Can these two queries be optimised into a single one? - postgresql

Given the tables:
create table entries
(
id integer generated always as identity
constraint entries_pk
primary key,
name text not null,
description text,
type integer not null
);
create table tasks
(
id integer generated always as identity
constraint tasks_pk
primary key,
channel_id bigint not null,
type integer not null,
is_active boolean default true not null
);
I currently have two separate queries. First:
SELECT id FROM tasks WHERE is_active = true;
Then, once per result from the last query:
SELECT t.channel_id, e.name, e.description
FROM tasks t
JOIN entries e ON t.type = e.type
WHERE t.id = :task_id
ORDER BY random()
LIMIT 1;
In other words I want a single random entry for each active task.
Can this be accomplished in a single query while retaining the limit per task?

Sure; use DISTINCT ON:
SELECT DISTINCT ON (t.id)
t.id, t.channel_id, e.name, e.description
FROM tasks t
JOIN entries e USING (type)
ORDER BY t.id, random();

Related

Modify existing query

I have two tables
create table jobs (
id varchar unique primary key,
account_email varchar not null,
active boolean not null default true,
enabled boolean not null default false,
name varchar (50) not null,
...
);
create table job_tags (
job_id varchar not null,
tag varchar(50) not null,
foreign key (job_id) references jobs(id) on delete cascade,
unique (job_id, tag)
);
And this sql query to get job SELECT * FROM jobs INNER JOIN job_categories ON (jobs.category_id=job_categories.category_id) WHERE jobs.id=$1
Since I have little experience I perform one more query in order to load job_tags. Is it possible to create only one? I work with golang sqlx, thanks
Yes, you almost got it:
SELECT * FROM jobs
INNER JOIN job_categories ON (jobs.category_id=job_categories.category_id)
INNER JOIN job_tags ON (jobs.id = job_tags.job_id)
WHERE jobs.id=$1

PostgreSQL count other values of ID that have the same value of other column

Let's say we have the following table that stores id of an observation and its address_id. You can create the table with the following code:
drop table if exists schema.pl_address_cnt;
create table schema.pl_address_cnt (
id serial,
address_id int);
insert into schema.pl_address_cnt(address_id) values
(100), (101), (100), (101), (100), (125), (128), (200), (200), (100);
My task is to count for each id how many other ids (thus -1) have the same address_id. I've come up with a solution that turns out to be quite expensive (explain) on the original dataset. I wonder whether my solution can be somehow optimised.
with tmp_table as (select address_id
, count(distinct id) as id_count
from schema.pl_address_cnt
group by address_id
)
select id
, id_count - 1
from schema.pl_address_cnt as pac
left join tmp_table as tt on tt.address_id=pac.address_id;
You can try to omit the CTE and do a self left join on common address but different ID and then aggregate this.
SELECT pac1.id,
count(pac2.id)
FROM pl_address_cnt pac1
LEFT JOIN pl_address_cnt pac2
ON pac1.address_id = pac2.address_id
AND pac1.id <> pac2.id
GROUP BY pac1.id
ORDER BY pac1.id;
For performance you can try indexes on (address_id, id) and (id).

PostgreSQL Transaction to Use Results from Query to Insert and Query another Table then Return Original Query Results

I am writing an application that stores data on file samples and YARA signatures. Essentially, in a single transaction, I need to execute a query, reference those results in an insert and another query, then return the original results. I have three tables that are relevant to this discussion:
samples - this is the table that stores information on files that need to be scanned with the associated YARA signatures.
yararules - the table that stores information on the YARA rules.
yaratracker - a table that tracks the sample/rule pairs that have been processed thus far.
In a single transaction, the application needs to:
Get a batch of unique sample/rule pairs that have not yet been processed. Preferably, this query will get all non-processed rules associated with a single sample (i.e. if I'm going to run the YARA rules on a sample, I want to run all of the YARA rules not yet processed on that sample so that I only have to load the sample into memory once).
Get a unique list of id,sha256 from the batch found in step 1.
Insert the batch from step 1 into the yaraqueue with the matchcount column equal to 0 and complete column set to false.
I can accomplish Step 1 with the query below, but I don't know how to reference those results to accomplish step 2. I've tried looking into variables, but apparently there isn't one that can hold multiple rows. I've looked into using a cursor, but I can't seem to use the cursor with a subsequent command and then return the cursor.
SELECT s.id,r.id
FROM sample s CROSS JOIN yararules r
WHERE r.status = 'Disabled' AND NOT EXISTS(
SELECT 1 FROM yaratracker q
WHERE q.sample_id = s.id AND q.rule_id = r.id
)
ORDER BY s.id
LIMIT 1000;
The relevant database schema looks like this.
CREATE TYPE samplelist AS ENUM ('Whitelist', 'Blacklist', 'Greylist', 'Unknown');
CREATE TABLE samples (
id SERIAL PRIMARY KEY,
md5 CHAR(32) NOT NULL,
sha1 CHAR(40) NOT NULL,
sha256 CHAR(64) NOT NULL,
total INT NOT NULL,
positives INT NOT NULL,
list SAMPLELIST NOT NULL,
filetype VARCHAR(16) NOT NULL,
submitted TIMESTAMP WITH TIME ZONE NOT NULL,
user_id SERIAL REFERENCES users;
);
CREATE UNIQUE INDEX md5_idx ON {0} (md5);
CREATE UNIQUE INDEX sha1_idx ON {0} (sha1);
CREATE UNIQUE INDEX sha256_idx ON {0} (sha256);
CREATE TYPE rulestatus AS ENUM ('Enabled', 'Disabled');
CREATE TABLE yararules (
id SERIAL PRIMARY KEY,
name VARCHAR(32) NOT NULL UNIQUE,
description TEXT NOT NULL,
rules TEXT NOT NULL,
lastmodified TIMESTAMP WITH TIME ZONE NOT NULL,
status rulestatus NOT NULL,
user_id SERIAL REFERENCES users ON DELETE CASCADE
);
CREATE TABLE yaratracker (
id SERIAL PRIMARY KEY,
rule_id SERIAL REFERENCES yararules ON DELETE CASCADE,
sample_id SERIAL REFERENCES sample ON DELETE CASCADE,
matchcount INT NOT NULL,
complete BOOL NOT NULL
);
CREATE INDEX composite_idx ON yaratracker (rule_id, sample_id);
CREATE INDEX complete_idx ON yaratracker (complete);
INSERT INTO target_table(a,b,c,...)
SELECT sid, rid, sha, ...
FROM (
SELECT s.id AS sid
,r.id AS rid
, s.sha256 AS sha
, ...
, ROW_NUMBER() OVER (PARTITION BY s.id) AS rn -- <<<--- HERE
FROM sample s CROSS JOIN yararules r
WHERE r.status = 'Disabled' AND NOT EXISTS(
SELECT 1 FROM yaratracker q
WHERE q.sample_id = s.id
AND q.rule_id = r.id
)
ORDER BY s.id
LIMIT 1000;
) src
WHERE src.rn = 1; -- <<<--- HERE
The WHERE src.rn = 1 will restrict the cross-join to deliver only one tuple per sample.id (both id and sha256 are unique in the sample table, so picking a unique id has the same effect as picking a unique sha256)
The complete cross-join result will never be generated; the optimiser is smart enough to push down the WHERE rn=1 condition into the subquery.
Note: the LIMIT 1000 should probably be removed (or pulled up to a higher level)
If you REALLY need to save the results from the CROSS JOIN, you could use a chain of CTEs (expect a performance degradation ...)
WITH big AS (
SELECT s.id AS sample_id
,r.id AS rule_id
, s.sha256
-- , ...
, ROW_NUMBER() OVER (PARTITION BY s.id) AS rn -- <<<--- HERE
FROM sample s
CROSS JOIN yararules r
WHERE r.status = 'Disabled' AND NOT EXISTS(
SELECT 1 FROM yaratracker q
WHERE q.sample_id = s.id AND q.rule_id = r.id
)
)
, ins AS (
INSERT INTO target_table(a,b,c,...)
SELECT b.sample_id, b.rule_id, b.sha256 , ...
FROM big b
WHERE b.rn = 1; -- <<<--- HERE
RETURNING *
)
INSERT INTO yaratracker (rule_id, sample_id, matchcount, complete )
SELECT b.sample_id, b.rule_id, 0, False
FROM big b
-- LEFT JOIN ins i ON i.a = b.sample_id AND i.b= b.rule_id
;
NOTE: the yaratracker(rule_id,sample_id) should not be serials but just plain integers, referencing yararules(id) and sample(id)

Merging columns from 2 different tables to apply aggregate function

I have below 3 Tables
Create table products(
prod_id character(20) NOT NULL,
name character varying(100) NOT NULL,
CONSTRAINT prod_pkey PRIMARY KEY (prod_id)
)
Create table dress_Sales(
prod_id character(20) NOT NULL,
dress_amount numeric(7,2) NOT NULL,
CONSTRAINT prod_pkey PRIMARY KEY (prod_id),
CONSTRAINT prod_id_fkey FOREIGN KEY (prod_id)
REFERENCES products (prod_id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
)
Create table sports_Sales(
prod_id character(20) NOT NULL,
sports_amount numeric(7,2) NOT NULL,
CONSTRAINT prod_pkey PRIMARY KEY (prod_id),
CONSTRAINT prod_id_fkey FOREIGN KEY (prod_id)
REFERENCES products (prod_id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
)
I want to get the Sum and Average sales amount form both the tables(Only for the Selected Prod_id). I have tried the below code but it's not producing any value.
select sum(coalesce(b.dress_amount, c.sports_amount)) as total_Amount
from products a JOIN dress_sales b on a.prod_id = b.prod_id
JOIN sports_sales c on a.prod_id = c.prod_id and a.prod_id = ANY( {"123456","456789"}')`
Here 1000038923 is in dress_sales table and 8002265822 is in sports_sales.
Looks like your product can exist in only one table (dress_sales or sports_sales).
In this case you should use left join:
select
sum(coalesce(b.dress_amount, c.sports_amount)) as total_amount,
avg(coalesce(b.dress_amount, c.sports_amount)) as avg_amount
from products a
left join dress_sales b using(prod_id)
left join sports_sales c using(prod_id)
where
a.prod_id in ('1', '2');
If you use inner join (which is default) the product row will not appear in the result set as it will not be joined with either dress_sales or sports_sales.
If you have a product that appears in both tables you can use a subquery that can handle both dress_amount and sports_amount values.
select sum(combined.amount), avg(combined.amount)
from
(select prod_id, dress_amount as amount from dress_sales
union all
select prod_id, sports_amount as amount from sports_sales) combined
where
combined.prod_id in ('1','2');

Looking up values from many tables based on value in each column

I have several tables containing key value pairs for differint fields in my database. I also have a table that that contains the keys of these differint tables that should be selected as the value for that key. However, I can't figure out how to select these values from the multiple tables?
The tables
CREATE TABLE CHARACTERS(
ID INTEGER PRIMARY KEY,
NAME VARCHAR(64)
);
CREATE TABLE MEDIA(
ID INTEGER PRIMARY KEY,
NAME VARCHAR(64)
);
CREATE TABLE EPISODES(
ID INTEGER PRIMARY KEY,
MEDIAID INTEGER,
NAME VARCHAR(64)
);
-- Selecting from this table
CREATE TABLE APPS(
ID INTEGER PRIMARY KEY,
CHARID INTEGER,
EPISODEID INTEGER,
MEDIAID INTEGER
);
I am selecting from the APPS table, and I want to replace the value of the *ID columns with the value of the name in the accomping table's NAME column. I want this done for each row in the APPS table. Like so...
CHARID -> CHARACTERS.NAME
EPISODEID -> EPISODES.NAME
MEDIAID -> MEDIA.NAME
I have tried to use joins, but they don't do it for each row in the APPS table. I have 18 rows in the APPS table, but I only get back way less than I have in the table or way more than I have in the table. So how can I make it do it for each row in the APPS table?
You do by JOINing the tables together and selecting the desired columns from the individual tables:
SELECT c.name AS character_name, e.name AS episode, m.name AS media
FROM apps a
LEFT JOIN episodes e ON e.id = a.episodeid
LEFT JOIN media m ON m.id = a.mediaid
LEFT JOIN characters c ON c.id = a.charid;
If you want to present the rows in a specific order, you can specify that too as a final clause in the SELECT statement. You can use any field from the included tables; that field is not necessarily part of the columns selected:
ORDER BY a.id -- order by apps.id
or
ORDER BY e.id, c.name -- order first by episode id, then by character name
etc