Generate data with at least one occurence - postgresql

I have three tables:
create table genres
(
genre_id serial primary key,
genre_name varchar NOT NULL UNIQUE
);
create table movies
(
movie_id serial primary key,
movie_name varchar NOT NULL
);
create table movie_genres
(
movie_id integer references movies NOT NULL,
genre_id integer references genres NOT NULL,
PRIMARY KEY(movie_id, genre_id)
);
Tables genres and movies are full of data and I want to generate some random data for table movie_genres, so that every movie has at least one genre.
I tried it this way, but then it is possible for a movie to be without any genre. Can anyone help me with that, please?
insert into movie_genres
select movie_id, genre_id
from genres cross join movies
where random() < 0.15;

Hmm, you can try to join a derived table in which you first select one random genre and then UNION some more randomly.
INSERT INTO movie_genres
(movie_id,
genre_id)
SELECT m.movie_id,
rg.genre_id
FROM movies m
CROSS JOIN ((SELECT g.genre_id
FROM genres g
ORDER BY random()
LIMIT 1)
UNION
(SELECT g.genre_id
FROM genres g
WHERE random() < 0.15)) rg;
That however means that every movie has that one genre selected first. To overcome this and have the first genre be random per movie, a lateral join can be used. (Remark: You need to use some column from the outer table in the derived table as otherwise the optimizer seems to optimize the LATERAL away.)
INSERT INTO movie_genres
(movie_id,
genre_id)
SELECT rg.movie_id,
rg.genre_id
FROM movies m
CROSS JOIN LATERAL ((SELECT g.genre_id,
m.movie_id -- that's just here to force the optimizer to keep the join lateral
FROM genres g
ORDER BY random()
LIMIT 1)
UNION
(SELECT g.genre_id,
m.movie_id
FROM genres g
WHERE random() < 0.15)) rg;
db<>fiddle

Related

Improving performance of a GROUP BY ... HAVING COUNT(...) > 1 in PostgreSQL

I'm trying to select the orders that are part of a trip with multiple orders.
I tried many approaches but can't find how to have a performant query.
To reproduce the problem here is the setup (here it's 100 000 rows, but really it's more 1 000 000 rows to see the timeout on db-fiddle).
Schema (PostgreSQL v14)
create table trips (id bigint primary key);
create table orders (id bigint primary key, trip_id bigint);
create index trips_idx on trips (id);
create index orders_idx on orders (id);
create index orders_trip_idx on orders (trip_id);
insert into trips (id) select seq from generate_series(1,100000) seq;
insert into orders (id, trip_id) select seq, floor(random() * 100000 + 1) from generate_series(1,100000) seq;
Query #1
explain analyze select orders.id
from orders
inner join trips on trips.id = orders.trip_id
inner join orders trips_orders on trips_orders.trip_id = trips.id
group by orders.id, trips.id
having count(trips_orders) > 1
limit 50
;
View on DB Fiddle
Here is what pgmustard gives me on the real query:
Do you actually need the join on trips? You could try
SELECT shared.id
FROM orders shared
WHERE EXISTS (SELECT * FROM orders other
WHERE other.trip_id = shared.trip_id
AND other.id != shared.id
)
;
to replace the group by with a hash join, or
SELECT unnest(array_agg(orders.id))
FROM orders
GROUP BY trip_id
HAVING count(*) > 1
;
to hopefully get Postgres to just use the trip_id index.

PySpark. How do I make sure the daily incremental data has NO duplicated UUID as PK in HIVE

I created a table in hive with UUID as Primary Key, for example
create table if not exists mydb.mytable as SELECT uuid() as uni_id, c.name, g.city, g.country
FROM client c
INNER JOIN geo g ON c.geo_id = g.id
Every day, I need to insert data to mytable, How do I make sure the daily incremental data has NO duplicated UUID as PK?
If by UUID, what you're looking for is a series Universally Unique Identifier, then I think you can use an auto-increment id. In pure HQL, it can be achieved by row_numer and cross join.
insert overwrite table dest_tbl
select
a.rn + b.mid as id, col1, col2,...
from (
select
*, row_number() over(order by rand()) as rn
from src_tbl
) a
join (select max(id) as mid from dst_tbl) b

PostgreSQL count other values of ID that have the same value of other column

Let's say we have the following table that stores id of an observation and its address_id. You can create the table with the following code:
drop table if exists schema.pl_address_cnt;
create table schema.pl_address_cnt (
id serial,
address_id int);
insert into schema.pl_address_cnt(address_id) values
(100), (101), (100), (101), (100), (125), (128), (200), (200), (100);
My task is to count for each id how many other ids (thus -1) have the same address_id. I've come up with a solution that turns out to be quite expensive (explain) on the original dataset. I wonder whether my solution can be somehow optimised.
with tmp_table as (select address_id
, count(distinct id) as id_count
from schema.pl_address_cnt
group by address_id
)
select id
, id_count - 1
from schema.pl_address_cnt as pac
left join tmp_table as tt on tt.address_id=pac.address_id;
You can try to omit the CTE and do a self left join on common address but different ID and then aggregate this.
SELECT pac1.id,
count(pac2.id)
FROM pl_address_cnt pac1
LEFT JOIN pl_address_cnt pac2
ON pac1.address_id = pac2.address_id
AND pac1.id <> pac2.id
GROUP BY pac1.id
ORDER BY pac1.id;
For performance you can try indexes on (address_id, id) and (id).

PostgreSQL - Append a table to another and add a field without listing all fields

I have two tables:
table_a with fields item_id,rank, and 50 other fields.
table_b with fields item_id, and the same 50 fields as table_a
I need to write a SELECT query that adds the rows of table_b to table_a but with rank set to a specific value, let's say 4.
Currently I have:
SELECT * FROM table_a
UNION
SELECT item_id, 4 rank, field_1, field_2, ...
How can I join the two tables together without writing out all of the fields and without using an INSERT query?
EDIT:
My idea is to join table_b to table_a somehow with the rank field remaining empty, then simply replace the null rank fields. The rank field is never null, but item_id can be duplicated and table_a may have item_id values that are not in table_b, and vice-versa.
I am not sure I understand why you need this, but you can use jsonb functions:
select (jsonb_populate_record(null::table_a, row)).*
from (
select to_jsonb(a) as row
from table_a a
union
select to_jsonb(b) || '{"rank": 4}'
from table_b b
) s
order by item_id;
Working example in rextester.
I'm pretty sure I've got it. The predefined rank column can be inserted into table_b by joining to the subset of itself with only the columns left of the column behind which you want to insert.
WITH
_leftcols AS ( SELECT item_id, 4 rank FROM table_b ),
_combined AS ( SELECT * FROM table_b JOIN _leftcols USING (item_id) )
SELECT * FROM _combined
UNION
SELECT * FROM table_a

Is it possible to use a aggregate in an aggregate to get a specific single value?

I have been playing around with code for a while now, and I have come across a problem where I must get the amount of certain fields where the average is above a certain amount , grouped by two fields from different tables
Here is my Code and expectations
SELECT C.Course,S.Name, COUNT(*) as Average FROM Students S
INNER JOIN Student_Modules SM ON
SM.StudentID = S.ID
INNER JOIN Courses_Template C
ON C.ID = SM.CourseID
Group by C.Course,S.Name
Having AVG(SM.Percentage_Obtained) > 80
This sends me back the rows containing the course name, the student's name, and the amount of percentages above 80%.
This counts for me as "the amounts of students that passed the course". I would Like to know how to force this query to give me the amount of students who have passed the course in stead of the amount of modules the student has passed and if it is possible
EDIT 1:
STUDENT LAYOUT
CREATE TABLE Students
(ID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED
,StudentNumber VARCHAR(20)
,Name VARCHAR(40)
,Surname VARCHAR(40)
,Student_ID VARCHAR(13)
,Languages VARCHAR(200)
,[Address] Varchar (512)
,Contact_Number varchar(20)
,Email Varchar (150)
,Days_Absent INT
,Student_Web_Username varchar(40)
,Student_Web_Password varchar(MAX)
,BranchID int
,Constraint FKStudentBranch FOREIGN KEY (BranchID) REFERENCES Branches(ID)
,CONSTRAINT Unq_StudentNumber UNIQUE (StudentNumber)
,CONSTRAINT Unq_Student_ID UNIQUE (Student_ID));
STUDENT_MODULE LAYOUT
CREATE TABLE Student_Modules
(ID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED
,ModuleID INT
,StudentID INT
,CourseID INT
,Percentage_Obtained INT Check (Percentage_Obtained >= -1 AND Percentage_Obtained <= 100)
,CONSTRAINT FKStudentModulesChosen FOREIGN KEY (ModuleID) REFERENCES Modules_Template(ID) ON DELETE CASCADE
,CONSTRAINT FKStudentModules FOREIGN KEY (StudentID) REFERENCES Students(ID) ON DELETE CASCADE);
COURSES_TEMPLATE LAYOUT
CREATE TABLE COURSES_TEMPLATE
(ID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED
,Course VARCHAR(40)
,Price SMALLMONEY CHECK(Price > 0)
,BranchID INT
,CONSTRAINT FKCourseBranches FOREIGN KEY (BranchID) REFERENCES Branches(ID) ON DELETE CASCADE);
If they need to pass by average 80% across all modules.
SELECT C.Course, COUNT(*) as [Average]
FROM Students S
INNER JOIN Student_Modules SM ON S.ID = SM.StudentID
INNER JOIN Courses_Template C ON SM.CourseID = C.ID
INNER JOIN (
SELECT SM.StudentID, SM.CourseID
FROM Student_Modules SM
Group by SM.StudentID, SM.CourseID
Having AVG(SM.Percentage_Obtained) > 80
) Pass ON SM.StudentID = Pass.StudentID AND SM.CourseID = Pass.CourseID
GROUP BY C.Course
If they need to pass each module by 80% to pass the course then
SELECT C.Course, COUNT(*) as [Average]
FROM Students S
INNER JOIN Student_Modules SM ON S.ID = SM.StudentID
INNER JOIN Courses_Template C ON SM.CourseID = C.ID
LEFT OUTER JOIN (
SELECT DISTINCT SM.StudentID, SM.CourseID
FROM Student_Modules SM
WHERE SM.Percentage_Obtained <= 80
) as NotPass ON SM.StudentID = NotPass.StudentID AND SM.CourseID = NotPass.CourseID
WHERE NotPass.StudentID IS NULL
GROUP BY C.Course
This is untested, let me know any errors or paste incorrect output and expected output.
It looks like you want the number of students that passed each course? If so wouldn't you just need to group by C.Course and then have a Count(S.Name) as NumWhoPassed for the display?