Have a table with about 500,000 rows. One of the columns is a string field.
Is there a way to get the set of all existing values of that string in PostgreSQL without having to request each row out of the database and add the values to a set manually?
Example:
first_name last_name
will i.am
will smith
britney spears
The set of all existing values for "first_name" would be ['will', 'britney'].
SELECT DISTINCT first_name FROM people;
or
SELECT first_name FROM people GROUP BY first_name;
Related
I have a time-series location data table containing the following columns (time, first_name, last_name, loc_lat, loc_long) with the first three columns as the primary key. The table has more than 1M rows.
I notice that first_name and last_name duplicate quite often. There are only 100 combinations in 1M rows. Therefore, to save disk space, I am thinking about creating a separate people table with columns (id, first_name, last_name) where (first_name, last_name) is a unique constraint, in order to simplify the time-series location table to be (time, person_id, loc_lat, loc_long) where person_id is a foreign key for the people table.
I want to first create a new table from my existing 1M row table to test if there is indeed meaningful disk space save with this change. I feel like this task is quite doable but cannot find a concrete way to do so yet. Any suggestions?
That's a basic step of database normalization.
If you can afford to do so, it will be faster to write a new table exchanging full names for IDs, than altering the schema of the existing table and update all rows. Basically:
BEGIN; -- wrap in single transaction (optional, but safer)
CREATE TABLE people (
people_id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY
, first_name text NOT NULL
, last_name text NOT NULL
, CONSTRAINT full_name_uni UNIQUE (first_name, last_name)
);
INSERT INTO people (first_name, last_name)
SELECT DISTINCT first_name, last_name
FROM tbl
ORDER BY 1, 2; -- optional
ALTER TABLE tbl RENAME TO tbl_old; -- free up org. table name
CREATE TABLE tbl AS
SELECT t.time, p.people_id, t.loc_lat, t.loc_long
FROM tbl_old t
JOIN people p USING (first_name, last_name);
-- ORDER BY ??
ALTER TABLE tbl ADD CONSTRAINT people_id_fk FOREIGN KEY (people_id) REFERENCES people(people_id);
-- make sure the new table is complete. indexes? constraints?
-- Finally:
DROP TABLE tbl_old;
COMMIT;
Related:
Best way to populate a new column in a large table?
Add new column without table lock?
Updating database rows without locking the table in PostgreSQL 9.2
DISTINCT is simple. But for only 100 distinct full names - and with the right index support! - there are more sophisticated, (much) faster ways. See:
Optimize GROUP BY query to retrieve latest row per user
Let's say I have 2 tables: Students and Groups.
The Group table has 2 columns: id, GroupName
The Student table has 3 columns: id, StudentName and GroupID
The GroupID is a foreign key to a Group field.
I need to import the Students table from a CSV, but in my CSV instead of the Group id appears the name of the group. How can I import it with pgAdmin without modifying the csv?
Based on Laurenz answer, use follwoing scripts:
Create a temp table to insert from CSV file:
CREATE TEMP TABLE std_temp (id int, student_name char(25), group_name char(25));
Then, import the CSV file:
COPY std_temp FROM '/home/username/Documents/std.csv' CSV HEADER;
Now, create std and grp tables for students and groups:
CREATE TABLE grp (id int, name char(25));
CREATE TABLE std (id int, name char(20), grp_id int);
It's grp table's turn to be populated based on distinct value of group name. Consider how row_number() is use to provide value for id`:
INSERT INTO grp (id, name) select row_number() OVER (), * from (select distinct group_name from std_temp) as foo;
And the final step, select data based on the join then insert it into the std table:
insert into std (id, name, grp_id) select std_temp.id, std_temp.student_name,grp.id from std_temp inner join grp on std_temp.group_name = grp.name;
At the end, retreive data from final std table:
select * from std;
Your easiest option is to import the file into a temporary table that is defined like the CSV file. Then you can join that table with the "groups" table and use INSERT INTO ... SELECT ... to populate the "students" table.
There is of course also the option to define a view on a join of the two tables and define an INSTEAD OF INSERT trigger on the view that inserts values into the underlying tables as appropriate. Then you could load the data directly to the view.
The suggestion by #LaurenzAlbe is the obvious approach (IMHO never load a spreadsheet directly to
your tables, they are untrustworthy beasts). But I believe your implementation after loading the staging
table is flawed.
First, using row_number() virtually ensures you get duplicated ids for the same group name.
The ids will always increment from 1 by 1 to then number of group names no matter the number of groups previously loaded and you cannot ensure the identical sequence on a subsequent spreadsheets. What happens when you have a group that does not previously exist.
Further there is no validation that the group name does not already exist. Result: Duplicate group names and/or multiple ids for the same name.
Second, you attempt to use the id from the spreadsheet as the id the student (std) table is full of error possibilities. How do you ensure that number is unique across spreadsheets?
Even if unique in a single spreadsheet, how do you ensure another spreadsheet does not use the same numbers as a previous one. Or assuming multiple users create the spreadsheets that one users numbers do not overlap another users even if all users
user are very conscious of the numbers they use. Result: Duplicate id numbers.
A much better approach would be to put a unique key on the group table name column then insert any group names from the stage table into the group trapping any duplicate name errors (using on conflict). Then load the student table directly from the stage table
while selecting group id from the group table by the (now unique) group name.
create table csv_load_temp( junk_num integer, student_name text, group_name text);
create table groups( grp_id integer generated always as identity
, name text
, grp_key text generated always as ( lower(name) ) stored
, constraint grp_pk
primary key (grp_id)
, constraint grp_bk
unique (grp_key)
);
create table students (std_id integer generated always as identity
, name text
, grp_id integer
, constraint std_pk
primary key (std_id)
, constraint std2grp_fk
foreign key (grp_id)
references groups(grp_id)
);
-- Function to load Groups and Students
create or replace function establish_students()
returns void
language sql
as $$
insert into groups (name)
select distinct group_name
from csv_load_temp
on conflict (grp_key) do nothing;
insert into students (name, grp_id)
select student_name, grp_id
from csv_load_temp t
join groups grp
on (grp.name = t.group_name);
$$;
The groups table requires Postgres v12. For prior versions remove the column grp_key couumn
and and put the unique constraint directly on the name column. What to do about capitalization is up to your business logic.
See fiddle for full example. Obviously the 2 inserts in the Establish_Students function can be run standalone and independently. In that case the function itself is not necessary.
I have list of value like 2,3,4(comma separated) and a table which has ids like 1,2,3,4,5,6 call it as table1. From table1 table i have emp_name.
Here what i want is to get list of employees in comma separated values based on list give in the condition.
To achieve this i tried group concate, and find_in set but unable to get desired out put.
eg. select emp_name from emp_table where emp_id in('2,3,4', '1,2,3,4,5,6');
desired output. mr2,mr3,mr4
I know i can achieve it by looping through it, but i want mysql way.
Any help would be appreciable.
Thanyou
Having a table like
CREATE TABLE employees (
id INTEGER NOT NULL AUTOINCREMENT,
name VARCHAR(100),
PRIMARY KEY (id)
);
you may query it like this
SELECT GROUP_CONCAT(name) AS employee FROM employees WHERE id IN (1,2,3,4,5,6);
and the result would be
Sandeep,Mohammed,Sabri,Ashtam,Tamal
I have a query which works fine on SQLite, but when I run it on the same data in Postgresql I get:
column "role.id" must appear in the GROUP BY clause or be used in an aggregate function
I have three tables, for people, exhibitions, and a table that links the two: "One person in one exhibition performing a particular role" (such as "Artist" or "Curator"):
CREATE TABLE "person" ("id" integer NOT NULL PRIMARY KEY AUTOINCREMENT,
"name" varchar(255));
CREATE TABLE "exhibition" ("id" integer NOT NULL PRIMARY KEY AUTOINCREMENT,
"name" varchar(255));
CREATE TABLE "role" (`id` integer NOT NULL PRIMARY KEY AUTOINCREMENT,
`name` varchar(30) NOT NULL,
`exhibition_id` integer NOT NULL,
`person_id` integer NOT NULL,
FOREIGN KEY(`exhibition_id`) REFERENCES `exhibition`(`id`),
FOREIGN KEY(`person_id`) REFERENCES `person`(`id`));
I want to display the people involved in an exhibition ordered by how many things they've done. So, I get the IDs of the people in an exhibition (1,2,3,4) and then do this:
SELECT
*,
COUNT(person.id) AS role_count
FROM person
INNER JOIN role
ON person.id = role.person_id
WHERE person.id IN ( 1, 2, 3, 4 )
GROUP BY person.id
ORDER BY role_count DESC
That orders the people by role_count, which is the number of roles they've had across all exhibitions
It works fine on SQLite, but not in Postgresql. I've tried putting role.id into the GROUP BY (instead of, and as well as, person.id) but that changes the results.
You know when you struggle for ages, post an SO question, and then immediately stumble on the answer?
From this answer I realised that I couldn't select role.id (which the SELECT * is implicitly doing) as it wasn't in the GROUP BY.
I couldn't add it to the GROUP BY (because that changes the results) so the solution was to not select it.
So I changed the SELECT part to:
SELECT
person.*,
COUNT(person.id) AS role_count
FROM person
...
Now role.id is not being selected. And that works.
If I needed any other fields from the role table, like name, I could add those explicitly too:
SELECT
person.*,
role.name,
COUNT(person.id) AS role_count
FROM person
...
Just like the error says, Standard SQL doesn't let you SELECT anything other than one of the GROUP BY columns or a call to an aggregate function. (For a logical reason: How would the RDBMS know which role.id to select when there are multiple rows to select from within a group?) PostgreSQL actually enforces this rule; SQLite ignores it and just returns data from an arbitrary row in the group.
As you discovered, omitting role.id from the SELECT fixes your error. But if you do want SQLite's behavior of selecting the ID from an arbitrary row, you can simply wrap it in an aggregate function, e.g., SELECT MAX(role.id) instead of just SELECT role.id).
I have two tables lets say table1 with id, fname, lname, gender and table2 with fname, lname.
table2 contains some data. I want to select all the data from table2 that has lname as "roy" and insert into table one incrementing value of id. But I don't want to use triggers. Is there anyway to do this.
If you have auto incrementing column id then just omit it while performing an insert and let the database get next value from sequence and assign it to a newly inserted record:
INSERT INTO table1 (fname, lname)
SELECT fname, lname
FROM table2
WHERE fname = 'roy'
If you don't have one, then you should probably create it, or use the last value + 1 (not safe and not recommended).
In different databases creating an auto incrementing column is done differently:
PostgreSQL has SERIAL
SQL Server has IDENTITY
Oracle pre 12c has sequences and triggers and PL/SQL triggers