I have a csv file with about 1,500 fields and 5-6 million rows. It is a dataset with one row for each individual who has received public benefits at some point since ISO week 32 in 1991. Each field represents one week and holds a number relating to the specific benefit received in that particular week. If the individual has received no benefits, the field is left blank (''). In addition to the weekly values there are a number of other fields (ID, sex, birthdate, etc.)
The data set is updated quarterly with an added field for each week in the quarter, and an added row for each new individual.
This is a sample of the data:
y_9132,y_9133,y_9134,...,y_1443,id,sex,dateofbirth
891,891,891,...,110,1000456,1,'1978/01/16'
110,112,112,...,997,2000789,0,'1945/09/28'
I'm trying to convert the data to a tabular format so it can be analysed using PostgreSQL with comlumn store or similar (Amazon Redshift is a possiblity).
The fields beginning with "y_" represents the year and week of the received public benefits. In a tabular format the field name should be converted to a row number or a date, starting with monday in ISO week 32 in 1991 (1991/08/05).
The tablular dataset I'm trying to convert the csv-file to would look like this:
(Week is just a sequential number, starting with 1 for the date '1991/08/05')
week,benefit,ID
1,891,1000456
2,891,1000456
3,891,1000456
...
1211,110,1000456
1,110,2000789
2,112,2000789
3,112,2000789
...
1211,997,2000789
I have written a function in PostgreSQL that works but, it is very slow. The entire conversion takes 15h. I have tried using my laptop with an SSD and 8GB RAM. I also tried it on an Amazon RDS instance with 30GB memory. Still slow. The PostgreSQL function splits the csv in chunks. I've experimented a bit and 100K rows pr. batch seems fastest (yeah, 15h fast).
To be clear, I'm not particularly looking for solution using PostgreSQL. Anything will do. In fact, I'm not sure why I would even use a DB for this at all.
That said, here are my functions in PostgreSQL:
First function: I load part of the csv file into a table called part_grund. I only load the fields with the weekly data and the ID.
CREATE OR REPLACE FUNCTION DREAMLOAD_PART(OUT result text) AS
$BODY$
BEGIN
EXECUTE 'DROP TABLE IF EXISTS part_grund;
CREATE UNLOGGED TABLE part_grund
(id int, raw_data text[],rn int[]);
INSERT INTO part_grund
SELECT raw_data[1300]::int
,raw_data[1:1211]
,rn
FROM grund_no_headers
cross join
(
SELECT ARRAY(
WITH RECURSIVE t(n) AS
(
VALUES (1)
UNION ALL
SELECT n+1 FROM t WHERE n < 1211
)
SELECT n FROM t
) AS rn) AS rn;
CREATE INDEX idx_id on part_grund (id);';
END;
$BODY$
LANGUAGE plpgsql;
Second function: Here, the data is transformed using the unnest function.
CREATE OR REPLACE FUNCTION DREAMLOAD(startint int, batch_size int, OUT result text) AS
$BODY$
DECLARE
i integer := startint;
e integer := startint + batch_size;
endint integer;
BEGIN
endint := (SELECT MAX(ID) FROM part_grund) + batch_size;
EXECUTE 'DROP TABLE IF EXISTS BENEFIT;
CREATE UNLOGGED TABLE BENEFIT (
ID integer
,benefit smallint
,Week smallint
);';
WHILE e <= endint LOOP
EXECUTE 'INSERT INTO BENEFIT
SELECT ID
,unnest(raw_data) AS benefit
,unnest(rn) AS week
FROM part_grund
WHERE ID between ' || i || ' and ' || e-1 ||';';
i=i+batch_size;
e=e+batch_size;
END LOOP;
END;
$BODY$
LANGUAGE plpgsql;
As I mentioned above, it works but, it is painfully slow. So, suggestions to a faster way of doing this would be very appreciated.
Related
I have the following table in Postgres
Which would typically be populated like below
id day visits passes
1 Monday {11,13,19} {13,17}
2 Tuesday {7,9} {11,13,19}
3 Wednesday {2,5,21} {21,27}
4 Thursday {3,11,39} {21,19}`
In order to get the visit or passes ids over a range of days I have written the following function
CREATE OR REPLACE FUNCTION day_entries(p_column TEXT,VARIADIC ids int[]) RETURNS bigint[] AS
$$
DECLARE result bigint[];
DECLARE hold bigint[];
BEGIN
FOR i IN 1 .. array_upper(ids,1) LOOP
execute format('SELECT %I FROM days WHERE id = $1',p_column) USING ids[i] INTO hold;
result := unnest(result) UNION unnest(hold);
END LOOP;
RETURN result;
END;
$$
LANGUAGE 'plpgsql';
which works with a subsequent call to day_entries('visits',1,2,3) returning
{11,9,19,21,5,13,2,7}
While it does the job I am concerned that based on my one day old knowledge of writing Postgres functions I have worked in one or more inefficiences into the process. Can the function be made easier in some way?
The other issue is more a curiosity than a problem - the order of elements in the result appears to have no bearing to the order of visits entries in the three rows that are touched. Although this is not an issue as far as I am concerned I am curious to know why it happens.
You can do the unnesting and aggregating in a single statement, no need for a loop. And you can use the ANY operator with the array to select all matching rows.
CREATE OR REPLACE FUNCTION day_entries(p_column TEXT, variadic p_ids int[])
RETURNS bigint[] AS
$$
DECLARE
result bigint[];
BEGIN
execute
format('SELECT array(select unnest(%I) from days WHERE id = any($1))', p_column)
USING p_ids -- pass the whole array as a parameter
INTO result;
RETURN result;
END;
$$
LANGUAGE plpgsql;
Not related to your questions, but I think you are going down the wrong road with that design. While arrays might look intriguing to beginners at the beginning, they should only be used rarely.
And if you find yourself unnesting and aggregating things back and forth, this is a strong indication that something could be improved.
I would split your table up in two tables, one that stores the "day" information and one that stores visits and passes in the same table with a column distinguishing the two. Then finding visits is as simple as adding a where ... = 'visit' rather than having to cope with (slow and error prone) dynamic SQL.
Without knowing more details, I would probably create the tables like this:
create table days
(
id integer not null primary key,
day character varying(9) not null
);
create table event
(
day_id integer not null references days,
event_id integer not null,
event_type varchar(10) not null check (event_type in ('visit', 'pass'))
);
event_id might even be a foreign to key to another table you haven't shown us - again something you can't really do with de-normalized tables.
Getting all visits for specific days, is then as simple as:_
select event_id
from event
where day_id in (1,2)
and event_type = 'visit';
Or if you do need that as an array:
select array_agg(event_id)
from event
where day_id in (1,2)
and event_type = 'visit';
Online example
For my master thesis I am analyzing several algorithms that could be useful for a mobile service provider (test data sets are based on a mobile music school) to find the optimal teacher for a new student taking the locations of a teacher's existing students into account.
The attached code provides correct results for a simple KNN (k-nearest neighbor) search avoiding duplicates.
As "DISTINCT ON" requires st.teacher_id to be included in the ORDER BY clause, the R-Tree-Index I have on my geometry column "address_transform" is not used. This leads to a very bad performance once table size gets larger (100k rows for the student table), the geometry gets more complex, etc.
Any ideas how to rewrite the function so that the index gets used?
CREATE OR REPLACE FUNCTION thesis_knn_distinct (q_neighbors_no integer, q_latitude numeric, q_longitude numeric, q_instrument text, q_student_table character varying)
RETURNS TABLE (
student_id TEXT,
teacher_id TEXT,
distance DOUBLE PRECISION,
instrument TEXT[]
)
AS $$
DECLARE
location_txt varchar(50) := 'SRID=4326;POINT('||q_longitude||' '||q_latitude||')';
teacher_table varchar(25);
BEGIN
IF q_student_table LIKE 'student_hesse%' THEN
teacher_table = 'teacher_synth_large';
ELSIF [...]
END IF;
RETURN QUERY EXECUTE 'WITH teacher_filter AS (
SELECT DISTINCT ON (st.teacher_id) st.id, st.teacher_id, ST_DistanceSphere(address_box, $2::geometry) AS distance, te.instrument
FROM '|| q_student_table::regclass ||' st INNER JOIN '|| teacher_table::regclass ||' te
ON st.teacher_id = te.id
WHERE te.instrument #> ARRAY[$1]::text[]
ORDER BY st.teacher_id, st.address_transform <-> ST_Transform($2::geometry,3857)
)
SELECT * FROM teacher_filter
ORDER BY distance
LIMIT $3;'
USING q_instrument, location_txt, q_neighbors_no;
END; $$
LANGUAGE 'plpgsql';
Annotations:
I'm using a dynamic query as I'm testing with several tables of real/synthetic data (indexed, non-indexed, clustered, etc.)
I am aware of the possibility to set configuration parameters like enable_seqscan but that's not really a permanent solution to my problem
As an alternative I have already implemented a (pretty fast) variation where I pre-select a multiple of the required neighbors via simple KNN and then remove duplicates in a second step. This works ok for a purely distance-related approach but the pre-selection does not necessarily contain the best matches if other parameters apart from distance are taken into account at a later step as well.
I am using postgres 10.4, postgis 2.4.4
Below is a function that's part of a process used to periodically upload many CSVs (that can change) into a Postgres 9.6 db. Anyone have any suggestions on how to improve this function or other data upload processes you'd like to share?
The function works (I think), so I thought I'd share in case it would be helpful for someone else. As a total newbie, this took me flipping forever, so hopefully I can save someone some time.
I lifted code from various sources+++ to make this function, which inserts all of the columns in the source table that have matching destination table columns, casting the data type from the source columns into the data type of the destination columns during the insert. I plan to turn this into a trigger function(s) that executes upon update of the source table(s).
Bigger picture: 1) batch file runs dbf2csv to export DBFs into CSVs, 2) batch files run csvkit to load many CSVs into a separate tables in a schema called dataloader and add a new column for the CSV date, 3) the below function moves the data from the dataloader table to the main tables located in the public schema. I had thought about using PGloader, but I don't know Python. An issue that I will have is if new columns are added to the source CSVs (this function will ignore them), but I can monitor that manually as the columns don't change much.
+++ A few I can remember (thanks!)
Dynamic insert
Dynamic insert #2
Data type
More dynamic code
I experimented with FDWs and can't remember why I didn't use this approach.
Foreign data wrapper
CREATE OR REPLACE FUNCTION dataloader.insert_des3 (
tbl_des pg_catalog.regclass,
tbl_src pg_catalog.regclass
)
RETURNS void AS
$body$
DECLARE
tdes_cols text;
tsrc_cols text;
BEGIN
SET search_path TO dataloader, public;
SELECT string_agg( c1.attname, ',' ),
string_agg( quote_ident( COALESCE( c2.attname, 'NULL' ) ) || '::' || format_type(c1.atttypid, c1.atttypmod), ',' )
INTO tdes_cols,
tsrc_cols
FROM pg_attribute c1
LEFT JOIN pg_attribute c2
ON c2.attrelid = tbl_src
AND c2.attnum > 0 --attnum is the column number of c2
AND NOT c2.attisdropped
AND c1.attname = lower(c2.attname)
WHERE c1.attrelid = tbl_des
AND c1.attnum > 0
AND NOT c1.attisdropped
AND c1.attname <> 'id';
EXECUTE format(
' INSERT INTO %I (%s)
SELECT %s
FROM %I
',
tbl_des,
tdes_cols,
tsrc_cols,
tbl_src
);
END;
$body$
LANGUAGE 'plpgsql'
VOLATILE
CALLED ON NULL INPUT
SECURITY INVOKER
COST 100;
To call the function
SELECT dataloader.insert_des('public.tbl_des','dataloader.tbl_src')
My idea is to implement a basic «vector clock», where a timestamps are clock-based, always go forward and are guaranteed to be unique.
For example, in a simple table:
CREATE TABLE IF NOT EXISTS timestamps (
last_modified TIMESTAMP UNIQUE
);
I use a trigger to set the timestamp value before insertion. It basically just goes into the future when two inserts arrive at the same time:
CREATE OR REPLACE FUNCTION bump_timestamp()
RETURNS trigger AS $$
DECLARE
previous TIMESTAMP;
current TIMESTAMP;
BEGIN
previous := NULL;
SELECT last_modified INTO previous
FROM timestamps
ORDER BY last_modified DESC LIMIT 1;
current := clock_timestamp();
IF previous IS NOT NULL AND previous >= current THEN
current := previous + INTERVAL '1 milliseconds';
END IF;
NEW.last_modified := current;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
DROP TRIGGER IF EXISTS tgr_timestamps_last_modified ON timestamps;
CREATE TRIGGER tgr_timestamps_last_modified
BEFORE INSERT OR UPDATE ON timestamps
FOR EACH ROW EXECUTE PROCEDURE bump_timestamp();
I then run a massive amount of insertions in two separate clients:
DO
$$
BEGIN
FOR i IN 1..100000 LOOP
INSERT INTO timestamps DEFAULT VALUES;
END LOOP;
END;
$$;
As expected, I get collisions:
ERROR: duplicate key value violates unique constraint "timestamps_last_modified_key"
État SQL :23505
Détail :Key (last_modified)=(2016-01-15 18:35:22.550367) already exists.
Contexte : SQL statement "INSERT INTO timestamps DEFAULT VALUES"
PL/pgSQL function inline_code_block line 4 at SQL statement
#rach suggested to mix current_clock() with a SEQUENCE object, but it would probably imply getting rid of the TIMESTAMP type. Even though I can't really figure out how it'd solve the isolation problem...
Is there a common pattern to avoid this?
Thank you for your insights :)
If you have only one Postgres server as you said, I think that using timestamp + sequence can solve the problem because sequence are non transactional and respect the insert order.
If you have db shard then it will be much more complex but maybe the distributed sequence of 2ndquadrant in BDR could help but I don't think that ordinality will be respected. I added some code below if you have setup to test it.
CREATE SEQUENCE "timestamps_seq";
-- Let's test first, how to generate id.
SELECT extract(epoch from now())::bigint::text || LPAD(nextval('timestamps_seq')::text, 20, '0') as unique_id ;
unique_id
--------------------------------
145288519200000000000000000010
(1 row)
CREATE TABLE IF NOT EXISTS timestamps (
unique_id TEXT UNIQUE NOT NULL DEFAULT extract(epoch from now())::bigint::text || LPAD(nextval('timestamps_seq')::text, 20, '0')
);
INSERT INTO timestamps DEFAULT VALUES;
INSERT INTO timestamps DEFAULT VALUES;
INSERT INTO timestamps DEFAULT VALUES;
select * from timestamps;
unique_id
--------------------------------
145288556900000000000000000001
145288557000000000000000000002
145288557100000000000000000003
(3 rows)
Let me know if that works. I'm not a DBA so maybe it will be good to ask on dba.stackexchange.com too about the potential side effect.
My two cents (Inspired from http://tapoueh.org/blog/2013/03/15-batch-update).
try adding the following before massive amount of insertions:
LOCK TABLE timestamps IN SHARE MODE;
Official documentation is here: http://www.postgresql.org/docs/current/static/sql-lock.html
Say I have a table like posts, which has typical columns like id, body, created_at. I'd like to generate a unique string with the creation of each post, for use in something like a url shortener. So maybe a 10-character alphanumeric string. It needs to be unique within the table, just like a primary key.
Ideally there would be a way for Postgres to handle both of these concerns:
generate the string
ensure its uniqueness
And they must go hand-in-hand, because my goal is to not have to worry about any uniqueness-enforcing code in my application.
I don't claim the following is efficient, but it is how we have done this sort of thing in the past.
CREATE FUNCTION make_uid() RETURNS text AS $$
DECLARE
new_uid text;
done bool;
BEGIN
done := false;
WHILE NOT done LOOP
new_uid := md5(''||now()::text||random()::text);
done := NOT exists(SELECT 1 FROM my_table WHERE uid=new_uid);
END LOOP;
RETURN new_uid;
END;
$$ LANGUAGE PLPGSQL VOLATILE;
make_uid() can be used as the default for a column in my_table. Something like:
ALTER TABLE my_table ADD COLUMN uid text NOT NULL DEFAULT make_uid();
md5(''||now()::text||random()::text) can be adjusted to taste. You could consider encode(...,'base64') except some of the characters used in base-64 are not URL friendly.
All existing answers are WRONG because they are based on SELECT while generating unique index per table record. Let us assume that we need unique code per record while inserting: Imagine two concurrent INSERTs are happening same time by miracle (which happens very often than you think) for both inserts same code was generated because at the moment of SELECT that code did not exist in table. One instance will INSERT and other will fail.
First let us create table with code field and add unique index
CREATE TABLE my_table
(
code TEXT NOT NULL
);
CREATE UNIQUE INDEX ON my_table (lower(code));
Then we should have function or procedure (you can use code inside for trigger also) where we 1. generate new code, 2. try to insert new record with new code and 3. if insert fails try again from step 1
CREATE OR REPLACE PROCEDURE my_table_insert()
AS $$
DECLARE
new_code TEXT;
BEGIN
LOOP
new_code := LOWER(SUBSTRING(MD5(''||NOW()::TEXT||RANDOM()::TEXT) FOR 8));
BEGIN
INSERT INTO my_table (code) VALUES (new_code);
EXIT;
EXCEPTION WHEN unique_violation THEN
END;
END LOOP;
END;
$$ LANGUAGE PLPGSQL;
This is guaranteed error free solution not like other solutions on this thread
Use a Feistel network. This technique works efficiently to generate unique random-looking strings in constant time without any collision.
For a version with about 2 billion possible strings (2^31) of 6 letters, see this answer.
For a 63 bits version based on bigint (9223372036854775808 distinct possible values), see this other answer.
You may change the round function as explained in the first answer to introduce a secret element to have your own series of strings (not guessable).
The easiest way probably to use the sequence to guarantee uniqueness
(so after the seq add a fix x digit random number):
CREATE SEQUENCE test_seq;
CREATE TABLE test_table (
id bigint NOT NULL DEFAULT (nextval('test_seq')::text || (LPAD(floor(random()*100000000)::text, 8, '0')))::bigint,
txt TEXT
);
insert into test_table (txt) values ('1');
insert into test_table (txt) values ('2');
select id, txt from test_table;
However this will waste a huge amount of records. (Note: the max bigInt is 9223372036854775807 if you use 8 digit random number at the end, you can only have 922337203 records. Thou 8 digit is probably not necessary. Also check the max number for your programming environment!)
Alternatively you can use varchar for the id and even convert the above number with to_hex() or change to base36 like below (but for base36, try to not expose it to customer, in order to avoid some funny string showing up!):
PostgreSQL: Is there a function that will convert a base-10 int into a base-36 string?
Check out a blog by Bruce. This gets you part way there. You will have to make sure it doesn't already exist. Maybe concat the primary key to it?
Generating Random Data Via Sql
"Ever need to generate random data? You can easily do it in client applications and server-side functions, but it is possible to generate random data in sql. The following query generates five lines of 40-character-length lowercase alphabetic strings:"
SELECT
(
SELECT string_agg(x, '')
FROM (
SELECT chr(ascii('a') + floor(random() * 26)::integer)
FROM generate_series(1, 40 + b * 0)
) AS y(x)
)
FROM generate_series(1,5) as a(b);
Use primary key in your data. If you really need alphanumeric unique string, you can use base-36 encoding. In PostgreSQL you can use this function.
Example:
select base36_encode(generate_series(1000000000,1000000010));
GJDGXS
GJDGXT
GJDGXU
GJDGXV
GJDGXW
GJDGXX
GJDGXY
GJDGXZ
GJDGY0
GJDGY1
GJDGY2