I m using postgresql 8.2.
I want to be able to create a simple table with year and all the week number upto 52 for that year.
Like this:
There must be an efficient way to do this.
If it needs to scale out to any year, it should dynamically list all the week number for year.
Any help is appreciated.
TIA
From what I know, PostgreSQL does not have a function that returns the number of weeks a year. So I guess the best way to proceed is to create this function:
CREATE OR REPLACE FUNCTION weeks_in_year(aYear integer) RETURNS integer AS
$$
DECLARE
vW integer;
BEGIN
vW := date_part('week', (aYear::text || '-12-31')::date);
-- When the week is the first, the year has 52 weeks.
RETURN CASE WHEN vW = 1 THEN 52 ELSE vW END;
END
$$
language 'plpgsql';
Then you can use this function with generate_series to get your data
select 2013 AS year,generate_series(1, weeks_in_year(2013)) AS week
if you have to create a table, you can use
SELECT 2013 AS year,generate_series(1, weeks_in_year(2013)) AS week INTO my_table
Related
Beside of unique actions we need recurrent actions in our database. We wan't the user to be able to define a periodicity (all 1,2,3,.. years) and a period (e.g. from 2018 - to 2020) in a form. This data should be used to insert appropriat datasets for a defined action.
If the user chooses an annual periodicity starting from 2018 3 datasets (2018, 2019 and 2020) should be inserted in the actions table.
If the user chooses an biannual periodicity starting from 2018 only 2 datasets (2018 and 2020) should be inserted in the actions table.
The simplified table actions looks like this:
id serial not null
id_action integer
action_year integer
periodicity integer
from_ integer
to_ integer
I need a starting point for the sql statement.
You should use generate_series(start, stop, step)
Annual:
=> select generate_series(2018,2020,1);
generate_series
-----------------
2018
2019
2020
(3 rows)
Biannual:
=> select generate_series(2018,2020,2);
generate_series
-----------------
2018
2020
(2 rows)
I didn't knew the function generate_series() until now. Thanks to point me in that direction.
To get things running like I intended I need to use the generate_series() inside an Trigger function that is fired AFTER INSERT. After first running into troubles with recursive Trigger inserts I now have the problem, that my Trigger produces to many duplicate inserts (increasing with the choosen periodicity).
My table actions looks like this:
id serial not null
id_action integer
action_year integer
periodicity integer
from_ integer
to_ integer
My Trigger on the table:
CREATE TRIGGER tr_actions_recurrent
AFTER INSERT
ON actions
FOR EACH ROW
WHEN ((pg_trigger_depth() = 0))
EXECUTE PROCEDURE actions_recurrent();
Here my trigger function:
CREATE OR REPLACE FUNCTION actions_recurrent()
RETURNS trigger AS
$BODY$
BEGIN
IF NEW.periodicity >0 AND NEW.action_year <= NEW.to_-NEW.periodicity THEN
INSERT into actions(id_action, action_year,periodicity, from_, to_)
SELECT NEW.id_action, y, NEW.periodicity, NEW.from_, NEW.to_
FROM actions, generate_series(NEW.from_+NEW.periodicity,NEW.to_,NEW.periodicity) AS y;
END IF;
RETURN NEW;
END;
$BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
When I'm doing an insert
INSERT INTO actions (id_action, action_year,periodicity,from_, to_)
VALUES (50,2018,4,2018,2028);
I get one row for action_year 2018, but 13 rows 2022 and 2026??
In my understanding, the IF-clause in the trigger-function should avoid such a repetitive execution.
My idea is to implement a basic «vector clock», where a timestamps are clock-based, always go forward and are guaranteed to be unique.
For example, in a simple table:
CREATE TABLE IF NOT EXISTS timestamps (
last_modified TIMESTAMP UNIQUE
);
I use a trigger to set the timestamp value before insertion. It basically just goes into the future when two inserts arrive at the same time:
CREATE OR REPLACE FUNCTION bump_timestamp()
RETURNS trigger AS $$
DECLARE
previous TIMESTAMP;
current TIMESTAMP;
BEGIN
previous := NULL;
SELECT last_modified INTO previous
FROM timestamps
ORDER BY last_modified DESC LIMIT 1;
current := clock_timestamp();
IF previous IS NOT NULL AND previous >= current THEN
current := previous + INTERVAL '1 milliseconds';
END IF;
NEW.last_modified := current;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
DROP TRIGGER IF EXISTS tgr_timestamps_last_modified ON timestamps;
CREATE TRIGGER tgr_timestamps_last_modified
BEFORE INSERT OR UPDATE ON timestamps
FOR EACH ROW EXECUTE PROCEDURE bump_timestamp();
I then run a massive amount of insertions in two separate clients:
DO
$$
BEGIN
FOR i IN 1..100000 LOOP
INSERT INTO timestamps DEFAULT VALUES;
END LOOP;
END;
$$;
As expected, I get collisions:
ERROR: duplicate key value violates unique constraint "timestamps_last_modified_key"
État SQL :23505
Détail :Key (last_modified)=(2016-01-15 18:35:22.550367) already exists.
Contexte : SQL statement "INSERT INTO timestamps DEFAULT VALUES"
PL/pgSQL function inline_code_block line 4 at SQL statement
#rach suggested to mix current_clock() with a SEQUENCE object, but it would probably imply getting rid of the TIMESTAMP type. Even though I can't really figure out how it'd solve the isolation problem...
Is there a common pattern to avoid this?
Thank you for your insights :)
If you have only one Postgres server as you said, I think that using timestamp + sequence can solve the problem because sequence are non transactional and respect the insert order.
If you have db shard then it will be much more complex but maybe the distributed sequence of 2ndquadrant in BDR could help but I don't think that ordinality will be respected. I added some code below if you have setup to test it.
CREATE SEQUENCE "timestamps_seq";
-- Let's test first, how to generate id.
SELECT extract(epoch from now())::bigint::text || LPAD(nextval('timestamps_seq')::text, 20, '0') as unique_id ;
unique_id
--------------------------------
145288519200000000000000000010
(1 row)
CREATE TABLE IF NOT EXISTS timestamps (
unique_id TEXT UNIQUE NOT NULL DEFAULT extract(epoch from now())::bigint::text || LPAD(nextval('timestamps_seq')::text, 20, '0')
);
INSERT INTO timestamps DEFAULT VALUES;
INSERT INTO timestamps DEFAULT VALUES;
INSERT INTO timestamps DEFAULT VALUES;
select * from timestamps;
unique_id
--------------------------------
145288556900000000000000000001
145288557000000000000000000002
145288557100000000000000000003
(3 rows)
Let me know if that works. I'm not a DBA so maybe it will be good to ask on dba.stackexchange.com too about the potential side effect.
My two cents (Inspired from http://tapoueh.org/blog/2013/03/15-batch-update).
try adding the following before massive amount of insertions:
LOCK TABLE timestamps IN SHARE MODE;
Official documentation is here: http://www.postgresql.org/docs/current/static/sql-lock.html
I have a csv file with about 1,500 fields and 5-6 million rows. It is a dataset with one row for each individual who has received public benefits at some point since ISO week 32 in 1991. Each field represents one week and holds a number relating to the specific benefit received in that particular week. If the individual has received no benefits, the field is left blank (''). In addition to the weekly values there are a number of other fields (ID, sex, birthdate, etc.)
The data set is updated quarterly with an added field for each week in the quarter, and an added row for each new individual.
This is a sample of the data:
y_9132,y_9133,y_9134,...,y_1443,id,sex,dateofbirth
891,891,891,...,110,1000456,1,'1978/01/16'
110,112,112,...,997,2000789,0,'1945/09/28'
I'm trying to convert the data to a tabular format so it can be analysed using PostgreSQL with comlumn store or similar (Amazon Redshift is a possiblity).
The fields beginning with "y_" represents the year and week of the received public benefits. In a tabular format the field name should be converted to a row number or a date, starting with monday in ISO week 32 in 1991 (1991/08/05).
The tablular dataset I'm trying to convert the csv-file to would look like this:
(Week is just a sequential number, starting with 1 for the date '1991/08/05')
week,benefit,ID
1,891,1000456
2,891,1000456
3,891,1000456
...
1211,110,1000456
1,110,2000789
2,112,2000789
3,112,2000789
...
1211,997,2000789
I have written a function in PostgreSQL that works but, it is very slow. The entire conversion takes 15h. I have tried using my laptop with an SSD and 8GB RAM. I also tried it on an Amazon RDS instance with 30GB memory. Still slow. The PostgreSQL function splits the csv in chunks. I've experimented a bit and 100K rows pr. batch seems fastest (yeah, 15h fast).
To be clear, I'm not particularly looking for solution using PostgreSQL. Anything will do. In fact, I'm not sure why I would even use a DB for this at all.
That said, here are my functions in PostgreSQL:
First function: I load part of the csv file into a table called part_grund. I only load the fields with the weekly data and the ID.
CREATE OR REPLACE FUNCTION DREAMLOAD_PART(OUT result text) AS
$BODY$
BEGIN
EXECUTE 'DROP TABLE IF EXISTS part_grund;
CREATE UNLOGGED TABLE part_grund
(id int, raw_data text[],rn int[]);
INSERT INTO part_grund
SELECT raw_data[1300]::int
,raw_data[1:1211]
,rn
FROM grund_no_headers
cross join
(
SELECT ARRAY(
WITH RECURSIVE t(n) AS
(
VALUES (1)
UNION ALL
SELECT n+1 FROM t WHERE n < 1211
)
SELECT n FROM t
) AS rn) AS rn;
CREATE INDEX idx_id on part_grund (id);';
END;
$BODY$
LANGUAGE plpgsql;
Second function: Here, the data is transformed using the unnest function.
CREATE OR REPLACE FUNCTION DREAMLOAD(startint int, batch_size int, OUT result text) AS
$BODY$
DECLARE
i integer := startint;
e integer := startint + batch_size;
endint integer;
BEGIN
endint := (SELECT MAX(ID) FROM part_grund) + batch_size;
EXECUTE 'DROP TABLE IF EXISTS BENEFIT;
CREATE UNLOGGED TABLE BENEFIT (
ID integer
,benefit smallint
,Week smallint
);';
WHILE e <= endint LOOP
EXECUTE 'INSERT INTO BENEFIT
SELECT ID
,unnest(raw_data) AS benefit
,unnest(rn) AS week
FROM part_grund
WHERE ID between ' || i || ' and ' || e-1 ||';';
i=i+batch_size;
e=e+batch_size;
END LOOP;
END;
$BODY$
LANGUAGE plpgsql;
As I mentioned above, it works but, it is painfully slow. So, suggestions to a faster way of doing this would be very appreciated.
I need help with creating a trigger which forbids user to delete data that is newer than 2 weeks.
My current code:
CREATE OR REPLACE FUNCTION f_delete_data() RETURNS trigger AS $$
BEGIN
RAISE EXCEPTION 'Cant delete data which is newer than 2 weeks.';
RETURN NULL;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trig_delete_data BEFORE DELETE ON Results
FOR EACH ROW WHEN (OLD.Date < DATE_SUB(NOW(), INTERVAL 14 DAY)) EXECUTE PROCEDURE
f_delete_data();
This code says there's a syntax error at or near 14 ..
Why is the date_sub(..,interval 14 day) not working?
I'm using PostgreSQL 9.3.0.
Why isn't the date_sub(..,intercval 14 day) not working?
$$ LANGUAGE plpgsql; indicates you're using postgres however DATE_SUB is a MySQL specific function not available in postgres.
Try replacing DATE_SUB with this
(OLD.Date < NOW() - INTERVAL '14 DAYS')
Besides the obvious mistake (no DATE_SUB in Postgres), you also have your logic backwards. If you want to protect rows where the value in the date column is less than 2 weeks old: "newer than 2 weeks", then you must revert the comparison operator.
CREATE TRIGGER trig_delete_data
BEFORE DELETE ON results
FOR EACH ROW
WHEN (OLD.date < DATE_SUB(NOW(), INTERVAL 14 DAY))
WHEN (OLD.date < now() - interval '14 days')
WHEN (OLD.date > now() - interval '14 days')
EXECUTE PROCEDURE f_delete_data();
And f_delete_data() really should be name something like f_protect_new_data().
Or, if your columns is an actual date like the ill-chosen column name suggests, further simplify:
WHEN (OLD.date >= CURRENT_DATE - 14)
The manual on CURRENT_DATE & friends.
Use >= in this case, the 14th day back from today is still illegal according to your definition. The bound is logically a bit different from timestamp handling.
Why "ill-chosen"? "date" is a reserved word in standard SQL and a basic type name in Postgres. If the column actually holds a timestamp, not a date, it's misleading on top of that.
Is it possible to declare a serial field in Postgres (9.0) which will increment based on a pattern?
For example:
Pattern: YYYY-XXXXX
where YYYY is a year, and XXXXX increments from 00000 - 99999.
Or should I just use a trigger?
EDIT: I prefer the year to be auto-determined based, maybe, on server date. The XXXXX part does start with 00000 for each year and "resets" to 00000 then increments again to 99999 when the year part is modified.
I would create a separate SEQUENCE for each year, so that each sequence keeps track of one year - even after that year is over, should you need more unique IDs for that year later.
This function does it all:
Improved with input from #Igor and #Clodoaldo in the comments.
CREATE OR REPLACE FUNCTION f_year_id(y text = to_char(now(), 'YYYY'))
RETURNS text AS
$func$
BEGIN
LOOP
BEGIN
RETURN y ||'-'|| to_char(nextval('year_'|| y ||'_seq'), 'FM00000');
EXCEPTION WHEN undefined_table THEN -- error code 42P01
EXECUTE 'CREATE SEQUENCE year_' || y || '_seq MINVALUE 0 START 0';
END;
END LOOP;
END
$func$ LANGUAGE plpgsql VOLATILE;
Call:
SELECT f_year_id();
Returns:
2013-00000
Basically this returns a text of your requested pattern. Automatically tailored for the current year. If a sequence of the name year_<year>_seq does not exist yet, it is created automatically and nextval() is retried.
Note that you cannot have an overloaded function without parameter at the same time (like my previous example), or Postgres will not know which to pick and throw an exception in despair.
Use this function as DEFAULT value in your table definition:
CREATE TABLE tbl (id text DEFAULT f_year_id(), ...)
Or you can get the next value for a year of your choice:
SELECT f_year_id('2012');
Tested in Postgres 9.1. Should work in v9.0 or v9.2 just as well.
To understand what's going on here, read these chapters in the manual:
CREATE FUNCTION
CREATE SEQUENCE
39.6.3. Simple Loops
39.5.4. Executing Dynamic Commands
39.6.6. Trapping Errors
Appendix A. PostgreSQL Error Codes
Table 9-22. Template Pattern Modifiers for Date/Time Formatting
You can create a function that will form this value (YYYY-XXXXX) and set this function as a default for a column.
Details here.