Massive insertions from one big table to other related tables - postgresql

Into:
Currently i have scraped all the data into one PostgreSQL 'Bigtable' table(there are about 1.2M rows). Now i need to split the design into separate tables which all have dependency on the Bigtable. Some of the tables might have subtables. The model looks pretty much like snowflake.
Problem:
What would be best option to inserting data into tables? I thought to make the insertion with functions written in 'SQL' or PLgSQL. But the problem is still with auto-generated ID-s.
Also if you know what tools might make this problem solving easier then post!
//Edit i have added example, this not the real case just for illustration

1.2 M rows is not too much. The best tool is sql script executed from console "psql". If you have a some newer version of Pg, then you can use inline functions (DO statement) when it is necessary. But probably the most useful command is INSERT INTO SELECT statement.
-- file conversion.sql
DROP TABLE IF EXISTS f1 CASCADE;
CREATE TABLE f1(a int, b int);
INSERT INTO f1
SELECT x1, y1
FROM data
WHERE x1 = 10;
...
-- end file
psql mydb -f conversion.sql

If I understand your question, you can use a psql function like this:
CREATE OR REPLACE FUNCTION migration() RETURNS integer AS
$BODY$
DECLARE
currentProductId INTEGER;
currentUserId INTEGER;
currentReg RECORD;
BEGIN
FOR currentReg IN
SELECT * FROM bigtable
LOOP
-- Product
SELECT productid INTO currentProductId
FROM product
WHERE name = currentReg.product_name;
IF currentProductId IS NULL THEN
EXECUTE 'INSERT INTO product (name) VALUES (''' || currentReg.product_name || ''') RETURNING productid'
INTO currentProductId;
END IF;
-- User
SELECT userid INTO currentUserId
FROM user
WHERE first_name = currentReg.first_name and last_name = currentReg.last_name;
IF currentUserId IS NULL THEN
EXECUTE 'INSERT INTO user (first_name, last_name) VALUES (''' || currentReg.first_name || ''', ''' || currentReg.last_name || ''') RETURNING userid'
INTO currentUserId;
-- Insert into userAdded too with: currentUserId and currentProductId
[...]
END IF;
-- Rest of tables
[...]
END LOOP;
RETURN 1;
END;
$BODY$
LANGUAGE plpgsql;
select * from migration();
In this case it's assumed that each table runs its own primary key sequence and I have reduced the number of fields in the tables to simplify.
I hope you have been helpful.

No need to use a function for this (unless I misunderstood your problem)
If your id columns are all defined as serial column (i.e. they automatically generate the values), then this can be done with simple INSERT statements. This assumes that the target tables are all empty.
INSERT INTO users (firstname, lastname)
SELECT DISTINCT firstname, lastname
FROM bigtable;
INSERT INTO category (name)
SELECT DISTINCT category_name
FROM bigtable;
-- the following assumes a column categoryid in the product table
-- which is not visible from your screenshot
INSERT INTO product (product_name, description, categoryid)
SELECT DISTINCT b.product_name, b.description, c.categoryid
FROM bigtable b
JOIN category c ON c.category_name = b.category_name;
INSERT INTO product_added (product_productid, user_userid)
SELECT p.productid, u.userid
FROM bigtable b
JOIN product p ON p.product_name = b.product_name
JOIN users u ON u.firstname = b.firstname AND u.lastname = b.lastname

Related

how to iterate in all schemas and find count from all tables present in all schemas with same table name for every 5mins?

imagine there are 5 schemas in my database and in every schema there is a common name table (ex:- table1) after every 5mins records get inserted in table1, how I can iterate in all schemas n calculate the count of table1[i have to automate the process so i am going to write the code in function and call that function after every 5mins using crontab].
Basically 2 options: Hard code schema.table and union the results. So something like:
create or replace function count_rows_in_each_table1()
returns table (schema_name text, number_or_rows integer)
language sql
as $$
select 'schema1', count(*) from schema1.table1 union all
select 'schema2', count(*) from schema2.table1 union all
select 'schema3', count(*) from schema3.table1 union all
...
select 'scheman', count(*) from scheman.table1;
$$;
The alternative being building the query dynamically from information_scheme.
create or replace function count_rows_in_each_table1()
returns table (schema_name text, number_of_rows bigint)
language plpgsql
as $$
declare
c_rows_count cursor is
select table_schema::text
from information_schema.tables
where table_name = 'table1';
l_tbl record;
l_sql_statement text = '';
l_connector text = '';
l_base_select text = 'select ''%s'', count(*) from %I.table1';
begin
for l_tbl in c_rows_count
loop
l_sql_statement = l_sql_statement ||
l_connector ||
format (l_base_select, l_tbl.table_schema, l_tbl.table_schema);
l_connector = ' union all ';
end loop;
raise notice E'Running Query: \n%', l_sql_statement;
return query execute l_sql_statement;
end;
$$;
Which is better. With few schema and few schema add/drop, opt for the first. It is direct and easily shows what you are doing. If you add/drop schema often then opt for the second. If you have many schema, but seldom add/drop them then modify the second to generate the first, save and schedule execution of the generated query.
NOTE: Not tested

Postgres stored procedure/function

New to Stored Procedures , have a requirement where I need to execute multiple queries inside stored procedure and return results. I would like to know whether this is possible or not ..
Ex :
Query 1 returns a list of userid ..
Select userid from user where username = ?
For each userid from the above query , I need to execute three different queries like
Query 2 select session_details from session where userid = ?
Query 3 select location from location where userid = ?
The return value should be a collection of , session_details and location.
Is this possible,can you provide some hints?
You can loop through query results like so:
FOR id IN Select userid from user where username = ?
LOOP
...
END LOOP;
As #Fahad Anjum says in his comment, its better if you can do it in a query. But if that's not posible, you have tree posibilities to achive what you want.
SETOF
TABLE
refcursor
1. SETOF
You can return a set of values. The set can be an existing table, a temporal table, or a TYPE you define.
TYPE example:
-- In your case the type could be (userid integer, session integer, location text)
CREATE TYPE tester AS (id integer);
-- The pl returns a SETOF the created type.
CREATE OR REPLACE FUNCTION test() RETURNS SETOF tester
AS $$
BEGIN
RETURN QUERY SELECT generate_series(1, 3) as id;
END;
$$ LANGUAGE plpgsql
-- Then, you get the set by selecting the PL as if it were a table.
SELECT * FROM test();
Table and Temp Table examples:
-- Create a temporal table o a regular table:
CREATE TEMP TABLE test_table(id integer);
-- or CREATE TABLE test_table(id integer);
-- or use an existing table in your schema(s);
-- The pl returns a SETOF the table you need
CREATE OR REPLACE FUNCTION test() RETURNS SETOF test_table
AS $$
BEGIN
RETURN QUERY SELECT generate_series(1, 3) as id;
END;
$$ LANGUAGE plpgsql
-- Then, you get the set by selecting the PL as if it were a table.
SELECT * FROM test();
-- NOTE: Since you are only returning a SETOF the table,
-- you don't insert any data into the table.
-- So, if you select the 'temp' table you won't see any changes.
SELECT * FROM test_table
2. TABLE
A PL can return a table, it would be similar to create a temporal table and then return a SETOF, but, in this case you declare de 'temp' table on the 'returns' sentence of the PL.
-- Next to TABLE you define the columns of the table the PL will return
CREATE OR REPLACE FUNCTION test() RETURNS TABLE (id integer)
AS $$
BEGIN
RETURN QUERY SELECT generate_series(1, 3) as id;
END;
$$ LANGUAGE plpgsql
-- As the other examples, you select the PL to get the data.
SELECT * FROM test();
3. refcursor
This one is the more complex solution. You return a cursor, not the actual data. If you need 'dynamic' values for your returning set, this is the solution.
But since you need static data, you won't need this option.
The use of any of these ways depends on any specific case, if you use regularly the userid,session,location in different ways and PLs, it would be better to Use the SETOF with a type.
If you have a table that has the userid,session,location columns, it's better to return a SETOF table.
If you just use the userid,session,location for one case, then it would be better to use a 'RETURNS TABLE' approach.
If you need to return a dynamic set you would have to use cursors... but that solution is really more advanced.
Based solely on your example, here's probably the easiest way to do it:
CREATE FUNCTION my_func(user_id INTEGER)
RETURNS TABLE (userid INTEGER, session INTEGER, location TEXT) AS
$$
SELECT u.userid, s.session, l.location
FROM -- etc... your query here
$$
LANGUAGE SQL STABLE;
Addressing comment:
That's a bit of a different question. One question is how to return multiple records containing multiple fields in a stored procedure. One way is as above.
The other question is how to write a query that gets data from multiple tables. Again, there are many ways to do it. One way is (again, based on my interpretation of your requirements in the examples):
SELECT userid
, ARRAY_AGG(SELECT session_details FROM session s WHERE s.userid = u.userid)
, ARRAY_AGG(SELECT l.location FROM location l WHERE l.userid = u.userid)
FROM user u
WHERE username = user_name
This will return one record containing the user_id, an array of session_details for that user, and an array of locations for that user.
Then the function can be changed to:
CREATE FUNCTION my_func(user_name TEXT, OUT userid INTEGER, OUT session_details TEXT[], OUT locations TEXT[])
AS $$
SELECT userid
, ARRAY(SELECT session_details FROM session s WHERE s.userid = u.userid)
, ARRAY(SELECT l.location FROM location l WHERE l.userid = u.userid)
FROM user u
WHERE username = user_name;
$$ LANGUAGE SQL STABLE;

postgres lag and window to create cohort table [duplicate]

I am trying to create crosstab queries in PostgreSQL such that it automatically generates the crosstab columns instead of hardcoding it. I have written a function that dynamically generates the column list that I need for my crosstab query. The idea is to substitute the result of this function in the crosstab query using dynamic sql.
I know how to do this easily in SQL Server, but my limited knowledge of PostgreSQL is hindering my progress here. I was thinking of storing the result of function that generates the dynamic list of columns into a variable and use that to dynamically build the sql query. It would be great if someone could guide me regarding the same.
-- Table which has be pivoted
CREATE TABLE test_db
(
kernel_id int,
key int,
value int
);
INSERT INTO test_db VALUES
(1,1,99),
(1,2,78),
(2,1,66),
(3,1,44),
(3,2,55),
(3,3,89);
-- This function dynamically returns the list of columns for crosstab
CREATE FUNCTION test() RETURNS TEXT AS '
DECLARE
key_id int;
text_op TEXT = '' kernel_id int, '';
BEGIN
FOR key_id IN SELECT DISTINCT key FROM test_db ORDER BY key LOOP
text_op := text_op || key_id || '' int , '' ;
END LOOP;
text_op := text_op || '' DUMMY text'';
RETURN text_op;
END;
' LANGUAGE 'plpgsql';
-- This query works. I just need to convert the static list
-- of crosstab columns to be generated dynamically.
SELECT * FROM
crosstab
(
'SELECT kernel_id, key, value FROM test_db ORDER BY 1,2',
'SELECT DISTINCT key FROM test_db ORDER BY 1'
)
AS x (kernel_id int, key1 int, key2 int, key3 int); -- How can I replace ..
-- .. this static list with a dynamically generated list of columns ?
You can use the provided C function crosstab_hash for this.
The manual is not very clear in this respect. It's mentioned at the end of the chapter on crosstab() with two parameters:
You can create predefined functions to avoid having to write out the
result column names and types in each query. See the examples in the
previous section. The underlying C function for this form of crosstab
is named crosstab_hash.
For your example:
CREATE OR REPLACE FUNCTION f_cross_test_db(text, text)
RETURNS TABLE (kernel_id int, key1 int, key2 int, key3 int)
AS '$libdir/tablefunc','crosstab_hash' LANGUAGE C STABLE STRICT;
Call:
SELECT * FROM f_cross_test_db(
'SELECT kernel_id, key, value FROM test_db ORDER BY 1,2'
,'SELECT DISTINCT key FROM test_db ORDER BY 1');
Note that you need to create a distinct crosstab_hash function for every crosstab function with a different return type.
Related:
PostgreSQL row to columns
Your function to generate the column list is rather convoluted, the result is incorrect (int missing after kernel_id), it can be replaced with this SQL query:
SELECT 'kernel_id int, '
|| string_agg(DISTINCT key::text, ' int, ' ORDER BY key::text)
|| ' int, DUMMY text'
FROM test_db;
And it cannot be used dynamically anyway.
#erwin-brandstetter: The return type of the function isn't an issue if you're always returning a JSON type with the converted results.
Here is the function I came up with:
CREATE OR REPLACE FUNCTION report.test(
i_start_date TIMESTAMPTZ,
i_end_date TIMESTAMPTZ,
i_interval INT
) RETURNS TABLE (
tab JSON
) AS $ab$
DECLARE
_key_id TEXT;
_text_op TEXT = '';
_ret JSON;
BEGIN
-- SELECT DISTINCT for query results
FOR _key_id IN
SELECT DISTINCT at_name
FROM report.company_data_date cd
JOIN report.company_data_amount cda ON cd.id = cda.company_data_date_id
JOIN report.amount_types at ON cda.amount_type_id = at.id
WHERE date_start BETWEEN i_start_date AND i_end_date
AND interval_type_id = i_interval
LOOP
-- build function_call with datatype of column
IF char_length(_text_op) > 1 THEN
_text_op := _text_op || ', ' || _key_id || ' NUMERIC(20,2)';
ELSE
_text_op := _text_op || _key_id || ' NUMERIC(20,2)';
END IF;
END LOOP;
-- build query with parameter filters
RETURN QUERY
EXECUTE '
SELECT array_to_json(array_agg(row_to_json(t)))
FROM (
SELECT * FROM crosstab(''SELECT date_start, at.at_name, cda.amount ct
FROM report.company_data_date cd
JOIN report.company_data_amount cda ON cd.id = cda.company_data_date_id
JOIN report.amount_types at ON cda.amount_type_id = at.id
WHERE date_start between $$' || i_start_date::TEXT || '$$ AND $$' || i_end_date::TEXT || '$$
AND interval_type_id = ' || i_interval::TEXT || ' ORDER BY date_start'')
AS ct (date_start timestamptz, ' || _text_op || ')
) t;';
END;
$ab$ LANGUAGE 'plpgsql';
So, when you run it, you get the dynamic results in JSON, and you don't need to know how many values were pivoted:
select * from report.test(now()- '1 week'::interval, now(), 1);
tab
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[{"date_start":"2015-07-27T08:40:01.277556-04:00","burn_rate":0.00,"monthly_revenue":5800.00,"cash_balance":0.00},{"date_start":"2015-07-27T08:50:02.458868-04:00","burn_rate":34000.00,"monthly_revenue":15800.00,"cash_balance":24000.00}]
(1 row)
Edit: If you have mixed datatypes in your crosstab, you can add logic to look it up for each column with something like this:
SELECT a.attname as column_name, format_type(a.atttypid, a.atttypmod) AS data_type
FROM pg_attribute a
JOIN pg_class b ON (a.attrelid = b.relfilenode)
JOIN pg_catalog.pg_namespace n ON n.oid = b.relnamespace
WHERE n.nspname = $$schema_name$$ AND b.relname = $$table_name$$ and a.attstattarget = -1;"
I realise this is an older post but struggled for a little while on the same issue.
My Problem Statement:
I had a table with muliple values in a field and wanted to create a crosstab query with 40+ column headings per row.
My Solution was to create a function which looped through the table column to grab values that I wanted to use as column headings within the crosstab query.
Within this function I could then Create the crosstab query. In my use case I added this crosstab result into a separate table.
E.g.
CREATE OR REPLACE FUNCTION field_values_ct ()
RETURNS VOID AS $$
DECLARE rec RECORD;
DECLARE str text;
BEGIN
str := '"Issue ID" text,';
-- looping to get column heading string
FOR rec IN SELECT DISTINCT field_name
FROM issue_fields
ORDER BY field_name
LOOP
str := str || '"' || rec.field_name || '" text' ||',';
END LOOP;
str:= substring(str, 0, length(str));
EXECUTE 'CREATE EXTENSION IF NOT EXISTS tablefunc;
DROP TABLE IF EXISTS temp_issue_fields;
CREATE TABLE temp_issue_fields AS
SELECT *
FROM crosstab(''select issue_id, field_name, field_value from issue_fields order by 1'',
''SELECT DISTINCT field_name FROM issue_fields ORDER BY 1'')
AS final_result ('|| str ||')';
END;
$$ LANGUAGE plpgsql;
The approach described here worked well for me.
Instead of retrieving the pivot table directly. The easier approach is to let the function generate a SQL query string. Dynamically execute the resulting SQL query string on demand.

Postgresql 9.1 select from all schemas

I have a Postgresql 9.1 database with couple hundred schemas. All have same structure, just different data. I need to perform a select on a table and get data from each schema. Unfortunately I haven't found a decent way to do it.
I tried setting the search path to schema_1,schema_2, etc and then perform a select on the table but it only selects data from the first schema.
The only way I managed to do it so far is by generating a big query like:
select * from schema_1.table
union
select * from schema_2.table
union
(...another 100 lines....)
Is there any other way to do this in a more reasonable fashion? If this is not possible, can I at least find out which of the schemas has records in that table without performing this select?
Different schemas mean different tables, so if you have to stick to this structure, it'll mean unions, one way or the other. That can be pretty expensive. If you're after partitioning through the convenience of search paths, it might make sense to reverse your schema:
Store a big table in the public schema, and then provision views in each of the individual schemas.
Check out this sqlfiddle that demonstrates my concept:
http://sqlfiddle.com/#!12/a326d/1
Also pasted inline for posterity, in case sqlfiddle is inaccessible:
Schema:
CREATE SCHEMA customer_1;
CREATE SCHEMA customer_2;
CREATE TABLE accounts(id serial, name text, value numeric, customer_id int);
CREATE INDEX ON accounts (customer_id);
CREATE VIEW customer_1.accounts AS SELECT id, name, value FROM public.accounts WHERE customer_id = 1;
CREATE VIEW customer_2.accounts AS SELECT id, name, value FROM public.accounts WHERE customer_id = 2;
INSERT INTO accounts(name, value, customer_id) VALUES('foo', 100, 1);
INSERT INTO accounts(name, value, customer_id) VALUES('bar', 100, 1);
INSERT INTO accounts(name, value, customer_id) VALUES('biz', 150, 2);
INSERT INTO accounts(name, value, customer_id) VALUES('baz', 75, 2);
Queries:
SELECT SUM(value) FROM public.accounts;
SET search_path TO 'customer_1';
SELECT * FROM accounts;
SET search_path TO 'customer_2';
SELECT * FROM accounts;
Results:
425
1 foo 100
2 bar 100
3 biz 150
4 baz 75
If you have to know some about data in tables, you have to do SELECT. There is no any other way. Schema is just logical addressing - for your case is important, so you use lot of tables, and you have to do massive UNION.
search_path works as expected. It has no meaning - return data from mentioned schemes, but it specify a order for searching not fully qualified table. Searching ends on first table, that has requested name.
Attention: massive unions can require lot of memory.
you can use a dynamic SQL and stored procedures with temp table:
postgres=# DO $$
declare r record;
begin
drop table if exists result;
create temp table result as select * from x.a limit 0; -- first table;
for r in select table_schema, table_name
from information_schema.tables
where table_name = 'a'
loop
raise notice '%', r;
execute format('insert into result select * from %I.%I',
r.table_schema,
r.table_name);
end loop;
end; $$;
result:
NOTICE: (y,a)
NOTICE: (x,a)
DO
postgres=# select * from result;
a
----
1
2
3
4
5
..
Here's one approach. You will need to pre-feed it all the schema names you are targeting. You could change this to just loop through all the schemas as Pavel shows if you know you want every schema. In my example I have three schemas that I care about each containing a table called bar. The logic will run a select on each schema's bar table and insert the value into a result table. At the end you have a table with all the data from all the tables. You could change this to update, delete, or do DDL. I chose to keep it simple and just collect the data from each table in each schema.
--START SETUP AKA Run This Section Once
create table schema3.bar(bar_id SERIAL PRIMARY KEY,
bar_name VARCHAR(50) NOT NULL);
insert into schema1.bar(bar_name) select 'One';
insert into schema2.bar(bar_name) select 'Two';
insert into schema3.bar(bar_name) select 'Three';
--END SETUP
DO $$
declare r record;
DECLARE l_id INTEGER = 1;
DECLARE l_schema_name TEXT;
begin
drop table if exists public.result;
create table public.result (bar_id INTEGER, bar_name TEXT);
drop table if exists public.schemas;
create table public.schemas (id serial PRIMARY KEY, schema_name text NOT NULL);
INSERT INTO public.schemas(schema_name)
VALUES ('schema1'),('schema2'),('schema3');
for r in select *
from public.schemas
loop
raise notice '%', r;
SELECT schema_name into l_schema_name
FROM public.schemas
WHERE id = l_id;
raise notice '%', l_schema_name;
EXECUTE 'set search_path TO ' || l_schema_name;
EXECUTE 'INSERT into public.result(bar_id, bar_name) select bar_id, bar_name from ' || l_schema_name || '.bar';
l_id = l_id + 1;
end loop;
end; $$;
--DEBUG
select * from schema1.bar;
select * from schema2.bar;
select * from schema3.bar;
select * from public.result;
select * from public.schemas;
--CLEANUP
--DROP TABLE public.result;
--DROP TABLE public.schemas;

Set of values in pl/pgsql

I'm writing some pl/pgsql code in postgres - and have come up to this issue. Simplified, my code looks like this:
declare
resulter mytype%rowtype;
...
for resulter in
select id, [a lot of other fields]
from mytable [joining a lot of other tables]
where [some reasonable where clause]
loop
if [certain condition on the resulter] then
[add id to a set];
end if;
return next resulter;
end loop;
select into myvar sum([some field])
from anothertable
where id in ([my set from above])
The question is around the [add to set]. In another scenario in the past, I used to deal with it this way:
declare
myset varchar := '';
...
loop
if [condition] then
myset || ',' || id;
end if;
return next resulter;
end loop;
execute 'select sum([field]) from anothertable where id in (' || trim(leading ',' from myset) || ')' into myvar
However this doesn't seem too efficient to me when the number of id's to be added to this set is large. What other options do I have for keeping track of this set and then using it?
-- update --
Obviously, another option is to create a temporary table and insert ids into it when needed. Then in the last select statement have a sub-select on that temporary table - like so:
create temporary table x (id integer);
loop
if [condition] then
insert into x values (id);
end if;
return next resulter;
end loop;
select into myvar sum([field]) from anothertable where id in (select id from x);
Any other options? Also, what would be the most efficient, considering that there may be many thousands of relevant ID's.
In my opinion, temp tables are the most efficient way to handle this:
create temp table x(id integer not null primary key) on commit drop;