How to quickly insert ~300GB/one billion record with relation into PostgreSQL database? - postgresql

I have been working on this for months but still no solution, I hope I can get help from you ...
The task is, I need to insert/import real time data records from an online data provider into our database.
Real time data is provided in form of files. Each file contains up to 200,000 json records, one record one line. Every day several/tens of files are provided online, begins from year 2013.
I calculated the whole file store and got a total size of around 300GB. I estimated the whole number of records (I can get the file sizes via rest api but not line numbers of each file), it should be around one billion records or a little bit more.
Before I can import/insert one record into our database, I need to find out two parameters (station, parameter) from the record and make relationship.
So the workflow is something like:
Find parameter in our database, if the parameter exists in our db, just return parameter.id; otherwise the parameter should be inserted into parameter table as a new entry and a new parameter.id will be created and returned;
Find station in our database, similarly, if that station exists already, just take its id, otherwise a new station will be created and a new station.id will be created;
Then I can insert this json record into our main data table with its identified parameter.id and station.id to make relationship.
Basically it is a simple database structure with three main tables (data, parameter, station). They are conneted by parameter.id and station.id with primary key/foreign key relationship.
But the querying is very time consuming and I cannot find a way to properly insert this amound of data into the database in a foreseeable time.
I did two trials:
Just use normal SQL queries with bulk insert.
The workflow is described above.
For each record a) get parameter.id b) get station.id c) insert this record
Even with bulk insert, I could only insert one million records in a week. The records are not that short and contain about 20 fields.
After that, I tried following:
I don't check parameter and station in advance, but just use COPY command to copy the records into a intermediate table with no relation. For this I calculated, all the one billion records can be imported in around ten days.
But after the COPY, I need to manully find out all dictinct stations (there are only a few parameters so I can ignore the parameter part) with SQL query select distinct or group by, and create these stations in the station table, and then UPDATE the whole records with their corresponding station.id, but this UPDATE operation takes very very long.
Here is an example:
I spent one and a half days to import 33,000,000 records into the intermediate table.
I queried with select longitude, latitude from records group by longitude, latitude and got 4,500 stations
I inserted these 4,500 stations into the station table
And for each station I do
update record set stationid = station.id where longitude=station.longitude and latitude=station.latitude
The job is still running but I estimate it will take two days
And this is only for 30,000,000 records, I have one billion.
So my question is, how can I insert this amount of data into the database quickly?
Thank you very much in advance!
2020-08-20 10:01
Thank you all very much for the comments!
#Brits:
Yes and no, all the "COPY"s took over 24 hours. One "COPY" command for one json file. For each file I need doing following:
Download the file
Flat the Json file (no relation check of station, but with parameter check, it is very easy and quick, parameter is like constant in the project) to a CSV like text file for "COPY" command
Execute the "COPY" command via Python
1, 2 and 3 all together will take if I am in the company network around one minute for a Json file containing ~ 200,000 records, and around 20 minutes if I work from home.
So just ignore the "one and a half day" statement. My estimation is, if I work in the company network, I will be able to import all 300GB/One billion records into my intermediate table without relation check in ten days.
I think my problem now is not to move the json content into a flat database table, but to build the relationship between data - station.
After I have my json content in the flat table, I need to find out all stations and update all records with:
select longitude, latitude, count(*) from records group by longitude, latitude
insert these longitude latitude combination into station table
for each entry in station table, do
update records set stationid = station.id where longitude=station.longitude and latitude=station.latitude
This 3. above takes very long (also 1. takes several minutes only for 34million records, I have no idea how long 1. will take for one billion records)
#Mike Organek #Laurenz Albe Thanks a lot. Even your comments are now for me difficult to understand. I will study them and give a feedback.
The total file count is 100,000+
I am thinking about to parse all the files and get individual stations firstly, and then with station.id and parameter.id in advance do the "COPY"s. I will give a feedback.
2020-09-04 09:14
I finally got what I want, even I am not sure whether it is correct.
What I have done since my question:
Parse all the json files and extract unique stations by coordinate and save them into the station table
Parse all the json files again and save all the fields into the record table, with parameter id from constants and station id from 1), with COPY command
I did 1) and 2) in a pipeline, they were done parallelly. And 1) took longer than 2), so I always needed to let 2) wait.
And after 10 days, I have my data in the postgres, with ca. 15000 stations, and 0.65 billion records in total, each records has the corresponding station id.
#steven-matison Thank you very much and could you please explain a little bit more?

Related

What is the best way to move millions of data from one postgres database to another?

So we have a task at the moment where I need to move millions of records from one database to another.
To complicate things slightly I need to change an id on each record before inserting the data.
How it works is we have 100 stations in database a.
Each station contains 30+ sensors.
Each sensor contains readings for about the last 10 years.
These readings are anywhere from 15minute interval to daily interval.
So each station can have at least 5m records.
database b has the same structure as database a.
The reading table contains the following fields
id: primary key
sensor_id: int4
value: numeric(12)
time: timestamp
What I have done so far for one station is.
Connect to database a and select all readings for station 1
Find all corresponding sensors in database b
Change the sensor_id from database a to it's new sensor_id from database b
Chunk the updated sensor_id data to groups of about 5000 parameters
Loop over the chunks and do a mass insert
In theory, this should work.
However, I am getting errors saying duplicate key violates unique constraint.
If I query the database on those records that are failing, the data doesn't exist.
The weird thing about this is that if I run the script 4 or 5 times in a row all the data eventually gets in there. So I am at a loss as to why I would be receiving this error because it doesn't seem accurate.
Is there a way I can get around this error from happening?
Is there a more efficient way of doing this?

Simple update query taking too long - Postgres

I have a table with 28 million rows that I want to update. It has around 60 columns and a ID column (primary key) with an index created on it. I created four new columns and I want to populate them with the data from four columns from other table which also has an ID column with an index created on it. Both tables have the same amount of rows and just the primary key and the index on the IDENTI column. The query has been running for 15 hours and since it is high priority work, we are starting to get nervous about it and we don't have so much time to experiment with queries. We have never update a table so big (7 GB), so we are not sure if this amount of time is normal.
This is the query:
UPDATE consolidated
SET IDEDUP2=uni.IDEDUP2
USE21=uni.USE21
USE22=uni.USE22
PESOXX2=uni.PESOXX2
FROM uni_group uni, consolidated con
WHERE con.IDENTI=uni.IDENTI
How can I make it faster? Is it possible? If not, is there a way to check how much longer it is going to take (without killing the process)?
Just as additional information, we have ran before much more complex queries for 3 million row tables (postgis) and It has taken it about 15 hours as well.
You should not repeat the target table in the FROM clause. Your statement creates a cartesian join of the consolidated table with itself, which is not what you want.
You should use the following:
UPDATE consolidated con
SET IDEDUP2=uni.IDEDUP2
USE21=uni.USE21
USE22=uni.USE22
PESOXX2=uni.PESOXX2
FROM uni_group uni
WHERE con.IDENTI = uni.IDENTI

long running queries and new data

I'm looking at a postgres system with tables containing 10 or 100's of millions of rows, and being fed at a rate of a few rows per second.
I need to do some processing on the rows of these tables, so I plan to run some simple select queries: select * with a where clause based on a range (each row contains a timestamp, that's what I'll work with for ranges). It may be a "closed range", with a start and an end I know are contained in the table, and I know no new data will fall into the range, or an open range : ie one of the range boundary might not be "in the table yet" and rows being fed in the table might thus fall in that range.
Since the response will itself contains millions of rows, and the processing per row can take some time (10s of ms) I'm fully aware I'll use a cursor and fetch, say, a few 1000 rows at a time. My question is:
If I run an "open range" query: will I only get the result as it was when I started the query, or will new rows being inserted in the table that fall in the range while I run my fetch show up ?
(I tend to think that no I won't see new rows, but I'd like a confirmation...)
updated
It should not happen under any isolation level:
https://www.postgresql.org/docs/current/static/transaction-iso.html
but Postgres insures it only in Serializable isolation
Well, I think when you make a query, that means you create a new transaction and it will not receive/update data from any other transaction until it commit.
So, basically "you only get the result as it was when you started the query"

Insert multiple records into fact table based on fields in single record

I'm working in Pentaho 4.4.1-GA (Kettle / PDI). The database is Postgres.
I need to be able to insert multiple records into a fact table based on the fields that come from a single record. The single record contains fields:
productcode1, price1
productcode2, price2
productcode3, price3
...
productcode10,price10
So if there was a value for each of the 10 productcode / prices then I'd need to insert a total of 10 records into the fact table. If there were values for 4 of the combinations, then I'd need to insert 4 records into the fact table, etcetera. All field values for the fact records would be identical except for the PK (generated by sequence), product codes, and prices.
I figure that I need some type of looping construct which would let me check whether or not a value was present for each productx field, and if so, do an insert/update step on the fact table with the desired field values. I'm just not sure how to do this in Pentaho.
Any ideas? All suggestions are welcome :)
Thank You,
Rakesh
Could you give a sample input and output for your scenario??
From your example data I can infer that if there are 10 different product codes and only 4 product prices you want to have 4 records inserted into your table. Is that so?
Well for a start you can add a constant value of 1 to those records by filtering for NOT NULL and then use an Group BY Step to count the number of 1's. This would give you the count. BTW it would be helpful if you could provide more details on what columns you would be loading as there are ways to make a PDI transformation execute multiple times

Executing query in chunks on Greenplum

I am trying to creating a way to convert bulk date queries into incremental query. For example, if a query has where condition specified as
WHERE date > now()::date - interval '365 days' and date < now()::date
this will fetch a years data if executed today. Now if the same query is executed tomorrow, 365 days data will again be fetched. However, I already have last 364 days data from previous run. I just want a single day's data to be fetched and a single day's data to be deleted from the system, so that I end up with 365 days data with better performance. This data is to be stored in a separate temp table.
To achieve this, I create an incremental query, which will be executed in next run. However, deleting the single date data is proving tricky when that "date" column does not feature in the SELECT clause but feature in the WHERE condition as the temp table schema will not have the "date" column.
So I thought of executing the bulk query in chunks and assign an ID to that chunk. This way, I can delete a chunk and add a chunk and other data remains unaffected.
Is there a way to achieve the same in postgres or greenplum? Like some inbuilt functionality. I went through the whole documentation but could not find any.
Also, if not, is there any better solution to this problem.
I think this is best handled with something like an aggregates table (I assume the issue is you have heavy aggregates to handle over a lot of data). This doesn't necessarily cause normalization problems (and data warehouses often denormalize anyway). In this regard the aggregates you need can be stored per day so you are able to cut down to one record per day of the closed data, plus non-closed data. Keeping the aggregates to data which cannot change is what is required to avoid the normal insert/update anomilies that normalization prevents.