How do you insert part of an exsisting column record into a new table in postgreSQL? - postgresql

I'm in the process of formatting a database and have found a column I'd like to format. It has 3 types of information for every record in the column. For example, a record in my history column shows as (American, born Estonia. 19011974). What I want to do is put this data into new individual columns to make them atomic. I want to extract data such as 'American' into a country column, 'Estonia' into a born column and '1901' into a born column and '1974' into a death column. UPDATE: However, some of the columns hold nullls, for example, another record in the same column might be (German, 19242004), so a normal regular expression wouldn't work for all data would it? Any help is appreciated!
What PostgreSQL statements would I use to obtain specific parts of this data from the individual records? I understand it would be insert and have already came up with:
INSERT INTO historian (id,url)
SELECT object_id, url FROM maintable;
that statement allowed me to get those values into new columns, but those were atomic already so I could easily transition them. Thanks for any help! :)

Related

Isolating same data in Alteryx workflow

I have a large set of data, however below shows only 4 entries. I want to isolate the rows that have the same entries. For instance, on table one you can see that the first two rows have the same value in the columns number, ID, Brand, and Partner. I want to only get the rows with these same entries, so my final result will be Data Table 2.
Data Table
Data Table 2
This is the quickest way I can think of
https://drive.google.com/file/d/1Ufbl6J-gwCi5OpqaGHmC93aHuA6g-gSi/view?usp=sharing
Actually, I've just realised the 'RecordID' is redundant so you can leave that tool out, I think?

How to quickly insert ~300GB/one billion record with relation into PostgreSQL database?

I have been working on this for months but still no solution, I hope I can get help from you ...
The task is, I need to insert/import real time data records from an online data provider into our database.
Real time data is provided in form of files. Each file contains up to 200,000 json records, one record one line. Every day several/tens of files are provided online, begins from year 2013.
I calculated the whole file store and got a total size of around 300GB. I estimated the whole number of records (I can get the file sizes via rest api but not line numbers of each file), it should be around one billion records or a little bit more.
Before I can import/insert one record into our database, I need to find out two parameters (station, parameter) from the record and make relationship.
So the workflow is something like:
Find parameter in our database, if the parameter exists in our db, just return parameter.id; otherwise the parameter should be inserted into parameter table as a new entry and a new parameter.id will be created and returned;
Find station in our database, similarly, if that station exists already, just take its id, otherwise a new station will be created and a new station.id will be created;
Then I can insert this json record into our main data table with its identified parameter.id and station.id to make relationship.
Basically it is a simple database structure with three main tables (data, parameter, station). They are conneted by parameter.id and station.id with primary key/foreign key relationship.
But the querying is very time consuming and I cannot find a way to properly insert this amound of data into the database in a foreseeable time.
I did two trials:
Just use normal SQL queries with bulk insert.
The workflow is described above.
For each record a) get parameter.id b) get station.id c) insert this record
Even with bulk insert, I could only insert one million records in a week. The records are not that short and contain about 20 fields.
After that, I tried following:
I don't check parameter and station in advance, but just use COPY command to copy the records into a intermediate table with no relation. For this I calculated, all the one billion records can be imported in around ten days.
But after the COPY, I need to manully find out all dictinct stations (there are only a few parameters so I can ignore the parameter part) with SQL query select distinct or group by, and create these stations in the station table, and then UPDATE the whole records with their corresponding station.id, but this UPDATE operation takes very very long.
Here is an example:
I spent one and a half days to import 33,000,000 records into the intermediate table.
I queried with select longitude, latitude from records group by longitude, latitude and got 4,500 stations
I inserted these 4,500 stations into the station table
And for each station I do
update record set stationid = station.id where longitude=station.longitude and latitude=station.latitude
The job is still running but I estimate it will take two days
And this is only for 30,000,000 records, I have one billion.
So my question is, how can I insert this amount of data into the database quickly?
Thank you very much in advance!
2020-08-20 10:01
Thank you all very much for the comments!
#Brits:
Yes and no, all the "COPY"s took over 24 hours. One "COPY" command for one json file. For each file I need doing following:
Download the file
Flat the Json file (no relation check of station, but with parameter check, it is very easy and quick, parameter is like constant in the project) to a CSV like text file for "COPY" command
Execute the "COPY" command via Python
1, 2 and 3 all together will take if I am in the company network around one minute for a Json file containing ~ 200,000 records, and around 20 minutes if I work from home.
So just ignore the "one and a half day" statement. My estimation is, if I work in the company network, I will be able to import all 300GB/One billion records into my intermediate table without relation check in ten days.
I think my problem now is not to move the json content into a flat database table, but to build the relationship between data - station.
After I have my json content in the flat table, I need to find out all stations and update all records with:
select longitude, latitude, count(*) from records group by longitude, latitude
insert these longitude latitude combination into station table
for each entry in station table, do
update records set stationid = station.id where longitude=station.longitude and latitude=station.latitude
This 3. above takes very long (also 1. takes several minutes only for 34million records, I have no idea how long 1. will take for one billion records)
#Mike Organek #Laurenz Albe Thanks a lot. Even your comments are now for me difficult to understand. I will study them and give a feedback.
The total file count is 100,000+
I am thinking about to parse all the files and get individual stations firstly, and then with station.id and parameter.id in advance do the "COPY"s. I will give a feedback.
2020-09-04 09:14
I finally got what I want, even I am not sure whether it is correct.
What I have done since my question:
Parse all the json files and extract unique stations by coordinate and save them into the station table
Parse all the json files again and save all the fields into the record table, with parameter id from constants and station id from 1), with COPY command
I did 1) and 2) in a pipeline, they were done parallelly. And 1) took longer than 2), so I always needed to let 2) wait.
And after 10 days, I have my data in the postgres, with ca. 15000 stations, and 0.65 billion records in total, each records has the corresponding station id.
#steven-matison Thank you very much and could you please explain a little bit more?

How to extract information meeting a specific criterion from a table?

I have a table with 6 columns and 140,000 rows, and I can't figure out how to extract specific information from the table. For instance, when I try to extract all the accidents that happens on a specific date, either it tells me that the row '12/05/2015' does not exist or it doesn't let me set 'Date' as a Row Name since the dates repeat because more than one accident happens in a day, thus giving me the error that 'Duplicate row name: '01/01/2015'.
How can I pick a date and extract all of the data that corresponds to it?
P.S. Below you can see two photos, one of the table and one of the errors I get when trying to set date as a row to make everything clearer.
if I understand correctly your matter, you want to extract from the table, the rows that contain Date1, if so try this :
new_table = table(table(:,1)==Date1,:);

DAX Query to get a specific range of rows

How can I create a DAX query that retrieves rows from a given range in that order. Let's say I want the rows from row 1000 to row 2000. There is no unique id in my database. Should I add one, or is it possible without it?
If you can't distinguish a filter to create the subset of rows you are targeting then I would use a unique ID. I have not come across anything in DAX that allows to select rows in your powerpivot data set. If there isn't anything unique about the data you are targeting then I imagine you would need a unique ID.
i.e. I normally have column values I can filter with to target or create the subset of data I want to use.
I hope I am wrong and there is a way and look forward to someone posting a way.

Insert multiple records into fact table based on fields in single record

I'm working in Pentaho 4.4.1-GA (Kettle / PDI). The database is Postgres.
I need to be able to insert multiple records into a fact table based on the fields that come from a single record. The single record contains fields:
productcode1, price1
productcode2, price2
productcode3, price3
...
productcode10,price10
So if there was a value for each of the 10 productcode / prices then I'd need to insert a total of 10 records into the fact table. If there were values for 4 of the combinations, then I'd need to insert 4 records into the fact table, etcetera. All field values for the fact records would be identical except for the PK (generated by sequence), product codes, and prices.
I figure that I need some type of looping construct which would let me check whether or not a value was present for each productx field, and if so, do an insert/update step on the fact table with the desired field values. I'm just not sure how to do this in Pentaho.
Any ideas? All suggestions are welcome :)
Thank You,
Rakesh
Could you give a sample input and output for your scenario??
From your example data I can infer that if there are 10 different product codes and only 4 product prices you want to have 4 records inserted into your table. Is that so?
Well for a start you can add a constant value of 1 to those records by filtering for NOT NULL and then use an Group BY Step to count the number of 1's. This would give you the count. BTW it would be helpful if you could provide more details on what columns you would be loading as there are ways to make a PDI transformation execute multiple times