Batch update table with 2 million rows - tsql

Hi all I've got an interesting task to update a single column in a table that has roughly 2 million rows. I've tried doing this using MVC Entity Framework, however I'm encountering "Out of memory exceptions" and I'm just wondering if there's another way.
The interesting part is that its not just a simple update. The procedure needs to read the TelephoneNumber column already in the table and this could be 014812001 for example. Then it needs to calculate a score for this number based on the number of occurrences greater than 1. So for example using the above number this would score a 6 as we have 3 x 1's and 3 x 0's giving a total of 6.
Once this score has been calculated this number needs to be inserted into the a column in the current row be processed, so in our case the row with the TelephoneNumber = 014812001.
Is this possible using TSQL or is it better to carry on with my Entity Framework approach?

For such a bulk update, I would always recommend doing this on the server itself - there's really no point in dragging down 2 million rows, updating a single column, and then pushing those back to the server again.....
I think based on your description, it should be fairly simple to create a little T-SQL user defined function that would calculate this score. Once you have that, you can issue a single T-SQL statement:
UPDATE dbo.YourTable
SET Score = dbo.fnCalculateScore(TelephoneNumber)
WHERE .... (whatever condition you might have) .....
That should be faster by several orders of magnitude than with your Entity Framework approach....

Related

How to quickly insert ~300GB/one billion record with relation into PostgreSQL database?

I have been working on this for months but still no solution, I hope I can get help from you ...
The task is, I need to insert/import real time data records from an online data provider into our database.
Real time data is provided in form of files. Each file contains up to 200,000 json records, one record one line. Every day several/tens of files are provided online, begins from year 2013.
I calculated the whole file store and got a total size of around 300GB. I estimated the whole number of records (I can get the file sizes via rest api but not line numbers of each file), it should be around one billion records or a little bit more.
Before I can import/insert one record into our database, I need to find out two parameters (station, parameter) from the record and make relationship.
So the workflow is something like:
Find parameter in our database, if the parameter exists in our db, just return parameter.id; otherwise the parameter should be inserted into parameter table as a new entry and a new parameter.id will be created and returned;
Find station in our database, similarly, if that station exists already, just take its id, otherwise a new station will be created and a new station.id will be created;
Then I can insert this json record into our main data table with its identified parameter.id and station.id to make relationship.
Basically it is a simple database structure with three main tables (data, parameter, station). They are conneted by parameter.id and station.id with primary key/foreign key relationship.
But the querying is very time consuming and I cannot find a way to properly insert this amound of data into the database in a foreseeable time.
I did two trials:
Just use normal SQL queries with bulk insert.
The workflow is described above.
For each record a) get parameter.id b) get station.id c) insert this record
Even with bulk insert, I could only insert one million records in a week. The records are not that short and contain about 20 fields.
After that, I tried following:
I don't check parameter and station in advance, but just use COPY command to copy the records into a intermediate table with no relation. For this I calculated, all the one billion records can be imported in around ten days.
But after the COPY, I need to manully find out all dictinct stations (there are only a few parameters so I can ignore the parameter part) with SQL query select distinct or group by, and create these stations in the station table, and then UPDATE the whole records with their corresponding station.id, but this UPDATE operation takes very very long.
Here is an example:
I spent one and a half days to import 33,000,000 records into the intermediate table.
I queried with select longitude, latitude from records group by longitude, latitude and got 4,500 stations
I inserted these 4,500 stations into the station table
And for each station I do
update record set stationid = station.id where longitude=station.longitude and latitude=station.latitude
The job is still running but I estimate it will take two days
And this is only for 30,000,000 records, I have one billion.
So my question is, how can I insert this amount of data into the database quickly?
Thank you very much in advance!
2020-08-20 10:01
Thank you all very much for the comments!
#Brits:
Yes and no, all the "COPY"s took over 24 hours. One "COPY" command for one json file. For each file I need doing following:
Download the file
Flat the Json file (no relation check of station, but with parameter check, it is very easy and quick, parameter is like constant in the project) to a CSV like text file for "COPY" command
Execute the "COPY" command via Python
1, 2 and 3 all together will take if I am in the company network around one minute for a Json file containing ~ 200,000 records, and around 20 minutes if I work from home.
So just ignore the "one and a half day" statement. My estimation is, if I work in the company network, I will be able to import all 300GB/One billion records into my intermediate table without relation check in ten days.
I think my problem now is not to move the json content into a flat database table, but to build the relationship between data - station.
After I have my json content in the flat table, I need to find out all stations and update all records with:
select longitude, latitude, count(*) from records group by longitude, latitude
insert these longitude latitude combination into station table
for each entry in station table, do
update records set stationid = station.id where longitude=station.longitude and latitude=station.latitude
This 3. above takes very long (also 1. takes several minutes only for 34million records, I have no idea how long 1. will take for one billion records)
#Mike Organek #Laurenz Albe Thanks a lot. Even your comments are now for me difficult to understand. I will study them and give a feedback.
The total file count is 100,000+
I am thinking about to parse all the files and get individual stations firstly, and then with station.id and parameter.id in advance do the "COPY"s. I will give a feedback.
2020-09-04 09:14
I finally got what I want, even I am not sure whether it is correct.
What I have done since my question:
Parse all the json files and extract unique stations by coordinate and save them into the station table
Parse all the json files again and save all the fields into the record table, with parameter id from constants and station id from 1), with COPY command
I did 1) and 2) in a pipeline, they were done parallelly. And 1) took longer than 2), so I always needed to let 2) wait.
And after 10 days, I have my data in the postgres, with ca. 15000 stations, and 0.65 billion records in total, each records has the corresponding station id.
#steven-matison Thank you very much and could you please explain a little bit more?

Simple update query taking too long - Postgres

I have a table with 28 million rows that I want to update. It has around 60 columns and a ID column (primary key) with an index created on it. I created four new columns and I want to populate them with the data from four columns from other table which also has an ID column with an index created on it. Both tables have the same amount of rows and just the primary key and the index on the IDENTI column. The query has been running for 15 hours and since it is high priority work, we are starting to get nervous about it and we don't have so much time to experiment with queries. We have never update a table so big (7 GB), so we are not sure if this amount of time is normal.
This is the query:
UPDATE consolidated
SET IDEDUP2=uni.IDEDUP2
USE21=uni.USE21
USE22=uni.USE22
PESOXX2=uni.PESOXX2
FROM uni_group uni, consolidated con
WHERE con.IDENTI=uni.IDENTI
How can I make it faster? Is it possible? If not, is there a way to check how much longer it is going to take (without killing the process)?
Just as additional information, we have ran before much more complex queries for 3 million row tables (postgis) and It has taken it about 15 hours as well.
You should not repeat the target table in the FROM clause. Your statement creates a cartesian join of the consolidated table with itself, which is not what you want.
You should use the following:
UPDATE consolidated con
SET IDEDUP2=uni.IDEDUP2
USE21=uni.USE21
USE22=uni.USE22
PESOXX2=uni.PESOXX2
FROM uni_group uni
WHERE con.IDENTI = uni.IDENTI

long running queries and new data

I'm looking at a postgres system with tables containing 10 or 100's of millions of rows, and being fed at a rate of a few rows per second.
I need to do some processing on the rows of these tables, so I plan to run some simple select queries: select * with a where clause based on a range (each row contains a timestamp, that's what I'll work with for ranges). It may be a "closed range", with a start and an end I know are contained in the table, and I know no new data will fall into the range, or an open range : ie one of the range boundary might not be "in the table yet" and rows being fed in the table might thus fall in that range.
Since the response will itself contains millions of rows, and the processing per row can take some time (10s of ms) I'm fully aware I'll use a cursor and fetch, say, a few 1000 rows at a time. My question is:
If I run an "open range" query: will I only get the result as it was when I started the query, or will new rows being inserted in the table that fall in the range while I run my fetch show up ?
(I tend to think that no I won't see new rows, but I'd like a confirmation...)
updated
It should not happen under any isolation level:
https://www.postgresql.org/docs/current/static/transaction-iso.html
but Postgres insures it only in Serializable isolation
Well, I think when you make a query, that means you create a new transaction and it will not receive/update data from any other transaction until it commit.
So, basically "you only get the result as it was when you started the query"

Insert multiple records into fact table based on fields in single record

I'm working in Pentaho 4.4.1-GA (Kettle / PDI). The database is Postgres.
I need to be able to insert multiple records into a fact table based on the fields that come from a single record. The single record contains fields:
productcode1, price1
productcode2, price2
productcode3, price3
...
productcode10,price10
So if there was a value for each of the 10 productcode / prices then I'd need to insert a total of 10 records into the fact table. If there were values for 4 of the combinations, then I'd need to insert 4 records into the fact table, etcetera. All field values for the fact records would be identical except for the PK (generated by sequence), product codes, and prices.
I figure that I need some type of looping construct which would let me check whether or not a value was present for each productx field, and if so, do an insert/update step on the fact table with the desired field values. I'm just not sure how to do this in Pentaho.
Any ideas? All suggestions are welcome :)
Thank You,
Rakesh
Could you give a sample input and output for your scenario??
From your example data I can infer that if there are 10 different product codes and only 4 product prices you want to have 4 records inserted into your table. Is that so?
Well for a start you can add a constant value of 1 to those records by filtering for NOT NULL and then use an Group BY Step to count the number of 1's. This would give you the count. BTW it would be helpful if you could provide more details on what columns you would be loading as there are ways to make a PDI transformation execute multiple times

Tableau Future and Current References

Tough problem I am working on here.
I have a table of CustomerIDs and CallDates. I want to measure whether there is a 'repeat call' within a certain period of time (up to 30 days).
I plan on creating a parameter called RepeatTime which is a range from 0 - 30 days, so the user can slide a scale to see the number/percentage of total repeats.
In Excel, I have this working. I sort CustomerID in order and then sort CallDate from earliest to latest. I then have formulas like:
=IF(AND(CurrentCustomerID = FutureCustomerID, FutureCallDate - CurrentCallDate <= RepeatTime), 1,0)
CurrentCustomerID = the current row, and the FutureCustomerID = the following row (so it is saying if the customer ID is the same).
FutureCallDate = the following row and the CurrentCallDate = the current row. It is subtracting the future call time from the first call time to measure the time in between.
The goal is to be able to see, dynamically, how many customers called in for a specific reason within maybe 4 hours or 1 day or 5 days, etc. All of the way up until 30 days (this is our actual metric but it is good to see the calls which are repeats within a shorter time frame so we can investigate).
I had a similar problem, see here for detailed version Array calculation in Tableau, maxif routine
In your case, that is basically the same thing as mine, so you could apply that solution, but I find it easier to understand the one I'm about to give, I would do:
1) Create a calculated field called RepeatTime:
DATEDIFF('day',MAX(CallDates),LOOKUP(MAX(CallDates),-1))
This will calculated how many days have passed since the last call to the current. You can add a IFNULL not to get Null values for the first entry.
2) Drag CustomersID, CallDates and RepeatTime to the worksheet (can be on the marks tab, don't need to be on rows or column).
3) Configure the table calculation of RepeatTIme, Compute using Advanced..., partitioning CustomersID, Adressing CallDates
Also Sort by Field CallDates, Maximum, Ascending.
This will guarantee the table calculation works properly
4) Now you have a base that you can use for what you need. You can either export it to csv or mdb and connect to it.
The best approach, actually, is to have this RepeatTime field calculated outside Tableau, on your database, so it's already there when you connect to it. But this is a way to use Tableau to do the calculation for you.
Unfortunately there's no direct way to do this directly with your database.