Delphi - Duplicating Data between data sources - rest

Delphi Seattle, Win10. I need to write a generic routine to refresh a set of tables from a single source to a single destination. The tables on both ends already exist, and the process is a complete refresh.... i.e. empty the destination table, and then copy all rows. The tables are identical, same columns, datatypes etc. The challenge is the restrictions of HOW to access the data. The only access I have to the source is via REST. I can pull the data via REST, using RESTClient, RESTRequest, RESTResponse connected to RESTAdapter, TDataSource and TClientDataSet. The destination is an Oracle database, which I have direct access to.
I have approx 15 tables, with the largest being approx 200,000 rows, 40 columns.
Right now, I am looping through each row, for each column on source, finding matching column in destination.... and the performance is killing me. Is there a more elegant (and particularly faster) way to do this?
Here is a code snippet of what I am doing now...
// for each row, loop
...
// Copy Each Field
for i := 0 to dm1.ds_Generic.DataSet.FieldCount - 1 do
begin
FieldFrom := dm1.ds_Generic.DataSet.Fields[i];
FieldTo := dm1.tGeneric.FindField(FieldFrom.FieldName);
if Assigned(FieldTo) then
begin
FieldTo.Value := FieldFrom.Value;
end;
end;

Related

FireDAC Array DML and Returning clauses

Using FireDAC's Array DML feature, it doesn't seem possible to utilise a RETURNING clause (in my case PostgeSQL).
If I run a simple insert query such as:
With FDQuery Do
begin
SQL.Text := 'INSERT INTO temptab(email, name) '
+'VALUES (''email1'', ''name1''), '
+'(''email2'', ''name2'') '
+'RETURNING id';
Open;
end;
The query returns two records containing the id for the newly inserted records.
For larger inserts I would prefer to use Array DML, but in some cases I also need to be able to get returned data.
The Open function does not have an ATimes parameter. Whilst you can call Open with Array DML, it results in the insertion and return of just the first record.
I cannot find any other properties, methods which would seem to facilitate this. I have posted on Praxis to see if anyone there has any ideas, but I have had no response. I have also posted this as a new feature request on Quality Central.
If anyone knows of a way of achieving this using Array DML, I would be grateful to hear, but my principal question is what is the most efficient route for retrieving the inserted data (principally IDs) from the DB if I persist with Array DML?
A couple of ideas occur to me, neither of which seem tremendously attractive:
Within StartTransaction and Commit and following the insertion retrieve the id of the last inserted record and then grab backwards the requisite number. This seems to be to be a bit risky, although as within a transaction, should probably be okay.
Add an integer field to the relevant table and populate each inserted record with a unique identifier and following insert retrieve the records with that identifier. Whilst this would ensure the return of the inserted records, it would be relatively inefficient unless I index the field being used to store the identifier.
Both the above would be dependent on records being inserted into the DB in the order they are supplied to the Array DML, but I assume/hope that is a given.
I would appreciate views on the best (ie most efficient and reliable) of the above options and any suggestions as to alternative even better options even if those entail abandoning Array DML where a Returning clause is needed.
You actually can get all returned ID's. You can tell Firedac to store the result values in paramters with {INTO }. See for example the following code:
FDQuery.SQL.Text := 'INSERT into tablename (fieldname) values (:p1) returning id {into :p2}';
FDQuery.Params.ArraySize := 2;
FDQuery.Params[0].AsStrings[0] := 'one';
FDQuery.Params[0].AsStrings[1] := 'two';
FDQuery.Params[1].ParamType := ptInputOutput;
FDQuery.Params[1].DataType := ftLargeInt;
FDQuery.Execute(2,0);
ID1 := FDQuery.Params[1].AsLargeInts[0];
ID2 := FDQuery.Params[1].AsLargeInts[1];
This works when 1 row is returned per arraydml element. I think it will not work for >1 row, but I've not tested it. If it does, you would have to know which result corresponds with your arraydml element.
Note that Firedac throws an AV when 0 rows are returned for one or more elements in the arraydml. For example when you UPDATE a row that was deleted in the meantime. The AV has nothing to do with the array DML itself. When FDQuery.Execute; is called, you'll get an AV as well.
I've suggested another option earlier on the delphipraxis forum, but that is a suboptimal solution as that uses a temp table to store the ID's:
https://en.delphipraxis.net/topic/4693-firedac-array-dml-returning-values-from-inserted-records/

Postgresql - Return column subset from cursor

I have a legacy stored procedure returning a number (row count) a cursor with many columns; I need to retrieve a subset of the selected columns. I can think of three ways of doing it:
Invoke the existing procedure from the outside, and map columns to my own data structures trimming unneeded columns;
Write a new stored procedure, mostly identical to the existing one but returning different columns;
Write a new stored procedure, invoking the old one internally and filtering columns (the referenced entities and thus the number of rows are exactly the same as the existing procedure).
Number 2 is obviously a no-go.
Number 1 is viable. As far as I know, there is little difference in the computing cost between retrieving one or more columns, in that the engine has to read full rows regardless, before filtering unrequired columns; I do have a feeling it would be heavier on the runtime invoking the procedure from the outside, as objects representing unneeded columns would exist on returning from the DB call.
I would be interested in implementing Number 3, but I would prefer to maintain the same return type as the existing function (count + refcursor) for conformity.
I think I could transfer all the rows in the cursor returned by the existing function into a temporary table as described e.g. in this question, and use it as a source for the output cursor but:
I am not sure of how the output cursor would behave with a temporary table created with a drop-on-commit clause (would the results exist reliably after the procedure has terminated? Would the temporary table be dropped as expected?);
I read that temporary tables are expensive to use, and it feels like overkill for what in the end is a filtering of columns on the same rows from a pre-computed result.
Is there a way to query the existing cursor so that it may be used as a source for the output cursor, while filtering columns?

How to update a local table remotely?

I have a large table on a remote server with an unknown (millions) amount of rows of data. I'd like to be able to fetch the data in batches of 100,000 rows at a time, update my local table with those fetched rows, and complete this until all rows have been fetched. Is there a way I can update a local table remotely?
Currently I have a dummy table called t on the server along with the following variables...
t:([]sym:1000000?`A`B`Ab`Ba`C`D`Cd`Dc;id:1+til 1000000)
selector:select from t where sym like "A*"
counter:count selector
divy:counter%100000
divyUP:ceiling divy
and the below function on the client along with the variables index set to 0 and normTable, which is a copy of the remote table...
index:0
normTable:h"0#t"
batches:{[idx;divy;anty;seltr]
if[not idx=divy;
batch:select[(anty;100000)] from seltr;
`normTable upsert batch;
idx+::1;
divy:divy;
anty+:100000;
seltr:seltr;
batches[idx;divy;anty;seltr]];
idx::0}
I call that function using the following command...
batches[index;h"divyUP";0;h"selector"]
The problem with this approach though is h"selector" fetches all the rows of data at the same time (and multiple times - for each batch of 100,000 that it upserts to my local normTable).
I could move the batches function to the remote server but then how would I update my local normTable remotely?
Alternatively I could break up the rows into batches on the server and then pull each batch individually. But if I don't know how many rows there are how do I know how many variables are required? For example the following would work, but only up to the first 400k rows...
batch1:select[100000] from t where symbol like "A*"
batch2:select[100000 100000] from t where symbol like "A*"
batch3:select[200000 100000] from t where symbol like "A*"
batch4:select[300000 100000] from t where symbol like "A*"
Is there a way to set a batchX variable so that it creates a new variable that equals the count of divyUP?
I would suggest few changes as you are trying to connect to remote server:
Do not run synchronous request as that would make server to slow down its processing. Try to make asynchronous request using callbacks.
Do not do full table scan(for heavy comparison) in each call specially for regex. Its possible that most of the data might be available in cache in next call but still it is not guaranteed which will again impact the server normal operations.
Do not make data requests in burst. Either use timer or make another data request call when last batch data has arrived.
Below approach is based on above suggestions. It will avoid scanning full table for columns other than index column(which is light weight) and make next request only when last batch has arrived.
Create Batch processing function
This function will run on server and read small batch of data from table using indices and return the required data.
q) batch:{[ind;s] ni:ind+s; d:select from t where i within (ind;ni), sym like "A*";
neg[.z.w](`upd;d;$[ni<count t;ni+1;0]) }
It takes 2 arguments- starting index and batch size to work on.
This function will finally call upd function on local macine asynchronously and will pass 2 arguments.
Table index to start next batch from (return 0 in case all rows are done to stop next batch processing)
Data from current batch request
Create Callback function
Result from batch processing function will come into this function.
If index > 0 that means there is more data to process and next batch should start form this index.
q) upd:{[data;ind] t::t,data;if[ind>0;fetch ind]}
Create Main function to start process
q)fetch:{[ind] h (batch;ind;size)}
Finally open connection, create table variable and run fetch function.
q) h:hopen `:server:port
q) t:()
q) size:100
q) fetch 0
Now, above method is based on the assumption that server table is static. In case its getting updates in real time then changes would be required depending upon how the table is getting updated on server.
Also, other optimizations can be done depending upon attributes set on remote table which can improve the performance.
If you're ok sending sync messages it can be simplified to something like:
{[h;i]`mytab upsert h({select from t where i in x};i)}[h]each 0N 100000#til h"count t"
And you can easily change it to control the number of batches (rather than the size) by instead using 10 0N# (that would do it in 10 batches)
Rather than having individual variables, the cut function can split up the result of the select into chunks of 100000 rows. Indexing each element is a table.
batches:100000 cut select from t where symbol like "A*"

Case-when or if-then to control table creation in Redshift

I have a handful of data sources that I'd like to apply the same analyses to and eventually load into a larger table database (uniformtable). Different sources contain different columns, and sometimes sources involve crosswalk files that I need to join. I'd like to have one query that converts all sources' data into uniformtable formatting, based on a unique key for each source. Something along the lines of this:
case when source.sourceid = 1 then
create uniformtable as
select column1a as uniforma, column1b as uniformb, sourceid from source
else
when source.sourceid = 2 then
create uniformtable as
select column2a as uniforma, column2b as uniformb, sourceid from source
end;
I've tried using if-then and case-when to accomplish this, but I get syntax errors pointing to the very start of my query. Does Redshift allow you to use if logic for this kind of control?
No, this logic is not permitted.
CASE statements are only valid within a SELECT statement.
You would need to perform this logic external to Amazon Redshift, and then just send the final SQL to create the table.

How to tell if record has changed in Postgres

I have a bit of an "upsert" type of question... but, I want to throw it out there because it's a little bit different than any that I've read on stackoverflow.
Basic problem.
I'm working on moving from mysql to PostgreSQL 9.1.5 (hosted on Heroku). As a part of that, I need to import multiple CSV files everyday. Some of the data is sales information and is almost guaranteed to be new and need to be inserted. But, other parts of the data is almost guaranteed to be the same. For example, the csv files (note plural) will have POS (point of sale) information in them. This rarely changes (and is most likely only via additions). Then there is product information. There are about 10,000 products (vast majority will be unchanged, but it's possible to have both additions and updates).
The final item (but is important), is that I have a requirement to be able to provide an audit trail/information for any given item. For example, if I add a new POS record, I need to be able to trace that back to the file it was found in. If I change a UPC code or description of a product, then I need to be able to trace it back to the import (and file) where the change came from.
Solution that I'm contemplating.
Since the data is provided to me via CSV, then I'm working around the idea that COPY will be the best/fastest way. The structure of the data in the files is not exactly what I have in the database (i.e. final destination). So, I'm copying them into tables in the staging schema that match the CSV (note: one schema per datasource). The tables in the staging schemas will have a before insert row triggers. These triggers can decide what to do with the data (insert, update or ignore).
For the tables that are most likely to contain new data, then it will try to insert first. If the record is already there, then it will return NULL (and stop the insert into the staging table). For tables that rarely change, then it will query the table and see if the record is found. If it is, then I need a way to see if any of the fields are changed. (because remember, I need to show that the record was modified by import x from file y) I obviously can just boiler plate out the code and test each column. But, was looking for something a little more "eloquent" and more maintainable than that.
In a way, what I'm kind of doing is combining a importing system with an audit trail system. So, in researching audit trails, I reviewed the following wiki.postgresql.org article. It seems like the hstore might be a nice way of getting changes (and being able to easily ignore some columns in the table that aren't important - e.g. "last_modified")
I'm about 90% sure it will all work... I've created some testing tables etc and played around with it.
My question?
Is a better, more maintainable way of accomplishing this task of finding the maybe 3 records out of 10K that require a change to the database. I could certainly write a python script (or something else) that reads the file and tries to figure out what to do with each record, but that feels horribly inefficient and will lead to lots of round trips.
A few final things:
I don't have control over the input files. I would love it if they only sent me the deltas, but they don't and it's completely outside of my control or influence.
he system is grow and new data sources are likely to be added that will greatly increase the amount of data being processed (so, I'm trying to keep things efficient)
I know this is not nice, simple SO question (like "how to sort a list in python") but I believe one of the great things about SO is that you can ask hard questions and people will share their thoughts about how they think the best way to solve it is.
I have lots of similar operations. What I do is COPY to temporary staging tables:
CREATE TEMP TABLE target_tmp AS
SELECT * FROM target_tbl LIMIT 0; -- only copy structure, no data
COPY target_tmp FROM '/path/to/target.csv';
For performance, run ANALYZE - temp. tables are not analyzed by autovacuum!
ANALYZE target_tmp;
Also for performance, maybe even create an index or two on the temp table, or add a primary key if the data allows for that.
ALTER TABLE ADD CONSTRAINT target_tmp_pkey PRIMARY KEY(target_id);
You don't need the performance stuff for small imports.
Then use the full scope of SQL commands to digest the new data.
For instance, if the primary key of the target table is target_id ..
Maybe DELETE what isn't there any more?
DELETE FROM target_tbl t
WHERE NOT EXISTS (
SELECT 1 FROM target_tmp t1
WHERE t1.target_id = t.target_id
);
Then UPDATE what's already there:
UPDATE target_tbl t
SET col1 = t1.col1
FROM target_tmp t1
WHERE t.target_id = t1.target_id
To avoid empty UPDATEs, simply add:
...
AND col1 IS DISTINCT FROM t1.col1; -- repeat for relevant columns
Or, if the whole row is relevant:
...
AND t IS DISTINCT FROM t1; -- check the whole row
Then INSERT what's new:
INSERT INTO target_tbl(target_id, col1)
SELECT t1.target_id, t1.col1
FROM target_tmp t1
LEFT JOIN target_tbl t USING (target_id)
WHERE t.target_id IS NULL;
Clean up if your session goes on (temp tables are dropped at end of session automatically):
DROP TABLE target_tmp;
Or use ON COMMIT DROP or similar with CREATE TEMP TABLE.
Code untested, but should work in any modern version of PostgreSQL except for typos.