How to insert autoincremented master/slave records using ScalaQuery? - scala

Classic issue, new framework -- thus problem.
PostgreSQL + Scala + ScalaQuery. I have Master table with serial (autincrement) id and Slave table also with serial id.
I need to insert one master record and several slaves. I have to do it within transaction (to have ability to cancel all), so I cannot run a query after inserting master to find out id. As far as I see SQ "insert" method does not return any reference to inserted master record.
So how to do it?
SQ Examples cover this however without autoincremented field, so such solution (pre-set ids) is not applicable here.

If I understand it correctly this is not possible for now in automatic way. If one is not afraid, this can be done this way. Obtaining the id of last insert (per each master record insertion):
postgreSQL function for last inserted ID
Then using it in SQ:
http://groups.google.com/group/scalaquery/browse_thread/thread/faa7d3e5842da82e
This code shows the MySql way. I'm posting it to the list for
posterity's sake.
val scopeIdentity = SimpleFunction.nullaryLong
val inserted = Actions.insert(
"cat", "eats", "dog)
//Print out the count of inserted records. println(inserted )
//Print out the primary key for the last inserted record.
println(Query(scopeIdentity).first)
//Regards //Bryan
But since for auto incremented fields you have to use projections excluding autoinc fields, and then inserting tuples instead of named record types, there is a question if it is not worth to hold breath until SQ will support this directly.
Note I am SQ newbie, I might just misinform you.

Related

FireDAC Array DML and Returning clauses

Using FireDAC's Array DML feature, it doesn't seem possible to utilise a RETURNING clause (in my case PostgeSQL).
If I run a simple insert query such as:
With FDQuery Do
begin
SQL.Text := 'INSERT INTO temptab(email, name) '
+'VALUES (''email1'', ''name1''), '
+'(''email2'', ''name2'') '
+'RETURNING id';
Open;
end;
The query returns two records containing the id for the newly inserted records.
For larger inserts I would prefer to use Array DML, but in some cases I also need to be able to get returned data.
The Open function does not have an ATimes parameter. Whilst you can call Open with Array DML, it results in the insertion and return of just the first record.
I cannot find any other properties, methods which would seem to facilitate this. I have posted on Praxis to see if anyone there has any ideas, but I have had no response. I have also posted this as a new feature request on Quality Central.
If anyone knows of a way of achieving this using Array DML, I would be grateful to hear, but my principal question is what is the most efficient route for retrieving the inserted data (principally IDs) from the DB if I persist with Array DML?
A couple of ideas occur to me, neither of which seem tremendously attractive:
Within StartTransaction and Commit and following the insertion retrieve the id of the last inserted record and then grab backwards the requisite number. This seems to be to be a bit risky, although as within a transaction, should probably be okay.
Add an integer field to the relevant table and populate each inserted record with a unique identifier and following insert retrieve the records with that identifier. Whilst this would ensure the return of the inserted records, it would be relatively inefficient unless I index the field being used to store the identifier.
Both the above would be dependent on records being inserted into the DB in the order they are supplied to the Array DML, but I assume/hope that is a given.
I would appreciate views on the best (ie most efficient and reliable) of the above options and any suggestions as to alternative even better options even if those entail abandoning Array DML where a Returning clause is needed.
You actually can get all returned ID's. You can tell Firedac to store the result values in paramters with {INTO }. See for example the following code:
FDQuery.SQL.Text := 'INSERT into tablename (fieldname) values (:p1) returning id {into :p2}';
FDQuery.Params.ArraySize := 2;
FDQuery.Params[0].AsStrings[0] := 'one';
FDQuery.Params[0].AsStrings[1] := 'two';
FDQuery.Params[1].ParamType := ptInputOutput;
FDQuery.Params[1].DataType := ftLargeInt;
FDQuery.Execute(2,0);
ID1 := FDQuery.Params[1].AsLargeInts[0];
ID2 := FDQuery.Params[1].AsLargeInts[1];
This works when 1 row is returned per arraydml element. I think it will not work for >1 row, but I've not tested it. If it does, you would have to know which result corresponds with your arraydml element.
Note that Firedac throws an AV when 0 rows are returned for one or more elements in the arraydml. For example when you UPDATE a row that was deleted in the meantime. The AV has nothing to do with the array DML itself. When FDQuery.Execute; is called, you'll get an AV as well.
I've suggested another option earlier on the delphipraxis forum, but that is a suboptimal solution as that uses a temp table to store the ID's:
https://en.delphipraxis.net/topic/4693-firedac-array-dml-returning-values-from-inserted-records/

Spark Delta Table Updates

I am working in Microsoft Azure Databricks environment using sparksql and pyspark.
So I have a delta table on a lake where data is partitioned by say, file_date. Every partition contains files storing millions of records per day with no primary/unique key. All these records have a "status" column which can either contain values NULL (if everything looks good on that specific record) or Not null (say if a particular lookup mapping for a particular column is not found). Additionally, my process contains another folder called "mapping" which gets refreshed on a periodic basis, lets say nightly to make it simple, from where mappings are found.
On a daily basis, there is a good chance that about 100~200 rows get errored out (status column containing not null values). From these files, on a daily basis, (hence is the partition by file_date) , a downstream job pulls all the valid records and sends it for further processing ignoring those 100-200 errored records, waiting for the correct mapping file to be received. The downstream job, in addition to the valid status records, should also try and see if a mapping is found for the errored records and if present, take it down further as well (after of course, updating the data lake with the appropriate mapping and status).
What is the best way to go? The best way is to directly first update the delta table/lake with the correct mapping and update the status column to say "available_for_reprocessing" and my downstream job, pull the valid data for the day + pull the "available_for_reprocessing" data and after processing, update back with the status as "processed". But this seems to be super difficult using delta.
I was looking at "https://docs.databricks.com/delta/delta-update.html" and the update example there is just giving an example for a simple update with constants to update, not for updates from multiple tables.
The other but the most inefficient is, say pull ALL the data (both processed and errored) for the last say 30 days , get the mapping for the errored records and write the dataframe back into the delta lake using the replaceWhere option. This is super inefficient as we are reading everything (hunderds of millions of records) and writing everything back just to process say a 1000 records at the most. If you search for deltaTable = DeltaTable.forPath(spark, "/data/events/") at "https://docs.databricks.com/delta/delta-update.html", the example provided is for very simple updates. Without a unique key, it is impossible to update specific records as well. Can someone please help?
I use pyspark or can use sparksql but I am lost
If you want to update 1 column ('status') on the condition that all lookups are now correct for rows where they weren't correct before (where 'status' is currently incorrect), I think UPDATE command along with EXISTS can help you solve this. It isn't mentioned in the update documentation, but it works both for delete and update operations, effectively allowing you to update/delete records on joins.
For your scenario I believe the sql command would look something like this:
UPDATE your_db.table_name AS a
SET staus = 'correct'
WHERE EXISTS
(
SELECT *
FROM your_db.table_name AS b
JOIN lookup_table_1 AS t1 ON t1.lookup_column_a = b.lookup_column_a
JOIN lookup_table_2 AS t2 ON t2.lookup_column_b = b.lookup_column_b
-- ... add further lookups if needed
WHERE
b.staus = 'incorrect' AND
a.lookup_column_a = b.lookup_column_a AND
a.lookup_column_b = b.lookup_column_b
)
Merge did the trick...
MERGE INTO deptdelta AS maindept
USING updated_dept_location AS upddept
ON upddept.dno = maindept.dno
WHEN MATCHED THEN UPDATE SET maindept.dname = upddept.updated_name, maindept.location = upddept.updated_location

Reference foreign keys using SSIS-Lookup

I am asking for help on the following topic. I am trying to create an ETL process with two Excel data sources (S1 ~300 rows and S2 ~7000 rows). S1 contains project information and employee details and S2 contains the amount of hours, which each employee worked in which project at a timestamp.
I want to insert the amount of hours, which each employee worked in each project at a timestamp, into the fact table by referencing to the existing primary keys in the dimension tables. If an entry is not present in the dimension tables already, i want to add a new entry first and use the newly generated id. The destination table structure looks as follows (Data Warehouse, Star Schema):Destination Table Structure
In SSIS, i created three Data Flow tasks for filling the Dimension Tables (project, employee and time) with distinct values (using group by, as S1 and S2 contain a lot of duplicate rows)first, and a fourth data flow task (see image below) to insert the FactTable data, and this is where I'm running into problems:
Data Flow Task FactTable
I am using three LookUp functions to retrieve the foreignKeys project_id, employee_id and time_id from the Dimension tables (using project name, employee number and timestamp). If the id is found, it is passed on all the way to Merge Join 1, if not, a new Dimension Entry is created (lets say project) and the generated project_id passed on instead. Same goes for employee and time respectively.
There is two issues with this:
1) The "amount of hours" (passed by Multicast four, see image above) is not matched in the final result (No Match)
2) The amount of rows being inserted keeps increasing forever (Endless Join, I belive due to the Merge joins).
What I've tried:
I have used one UNION instead of three Merge Joins before, but this resulted in the foreign keys being in seperate rows each, instead of merged together.
I used Merge (instead of Merge Join) and combined the join as well as sort conditions in as I fell all possible ways.
I understand that this scenario might be confusing for everybody else, but thank your for taking time looking at it! Any help is greatly appreciated.
Solved it
For anybody having similar issues:
Seperate Data Flows for filling Dimension Tables with those filling Fact Tables will do the trick.
Its a clean solution and easier to debug.
Also: Dont run the LookUp Functions in parallel, but rather one after each other and pass on the attributes. Saves unnecessary Merges as well.
So as a Sum Up:
Four Data Flow Tasks, three for filling dimension tables ONLY and one for filling fact tables ONLY.
Loading Multiple Tables using SSIS keeping foreign key relationships
The answer posted by onupdatecascade is basically it.
Good luck!

loop and change all record numbers in Firebird database

I have a database table with unique record numbers created with generator but because of error in code (setting generators) record numbers suddenly became large because many numbers are skipped. I would like to rewrite all record numbers starting with 1 and finish with total records number. With application it will take a lot of time.
As I see from documentation for Firebird it should be simple task using loop but I have no experience with Firebird programming, I am using only simple SQL statements, can somebody help?
Actually there is no need to program a loop, simple update statement should do. First, reset the generator:
SET GENERATOR my_GEN TO 0;
and then update all the records assigning them new id
update tab set recno = gen_id(my_GEN, 1) order by recno asc;
It assumes that all references to the recno field are via foreign key with ON UPDATE CASCADE, otherwise you either mess up your data or the update fails.
During this operation there should be no other users in the database!
That being said, you really shouldn't care about gaps in your record numbers.

Does SQL Alchemy know the current Sequence value?

I'm using SQL Alchemy (1.0) ORM with a PostgreSQL database. Let's say I have a line in my class
serialid = Column(Integer, Sequence('journal_seq'), unique=True)
I realize that then there is a special "table" in my database, that holds the current value of the last one or next available integer for a serialid. Can I, from ORM, get the value of that integer (without incrementing it - otherwise I can just call next_value)? And is there a guarantee that the next serialid will have exactly that value?
I'd like to make an journal item with a serialid, and also make another item refering to that same serialid, but I'd like two of those to be committed (or rollbacked) together - and until I commit the journal item, I don't know what its serialid is. Maybe there is a cleaner way of doing it without knowing the current sequence value. A relationship would be great, but I don't know how to set it up.
(I know there is a question that asks the same thing when you control the SQL directly. I'd like to do the same from SQL Alchemy ORM.)