Insert data from staging table into multiple, related tables? - tsql

I'm working on an application that imports data from Access to SQL Server 2008. Currently, I'm using a stored procedure to import the data individually by record. I can't go with a bulk insert or anything like that because the data is inserted into two related tables...I have a bunch of fields that go into the Account table (first name, last name, etc.) and three fields that will each have a record in an Insurance table, linked back to the Account table by the auto-incrementing AccountID that's selected with SCOPE_IDENTITY in the stored procedure.
Performance isn't very good due to the number of round trips to the database from the application. For this and some other reasons I'm planning to instead use a staging table and import the data from there. Reading up on my options for approaching this, a cursor that executes the same insert stored procedure on the data in the staging table would make sense. However it appears that cursors are evil incarnate and should be avoided.
Is there any way to insert data into one table, retrieve the auto-generated IDs, then insert data for the same records into another table using the corresponding ID, in a set-based operation? Or is a cursor my only option here?

Look at the OUTPUT clause. You should be able to add it to your INSERT statement to do what you want.
BTW, if you need to output columns into the second table that weren't inserted into the first one, then use MERGE instead of INSERT (as suggested in the comment to the original question) as its OUTPUT clause supports referencing other columns from the source table(s). Otherwise, keeping it with an INSERT is more straightforward, and it does give you access to the inserted identity column.

I'm having experiment to worked out in inserting multiple record into related table using databinding. So, try this!
Hopefully this is very helpful. Follow this link How to insert record into related tables. for more information.

Related

PostgreSQL: Return auto-generated ids from COPY FROM insertion

I have a non-empty PostgreSQL table with a GENERATED ALWAYS AS IDENTITY column id. I do a bulk insert with the C++ binding pqxx::stream_to, which I'm assuming uses COPY FROM. My problem is that I want to know the ids of the newly created rows, but COPY FROM has no RETURNING clause. I see several possible solutions, but I'm not sure if any of them is good, or which one is the least bad:
Provide the ids manually through COPY FROM, taking care to give the values which the identity sequence would have provided, then afterwards synchronize the sequence with setval(...).
First stream the data to a temp-table with a custom index column for ordering. Then do something likeINSERT INTO foo (col1, col2)
SELECT ttFoo.col1, ttFoo.col2 FROM ttFoo
ORDER BY ttFoo.idx RETURNING foo.id
and depend on the fact that the identity sequence produces ascending numbers to correlate them with ttFoo.idx (I cannot do RETURNING ttFoo.idx too because only the inserted row is available for that which doesn't contain idx)
Query the current value of the identity sequence prior to insertion, then check afterwards which rows are new.
I would assume that this is a common situation, yet I don't see an obviously correct solution. What do you recommend?
You can find out which rows have been affected by your current transaction using the system columns. The xmin column contains the ID of the inserting transaction, so to return the id values you just copied, you could:
BEGIN;
COPY foo(col1,col2) FROM STDIN;
SELECT id FROM foo
WHERE xmin::text = (txid_current() % (2^32)::bigint)::text
ORDER BY id;
COMMIT;
The WHERE clause comes from this answer, which explains the reasoning behind it.
I don't think there's any way to optimise this with an index, so it might be too slow on a large table. If so, I think your second option would be the way to go, i.e. stream into a temp table and INSERT ... RETURNING.
I think you can create id with type is uuid.
The first step, you should random your ids after that bulk insert them, by this way your will not need to return ids from database.

Best performance method for getting records by large collection of IDs

I am writing a query with code to select all records from a table where a column value is contained in a CSV. I found a suggestion that the best way to do this was using ARRAY functionality in PostgresQL.
I have a table price_mapping and it has a primary key of id and a column customer_id of type bigint.
I want to return all records that have a customer ID in the array I will generate from csv.
I tried this:
select * from price_mapping
where ARRAY[customer_id] <# ARRAY[5,7,10]::bigint[]
(the 5,7,10 part would actually be a csv inserted by my app)
But I am not sure that is right. In application the array could contain 10's of thousands of IDs so want to make sure I am doing right with best performance method.
Is this the right way in PostgreSQL to retrieve large collection of records by pre-defined column value?
Thanks
Generally this is done with the SQL standard in operator.
select *
from price_mapping
where customer_id in (5,7,10)
I don't see any reason using ARRAY would be faster. It might be slower given it has to build arrays, though it might have been optimized.
In the past this was more optimal:
select *
from price_mapping
where customer_id = ANY(VALUES (5), (7), (10)
But new-ish versions of Postgres should optimize this for you.
Passing in tens of thousands of IDs might run up against a query size limit either in Postgres or your database driver, so you may wish to batch this a few thousand at a time.
As for the best performance, the answer is to not search for tens of thousands of IDs. Find something which relates them together, index that column, and search by that.
If your data is big enough, try this:
Read your CSV using a FDW (foreign data wrapper)
If you need this connection often, you might build a materialized view from it, holding only needed columns. Refresh this when new CSV is created.
Join your table against this foreign table or materialized viev.

PostgreSQL 9.5 ON CONFLICT DO UPDATE command cannot affect row a second time

I have a table from which I want to UPSERT into another, when try to launch the query, I get the "cannot affect row a second time" error. So I tried to see if I have some duplicate on my first table regarding the field with the UNIQUE constraint, and I have none. I must be missing something, but since I cannot figure out what (and my query is a bit complex because it is including some JOIN), here is the query, the field with the UNIQUE constraint is "identifiant_immeuble" :
with upd(a,b,c,d,e,f,g,h,i,j,k) as(
select id_parcelle, batimentimmeuble,etatimmeuble,nb_loc_hab_ma,nb_loc_hab_ap,nb_loc_pro, dossier.id_dossier, adresse.id_adresse, zapms.geom, 1, batimentimmeuble2
from public.zapms
left join geo_pays_gex.dossier on dossier.designation_siea=zapms.id_dossier
left join geo_pays_gex.adresse on adresse.id_voie=(select id_voie from geo_pays_gex.voie where (voie.designation=zapms.nom_voie or voie.nom_quartier=zapms.nom_quartier) and voie.rivoli=lpad(zapms.rivoli,4,'0'))
and adresse.num_voie=zapms.num_voie
and adresse.insee=zapms.insee_commune::integer
)
insert into geo_pays_gex.bal2(identifiant_immeuble, batimentimmeuble, id_etat_addr, nb_loc_hab_ma, nb_loc_hab_ap, nb_loc_pro, id_dossier, id_adresse, geom, raccordement, batimentimmeuble2)
select a,b,c,d,e,f,g,h,i,j,k from upd
on conflict (identifiant_immeuble) do update
set batimentimmeuble=excluded.batimentimmeuble, id_etat_addr=excluded.id_etat_addr, nb_loc_hab_ma=excluded.nb_loc_hab_ma, nb_loc_hab_ap=excluded.nb_loc_hab_ap, nb_loc_pro=excluded.nb_loc_pro,
id_dossier=excluded.id_dossier, id_adresse=excluded.id_adresse,geom=excluded.geom, raccordement=1, batimentimmeuble2=excluded.batimentimmeuble2
;
As you can see, I use several intermediary tables in this query : one to store the street's names (voie), one related to this one storing the adresses (adresse, basically numbers related through a foreign key to the street's names table), and another storing some other datas related to the projects' names (dossier).
I don't know what other information I could give to help find an answer, I guess it is better I do not share the actual content of my tables since it may touch some privacy regulations or such.
Thanks for your attention.
EDIT : I found a workaround by deleting the entries present in the zapms table from the bal2 table, as such
delete from geo_pays_gex.bal2 where bal2.identifiant_immeuble in (select id_parcelle from zapms);
it is not entirely satisfying though, since I would have prefered to keep track of the data creator and the date of creation, as much as the fact that the data has been modified (I have some fields to store this information) and here I simply erase all this history... And I have another table with the primary key of the bal2 table as a foreign key. I am still in the DB creation so I can afford to truncate this table, but in production it wouldn't be possible since I would lose some datas.

Creating Stored Procedures that can work with different tables

I need to use the same Stored Procedures against many tables all with the same structure in my DB. This is data loaded from customers,with one table/customer and the data needs calculations/checks run before it's loaded to our DataWarehouse.
So far these are the options and issues I've found and I'm looking for a better pattern/approach.
Create a view that points to the
table I want to process, the SPs
then talk to that view. This works
well (especially once I'd worked out
how to create views 'automagically'
based on their columns). But the
view can only be used with one table
at a time, forcing the system to
deal with one customer at a time.
Use dynamic sql within each SP -
makes the SPs much harder to
read/debug and for those reasons has
been ruled out
Create a partitioned view across
all the tables and then use a
paramatised table function to return
just the data we're interested in -
ah but then I can't update the data
as the function returns a table that
can be only used for select
Use dynamic sql inside a function
(can't be done) to create a view
(which also can't be done) .... give
up
Within the SP create a temp table
with over the target table using
dynamic sql, but then the temp table
only exists in the session that runs
the dynamic sql not the 'parent'
session that's running the SP ...
give up
Create a global temp table using
dynamic SQL to avoid the scope issue
of 5, then run the SP against the
global temp table. Still run into
the single customer issue.
Create the view as in 1 within a
transaction and then run all the SPs
and then commit - works fine for one
user, but any others are now blocked
trying to create a new view of the
same name
Use a temporary view ... can't in
T/Sql
Move all the code into .Net - but
we have environment issues where
tsql is much easier to host/run
I know I'm not the only person who has this problem, have any of you good people solved it, please help.
Maybe your approach is wrong, I will go deep in details in a while but it seems that your problem can be solved using SSIS
-- Updated answer:
First, the big picture:
The most affordable way to process the tables dynamically is using a script instead of a stored procedure. If you want to make table access randomly chosen, you certainly will not use any of the performance advantages of stored procedures, i.e. execution plans. A SQL Script can be easily upgraded to point one table at runtime using placeholders and replacing it before executing.
The script can be loaded from the filesystem, a variable, a text column in a table, etc. The loading process consists in read the script content to a string variable. This step occurs once.
The next step is the preparation stage. This step will be executed for each table to be processed. The main business of this step is to replace the table placeholders with the current table being processed. Also is possible to set parameter values like any parameter you can need to pass into the sp that you already wrote.
The last step is the execution of the script. As is already loaded into a variable and the placeholders were set to the current table name, you can safely call a ExecuteSQLTask with the sql variable as the input. This process of course happens for each table you want to process.
Ok. Now let's see this in action.
This is a sample database model:
CREATE TABLE [dbo].[t_n](
[id] [int] IDENTITY(1,1) NOT NULL,
[name] [varchar](50) NOT NULL,
[start] [datetime] NULL,
CONSTRAINT [PK_t_n] PRIMARY KEY CLUSTERED ([id] ASC)
) ON [PRIMARY]
where t_n represents any table (t_1, t_2, t_3, etc).
This is your current stored procedure:
CREATE PROCEDURE SpProcessT_n
AS
BEGIN
SET NOCOUNT ON;
SELECT * FROM [t1];
END
GO
Now, transform this stored procedure to a Sql script, placing a placeholder instead of the table name
SET NOCOUNT ON;
SELECT * FROM [$table_name];
I choose to save this in a .sql file in the filesystem to keep the POC as simple as possible.
Next, create a SSIS Package like this:
These are the settings I choose to set up the loop:
And this is the way you can assign the table name to a variable called appropriately _table_name_
This is the setup of the script task, here you find that the variable _table_name_ has read only access, while a new variable called SqlExec has read/write access:
And this is it's Main function:
public void Main()
{
String Table_Name = Dts.Variables["table_name"].Value.ToString();
String SqlScript;
Regex reg = new Regex(#"\$table_name", RegexOptions.Compiled);
using (var f = File.OpenText(#"c:\sqlscript.sql")) {
SqlScript = f.ReadToEnd();
f.Close();
}
SqlScript = reg.Replace(SqlScript, Table_Name);
Dts.Variables["SqlExec"].Value = SqlScript;
Dts.TaskResult = (int)ScriptResults.Success;
}
You can notice that the Dts Variable SqlExec contains the sql script that will be executed. Now you can set the following options in your ExecuteSqlTask:
Successfully tested in MSSQL 2008, if you put a insert inside the script file you will notice new rows in each table.
Hope this helps!
If your application can afford to have one cut-off day late, then you can have a nightly scheduled job to run an SSIS package that will consolidate all 150+ tables into one single huge table. Since the freshness of the results of the queries against that huge table will then be 1 'date' late, this solution will not include any rows that recently been loaded.
You can actually time the running of this package. If it is still amazingly fast, say within 30 minutes, then you can bet to run it in every few hours, like during: the start of work day, lunch break, and end of day. This way you can have a nearly fresh data to query with.
Write a partitioned view including table names?
SELECT 'TableName', t.* FROM TableName t
UNION ALL
SELECT 'TableName2', t.* FROM TableName2 t
Then write a single instead of trigger which uses dynamic SQL for writing (less testing involved with that use of dynamic SQL because you'd just write the simple CRUD operations once for all tables I'd think)
I would not do this with SQL. What you are describing sounds like a traditional ETL situation.
Since all of the customer tables are the same, I would create a table in the data warehouse with all the columns from the client table, a surrogate key column, and a type identifier. You have an option to create a "staging" table here that will only have data in it during the ETL process, or just working on a single "live" table. I would create the staging table.
Then within SSIS package (don't worry you can still schedule from SQL Server agent, it hasn't totally left the DB server), start the ETL process...
E(xtract): copy the data from your source into the staging table in the data warehouse. You most likely want to use a sub-package within a foreach loop and changing the name of the table that you want to process from an external store (most people would say put this in the warehouse, but its up to you).
T(ransform): run the calculations/checks you were talking about, but do it on the whole set...
L(oad): Copy it to your real within the data warehouse.
There are a couple things I would NOT do.
1. Modify the data in the source table.
2. Try to do this in t-sql. Its just not what tsql is good at.
If you need more detail on this approach, I would probably ask the question with some Business Intelligence tags. I'll be traveling for the next week or so, but I will try to look at the comments to clear anything up if you need me to.
I am fairly certain that the standard way to solve this is using dynamic SQL in each sp (your option 2), which has already been ruled out.
Your goal is to make generic, multi-table SQL. I don't see how you intend to accomplish that without sacrificing some efficiency and readability.

Optimize getting counts of rows grouped by first letter in SQLite?

My current query looks something like this:
SELECT SUBSTR(name,1,1), COUNT(*) FROM files GROUP BY SUBSTR(name,1,1)
But it's taking a pretty long time just to do counts on a table that's already indexed by the name column. I saw from this question that some engines might not use indexes correctly for the SUBSTR function, and in fact, sqlite will not use indexes for SUBSTR(string,1,1).
Is there any other approach that would utilize the index and net me some faster queries?
One strategy that is consistent with your access pattern is to add a new indexed column "first_letter" to your table. Use a trigger on to set the value on insert and update. Then your query is a simple group by first_letter.
Another strategy is to create a shadow table which contains an aggregation of the mother table. This isn't easy because it is your job as developer to keep the shadow table consistent with the mother table. Every delete, update or insert in table files needs to be accompanied by a change in the shadow table.
Databases like Oracle have support for materialized views to achieve this automatically but sqlite doesn't.