Apply diffs between two SQL Server Tables efficiently - tsql

I have a once-a-day ingestion case in which I will be getting a large file via FTP which contains the up-to-date versions of 4 database tables.
For each table, I would like to:
Truncate table in staging database
BCP the FTP'd file into that table
Find diffs (IUD) between staging table and production table
Make any required IUDs to production table so it matches staging table
I'm sure this is a reasonably common problem, but I'm not 100% sure as to the best way to approach it.
Are there any built in T-SQL features for this kind of problem, or do I just need to do various joins to find the inserted/updated/deleted records and execute them manually? I'm sure I can manage to do it this second way, but any suggestions are greatly appreciated none-the-less (not looking for working code).

Since nobody ever put it as a real answer, the MERGE command as mentioned by Mikael Eriksson in the comment is the right way to go, it worked great.
Here's a simple example usage:
MERGE dbo.DimProduct AS Target
USING (SELECT ProductID, ProductName, ProductColor, ProductCategory FROM dbo.StagingProduct) AS Source
ON (Target.ProductID = Source.ProductID)
WHEN MATCHED THEN
UPDATE SET Target.ProductName = Source.ProductName
WHEN NOT MATCHED BY TARGET THEN
INSERT (ProductID, ProductName, ProductColor, ProductCategory)
VALUES (Source.ProductID, Source.ProductName, Source.ProductColor, Source.ProductCategory)
OUTPUT $action, Inserted.*, Deleted.*;
from: http://www.bidn.com/blogs/bretupdegraff/bidn-blog/239/using-the-new-tsql-merge-statement-with-sql-server-2008
which helped me.

RedGate's SQL Compare product has automation capabilities.
http://downloads.red-gate.com/HelpPDF/ContinuousIntegrationForDatabasesUsingRedGateSQLTools.pdf
(I am not associated with redgate. I don't even like their products that much, but it seems to fit the case in this instance)

Related

Given the OSM data for a state, find its area

Apologies if this is a silly question, I'm very new to PostgreSQL/PostGIS.
I have downloaded an .osm file for the entire state of Baden-Württemberg from here - https://download.geofabrik.de/europe/germany/baden-wuerttemberg.html
I have to answer the following questions, by writing queries for the same -
What is the area (expressed in m^2) of Baden-Württemberg?
What is by area the smallest city of Baden-Württemberg?
I have imported the osm file into a PostgreSQL using the osm2pgsql tool.
There seem to be multiple tables that are being created, core and generated by osm2pgsql. In addition to the core tables - planet_osm_nodes, planet_osm_ways and planet_osm_rels there are other tables as well, and I'm not quite sure how to use them to complete my tasks.
It would be really helpful if anybody could provide some insight into how I can go about writing the PostGIS query for the same.
I think something like this should give you the cities and their area in square meters in order
SELECT *, ST_Area(way)
FROM planet_osm_polygon
WHERE boundary='administrative'
AND admin_level=7
ORDER BY ST_Area(way)
Same thing should work with the border of Baden-Württemberg only that you would need admin_level=4
https://postgis.net/docs/ST_Area.html
https://wiki.openstreetmap.org/wiki/DE:Grenze

how long would a insert query averagely take if the tables involved go over 20million rows in postgresql?

i got this little school project which involves postgresql but its been 1 hour after running the following command and its still not done:
INSERT INTO series_has_performer(serie_id, performer_id)
(SELECT series.serie_id, performers.performer_id
FROM series, performers WHERE series.title = performers.title and
series.title like '%''%' and series.ep_name = performers.ep_name);
where as the list series is about 3.3million rows long and the list performers 36million rows long.
i have also already disabled autocommit and add the foreign key after the insert is done.
is there something im doing wrong been googling all day and some forums said i should just copy the files into a csv file and then use those to copy into the wanted table but that seems like quite an inefficient way to do it to me. any suggestions?

Tableau Extract API with multiple tables in a database

I am currently experimenting with Tableau Extract API to generate some TDE from the tables I have in a PostgreSQL database. I was able to write a code to generate the TDE from single table, but I would like to do this for multiple joined tables. To be more specific, if I have two tables that are inner joined by some field, how would I generate the TDE for this?
I can see that if I am working with small number of tables, I could use a SQL query with JOIN clauses to create a one gigantic table, and generate the TDE from that table.
>> SELECT * FROM table_1 INNER JOIN table_2
INTO new_table_1
ON table_1.id_1 = table_2.id_2;
>> SELECT * FROM new_table_1 INNER JOIN TABLE_3
INTO new_table_2
ON new_table_1.id_1 = table_3.id_3
and then generate the TDE from new_table_2.
However, I have some tables that have over 40 different fields, so this could get messy.
Is this even a possibility with current version of the API?
You can read from as many tables or other sources as you want. Or use complex query with lots of joins, or create a view and read from that. Usually, creating a view is helpful when you have a complex query joining many tables.
The data extract API is totally agnostic about how or where you get the data to feed it -- the whole point is to allow you to grab data from unusual sources that don't have pre-built drivers for Tableau.
Since Tableau has a Postgres driver and can read from it directly, you don't need to write a program with the data extract API at all. You can define your extract with Tableau Desktop. If you need to schedule automated refreshes of the extract, you can use Tableau Server or its tabcmd command.
Many thanks for your replies. I am aware that I could use Tableau Desktop to define my extract. In fact, I have done this many times before. I am just trying to create the extracts using the API, because I need to create some calculated fields, which is near impossible to create using the Tableau Desktop.
At this point, I am hesitant to use JOINs in the SQL query because the resulting table would look too complicated to comprehend (some of these tables also have same field names).
When you say that I could read from multiple tables or sources, does that mean with the Tableau Extract API? At this point, I cannot find anywhere in this API that accommodates multiple sources. For example, I know that when I use multiple tables in the Tableau Desktop, there are icons on the left hand side that tells me that the extract is composed of multiple tables. This just doesn't seem to be happening with the API, which leaves me stranded. Anyways, thank you again for your replies.
Going back to the topic, this is something that I tried few days ago on my python code
try:
tdefile= tde.Extract("extract.tde")
except:
os.remove("extract.tde")
tdefile = tde.Extract("extract.tde")
tableDef = tde.TableDefinition()
# Read each column in table and set the column data types using tableDef.addColumn
# Some code goes here...
for eachTable in tableNames:
tableAdd = tdeFile.addTable(eachTable, tableDef)
# Use SQL query to retrieve bunch_of_rows from eachTable
for some_row in bunch_of_rows:
# Read each row in table, and set the values in each column position of each row
# Some code goes here...
tableAdd.insert(some_row)
some_row.close()
tdefile.close()
When I execute this code, I get the error that eachTable has to be called "Extract".
Of course, this code has its flaws, as there is no where in this code that tells how each table are being joined.
So I am little thrown off here, because it doesn't seem like I can use multiple tables unless I use JOINs to generate one table that contains everything.

Updating the text of a large number of stored procedures

The question pretty much sums it up. I've got to replace text in a large number for store procedures. Its not so many that doing it manually is impossible, but enough that I'm asking the question. I also prefer automation as it reduces the change of user error when we make the change in production.
I can Identify them like this:
select OBJECT_DEFINITION(object_id), *
from sys.procedures
where OBJECT_DEFINITION(object_id) like '%''MyExampleLiteral''%'
order by name
Is there any way to mass update them all to change 'MyExampleLiteral' to 'MyOtherExampleLiteral'?
I'd even settle for a way to open all the stored procs. Just Finding these store procs in a larger list will take some time.
I thought about generating alter statements using the above select statements, but then I lose line breaks.
Thanks in advance,
This is a Microsoft SQL Server.
There are different tools to use depending on the database in question. For example, Microsoft SQL Server Data Tools integrates with Visual Studio, and allows you to do these types of operations fairly easily. The database is stored in your solution as scripts, which you can then search and replace any keyword you wish. I'm assuming there would be similar tools available for other platforms.
You could do this with dynamic sql. Query the system tables to get all the SPs containing your "MyExampleLiteral":
SELECT [object_id] FROM sys.objects o
WHERE type_desc = 'SQL_STORED_PROCEDURE'
AND is_ms_shipped = 0
AND OBJECT_DEFINITION(o.[object_id]) LIKE '%<search string>%'
Then, write a while loop to go through those object_ids. In the while loop, get the OBJECT_DEFINITION() into a string and replace the "MyExampleLiteral", then replace CREATE PROCEDURE with ALTER PROCEDURE and execute the string using sp_executesql.
Doing something this crazy, make sure you backup the database first.

How to tell if record has changed in Postgres

I have a bit of an "upsert" type of question... but, I want to throw it out there because it's a little bit different than any that I've read on stackoverflow.
Basic problem.
I'm working on moving from mysql to PostgreSQL 9.1.5 (hosted on Heroku). As a part of that, I need to import multiple CSV files everyday. Some of the data is sales information and is almost guaranteed to be new and need to be inserted. But, other parts of the data is almost guaranteed to be the same. For example, the csv files (note plural) will have POS (point of sale) information in them. This rarely changes (and is most likely only via additions). Then there is product information. There are about 10,000 products (vast majority will be unchanged, but it's possible to have both additions and updates).
The final item (but is important), is that I have a requirement to be able to provide an audit trail/information for any given item. For example, if I add a new POS record, I need to be able to trace that back to the file it was found in. If I change a UPC code or description of a product, then I need to be able to trace it back to the import (and file) where the change came from.
Solution that I'm contemplating.
Since the data is provided to me via CSV, then I'm working around the idea that COPY will be the best/fastest way. The structure of the data in the files is not exactly what I have in the database (i.e. final destination). So, I'm copying them into tables in the staging schema that match the CSV (note: one schema per datasource). The tables in the staging schemas will have a before insert row triggers. These triggers can decide what to do with the data (insert, update or ignore).
For the tables that are most likely to contain new data, then it will try to insert first. If the record is already there, then it will return NULL (and stop the insert into the staging table). For tables that rarely change, then it will query the table and see if the record is found. If it is, then I need a way to see if any of the fields are changed. (because remember, I need to show that the record was modified by import x from file y) I obviously can just boiler plate out the code and test each column. But, was looking for something a little more "eloquent" and more maintainable than that.
In a way, what I'm kind of doing is combining a importing system with an audit trail system. So, in researching audit trails, I reviewed the following wiki.postgresql.org article. It seems like the hstore might be a nice way of getting changes (and being able to easily ignore some columns in the table that aren't important - e.g. "last_modified")
I'm about 90% sure it will all work... I've created some testing tables etc and played around with it.
My question?
Is a better, more maintainable way of accomplishing this task of finding the maybe 3 records out of 10K that require a change to the database. I could certainly write a python script (or something else) that reads the file and tries to figure out what to do with each record, but that feels horribly inefficient and will lead to lots of round trips.
A few final things:
I don't have control over the input files. I would love it if they only sent me the deltas, but they don't and it's completely outside of my control or influence.
he system is grow and new data sources are likely to be added that will greatly increase the amount of data being processed (so, I'm trying to keep things efficient)
I know this is not nice, simple SO question (like "how to sort a list in python") but I believe one of the great things about SO is that you can ask hard questions and people will share their thoughts about how they think the best way to solve it is.
I have lots of similar operations. What I do is COPY to temporary staging tables:
CREATE TEMP TABLE target_tmp AS
SELECT * FROM target_tbl LIMIT 0; -- only copy structure, no data
COPY target_tmp FROM '/path/to/target.csv';
For performance, run ANALYZE - temp. tables are not analyzed by autovacuum!
ANALYZE target_tmp;
Also for performance, maybe even create an index or two on the temp table, or add a primary key if the data allows for that.
ALTER TABLE ADD CONSTRAINT target_tmp_pkey PRIMARY KEY(target_id);
You don't need the performance stuff for small imports.
Then use the full scope of SQL commands to digest the new data.
For instance, if the primary key of the target table is target_id ..
Maybe DELETE what isn't there any more?
DELETE FROM target_tbl t
WHERE NOT EXISTS (
SELECT 1 FROM target_tmp t1
WHERE t1.target_id = t.target_id
);
Then UPDATE what's already there:
UPDATE target_tbl t
SET col1 = t1.col1
FROM target_tmp t1
WHERE t.target_id = t1.target_id
To avoid empty UPDATEs, simply add:
...
AND col1 IS DISTINCT FROM t1.col1; -- repeat for relevant columns
Or, if the whole row is relevant:
...
AND t IS DISTINCT FROM t1; -- check the whole row
Then INSERT what's new:
INSERT INTO target_tbl(target_id, col1)
SELECT t1.target_id, t1.col1
FROM target_tmp t1
LEFT JOIN target_tbl t USING (target_id)
WHERE t.target_id IS NULL;
Clean up if your session goes on (temp tables are dropped at end of session automatically):
DROP TABLE target_tmp;
Or use ON COMMIT DROP or similar with CREATE TEMP TABLE.
Code untested, but should work in any modern version of PostgreSQL except for typos.