Upsert in Amazon RedShift without Function or Stored Procedures - amazon-redshift

As there is no support for user defined functions or stored procedures in RedShift, how can i achieve UPSERT mechanism in RedShift which is using ParAccel, a PostgreSQL 8.0.2 fork.
Currently, i'm trying to achieve UPSERT mechanism using IF...THEN...ELSE... statement
e.g:-
IF NOT EXISTS(SELECT...WHERE(SELECT..))
THEN INSERT INTO tblABC() SELECT... FROM tblXYZ
ELSE UPDATE tblABC SET.,.,.,. FROM tblXYZ WHERE...
which is giving me error. As i'm writing this code independently without including it in function or SP's.
So, is there any solution to achieve UPSERT.
Thanks

You should probably read this article on upsert by depesz. You can't rely on SERIALIABLE for this since, AFAIK, ParAccel doesn't support full serializability support like in Pg 9.1+. As outlined in that post, you can't really do what you want purely in the DB anyway.
The short version is that even on current PostgreSQL versions that support writable CTEs it's still hard. On an 8.0 based ParAccel, you're pretty much out of luck.
I'd do a staged merge. COPY the new data to a temporary table on the server, LOCK the destination table, then do an UPDATE ... FROM followed by an INSERT INTO ... SELECT. Doing the data uploads in big chunks and locking the table for the upserts is reasonably in keeping with how Redshift is used anyway.
Another approach is to externally co-ordinate the upserts via something local to your application cluster. Have all your tools communicate via an external tool where they take an "insert-intent lock" before doing an insert. You want a distributed locking tool appropriate to your system. If everything's running inside one application server, it might be as simple as a synchronized singleton object.

Related

Is it possible to evaluate a Postgres expression without connecting to a database?

PostgreSQL has excellent support for evaluating JSONPath expressions against JSON data.
For example, this query returns true because the value of the nested field is indeed "foo".
select '{"header": {"nested": "foo"}}'::jsonb #? '$.header ? (#.nested == "foo")'
Notably this query does not reference any schemas or tables. Ideally, I would like to use this functionality of PostgreSQL without creating or connecting to a full database instance. Is it possible to run PostgreSQL in such a way that it doesn't have schemas or tables, but is still able to evaluate "standalone" queries?
Some other context on the project, we need to evaluate JSONPath expressions against JSON data in both a Postgres database and Python application. Unfortunately, Python does not have any JSONPath libraries that support enough of the spec to be useful to us.
Ideally, I would like to use this functionality of PostgreSQL without creating or connecting to a full database instance.
Well, it is open source. You can always pull out the source code for this functionality you want and adapt it to compile by itself. But that seems like a large and annoying undertaking, and I probably wouldn't do it. And short of that, no.
Why do you need this? Are you worried about scalability or ease of installation or performance or what? If you are already using PostgreSQL anyway, firing up a dummy connection to just fire some queries at the JSONB engine doesn't seem too hard.

How to set Postgres tables to use columnar storage using dbt?

We are using dbt to manage our data models in MSSQL, and we are considering moving to PostgreSQL Citus.
I'm not clear on how I could use Citus's features via dbt? I'm thinking specifically of data compression and columnar storage.
Any ideas?
TBH, I don't have any experiences with dbt.
However, given that dbt docs state that they support PostgreSQL, and Citus Columnar is a table-access-method level abstraction that doesn't require any changes to the queries(*), I wouldn't expect any compatibility issues when combining both.
My guess is that after creating your tables with USING columnar option, then you can just follow dbt docs for the rest.
(*): You might still want to have a look into that page to see what Citus Columnar does or doesn't support atm.
https://github.com/citusdata/citus/tree/master/src/backend/columnar#introduction
was wondering if you had any updates on how this process has gone for you?
I'm looking to implement DBT at my company and was considering postgres/citus as well.
I believe you could use a combination of pre-hook/post-hook dbt functions and materializing incrementally instead of the table or view method.
Thinking is either pre-hook create columnar table and use the dbt incremental update, or materialize as view/table then post-hook SELECT alter_table_set_access_method('table_name', 'columnar');

Is there a way to show everything that was changed in a PostgreSQL database during a transaction?

I often have to execute complex sql scripts in a single transaction on a large PostgreSQL database and I would like to verify everything that was changed during the transaction.
Verifying each single entry on each table "by hand" would take ages.
Dumping the database before and after the script to plain sql and using diff on the dumps isn't really an option since each dump would be about 50G of data.
Is there a way to show all the data that was added, deleted or modified during a single transaction?
Dude, What are you looking for is the most searchable thing on the internet when it comes to capturing Database changes. It is a kind of version control we can say.
But as long as I know, sadly there are no in-built approaches are available in PostgreSQL or MySql. But you can overcome it by setting/adding some triggers for your most usable operations.
You can create some backup schemas, and tables to capture your changes that are changed(updated), created, or deleted.
In this way you can achieve what you want. I know this process is fully manual, But really effective.
If you need to analyze the script's behaviour only sporadically, then the easiest approach would be to change server configuration parameter log_min_duration_statement to 0 and then back to any value it had before the analysis. Then all of the script activity will be written to the instance log.
This approach is not suitable if your storage is not prepared to accommodate this amount of data, or for systems in which you don't want sensitive client data to be written to a plain-text log file.

Challenges in migrating from IBM DB2 to Netezza

Due to added advantage of high performance and reduction in turnaround time, I am trying to migrate all the data from IBM DB2 to Netezza in my organization.
But what I realized is there is no concept of primary key in Netezza? If true, I can try and take care of these issue by using duplicate removal stage in Datastage.
Also, could you guys please assist me understanding if there are any more constraints that I should consider or challenges I could face for DB2 to Netezza migration?
Netezza does allow you to specify Primary Key and Foreign Key restraints, but they are not enforced. Which is to say that they are purely informational (for bot the user and the optimizer). A well-formed upsert process in ETL is a good way to manage for this.
On the topic of other issues you may face, here are a few thoughts:
Surrogate Keys
Be sure that you generate your surrogate keys either with Netezza's SEQUENCE object, or with a surrogate key generator in your ETL tool. Avoid using ROW_NUMBER for this process as it will most often prevent you from exploiting the parallel nature of the system when used in this way.
Stored Procedures
Stored procedures should avoid row-by-row/cusor-based processing when possible, as this is another case where you may prevent yourself from exploiting the parallel nature of the system.
SQL Extension Functions
If you find that you rely on functions that exists in DB2 that you don't find natively in Netezza, be sure to check what is available in the SQL Extensions Toolkit, which is included with Netezza, but not automatically installed/configured.
MERGE
If you rely on MERGE in your current environment, be aware that you must be on v7.2.1 to use MERGE in Netezza. Otherwise you will have to break it out into an INSERT/UPDATE operation.
Once you load the data in Netezza, one method we have utilized is to create a View to access the data and only expose the view. The view would have the logic inside to remove the duplicates.
Good luck!
Delan

Data Warehousing Postgres

We're considering using SSIS to maintain a PostgreSql data warehouse. I've used it before between SQL Servers with no problems, but am having a lot of difficulty getting it to play nicely with Postgres. I’m using the evaluation version of the OLEDB PGNP data provider (http://www.postgresql.org/about/news.1004).
I wanted to start with something simple like UPSERT on the fact table (10k-15k rows are updated/inserted daily), but this is proving very difficult (not to mention I’ll want to use surrogate keys in the future).
I’ve attempted (Link) and (http://consultingblogs.emc.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx) which are effectively the same (except I don’t really understand the union all at the end when I’m trying to upsert) But I run into the same problem with parameters when doing the update using a OLEDb command – which I tried to overcome using (http://technet.microsoft.com/en-us/library/ms141773.aspx) but that just doesn’t seem to work, I get a validation error –
The external columns for complent.... are out of sync with the datasource columns... external column “Param_2” needs to be removed from the external columns.
(this error is repeated for the first two parameters as well – never came across this using the sql connection as it supports named parameters)
Has anyone come across this?
AND:
The fact that this simple task is apparently so difficult to do in SSIS suggests I’m using the wrong tool for the job - is there a better (and still flexible) way of doing this? Or would another ETL package be better for use between two Postgres database? -Other options include any listed on (http://en.wikipedia.org/wiki/Extract,_transform,_load#Open-source_ETL_frameworks). I could just go and write a load of SQL to do this for me, but I wanted a neat and easily maintainable solution.
I have used the Slowly Changing Dimension wizard for this with good success. It may give you what you are looking for especially with the Wizard
http://msdn.microsoft.com/en-us/library/ms141715.aspx
The External Columns Out Of Sync: SSIS is Case Sensitive - I encountered this issue multiple times and it makes me want to pull my hair out.
This simple task is going to take some work either way. SSIS is by no means an enterprise class ETL product yet, but it does give you some quick and easy functionality, and is sufficient for most ETL work. I guess it is also about your level of comfort with it as well.
SCD is way too slow for what I want. I need to use set based sql.
It turned out that a lot of my problems were with bugs in the provider.
I opened a forum topic (http://www.pgoledb.com/forum/viewtopic.php?f=4&t=49) and had a useful discussion with the moderator/support/developer person.
Also Postgres doesn't let you do cross db querys, so I solved the problem this way:
Data Source from Production DB to a temp Archive DB table
Run set based query between temp table and archive table
Truncate temp table
Note that the temp table is not atchally a temp table, but a copy of the archive table schema to temporarily stored data in.
Took a while, but I got there in the end.
This simple task is going to take some work either way. SSIS is by no means an enterprise class ETL product yet, but it does give you some quick and easy functionality, and is sufficient for most ETL work. I guess it is also about your level of comfort with it as well.
What enterprise ETL solution would you suggest?