I have a look up table that needs to be updated from my spark code. There are multiple spark jobs using the same look up table.
I have a dataframe that needs to be inserted or updated into Postgres table. I cant do it using sparks native postgres library because it only supports appends and overwrites, however I have a clear case of insert or update.
Things i have tried
I have tried using slick but am unable to get anything working and the documentation seemed confusing.
I also tried using foreach partition and iterating on every row and running the query using slicks db.run functionality, still not working and no errors
The thing is I'm unable to get a working solution, can someone share their code snippet on how they got it to be working? Seems like a pretty common use case but Im stuck.
Related
I am trying to implement a FULL-TEXT SEARCH (PostgreSQL) on kotlin by using kotlin-exposed. I have my queries in raw SQL but cannot write queries containing to_tsvector()or to_tsquery() in kotlin. I couldn't actually find anything similar anywhere. After a bit of reading, I understood that complex queries could be written as raw SQL here (i couldn't get it working too) but there is a chance of SQL injection in that. Is there a way to tackle this?
I am not posting any code since what I've tried is just trial and error, actually, the methods are not even available in IDE. Any help is appreciated. My DB is PostgreSQL.
I have worked on a Python application that queries a PostgreSQL DB using SQLAlchemy. My DB has a large number of tables many of which get modified every release. Whenever I make a change to a table (say, renaming a column), I execute mypy to find places in the code which are referencing old column names and I fix them in all such places. I now have to write some code using C/C++ and I have used libpq. The problem is that I have to manually scan the code to find places where changes are required whenever I make table changes. This is error prone. Is there a better way to ensure that the C/C++ code is in sync with the DB schema ?
Looking to find if anybody got through a way to perform upserts (append / update / partial inserts/update) on Phoenix using Apache Spark. I could see as per Phoenix documentation save SaveMode.Overwrite is only supported - which is overwrite with full load. I tried changing the mode it throws error.
Currently, we have *.hql jobs running to perform this operation, now we want to rewrite them in Spark Scala. Thanks for sharing your valuable inputs.
While Phoenix connector indeed supports only SaveMode.Overwrite, the implementation doesn't conform to the Spark standard, which states that:
Overwrite mode means that when saving a DataFrame to a data source, if data/table already exists, existing data is expected to be overwritten by the contents of the DataFrame
If you check the source, you'll see that saveToPhoenix just calls saveAsNewAPIHadoopFile with PhoenixOutputFormat, which
internally builds the UPSERT query for you
In other words SaveMode.Overwrite with Phoenix Connector is in fact UPSERT.
I am trying to save a dataframe as table using saveAsTable and well it works but I want to save the table to not the default database, Does anyone know if there is a way to set the database to use? I tried with hiveContext.sql("use db_name") and this did not seem to do it. There is an saveAsTable that takes in some options. Is there a way that i can do it with the options?
It does not look like you can set the database name yet... if you read the HiveContext.scala code you see a lot comments like...
// TODO: Database support...
So I am guessing that its not supported yet.
Update:
In spark 1.5.1 this works, which did not work in early versions. In early version you had to use a using statement like in deformitysnot answer.
df.write.format("parquet").mode(SaveMode.Append).saveAsTable("databaseName.tablename")
This was fixed in Spark 1.5 and you can do it using :
hiveContext.sql("USE sparkTables");
dataFrame.saveAsTable("tab3", "orc", SaveMode.Overwrite);
By the way in Spark 1.5 you can read Spark saved dataframes from Hive command line (beeline, ...), something that was impossible in earlier versions.
We're considering using SSIS to maintain a PostgreSql data warehouse. I've used it before between SQL Servers with no problems, but am having a lot of difficulty getting it to play nicely with Postgres. I’m using the evaluation version of the OLEDB PGNP data provider (http://www.postgresql.org/about/news.1004).
I wanted to start with something simple like UPSERT on the fact table (10k-15k rows are updated/inserted daily), but this is proving very difficult (not to mention I’ll want to use surrogate keys in the future).
I’ve attempted (Link) and (http://consultingblogs.emc.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx) which are effectively the same (except I don’t really understand the union all at the end when I’m trying to upsert) But I run into the same problem with parameters when doing the update using a OLEDb command – which I tried to overcome using (http://technet.microsoft.com/en-us/library/ms141773.aspx) but that just doesn’t seem to work, I get a validation error –
The external columns for complent.... are out of sync with the datasource columns... external column “Param_2” needs to be removed from the external columns.
(this error is repeated for the first two parameters as well – never came across this using the sql connection as it supports named parameters)
Has anyone come across this?
AND:
The fact that this simple task is apparently so difficult to do in SSIS suggests I’m using the wrong tool for the job - is there a better (and still flexible) way of doing this? Or would another ETL package be better for use between two Postgres database? -Other options include any listed on (http://en.wikipedia.org/wiki/Extract,_transform,_load#Open-source_ETL_frameworks). I could just go and write a load of SQL to do this for me, but I wanted a neat and easily maintainable solution.
I have used the Slowly Changing Dimension wizard for this with good success. It may give you what you are looking for especially with the Wizard
http://msdn.microsoft.com/en-us/library/ms141715.aspx
The External Columns Out Of Sync: SSIS is Case Sensitive - I encountered this issue multiple times and it makes me want to pull my hair out.
This simple task is going to take some work either way. SSIS is by no means an enterprise class ETL product yet, but it does give you some quick and easy functionality, and is sufficient for most ETL work. I guess it is also about your level of comfort with it as well.
SCD is way too slow for what I want. I need to use set based sql.
It turned out that a lot of my problems were with bugs in the provider.
I opened a forum topic (http://www.pgoledb.com/forum/viewtopic.php?f=4&t=49) and had a useful discussion with the moderator/support/developer person.
Also Postgres doesn't let you do cross db querys, so I solved the problem this way:
Data Source from Production DB to a temp Archive DB table
Run set based query between temp table and archive table
Truncate temp table
Note that the temp table is not atchally a temp table, but a copy of the archive table schema to temporarily stored data in.
Took a while, but I got there in the end.
This simple task is going to take some work either way. SSIS is by no means an enterprise class ETL product yet, but it does give you some quick and easy functionality, and is sufficient for most ETL work. I guess it is also about your level of comfort with it as well.
What enterprise ETL solution would you suggest?