Upsert to Phoenix table in Apache Spark - scala

Looking to find if anybody got through a way to perform upserts (append / update / partial inserts/update) on Phoenix using Apache Spark. I could see as per Phoenix documentation save SaveMode.Overwrite is only supported - which is overwrite with full load. I tried changing the mode it throws error.
Currently, we have *.hql jobs running to perform this operation, now we want to rewrite them in Spark Scala. Thanks for sharing your valuable inputs.

While Phoenix connector indeed supports only SaveMode.Overwrite, the implementation doesn't conform to the Spark standard, which states that:
Overwrite mode means that when saving a DataFrame to a data source, if data/table already exists, existing data is expected to be overwritten by the contents of the DataFrame
If you check the source, you'll see that saveToPhoenix just calls saveAsNewAPIHadoopFile with PhoenixOutputFormat, which
internally builds the UPSERT query for you
In other words SaveMode.Overwrite with Phoenix Connector is in fact UPSERT.

Related

Is there a way for PySpark to give user warning when executing a query on Apache Hive table without specifying partition keys?

We are using Spark SQL with Apache Hive tables (via AWS Glue Data catalog). One problem is that when we execute a Spark SQL query without specifying the partitions to read via the WHERE clause, it gives us/the user no warning about the fact that it will proceed to load all partitions and thus likely time out or fail.
Is there a way to ideally error out, or at least give some warning, when a user executes a Spark SQL query on Apache Hive table without specifying partition keys? It's very easy to forget to do this.
I searched for existing solutions to this and found none, both on Stack Overflow and on the wider internet. I was expecting some configuration option/code that would help me achieve the goal.

How to do a bulk insert/ bulkload into Hbase through Glue

I am trying to do a bulk insert or bulk load into Hbase in EMR using Glue Scala (Spark 3.1). I got this using
table.put(List<Put>);
without a satisfatory performance. I tried to insert though spark dataframe following some examples but the libraries features are compatible just with Spark 1.6. I tried, too, reproduce some examples of insert HFiles into HDFS enviroment and processing through HOutputFormat and HOutputFormat2 but these classes were removed from newer versions. How can I be able to perform a highly-performatic insert in HBase, with current libraries or, even, an bulkload? The examples that I found were old and the Hbase Book Reference wasn't clearly about this point.
Thank you.

Spark Dataframe delete to Elasticsearch

I am using Apache Spark DataFrame and I want to Delete data to Elasticsearch.
For adding and updating I am using below command
val esURL = "https://56h874526b6741db87c3c91324g755.westeurope.azure.elastic-cloud.com:9243"
var indexName = "test_elastic/test_elastic"
df.write
.format("org.elasticsearch.spark.sql")
.option("es.nodes.wan.only","true")
.option("es.port","443")
.option("es.net.ssl","true")
.option("es.net.http.auth.user","userid")
.option("es.net.http.auth.pass","pwd")
.option("es.nodes", esURL)
.option("es.mapping.id", "primary_key")
.mode("append")
.save(indexName)
My question is how I can delete some rows from elasticsearch index. In my case elasticsearch index is "test_elastic".
A quick search in the repository mentions first support for deletes on the ElasticSearch Hadoop connector for the 8.x and 7.8 versions, neither of which are currently released the moment I write this.
https://github.com/elastic/elasticsearch-hadoop/pull/1324
From maintainer jbaiera :
jbaiera left a comment •
LGTM! Thanks very much for your dedication on getting this in! I'll go ahead and merge it in to master and backport it to the 7.x branch. It should be available in the 7.8.0 release when that lands.
From the current code changes, supported ES versions should be 2.x to 8.x, but there is no documentation yet that I could find (did not look much though), and no information about a direct usage in the Spark API (and I'm not even sure a delete API exists on Spark Dataframes, whatever their data source).
On the other hand, there exists a write mode called "overwrite", that is working and could allow you achieve data deletion, but overwriting the whole index may not be practical depending on the volumetry.
I think your best bet would be to drop out of the spark dataframe API to switch to one (or several) direct call(s) to bulk delete.

How to access the HIVE ACID table in Spark sql?

How could you access the HIVE ACID table, in Spark sql?
We have worked on and open sourced a datasource that will enable users to work on their Hive ACID Transactional tables using Spark.
Github: https://github.com/qubole/spark-acid
It is available as a Spark package and instructions to use it are on the Github page. Currently the datasource supports only reading from Hive ACID tables, and we are working on adding the ability to write into these tables via Spark as well.
Feedback and suggestions are welcome!
#aniket Spark doesn't support reading Hive Acid tables directly. (https://issues.apache.org/jira/browse/SPARK-15348/SPARK-16996)
The data layout for transactional tables requires special logic to decide which directories to read and how to combine them correctly. Some data files may represent updates of previously written rows, for example. Also, if you are reading while something is writing to this table your read may fail (w/o the special logic) because it will try to read incomplete ORC files. Compaction may (again w/o the special logic) may make it look like your data is duplicated.
It can be done (WIP) via LLAP - tracked in https://issues.apache.org/jira/browse/HIVE-12991
I faced the same issue (Spark for Hive acid tables )and I can able to manage with JDBC call from Spark. May be I can use this JDBC call from spark until we get the native ACID support from Spark.
https://github.com/Gowthamsb12/Spark/blob/master/Spark_ACID
Spark can read acid table directly at least since spark 2.3.2. But I can aslo confirm it can't read acid table in spark 2.2.0.

Apache Spark 1.3 dataframe SaveAsTable database other then default

I am trying to save a dataframe as table using saveAsTable and well it works but I want to save the table to not the default database, Does anyone know if there is a way to set the database to use? I tried with hiveContext.sql("use db_name") and this did not seem to do it. There is an saveAsTable that takes in some options. Is there a way that i can do it with the options?
It does not look like you can set the database name yet... if you read the HiveContext.scala code you see a lot comments like...
// TODO: Database support...
So I am guessing that its not supported yet.
Update:
In spark 1.5.1 this works, which did not work in early versions. In early version you had to use a using statement like in deformitysnot answer.
df.write.format("parquet").mode(SaveMode.Append).saveAsTable("databaseName.tablename")
This was fixed in Spark 1.5 and you can do it using :
hiveContext.sql("USE sparkTables");
dataFrame.saveAsTable("tab3", "orc", SaveMode.Overwrite);
By the way in Spark 1.5 you can read Spark saved dataframes from Hive command line (beeline, ...), something that was impossible in earlier versions.