Spark Dataframe delete to Elasticsearch - scala

I am using Apache Spark DataFrame and I want to Delete data to Elasticsearch.
For adding and updating I am using below command
val esURL = "https://56h874526b6741db87c3c91324g755.westeurope.azure.elastic-cloud.com:9243"
var indexName = "test_elastic/test_elastic"
df.write
.format("org.elasticsearch.spark.sql")
.option("es.nodes.wan.only","true")
.option("es.port","443")
.option("es.net.ssl","true")
.option("es.net.http.auth.user","userid")
.option("es.net.http.auth.pass","pwd")
.option("es.nodes", esURL)
.option("es.mapping.id", "primary_key")
.mode("append")
.save(indexName)
My question is how I can delete some rows from elasticsearch index. In my case elasticsearch index is "test_elastic".

A quick search in the repository mentions first support for deletes on the ElasticSearch Hadoop connector for the 8.x and 7.8 versions, neither of which are currently released the moment I write this.
https://github.com/elastic/elasticsearch-hadoop/pull/1324
From maintainer jbaiera :
jbaiera left a comment •
LGTM! Thanks very much for your dedication on getting this in! I'll go ahead and merge it in to master and backport it to the 7.x branch. It should be available in the 7.8.0 release when that lands.
From the current code changes, supported ES versions should be 2.x to 8.x, but there is no documentation yet that I could find (did not look much though), and no information about a direct usage in the Spark API (and I'm not even sure a delete API exists on Spark Dataframes, whatever their data source).
On the other hand, there exists a write mode called "overwrite", that is working and could allow you achieve data deletion, but overwriting the whole index may not be practical depending on the volumetry.
I think your best bet would be to drop out of the spark dataframe API to switch to one (or several) direct call(s) to bulk delete.

Related

How to do a bulk insert/ bulkload into Hbase through Glue

I am trying to do a bulk insert or bulk load into Hbase in EMR using Glue Scala (Spark 3.1). I got this using
table.put(List<Put>);
without a satisfatory performance. I tried to insert though spark dataframe following some examples but the libraries features are compatible just with Spark 1.6. I tried, too, reproduce some examples of insert HFiles into HDFS enviroment and processing through HOutputFormat and HOutputFormat2 but these classes were removed from newer versions. How can I be able to perform a highly-performatic insert in HBase, with current libraries or, even, an bulkload? The examples that I found were old and the Hbase Book Reference wasn't clearly about this point.
Thank you.

What changes do I have to do to migrate an application from Spark 1.5 to Spark 2.1?

I have to migrate to Spark 2.1 an application written in Scala 2.10.4 using Spark 1.6.
The application treats text files with around 7GB of dimension, and contains several rdd transformations.
I was told to try to recompile it with scala 2.11, which should be enough to make it work with Spark 2.1. This sounds strange to me as I know in Spark 2 there are some relevant changes, like:
Introduction of SparkSession object
Merge of DataSet and DataFrame
APIs
I managed to recompile the application in spark 2 with scala 2.11 with only minor changes due to Kryo Serializer registration.
I still have some runtime error that I am trying to solve and I am trying to figure out what will come next.
My question regards what changes are "neccessary" in order to make the application work as before, and what changes are "recommended" in terms of performance optimization (I need to keep at least the same level of performances), and whatever you think could be useful for a newbie in spark :).
Thanks in advance!
I did the same 1 year ago, there are not many changes you need to do, what comes in my mind:
if your code is cluttered with spark/sqlContext, then just extract this variable from SparkSession instace at the beginning of your code.
df.map switched to RDD API in Spark 1.6, in Spark 2.+ you stay in DataFrame API (which now has a map method). To get same functionality as before, replace df.map with df.rdd.map. The same is true for df.foreach and df.mapPartitions etc
unionAll in Spark 1.6 is just union in Spark 2.+
The databrick csv library is now included in Spark.
When you insert into a partitioned hive table, then the partition columns must now come as last column in the schema, in Spark 1.6 it had to be the first column
What you should consider (but would require more work):
migrate RDD-Code into Dataset-Code
enable CBO (cost based optimizer)
collect_list can be used with structs, in Spark 1.6 it could only be used with primitives. This can simplify some things
Datasource API was improved/unified
leftanti join was introduced

Upsert to Phoenix table in Apache Spark

Looking to find if anybody got through a way to perform upserts (append / update / partial inserts/update) on Phoenix using Apache Spark. I could see as per Phoenix documentation save SaveMode.Overwrite is only supported - which is overwrite with full load. I tried changing the mode it throws error.
Currently, we have *.hql jobs running to perform this operation, now we want to rewrite them in Spark Scala. Thanks for sharing your valuable inputs.
While Phoenix connector indeed supports only SaveMode.Overwrite, the implementation doesn't conform to the Spark standard, which states that:
Overwrite mode means that when saving a DataFrame to a data source, if data/table already exists, existing data is expected to be overwritten by the contents of the DataFrame
If you check the source, you'll see that saveToPhoenix just calls saveAsNewAPIHadoopFile with PhoenixOutputFormat, which
internally builds the UPSERT query for you
In other words SaveMode.Overwrite with Phoenix Connector is in fact UPSERT.

hbase-spark for Spark 2

I want to do a full Scan on hbase from Spark 2 using Scala.
I don't have a fixed catalog definition so libraries as SHC are not an option.
My logical choice was to use hbase-spark, that is working fine in Spark 1.6
In addition to the poor documentation about this library in previous versions, my surprise has been when checking the last HBase releasees, for example tag 2.0, hbase-spark is gone! but still in the master.
So my questions are:
Where is the hbase-spark module for the last releases?
Where can I find a hbase-spark version compatible with Spark 2?
thx!
Seems hbase-spark module was removed from the hbase project for v2.0 release
https://issues.apache.org/jira/browse/HBASE-18817
#bp2010 already answered part of the question.
Regarding the HBase Spark see below. It works with spark 2.
There are some options that don't demand a fixed catalog from client code:
HBase Spark
Source code with examples are here: https://github.com/apache/hbase-connectors/tree/master/spark/hbase-spark
Here you can see explanations about the repositories:
https://github.com/apache/hbase-connectors/tree/master/spark/hbase-spark
Apache Phoenix Spark connector
https://phoenix.apache.org/phoenix_spark.html
I'm not sure if it helps you, since the table must be mapped to a Phoenix table. If you have Phoenix, and you problem is writing the catalog from code, but you can standardize types in HBase Table, for a full scan this can be the way to go. Otherwise, go with option 1.

Apache Spark 1.3 dataframe SaveAsTable database other then default

I am trying to save a dataframe as table using saveAsTable and well it works but I want to save the table to not the default database, Does anyone know if there is a way to set the database to use? I tried with hiveContext.sql("use db_name") and this did not seem to do it. There is an saveAsTable that takes in some options. Is there a way that i can do it with the options?
It does not look like you can set the database name yet... if you read the HiveContext.scala code you see a lot comments like...
// TODO: Database support...
So I am guessing that its not supported yet.
Update:
In spark 1.5.1 this works, which did not work in early versions. In early version you had to use a using statement like in deformitysnot answer.
df.write.format("parquet").mode(SaveMode.Append).saveAsTable("databaseName.tablename")
This was fixed in Spark 1.5 and you can do it using :
hiveContext.sql("USE sparkTables");
dataFrame.saveAsTable("tab3", "orc", SaveMode.Overwrite);
By the way in Spark 1.5 you can read Spark saved dataframes from Hive command line (beeline, ...), something that was impossible in earlier versions.