Spark Dataframe upsert to Elasticsearch - scala

I am using Apache Spark DataFrame and I want to upsert data to Elasticsearch
and I found I can overwrite them like this
val df = spark.read.option("header","true").csv("/mnt/data/akc_breed_info.csv")
df.write
.format("org.elasticsearch.spark.sql")
.option("es.nodes.wan.only","true")
.option("es.port","443")
.option("es.net.ssl","true")
.option("es.nodes", esURL)
.option("es.mapping.id", index)
.mode("Overwrite")
.save("index/dogs")
but what i noticed so far is this command mode("Overwrite") is actually delete all existing duplicated data and insert the new data
is there a way I can upsert them not delete and re-write them ? because I need to query those data almost real time. thanks in advance

The reason why mode("Overwrite") was a problem is that when you overwrite your entire dataframe it deletes all data that matches with your rows of dataframe at once and it looks like the entire index is empty for me and I figure out how to actually upsert it
here is my code
df.write
.format("org.elasticsearch.spark.sql")
.option("es.nodes.wan.only","true")
.option("es.nodes.discovery", "false")
.option("es.nodes.client.only", "false")
.option("es.net.ssl","true")
.option("es.mapping.id", index)
.option("es.write.operation", "upsert")
.option("es.nodes", esURL)
.option("es.port", "443")
.mode("append")
.save(path)
Note that you have to put "es.write.operation", "upert" and .mode("append")

Try setting:
es.write.operation = upsert
This should perform the required operation. You can find more details in https://www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.html

Related

Filter rows of snowflake table while reading in pyspark dataframe

I have a huge snowflake table. I want to do some transformation on the table in pyspark. My snowflake table has a column called 'snapshot'. I only want to read the current snapshot data in pyspark dataframe and do transformation on the filtered data.
So, Is there a way to apply filtering the rows while reading the snowflake table in spark dataframe (I don't want to read the entire snowflake table in memory because it is not efficient) or do I need to read entire snowflake table (in spark dataframe) and then apply filter to get the latest snapshot something as below?
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
snowflake_database="********"
snowflake_schema="********"
source_table_name="********"
snowflake_options = {
"sfUrl": "********",
"sfUser": "********",
"sfPassword": "********",
"sfDatabase": snowflake_database,
"sfSchema": snowflake_schema,
"sfWarehouse": "COMPUTE_WH"
}
df = spark.read \
.format(SNOWFLAKE_SOURCE_NAME) \
.options(**snowflake_options) \
.option("dbtable",snowflake_database+"."+snowflake_schema+"."+source_table_name) \
.load()
df = df.where(df.snapshot == current_timestamp()).collect()
There are forms of filters (filter or where functionality of Spark DataFrame) that Spark doesn't pass to the Spark Snowflake connector. That means, in some situations, you may get more records than you expect.
The safest way would be to use a SQL query directly:
df = spark.read \
.format(SNOWFLAKE_SOURCE_NAME) \
.options(**snowflake_options) \
.option("query","SELECT X,Y,Z FROM TABLE1 WHERE SNAPSHOT==CURRENT_TIMESTAMP()") \
.load()
Of course, if you want to use filter/where functionality of Spark DataFrame, check the Query History in Snowflake UI to see if the query generated has the right filter applied.

Writestreaming query not writing data into table in databricks

Can someone help me with this issue.
I have a delta table "orders". this table is loaded with 1000 records using the delta file.
Now we are getting the streaming JSON file which is appending the data into this table.
Readstream dataframe **orderInputDF **:
from pyspark.sql.functions import *
orderInputDF = (spark
.readStream
.schema(jsonSchema)
.option("maxFilesPerTrigger", 1)
.json(stream_path).
withColumn("submitted_at",to_timestamp('submittedAt')).
withColumn("submitted_yyyy_mm", col('submitted_at').substr(0,7)).
withColumnRenamed("orderId",'order_id').
withColumnRenamed("customerId","customer_id").
withColumnRenamed("salesRepId","sales_rep_id").
withColumn("shipping_address_attention", col("shippingAddress.attention")).
withColumn("shipping_address_address", col("shippingAddress.address")).
withColumn("shipping_address_city", col("shippingAddress.city")).
withColumn("shipping_address_state", col("shippingAddress.state")).
withColumn("shipping_address_zip", col("shippingAddress.zip")).
withColumn("ingest_file_name",lit(files[0][0])).
withColumn("ingested_at",current_timestamp()).
drop("products","shippingAddress","submittedAt")
)
This readstream dataframe is returning 20 records.
Writestream Query is as below
orderOutputDF = (orderInputDF
.writeStream
.format("delta")
.format("memory")
.queryName(orders_table)
.outputMode("append")
.option("checkpointLocation", orders_checkpoint_path)
.start()
)
This query is not appending the data. Everytime it is inserting the new records and deleting the old records.
Can someone help me on this.

Counting unique values on grouped data in a Spark Dataframe with Structured Streaming on Delta Lake

everyone.
I have a structured streaming in a Delta Lake.
My last table is supposed to count how many unique IDs access a platform per week.
I`m grouping the data by week in the streaming, however, I cannot count the unique values of IDs on the other column and I keep getting the count of the whole bunch even if repeated instead.
I have tried grouping the data twice, by week and then device_id.
I have tried dropDuplicate().
Nothing has worked out so far.
Can someone explain me what am I missing?
My code:
from pyspark.sql.functions import weekofyear, col
def silverToGold(silverPath, goldPath, queryName):
(spark.readStream
.format("delta")
.load(silverPath)
.withColumn("week", weekofyear("client_event_time"))
.groupBy(col("week"))
.count()
.select(col("week"),col("count").alias("WAU"))
.writeStream
.format("delta")
.option("checkpointLocation", goldPath + "/_checkpoint")
.queryName(queryName)
.outputMode("complete")
.start(goldPath))
The following code worked out.
Used approx_count_distinct and rsd=0.01.
from pyspark.sql.functions import weekofyear, approx_count_distinct
def silverToGold(silverPath, goldPath, queryName):
(spark.readStream
.format("delta")
.load(silverPath)
.withColumn("week", weekofyear(col("eventDate")))
.groupBy(col("week"))
.agg(approx_count_distinct("device_id",rsd=0.01).alias("WAU"))
.select("week","WAU")
.writeStream
.format("delta")
.option("checkpointLocation", goldPath + "/_checkpoint")
.queryName(queryName)
.outputMode("complete")
.start(goldPath))

Multiple aggregations and Distinct Function in Spark Structured Streaming

I need to make some aggregations on streaming data from Kafka and output top 10 rows of result to console every M seconds.
input_df = (
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("subscribe", "page_views")
.load()
.selectExpr('cast(value as string)')
)
...
...
# info has 2 cols: domain, uid (info = transformation of input_df)
# It's an example of what I want to do (like in simple pyspark)
stat = (
info
.groupby('domain')
.agg(
F.count(F.col('UID')).alias('view'),
F.countDistinct(F.col('UID')).alias('unique')
)
.sort(F.col("view").desc())
.limit(10)
)
query = (
stat
.writeStream
.outputMode("complete")
.format("console")
.option("truncate", "true")
.start()
)
This example without time trigger, but I can make it by myself.
Because of it's not allowed to use countDistinct, I haven't ideas of making my exercise.
I tried to make 2 dfs for each aggregation(df_1 = (domain, view), df_2 = (domain, unique)) and then join df_1 with df_2, but it's also not allowed to have several aggregations. So it's dead end for me.
It will be cool to have decision for it.
Thanks for your attention!
You can make it by flatMapGroupWithState, which is arbitrary state function.Besides,it supports append mode and update mode.

Spark JDBC returning dataframe only with column names

I am trying to connect to a HiveTable using spark JDBC, with the following code:
val df = spark.read.format("jdbc").
option("driver", "org.apache.hive.jdbc.HiveDriver").
option("user","hive").
option("password", "").
option("url", jdbcUrl).
option("dbTable", tableName).load()
df.show()
but the return I get is only an empty dataframe with modified columns name, like this:
--------------|---------------|
tableName.uuid|tableName.name |
--------------|---------------|
I've tried to read the dataframe in a lot of ways, but it always results the same.
I'm using JDBC Hive Driver, and this HiveTable is located in an EMR cluster. The code also runs in the same cluster.
Any help will be really appreciated.
Thank you all.
Please set fetchsize in option it should work.
Dataset<Row> referenceData
= sparkSession.read()
.option("fetchsize", "100")
.format("jdbc")
.option("url", jdbc.getJdbcURL())
.option("user", "")
.option("password", "")
.option("dbtable", hiveTableName).load();