merge into deltalake table updates all rows - scala

i'm trying to update a deltalake table using a spark dataframe. What i want to do is to update all rows that are different in the spark dataframe than in the deltalake table, and to insert all rows that are missing from the deltalake table.
I tried to do this as follows:
import io.delta.tables._
val not_equal_string = df.schema.fieldNames.map(fn =>
s"coalesce(not ((updates.${fn} = history.${fn}) or (isnull(history.${fn}) and isnull(updates.${fn})) ),false)"
).reduceLeft((x,y) => s"$x OR $y ")
val deltaTable = DeltaTable.forPath(spark, "s3a://sparkdata/delta-table")
deltaTable.as("history").merge(
df.as("updates"), "updates.EquipmentKey = history.EquipmentKey"
).whenMatched(not_equal_string).updateAll().whenNotMatched().insertAll().execute()
this works but when i look in the resulting delta table i see that it effectively doubled in size even if i didn't update a single record. A new json file was generated with a remove for every old partition and an add with all new partitions.
when i just run a sql join with the whenMatched criterion as a where condition, i don't get a single row.
i would expect the delta table to be untouched after such a merge operation. am i missing something simple ?

Related

queryExecution.analyzed.stats.sizeInBytes always returns 9223372036854775807

I'm trying to check the size of the different tables we're generating in our data warehouse, so we can have an automatic way to calculate partition size in next runs.
In order to get the table size I'm getting the stats from dataframes in the following way:
val db = "database"
val table_name = "table_name"
val table_size_bytes = spark.read.table(s"$db.$table_name").queryExecution.analyzed.stats.sizeInBytes
This was working fine until I started running the same code on partitioned tables. Each time I ran it on a partitioned table I got the same value for sizeInBytes, which is the max allowed value for BigInt: 9223372036854775807.
Is this a bug in Spark or should I be running this in a different way for partitioned tables?

Spark DataFrame turns empty after writing to table

I'm having some concerns regarding the behaviour of dataframes after writing them to Hive tables.
Context:
I run a Spark Scala (version 2.2.0.2.6.4.105-1) job through spark-submit in my production environment, which has Hadoop 2.
I do multiple computations and store some intermediate data to Hive ORC tables; after storing a table, I need to re-use the dataframe to compute a new dataframe to be stored in another Hive ORC table.
E.g.:
// dataframe with ~10 million record
val df = prev_df.filter(some_filters)
val df_temp_table_name = "temp_table"
val df_table_name = "table"
sql("SET hive.exec.dynamic.partition = true")
sql("SET hive.exec.dynamic.partition.mode = nonstrict")
df.createOrReplaceTempView(df_temp_table_name)
sql(s"""INSERT OVERWRITE TABLE $df_table_name PARTITION(partition_timestamp)
SELECT * FROM $df_temp_table_name """)
These steps always work and the table is properly populated with the correct data and partitions.
After this, I need to use the just computed dataframe (df) to update another table. So I query the table to be updated into dataframe df2, then I join df with df2, and the result of the join needs to overwrite the table of df2 (a plain, non-partitioned table).
val table_name_to_be_updated = "table2"
// Query the table to be updated
val df2 = sql(table_name_to_be_updated)
val df3 = df.join(df2).filter(some_filters).withColumn(something)
val temp = "temp_table2"
df3.createOrReplaceTempView(temp)
sql(s"""INSERT OVERWRITE TABLE $table_name_to_be_updated
SELECT * FROM $temp """)
At this point, df3 is always found empty, so the resulting Hive table is always empty as well. This happens also when I .persist() it to keep it in memory.
When testing with spark-shell, I have never encountered the issue. This happens only when the flow is scheduled in cluster-mode under Oozie.
What do you think might be the issue? Do you have any advice on approaching a problem like this with efficient memory usage?
I don't understand if it's the first df that turns empty after writing to a table, or if the issue is because I first query and then try to overwrite the same table.
Thank you very much in advance and have a great day!
Edit:
Previously, df was computed in an individual script and then inserted into its respective table. On a second script, that table was queried into a new variable df; then the table_to_be_updated was also queried and stored into a variable old_df2 let's say. The two were then joined and computed upon in a new variable df3, that was then inserted with overwrite into the table_to_be_updated.

how to split a list to multiple partitions and sent to executors

When we use spark to read data from csv for DB as follow, it will automatically split the data to multiple partitions and sent to executors
spark
.read
.option("delimiter", ",")
.option("header", "true")
.option("mergeSchema", "true")
.option("codec", properties.getProperty("sparkCodeC"))
.format(properties.getProperty("fileFormat"))
.load(inputFile)
Currently, I have a id list as :
[1,2,3,4,5,6,7,8,9,...1000]
What I want to do is split this list to multiple partitions and sent to executors, in each executor, run the sql as
ids.foreach(id => {
select * from table where id = id
})
When we load data from cassandra, the connector will generate the query sql as:
select columns from table where Token(k) >= ? and Token(k) <= ?
it means, the connector will scan the whole database, virtually, I needn't to scan the whole table, I just what to get all the data from the table where the k(partition key) in the id list.
the table schema as:
CREATE TABLE IF NOT EXISTS tab.events (
k int,
o text,
event text
PRIMARY KEY (k,o)
);
or how can i use spark to load data from cassandra using pre defined sql statement without scan the whole table?
You simply need to use joinWithCassandra function to perform selection only of the data is required for your operation. But be aware that this function is only available via RDD API.
Something like this:
val joinWithRDD = your_df.rdd.joinWithCassandraTable("tab","events")
You need to make sure that column name in your DataFrame matched the partition key name in Cassandra - see documentation for more information.
The DataFrame implementation is only available in the DSE version of Spark Cassandra Connector as described in following blog post.
Update in September 2020th: support for join with Cassandra was added in the Spark Cassandra Connector 2.5.0

Insert Counts for SaveAsTable and Insert INTO

I have a Scala code that I run in spark-shell to extract data from json files and load first in a HIVE staging table and then extract data from staging and load into the final main table in HIVE.
I use the following commands:
//read json to DF
val df = hiveContext.read.schema(schema1).json(file)
//DF to Staging
df.write.mode("append").saveAsTable("stg")
//Staging to Final
hiveContext.sql("insert into final select distinct columns from stg left outer join final on stg.id = final.id where stg.id is not null and final.id is null")
I want to be able to understand the counts of records read/written at every stage.
For json to DF, I understand that I can do df.count() but how about the remaining two stages. How do I get the count of saveAsTable and Insert into statements?

running tasks in parallel on separate Hive partitions using Scala and Spark to speed up loading Hive and writing results to Hive or Parquet

this question is a spin off from [this one] (saving a list of rows to a Hive table in pyspark).
EDIT please see my update edits at the bottom of this post
I have used both Scala and now Pyspark to do the same task, but I am having problems with VERY slow saves of a dataframe to parquet or csv, or converting a dataframe to a list or array type data structure. Below is the relevant python/pyspark code and info:
#Table is a List of Rows from small Hive table I loaded using
#query = "SELECT * FROM Table"
#Table = sqlContext.sql(query).collect()
for i in range(len(Table)):
rows = sqlContext.sql(qry)
val1 = Table[i][0]
val2 = Table[i][1]
count = Table[i][2]
x = 100 - count
#hivetemp is a table that I copied from Hive to my hfs using:
#create external table IF NOT EXISTS hive temp LIKE hivetableIwant2copy LOCATION "/user/name/hiveBackup";
#INSERT OVERWRITE TABLE hivetemp SELECT * FROM hivetableIwant2copy;
query = "SELECT * FROM hivetemp WHERE col1<>\""+val1+"\" AND col2 ==\""+val2+"\" ORDER BY RAND() LIMIT "+str(x)
rows = sqlContext.sql(query)
rows = rows.withColumn("col4", lit(10))
rows = rows.withColumn("col5", lit(some_string))
#writing to parquet is heck slow AND I can't work with pandas due to the library not installed on the server
rows.saveAsParquetFile("rows"+str(i)+".parquet")
#tried this before and heck slow also
#rows_list = rows.collect()
#shuffle(rows_list)
I have tried to do the above in Scala, and I had similar problems. I could easily load the hive table or query of a hive table, but needing to do a random shuffle or store a large dataframe encounters memory issues. There were also some challenges with being able to add 2 extra columns.
The Hive table (hiveTemp) that I want to add rows to has 5,570,000 ~5.5 million rows and 120 columns.
The Hive table that I am iterating in the for loop through has 5000 rows and 3 columns. There are 25 unique val1 (a column in hiveTemp), and the combinations of val1 and val2 3000. Val2 could be one of 5 columns and its specific cell value. This means if I had tweaked code, then I could reduce the lookups of rows to add down to 26 from 5000, but the number of rows I have to retrieve, store and random shuffle would be pretty large and hence a memory issue (unless anyone has suggestions on this)
As far as how many total rows I need to add to the table might be about 100,000.
The ultimate goal is to have the original table of 5.5mill rows appended with the 100k+ rows written as a hive or parquet table. If its easier, I am fine with writing the 100k rows in its own table that can be merged to the 5.5 mill table later
Scala or Python is fine, though Scala is more preferred..
Any advice on this and the options that would be best would be great.
Thanks a lot!
EDIT
Some additional thought I had on this problem:
I used the hash partitioner to partition the hive table into 26 partitions. This is based on a column value which there are 26 distinct ones. The operations I want to perform in the for loop could be generalized so that it only needs to happen on each of these partitions.
That being said, how could I, or what guide can I look at online to be able to write the scala code to do this, and for a separate executer to do each of these loops on each partition? I am thinking this would make things much faster.
I know how to do something like this using multithreads but not sure how to in the scala/spark paradigm.