Update Table Hive Using Spark Scala - scala

I need to update a Table Hive like
update A from B
set
Col5 = A.Col2,
Col2 = B.Col2,
DT_Change = B.DT,
Col3 = B.Col3,
Col4 = B.Col4
where A.Col1 = B.Col1 and A.Col2 <> B.Col2
Using Scala Spark RDD
How can I do this ?

I want to split this question in to two questions to explain it simple.
First question : How to write Spark RDD data to Hive table ?
The simplest way is to convert the RDD in to Spark SQL (dataframe) using method rdd.toDF(). Then register the dataframe as temptable using df.registerTempTable("temp_table"). Now you can query from the temptable and insert in to hive table using sqlContext.sql("insert into table my_table select * from temp_table").
Second question: How to update Hive table from Spark ?
As of now, Hive is not a best fit for record level updates. Updates can only be performed on tables that support ACID. One primary limitation is only ORC format supports updating Hive tables. You can find some information on it from https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions
You can refer How to Updata an ORC Hive table form Spark using Scala for this.
Few methods might have deprecated with spark 2.x and you can check spark 2.0 documentation for the latest methods.
While there could be better approaches, this is the simplest approach that I can think of which works.

Related

Spark DataFrame turns empty after writing to table

I'm having some concerns regarding the behaviour of dataframes after writing them to Hive tables.
Context:
I run a Spark Scala (version 2.2.0.2.6.4.105-1) job through spark-submit in my production environment, which has Hadoop 2.
I do multiple computations and store some intermediate data to Hive ORC tables; after storing a table, I need to re-use the dataframe to compute a new dataframe to be stored in another Hive ORC table.
E.g.:
// dataframe with ~10 million record
val df = prev_df.filter(some_filters)
val df_temp_table_name = "temp_table"
val df_table_name = "table"
sql("SET hive.exec.dynamic.partition = true")
sql("SET hive.exec.dynamic.partition.mode = nonstrict")
df.createOrReplaceTempView(df_temp_table_name)
sql(s"""INSERT OVERWRITE TABLE $df_table_name PARTITION(partition_timestamp)
SELECT * FROM $df_temp_table_name """)
These steps always work and the table is properly populated with the correct data and partitions.
After this, I need to use the just computed dataframe (df) to update another table. So I query the table to be updated into dataframe df2, then I join df with df2, and the result of the join needs to overwrite the table of df2 (a plain, non-partitioned table).
val table_name_to_be_updated = "table2"
// Query the table to be updated
val df2 = sql(table_name_to_be_updated)
val df3 = df.join(df2).filter(some_filters).withColumn(something)
val temp = "temp_table2"
df3.createOrReplaceTempView(temp)
sql(s"""INSERT OVERWRITE TABLE $table_name_to_be_updated
SELECT * FROM $temp """)
At this point, df3 is always found empty, so the resulting Hive table is always empty as well. This happens also when I .persist() it to keep it in memory.
When testing with spark-shell, I have never encountered the issue. This happens only when the flow is scheduled in cluster-mode under Oozie.
What do you think might be the issue? Do you have any advice on approaching a problem like this with efficient memory usage?
I don't understand if it's the first df that turns empty after writing to a table, or if the issue is because I first query and then try to overwrite the same table.
Thank you very much in advance and have a great day!
Edit:
Previously, df was computed in an individual script and then inserted into its respective table. On a second script, that table was queried into a new variable df; then the table_to_be_updated was also queried and stored into a variable old_df2 let's say. The two were then joined and computed upon in a new variable df3, that was then inserted with overwrite into the table_to_be_updated.

how to split a list to multiple partitions and sent to executors

When we use spark to read data from csv for DB as follow, it will automatically split the data to multiple partitions and sent to executors
spark
.read
.option("delimiter", ",")
.option("header", "true")
.option("mergeSchema", "true")
.option("codec", properties.getProperty("sparkCodeC"))
.format(properties.getProperty("fileFormat"))
.load(inputFile)
Currently, I have a id list as :
[1,2,3,4,5,6,7,8,9,...1000]
What I want to do is split this list to multiple partitions and sent to executors, in each executor, run the sql as
ids.foreach(id => {
select * from table where id = id
})
When we load data from cassandra, the connector will generate the query sql as:
select columns from table where Token(k) >= ? and Token(k) <= ?
it means, the connector will scan the whole database, virtually, I needn't to scan the whole table, I just what to get all the data from the table where the k(partition key) in the id list.
the table schema as:
CREATE TABLE IF NOT EXISTS tab.events (
k int,
o text,
event text
PRIMARY KEY (k,o)
);
or how can i use spark to load data from cassandra using pre defined sql statement without scan the whole table?
You simply need to use joinWithCassandra function to perform selection only of the data is required for your operation. But be aware that this function is only available via RDD API.
Something like this:
val joinWithRDD = your_df.rdd.joinWithCassandraTable("tab","events")
You need to make sure that column name in your DataFrame matched the partition key name in Cassandra - see documentation for more information.
The DataFrame implementation is only available in the DSE version of Spark Cassandra Connector as described in following blog post.
Update in September 2020th: support for join with Cassandra was added in the Spark Cassandra Connector 2.5.0

Most efficient way to select and process data from a dataframe

I would like to load and process data from a dataframe in Spark using Scala.
The raw SQL Statement looks like this:
INSERT INTO TABLE_1
(
key_attribute,
attribute_1,
attribute_2
)
SELECT
MIN(TABLE_2.key_attribute),
CURRENT_TIMESTAMP as attribute_1,
'Some_String' as attribute_2
FROM TABLE_2
LEFT OUTER JOIN TABLE_1
ON TABLE_2.key_attribute = TABLE_1.key_attribute
WHERE
TABLE_1.key_attribute IS NULL
AND TABLE_2.key_attribute IS NOT NULL
GROUP BY
attribute_1,
attribute_2,
TABLE_2.key_attribute
What I've done so far:
I created a DataFrame from the Select Statement and joined it with the TABLE_2 DataFrame.
val table_1 = spark.sql("Select key_attribute, current_timestamp() as attribute_1, 'Some_String' as attribute_2").toDF();
table_2.join(table_1, Seq("key_attribute"), "left_outer");
Not really much progress because I face to many difficulties:
How do I handle the SELECT with processing data efficiently? Keep everything in seperate DataFrames?
How do I insert the WHERE/GROUP BY clause with attributes from several sources?
Is there any other/better way except Spark SQL?
Few steps in handling are -
First create the dataframe with your raw data
Then save it as temp table.
You can use filter() or "where condition in sparksql" and get the
resultant dataframe
Then as you used - you can make use of jons with datframes. You can
think of dafaframes as a representation of table.
Regarding efficiency, since the processing will be done in parallel, its being taken care. If you want anything more regarding efficiency, please mention it.

Join Multiple Data frames in Spark

I am Implementing a project where MySql data is imported to hdfs using sqoop. It had nearly 30 tables.I am reading each table as a dataframe by inferring schema and registered as temp tables. I has few questions in doing this...
1. There several joins need to implemented for the tables suppose say df1 to df10 . In MySQL the query will be
select a.id,b.name,c.AccountName from accounts a priority b bills c where a.id=b.id and c.name=a.name
Instead of using
sqlContext.sql(select a.id,b.name,c.AccountName from accounts a priority b bills c where a.id=b.id and c.name=a.name)
Is there other to join all the data frames effectively based on conditions..
Is it the correct way to convert tables to data frames and querying on top of them or any better way to approach this type of joins and querying in spark
I had similiar problem and I end up Using :
val df_list = ListBuffer[DataFrame]()
df_list .toList.reduce((a, b) => a.join(b, a.col(a.schema.head.name) === b.col(b.schema.head.name), "left_outer"))
You could make a free sql statement on Sqoop and join everything there. Or Use Spark JDBC to do the same job

Difference between DataFrame API methods vs Storing DF as table

What is the difference between saving the DF as temp table then process using SQL queries and directly accessing the DF API methods?
For example:
df.registerTempTable("tablename")
sqlCtx.sql("select column1 from tablename where column2='value2' group by column1")
and this
df.where($"column2"==="value2").groupBy($"column1").select($"column1")
Is there any performance difference between these two?