Writing dataframe to hive table using PySpark - pyspark

This may have been answered elsewhere but I couldn't find an exact solution I was looking for.
I have a dataframe df created using:
df=spark.sql('''select distinct Col_1, Col_2, Col_3 from sourceTable''')
with the following structure
col_1
Col_2
Col_3
first_row
row
line
The target table has the following structure where Col_3, rundate and runtime are partition columns in hive
col_1
Col_2
Col_3
rundate
runtime
first_row
row
line
Now normally when schema matches, I know I can write the df to hive using
df.write.mode("overwrite").partitionBy("Col_3").insertInto("TargetTable", overwrite=True)
and set
hive.exec.dynamic.partition = true
hive.ecec.dynamic.partition.mode = nonstrict
spark.sql.sources.partitionOverwriteMode = dynamic
The challenge I am facing is that my source schema is different and moreover, I want to partition by Col_3 and month (instead of daily rundate and runtime).
Can someone help me with this?

Related

How to PartitionBy a column in spark and drop the same column before saving the dataframe in spark scala

Let's say we have have a dataframe with columns as col1, col2, col3, col4. Now while saving the df I want to partition by using col2 and my final df which will be saved should not have col2. So the final df should be col1, col3, col4. Any advice about how should can I achieve this?
newdf.drop("Status").write.mode("overwrite").partitionBy("Status").csv("C:/Users/Documents/Test")
drop will drop status column & Your code will fail with below error at partitionBy as status column was dropped.
org.apache.spark.sql.AnalysisException: Partition column `status` not found in schema [...]
Check below code, It will not include status values inside your data.
newdf
.write
.mode("overwrite")
.partitionBy("Status")
.csv("C:/Users/Documents/Test")

Update Table Hive Using Spark Scala

I need to update a Table Hive like
update A from B
set
Col5 = A.Col2,
Col2 = B.Col2,
DT_Change = B.DT,
Col3 = B.Col3,
Col4 = B.Col4
where A.Col1 = B.Col1 and A.Col2 <> B.Col2
Using Scala Spark RDD
How can I do this ?
I want to split this question in to two questions to explain it simple.
First question : How to write Spark RDD data to Hive table ?
The simplest way is to convert the RDD in to Spark SQL (dataframe) using method rdd.toDF(). Then register the dataframe as temptable using df.registerTempTable("temp_table"). Now you can query from the temptable and insert in to hive table using sqlContext.sql("insert into table my_table select * from temp_table").
Second question: How to update Hive table from Spark ?
As of now, Hive is not a best fit for record level updates. Updates can only be performed on tables that support ACID. One primary limitation is only ORC format supports updating Hive tables. You can find some information on it from https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions
You can refer How to Updata an ORC Hive table form Spark using Scala for this.
Few methods might have deprecated with spark 2.x and you can check spark 2.0 documentation for the latest methods.
While there could be better approaches, this is the simplest approach that I can think of which works.

Most efficient way to select and process data from a dataframe

I would like to load and process data from a dataframe in Spark using Scala.
The raw SQL Statement looks like this:
INSERT INTO TABLE_1
(
key_attribute,
attribute_1,
attribute_2
)
SELECT
MIN(TABLE_2.key_attribute),
CURRENT_TIMESTAMP as attribute_1,
'Some_String' as attribute_2
FROM TABLE_2
LEFT OUTER JOIN TABLE_1
ON TABLE_2.key_attribute = TABLE_1.key_attribute
WHERE
TABLE_1.key_attribute IS NULL
AND TABLE_2.key_attribute IS NOT NULL
GROUP BY
attribute_1,
attribute_2,
TABLE_2.key_attribute
What I've done so far:
I created a DataFrame from the Select Statement and joined it with the TABLE_2 DataFrame.
val table_1 = spark.sql("Select key_attribute, current_timestamp() as attribute_1, 'Some_String' as attribute_2").toDF();
table_2.join(table_1, Seq("key_attribute"), "left_outer");
Not really much progress because I face to many difficulties:
How do I handle the SELECT with processing data efficiently? Keep everything in seperate DataFrames?
How do I insert the WHERE/GROUP BY clause with attributes from several sources?
Is there any other/better way except Spark SQL?
Few steps in handling are -
First create the dataframe with your raw data
Then save it as temp table.
You can use filter() or "where condition in sparksql" and get the
resultant dataframe
Then as you used - you can make use of jons with datframes. You can
think of dafaframes as a representation of table.
Regarding efficiency, since the processing will be done in parallel, its being taken care. If you want anything more regarding efficiency, please mention it.

Insert data into a Hive table with HiveContext using Spark Scala

I was able to insert data into a Hive table from my spark code using HiveContext like below
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("CREATE TABLE IF NOT EXISTS e360_models.employee(id INT, name STRING, age INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'")
sqlContext.sql("insert into table e360_models.employee select t.* from (select 1210, 'rahul', 55) t")
sqlContext.sql("insert into table e360_models.employee select t.* from (select 1211, 'sriram pv', 35) t")
sqlContext.sql("insert into table e360_models.employee select t.* from (select 1212, 'gowri', 59) t")
val result = sqlContext.sql("FROM e360_models.employee SELECT id, name, age")
result.show()
But, this approach is creating a separate file in the warehouse for every insertion like below
part-00000
part-00000_copy_1
part-00000_copy_2
part-00000_copy_3
Is there any way to avoid this and just append the new data to a single file or is there any other better way to insert data into hive from spark?
No, there is no way to do that. Each new insert will create a new file. It's not a Spark "issue", but a general behavior you can experience with Hive too. The only way is to perform a single insert with the UNION of all your data, but if you need to do multiple inserts, you'll have multiple files.
The only thing you can do is to enable file merging in hive (look at it here: Hive Create Multi small files for each insert in HDFS and https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties).

Hive: How to do a SELECT query to output a unique primary key using HiveQL?

I have the following schema dataset which i want to transform into a table that can be exported to SQL. I am using HIVE. Input as follows
call_id,stat1,stat2,stat3
1,a,b,c,
2,x,y,z,
3,d,e,f,
1,j,k,l,
The output table needs to have call_id as its primary key so it needs to be unique. The output schema should be
call_id,stat2,stat3,
1,b,c, or (1,k,l)
2,y,z,
3,e,f,
The problem is that when i use the keyword DISTINCT in the HIVE query, the DISTINCT applies to the all the colums combined. I want to apply the DISTINCT operation only to the call_id. Something on the lines of
SELECT DISTINCT(call_id), stat2,stat3 from intable;
However this is not valid in HIVE(I am not well-versed in SQL either).
The only legal query seems to be
SELECT DISTINCT call_id, stat2,stat3 from intable;
But this returns multiple rows with same call_id as the other columns are different and the row on the whole is distinct.
NOTE: There is no arithmetic relation between a,b,c,x,y,z, etc. So any trick of averaging or summing is not viable.
Any ideas how i can do this?
One quick idea,not the best one, but will do the work-
hive>create table temp1(a int,b string);
hive>insert overwrite table temp1
select call_id,max(concat(stat1,'|',stat2,'|',stat3)) from intable group by call_id;
hive>insert overwrite table intable
select a,split(b,'|')[0],split(b,'|')[1],split(b,'|')[2] from temp1;
,,I want to apply the DISTINCT operation only to the call_id"
But how will then Hive know which row to eliminate?
Without knowing the amount of data / size of the stat fields you have, the following query can the job:
select distinct i1.call_id, i1.stat2, i1.stat3 from (
select call_id, MIN(concat(stat1, stat2, stat3)) as smin
from intable group by call_id
) i2 join intable i1 on i1.call_id = i2.call_id
AND concat(i1.stat1, i1.stat2, i1.stat3) = i2.smin;