Join Multiple Data frames in Spark - scala

I am Implementing a project where MySql data is imported to hdfs using sqoop. It had nearly 30 tables.I am reading each table as a dataframe by inferring schema and registered as temp tables. I has few questions in doing this...
1. There several joins need to implemented for the tables suppose say df1 to df10 . In MySQL the query will be
select a.id,b.name,c.AccountName from accounts a priority b bills c where a.id=b.id and c.name=a.name
Instead of using
sqlContext.sql(select a.id,b.name,c.AccountName from accounts a priority b bills c where a.id=b.id and c.name=a.name)
Is there other to join all the data frames effectively based on conditions..
Is it the correct way to convert tables to data frames and querying on top of them or any better way to approach this type of joins and querying in spark

I had similiar problem and I end up Using :
val df_list = ListBuffer[DataFrame]()
df_list .toList.reduce((a, b) => a.join(b, a.col(a.schema.head.name) === b.col(b.schema.head.name), "left_outer"))
You could make a free sql statement on Sqoop and join everything there. Or Use Spark JDBC to do the same job

Related

spark join operation for two data frame

when df1 and df2 has the same rows and
df1 and df2 has no duplicated value
what is the complexity for join operation df1.join(df2)?
my quess is to take O(n^2)
and is it possible to sort both the data frame and make it better performance?
if it's not what is the way to make a join faster im pyspark?
Even if df1 and df2 have same set of rows and if they are not partitioned, for joining them spark has to partition both the data frames on the join key. For spark 2.3 onwards, sort-merge joins the default join workhorse which would require both the data frames to be partitioned and sorted by the join key and then the join is performed. Both the data frames also have to be colocated for sort-merge join.
and is it possible to sort both the data frame and make it better performance? if it's not what is the way to make a join faster im pyspark?
Yes, if you see that a particular data frame is used again and again in a join using the same join key then you can repartition the data frame on the join key and cache it for further use. Please refer below link for more details
https://deepsense.ai/optimize-spark-with-distribute-by-and-cluster-by/

How to optimize broadcast join in spark Scala?

I am a new developper at Spark Scala and I want to improve my code by using a broadcast join.
As I understand, a broadcast join can optimise the code if we have a large DataFrame with a small one. It's exactly the case for me. I have a first DF (tab1 in my example) that contains more 3 billions data that I have to join with a second one with only 900 data.
Here is my sql request :
SELECT tab1.id1, regexp_extract(tab2.emp_name, ".*?(\\d+)\\)$", 1) AS city,
topo_2g3g.emp_id AS emp_id, tab1.emp_type
FROM table1 tab1
INNER JOIN table2 tab2
ON (tab1.emp_type = tab2.emp_type AND tab1.start = tab2.code)
And here is my attempt to use a broadcast join :
val tab1 = df1.filter(""" id > 100 """).as("table1")
val tab2 = df2.filter(""" id > 100 """).as("table2")
val result = tab1.join(
broadcast(tab2)
, col("tab1.emp_type") === col("tab2.emp_type") && col("tab1.start") === col("tab2.code")
, "inner")
The problem is that this way is not optimized at all. I mean it contains ALL the columns for the two table, while I don't need all those columns. I just need 3 of them and the last one (with a regex on it), which is not optimal at all. It's like, we generate a very big table first and then we reduce it to a small table. While in SQL, we got directly the small table.
So, after this step :
I have to use withColumn to generate the new column (with the regex)
Apply a filter method to select the 3 colmuns that I. While i got them IMMEDIATELY in sql (with no filter I mean).
Can you help me please to optimize my code and my request ?
Thanks in advance
you select the columns you want before doing the join
df1.select("col1", "col2").filter(""" id > 100 """).as("table1")

Update Table Hive Using Spark Scala

I need to update a Table Hive like
update A from B
set
Col5 = A.Col2,
Col2 = B.Col2,
DT_Change = B.DT,
Col3 = B.Col3,
Col4 = B.Col4
where A.Col1 = B.Col1 and A.Col2 <> B.Col2
Using Scala Spark RDD
How can I do this ?
I want to split this question in to two questions to explain it simple.
First question : How to write Spark RDD data to Hive table ?
The simplest way is to convert the RDD in to Spark SQL (dataframe) using method rdd.toDF(). Then register the dataframe as temptable using df.registerTempTable("temp_table"). Now you can query from the temptable and insert in to hive table using sqlContext.sql("insert into table my_table select * from temp_table").
Second question: How to update Hive table from Spark ?
As of now, Hive is not a best fit for record level updates. Updates can only be performed on tables that support ACID. One primary limitation is only ORC format supports updating Hive tables. You can find some information on it from https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions
You can refer How to Updata an ORC Hive table form Spark using Scala for this.
Few methods might have deprecated with spark 2.x and you can check spark 2.0 documentation for the latest methods.
While there could be better approaches, this is the simplest approach that I can think of which works.

flink Table SQL Api

I want to know can we write query using two tables ( join) in flink Table and SQL api.
I am new to flik, I want to create two table from two different data set and query them and produce other dataset.
my query would be like select... from table1, table 2... so can we write like this query which querying two tables or more?
Thanks
Flink's Table API supports join operations (full, left, right, inner joins) on batch tables (e.g. those created from a DataSet).
SELECT c, g FROM Table3, Table5 WHERE b = e
For streaming tables (e.g. those created from a DataStream), Flink does not yet support join operations. But the Flink community is actively working to add them in the near future.

Solving data skew in SparkSQL

I have a SPARK SQL code that joins a fact table and dimension table. Join condition leads to data skew as one of the result combination will have huge data compared to others. In scala , I think this can be solved with
partitionBy(new org.apache.spark.HashPartitioner(160))
But this works only on RDD and not on schemaRDD.
Is there an equivalent to this ?
Here is how my code looks like
sqlContext.sql("select product_category,shipment_item_id,shipment_amount from shipments_fact f left outer join product_category pc on f.category_code = pc.category_code")
Request help...