Spark Scala equivalent for SKEW join hints - scala

Spark SQL has a skew hint available (please see here). Is there an equivalent hint available for Spark Scala?
Example
This is the Spark SQL code where fact table has skewed ProductId column:
SELECT /*+ SKEW('viewFact', 'ProductId') */
RevSumDivisionName, RevSumCategoryName, CloudAddOnFlag,
SUM(ActualRevenueAmt) AS RevenueUSD, COUNT(*) AS Cnt
FROM viewFact
INNER JOIN viewPMST ON viewFact.ProductId = viewPMST.ProductId
INNER JOIN viewRsDf ON viewPMST.ProductFamilyId = viewRsDf.ProductFamilyId
INNER JOIN viewRevH ON viewRsDf.RevSumCategoryId = viewRevH.RevSumCategoryId
GROUP BY RevSumDivisionName, RevSumCategoryName, CloudAddOnFlag
Same join in Scala:
inFact
.join(inPMst, Seq("ProductId"))
.join(inRsDf, Seq("ProductFamilyId"))
.join(inRevH, Seq("RevSumCategoryId"))
.groupBy($"RevSumDivisionName", $"RevSumCategoryName", $"CloudAddOnFlag")
.agg(sum($"ActualRevenueAmt") as "RevenueUSD", count($"*") as "Cnt")
I'm just unable finding syntax for the skew hint.

Spark SQL has a skew hint available
It does not. Databricks platform has, but it is a proprietary extension (same as indexing) not available in Spark as such.
I'm just unable finding syntax for the skew hint.
In general case query plan hints are passed using hint method which can be used like this
val hint: String = ???
inFact.join(inPMst.hint(hint), Seq("ProductId")))

Related

Drop function not working after left outer join in pyspark

My pyspark version is 2.1.1. I am trying to join two dataframes (left outer) having two columns id and priority. I am creating my dataframes like this:
a = "select 123 as id, 1 as priority"
a_df = spark.sql(a)
b = "select 123 as id, 1 as priority union select 112 as uid, 1 as priority"
b_df = spark.sql(b)
c_df = a_df.join(b_df, (a_df.id==b_df.id), 'left').drop(b_df.priority)
c_df schema is coming as DataFrame[uid: int, priority: int, uid: int, priority: int]
The drop function is not removing the columns.
But if I try to do:
c_df = a_df.join(b_df, (a_df.id==b_df.id), 'left').drop(a_df.priority)
Then priority column for a_df gets dropped.
Not sure if there is a version change issue or something else, but it feels very weird that drop function will behave like this.
I know the workaround can be to remove the unwanted columns first, and then do the join. But still not sure why drop function is not working?
Thanks in advance.
Duplicate column names with joins in pyspark lead to unpredictable behavior, and I've read to disambiguate the names before joining. From stackoverflow, Spark Dataframe distinguish columns with duplicated name and Pyspark Join and then column select is showing unexpected output . I'm sorry to say I can't find why pyspark doesn't work as you describe.
But the databricks documentation addresses this problem: https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html
From the databricks:
If you perform a join in Spark and don’t specify your join correctly you’ll end up with duplicate column names. This makes it harder to select those columns. This topic and notebook demonstrate how perform a join so that you don’t have duplicated columns.
When you join, instead you can try either using an alias (thats typically what I use), or you can join the columns as an list type or str.
df = left.join(right, ["priority"])

Join Multiple Data frames in Spark

I am Implementing a project where MySql data is imported to hdfs using sqoop. It had nearly 30 tables.I am reading each table as a dataframe by inferring schema and registered as temp tables. I has few questions in doing this...
1. There several joins need to implemented for the tables suppose say df1 to df10 . In MySQL the query will be
select a.id,b.name,c.AccountName from accounts a priority b bills c where a.id=b.id and c.name=a.name
Instead of using
sqlContext.sql(select a.id,b.name,c.AccountName from accounts a priority b bills c where a.id=b.id and c.name=a.name)
Is there other to join all the data frames effectively based on conditions..
Is it the correct way to convert tables to data frames and querying on top of them or any better way to approach this type of joins and querying in spark
I had similiar problem and I end up Using :
val df_list = ListBuffer[DataFrame]()
df_list .toList.reduce((a, b) => a.join(b, a.col(a.schema.head.name) === b.col(b.schema.head.name), "left_outer"))
You could make a free sql statement on Sqoop and join everything there. Or Use Spark JDBC to do the same job

How to use CROSS JOIN and CROSS APPLY in Spark SQL

I am very new to Spark and Scala, I writing Spark SQL code. I am in situation to apply CROSS JOIN and CROSS APPLY in my logic. Here I will post the SQL query which I have to convert to spark SQL.
select Table1.Column1,Table2.Column2,Table3.Column3
from Table1 CROSS JOIN Table2 CROSS APPLY Table3
I need the above query to convert in to SQLContext in Spark SQL. Kindly help me. Thanks in Advance.
First set the below property in spark conf
spark.sql.crossJoin.enabled=true
then dataFrame1.join(dataFrame2) will do Cross/Cartesian join,
we can use below query also for doing the same
sqlContext.sql("select * from table1 CROSS JOIN table2 CROSS JOIN table3...")
Set Spark Configuration ,
var sparkConf: SparkConf = null
sparkConf = new SparkConf()
.set("spark.sql.crossJoin.enabled", "true")
Explicit Cross Join in spark 2.x using crossJoin Method
crossJoin(right: Dataset[_]): DataFrame
var df_new = df1.crossJoin(df2);
Note : Cross joins are one of the most time consuming joins and often should be avoided.

Nested SQL Query in Spark [duplicate]

I am running this query in Spark shell but it gives me error,
sqlContext.sql(
"select sal from samplecsv where sal < (select MAX(sal) from samplecsv)"
).collect().foreach(println)
error:
java.lang.RuntimeException: [1.47] failure: ``)'' expected but identifier MAX found
select sal from samplecsv where sal < (select MAX(sal) from samplecsv)
^
at scala.sys.package$.error(package.scala:27)
Can anybody explan me,thanks
Planned features:
SPARK-23945 (Column.isin() should accept a single-column DataFrame as input).
SPARK-18455 (General support for correlated subquery processing).
Spark 2.0+
Spark SQL should support both correlated and uncorrelated subqueries. See SubquerySuite for details. Some examples include:
select * from l where exists (select * from r where l.a = r.c)
select * from l where not exists (select * from r where l.a = r.c)
select * from l where l.a in (select c from r)
select * from l where a not in (select c from r)
Unfortunately as for now (Spark 2.0) it is impossible to express the same logic using DataFrame DSL.
Spark < 2.0
Spark supports subqueries in the FROM clause (same as Hive <= 0.12).
SELECT col FROM (SELECT * FROM t1 WHERE bar) t2
It simply doesn't support subqueries in the WHERE clause.Generally speaking arbitrary subqueries (in particular correlated subqueries) couldn't be expressed using Spark without promoting to Cartesian join.
Since subquery performance is usually a significant issue in a typical relational system and every subquery can be expressed using JOIN there is no loss-of-function here.
https://issues.apache.org/jira/browse/SPARK-4226
There is a pull request to implement that feature .. my guess it might land in Spark 2.0.

Solving data skew in SparkSQL

I have a SPARK SQL code that joins a fact table and dimension table. Join condition leads to data skew as one of the result combination will have huge data compared to others. In scala , I think this can be solved with
partitionBy(new org.apache.spark.HashPartitioner(160))
But this works only on RDD and not on schemaRDD.
Is there an equivalent to this ?
Here is how my code looks like
sqlContext.sql("select product_category,shipment_item_id,shipment_amount from shipments_fact f left outer join product_category pc on f.category_code = pc.category_code")
Request help...