I have a SPARK SQL code that joins a fact table and dimension table. Join condition leads to data skew as one of the result combination will have huge data compared to others. In scala , I think this can be solved with
partitionBy(new org.apache.spark.HashPartitioner(160))
But this works only on RDD and not on schemaRDD.
Is there an equivalent to this ?
Here is how my code looks like
sqlContext.sql("select product_category,shipment_item_id,shipment_amount from shipments_fact f left outer join product_category pc on f.category_code = pc.category_code")
Request help...
Related
when df1 and df2 has the same rows and
df1 and df2 has no duplicated value
what is the complexity for join operation df1.join(df2)?
my quess is to take O(n^2)
and is it possible to sort both the data frame and make it better performance?
if it's not what is the way to make a join faster im pyspark?
Even if df1 and df2 have same set of rows and if they are not partitioned, for joining them spark has to partition both the data frames on the join key. For spark 2.3 onwards, sort-merge joins the default join workhorse which would require both the data frames to be partitioned and sorted by the join key and then the join is performed. Both the data frames also have to be colocated for sort-merge join.
and is it possible to sort both the data frame and make it better performance? if it's not what is the way to make a join faster im pyspark?
Yes, if you see that a particular data frame is used again and again in a join using the same join key then you can repartition the data frame on the join key and cache it for further use. Please refer below link for more details
https://deepsense.ai/optimize-spark-with-distribute-by-and-cluster-by/
I am a new developper at Spark Scala and I want to improve my code by using a broadcast join.
As I understand, a broadcast join can optimise the code if we have a large DataFrame with a small one. It's exactly the case for me. I have a first DF (tab1 in my example) that contains more 3 billions data that I have to join with a second one with only 900 data.
Here is my sql request :
SELECT tab1.id1, regexp_extract(tab2.emp_name, ".*?(\\d+)\\)$", 1) AS city,
topo_2g3g.emp_id AS emp_id, tab1.emp_type
FROM table1 tab1
INNER JOIN table2 tab2
ON (tab1.emp_type = tab2.emp_type AND tab1.start = tab2.code)
And here is my attempt to use a broadcast join :
val tab1 = df1.filter(""" id > 100 """).as("table1")
val tab2 = df2.filter(""" id > 100 """).as("table2")
val result = tab1.join(
broadcast(tab2)
, col("tab1.emp_type") === col("tab2.emp_type") && col("tab1.start") === col("tab2.code")
, "inner")
The problem is that this way is not optimized at all. I mean it contains ALL the columns for the two table, while I don't need all those columns. I just need 3 of them and the last one (with a regex on it), which is not optimal at all. It's like, we generate a very big table first and then we reduce it to a small table. While in SQL, we got directly the small table.
So, after this step :
I have to use withColumn to generate the new column (with the regex)
Apply a filter method to select the 3 colmuns that I. While i got them IMMEDIATELY in sql (with no filter I mean).
Can you help me please to optimize my code and my request ?
Thanks in advance
you select the columns you want before doing the join
df1.select("col1", "col2").filter(""" id > 100 """).as("table1")
Spark SQL has a skew hint available (please see here). Is there an equivalent hint available for Spark Scala?
Example
This is the Spark SQL code where fact table has skewed ProductId column:
SELECT /*+ SKEW('viewFact', 'ProductId') */
RevSumDivisionName, RevSumCategoryName, CloudAddOnFlag,
SUM(ActualRevenueAmt) AS RevenueUSD, COUNT(*) AS Cnt
FROM viewFact
INNER JOIN viewPMST ON viewFact.ProductId = viewPMST.ProductId
INNER JOIN viewRsDf ON viewPMST.ProductFamilyId = viewRsDf.ProductFamilyId
INNER JOIN viewRevH ON viewRsDf.RevSumCategoryId = viewRevH.RevSumCategoryId
GROUP BY RevSumDivisionName, RevSumCategoryName, CloudAddOnFlag
Same join in Scala:
inFact
.join(inPMst, Seq("ProductId"))
.join(inRsDf, Seq("ProductFamilyId"))
.join(inRevH, Seq("RevSumCategoryId"))
.groupBy($"RevSumDivisionName", $"RevSumCategoryName", $"CloudAddOnFlag")
.agg(sum($"ActualRevenueAmt") as "RevenueUSD", count($"*") as "Cnt")
I'm just unable finding syntax for the skew hint.
Spark SQL has a skew hint available
It does not. Databricks platform has, but it is a proprietary extension (same as indexing) not available in Spark as such.
I'm just unable finding syntax for the skew hint.
In general case query plan hints are passed using hint method which can be used like this
val hint: String = ???
inFact.join(inPMst.hint(hint), Seq("ProductId")))
I would like to load and process data from a dataframe in Spark using Scala.
The raw SQL Statement looks like this:
INSERT INTO TABLE_1
(
key_attribute,
attribute_1,
attribute_2
)
SELECT
MIN(TABLE_2.key_attribute),
CURRENT_TIMESTAMP as attribute_1,
'Some_String' as attribute_2
FROM TABLE_2
LEFT OUTER JOIN TABLE_1
ON TABLE_2.key_attribute = TABLE_1.key_attribute
WHERE
TABLE_1.key_attribute IS NULL
AND TABLE_2.key_attribute IS NOT NULL
GROUP BY
attribute_1,
attribute_2,
TABLE_2.key_attribute
What I've done so far:
I created a DataFrame from the Select Statement and joined it with the TABLE_2 DataFrame.
val table_1 = spark.sql("Select key_attribute, current_timestamp() as attribute_1, 'Some_String' as attribute_2").toDF();
table_2.join(table_1, Seq("key_attribute"), "left_outer");
Not really much progress because I face to many difficulties:
How do I handle the SELECT with processing data efficiently? Keep everything in seperate DataFrames?
How do I insert the WHERE/GROUP BY clause with attributes from several sources?
Is there any other/better way except Spark SQL?
Few steps in handling are -
First create the dataframe with your raw data
Then save it as temp table.
You can use filter() or "where condition in sparksql" and get the
resultant dataframe
Then as you used - you can make use of jons with datframes. You can
think of dafaframes as a representation of table.
Regarding efficiency, since the processing will be done in parallel, its being taken care. If you want anything more regarding efficiency, please mention it.
I am Implementing a project where MySql data is imported to hdfs using sqoop. It had nearly 30 tables.I am reading each table as a dataframe by inferring schema and registered as temp tables. I has few questions in doing this...
1. There several joins need to implemented for the tables suppose say df1 to df10 . In MySQL the query will be
select a.id,b.name,c.AccountName from accounts a priority b bills c where a.id=b.id and c.name=a.name
Instead of using
sqlContext.sql(select a.id,b.name,c.AccountName from accounts a priority b bills c where a.id=b.id and c.name=a.name)
Is there other to join all the data frames effectively based on conditions..
Is it the correct way to convert tables to data frames and querying on top of them or any better way to approach this type of joins and querying in spark
I had similiar problem and I end up Using :
val df_list = ListBuffer[DataFrame]()
df_list .toList.reduce((a, b) => a.join(b, a.col(a.schema.head.name) === b.col(b.schema.head.name), "left_outer"))
You could make a free sql statement on Sqoop and join everything there. Or Use Spark JDBC to do the same job