I am very new to Spark and Scala, I writing Spark SQL code. I am in situation to apply CROSS JOIN and CROSS APPLY in my logic. Here I will post the SQL query which I have to convert to spark SQL.
select Table1.Column1,Table2.Column2,Table3.Column3
from Table1 CROSS JOIN Table2 CROSS APPLY Table3
I need the above query to convert in to SQLContext in Spark SQL. Kindly help me. Thanks in Advance.
First set the below property in spark conf
spark.sql.crossJoin.enabled=true
then dataFrame1.join(dataFrame2) will do Cross/Cartesian join,
we can use below query also for doing the same
sqlContext.sql("select * from table1 CROSS JOIN table2 CROSS JOIN table3...")
Set Spark Configuration ,
var sparkConf: SparkConf = null
sparkConf = new SparkConf()
.set("spark.sql.crossJoin.enabled", "true")
Explicit Cross Join in spark 2.x using crossJoin Method
crossJoin(right: Dataset[_]): DataFrame
var df_new = df1.crossJoin(df2);
Note : Cross joins are one of the most time consuming joins and often should be avoided.
Related
In our Spark-Scala application, we want to use typed Datasets. There is a JOIN operation. There is a join between DF1 & DF2 (DF - Dataframe).
My question is should we convert DF1 & DF2 both to Dataset[T] and then perform JOIN or should we do the JOIN and then convert the result DataFrame to Dataset.
As I understand since here Dataset[T] are being used for type safety so we should convert DF1 & DF2 to Dataset[T]. Can someone please confirm and advise if something is not correct?
Spark SQL has a skew hint available (please see here). Is there an equivalent hint available for Spark Scala?
Example
This is the Spark SQL code where fact table has skewed ProductId column:
SELECT /*+ SKEW('viewFact', 'ProductId') */
RevSumDivisionName, RevSumCategoryName, CloudAddOnFlag,
SUM(ActualRevenueAmt) AS RevenueUSD, COUNT(*) AS Cnt
FROM viewFact
INNER JOIN viewPMST ON viewFact.ProductId = viewPMST.ProductId
INNER JOIN viewRsDf ON viewPMST.ProductFamilyId = viewRsDf.ProductFamilyId
INNER JOIN viewRevH ON viewRsDf.RevSumCategoryId = viewRevH.RevSumCategoryId
GROUP BY RevSumDivisionName, RevSumCategoryName, CloudAddOnFlag
Same join in Scala:
inFact
.join(inPMst, Seq("ProductId"))
.join(inRsDf, Seq("ProductFamilyId"))
.join(inRevH, Seq("RevSumCategoryId"))
.groupBy($"RevSumDivisionName", $"RevSumCategoryName", $"CloudAddOnFlag")
.agg(sum($"ActualRevenueAmt") as "RevenueUSD", count($"*") as "Cnt")
I'm just unable finding syntax for the skew hint.
Spark SQL has a skew hint available
It does not. Databricks platform has, but it is a proprietary extension (same as indexing) not available in Spark as such.
I'm just unable finding syntax for the skew hint.
In general case query plan hints are passed using hint method which can be used like this
val hint: String = ???
inFact.join(inPMst.hint(hint), Seq("ProductId")))
I have a 243MB dataset. I need to update my Dataframe with row_number
and I tried using the below methods:
import org.apache.spark.sql.functions._
df.withColumn("Rownumber",functions.monotonically_increasing_id())
Now the row_number getting wrong after 248352 rows, after that row_number comes 8589934592 like this.
and also I used,
df.registerTempTable("table")
val query = s"select *,ROW_NUMBER() OVER (order by Year) as Rownumber from table"
val z = hiveContext.sql(query)
Using this method, I got the answer but this take more time. Hence I can't use this method.
Same is the problem with df.rdd.zipwithIndex
What is the best way to solve this in spark-scala ? i'm using spark 2.3.0.
I need to update a Table Hive like
update A from B
set
Col5 = A.Col2,
Col2 = B.Col2,
DT_Change = B.DT,
Col3 = B.Col3,
Col4 = B.Col4
where A.Col1 = B.Col1 and A.Col2 <> B.Col2
Using Scala Spark RDD
How can I do this ?
I want to split this question in to two questions to explain it simple.
First question : How to write Spark RDD data to Hive table ?
The simplest way is to convert the RDD in to Spark SQL (dataframe) using method rdd.toDF(). Then register the dataframe as temptable using df.registerTempTable("temp_table"). Now you can query from the temptable and insert in to hive table using sqlContext.sql("insert into table my_table select * from temp_table").
Second question: How to update Hive table from Spark ?
As of now, Hive is not a best fit for record level updates. Updates can only be performed on tables that support ACID. One primary limitation is only ORC format supports updating Hive tables. You can find some information on it from https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions
You can refer How to Updata an ORC Hive table form Spark using Scala for this.
Few methods might have deprecated with spark 2.x and you can check spark 2.0 documentation for the latest methods.
While there could be better approaches, this is the simplest approach that I can think of which works.
I would like to load and process data from a dataframe in Spark using Scala.
The raw SQL Statement looks like this:
INSERT INTO TABLE_1
(
key_attribute,
attribute_1,
attribute_2
)
SELECT
MIN(TABLE_2.key_attribute),
CURRENT_TIMESTAMP as attribute_1,
'Some_String' as attribute_2
FROM TABLE_2
LEFT OUTER JOIN TABLE_1
ON TABLE_2.key_attribute = TABLE_1.key_attribute
WHERE
TABLE_1.key_attribute IS NULL
AND TABLE_2.key_attribute IS NOT NULL
GROUP BY
attribute_1,
attribute_2,
TABLE_2.key_attribute
What I've done so far:
I created a DataFrame from the Select Statement and joined it with the TABLE_2 DataFrame.
val table_1 = spark.sql("Select key_attribute, current_timestamp() as attribute_1, 'Some_String' as attribute_2").toDF();
table_2.join(table_1, Seq("key_attribute"), "left_outer");
Not really much progress because I face to many difficulties:
How do I handle the SELECT with processing data efficiently? Keep everything in seperate DataFrames?
How do I insert the WHERE/GROUP BY clause with attributes from several sources?
Is there any other/better way except Spark SQL?
Few steps in handling are -
First create the dataframe with your raw data
Then save it as temp table.
You can use filter() or "where condition in sparksql" and get the
resultant dataframe
Then as you used - you can make use of jons with datframes. You can
think of dafaframes as a representation of table.
Regarding efficiency, since the processing will be done in parallel, its being taken care. If you want anything more regarding efficiency, please mention it.