SparkSQL Select with multiple columns, then join? - scala

I'm unfamiliar with sparksql, but want to select multiple columns in this query then join the 2 frames. The primary key column is ID from df.
val count1 = df.select(size($"col1").as("col1Name"))
val count2 = df.select(size($"col2").as("col2Name"))
So ultimately I want a table with ID, count1 and count2. How can I achieve this?

I believe what you are trying to do is count 2 columns from df. You can do this using below
df.registerTempTable("temp_table")
//Below Is an example how you can use SparkSql
val newdf = spark.sql("select id,count(col1) as count1,count(col2) as count2 from temp_table group by id")
//You can use this dataframe further for operations
newdf.show(false)

Related

Apache Spark merge two identical DataFrames summing all rows and columns

I have two dataframes with identical column names but different number of rows, each of them identified by an ID and Date, as follows:
First table (the one with all the ID's available):
ID
Date
Amount A
1
2021-09-01
100
1
2021-09-02
50
2
2021-09-01
70
Second table (a smaller version including only some ID's):
ID
Date
Amount A
2
2021-09-01
50
2
2021-09-02
30
What I would like to have is a single table with the following output:
ID
Date
Amount A
1
2021-09-01
100
1
2021-09-02
50
2
2021-09-01
120
2
2021-09-02
30
Thanks in advance.
Approach 1: Using a Join
You may join both tables and sum on similar rows.
Using spark sql
Ensure your dataframe is accessible
firstDf.createOrReplaceTempView("first_df")
secondDf.createOrReplaceTempView("second_df")
Execute the following on your spark session
val outputDf = sparkSession.sql("<insert sql below here>")
SELECT
first_df.ID,
first_df.Date,
first_df.AmountA + second_df.AmountA as AmountA
FROM
first_df
LEFT JOIN
second_df ON first_df.ID = second_df.ID AND
first_df.Date = second_df.Date
Using Scala api
val outputDf = firstDf.alias("first_df")
.join(
secondDf.alias("second_df"),
Seq("ID","Date"),
"left"
).selectExpr(
"first_df.ID",
"second_df.Date",
"first_df.AmountA + second_df.AmountA as AmountA"
)
Using pyspark api
outputDf = (
firstDf.alias("first_df")
.join(
second_df.alias("second_df"),
["ID","Date"],
"left"
).selectExpr(
"first_df.ID",
"second_df.Date",
"first_df.AmountA + second_df.AmountA as AmountA"
)
)
Approach 2: Using a Union then aggregate by sum
Using spark sql
Ensure your dataframe is accessible
firstDf.createOrReplaceTempView("first_df")
secondDf.createOrReplaceTempView("second_df")
Execute the following on your spark session
val outputDf = sparkSession.sql("<insert sql below here>")
SELECT
ID,
Date,
SUM(AmountA) as AmountA
FROM (
SELECT ID, Date, AmountA FROM first_df UNION ALL
SELECT ID, Date, AmountA FROM second_df
) t
GROUP BY
ID,
Date
Using Scala api
val outputDf = firstDf.select("ID","Date","AmountA")
.union(secondDf.select("ID","Date","AmountA"))
.groupBy("ID","Date")
.agg(
sum("AmountA").alias("AmountA")
)
Using Pyspark api
from pyspark.sql import functions as F
val outputDf = (
firstDf.select("ID","Date","AmountA")
.union(second_df.select("ID","Date","AmountA"))
.groupBy("ID","Date")
.agg(
F.sum("AmountA").alias("AmountA")
)
)
Let me know if this works for you.

Spark scala window count max

I have following df:-
result
state
clubName
win
XYZ
club1
win
XYZ
club2
win
XYZ
club1
win
PQR
club3
I need state wise max wining clubName
val byState =Window.partitionBy("state").orderBy('state)
I tried creating a window but does not helps..
Expected Result :-
Some like this in sql
select temp.res
(select count(result) as res
from table
group by clubName) temp
group by state
e.g
state
max_count_of_wins
clubName
XYZ
2
club1
You can get the win count for each club, then assign a rank for each club ordered by wins, and filter those rows with rank = 1.
import org.apache.spark.sql.expressions.Window
val df2 = df.withColumn(
"wins",
count(when(col("result") === "win", 1))
.over(Window.partitionBy("state","clubName"))
).withColumn(
"rn",
row_number().over(Window.partitionBy("state").orderBy(desc("wins")))
).filter("rn = 1").selectExpr("state", "wins as max_count_of_wins", "clubName")
df2.show
+-----+-----------------+--------+
|state|max_count_of_wins|clubName|
+-----+-----------------+--------+
| PQR| 1| club3|
| XYZ| 2| club1|
+-----+-----------------+--------+
You can also use a SQL-dialect with SparkSQL (find doc here):
df.sql("""
SELECT tt.name, tt.state, MAX(tt.nWins) as max_count_of_wins
FROM (
SELECT t1.clubName as name, t1.state as state, COUNT(1) as nWins
FROM Table1 t1
WHERE t1.result = 'win'
GROUP BY state, name
) as tt
GROUP BY tt.state;
""")
where the table in the dataframe would be named Table1 and your dataframe df.
p.s. if you want to try it yourself use the initialization
CREATE TABLE Table1
(`result` varchar(3), `state` varchar(3), `clubName` varchar(5))
;
INSERT INTO Table1
(`result`, `state`, `clubName`)
VALUES
('win', 'XYZ', 'club1'),
('win', 'XYZ', 'club2'),
('win', 'XYZ', 'club1'),
('win', 'PQR', 'club3')
;
on http://sqlfiddle.com.

Loop through the list which has queries to be executed and appended to dataframe

I need to loop through each element in the list and run this query against the database and append the result in to the same dataframe (df). Could you please let me know how to achieve this.
PS : I am using spark scala for this.
List((select * from table1 where a=10 ) as rules,
(select * from table1 where b=10) as rules,
(select * from table1 where c=10 ) as rules)
Thank you.
As you load data from the same table table1, you can simply use multiple conditions with or in where clause :
val df = spark.sql("select * from table1 where a=10 or b=10 or c=10")
If the queries are on different tables you can load into list of dataframes then union:
val queries = List(
"select * from table1 where a=10",
"select * from table1 where b=10",
"select * from table1 where c=10"
)
val df = queries.map(spark.sql).reduce(_ unionAll _)

Can we write a hive query in Spark - UDF

Can we write a hive query in Spark - UDF.
eg I have 2 tables:
Table A and B
where b1 contains column names of A and b2 contains the value of that column in A.
Now I want to query the tables in such a way that I get result as below:
Result.
Basically replace the values of column in A with B based on column names and their corresponding values.
To achieve that I wrote spark-UDF eg:convert as below
def convert(colname: String, colvalue:String)={
sqlContext.sql("SELECT b3 from B where b1 = colname and b2 = colvalue").toString;
}
I registered it as:
sqlContext.udf.register("conv",convert(_:String,_:String));
Now my main query is-
val result = sqlContext.sql("select a1 , conv('a2',a2), conv('a3',a3)");
result.take(2);
It gives me java.lang.NullPointerException.
Can someone please suggest if this feature is supported in spark/hive.
Any other approach is also welcome.
Thanks!
No, UDF Doesn't permit to write a Query inside.
You can only pass the data as variables and do transformation to get the final result back at row/column/table level.
Here is the solution to your question. You can do it in Hive itself.
WITH a_plus_col
AS (SELECT a1
,'a2' AS col_name
,a2 AS col_value
FROM A
UNION ALL
SELECT a1
,'a3' AS col_name
,a3 AS col_value
FROM A)
SELECT a_plus_col.a1 AS r1
,MAX(CASE WHEN a_plus_col.col_name = 'a2' THEN B.b3 END) AS r2
,MAX(CASE WHEN a_plus_col.col_name = 'a3' THEN B.b3 END) AS r3
FROM a_plus_col
INNER JOIN B ON ( a_plus_col.col_name = b1 AND a_plus_col.col_value = b2)
GROUP BY a_plus_col.a1;

Deduplication problems in Pyspark

I have one dataframe with many rows of id, date and other information. It contains 2,856,134 records. A count distinct of ID results in 1,552,184 records.
Using this:
DF2 = sorted(DF.groupBy(DF.id).max('date').alias('date').collect())
Gives me the max date per ID, and results in 1,552,184 records, which matches the above. So far so good.
I try to join DF2 back to DF where id = id and max_date = date:
df3 = DF2.join(DF,(DF2.id==DF.id)&(DF2.Max_date==DF.date),"left")
This results in 2,358,316 records - which is different than the original amount.
I changed the code to:
df3 = DF2.join(DF,(DF2.id==DF.id)&(DF2.Max_date==DF.date),"left").dropDuplicates()
This results in 1,552,508 records (which is odd, since it should return 1,552,184 from the de-duplicated DF2 above.
Any idea what's happening here? I presume it's something to do with my join function.
Thanks!
its because your table 2 has duplicate entries for example:
Table1 Table2
_______ _________
1 2
2 2
3 5
4 6
SELECT Table1.Id, Table2.Id FROM Table1 LEFT OUTER JOIN Table2 ON Table1.Id=Table2.Id
Results:
1,null
2,2
2,2
3,null
4,null
I hope this will help you in solving your problem