Spark data frame rows between 1 preceding and 1 preceding - scala

Here is the piece of teradata query, trying to rewrite in spark scala dataframe
Input Query
SELECT Min(activitydate)
over (
ORDER BY cust_id, site_group, activitydate ROWS BETWEEN 1 preceding AND 1 preceding) AS pread
FROM table_name
What is the equivalent of 1 preceding AND 1 preceding in spark scala ?
This is what I tried
val window = Window.orderBy('cust_id, 'site_group, 'activitydate)
val df2 = df1.withColumn("pread", lag('ActivityDate, 1) over (window ))

Related

Apache Spark merge two identical DataFrames summing all rows and columns

I have two dataframes with identical column names but different number of rows, each of them identified by an ID and Date, as follows:
First table (the one with all the ID's available):
ID
Date
Amount A
1
2021-09-01
100
1
2021-09-02
50
2
2021-09-01
70
Second table (a smaller version including only some ID's):
ID
Date
Amount A
2
2021-09-01
50
2
2021-09-02
30
What I would like to have is a single table with the following output:
ID
Date
Amount A
1
2021-09-01
100
1
2021-09-02
50
2
2021-09-01
120
2
2021-09-02
30
Thanks in advance.
Approach 1: Using a Join
You may join both tables and sum on similar rows.
Using spark sql
Ensure your dataframe is accessible
firstDf.createOrReplaceTempView("first_df")
secondDf.createOrReplaceTempView("second_df")
Execute the following on your spark session
val outputDf = sparkSession.sql("<insert sql below here>")
SELECT
first_df.ID,
first_df.Date,
first_df.AmountA + second_df.AmountA as AmountA
FROM
first_df
LEFT JOIN
second_df ON first_df.ID = second_df.ID AND
first_df.Date = second_df.Date
Using Scala api
val outputDf = firstDf.alias("first_df")
.join(
secondDf.alias("second_df"),
Seq("ID","Date"),
"left"
).selectExpr(
"first_df.ID",
"second_df.Date",
"first_df.AmountA + second_df.AmountA as AmountA"
)
Using pyspark api
outputDf = (
firstDf.alias("first_df")
.join(
second_df.alias("second_df"),
["ID","Date"],
"left"
).selectExpr(
"first_df.ID",
"second_df.Date",
"first_df.AmountA + second_df.AmountA as AmountA"
)
)
Approach 2: Using a Union then aggregate by sum
Using spark sql
Ensure your dataframe is accessible
firstDf.createOrReplaceTempView("first_df")
secondDf.createOrReplaceTempView("second_df")
Execute the following on your spark session
val outputDf = sparkSession.sql("<insert sql below here>")
SELECT
ID,
Date,
SUM(AmountA) as AmountA
FROM (
SELECT ID, Date, AmountA FROM first_df UNION ALL
SELECT ID, Date, AmountA FROM second_df
) t
GROUP BY
ID,
Date
Using Scala api
val outputDf = firstDf.select("ID","Date","AmountA")
.union(secondDf.select("ID","Date","AmountA"))
.groupBy("ID","Date")
.agg(
sum("AmountA").alias("AmountA")
)
Using Pyspark api
from pyspark.sql import functions as F
val outputDf = (
firstDf.select("ID","Date","AmountA")
.union(second_df.select("ID","Date","AmountA"))
.groupBy("ID","Date")
.agg(
F.sum("AmountA").alias("AmountA")
)
)
Let me know if this works for you.

SparkSQL Select with multiple columns, then join?

I'm unfamiliar with sparksql, but want to select multiple columns in this query then join the 2 frames. The primary key column is ID from df.
val count1 = df.select(size($"col1").as("col1Name"))
val count2 = df.select(size($"col2").as("col2Name"))
So ultimately I want a table with ID, count1 and count2. How can I achieve this?
I believe what you are trying to do is count 2 columns from df. You can do this using below
df.registerTempTable("temp_table")
//Below Is an example how you can use SparkSql
val newdf = spark.sql("select id,count(col1) as count1,count(col2) as count2 from temp_table group by id")
//You can use this dataframe further for operations
newdf.show(false)

Scala spark is better than pyspark?

I was translating pyspark to Scala spark thinking Scala spark would work well.
But Scala spark is taking more time than pyspark.
Can anyone please find issues with these two queries when its being executed in Scala spark.
Query1 : sqlContext.sql(SELECT a.pair, a.bi_count,
a.uni_count, unigram_table.uni_count as uni_count_2,
(log(a.bi_count) -log(a.uni_count) - log(unigram_table.uni_count))
as score FROM ( SELECT * FROM bigram_table JOIN unigram_table
ON bigram_table.parent = unigram_table.token ) as a JOIN
unigram_table ON a.child = unigram_table.token WHERE a.bi_count >
4000 ORDER BY score DESC limit 400000 )
Execution time in pyspark - 3 min
Execution time in Scala spark - 3 min
Query2: sqlContext.sql( SELECT
pair, tri_count, (log(tri_count) - log(count1) -log(count2)
-log(unigram_table.uni_count)) as score FROM( SELECT pair, tri_count, count1, child1, child2, unigram_table.uni_count
as count2 FROM ( SELECT
pair, child1, child2, tri_count, unigram_table.uni_count
as count1 FROM trigram_table JOIN unigram_table ON
trigram_table.parent = unigram_table.token ) as a JOIN
unigram_table ON a.child1 = unigram_table.token ) as b JOIN
unigram_table ON b.child2 = unigram_table.token WHERE tri_count >
3000 ORDER BY score DES)
Execution time in pyspark - 3 min
Execution time in Scala spark - 12 min

Spark SQL: Unable to use aggregate within a window function

I use this SQL to create a session_id for a dataset. If a user is inactive for more than 30 minutes (30*60 seconds), then a new session_id is assigned I am new to Spark SQL and trying to replicate the same procedure using Spark SQL Context. But I'm encountering some errors.
session_id follows the naming convention:
userid_1,
userid_2,
userid_3,...
SQL (date is in seconds):
CREATE TABLE tablename_with_session_id AS
SELECT * , userid || '_' || SUM(new_session) OVER (PARTITION BY userid ORDER BY date asc, new_session desc rows unbounded preceding) AS session_id
FROM
(SELECT *,
CASE
WHEN (date - LAG(date) OVER (PARTITION BY userid ORDER BY date) >= 30 * 60)
THEN 1
WHEN row_number() over (partition by userid order by date) = 1
THEN 1
ELSE 0
END as new_session
FROM
tablename
)
order by date;
I tried using the same SQL in Spark-Scala with:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val tableSessionID = sqlContext.sql("SELECT * , CONCAT(userid,'_',SUM(new_session)) OVER (PARTITION BY userid ORDER BY date asc, new_session desc rows unbounded preceding) AS new_session_id FROM
(SELECT *, CASE WHEN (date - LAG(date) OVER (PARTITION BY userid ORDER BY date) >= 30 * 60) THEN 1 WHEN row_number() over (partition by userid order by date) = 1 THEN 1 ELSE 0 END as new_session FROM clickstream) order by date")
Some Error which suggested to wrap Spark SQL expression ..sum(new_session).. within window function.
I tried to using multiple data frames:
val temp1 = sqlContext.sql("SELECT *, CASE WHEN (date - LAG(date) OVER (PARTITION BY userid ORDER BY date) >= 30 * 60) THEN 1 WHEN row_number() over (partition by userid order by date) = 1 THEN 1 ELSE 0 END as new_session FROM clickstream")
temp1.registerTempTable("clickstream_temp1")
val temp2 = sqlContext.sql("SELECT * , SUM(new_session) OVER (PARTITION BY userid ORDER BY date asc, new_session desc rows unbounded preceding) AS s_id FROM clickstream_temp1")
temp2.registerTempTable("clickstream_temp2")
val temp3 = sqlContext.sql("SELECT * , CONCAT(userid,'_',s_id) OVER (PARTITION BY userid ORDER BY date asc, new_session desc rows unbounded preceding) AS new_session_id FROM clickstream_temp2")
It returns an error only on the above statement. 'val temp3 = ...'
That CONCAT(userid,'_',s_id) cannot be used within window function.
What's the workaround? Is there an alternative?
Thanks
To use concat with spark window function you need to use user defined aggregate function(UDAF). You can't directly use concat function with window function.
//Extend UserDefinedAggregateFunction to write custom aggregate function
//You can also specify any constructor arguments. For instance you can have
//CustomConcat(arg1: Int, arg2: String)
class CustomConcat() extends org.apache.spark.sql.expressions.UserDefinedAggregateFunction {
import org.apache.spark.sql.types._
import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.Row
// Input Data Type Schema
def inputSchema: StructType = StructType(Array(StructField("description", StringType)))
// Intermediate Schema
def bufferSchema = StructType(Array(StructField("groupConcat", StringType)))
// Returned Data Type.
def dataType: DataType = StringType
// Self-explaining
def deterministic = true
// This function is called whenever key changes
def initialize(buffer: MutableAggregationBuffer) = {buffer(0) = " ".toString}
// Iterate over each entry of a group
def update(buffer: MutableAggregationBuffer, input: Row) = { buffer(0) = buffer.getString(0) + input.getString(0) }
// Merge two partial aggregates
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = { buffer1(0) = buffer1.getString(0) + buffer2.getString(0) }
// Called after all the entries are exhausted.
def evaluate(buffer: Row) = {buffer.getString(0)}
}
val newdescription = new CustomConcat
val newdesc1=newdescription($"description").over(windowspec)
You can use newdesc1 as an aggregate function for concatenation in window functions.
For more information you can have a look at :
databricks udaf
I hope this will answer your question.

Tsql query to find equal row values along columns

I've this table
col 1 col 2 col 3 .... col N
-------------------------------------
1 A B fooa
10 A foo cc
4 A B fooa
it is possible with a tsql query to return only one row with a value only where the values are ALL the same?
col 1 col 2 col 3 .... col N
-------------------------------------
-- A -- --
SELECT
CASE WHEN COUNT(col1) = COUNT(*) AND MIN(col1) = MAX(col1) THEN MIN(col1) END AS col1,
CASE WHEN COUNT(col2) = COUNT(*) AND MIN(col2) = MAX(col2) THEN MIN(col2) END AS col2,
...
FROM yourtable
You have to allow for NULLs in the column:
COUNT(*) counts them
COUNT(col1) doesn't count them
That is, a columns with a mix of As and NULLs isn't one value. MIN and MAX would both give A because they ignore NULLs.
Edit:
removed DISTINCT to get counts the same for NULL check
added MIN/MAX check (as per Mark Byers deleted answer) to check uniqueness