Spark and SparkSQL: How to imitate window function? - scala

Description
Given a dataframe df
id | date
---------------
1 | 2015-09-01
2 | 2015-09-01
1 | 2015-09-03
1 | 2015-09-04
2 | 2015-09-04
I want to create a running counter or index,
grouped by the same id and
sorted by date in that group,
thus
id | date | counter
--------------------------
1 | 2015-09-01 | 1
1 | 2015-09-03 | 2
1 | 2015-09-04 | 3
2 | 2015-09-01 | 1
2 | 2015-09-04 | 2
This is something I can achieve with window function, e.g.
val w = Window.partitionBy("id").orderBy("date")
val resultDF = df.select( df("id"), rowNumber().over(w) )
Unfortunately, Spark 1.4.1 does not support window functions for regular dataframes:
org.apache.spark.sql.AnalysisException: Could not resolve window function 'row_number'. Note that, using window functions currently requires a HiveContext;
Questions
How can I achieve the above computation on current Spark 1.4.1 without using window functions?
When will window functions for regular dataframes be supported in Spark?
Thanks!

You can use HiveContext for local DataFrames as well and, unless you have a very good reason not to, it is probably a good idea anyway. It is a default SQLContext available in spark-shell and pyspark shell (as for now sparkR seems to use plain SQLContext) and its parser is recommended by Spark SQL and DataFrame Guide.
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.rowNumber
object HiveContextTest {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Hive Context")
val sc = new SparkContext(conf)
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val df = sc.parallelize(
("foo", 1) :: ("foo", 2) :: ("bar", 1) :: ("bar", 2) :: Nil
).toDF("k", "v")
val w = Window.partitionBy($"k").orderBy($"v")
df.select($"k", $"v", rowNumber.over(w).alias("rn")).show
}
}

You can do this with RDDs. Personally I find the API for RDDs makes a lot more sense - I don't always want my data to be 'flat' like a dataframe.
val df = sqlContext.sql("select 1, '2015-09-01'"
).unionAll(sqlContext.sql("select 2, '2015-09-01'")
).unionAll(sqlContext.sql("select 1, '2015-09-03'")
).unionAll(sqlContext.sql("select 1, '2015-09-04'")
).unionAll(sqlContext.sql("select 2, '2015-09-04'"))
// dataframe as an RDD (of Row objects)
df.rdd
// grouping by the first column of the row
.groupBy(r => r(0))
// map each group - an Iterable[Row] - to a list and sort by the second column
.map(g => g._2.toList.sortBy(row => row(1).toString))
.collect()
The above gives a result like the following:
Array[List[org.apache.spark.sql.Row]] =
Array(
List([1,2015-09-01], [1,2015-09-03], [1,2015-09-04]),
List([2,2015-09-01], [2,2015-09-04]))
If you want the position within the 'group' as well, you can use zipWithIndex.
df.rdd.groupBy(r => r(0)).map(g =>
g._2.toList.sortBy(row => row(1).toString).zipWithIndex).collect()
Array[List[(org.apache.spark.sql.Row, Int)]] = Array(
List(([1,2015-09-01],0), ([1,2015-09-03],1), ([1,2015-09-04],2)),
List(([2,2015-09-01],0), ([2,2015-09-04],1)))
You could flatten this back to a simple List/Array of Row objects using FlatMap, but if you need to perform anything on the 'group' that won't be a great idea.
The downside to using RDD like this is that it's tedious to convert from DataFrame to RDD and back again.

I totally agree that Window functions for DataFrames are the way to go if you have Spark version (>=)1.5. But if you are really stuck with an older version(e.g 1.4.1), here is a hacky way to solve this
val df = sc.parallelize((1, "2015-09-01") :: (2, "2015-09-01") :: (1, "2015-09-03") :: (1, "2015-09-04") :: (1, "2015-09-04") :: Nil)
.toDF("id", "date")
val dfDuplicate = df.selecExpr("id as idDup", "date as dateDup")
val dfWithCounter = df.join(dfDuplicate,$"id"===$"idDup")
.where($"date"<=$"dateDup")
.groupBy($"id", $"date")
.agg($"id", $"date", count($"idDup").as("counter"))
.select($"id",$"date",$"counter")
Now if you do dfWithCounter.show
You will get:
+---+----------+-------+
| id| date|counter|
+---+----------+-------+
| 1|2015-09-01| 1|
| 1|2015-09-04| 3|
| 1|2015-09-03| 2|
| 2|2015-09-01| 1|
| 2|2015-09-04| 2|
+---+----------+-------+
Note that date is not sorted, but the counter is correct. Also you can change the ordering of the counter by changing the <= to >= in the where statement.

Related

Spark Dataframe extracting columns based dynamically selected columns

Schema of input dataframe
- employeeKey (int)
- employeeTypeId (string)
- loginDate (string)
- employeeDetailsJson (string)
{"Grade":"100","ValidTill":"2021-12-01","Supervisor":"Alex","Vendor":"technicia","HourlyRate":29}
For Perm employees , some attributes are available and some not. Same for Contracting Employees.
So looking to find an efficient way to build dataframe based on only selected columns, as against transforming all columns and select the ones which I need.
Also please advise this is the best way to extract values from json string based on a key. As the attributes in the string are dynamic, I can not build StructSchema based on it. So using good old get_json_object.
(spark 2.45 and will use spark 3 in future)
val dfSelectColumns=List("Employee-Key", "Employee-Type","Login-Date","cont.Vendor-Name","cont.Hourly-Rate" )
//val dfSelectColumns=List("Employee-Key", "Employee-Type","Login-Date","perm.Level","perm-Validity","perm.Supervisor" )
val resultDF = inputDF.get
.withColumn("Employee-Key", col("employeeKey"))
.withColumn("Employee-Type", when(col("employeeTypeId") === 1, "Permanent")
.when(col("employeeTypeId") === 2, "Contractor")
.otherwise("unknown"))
.withColumn("Login-Date", to_utc_timestamp(to_timestamp(col("loginDate"), "yyyy-MM-dd'T'HH:mm:ss"), ""America/Chicago""))
.withColumn("perm.Level", get_json_object(col("employeeDetailsJson"), "$.Grade"))
.withColumn("perm.Validity", get_json_object(col("employeeDetailsJson"), "$.ValidTill"))
.withColumn("perm.SuperVisor", get_json_object(col("employeeDetailsJson"), "$.Supervisor"))
.withColumn("cont.Vendor-Name", get_json_object(col("employeeDetailsJson"), "$.Vendor"))
.withColumn("cont.Hourly-Rate", get_json_object(col("employeeDetailsJson"), "$.HourlyRate"))
.select(dfSelectColumns.head, dfSelectColumns.tail: _*)
I see that you have 2 schemas, one for Permanent and another for Contractor. You can have 2 schemas.
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val schemaBase = new StructType().add("Employee-Key", IntegerType).add("Employee-Type", StringType).add("Login-Date", DateType)
val schemaPerm = schemaBase.add("Level", IntegerType).add("Validity", StringType)// Permanent attributes
val schemaCont = schemaBase.add("Vendor", StringType).add("HourlyRate", DoubleType) // Contractor attributes
Then you can use the 2 schemas to load the data into dataframe.
For Permanent Employee:
val jsonPermDf = Seq( // Construct sample dataframe
(2, """{"Employee-Key":2, "Employee-Type":"Permanent", "Login-Date":"2021-11-01", "Level":3, "Validity":"ok"}""")
, (3, """{"Employee-Key":3, "Employee-Type":"Permanent", "Login-Date":"2020-10-01", "Level":2, "Validity":"ok-yes"}""")
).toDF("key", "raw_json")
val permDf = jsonPermDf.withColumn("data", from_json(col("raw_json"),schemaPerm)).select($"data.*")
permDf.show()
For Contractor:
val jsonContDf = Seq( // Construct sample dataframe
(1, """{"Employee-Key":1, "Employee-Type":"Contractor", "Login-Date":"2021-12-01", "Vendor":"technicia", "HourlyRate":29}""")
, (4, """{"Employee-Key":4, "Employee-Type":"Contractor", "Login-Date":"2019-09-01", "Vendor":"Minis", "HourlyRate":35}""")
).toDF("key", "raw_json")
val contDf = jsonContDf.withColumn("data", from_json(col("raw_json"),schemaCont)).select($"data.*")
contDf.show()
This is the result datafrme for Permanent:
+------------+-------------+----------+-----+--------+
|Employee-Key|Employee-Type|Login-Date|Level|Validity|
+------------+-------------+----------+-----+--------+
| 2| Permanent|2021-11-01| 3| ok|
| 3| Permanent|2020-10-01| 2| ok-yes|
+------------+-------------+----------+-----+--------+
This is the result dataframe for Contractor:
+------------+-------------+----------+---------+----------+
|Employee-Key|Employee-Type|Login-Date| Vendor|HourlyRate|
+------------+-------------+----------+---------+----------+
| 1| Contractor|2021-12-01|technicia| 29.0|
| 4| Contractor|2019-09-01| Minis| 35.0|
+------------+-------------+----------+---------+----------+
If the schema of the JSON in employeeDetailsJson is unstable, you can still parse it into Map(String, String) type using from_json function with schema map<string,string>. Then you can explode the map column and pivot to get keys as columns.
Example:
val df1 = df.withColumn(
"employeeDetails",
from_json(col("employeeDetailsJson"), "map<string,string>")
).select(
col("employeeKey"),
col("employeeTypeId"),
col("loginDate"),
explode("employeeDetails")
).groupBy("employeeKey", "employeeTypeId", "loginDate")
.pivot("key")
.agg(first("value"))
df1.show()
//+-----------+--------------+---------------------+-----+----------+----------+----------+---------+
//|employeeKey|employeeTypeId|loginDate |Grade|HourlyRate|Supervisor|ValidTill |Vendor |
//+-----------+--------------+---------------------+-----+----------+----------+----------+---------+
//|1 |1 |2021-02-05'T'21:28:06|100 |29 |Alex |2021-12-01|technicia|
//+-----------+--------------+---------------------+-----+----------+----------+----------+---------+

Spark generate a list of column names that contains(SQL LIKE) a string

This one below is a simple syntax to search for a string in a particular column uisng SQL Like functionality.
val dfx = df.filter($"name".like(s"%${productName}%"))
The questions is How do I grab each and every column NAME that contained the particular string in its VALUES and generate a new column with a list of those "column names" for every row.
So far this is the approach I took but stuck as I cant use spark-sql "Like" function inside a UDF.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types._
import spark.implicits._
val df1 = Seq(
(0, "mango", "man", "dit"),
(1, "i-man", "man2", "mane"),
(2, "iman", "mango", "ho"),
(3, "dim", "kim", "sim")
).toDF("id", "col1", "col2", "col3")
val df2 = df1.columns.foldLeft(df1) {
(acc: DataFrame, colName: String) =>
acc.withColumn(colName, concat(lit(colName + "="), col(colName)))
}
val df3 = df2.withColumn("merged_cols", split(concat_ws("X", df2.columns.map(c=> col(c)):_*), "X"))
Here is a sample output. Note that here there are only 3 columns but in the real job I'll be reading multiple tables which can contain dynamic number of columns.
+--------------------------------------------+
|id | col1| col2| col3| merged_cols
+--------------------------------------------+
0 | mango| man | dit | col1, col2
1 | i-man| man2 | mane | col1, col2, col3
2 | iman | mango| ho | col1, col2
3 | dim | kim | sim|
+--------------------------------------------+
This can be done using a foldLeft over the columns together with when and otherwise:
val e = "%man%"
val df2 = df1.columns.foldLeft(df.withColumn("merged_cols", lit(""))){(df, c) =>
df.withColumn("merged_cols", when(col(c).like(e), concat($"merged_cols", lit(s"$c,"))).otherwise($"merged_cols"))}
.withColumn("merged_cols", expr("substring(merged_cols, 1, length(merged_cols)-1)"))
All columns that satisfies the condition e will be appended to the string in the merged_cols column. Note that the column must exist for the first append to work so it is added (containing an empty string) to the dataframe when sent into the foldLeft.
The last row in the code simply removes the extra , that is added in the end. If you want the result as an array instead, simply adding .withColumn("merged_cols", split($"merged_cols", ",")) would work.
An alternative appraoch is to instead use an UDF. This could be preferred when dealing with many columns since foldLeft will create multiple dataframe copies. Here regex is used (not the SQL like since that operates on whole columns).
val e = ".*man.*"
val concat_cols = udf((vals: Seq[String], names: Seq[String]) => {
vals.zip(names).filter{case (v, n) => v.matches(e)}.map(_._2)
})
val df2 = df.withColumn("merged_cols", concat_cols(array(df.columns.map(col(_)): _*), typedLit(df.columns.toSeq)))
Note: typedLit can be used in Spark versions 2.2+, when using older versions use array(df.columns.map(lit(_)): _*) instead.

Spark How can I filter out rows that contain char sequences from another dataframe?

So, I am trying to remove rows from df2 if the Value in df2 is "like" a key from df1. I'm not sure if this is possible, or if I might need to change df1 into a list first? It's a fairly small dataframe, but as you can see, we want to remove the 2nd and 3rd rows from df2 and just return back df2 without them.
df1
+--------------------+
| key|
+--------------------+
| Monthly Beginning|
| Annual Percentage|
+--------------------+
df2
+--------------------+--------------------------------+
| key| Value|
+--------------------+--------------------------------+
| Date| 1/1/2018|
| Date| Monthly Beginning on Tuesday|
| Number| Annual Percentage Rate for...|
| Number| 17.5|
+--------------------+--------------------------------+
I thought it would be something like this?
df.filter(($"Value" isin (keyDf.select("key") + "%"))).show(false)
But that doesn't work and I'm not surprised, but I think it helps show what I am trying to do if my previous explanation was not sufficient enough. Thank you for your help ahead of time.
Convert the first dataframe df1 to List[String] and then create one udf and apply filter condition
Spark-shell-
import org.apache.spark.sql.functions._
//Converting df1 to list
val df1List=df1.select("key").map(row=>row.getString(0).toLowerCase).collect.toList
//Creating udf , spark stands for spark session
spark.udf.register("filterUDF", (str: String) => df1List.filter(str.toLowerCase.contains(_)).length)
//Applying filter
df2.filter("filterUDF(Value)=0").show
//output
+------+--------+
| key| Value|
+------+--------+
| Date|1/1/2018|
|Number| 17.5|
+------+--------+
Scala-IDE -
val sparkSession=SparkSession.builder().master("local").appName("temp").getOrCreate()
val df1=sparkSession.read.format("csv").option("header","true").load("C:\\spark\\programs\\df1.csv")
val df2=sparkSession.read.format("csv").option("header","true").load("C:\\spark\\programs\\df2.csv")
import sparkSession.implicits._
val df1List=df1.select("key").map(row=>row.getString(0).toLowerCase).collect.toList
sparkSession.udf.register("filterUDF", (str: String) => df1List.filter(str.toLowerCase.contains(_)).length)
df2.filter("filterUDF(Value)=0").show
Convert df1 to List. Convert df2 to Dataset.
case class s(key:String,Value:String)
df2Ds = df2.as[s]
Then we can use the filter method to filter out the records.
Somewhat like this.
def check(str:String):Boolean = {
var i = ""
for(i<-df1List)
{
if(str.contains(i))
return false
}
return true
}
df2Ds.filter(s=>check(s.Value)).collect

Scala Spark - split vector column into separate columns in a Spark DataFrame

I have a Spark DataFrame where I have a column with Vector values. The vector values are all n-dimensional, aka with the same length. I also have a list of column names Array("f1", "f2", "f3", ..., "fn"), each corresponds to one element in the vector.
some_columns... | Features
... | [0,1,0,..., 0]
to
some_columns... | f1 | f2 | f3 | ... | fn
... | 0 | 1 | 0 | ... | 0
What is the best way to achieve this? I thought of one way which is to create a new DataFrame with createDataFrame(Row(Features), featureNameList) and then join with the old one, but it requires spark context to use createDataFrame. I only want to transform the existing data frame. I also know .withColumn("fi", value) but what do I do if n is large?
I'm new to Scala and Spark and couldn't find any good examples for this. I think this can be a common task. My particular case is that I used the CountVectorizer and wanted to recover each column individually for better readability instead of only having the vector result.
One way could be to convert the vector column to an array<double> and then using getItem to extract individual elements.
import org.apache.spark.sql.functions._
import org.apache.spark.ml._
val df = Seq( (1 , linalg.Vectors.dense(1,0,1,1,0) ) ).toDF("id", "features")
//df: org.apache.spark.sql.DataFrame = [id: int, features: vector]
df.show
//+---+---------------------+
//|id |features |
//+---+---------------------+
//|1 |[1.0,0.0,1.0,1.0,0.0]|
//+---+---------------------+
// A UDF to convert VectorUDT to ArrayType
val vecToArray = udf( (xs: linalg.Vector) => xs.toArray )
// Add a ArrayType Column
val dfArr = df.withColumn("featuresArr" , vecToArray($"features") )
// Array of element names that need to be fetched
// ArrayIndexOutOfBounds is not checked.
// sizeof `elements` should be equal to the number of entries in column `features`
val elements = Array("f1", "f2", "f3", "f4", "f5")
// Create a SQL-like expression using the array
val sqlExpr = elements.zipWithIndex.map{ case (alias, idx) => col("featuresArr").getItem(idx).as(alias) }
// Extract Elements from dfArr
dfArr.select(sqlExpr : _*).show
//+---+---+---+---+---+
//| f1| f2| f3| f4| f5|
//+---+---+---+---+---+
//|1.0|0.0|1.0|1.0|0.0|
//+---+---+---+---+---+

scala spark - matching dataframes based on variable dates

I'm trying to match two dataframes based on a variable date window. I am not simply trying to get an exact match, which my code achieves but to get all likely candidates within a variable day window.
I was able to get exact matches on dates with my code.
But I want to find out if the records are still viable to match since they could be a few days off either side but would still be reasonable enough to join on.
I've tried looking for something similar to python's pd.to_timedelta('1 day') in spark to add to the filter but alas have struck no luck.
Here is my current code which matches the dataframe on the ID column and then runs a filter to ensure that the from_date in the second dataframe is between the start_date and the end_date of the first dataframe.
What I need is not the exact date match but be able to match records if they fall between a day or two (either side) of the actual dates.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
val df1 = spark.read.option("header","true")
.option("inferSchema","true").csv("../data/df1.csv")
val df2 = spark.read.option("header","true")
.option("inferSchema","true")
.csv("../data/df2.csv")
val df = df2.join(df1,
(df1("ID") === df2("ID")) &&
(df2("from_date") >= df1("start_date")) &&
(df2("from_date") <= df1("end_date")),"left")
.select(df1("ID"), df1("start_date"), df1("end_date"),
$"from_date", $"to_date")
df.coalesce(1).write.format("com.databricks.spark.csv")
.option("header", "true").save("../mydata.csv")
Essentially I want to be able to edit this date window to increase or decrease the data actually matching.
Would really appreciate your input. I'm new to spark/scala but gotta say I'm loving it so far ... soo much faster (and cleaner) than python!
cheers
You can apply date_add and date_sub to start_date/end_date in your join condition, as shown below:
import org.apache.spark.sql.functions._
import java.sql.Date
val df1 = Seq(
(1, Date.valueOf("2018-12-01"), Date.valueOf("2018-12-05")),
(2, Date.valueOf("2018-12-01"), Date.valueOf("2018-12-06")),
(3, Date.valueOf("2018-12-01"), Date.valueOf("2018-12-07"))
).toDF("ID", "start_date", "end_date")
val df2 = Seq(
(1, Date.valueOf("2018-11-30")),
(2, Date.valueOf("2018-12-08")),
(3, Date.valueOf("2018-12-08"))
).toDF("ID", "from_date")
val deltaDays = 1
df2.join( df1,
df1("ID") === df2("ID") &&
df2("from_date") >= date_sub(df1("start_date"), deltaDays) &&
df2("from_date") <= date_add(df1("end_date"), deltaDays),
"left_outer"
).show
// +---+----------+----+----------+----------+
// | ID| from_date| ID|start_date| end_date|
// +---+----------+----+----------+----------+
// | 1|2018-11-30| 1|2018-12-01|2018-12-05|
// | 2|2018-12-08|null| null| null|
// | 3|2018-12-08| 3|2018-12-01|2018-12-07|
// +---+----------+----+----------+----------+
You can get the same results using datediff() function also. Check this out:
scala> val df1 = Seq((1, "2018-12-01", "2018-12-05"),(2, "2018-12-01", "2018-12-06"),(3, "2018-12-01", "2018-12-07")).toDF("ID", "start_date", "end_date").withColumn("start_date",'start_date.cast("date")).withColumn("end_date",'end_date.cast("date"))
df1: org.apache.spark.sql.DataFrame = [ID: int, start_date: date ... 1 more field]
scala> val df2 = Seq((1, "2018-11-30"), (2, "2018-12-08"),(3, "2018-12-08")).toDF("ID", "from_date").withColumn("from_date",'from_date.cast("date"))
df2: org.apache.spark.sql.DataFrame = [ID: int, from_date: date]
scala> val delta = 1;
delta: Int = 1
scala> df2.join(df1,df1("ID") === df2("ID") && datediff('from_date,'start_date) >= -delta && datediff('from_date,'end_date)<=delta, "leftOuter").show(false)
+---+----------+----+----------+----------+
|ID |from_date |ID |start_date|end_date |
+---+----------+----+----------+----------+
|1 |2018-11-30|1 |2018-12-01|2018-12-05|
|2 |2018-12-08|null|null |null |
|3 |2018-12-08|3 |2018-12-01|2018-12-07|
+---+----------+----+----------+----------+
scala>