Transform dataset with empty data for dates - scala

I have a dataset with date,accountid and value. I want to transform the dataset to a new dataset where if accountid is not present in a particular date then add a accountid with value of 0 against that date.Is this possible
val df = sc.parallelize(Seq(("2018-01-01", 100.5,"id1"),
("2018-01-02", 120.6,"id1"),
("2018-01-03", 450.2,"id2")
)).toDF("date", "val","accountid")
+----------+-----+---------+
| date| val|accountid|
+----------+-----+---------+
|2018-01-01|100.5| id1|
|2018-01-02|120.6| id1|
|2018-01-03|450.2| id2|
+----------+-----+---------+
I want to transform this dataset into this format
+----------+-----+---------+
| date| val|accountid|
+----------+-----+---------+
|2018-01-01|100.5| id1|
|2018-01-01| 0.0| id2|
|2018-01-02|120.6| id1|
|2018-01-02| 0.0| id2|
|2018-01-03|450.2| id2|
|2018-01-03|0.0 | id1|
+----------+-----+---------+

You can simply use a udf function to get your requirement fulfilled.
But before that you will have to get the complete set of accountids and get it broadcasted to be used in udf function.
The returned array from udf function is to be exploded and finally select the columns.
import org.apache.spark.sql.functions._
val idList = df.select(collect_set("accountid")).first().getAs[Seq[String]](0)
val broadCastedIdList = sc.broadcast(idList)
def populateUdf = udf((date: String, value: Double, accountid: String)=> Array(accounts(date, value, accountid)) ++ broadCastedIdList.value.filterNot(_ == accountid).map(accounts(date, 0.0, _)))
df.select(populateUdf(col("date"), col("val"), col("accountid")).as("struct"))
.withColumn("struct", explode(col("struct")))
.select(col("struct.date"), col("struct.value").as("val"), col("struct.accountid"))
.show(false)
And of course you would need a case class
case class accounts(date:String, value:Double, accountid:String)
which should give you
+----------+-----+---------+
|date |val |accountid|
+----------+-----+---------+
|2018-01-01|100.5|id1 |
|2018-01-01|0.0 |id2 |
|2018-01-02|120.6|id1 |
|2018-01-02|0.0 |id2 |
|2018-01-03|450.2|id2 |
|2018-01-03|0.0 |id1 |
+----------+-----+---------+
Note: value keyword is used in case class because reserved identifier names cannot be used as variable names

You can create reference
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
val Row(minTs: Long, maxTs: Long) = df
.select(to_date($"date").cast("timestamp").cast("bigint") as "date")
.select(min($"date"), max($"date")).first
val by = 60 * 60 * 24
val ref = spark
.range(minTs, maxTs + by, by)
.select($"id".cast("timestamp").cast("date").cast("string").as("date"))
.crossJoin(df.select("accountid").distinct)
and outer join with input data:
ref.join(df, Seq("date", "accountid"), "leftouter").na.fill(0.0).show
// +----------+---------+-----+
// | date|accountid| val|
// +----------+---------+-----+
// |2018-01-03| id1| 0.0|
// |2018-01-01| id1|100.5|
// |2018-01-02| id2| 0.0|
// |2018-01-02| id1|120.6|
// |2018-01-03| id2|450.2|
// |2018-01-01| id2| 0.0|
// +----------+---------+-----+
Concept adopted from this sparklyr answer by user6910411.

Related

How to get min and max of clusters

I created a scala program to apply k-means on a specific column of a dataframe. Dataframe name is df_items and column name is price.
import org.apache.spark._
import org.apache.spark.sql.types._
import org.apache.spark.ml.clustering._
import org.apache.spark.ml.feature.VectorAssembler
val df_items = spark.read.format("csv").option("header","true").load(path.csv)
// need to cast because df_items("price") is String
df_items.createGlobalTempView("items")
val price = spark.sql("SELECT cast(price as double) price FROM global_temp.items")
case class Rows(price:Double)
val rows = price.as[Rows]
val assembler = new VectorAssembler().setInputCols(Array("price")).setOutputCol("features")
val data = assembler.transform(rows)
val kmeans = new KMeans().setK(6)
val model = kmeans.fit(data)
val predictions = model.summary.predictions
Predictions result :
+------+--------+----------+
| price|features|prediction|
+------+--------+----------+
| 58.9| [58.9]| 0|
| 239.9| [239.9]| 3|
| 199.0| [199.0]| 5|
| 12.99| [12.99]| 0|
| 199.9| [199.9]| 5|
| 21.9| [21.9]| 0|
| 19.9| [19.9]| 0|
| 810.0| [810.0]| 1|
|145.95|[145.95]| 5|
| ... | ... | ... |
My goal is to get the min and the max value of a cluster (or all clusters). It is possible?
Thank's a lot
If I understand your question correctly, you could use groupBy to group by prediction column.
predictions.groupBy("prediction")
.agg(min(col("price")).as("min_price"),
max(col("price")).as("max_price"))
Is this what you need?

Spark Dataframe implementation similar to Oracle's LISTAGG function - Unable to Order with in the group

I want to implement a function similar to Oracle's LISTAGG function.
Equivalent oracle code is
select KEY,
listagg(CODE, '-') within group (order by DATE) as CODE
from demo_table
group by KEY
Here is my spark scala dataframe implementation, but unable to order the values with in each group.
Input:
val values = List(List("66", "PL", "2016-11-01"), List("66", "PL", "2016-12-01"),
List("67", "JL", "2016-12-01"), List("67", "JL", "2016-11-01"), List("67", "PL", "2016-10-01"), List("67", "PO", "2016-09-01"), List("67", "JL", "2016-08-01"),
List("68", "PL", "2016-12-01"), List("68", "JO", "2016-11-01"))
.map(row => (row(0), row(1), row(2)))
val df = values.toDF("KEY", "CODE", "DATE")
df.show()
+---+----+----------+
|KEY|CODE| DATE|
+---+----+----------+
| 66| PL|2016-11-01|
| 66| PL|2016-12-01|----- group 1
| 67| JL|2016-12-01|
| 67| JL|2016-11-01|
| 67| PL|2016-10-01|
| 67| PO|2016-09-01|
| 67| JL|2016-08-01|----- group 2
| 68| PL|2016-12-01|
| 68| JO|2016-11-01|----- group 3
+---+----+----------+
udf implementation :
import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions.udf
val listAgg = udf((xs: Seq[String]) => xs.mkString("-"))
df.groupBy("KEY")
.agg(listAgg(collect_list("CODE")).alias("CODE"))
.show(false)
+---+--------------+
|KEY|CODE |
+---+--------------+
|68 |PL-JO |
|67 |JL-JL-PL-PO-JL|
|66 |PL-PL |
+---+--------------+
Expected Output : - order by date
+---+--------------+
|KEY|CODE |
+---+--------------+
|68 |JO-PL |
|67 |JL-PO-PL-JL-JL|
|66 |PL-PL |
+---+--------------+
Use struct inbuilt function to combine the CODE and DATE columns and use that new struct column in collect_list aggregation function. And in the udf function sort by the DATE and collect the CODE as - separated string
import org.apache.spark.sql.functions._
def sortAndStringUdf = udf((codeDate: Seq[Row])=> codeDate.sortBy(row => row.getAs[Long]("DATE")).map(row => row.getAs[String]("CODE")).mkString("-"))
df.withColumn("codeDate", struct(col("CODE"), col("DATE").cast("timestamp").cast("long").as("DATE")))
.groupBy("KEY").agg(sortAndStringUdf(collect_list("codeDate")).as("CODE"))
which should give you
+---+--------------+
|KEY| CODE|
+---+--------------+
| 68| JO-PL|
| 67|JL-PO-PL-JL-JL|
| 66| PL-PL|
+---+--------------+
I hope the answer is helpful
Update
I am sure this will be faster than using udf function
df.withColumn("codeDate", struct(col("DATE").cast("timestamp").cast("long").as("DATE"), col("CODE")))
.groupBy("KEY")
.agg(concat_ws("-", expr("sort_array(collect_list(codeDate)).CODE")).alias("CODE"))
.show(false)
which should give you the same result as above

Converting from org.apache.spark.sql.Dataset to CoordinateMatrix

I have a spark SQL dataset whose schema defined as follows,
User_id <String> | Item_id <String> | Bought_Status <Boolean>
I would like to convert this to a Sparse matrix to apply recommender systems algorithms. This is very huge RDD datasets so I read that CoordinateMatrix is the right way to create a sparse matrix out of this.
However I got stuck at a point where the API doc says that RDD[MatrixEntry] is mandatory to create a CoordinateMatrix. Also MatrixEntry needs a format of int,int, long.
I am not able to convert my data scheme to this format. Can you please help me on how to convert this data to a sparse matrix in spark? I am currently programming using scala
Please note that matrix entity is of type long, long, double
Reference: https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.mllib.linalg.distributed.MatrixEntry
Also, as user/ item columns are string, those need to be indexed before processing. Here is how you can create coordinatematrix with scala:
//Imports needed
scala> import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix
import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix
scala> import org.apache.spark.mllib.linalg.distributed.MatrixEntry
import org.apache.spark.mllib.linalg.distributed.MatrixEntry
scala> import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.feature.StringIndexer
//Let's create a dummy dataframe
scala> val df = spark.sparkContext.parallelize(List(
| ("u1","i1" ,true),
| ("u1","i2" ,true),
| ("u2","i3" ,false),
| ("u2","i4" ,false),
| ("u3","i1" ,true),
| ("u3","i3" ,true),
| ("u4","i3" ,false),
| ("u4","i4" ,false))).toDF("user","item","bought")
df: org.apache.spark.sql.DataFrame = [user: string, item: string ... 1 more field]
scala> df.show
+----+----+------+
|user|item|bought|
+----+----+------+
| u1| i1| true|
| u1| i2| true|
| u2| i3| false|
| u2| i4| false|
| u3| i1| true|
| u3| i3| true|
| u4| i3| false|
| u4| i4| false|
+----+----+------+
//Index user/ item columns
scala> val indexer1 = new StringIndexer().setInputCol("user").setOutputCol("userIndex")
indexer1: org.apache.spark.ml.feature.StringIndexer = strIdx_2de8d35b8301
scala> val indexed1 = indexer1.fit(df).transform(df)
indexed1: org.apache.spark.sql.DataFrame = [user: string, item: string ... 2 more fields]
scala> val indexer2 = new StringIndexer().setInputCol("item").setOutputCol("itemIndex")
indexer2: org.apache.spark.ml.feature.StringIndexer = strIdx_493ce45dbec3
scala> val indexed2 = indexer2.fit(indexed1).transform(indexed1)
indexed2: org.apache.spark.sql.DataFrame = [user: string, item: string ... 3 more fields]
scala> val tempDF = indexed2.withColumn("userIndex",indexed2("userIndex").cast("long")).withColumn("itemIndex",indexed2("itemIndex").cast("long")).withColumn("bought",indexed2("bought").cast("double")).select("userIndex","itemIndex","bought")
tempDF: org.apache.spark.sql.DataFrame = [userIndex: bigint, itemIndex: bigint ... 1 more field]
scala> tempDF.show
+---------+---------+------+
|userIndex|itemIndex|bought|
+---------+---------+------+
| 0| 1| 1.0|
| 0| 3| 1.0|
| 1| 0| 0.0|
| 1| 2| 0.0|
| 2| 1| 1.0|
| 2| 0| 1.0|
| 3| 0| 0.0|
| 3| 2| 0.0|
+---------+---------+------+
//Create coordinate matrix of size 4*4
scala> val corMat = new CoordinateMatrix(tempDF.rdd.map(m => MatrixEntry(m.getLong(0),m.getLong(1),m.getDouble(2))), 4, 4)
corMat: org.apache.spark.mllib.linalg.distributed.CoordinateMatrix = org.apache.spark.mllib.linalg.distributed.CoordinateMatrix#16be6b36
//Check the content of coordinate matrix
scala> corMat.entries.collect
res2: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] = Array(MatrixEntry(0,1,1.0), MatrixEntry(0,3,1.0), MatrixEntry(1,0,0.0), MatrixEntry(1,2,0.0), MatrixEntry(2,1,1.0), MatrixEntry(2,0,1.0), MatrixEntry(3,0,0.0), MatrixEntry(3,2,0.0))
Hope, this helps!

Spark - pass full row to a udf and then get column name inside udf

I am using Spark with Scala and want to pass the entire row to udf and select for each column name and column value in side udf. How can I do this?
I am trying following -
inputDataDF.withColumn("errorField", mapCategory(ruleForNullValidation) (col(_*)))
def mapCategory(categories: Map[String, Boolean]) = {
udf((input:Row) => //write a recursive function to check if each row is in categories if yes check for null if null then false, repeat this for all columns and then combine results)
})
In Spark 1.6 you can use Row as external type and struct as expression. as expression. Column name can be fetched from the schema. For example:
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.{col, struct}
val df = Seq((1, 2, 3)).toDF("a", "b", "c")
val f = udf((row: Row) => row.schema.fieldNames)
df.select(f(struct(df.columns map col: _*))).show
// +-----------------------------------------------------------------------------+
// |UDF(named_struct(NamePlaceholder, a, NamePlaceholder, b, NamePlaceholder, c))|
// +-----------------------------------------------------------------------------+
// | [a, b, c]|
// +-----------------------------------------------------------------------------+
Values can be accessed by name using Row.getAs method.
Here is a simple working example:
Input Data:
+-----+---+--------+
| NAME|AGE|CATEGORY|
+-----+---+--------+
| RIO| 35| FIN|
| TOM| 90| ACC|
|KEVIN| 32| |
| STEF| 22| OPS|
+-----+---+--------+
//Define category list and UDF
val categoryList = List("FIN","ACC")
def mapCategoryUDF(ls: List[String]) = udf[Boolean,Row]((x: Row) => if (!ls.contains(x.getAs("CATEGORY"))) false else true)
import org.apache.spark.sql.functions.{struct}
df.withColumn("errorField",mapCategoryUDF(categoryList)(struct("*"))).show()
Result should look like this:
+-----+---+--------+----------+
| NAME|AGE|CATEGORY|errorField|
+-----+---+--------+----------+
| RIO| 35| FIN| true|
| TOM| 90| ACC| true|
|KEVIN| 32| | false|
| STEF| 22| OPS| false|
+-----+---+--------+----------+
Hope this helps!!

Spark Scala: How to convert Dataframe[vector] to DataFrame[f1:Double, ..., fn: Double)]

I just used Standard Scaler to normalize my features for a ML application. After selecting the scaled features, I want to convert this back to a dataframe of Doubles, though the length of my vectors are arbitrary. I know how to do it for a specific 3 features by using
myDF.map{case Row(v: Vector) => (v(0), v(1), v(2))}.toDF("f1", "f2", "f3")
but not for an arbitrary amount of features. Is there an easy way to do this?
Example:
val testDF = sc.parallelize(List(Vectors.dense(5D, 6D, 7D), Vectors.dense(8D, 9D, 10D), Vectors.dense(11D, 12D, 13D))).map(Tuple1(_)).toDF("scaledFeatures")
val myColumnNames = List("f1", "f2", "f3")
// val finalDF = DataFrame[f1: Double, f2: Double, f3: Double]
EDIT
I found out how to unpack to column names when creating the dataframe, but still am having trouble converting a vector to a sequence needed to create the dataframe:
finalDF = testDF.map{case Row(v: Vector) => v.toArray.toSeq /* <= this errors */}.toDF(List("f1", "f2", "f3"): _*)
Spark >= 3.0.0
Since Spark 3.0 you can use vector_to_array
import org.apache.spark.ml.functions.vector_to_array
testDF.select(vector_to_array($"scaledFeatures").alias("_tmp")).select(exprs:_*)
Spark < 3.0.0
One possible approach is something similar to this
import org.apache.spark.sql.functions.udf
// In Spark 1.x you'll will have to replace ML Vector with MLLib one
// import org.apache.spark.mllib.linalg.Vector
// In 2.x the below is usually the right choice
import org.apache.spark.ml.linalg.Vector
// Get size of the vector
val n = testDF.first.getAs[Vector](0).size
// Simple helper to convert vector to array<double>
// asNondeterministic is available in Spark 2.3 or befor
// It can be removed, but at the cost of decreased performance
val vecToSeq = udf((v: Vector) => v.toArray).asNondeterministic
// Prepare a list of columns to create
val exprs = (0 until n).map(i => $"_tmp".getItem(i).alias(s"f$i"))
testDF.select(vecToSeq($"scaledFeatures").alias("_tmp")).select(exprs:_*)
If you know a list of columns upfront you can simplify this a little:
val cols: Seq[String] = ???
val exprs = cols.zipWithIndex.map{ case (c, i) => $"_tmp".getItem(i).alias(c) }
For Python equivalent see How to split Vector into columns - using PySpark.
Please try VectorSlicer :
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val dataset = spark.createDataFrame(
Seq((1, 0.2, 0.8), (2, 0.1, 0.9), (3, 0.3, 0.7))
).toDF("id", "negative_logit", "positive_logit")
val assembler = new VectorAssembler()
.setInputCols(Array("negative_logit", "positive_logit"))
.setOutputCol("prediction")
val output = assembler.transform(dataset)
output.show()
/*
+---+--------------+--------------+----------+
| id|negative_logit|positive_logit|prediction|
+---+--------------+--------------+----------+
| 1| 0.2| 0.8| [0.2,0.8]|
| 2| 0.1| 0.9| [0.1,0.9]|
| 3| 0.3| 0.7| [0.3,0.7]|
+---+--------------+--------------+----------+
*/
val slicer = new VectorSlicer()
.setInputCol("prediction")
.setIndices(Array(1))
.setOutputCol("positive_prediction")
val posi_output = slicer.transform(output)
posi_output.show()
/*
+---+--------------+--------------+----------+-------------------+
| id|negative_logit|positive_logit|prediction|positive_prediction|
+---+--------------+--------------+----------+-------------------+
| 1| 0.2| 0.8| [0.2,0.8]| [0.8]|
| 2| 0.1| 0.9| [0.1,0.9]| [0.9]|
| 3| 0.3| 0.7| [0.3,0.7]| [0.7]|
+---+--------------+--------------+----------+-------------------+
*/
Alternate solution that evovled couple of days ago: Import the VectorDisassembler into your project (as long as it's not merged into Spark), now:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val dataset = spark.createDataFrame(
Seq((0, 1.2, 1.3), (1, 2.2, 2.3), (2, 3.2, 3.3))
).toDF("id", "val1", "val2")
val assembler = new VectorAssembler()
.setInputCols(Array("val1", "val2"))
.setOutputCol("vectorCol")
val output = assembler.transform(dataset)
output.show()
/*
+---+----+----+---------+
| id|val1|val2|vectorCol|
+---+----+----+---------+
| 0| 1.2| 1.3|[1.2,1.3]|
| 1| 2.2| 2.3|[2.2,2.3]|
| 2| 3.2| 3.3|[3.2,3.3]|
+---+----+----+---------+*/
val disassembler = new org.apache.spark.ml.feature.VectorDisassembler()
.setInputCol("vectorCol")
disassembler.transform(output).show()
/*
+---+----+----+---------+----+----+
| id|val1|val2|vectorCol|val1|val2|
+---+----+----+---------+----+----+
| 0| 1.2| 1.3|[1.2,1.3]| 1.2| 1.3|
| 1| 2.2| 2.3|[2.2,2.3]| 2.2| 2.3|
| 2| 3.2| 3.3|[3.2,3.3]| 3.2| 3.3|
+---+----+----+---------+----+----+*/
I use Spark 2.3.2, and built a xgboost4j binary-classification model, the result looks like this:
results_train.select("classIndex","probability","prediction").show(3,0)
+----------+----------------------------------------+----------+
|classIndex|probability |prediction|
+----------+----------------------------------------+----------+
|1 |[0.5998525619506836,0.400147408246994] |0.0 |
|1 |[0.5487841367721558,0.45121586322784424]|0.0 |
|0 |[0.5555324554443359,0.44446757435798645]|0.0 |
I define the following udf to get the elements out of vector column probability
import org.apache.spark.sql.functions._
def getProb = udf((probV: org.apache.spark.ml.linalg.Vector, clsInx: Int) => probV.apply(clsInx) )
results_train.select("classIndex","probability","prediction").
withColumn("p_0",getProb($"probability",lit(0))).
withColumn("p_1",getProb($"probability", lit(1))).show(3,0)
+----------+----------------------------------------+----------+------------------+-------------------+
|classIndex|probability |prediction|p_0 |p_1 |
+----------+----------------------------------------+----------+------------------+-------------------+
|1 |[0.5998525619506836,0.400147408246994] |0.0 |0.5998525619506836|0.400147408246994 |
|1 |[0.5487841367721558,0.45121586322784424]|0.0 |0.5487841367721558|0.45121586322784424|
|0 |[0.5555324554443359,0.44446757435798645]|0.0 |0.5555324554443359|0.44446757435798645|
Hope this would help for those who handle with Vector type input.
Since the above answers need additional libraries or still not supported, I have used pandas dataframe to easity extract the vector values and then convert it back to spark dataframe.
# convert to pandas dataframe
pandasDf = dataframe.toPandas()
# add a new column
pandasDf['newColumnName'] = 0 # filled the new column with 0s
# now iterate through the rows and update the column
for index, row in pandasDf.iterrows():
value = row['vectorCol'][0] # get the 0th value of the vector
pandasDf.loc[index, 'newColumnName'] = value # put the value in the new column