Simple Roll-Down With Spark Dataframe (Scala) - scala

If you have a simple dataframe that looks like this:
val n = sc.parallelize(List[String](
"Alice", null, null,
"Bob", null, null,
"Chuck"
)).toDF("name")
Which looks like this:
//+-----+
//| name|
//+-----+
//|Alice|
//| null|
//| null|
//| Bob|
//| null|
//| null|
//|Chuck|
//+-----+
How can I use dataframe roll-down functions to get:
//+-----+
//| name|
//+-----+
//|Alice|
//|Alice|
//|Alice|
//| Bob|
//| Bob|
//| Bob|
//|Chuck|
//+-----+
Note: Please state any needed imports, I suspect these include:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.{WindowSpec, Window}
Note: Some sites I tried to mimic are:
http://xinhstechblog.blogspot.com/2016/04/spark-window-functions-for-dataframes.html
and
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
I've come across something like this in the past so I realize that Spark versions will differ. I am using 1.5.2 in the cluster (where this solution is more useful) and 2.0 in local emulation. I prefer a 1.5.2 compatible solution.
Also, I'd like to get away from writing SQL directly - avoid using sqlContext.sql(...)

If you have another column that allows grouping of the values, here's a suggestion:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import sqlContext.implicits._
val df = Seq(
(Some("Alice"), 1),
(None, 1),
(None, 1),
(Some("Bob"), 2),
(None, 2),
(None, 2),
(Some("Chuck"), 3)
).toDF("name", "group")
val result = df.withColumn("new_col", min(col("name")).over(Window.partitionBy("group")))
result.show()
+-----+-----+-------+
| name|group|new_col|
+-----+-----+-------+
|Alice| 1| Alice|
| null| 1| Alice|
| null| 1| Alice|
| Bob| 2| Bob|
| null| 2| Bob|
| null| 2| Bob|
|Chuck| 3| Chuck|
+-----+-----+-------+
On the other hand, if you only have a column that allows ordering, but not grouping, the solution is a little harder. My first idea is to create a subset and then do a join:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import sqlContext.implicits._
val df = Seq(
(Some("Alice"), 1),
(None, 2),
(None, 3),
(Some("Bob"), 4),
(None, 5),
(None, 6),
(Some("Chuck"), 7)
).toDF("name", "order")
val subset = df
.select("name", "order")
.where(col("name").isNotNull)
.withColumn("next", lead("order", 1).over(Window.orderBy("order")))
val partial = df.as("a")
.join(subset.as("b"), col("a.order") >= col("b.order") && (col("a.order") < subset("next")), "left")
val result = partial.select(coalesce(col("a.name"), col("b.name")).as("name"), col("a.order"))
result.show()
+-----+-----+
| name|order|
+-----+-----+
|Alice| 1|
|Alice| 2|
|Alice| 3|
| Bob| 4|
| Bob| 5|
| Bob| 6|
|Chuck| 7|
+-----+-----+

Related

Spark: map columns of a dataframe to their ID of the distinct elements

I have the following dataframe of two columns of string type A and B:
val df = (
spark
.createDataFrame(
Seq(
("a1", "b1"),
("a1", "b2"),
("a1", "b2"),
("a2", "b3")
)
)
).toDF("A", "B")
I create maps between distinct elements of each columns and a set of integers
val mapColA = (
df
.select("A")
.distinct
.rdd
.zipWithIndex
.collectAsMap
)
val mapColB = (
df
.select("B")
.distinct
.rdd
.zipWithIndex
.collectAsMap
)
Now I want to create a new columns in the dataframe applying those maps to their correspondent columns. For one map only this would be
df.select("A").map(x=>mapColA.get(x)).show()
However I don't understand how to apply each map to their correspondent columns and create two new columns (e.g. with withColumn). The expected result would be
val result = (
spark
.createDataFrame(
Seq(
("a1", "b1", 1, 1),
("a1", "b2", 1, 2),
("a1", "b2", 1, 2),
("a2", "b3", 2, 3)
)
)
).toDF("A", "B", "idA", "idB")
Could you help me?
If I understood correctly, this can be achieved using dense_rank:
import org.apache.spark.sql.expressions.Window
val df2 = df.withColumn("idA", dense_rank().over(Window.orderBy("A")))
.withColumn("idB", dense_rank().over(Window.orderBy("B")))
df2.show
+---+---+---+---+
| A| B|idA|idB|
+---+---+---+---+
| a1| b1| 1| 1|
| a1| b2| 1| 2|
| a1| b2| 1| 2|
| a2| b3| 2| 3|
+---+---+---+---+
If you want to stick with your original code, you can make some modifications:
val mapColA = df.select("A").distinct().rdd.map(r=>r.getAs[String](0)).zipWithIndex.collectAsMap
val mapColB = df.select("B").distinct().rdd.map(r=>r.getAs[String](0)).zipWithIndex.collectAsMap
val df2 = df.map(r => (r.getAs[String](0), r.getAs[String](1), mapColA.get(r.getAs[String](0)), mapColB.get(r.getAs[String](1)))).toDF("A","B", "idA", "idB")
df2.show
+---+---+---+---+
| A| B|idA|idB|
+---+---+---+---+
| a1| b1| 1| 2|
| a1| b2| 1| 0|
| a1| b2| 1| 0|
| a2| b3| 0| 1|
+---+---+---+---+

How to paralelize processing of dataframe in apache spark with combination over a column

I'm looking a solution to build an aggregation with all combination of a column. For example , I have for a data frame as below:
val df = Seq(("A", 1), ("B", 2), ("C", 3), ("A", 4), ("B", 5)).toDF("id", "value")
+---+-----+
| id|value|
+---+-----+
| A| 1|
| B| 2|
| C| 3|
| A| 4|
| B| 5|
+---+-----+
And looking an aggregation for all combination over the column "id". Here below I found a solution, but this cannot use the parallelism of Spark, works only on driver node or only on a single executor. Is there any better solution in order to get rid of the for loop?
import spark.implicits._;
val list =df.select($"id").distinct().orderBy($"id").as[String].collect();
val combinations = (1 to list.length flatMap (x => list.combinations(x))) filter(_.length >1)
val schema = StructType(
StructField("indexvalue", IntegerType, true) ::
StructField("segment", StringType, true) :: Nil)
var initialDF = spark.createDataFrame(sc.emptyRDD[Row], schema)
for (x <- combinations) {
initialDF = initialDF.union(df.filter($"id".isin(x: _*))
.agg(expr("sum(value)").as("indexvalue"))
.withColumn("segment",lit(x.mkString("+"))))
}
initialDF.show()
+----------+-------+
|indexvalue|segment|
+----------+-------+
| 12| A+B|
| 8| A+C|
| 10| B+C|
| 15| A+B+C|
+----------+-------+

Converting row into column using spark scala

I want to covert row into column using spark dataframe.
My table is like this
Eno,Name
1,A
1,B
1,C
2,D
2,E
I want to convert it into
Eno,n1,n2,n3
1,A,B,C
2,D,E,Null
I used this below code :-
val r = spark.sqlContext.read.format("csv").option("header","true").option("inferschema","true").load("C:\\Users\\axy\\Desktop\\abc2.csv")
val n =Seq("n1","n2","n3"
r
.groupBy("Eno")
.pivot("Name",n).agg(expr("coalesce(first(Name),3)").cast("double")).show()
But I am getting result as-->
+---+----+----+----+
|Eno| n1| n2| n3|
+---+----+----+----+
| 1|null|null|null|
| 2|null|null|null|
+---+----+----+----+
Can anyone help to get the desire result.
val m= map(lit("A"), lit("n1"), lit("B"),lit("n2"), lit("C"), lit("n3"), lit("D"), lit("n1"), lit("E"), lit("n2"))
val df= Seq((1,"A"),(1,"B"),(1,"C"),(2,"D"),(2,"E")).toDF("Eno","Name")
df.withColumn("new", m($"Name")).groupBy("Eno").pivot("new").agg(first("Name"))
+---+---+---+----+
|Eno| n1| n2| n3|
+---+---+---+----+
| 1| A| B| C|
| 2| D| E|null|
+---+---+---+----+
import org.apache.spark.sql.functions._
import spark.implicits._
val df= Seq((1,"A"),(1,"B"),(1,"C"),(2,"D"),(2,"E")).toDF("Eno","Name")
val getName=udf {(names: Seq[String],i : Int) => if (names.size>i) names(i) else null}
val tdf=df.groupBy($"Eno").agg(collect_list($"name").as("names"))
val ndf=(0 to 2).foldLeft(tdf){(ndf,i) => ndf.withColumn(s"n${i}",getName($"names",lit(i))) }.
drop("names")
ndf.show()
+---+---+---+----+
|Eno| n0| n1| n2|
+---+---+---+----+
| 1| A| B| C|
| 2| D| E|null|
+---+---+---+----+

Find smallest value in a rolling window partitioned by group

I have a dataframe containing different geographical positions as well as the distance to some other places. My problem is that I want to find the closest n places for each geographical position. My first idea was to use groupBy() followed by some sort of aggregation but I couldn't get that to work.
Instead I tried to first convert the dataframe to an RDD and the use groupByKey(), it works, but the method is cumbersome. Is there is any better alternative to solve this problem? Maybe using groupBy() and aggregate somehow?
A small example of my approach where n=2 with input:
+---+--------+
| id|distance|
+---+--------+
| 1| 5.0|
| 1| 3.0|
| 1| 7.0|
| 1| 4.0|
| 2| 1.0|
| 2| 3.0|
| 2| 3.0|
| 2| 7.0|
+---+--------+
Code:
df.rdd.map{case Row(id: Long, distance: Double) => (id, distance)}
.groupByKey()
.map{case (id: Long, iter: Iterable[Double]) => (id, iter.toSeq.sorted.take(2))}
.toDF("id", "distance")
.withColumn("distance", explode($"distance"))
Output:
+---+--------+
| id|distance|
+---+--------+
| 1| 3.0|
| 1| 4.0|
| 2| 1.0|
| 2| 3.0|
+---+--------+
You can use Window as below:
val spark = SparkSession.builder().master("local").appName("test").getOrCreate()
import spark.implicits._
case class A(id: Long, distance: Double)
val df = List(A(1, 5.0), A(1,3.0), A(1, 7.0), A(1, 4.0), A(2, 1.0), A(2, 3.0), A(2, 4.0), A(2, 7.0))
.toDF("id", "distance")
val window = Window.partitionBy("id").orderBy("distance")
val result = df.withColumn("rank", row_number().over(window)).where(col("rank") <= 2 )
result.drop("rank").show()
You can increase the number of result you want by replacing the 2.
Hope this helps.

How to use Spark ML ALS algorithm? [duplicate]

Is it possible to factorize a Spark dataframe column? With factorizing I mean creating a mapping of each unique value in the column to the same ID.
Example, the original dataframe:
+----------+----------------+--------------------+
| col1| col2| col3|
+----------+----------------+--------------------+
|1473490929|4060600988513370| A|
|1473492972|4060600988513370| A|
|1473509764|4060600988513370| B|
|1473513432|4060600988513370| C|
|1473513432|4060600988513370| A|
+----------+----------------+--------------------+
to the factorized version:
+----------+----------------+--------------------+
| col1| col2| col3|
+----------+----------------+--------------------+
|1473490929|4060600988513370| 0|
|1473492972|4060600988513370| 0|
|1473509764|4060600988513370| 1|
|1473513432|4060600988513370| 2|
|1473513432|4060600988513370| 0|
+----------+----------------+--------------------+
In scala itself it would be fairly simple, but since Spark distributes it's dataframes over nodes I'm not sure how to keep a mapping from A->0, B->1, C->2.
Also, assume the dataframe is pretty big (gigabytes), which means loading one entire column into the memory of a single machine might not be possible.
Can it be done?
You can use StringIndexer to encode letters into indices:
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer()
.setInputCol("col3")
.setOutputCol("col3Index")
val indexed = indexer.fit(df).transform(df)
indexed.show()
+----------+----------------+----+---------+
| col1| col2|col3|col3Index|
+----------+----------------+----+---------+
|1473490929|4060600988513370| A| 0.0|
|1473492972|4060600988513370| A| 0.0|
|1473509764|4060600988513370| B| 1.0|
|1473513432|4060600988513370| C| 2.0|
|1473513432|4060600988513370| A| 0.0|
+----------+----------------+----+---------+
Data:
val df = spark.createDataFrame(Seq(
(1473490929, "4060600988513370", "A"),
(1473492972, "4060600988513370", "A"),
(1473509764, "4060600988513370", "B"),
(1473513432, "4060600988513370", "C"),
(1473513432, "4060600988513370", "A"))).toDF("col1", "col2", "col3")
You can use an user defined function.
First you create the mapping you need:
val updateFunction = udf {(x: String) =>
x match {
case "A" => 0
case "B" => 1
case "C" => 2
case _ => 3
}
}
And now you only have to apply it to your DataFrame:
df.withColumn("col3", updateFunction(df.col("col3")))