How to point or select a cell in a dataframe, Spark - Scala - scala

I want to find the time difference of 2 cells.
With arrays in python I would do a for loop the st[i+1] - st[i] and store the results somewhere.
I have this dataframe sorted by time. How can I do it with Spark 2 or Scala, a pseudo-code is enough.
+--------------------+-------+
| st| name|
+--------------------+-------+
|15:30 |dog |
|15:32 |dog |
|18:33 |dog |
|18:34 |dog |
+--------------------+-------+

If the sliding diffs are to be computed per partition by name, I would use the lag() Window function:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq(
("a", 100), ("a", 120),
("b", 200), ("b", 240), ("b", 270)
).toDF("name", "value")
val window = Window.partitionBy($"name").orderBy("value")
df.
withColumn("diff", $"value" - lag($"value", 1).over(window)).
na.fill(0).
orderBy("name", "value").
show
// +----+-----+----+
// |name|value|diff|
// +----+-----+----+
// | a| 100| 0|
// | a| 120| 20|
// | b| 200| 0|
// | b| 240| 40|
// | b| 270| 30|
// +----+-----+----+
On the other hand, if the sliding diffs are to be computed across the entire dataset, Window function without partition wouldn't scale hence I would resort to using RDD's sliding() function:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.mllib.rdd.RDDFunctions._
val rdd = df.rdd
val diffRDD = rdd.sliding(2).
map{ case Array(x, y) => Row(y.getString(0), y.getInt(1), y.getInt(1) - x.getInt(1)) }
val headRDD = sc.parallelize(Seq(Row.fromSeq(rdd.first.toSeq :+ 0)))
val headDF = spark.createDataFrame(headRDD, df.schema.add("diff", IntegerType))
val diffDF = spark.createDataFrame(diffRDD, df.schema.add("diff", IntegerType))
val resultDF = headDF union diffDF
resultDF.show
// +----+-----+----+
// |name|value|diff|
// +----+-----+----+
// | a| 100| 0|
// | a| 120| 20|
// | b| 200| 80|
// | b| 240| 40|
// | b| 270| 30|
// +----+-----+----+

Something like:
object Data1 {
import org.apache.log4j.Logger
import org.apache.log4j.Level
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
def main(args: Array[String]) : Unit = {
implicit val spark: SparkSession =
SparkSession
.builder()
.appName("Test")
.master("local[1]")
.getOrCreate()
import org.apache.spark.sql.functions.col
val rows = Seq(Row(1, 1), Row(1, 1), Row(1, 1))
val schema = List(StructField("int1", IntegerType, true), StructField("int2", IntegerType, true))
val someDF = spark.createDataFrame(
spark.sparkContext.parallelize(rows),
StructType(schema)
)
someDF.withColumn("diff", col("int1") - col("int2")).show()
}
}
gives
+----+----+----+
|int1|int2|diff|
+----+----+----+
| 1| 1| 0|
| 1| 1| 0|
| 1| 1| 0|
+----+----+----+

If you are specifically looking to diff adjacent elements in a collection then in Scala I would zip the collection with its tail to give a collection containing tuples of adjacent pairs.
Unfortunately there isn't a tail method on RDDs or DataFrames/Sets
You could do something like:
val a = myDF.rdd
val tail = myDF.rdd.zipWithIndex.collect{
case (index, v) if index > 1 => v}
a.zip(tail).map{ case (l, r) => /* diff l and r st column */}.collect

Related

Scala spark, input dataframe, return columns where all values equal to 1

Given a dataframe, say that it contains 4 columns and 3 rows. I want to write a function to return the columns where all the values in that column are equal to 1.
This is a Scala code. I want to use some spark transformations to transform or filter the dataframe input. This filter should be implemented in a function.
case class Grade(c1: Integral, c2: Integral, c3: Integral, c4: Integral)
val example = Seq(
Grade(1,3,1,1),
Grade(1,1,null,1),
Grade(1,10,2,1)
)
val dfInput = spark.createDataFrame(example)
After I call the function filterColumns()
val dfOutput = dfInput.filterColumns()
it should return 3 row 2 columns dataframe with value all 1.
A bit more readable approach using Dataset[Grade]
import org.apache.spark.sql.functions.col
import scala.collection.mutable
import org.apache.spark.sql.Column
val tmp = dfInput.map(grade => grade.dropWhenNotEqualsTo(1))
val rowsCount = dfInput.count()
val colsToRetain = mutable.Set[Column]()
for (column <- tmp.columns) {
val withoutNullsCount = tmp.select(column).na.drop().count()
if (rowsCount == withoutNullsCount) colsToRetain += col(column)
}
dfInput.select(colsToRetain.toArray:_*).show()
+---+---+
| c4| c1|
+---+---+
| 1| 1|
| 1| 1|
| 1| 1|
+---+---+
And the case object
case class Grade(c1: Integer, c2: Integer, c3: Integer, c4: Integer) {
def dropWhenNotEqualsTo(n: Integer): Grade = {
Grade(nullOrValue(c1, n), nullOrValue(c2, n), nullOrValue(c3, n), nullOrValue(c4, n))
}
def nullOrValue(c: Integer, n: Integer) = if (c == n) c else null
}
grade.dropWhenNotEqualsTo(1) -> returns a new Grade with values that not satisfies the condition replaced to nulls
+---+----+----+---+
| c1| c2| c3| c4|
+---+----+----+---+
| 1|null| 1| 1|
| 1| 1|null| 1|
| 1|null|null| 1|
+---+----+----+---+
(column <- tmp.columns) -> iterate over the columns
tmp.select(column).na.drop() -> drop rows with nulls
e.g for c2 this will return
+---+
| c2|
+---+
| 1|
+---+
if (rowsCount == withoutNullsCount) colsToRetain += col(column) -> if column contains nulls just drop it
one of the options is reduce on rdd:
import spark.implicits._
val df= Seq(("1","A","3","4"),("1","2","?","4"),("1","2","3","4")).toDF()
df.show()
val first = df.first()
val size = first.length
val diffStr = "#"
val targetStr = "1"
def rowToArray(row: Row): Array[String] = {
val arr = new Array[String](row.length)
for (i <- 0 to row.length-1){
arr(i) = row.getString(i)
}
arr
}
def compareArrays(a1: Array[String], a2: Array[String]): Array[String] = {
val arr = new Array[String](a1.length)
for (i <- 0 to a1.length-1){
arr(i) = if (a1(i).equals(a2(i)) && a1(i).equals(targetStr)) a1(i) else diffStr
}
arr
}
val diff = df.rdd
.map(rowToArray)
.reduce(compareArrays)
val cols = (df.columns zip diff).filter(!_._2.equals(diffStr)).map(s=>df(s._1))
df.select(cols:_*).show()
+---+---+---+---+
| _1| _2| _3| _4|
+---+---+---+---+
| 1| A| 3| 4|
| 1| 2| ?| 4|
| 1| 2| 3| 4|
+---+---+---+---+
+---+
| _1|
+---+
| 1|
| 1|
| 1|
+---+
I would try to prepare dataset for processing without nulls. In case of few columns this simple iterative approach might work fine (don't forget to import spark implicits before import spark.implicits._):
val example = spark.sparkContext.parallelize(Seq(
Grade(1,3,1,1),
Grade(1,1,0,1),
Grade(1,10,2,1)
)).toDS().cache()
def allOnes(colName: String, ds: Dataset[Grade]): Boolean = {
val row = ds.select(colName).distinct().collect()
if (row.length == 1 && row.head.getInt(0) == 1) true
else false
}
val resultColumns = example.columns.filter(col => allOnes(col, example))
example.selectExpr(resultColumns: _*).show()
result is:
+---+---+
| c1| c4|
+---+---+
| 1| 1|
| 1| 1|
| 1| 1|
+---+---+
If nulls are inevitable, use untyped dataset (aka dataframe):
val schema = StructType(Seq(
StructField("c1", IntegerType, nullable = true),
StructField("c2", IntegerType, nullable = true),
StructField("c3", IntegerType, nullable = true),
StructField("c4", IntegerType, nullable = true)
))
val example = spark.sparkContext.parallelize(Seq(
Row(1,3,1,1),
Row(1,1,null,1),
Row(1,10,2,1)
))
val dfInput = spark.createDataFrame(example, schema).cache()
def allOnes(colName: String, df: DataFrame): Boolean = {
val row = df.select(colName).distinct().collect()
if (row.length == 1 && row.head.getInt(0) == 1) true
else false
}
val resultColumns= dfInput.columns.filter(col => allOnes(col, dfInput))
dfInput.selectExpr(resultColumns: _*).show()

Sequential Dynamic filters on the same Spark Dataframe Column in Scala Spark

I have a column named root and need to filter dataframe based on the different values of a root column.
Suppose I have a values in root are parent,child or sub-child and I want to apply these filters dynamically through a variable.
val x = ("parent,child,sub-child").split(",")
x.map(eachvalue <- {
var df1 = df.filter(col("root").contains(eachvalue))
}
But when I am doing it, it always overwriting the DF1 instead, I want to apply all the 3 filters and get the result.
May be in future I may extend the list to any number of filter values and the code should work.
Thanks,
Bab
You should apply the subsequent filters to the result of the previous filter, not on df:
val x = ("parent,child,sub-child").split(",")
var df1 = df
x.map(eachvalue <- {
df1 = df1.filter(col("root").contains(eachvalue))
}
df1 after the map operation will have all filters applied to it.
Let's see an example with spark shell. Hope it helps you.
scala> import spark.implicits._
import spark.implicits._
scala> val df0 =
spark.sparkContext.parallelize(List(1,2,1,3,3,2,1)).toDF("number")
df0: org.apache.spark.sql.DataFrame = [number: int]
scala> val list = List(1,2,3)
list: List[Int] = List(1, 2, 3)
scala> val dfFiltered = for (number <- list) yield { df0.filter($"number" === number)}
dfFiltered: List[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]] = List([number: int], [number: int], [number: int])
scala> dfFiltered(0).show
+------+
|number|
+------+
| 1|
| 1|
| 1|
+------+
scala> dfFiltered(1).show
+------+
|number|
+------+
| 2|
| 2|
+------+
scala> dfFiltered(2).show
+------+
|number|
+------+
| 3|
| 3|
+------+
AFAIK isin can be used in this case below is the example.
import spark.implicits._
val colorStringArr = "red,yellow,blue".split(",")
val colorDF =
List(
"red",
"yellow",
"purple"
).toDF("color")
// to derive a column using a list
colorDF.withColumn(
"is_primary_color",
col("color").isin(colorStringArr: _*)
).show()
println( "if you don't want derived column and directly want to filter using a list with isin then .. ")
colorDF.filter(col("color").isin(colorStringArr: _*)).show
Result :
+------+----------------+
| color|is_primary_color|
+------+----------------+
| red| true|
|yellow| true|
|purple| false|
+------+----------------+
if you don't want derived column and directly want to filter using a list with isin then ....
+------+
| color|
+------+
| red|
|yellow|
+------+
One more way using array_contains and swapping the arguments.
scala> val x = ("parent,child,sub-child").split(",")
x: Array[String] = Array(parent, child, sub-child)
scala> val df = Seq(("parent"),("grand-parent"),("child"),("sub-child"),("cousin")).toDF("root")
df: org.apache.spark.sql.DataFrame = [root: string]
scala> df.show
+------------+
| root|
+------------+
| parent|
|grand-parent|
| child|
| sub-child|
| cousin|
+------------+
scala> df.withColumn("check", array_contains(lit(x),'root)).show
+------------+-----+
| root|check|
+------------+-----+
| parent| true|
|grand-parent|false|
| child| true|
| sub-child| true|
| cousin|false|
+------------+-----+
scala>
Here are my two cents
val filters = List(1,2,3)
val data = List(5,1,2,1,3,3,2,1,4)
val colName = "number"
val df = spark.
sparkContext.
parallelize(data).
toDF(colName).
filter(
r => filters.contains(r.getAs[Int](colName))
)
df.show()
which results in
+------+
|number|
+------+
| 1|
| 2|
| 1|
| 3|
| 3|
| 2|
| 1|
+------+

How to paralelize processing of dataframe in apache spark with combination over a column

I'm looking a solution to build an aggregation with all combination of a column. For example , I have for a data frame as below:
val df = Seq(("A", 1), ("B", 2), ("C", 3), ("A", 4), ("B", 5)).toDF("id", "value")
+---+-----+
| id|value|
+---+-----+
| A| 1|
| B| 2|
| C| 3|
| A| 4|
| B| 5|
+---+-----+
And looking an aggregation for all combination over the column "id". Here below I found a solution, but this cannot use the parallelism of Spark, works only on driver node or only on a single executor. Is there any better solution in order to get rid of the for loop?
import spark.implicits._;
val list =df.select($"id").distinct().orderBy($"id").as[String].collect();
val combinations = (1 to list.length flatMap (x => list.combinations(x))) filter(_.length >1)
val schema = StructType(
StructField("indexvalue", IntegerType, true) ::
StructField("segment", StringType, true) :: Nil)
var initialDF = spark.createDataFrame(sc.emptyRDD[Row], schema)
for (x <- combinations) {
initialDF = initialDF.union(df.filter($"id".isin(x: _*))
.agg(expr("sum(value)").as("indexvalue"))
.withColumn("segment",lit(x.mkString("+"))))
}
initialDF.show()
+----------+-------+
|indexvalue|segment|
+----------+-------+
| 12| A+B|
| 8| A+C|
| 10| B+C|
| 15| A+B+C|
+----------+-------+

How to update column of spark dataframe based on the values of previous record

I have three columns in df
Col1,col2,col3
X,x1,x2
Z,z1,z2
Y,
X,x3,x4
P,p1,p2
Q,q1,q2
Y
I want to do the following
when col1=x,store the value of col2 and col3
and assign those column values to next row when col1=y
expected output
X,x1,x2
Z,z1,z2
Y,x1,x2
X,x3,x4
P,p1,p2
Q,q1,q2
Y,x3,x4
Any help would be appreciated
Note:-spark 1.6
Here's one approach using Window function with steps as follows:
Add row-identifying column (not needed if there is already one) and combine non-key columns (presumably many of them) into one
Create tmp1 with conditional nulls and tmp2 using last/rowsBetween Window function to back-fill with the last non-null value
Create newcols conditionally from cols and tmp2
Expand newcols back to individual columns using foldLeft
Note that this solution uses Window function without partitioning, thus may not work for large dataset.
val df = Seq(
("X", "x1", "x2"),
("Z", "z1", "z2"),
("Y", "", ""),
("X", "x3", "x4"),
("P", "p1", "p2"),
("Q", "q1", "q2"),
("Y", "", "")
).toDF("col1", "col2", "col3")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val colList = df.columns.filter(_ != "col1")
val df2 = df.select($"col1", monotonically_increasing_id.as("id"),
struct(colList.map(col): _*).as("cols")
)
val df3 = df2.
withColumn( "tmp1", when($"col1" === "X", $"cols") ).
withColumn( "tmp2", last("tmp1", ignoreNulls = true).over(
Window.orderBy("id").rowsBetween(Window.unboundedPreceding, 0)
) )
df3.show
// +----+---+-------+-------+-------+
// |col1| id| cols| tmp1| tmp2|
// +----+---+-------+-------+-------+
// | X| 0|[x1,x2]|[x1,x2]|[x1,x2]|
// | Z| 1|[z1,z2]| null|[x1,x2]|
// | Y| 2| [,]| null|[x1,x2]|
// | X| 3|[x3,x4]|[x3,x4]|[x3,x4]|
// | P| 4|[p1,p2]| null|[x3,x4]|
// | Q| 5|[q1,q2]| null|[x3,x4]|
// | Y| 6| [,]| null|[x3,x4]|
// +----+---+-------+-------+-------+
val df4 = df3.withColumn( "newcols",
when($"col1" === "Y", $"tmp2").otherwise($"cols")
).select($"col1", $"newcols")
df4.show
// +----+-------+
// |col1|newcols|
// +----+-------+
// | X|[x1,x2]|
// | Z|[z1,z2]|
// | Y|[x1,x2]|
// | X|[x3,x4]|
// | P|[p1,p2]|
// | Q|[q1,q2]|
// | Y|[x3,x4]|
// +----+-------+
val dfResult = colList.foldLeft( df4 )(
(accDF, c) => accDF.withColumn(c, df4(s"newcols.$c"))
).drop($"newcols")
dfResult.show
// +----+----+----+
// |col1|col2|col3|
// +----+----+----+
// | X| x1| x2|
// | Z| z1| z2|
// | Y| x1| x2|
// | X| x3| x4|
// | P| p1| p2|
// | Q| q1| q2|
// | Y| x3| x4|
// +----+----+----+
[UPDATE]
For Spark 1.x, last(colName, ignoreNulls) isn't available in the DataFrame API. A work-around is to revert to use Spark SQL which supports ignore-null in its last() method:
df2.
withColumn( "tmp1", when($"col1" === "X", $"cols") ).
createOrReplaceTempView("df2table")
// might need to use registerTempTable("df2table") instead
val df3 = spark.sqlContext.sql("""
select col1, id, cols, tmp1, last(tmp1, true) over (
order by id rows between unbounded preceding and current row
) as tmp2
from df2table
""")
Yes, there is a lag function that requires ordering
import org.apache.spark.sql.expressions.Window.orderBy
import org.apache.spark.sql.functions.{coalesce, lag}
case class Temp(a: String, b: Option[String], c: Option[String])
val input = ss.createDataFrame(
Seq(
Temp("A", Some("a1"), Some("a2")),
Temp("D", Some("d1"), Some("d2")),
Temp("B", Some("b1"), Some("b2")),
Temp("E", None, None),
Temp("C", None, None)
))
+---+----+----+
| a| b| c|
+---+----+----+
| A| a1| a2|
| D| d1| d2|
| B| b1| b2|
| E|null|null|
| C|null|null|
+---+----+----+
val order = orderBy($"a")
input
.withColumn("b", coalesce($"b", lag($"b", 1).over(order)))
.withColumn("c", coalesce($"c", lag($"c", 1).over(order)))
.show()
+---+---+---+
| a| b| c|
+---+---+---+
| A| a1| a2|
| B| b1| b2|
| C| b1| b2|
| D| d1| d2|
| E| d1| d2|
+---+---+---+

Spark ML StringIndexer and OneHotEncoder - empty strings error [duplicate]

I am importing a CSV file (using spark-csv) into a DataFrame which has empty String values. When applied the OneHotEncoder, the application crashes with error requirement failed: Cannot have an empty string for name.. Is there a way I can get around this?
I could reproduce the error in the example provided on Spark ml page:
val df = sqlContext.createDataFrame(Seq(
(0, "a"),
(1, "b"),
(2, "c"),
(3, ""), //<- original example has "a" here
(4, "a"),
(5, "c")
)).toDF("id", "category")
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("categoryIndex")
.fit(df)
val indexed = indexer.transform(df)
val encoder = new OneHotEncoder()
.setInputCol("categoryIndex")
.setOutputCol("categoryVec")
val encoded = encoder.transform(indexed)
encoded.show()
It is annoying since missing/empty values is a highly generic case.
Thanks in advance,
Nikhil
Since the OneHotEncoder/OneHotEncoderEstimator does not accept empty string for name, or you'll get the following error :
java.lang.IllegalArgumentException: requirement failed: Cannot have an empty string for name.
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.ml.attribute.Attribute$$anonfun$5.apply(attributes.scala:33)
at org.apache.spark.ml.attribute.Attribute$$anonfun$5.apply(attributes.scala:32)
[...]
This is how I will do it : (There is other way to do it, rf. #Anthony 's answer)
I'll create an UDF to process the empty category :
import org.apache.spark.sql.functions._
def processMissingCategory = udf[String, String] { s => if (s == "") "NA" else s }
Then, I'll apply the UDF on the column :
val df = sqlContext.createDataFrame(Seq(
(0, "a"),
(1, "b"),
(2, "c"),
(3, ""), //<- original example has "a" here
(4, "a"),
(5, "c")
)).toDF("id", "category")
.withColumn("category",processMissingCategory('category))
df.show
// +---+--------+
// | id|category|
// +---+--------+
// | 0| a|
// | 1| b|
// | 2| c|
// | 3| NA|
// | 4| a|
// | 5| c|
// +---+--------+
Now, you can go back to your transformations
val indexer = new StringIndexer().setInputCol("category").setOutputCol("categoryIndex").fit(df)
val indexed = indexer.transform(df)
indexed.show
// +---+--------+-------------+
// | id|category|categoryIndex|
// +---+--------+-------------+
// | 0| a| 0.0|
// | 1| b| 2.0|
// | 2| c| 1.0|
// | 3| NA| 3.0|
// | 4| a| 0.0|
// | 5| c| 1.0|
// +---+--------+-------------+
// Spark <2.3
// val encoder = new OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryVec")
// Spark +2.3
val encoder = new OneHotEncoderEstimator().setInputCols(Array("categoryIndex")).setOutputCols(Array("category2Vec"))
val encoded = encoder.transform(indexed)
encoded.show
// +---+--------+-------------+-------------+
// | id|category|categoryIndex| categoryVec|
// +---+--------+-------------+-------------+
// | 0| a| 0.0|(3,[0],[1.0])|
// | 1| b| 2.0|(3,[2],[1.0])|
// | 2| c| 1.0|(3,[1],[1.0])|
// | 3| NA| 3.0| (3,[],[])|
// | 4| a| 0.0|(3,[0],[1.0])|
// | 5| c| 1.0|(3,[1],[1.0])|
// +---+--------+-------------+-------------+
EDIT:
#Anthony 's solution in Scala :
df.na.replace("category", Map( "" -> "NA")).show
// +---+--------+
// | id|category|
// +---+--------+
// | 0| a|
// | 1| b|
// | 2| c|
// | 3| NA|
// | 4| a|
// | 5| c|
// +---+--------+
I hope this helps!
Yep, it's a little thorny but maybe you can just replace the empty string with something sure to be different than other values. NOTE that I am using pyspark DataFrameNaFunctions API but Scala's should be similar.
df = sqlContext.createDataFrame([(0,"a"), (1,'b'), (2, 'c'), (3,''), (4,'a'), (5, 'c')], ['id', 'category'])
df = df.na.replace('', 'EMPTY', 'category')
df.show()
+---+--------+
| id|category|
+---+--------+
| 0| a|
| 1| b|
| 2| c|
| 3| EMPTY|
| 4| a|
| 5| c|
+---+--------+
if the column contains null the OneHotEncoder fails with a NullPointerException.
therefore i extended the udf to tanslate null values as well
object OneHotEncoderExample {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("OneHotEncoderExample Application").setMaster("local[2]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
// $example on$
val df1 = sqlContext.createDataFrame(Seq(
(0.0, "a"),
(1.0, "b"),
(2.0, "c"),
(3.0, ""),
(4.0, null),
(5.0, "c")
)).toDF("id", "category")
import org.apache.spark.sql.functions.udf
def emptyValueSubstitution = udf[String, String] {
case "" => "NA"
case null => "null"
case value => value
}
val df = df1.withColumn("category", emptyValueSubstitution( df1("category")) )
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("categoryIndex")
.fit(df)
val indexed = indexer.transform(df)
indexed.show()
val encoder = new OneHotEncoder()
.setInputCol("categoryIndex")
.setOutputCol("categoryVec")
.setDropLast(false)
val encoded = encoder.transform(indexed)
encoded.show()
// $example off$
sc.stop()
}
}