Sequential Dynamic filters on the same Spark Dataframe Column in Scala Spark - scala

I have a column named root and need to filter dataframe based on the different values of a root column.
Suppose I have a values in root are parent,child or sub-child and I want to apply these filters dynamically through a variable.
val x = ("parent,child,sub-child").split(",")
x.map(eachvalue <- {
var df1 = df.filter(col("root").contains(eachvalue))
}
But when I am doing it, it always overwriting the DF1 instead, I want to apply all the 3 filters and get the result.
May be in future I may extend the list to any number of filter values and the code should work.
Thanks,
Bab

You should apply the subsequent filters to the result of the previous filter, not on df:
val x = ("parent,child,sub-child").split(",")
var df1 = df
x.map(eachvalue <- {
df1 = df1.filter(col("root").contains(eachvalue))
}
df1 after the map operation will have all filters applied to it.

Let's see an example with spark shell. Hope it helps you.
scala> import spark.implicits._
import spark.implicits._
scala> val df0 =
spark.sparkContext.parallelize(List(1,2,1,3,3,2,1)).toDF("number")
df0: org.apache.spark.sql.DataFrame = [number: int]
scala> val list = List(1,2,3)
list: List[Int] = List(1, 2, 3)
scala> val dfFiltered = for (number <- list) yield { df0.filter($"number" === number)}
dfFiltered: List[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]] = List([number: int], [number: int], [number: int])
scala> dfFiltered(0).show
+------+
|number|
+------+
| 1|
| 1|
| 1|
+------+
scala> dfFiltered(1).show
+------+
|number|
+------+
| 2|
| 2|
+------+
scala> dfFiltered(2).show
+------+
|number|
+------+
| 3|
| 3|
+------+

AFAIK isin can be used in this case below is the example.
import spark.implicits._
val colorStringArr = "red,yellow,blue".split(",")
val colorDF =
List(
"red",
"yellow",
"purple"
).toDF("color")
// to derive a column using a list
colorDF.withColumn(
"is_primary_color",
col("color").isin(colorStringArr: _*)
).show()
println( "if you don't want derived column and directly want to filter using a list with isin then .. ")
colorDF.filter(col("color").isin(colorStringArr: _*)).show
Result :
+------+----------------+
| color|is_primary_color|
+------+----------------+
| red| true|
|yellow| true|
|purple| false|
+------+----------------+
if you don't want derived column and directly want to filter using a list with isin then ....
+------+
| color|
+------+
| red|
|yellow|
+------+

One more way using array_contains and swapping the arguments.
scala> val x = ("parent,child,sub-child").split(",")
x: Array[String] = Array(parent, child, sub-child)
scala> val df = Seq(("parent"),("grand-parent"),("child"),("sub-child"),("cousin")).toDF("root")
df: org.apache.spark.sql.DataFrame = [root: string]
scala> df.show
+------------+
| root|
+------------+
| parent|
|grand-parent|
| child|
| sub-child|
| cousin|
+------------+
scala> df.withColumn("check", array_contains(lit(x),'root)).show
+------------+-----+
| root|check|
+------------+-----+
| parent| true|
|grand-parent|false|
| child| true|
| sub-child| true|
| cousin|false|
+------------+-----+
scala>

Here are my two cents
val filters = List(1,2,3)
val data = List(5,1,2,1,3,3,2,1,4)
val colName = "number"
val df = spark.
sparkContext.
parallelize(data).
toDF(colName).
filter(
r => filters.contains(r.getAs[Int](colName))
)
df.show()
which results in
+------+
|number|
+------+
| 1|
| 2|
| 1|
| 3|
| 3|
| 2|
| 1|
+------+

Related

Merge two columns of different DataFrames in Spark using scala

I want to merge two columns from separate DataFrames in one DataFrames
I have two DataFrames like this
val ds1 = sc.parallelize(Seq(1,0,1,0)).toDF("Col1")
val ds2 = sc.parallelize(Seq(234,43,341,42)).toDF("Col2")
ds1.show()
+-----+
| Col1|
+-----+
| 0|
| 1|
| 0|
| 1|
+-----+
ds2.show()
+-----+
| Col2|
+-----+
| 234|
| 43|
| 341|
| 42|
+-----+
I want 3rd dataframe containing two columns Col1 and Col2
+-----++-----+
| Col1|| Col2|
+-----++-----+
| 0|| 234|
| 1|| 43|
| 0|| 341|
| 1|| 42|
+-----++-----+
I tried union
val ds3 = ds1.union(ds2)
But, it adds all row of ds2 to ds1.
monotonically_increasing_id <-- is not Deterministic.
Hence it is not guaranteed that you would get correct result
Easier to do by using RDD and creating key by using zipWithIndex
val ds1 = sc.parallelize(Seq(1,0,1,0)).toDF("Col1")
val ds2 = sc.parallelize(Seq(234,43,341,42)).toDF("Col2")
// Convert to RDD with ZIPINDEX < Which will be our key
val ds1Rdd = ds1.rdd.repartition(4).zipWithIndex().map{ case (v,k) => (k,v) }
val ds2Rdd = ds2.as[(Int)].rdd.repartition(4).zipWithIndex().map{ case (v,k) => (k,v) }
// Check How The KEY-VALUE Pair looks
ds1Rdd.collect()
res50: Array[(Long, Int)] = Array((0,0), (1,1), (2,1), (3,0))
res51: Array[(Long, Int)] = Array((0,341), (1,42), (2,43), (3,234))
So First element of the tuple is our Join key
we simply join and rearrange to result dataframe
val joinedRdd = ds1Rdd.join(ds2Rdd)
val resultrdd = joinedRdd.map(x => x._2).map(x => (x._1 ,x._2))
// resultrdd: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[204] at map at <console>
And we convert to DataFrame
resultrdd.toDF("Col1","Col2").show()
+----+----+
|Col1|Col2|
+----+----+
| 0| 341|
| 1| 42|
| 1| 43|
| 0| 234|
+----+----+

How to transform a string column of a dataframe into a column of Array[String] with Apache Spark and Scala

I have a DataFrame with a column 'title_from' as below.
.
This colume contains a sentence and I want to transform this column into a Array[String]. I have tried something like this but it does not works.
val newDF = df.select("title_from").map(x => x.split("\\\s+")
How can I achieve this? How can I transform a datafram of strings into a dataframe of Array[string]? I want evry line of newDF to be an array of words from df.
Thanks for any help!
You can use the withColumn function.
import org.apache.spark.sql.functions._
val newDF = df.withColumn("split_title_from", split(col("title_from"), "\\s+"))
.select("split_title_from")
Can you try following to get the list of all authors
scala> val df = Seq((1,"a1,a2,a3"), (2,"a1,a4,a10")).toDF("id","author")
df: org.apache.spark.sql.DataFrame = [id: int, author: string]
scala> df.show()
+---+---------+
| id| author|
+---+---------+
| 1| a1,a2,a3|
| 2|a1,a4,a10|
+---+---------+
scala> df.select("author").show
+---------+
| author|
+---------+
| a1,a2,a3|
|a1,a4,a10|
+---------+
scala> df.select("author").flatMap( row => { row.get(0).toString().split(",")}).show()
+-----+
|value|
+-----+
| a1|
| a2|
| a3|
| a1|
| a4|
| a10|
+-----+

How to assign keys to items in a column in Scala?

I have the following RDD:
Col1 Col2
"abc" "123a"
"def" "783b"
"abc "674b"
"xyz" "123a"
"abc" "783b"
I need the following output where each item in each column is converted into a unique key.
for example : abc->1,def->2,xyz->3
Col1 Col2
1 1
2 2
1 3
3 1
1 2
Any help would be appreciated. Thanks!
In this case, you can rely on the hashCode of the string. The hashcode will be the same if the input and datatype is same. Try this.
scala> "abc".hashCode
res23: Int = 96354
scala> "xyz".hashCode
res24: Int = 119193
scala> val df = Seq(("abc","123a"),
| ("def","783b"),
| ("abc","674b"),
| ("xyz","123a"),
| ("abc","783b")).toDF("col1","col2")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
scala>
scala> def hashc(x:String):Int =
| return x.hashCode
hashc: (x: String)Int
scala> val myudf = udf(hashc(_:String):Int)
myudf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,Some(List(StringType)))
scala> df.select(myudf('col1), myudf('col2)).show
+---------+---------+
|UDF(col1)|UDF(col2)|
+---------+---------+
| 96354| 1509487|
| 99333| 1694000|
| 96354| 1663279|
| 119193| 1509487|
| 96354| 1694000|
+---------+---------+
scala>
If you must map your columns into natural numbers starting from 1, one approach would be to apply zipWithIndex to the individual columns, add 1 to the index (as zipWithIndex always starts from 0), convert indvidual RDDs to DataFrames, and finally join the converted DataFrames for the index keys:
val rdd = sc.parallelize(Seq(
("abc", "123a"),
("def", "783b"),
("abc", "674b"),
("xyz", "123a"),
("abc", "783b")
))
val df1 = rdd.map(_._1).distinct.zipWithIndex.
map(r => (r._1, r._2 + 1)).
toDF("col1", "c1key")
val df2 = rdd.map(_._2).distinct.zipWithIndex.
map(r => (r._1, r._2 + 1)).
toDF("col2", "c2key")
val dfJoined = rdd.toDF("col1", "col2").
join(df1, Seq("col1")).
join(df2, Seq("col2"))
// +----+----+-----+-----+
// |col2|col1|c1key|c2key|
// +----+----+-----+-----+
// |783b| abc| 2| 1|
// |783b| def| 3| 1|
// |123a| xyz| 1| 2|
// |123a| abc| 2| 2|
// |674b| abc| 2| 3|
//+----+----+-----+-----+
dfJoined.
select($"c1key".as("col1"), $"c2key".as("col2")).
show
// +----+----+
// |col1|col2|
// +----+----+
// | 2| 1|
// | 3| 1|
// | 1| 2|
// | 2| 2|
// | 2| 3|
// +----+----+
Note that if you're okay with having the keys start from 0, the step of map(r => (r._1, r._2 + 1)) can be skipped in generating df1 and df2.

How to point or select a cell in a dataframe, Spark - Scala

I want to find the time difference of 2 cells.
With arrays in python I would do a for loop the st[i+1] - st[i] and store the results somewhere.
I have this dataframe sorted by time. How can I do it with Spark 2 or Scala, a pseudo-code is enough.
+--------------------+-------+
| st| name|
+--------------------+-------+
|15:30 |dog |
|15:32 |dog |
|18:33 |dog |
|18:34 |dog |
+--------------------+-------+
If the sliding diffs are to be computed per partition by name, I would use the lag() Window function:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq(
("a", 100), ("a", 120),
("b", 200), ("b", 240), ("b", 270)
).toDF("name", "value")
val window = Window.partitionBy($"name").orderBy("value")
df.
withColumn("diff", $"value" - lag($"value", 1).over(window)).
na.fill(0).
orderBy("name", "value").
show
// +----+-----+----+
// |name|value|diff|
// +----+-----+----+
// | a| 100| 0|
// | a| 120| 20|
// | b| 200| 0|
// | b| 240| 40|
// | b| 270| 30|
// +----+-----+----+
On the other hand, if the sliding diffs are to be computed across the entire dataset, Window function without partition wouldn't scale hence I would resort to using RDD's sliding() function:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.mllib.rdd.RDDFunctions._
val rdd = df.rdd
val diffRDD = rdd.sliding(2).
map{ case Array(x, y) => Row(y.getString(0), y.getInt(1), y.getInt(1) - x.getInt(1)) }
val headRDD = sc.parallelize(Seq(Row.fromSeq(rdd.first.toSeq :+ 0)))
val headDF = spark.createDataFrame(headRDD, df.schema.add("diff", IntegerType))
val diffDF = spark.createDataFrame(diffRDD, df.schema.add("diff", IntegerType))
val resultDF = headDF union diffDF
resultDF.show
// +----+-----+----+
// |name|value|diff|
// +----+-----+----+
// | a| 100| 0|
// | a| 120| 20|
// | b| 200| 80|
// | b| 240| 40|
// | b| 270| 30|
// +----+-----+----+
Something like:
object Data1 {
import org.apache.log4j.Logger
import org.apache.log4j.Level
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
def main(args: Array[String]) : Unit = {
implicit val spark: SparkSession =
SparkSession
.builder()
.appName("Test")
.master("local[1]")
.getOrCreate()
import org.apache.spark.sql.functions.col
val rows = Seq(Row(1, 1), Row(1, 1), Row(1, 1))
val schema = List(StructField("int1", IntegerType, true), StructField("int2", IntegerType, true))
val someDF = spark.createDataFrame(
spark.sparkContext.parallelize(rows),
StructType(schema)
)
someDF.withColumn("diff", col("int1") - col("int2")).show()
}
}
gives
+----+----+----+
|int1|int2|diff|
+----+----+----+
| 1| 1| 0|
| 1| 1| 0|
| 1| 1| 0|
+----+----+----+
If you are specifically looking to diff adjacent elements in a collection then in Scala I would zip the collection with its tail to give a collection containing tuples of adjacent pairs.
Unfortunately there isn't a tail method on RDDs or DataFrames/Sets
You could do something like:
val a = myDF.rdd
val tail = myDF.rdd.zipWithIndex.collect{
case (index, v) if index > 1 => v}
a.zip(tail).map{ case (l, r) => /* diff l and r st column */}.collect

how to concat multiple columns in spark while getting the column names to be concatenated from another table (different for each row)

I am trying to concat multiple columns in spark using concat function.
For example below is the table for which I have to add new concatenated column
table - **t**
+---+----+
| id|name|
+---+----+
| 1| a|
| 2| b|
+---+----+
and below is the table which has the information about which columns are to be concatenated for given id (for id 1 column id and name needs to be concatenated and for id 2 only id)
table - **r**
+---+-------+
| id| att |
+---+-------+
| 1|id,name|
| 2| id |
+---+-------+
if I join the two tables and do something like below, I am able to concat but not based on the table r (as the new column is having 1,a for first row but for second row it should be 2 only)
t.withColumn("new",concat_ws(",",t.select("att").first.mkString.split(",").map(c => col(c)): _*)).show
+---+----+-------+---+
| id|name| att |new|
+---+----+-------+---+
| 1| a|id,name|1,a|
| 2| b| id |2,b|
+---+----+-------+---+
I have to apply filter before the select in the above query, but I am not sure how to do that in withColumn for each row.
Something like below, if that is possible.
t.withColumn("new",concat_ws(",",t.**filter**("id="+this.id).select("att").first.mkString.split(",").map(c => col(c)): _*)).show
As it will require to filter each row based on the id.
scala> t.filter("id=1").select("att").first.mkString.split(",").map(c => col(c))
res90: Array[org.apache.spark.sql.Column] = Array(id, name)
scala> t.filter("id=2").select("att").first.mkString.split(",").map(c => col(c))
res89: Array[org.apache.spark.sql.Column] = Array(id)
Below is the final required result.
+---+----+-------+---+
| id|name| att |new|
+---+----+-------+---+
| 1| a|id,name|1,a|
| 2| b| id |2 |
+---+----+-------+---+
We can use UDF
Requirements for this logic to work.
The column name of your table t should be in same order as it comes in col att of table r
scala> input_df_1.show
+---+----+
| id|name|
+---+----+
| 1| a|
| 2| b|
+---+----+
scala> input_df_2.show
+---+-------+
| id| att|
+---+-------+
| 1|id,name|
| 2| id|
+---+-------+
scala> val join_df = input_df_1.join(input_df_2,Seq("id"),"inner")
join_df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]
scala> val req_cols = input_df_1.columns
req_cols: Array[String] = Array(id, name)
scala> def new_col_udf = udf((cols : Seq[String],row : String,attr : String) => {
| val row_values = row.split(",")
| val attrs = attr.split(",")
| val req_val = attrs.map{at =>
| val index = cols.indexOf(at)
| row_values(index)
| }
| req_val.mkString(",")
| })
new_col_udf: org.apache.spark.sql.expressions.UserDefinedFunction
scala> val intermediate_df = join_df.withColumn("concat_column",concat_ws(",",'id,'name)).withColumn("new_col",new_col_udf(lit(req_cols),'concat_column,'att))
intermediate_df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 3 more fields]
scala> val result_df = intermediate_df.select('id,'name,'att,'new_col)
result_df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 2 more fields]
scala> result_df.show
+---+----+-------+-------+
| id|name| att|new_col|
+---+----+-------+-------+
| 1| a|id,name| 1,a|
| 2| b| id| 2|
+---+----+-------+-------+
Hope it answers your question.
This may be done in a UDF:
val cols: Seq[Column] = dataFrame.columns.map(x => col(x)).toSeq
val indices: Seq[String] = dataFrame.columns.map(x => x).toSeq
val generateNew = udf((values: Seq[Any]) => {
val att = values(indices.indexOf("att")).toString.split(",")
val associatedIndices = indices.filter(x => att.contains(x))
val builder: StringBuilder = StringBuilder.newBuilder
values.filter(x => associatedIndices.contains(values.indexOf(x)))
values.foreach{ v => builder.append(v).append(";") }
builder.toString()
})
val dfColumns = array(cols:_*)
val dNew = dataFrame.withColumn("new", generateNew(dfColumns))
This is just a sketch, but the idea is that you can pass a sequence of items to the user defined function, and select the ones that are needed dynamically.
Note that there are additional types of collection/maps that you can pass - for example How to pass array to UDF