How can I apply boolean indexing in a Spark-Scala dataframe? - scala

I have two Spark-Scala dataframes and I need to use one boolean column from one dataframe to filter the second dataframe. Both dataframes have the same number of rows.
In pandas I would so it like this:
import pandas as pd
df1 = pd.DataFrame({"col1": ["A", "B", "A", "C"], "boolean_column": [True, False, True, False]})
df2 = pd.DataFrame({"col1": ["Z", "X", "Y", "W"], "col2": [1, 2, 3, 4]})
filtered_df2 = df2[df1['boolean_column']]
// Expected filtered_df2 should be this:
// df2 = pd.DataFrame({"col1": ["Z", "Y"], "col2": [1, 3]})
How can I do the same operation in Spark-Scala in the most time-efficient way?
My current solution is to add "boolean_column" from df1 to df2, then filter df2 by selecting only the rows with a true value in the newly added column and finally removing "boolean_column" from df2, but I'm not sure it is the best solution.
Any suggestion is appreciated.
Edit:
The expected output is a Spark-Scala dataframe (not a list or a column) with the same schema as the second dataframe, and only the subset of rows from df2 that satisfy the boolean mask from the "boolean_column" of df1.
The schema of df2 presented above is just an example. I'm expecting to receive df2 as a parameter, with any number of columns of different (and not fixed) schemas.

you can zip both DataFrames and filter on those tuples.
val ints = sparkSession.sparkContext.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
val bools = sparkSession.sparkContext.parallelize(List(true, false, true, false, true, false, true, false, true, false))
val filtered = ints.zip(bools).filter { case (int, bool) => bool }.map { case (int, bool) => int }
println(filtered.collect().toList) //List(1, 3, 5, 7, 9)

I managed to solve this with the following code:
import org.apache.spark.sql.types.{LongType, StructField, StructType}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, Row, SQLContext, SparkSession}
val spark = SparkSession.builder().appName(sc.appName).master(sc.master).getOrCreate()
val sqlContext = spark.sqlContext
def addColumnIndex(df: DataFrame, sqlContext: SQLContext) = sqlContext.createDataFrame(
// Add Column index
df.rdd.zipWithIndex.map{case (row, columnindex) => Row.fromSeq(row.toSeq :+ columnindex)},
// Create schema
StructType(df.schema.fields :+ StructField("columnindex", LongType, nullable = false))
)
import spark.implicits._
val DF1 = Seq(
("A", true),
("B", false),
("A", true),
("C", false)
).toDF("col1", "boolean_column")
val DF2 = Seq(
("Z", 1),
("X", 2),
("Y", 3),
("W", 4)
).toDF("col_1", "col_2")
// Add index
val DF1WithIndex = addColumnIndex(DF1, sqlContext)
val DF2WithIndex = addColumnIndex(DF2, sqlContext)
// Join
val joinDF = DF2WithIndex
.join(DF1WithIndex, Seq("columnindex"))
.drop("columnindex", "col1")
// Filter
val filteredDF2 = joinDF.filter(joinDF("boolean_column")).drop("boolean_column")
The filtered dataframe will be the following:
+-----+-----+
|col_1|col_2|
+-----+-----+
| Z| 1|
| Y| 3|
+-----+-----+

Related

Find the duplicate values of an array column in the dataframe in Scala

I have a dataframe with an array column like:
val df = Seq(
Array("abc", "abc", "null", "null"),
Array("bcd", "bc", "bcd", "null"),
Array("ijk", "abc", "bcd", "ijk")).toDF("col")
And looks like:
col:
["abc","abc","null","null"]
["bcd","bc","bcd","null"]
["ijk","abc","bcd","ijk"]
I am trying to get the duplicate value of each array in scala:
col_1:
['abc']
['bcd']
['ijk']
I tried to get the duplicate value in the list but no clue on how this can be done with array column
val df = List("bcd", "bc", "bcd", "null")
df.groupBy(identity).collect { case (x, List(_,_,_*)) => x }
df.withColumn("id", monotonically_increasing_id())
.withColumn("col", explode(col("col")))
.groupBy("id", "col")
.count()
.filter(col("count") > 1 /*&& col("col") =!= "null"*/)
.select("col")
.show()
You can simply use custom UDF
def findDuplicate = udf((in: Seq[String]) =>
in.groupBy(identity)
.filter(_._2.length > 1)
.keys
.toArray
)
df.withColumn("col_1", explode(findDuplicate($"col")))
.show()
if you are willing to skip null values (as in your example) just add .filterNot(_ == "null") before .groupBy
The duplicate values of an array column could be obtained by assigning a monotonically increasing id to each array, exploding the array, and then window grouping by id and col.
import org.apache.spark.sql.functions.max
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.explode
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.functions.monotonically_increasing_id
import spark.implicits._
val df = spark.sparkContext.parallelize(Seq(
Array("abc", "abc", null, null),
Array("bcd", "bc", "bcd", null),
Array("ijk", "abc", "bcd", "ijk"))).toDF("col")
df.show(10)
val idfDF = df.withColumn("id", monotonically_increasing_id)
val explodeDF = idfDF.select(col("id"), explode(col("col")))
val countDF = explodeDF.groupBy("id", "col").count()
// Windows are partitions of id
val byId = Window.partitionBy("id")
val maxDF = countDF.withColumn("max", max("count") over byId)
val finalDf = maxDF.where("max == count").where("col is not null").select("col")
finalDf.show(10)
+---+
|col|
+---+
|abc|
|ijk|
|bcd|
+---+

How to extract efficiently multiple columns from a single string column RDD?

I have a file with 20+ columns of which I would like to extract a few. Until now, I have the following code. I'm sure there is a smart way to do it, but not able to get it working successfully. Any ideas?
mvnmdata is of type RDD[String]
val strpcols = mvnmdata.map(x => x.split('|')).map(x => (x(0),x(1),x(5),x(6),x(7),x(8),x(9),x(10),x(11),x(12),x(13),x(14),x(15),x(16),x(17),x(18),x(19),x(20),x(21),x(22),x(23) ))```
The next solution provides an easy and scalable way to manage your column names and indices. It is based on a map which determines the column name/index relation. The map will also help us to handle both the index of the extracted column and its name.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, StructType, StructField}
val rdd = spark.sparkContext.parallelize(Seq(
"1|500|400|300",
"1|34|67|89",
"2|10|20|56",
"3|2|5|56",
"3|1|8|22"))
val dictColums = Map("c0" -> 0, "c2" -> 2)
// create schema from map keys
val schema = StructType(dictColums.keys.toSeq.map(StructField(_, StringType, true)))
val mappedRDD = rdd.map{line => line.split('|')}
.map{
cols => Row.fromSeq(dictColums.values.toSeq.map{cols(_)})
}
val df = spark.createDataFrame(mappedRDD, schema).show
//output
+---+---+
| c0| c2|
+---+---+
| 1|400|
| 1| 67|
| 2| 20|
| 3| 5|
| 3| 8|
+---+---+
First we declare dictColums in this example we will extract the cols "c0" -> 0 and "c2" -> 2
Next we create the schema from the keys of the map
The one map (which you already have) will split the line by |, the second one will create a Row containing the values that correspond to each item of dictColums.values
UPDATE:
You could also create a function from the above functionality in order to be able to reuse it multiple times:
import org.apache.spark.sql.DataFrame
def stringRddToDataFrame(colsMapping: Map[String, Int], rdd: RDD[String]) : DataFrame = {
val schema = StructType(colsMapping.keys.toSeq.map(StructField(_, StringType, true)))
val mappedRDD = rdd.map{line => line.split('|')}
.map{
cols => Row.fromSeq(colsMapping.values.toSeq.map{cols(_)})
}
spark.createDataFrame(mappedRDD, schema)
}
And then use it for your case:
val cols = Map("c0" -> 0, "c1" -> 1, "c5" -> 5, ... "c23" -> 23)
val df = stringRddToDataFrame(cols, rdd)
As below,if you don't want to write repeated x(i),you can process it in a loop. Example 1:
val strpcols = mvnmdata.map(x => x.split('|'))
.map(x =>{
val xbuffer = new ArrayBuffer[String]()
for (i <- Array(0,1,5,6...)){
xbuffer.append(x(i))
}
xbuffer
})
If you only want to define the index list with start&end and the numbers to be excluded, see Example 2 of below:
scala> (1 to 10).toSet
res8: scala.collection.immutable.Set[Int] = Set(5, 10, 1, 6, 9, 2, 7, 3, 8, 4)
scala> ((1 to 10).toSet -- Set(2,9)).toArray.sortBy(row=>row)
res9: Array[Int] = Array(1, 3, 4, 5, 6, 7, 8, 10)
The final code you want:
//define the function to process indexes
def getSpecIndexes(start:Int, end:Int, removedValueSet:Set[Int]):Array[Int] = {
((start to end).toSet -- removedValueSet).toArray.sortBy(row=>row)
}
val strpcols = mvnmdata.map(x => x.split('|'))
.map(x =>{
val xbuffer = new ArrayBuffer[String]()
//call the function
for (i <- getSpecIndexes(0,100,Set(3,4,5,6))){
xbuffer.append(x(i))
}
xbuffer
})

Finding size of distinct array column

I am using Scala and Spark to create a dataframe. Here's my code so far:
val df = transformedFlattenDF
.groupBy($"market", $"city", $"carrier").agg(count("*").alias("count"), min($"bandwidth").alias("bandwidth"), first($"network").alias("network"), concat_ws(",", collect_list($"carrierCode")).alias("carrierCode")).withColumn("carrierCode", split(($"carrierCode"), ",").cast("array<string>")).withColumn("Carrier Count", collect_set("carrierCode"))
The column carrierCode becomes an array column. The data is present as follows:
CarrierCode
1: [12,2,12]
2: [5,2,8]
3: [1,1,3]
I'd like to create a column that counts the number of distinct values in each array. I tried doing collect_set, however, it gives me an error saying grouping expressions sequence is empty Is it possible to find the number of distinct values in each row's array? So that way in our same example, there could be a column like so:
Carrier Count
1: 2
2: 3
3: 2
collect_set is for aggregation hence should be applied within your groupBy-agg step:
val df = transformedFlattenDF.groupBy($"market", $"city", $"carrier").agg(
count("*").alias("count"), min($"bandwidth").alias("bandwidth"),
first($"network").alias("network"),
concat_ws(",", collect_list($"carrierCode")).alias("carrierCode"),
size(collect_set($"carrierCode")).as("carrier_count") // <-- ADDED `collect_set`
).
withColumn("carrierCode", split(($"carrierCode"), ",").cast("array<string>"))
If you don't want to change the existing groupBy-agg code, you can create a UDF like in the following example:
import org.apache.spark.sql.functions._
val codeDF = Seq(
Array("12", "2", "12"),
Array("5", "2", "8"),
Array("1", "1", "3")
).toDF("carrier_code")
def distinctElemCount = udf( (a: Seq[String]) => a.toSet.size )
codeDF.withColumn("carrier_count", distinctElemCount($"carrier_code")).
show
// +------------+-------------+
// |carrier_code|carrier_count|
// +------------+-------------+
// | [12, 2, 12]| 2|
// | [5, 2, 8]| 3|
// | [1, 1, 3]| 2|
// +------------+-------------+
Without UDF and using RDD conversion and back to DF for posterity:
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(
("A", 2, 100, 2), ("F", 7, 100, 1), ("B", 10, 100, 100)
)).toDF("c1", "c2", "c3", "c4")
val x = df.select("c1", "c2", "c3", "c4").rdd.map(x => (x.get(0), List(x.get(1), x.get(2), x.get(3))) )
val y = x.map {case (k, vL) => (k, vL.toSet.size) }
// Manipulate back to your DF, via conversion, join, what not.
Returns:
res15: Array[(Any, Int)] = Array((A,2), (F,3), (B,2))
Solution above better, as stated more so for posterity.
You can take help for udf and you can do like this.
//Input
df.show
+-----------+
|CarrierCode|
+-----------+
|1:[12,2,12]|
| 2:[5,2,8]|
| 3:[1,1,3]|
+-----------+
//udf
val countUDF=udf{(str:String)=>val strArr=str.split(":"); strArr(0)+":"+strArr(1).split(",").distinct.length.toString}
df.withColumn("Carrier Count",countUDF(col("CarrierCode"))).show
//Sample Output:
+-----------+-------------+
|CarrierCode|Carrier Count|
+-----------+-------------+
|1:[12,2,12]| 1:3|
| 2:[5,2,8]| 2:3|
| 3:[1,1,3]| 3:3|
+-----------+-------------+

Merging the results of a scala spark dataframe as an array of results in another dataframe's column

Is there a way to take the following two dataframes and join them by the col0 field producing the output below?
//dataframe1
val df1 = Seq(
(1, 9, 100.1, 10),
).toDF("pk", "col0", "col1", "col2")
//dataframe2
val df2 = Seq(
(1, 9 "a1", "b1"),
(2, 9 "a2", "b2")
).toDF("pk", "col0", "str_col1", "str_col2")
//expected dataframe result
+---+-----+----+---------------------------+
| pk| col1|col2| new_arr_col |
+---+-----+----+---------------------------+
| 1|100.1| 10|[[1,9,a1, b1],[2,9,a2, b2]]|
+---+-----+----+---------------------------+
import org.apache.spark.sql.functions._
import spark.implicits._
// creating new array column out of all df2 columns:
val df2AsArray = df2.select($"col0", array(df2.columns.map(col): _*) as "new_arr_col")
val result = df1.join(df2AsArray, "col0")
.groupBy(df1.columns.map(col): _*) // grouping by all df1 columns
.agg(collect_list("new_arr_col") as "new_arr_col") // collecting array of arrays
.drop("col0")
result.show(false)
// +---+-----+----+--------------------------------------------------------+
// |pk |col1 |col2|new_arr_col |
// +---+-----+----+--------------------------------------------------------+
// |1 |100.1|10 |[WrappedArray(2, 9, a2, b2), WrappedArray(1, 9, a1, b1)]|
// +---+-----+----+--------------------------------------------------------+

Partial/Full-match value in one RDD to values in another RDD

I have two RDDs where the first RDD has records of the form
RDD1 = (1, 2017-2-13,"ABX-3354 gsfette"
2, 2017-3-18,"TYET-3423 asdsad"
3, 2017-2-09,"TYET-3423 rewriu"
4, 2017-2-13,"ABX-3354 42324"
5, 2017-4-01,"TYET-3423 aerr")
and the second RDD has records of the form
RDD2 = ('mfr1',"ABX-3354")
('mfr2',"TYET-3423")
I need to find all the records in RDD1 which have a full match/partial match for each value in RDD2 matching the 3rd Column of RDD1 to 2nd column of RDD2 and get the count
For this example, the end result would be:
ABX-3354 2
TYET-3423 3
What is the best way to do this?
I am posting couple of solutions with Spark SQL and more focused towards accurate pattern matching of search string in given text.
1: Using CrossJoin
import spark.implicits._
val df1 = Seq(
(1, "2017-2-13", "ABX-3354 gsfette"),
(2, "2017-3-18", "TYET-3423 asdsad"),
(3, "2017-2-09", "TYET-3423 rewriu"),
(4, "2017-2-13", "ABX-335442324"), //changed from "ABX-3354 42324"
(5, "2017-4-01", "aerrTYET-3423") //changed from "TYET-3423 aerr"
).toDF("id", "dt", "txt")
val df2 = Seq(
("mfr1", "ABX-3354"),
("mfr2", "TYET-3423")
).toDF("col1", "key")
//match function for filter
def matcher(row: Row): Boolean = row.getAs[String]("txt")
.contains(row.getAs[String]("key"))
val join = df1.crossJoin(df2)
import org.apache.spark.sql.functions.count
val result = join.filter(matcher _)
.groupBy("key")
.agg(count("txt").as("count"))
2: Using Broadcast variable
import spark.implicits._
val df1 = Seq(
(1, "2017-2-13", "ABX-3354 gsfette"),
(2, "2017-3-18", "TYET-3423 asdsad"),
(3, "2017-2-09", "TYET-3423 rewriu"),
(4, "2017-2-13", "ABX-3354 42324"),
(5, "2017-4-01", "aerrTYET-3423"),
(6, "2017-4-01", "aerrYET-3423")
).toDF("id", "dt", "pattern")
//small dataset to broadcast
val df2 = Seq(
("mfr1", "ABX-3354"),
("mfr2", "TYET-3423")
).map(_._2) // considering only 2 values in pair
//Lookup to use in UDF
val lookup = spark.sparkContext.broadcast(df2)
//Udf
import org.apache.spark.sql.functions._
val matcher = udf((txt: String) => {
val matches: Seq[String] = lookup.value.filter(txt.contains(_))
if (matches.size > 0) matches.head else null
})
val result = df1.withColumn("match", matcher($"pattern"))
.filter($"match".isNotNull) // not interested in non matching records
.groupBy("match")
.agg(count("pattern").as("count"))
Both solutions result same output
result.show()
+---------+-----+
| key|count|
+---------+-----+
|TYET-3423| 3|
| ABX-3354| 2|
+---------+-----+
Here is how you can get the result
val RDD1 = spark.sparkContext.parallelize(Seq(
(1, "2017-2-13", "ABX-3354 gsfette"),
(2, "2017-3-18", "TYET-3423 asdsad"),
(3, "2017-2-09", "TYET-3423 rewriu"),
(4, "2017-2-13", "ABX-3354 42324"),
(5, "2017-4-01", "TYET-3423 aerr")
))
val RDD2 = spark.sparkContext.parallelize(Seq(
("mfr1","ABX-3354"),
("mfr2","TYET-3423")
))
RDD1.map(r =>{
(r._3.split(" ")(0), (r._1, r._2, r._3))
})
.join(RDD2.map(r => (r._2, r._1)))
.groupBy(_._1)
.map(r => (r._1, r._2.toSeq.size))
.foreach(println)
Output:
(TYET-3423,3)
(ABX-3354,2)
Hope this helps!