Partial/Full-match value in one RDD to values in another RDD - scala

I have two RDDs where the first RDD has records of the form
RDD1 = (1, 2017-2-13,"ABX-3354 gsfette"
2, 2017-3-18,"TYET-3423 asdsad"
3, 2017-2-09,"TYET-3423 rewriu"
4, 2017-2-13,"ABX-3354 42324"
5, 2017-4-01,"TYET-3423 aerr")
and the second RDD has records of the form
RDD2 = ('mfr1',"ABX-3354")
('mfr2',"TYET-3423")
I need to find all the records in RDD1 which have a full match/partial match for each value in RDD2 matching the 3rd Column of RDD1 to 2nd column of RDD2 and get the count
For this example, the end result would be:
ABX-3354 2
TYET-3423 3
What is the best way to do this?

I am posting couple of solutions with Spark SQL and more focused towards accurate pattern matching of search string in given text.
1: Using CrossJoin
import spark.implicits._
val df1 = Seq(
(1, "2017-2-13", "ABX-3354 gsfette"),
(2, "2017-3-18", "TYET-3423 asdsad"),
(3, "2017-2-09", "TYET-3423 rewriu"),
(4, "2017-2-13", "ABX-335442324"), //changed from "ABX-3354 42324"
(5, "2017-4-01", "aerrTYET-3423") //changed from "TYET-3423 aerr"
).toDF("id", "dt", "txt")
val df2 = Seq(
("mfr1", "ABX-3354"),
("mfr2", "TYET-3423")
).toDF("col1", "key")
//match function for filter
def matcher(row: Row): Boolean = row.getAs[String]("txt")
.contains(row.getAs[String]("key"))
val join = df1.crossJoin(df2)
import org.apache.spark.sql.functions.count
val result = join.filter(matcher _)
.groupBy("key")
.agg(count("txt").as("count"))
2: Using Broadcast variable
import spark.implicits._
val df1 = Seq(
(1, "2017-2-13", "ABX-3354 gsfette"),
(2, "2017-3-18", "TYET-3423 asdsad"),
(3, "2017-2-09", "TYET-3423 rewriu"),
(4, "2017-2-13", "ABX-3354 42324"),
(5, "2017-4-01", "aerrTYET-3423"),
(6, "2017-4-01", "aerrYET-3423")
).toDF("id", "dt", "pattern")
//small dataset to broadcast
val df2 = Seq(
("mfr1", "ABX-3354"),
("mfr2", "TYET-3423")
).map(_._2) // considering only 2 values in pair
//Lookup to use in UDF
val lookup = spark.sparkContext.broadcast(df2)
//Udf
import org.apache.spark.sql.functions._
val matcher = udf((txt: String) => {
val matches: Seq[String] = lookup.value.filter(txt.contains(_))
if (matches.size > 0) matches.head else null
})
val result = df1.withColumn("match", matcher($"pattern"))
.filter($"match".isNotNull) // not interested in non matching records
.groupBy("match")
.agg(count("pattern").as("count"))
Both solutions result same output
result.show()
+---------+-----+
| key|count|
+---------+-----+
|TYET-3423| 3|
| ABX-3354| 2|
+---------+-----+

Here is how you can get the result
val RDD1 = spark.sparkContext.parallelize(Seq(
(1, "2017-2-13", "ABX-3354 gsfette"),
(2, "2017-3-18", "TYET-3423 asdsad"),
(3, "2017-2-09", "TYET-3423 rewriu"),
(4, "2017-2-13", "ABX-3354 42324"),
(5, "2017-4-01", "TYET-3423 aerr")
))
val RDD2 = spark.sparkContext.parallelize(Seq(
("mfr1","ABX-3354"),
("mfr2","TYET-3423")
))
RDD1.map(r =>{
(r._3.split(" ")(0), (r._1, r._2, r._3))
})
.join(RDD2.map(r => (r._2, r._1)))
.groupBy(_._1)
.map(r => (r._1, r._2.toSeq.size))
.foreach(println)
Output:
(TYET-3423,3)
(ABX-3354,2)
Hope this helps!

Related

Spark scala - Dataframes comparison

How to compare 2 Dataframes based on PK.
Basically want to create a scala spark code to compare 2 big Dataframes (10M records each, 100 columns each) and show output as:
ID Diff
1 [ {Col1: [1,2]}, {col3: [5,10]} ...]
2 [ {Col3: [4,2]}, {col7: [2,6]} ...]
ID is PK
Diff column - show first Column name where is the difference and then which value is different one from another in that column.
Each different column can be converted to string, and then all columns are concated:
// ---- data ---
val leftDF = Seq(
(1, 1, 5, 0),
(2, 0, 4, 2)
).toDF("ID", "Col1", "col3", "col7")
val rightDF = Seq(
(1, 2, 10, 0),
(2, 0, 2, 6)
).toDF("ID", "Col1", "col3", "col7")
def getDifferenceForColumn(name: String): Column =
when(
col("l." + name) =!= col("r." + name),
concat(lit("{" + name + ": ["), col("l." + name), lit(","), col("r." + name), lit("]}")))
.otherwise(lit(""))
val diffColumn = leftDF
.columns
.filter(_ != "ID")
.map(name => getDifferenceForColumn(name))
.reduce((l, r) => concat(l,
when(length(r) =!= 0 && length(l) =!= 0, lit(",")).otherwise(lit(""))
, r))
val diffColumnWithBraces = concat(lit("["), diffColumn, lit("]"))
leftDF
.alias("l")
.join(rightDF.alias("r"), Seq("id"))
.select(col("ID"), diffColumnWithBraces.alias("DIFF"))
Output:
+---+------------------------------+
|ID |DIFF |
+---+------------------------------+
|1 |[{Col1: [1,2]},{col3: [5,10]}]|
|2 |[{col3: [4,2]},{col7: [2,6]}] |
+---+------------------------------+
If columns cannot have value "}{", in solution above two variables can be changed, maybe, performance can be better:
val diffColumns = leftDF
.columns
.filter(_ != "ID")
.map(name => getDifferenceForColumn(name))
val diffColumnWithBraces = concat(lit("["), regexp_replace(concat(diffColumns: _*),"\\}\\{","},{"), lit("]"))
Also UDF can be used, incoming data and output is the same as in my first answer:
val colNames = leftDF
.columns
.filter(_ != "ID")
val generateSeqDiff = (colNames: Seq[String], leftValues: Seq[Any], rightValues: Seq[Any]) => {
val nameValues = colNames
.zip(leftValues)
.zip(rightValues)
.filterNot({ case ((_, l), r) => l == r })
.map({ case ((name, l), r) => s"{$name: [$l,$r]}" })
.mkString(",")
s"[$nameValues]"
}
val generateSeqDiffUDF = udf(generateSeqDiff)
leftDF
.select($"ID", array(colNames.head, colNames.tail: _*).alias("leftValues"))
.alias("l")
.join(
rightDF
.select($"ID", array(colNames.head, colNames.tail: _*).alias("rightValues"))
.alias("r"), Seq("id"))
.select($"ID", generateSeqDiffUDF(lit(colNames), $"leftValues", $"rightValues").alias("DIFF"))

How can I apply boolean indexing in a Spark-Scala dataframe?

I have two Spark-Scala dataframes and I need to use one boolean column from one dataframe to filter the second dataframe. Both dataframes have the same number of rows.
In pandas I would so it like this:
import pandas as pd
df1 = pd.DataFrame({"col1": ["A", "B", "A", "C"], "boolean_column": [True, False, True, False]})
df2 = pd.DataFrame({"col1": ["Z", "X", "Y", "W"], "col2": [1, 2, 3, 4]})
filtered_df2 = df2[df1['boolean_column']]
// Expected filtered_df2 should be this:
// df2 = pd.DataFrame({"col1": ["Z", "Y"], "col2": [1, 3]})
How can I do the same operation in Spark-Scala in the most time-efficient way?
My current solution is to add "boolean_column" from df1 to df2, then filter df2 by selecting only the rows with a true value in the newly added column and finally removing "boolean_column" from df2, but I'm not sure it is the best solution.
Any suggestion is appreciated.
Edit:
The expected output is a Spark-Scala dataframe (not a list or a column) with the same schema as the second dataframe, and only the subset of rows from df2 that satisfy the boolean mask from the "boolean_column" of df1.
The schema of df2 presented above is just an example. I'm expecting to receive df2 as a parameter, with any number of columns of different (and not fixed) schemas.
you can zip both DataFrames and filter on those tuples.
val ints = sparkSession.sparkContext.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
val bools = sparkSession.sparkContext.parallelize(List(true, false, true, false, true, false, true, false, true, false))
val filtered = ints.zip(bools).filter { case (int, bool) => bool }.map { case (int, bool) => int }
println(filtered.collect().toList) //List(1, 3, 5, 7, 9)
I managed to solve this with the following code:
import org.apache.spark.sql.types.{LongType, StructField, StructType}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, Row, SQLContext, SparkSession}
val spark = SparkSession.builder().appName(sc.appName).master(sc.master).getOrCreate()
val sqlContext = spark.sqlContext
def addColumnIndex(df: DataFrame, sqlContext: SQLContext) = sqlContext.createDataFrame(
// Add Column index
df.rdd.zipWithIndex.map{case (row, columnindex) => Row.fromSeq(row.toSeq :+ columnindex)},
// Create schema
StructType(df.schema.fields :+ StructField("columnindex", LongType, nullable = false))
)
import spark.implicits._
val DF1 = Seq(
("A", true),
("B", false),
("A", true),
("C", false)
).toDF("col1", "boolean_column")
val DF2 = Seq(
("Z", 1),
("X", 2),
("Y", 3),
("W", 4)
).toDF("col_1", "col_2")
// Add index
val DF1WithIndex = addColumnIndex(DF1, sqlContext)
val DF2WithIndex = addColumnIndex(DF2, sqlContext)
// Join
val joinDF = DF2WithIndex
.join(DF1WithIndex, Seq("columnindex"))
.drop("columnindex", "col1")
// Filter
val filteredDF2 = joinDF.filter(joinDF("boolean_column")).drop("boolean_column")
The filtered dataframe will be the following:
+-----+-----+
|col_1|col_2|
+-----+-----+
| Z| 1|
| Y| 3|
+-----+-----+

Merging the results of a scala spark dataframe as an array of results in another dataframe's column

Is there a way to take the following two dataframes and join them by the col0 field producing the output below?
//dataframe1
val df1 = Seq(
(1, 9, 100.1, 10),
).toDF("pk", "col0", "col1", "col2")
//dataframe2
val df2 = Seq(
(1, 9 "a1", "b1"),
(2, 9 "a2", "b2")
).toDF("pk", "col0", "str_col1", "str_col2")
//expected dataframe result
+---+-----+----+---------------------------+
| pk| col1|col2| new_arr_col |
+---+-----+----+---------------------------+
| 1|100.1| 10|[[1,9,a1, b1],[2,9,a2, b2]]|
+---+-----+----+---------------------------+
import org.apache.spark.sql.functions._
import spark.implicits._
// creating new array column out of all df2 columns:
val df2AsArray = df2.select($"col0", array(df2.columns.map(col): _*) as "new_arr_col")
val result = df1.join(df2AsArray, "col0")
.groupBy(df1.columns.map(col): _*) // grouping by all df1 columns
.agg(collect_list("new_arr_col") as "new_arr_col") // collecting array of arrays
.drop("col0")
result.show(false)
// +---+-----+----+--------------------------------------------------------+
// |pk |col1 |col2|new_arr_col |
// +---+-----+----+--------------------------------------------------------+
// |1 |100.1|10 |[WrappedArray(2, 9, a2, b2), WrappedArray(1, 9, a1, b1)]|
// +---+-----+----+--------------------------------------------------------+

Spark : anti-join two DStreams

I can do JOINs on two Spark DStreams like :
val joinStream = stream1.join(stream2)
Now, what if I need to filter out all the records that weren't JOINed. Essentially, something like stream1.anti-join(stream2). Is this possible somehow?
Thanks and appreciate any help!
Assuming you had these:
val rdd1 = sc.parallelize(Array(
(1, "one"),
(2, "twow"),
(3, "three"),
(4, "four"),
(5, "five")
))
val rdd2 = sc.parallelize(Array(
(1, "otherOne"),
(4, "otherFour"),
(5,"otherFive"),
(6,"six"),
(7,"seven")
))
val antiJoined = rdd1.fullOuterJoin(rdd2).filter(r => r._2._1.isEmpty || r._2._2.isEmpty)
antiJoined.collect foreach println
(6,(None,Some(six)))
(2,(Some(twow),None))
(3,(Some(three),None))
(7,(None,Some(seven)))

Spark table transformation (ERROR: 5063)

I have the following data:
val RDDApp = sc.parallelize(List("A", "B", "C"))
val RDDUser = sc.parallelize(List(1, 2, 3))
val RDDInstalled = sc.parallelize(List((1, "A"), (1, "B"), (2, "B"), (2, "C"), (3, "A"))).groupByKey
val RDDCart = RDDUser.cartesian(RDDApp)
I want to map this data so that I have an RDD of tuples with (userId, Boolean if the letter is given for user). I thought I found a solution with this:
val results = RDDCart.map (entry =>
(entry._1, RDDInstalled.lookup(entry._1).contains(entry._2))
)
If I call results.first, I get org.apache.spark.SparkException: SPARK-5063. I see the problem with the Action within the Mapping function but do not know how I can work around it so that I get the same result.
Just join and mapValues:
RDDCart.join(RDDInstalled).mapValues{case (x, xs) => xs.toSeq.contains(x)}