How to compare 2 Dataframes based on PK.
Basically want to create a scala spark code to compare 2 big Dataframes (10M records each, 100 columns each) and show output as:
ID Diff
1 [ {Col1: [1,2]}, {col3: [5,10]} ...]
2 [ {Col3: [4,2]}, {col7: [2,6]} ...]
ID is PK
Diff column - show first Column name where is the difference and then which value is different one from another in that column.
Each different column can be converted to string, and then all columns are concated:
// ---- data ---
val leftDF = Seq(
(1, 1, 5, 0),
(2, 0, 4, 2)
).toDF("ID", "Col1", "col3", "col7")
val rightDF = Seq(
(1, 2, 10, 0),
(2, 0, 2, 6)
).toDF("ID", "Col1", "col3", "col7")
def getDifferenceForColumn(name: String): Column =
when(
col("l." + name) =!= col("r." + name),
concat(lit("{" + name + ": ["), col("l." + name), lit(","), col("r." + name), lit("]}")))
.otherwise(lit(""))
val diffColumn = leftDF
.columns
.filter(_ != "ID")
.map(name => getDifferenceForColumn(name))
.reduce((l, r) => concat(l,
when(length(r) =!= 0 && length(l) =!= 0, lit(",")).otherwise(lit(""))
, r))
val diffColumnWithBraces = concat(lit("["), diffColumn, lit("]"))
leftDF
.alias("l")
.join(rightDF.alias("r"), Seq("id"))
.select(col("ID"), diffColumnWithBraces.alias("DIFF"))
Output:
+---+------------------------------+
|ID |DIFF |
+---+------------------------------+
|1 |[{Col1: [1,2]},{col3: [5,10]}]|
|2 |[{col3: [4,2]},{col7: [2,6]}] |
+---+------------------------------+
If columns cannot have value "}{", in solution above two variables can be changed, maybe, performance can be better:
val diffColumns = leftDF
.columns
.filter(_ != "ID")
.map(name => getDifferenceForColumn(name))
val diffColumnWithBraces = concat(lit("["), regexp_replace(concat(diffColumns: _*),"\\}\\{","},{"), lit("]"))
Also UDF can be used, incoming data and output is the same as in my first answer:
val colNames = leftDF
.columns
.filter(_ != "ID")
val generateSeqDiff = (colNames: Seq[String], leftValues: Seq[Any], rightValues: Seq[Any]) => {
val nameValues = colNames
.zip(leftValues)
.zip(rightValues)
.filterNot({ case ((_, l), r) => l == r })
.map({ case ((name, l), r) => s"{$name: [$l,$r]}" })
.mkString(",")
s"[$nameValues]"
}
val generateSeqDiffUDF = udf(generateSeqDiff)
leftDF
.select($"ID", array(colNames.head, colNames.tail: _*).alias("leftValues"))
.alias("l")
.join(
rightDF
.select($"ID", array(colNames.head, colNames.tail: _*).alias("rightValues"))
.alias("r"), Seq("id"))
.select($"ID", generateSeqDiffUDF(lit(colNames), $"leftValues", $"rightValues").alias("DIFF"))
Related
How do I take the average of columns in an array cols with non-null values in a dataframe df? I can do this for all columns but it gives null when any of the values are null.
val cols = Array($"col1", $"col2", $"col3")
df.withColumn("avgCols", cols.foldLeft(lit(0)){(x, y) => x + y} / cols.length)
I don't want to na.fill because I want to preserve the true average.
I guess you can do something like this:
val cols = Array("col1", "col2", "col3")
def countAvg =
udf((data: Row) => {
val notNullIndices = cols.indices.filterNot(i => data.isNullAt(i))
notNullIndices.map(i => data.getDouble(i)).sum / notNullIndices.lenght
})
df.withColumn("seqNull", struct(cols.map(col): _*))
.withColumn("avg", countAvg(col("seqNull")))
.show(truncate = false)
But be careful, here average is counted only for not null elements.
If you need exactly solution like in your code:
val cols = Array("col1", "col2", "col3")
def countAvg =
udf((data: Row) => {
val notNullIndices = cols.indices.filterNot(i => data.isNullAt(i))
notNullIndices.map(i => data.getDouble(i)).sum / cols.lenght
})
df.withColumn("seqNull", struct(cols.map(col): _*))
.withColumn("avg", countAvg(col("seqNull")))
.show(truncate = false)
aggregate function can do it without udf.
val cols = Array($"col1", $"col2", $"col3")
df.withColumn(
"avgCols",
aggregate(
cols,
struct(lit(0).alias("sum"), lit(0).alias("count")),
(acc, x) => struct((acc("sum") + coalesce(x, lit(0))).alias("sum"), (acc("count") + coalesce(x.cast("boolean").cast("int"), lit(0))).alias("count")),
(s) => s("sum") / s("count")
)
)
I have a file with 20+ columns of which I would like to extract a few. Until now, I have the following code. I'm sure there is a smart way to do it, but not able to get it working successfully. Any ideas?
mvnmdata is of type RDD[String]
val strpcols = mvnmdata.map(x => x.split('|')).map(x => (x(0),x(1),x(5),x(6),x(7),x(8),x(9),x(10),x(11),x(12),x(13),x(14),x(15),x(16),x(17),x(18),x(19),x(20),x(21),x(22),x(23) ))```
The next solution provides an easy and scalable way to manage your column names and indices. It is based on a map which determines the column name/index relation. The map will also help us to handle both the index of the extracted column and its name.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, StructType, StructField}
val rdd = spark.sparkContext.parallelize(Seq(
"1|500|400|300",
"1|34|67|89",
"2|10|20|56",
"3|2|5|56",
"3|1|8|22"))
val dictColums = Map("c0" -> 0, "c2" -> 2)
// create schema from map keys
val schema = StructType(dictColums.keys.toSeq.map(StructField(_, StringType, true)))
val mappedRDD = rdd.map{line => line.split('|')}
.map{
cols => Row.fromSeq(dictColums.values.toSeq.map{cols(_)})
}
val df = spark.createDataFrame(mappedRDD, schema).show
//output
+---+---+
| c0| c2|
+---+---+
| 1|400|
| 1| 67|
| 2| 20|
| 3| 5|
| 3| 8|
+---+---+
First we declare dictColums in this example we will extract the cols "c0" -> 0 and "c2" -> 2
Next we create the schema from the keys of the map
The one map (which you already have) will split the line by |, the second one will create a Row containing the values that correspond to each item of dictColums.values
UPDATE:
You could also create a function from the above functionality in order to be able to reuse it multiple times:
import org.apache.spark.sql.DataFrame
def stringRddToDataFrame(colsMapping: Map[String, Int], rdd: RDD[String]) : DataFrame = {
val schema = StructType(colsMapping.keys.toSeq.map(StructField(_, StringType, true)))
val mappedRDD = rdd.map{line => line.split('|')}
.map{
cols => Row.fromSeq(colsMapping.values.toSeq.map{cols(_)})
}
spark.createDataFrame(mappedRDD, schema)
}
And then use it for your case:
val cols = Map("c0" -> 0, "c1" -> 1, "c5" -> 5, ... "c23" -> 23)
val df = stringRddToDataFrame(cols, rdd)
As below,if you don't want to write repeated x(i),you can process it in a loop. Example 1:
val strpcols = mvnmdata.map(x => x.split('|'))
.map(x =>{
val xbuffer = new ArrayBuffer[String]()
for (i <- Array(0,1,5,6...)){
xbuffer.append(x(i))
}
xbuffer
})
If you only want to define the index list with start&end and the numbers to be excluded, see Example 2 of below:
scala> (1 to 10).toSet
res8: scala.collection.immutable.Set[Int] = Set(5, 10, 1, 6, 9, 2, 7, 3, 8, 4)
scala> ((1 to 10).toSet -- Set(2,9)).toArray.sortBy(row=>row)
res9: Array[Int] = Array(1, 3, 4, 5, 6, 7, 8, 10)
The final code you want:
//define the function to process indexes
def getSpecIndexes(start:Int, end:Int, removedValueSet:Set[Int]):Array[Int] = {
((start to end).toSet -- removedValueSet).toArray.sortBy(row=>row)
}
val strpcols = mvnmdata.map(x => x.split('|'))
.map(x =>{
val xbuffer = new ArrayBuffer[String]()
//call the function
for (i <- getSpecIndexes(0,100,Set(3,4,5,6))){
xbuffer.append(x(i))
}
xbuffer
})
I have a dataframe containing a bunch of values
val df = List(
(2017, 1, 1234),
(2017, 2, 1234),
(2017, 3, 1234),
(2017, 4, 1234),
(2018, 1, 12345),
(2018, 2, 12346),
(2018, 3, 12347),
(2018, 4, 12348)
).toDF("year", "month", "employeeCount")
df: org.apache.spark.sql.DataFrame = [year: int, month: int, employeeCount: int]
I want to filter that dataframe by a list of (year, month) pairs:
val filterValues = List((2018, 1), (2018, 2))
I can easily cheat and write the code that achieves it:
df.filter(
(col("year") === 2018 && col("month") === 1) ||
(col("year") === 2018 && col("month") === 2)
).show
but of course that's not satisfactory because filterValues could change, and I want to base it on whatever is in that list.
Is it possible to dynamically build my filter_expression and then pass it to df.filter(filter_expression)? I can't figure out how.
Based on your comment:
imagine someone calling this from the command-line with something like --filterColumns "year,month" --filterValues "2018|1,2018|2"
val filterValues = "2018|1,2018|2"
val filterColumns = "year,month"
you can get a list of columns
val colnames = filterColumns.split(',')
Convert data to a local Dataset (add schema when needed):
val filter = spark.read.option("delimiter", "|")
.csv(filterValues.split(',').toSeq.toDS)
.toDF(colnames: _*)
and semi join:
df.join(filter, colnames, "left_semi").show
// +----+-----+-------------+
// |year|month|employeeCount|
// +----+-----+-------------+
// |2018| 1| 12345|
// |2018| 2| 12346|
// +----+-----+-------------+
Expression like this one should work as well:
import org.apache.spark.sql.functions._
val pred = filterValues
.split(",")
.map(x => colnames.zip(x.split('|'))
.map { case (c, v) => col(c) === v }
.reduce(_ && _))
.reduce(_ || _)
df.where(pred).show
// +----+-----+-------------+
// |year|month|employeeCount|
// +----+-----+-------------+
// |2018| 1| 12345|
// |2018| 2| 12346|
// +----+-----+-------------+
but will require more work if some type casting is required.
You can always do that using a udf function as
val filterValues = List((2018, 1), (2018, 2))
import org.apache.spark.sql.functions._
def filterUdf = udf((year:Int, month:Int) => filterValues.exists(x => x._1 == year && x._2 == month))
df.filter(filterUdf(col("year"), col("month"))).show(false)
Updated
You commented as
I mean that the list of columns to filter on (and the corresponding list of respective values) would be supplied from elsewhere at runtime.
for that you will have list of column names provided too, so the solution would be something like below
val filterValues = List((2018, 1), (2018, 2))
val filterColumns = List("year", "month")
import org.apache.spark.sql.functions._
def filterUdf = udf((unknown: Seq[Int]) => filterValues.exists(x => !x.productIterator.toList.zip(unknown).map(y => y._1 == y._2).contains(false)))
df.filter(filterUdf(array(filterColumns.map(col): _*))).show(false)
you can build up your filter_expression like this :
val df = List(
(2017, 1, 1234),
(2017, 2, 1234),
(2017, 3, 1234),
(2017, 4, 1234),
(2018, 1, 12345),
(2018, 2, 12346),
(2018, 3, 12347),
(2018, 4, 12348)
).toDF("year", "month", "employeeCount")
val filterValues = List((2018, 1), (2018, 2))
val filter_expession = filterValues
.map{case (y,m) => col("year") === y and col("month") === m}
.reduce(_ || _)
df
.filter(filter_expession)
.show()
+----+-----+-------------+
|year|month|employeeCount|
+----+-----+-------------+
|2018| 1| 12345|
|2018| 2| 12346|
+----+-----+-------------+
I have two RDDs where the first RDD has records of the form
RDD1 = (1, 2017-2-13,"ABX-3354 gsfette"
2, 2017-3-18,"TYET-3423 asdsad"
3, 2017-2-09,"TYET-3423 rewriu"
4, 2017-2-13,"ABX-3354 42324"
5, 2017-4-01,"TYET-3423 aerr")
and the second RDD has records of the form
RDD2 = ('mfr1',"ABX-3354")
('mfr2',"TYET-3423")
I need to find all the records in RDD1 which have a full match/partial match for each value in RDD2 matching the 3rd Column of RDD1 to 2nd column of RDD2 and get the count
For this example, the end result would be:
ABX-3354 2
TYET-3423 3
What is the best way to do this?
I am posting couple of solutions with Spark SQL and more focused towards accurate pattern matching of search string in given text.
1: Using CrossJoin
import spark.implicits._
val df1 = Seq(
(1, "2017-2-13", "ABX-3354 gsfette"),
(2, "2017-3-18", "TYET-3423 asdsad"),
(3, "2017-2-09", "TYET-3423 rewriu"),
(4, "2017-2-13", "ABX-335442324"), //changed from "ABX-3354 42324"
(5, "2017-4-01", "aerrTYET-3423") //changed from "TYET-3423 aerr"
).toDF("id", "dt", "txt")
val df2 = Seq(
("mfr1", "ABX-3354"),
("mfr2", "TYET-3423")
).toDF("col1", "key")
//match function for filter
def matcher(row: Row): Boolean = row.getAs[String]("txt")
.contains(row.getAs[String]("key"))
val join = df1.crossJoin(df2)
import org.apache.spark.sql.functions.count
val result = join.filter(matcher _)
.groupBy("key")
.agg(count("txt").as("count"))
2: Using Broadcast variable
import spark.implicits._
val df1 = Seq(
(1, "2017-2-13", "ABX-3354 gsfette"),
(2, "2017-3-18", "TYET-3423 asdsad"),
(3, "2017-2-09", "TYET-3423 rewriu"),
(4, "2017-2-13", "ABX-3354 42324"),
(5, "2017-4-01", "aerrTYET-3423"),
(6, "2017-4-01", "aerrYET-3423")
).toDF("id", "dt", "pattern")
//small dataset to broadcast
val df2 = Seq(
("mfr1", "ABX-3354"),
("mfr2", "TYET-3423")
).map(_._2) // considering only 2 values in pair
//Lookup to use in UDF
val lookup = spark.sparkContext.broadcast(df2)
//Udf
import org.apache.spark.sql.functions._
val matcher = udf((txt: String) => {
val matches: Seq[String] = lookup.value.filter(txt.contains(_))
if (matches.size > 0) matches.head else null
})
val result = df1.withColumn("match", matcher($"pattern"))
.filter($"match".isNotNull) // not interested in non matching records
.groupBy("match")
.agg(count("pattern").as("count"))
Both solutions result same output
result.show()
+---------+-----+
| key|count|
+---------+-----+
|TYET-3423| 3|
| ABX-3354| 2|
+---------+-----+
Here is how you can get the result
val RDD1 = spark.sparkContext.parallelize(Seq(
(1, "2017-2-13", "ABX-3354 gsfette"),
(2, "2017-3-18", "TYET-3423 asdsad"),
(3, "2017-2-09", "TYET-3423 rewriu"),
(4, "2017-2-13", "ABX-3354 42324"),
(5, "2017-4-01", "TYET-3423 aerr")
))
val RDD2 = spark.sparkContext.parallelize(Seq(
("mfr1","ABX-3354"),
("mfr2","TYET-3423")
))
RDD1.map(r =>{
(r._3.split(" ")(0), (r._1, r._2, r._3))
})
.join(RDD2.map(r => (r._2, r._1)))
.groupBy(_._1)
.map(r => (r._1, r._2.toSeq.size))
.foreach(println)
Output:
(TYET-3423,3)
(ABX-3354,2)
Hope this helps!
val temp = sqlContext.sql(s"SELECT A, B, C, (CASE WHEN (D) in (1,2,3) THEN ((E)+0.000)/60 ELSE 0 END) AS Z from TEST.TEST_TABLE")
val temp1 = temp.map({ temp => ((temp.getShort(0), temp.getString(1)), (USAGE_TEMP.getDouble(2), USAGE_TEMP.getDouble(3)))})
.reduceByKey((x, y) => ((x._1+y._1),(x._2+y._2)))
instead of the above code which is doing the computation(case evaluation) on hive layer I would like to have the transformation done in scala. How would I do it?
Is it possible to do the same while filling the data inside Map?
val temp = sqlContext.sql(s"SELECT A, B, C, D, E from TEST.TEST_TABLE")
val tempTransform = temp.map(row => {
val z = List[Double](1, 2, 3).contains(row.getDouble(3)) match {
case true => row.getDouble(4) / 60
case _ => 0
}
Row(row.getShort(0), Row.getString(1), Row.getDouble(2), z)
})
val temp1 = tempTransform.map({ temp => ((temp.getShort(0), temp.getString(1)), (USAGE_TEMP.getDouble(2), USAGE_TEMP.getDouble(3)))})
.reduceByKey((x, y) => ((x._1+y._1),(x._2+y._2)))
you can use this syntax as well
new_df = old_df.withColumn('target_column', udf(df.name))
as reffered by this example
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._ // for `toDF` and $""
import org.apache.spark.sql.functions._ // for `when`
val df = sc.parallelize(Seq((4, "blah", 2), (2, "", 3), (56, "foo", 3), (100, null, 5)))
.toDF("A", "B", "C")
val newDf = df.withColumn("D", when($"B".isNull or $"B" === "", 0).otherwise(1))
In your case, execute sql which be dataframe like below
val temp = sqlContext.sql(s"SELECT A, B, C, D, E from TEST.TEST_TABLE")
and apply withColumn with case or when otherwise or if needed spark udf
, call scala function logic instead of hiveudf