How to replace values in RDD 1 per keys in RDD 2? - scala

Here are two RDDs.
Table1-pair(key,value)
val table1 = sc.parallelize(Seq(("1", "a"), ("2", "b"), ("3", "c")))
//RDD[(String, String)]
Table2-Arrays
val table2 = sc.parallelize(Array(Array("1", "2", "d"), Array("1", "3", "e")))
//RDD[Array[String]]
I am trying to change elements of table2 such as "1" to "a" using the keys and values in table1. My expect result is as follows:
RDD[Array[String]] = (Array(Array("a", "b", "d"), Array("a", "c", "e")))
Is there a way to make this possible?
If so, would it be efficient using a huge dataset?

I think we can do it better with dataframes while avoiding joins as it might involve shuffling of data.
val table1 = spark.sparkContext.parallelize(Seq(("1", "a"), ("2", "b"), ("3", "c"))).collectAsMap()
//Brodcasting so that mapping is available to all nodes
val brodcastedMapping = spark.sparkContext.broadcast(table1)
val table2 = spark.sparkContext.parallelize(Array(Array("1", "2", "d"), Array("1", "3", "e")))
def changeMapping(value: String): String = {
brodcastedMapping.value.getOrElse(value, value)
}
val changeMappingUDF = udf(changeMapping(_:String))
table2.toDF.withColumn("exploded", explode($"value"))
.withColumn("new", changeMappingUDF($"exploded"))
.groupBy("value")
.agg(collect_list("new").as("mappedCol"))
.select("mappedCol").rdd.map(r => r.toSeq.toArray.map(_.toString))
Let me know if it suits your requirement otherwise I can modify it as needed.

You can do that in Dataset
package dataframe
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.spark.{SparkConf, SparkContext}
/**
* #author vaquar.khan#gmail.com
*/
object Test {
case class table1Class(key: String, value: String)
case class table2Class(key: String, value: String, value1: String)
def main(args: Array[String]) {
val spark =
SparkSession.builder()
.appName("DataFrame-Basic")
.master("local[4]")
.getOrCreate()
import spark.implicits._
//
val table1 = Seq(
table1Class("1", "a"), table1Class("2", "b"), table1Class("3", "c"))
val df1 = spark.sparkContext.parallelize(table1, 4).toDF()
df1.show()
val table2 = Seq(
table2Class("1", "2", "d"), table2Class("1", "3", "e"))
val df2 = spark.sparkContext.parallelize(table2, 4).toDF()
df2.show()
//
df1.createOrReplaceTempView("A")
df2.createOrReplaceTempView("B")
spark.sql("select d1.key,d1.value,d2.value1 from A d1 inner join B d2 on d1.key = d2.key").show()
//TODO
/* need to fix query
spark.sql( "select * from ( "+ //B1.value,B1.value1,A.value
" select A.value,B.value,B.value1 "+
" from B "+
" left join A "+
" on B.key = A.key ) B2 "+
" left join A " +
" on B2.value = A.key" ).show()
*/
}
}
Results :
+---+-----+
|key|value|
+---+-----+
| 1| a|
| 2| b|
| 3| c|
+---+-----+
+---+-----+------+
|key|value|value1|
+---+-----+------+
| 1| 2| d|
| 1| 3| e|
+---+-----+------+
[Stage 15:=====================================> (68 + 6) / 100]
[Stage 15:============================================> (80 + 4) / 100]
+-----+-----+------+
|value|value|value1|
+-----+-----+------+
| 1| a| d|
| 1| a| e|
+-----+-----+------+

Is there a way to make this possible?
Yes. Use Datasets (not RDDs as less effective and expressive), join them together and select fields of your liking.
val table1 = Seq(("1", "a"), ("2", "b"), ("3", "c")).toDF("key", "value")
scala> table1.show
+---+-----+
|key|value|
+---+-----+
| 1| a|
| 2| b|
| 3| c|
+---+-----+
val table2 = sc.parallelize(
Array(Array("1", "2", "d"), Array("1", "3", "e"))).
toDF("a").
select($"a"(0) as "a0", $"a"(1) as "a1", $"a"(2) as "a2")
scala> table2.show
+---+---+---+
| a0| a1| a2|
+---+---+---+
| 1| 2| d|
| 1| 3| e|
+---+---+---+
scala> table2.join(table1, $"key" === $"a0").select($"value" as "a0", $"a1", $"a2").show
+---+---+---+
| a0| a1| a2|
+---+---+---+
| a| 2| d|
| a| 3| e|
+---+---+---+
Repeat for the other a columns and union together. While repeating the code, you'll notice the pattern that will make the code generic.
If so, would it be efficient using a huge dataset?
Yes (again). We're talking Spark here and a huge dataset is exactly why you chose Spark, isn't it?

Related

Spark: map columns of a dataframe to their ID of the distinct elements

I have the following dataframe of two columns of string type A and B:
val df = (
spark
.createDataFrame(
Seq(
("a1", "b1"),
("a1", "b2"),
("a1", "b2"),
("a2", "b3")
)
)
).toDF("A", "B")
I create maps between distinct elements of each columns and a set of integers
val mapColA = (
df
.select("A")
.distinct
.rdd
.zipWithIndex
.collectAsMap
)
val mapColB = (
df
.select("B")
.distinct
.rdd
.zipWithIndex
.collectAsMap
)
Now I want to create a new columns in the dataframe applying those maps to their correspondent columns. For one map only this would be
df.select("A").map(x=>mapColA.get(x)).show()
However I don't understand how to apply each map to their correspondent columns and create two new columns (e.g. with withColumn). The expected result would be
val result = (
spark
.createDataFrame(
Seq(
("a1", "b1", 1, 1),
("a1", "b2", 1, 2),
("a1", "b2", 1, 2),
("a2", "b3", 2, 3)
)
)
).toDF("A", "B", "idA", "idB")
Could you help me?
If I understood correctly, this can be achieved using dense_rank:
import org.apache.spark.sql.expressions.Window
val df2 = df.withColumn("idA", dense_rank().over(Window.orderBy("A")))
.withColumn("idB", dense_rank().over(Window.orderBy("B")))
df2.show
+---+---+---+---+
| A| B|idA|idB|
+---+---+---+---+
| a1| b1| 1| 1|
| a1| b2| 1| 2|
| a1| b2| 1| 2|
| a2| b3| 2| 3|
+---+---+---+---+
If you want to stick with your original code, you can make some modifications:
val mapColA = df.select("A").distinct().rdd.map(r=>r.getAs[String](0)).zipWithIndex.collectAsMap
val mapColB = df.select("B").distinct().rdd.map(r=>r.getAs[String](0)).zipWithIndex.collectAsMap
val df2 = df.map(r => (r.getAs[String](0), r.getAs[String](1), mapColA.get(r.getAs[String](0)), mapColB.get(r.getAs[String](1)))).toDF("A","B", "idA", "idB")
df2.show
+---+---+---+---+
| A| B|idA|idB|
+---+---+---+---+
| a1| b1| 1| 2|
| a1| b2| 1| 0|
| a1| b2| 1| 0|
| a2| b3| 0| 1|
+---+---+---+---+

Avoiding duplicate coulmns after nullSafeJoin scala spark

I have use case wherein I need to join nullable columns. I am doing the same like this :
def nullSafeJoin(leftDF: DataFrame, rightDF: DataFrame, joinOnColumns: Seq[String]) = {
val dataset1 = leftDF.alias("dataset1")
val dataset2 = rightDF.alias("dataset2")
val firstColumn = joinOnColumns.head
val colExpression: Column = (col(s"dataset1.$firstColumn").eqNullSafe(col(s"dataset2.$firstColumn")))
val fullExpr = joinOnColumns.tail.foldLeft(colExpression) {
(colExpression, p) => colExpression && (col(s"dataset1.$p").eqNullSafe(col(s"dataset2.$p")))
}
dataset1.join(dataset2, fullExpr)
}
The final joined dataset has duplicate columns. I have tried dropping the columns using the alias like this :
dataset1.join(dataset2, fullExpr).drop(s"dataset2.$firstColumn")
but it doesn't work.
I understand that instead of dropping we can do a select columns.
I am trying to have a generic code base so don't want to pass the list of columns to be selected to the function (In case of drop I would be having to just drop the list of joinOnColumns we have passed to the function)
Any pointers on how to solve this would be really helpful.
Thanks!
Edit : (Sample data )
leftDF :
+------------------+-----------+---------+---------+-------+
| A| B| C| D| status|
+------------------+-----------+---------+---------+-------+
| 14567| 37| 1| game|Enabled|
| 14567| BASE| 1| toy| Paused|
| 13478| null| 5| game|Enabled|
| 2001| BASE| 1| null| Paused|
| null| 37| 1| home|Enabled|
+------------------+-----------+---------+---------+-------+
rightDF :
+------------------+-----------+---------+
| A| B| C|
+------------------+-----------+---------+
| 140| 37| 1|
| 569| BASE| 1|
| 13478| null| 5|
| 2001| BASE| 1|
| null| 37| 1|
+------------------+-----------+---------+
Final Join (Required):
+------------------+-----------+---------+---------+-------+
| A| B| C| D| status|
+------------------+-----------+---------+---------+-------+
| 13478| null| 5| game|Enabled|
| 2001| BASE| 1| null| Paused|
| null| 37| 1| home|Enabled|
+------------------+-----------+---------+---------+-------+
Your final DataFrame has duplicate columns from both leftDF & rightDF, don't have identifier to check if that column is from leftDF or rightDF.
So I have renamed leftDF & rightDF columns. leftDF columns starts with left_[column_name] & rightDF columns starts with right_[column_name]
Hope below code will help you.
scala> :paste
// Entering paste mode (ctrl-D to finish)
val left = Seq(("14567", "37", "1", "game", "Enabled"), ("14567", "BASE", "1", "toy", "Paused"), ("13478", "null", "5", "game", "Enabled"), ("2001", "BASE", "1", "null", "Paused"), ("null", "37", "1", "home", "Enabled")).toDF("a", "b", "c", "d", "status")
val right = Seq(("140", "37", 1), ("569", "BASE", 1), ("13478", "null", 5), ("2001", "BASE", 1), ("null", "37", 1)).toDF("a", "b", "c")
import org.apache.spark.sql.DataFrame
def nullSafeJoin(leftDF: DataFrame, rightDF: DataFrame, joinOnColumns: Seq[String]):DataFrame = {
val leftRenamedDF = leftDF
.columns
.map(c => (c, s"left_${c}"))
.foldLeft(leftDF){ (df, c) =>
df.withColumnRenamed(c._1, c._2)
}
val rightRenamedDF = rightDF
.columns
.map(c => (c, s"right_${c}"))
.foldLeft(rightDF){(df, c) =>
df.withColumnRenamed(c._1, c._2)
}
val fullExpr = joinOnColumns
.tail
.foldLeft($"left_${joinOnColumns.head}".eqNullSafe($"right_${joinOnColumns.head}")){(cee, p) =>
cee && ($"left_${p}".eqNullSafe($"right_${p}"))
}
val finalColumns = joinOnColumns
.map(c => col(s"left_${c}").as(c)) ++ // Taking All columns from Join columns
leftDF.columns.diff(joinOnColumns).map(c => col(s"left_${c}").as(c)) ++ // Taking missing columns from leftDF
rightDF.columns.diff(joinOnColumns).map(c => col(s"right_${c}").as(c)) // Taking missing columns from rightDF
leftRenamedDF.join(rightRenamedDF, fullExpr).select(finalColumns: _*)
}
scala>
Final DataFrame result is :
scala> nullSafeJoin(left, right, Seq("a", "b", "c")).show(false)
// Exiting paste mode, now interpreting.
+-----+----+---+----+-------+
|a |b |c |d |status |
+-----+----+---+----+-------+
|13478|null|5 |game|Enabled|
|2001 |BASE|1 |null|Paused |
|null |37 |1 |home|Enabled|
+-----+----+---+----+-------+

Group by and find count before doing pivot spark

I have a dataframe like below
A B C D
foo one small 1
foo one large 2
foo one large 2
foo two small 3
I need to groupBy based on A and B pivot on column C, and sum column D
I am able to do this using
df.groupBy("A", "B").pivot("C").sum("D")
However I need also to find count after groupBy ,if I try something like
df.groupBy("A", "B").pivot("C").agg(sum("D"), count)
I get an output like
A B large small large_count small_count
Is there a way to get only one count after groupBy before doing pivot
On output try
output.withColumn("count", $"large_count"+$"small_count").show
You can drop the two count columns if you want to
To do it before pivot try
df.groupBy("A", "B").agg(count("C"))
Is this what you are expecting?.
val df = Seq(("foo", "one", "small", 1),
("foo", "one", "large", 2),
("foo", "one", "large", 2),
("foo", "two", "small", 3)).toDF("A","B","C","D")
scala> df.show
+---+---+-----+---+
| A| B| C| D|
+---+---+-----+---+
|foo|one|small| 1|
|foo|one|large| 2|
|foo|one|large| 2|
|foo|two|small| 3|
+---+---+-----+---+
scala> val df2 = df.groupBy('A,'B).pivot("C").sum("D")
df2: org.apache.spark.sql.DataFrame = [A: string, B: string ... 2 more fields]
scala> val df3 = df.groupBy('A as "A1",'B as "B1").agg(sum('D) as "sumd")
df3: org.apache.spark.sql.DataFrame = [A1: string, B1: string ... 1 more field]
scala> df3.join(df2,'A==='A1 and 'B==='B1,"inner").select("A","B","sumd","large","small").show
+---+---+----+-----+-----+
| A| B|sumd|large|small|
+---+---+----+-----+-----+
|foo|one| 5| 4| 1|
|foo|two| 3| null| 3|
+---+---+----+-----+-----+
scala>
This wont require a join. Is this what you are looking for ?
val df = Seq(("foo", "one", "small", 1),
("foo", "one", "large", 2),
("foo", "one", "large", 2),
("foo", "two", "small", 3)).toDF("A","B","C","D")
scala> df.show
+---+---+-----+---+
| A| B| C| D|
+---+---+-----+---+
|foo|one|small| 1|
|foo|one|large| 2|
|foo|one|large| 2|
|foo|two|small| 3|
+---+---+-----+---+
df.registerTempTable("dummy")
spark.sql("SELECT * FROM (SELECT A , B , C , sum(D) as D from dummy group by A,B,C grouping sets ((A,B,C) ,(A,B)) order by A nulls last , B nulls last , C nulls last) dummy pivot (first(D) for C in ('large' large ,'small' small , null total))").show
+---+---+-----+-----+-----+
| A| B|large|small|total|
+---+---+-----+-----+-----+
|foo|one| 4| 1| 5|
|foo|two| null| 3| 3|
+---+---+-----+-----+-----+

drop duplicate words in long string using scala

I am curious to learn how to drop duplicate words within strings that are contained in a dataframe column. I would like to accomplish it using scala.
By way of example, below you can find a dataframe I would like to transform.
dataframe:
val dataset1 = Seq(("66", "a,b,c,a", "4"), ("67", "a,f,g,t", "0"), ("70", "b,b,b,d", "4")).toDF("KEY1", "KEY2", "ID")
+----+-------+---+
|KEY1| KEY2| ID|
+----+-------+---+
| 66|a,b,c,a| 4|
| 67|a,f,g,t| 0|
| 70|b,b,b,d| 4|
+----+-------+---+
result:
+----+----------+---+
|KEY1| KEY2| ID|
+----+----------+---+
| 66| a, b, c| 4|
| 67|a, f, g, t| 0|
| 70| b, d| 4|
+----+----------+---+
Using pyspark I have used the following code to get the above result. I could not rewrite such a code via scala. Do you have any suggestion? Thanking you in advance I wish you a nice day.
pyspark code:
# dataframe
l = [("66", "a,b,c,a", "4"),("67", "a,f,g,t", "0"),("70", "b,b,b,d", "4")]
#spark.createDataFrame(l).show()
df1 = spark.createDataFrame(l, ['KEY1', 'KEY2','ID'])
# function
import re
import numpy as np
# drop duplicates in a row
def drop_duplicates(row):
# split string by ', ', drop duplicates and join back
words = re.split(',',row)
return ', '.join(np.unique(words))
# drop duplicates
from pyspark.sql.functions import udf
drop_duplicates_udf = udf(drop_duplicates)
dataset2 = df1.withColumn('KEY2', drop_duplicates_udf(df1.KEY2))
dataset2.show()
Dataframe solution
scala> val df = Seq(("66", "a,b,c,a", "4"), ("67", "a,f,g,t", "0"), ("70", "b,b,b,d", "4")).toDF("KEY1", "KEY2", "ID")
df: org.apache.spark.sql.DataFrame = [KEY1: string, KEY2: string ... 1 more field]
scala> val distinct :String => String = _.split(",").toSet.mkString(",")
distinct: String => String = <function1>
scala> val distinct_id = udf (distinct)
distinct_id: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
scala> df.select('key1,distinct_id('key2).as("distinct"),'id).show
+----+--------+---+
|key1|distinct| id|
+----+--------+---+
| 66| a,b,c| 4|
| 67| a,f,g,t| 0|
| 70| b,d| 4|
+----+--------+---+
scala>
There could be a more optimized solution but this could help you.
val rdd2 = dataset1.rdd.map(x => x(1).toString.split(",").distinct.mkString(", "))
// and then transform it to dataset
// or
val distinctUDF = spark.udf.register("distinctUDF", (s: String) => s.split(",").distinct.mkString(", "))
dataset1.createTempView("dataset1")
spark.sql("Select KEY1, distinctUDF(KEY2), ID from dataset1").show
import org.apache.spark.sql._
val dfUpdated = dataset1.rdd.map{
case Row(x: String, y: String,z:String) => (x,y.split(",").distinct.mkString(", "),z)
}.toDF(dataset1.columns:_*)
In spark-shell:
scala> val dataset1 = Seq(("66", "a,b,c,a", "4"), ("67", "a,f,g,t", "0"), ("70", "b,b,b,d", "4")).toDF("KEY1", "KEY2", "ID")
dataset1: org.apache.spark.sql.DataFrame = [KEY1: string, KEY2: string ... 1 more field]
scala> dataset1.show
+----+-------+---+
|KEY1| KEY2| ID|
+----+-------+---+
| 66|a,b,c,a| 4|
| 67|a,f,g,t| 0|
| 70|b,b,b,d| 4|
+----+-------+---+
scala> val dfUpdated = dataset1.rdd.map{
case Row(x: String, y: String,z:String) => (x,y.split(",").distinct.mkString(", "),z)
}.toDF(dataset1.columns:_*)
dfUpdated: org.apache.spark.sql.DataFrame = [KEY1: string, KEY2: string ... 1 more field]
scala> dfUpdated.show
+----+----------+---+
|KEY1| KEY2| ID|
+----+----------+---+
| 66| a, b, c| 4|
| 67|a, f, g, t| 0|
| 70| b, d| 4|
+----+----------+---+

Spark ML StringIndexer and OneHotEncoder - empty strings error [duplicate]

I am importing a CSV file (using spark-csv) into a DataFrame which has empty String values. When applied the OneHotEncoder, the application crashes with error requirement failed: Cannot have an empty string for name.. Is there a way I can get around this?
I could reproduce the error in the example provided on Spark ml page:
val df = sqlContext.createDataFrame(Seq(
(0, "a"),
(1, "b"),
(2, "c"),
(3, ""), //<- original example has "a" here
(4, "a"),
(5, "c")
)).toDF("id", "category")
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("categoryIndex")
.fit(df)
val indexed = indexer.transform(df)
val encoder = new OneHotEncoder()
.setInputCol("categoryIndex")
.setOutputCol("categoryVec")
val encoded = encoder.transform(indexed)
encoded.show()
It is annoying since missing/empty values is a highly generic case.
Thanks in advance,
Nikhil
Since the OneHotEncoder/OneHotEncoderEstimator does not accept empty string for name, or you'll get the following error :
java.lang.IllegalArgumentException: requirement failed: Cannot have an empty string for name.
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.ml.attribute.Attribute$$anonfun$5.apply(attributes.scala:33)
at org.apache.spark.ml.attribute.Attribute$$anonfun$5.apply(attributes.scala:32)
[...]
This is how I will do it : (There is other way to do it, rf. #Anthony 's answer)
I'll create an UDF to process the empty category :
import org.apache.spark.sql.functions._
def processMissingCategory = udf[String, String] { s => if (s == "") "NA" else s }
Then, I'll apply the UDF on the column :
val df = sqlContext.createDataFrame(Seq(
(0, "a"),
(1, "b"),
(2, "c"),
(3, ""), //<- original example has "a" here
(4, "a"),
(5, "c")
)).toDF("id", "category")
.withColumn("category",processMissingCategory('category))
df.show
// +---+--------+
// | id|category|
// +---+--------+
// | 0| a|
// | 1| b|
// | 2| c|
// | 3| NA|
// | 4| a|
// | 5| c|
// +---+--------+
Now, you can go back to your transformations
val indexer = new StringIndexer().setInputCol("category").setOutputCol("categoryIndex").fit(df)
val indexed = indexer.transform(df)
indexed.show
// +---+--------+-------------+
// | id|category|categoryIndex|
// +---+--------+-------------+
// | 0| a| 0.0|
// | 1| b| 2.0|
// | 2| c| 1.0|
// | 3| NA| 3.0|
// | 4| a| 0.0|
// | 5| c| 1.0|
// +---+--------+-------------+
// Spark <2.3
// val encoder = new OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryVec")
// Spark +2.3
val encoder = new OneHotEncoderEstimator().setInputCols(Array("categoryIndex")).setOutputCols(Array("category2Vec"))
val encoded = encoder.transform(indexed)
encoded.show
// +---+--------+-------------+-------------+
// | id|category|categoryIndex| categoryVec|
// +---+--------+-------------+-------------+
// | 0| a| 0.0|(3,[0],[1.0])|
// | 1| b| 2.0|(3,[2],[1.0])|
// | 2| c| 1.0|(3,[1],[1.0])|
// | 3| NA| 3.0| (3,[],[])|
// | 4| a| 0.0|(3,[0],[1.0])|
// | 5| c| 1.0|(3,[1],[1.0])|
// +---+--------+-------------+-------------+
EDIT:
#Anthony 's solution in Scala :
df.na.replace("category", Map( "" -> "NA")).show
// +---+--------+
// | id|category|
// +---+--------+
// | 0| a|
// | 1| b|
// | 2| c|
// | 3| NA|
// | 4| a|
// | 5| c|
// +---+--------+
I hope this helps!
Yep, it's a little thorny but maybe you can just replace the empty string with something sure to be different than other values. NOTE that I am using pyspark DataFrameNaFunctions API but Scala's should be similar.
df = sqlContext.createDataFrame([(0,"a"), (1,'b'), (2, 'c'), (3,''), (4,'a'), (5, 'c')], ['id', 'category'])
df = df.na.replace('', 'EMPTY', 'category')
df.show()
+---+--------+
| id|category|
+---+--------+
| 0| a|
| 1| b|
| 2| c|
| 3| EMPTY|
| 4| a|
| 5| c|
+---+--------+
if the column contains null the OneHotEncoder fails with a NullPointerException.
therefore i extended the udf to tanslate null values as well
object OneHotEncoderExample {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("OneHotEncoderExample Application").setMaster("local[2]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
// $example on$
val df1 = sqlContext.createDataFrame(Seq(
(0.0, "a"),
(1.0, "b"),
(2.0, "c"),
(3.0, ""),
(4.0, null),
(5.0, "c")
)).toDF("id", "category")
import org.apache.spark.sql.functions.udf
def emptyValueSubstitution = udf[String, String] {
case "" => "NA"
case null => "null"
case value => value
}
val df = df1.withColumn("category", emptyValueSubstitution( df1("category")) )
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("categoryIndex")
.fit(df)
val indexed = indexer.transform(df)
indexed.show()
val encoder = new OneHotEncoder()
.setInputCol("categoryIndex")
.setOutputCol("categoryVec")
.setDropLast(false)
val encoded = encoder.transform(indexed)
encoded.show()
// $example off$
sc.stop()
}
}