How to replace values in RDD 1 per keys in RDD 2?

How to replace values in RDD 1 per keys in RDD 2? - scala

Here are two RDDs.
Table1-pair(key,value)
val table1 = sc.parallelize(Seq(("1", "a"), ("2", "b"), ("3", "c")))
//RDD[(String, String)]
Table2-Arrays
val table2 = sc.parallelize(Array(Array("1", "2", "d"), Array("1", "3", "e")))
//RDD[Array[String]]
I am trying to change elements of table2 such as "1" to "a" using the keys and values in table1. My expect result is as follows:
RDD[Array[String]] = (Array(Array("a", "b", "d"), Array("a", "c", "e")))
Is there a way to make this possible?
If so, would it be efficient using a huge dataset?

I think we can do it better with dataframes while avoiding joins as it might involve shuffling of data.
val table1 = spark.sparkContext.parallelize(Seq(("1", "a"), ("2", "b"), ("3", "c"))).collectAsMap()
//Brodcasting so that mapping is available to all nodes
val brodcastedMapping = spark.sparkContext.broadcast(table1)
val table2 = spark.sparkContext.parallelize(Array(Array("1", "2", "d"), Array("1", "3", "e")))
def changeMapping(value: String): String = {
brodcastedMapping.value.getOrElse(value, value)
}
val changeMappingUDF = udf(changeMapping(_:String))
table2.toDF.withColumn("exploded", explode($"value"))
.withColumn("new", changeMappingUDF($"exploded"))
.groupBy("value")
.agg(collect_list("new").as("mappedCol"))
.select("mappedCol").rdd.map(r => r.toSeq.toArray.map(_.toString))
Let me know if it suits your requirement otherwise I can modify it as needed.

You can do that in Dataset
package dataframe
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.spark.{SparkConf, SparkContext}
/**
* #author vaquar.khan#gmail.com
*/
object Test {
case class table1Class(key: String, value: String)
case class table2Class(key: String, value: String, value1: String)
def main(args: Array[String]) {
val spark =
SparkSession.builder()
.appName("DataFrame-Basic")
.master("local[4]")
.getOrCreate()
import spark.implicits._
//
val table1 = Seq(
table1Class("1", "a"), table1Class("2", "b"), table1Class("3", "c"))
val df1 = spark.sparkContext.parallelize(table1, 4).toDF()
df1.show()
val table2 = Seq(
table2Class("1", "2", "d"), table2Class("1", "3", "e"))
val df2 = spark.sparkContext.parallelize(table2, 4).toDF()
df2.show()
//
df1.createOrReplaceTempView("A")
df2.createOrReplaceTempView("B")
spark.sql("select d1.key,d1.value,d2.value1 from A d1 inner join B d2 on d1.key = d2.key").show()
//TODO
/* need to fix query
spark.sql( "select * from ( "+ //B1.value,B1.value1,A.value
" select A.value,B.value,B.value1 "+
" from B "+
" left join A "+
" on B.key = A.key ) B2 "+
" left join A " +
" on B2.value = A.key" ).show()
*/
}
}
Results :
+---+-----+
|key|value|
+---+-----+
| 1| a|
| 2| b|
| 3| c|
+---+-----+
+---+-----+------+
|key|value|value1|
+---+-----+------+
| 1| 2| d|
| 1| 3| e|
+---+-----+------+
[Stage 15:=====================================> (68 + 6) / 100]
[Stage 15:============================================> (80 + 4) / 100]
+-----+-----+------+
|value|value|value1|
+-----+-----+------+
| 1| a| d|
| 1| a| e|
+-----+-----+------+

Is there a way to make this possible?
Yes. Use Datasets (not RDDs as less effective and expressive), join them together and select fields of your liking.
val table1 = Seq(("1", "a"), ("2", "b"), ("3", "c")).toDF("key", "value")
scala> table1.show
+---+-----+
|key|value|
+---+-----+
| 1| a|
| 2| b|
| 3| c|
+---+-----+
val table2 = sc.parallelize(
Array(Array("1", "2", "d"), Array("1", "3", "e"))).
toDF("a").
select($"a"(0) as "a0", $"a"(1) as "a1", $"a"(2) as "a2")
scala> table2.show
+---+---+---+
| a0| a1| a2|
+---+---+---+
| 1| 2| d|
| 1| 3| e|
+---+---+---+
scala> table2.join(table1, $"key" === $"a0").select($"value" as "a0", $"a1", $"a2").show
+---+---+---+
| a0| a1| a2|
+---+---+---+
| a| 2| d|
| a| 3| e|
+---+---+---+
Repeat for the other a columns and union together. While repeating the code, you'll notice the pattern that will make the code generic.
If so, would it be efficient using a huge dataset?
Yes (again). We're talking Spark here and a huge dataset is exactly why you chose Spark, isn't it?

Related

Spark: map columns of a dataframe to their ID of the distinct elements

I have the following dataframe of two columns of string type A and B:
val df = (
spark
.createDataFrame(
Seq(
("a1", "b1"),
("a1", "b2"),
("a1", "b2"),
("a2", "b3")
)
)
).toDF("A", "B")
I create maps between distinct elements of each columns and a set of integers
val mapColA = (
df
.select("A")
.distinct
.rdd
.zipWithIndex
.collectAsMap
)
val mapColB = (
df
.select("B")
.distinct
.rdd
.zipWithIndex
.collectAsMap
)
Now I want to create a new columns in the dataframe applying those maps to their correspondent columns. For one map only this would be
df.select("A").map(x=>mapColA.get(x)).show()
However I don't understand how to apply each map to their correspondent columns and create two new columns (e.g. with withColumn). The expected result would be
val result = (
spark
.createDataFrame(
Seq(
("a1", "b1", 1, 1),
("a1", "b2", 1, 2),
("a1", "b2", 1, 2),
("a2", "b3", 2, 3)
)
)
).toDF("A", "B", "idA", "idB")
Could you help me?

If I understood correctly, this can be achieved using dense_rank:
import org.apache.spark.sql.expressions.Window
val df2 = df.withColumn("idA", dense_rank().over(Window.orderBy("A")))
.withColumn("idB", dense_rank().over(Window.orderBy("B")))
df2.show
+---+---+---+---+
| A| B|idA|idB|
+---+---+---+---+
| a1| b1| 1| 1|
| a1| b2| 1| 2|
| a1| b2| 1| 2|
| a2| b3| 2| 3|
+---+---+---+---+
If you want to stick with your original code, you can make some modifications:
val mapColA = df.select("A").distinct().rdd.map(r=>r.getAs[String](0)).zipWithIndex.collectAsMap
val mapColB = df.select("B").distinct().rdd.map(r=>r.getAs[String](0)).zipWithIndex.collectAsMap
val df2 = df.map(r => (r.getAs[String](0), r.getAs[String](1), mapColA.get(r.getAs[String](0)), mapColB.get(r.getAs[String](1)))).toDF("A","B", "idA", "idB")
df2.show
+---+---+---+---+
| A| B|idA|idB|
+---+---+---+---+
| a1| b1| 1| 2|
| a1| b2| 1| 0|
| a1| b2| 1| 0|
| a2| b3| 0| 1|
+---+---+---+---+

Avoiding duplicate coulmns after nullSafeJoin scala spark

I have use case wherein I need to join nullable columns. I am doing the same like this :
def nullSafeJoin(leftDF: DataFrame, rightDF: DataFrame, joinOnColumns: Seq[String]) = {
val dataset1 = leftDF.alias("dataset1")
val dataset2 = rightDF.alias("dataset2")
val firstColumn = joinOnColumns.head
val colExpression: Column = (col(s"dataset1.$firstColumn").eqNullSafe(col(s"dataset2.$firstColumn")))
val fullExpr = joinOnColumns.tail.foldLeft(colExpression) {
(colExpression, p) => colExpression && (col(s"dataset1.$p").eqNullSafe(col(s"dataset2.$p")))
}
dataset1.join(dataset2, fullExpr)
}
The final joined dataset has duplicate columns. I have tried dropping the columns using the alias like this :
dataset1.join(dataset2, fullExpr).drop(s"dataset2.$firstColumn")
but it doesn't work.
I understand that instead of dropping we can do a select columns.
I am trying to have a generic code base so don't want to pass the list of columns to be selected to the function (In case of drop I would be having to just drop the list of joinOnColumns we have passed to the function)
Any pointers on how to solve this would be really helpful.
Thanks!
Edit : (Sample data )
leftDF :
+------------------+-----------+---------+---------+-------+
| A| B| C| D| status|
+------------------+-----------+---------+---------+-------+
| 14567| 37| 1| game|Enabled|
| 14567| BASE| 1| toy| Paused|
| 13478| null| 5| game|Enabled|
| 2001| BASE| 1| null| Paused|
| null| 37| 1| home|Enabled|
+------------------+-----------+---------+---------+-------+
rightDF :
+------------------+-----------+---------+
| A| B| C|
+------------------+-----------+---------+
| 140| 37| 1|
| 569| BASE| 1|
| 13478| null| 5|
| 2001| BASE| 1|
| null| 37| 1|
+------------------+-----------+---------+
Final Join (Required):
+------------------+-----------+---------+---------+-------+
| A| B| C| D| status|
+------------------+-----------+---------+---------+-------+
| 13478| null| 5| game|Enabled|
| 2001| BASE| 1| null| Paused|
| null| 37| 1| home|Enabled|
+------------------+-----------+---------+---------+-------+

Your final DataFrame has duplicate columns from both leftDF & rightDF, don't have identifier to check if that column is from leftDF or rightDF.
So I have renamed leftDF & rightDF columns. leftDF columns starts with left_[column_name] & rightDF columns starts with right_[column_name]
Hope below code will help you.
scala> :paste
// Entering paste mode (ctrl-D to finish)
val left = Seq(("14567", "37", "1", "game", "Enabled"), ("14567", "BASE", "1", "toy", "Paused"), ("13478", "null", "5", "game", "Enabled"), ("2001", "BASE", "1", "null", "Paused"), ("null", "37", "1", "home", "Enabled")).toDF("a", "b", "c", "d", "status")
val right = Seq(("140", "37", 1), ("569", "BASE", 1), ("13478", "null", 5), ("2001", "BASE", 1), ("null", "37", 1)).toDF("a", "b", "c")
import org.apache.spark.sql.DataFrame
def nullSafeJoin(leftDF: DataFrame, rightDF: DataFrame, joinOnColumns: Seq[String]):DataFrame = {
val leftRenamedDF = leftDF
.columns
.map(c => (c, s"left_${c}"))
.foldLeft(leftDF){ (df, c) =>
df.withColumnRenamed(c._1, c._2)
}
val rightRenamedDF = rightDF
.columns
.map(c => (c, s"right_${c}"))
.foldLeft(rightDF){(df, c) =>
df.withColumnRenamed(c._1, c._2)
}
val fullExpr = joinOnColumns
.tail
.foldLeft($"left_${joinOnColumns.head}".eqNullSafe($"right_${joinOnColumns.head}")){(cee, p) =>
cee && ($"left_${p}".eqNullSafe($"right_${p}"))
}
val finalColumns = joinOnColumns
.map(c => col(s"left_${c}").as(c)) ++ // Taking All columns from Join columns
leftDF.columns.diff(joinOnColumns).map(c => col(s"left_${c}").as(c)) ++ // Taking missing columns from leftDF
rightDF.columns.diff(joinOnColumns).map(c => col(s"right_${c}").as(c)) // Taking missing columns from rightDF
leftRenamedDF.join(rightRenamedDF, fullExpr).select(finalColumns: _*)
}
scala>
Final DataFrame result is :
scala> nullSafeJoin(left, right, Seq("a", "b", "c")).show(false)
// Exiting paste mode, now interpreting.
+-----+----+---+----+-------+
|a |b |c |d |status |
+-----+----+---+----+-------+
|13478|null|5 |game|Enabled|
|2001 |BASE|1 |null|Paused |
|null |37 |1 |home|Enabled|
+-----+----+---+----+-------+

Group by and find count before doing pivot spark

I have a dataframe like below
A B C D
foo one small 1
foo one large 2
foo one large 2
foo two small 3
I need to groupBy based on A and B pivot on column C, and sum column D
I am able to do this using
df.groupBy("A", "B").pivot("C").sum("D")
However I need also to find count after groupBy ,if I try something like
df.groupBy("A", "B").pivot("C").agg(sum("D"), count)
I get an output like
A B large small large_count small_count
Is there a way to get only one count after groupBy before doing pivot

On output try
output.withColumn("count", $"large_count"+$"small_count").show
You can drop the two count columns if you want to
To do it before pivot try
df.groupBy("A", "B").agg(count("C"))

Is this what you are expecting?.
val df = Seq(("foo", "one", "small", 1),
("foo", "one", "large", 2),
("foo", "one", "large", 2),
("foo", "two", "small", 3)).toDF("A","B","C","D")
scala> df.show
+---+---+-----+---+
| A| B| C| D|
+---+---+-----+---+
|foo|one|small| 1|
|foo|one|large| 2|
|foo|one|large| 2|
|foo|two|small| 3|
+---+---+-----+---+
scala> val df2 = df.groupBy('A,'B).pivot("C").sum("D")
df2: org.apache.spark.sql.DataFrame = [A: string, B: string ... 2 more fields]
scala> val df3 = df.groupBy('A as "A1",'B as "B1").agg(sum('D) as "sumd")
df3: org.apache.spark.sql.DataFrame = [A1: string, B1: string ... 1 more field]
scala> df3.join(df2,'A==='A1 and 'B==='B1,"inner").select("A","B","sumd","large","small").show
+---+---+----+-----+-----+
| A| B|sumd|large|small|
+---+---+----+-----+-----+
|foo|one| 5| 4| 1|
|foo|two| 3| null| 3|
+---+---+----+-----+-----+
scala>

This wont require a join. Is this what you are looking for ?
val df = Seq(("foo", "one", "small", 1),
("foo", "one", "large", 2),
("foo", "one", "large", 2),
("foo", "two", "small", 3)).toDF("A","B","C","D")
scala> df.show
+---+---+-----+---+
| A| B| C| D|
+---+---+-----+---+
|foo|one|small| 1|
|foo|one|large| 2|
|foo|one|large| 2|
|foo|two|small| 3|
+---+---+-----+---+
df.registerTempTable("dummy")
spark.sql("SELECT * FROM (SELECT A , B , C , sum(D) as D from dummy group by A,B,C grouping sets ((A,B,C) ,(A,B)) order by A nulls last , B nulls last , C nulls last) dummy pivot (first(D) for C in ('large' large ,'small' small , null total))").show
+---+---+-----+-----+-----+
| A| B|large|small|total|
+---+---+-----+-----+-----+
|foo|one| 4| 1| 5|
|foo|two| null| 3| 3|
+---+---+-----+-----+-----+

drop duplicate words in long string using scala

I am curious to learn how to drop duplicate words within strings that are contained in a dataframe column. I would like to accomplish it using scala.
By way of example, below you can find a dataframe I would like to transform.
dataframe:
val dataset1 = Seq(("66", "a,b,c,a", "4"), ("67", "a,f,g,t", "0"), ("70", "b,b,b,d", "4")).toDF("KEY1", "KEY2", "ID")
+----+-------+---+
|KEY1| KEY2| ID|
+----+-------+---+
| 66|a,b,c,a| 4|
| 67|a,f,g,t| 0|
| 70|b,b,b,d| 4|
+----+-------+---+
result:
+----+----------+---+
|KEY1| KEY2| ID|
+----+----------+---+
| 66| a, b, c| 4|
| 67|a, f, g, t| 0|
| 70| b, d| 4|
+----+----------+---+
Using pyspark I have used the following code to get the above result. I could not rewrite such a code via scala. Do you have any suggestion? Thanking you in advance I wish you a nice day.
pyspark code:
# dataframe
l = [("66", "a,b,c,a", "4"),("67", "a,f,g,t", "0"),("70", "b,b,b,d", "4")]
#spark.createDataFrame(l).show()
df1 = spark.createDataFrame(l, ['KEY1', 'KEY2','ID'])
# function
import re
import numpy as np
# drop duplicates in a row
def drop_duplicates(row):
# split string by ', ', drop duplicates and join back
words = re.split(',',row)
return ', '.join(np.unique(words))
# drop duplicates
from pyspark.sql.functions import udf
drop_duplicates_udf = udf(drop_duplicates)
dataset2 = df1.withColumn('KEY2', drop_duplicates_udf(df1.KEY2))
dataset2.show()

Dataframe solution
scala> val df = Seq(("66", "a,b,c,a", "4"), ("67", "a,f,g,t", "0"), ("70", "b,b,b,d", "4")).toDF("KEY1", "KEY2", "ID")
df: org.apache.spark.sql.DataFrame = [KEY1: string, KEY2: string ... 1 more field]
scala> val distinct :String => String = _.split(",").toSet.mkString(",")
distinct: String => String = <function1>
scala> val distinct_id = udf (distinct)
distinct_id: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
scala> df.select('key1,distinct_id('key2).as("distinct"),'id).show
+----+--------+---+
|key1|distinct| id|
+----+--------+---+
| 66| a,b,c| 4|
| 67| a,f,g,t| 0|
| 70| b,d| 4|
+----+--------+---+
scala>

There could be a more optimized solution but this could help you.
val rdd2 = dataset1.rdd.map(x => x(1).toString.split(",").distinct.mkString(", "))
// and then transform it to dataset
// or
val distinctUDF = spark.udf.register("distinctUDF", (s: String) => s.split(",").distinct.mkString(", "))
dataset1.createTempView("dataset1")
spark.sql("Select KEY1, distinctUDF(KEY2), ID from dataset1").show

import org.apache.spark.sql._
val dfUpdated = dataset1.rdd.map{
case Row(x: String, y: String,z:String) => (x,y.split(",").distinct.mkString(", "),z)
}.toDF(dataset1.columns:_*)
In spark-shell:
scala> val dataset1 = Seq(("66", "a,b,c,a", "4"), ("67", "a,f,g,t", "0"), ("70", "b,b,b,d", "4")).toDF("KEY1", "KEY2", "ID")
dataset1: org.apache.spark.sql.DataFrame = [KEY1: string, KEY2: string ... 1 more field]
scala> dataset1.show
+----+-------+---+
|KEY1| KEY2| ID|
+----+-------+---+
| 66|a,b,c,a| 4|
| 67|a,f,g,t| 0|
| 70|b,b,b,d| 4|
+----+-------+---+
scala> val dfUpdated = dataset1.rdd.map{
case Row(x: String, y: String,z:String) => (x,y.split(",").distinct.mkString(", "),z)
}.toDF(dataset1.columns:_*)
dfUpdated: org.apache.spark.sql.DataFrame = [KEY1: string, KEY2: string ... 1 more field]
scala> dfUpdated.show
+----+----------+---+
|KEY1| KEY2| ID|
+----+----------+---+
| 66| a, b, c| 4|
| 67|a, f, g, t| 0|
| 70| b, d| 4|
+----+----------+---+

Spark ML StringIndexer and OneHotEncoder - empty strings error [duplicate]

I am importing a CSV file (using spark-csv) into a DataFrame which has empty String values. When applied the OneHotEncoder, the application crashes with error requirement failed: Cannot have an empty string for name.. Is there a way I can get around this?
I could reproduce the error in the example provided on Spark ml page:
val df = sqlContext.createDataFrame(Seq(
(0, "a"),
(1, "b"),
(2, "c"),
(3, ""), //<- original example has "a" here
(4, "a"),
(5, "c")
)).toDF("id", "category")
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("categoryIndex")
.fit(df)
val indexed = indexer.transform(df)
val encoder = new OneHotEncoder()
.setInputCol("categoryIndex")
.setOutputCol("categoryVec")
val encoded = encoder.transform(indexed)
encoded.show()
It is annoying since missing/empty values is a highly generic case.
Thanks in advance,
Nikhil

Since the OneHotEncoder/OneHotEncoderEstimator does not accept empty string for name, or you'll get the following error :
java.lang.IllegalArgumentException: requirement failed: Cannot have an empty string for name.
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.ml.attribute.Attribute$$anonfun$5.apply(attributes.scala:33)
at org.apache.spark.ml.attribute.Attribute$$anonfun$5.apply(attributes.scala:32)
[...]
This is how I will do it : (There is other way to do it, rf. #Anthony 's answer)
I'll create an UDF to process the empty category :
import org.apache.spark.sql.functions._
def processMissingCategory = udf[String, String] { s => if (s == "") "NA" else s }
Then, I'll apply the UDF on the column :
val df = sqlContext.createDataFrame(Seq(
(0, "a"),
(1, "b"),
(2, "c"),
(3, ""), //<- original example has "a" here
(4, "a"),
(5, "c")
)).toDF("id", "category")
.withColumn("category",processMissingCategory('category))
df.show
// +---+--------+
// | id|category|
// +---+--------+
// | 0| a|
// | 1| b|
// | 2| c|
// | 3| NA|
// | 4| a|
// | 5| c|
// +---+--------+
Now, you can go back to your transformations
val indexer = new StringIndexer().setInputCol("category").setOutputCol("categoryIndex").fit(df)
val indexed = indexer.transform(df)
indexed.show
// +---+--------+-------------+
// | id|category|categoryIndex|
// +---+--------+-------------+
// | 0| a| 0.0|
// | 1| b| 2.0|
// | 2| c| 1.0|
// | 3| NA| 3.0|
// | 4| a| 0.0|
// | 5| c| 1.0|
// +---+--------+-------------+
// Spark <2.3
// val encoder = new OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryVec")
// Spark +2.3
val encoder = new OneHotEncoderEstimator().setInputCols(Array("categoryIndex")).setOutputCols(Array("category2Vec"))
val encoded = encoder.transform(indexed)
encoded.show
// +---+--------+-------------+-------------+
// | id|category|categoryIndex| categoryVec|
// +---+--------+-------------+-------------+
// | 0| a| 0.0|(3,[0],[1.0])|
// | 1| b| 2.0|(3,[2],[1.0])|
// | 2| c| 1.0|(3,[1],[1.0])|
// | 3| NA| 3.0| (3,[],[])|
// | 4| a| 0.0|(3,[0],[1.0])|
// | 5| c| 1.0|(3,[1],[1.0])|
// +---+--------+-------------+-------------+
EDIT:
#Anthony 's solution in Scala :
df.na.replace("category", Map( "" -> "NA")).show
// +---+--------+
// | id|category|
// +---+--------+
// | 0| a|
// | 1| b|
// | 2| c|
// | 3| NA|
// | 4| a|
// | 5| c|
// +---+--------+
I hope this helps!

Yep, it's a little thorny but maybe you can just replace the empty string with something sure to be different than other values. NOTE that I am using pyspark DataFrameNaFunctions API but Scala's should be similar.
df = sqlContext.createDataFrame([(0,"a"), (1,'b'), (2, 'c'), (3,''), (4,'a'), (5, 'c')], ['id', 'category'])
df = df.na.replace('', 'EMPTY', 'category')
df.show()
+---+--------+
| id|category|
+---+--------+
| 0| a|
| 1| b|
| 2| c|
| 3| EMPTY|
| 4| a|
| 5| c|
+---+--------+

if the column contains null the OneHotEncoder fails with a NullPointerException.
therefore i extended the udf to tanslate null values as well
object OneHotEncoderExample {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("OneHotEncoderExample Application").setMaster("local[2]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
// $example on$
val df1 = sqlContext.createDataFrame(Seq(
(0.0, "a"),
(1.0, "b"),
(2.0, "c"),
(3.0, ""),
(4.0, null),
(5.0, "c")
)).toDF("id", "category")
import org.apache.spark.sql.functions.udf
def emptyValueSubstitution = udf[String, String] {
case "" => "NA"
case null => "null"
case value => value
}
val df = df1.withColumn("category", emptyValueSubstitution( df1("category")) )
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("categoryIndex")
.fit(df)
val indexed = indexer.transform(df)
indexed.show()
val encoder = new OneHotEncoder()
.setInputCol("categoryIndex")
.setOutputCol("categoryVec")
.setDropLast(false)
val encoded = encoder.transform(indexed)
encoded.show()
// $example off$
sc.stop()
}
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to replace values in RDD 1 per keys in RDD 2? - scala

Related

Spark: map columns of a dataframe to their ID of the distinct elements

Avoiding duplicate coulmns after nullSafeJoin scala spark

Group by and find count before doing pivot spark

drop duplicate words in long string using scala

Spark ML StringIndexer and OneHotEncoder - empty strings error [duplicate]

Categories

Resources