In my Scala program, I am dealing with a problem to combine the result from multiple levels of GroupBy.
The dataset that I am using is quite big. As a small sample, I have a dataframe that looks like this:
+---+---+----+-----+-----+
| F| L| Loy|Email|State|
+---+---+----+-----+-----+
| f1| l1|loy1| null| s1|
| f1| l1|loy1| e1| s1|
| f2| l2|loy2| e2| s2|
| f2| l2|loy2| e3| null|
| f1| l1|null| e1| s3|
+---+---+----+-----+-----+
For the first level groupBy I use the following script, to obtain the result based on the same (F, L, Loy) columns:
df.groupBy("F", "L", "Loy").agg(collect_set($"Email").alias("Email"), collect_set($"State").alias("State")).show
The result is like this:
+---+---+----+--------+-----+
| F| L| Loy| Email|State|
+---+---+----+--------+-----+
| f1| l1|null| [e1]| [s3]|
| f2| l2|loy2|[e2, e3]| [s2]|
| f1| l1|loy1| [e1]| [s1]|
+---+---+----+--------+-----+
The problem that I am dealing with is that how can I perform the second level groupBy, which is based on the condition (F, L, Email) and takes as an input F and L as a String while Email column as an Array[String]. This groupBy should return a result as followed:
+---+---+----+--------+---------+
| F| L| Loy| Email| State|
+---+---+----+--------+---------+
| f1| l1|loy1| [e1]| [s3, s1]|
| f2| l2|loy2|[e2, e3]| [s2]|
+---+---+----+--------+---------+
The main goal is to reduce the number of entries as much as possible by applying groupBy in different levels. I am quite new to Scala and any help would be appreciated :)
Just use concat_ws() with null separator which will remove the Array of state to simple elements and then collect_set will get you again the array to states. Check this out.
scala> val df = Seq( ("f1","l1","loy1",null,"s1"),("f1","l1","loy1","e1","s1"),("f2","l2","loy2","e2","s2"),("f2","l2","loy2","e3",null),("f1","l1",null,"e1","s3")).toDF("F","L","loy","email","state")
df: org.apache.spark.sql.DataFrame = [F: string, L: string ... 3 more fields]
scala> df.show(false)
+---+---+----+-----+-----+
|F |L |loy |email|state|
+---+---+----+-----+-----+
|f1 |l1 |loy1|null |s1 |
|f1 |l1 |loy1|e1 |s1 |
|f2 |l2 |loy2|e2 |s2 |
|f2 |l2 |loy2|e3 |null |
|f1 |l1 |null|e1 |s3 |
+---+---+----+-----+-----+
scala> val df2 = df.groupBy("F", "L", "Loy").agg(collect_set($"Email").alias("Email"), collect_set($"State").alias("State"))
df2: org.apache.spark.sql.DataFrame = [F: string, L: string ... 3 more fields]
scala> df2.show(false)
+---+---+----+--------+-----+
|F |L |Loy |Email |State|
+---+---+----+--------+-----+
|f1 |l1 |null|[e1] |[s3] |
|f2 |l2 |loy2|[e2, e3]|[s2] |
|f1 |l1 |loy1|[e1] |[s1] |
+---+---+----+--------+-----+
scala> df2.groupBy("F","L","email").agg(max('loy).as("loy"),collect_set(concat_ws("",'state)).as("state")).show
+---+---+--------+----+--------+
| F| L| email| loy| state|
+---+---+--------+----+--------+
| f2| l2|[e2, e3]|loy2| [s2]|
| f1| l1| [e1]|loy1|[s3, s1]|
+---+---+--------+----+--------+
scala>
Related
I have a pyspark dataframe as:
+--------+------+
|numbers1|words1|
+--------+------+
| 1| word1|
| 1| word2|
| 1| word3|
| 2| word4|
| 2| word5|
| 3| word6|
| 3| word7|
| 3| word8|
| 3| word9|
+--------+------+
I want to produce another dataframe that would generate all pairs of words in each group. So the result for the above would be:
ID wordA wordB
1 word1 word2
1 word1 word3
1 word2 word3
2 word4 word5
3 word6 word7
3 word6 word8
3 word6 word9
3 word7 word8
3 word7 word9
3 word8 word9
I know I can run this with Python with these codes:
from itertools import combinations
ndf = df.groupby('ID')['words'].apply(lambda x : list(combinations(x.values,2)))
.apply(pd.Series).stack().reset_index(level=0,name='words')
But now I need to implement this with just Spark APIs and without itertools library. How can I rewrite this script without combinations and using dataframe or RDD?
Here is my trial with the dataframe.
import pyspark.sql.functions as f
df.join(df.withColumnRenamed('words1', 'words2'), ['numbers1'], 'outer') \
.filter('words1 < words2').show(10, False)
+--------+------+------+
|numbers1|words1|words2|
+--------+------+------+
|1 |word1 |word3 |
|1 |word1 |word2 |
|1 |word2 |word3 |
|2 |word4 |word5 |
|3 |word6 |word9 |
|3 |word6 |word8 |
|3 |word6 |word7 |
|3 |word7 |word9 |
|3 |word7 |word8 |
|3 |word8 |word9 |
+--------+------+------+
Here is a solution using combinations in a UDF. It uses the same logic as the Pandas code you showed.
from itertools import combinations
from pyspark.sql import types as T, functions as F
df_agg = df.groupBy("numbers1").agg(F.collect_list("words1").alias("words_list"))
#F.udf(
T.ArrayType(
T.StructType(
[
T.StructField("wordA", T.StringType(), True,),
T.StructField("wordB", T.StringType(), True,),
]
)
)
)
def combi(words_list):
return list(combinations(words_list, 2))
df_agg = df_agg.withColumn("combinations", combi(F.col("words_list")))
new_df = df_agg.withColumn("combination", F.explode("combinations")).select(
"numbers1",
F.col("combination.wordA").alias("wordA"),
F.col("combination.wordB").alias("wordB"),
)
new_df.show()
+--------+------+------+
|numbers1| wordA| wordB|
+--------+------+------+
| 1| word1| word2|
| 1| word1| word3|
| 1| word2| word3|
| 3| word6| word7|
| 3| word6| word8|
| 3| word6| word9|
| 3| word7| word8|
| 3| word7| word9|
| 3| word8| word9|
| 2| word4| word5|
+--------+------+------+
I have a spark DataFrame like this:
+---+---+---+---+---+---+---+
| f1| f2| f3| f4| f5| f6| f7|
+---+---+---+---+---+---+---+
| 5| 4| 5| 2| 5| 5| 5|
+---+---+---+---+---+---+---+
how can you povit to
+---+---+
| f1| 5|
+---+---+
| f2| 4|
+---+---+
| f3| 5|
+---+---+
| f4| 2|
+---+---+
| f5| 5|
+---+---+
| f6| 5|
+---+---+
| f7| 5|
+---+---+
Is there a simple code in spark scala that can be used for transposition?
scala> df.show()
+---+---+---+---+---+---+---+
| f1| f2| f3| f4| f5| f6| f7|
+---+---+---+---+---+---+---+
| 5| 4| 5| 2| 5| 5| 5|
+---+---+---+---+---+---+---+
scala> import org.apache.spark.sql.DataFrame
scala> def transposeUDF(transDF: DataFrame, transBy: Seq[String]): DataFrame = {
| val (cols, types) = transDF.dtypes.filter{ case (c, _) => !transBy.contains(c)}.unzip
| require(types.distinct.size == 1)
|
| val kvs = explode(array(
| cols.map(c => struct(lit(c).alias("columns"), col(c).alias("value"))): _*
| ))
|
| val byExprs = transBy.map(col(_))
|
| transDF
| .select(byExprs :+ kvs.alias("_kvs"): _*)
| .select(byExprs ++ Seq($"_kvs.columns", $"_kvs.value"): _*)
| }
scala> val df1 = df.withColumn("tempColumn", lit("1"))
scala> transposeUDF(df1, Seq("tempColumn")).drop("tempColumn").show(false)
+-------+-----+
|columns|value|
+-------+-----+
|f1 |5 |
|f2 |4 |
|f3 |5 |
|f4 |2 |
|f5 |5 |
|f6 |5 |
|f7 |5 |
+-------+-----+
spark 2.4+ use map_from_arrays
scala> var df =Seq(( 5, 4, 5, 2, 5, 5, 5)).toDF("f1", "f2", "f3", "f4", "f5", "f6", "f7")
scala> df.select(array('*).as("v"), lit(df.columns).as("k")).select('v.getItem(0).as("cust_id"), map_from_arrays('k,'v).as("map")).select(explode('map)).show(false)
+---+-----+
|key|value|
+---+-----+
|f1 |5 |
|f2 |4 |
|f3 |5 |
|f4 |2 |
|f5 |5 |
|f6 |5 |
|f7 |5 |
+---+-----+
hope its helps you.
I wrote a function
object DT {
val KEY_COL_NAME = "dt_key"
val VALUE_COL_NAME = "dt_value"
def pivot(df: DataFrame, valueDataType: DataType, cols: Array[String], keyColName: String, valueColName: String): DataFrame = {
val tempData: RDD[Row] = df.rdd.flatMap(row => row.getValuesMap(cols).map(Row.fromTuple))
val keyStructField = DataTypes.createStructField(keyColName, DataTypes.StringType, false)
val valueStructField = DataTypes.createStructField(valueColName, DataTypes.StringType, true)
val structType = DataTypes.createStructType(Array(keyStructField, valueStructField))
df.sparkSession.createDataFrame(tempData, structType).select(col(keyColName), col(valueColName).cast(valueDataType))
}
def pivot(df: DataFrame, valueDataType: DataType): DataFrame = {
pivot(df, valueDataType, df.columns, KEY_COL_NAME, VALUE_COL_NAME)
}
}
it worked
df.show()
DT.pivot(df,DoubleType).show()
like this
+---+---+-----------+---+---+ +------+-----------+
| f1| f2| f3| f4| f5| |dt_key| dt_value|
+---+---+-----------+---+---+ to +------+-----------+
|100| 1|0.355072464| 0| 31| | f1| 100.0|
+---+---+-----------+---+---+ | f5| 31.0|
| f3|0.355072464|
| f4| 0.0|
| f2| 1.0|
+------+-----------+
and
+---+---+-----------+-----------+---+ +------+-----------+
| f1| f2| f3| f4| f5| |dt_key| dt_value|
+---+---+-----------+-----------+---+ to +------+-----------+
|100| 1|0.355072464| 0| 31| | f1| 100.0|
| 63| 2|0.622775801|0.685809375| 16| | f5| 31.0|
+---+---+-----------+-----------+---+ | f3|0.355072464|
| f4| 0.0|
| f2| 1.0|
| f1| 63.0|
| f5| 16.0|
| f3|0.622775801|
| f4|0.685809375|
| f2| 2.0|
+------+-----------+
very nice!
I am trying construct distinction matrix using spark and am confused how to do it optimally. I am new to spark. I have given a small example of what I'm trying to do below.
Example of distinction matrix construction:
Given Dataset D:
+----+-----+------+-----+
| id | a1 | a2 | a3 |
+----+-----+------+-----+
| 1 | yes | high | on |
| 2 | no | high | off |
| 3 | yes | low | off |
+----+-----+------+-----+
and my distinction table is
+-------+----+----+----+
| id,id | a1 | a2 | a3 |
+-------+----+----+----+
| 1,2 | 1 | 0 | 1 |
| 1,3 | 0 | 1 | 1 |
| 2,3 | 1 | 1 | 0 |
+-------+----+----+----+
i.e whenever an attribute ai is helpful in distinguishing a pair of tuples, distinction table has a 1, otherwise a 0.
My Datasets are huge and I trying to do it in spark.Following are approaches that came to my mind:
using nested for loop to iterate over all members of RDD (of dataset)
using cartesian() transformation over original RDD and iterate over all members of resultant RDD to get distinction table.
My questions are:
In 1st approach, does spark automatically optimize nested for loop setup internally for parallel processing?
In 2nd approach, using cartesian() causes extra storage overhead to store intermediate RDD. Is there any way to avoid this storage overhead and get final distinction table?
Which of these approaches is better and is there any other approach which can be useful to construct distinction matrix efficiently (both space and time)?
For this dataframe:
scala> val df = List((1, "yes", "high", "on" ), (2, "no", "high", "off"), (3, "yes", "low", "off") ).toDF("id", "a1", "a2", "a3")
df: org.apache.spark.sql.DataFrame = [id: int, a1: string ... 2 more fields]
scala> df.show
+---+---+----+---+
| id| a1| a2| a3|
+---+---+----+---+
| 1|yes|high| on|
| 2| no|high|off|
| 3|yes| low|off|
+---+---+----+---+
We can build a cartesian product by using crossJoin with itself. However, the column names will be ambiguous (I don't really know how to easily deal with that). To prepare for that, let's create a second dataframe:
scala> val df2 = df.toDF("id_2", "a1_2", "a2_2", "a3_2")
df2: org.apache.spark.sql.DataFrame = [id_2: int, a1_2: string ... 2 more fields]
scala> df2.show
+----+----+----+----+
|id_2|a1_2|a2_2|a3_2|
+----+----+----+----+
| 1| yes|high| on|
| 2| no|high| off|
| 3| yes| low| off|
+----+----+----+----+
In this example we can get combinations by filtering using id < id_2.
scala> val xp = df.crossJoin(df2)
xp: org.apache.spark.sql.DataFrame = [id: int, a1: string ... 6 more fields]
scala> xp.show
+---+---+----+---+----+----+----+----+
| id| a1| a2| a3|id_2|a1_2|a2_2|a3_2|
+---+---+----+---+----+----+----+----+
| 1|yes|high| on| 1| yes|high| on|
| 1|yes|high| on| 2| no|high| off|
| 1|yes|high| on| 3| yes| low| off|
| 2| no|high|off| 1| yes|high| on|
| 2| no|high|off| 2| no|high| off|
| 2| no|high|off| 3| yes| low| off|
| 3|yes| low|off| 1| yes|high| on|
| 3|yes| low|off| 2| no|high| off|
| 3|yes| low|off| 3| yes| low| off|
+---+---+----+---+----+----+----+----+
scala> val filtered = xp.filter($"id" < $"id_2")
filtered: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, a1: string ... 6 more fields]
scala> filtered.show
+---+---+----+---+----+----+----+----+
| id| a1| a2| a3|id_2|a1_2|a2_2|a3_2|
+---+---+----+---+----+----+----+----+
| 1|yes|high| on| 2| no|high| off|
| 1|yes|high| on| 3| yes| low| off|
| 2| no|high|off| 3| yes| low| off|
+---+---+----+---+----+----+----+----+
At this point the problem is basically solved. To get the final table we can use a when().otherwise() statement on each column pair, or a UDF as I have done here:
scala> val dist = udf((a:String, b: String) => if (a != b) 1 else 0)
dist: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,IntegerType,Some(List(StringType, StringType)))
scala> val distinction = filtered.select($"id", $"id_2", dist($"a1", $"a1_2").as("a1"), dist($"a2", $"a2_2").as("a2"), dist($"a3", $"a3_2").as("a3"))
distinction: org.apache.spark.sql.DataFrame = [id: int, id_2: int ... 3 more fields]
scala> distinction.show
+---+----+---+---+---+
| id|id_2| a1| a2| a3|
+---+----+---+---+---+
| 1| 2| 1| 0| 1|
| 1| 3| 0| 1| 1|
| 2| 3| 1| 1| 0|
+---+----+---+---+---+
In my Scala program, I am dealing with a problem to combine the result from multiple levels of GroupBy.
The dataset that I am using is quite big. As a small sample, I have a dataframe that looks like this:
val df = (Seq(("f1", "l1", "loy1", null, "s1"),
("f1", "l1", "loy1", "e1", "s1"),
("f2", "l2", "loy2", "e2", "s2"),
("f2", "l2", "loy2", "e3", null),
("f1", "l1", null , "e1", "s3"),
("f1", "l1", null , "e2", "s3"),
("f2", "l2", null , null, "s4")
).toDF("F", "L", "Loy", "Email", "State"))
+---+---+----+-----+-----+
| F| L| Loy|Email|State|
+---+---+----+-----+-----+
| f1| l1|loy1| null| s1|
| f1| l1|loy1| e1| s1|
| f2| l2|loy2| e2| s2|
| f2| l2|loy2| e3| null|
| f1| l1|null| e1| s3|
| f1| l1|null| e2| s3|
| f2| l2|null| null| s4|
+---+---+----+-----+-----+
For the first level groupBy I use the following script, to obtain the result based on the same (F, L, Loy) columns:
df.groupBy("F", "L", "Loy").agg(collect_set($"Email").alias("Email"), collect_set($"State").alias("State")).show
The result is like this:
+---+---+----+--------+-----+
| F| L| Loy| Email|State|
+---+---+----+--------+-----+
| f1| l1|null|[e1, e2]| [s3]|
| f2| l2|loy2|[e2, e3]| [s2]|
| f1| l1|loy1| [e1]| [s1]|
| f2| l2|null| []| [s4]|
+---+---+----+--------+-----+
The problem that I am dealing with is that how can I perform the second level groupBy, which is based on the condition (F, L, Email) and takes as an input F and L as a String while Email column as an Array[String]. This groupBy should return a result as followed:
+---+---+------+--------+---------+
| F| L| Loy| Email| State|
+---+---+------+--------+---------+
| f1| l1|[loy1]|[e1, e2]| [s3, s1]|
| f2| l2|[loy2]|[e2, e3]| [s2]|
| f2| l2| null| []| [s4]|
+---+---+------+--------+---------+
The main goal is to reduce the number of entries as much as possible by applying groupBy in different levels. I am quite new to Scala and any help would be appreciated :)
How can I replace empty values in a column Field1 of DataFrame df?
Field1 Field2
AA
12 BB
This command does not provide an expected result:
df.na.fill("Field1",Seq("Anonymous"))
The expected result:
Field1 Field2
Anonymous AA
12 BB
You can also try this.
This might handle both blank/empty/null
df.show()
+------+------+
|Field1|Field2|
+------+------+
| | AA|
| 12| BB|
| 12| null|
+------+------+
df.na.replace(Seq("Field1","Field2"),Map(""-> null)).na.fill("Anonymous", Seq("Field2","Field1")).show(false)
+---------+---------+
|Field1 |Field2 |
+---------+---------+
|Anonymous|AA |
|12 |BB |
|12 |Anonymous|
+---------+---------+
Fill: Returns a new DataFrame that replaces null or NaN values in
numeric columns with value.
Two things:
An empty string is not null or NaN, so you'll have to use a case statement for that.
Fill seems to not work well when giving a text value into a numeric column.
Failing Null Replace with Fill / Text:
scala> a.show
+----+---+
| f1| f2|
+----+---+
|null| AA|
| 12| BB|
+----+---+
scala> a.na.fill("Anonymous", Seq("f1")).show
+----+---+
| f1| f2|
+----+---+
|null| AA|
| 12| BB|
+----+---+
Working Example - Using Null With All Numbers:
scala> a.show
+----+---+
| f1| f2|
+----+---+
|null| AA|
| 12| BB|
+----+---+
scala> a.na.fill(1, Seq("f1")).show
+---+---+
| f1| f2|
+---+---+
| 1| AA|
| 12| BB|
+---+---+
Failing Example (Empty String instead of Null):
scala> b.show
+---+---+
| f1| f2|
+---+---+
| | AA|
| 12| BB|
+---+---+
scala> b.na.fill(1, Seq("f1")).show
+---+---+
| f1| f2|
+---+---+
| | AA|
| 12| BB|
+---+---+
Case Statement Fix Example:
scala> b.show
+---+---+
| f1| f2|
+---+---+
| | AA|
| 12| BB|
+---+---+
scala> b.select(when(col("f1") === "", "Anonymous").otherwise(col("f1")).as("f1"), col("f2")).show
+---------+---+
| f1| f2|
+---------+---+
|Anonymous| AA|
| 12| BB|
+---------+---+
You can try using below code when you have n number of columns in dataframe.
Note: When you are trying to write data into formats like parquet, null data types are not supported. we have to type cast it.
val df = Seq(
(1, ""),
(2, "Ram"),
(3, "Sam"),
(4,"")
).toDF("ID", "Name")
// null type column
val inputDf = df.withColumn("NulType", lit(null).cast(StringType))
//Output
+---+----+-------+
| ID|Name|NulType|
+---+----+-------+
| 1| | null|
| 2| Ram| null|
| 3| Sam| null|
| 4| | null|
+---+----+-------+
//Replace all blank space in the dataframe with null
val colName = inputDf.columns //*This will give you array of string*
val data = inputDf.na.replace(colName,Map(""->"null"))
data.show()
+---+----+-------+
| ID|Name|NulType|
+---+----+-------+
| 1|null| null|
| 2| Ram| null|
| 3| Sam| null|
| 4|null| null|
+---+----+-------+