How to perform one to many mapping on spark scala dataframe column using flatmaps - scala

I am looking for specifically a flatmap solution to a problem of mocking the data column in a spark-scala dataframe by using data duplicacy technique like 1 to many mapping inside flatmap
My given data is something like this
|id |name|marks|
+---+----+-----+
|1 |ABCD|12 |
|2 |CDEF|12 |
|3 |FGHI|14 |
+---+----+-----+
and my expectation after doing 1 to 3 mapping of the id column will be something like this
|id |name|marks|
+---+----+-----+
|1 |ABCD|12 |
|2 |CDEF|12 |
|3 |FGHI|14 |
|2 |null|null |
|3 |null|null |
|1 |null|null |
|2 |null|null |
|1 |null|null |
|3 |null|null |
+---+----+-----+
Please feel free to let me know if there is any clarification required on the requirement part
Thanks in advance!!!

I see that you are attempting to generate data with a requirement of re-using values in the ID column.
You can just select the ID column and generate random values and do a union back to your original dataset.
For example:
val data = Seq((1,"asd",15), (2,"asd",20), (3,"test",99)).toDF("id","testName","marks")
+---+--------+-----+
| id|testName|marks|
+---+--------+-----+
| 1| asd| 15|
| 2| asd| 20|
| 3| test| 99|
+---+--------+-----+
import org.apache.spark.sql.types._
val newRecords = data.select("id").withColumn("testName", concat(lit("name_"), lit(rand()*10).cast(IntegerType).cast(StringType))).withColumn("marks", lit(rand()*100).cast(IntegerType))
+---+--------+-----+
| id|testName|marks|
+---+--------+-----+
| 1| name_2| 35|
| 2| name_9| 20|
| 3| name_3| 7|
+---+--------+-----+
val result = data.unionAll(newRecords)
+---+--------+-----+
| id|testName|marks|
+---+--------+-----+
| 1| asd| 15|
| 2| asd| 20|
| 3| test| 99|
| 1| name_2| 35|
| 2| name_9| 20|
| 3| name_3| 7|
+---+--------+-----+
you can run the randomisation portion of the code using a loop and do a union of all the generated dataframes.

Related

Process multiple dataframes in parallel Scala

I am a newbie in Scala-Spark. I have a dataframe like the one below that I need to split into different chunks of data based into a group ID and process them independently in parallel.
+----+-------+-----+-------+
|user|feature|value|groupID
+----+-------+-----+-------+
| 1| 100| 1| A|
| 2| 20B| 0| B|
| 3| 30A| 1| B|
| 4| 40A| 1| B|
| 5| 50A| 1| A|
| 6| 10A| 0| B|
| 7| 200| 1| A|
| 8| 30B| 1| B|
| 9| 400| 0| A|
| 10| 50C| 0| A|
+----+-------+-----+-------+
1 Step I need to split it to have two different df like these ones: I can user a filter for this. But I am not sure if (due to the large number of different dataframes they will produce) I should save them into ADLS as parquets or keep them in memory.
+----+-------+-----+-------+
|user|feature|value|groupID
+----+-------+-----+-------+
| 1| 100| 1| A|
| 5| 50A| 1| A|
| 7| 200| 1| A|
| 9| 400| 0| A|
| 10| 50C| 0| A|
+----+-------+-----+-------+
+----+-------+-----+-------+
|user|feature|value|groupID
+----+-------+-----+-------+
| 2| 20B| 0| B|
| 3| 30A| 1| B|
| 4| 40A| 1| B|
| 6| 10A| 0| B|
| 8| 30B| 1| B|
+----+-------+-----+-------+
2 Step Process independently each dataframe in a parallel fashion and get independent processed dataframes.
To give some context:
The number of groupIds will be high therefore they cannot be hardcoded.
The processing of each dataframe would ideally happen in parallel.
I ask for a brief idea on how to proceed: I have seen .par.foreach (but is not clear to me how to apply this on a dynamic number of dataframes and how to store them independently nor if the best efficient way)
Check below code.
scala> df.show(false)
+----+-------+-----+-------+
|user|feature|value|groupID|
+----+-------+-----+-------+
|1 |100 |1 |A |
|2 |20B |0 |B |
|3 |30A |1 |B |
|4 |40A |1 |B |
|5 |50A |1 |A |
|6 |10A |0 |B |
|7 |200 |1 |A |
|8 |30B |1 |B |
|9 |400 |0 |A |
|10 |50C |0 |A |
+----+-------+-----+-------+
Get distinct groupid values from dataframe.
scala> val groupIds = df.select($"groupID").distinct.as[String].collect // Get distinct group ids.
groupIds: Array[String] = Array(B, A)
Use .par for parallel process. You need add your logic inside map.
scala> groupIds.par.map(groupid => df.filter($"groupId" === lit(groupid))).foreach(_.show(false)) // here you might need add your logic to save or any other inside map function not foreach.., for example I have added logic to show dataframe content in foreach.
+----+-------+-----+-------+
|user|feature|value|groupID|
+----+-------+-----+-------+
|2 |20B |0 |B |
|3 |30A |1 |B |
|4 |40A |1 |B |
|6 |10A |0 |B |
|8 |30B |1 |B |
+----+-------+-----+-------+
+----+-------+-----+-------+
|user|feature|value|groupID|
+----+-------+-----+-------+
|1 |100 |1 |A |
|5 |50A |1 |A |
|7 |200 |1 |A |
|9 |400 |0 |A |
|10 |50C |0 |A |
+----+-------+-----+-------+

How to find the max length unique rows from a dataframe with spark?

I am trying to find the unique rows (based on id) that have the maximum length value in a Spark dataframe. Each Column has a value of string type.
The dataframe is like:
+-----+---+----+---+---+
|id | A | B | C | D |
+-----+---+----+---+---+
|1 |toto|tata|titi| |
|1 |toto|tata|titi|tutu|
|2 |bla |blo | | |
|3 |b | c | | d |
|3 |b | c | a | d |
+-----+---+----+---+---+
The expectation is:
+-----+---+----+---+---+
|id | A | B | C | D |
+-----+---+----+---+---+
|1 |toto|tata|titi|tutu|
|2 |bla |blo | | |
|3 |b | c | a | d |
+-----+---+----+---+---+
I can't figure how to do this using Spark easily...
Thanks in advance
Note: This approach takes care of any addition/deletion of columns to the DataFrame, without the need of code change.
It can be done by first finding length of all columns after concatenating (except the first column), then filter all other rows except the row with the maximum length.
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val output = input.withColumn("rowLength", length(concat(input.columns.toList.drop(1).map(col): _*)))
.withColumn("maxLength", max($"rowLength").over(Window.partitionBy($"id")))
.filter($"rowLength" === $"maxLength")
.drop("rowLength", "maxLength")
scala> df.show
+---+----+----+----+----+
| id| A| B| C| D|
+---+----+----+----+----+
| 1|toto|tata|titi| |
| 1|toto|tata|titi|tutu|
| 2| bla| blo| | |
| 3| b| c| | d|
| 3| b| c| a| d|
+---+----+----+----+----+
scala> df.groupBy("id").agg(concat_ws("",collect_set(col("A"))).alias("A"),concat_ws("",collect_set(col("B"))).alias("B"),concat_ws("",collect_set(col("C"))).alias("C"),concat_ws("",collect_set(col("D"))).alias("D")).show
+---+----+----+----+----+
| id| A| B| C| D|
+---+----+----+----+----+
| 1|toto|tata|titi|tutu|
| 2| bla| blo| | |
| 3| b| c| a| d|
+---+----+----+----+----+

Flatten Spark Dataframe and Name Columns

How can one unnest an array within a spark dataframe, such that the resulting dataframe contains one row for each value in the original array?
Example:
scala> df.show()
+---------+------+
|employees|person|
+---------+------+
|[1, 2, 3]| Mary|
|[4, 5, 6]| John|
+---------+------+
Expected result:
+---------+------+
|employee |person|
+---------+------+
|1 | Mary|
|2 | Mary|
|3 | Mary|
|4 | John|
|5 | John|
|6 | John|
+---------+------+
This is what I have tried:
df.select($"person", explode($"employees")).show()
+------+---+
|person|col|
+------+---+
| Mary| 1|
| Mary| 2|
| Mary| 3|
| John| 4|
| John| 5|
| John| 6|
+------+---+
How can I have the resulting exploded column be named "employee"?
How can I have the resulting exploded column be named "employee"?
df.select($"person", explode($"employees").alias("employee")).show()
or
df.select($"person", explode($"employees").as("employee")).show()
You can use withColumn as to create a new column as
df.withColumn("employee", explode($"employees")).show()

Forward-fill missing data in PySpark not working

I have a simple dataset as shown under.
| id| name| country| languages|
|1 | Bob| USA| Spanish|
|2 | Angelina| France| null|
|3 | Carl| Brazil| null|
|4 | John| Australia| English|
|5 | Anne| Nepal| null|
I am trying to impute the null values in languages with the last non-null value using pyspark.sql.window to create a window over certain rows but nothing is happening. The column which is supposed to be have null values filled, temp_filled_spark, remains unchanged i.e a copy of original languages column.
from pyspark.sql import Window
from pyspark.sql.functions import last
window = Window.partitionBy('name').orderBy('country').rowsBetween(-sys.maxsize, 0)
filled_column = last(df['languages'], ignorenulls=True).over(window)
df = df.withColumn('temp_filled_spark', filled_column)
df.orderBy('name', 'country').show(100)
I expect the output column to be:
|temp_filled_spark|
| Spanish|
| Spanish|
| Spanish|
| English|
| English|
Could anybody help pointing out the mistake?
we can create window considering entire dataframe as one partition as,
from pyspark.sql import functions as F
>>> df1.show()
+---+--------+---------+---------+
| id| name| country|languages|
+---+--------+---------+---------+
| 1| Bob| USA| Spanish|
| 2|Angelina| France| null|
| 3| Carl| Brazil| null|
| 4| John|Australia| English|
| 5| Anne| Nepal| null|
+---+--------+---------+---------+
>>> w = Window.partitionBy(F.lit(1)).orderBy(F.lit(1)).rowsBetween(-sys.maxsize, 0)
>>> df1.select("*",F.last('languages',True).over(w).alias('newcol')).show()
+---+--------+---------+---------+-------+
| id| name| country|languages| newcol|
+---+--------+---------+---------+-------+
| 1| Bob| USA| Spanish|Spanish|
| 2|Angelina| France| null|Spanish|
| 3| Carl| Brazil| null|Spanish|
| 4| John|Australia| English|English|
| 5| Anne| Nepal| null|English|
+---+--------+---------+---------+-------+
Hope this helps.!

How to append column values in Spark SQL?

I have the below table:
+-------+---------+---------+
|movieId|movieName| genre|
+-------+---------+---------+
| 1| example1| action|
| 1| example1| thriller|
| 1| example1| romance|
| 2| example2|fantastic|
| 2| example2| action|
+-------+---------+---------+
What I am trying to achieve is to append the genre values together where the id and name are the same. Like this:
+-------+---------+---------------------------+
|movieId|movieName| genre |
+-------+---------+---------------------------+
| 1| example1| action|thriller|romance |
| 2| example2| action|fantastic |
+-------+---------+---------------------------+
Use groupBy and collect_list to get a list of all items with the same movie name. Then combine these to a string using concat_ws (if the order is important, first use sort_array). Small example with given sample dataframe:
val df2 = df.groupBy("movieId", "movieName")
.agg(collect_list($"genre").as("genre"))
.withColumn("genre", concat_ws("|", sort_array($"genre")))
Gives the result:
+-------+---------+-----------------------+
|movieId|movieName|genre |
+-------+---------+-----------------------+
|1 |example1 |action|thriller|romance|
|2 |example2 |action|fantastic |
+-------+---------+-----------------------+