spark data frame converting row values into column name - scala

Using spark dataframe i need to convert the row values into column and partition by user id and create a csv file.
val someDF = Seq(
("user1", "math","algebra-1","90"),
("user1", "physics","gravity","70"),
("user3", "biology","health","50"),
("user2", "biology","health","100"),
("user1", "math","algebra-1","40"),
("user2", "physics","gravity-2","20")
).toDF("user_id", "course_id","lesson_name","score")
someDF.show(false)
+-------+---------+-----------+-----+
|user_id|course_id|lesson_name|score|
+-------+---------+-----------+-----+
| user1| math| algebra-1| 90|
| user1| physics| gravity| 70|
| user3| biology| health| 50|
| user2| biology| health| 100|
| user1| math| algebra-1| 40|
| user2| physics| gravity-2| 20|
+-------+---------+-----------+-----+
val result = someDF.groupBy("user_id", "course_id").pivot("lesson_name").agg(first("score"))
result.show(false)
+-------+---------+---------+-------+---------+------+
|user_id|course_id|algebra-1|gravity|gravity-2|health|
+-------+---------+---------+-------+---------+------+
| user3| biology| null| null| null| 50|
| user1| math| 90| null| null| null|
| user2| biology| null| null| null| 100|
| user2| physics| null| null| 20| null|
| user1| physics| null| 70| null| null|
+-------+---------+---------+-------+---------+------+
With the above code i'm able to convert row value(lesson_name) to column name.
But I need to save the out in csv in a course_wise
Expected out in csv should be like this below formate.
biology.csv // Expected Output
+-------+---------+------+
|user_id|course_id|health|
+-------+---------+------+
| user3| biology| 50 |
| user2| biology| 100 |
+-------+---------+-------
physics.csv // Expected Output
+-------+---------+---------+-------
|user_id|course_id|gravity-2|gravity|
+-------+---------+---------+-------+
| user2| physics| 50 | null |
| user1| physics| 100 | 70 |
+-------+---------+---------+-------+
**Note: Each course in a csv it should contain only it's specifi lesson names and it should not contain any non relevant course lesson names.
Actually in csv i'm able to in below formate**
result.write
.partitionBy("course_id")
.mode("overwrite")
.format("com.databricks.spark.csv")
.option("header", "true")
.save(somepath)
eg:
biology.csv // Wrong output, Due to it is containing non-relevant course lesson's(algebra-1,gravity-2,algebra-1)
+-------+---------+---------+-------+---------+------+
|user_id|course_id|algebra-1|gravity|gravity-2|health|
+-------+---------+---------+-------+---------+------+
| user3| biology| null| null| null| 50|
| user2| biology| null| null| null| 100|
+-------+---------+---------+-------+---------+------+
Anyone can help to solve this problem ?

Just filter by course before you pivot:
val result = someDF.filter($"course_id" === "physics").groupBy("user_id", "course_id").pivot("lesson_name").agg(first("score"))
+-------+---------+-------+---------+
|user_id|course_id|gravity|gravity-2|
+-------+---------+-------+---------+
|user2 |physics |null |20 |
|user1 |physics |70 |null |
+-------+---------+-------+---------+

I'm assuming you mean you'd like to save the data into separate directories by course_id. you can use this approach.
scala> val someDF = Seq(
("user1", "math","algebra-1","90"),
("user1", "physics","gravity","70"),
("user3", "biology","health","50"),
("user2", "biology","health","100"),
("user1", "math","algebra-1","40"),
("user2", "physics","gravity-2","20")
).toDF("user_id", "course_id","lesson_name","score")
scala> val result = someDF.groupBy("user_id", "course_id").pivot("lesson_name").agg(first("score"))
scala> val eventNames = result.select($"course_id").distinct().collect()
var eventlist =eventNames.map(x => x(0).toString)
for (eventName <- eventlist) {
val course = result.where($"course_id" === lit(eventName))
//remove null column
val row = course
.select(course.columns.map(c => when(col(c).isNull, 0).otherwise(1).as(c)): _*)
.groupBy().max(course.columns.map(c => c): _*)
.first
val colKeep = row.getValuesMap[Int](row.schema.fieldNames)
.map{c => if (c._2 == 1) Some(c._1) else None }
.flatten.toArray
var final_df = course.select(row.schema.fieldNames.intersect(colKeep)
.map(c => col(c.drop(4).dropRight(1))): _*)
final_df.show()
final_df.coalesce(1).write.mode("overwrite").format("csv").save(s"${eventName}")
}
+-------+---------+------+
|user_id|course_id|health|
+-------+---------+------+
| user3| biology| 50|
| user2| biology| 100|
+-------+---------+------+
+-------+---------+-------+---------+
|user_id|course_id|gravity|gravity-2|
+-------+---------+-------+---------+
| user2| physics| null| 20|
| user1| physics| 70| null|
+-------+---------+-------+---------+
+-------+---------+---------+
|user_id|course_id|algebra-1|
+-------+---------+---------+
| user1| math| 90|
+-------+---------+---------+
if it solves your purpose please accept the answer.HAppy Hadoop

Related

Fill null or empty with next Row value with spark

Is there a way to replace null values in spark data frame with next row not null value. There is additional row_count column added for windows partitioning and ordering. More specifically, I'd like to achieve the following result:
+---------+-----------+ +---------+--------+
| row_count | id| |row_count | id|
+---------+-----------+ +------+-----------+
| 1| null| | 1| 109|
| 2| 109| | 2| 109|
| 3| null| | 3| 108|
| 4| null| | 4| 108|
| 5| 108| => | 5| 108|
| 6| null| | 6| 110|
| 7| 110| | 7| 110|
| 8| null| | 8| null|
| 9| null| | 9| null|
| 10| null| | 10| null|
+---------+-----------+ +---------+--------+
I tried with below code, It is not giving proper result.
val ss = dataframe.select($"*", sum(when(dataframe("id").isNull||dataframe("id") === "", 1).otherwise(0)).over(Window.orderBy($"row_count")) as "value")
val window1=Window.partitionBy($"value").orderBy("id").rowsBetween(0, Long.MaxValue)
val selectList=ss.withColumn("id_fill_from_below",last("id").over(window1)).drop($"row_count").drop($"value")
Here is a approach
Filter the non nulls (dfNonNulls)
Filter the nulls (dfNulls)
Find the right value for null id, using join and Window function
Fill the null dataframe (dfNullFills)
union dfNonNulls and dfNullFills
data.csv
row_count,id
1,
2,109
3,
4,
5,108
6,
7,110
8,
9,
10,
var df = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("data.csv")
var dfNulls = df.filter(
$"id".isNull
).withColumnRenamed(
"row_count","row_count_nulls"
).withColumnRenamed(
"id","id_nulls"
)
val dfNonNulls = df.filter(
$"id".isNotNull
).withColumnRenamed(
"row_count","row_count_values"
).withColumnRenamed(
"id","id_values"
)
dfNulls = dfNulls.join(
dfNonNulls, $"row_count_nulls" lt $"row_count_values","left"
).select(
$"id_nulls",$"id_values",$"row_count_nulls",$"row_count_values"
)
val window = Window.partitionBy("row_count_nulls").orderBy("row_count_values")
val dfNullFills = dfNulls.withColumn(
"rn", row_number.over(window)
).where($"rn" === 1).drop("rn").select(
$"row_count_nulls".alias("row_count"),$"id_values".alias("id"))
dfNullFills .union(dfNonNulls).orderBy($"row_count").show()
which results in
+---------+----+
|row_count| id|
+---------+----+
| 1| 109|
| 2| 109|
| 3| 108|
| 4| 108|
| 5| 108|
| 6| 110|
| 7| 110|
| 8|null|
| 9|null|
| 10|null|
+---------+----+

Copy missed data from top/bottom row col values

I have a dataframe, with index, category and few other columns. index and category never be empty/null. but other columns data comes null, When all other columns data is null then we have to copy from top/bottom row values based on cateogry.
val df = Seq(
(1,1, null, null, null ),
(2,1, null, null, null ),
(3,1, null, null, null ),
(4,1,"123.12", "124.52", "95.98" ),
(5,1, "452.12", "478.65", "1865.12" ),
(1,2,"2014.21", "147", "265"),
(2,2, "1457", "12483.00", "215.21"),
(3,2, null, null, null),
(4,2, null, null, null) ).toDF("index", "category", "col1", "col2", "col3")
scala> df.show
+-----+--------+-------+--------+-------+
|index|category| col1| col2| col3|
+-----+--------+-------+--------+-------+
| 1| 1| null| null| null|
| 2| 1| null| null| null|
| 3| 1| null| null| null|
| 4| 1| 123.12| 124.52| 95.98|
| 5| 1| 452.12| 478.65|1865.12|
| 1| 2|2014.21| 147| 265|
| 2| 2| 1457|12483.00| 215.21|
| 3| 2| null| null| null|
| 4| 2| null| null| null|
+-----+--------+-------+--------+-------+
Expecting dataframe as below
+-----+--------+-------+--------+-------+
|index|category| col1| col2| col3|
+-----+--------+-------+--------+-------+
| 1| 1| 123.12| 124.52| 95.98| // Copied from below for same category
| 2| 1| 123.12| 124.52| 95.98| // Copied from below for same category
| 3| 1| 123.12| 124.52| 95.98|
| 4| 1| 123.12| 124.52| 95.98|
| 5| 1| 452.12| 478.65|1865.12|
| 1| 2|2014.21| 147| 265|
| 2| 2| 1457|12483.00| 215.21|
| 3| 2| 1457|12483.00| 215.21| // Copied from above for same category
| 4| 2| 1457|12483.00| 215.21| // Copied from above for same category
+-----+--------+-------+--------+-------+
Update When several rows with nulls possible, advanced Windows have to be used:
val cols = Seq("col1", "col2", "col3")
val beforeWindow = Window
.partitionBy("category")
.orderBy("index")
.rangeBetween(Window.unboundedPreceding, Window.currentRow)
val afterWindow = Window
.partitionBy("category")
.orderBy("index")
.rangeBetween(Window.currentRow, Window.unboundedFollowing)
val result = cols.foldLeft(df)((updated, columnName) =>
updated.withColumn(columnName,
coalesce(col(columnName),
last(columnName, ignoreNulls = true).over(beforeWindow),
first(columnName, ignoreNulls = true).over(afterWindow)
))
)
In one null case can be resolved with Window functions "lead" and "lag", and "coalesce":
val cols = Seq("col1", "col2", "col3")
val categoryWindow = Window.partitionBy("category").orderBy("index")
val result = cols.foldLeft(df)((updated, columnName) =>
updated.withColumn(columnName,
coalesce(col(columnName),
lag(col(columnName), 1).over(categoryWindow),
lead(col(columnName), 1).over(categoryWindow)
))
)
result.show(false)
Output:
+-----+--------+-------+--------+-------+
|index|category|col1 |col2 |col3 |
+-----+--------+-------+--------+-------+
|1 |1 |123.12 |124.52 |95.98 |
|2 |1 |123.12 |124.52 |95.98 |
|3 |1 |452.12 |478.65 |1865.12|
|1 |2 |2014.21|147 |265 |
|2 |2 |1457 |12483.00|215.21 |
|3 |2 |1.25 |3.45 |26.3 |
|4 |2 |1.25 |3.45 |26.3 |
+-----+--------+-------+--------+-------+

Upsert Two Dataframes in Scala

I have two data sources, both of which have opinions about the current state of the same set of entities. Either data source may contain the most current data, which may or may not be from the current date. For example:
val df1 = Seq((1, "green", "there", "2018-01-19"), (2, "yellow", "there", "2018-01-18"), (4, "yellow", "here", "2018-01-20")).toDF("id", "status", "location", "date")
val df2 = Seq((2, "red", "here", "2018-01-20"), (3, "green", "there", "2018-01-20"), (4, "green", "here", "2018-01-19")).toDF("id", "status", "location", "date")
df1.show
+---+------+--------+----------+
| id|status|location| date|
+---+------+--------+----------+
| 1| green| there|2018-01-19|
| 2|yellow| there|2018-01-18|
| 4|yellow| here|2018-01-20|
+---+------+--------+----------+
df2.show
+---+------+--------+----------+
| id|status|location| date|
+---+------+--------+----------+
| 2| red| here|2018-01-20|
| 3| green| there|2018-01-20|
| 4| green| here|2018-01-19|
+---+------+--------+----------+
I want the output to be the set of most current states for each entity:
+---+------+--------+----------+
| id|status|location| date|
+---+------+--------+----------+
| 1| green| there|2018-01-19|
| 2| red| here|2018-01-20|
| 3| green| there|2018-01-20|
| 4|yellow| here|2018-01-20|
+---+------+--------+----------+
My approach, which seems to work, is to join the two tables and then do a kind of custom coalesce operation based on date:
val joined = df1.join(df2, df1("id") === df2("id"), "outer")
+----+------+--------+----------+----+------+--------+----------+
| id|status|location| date| id|status|location| date|
+----+------+--------+----------+----+------+--------+----------+
| 1| green| there|2018-01-19|null| null| null| null|
|null| null| null| null| 3| green| there|2018-01-20|
| 4|yellow| here|2018-01-20| 4|yellow| here|2018-01-20|
| 2|yellow| there|2018-01-18| 2| red| here|2018-01-20|
+----+------+--------+----------+----+------+--------+----------+
val weirdCoal(name: String) = when(df1("date") > df2("date") || df2("date").isNull, df1(name)).otherwise(df2(name)) as name
val ouput = joined.select(df1.columns.map(weirdCoal):_*)
+---+------+--------+----------+
| id|status|location| date|
+---+------+--------+----------+
| 1| green| there|2018-01-19|
| 2| red| here|2018-01-20|
| 3| green| there|2018-01-20|
| 4|yellow| here|2018-01-20|
+---+------+--------+----------+
Which is the output I expect.
I can also see doing this via some kind of union / aggregation approach or with a window that partitions by id and sorts by date and takes the last row.
My question: is there an idiomatic way of doing this?
Yes it can be done without join using Window functions:
df1.union(df2)
.withColumn("rank", rank().over(Window.partitionBy($"id").orderBy($"date".desc)))
.filter($"rank" === 1)
.drop($"rank")
.orderBy($"id")
.show
output:
+---+------+--------+----------+
| id|status|location| date|
+---+------+--------+----------+
| 1| green| there|2018-01-19|
| 2| red| here|2018-01-20|
| 3| green| there|2018-01-20|
| 4|yellow| here|2018-01-20|
+---+------+--------+----------+
the above code partitions the data by id and finds the top date among all dates falling under same id.

How to replace empty values in a column of DataFrame?

How can I replace empty values in a column Field1 of DataFrame df?
Field1 Field2
AA
12 BB
This command does not provide an expected result:
df.na.fill("Field1",Seq("Anonymous"))
The expected result:
Field1 Field2
Anonymous AA
12 BB
You can also try this.
This might handle both blank/empty/null
df.show()
+------+------+
|Field1|Field2|
+------+------+
| | AA|
| 12| BB|
| 12| null|
+------+------+
df.na.replace(Seq("Field1","Field2"),Map(""-> null)).na.fill("Anonymous", Seq("Field2","Field1")).show(false)
+---------+---------+
|Field1 |Field2 |
+---------+---------+
|Anonymous|AA |
|12 |BB |
|12 |Anonymous|
+---------+---------+
Fill: Returns a new DataFrame that replaces null or NaN values in
numeric columns with value.
Two things:
An empty string is not null or NaN, so you'll have to use a case statement for that.
Fill seems to not work well when giving a text value into a numeric column.
Failing Null Replace with Fill / Text:
scala> a.show
+----+---+
| f1| f2|
+----+---+
|null| AA|
| 12| BB|
+----+---+
scala> a.na.fill("Anonymous", Seq("f1")).show
+----+---+
| f1| f2|
+----+---+
|null| AA|
| 12| BB|
+----+---+
Working Example - Using Null With All Numbers:
scala> a.show
+----+---+
| f1| f2|
+----+---+
|null| AA|
| 12| BB|
+----+---+
scala> a.na.fill(1, Seq("f1")).show
+---+---+
| f1| f2|
+---+---+
| 1| AA|
| 12| BB|
+---+---+
Failing Example (Empty String instead of Null):
scala> b.show
+---+---+
| f1| f2|
+---+---+
| | AA|
| 12| BB|
+---+---+
scala> b.na.fill(1, Seq("f1")).show
+---+---+
| f1| f2|
+---+---+
| | AA|
| 12| BB|
+---+---+
Case Statement Fix Example:
scala> b.show
+---+---+
| f1| f2|
+---+---+
| | AA|
| 12| BB|
+---+---+
scala> b.select(when(col("f1") === "", "Anonymous").otherwise(col("f1")).as("f1"), col("f2")).show
+---------+---+
| f1| f2|
+---------+---+
|Anonymous| AA|
| 12| BB|
+---------+---+
You can try using below code when you have n number of columns in dataframe.
Note: When you are trying to write data into formats like parquet, null data types are not supported. we have to type cast it.
val df = Seq(
(1, ""),
(2, "Ram"),
(3, "Sam"),
(4,"")
).toDF("ID", "Name")
// null type column
val inputDf = df.withColumn("NulType", lit(null).cast(StringType))
//Output
+---+----+-------+
| ID|Name|NulType|
+---+----+-------+
| 1| | null|
| 2| Ram| null|
| 3| Sam| null|
| 4| | null|
+---+----+-------+
//Replace all blank space in the dataframe with null
val colName = inputDf.columns //*This will give you array of string*
val data = inputDf.na.replace(colName,Map(""->"null"))
data.show()
+---+----+-------+
| ID|Name|NulType|
+---+----+-------+
| 1|null| null|
| 2| Ram| null|
| 3| Sam| null|
| 4|null| null|
+---+----+-------+

How to fill missing values in SataFrame?

After querying a mysql db and building the corresponding data frame, I am left with this:
mydata.show
+--+------+------+------+------+------+------+
|id| sport| var1| var2| var3| var4| var5|
+--+------+------+------+------+------+------+
| 1|soccer|330234| | | | |
| 2|soccer| null| null| null| null| null|
| 3|soccer|330101| | | | |
| 4|soccer| null| null| null| null| null|
| 5|soccer| null| null| null| null| null|
| 6|soccer| null| null| null| null| null|
| 7|soccer| null| null| null| null| null|
| 8|soccer|330024|330401| | | |
| 9|soccer|330055|330106| | | |
|10|soccer| null| null| null| null| null|
|11|soccer|390027| | | | |
|12|soccer| null| null| null| null| null|
|13|soccer|330101| | | | |
|14|soccer|330059| | | | |
|15|soccer| null| null| null| null| null|
|16|soccer|140242|140281| | | |
|17|soccer|330214| | | | |
|18|soccer| | | | | |
|19|soccer|330055|330196| | | |
|20|soccer|210022| | | | |
+--+------+------+------+------+------+------+
Every var column is a:
string (nullable = true)
So I'd like to change all the empty rows to a "null", so to be able to treat empty cells and cell with "null" as equal, possibly without leaving the data frame for an RDD...
My approach would be to create a list of expressions. In Scala this can be done using a map. On the other hand in Python you'd to use a comprehension list.
After that, you should unpack that list inside a df.select instruction like in the examples bellow.
Inside the expression, empty strings are replaced with a null value
Scala:
val exprs = df.columns.map(x => when(col(x) === '', null).otherwise(col(x)).as(x))
df.select(exprs:_*).show()
Python:
# Creation of a dummy dataframe:
df = sc.parallelize([("", "19911201", 1, 1, 20.0),
("", "19911201", 2, 1, 20.0),
("hola", "19911201", 2, 1, 20.0),
(None, "20111201", 3, 1, 20.0)]).toDF()
df.show()
exprs = [when(col(x) == '', None).otherwise(col(x)).alias(x)
for x in df.columns]
df.select(*exprs).show()
E.g:
+----+--------+---+---+----+
| _1| _2| _3| _4| _5|
+----+--------+---+---+----+
| |19911201| 1| 1|20.0|
| |19911201| 2| 1|20.0|
|hola|19911201| 2| 1|20.0|
|null|20111201| 3| 1|20.0|
+----+--------+---+---+----+
+----+--------+---+---+----+
| _1| _2| _3| _4| _5|
+----+--------+---+---+----+
|null|19911201| 1| 1|20.0|
|null|19911201| 2| 1|20.0|
|hola|19911201| 2| 1|20.0|
|null|20111201| 3| 1|20.0|
+----+--------+---+---+----+
One option would be to do the opposite - replace nulls with empty values (I personally hate nulls...), for which you can use the coalesce function:
import org.apache.spark.sql.functions._
val result = input.withColumn("myCol", coalesce(input("myCol"), lit("")))
To do that for multiple columns:
val cols = Seq("var1", "var2", "var3", "var4", "var5")
val result = cols.foldLeft(input) { case (df, colName) => df.withColumn(colName, coalesce(df(colName), lit(""))) }