Create a Vertical Table in Spark 2 [duplicate] - scala

This question already has answers here:
How to melt Spark DataFrame?
(6 answers)
Closed 4 years ago.
How to create a vertical table in Spark 2 SQL.
I am building a ETL using Spark 2 / SQL / Scala. I have data in normal table structure like.
Input Table:
| ID | A | B | C | D |
| 1 | A1 | B1 | C1 | D1 |
| 2 | A2 | B2 | C2 | D2 |
Output Table:
| ID | Key | Val |
| 1 | A | A1 |
| 1 | B | B1 |
| 1 | C | C1 |
| 1 | D | D1 |
| 2 | A | A2 |
| 2 | B | B2 |
| 2 | C | C2 |
| 2 | D | D2 |

This could do the trick as well:
Input Data:
+---+---+---+---+---+
|ID |A |B |C |D |
+---+---+---+---+---+
|1 |A1 |B1 |C1 |D1 |
|2 |A2 |B2 |C2 |D2 |
|3 |A3 |B3 |C3 |D3 |
+---+---+---+---+---+
Zip the column header and no of columns to be included:
val cols = Seq("A","B","C","D") zip Range(0,4,1)
df.flatMap(r => cols.map(i => (r.getString(0),i._1,r.getString(i._2 + 1)))).toDF("ID","KEY","VALUE").show()
Result should look like this:
+---+---+-----+
| ID|KEY|VALUE|
+---+---+-----+
| 1| A| A1|
| 1| B| B1|
| 1| C| C1|
| 1| D| D1|
| 2| A| A2|
| 2| B| B2|
| 2| C| C2|
| 2| D| D2|
| 3| A| A3|
| 3| B| B3|
| 3| C| C3|
| 3| D| D3|
+---+---+-----+
Good Luck!!

Related

Add column elements to a Dataframe Scala Spark

I have two dataframes, and I want to add one to all row of the other one.
My dataframes are like:
id | name | rate
1 | a | 3
1 | b | 4
1 | c | 1
2 | a | 2
2 | d | 4
name
a
b
c
d
e
And I want a result like this:
id | name | rate
1 | a | 3
1 | b | 4
1 | c | 1
1 | d | null
1 | e | null
2 | a | 2
2 | b | null
2 | c | null
2 | d | 4
2 | e | null
How can I do this?
It seems it's more than a simple join.
val df = df1.select("id").distinct().crossJoin(df2).join(
df1,
Seq("name", "id"),
"left"
).orderBy("id", "name")
df.show
+----+---+----+
|name| id|rate|
+----+---+----+
| a| 1| 3|
| b| 1| 4|
| c| 1| 1|
| d| 1|null|
| e| 1|null|
| a| 2| 2|
| b| 2|null|
| c| 2|null|
| d| 2| 4|
| e| 2|null|
+----+---+----+

Pyspark - advanced aggregation of monthly data

I have a table of the following format.
|---------------------|------------------|------------------|
| Customer | Month | Sales |
|---------------------|------------------|------------------|
| A | 3 | 40 |
|---------------------|------------------|------------------|
| A | 2 | 50 |
|---------------------|------------------|------------------|
| B | 1 | 20 |
|---------------------|------------------|------------------|
I need it in the format as below
|---------------------|------------------|------------------|------------------|
| Customer | Month 1 | Month 2 | Month 3 |
|---------------------|------------------|------------------|------------------|
| A | 0 | 50 | 40 |
|---------------------|------------------|------------------|------------------|
| B | 20 | 0 | 0 |
|---------------------|------------------|------------------|------------------|
Can you please help me out to solve this problem in PySpark?
This should help , i am assumming you are using SUM to aggregate vales from the originical DF
>>> df.show()
+--------+-----+-----+
|Customer|Month|Sales|
+--------+-----+-----+
| A| 3| 40|
| A| 2| 50|
| B| 1| 20|
+--------+-----+-----+
>>> import pyspark.sql.functions as F
>>> df2=(df.withColumn('COLUMN_LABELS',F.concat(F.lit('Month '),F.col('Month')))
.groupby('Customer')
.pivot('COLUMN_LABELS')
.agg(F.sum('Sales'))
.fillna(0))
>>> df2.show()
+--------+-------+-------+-------+
|Customer|Month 1|Month 2|Month 3|
+--------+-------+-------+-------+
| A| 0| 50| 40|
| B| 20| 0| 0|
+--------+-------+-------+-------+

How to find the max length unique rows from a dataframe with spark?

I am trying to find the unique rows (based on id) that have the maximum length value in a Spark dataframe. Each Column has a value of string type.
The dataframe is like:
+-----+---+----+---+---+
|id | A | B | C | D |
+-----+---+----+---+---+
|1 |toto|tata|titi| |
|1 |toto|tata|titi|tutu|
|2 |bla |blo | | |
|3 |b | c | | d |
|3 |b | c | a | d |
+-----+---+----+---+---+
The expectation is:
+-----+---+----+---+---+
|id | A | B | C | D |
+-----+---+----+---+---+
|1 |toto|tata|titi|tutu|
|2 |bla |blo | | |
|3 |b | c | a | d |
+-----+---+----+---+---+
I can't figure how to do this using Spark easily...
Thanks in advance
Note: This approach takes care of any addition/deletion of columns to the DataFrame, without the need of code change.
It can be done by first finding length of all columns after concatenating (except the first column), then filter all other rows except the row with the maximum length.
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val output = input.withColumn("rowLength", length(concat(input.columns.toList.drop(1).map(col): _*)))
.withColumn("maxLength", max($"rowLength").over(Window.partitionBy($"id")))
.filter($"rowLength" === $"maxLength")
.drop("rowLength", "maxLength")
scala> df.show
+---+----+----+----+----+
| id| A| B| C| D|
+---+----+----+----+----+
| 1|toto|tata|titi| |
| 1|toto|tata|titi|tutu|
| 2| bla| blo| | |
| 3| b| c| | d|
| 3| b| c| a| d|
+---+----+----+----+----+
scala> df.groupBy("id").agg(concat_ws("",collect_set(col("A"))).alias("A"),concat_ws("",collect_set(col("B"))).alias("B"),concat_ws("",collect_set(col("C"))).alias("C"),concat_ws("",collect_set(col("D"))).alias("D")).show
+---+----+----+----+----+
| id| A| B| C| D|
+---+----+----+----+----+
| 1|toto|tata|titi|tutu|
| 2| bla| blo| | |
| 3| b| c| a| d|
+---+----+----+----+----+

remove first character of a spark string column

I wonder as I said in title how to remove first character of a spark string column, for the two following cases:
val myDF1 = Seq(("£14326"),("£1258634"),("£15626"),("£163262")).toDF("A")
val myDF2 = Seq(("a14326"),("c1258634"),("t15626"),("f163262")).toDF("A")
myDF1.show
myDF2.show
+--------+
| A|
+--------+
|£14326 |
|£1258634|
|£15626 |
|£163262 |
+--------+
+--------+
| A |
+--------+
|a14326 |
|c1258634|
|t15626 |
|f163262 |
+--------+
I would like to obtain:
+--------+-------+
| A| B|
+--------+-------+
|£14326 | 14326|
|£1258634|1258634|
|£15626 | 15626|
|£163262 | 163262|
+--------+-------+
+--------+-------+
| A| B|
+--------+-------+
|a14326 |14326 |
|c1258634|1258634|
|t15626 |15626 |
|f163262 |163262 |
+--------+-------+
Do you have any idea?
You can do something like this.
myDF1.show
+------+
| A|
+------+
|£14326|
|£12586|
|£15626|
|£16326|
+------+
myDF1.withColumn("B", expr("substring(A, 2, length(A))")).show
+------+-----+
| A| B|
+------+-----+
|£14326|14326|
|£12586|12586|
|£15626|15626|
|£16326|16326|
+------+-----+

Spark SQL window function look ahead and complex function

I have the following data:
+-----+----+-----+
|event|t |type |
+-----+----+-----+
| A |20 | 1 |
| A |40 | 1 |
| B |10 | 1 |
| B |20 | 1 |
| B |120 | 1 |
| B |140 | 1 |
| B |320 | 1 |
| B |340 | 1 |
| B |360 | 7 |
| B |380 | 1 |
+-----+-----+----+
And what I want is something like this:
+-----+----+----+
|event|t |grp |
+-----+----+----+
| A |20 |1 |
| A |40 |1 |
| B |10 |2 |
| B |20 |2 |
| B |120 |3 |
| B |140 |3 |
| B |320 |4 |
| B |340 |4 |
| B |380 |5 |
+-----+----+----+
Rules:
Group all Values together that are at least 50ms away from each other. (column t) and belongs to the same event.
When a row of type 7 appears take a cut too and remove this row. (see last row)
The first rule I can achieve with the answer from this thread:
Code:
val windowSpec= Window.partitionBy("event").orderBy("t")
val newSession = (coalesce(
($"t" - lag($"t", 1).over(windowSpec)),
lit(0)
) > 50).cast("bigint")
val sessionized = df.withColumn("session", sum(newSession).over(userWindow))
I have to say I can't figure it out how it works and don't know how to modify it so that rule 2 also works...
Hope someone can give me some useful hints.
What I tried:
val newSession = (coalesce(
($"t" - lag($"t", 1).over(windowSpec)),
lit(0)
) > 50 || lead($"type",1).over(windowSpec) =!= 7 ).cast("bigint")
But only an error occurred: "Must follow method; cannot follow org.apache.spark.sql.Column val grp = (coalesce(
this should do the trick:
val newSession = (coalesce(
($"t" - lag($"t", 1).over(win)),
lit(0)
) > 50
or $"type"===7) // also start new group in this case
.cast("bigint")
df.withColumn("session", sum(newSession).over(win))
.where($"type"=!=7) // remove these rows
.orderBy($"event",$"t")
.show
gives:
+-----+---+----+-------+
|event| t|type|session|
+-----+---+----+-------+
| A| 20| 1| 0|
| A| 40| 1| 0|
| B| 10| 1| 0|
| B| 20| 1| 0|
| B|120| 1| 1|
| B|140| 1| 1|
| B|320| 1| 2|
| B|340| 1| 2|
| B|380| 1| 3|
+-----+---+----+-------+