I am trying to transpose data in pyspark. I was able to transpose using a single column. However, with multiple columns I am not sure how to pass parameters to explode function.
Input format:
Output Format :
Can someone please hint me with any example or reference? Thanks in advance.
use stack to transpose as below (spark>=2.4)-
Load the test data
val data =
"""
|PersonId | Education1CollegeName | Education1Degree | Education2CollegeName | Education2Degree |Education3CollegeName | Education3Degree
| 1 | xyz | MS | abc | Phd | pqr | BS
| 2 | POR | MS | ABC | Phd | null | null
""".stripMargin
val stringDS1 = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString("|"))
.toSeq.toDS()
val df1 = spark.read
.option("sep", "|")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS1)
df1.show(false)
df1.printSchema()
/**
* +--------+---------------------+----------------+---------------------+----------------+---------------------+----------------+
* |PersonId|Education1CollegeName|Education1Degree|Education2CollegeName|Education2Degree|Education3CollegeName|Education3Degree|
* +--------+---------------------+----------------+---------------------+----------------+---------------------+----------------+
* |1 |xyz |MS |abc |Phd |pqr |BS |
* |2 |POR |MS |ABC |Phd |null |null |
* +--------+---------------------+----------------+---------------------+----------------+---------------------+----------------+
*
* root
* |-- PersonId: integer (nullable = true)
* |-- Education1CollegeName: string (nullable = true)
* |-- Education1Degree: string (nullable = true)
* |-- Education2CollegeName: string (nullable = true)
* |-- Education2Degree: string (nullable = true)
* |-- Education3CollegeName: string (nullable = true)
* |-- Education3Degree: string (nullable = true)
*/
Un-pivot the table using stack
df1.selectExpr("PersonId",
"stack(3, Education1CollegeName, Education1Degree, Education2CollegeName, Education2Degree, " +
"Education3CollegeName, Education3Degree) as (CollegeName, EducationDegree)")
.where("CollegeName is not null and EducationDegree is not null")
.show(false)
/**
* +--------+-----------+---------------+
* |PersonId|CollegeName|EducationDegree|
* +--------+-----------+---------------+
* |1 |xyz |MS |
* |1 |abc |Phd |
* |1 |pqr |BS |
* |2 |POR |MS |
* |2 |ABC |Phd |
* +--------+-----------+---------------+
*/
A cleaned PySpark version of this
from pyspark.sql import functions as F
df_a = spark.createDataFrame([(1,'xyz','MS','abc','Phd','pqr','BS'),(2,"POR","MS","ABC","Phd","","")],[
"id","Education1CollegeName","Education1Degree","Education2CollegeName","Education2Degree","Education3CollegeName","Education3Degree"])
+---+---------------------+----------------+---------------------+----------------+---------------------+----------------+
| id|Education1CollegeName|Education1Degree|Education2CollegeName|Education2Degree|Education3CollegeName|Education3Degree|
+---+---------------------+----------------+---------------------+----------------+---------------------+----------------+
| 1| xyz| MS| abc| Phd| pqr| BS|
| 2| POR| MS| ABC| Phd| | |
+---+---------------------+----------------+---------------------+----------------+---------------------+----------------+
Code -
df = df_a.selectExpr("id", "stack(3, Education1CollegeName, Education1Degree,Education2CollegeName, Education2Degree,Education3CollegeName, Education3Degree) as (B, C)")
+---+---+---+
| id| B| C|
+---+---+---+
| 1|xyz| MS|
| 1|abc|Phd|
| 1|pqr| BS|
| 2|POR| MS|
| 2|ABC|Phd|
| 2| | |
+---+---+---+
Related
I have a dataframe like this
+----------+-------------+
|A |Devices |
+----------+-------+------
|house1 |[100,101,102]|
|house1 |[103,104] |
+----------+-------------+
And I want to explode the column 'Devices' into multiple rows. My final dataframe should look like this
+----------+--------+
|A |Devices |
+----------+--------+
|house1 |100 |
|house1 |101 |
|house1 |102 |
|house1 |103 |
|house1 |104 |
+----------+--------+
The schema of the table is
root
|-- A: String (nullable = true)
|-- Devices: array (nullable = true)
| |-- element: String (containsNull = true)
I tried doing this but it is showing error (UnresolvedAttribute in $"Devices")
Df.withColumn("c", explode(split($"Devices","\\,")))
Df.select(col("A"),explode(col("devices"))
Using this I am able to find the required answer
I am trying to get the complex data into normal dataframe format
My data schema:
root
|-- column_names: array (nullable = true)
| |-- element: string (containsNull = true)
|-- values: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- id: array (nullable = true)
| |-- element: string (containsNull = true)
My Data File(JSON Format):
{"column_names":["2_col_name","3_col_name"],"id":["a","b","c","d","e"],"values":[["2_col_1",1],["2_col_2",2],["2_col_3",9],["2_col_4",10],["2_col_5",11]]}
I am trying to convert above data into this format:
+----------+----------+----------+
|1_col_name|2_col_name|3_col_name|
+----------+----------+----------+
| a| 2_col_1| 1|
| b| 2_col_2| 2|
| c| 2_col_3| 9|
| d| 2_col_4| 10|
| e| 2_col_5| 11|
+----------+----------+----------+
I tried using explode function on id and values but got different output as below:
+---+-------------+
| id| values|
+---+-------------+
| a| [2_col_1, 1]|
| a| [2_col_2, 2]|
| a| [2_col_3, 9]|
| a|[2_col_4, 10]|
| a|[2_col_5, 11]|
| b| [2_col_1, 1]|
| b| [2_col_2, 2]|
| b| [2_col_3, 9]|
| b|[2_col_4, 10]|
+---+-------------+
only showing top 9 rows
Not sure where i am doing wrong
You can use array_zip + inline functions to flatten then pivot the column names :
val df1 = df.select(
$"column_names",
expr("inline(arrays_zip(id, values))")
).select(
$"id".as("1_col_name"),
expr("inline(arrays_zip(column_names, values))")
)
.groupBy("1_col_name")
.pivot("column_names")
.agg(first("values"))
df1.show
//+----------+----------+----------+
//|1_col_name|2_col_name|3_col_name|
//+----------+----------+----------+
//|e |2_col_5 |11 |
//|d |2_col_4 |10 |
//|c |2_col_3 |9 |
//|b |2_col_2 |2 |
//|a |2_col_1 |1 |
//+----------+----------+----------+
Split the timestamp based on hours in spark
1,2019-04-01 04:00:21,12
1,2019-04-01 06:01:22,34
1,2019-04-01 09:21:23,10
1,2019-04-01 11:23:09,15
1,2019-04-01 12:02:10,15
1,2019-04-01 15:00:21,10
1,2019-04-01 18:00:22,10
1,2019-04-01 19:30:22,30
1,2019-04-01 20:22:30,30
1,2019-04-01 22:20:30,30
1,2019-04-01 23:59:00,10
Spilt the timestamp based on hours by every 6 hours into 4 parts in a day and sum it.
Here I'm splitting like 0-6AM,6AM-12PM etc.
1,2019-04-01,12
1,2019-04-01,59
1,2019-04-01,25
1,2019-04-01,110
Try this-
Load the test data
spark.conf.set("spark.sql.session.timeZone", "UTC")
val data =
"""
|c1,c2,c3
|1,2019-04-01 04:00:21,12
|1,2019-04-01 06:01:22,34
|1,2019-04-01 09:21:23,10
|1,2019-04-01 11:23:09,15
|1,2019-04-01 12:02:10,15
|1,2019-04-01 15:00:21,10
|1,2019-04-01 18:00:22,10
|1,2019-04-01 19:30:22,30
|1,2019-04-01 20:22:30,30
|1,2019-04-01 22:20:30,30
|1,2019-04-01 23:59:00,10
""".stripMargin
val stringDS2 = data.split(System.lineSeparator())
.map(_.split("\\,").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df2 = spark.read
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS2)
df2.show(false)
df2.printSchema()
/**
* +---+-------------------+---+
* |c1 |c2 |c3 |
* +---+-------------------+---+
* |1 |2019-03-31 22:30:21|12 |
* |1 |2019-04-01 00:31:22|34 |
* |1 |2019-04-01 03:51:23|10 |
* |1 |2019-04-01 05:53:09|15 |
* |1 |2019-04-01 06:32:10|15 |
* |1 |2019-04-01 09:30:21|10 |
* |1 |2019-04-01 12:30:22|10 |
* |1 |2019-04-01 14:00:22|30 |
* |1 |2019-04-01 14:52:30|30 |
* |1 |2019-04-01 16:50:30|30 |
* |1 |2019-04-01 18:29:00|10 |
* +---+-------------------+---+
*
* root
* |-- c1: integer (nullable = true)
* |-- c2: timestamp (nullable = true)
* |-- c3: integer (nullable = true)
*/
truncate the date for the range of 6 hrs and then groupBy().sum
val seconds = 21600 // 6 hrs
df2.withColumn("c2_long", expr(s"floor(cast(c2 as long) / $seconds) * $seconds"))
.groupBy("c1", "c2_long")
.agg(sum($"c3").as("c3"))
.withColumn("c2", to_date(to_timestamp($"c2_long")))
.withColumn("c2_time", to_timestamp($"c2_long"))
.orderBy("c2")
.show(false)
/**
* +---+----------+---+----------+-------------------+
* |c1 |c2_long |c3 |c2 |c2_time |
* +---+----------+---+----------+-------------------+
* |1 |1554055200|12 |2019-03-31|2019-03-31 18:00:00|
* |1 |1554120000|100|2019-04-01|2019-04-01 12:00:00|
* |1 |1554076800|59 |2019-04-01|2019-04-01 00:00:00|
* |1 |1554141600|10 |2019-04-01|2019-04-01 18:00:00|
* |1 |1554098400|25 |2019-04-01|2019-04-01 06:00:00|
* +---+----------+---+----------+-------------------+
*/
SCALA: The answer in the post that I comment on is working very well.
df.groupBy($"id", window($"time", "6 hours").as("time"))
.agg(sum("count").as("count"))
.orderBy("time.start")
.select($"id", to_date($"time.start").as("time"), $"count")
.show(false)
+---+----------+-----+
|id |time |count|
+---+----------+-----+
|1 |2019-04-01|12 |
|1 |2019-04-01|59 |
|1 |2019-04-01|25 |
|1 |2019-04-01|110 |
+---+----------+-----+
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
Table1:
class male female
1 2 1
2 0 2
3 2 0
table2:
class gender
1 m
1 f
1 m
2 f
2 f
3 m
3 m
Using spark-scala take the data from table1 and dump into another table in the format of table2 as given.Also please do vice-versa
Please help me in this guys.
Thanks in Advance
You can use udf and explode function like below.
import org.apache.spark.sql.functions._
import spark.implicits._
val df=Seq((1,2,1),(2,0,2),(3,2,0)).toDF("class","male","female")
//Input Df
+-----+----+------+
|class|male|female|
+-----+----+------+
| 1| 2| 1|
| 2| 0| 2|
| 3| 2| 0|
+-----+----+------+
val getGenderUdf=udf((x:Int,y:Int)=>List.fill(x)("m")++List.fill(y)("f"))
val df1=df.withColumn("gender",getGenderUdf(df.col("male"),df.col("female"))).drop("male","female").withColumn("gender",explode($"gender"))
df1.show()
+-----+------+
|class|gender|
+-----+------+
| 1| m|
| 1| m|
| 1| f|
| 2| f|
| 2| f|
| 3| m|
| 3| m|
+-----+------+
Reverse of df1
val df2=df1.groupBy("class").pivot("gender").agg(count("gender")).na.fill(0).withColumnRenamed("m","male").withColumnRenamed("f","female")
df2.show()
//Sample Output:
+-----+------+----+
|class|female|male|
+-----+------+----+
| 1| 1| 2|
| 3| 0| 2|
| 2| 2| 0|
+-----+------+----+
val inDF = Seq((1,2,1),
(2, 0, 2),
(3, 2, 0)).toDF("class", "male", "female")
val testUdf = udf((m: Int, f: Int) => {
val ml = 1.to(m).map(_ => "m")
val fml = 1.to(f).map(_ => "f")
ml ++ fml
})
val df1 = inDF.withColumn("mf", testUdf('male, 'female))
.drop("male", "female")
.select('class, explode('mf).alias("gender"))
Perhaps this is helpful - without UDF
spark>=2.4
Load the test data provided
val data =
"""
|class | male | female
|1 | 2 | 1
|2 | 0 | 2
|3 | 2 | 0
""".stripMargin
val stringDS1 = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df1 = spark.read
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS1)
df1.show(false)
df1.printSchema()
/**
* +-----+----+------+
* |class|male|female|
* +-----+----+------+
* |1 |2 |1 |
* |2 |0 |2 |
* |3 |2 |0 |
* +-----+----+------+
*
* root
* |-- class: integer (nullable = true)
* |-- male: integer (nullable = true)
* |-- female: integer (nullable = true)
*/
compute the gender array and explode
df1.select($"class",
when($"male" >= 1, sequence(lit(1), col("male"))).otherwise(array()).as("male"),
when($"female" >= 1, sequence(lit(1), col("female"))).otherwise(array()).as("female")
).withColumn("male", expr("TRANSFORM(male, x -> 'm')"))
.withColumn("female", expr("TRANSFORM(female, x -> 'f')"))
.withColumn("gender", explode(concat($"male", $"female")))
.select("class", "gender")
.show(false)
/**
* +-----+------+
* |class|gender|
* +-----+------+
* |1 |m |
* |1 |m |
* |1 |f |
* |2 |f |
* |2 |f |
* |3 |m |
* |3 |m |
* +-----+------+
*/
vice versa
df2.groupBy("class").agg(collect_list("gender").as("gender"))
.withColumn("male", expr("size(FILTER(gender, x -> x='m'))"))
.withColumn("female", expr("size(FILTER(gender, x -> x='f'))"))
.select("class", "male", "female")
.orderBy("class")
.show(false)
/**
* +-----+----+------+
* |class|male|female|
* +-----+----+------+
* |1 |2 |1 |
* |2 |0 |2 |
* |3 |2 |0 |
* +-----+----+------+
*/
I have 2 dataframes:
val df1 = sc.parallelize(Seq((123, 2.23, 1.12), (234, 2.45, 0.12), (456, 1.112, 0.234))).toDF("objid", "ra", "dec")
val df2 = sc.parallelize(Seq((4567, 123, "name1", "val1"), (2322, 456, "name2", "val2"), (3324, 555, "name3", "val3"), (5556, 123, "name4", "val4"), (3345, 123, "name5", "val5"))).toDF("specid", "objid", "name", "value")
They look like below:
df1.show()
+-----+-----+-----+
|objid| ra| dec|
+-----+-----+-----+
| 123| 2.23| 1.12|
| 234| 2.45| 0.12|
| 456|1.112|0.234|
+-----+-----+-----+
df2.show()
+------+-----+-----+-----+
|specid|objid| name|value|
+------+-----+-----+-----+
| 4567| 123|name1| val1|
| 2322| 456|name2| val2|
| 3324| 555|name3| val3|
| 5556| 123|name4| val4|
| 3345| 123|name5| val5|
+------+-----+-----+-----+
Now I want to nest df2 inside df1 as a nested column so the schema should look like below:
val new_schema = df1.schema.add("specs", df2.schema)
new_schema: org.apache.spark.sql.types.StructType = StructType(StructField(objid,IntegerType,false), StructField(ra,DoubleType,false), StructField(dec,DoubleType,false), StructField(specs,StructType(StructField(specid,IntegerType,false), StructField(objid,IntegerType,false), StructField(name,StringType,true), StructField(value,StringType,true)),true))
The reason I wanted to do it this way was because there is a one to many relationship between df1 and df2, which means there are more than 1 specs per objid. And I am not going to join only these two tables. There are about 50 tables that I want to ultimately join together to create a mega table. Most of those tables have 1 to n relationships and I was just thinking about a way to avoid having a lot of duplicate rows and null cells in the ultimate join result.
The ultimate result would look something like:
+-----+-----+-----+----------------------+
| | specs |
|objid| ra| dec| specid| name | value|
+-----+-----+-----+------+----+-------+ |
| 123| 2.23| 1.12| 4567 | name1 | val1 |
| | 5556 | name4 | val4 |
| | 3345 | name5 | val5 |
+-----+-----+-----+----------------------+
| 234| 2.45| 0.12| |
+-----+-----+-----+----------------------+
| 456|1.112|0.234| 2322 | name2 | val2 |
+-----+-----+-----+----------------------+
I was trying to add the column to df1 using .withColumn but ran into errors.
What I actually wanted to do was to select all the columns from df2 with the condition where df2.objid = df1.objid to match the rows and make that the new column in df1 but I am not sure if that's the best approach. Even if so, I am not sure how to do that.
Could someone please tell me how to do this?
As per my knowledge, you cannot have dataframe inside another dataframe(same is the case with RDDs).
What you need is a join between two dataframes. You can perform different types of joins and join the rows from two dataframes(this is where you make nest df2 columns inside df1)
You need to join both the dataframes based on the column objid like below
val join = df1.join(df2, "objid")
join.printSchema()
output:
root
|-- objid: integer (nullable = false)
|-- ra: double (nullable = false)
|-- dec: double (nullable = false)
|-- specid: integer (nullable = false)
|-- name: string (nullable = true)
|-- value: string (nullable = true)
and when we say
join.show()
the output will be
+-----+-----+-----+------+-----+-----+
|objid| ra| dec|specid| name|value|
+-----+-----+-----+------+-----+-----+
| 456|1.112|0.234| 2322|name2| val2|
| 123| 2.23| 1.12| 4567|name1| val1|
+-----+-----+-----+------+-----+-----+
for more details you can check here
Update:
I think you are looking for something like this
df1.join(df2, df1("objid") === df2("objid"), "left_outer").show()
and the output is:
+-----+-----+-----+------+-----+-----+-----+
|objid| ra| dec|specid|objid| name|value|
+-----+-----+-----+------+-----+-----+-----+
| 456|1.112|0.234| 2322| 456|name2| val2|
| 234| 2.45| 0.12| null| null| null| null|
| 123| 2.23| 1.12| 4567| 123|name1| val1|
| 123| 2.23| 1.12| 5556| 123|name4| val4|
| 123| 2.23| 1.12| 3345| 123|name5| val5|
+-----+-----+-----+------+-----+-----+-----+