Lateral view explode strange behaviour

Lateral view explode strange behaviour - pyspark

I am concatenating two array columns and convert them back to array. Now when I apply explode, nothing is happening. Using Spark 2.3. Is there anything weird here?
df = spark.createDataFrame([(1,25,['A','B','B','C'],['A','B','B','C']),(1,20,['A','A','B','C'],['A','B','B','C']),(1,20,['A','C','B','C'],['A','B','B','C']),(2,26,['X','Y','Z','C'],['A','B','B','C'])],['id','age','one','two'])
+---+---+------------+------------+
| id|age| one| two|
+---+---+------------+------------+
| 1| 25|[A, B, B, C]|[A, B, B, C]|
| 1| 20|[A, A, B, C]|[A, B, B, C]|
| 1| 20|[A, C, B, C]|[A, B, B, C]|
| 2| 26|[X, Y, Z, C]|[A, B, B, C]|
+---+---+------------+------------+
>>> df.createOrReplaceTempView('df')
>>> df2 = spark.sql('''select id,age, array(concat_ws(',', one, two)) as three from df''')
>>> df2.show()
+---+---+-----------------+
| id|age| three|
+---+---+-----------------+
| 1| 25|[A,B,B,C,A,B,B,C]|
| 1| 20|[A,A,B,C,A,B,B,C]|
| 1| 20|[A,C,B,C,A,B,B,C]|
| 2| 26|[X,Y,Z,C,A,B,B,C]|
+---+---+-----------------+
>>> df2.createOrReplaceTempView('df2')
>>> spark.sql('''select id, age, four from df2 lateral view explode(three) tbl as four''').show() //not exploding
+---+---+---------------+
| id|age| four|
+---+---+---------------+
| 1| 25|A,B,B,C,A,B,B,C|
| 1| 20|A,A,B,C,A,B,B,C|
| 1| 20|A,C,B,C,A,B,B,C|
| 2| 26|X,Y,Z,C,A,B,B,C|
+---+---+---------------+
Please note that I can make it work by
>>> df2 = spark.sql('''select id,age, split(concat_ws(',', one, two),',') as three from df''')
But just wondering why the first approach is not working.

concat_ws creates a single string column and not an array:
df.select(F.size(df.one)).show()
df2.select(F.size(df2.three)).show()
Output:
+---------+
|size(one)|
+---------+
| 4|
| 4|
| 4|
| 4|
+---------+
+-----------+
|size(three)|
+-----------+
| 1|
| 1|
| 1|
| 1|
+-----------+
That means your array has just one element:
df2.select(df2.three.getItem(0)).show()
df2.select(df2.three.getItem(1)).show()
df2.printSchema()
Output:
+---------------+
| three[0]|
+---------------+
|A,B,B,C,A,B,B,C|
|A,A,B,C,A,B,B,C|
|A,C,B,C,A,B,B,C|
|X,Y,Z,C,A,B,B,C|
+---------------+
+--------+
|three[1]|
+--------+
| null|
| null|
| null|
| null|
+--------+
root
|-- id: long (nullable = true)
|-- age: long (nullable = true)
|-- three: array (nullable = false)
| |-- element: string (containsNull = false)
So what you actually should use, is concat on spark >= 2.4:
df3 = spark.sql('''select id,age, concat(one, two) as three from df''')
df3.show(truncate=False)
df3.printSchema()
df3.select(df3.three.getItem(0)).show()
df3.select(df3.three.getItem(1)).show()
Output:
+---+---+------------------------+
|id |age|three |
+---+---+------------------------+
|1 |25 |[A, B, B, C, A, B, B, C]|
|1 |20 |[A, A, B, C, A, B, B, C]|
|1 |20 |[A, C, B, C, A, B, B, C]|
|2 |26 |[X, Y, Z, C, A, B, B, C]|
+---+---+------------------------+
root
|-- id: long (nullable = true)
|-- age: long (nullable = true)
|-- three: array (nullable = true)
| |-- element: string (containsNull = true)
+--------+
|three[0]|
+--------+
| A|
| A|
| A|
| X|
+--------+
+--------+
|three[1]|
+--------+
| B|
| A|
| C|
| Y|
+--------+
Concating two arrays with spark < 2.4 requires an udf (check this answer for an example).

Example way to do this using a UDF:
arraycat = F.udf(lambda x,y : x + y, ArrayType(StringType()))
df = df.withColumn("combined", arraycat("one", "two"))
df = df.withColumn("combined", F.explode("combined"))

Related

Spark nested complex dataframe

I am trying to get the complex data into normal dataframe format
My data schema:
root
|-- column_names: array (nullable = true)
| |-- element: string (containsNull = true)
|-- values: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- id: array (nullable = true)
| |-- element: string (containsNull = true)
My Data File(JSON Format):
{"column_names":["2_col_name","3_col_name"],"id":["a","b","c","d","e"],"values":[["2_col_1",1],["2_col_2",2],["2_col_3",9],["2_col_4",10],["2_col_5",11]]}
I am trying to convert above data into this format:
+----------+----------+----------+
|1_col_name|2_col_name|3_col_name|
+----------+----------+----------+
| a| 2_col_1| 1|
| b| 2_col_2| 2|
| c| 2_col_3| 9|
| d| 2_col_4| 10|
| e| 2_col_5| 11|
+----------+----------+----------+
I tried using explode function on id and values but got different output as below:
+---+-------------+
| id| values|
+---+-------------+
| a| [2_col_1, 1]|
| a| [2_col_2, 2]|
| a| [2_col_3, 9]|
| a|[2_col_4, 10]|
| a|[2_col_5, 11]|
| b| [2_col_1, 1]|
| b| [2_col_2, 2]|
| b| [2_col_3, 9]|
| b|[2_col_4, 10]|
+---+-------------+
only showing top 9 rows
Not sure where i am doing wrong

You can use array_zip + inline functions to flatten then pivot the column names :
val df1 = df.select(
$"column_names",
expr("inline(arrays_zip(id, values))")
).select(
$"id".as("1_col_name"),
expr("inline(arrays_zip(column_names, values))")
)
.groupBy("1_col_name")
.pivot("column_names")
.agg(first("values"))
df1.show
//+----------+----------+----------+
//|1_col_name|2_col_name|3_col_name|
//+----------+----------+----------+
//|e |2_col_5 |11 |
//|d |2_col_4 |10 |
//|c |2_col_3 |9 |
//|b |2_col_2 |2 |
//|a |2_col_1 |1 |
//+----------+----------+----------+

Drop rows in spark which dont follow schema

currently, schema for my table is:
root
|-- product_id: integer (nullable = true)
|-- product_name: string (nullable = true)
|-- aisle_id: string (nullable = true)
|-- department_id: string (nullable = true)
I want to apply the below schema on the above table and delete all the rows which do not follow the below schema:
val productsSchema = StructType(Seq(
StructField("product_id",IntegerType,nullable = true),
StructField("product_name",StringType,nullable = true),
StructField("aisle_id",IntegerType,nullable = true),
StructField("department_id",IntegerType,nullable = true)
))

Use option "DROPMALFORMED" while loading the data which ignores corrupted records.
spark.read.format("json")
.option("mode", "DROPMALFORMED")
.option("header", "true")
.schema(productsSchema)
.load("sample.json")

If data is not matching with schema, spark will put null as value in that column. We just have to filter the null values for all columns.
Used filter to filter ```null`` values for all columns.
scala> "cat /tmp/sample.json".! // JSON File Data, one row is not matching with schema.
{"product_id":1,"product_name":"sampleA","aisle_id":"AA","department_id":"AAD"}
{"product_id":2,"product_name":"sampleBB","aisle_id":"AAB","department_id":"AADB"}
{"product_id":3,"product_name":"sampleCC","aisle_id":"CC","department_id":"CCC"}
{"product_id":3,"product_name":"sampledd","aisle_id":"dd","departmentId":"ddd"}
{"name","srinivas","age":29}
res100: Int = 0
scala> schema.printTreeString
root
|-- aisle_id: string (nullable = true)
|-- department_id: string (nullable = true)
|-- product_id: long (nullable = true)
|-- product_name: string (nullable = true)
scala> val df = spark.read.schema(schema).option("badRecordsPath", "/tmp/badRecordsPath").format("json").load("/tmp/sample.json") // Loading Json data & if schema is not matching we will be getting null rows for all columns.
df: org.apache.spark.sql.DataFrame = [aisle_id: string, department_id: string ... 2 more fields]
scala> df.show(false)
+--------+-------------+----------+------------+
|aisle_id|department_id|product_id|product_name|
+--------+-------------+----------+------------+
|AA |AAD |1 |sampleA |
|AAB |AADB |2 |sampleBB |
|CC |CCC |3 |sampleCC |
|dd |null |3 |sampledd |
|null |null |null |null |
+--------+-------------+----------+------------+
scala> df.filter(df.columns.map(c => s"${c} is not null").mkString(" or ")).show(false) // Filter null rows.
+--------+-------------+----------+------------+
|aisle_id|department_id|product_id|product_name|
+--------+-------------+----------+------------+
|AA |AAD |1 |sampleA |
|AAB |AADB |2 |sampleBB |
|CC |CCC |3 |sampleCC |
|dd |null |3 |sampledd |
+--------+-------------+----------+------------+
scala>

do check out na.drop functions on data-frame, you can drop rows based on null values, min nulls in a row, and also based on a specific column which has nulls.
scala> sc.parallelize(Seq((1,"a","a"),(1,"a","a"),(2,"b","b"),(3,"c","c"),(4,"d","d"),(4,"d",null))).toDF
res7: org.apache.spark.sql.DataFrame = [_1: int, _2: string ... 1 more field]
scala> res7.show()
+---+---+----+
| _1| _2| _3|
+---+---+----+
| 1| a| a|
| 1| a| a|
| 2| b| b|
| 3| c| c|
| 4| d| d|
| 4| d|null|
+---+---+----+
//dropping row if a null is found
scala> res7.na.drop.show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| a| a|
| 1| a| a|
| 2| b| b|
| 3| c| c|
| 4| d| d|
+---+---+---+
//drops only if `minNonNulls = 3` if accepted to each row
scala> res7.na.drop(minNonNulls = 3).show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| a| a|
| 1| a| a|
| 2| b| b|
| 3| c| c|
| 4| d| d|
+---+---+---+
//not dropping any
scala> res7.na.drop(minNonNulls = 2).show()
+---+---+----+
| _1| _2| _3|
+---+---+----+
| 1| a| a|
| 1| a| a|
| 2| b| b|
| 3| c| c|
| 4| d| d|
| 4| d|null|
+---+---+----+
//drops row based on nulls in `_3` column
scala> res7.na.drop(Seq("_3")).show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| a| a|
| 1| a| a|
| 2| b| b|
| 3| c| c|
| 4| d| d|
+---+---+---+

How to convert rdd object to dataframe in Scala

I read data from ElasticSearch and save into an RDD.
val es_rdd = sc.esRDD("indexname/typename",query="?q=*")
The rdd has the next example data:
(uniqueId,Map(field -> value))
(uniqueId2,Map(field2 -> value2))
How can I convert this RDD (String, Map to a Dataframe (String, String, String)?

You can use explode to achieve it.
import spark.implicits._
import org.apache.spark.sql.functions._
val rdd = sc.range(1, 10).map(s => (s, Map(s -> s)))
val ds = spark.createDataset(rdd)
val df = ds.toDF()
df.printSchema()
df.show()
df.select('_1,explode('_2)).show()
output:
root
|-- _1: long (nullable = false)
|-- _2: map (nullable = true)
| |-- key: long
| |-- value: long (valueContainsNull = false)
+---+--------+
| _1| _2|
+---+--------+
| 1|[1 -> 1]|
| 2|[2 -> 2]|
| 3|[3 -> 3]|
| 4|[4 -> 4]|
| 5|[5 -> 5]|
| 6|[6 -> 6]|
| 7|[7 -> 7]|
| 8|[8 -> 8]|
| 9|[9 -> 9]|
+---+--------+
+---+---+-----+
| _1|key|value|
+---+---+-----+
| 1| 1| 1|
| 2| 2| 2|
| 3| 3| 3|
| 4| 4| 4|
| 5| 5| 5|
| 6| 6| 6|
| 7| 7| 7|
| 8| 8| 8|
| 9| 9| 9|
+---+---+-----+

I readed directly in Spark.SQL format using the next call to elastic:
val df = spark.read.format("org.elasticsearch.spark.sql")
.option("query", "?q=*")
.option("pushdown", "true")
.load("indexname/typename")

How to perform aggregation (sum) on different columns and group the result based on another column of a spark dataframe?

Using scala-spark, I read a table in postgres and formed a dataframe: locationDF which contains data related to locations in the below format.
val opts = Map("url" -> "databaseurl","dbtable" -> "locations")
val locationDF = spark.read.format("jdbc").options(opts).load()
locationDF.printSchema()
root
|-- locn_id: integer (nullable = true)
|-- start_date: string (nullable = true)
|-- work_min: double (nullable = true)
|-- coverage: double (nullable = true)
|-- speed: double (nullable == true)
Initial Data:
+-------------+----------+-------------------+----------------+------------------+
| locn_id|start_date| work_min| coverage| speed|
+-------------+----------+-------------------+----------------+------------------+
| 3|2012-02-22| 53.62948333333333| 13.644|3.9306276263070457|
| 7|2012-02-22|0.11681666666666667| 0.0| 0.0|
| 1|2012-02-21| 22.783333333333335| 2.6| 8.762820512820513|
| 1|2012-01-21| 23.033333333333335| 2.6| 8.85897435897436|
| 1|2012-01-21| 44.98533333333334| 6.99| 6.435670004768718|
| 4|2012-02-21| 130.34788333333333| 54.67| 2.384267117858667|
| 2|2012-01-21| 94.61035| 8.909|10.619637445280052|
| 1|2012-02-21| 0.0| 0.0| 0.0|
| 1|2012-02-21| 29.3377| 4.579| 6.407010264249837|
| 1|2012-01-21| 59.13276666666667| 8.096| 7.303948451910409|
| 2|2012-03-21| 166.41843333333333| 13.048|12.754325056202738|
| 1|2012-03-21| 14.853183333333334| 2.721| 5.458722283474213|
| 9|2012-03-21| 1.69895| 0.845|2.0105917159763314|
+-------------+----------+-------------------+----------------+------------------+
I am trying to perform the sum of work_min (and convert into hours), sum of coverage, average speed of that particular year and month and form another dataframe.
To do that, I have seperated the month and year from the date column: start_date as below and got two columns: year and month out of it.
locationDF.withColumn("year", date_format(to_date($"start_date"),
"yyyy").cast(("Integer"))).withColumn("month",
date_format(to_date($"start_date"), "MM").cast(("Integer")))
+-------------+----------+-------------------+----------------+------------------+----+-----+
| locn_id|start_date| work_min| coverage| speed|year|month|
+-------------+----------+-------------------+----------------+------------------+----+-----+
| 3|2012-02-22| 53.62948333333333| 13.644|3.9306276263070457|2012| 2|
| 7|2012-02-22|0.11681666666666667| 0.0| 0.0|2012| 2|
| 1|2012-02-21| 22.783333333333335| 2.6| 8.762820512820513|2012| 2|
| 1|2012-01-21| 23.033333333333335| 2.6| 8.85897435897436|2012| 1|
| 1|2012-01-21| 44.98533333333334| 6.99| 6.435670004768718|2012| 1|
| 4|2012-02-21| 130.34788333333333| 54.67| 2.384267117858667|2012| 2|
| 2|2012-01-21| 94.61035| 8.909|10.619637445280052|2012| 1|
| 1|2012-02-21| 0.0| 0.0| 0.0|2012| 2|
| 1|2012-02-21| 29.3377| 4.579| 6.407010264249837|2012| 2|
| 1|2012-01-21| 59.13276666666667| 8.096| 7.303948451910409|2012| 1|
| 2|2012-03-21| 166.41843333333333| 13.048|12.754325056202738|2012| 3|
| 1|2012-03-21| 14.853183333333334| 2.721| 5.458722283474213|2012| 3|
| 9|2012-03-21| 1.69895| 0.845|2.0105917159763314|2012| 3|
+-------------+----------+-------------------+----------------+------------------+----+-----+
But I dont understand how to perform an aggregation -> sum on two separate columns: work_hours & coverage, average value of the column: speed for that particular month all at the same time and obtain the result as below.
+----+-----+-------------+------------+-----------------+
|year|month|sum_work_mins|sum_coverage| avg_speed|
+----+-----+-------------+------------+-----------------+
|2012| 1|221.7617833 | 26.595 |11.07274342031118|
|2012| 2|236.2152166 | 75.493 |7.161575173745354|
|2012| 3|182.9705666 | 16.614 |6.741213018551094|
+----+-----+-------------+------------+-----------------+
Could anyone let me know how can I achieve that ?

I think you are looking for this
scala> dfd.groupBy("year","month").agg(sum("work_min").as("sum_work_min"),sum("coverage").as("sum_coverage"),avg("speed").as("avg_speed")).show
+----+-----+------------------+------------------+-----------------+
|year|month| sum_work_min| sum_coverage| avg_speed|
+----+-----+------------------+------------------+-----------------+
|2012| 1|221.76178333333334|26.595000000000002|8.304557565233385|
|2012| 2| 236.2152166666667| 75.493|3.580787586872677|
|2012| 3|182.97056666666666| 16.614|6.741213018551094|
+----+-----+------------------+------------------+-----------------+
hope it helps you.

Create a new column from one of the value available in another columns as an array of Key Value pair

I have extracted some data from hive to dataframe, which is in the below shown format.
+--------------------+-----------------+--------------------+---------------+
| NUM_ID| SIG1| SIG2| SIG3| SIG4|
+----------------------+---------------+--------------------+---------------+
|XXXXX01|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX02|[{15695604780...|[{15695604780...|[{15695604780...|[{15695604780...|
|XXXXX03|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX04|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX05|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX06|[{15695605340...|[{15695605340...|[{15695605340...|[{15695605340...|
|XXXXX07|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX08|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
If we take only one signal it will be as below.
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|
[{1569560537000,3.7825},{1569560481000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|
[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560527000,34.7825}]|
[{1569560535000,34.7825},{1569560479000,34.7825},{1569560487000,34.7825}]
For each NUM_ID , each SIG column will have an array of E and V pairs.
The schema for the above data is
fromHive.printSchema
root
|-- NUM_ID: string (nullable = true)
|-- SIG1: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
|-- SIG2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
|-- SIG3: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
|-- SIG4: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
My requirement is to get the all E values from all the columns for a particular NUM_ID and create as a new cloumn with corresponding signal values in another columns as shown below.
+-------+-------------+-------+-------+-------+-------+
| NUM_ID| E| SIG1_V| SIG2_V| SIG3_V| SIG4_V|
+-------+-------------+-------+-------+-------+-------+
|XXXXX01|1569560531000|33.7825|34.7825| null|96.3354|
|XXXXX01|1569560505000| null| null|35.5501| null|
|XXXXX01|1569560531001|73.7825| null| null| null|
|XXXXX02|1569560505000|34.7825| null|35.5501|96.3354|
|XXXXX02|1569560531000|33.7825|34.7825|35.5501|96.3354|
|XXXXX02|1569560505001|73.7825| null| null| null|
|XXXXX02|1569560502000| null| null|35.5501|96.3354|
|XXXXX03[1569560531000|73.7825| null| null| null|
|XXXXX03|1569560505000|34.7825| null|35.5501|96.3354|
|XXXXX03|1569560509000| null|34.7825|35.5501|96.3354|
The E values from all four signals column, for a particular NUM_ID should be taken as a single column without duplicates and the V values for corresponding E should be populated in different columns. Suppose a Signal is not having any E-V pair for a particular E, then that column should be null. as shown above.
Thanks in advance. Any lead appreciated.
For better Understanding below is the sample structure for input and expected output.
INPUT:
+-------------------------+-----------------+-----------------+------------------+
| NUM_ID| SIG1| SIG2| SIG3| SIG4|
+-------------------------+-----------------+-----------------+------------------+
|XXXXX01|[{E1,V1},{E2,V2}]|[{E1,V3},{E3,V4}]|[{E4,V5},{E5,V6}]|[{E5,V7},{E2,V8}] |
|XXXXX02|[{E7,V1},{E8,V2}]|[{E1,V3},{E3,V4}]|[{E1,V5},{E5,V6}]|[{E9,V7},{E8,V8}]|
|XXXXX03|[{E1,V1},{E2,V2}]|[{E1,V3},{E3,V4}]|[{E4,V5},{E5,V6}]|[{E5,V7},{E2,V8}] |
OUTPUT EXPECTED:
+-------+---+--------+-------+-------+-------+
| NUM_ID| E| SIG1_V| SIG2_V| SIG3_V| SIG4_V|
+-------+---+-------+-------+-------+-------+
|XXXXX01| E1| V1| V3| null| null|
|XXXXX01| E2| V2| null| null| V8|
|XXXXX01| E3| null| V4| null| null|
|XXXXX01| E4| null| null| V5| null|
|XXXXX01| E5| null| null| V6| V7|
|XXXXX02| E1| null| V3| V5| null|
|XXXXX02| E3| null| V4| null| null|
|XXXXX02| E5| null| null| V6| null|
|XXXXX02[ E7| V1| null| null| null|
|XXXXX02| E8| V2| null| null| V7|
|XXXXX02| E9| null|34.7825| null| V8|

Input CSV file is as below:
NUM_ID|SIG1|SIG2|SIG3|SIG4 XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}
import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.UserDefinedFunction
val df = spark.read.format("csv").option("header","true").option("delimiter", "|").load("path .csv")
df.show(false)
+-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+
|NUM_ID |SIG1 |SIG2 |SIG3 |SIG4 |
+-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|
+-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+
//UDF to generate column E
def UDF_E:UserDefinedFunction=udf((r: Row)=>{
val SigColumn = "SIG1,SIG2,SIG3,SIG4"
val colList = SigColumn.split(",").toList
val rr = "[\\}],[\\{]".r
var out = ""
colList.foreach{ x =>
val a = (rr replaceAllIn(r.getAs(x).toString, "|")).replaceAll("\\[\\{","").replaceAll("\\}\\]","")
val b = a.split("\\|").map(x => x.split(",")(0)).toSet
out = out + "," + b.mkString(",")
}
val out1 = out.replaceFirst(s""",""","").split(",").toSet.mkString(",")
out1
})
//UDF to generate column value with Signal
def UDF_V:UserDefinedFunction=udf((E: String, SIG:String)=>{
val Signal = SIG.replaceAll("\\{", "\\(").replaceAll("\\}", "\\)").replaceAll("\\[", "").replaceAll("\\]", "")
val SigMap = "(\\w+),([\\w 0-9 .]+)".r.findAllIn(Signal).matchData.map(i => {(i.group(1), i.group(2))}).toMap
var out = ""
if(SigMap.keys.toList.contains(E)){
out = SigMap(E).toString
}
out})
//new DataFrame with Column "E"
val df1 = df.withColumn("E", UDF_E(struct(df.columns map col: _*))).withColumn("E", explode(split(col("E"), ",")))
df1.show(false)
+-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+-------------+
|NUM_ID |SIG1 |SIG2 |SIG3 |SIG4 |E |
+-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+-------------+
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560483000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560497000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560475000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560489000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560535000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560531000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560513000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560537000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560491000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560521000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560505000|
+-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+-------------+
//Final DataFrame
val df2 = df1.withColumn("SIG1_V", UDF_V(col("E"),col("SIG1"))).withColumn("SIG2_V", UDF_V(col("E"),col("SIG2"))).withColumn("SIG3_V", UDF_V(col("E"),col("SIG3"))).withColumn("SIG4_V", UDF_V(col("E"),col("SIG4"))).drop("SIG1","SIG2","SIG3","SIG4")
df2.show()
+-------+-------------+-------+-------+-------+-------+
| NUM_ID| E| SIG1_V| SIG2_V| SIG3_V| SIG4_V|
+-------+-------------+-------+-------+-------+-------+
|XXXXX01|1569560475000| 3.7812| | | |
|XXXXX01|1569560483000| 3.7812| | |34.7825|
|XXXXX01|1569560489000| |34.7825| | |
|XXXXX01|1569560491000|34.7875| | | |
|XXXXX01|1569560497000| |34.7825| | |
|XXXXX01|1569560505000| | |34.7825| |
|XXXXX01|1569560513000| | |34.7825| |
|XXXXX01|1569560521000| | |34.7825| |
|XXXXX01|1569560531000| 3.7825|34.7825|34.7825|34.7825|
|XXXXX01|1569560535000| | | |34.7825|
|XXXXX01|1569560537000| | 3.7825| | |
+-------+-------------+-------+-------+-------+-------+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Lateral view explode strange behaviour - pyspark

Example way to do this using a UDF: arraycat = F.udf(lambda x,y : x + y, ArrayType(StringType())) df = df.withColumn("combined", arraycat("one", "two")) df = df.withColumn("combined", F.explode("combined"))

Related

Spark nested complex dataframe

Drop rows in spark which dont follow schema

How to convert rdd object to dataframe in Scala

How to perform aggregation (sum) on different columns and group the result based on another column of a spark dataframe?

Create a new column from one of the value available in another columns as an array of Key Value pair

Categories

Resources