Spark dataframe join showing unexpected results - 0 rows - scala
I'm using spark-1.6.0, I want to join 2 dataframe, they showed in YARN log like following.
df_train_raw
df_user_clicks_info
I have tried to inner join them with code:
val df_tmp_tmp_0 = df_train_raw.join(df_user_clicks_info, Seq("subscriberid"))
df_tmp_tmp_0.show()
And the results I got was exactly nothing! OMG!
+------------+--------+-----+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
|subscriberid|objectid|label|subscriberid|user_clicks_avg_everyday_a_week|user_clicks_sum_time_1_9_a_week|user_clicks_sum_time_9_14_a_week|user_clicks_sum_time_14_17_a_week|user_clicks_sum_time_17_19_a_week|user_clicks_sum_time_19_23_a_week|user_clicks_sum_time_23_1_a_week|user_clicks_avg_everyday_weekday|user_clicks_sum_time_1_9_weekday|user_clicks_sum_time_9_14_weekday|user_clicks_sum_time_14_17_weekday|user_clicks_sum_time_17_19_weekday|user_clicks_sum_time_19_23_weekday|user_clicks_sum_time_23_1_weekday|user_clicks_avg_everyday_weekdend|user_clicks_sum_time_1_9_weekdend|user_clicks_sum_time_9_14_weekdend|user_clicks_sum_time_14_17_weekdend|user_clicks_sum_time_17_19_weekdend|user_clicks_sum_time_19_23_weekdend|user_clicks_sum_time_23_1_weekdend|
+------------+--------+-----+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
+------------+--------+-----+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
I don't know why? It seems nothing wrong here? Hope some help, please~ Thanks~
After the 2 friends advice about space, I'd have a another try:
df_train_raw
————————————
+------------+-----------+-----+
|subscriberid| objectid|label|
+------------+-----------+-----+
| 104752237|11029932485| 0|
| 105246837|11029932485| 0|
| 105517237|11029932485| 0|
| 108917037|11030797988| 0|
| 108917037|11029648595| 0|
| 109901037|11029648595| 0|
| 105517237|11030720502| 0|
| 105246837|11029986502| 0|
| 104752237|11029191717| 0|
| 105246837|11029191717| 0|
| 105517237|11029191717| 0|
| 109901037|11030138623| 0|
| 105517237|11014105538| 0|
| 105517237|11014105543| 0|
| 105517237|11016478156| 0|
| 105517237|11023285357| 0|
| 105246837|11026067980| 0|
| 105246837|11030797988| 0|
| 108917037|11029932485| 0|
| 109901037|11029932485| 0|
+------------+-----------+-----+
only showing top 20 rows
————————————
root
|-- subscriberid: long (nullable = true)
|-- objectid: long (nullable = true)
|-- label: integer (nullable = true)
and print the "subscriberid" column, this showed not the space case.
df_train_raw.select("subscriberid").take(20).foreach(println)
the result
[104752237]
[105246837]
[105517237]
[108917037]
[108917037]
[109901037]
[105517237]
[105246837]
[104752237]
[105246837]
[105517237]
[109901037]
[105517237]
[105517237]
[105517237]
[105517237]
[105246837]
[105246837]
[108917037]
[109901037]
And fot the df_user_clicks_info
+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
|subscriberid|user_clicks_avg_everyday_a_week|user_clicks_sum_time_1_9_a_week|user_clicks_sum_time_9_14_a_week|user_clicks_sum_time_14_17_a_week|user_clicks_sum_time_17_19_a_week|user_clicks_sum_time_19_23_a_week|user_clicks_sum_time_23_1_a_week|user_clicks_avg_everyday_weekday|user_clicks_sum_time_1_9_weekday|user_clicks_sum_time_9_14_weekday|user_clicks_sum_time_14_17_weekday|user_clicks_sum_time_17_19_weekday|user_clicks_sum_time_19_23_weekday|user_clicks_sum_time_23_1_weekday|user_clicks_avg_everyday_weekdend|user_clicks_sum_time_1_9_weekdend|user_clicks_sum_time_9_14_weekdend|user_clicks_sum_time_14_17_weekdend|user_clicks_sum_time_17_19_weekdend|user_clicks_sum_time_19_23_weekdend|user_clicks_sum_time_23_1_weekdend|
+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
| 104752237| 1.71| 0| 0| 0| 4| 4| 4| 0.8| 0| 0| 0| 0| 4| 0| 4.0| 0| 0| 0| 4| 0| 4|
| 105517237| 17.14| 12| 36| 12| 0| 60| 0| 9.6| 0| 0| 0| 0| 48| 0| 36.0| 12| 36| 12| 0| 12| 0|
| 109901037| 2.14| 0| 3| 3| 6| 3| 0| 2.4| 0| 0| 3| 6| 3| 0| 1.5| 0| 3| 0| 0| 0| 0|
| 105246837| 8.0| 8| 0| 0| 16| 32| 0| 8.0| 8| 0| 0| 8| 24| 0| 8.0| 0| 0| 0| 8| 8| 0|
+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
————————————
root
|-- subscriberid: string (nullable = true)
|-- user_clicks_avg_everyday_a_week: double (nullable = false)
|-- user_clicks_sum_time_1_9_a_week: long (nullable = false)
|-- user_clicks_sum_time_9_14_a_week: long (nullable = false)
|-- user_clicks_sum_time_14_17_a_week: long (nullable = false)
|-- user_clicks_sum_time_17_19_a_week: long (nullable = false)
|-- user_clicks_sum_time_19_23_a_week: long (nullable = false)
|-- user_clicks_sum_time_23_1_a_week: long (nullable = false)
|-- user_clicks_avg_everyday_weekday: double (nullable = false)
|-- user_clicks_sum_time_1_9_weekday: long (nullable = false)
|-- user_clicks_sum_time_9_14_weekday: long (nullable = false)
|-- user_clicks_sum_time_14_17_weekday: long (nullable = false)
|-- user_clicks_sum_time_17_19_weekday: long (nullable = false)
|-- user_clicks_sum_time_19_23_weekday: long (nullable = false)
|-- user_clicks_sum_time_23_1_weekday: long (nullable = false)
|-- user_clicks_avg_everyday_weekdend: double (nullable = false)
|-- user_clicks_sum_time_1_9_weekdend: long (nullable = false)
|-- user_clicks_sum_time_9_14_weekdend: long (nullable = false)
|-- user_clicks_sum_time_14_17_weekdend: long (nullable = false)
|-- user_clicks_sum_time_17_19_weekdend: long (nullable = false)
|-- user_clicks_sum_time_19_23_weekdend: long (nullable = false)
|-- user_clicks_sum_time_23_1_weekdend: long (nullable = false)
df_user_clicks_info.select("subscriberid").take(20).foreach(println)
[104752237]
[105517237]
[109901037]
[105246837]
It didn't work either :(
Thanks for the help of friends who helped me. And the reason why for this, I think is a bug in SPARK-1.6.0, I solved it by changing my data process without updating SPARK. I mean in the beginning, I wanted to get df_3 from df_1 and df_2, but it didn't get the result I want because of the bug I mentioned in the question, so I tried another way to get a df_tmp_1 and df_tmp_2, then join them and get the result. I don't know why either, but it seems like a good idea if U use the SPARK-1.6.0 and meet the same join bug like me.
Related
Drop rows in spark which dont follow schema
currently, schema for my table is: root |-- product_id: integer (nullable = true) |-- product_name: string (nullable = true) |-- aisle_id: string (nullable = true) |-- department_id: string (nullable = true) I want to apply the below schema on the above table and delete all the rows which do not follow the below schema: val productsSchema = StructType(Seq( StructField("product_id",IntegerType,nullable = true), StructField("product_name",StringType,nullable = true), StructField("aisle_id",IntegerType,nullable = true), StructField("department_id",IntegerType,nullable = true) ))
Use option "DROPMALFORMED" while loading the data which ignores corrupted records. spark.read.format("json") .option("mode", "DROPMALFORMED") .option("header", "true") .schema(productsSchema) .load("sample.json")
If data is not matching with schema, spark will put null as value in that column. We just have to filter the null values for all columns. Used filter to filter ```null`` values for all columns. scala> "cat /tmp/sample.json".! // JSON File Data, one row is not matching with schema. {"product_id":1,"product_name":"sampleA","aisle_id":"AA","department_id":"AAD"} {"product_id":2,"product_name":"sampleBB","aisle_id":"AAB","department_id":"AADB"} {"product_id":3,"product_name":"sampleCC","aisle_id":"CC","department_id":"CCC"} {"product_id":3,"product_name":"sampledd","aisle_id":"dd","departmentId":"ddd"} {"name","srinivas","age":29} res100: Int = 0 scala> schema.printTreeString root |-- aisle_id: string (nullable = true) |-- department_id: string (nullable = true) |-- product_id: long (nullable = true) |-- product_name: string (nullable = true) scala> val df = spark.read.schema(schema).option("badRecordsPath", "/tmp/badRecordsPath").format("json").load("/tmp/sample.json") // Loading Json data & if schema is not matching we will be getting null rows for all columns. df: org.apache.spark.sql.DataFrame = [aisle_id: string, department_id: string ... 2 more fields] scala> df.show(false) +--------+-------------+----------+------------+ |aisle_id|department_id|product_id|product_name| +--------+-------------+----------+------------+ |AA |AAD |1 |sampleA | |AAB |AADB |2 |sampleBB | |CC |CCC |3 |sampleCC | |dd |null |3 |sampledd | |null |null |null |null | +--------+-------------+----------+------------+ scala> df.filter(df.columns.map(c => s"${c} is not null").mkString(" or ")).show(false) // Filter null rows. +--------+-------------+----------+------------+ |aisle_id|department_id|product_id|product_name| +--------+-------------+----------+------------+ |AA |AAD |1 |sampleA | |AAB |AADB |2 |sampleBB | |CC |CCC |3 |sampleCC | |dd |null |3 |sampledd | +--------+-------------+----------+------------+ scala>
do check out na.drop functions on data-frame, you can drop rows based on null values, min nulls in a row, and also based on a specific column which has nulls. scala> sc.parallelize(Seq((1,"a","a"),(1,"a","a"),(2,"b","b"),(3,"c","c"),(4,"d","d"),(4,"d",null))).toDF res7: org.apache.spark.sql.DataFrame = [_1: int, _2: string ... 1 more field] scala> res7.show() +---+---+----+ | _1| _2| _3| +---+---+----+ | 1| a| a| | 1| a| a| | 2| b| b| | 3| c| c| | 4| d| d| | 4| d|null| +---+---+----+ //dropping row if a null is found scala> res7.na.drop.show() +---+---+---+ | _1| _2| _3| +---+---+---+ | 1| a| a| | 1| a| a| | 2| b| b| | 3| c| c| | 4| d| d| +---+---+---+ //drops only if `minNonNulls = 3` if accepted to each row scala> res7.na.drop(minNonNulls = 3).show() +---+---+---+ | _1| _2| _3| +---+---+---+ | 1| a| a| | 1| a| a| | 2| b| b| | 3| c| c| | 4| d| d| +---+---+---+ //not dropping any scala> res7.na.drop(minNonNulls = 2).show() +---+---+----+ | _1| _2| _3| +---+---+----+ | 1| a| a| | 1| a| a| | 2| b| b| | 3| c| c| | 4| d| d| | 4| d|null| +---+---+----+ //drops row based on nulls in `_3` column scala> res7.na.drop(Seq("_3")).show() +---+---+---+ | _1| _2| _3| +---+---+---+ | 1| a| a| | 1| a| a| | 2| b| b| | 3| c| c| | 4| d| d| +---+---+---+
Create a new column from one of the value available in another columns as an array of Key Value pair
I have extracted some data from hive to dataframe, which is in the below shown format. +--------------------+-----------------+--------------------+---------------+ | NUM_ID| SIG1| SIG2| SIG3| SIG4| +----------------------+---------------+--------------------+---------------+ |XXXXX01|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...| |XXXXX02|[{15695604780...|[{15695604780...|[{15695604780...|[{15695604780...| |XXXXX03|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...| |XXXXX04|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...| |XXXXX05|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...| |XXXXX06|[{15695605340...|[{15695605340...|[{15695605340...|[{15695605340...| |XXXXX07|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...| |XXXXX08|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...| If we take only one signal it will be as below. |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]| [{1569560537000,3.7825},{1569560481000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]| [{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560527000,34.7825}]| [{1569560535000,34.7825},{1569560479000,34.7825},{1569560487000,34.7825}] For each NUM_ID , each SIG column will have an array of E and V pairs. The schema for the above data is fromHive.printSchema root |-- NUM_ID: string (nullable = true) |-- SIG1: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- E: long (nullable = true) | | |-- V: double (nullable = true) |-- SIG2: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- E: long (nullable = true) | | |-- V: double (nullable = true) |-- SIG3: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- E: long (nullable = true) | | |-- V: double (nullable = true) |-- SIG4: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- E: long (nullable = true) | | |-- V: double (nullable = true) My requirement is to get the all E values from all the columns for a particular NUM_ID and create as a new cloumn with corresponding signal values in another columns as shown below. +-------+-------------+-------+-------+-------+-------+ | NUM_ID| E| SIG1_V| SIG2_V| SIG3_V| SIG4_V| +-------+-------------+-------+-------+-------+-------+ |XXXXX01|1569560531000|33.7825|34.7825| null|96.3354| |XXXXX01|1569560505000| null| null|35.5501| null| |XXXXX01|1569560531001|73.7825| null| null| null| |XXXXX02|1569560505000|34.7825| null|35.5501|96.3354| |XXXXX02|1569560531000|33.7825|34.7825|35.5501|96.3354| |XXXXX02|1569560505001|73.7825| null| null| null| |XXXXX02|1569560502000| null| null|35.5501|96.3354| |XXXXX03[1569560531000|73.7825| null| null| null| |XXXXX03|1569560505000|34.7825| null|35.5501|96.3354| |XXXXX03|1569560509000| null|34.7825|35.5501|96.3354| The E values from all four signals column, for a particular NUM_ID should be taken as a single column without duplicates and the V values for corresponding E should be populated in different columns. Suppose a Signal is not having any E-V pair for a particular E, then that column should be null. as shown above. Thanks in advance. Any lead appreciated. For better Understanding below is the sample structure for input and expected output. INPUT: +-------------------------+-----------------+-----------------+------------------+ | NUM_ID| SIG1| SIG2| SIG3| SIG4| +-------------------------+-----------------+-----------------+------------------+ |XXXXX01|[{E1,V1},{E2,V2}]|[{E1,V3},{E3,V4}]|[{E4,V5},{E5,V6}]|[{E5,V7},{E2,V8}] | |XXXXX02|[{E7,V1},{E8,V2}]|[{E1,V3},{E3,V4}]|[{E1,V5},{E5,V6}]|[{E9,V7},{E8,V8}]| |XXXXX03|[{E1,V1},{E2,V2}]|[{E1,V3},{E3,V4}]|[{E4,V5},{E5,V6}]|[{E5,V7},{E2,V8}] | OUTPUT EXPECTED: +-------+---+--------+-------+-------+-------+ | NUM_ID| E| SIG1_V| SIG2_V| SIG3_V| SIG4_V| +-------+---+-------+-------+-------+-------+ |XXXXX01| E1| V1| V3| null| null| |XXXXX01| E2| V2| null| null| V8| |XXXXX01| E3| null| V4| null| null| |XXXXX01| E4| null| null| V5| null| |XXXXX01| E5| null| null| V6| V7| |XXXXX02| E1| null| V3| V5| null| |XXXXX02| E3| null| V4| null| null| |XXXXX02| E5| null| null| V6| null| |XXXXX02[ E7| V1| null| null| null| |XXXXX02| E8| V2| null| null| V7| |XXXXX02| E9| null|34.7825| null| V8|
Input CSV file is as below: NUM_ID|SIG1|SIG2|SIG3|SIG4 XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825} import org.apache.spark.sql.Row import org.apache.spark.sql.expressions.UserDefinedFunction val df = spark.read.format("csv").option("header","true").option("delimiter", "|").load("path .csv") df.show(false) +-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+ |NUM_ID |SIG1 |SIG2 |SIG3 |SIG4 | +-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+ |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]| +-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+ //UDF to generate column E def UDF_E:UserDefinedFunction=udf((r: Row)=>{ val SigColumn = "SIG1,SIG2,SIG3,SIG4" val colList = SigColumn.split(",").toList val rr = "[\\}],[\\{]".r var out = "" colList.foreach{ x => val a = (rr replaceAllIn(r.getAs(x).toString, "|")).replaceAll("\\[\\{","").replaceAll("\\}\\]","") val b = a.split("\\|").map(x => x.split(",")(0)).toSet out = out + "," + b.mkString(",") } val out1 = out.replaceFirst(s""",""","").split(",").toSet.mkString(",") out1 }) //UDF to generate column value with Signal def UDF_V:UserDefinedFunction=udf((E: String, SIG:String)=>{ val Signal = SIG.replaceAll("\\{", "\\(").replaceAll("\\}", "\\)").replaceAll("\\[", "").replaceAll("\\]", "") val SigMap = "(\\w+),([\\w 0-9 .]+)".r.findAllIn(Signal).matchData.map(i => {(i.group(1), i.group(2))}).toMap var out = "" if(SigMap.keys.toList.contains(E)){ out = SigMap(E).toString } out}) //new DataFrame with Column "E" val df1 = df.withColumn("E", UDF_E(struct(df.columns map col: _*))).withColumn("E", explode(split(col("E"), ","))) df1.show(false) +-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+-------------+ |NUM_ID |SIG1 |SIG2 |SIG3 |SIG4 |E | +-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+-------------+ |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560483000| |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560497000| |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560475000| |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560489000| |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560535000| |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560531000| |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560513000| |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560537000| |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560491000| |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560521000| |XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560505000| +-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+-------------+ //Final DataFrame val df2 = df1.withColumn("SIG1_V", UDF_V(col("E"),col("SIG1"))).withColumn("SIG2_V", UDF_V(col("E"),col("SIG2"))).withColumn("SIG3_V", UDF_V(col("E"),col("SIG3"))).withColumn("SIG4_V", UDF_V(col("E"),col("SIG4"))).drop("SIG1","SIG2","SIG3","SIG4") df2.show() +-------+-------------+-------+-------+-------+-------+ | NUM_ID| E| SIG1_V| SIG2_V| SIG3_V| SIG4_V| +-------+-------------+-------+-------+-------+-------+ |XXXXX01|1569560475000| 3.7812| | | | |XXXXX01|1569560483000| 3.7812| | |34.7825| |XXXXX01|1569560489000| |34.7825| | | |XXXXX01|1569560491000|34.7875| | | | |XXXXX01|1569560497000| |34.7825| | | |XXXXX01|1569560505000| | |34.7825| | |XXXXX01|1569560513000| | |34.7825| | |XXXXX01|1569560521000| | |34.7825| | |XXXXX01|1569560531000| 3.7825|34.7825|34.7825|34.7825| |XXXXX01|1569560535000| | | |34.7825| |XXXXX01|1569560537000| | 3.7825| | | +-------+-------------+-------+-------+-------+-------+
Drop duplicates if reverse is present between two columns
I have dataframe contain (around 20000000 rows) and I'd like to drop duplicates from a dataframe for two columns if those columns have the same values, or even if those values are in the reverse order. For example the original dataframe: +----+----+----+ |col1|col2|col3| +----+----+----+ | 1| 1| A| | 1| 1| B| | 2| 1| C| | 1| 2| D| | 3| 5| E| | 3| 4| F| | 4| 3| G| +----+----+----+ where the schema of the column as follows: root |-- col1: string (nullable = true) |-- col2: string (nullable = true) |-- col3: string (nullable = true) The desired dataframe should look like: +----+----+----+ |col1|col2|col3| +----+----+----+ | 1| 1| A| | 1| 2| D| | 3| 5| E| | 3| 4| F| +----+----+----+ The dropDuplicates() method remove duplicates if the values in the same order I followed the accepted answer to this question Pandas: remove reverse duplicates from dataframe but it took more time.
You can use this: Hope this helps. Note : In 'col3' 'D' will be removed istead of 'C', because 'C' is positioned before 'D'. from pyspark.sql import functions as F df = spark.read.csv('/FileStore/tables/stack2.csv', header = 'True') df2 = df.select(F.least(df.col1,df.col2).alias('col1'),F.greatest(df.col1,df.col2).alias('col2'),df.col3) df2.dropDuplicates(['col1','col2']).show()
Why pyspark give the wrong value of variance?
I have a registered table in pyspark. +--------+-------+--------+------------+---------+-----------------+----------------------+ |order_id|user_id|eval_set|order_number|order_dow|order_hour_of_day|days_since_prior_order| +--------+-------+--------+------------+---------+-----------------+----------------------+ | 2168274| 2| prior| 1| 2| 11| null| | 1501582| 2| prior| 2| 5| 10| 10| | 1901567| 2| prior| 3| 1| 10| 3| | 738281| 2| prior| 4| 2| 10| 8| | 1673511| 2| prior| 5| 3| 11| 8| | 1199898| 2| prior| 6| 2| 9| 13| | 3194192| 2| prior| 7| 2| 12| 14| | 788338| 2| prior| 8| 1| 15| 27| | 1718559| 2| prior| 9| 2| 9| 8| | 1447487| 2| prior| 10| 1| 11| 6| | 1402090| 2| prior| 11| 1| 10| 30| | 3186735| 2| prior| 12| 1| 9| 28| | 3268552| 2| prior| 13| 4| 11| 30| | 839880| 2| prior| 14| 3| 10| 13| | 1492625| 2| train| 15| 1| 11| 30| +--------+-------+--------+------------+---------+-----------------+----------------------+ I want to calculate the variance of days_since_prior_order, excluding the null value. The right value should be 97.91836734693878, which is given by hive and python. But my pyspark give me 105.45054945054943. spark.sql("select variance(days_since_prior_order) from \ (select * from orders where user_id=2 and days_since_prior_order is not null ) ").show() The original table data types are correct. |-- order_id: long (nullable = true) |-- user_id: long (nullable = true) |-- eval_set: string (nullable = true) |-- order_number: short (nullable = true) |-- order_dow: short (nullable = true) |-- order_hour_of_day: short (nullable = true) |-- days_since_prior_order: short (nullable = true)
Try to use the following function instead of pyspark.sql.functions.variance(col): pyspark.sql.functions.var_pop(col) Aggregate function: returns the population variance of the values in a group. With your column data, var_pop gives me this result: [Row(var_pop(days_since_prior_order)=97.91836734693877)] The reason is that: variance() and var_samp() are scaled by 1/(N-1) var_pop() is scaled by 1/N with N number of values selected. See population and sample variance for a useful link. Here you will find the docs of var_pop()
DataFrame explode list of JSON objects
I have JSON data in the following format: { "date": 100 "userId": 1 "data": [ { "timeStamp": 101, "reading": 1 }, { "timeStamp": 102, "reading": 2 } ] } { "date": 200 "userId": 1 "data": [ { "timeStamp": 201, "reading": 3 }, { "timeStamp": 202, "reading": 4 } ] } I read it into Spark SQL: val df = SQLContext.read.json(...) df.printSchema // root // |-- date: double (nullable = true) // |-- userId: long (nullable = true) // |-- data: array (nullable = true) // | |-- element: struct (containsNull = true) // | | |-- timeStamp: double (nullable = true) // | | |-- reading: double (nullable = true) I would like to transform it in order to have one row per reading. To my understanding, every transformation should produce a new DataFrame, so the following should work: import org.apache.spark.sql.functions.explode val exploded = df .withColumn("reading", explode(df("data.reading"))) .withColumn("timeStamp", explode(df("data.timeStamp"))) .drop("data") exploded.printSchema // root // |-- date: double (nullable = true) // |-- userId: long (nullable = true) // |-- timeStamp: double (nullable = true) // |-- reading: double (nullable = true) The resulting schema is correct, but I get every value twice: exploded.show // +-----------+-----------+-----------+-----------+ // | date| userId| timeStamp| reading| // +-----------+-----------+-----------+-----------+ // | 100| 1| 101| 1| // | 100| 1| 101| 1| // | 100| 1| 102| 2| // | 100| 1| 102| 2| // | 200| 1| 201| 3| // | 200| 1| 201| 3| // | 200| 1| 202| 4| // | 200| 1| 202| 4| // +-----------+-----------+-----------+-----------+ My feeling is that there is something about the lazy evaluation of the two explodes that I don't understand. Is there a way to get the above code to work? Or should I use a different approach all together?
The resulting schema is correct, but I get every value twice While schema is correct the output you've provided doesn't reflect actual result. In practice you'll get Cartesian product of timeStamp and reading for each input row. My feeling is that there is something about the lazy evaluation No, it has nothing to do with lazy evaluation. The way you use explode is just wrong. To understand what is going on lets trace execution for date equal 100: val df100 = df.where($"date" === 100) step by step. First explode will generate two rows, one for 1 and one for 2: val df100WithReading = df100.withColumn("reading", explode(df("data.reading"))) df100WithReading.show // +------------------+----+------+-------+ // | data|date|userId|reading| // +------------------+----+------+-------+ // |[[1,101], [2,102]]| 100| 1| 1| // |[[1,101], [2,102]]| 100| 1| 2| // +------------------+----+------+-------+ The second explode generate two rows (timeStamp equal 101 and 102) for each row from the previous step: val df100WithReadingAndTs = df100WithReading .withColumn("timeStamp", explode(df("data.timeStamp"))) df100WithReadingAndTs.show // +------------------+----+------+-------+---------+ // | data|date|userId|reading|timeStamp| // +------------------+----+------+-------+---------+ // |[[1,101], [2,102]]| 100| 1| 1| 101| // |[[1,101], [2,102]]| 100| 1| 1| 102| // |[[1,101], [2,102]]| 100| 1| 2| 101| // |[[1,101], [2,102]]| 100| 1| 2| 102| // +------------------+----+------+-------+---------+ If you want correct results explode data and select afterwards: val exploded = df.withColumn("data", explode($"data")) .select($"userId", $"date", $"data".getItem("reading"), $"data".getItem("timestamp")) exploded.show // +------+----+-------------+---------------+ // |userId|date|data[reading]|data[timestamp]| // +------+----+-------------+---------------+ // | 1| 100| 1| 101| // | 1| 100| 2| 102| // | 1| 200| 3| 201| // | 1| 200| 4| 202| // +------+----+-------------+---------------+