I have JSON data in the following format:
{
"date": 100
"userId": 1
"data": [
{
"timeStamp": 101,
"reading": 1
},
{
"timeStamp": 102,
"reading": 2
}
]
}
{
"date": 200
"userId": 1
"data": [
{
"timeStamp": 201,
"reading": 3
},
{
"timeStamp": 202,
"reading": 4
}
]
}
I read it into Spark SQL:
val df = SQLContext.read.json(...)
df.printSchema
// root
// |-- date: double (nullable = true)
// |-- userId: long (nullable = true)
// |-- data: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- timeStamp: double (nullable = true)
// | | |-- reading: double (nullable = true)
I would like to transform it in order to have one row per reading. To my understanding, every transformation should produce a new DataFrame, so the following should work:
import org.apache.spark.sql.functions.explode
val exploded = df
.withColumn("reading", explode(df("data.reading")))
.withColumn("timeStamp", explode(df("data.timeStamp")))
.drop("data")
exploded.printSchema
// root
// |-- date: double (nullable = true)
// |-- userId: long (nullable = true)
// |-- timeStamp: double (nullable = true)
// |-- reading: double (nullable = true)
The resulting schema is correct, but I get every value twice:
exploded.show
// +-----------+-----------+-----------+-----------+
// | date| userId| timeStamp| reading|
// +-----------+-----------+-----------+-----------+
// | 100| 1| 101| 1|
// | 100| 1| 101| 1|
// | 100| 1| 102| 2|
// | 100| 1| 102| 2|
// | 200| 1| 201| 3|
// | 200| 1| 201| 3|
// | 200| 1| 202| 4|
// | 200| 1| 202| 4|
// +-----------+-----------+-----------+-----------+
My feeling is that there is something about the lazy evaluation of the two explodes that I don't understand.
Is there a way to get the above code to work? Or should I use a different approach all together?
The resulting schema is correct, but I get every value twice
While schema is correct the output you've provided doesn't reflect actual result. In practice you'll get Cartesian product of timeStamp and reading for each input row.
My feeling is that there is something about the lazy evaluation
No, it has nothing to do with lazy evaluation. The way you use explode is just wrong. To understand what is going on lets trace execution for date equal 100:
val df100 = df.where($"date" === 100)
step by step. First explode will generate two rows, one for 1 and one for 2:
val df100WithReading = df100.withColumn("reading", explode(df("data.reading")))
df100WithReading.show
// +------------------+----+------+-------+
// | data|date|userId|reading|
// +------------------+----+------+-------+
// |[[1,101], [2,102]]| 100| 1| 1|
// |[[1,101], [2,102]]| 100| 1| 2|
// +------------------+----+------+-------+
The second explode generate two rows (timeStamp equal 101 and 102) for each row from the previous step:
val df100WithReadingAndTs = df100WithReading
.withColumn("timeStamp", explode(df("data.timeStamp")))
df100WithReadingAndTs.show
// +------------------+----+------+-------+---------+
// | data|date|userId|reading|timeStamp|
// +------------------+----+------+-------+---------+
// |[[1,101], [2,102]]| 100| 1| 1| 101|
// |[[1,101], [2,102]]| 100| 1| 1| 102|
// |[[1,101], [2,102]]| 100| 1| 2| 101|
// |[[1,101], [2,102]]| 100| 1| 2| 102|
// +------------------+----+------+-------+---------+
If you want correct results explode data and select afterwards:
val exploded = df.withColumn("data", explode($"data"))
.select($"userId", $"date",
$"data".getItem("reading"), $"data".getItem("timestamp"))
exploded.show
// +------+----+-------------+---------------+
// |userId|date|data[reading]|data[timestamp]|
// +------+----+-------------+---------------+
// | 1| 100| 1| 101|
// | 1| 100| 2| 102|
// | 1| 200| 3| 201|
// | 1| 200| 4| 202|
// +------+----+-------------+---------------+
Related
I am trying to get the complex data into normal dataframe format
My data schema:
root
|-- column_names: array (nullable = true)
| |-- element: string (containsNull = true)
|-- values: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- id: array (nullable = true)
| |-- element: string (containsNull = true)
My Data File(JSON Format):
{"column_names":["2_col_name","3_col_name"],"id":["a","b","c","d","e"],"values":[["2_col_1",1],["2_col_2",2],["2_col_3",9],["2_col_4",10],["2_col_5",11]]}
I am trying to convert above data into this format:
+----------+----------+----------+
|1_col_name|2_col_name|3_col_name|
+----------+----------+----------+
| a| 2_col_1| 1|
| b| 2_col_2| 2|
| c| 2_col_3| 9|
| d| 2_col_4| 10|
| e| 2_col_5| 11|
+----------+----------+----------+
I tried using explode function on id and values but got different output as below:
+---+-------------+
| id| values|
+---+-------------+
| a| [2_col_1, 1]|
| a| [2_col_2, 2]|
| a| [2_col_3, 9]|
| a|[2_col_4, 10]|
| a|[2_col_5, 11]|
| b| [2_col_1, 1]|
| b| [2_col_2, 2]|
| b| [2_col_3, 9]|
| b|[2_col_4, 10]|
+---+-------------+
only showing top 9 rows
Not sure where i am doing wrong
You can use array_zip + inline functions to flatten then pivot the column names :
val df1 = df.select(
$"column_names",
expr("inline(arrays_zip(id, values))")
).select(
$"id".as("1_col_name"),
expr("inline(arrays_zip(column_names, values))")
)
.groupBy("1_col_name")
.pivot("column_names")
.agg(first("values"))
df1.show
//+----------+----------+----------+
//|1_col_name|2_col_name|3_col_name|
//+----------+----------+----------+
//|e |2_col_5 |11 |
//|d |2_col_4 |10 |
//|c |2_col_3 |9 |
//|b |2_col_2 |2 |
//|a |2_col_1 |1 |
//+----------+----------+----------+
currently, schema for my table is:
root
|-- product_id: integer (nullable = true)
|-- product_name: string (nullable = true)
|-- aisle_id: string (nullable = true)
|-- department_id: string (nullable = true)
I want to apply the below schema on the above table and delete all the rows which do not follow the below schema:
val productsSchema = StructType(Seq(
StructField("product_id",IntegerType,nullable = true),
StructField("product_name",StringType,nullable = true),
StructField("aisle_id",IntegerType,nullable = true),
StructField("department_id",IntegerType,nullable = true)
))
Use option "DROPMALFORMED" while loading the data which ignores corrupted records.
spark.read.format("json")
.option("mode", "DROPMALFORMED")
.option("header", "true")
.schema(productsSchema)
.load("sample.json")
If data is not matching with schema, spark will put null as value in that column. We just have to filter the null values for all columns.
Used filter to filter ```null`` values for all columns.
scala> "cat /tmp/sample.json".! // JSON File Data, one row is not matching with schema.
{"product_id":1,"product_name":"sampleA","aisle_id":"AA","department_id":"AAD"}
{"product_id":2,"product_name":"sampleBB","aisle_id":"AAB","department_id":"AADB"}
{"product_id":3,"product_name":"sampleCC","aisle_id":"CC","department_id":"CCC"}
{"product_id":3,"product_name":"sampledd","aisle_id":"dd","departmentId":"ddd"}
{"name","srinivas","age":29}
res100: Int = 0
scala> schema.printTreeString
root
|-- aisle_id: string (nullable = true)
|-- department_id: string (nullable = true)
|-- product_id: long (nullable = true)
|-- product_name: string (nullable = true)
scala> val df = spark.read.schema(schema).option("badRecordsPath", "/tmp/badRecordsPath").format("json").load("/tmp/sample.json") // Loading Json data & if schema is not matching we will be getting null rows for all columns.
df: org.apache.spark.sql.DataFrame = [aisle_id: string, department_id: string ... 2 more fields]
scala> df.show(false)
+--------+-------------+----------+------------+
|aisle_id|department_id|product_id|product_name|
+--------+-------------+----------+------------+
|AA |AAD |1 |sampleA |
|AAB |AADB |2 |sampleBB |
|CC |CCC |3 |sampleCC |
|dd |null |3 |sampledd |
|null |null |null |null |
+--------+-------------+----------+------------+
scala> df.filter(df.columns.map(c => s"${c} is not null").mkString(" or ")).show(false) // Filter null rows.
+--------+-------------+----------+------------+
|aisle_id|department_id|product_id|product_name|
+--------+-------------+----------+------------+
|AA |AAD |1 |sampleA |
|AAB |AADB |2 |sampleBB |
|CC |CCC |3 |sampleCC |
|dd |null |3 |sampledd |
+--------+-------------+----------+------------+
scala>
do check out na.drop functions on data-frame, you can drop rows based on null values, min nulls in a row, and also based on a specific column which has nulls.
scala> sc.parallelize(Seq((1,"a","a"),(1,"a","a"),(2,"b","b"),(3,"c","c"),(4,"d","d"),(4,"d",null))).toDF
res7: org.apache.spark.sql.DataFrame = [_1: int, _2: string ... 1 more field]
scala> res7.show()
+---+---+----+
| _1| _2| _3|
+---+---+----+
| 1| a| a|
| 1| a| a|
| 2| b| b|
| 3| c| c|
| 4| d| d|
| 4| d|null|
+---+---+----+
//dropping row if a null is found
scala> res7.na.drop.show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| a| a|
| 1| a| a|
| 2| b| b|
| 3| c| c|
| 4| d| d|
+---+---+---+
//drops only if `minNonNulls = 3` if accepted to each row
scala> res7.na.drop(minNonNulls = 3).show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| a| a|
| 1| a| a|
| 2| b| b|
| 3| c| c|
| 4| d| d|
+---+---+---+
//not dropping any
scala> res7.na.drop(minNonNulls = 2).show()
+---+---+----+
| _1| _2| _3|
+---+---+----+
| 1| a| a|
| 1| a| a|
| 2| b| b|
| 3| c| c|
| 4| d| d|
| 4| d|null|
+---+---+----+
//drops row based on nulls in `_3` column
scala> res7.na.drop(Seq("_3")).show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| a| a|
| 1| a| a|
| 2| b| b|
| 3| c| c|
| 4| d| d|
+---+---+---+
I read data from ElasticSearch and save into an RDD.
val es_rdd = sc.esRDD("indexname/typename",query="?q=*")
The rdd has the next example data:
(uniqueId,Map(field -> value))
(uniqueId2,Map(field2 -> value2))
How can I convert this RDD (String, Map to a Dataframe (String, String, String)?
You can use explode to achieve it.
import spark.implicits._
import org.apache.spark.sql.functions._
val rdd = sc.range(1, 10).map(s => (s, Map(s -> s)))
val ds = spark.createDataset(rdd)
val df = ds.toDF()
df.printSchema()
df.show()
df.select('_1,explode('_2)).show()
output:
root
|-- _1: long (nullable = false)
|-- _2: map (nullable = true)
| |-- key: long
| |-- value: long (valueContainsNull = false)
+---+--------+
| _1| _2|
+---+--------+
| 1|[1 -> 1]|
| 2|[2 -> 2]|
| 3|[3 -> 3]|
| 4|[4 -> 4]|
| 5|[5 -> 5]|
| 6|[6 -> 6]|
| 7|[7 -> 7]|
| 8|[8 -> 8]|
| 9|[9 -> 9]|
+---+--------+
+---+---+-----+
| _1|key|value|
+---+---+-----+
| 1| 1| 1|
| 2| 2| 2|
| 3| 3| 3|
| 4| 4| 4|
| 5| 5| 5|
| 6| 6| 6|
| 7| 7| 7|
| 8| 8| 8|
| 9| 9| 9|
+---+---+-----+
I readed directly in Spark.SQL format using the next call to elastic:
val df = spark.read.format("org.elasticsearch.spark.sql")
.option("query", "?q=*")
.option("pushdown", "true")
.load("indexname/typename")
I'm using spark-1.6.0, I want to join 2 dataframe, they showed in YARN log like following.
df_train_raw
df_user_clicks_info
I have tried to inner join them with code:
val df_tmp_tmp_0 = df_train_raw.join(df_user_clicks_info, Seq("subscriberid"))
df_tmp_tmp_0.show()
And the results I got was exactly nothing! OMG!
+------------+--------+-----+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
|subscriberid|objectid|label|subscriberid|user_clicks_avg_everyday_a_week|user_clicks_sum_time_1_9_a_week|user_clicks_sum_time_9_14_a_week|user_clicks_sum_time_14_17_a_week|user_clicks_sum_time_17_19_a_week|user_clicks_sum_time_19_23_a_week|user_clicks_sum_time_23_1_a_week|user_clicks_avg_everyday_weekday|user_clicks_sum_time_1_9_weekday|user_clicks_sum_time_9_14_weekday|user_clicks_sum_time_14_17_weekday|user_clicks_sum_time_17_19_weekday|user_clicks_sum_time_19_23_weekday|user_clicks_sum_time_23_1_weekday|user_clicks_avg_everyday_weekdend|user_clicks_sum_time_1_9_weekdend|user_clicks_sum_time_9_14_weekdend|user_clicks_sum_time_14_17_weekdend|user_clicks_sum_time_17_19_weekdend|user_clicks_sum_time_19_23_weekdend|user_clicks_sum_time_23_1_weekdend|
+------------+--------+-----+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
+------------+--------+-----+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
I don't know why? It seems nothing wrong here? Hope some help, please~ Thanks~
After the 2 friends advice about space, I'd have a another try:
df_train_raw
————————————
+------------+-----------+-----+
|subscriberid| objectid|label|
+------------+-----------+-----+
| 104752237|11029932485| 0|
| 105246837|11029932485| 0|
| 105517237|11029932485| 0|
| 108917037|11030797988| 0|
| 108917037|11029648595| 0|
| 109901037|11029648595| 0|
| 105517237|11030720502| 0|
| 105246837|11029986502| 0|
| 104752237|11029191717| 0|
| 105246837|11029191717| 0|
| 105517237|11029191717| 0|
| 109901037|11030138623| 0|
| 105517237|11014105538| 0|
| 105517237|11014105543| 0|
| 105517237|11016478156| 0|
| 105517237|11023285357| 0|
| 105246837|11026067980| 0|
| 105246837|11030797988| 0|
| 108917037|11029932485| 0|
| 109901037|11029932485| 0|
+------------+-----------+-----+
only showing top 20 rows
————————————
root
|-- subscriberid: long (nullable = true)
|-- objectid: long (nullable = true)
|-- label: integer (nullable = true)
and print the "subscriberid" column, this showed not the space case.
df_train_raw.select("subscriberid").take(20).foreach(println)
the result
[104752237]
[105246837]
[105517237]
[108917037]
[108917037]
[109901037]
[105517237]
[105246837]
[104752237]
[105246837]
[105517237]
[109901037]
[105517237]
[105517237]
[105517237]
[105517237]
[105246837]
[105246837]
[108917037]
[109901037]
And fot the df_user_clicks_info
+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
|subscriberid|user_clicks_avg_everyday_a_week|user_clicks_sum_time_1_9_a_week|user_clicks_sum_time_9_14_a_week|user_clicks_sum_time_14_17_a_week|user_clicks_sum_time_17_19_a_week|user_clicks_sum_time_19_23_a_week|user_clicks_sum_time_23_1_a_week|user_clicks_avg_everyday_weekday|user_clicks_sum_time_1_9_weekday|user_clicks_sum_time_9_14_weekday|user_clicks_sum_time_14_17_weekday|user_clicks_sum_time_17_19_weekday|user_clicks_sum_time_19_23_weekday|user_clicks_sum_time_23_1_weekday|user_clicks_avg_everyday_weekdend|user_clicks_sum_time_1_9_weekdend|user_clicks_sum_time_9_14_weekdend|user_clicks_sum_time_14_17_weekdend|user_clicks_sum_time_17_19_weekdend|user_clicks_sum_time_19_23_weekdend|user_clicks_sum_time_23_1_weekdend|
+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
| 104752237| 1.71| 0| 0| 0| 4| 4| 4| 0.8| 0| 0| 0| 0| 4| 0| 4.0| 0| 0| 0| 4| 0| 4|
| 105517237| 17.14| 12| 36| 12| 0| 60| 0| 9.6| 0| 0| 0| 0| 48| 0| 36.0| 12| 36| 12| 0| 12| 0|
| 109901037| 2.14| 0| 3| 3| 6| 3| 0| 2.4| 0| 0| 3| 6| 3| 0| 1.5| 0| 3| 0| 0| 0| 0|
| 105246837| 8.0| 8| 0| 0| 16| 32| 0| 8.0| 8| 0| 0| 8| 24| 0| 8.0| 0| 0| 0| 8| 8| 0|
+------------+-------------------------------+-------------------------------+--------------------------------+---------------------------------+---------------------------------+---------------------------------+--------------------------------+--------------------------------+--------------------------------+---------------------------------+----------------------------------+----------------------------------+----------------------------------+---------------------------------+---------------------------------+---------------------------------+----------------------------------+-----------------------------------+-----------------------------------+-----------------------------------+----------------------------------+
————————————
root
|-- subscriberid: string (nullable = true)
|-- user_clicks_avg_everyday_a_week: double (nullable = false)
|-- user_clicks_sum_time_1_9_a_week: long (nullable = false)
|-- user_clicks_sum_time_9_14_a_week: long (nullable = false)
|-- user_clicks_sum_time_14_17_a_week: long (nullable = false)
|-- user_clicks_sum_time_17_19_a_week: long (nullable = false)
|-- user_clicks_sum_time_19_23_a_week: long (nullable = false)
|-- user_clicks_sum_time_23_1_a_week: long (nullable = false)
|-- user_clicks_avg_everyday_weekday: double (nullable = false)
|-- user_clicks_sum_time_1_9_weekday: long (nullable = false)
|-- user_clicks_sum_time_9_14_weekday: long (nullable = false)
|-- user_clicks_sum_time_14_17_weekday: long (nullable = false)
|-- user_clicks_sum_time_17_19_weekday: long (nullable = false)
|-- user_clicks_sum_time_19_23_weekday: long (nullable = false)
|-- user_clicks_sum_time_23_1_weekday: long (nullable = false)
|-- user_clicks_avg_everyday_weekdend: double (nullable = false)
|-- user_clicks_sum_time_1_9_weekdend: long (nullable = false)
|-- user_clicks_sum_time_9_14_weekdend: long (nullable = false)
|-- user_clicks_sum_time_14_17_weekdend: long (nullable = false)
|-- user_clicks_sum_time_17_19_weekdend: long (nullable = false)
|-- user_clicks_sum_time_19_23_weekdend: long (nullable = false)
|-- user_clicks_sum_time_23_1_weekdend: long (nullable = false)
df_user_clicks_info.select("subscriberid").take(20).foreach(println)
[104752237]
[105517237]
[109901037]
[105246837]
It didn't work either :(
Thanks for the help of friends who helped me. And the reason why for this, I think is a bug in SPARK-1.6.0, I solved it by changing my data process without updating SPARK. I mean in the beginning, I wanted to get df_3 from df_1 and df_2, but it didn't get the result I want because of the bug I mentioned in the question, so I tried another way to get a df_tmp_1 and df_tmp_2, then join them and get the result. I don't know why either, but it seems like a good idea if U use the SPARK-1.6.0 and meet the same join bug like me.
I have dataframe contain (around 20000000 rows) and I'd like to drop duplicates from a dataframe for two columns if those columns have the same values, or even if those values are in the reverse order.
For example the original dataframe:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 1| A|
| 1| 1| B|
| 2| 1| C|
| 1| 2| D|
| 3| 5| E|
| 3| 4| F|
| 4| 3| G|
+----+----+----+
where the schema of the column as follows:
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- col3: string (nullable = true)
The desired dataframe should look like:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 1| A|
| 1| 2| D|
| 3| 5| E|
| 3| 4| F|
+----+----+----+
The dropDuplicates() method remove duplicates if the values in the same order
I followed the accepted answer to this question Pandas: remove reverse duplicates from dataframe but it took more time.
You can use this:
Hope this helps.
Note : In 'col3' 'D' will be removed istead of 'C', because 'C' is positioned before 'D'.
from pyspark.sql import functions as F
df = spark.read.csv('/FileStore/tables/stack2.csv', header = 'True')
df2 = df.select(F.least(df.col1,df.col2).alias('col1'),F.greatest(df.col1,df.col2).alias('col2'),df.col3)
df2.dropDuplicates(['col1','col2']).show()