PySpark - Convert list of JSON objects to rows - pyspark

I want to convert a list of objects and store their attributes as columns.
{
"heading": 1,
"columns": [
{
"col1": "a",
"col2": "b",
"col3": "c"
},
{
"col1": "d",
"col2": "e",
"col3": "f"
}
]
}
Final Result
heading | col1 | col2 | col3
1 | a | b | c
1 | d | e | f
I am currently flattening my data (and excluding the columns column)
df = target_table.relationalize('roottable', temp_path)
However, for this use case, I will need the columns column. I saw examples where arrays_zip and explode was used. Would I need to iterate through each object or is there an easier way to extract each object and convert into a row?

use Spark SQL builtin function: inline or inline_outer is probably easiest way to handle this (use inline_outer when NULL is allowed in columns):
From Apache Hive document:
Explodes an array of structs to multiple rows. Returns a row-set with N columns (N = number of top level elements in the struct), one row per struct from the array. (As of Hive 0.10.)
df.selectExpr('heading', 'inline_outer(columns)').show()
+-------+----+----+----+
|heading|col1|col2|col3|
+-------+----+----+----+
| 1| a| b| c|
| 1| d| e| f|
+-------+----+----+----+

Related

PySpark dataframe how to use flatmap

I am writing a PySpark program that is comparing two tables, let's say Table1 and Table2
Both tables have identical structure, but may contain different data
Let's say, Table 1 has below cols
key1, key2, col1, col2, col3
The sample data in table 1 is as follows
"a", 1, "x1", "y1", "z1"
"a", 2, "x2", "y2", "z2"
"a", 3, "x3", "y3", "z3"
Similarly Table 2 has below cols
key1, key2, col1, col2, col3
The sample data in table 1 is as follows
"a", 1, "x1", "y1", "z1"
"a", 2, "x21", "y21", "z2"
"a", 3, "x3", "y3", "z31"
The program creates a data frame (let's say df1) that contains below columns
Key1, Key2, a.Col1, a.Col2, a.Col3, b.Col1, b.Col2, b.Col3, column_names
example data:
"a", 2, "x2", "y2", "z2", "x21", "y21", "z2", "col1,col2"
"a", 3, "x3", "y3", "z3", "x3", "y3", "z31", "col3"
The column "column_names" contains columns that have different values between table1 and table2
Using this data frame, I need to create another data frame that contains below structure
key1, key2, field_in_difference, src_value, tgt_value
"a", 2, "col1", "x2", "x21"
"a", 2, "col2", "y2", "y21"
"a", 3, "col3", "z3", "z31"
I am thinking that I need use flatMap in PySpark
Can I use flatmap for one of the column in the dataframe, so that multiple rows are created in the resulting dataframe ? but remaining columns get copied in the new row ?
I tried to use following, but does not seem to be correct syntax
df2 = df1.withColumn("newcolumn", func.concat_ws(",", flatMap(lambda x: x.split(','))))
But I get an error NameErrorL name flatMap is not defined
not sure how do I specify that the flatmap needs to be done on the column "column_names" , while keeping the remaining cols as they are..
I think the approach is to create one row per the column in difference as a first step
Then in the second step, create another df that will transform as expected output
I really appreciate the help
flatMap works on RDD, not DataFrame.
I don't quite understand how you want to use flatMap on df1, but I think working directly from Table 1 and Table 2 might be easier. Let's say Table 1 is df_src and Table 2 is df_tgt.
df_src.show()
+----+----+----+----+----+
|key1|key2|col1|col2|col3|
+----+----+----+----+----+
| a| 1| x1| y1| z1|
| a| 2| x2| y2| z2|
| a| 3| x3| y3| z3|
+----+----+----+----+----+
df_tgt.show()
+----+----+----+----+----+
|key1|key2|col1|col2|col3|
+----+----+----+----+----+
| a| 1| x1| y1| z1|
| a| 2| x21| y21| z2|
| a| 3| x3| y3| z31|
+----+----+----+----+----+
You can un-pivot both dataframes using stack function, join them, and filter it.
from pyspark.sql.functions import col
# unpivot col1, col2 and col3 of both dataframes. rename key columns as well
df_src = df_src.selectExpr("key1 key1_s", "key2 key2_s", "stack(3, 'col1', col1, 'col2', col2, 'col3', col3) (field_s, src_value)")
df_tgt = df_tgt.selectExpr("key1 key1_t", "key2 key2_t", "stack(3, 'col1', col1, 'col2', col2, 'col3', col3) (field_t, tgt_value)")
# join the dataframes on keys and field, then filter where field values are different
df_res = (df_src
.join(df_tgt,
[col('key1_s') == col('key1_t'), col('key2_s') == col('key2_t'), col('field_s') == col('field_t')],
'inner')
.filter(col('src_value') != col('tgt_value'))
.selectExpr('key1_s key1', 'key2_s key2', 'field_s field_in_difference', 'src_value', 'tgt_value')
)
df_res.show()
+----+----+-------------------+---------+---------+
|key1|key2|field_in_difference|src_value|tgt_value|
+----+----+-------------------+---------+---------+
| a| 2| col1| x2| x21|
| a| 2| col2| y2| y21|
| a| 3| col3| z3| z31|
+----+----+-------------------+---------+---------+

substring function return column type instead of a value. Is there a way to fetch a value out of column type in pyspark

I am comparing a condition with pyspark join in my application by using substring function. This function is returning a column type instead of a value.
substring(trim(coalesce(df.col1)), 13, 3) returns
Column<b'substring(trim(coalesce(col1), 13, 3)'>
Tried with expr but still getting the same column type result
expr("substring(trim(coalesce(df.col1)),length(trim(coalesce(df.col1))) - 2, 3)")
I want to compare the values coming from substring to the value of another dataframe column. Both are of string type
pyspark:
substring(trim(coalesce(df.col1)), length(trim(coalesce(df.col1))) -2, 3) == df2["col2"]
lets say col1 = 'abcdefghijklmno'
The expected output of substring function should mno based on the above definition.
creating a sample dataframes to join
list1 = [('ABC','abcdefghijklmno'),('XYZ','abcdefghijklmno'),('DEF','abcdefghijklabc')]
df1=spark.createDataFrame(list1, ['col1', 'col2'])
list2 = [(1,'mno'),(2,'mno'),(3,'abc')]
df2=spark.createDataFrame(list2, ['col1', 'col2'])
import pyspark.sql.functions as f
creating a substring to read last n characters for three positions.
cond=f.substring(df1['col2'], -3, 3)==df2['col2']
newdf=df1.join(df2,cond)
>>> newdf.show()
+----+---------------+----+----+
|col1| col2|col1|col2|
+----+---------------+----+----+
| ABC|abcdefghijklmno| 1| mno|
| ABC|abcdefghijklmno| 2| mno|
| XYZ|abcdefghijklmno| 1| mno|
| XYZ|abcdefghijklmno| 2| mno|
| DEF|abcdefghijklabc| 3| abc|
+----+---------------+----+----+

Count Rows where a date is within a date range for each ID

We have a dataframe (of multiple million rows) consisting of:
Id
start date
end date
date
For each Row we take the date Variable and want to count how many rows for each id exist, where this date lies between start date and end date.
This value then should be included in a new Column ("sum_of_rows").
Here is the table we expect (with sum_of_rows the to creating variable):
+---+----------+----------+----------+-----------+
| Id| start| end| date|sum_of_rows|
+---+----------+----------+----------+-----------+
| A|2008-01-02|2010-01-01|2009-01-01| 2|
| A|2005-01-02|2012-01-01| null| null|
| A|2013-01-02|2015-01-01|2014-01-01| 1|
| B|2002-01-02|2019-01-01|2003-01-01| 1|
| B|2015-01-02|2017-01-01|2016-01-01| 2|
+---+----------+----------+----------+-----------+
Example:
We look at the first Row. Take the date "2009-01-01" and want to look
in all rows where ID is the ID of the Row (so A here) and count
in how many Rows the date "2009-01-01" is within start and end (True for row 1 and 2 in this example).
Code for the original table:
table = spark.createDataFrame(
[
["A", '2008-01-02', '2010-01-01', '2009-01-01'],
["A", '2005-01-02', '2012-01-01', None],
["A", '2013-01-02', '2015-01-01', '2014-01-01'],
["B", '2002-01-02', '2019-01-01', '2003-01-01'],
["B", '2015-01-02', '2017-01-01', '2016-01-01']
],
("Id", "start", "end", "date")
)
This code works but creates a "product" join which is not recommended with big volumes of data.
table2 = table.select(
F.col("id"),
F.col("start").alias("s"),
F.col("end").alias("e"),
)
table3 = table.join(
table2, on="id"
)
table3 = table3.withColumn(
"one",
F.when(
F.col("date").between(F.col("s"),F.col("e")),
1
).otherwise(0)
)
table3.groupBy(
"Id",
"start",
"end",
"date"
).agg(F.sum("one").alias("sum_of_rows")).show()
+---+----------+----------+----------+-----------+
| Id| start| end| date|sum_of_rows|
+---+----------+----------+----------+-----------+
| B|2002-01-02|2019-01-01|2003-01-01| 1|
| B|2015-01-02|2017-01-01|2016-01-01| 2|
| A|2008-01-02|2010-01-01|2009-01-01| 2|
| A|2013-01-02|2015-01-01|2014-01-01| 1|
| A|2005-01-02|2012-01-01| null| 0|
+---+----------+----------+----------+-----------+

pyspark/dataframe - creating a nested structure

i'm using pyspark with dataframe and would like to create a nested structure as below
Before:
Column 1 | Column 2 | Column 3
--------------------------------
A | B | 1
A | B | 2
A | C | 1
After:
Column 1 | Column 4
--------------------------------
A | [B : [1,2]]
A | [C : [1]]
Is this doable?
I don't think you can get that exact output, but you can come close. The problem is your key names for the column 4. In Spark, structs need to have a fixed set of columns known in advance. But let's leave that for later, first, the aggregation:
import pyspark
from pyspark.sql import functions as F
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
data = [('A', 'B', 1), ('A', 'B', 2), ('A', 'C', 1)]
columns = ['Column1', 'Column2', 'Column3']
data = spark.createDataFrame(data, columns)
data.createOrReplaceTempView("data")
data.show()
# Result
+-------+-------+-------+
|Column1|Column2|Column3|
+-------+-------+-------+
| A| B| 1|
| A| B| 2|
| A| C| 1|
+-------+-------+-------+
nested = spark.sql("SELECT Column1, Column2, STRUCT(COLLECT_LIST(Column3) AS data) AS Column4 FROM data GROUP BY Column1, Column2")
nested.toJSON().collect()
# Result
['{"Column1":"A","Column2":"C","Column4":{"data":[1]}}',
'{"Column1":"A","Column2":"B","Column4":{"data":[1,2]}}']
Which is almost what you want, right? The problem is that if you do not know your key names in advance (that is, the values in Column 2), Spark cannot determine the structure of your data. Also, I am not entirely sure how you can use the value of a column as key for a structure unless you use a UDF (maybe with a PIVOT?):
datatype = 'struct<B:array<bigint>,C:array<bigint>>' # Add any other potential keys here.
#F.udf(datatype)
def replace_struct_name(column2_value, column4_value):
return {column2_value: column4_value['data']}
nested.withColumn('Column5', replace_struct_name(F.col("Column2"), F.col("Column4"))).toJSON().collect()
# Output
['{"Column1":"A","Column2":"C","Column4":{"C":[1]}}',
'{"Column1":"A","Column2":"B","Column4":{"B":[1,2]}}']
This of course has the drawback that the number of keys must be discrete and known in advance, otherwise other key values will be silently ignored.
First, reproducible example of your dataframe.
js = [{"col1": "A", "col2":"B", "col3":1},{"col1": "A", "col2":"B", "col3":2},{"col1": "A", "col2":"C", "col3":1}]
jsrdd = sc.parallelize(js)
sqlContext = SQLContext(sc)
jsdf = sqlContext.read.json(jsrdd)
jsdf.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| A| B| 1|
| A| B| 2|
| A| C| 1|
+----+----+----+
Now, lists are not stored as key value pairs. You can either use a dictionary or simple collect_list() after doing a groupby on column2.
jsdf.groupby(['col1', 'col2']).agg(F.collect_list('col3')).show()
+----+----+------------------+
|col1|col2|collect_list(col3)|
+----+----+------------------+
| A| C| [1]|
| A| B| [1, 2]|
+----+----+------------------+

Data filtering in Spark

I am trying to do a certain kind of filtering using Spark. I have a data frame that looks like the following:
ID Property#1 Property#2 Property#3
-----------------------------------------
01 a b c
01 a X c
02 d e f
03 i j k
03 i j k
I expect the properties for a given ID to be the same. In the example above, I would like to filter out the following:
ID Property#2
---------------
01 b
01 X
Note that it is okay for IDs to be repeated in the data frame as long as the properties are the same (e.g. ID '03' in the first table). The code needs to be as efficient as possible as I am planning to apply it on datasets with >10k rows. I tried extracting the distinct rows using the distinct function in DataFrame API, grouped them on the ID column using groupBy and aggregated the results using countDistinct function, but unfortunately I couldn't get a working version of the code. Also the way I implemented it seems to be quite slow. I was wondering if anyone can provide some pointers as to how to approach this problem.
Thanks!
You can for example aggregate and join. First you'll have to create a lookup table:
val df = Seq(
("01", "a", "b", "c"), ("01", "a", "X", "c"),
("02", "d", "e", "f"), ("03", "i", "j", "k"),
("03", "i", "j", "k")
).toDF("id", "p1", "p2", "p3")
val lookup = df.distinct.groupBy($"id").count
Then filter the records:
df.join(broadcast(lookup), Seq("id"))
df.join(broadcast(lookup), Seq("id")).where($"count" !== 1).show
// +---+---+---+---+-----+
// | id| p1| p2| p3|count|
// +---+---+---+---+-----+
// | 01| a| b| c| 2|
// | 01| a| X| c| 2|
// +---+---+---+---+-----+