parse string of jsons pyspark - pyspark

I am trying to parse a column of a list of json strings but even after trying multiple schemas using structType, structField etc I am just unable to get the schema at all.
[{"event":"empCreation","count":"148"},{"event":"jobAssignment","count":"3"},{"event":"locationAssignment","count":"77"}]
[{"event":"empCreation","count":"334"},{"event":"jobAssignment","count":33"},{"event":"locationAssignment","count":"73"}]
[{"event":"empCreation","count":"18"},{"event":"jobAssignment","count":"32"},{"event":"locationAssignment","count":"72"}]
Based on this SO post, I was able to derive the json schema but even after apply from_json function, it still wouldn't work
Pyspark: Parse a column of json strings
Can you please help?

You can parse the given json schema with below schame definition and read the json as a DataFrame providing the schema info.
>>> dschema = StructType([
... StructField("event", StringType(),True),
... StructField("count", StringType(),True)])
>>>
>>>
>>> df = spark.read.json('/<json_file_path>/json_file.json', schema=dschema)
>>>
>>> df.show()
+------------------+-----+
| event|count|
+------------------+-----+
| empCreation| 148|
| jobAssignment| 3|
|locationAssignment| 77|
| empCreation| 334|
| jobAssignment| 33|
|locationAssignment| 73|
| empCreation| 18|
| jobAssignment| 32|
|locationAssignment| 72|
+------------------+-----+
>>>
Contents of the json file:
cat json_file.json
[{"event":"empCreation","count":"148"},{"event":"jobAssignment","count":"3"},{"event":"locationAssignment","count":"77"}]
[{"event":"empCreation","count":"334"},{"event":"jobAssignment","count":"33"},{"event":"locationAssignment","count":"73"}]
[{"event":"empCreation","count":"18"},{"event":"jobAssignment","count":"32"},{"event":"locationAssignment","count":"72"}]

thanks so much #Lakshmanan but I had to just add a slight change to the schema:
eventCountSchema = ArrayType(StructType([StructField("event", StringType(),True),StructField("count", StringType(),True)]), True)
and then applied this schema to the dataframe complex datatype column

Related

spark bad records : bad records shows reason for only one column

I am trying to filter out bad records from a csv file using pyspark. Code snippet given below
from pyspark.sql import SparkSession
schema="employee_id int,name string,address string,dept_id int"
spark = SparkSession.builder.appName("TestApp").getOrCreate()
data = spark.read.format("csv").option("header", True).schema(schema).option("badRecordsPath", "/tmp/bad_records").load("/path/to/csv/file")
schema_for_bad_record="path string,record string,reason string"
bad_records_frame=spark.read.schema(schema_for_bad_record).json("/tmp/bad_records")
bad_records_frame.select("reason").show()
The valid dataframe is
+-----------+-------+-------+-------+
|employee_id| name|address|dept_id|
+-----------+-------+-------+-------+
| 1001| Floyd| Delhi| 1|
| 1002| Watson| Mumbai| 2|
| 1004|Thomson|Chennai| 3|
| 1005| Bonila| Pune| 4|
+-----------+-------+-------+-------+
In one of the records, both employee_id and dept_id has incorrect values. But the reason shows only one column's issue.
java.lang.NumberFormatException: For input string: \\"abc\\"
Is there any way to show reasons for multiple columns in case of failure?

reading partitioned parquet record in pyspark

I have a parquet file partitioned by a date field (YYYY-MM-DD).
How to read the (current date-1 day) records from the file efficiently in Pyspark - please suggest.
PS: I would not like to read the entire file and then filter the records as the data volume is huge.
There are multiple ways to go about this:
Suppose this is the input data and you write out the dataframe partitioned on "date" column:
data = [(datetime.date(2022, 6, 12), "Hello"), (datetime.date(2022, 6, 19), "World")]
schema = StructType([StructField("date", DateType()),StructField("message", StringType())])
df = spark.createDataFrame(data, schema=schema)
df.write.mode('overwrite').partitionBy('date').parquet('./test')
You can read the parquet files associated to a given date with this syntax:
spark.read.parquet('./test/date=2022-06-19').show()
# The catch is that the date column is gonna be omitted from your dataframe
+-------+
|message|
+-------+
| World|
+-------+
# You could try adding the date column with lit syntax.
(spark.read.parquet('./test/date=2022-06-19')
.withColumn('date', f.lit('2022-06-19').cast(DateType()))
.show()
)
# Output
+-------+----------+
|message| date|
+-------+----------+
| World|2022-06-19|
+-------+----------+
The more efficient solution is using the delta tables:
df.write.mode('overwrite').partitionBy('date').format('delta').save('/test')
spark.read.format('delta').load('./test').where(f.col('date') == '2022-06-19').show()
The spark engine uses the _delta_log to optimize your query and only reads the parquet files that are applicable to your query. Also, the output will have all the columns:
+-------+----------+
|message| date|
+-------+----------+
| World|2022-06-19|
+-------+----------+
you can read it by passing date variable while reading.
This is dynamic code, you nor need to hardcode date, just append it with path
>>> df.show()
+-----+-----------------+-----------+----------+
|Sr_No| User_Id|Transaction| dt|
+-----+-----------------+-----------+----------+
| 1|paytm 111002203#p| 100D|2022-06-29|
| 2|paytm 111002203#p| 50C|2022-06-27|
| 3|paytm 111002203#p| 20C|2022-06-26|
| 4|paytm 111002203#p| 10C|2022-06-25|
| 5| null| 1C|2022-06-24|
+-----+-----------------+-----------+----------+
>>> df.write.partitionBy("dt").mode("append").parquet("/dir1/dir2/sample.parquet")
>>> from datetime import date
>>> from datetime import timedelta
>>> today = date.today()
#Here i am taking two days back date, for one day back you can do (days=1)
>>> yesterday = today - timedelta(days = 2)
>>> two_days_back=yesterday.strftime('%Y-%m-%d')
>>> path="/di1/dir2/sample.parquet/dt="+two_days_back
>>> spark.read.parquet(path).show()
+-----+-----------------+-----------+
|Sr_No| User_Id|Transaction|
+-----+-----------------+-----------+
| 2|paytm 111002203#p| 50C|
+-----+-----------------+-----------+

structtype add method to add arraytype

structtype has a method call add. i see example to use
schema = Structype()
schema.add('testing',string)
schema.add('testing2',string)
how can I add Structype and array type in the schema , using add()?
You need to use it as below -
from pyspark.sql.types import *
schema = StructType()
schema.add('testing',StringType())
schema.add('testing2',StringType())
Sample example to create a dataframe using this schema -
df = spark.createDataFrame(data=[(1,2), (3,4)],schema=schema)
df.show()
+-------+--------+
|testing|testing2|
+-------+--------+
| 1| 2|
| 3| 4|
+-------+--------+

How to read multiple nested json objects in one file extract by pyspark to dataframe in Azure databricks?

I have .log file in ADLS which contain multiple nested Json objects as follows
{"EventType":3735091736,"Timestamp":"2019-03-19","Data":{"Id":"event-c2","Level":2,"MessageTemplate":"Test1","Properties":{"CorrId":"d69b7489","ActionId":"d0e2c3fd"}},"Id":"event-c20b9c7eac0808d6321106d901000000"}
{"EventType":3735091737,"Timestamp":"2019-03-18","Data":{"Id":"event-d2","Level":2,"MessageTemplate":"Test1","Properties":{"CorrId":"f69b7489","ActionId":"d0f2c3fd"}},"Id":"event-d20b9c7eac0808d6321106d901000000"}
{"EventType":3735091738,"Timestamp":"2019-03-17","Data":{"Id":"event-e2","Level":1,"MessageTemplate":"Test1","Properties":{"CorrId":"g69b7489","ActionId":"d0d2c3fd"}},"Id":"event-e20b9c7eac0808d6321106d901000000"}
Need to read the above multiple nested Json objects in pyspark and convert to dataframe as follows
EventType Timestamp Data.[Id] ..... [Data.Properties.CorrId] [Data.Properties. ActionId]
3735091736 2019-03-19 event-c2 ..... d69b7489 d0e2c3fd
3735091737 2019-03-18 event-d2 ..... f69b7489 d0f2c3fd
3735091738 2019-03-17 event-e2 ..... f69b7489 d0d2c3fd
For above I am using ADLS,Pyspark in Azure DataBricks.
Does anyone know a general way to deal with above problem? Thanks!
You can read it into an RDD first. It will be read as a list of strings
You need to convert the json string into a native python datatype using
json.loads()
Then you can convert the RDD into a dataframe, and it can infer the schema directly using toDF()
Using the answer from Flatten Spark Dataframe column of map/dictionary into multiple columns, you can explode the Data column into multiple columns. Given your Id column is going to be unique. Note that, explode would return key, value columns for each entry in the map type.
You can repeat the 4th point to explode the properties column.
Solution:
import json
rdd = sc.textFile("demo_files/Test20191023.log")
df = rdd.map(lambda x: json.loads(x)).toDF()
df.show()
# +--------------------+----------+--------------------+----------+
# | Data| EventType| Id| Timestamp|
# +--------------------+----------+--------------------+----------+
# |[MessageTemplate ...|3735091736|event-c20b9c7eac0...|2019-03-19|
# |[MessageTemplate ...|3735091737|event-d20b9c7eac0...|2019-03-18|
# |[MessageTemplate ...|3735091738|event-e20b9c7eac0...|2019-03-17|
# +--------------------+----------+--------------------+----------+
data_exploded = df.select('Id', 'EventType', "Timestamp", F.explode('Data'))\
.groupBy('Id', 'EventType', "Timestamp").pivot('key').agg(F.first('value'))
# There is a duplicate Id column and might cause ambiguity problems
data_exploded.show()
# +--------------------+----------+----------+--------+-----+---------------+--------------------+
# | Id| EventType| Timestamp| Id|Level|MessageTemplate| Properties|
# +--------------------+----------+----------+--------+-----+---------------+--------------------+
# |event-c20b9c7eac0...|3735091736|2019-03-19|event-c2| 2| Test1|{CorrId=d69b7489,...|
# |event-d20b9c7eac0...|3735091737|2019-03-18|event-d2| 2| Test1|{CorrId=f69b7489,...|
# |event-e20b9c7eac0...|3735091738|2019-03-17|event-e2| 1| Test1|{CorrId=g69b7489,...|
# +--------------------+----------+----------+--------+-----+---------------+--------------------+
I was able to read the data by following code.
from pyspark.sql.functions import *
DF = spark.read.json("demo_files/Test20191023.log")
DF.select(col('Id'),col('EventType'),col('Timestamp'),col('Data.Id'),col('Data.Level'),col('Data.MessageTemplate'),
col('Data.Properties.CorrId'),col('Data.Properties.ActionId'))\
.show()```
***Result***
+--------------------+----------+----------+--------+-----+---------------+--------+--------+
| Id| EventType| Timestamp| Id|Level|MessageTemplate| CorrId|ActionId|
+--------------------+----------+----------+--------+-----+---------------+--------+--------+
|event-c20b9c7eac0...|3735091736|2019-03-19|event-c2| 2| Test1|d69b7489|d0e2c3fd|
|event-d20b9c7eac0...|3735091737|2019-03-18|event-d2| 2| Test1|f69b7489|d0f2c3fd|
|event-e20b9c7eac0...|3735091738|2019-03-17|event-e2| 1| Test1|g69b7489|d0d2c3fd|
+--------------------+----------+----------+--------+-----+---------------+--------+--------+

List to DataFrame in pyspark

Can someone tell me how to convert a list containing strings to a Dataframe in pyspark. I am using python 3.6 with spark 2.2.1. I am just started learning spark environment and my data looks like below
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
Now, i want to create a Dataframe as follows
---------------------------------
|ID | words |
---------------------------------
1 | ['apple','ball','ballon'] |
2 | ['cat','camel','james'] |
I even want to add ID column which is not associated in the data
You can convert the list to a list of Row objects, then use spark.createDataFrame which will infer the schema from your data:
from pyspark.sql import Row
R = Row('ID', 'words')
# use enumerate to add the ID column
spark.createDataFrame([R(i, x) for i, x in enumerate(my_data)]).show()
+---+--------------------+
| ID| words|
+---+--------------------+
| 0|[apple, ball, bal...|
| 1| [cat, camel, james]|
| 2| [none, focus, cake]|
+---+--------------------+
Try this -
data_array = []
for i in range (0,len(my_data)) :
data_array.extend([(i, my_data[i])])
df = spark.createDataframe(data = data_array, schema = ["ID", "words"])
df.show()
Try this -- the simplest approach
from pyspark.sql import *
x = Row(utc_timestamp=utc, routine='routine name', message='your message')
data = [x]
df = sqlContext.createDataFrame(data)
Simple Approach:
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
spark.sparkContext.parallelize(my_data).zipWithIndex() \
toDF(["id", "words"]).show(truncate=False)
+---------------------+-----+
|id |words|
+---------------------+-----+
|[apple, ball, ballon]|0 |
|[cat, camel, james] |1 |
|[none, focus, cake] |2 |
+---------------------+-----+