Parsing schema of pyarrow.parquet.ParquetDataset object - pyspark

I'm using pyarrow to read parquet data from s3 and I'd like to be able to parse the schema and convert it to a format suitable for running an mLeap serialized model outside of Spark.
This requires parsing the schema.
If I had a Pyspark dataframe, I could do this:
test_df = spark.read.parquet(test_data_path)
schema = [ { "name" : field.simpleString().split(":")[0], "type" : field.simpleString().split(":")[1] }
for field in test_df.schema ]
How can I achieve the same if I read the data using pyarrow instead ?
Also, for the Spark dataframe I can obtain the rows in a suitable format for model evaluation by doing the following:
rows = [[field for field in row] for row in test_df.collect()]
How can I achieve a similar thing using pyarrow ?
Thanks in advance for your help.

If you want to get the schema, you can do the following with pyarrow.parquet:
import pyarrow.parquet as pq
dataset = pq.ParquetDataset(<path to file>).read_pandas()
schema = dataset.schema
schemaDict = {x:y for (x,y) in zip(schema.names, schema.types)}
This will give you a dictionary of column names to datatypes.

Related

How to read JSON in data frame column

I'm reading a HDFS directory
val schema = spark.read.schema(schema).json("/HDFS path").schema
val df= spark.read.schema(schema).json ("/HDFS path")
Here selecting only PK and timestamp from JSON file
Val df2= df.select($"PK1",$"PK2",$"PK3" ,$"ts")
Then
Using windows function to get updated PK on the base of timestamp
val dfrank = df2.withColumn("rank",row_number().over(
Window.partitionBy($"PK1",$"PK2",$"PK3" ).orderBy($"ts".desc))
)
.filter($"rank"===1)
From this window function getting only updated primary keys & timestamp of updated JSON.
Now I have to add one more column where I want to get only JSON with updated PK and Timestamp
How I can do that
Trying below but getting wrong JSON instead of updated JSON
val df3= dfrank.withColumn("JSON",lit(dfrank.toJSON.first()))
Result shown in image.
Here, you convert the entire dataframe to JSON and collect it to the driver with toJSON (that's going to crash with a large dataframe) and add a column that contains a JSON version of the first row of the dataframe to your dataframe. I don't think this is what you want.
From what I understand, you have a dataframe and for each row, you want to create a JSON column that contains all of its columns. You could create a struct with all your columns and then use to_json like this:
val df3 = dfrank.withColumn("JSON", to_json(struct(df.columns.map(col) : _*)))

Databricks Flatten Nested JSON to Dataframe with PySpark

I am trying to Convert a nested JSON to a flattened DataFrame.
I have read in the JSON as follows:
df = spark.read.json("/mnt/ins/duedil/combined.json")
The resulting dataframe looks like the following:
I have made a start on flattening the dataframe as follows
display(df.select ("companyId","countryCode"))
The above will display the following
I would like to select 'fiveYearCAGR" under the following: "financials:element:amortisationOfIntangibles:fiveYearCAGR"
Can someone let me know how to add to the select statement to retrieve the fiveYearCAGR?
Your financials is an array so if you want to extract something within the financials, you need some array transformations.
One example is to use transform.
from pyspark.sql import functions as F
df.select(
"companyId",
"countryCode",
F.transform('financials', lambda x: x['amortisationOfIntangibles']['fiveYearCAGR']).alias('fiveYearCAGR')
)
This will return the fiveYearCAGR in an array. If you need to flatten it further, you can use explode/explode_outer.

Writing tables in Azure Data Lake Gen 2 takes too long with Pyspark Notebook

I am quite new in programming with parallel processing with Spark in Azure Synapse by using PySpark notebook. I need to do some transformations of data stored in Cosmos DB and then save the results in Parquet files in Azure Data Lake Gen2. These data come from a tracking tool called Mixpanel (events). Here is the code:
# libraries
from pyspark.sql import functions as F
import re
from pyspark.sql.types import *
from pyspark.sql import *
from pyspark.sql.functions import *
# Load data from Cosmos DB
df_Mixpanel = spark.read\
.format("cosmos.olap")\
.option("spark.synapse.linkedService", "<linked-service>")\
.option("spark.cosmos.container", "events-mixpanel")\
.load()
The data are json string that I need to convert as a JSON object
df_Mixpanel.printSchema()
# Infer schema of Mixpanel data and parse json column to return a new dataframe
mixpanel_json_schema = spark.read.json(df_Mixpanel.rdd.map(lambda row: row.data)).schema
df_Mixpanel2 = df_Mixpanel\
.withColumn("data", from_json(col("data"), mixpanel_json_schema)).select("id", "data.*")\
.withColumn("time", to_timestamp(col("time")))\
.withColumn("Year", year(col("time")))\
.withColumn("Month", month(col("time")))
Some fields contain a list of values stored as a string with squared brackets (for example, "['a', 'b', 'c']"). These have to be exploded into several rows:
list_Mixpanel_col = df_Mixpanel2.columns
regex = re.compile(r"(^id|distinct_id|mp_+)")
list_Mixpanel_col_filtered = [i for i in list_Mixpanel_col if not regex.search(i)]
df_Mixpanel3 = df_Mixpanel2
for my_col in list_Mixpanel_col_filtered:
df_Mixpanel3 = df_Mixpanel3.withColumn(my_col, F.expr("regexp_extract_all(" + my_col + r", '((\\w+\\s?-?\\.?\'?)+)', 0)"))\
.withColumn(my_col, explode_outer(my_col))
All these steps work very well until I have to save the results in Azure Data Lake Gen 2 in a parquet format:
df_Mixpanel3.write\
.option("header", True)\
.partitionBy("Year", "Month")\
.mode("overwrite")\
.parquet(path_staging)
I had to cancel the job because it takes too long to process. One of the reasons is that I have data skew (So, Spark doesn't use parallel processing)
How can I solve this problem? I was trying to find the best partition key but the only partition key that has basically the best data distribution seems the combination "Year/Month" to me. None of my "Mixpanel" fields (event name, page type,...) has a good data distribution.
In Spark Web UI, I also have some logs with stdout/stderr:
Looking into details, there is the message:
Request for https://XXX.dev.azuresynapse.net/sparkhistory/api/v1/sparkpools/customerportal/livyid/68/hs/applications/application_1656687049284_0001/executorlogs/containers/container_1656687049284_0001_01_000003/files/stdout?feature.enableStandaloneHS=true failed, activityId:1134e452-48d7-42ac-971a-f7db85c0b8dc, status code:404.
Response:
{
"TraceId": "f9f41914-660b-4211-9816-454543544279 | client-request-id : 1134e452-48d7-42ac-971a-f7db85c0b8dc",
"Message": "The HTTP status code of the response was not expected (404).\n\nStatus: 404\nResponse: \n{\n \"message\" : \"Can not find the log file:[stdout] for the container:container_1656687049284_0001_01_000003 in NodeManager:vm-6c067982:39163\",\n \"traceId\" : \"f9f41914-660b-4211-9816-454543544279 | client-request-id : 1134e452-48d7-42ac-971a-f7db85c0b8dc.45632bf3-c43e-47cb-a455-71856d634756\"\n}"
}
Does it impact the performance of my Spark job?

pyspark udf with parameter

Need to transfer one pyspark dataframe colume checkin_time from milisec to timezone adjusted timestamp, timezone information is in another column tz_info.
Tried following:
def tz_adjust(x,tz_info):
if tz_info:
y = col(x)+ col(tz_info)
return from_unixtime(col(y)/1000)
else:
return from_unixtime(col(x)/1000)
def udf_tz_adjust(tz_info):
return udf(lambda l: tz_adjust(l, tz_info))
While using this udf to the column
df.withColumn('checkin_time',udf_tz_adjust('time_zone')(col('checkin_time')))
got some error:
AttributeError: 'NoneType' object has no attribute '_jvm'
Any idea to pass the second column as parameter to udf?
Thanks.
IMHO, what you are doing is a combination of UDF and partial function which could get tricky. I don't think you need to use UDF at all for your application purpose. You can do the following
#not tested
from pyspark.sql.functions import *
df.withColumn('checkin_time', when(col("tz_info").isNotNull(), (from_unixtime(col('checkin_time')) + F.col("tz_info"))/1000).otherwise(from_unixtime(col("checkin_time"))/1000))
UDF has its own serde inefficiencies which is even worse when using with python as it puts an extra overhead of converting scala datatypes into python datatypes.

PySpark - iterate rows of a Data Frame

I need to iterate rows of a pyspark.sql.dataframe.DataFrame.DataFrame.
I have done it in pandas in the past with the function iterrows() but I need to find something similar for pyspark without using pandas.
If I do for row in myDF: it iterates columns.DataFrame
Thanks
You can use select method to operate on your dataframe using a user defined function something like this :
columns = header.columns
my_udf = F.udf(lambda data: "do what ever you want here " , StringType())
myDF.select(*[my_udf(col(c)) for c in columns])
then inside the select you can choose what you want to do with each column .