Export parquet file schema to JSON or CSV - pyspark

I need to extract schema of parquet file into JSON, TXT or CSV format.
That should include column name, datatype from parquet file.
For example:
{"id", "type" : "integer" },
{"booking_date""type" : "timestamp", "format" : "%Y-%m-%d %H:%M:%S.%f" }

We can read the schema from the parquet file using .schema and convert to json format finally save as textfile.
input parquet file:
spark.read.parquet("/tmp").printSchema()
#root
#|-- id: integer (nullable = true)
#|-- name: string (nullable = true)
#|-- booking_date: timestamp (nullable = true)
Extract the schema and write to HDFS/local filesystem:
spark.sparkContext.parallelize( #converting from string to rdd
[spark.read.parquet("/tmp").schema.json()] #read schema of parquetfile
).repartition(1).\
saveAsTextFile("/tmp_schema/") #saving the file into HDFS
Read the output file from hdfs:
$ hdfs dfs -cat /tmp_schema/part-00000
{"fields":[{"metadata":{},"name":"id","nullable":true,"type":"integer"},{"metadata":{},"name":"name","nullable":true,"type":"string"},{"metadata":{},"name":"booking_date","nullable":true,"type":"timestamp"}],"type":"struct"}

Related

AWS Glue: Pyspark: Problems while flattening DynamoDB JSON

Need help with reading and converting a JSON element.
From Glue ETL job, I am using DynamoDB connector to export data from cross account DynamoDB.
Example Data from DynamoDB:
{
"header_name": {
"S": "2caee47b-4e6d-4eba-812d-65e4098f1f78"
},
"additional_info": {
"S": "test additionalIdentifier"
},
"biz_group": {
"S": "64b27e6a-863d-4ee6-9a33-386675ce348a"
},
"key_id": {
"S": "2caee47b-4e6d-4eba-812d-65e4098f1f78"
},
"workflow_id": {
"M": {
"48281078dd41": {
"N": "1"
}
}
}
}
There are 2 million unique “workflow_id”s.
I’m reading the data using glue connector to DynamoDB exports
S3bucket_node1 = glueContext.create_dynamic_frame.from_options(
connection_type="dynamodb",
connection_options={
"dynamodb.export": "ddb",
"dynamodb.s3.bucket": "s3_bucket",
"dynamodb.s3.prefix": "prefix-",
"dynamodb.tableArn": "TABLE_ARN",
"dynamodb.unnestDDBJson": False,
"dynamodb.sts.roleArn": "STS_ROLE_ARN",
"dynamodb.region": "us-west-2",
"dynamodb.s3.bucketOwner":"ACCOUNT ID BUCKET OWNER"
},
transformation_ctx="S3bucket_node1",
)
When I try to infer schema type for workflow_id, it is interpreting the values as individual columns. workflow_id is getting exploded (pasting only a few below).
root
|-- header_name: string
|-- additional_info: string
|-- biz_group: string
|-- key_id: string
|-- workflow_id: struct
| |-- 48281078dd41: long
| |-- 48281078dd42: long
.............
.............
.............
.............
.............
.............
I would like to consider the element workflow_id as string instead of interpreting as struct.
Question: is there a way I can convert the field workflow_id to string. I tried using crawler and changing the type to string but it is causing “internal exception” issues. Is there a way we can manipulate the schema while reading the data?
I tried using crawler and changing the type to string but it is causing “internal exception” issue. I am trying to avoid using crawler and use minimal amount of layers possible.
Use Case:
DynamoDB (cross account)->AWS Glue DDB connectors->dataframe->flatten->Parquet->Redshift Spectrum

Databricks pyspark readstream to read data from azure blob storage which is written in partitioned structure

I have a process which fetch data from RDBMS source and writes to Azure blob storage the data is written in partitioned structure like,
Storage account
|_ container
|_ data-load
|_ updateddate_p=yyyymmdd
|_timestamp_p=16382882
|_data-file.orc
|_timestamp_p=16382885
|_data-file-1.orc
Now in Databricks, I mount the Azure storage (classic) to the cluster and use readstream to read the orc file data as stream.
Is there a way to read the orc data with the partitioned info.
So the stream dataframe has the partitioned information when reading.
So when using the writestream to write the data, I should be able to fetch the updateddate_p and timestamp_p in the readstream dataframe with the values.
Added the partition column part of the schema
{
"metadata": {},
"name": "updateddate_p",
"nullable": true,
"type": "integer"
},
{
"metadata": {},
"name": "timestamp_p",
"nullable": true,
"type": "long"
},
Used the read stream as such, and the partition loaded as expected
with open(f'custom-scehma.json', 'r') as f:
schema_definition= T.StructType.fromJson(json.loads(f.read()))
data_readstream = (
spark.readStream
.format("orc")
.schema(schema_definition)
.option("pathGlobFilter", "*.orc")
.load(f'{mounted_input_data_path_loc}/')
)
# write stream used partitionBy(column1,column2..
PySpark Reference Documentation

Read CosmosDb items from pyspark (Databricks) with an inconsistent model

Let's say I have two items in CosmosDb:
{
"Test": {
"InconsistentA": 10
},
"Common": 1
}
{
"Test": {
"InconsistentB": 10
},
"Common": 2
}
How to read this data so to have the following schema:
Test: string (the JSON string of the inconsistent part of the model)
Common: int (the consistent part of the model)
I don't know in advance what the model is and the spark CosmosDb driver (com.microsoft.azure.cosmosdb.spark) only reads X first items in CosmosDb to infer the schema.
What I have tried is enforcing the schema this way:
|-- Test: string (nullable = true)
|-- Common: integer (nullable = true)
But the result of the Test column is:
'{ InconsistentA=10 }'
Instead of:
'{ "InconsistentA": 10 }'

Dataframe headers get modified on saving as parquet file using glue

I'm writing a dataframe with headers by partitions to s3 using the below code:
df_dynamic = DynamicFrame.fromDF(
df_columned,
glue_context,
"temp_ctx"
)
print("\nUploading parquet to " + destination_path)
glue_context.write_dynamic_frame.from_options(
frame=df_dynamic,
connection_type="s3",
connection_options={
"path": destination_path,
"partitionKeys" : ["partition_id"]
},
format_options={
"header":"true"
},
format="glueparquet"
)
Once my files are created I see I have #1, #2 added after my column headers.
Example: if my column name is "Doc Data", it gets converted to Doc_Date#1
I thought its a parquet way of saving data.
Then when I try to read from the same files using the below code, my headers are no more the same. Now they come as Doc_Date#1. How do I fix this?
str_folder_path = str.format(
_S3_PATH_FORMAT,
args['BUCKET_NAME'],
str_relative_path
)
df_grouped = glue_context.create_dynamic_frame.from_options(
"s3",
{
'paths': [str_folder_path],
'recurse':True,
'groupFiles': 'inPartition',
'groupSize': '1048576'
},
format_options={
"header":"true"
},
format="parquet"
)
return df_grouped.toDF()
Issue Resolved!!
Issue was that I had spaces in my column names. Once I replaced them with underscores(_), the issue resolved.

Spark 2.4.7 ignoring null fields while writing to JSON [duplicate]

I am trying to write a JSON file using spark. There are some keys that have null as value. These show up just fine in the DataSet, but when I write the file, the keys get dropped. How do I ensure they are retained?
code to write the file:
ddp.coalesce(20).write().mode("overwrite").json("hdfs://localhost:9000/user/dedupe_employee");
part of JSON data from source:
"event_header": {
"accept_language": null,
"app_id": "App_ID",
"app_name": null,
"client_ip_address": "IP",
"event_id": "ID",
"event_timestamp": null,
"offering_id": "Offering",
"server_ip_address": "IP",
"server_timestamp": 1492565987565,
"topic_name": "Topic",
"version": "1.0"
}
Output:
"event_header": {
"app_id": "App_ID",
"client_ip_address": "IP",
"event_id": "ID",
"offering_id": "Offering",
"server_ip_address": "IP",
"server_timestamp": 1492565987565,
"topic_name": "Topic",
"version": "1.0"
}
In the above example keys accept_language, app_name and event_timestamp have been dropped.
Apparently, spark does not provide any option to handle nulls. So following custom solution should work.
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
import com.fasterxml.jackson.databind.ObjectMapper
case class EventHeader(accept_language:String,app_id:String,app_name:String,client_ip_address:String,event_id: String,event_timestamp:String,offering_id:String,server_ip_address:String,server_timestamp:Long,topic_name:String,version:String)
val ds = Seq(EventHeader(null,"App_ID",null,"IP","ID",null,"Offering","IP",1492565987565L,"Topic","1.0")).toDS()
val ds1 = ds.mapPartitions(records => {
val mapper = new ObjectMapper with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
records.map(mapper.writeValueAsString(_))
})
ds1.coalesce(1).write.text("hdfs://localhost:9000/user/dedupe_employee")
This will produce output as :
{"accept_language":null,"app_id":"App_ID","app_name":null,"client_ip_address":"IP","event_id":"ID","event_timestamp":null,"offering_id":"Offering","server_ip_address":"IP","server_timestamp":1492565987565,"topic_name":"Topic","version":"1.0"}
If you are on Spark 3, you can add
spark.sql.jsonGenerator.ignoreNullFields false
ignoreNullFields is an option to set when you want DataFrame converted to json file since Spark 3.
If you need Spark 2 (specifically PySpark 2.4.6), you can try converting DataFrame to rdd with Python dict format. And then call pyspark.rdd.saveTextFile to output json file to hdfs. The following example may help.
cols = ddp.columns
ddp_ = ddp.rdd
ddp_ = ddp_.map(lambda row: dict([(c, row[c]) for c in cols])
ddp_ = ddp.repartition(1).saveAsTextFile(your_hdfs_file_path)
This should produce output file like,
{"accept_language": None, "app_id":"123", ...}
{"accept_language": None, "app_id":"456", ...}
What's more, if you want to replace Python None with JSON null, you will need to dump every dict into json.
ddp_ = ddp_.map(lambda row: json.dumps(row, ensure.ascii=False))
Since Spark 3, and if you are using the class DataFrameWriter
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameWriter.html#json-java.lang.String-
(same applies for pyspark)
https://spark.apache.org/docs/3.0.0-preview/api/python/_modules/pyspark/sql/readwriter.html
its json method has an option ignoreNullFields=None
where None means True.
So just set this option to false.
ddp.coalesce(20).write().mode("overwrite").option("ignoreNullFields", "false").json("hdfs://localhost:9000/user/dedupe_employee")
To retain null values converting to JSON please set this config option.
spark = (
SparkSession.builder.master("local[1]")
.config("spark.sql.jsonGenerator.ignoreNullFields", "false")
).getOrCreate()