Dataframe headers get modified on saving as parquet file using glue - pyspark

I'm writing a dataframe with headers by partitions to s3 using the below code:
df_dynamic = DynamicFrame.fromDF(
df_columned,
glue_context,
"temp_ctx"
)
print("\nUploading parquet to " + destination_path)
glue_context.write_dynamic_frame.from_options(
frame=df_dynamic,
connection_type="s3",
connection_options={
"path": destination_path,
"partitionKeys" : ["partition_id"]
},
format_options={
"header":"true"
},
format="glueparquet"
)
Once my files are created I see I have #1, #2 added after my column headers.
Example: if my column name is "Doc Data", it gets converted to Doc_Date#1
I thought its a parquet way of saving data.
Then when I try to read from the same files using the below code, my headers are no more the same. Now they come as Doc_Date#1. How do I fix this?
str_folder_path = str.format(
_S3_PATH_FORMAT,
args['BUCKET_NAME'],
str_relative_path
)
df_grouped = glue_context.create_dynamic_frame.from_options(
"s3",
{
'paths': [str_folder_path],
'recurse':True,
'groupFiles': 'inPartition',
'groupSize': '1048576'
},
format_options={
"header":"true"
},
format="parquet"
)
return df_grouped.toDF()

Issue Resolved!!
Issue was that I had spaces in my column names. Once I replaced them with underscores(_), the issue resolved.

Related

Spark 2.4.7 ignoring null fields while writing to JSON [duplicate]

I am trying to write a JSON file using spark. There are some keys that have null as value. These show up just fine in the DataSet, but when I write the file, the keys get dropped. How do I ensure they are retained?
code to write the file:
ddp.coalesce(20).write().mode("overwrite").json("hdfs://localhost:9000/user/dedupe_employee");
part of JSON data from source:
"event_header": {
"accept_language": null,
"app_id": "App_ID",
"app_name": null,
"client_ip_address": "IP",
"event_id": "ID",
"event_timestamp": null,
"offering_id": "Offering",
"server_ip_address": "IP",
"server_timestamp": 1492565987565,
"topic_name": "Topic",
"version": "1.0"
}
Output:
"event_header": {
"app_id": "App_ID",
"client_ip_address": "IP",
"event_id": "ID",
"offering_id": "Offering",
"server_ip_address": "IP",
"server_timestamp": 1492565987565,
"topic_name": "Topic",
"version": "1.0"
}
In the above example keys accept_language, app_name and event_timestamp have been dropped.
Apparently, spark does not provide any option to handle nulls. So following custom solution should work.
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
import com.fasterxml.jackson.databind.ObjectMapper
case class EventHeader(accept_language:String,app_id:String,app_name:String,client_ip_address:String,event_id: String,event_timestamp:String,offering_id:String,server_ip_address:String,server_timestamp:Long,topic_name:String,version:String)
val ds = Seq(EventHeader(null,"App_ID",null,"IP","ID",null,"Offering","IP",1492565987565L,"Topic","1.0")).toDS()
val ds1 = ds.mapPartitions(records => {
val mapper = new ObjectMapper with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
records.map(mapper.writeValueAsString(_))
})
ds1.coalesce(1).write.text("hdfs://localhost:9000/user/dedupe_employee")
This will produce output as :
{"accept_language":null,"app_id":"App_ID","app_name":null,"client_ip_address":"IP","event_id":"ID","event_timestamp":null,"offering_id":"Offering","server_ip_address":"IP","server_timestamp":1492565987565,"topic_name":"Topic","version":"1.0"}
If you are on Spark 3, you can add
spark.sql.jsonGenerator.ignoreNullFields false
ignoreNullFields is an option to set when you want DataFrame converted to json file since Spark 3.
If you need Spark 2 (specifically PySpark 2.4.6), you can try converting DataFrame to rdd with Python dict format. And then call pyspark.rdd.saveTextFile to output json file to hdfs. The following example may help.
cols = ddp.columns
ddp_ = ddp.rdd
ddp_ = ddp_.map(lambda row: dict([(c, row[c]) for c in cols])
ddp_ = ddp.repartition(1).saveAsTextFile(your_hdfs_file_path)
This should produce output file like,
{"accept_language": None, "app_id":"123", ...}
{"accept_language": None, "app_id":"456", ...}
What's more, if you want to replace Python None with JSON null, you will need to dump every dict into json.
ddp_ = ddp_.map(lambda row: json.dumps(row, ensure.ascii=False))
Since Spark 3, and if you are using the class DataFrameWriter
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameWriter.html#json-java.lang.String-
(same applies for pyspark)
https://spark.apache.org/docs/3.0.0-preview/api/python/_modules/pyspark/sql/readwriter.html
its json method has an option ignoreNullFields=None
where None means True.
So just set this option to false.
ddp.coalesce(20).write().mode("overwrite").option("ignoreNullFields", "false").json("hdfs://localhost:9000/user/dedupe_employee")
To retain null values converting to JSON please set this config option.
spark = (
SparkSession.builder.master("local[1]")
.config("spark.sql.jsonGenerator.ignoreNullFields", "false")
).getOrCreate()

How to select child tag from JSON file using scala

Good Day!!
I am writing a Scala code to select the multiple child tag from json file however I am not getting exact solution. The code looks like below,
Code:
val spark = SparkSession.builder.master("local").appName("").config("spark.sql.warehouse.dir", "C:/temp").getOrCreate()
val df = spark.read.option("header", "true").json("C:/Users/Desktop/data.json").select("type", "city", "id","name")
println(df.show())
Data.json
{"claims":[
{ "type":"Part B",
"city":"Chennai",
"subscriber":[
{ "id":11 },
{ "name":"Harvey" }
] },
{ "type":"Part D",
"city":"Bangalore",
"subscriber":[
{ "id":12 },
{ "name":"andrew" }
] } ]}
Expected Result:
type city subscriber/0/id subscriber/1/name
Part B Chennai 11 Harvey
Part D Bangalore 12 Andrew
Please help me with the above code.
If I'm not mistaken Apache Spark expects each line to be a separate JSON object, so it will fail if you’ll try to load a pretty formatted JSON file.
https://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
http://jsonlines.org/examples/

Parsing Json in Spark and populate a column in dataframe dynamically based on nodes value

I am using spark 1.6.3 to parse a json strucuture
I have a json structure below :
{
"events":[
{
"_update_date":1500301647576,
"eventKey":"depth2Name",
"depth2Name":"XYZ"
},
{
"_update_date":1500301647577,
"eventKey":"journey_start",
"journey_start":"2017-07-17T14:27:27.144Z"
}]
}
i want parse the above JSON to 3 columns in dataframe. eventKey's value(deapth2Name) is a node in Json(deapth2Name) and i want to read the value from corresponding node add it to a column "eventValue" so that i can accommodate any new events dynamically.
Here is the expected output:
_update_date,eventKey,eventValue
1500301647576,depth2Name,XYZ
1500301647577,journey_start,2017-07-17T14:27:27.144Z
sample code:
val x = sc.wholeTextFiles("/user/jx665240/events.json").map(x => x._2)
val namesJson = sqlContext.read.json(x)
namesJson.printSchema()
namesJson.registerTempTable("namesJson")
val eventJson=namesJson.select("events")
val mentions1 =eventJson.select(explode($"events")).toDF("events").select($"events._update_date",$"events.eventKey",$"events.$"events.eventKey"")
$"events.$"events.eventKey"" is not working.
Can you please suggest how to fix this issue.
Thanks,
Sree

How to convert csv file to a xls or xlsx in sapui5?

I have been trying to convert the .csv file to .xls or .xlsx, and i have been success in do so. But the problem is that the file is not opening. It's showing that unable to open the file format.
These are the following code which i have been using.
1. var oExport = new sap.ui.core.util.Export({
exportType : new sap.ui.core.util.ExportType({
separatorChar : ",",
fileExtension : "xlsx",
mimeType: "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
}),
models : oModel,
rows : {
path : "/"
},
columns : [ itemsArray ]
});
oExport.saveFile(oFileName).always(function() {
this.destroy();
});
when i download like this the file size will be zero. Hence it does not have any value. And i tried with alternative way.
2.var oExport = new sap.ui.core.util.Export({
exportType : new sap.ui.core.util.ExportTypeCSV({
separatorChar : ",",
fileExtension : "xlsx",
mimeType: "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
}),
models : oModel,
rows : {
path : "/"
},
columns : [ itemsArray ]
});
oExport.saveFile(oFileName).always(function() {
this.destroy();
});
When i use the above code file will be having some data but unable to open it. Could anyone help me with this?
Thank you.
AFAIK sapui5 doesn't provide that functionality out of the box. So you may need to manage it by yourself.
This might help you https://github.com/SheetJS/js-xlsx.
We used it to read xlsx files but looks like it helps you to write them as well.

amazon redshift copy using json having trouble

I have created simple table called as test3
create table if not exists test3(
Studies varchar(300) not null,
Series varchar(500) not null
);
I got some json data
{
"Studies": [{
"studyinstanceuid": "2.16.840.1.114151",
"studydescription": "Some study",
"studydatetime": "2014-10-03 08:36:00"
}],
"Series": [{
"SeriesKey": "abc",
"SeriesInstanceUid": "xyz",
"studyinstanceuid": "2.16.840.1.114151",
"SeriesDateTime": "2014-10-03 09:05:09"
}, {
"SeriesKey": "efg",
"SeriesInstanceUid": "stw",
"studyinstanceuid": "2.16.840.1.114151",
"SeriesDateTime": "0001-01-01 00:00:00"
}],
"ExamKey": "exam-key",
}
and here is my json_path
{
"jsonpaths": [
"$['Studies']",
"$['Series']"
]
}
Both the json data and json path is uploaded to s3.
I try to execute the following copy command in redshift consule.
copy test3
from 's3://mybucket/redshift_demo/input.json'
credentials 'aws_access_key_id=my_key;aws_secret_access_key=my_access'
json 's3://mybucket/redsift_demo/json_path.json'
I get the following error. Can anyone please help been stuck on this for sometime now.
Amazon](500310) Invalid operation: Number of jsonpaths and the number of columns should match. JSONPath size: 1, Number of columns in table or column list: 2
Details:
-----------------------------------------------
error: Number of jsonpaths and the number of columns should match. JSONPath size: 1, Number of columns in table or column list: 2
code: 8001
context:
query: 1125432
location: s3_utility.cpp:670
process: padbmaster [pid=83747]
-----------------------------------------------;
1 statement failed.
Execution time: 1.58s
Redshift's error is misleading. The issue is that your input file is wrongly formatted: you have an extra comma after the last JSON entry.
Copy succeeds if you change "ExamKey": "exam-key", to "ExamKey": "exam-key"