Dynamically generating schema from JSon File - scala

Need your help in defining a dynamic schema with fields and datatypes from input metadata JSon file .Below is the JSon file
[
{
"trim": true,
"name": "id",
"nullable": true,
"id": null,
"position": 0,
"table": "employee",
"type": "integer",
"primaryKey": true
},
{
"trim": true,
"name": "salary",
"nullable": true,
"id": null,
"position": 1,
"table": "employee",
"type": "double",
"primaryKey": false
},
{
"trim": true,
"name": "dob",
"nullable": true,
"id": null,
"position": 2,
"table": "employee",
"type": "date",
"primaryKey": false
}
]
Found a useful link but all the fields are mapped to string data type .
Programmatically generate the schema AND the data for a dataframe in Apache Spark
My requirement is bit different .I have an input CSV file without any header contents and therefore all column values are of string datatype.Similarly I have a JSON metadata file that contains the field name and corresponding data types.I want to define a schema that maps the field name with corresponding datatype from JSON to input CSV data.
For example :Below is the sample code I have written for mapping the column names from JSON file to input CSV data . But it doesn't convert or map the columns to the corresponding datatype .
val in_emp = spark.read.csv(in_source_data)
val in_meta_emp = spark.read.option("multiline","true").json(in_meta_data)
val in_cols = in_meta_emp.select("name","type").map(_.getString(0)).collect
val in_cols_map = in_emp.toDF(in_cols:_*)
in_emp.show
in_cols_map.show
in_cols_map.dtypes
Result :
Please click to see the result
3 :mapped Input datatypes
Array[(String, String)] = Array((id,StringType),
(salary,StringType), (dob,StringType))
Below code depicts the static way to define the schema but I am looking for dynamic way that picks the column and corresponding data type from JSON metadata file.
val schema = StructType (Array(
StructField("id",IntegerType,true ),
StructField("dob",DateType,true ),
StructField("salary",DoubleType,true )
))
val in_emp =
spark.read
.schema(schema)
.option("inferSchema","true")
.option("dateFormat", "yyyy.MM.dd")
.csv(in_source_data)

Related

ADF - Loop through a large JSON file in a dataflow

We currently receive some metadata information from a third party supplier in the form of a JSON file.
The JSON file contains definitions of some tables which need to be loaded into SQL via ADF.
The JSON file looks like this, it's a list of tables and their data types
"Tables": [
{
"name": "account",
"description": "account",
"$type": "LocalEntity",
"attributes": [
{
"dataType": "guid",
"maxLength": "-1",
"name": "Id"
},
{
"dataType": "string",
"maxLength": "250",
"name": "name"
}
]
},
{
"name": "customer",
"description": "account",
"$type": "LocalEntity",
"attributes": [
{
"dataType": "guid",
"maxLength": "-1",
"name": "Id"
},
{
"dataType": "string",
"maxLength": "100",
"name": "name"
}
]
}
]
What we need to do is to loop through this JSON and via an ADF data flow we create the required tables in the destination database.
We initially designed the Pipeline with a lookup activity that loads the JSON file then pass the output to a foreach loop. This worked really well when we had only a small JSON file but as we started using real data, the JSON file was over the limit of 4MB resulting in the lookup activity throwing an error.
We then tried using a mapping dataflow by loading the JSON as a source, then setting the sink as a cache and outputting this to an output variable which we then loop through but again this works with smaller datasets but as soon as the dataset is large enough it can't parse it to an output.
I am sure this should be easy to do but just can't get my head around it!
Here is the sample procedure to loop through large JSON file in a Dataflows.
Create a Linked service and dataset to the json file path.
Provision the dataset to the source in the dataflows.
By the flatten formatter will get the input columns from the source through Unroll by option with required input.
Create linked service and dataset to the sink path.
Attach the data flow work item to the Data Flow activity.
Will get result as per the expectations in the sql db.

Pyspark: Best way to set json strings in dataframe column

I need to create couple of columns in Dataframe where I want to parse and store the json string. Here is one json which I need to store in one column. Other json are also similar.Can you please help in how to transform and store this json string in the column. The values section needs to be filled from the values from other dataframe columns within the same data frame.
{
"name": "",
"headers": [
{
"name": "A",
"dataType": "number"
},
{
"name": "B",
"dataType": "string"
},
{
"name": "C",
"dataType": "string"
}
],
"values": [
[
2,
"some value",
"some value"
]
]
}

array_sort function sorting the data based on first numerical element in Array<Struct>

I need to sort the array<struct> based on a particular element from a struct. I am trying to use the array_sort function and could see that by default it is sorting the array but based on the first numerical element. Is this the expected behavior? PFB sample code and output.
val jsonData = """
{
"topping":
[
{ "id": "5001", "id1": "5001", "type": "None" },
{ "id": "5002", "id1": "5008", "type": "Glazed" },
{ "id": "5005", "id1": "5007", "type": "Sugar" },
{ "id": "5007", "id1": "5002", "type": "Powdered Sugar" },
{ "id": "5006", "id1": "5005", "type": "Chocolate with Sprinkles" },
{ "id": "5003", "id1": "5004", "type": "Chocolate" },
{ "id": "5004", "id1": "5003", "type": "Maple" }
]
}
"""
val json_df = spark.read.json(Seq(jsonData).toDS)
val sort_df = json_df.select(array_sort($"topping").as("sort_col"))
display(sort_df)
OUTPUT
As you could see the above output is sorted based on the id element which is the first numerical element in the struct.
Is there any way to specify the element based on which sorting can be done?
Is this the expected behavior?
Short answer, yes!
For arrays with struct type elements, It compares first fields to determine the order and if they are equal it compares the second fields and so on. You can see that easily by modifying your input data example to have the same value in id for 2 rows, you'll then notice the order is determined by the second field.
The array_sort function uses collection operation ArraySort. If you look into the code you'll find how it handles complex DataTypes like StructType.
Is there any way to specify the element based on which sorting can be done?
One way is using a tranform function to change the positions of the struct fields so that you can have the first field containing the values you want the ordering to be based on. Example: if you want to order by the field type:
val transform_expr = "TRANSFORM(topping, x -> struct(x.type as type, x.id as id, x.id1 as id1))"
val transform_df = json_df.select(expr(transform_expr).alias("topping_transform"))
transfrom_df.show(fasle)
//+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
//|topping_transform |
//+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
//|[[None, 5001, 5001], [Glazed, 5002, 5008], [Sugar, 5005, 5007], [Powdered Sugar, 5007, 5002], [Chocolate with Sprinkles, 5006, 5005], [Chocolate, 5003, 5004], [Maple, 5004, 5003]]|
//+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
val sort_df = transform_df.select(array_sort($"topping_transform").as("sort_col"))
sort_df.show(false)
//+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
//|sort_col |
//+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
//|[[Chocolate, 5003, 5004], [Chocolate with Sprinkles, 5006, 5005], [Glazed, 5002, 5008], [Maple, 5004, 5003], [None, 5001, 5001], [Powdered Sugar, 5007, 5002], [Sugar, 5005, 5007]]|
//+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Partition column disappears in result set dataframe Spark

I tried to split Spark data frame by the timestamp column update_database_time and write it into HDFS with defined Avro schema. However, after calling the repartition method I get this exception:
Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Catalyst type StructType(StructField(random_pk,DecimalType(38,0),true), StructField(random_string,StringType,true), StructField(code,StringType,true), StructField(random_bool,BooleanType,true), StructField(random_int,IntegerType,true), StructField(random_float,DoubleType,true), StructField(random_double,DoubleType,true), StructField(random_enum,StringType,true), StructField(random_date,DateType,true), StructField(random_decimal,DecimalType(4,2),true), StructField(update_database_time_tz,TimestampType,true), StructField(random_money,DecimalType(19,4),true)) to Avro type {"type":"record","name":"TestData","namespace":"DWH","fields":[{"name":"random_pk","type":["null",{"type":"bytes","logicalType":"decimal","precision":38,"scale":0}]},{"name":"random_string","type":["string","null"]},{"name":"code","type":["string","null"]},{"name":"random_bool","type":["boolean","null"]},{"name":"random_int","type":["int","null"]},{"name":"random_float","type":["double","null"]},{"name":"random_double","type":["double","null"]},{"name":"random_enum","type":["null",{"type":"enum","name":"enumType","symbols":["VAL_1","VAL_2","VAL_3"]}]},{"name":"random_date","type":["null",{"type":"int","logicalType":"date"}]},{"name":"random_decimal","type":["null",{"type":"bytes","logicalType":"decimal","precision":4,"scale":2}]},{"name":"update_database_time","type":["null",{"type":"long","logicalType":"timestamp-millis"}]},{"name":"update_database_time_tz","type":["null",{"type":"long","logicalType":"timestamp-millis"}]},{"name":"random_money","type":["null",{"type":"bytes","logicalType":"decimal","precision":19,"scale":4}]}]}.
I assume that the column for partitioning disappears in the result. How can I redefine the operation so it would not happen?
Here is the code I use:
dataDF.write
.partitionBy("update_database_time")
.format("avro")
.option(
"avroSchema",
SchemaRegistry.getSchema(
schemaRegistryConfig.url,
schemaRegistryConfig.dataSchemaSubject,
schemaRegistryConfig.dataSchemaVersion))
.save(s"${hdfsURL}${pathToSave}")
By the exception you have provided, the error seems to stem from incompatible schemas between the fetched AVRO schema and Spark's schema. Taking a quick look, the most worrisome parts are probably these ones:
(Possibly catalyst doesn't know how to transform string into enumType)
Spark schema:
StructField(random_enum,StringType,true)
AVRO schema:
{
"name": "random_enum",
"type": [
"null",
{
"type": "enum",
"name": "enumType",
"symbols": [
"VAL_1",
"VAL_2",
"VAL_3"
]
}
]
}
(update_databse_time_tz appears only once in the dataframe's schema, but twice in the AVRO schema)
Spark schema:
StructField(update_database_time_tz,TimestampType,true)
AVRO schema:
{
"name": "update_database_time",
"type": [
"null",
{
"type": "long",
"logicalType": "timestamp-millis"
}
]
},
{
"name": "update_database_time_tz",
"type": [
"null",
{
"type": "long",
"logicalType": "timestamp-millis"
}
]
}
I'd suggest to consolidate schemas first and get rid of that exception before going inside other possible partitioning problems.
EDIT: In regards to number 2 I've missed the face that there are different names in the AVRO schema, which leads to a problem of missing a column update_database_time in the Dataframe.

Avro genericdata.Record ignores data types

I have the following avro schema
{ "namespace": "example.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
I use the following snippet to set up a Record
val schema = new Schema.Parser().parse(new File("data/user.avsc"))
val user1 = new GenericData.Record(schema) //strangely this schema only checks for valid fields NOT types.
user1.put("name", "Fred")
user1.put("favorite_number", "Jones")
I would have thought that this would fail to validate against the schema
When I add the line
user1.put("last_name", 100)
It generates a run time error, which is what I would expect in the first case as well.
Exception in thread "main" org.apache.avro.AvroRuntimeException: Not a valid schema field: last_name
at org.apache.avro.generic.GenericData$Record.put(GenericData.java:125)
at csv2avro$.main(csv2avro.scala:40)
at csv2avro.main(csv2avro.scala)
What's going on here?
It won't fail when adding it into the Record, it will fail when it tries to serialize because it is at that point when it is trying to match the type. As far as I'm aware that is the only place it does type checking.