How to create DataFrame schema from schema json file in pyspark?

How to create DataFrame schema from schema json file in pyspark? - pyspark

I'm trying to use Pyspark to create DataFrame schema from schema json file. Once DataFrame schema created, I will load json data files by using this schema. Could somebody help me? Thanks in advance. For my schema json file look like below:
[
{
"name": "visitorId",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "visitStartTime",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "totals",
"type": "RECORD",
"mode": "NULLABLE",
"fields": [
{
"name": "visits",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "hits",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "pageviews",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "transactions",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "timeOnScreen",
"type": "INTEGER",
"mode": "NULLABLE"
}
]
},
{
"name": "channelGrouping",
"type": "STRING",
"mode": "NULLABLE"
}
]

Your schema is not defined as expected and pyspark cannot parse it.
I've changed your schema to:
{
"type": "struct",
"fields": [
{
"name": "visitorId",
"type": "integer",
"nullable": true,
"metadata": {}
},
{
"name": "visitStartTime",
"type": "integer",
"nullable": true,
"metadata": {}
},
{
"name": "totals",
"type": {
"type": "struct",
"fields": [
{
"name": "visits",
"type": "integer",
"nullable": true,
"metadata": {}
},
{
"name": "hits",
"type": "integer",
"nullable": true,
"metadata": {}
},
{
"name": "pageviews",
"type": "integer",
"nullable": true,
"metadata": {}
},
{
"name": "transactions",
"type": "integer",
"nullable": true,
"metadata": {}
},
{
"name": "timeOnScreen",
"type": "integer",
"nullable": true,
"metadata": {}
}
]
},
"nullable": true,
"metadata": {}
},
{
"name": "channelGrouping",
"type": "string",
"nullable": true,
"metadata": {}
}
]
}
Saved that as schema.json file and then created a StructType from this json using:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType
import json
if __name__ == "__main__":
spark = SparkSession \
.builder \
.appName("Test") \
.getOrCreate()
with open("schema.json") as f_in:
schema_data = json.load(f_in)
data = {}
schemaFromJson = StructType.fromJson(schema_data)
df = spark.createDataFrame(spark.sparkContext.parallelize(data), schemaFromJson)
df.printSchema()
The result is:
root
|-- visitorId: integer (nullable = true)
|-- visitStartTime: integer (nullable = true)
|-- totals: struct (nullable = true)
| |-- visits: integer (nullable = true)
| |-- hits: integer (nullable = true)
| |-- pageviews: integer (nullable = true)
| |-- transactions: integer (nullable = true)
| |-- timeOnScreen: integer (nullable = true)
|-- channelGrouping: string (nullable = true)

Related

pyspark stream from kafka topic with avro format returns null dataframe

I have a topic with Avro format and I want to read it as a stream in Pyspark but the output is null. my data is like this:
{
"ID": 559,
"DueDate": 1676362642000,
"Number": 1,
"__deleted": "false"
}
and the schema in the schema registry is:
{
"type": "record",
"name": "Value",
"namespace": "test",
"fields": [
{
"name": "ID",
"type": "long"
},
{
"name": "DueDate",
"type": {
"type": "long",
"connect.version": 1,
"connect.name": "io.debezium.time.Timestamp"
}
},
{
"name": "Number",
"type": "long"
},
{
"name": "StartDate",
"type": [
"null",
{
"type": "long",
"connect.version": 1,
"connect.name": "io.debezium.time.Timestamp"
}
],
"default": null
},
{
"name": "__deleted",
"type": [
"null",
"string"
],
"default": null
}
],
"connect.name": "test.Value"
}
and the schema in Pyspark I defined is:
schema = StructType([
StructField("ID",LongType(),False),
StructField("DueDate",LongType(),False),
StructField("Number",LongType(),False),
StructField("StartDate",LongType(),True),
StructField("__deleted",StringType(),True)
])
and in the result, I see null Dataframe
I expect the value in the Dataframe same as the records in Kafka topic but all columns are null

PubSub Subscription error with REPEATED Column Type - Avro Schema

I am trying to use the PubSub Subscription "Write to BigQuery" but am running into an issue with the "REPEATED" column type. the message I get when update the subscription is
Incompatible schema mode for field 'Values': field is REQUIRED in the topic schema, but REPEATED in the BigQuery table schema
My Avro Schema is:
{
"type": "record",
"name": "Avro",
"fields": [
{
"name": "ItemID",
"type": "string"
},
{
"name": "UserType",
"type": "string"
},
{
"name": "Values",
"type": [
{
"type": "record",
"name": "Values",
"fields": [
{
"name": "AttributeID",
"type": "string"
},
{
"name": "AttributeValue",
"type": "string"
}
]
}
]
}
]
}
Input JSON That "Matches" Schema:
{
"ItemID": "Item_1234",
"UserType": "Item",
"Values": {
"AttributeID": "TEST_ID_1",
"AttributeValue": "Value_1"
}
}
my Table looks like:
ItemID | STRING | NULLABLE
UserType | STRING | NULLABLE
Values | RECORD | REPEATED
AttributeID | STRING | NULLABLE
AttributeValue | STRING | NULLABLE
I am able to "Test" and "Validate Schema" and it comes back with a success. Question is, what am I missing on the Avro for the Values node to make it "REPEATED" vs "Required" for subscription to be created.

The issue is that Values is not an array type in your Avro schema, meaning it expects only one in the message, while it is a repeated type in your BigQuery schema, meaning it expects a list of them.

Per Kamal's comment above, this schema works:
{
"type": "record",
"name": "Avro",
"fields": [
{
"name": "ItemID",
"type": "string"
},
{
"name": "UserType",
"type": "string"
},
{
"name": "Values",
"type": {
"type": "array",
"items": {
"name": "NameDetails",
"type": "record",
"fields": [
{
"name": "ID",
"type": "string"
},
{
"name": "Value",
"type": "string"
}
]
}
}
}
]
}
the payload:
{
"ItemID": "Item_1234",
"UserType": "Item",
"Values": [
{ "AttributeID": "TEST_ID_1" },
{ "AttributeValue": "Value_1" }
]
}

ADF Copy activity delimited to parquet data type mapping with extra column

I'm trying to use a copy activity in ADF to copy data from a csv to a parquet. I can get column name mappings to work and I can mostly get the data types to map successfully however I am adding a dynamic column called LoadDate that is created from an expression in ADF and I can't seem to get it to map correctly. Looking at the parquet file that is output for the column "date" which is in the delimited file we get the type of INT96 which Azure Databricks correctly reads as a date. However for the LoadDate column which is generated using an expression we get the type BYTE_ARRAY.
I just can't seem to get it to output the extra column in the correct format. Any help would be appreciated.
Below is the mapping section of my JSON.
"mappings": [
{
"source": {
"name": "Date",
"type": "DateTime",
"physicalType": "String"
},
"sink": {
"name": "Date",
"type": "DateTime",
"physicalType": "INT_96"
}
},
{
"source": {
"name": "Item",
"type": "String",
"physicalType": "String"
},
"sink": {
"name": "Item",
"type": "String",
"physicalType": "UTF8"
}
},
{
"source": {
"name": "Opt",
"type": "INT32",
"physicalType": "String"
},
"sink": {
"name": "Opt",
"type": "Int32",
"physicalType": "INT_32"
}
},
{
"source": {
"name": "Branch",
"type": "INT32",
"physicalType": "String"
},
"sink": {
"name": "Branch",
"type": "Int32",
"physicalType": "INT_32"
}
},
{
"source": {
"name": "QTY",
"type": "INT32",
"physicalType": "String"
},
"sink": {
"name": "QTY",
"type": "Int32",
"physicalType": "INT_32"
}
},
{
"source": {
"name": "LoadDate",
"type": "DateTime",
"physicalType": "String"
},
"sink": {
"name": "LoadDate",
"type": "DateTime",
"physicalType": "INT_96"
}
}
]

avro.io.AvroTypeException: The datum [object] is not an example of the schema

I have been struggling through this issue quite for some time. I am working on AvroProducer(confluent kafka) and getting error related to schema defined.
Here is the complete stacktrace of the issue I am getting:
<!--language: lang-none-->
raise AvroTypeException(self.writer_schema, datum)
avro.io.AvroTypeException: The datum {'totalDifficulty': 2726165051, 'stateRoot': '0xf09bd6730b3ae7f5728836564837d7f776a8f7333628c8b84cb57d7c6d48ebba', 'sha3Uncles': '0x1dcc4de8dec75d7aab85b567b6ccd41ad312451b948a7413f0a142fd40d49347', 'size': 538, 'logs': [], 'gasLimit': 8000000, 'mixHash': '0x410b2b19519be16496727c93515f399072ffecf06defe4913d00eb4d10bb7351', 'logsBloom': '0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000', 'nonce': '0x18dc6c0d30839c91', 'proofOfAuthorityData': '0xd883010817846765746888676f312e31302e34856c696e7578', 'number': 5414, 'timestamp': 1552577641, 'difficulty': 589091, 'gasUsed': 0, 'miner': '0x48FA5EBc2f0D82B5D52faAe624Fa2426998ab492', 'hash': '0x71259991acb407a85befa8b3c5df26a94a11a6c08f92f3e3b7c9c0e8e1f5916d', 'transactionsRoot': '0x56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421', 'receiptsRoot': '0x56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421', 'transactions': [], 'parentHash': '0x9f0c25eeab86fc144296cb034c94857beed331936016d60c0986a35ac07d9c68', 'uncles': []} is not an example of the schema {
"type": "record",
"name": "value",
"namespace": "exporter.value.opsnetBlock",
"fields": [
{
"type": "int",
"name": "difficulty"
},
{
"type": "string",
"name": "proofOfAuthorityData"
},
{
"type": "int",
"name": "gasLimit"
},
{
"type": "int",
"name": "gasUsed"
},
{
"type": "string",
"name": "hash"
},
{
"type": "string",
"name": "logsBloom"
},
{
"type": "int",
"name": "size"
},
{
"type": "string",
"name": "miner"
},
{
"type": "string",
"name": "mixHash"
},
{
"type": "string",
"name": "nonce"
},
{
"type": "int",
"name": "number"
},
{
"type": "string",
"name": "parentHash"
},
{
"type": "string",
"name": "receiptsRoot"
},
{
"type": "string",
"name": "sha3Uncles"
},
{
"type": "string",
"name": "stateRoot"
},
{
"type": "int",
"name": "timestamp"
},
{
"type": "int",
"name": "totalDifficulty"
},
{
"type": "string",
"name": "transactionsRoot"
},
{
"type": {
"type": "array",
"items": "string"
},
"name": "transactions"
},
{
"type": {
"type": "array",
"items": "string"
},
"name": "uncles"
},
{
"type": {
"type": "array",
"items": {
"type": "record",
"name": "Child",
"namespace": "exporter.value.opsnetBlock",
"fields": [
{
"type": "string",
"name": "address"
},
{
"type": "string",
"name": "blockHash"
},
{
"type": "int",
"name": "blockNumber"
},
{
"type": "string",
"name": "data"
},
{
"type": "int",
"name": "logIndex"
},
{
"type": "boolean",
"name": "removed"
},
{
"type": {
"type": "array",
"items": "string"
},
"name": "topics"
},
{
"type": "string",
"name": "transactionHash"
},
{
"type": "int",
"name": "transactionIndex"
}
]
}
},
"name": "logs"
}
]
}
Can anybody please tell me where am I going wrong in this?
Thanks in advance

Json parsing using Play Json with different fields

I have below JSON and I'm parsing it using play-json. Somehow "datafeeds/schema/fields" Node is not getting properly parsed.
I have created standard reads to parse this Json but "datafeeds" node seems not to be parsing correctly due to "format"(datafeeds/schema/fields) node being String or JsObject sometime and same goes for the "type" node.
If I consider Schema as JsObject then whole Json get parsed correctly and seems I then have to process Schema separately.
My Json looks like this
{
"entities": [
{
"name": "customers",
"number_of_buckets": 5,
"entity_column_name": "customer_id",
"entity_column_type": "integer"
},
{
"name": "accounts",
"number_of_buckets": 7,
"entity_column_name": "account_id",
"entity_column_type": "string"
},
{
"name": "products",
"number_of_buckets": 1,
"entity_column_name": "product_id",
"entity_column_type": "integer"
}
],
"datafeeds": [
{
"name": "customer_demographics",
"version": "1",
"delimiter": "|",
"filename_re_pattern": ".*(customer_demographics_v1_[0-9]{8}\\.psv)$",
"frequency": {
"days": 1
},
"from": "2015-07-01",
"drop_threshold": {
"rows": null,
"percentage": 0.05
},
"dry_run": false,
"header": true,
"text_qualifier": null,
"landing_path": "landing",
"schema": {
"fields": [
{
"time_key": true,
"format": "yyyy-MM-dd",
"metadata": {},
"name": "record_date",
"nullable": false,
"primary_key": true,
"type": "timestamp",
"timezone": "Australia/Sydney"
},
{
"format": "yyyy-MM-dd",
"metadata": {},
"name": "extract_date",
"nullable": false,
"primary_key": true,
"type": "timestamp",
"timezone": "Australia/Sydney"
},
{
"entity_type": "customers",
"metadata": {},
"name": "customer_id",
"nullable": false,
"primary_key": true,
"type": "integer"
},
{
"metadata": {},
"name": "year_of_birth",
"nullable": true,
"type": "integer"
},
{
"metadata": {},
"name": "month_of_birth",
"nullable": true,
"type": "integer"
},
{
"metadata": {},
"name": "postcode",
"nullable": true,
"type": "string"
},
{
"metadata": {},
"name": "state",
"nullable": true,
"type": "string"
},
{
"format": {
"false": "N",
"true": "Y"
},
"metadata": {},
"name": "marketing_consent",
"nullable": true,
"type": "boolean"
}
],
"type": "struct"
}
},
{
"name": "customer_statistics",
"version": "1",
"delimiter": "|",
"filename_re_pattern": ".*(customer_statistics_v1_[0-9]{8}\\.psv)$",
"frequency": {
"days": 1
},
"from": "2015-07-01",
"drop_threshold": {
"rows": null,
"percentage": 0.05
},
"dry_run": false,
"header": true,
"text_qualifier": null,
"landing_path": "landing",
"schema": {
"fields": [
{
"time_key": true,
"format": "yyyy-MM-dd",
"metadata": {},
"name": "record_date",
"nullable": false,
"primary_key": true,
"type": "timestamp",
"timezone": "Australia/Sydney"
},
{
"format": "yyyy-MM-dd",
"metadata": {},
"name": "extract_date",
"nullable": false,
"primary_key": true,
"type": "timestamp",
"timezone": "Australia/Sydney"
},
{
"entity_type": "customers",
"metadata": {},
"name": "customer_id",
"nullable": false,
"primary_key": true,
"type": "integer"
},
{
"metadata": {},
"name": "risk_score",
"nullable": true,
"type": "double"
},
{
"metadata": {},
"name": "mkg_segments",
"nullable": true,
"type": {
"type":"array",
"elementType":"string",
"containsNull": false
}
},
{
"metadata": {},
"name": "avg_balance",
"nullable": true,
"type": "decimal"
},
{
"metadata": {},
"name": "num_accounts",
"nullable": true,
"type": "integer"
}
],
"type": "struct"
}
}
],
"tables": [
{
"name": "table_name",
"version": "version",
"augmentations": [
{
"left_table_name": "left_table_name",
"left_table_version": "v1",
"right_table_name": "right_table_name",
"right_table_version": "v1",
"columns": [
"column_a",
"column_b",
"column_c"
],
"join_cols": [
{
"left_table": "system_code",
"right_table": "key_a"
},
{
"left_table": "group_product_code",
"right_table": "key_b"
},
{
"left_table": "sub_product_code",
"right_table": "key_c"
}
]
}
],
"sources": [
{
"name": "table_name",
"version": "v1",
"mandatory": true,
"type": "datafeed | table"
}
],
"aggregations": [
{
"column_name": "customer_age_customer_age",
"column_type": "long",
"description": "date_diff",
"expression": "max_by",
"source_columns": [
{
"column_name": "customer_age_year_of_birth",
"source": {
"name": "customers",
"type": "table",
"version": "v1"
}
},
{
"column_name": "customer_age_month_of_birth",
"source": {
"name": "customers",
"type": "table",
"version": "v1"
}
}
]
}
],
"column_level_transformations": [
{
"column_name": "column_added",
"column_type": "long",
"description": "adding two columns to return something else",
"expression": "column_a+column_b",
"source_columns": [
{
"column_name": "column_a",
"source": {
"name": "source_a",
"type": "table",
"version": "v1"
}
},
{
"column_name": "column_b",
"source": {
"name": "source_b",
"type": "table",
"version": "v1"
}
}
]
}
],
"frequency": {
"months": 1
},
"joins": [
{
"name": "table_name",
"version": "v1"
},
{
"name": "table_name_b",
"version": "v2"
}
],
"from": "2015-07-01",
"format": "parquet",
"structure": "primitives",
"index_query": "sql statement",
"insert_query": "sql statement"
}
]
}
Any idea how to parse this Json?

Edit: updated to answer the updated question
I'm not sure how you're parsing now, but you can try this:
import play.api.libs.json.Reads._
import play.api.libs.json._
case class Frequency(days: Int)
case class DropThreshold(
rows: Option[Int], //guessing type here
percentage: Double
)
case class Format(`false`: String, `true`: String)
case class Type(`type`: String, elementType: String, containsNull: Boolean)
case class Field(
entity_type: Option[String],
time_key: Option[Boolean],
format: Option[Either[String, Format]],
metadata: Option[JsObject],
name: Option[String],
nullable: Option[Boolean],
primary_key: Option[Boolean],
`type`: Option[Either[String, Type]],
timezone: Option[String]
)
case class Schema(fields: Seq[Field])
case class Datafeed(
name: String,
version: String,
delimiter: String,
filename_re_pattern: String,
frequency: Frequency,
from: String,
drop_threshold: DropThreshold,
dry_run: Boolean,
header: Boolean,
text_qualifier: Option[String], //guessing type here
landing_path: String,
schema: Schema
)
case class Entity(name: String, number_of_buckets: Int, entity_column_name: String, entity_column_type: String)
case class MyJson(entities: Seq[Entity], datafeeds: Seq[Datafeed])
implicit def eitherReads[A, B](implicit A: Reads[A], B: Reads[B]): Reads[Either[A, B]] = Reads[Either[A, B]] { json =>
A.reads(json) match {
case JsSuccess(value, path) => JsSuccess(Left(value), path)
case JsError(e1) => B.reads(json) match {
case JsSuccess(value, path) => JsSuccess(Right(value), path)
case JsError(e2) => JsError(JsError.merge(e1, e2))
}
}
}
implicit val frequencyReads: Reads[Frequency] = Json.reads[Frequency]
implicit val dropThresholdReads: Reads[DropThreshold] = Json.reads[DropThreshold]
implicit val formatReads: Reads[Format] = Json.reads[Format]
implicit val typeReads: Reads[Type] = Json.reads[Type]
implicit val fieldReads: Reads[Field] = Json.reads[Field]
implicit val schemaReads: Reads[Schema] = Json.reads[Schema]
implicit val datafeedReads: Reads[Datafeed] = Json.reads[Datafeed]
implicit val entityReads: Reads[Entity] = Json.reads[Entity]
implicit val myJsonReads: Reads[MyJson] = Json.reads[MyJson]
With the Either Reads copied from here. To test:
scala> val json = Json.parse("""{"entities": [{"name": "customers","number_of_buckets": 5,"entity_column_name": "customer_id","entity_column_type": "integer"},{"name": "accounts","number_of_buckets": 7,"entity_column_name": "account_id","entity_column_type": "string"},{"name": "products","number_of_buckets": 1,"entity_column_name": "product_id","entity_column_type": "integer"}],"datafeeds": [{"name": "customer_demographics","version": "1","delimiter": "|","filename_re_pattern": ".*(customer_demographics_v1_[0-9]{8}\\.psv)$","frequency": {"days": 1},"from": "2015-07-01","drop_threshold": {"rows": null,"percentage": 0.05},"dry_run": false,"header": true,"text_qualifier": null,"landing_path": "landing","schema": {"fields": [{"time_key": true,"format": "yyyy-MM-dd","metadata": {},"name": "record_date","nullable": false,"primary_key": true,"type": "timestamp","timezone": "Australia/Sydney"},{"format": "yyyy-MM-dd","metadata": {},"name": "extract_date","nullable": false,"primary_key": true,"type": "timestamp","timezone": "Australia/Sydney"},{"entity_type": "customers","metadata": {},"name": "customer_id","nullable": false,"primary_key": true,"type": "integer"},{"metadata": {},"name": "year_of_birth","nullable": true,"type": "integer"},{"metadata": {},"name": "month_of_birth","nullable": true,"type": "integer"},{"metadata": {},"name": "postcode","nullable": true,"type": "string"},{"metadata": {},"name": "state","nullable": true,"type": "string"},{"format": {"false": "N","true": "Y"},"metadata": {},"name": "marketing_consent","nullable": true,"type": "boolean"}],"type": "struct"}},{"name": "customer_statistics","version": "1","delimiter": "|","filename_re_pattern": ".*(customer_statistics_v1_[0-9]{8}\\.psv)$","frequency": {"days": 1},"from": "2015-07-01","drop_threshold": {"rows": null,"percentage": 0.05},"dry_run": false,"header": true,"text_qualifier": null,"landing_path": "landing","schema": {"fields": [{"time_key": true,"format": "yyyy-MM-dd","metadata": {},"name": "record_date","nullable": false,"primary_key": true,"type": "timestamp","timezone": "Australia/Sydney"},{"format": "yyyy-MM-dd","metadata": {},"name": "extract_date","nullable": false,"primary_key": true,"type": "timestamp","timezone": "Australia/Sydney"},{"entity_type": "customers","metadata": {},"name": "customer_id","nullable": false,"primary_key": true,"type": "integer"},{"metadata": {},"name": "risk_score","nullable": true,"type": "double"},{"metadata": {},"name": "mkg_segments","nullable": true,"type": {"type":"array","elementType":"string","containsNull": false}},{"metadata": {},"name": "avg_balance","nullable": true,"type": "decimal"},{"metadata": {},"name": "num_accounts","nullable": true,"type": "integer"}],"type": "struct"}}],"tables": [{"name": "table_name","version": "version","augmentations": [{"left_table_name": "left_table_name","left_table_version": "v1","right_table_name": "right_table_name","right_table_version": "v1","columns": ["column_a","column_b","column_c"],"join_cols": [{"left_table": "system_code","right_table": "key_a"},{"left_table": "group_product_code","right_table": "key_b"},{"left_table": "sub_product_code","right_table": "key_c"}]}],"sources": [{"name": "table_name","version": "v1","mandatory": true,"type": "datafeed | table"}],"aggregations": [{"column_name": "customer_age_customer_age","column_type": "long","description": "date_diff","expression": "max_by","source_columns": [{"column_name": "customer_age_year_of_birth","source": {"name": "customers","type": "table","version": "v1"}},{"column_name": "customer_age_month_of_birth","source": {"name": "customers","type": "table","version": "v1"}}]}],"column_level_transformations": [{"column_name": "column_added","column_type": "long","description": "adding two columns to return something else","expression": "column_a+column_b","source_columns": [{"column_name": "column_a","source": {"name": "source_a","type": "table","version": "v1"}},{"column_name": "column_b","source": {"name": "source_b","type": "table","version": "v1"}}]}],"frequency": {"months": 1},"joins": [{"name": "table_name","version": "v1"},{"name": "table_name_b","version": "v2"}],"from": "2015-07-01","format": "parquet","structure": "primitives","index_query": "sql statement","insert_query": "sql statement"}]}""")
json: play.api.libs.json.JsValue = {"entities":[{"name":"customers","number_of_buckets":5,"entity_column_name":"customer_id","entity_column_type":"integer"},{"name":"accounts","number_of_buckets":7,"entity_column_name":"account_id","entity_column_type":"string"},{"name":"products","number_of_buckets":1,"entity_column_name":"product_id","entity_column_type":"integer"}],"datafeeds":[{"name":"customer_demographics","version":"1","delimiter":"|","filename_re_pattern":".*(customer_demographics_v1_[0-9]{8}\\.psv)$","frequency":{"days":1},"from":"2015-07-01","drop_threshold":{"rows":null,"percentage":0.05},"dry_run":false,"header":true,"text_qualifier":null,"landing_path":"landing","schema":{"fields":[{"time_key":true,"format":"yyyy-MM-dd","metadata":{},"name":"record...
scala> json.validate[MyJson]
res0: play.api.libs.json.JsResult[MyJson] = JsSuccess(MyJson(List(Entity(customers,5,customer_id,integer), Entity(accounts,7,account_id,string), Entity(products,1,product_id,integer)),List(Datafeed(customer_demographics,1,|,.*(customer_demographics_v1_[0-9]{8}\.psv)$,Frequency(1),2015-07-01,DropThreshold(None,0.05),false,true,None,landing,Schema(List(Field(None,Some(true),Some(Left(yyyy-MM-dd)),Some({}),Some(record_date),Some(false),Some(true),Some(Left(timestamp)),Some(Australia/Sydney)), Field(None,None,Some(Left(yyyy-MM-dd)),Some({}),Some(extract_date),Some(false),Some(true),Some(Left(timestamp)),Some(Australia/Sydney)), Field(Some(customers),None,None,Some({}),Some(customer_id),Some(false),Some(true),Some(Left(integer)),None), Field(None,None,None,Some({}),...
Remember to set any optional or nullable fields to an Option type.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to create DataFrame schema from schema json file in pyspark? - pyspark

Related

pyspark stream from kafka topic with avro format returns null dataframe

PubSub Subscription error with REPEATED Column Type - Avro Schema

ADF Copy activity delimited to parquet data type mapping with extra column

avro.io.AvroTypeException: The datum [object] is not an example of the schema

Json parsing using Play Json with different fields

Categories

Resources