What column type do I need for this nested data in BigQuery? - apache-kafka

I have a JSON schema for a Kafka stream that I am integrating with BigQuery but I can't get the data type correct at the BigQuery end. This is the schema:
"my_meta_data": {
"type": "object",
"properties": {
"property_1": {
"type": "array",
"items": {
"type": "number"
}
},
"property_2": {
"type": "array",
"items": {
"type": "number"
}
},
"property_3": {
"type": "array",
"items": {
"type": "number"
}
}
}
}
I tried this in the JSON file defining the BigQuery table:
{
"name": "my_meta_data",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "property_1",
"type": "INT64",
"mode": "REPEATED"
},
{
"name": "property_2",
"type": "INT64",
"mode": "REPEATED"
},
{
"name": "property_3",
"type": "INT64",
"mode": "REPEATED"
}
]
}
I am using a hosted connector from Confluent, the Kafka provider, and the error message is
The connector is failing because it cannot write a non-array element to an array column. Please check the schemas of the data in Kafka and the BigQuery tables the connector is writing to, and ensure that all data from Kafka that will be written to an array column in BigQuery is contained in an array.
I haven't defined an array column though, I've defined a RECORD column that contains arrays. Any ideas how I can set up the BigQuery table to capture this data? Thanks in advance.

Related

JDBC sink topic with multiple structs to postgres

I am trying to sink a few topics top a postgres database. However the topic schema defines a array at the top level and within it multiple structs. Automapping does not work and I cannot find any reference how to handle this. I need all structs because they are dependent types, the second struct references the first struct as a field.
Currently it breaks when hitting the 2nd struct stating statusChangeEvent (struct) has no mapping to sql column type. This because it is using auto.create to make a table (probably called ProcessStatus) then when hitting the second entry there is no column of course.
[
{
"type": "record",
"name": "processStatus",
"namespace": "company.some.process",
"fields": [
{
"name": "code",
"doc": "The code of the processStatus",
"type": "string"
},
{
"name": "name",
"doc": "The name of the processStatus",
"type": "string"
},
{
"name": "description",
"type": "string"
},
{
"name": "isCompleted",
"type": "boolean"
},
{
"name": "isSuccessfullyCompleted",
"type": "boolean"
}
]
},
{
"type": "record",
"name": "StatusChangeEvent",
"namespace": "company.some.process",
"fields": [
{
"name": "contNumber",
"type": "string"
},
{
"name": "processId",
"type": "string"
},
{
"name": "processVersion",
"type": "int"
},
{
"name": "extProcessId",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "fromStatus",
"type": "process.status"
},
{
"name": "toStatus",
"doc": "The new status of the process",
"type": "company.some.process.processStatus"
},
{
"name": "changeDateTime",
"type": "long",
"logicalType": "timestamp-millis"
},
{
"name": "isPublic",
"type": "boolean"
}
]
}
]
I am not using ksql atm. Which connector settings are suited for this task? If there is a ksql alternative it would be nice to know but the current requirement is to use the JDBC connector.
I tried using flatten but it does not support struct fields that have a schema. Which seems kind of weird. Aren't schema's the whole selling point of connect with kafka? Or is it more of a constraint you have to work around?
Aren't schema's the whole selling point of connect with kafka?
Yes, but Postgres (or the JDBC Sink, in general) doesn't really support nested objects within columns. For that, you're better off with a document database, such as using Mongo Sink Connector.
Which connector settings are suited for this task?
None, really, other than transforms. You could write your own if flatten doesn't work.
You could try pre-defining your table to use JSONB for the two status columns, however, that's more of a workaround.

Convert CSV to AVRO in NiFi 1.13.0

I would like to convert my CSV dataflow to AVRO in NiFi 1.13.0 and send it to a Kafka topic with a Key and his Schema.
So I have multiple problems:
Convert my file in AVRO (compatible for Kafka Topic and Kakfa Stream use)
Send my AVRO message to my Kafka topic with his Schema
Attach a custom Key with my message
I saw many things about processors that does not exist anymore so I would like to have a clear answer for the 1.13.0 NiFi version.
Here is my dataFlow :
Project,Price,Charges,hours spent,Days spent,price/day
API,75000,2500,1500,187.5,1000
Here is the AVRO Schema I'd like to have at the end :
{
"name": "projectClass",
"type": "record",
"fields": [
{
"name": "Project",
"type": "string"
},
{
"name": "Price",
"type": "int"
},
{
"name": "Charges",
"type": "int"
},
{
"name": "hours spent",
"type": "int"
},
{
"name": "Days spent",
"type": "double"
},
{
"name": "price/day",
"type": "int"
}
]
}
The associated key must be a unique ID (int or double).
Thanks for your answers.

How to use Schema registry for Kafka Connect AVRO

I have started exploring Kafka and Kafka connect recently and did some initial set up .
But wanted to explore more on schema registry part .
My schema registry is started now what i should do .
I have a AVRO schema stored in avro_schema.avsc.
here is the schema
{
"name": "FSP-AUDIT-EVENT",
"type": "record",
"namespace": "com.acme.avro",
"fields": [
{
"name": "ID",
"type": "string"
},
{
"name": "VERSION",
"type": "int"
},
{
"name": "ACTION_TYPE",
"type": "string"
},
{
"name": "EVENT_TYPE",
"type": "string"
},
{
"name": "CLIENT_ID",
"type": "string"
},
{
"name": "DETAILS",
"type": "string"
},
{
"name": "OBJECT_TYPE",
"type": "string"
},
{
"name": "UTC_DATE_TIME",
"type": "long"
},
{
"name": "POINT_IN_TIME_PRECISION",
"type": "string"
},
{
"name": "TIME_ZONE",
"type": "string"
},
{
"name": "TIMELINE_PRECISION",
"type": "string"
},
{
"name": "AUDIT_EVENT_TO_UTC_DT",
"type": [
"string",
"null"
]
},
{
"name": "AUDIT_EVENT_TO_DATE_PITP",
"type": "string"
},
{
"name": "AUDIT_EVENT_TO_DATE_TZ",
"type": "string"
},
{
"name": "AUDIT_EVENT_TO_DATE_TP",
"type": "string"
},
{
"name": "GROUP_ID",
"type": "string"
},
{
"name": "OBJECT_DISPLAY_NAME",
"type": "string"
},
{
"name": "OBJECT_ID",
"type": [
"string",
"null"
]
},
{
"name": "USER_DISPLAY_NAME",
"type": [
"string",
"null"
]
},
{
"name": "USER_ID",
"type": "string"
},
{
"name": "PARENT_EVENT_ID",
"type": [
"string",
"null"
]
},
{
"name": "NOTES",
"type": [
"string",
"null"
]
},
{
"name": "SUMMARY",
"type": [
"string",
"null"
]
}
]
}
Is my schema is valid .I converted it online from JSON ?
where should i keep this schema file location i am not sure about .
Please guide me with the step to follow
.
I am sending records from Lambda function and from JDBC source both .
So basically how can i enforce AVRO schema and test ?
Do i have to change anything in avro-consumer properties file ?
Or is this correct way to register schema
./bin/kafka-avro-console-producer \
--broker-list b-3.**:9092,b-**:9092,b-**:9092 --topic AVRO-AUDIT_EVENT \
--property value.schema='{"type":"record","name":"myrecord","fields":[{"name":"f1","type":"string"}]}'
curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" --data '{"schema" : "{\"type\":\"struct\",\"fields\":[{\"type\":\"string\",\"optional\":false,\"field\":\"ID\"},{\"type\":\"string\",\"optional\":true,\"field\":\"VERSION\"},{\"type\":\"string\",\"optional\":true,\"field\":\"ACTION_TYPE\"},{\"type\":\"string\",\"optional\":true,\"field\":\"EVENT_TYPE\"},{\"type\":\"string\",\"optional\":true,\"field\":\"CLIENT_ID\"},{\"type\":\"string\",\"optional\":true,\"field\":\"DETAILS\"},{\"type\":\"string\",\"optional\":true,\"field\":\"OBJECT_TYPE\"},{\"type\":\"string\",\"optional\":true,\"field\":\"UTC_DATE_TIME\"},{\"type\":\"string\",\"optional\":true,\"field\":\"POINT_IN_TIME_PRECISION\"},{\"type\":\"string\",\"optional\":true,\"field\":\"TIME_ZONE\"},{\"type\":\"string\",\"optional\":true,\"field\":\"TIMELINE_PRECISION\"},{\"type\":\"string\",\"optional\":true,\"field\":\"GROUP_ID\"},{\"type\":\"string\",\"optional\":true,\"field\":\"OBJECT_DISPLAY_NAME\"},{\"type\":\"string\",\"optional\":true,\"field\":\"OBJECT_ID\"},{\"type\":\"string\",\"optional\":true,\"field\":\"USER_DISPLAY_NAME\"},{\"type\":\"string\",\"optional\":true,\"field\":\"USER_ID\"},{\"type\":\"string\",\"optional\":true,\"field\":\"PARENT_EVENT_ID\"},{\"type\":\"string\",\"optional\":true,\"field\":\"NOTES\"},{\"type\":\"string\",\"optional\":true,\"field\":\"SUMMARY\"},{\"type\":\"string\",\"optional\":true,\"field\":\"AUDIT_EVENT_TO_UTC_DT\"},{\"type\":\"string\",\"optional\":true,\"field\":\"AUDIT_EVENT_TO_DATE_PITP\"},{\"type\":\"string\",\"optional\":true,\"field\":\"AUDIT_EVENT_TO_DATE_TZ\"},{\"type\":\"string\",\"optional\":true,\"field\":\"AUDIT_EVENT_TO_DATE_TP\"}],\"optional\":false,\"name\":\"test\"}"}' http://localhost:8081/subjects/view/versions
what next i have to do
But when i try to see my schema i get only below
curl --silent -X GET http://localhost:8081/subjects/AVRO-AUDIT-EVENT/versions/latest
this is the result
{"subject":"AVRO-AUDIT-EVENT","version":1,"id":161,"schema":"{\"type\":\"string\",\"optional\":false}"}
Why i do not see my full registered schema
Also when i try to delete schema
i get below error
{"error_code":405,"message":"HTTP 405 Method Not Allowed"
i am not sure if my schema is registered correctly .
Please help me.
Thanks in Advance
is my schema valid
You can use the REST API of the Registry to try and submit it and see...
where should i keep this schema file location i am not sure about
It's not clear how you're sending messages...
If you actually wrote Kafka producer code, you store it within your code (as a string) or as a resource file.. If using Java, you can instead use the SchemaBuilder class to create the Schema object
You need to rewrite your producer to use Avro Schema and Serializers if you've not already
If we create AVRO schema will it work for Json as well .
Avro is a Binary format, but there is a JSONDecoder for it.
what should be the URL for our AVRO schema properties file ?
It needs to be the IP of your Schema Registry once you figure out how to start it. (with schema-registry-start)
Do i have to change anything in avro-consumer properties file ?
You need to use the Avro Deserializer
is this correct way to register schema
.> /bin/kafka-avro-console-producer \
Not quite. That's how you produce a message with a schema (and you need to use the correct schema). You also must provide --property schema.registry.url
You use the REST API of the Registry to register and verify schemas

Ingesting multi-valued dimension from comma sep string

I have event data from Kafka with the following structure that I want to ingest in Druid
{
"event": "some_event",
"id": "1",
"parameters": {
"campaigns": "campaign1, campaign2",
"other_stuff": "important_info"
}
}
Specifically, I want to transform the dimension "campaigns" from a comma-separated string into an array / multi-valued dimension so that it can be nicely filtered and grouped by.
My ingestion so far looks as follows
{
"type": "kafka",
"dataSchema": {
"dataSource": "event-data",
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "timestamp",
"format": "posix"
},
"flattenSpec": {
"fields": [
{
"type": "root",
"name": "parameters"
},
{
"type": "jq",
"name": "campaigns",
"expr": ".parameters.campaigns"
}
]
}
},
"dimensionSpec": {
"dimensions": [
"event",
"id",
"campaigns"
]
}
},
"metricsSpec": [
{
"type": "count",
"name": "count"
}
],
"granularitySpec": {
"type": "uniform",
...
}
},
"tuningConfig": {
"type": "kafka",
...
},
"ioConfig": {
"topic": "production-tracking",
...
}
}
Which however leads to campaigns being ingested as a string.
I could neither find a way to generate an array out of it with a jq expression in flattenSpec nor did I find something like a string split expression that may be used as a transformSpec.
Any suggestions?
Try setting useFieldDiscover: false in your ingestion spec. when this flag is set to true (which is default case) then it interprets all fields with singular values (not a map or list) and flat lists (lists of singular values) at the root level as columns.
Here is a good example and reference link to use flatten spec:
https://druid.apache.org/docs/latest/ingestion/flatten-json.html
Looks like since Druid 0.17.0, Druid expressions support typed constructors for creating arrays, so using expression string_to_array should do the trick!

Using ADF Copy Activity with dynamic schema mapping

I'm trying to drive the columnMapping property from a database configuration table. My first activity in the pipeline pulls in the rows from the config table. My copy activity source is a Json file in Azure blob storage and my sink is an Azure SQL database.
In copy activity I'm setting the mapping using the dynamic content window. The code looks like this:
"translator": {
"value": "#json(activity('Lookup1').output.value[0].ColumnMapping)",
"type": "Expression"
}
My question is, what should the value of activity('Lookup1').output.value[0].ColumnMapping look like?
I've tried several different json formats but the copy activity always seems to ignore it.
For example, I've tried:
{
"type": "TabularTranslator",
"columnMappings": {
"view.url": "url"
}
}
and:
"columnMappings": {
"view.url": "url"
}
and:
{
"view.url": "url"
}
In this example, view.url is the name of the column in the JSON source, and url is the name of the column in my destination table in Azure SQL database.
The issue is due to the dot (.) sign in your column name.
To use column mapping, you should also specify structure in your source and sink dataset.
For your source dataset, you need specify your format correctly. And since your column name has dot, you need specify the json path as following.
You could use ADF UI to setup a copy for a single file first to get the related format, structure and column mapping format. Then change it to lookup.
And as my understanding, your first format should be the right format. If it is already in json format, then you may not need use "json" function in your expression.
There seems to be a disconnect between the question and the answer, so I'll hopefully provide a more straightforward answer.
When setting this up, you should have a source dataset with dynamic mapping. The sink doesn't require one, as we're going to specify it in the mapping.
Within the copy activity, format the dynamic json like the following:
{
"structure": [
{
"name": "Address Number"
},
{
"name": "Payment ID"
},
{
"name": "Document Number"
},
...
...
]
}
You would then specify your dynamic mapping like this:
{
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": {
"name": "Address Number",
"type": "Int32"
},
"sink": {
"name": "address_number"
}
},
{
"source": {
"name": "Payment ID",
"type": "Int64"
},
"sink": {
"name": "payment_id"
}
},
{
"source": {
"name": "Document Number",
"type": "Int32"
},
"sink": {
"name": "document_number"
}
},
...
...
]
}
}
Assuming these were set in separate variables, you would want to send the source as a string, and the mapping as json:
source: #string(json(variables('str_dyn_structure')).structure)
mapping: #json(variables('str_dyn_translator')).translator
VladDrak - You could skip the source dynamic definition by building dynamic mapping like this:
{
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": {
"type": "String",
"ordinal": "1"
},
"sink": {
"name": "dateOfActivity",
"type": "String"
}
},
{
"source": {
"type": "String",
"ordinal": "2"
},
"sink": {
"name": "CampaignID",
"type": "String"
}
}
]
}
}