Fanout data from Kafka to S3 using kafka connect - apache-kafka

My kafka gets a wrapper payload in Json format. The wrapper payload looks like this:
{
"format": "wrapper",
"time": 1626814608000,
"events": [
{
"id": "item1",
"type": "product1",
"count": 200
},
{
"id": "item2",
"type": "product2",
"count": 300
}
],
"metadata": {
"schema": "schema-1"
}
}
I should export this to S3. But the catch here is, I should not store the wrapper. Instead, I should be storing the individual events based on the item.
For example, it should be stored in S3 as follows:
bucket/product1:
{"id": "item1", "type": "product1", "count": 200}
bucket/product2:
{"id": "item2", "type": "product2", "count": 300}
If you notice, the input is the wrapper with those events internally. However, my output should be each of those individual events stored in S3 in the same bucket with the product type as prefix.
My question is, is it possible to use Kafka Connect to do this? I see they have this Single Message transformer which seems to be a way to mutate data inside the object, but not to fanout like what I want. Even the signature looks like an R=>R
https://github.com/apache/kafka/blob/trunk/connect/api/src/main/java/org/apache/kafka/connect/transforms/Transformation.java
So based on my research, it does not seem possible. But I want to check if I am missing something before using a different option.

Transforms accept one event and output one event.
You need to use a stream processor such as Kafka Streams branch or flatMap functions to split an array of events into multiple events or multiple topics.

Related

In which case meta's whatsapp payload examples will receive with multiple element in array

Meta's whatsapp API integration and response on webhook,
https://developers.facebook.com/docs/whatsapp/cloud-api/webhooks/payload-examples
I am new to the whatsapp cloud integration and I am confused why inbound message response of webhook is too weird with nested array, in which cases facebook(meta) will give an multiple elements in nested of nested array.
Is it good way to get entry[0].changes[0].value.messages[0].text.body or I require to add loop on every case?
What are the changes we will received multiple elements?
{
"object": "whatsapp_business_account",
"entry": [{
"id": "WHATSAPP_BUSINESS_ACCOUNT_ID",
"changes": [{
"value": {
"messaging_product": "whatsapp",
"metadata": {
"display_phone_number": PHONE_NUMBER,
"phone_number_id": PHONE_NUMBER_ID
},
"contacts": [{
"profile": {
"name": "NAME"
},
"wa_id": PHONE_NUMBER
}],
"messages": [{
"from": PHONE_NUMBER,
"id": "wamid.ID",
"timestamp": TIMESTAMP,
"text": {
"body": "MESSAGE_BODY"
},
"type": "text"
}]
},
"field": "messages"
}]
}]
}
You can read the documentation of graph-api webhook,
https://developers.facebook.com/docs/graph-api/webhooks/getting-started#validate-payloads
Event Notifications are aggregated and sent in a batch with a maximum of 1000 updates. However batching cannot be guaranteed so be sure to adjust your servers to handle each Webhook individually.
You can also check the property-wise batch possibility in the provided link.

ADF - Loop through a large JSON file in a dataflow

We currently receive some metadata information from a third party supplier in the form of a JSON file.
The JSON file contains definitions of some tables which need to be loaded into SQL via ADF.
The JSON file looks like this, it's a list of tables and their data types
"Tables": [
{
"name": "account",
"description": "account",
"$type": "LocalEntity",
"attributes": [
{
"dataType": "guid",
"maxLength": "-1",
"name": "Id"
},
{
"dataType": "string",
"maxLength": "250",
"name": "name"
}
]
},
{
"name": "customer",
"description": "account",
"$type": "LocalEntity",
"attributes": [
{
"dataType": "guid",
"maxLength": "-1",
"name": "Id"
},
{
"dataType": "string",
"maxLength": "100",
"name": "name"
}
]
}
]
What we need to do is to loop through this JSON and via an ADF data flow we create the required tables in the destination database.
We initially designed the Pipeline with a lookup activity that loads the JSON file then pass the output to a foreach loop. This worked really well when we had only a small JSON file but as we started using real data, the JSON file was over the limit of 4MB resulting in the lookup activity throwing an error.
We then tried using a mapping dataflow by loading the JSON as a source, then setting the sink as a cache and outputting this to an output variable which we then loop through but again this works with smaller datasets but as soon as the dataset is large enough it can't parse it to an output.
I am sure this should be easy to do but just can't get my head around it!
Here is the sample procedure to loop through large JSON file in a Dataflows.
Create a Linked service and dataset to the json file path.
Provision the dataset to the source in the dataflows.
By the flatten formatter will get the input columns from the source through Unroll by option with required input.
Create linked service and dataset to the sink path.
Attach the data flow work item to the Data Flow activity.
Will get result as per the expectations in the sql db.

How to inject data to apache pinot from kafka topic with avro schema?

I have started exploring apache pinot, there are few query regarding schema of apache pinot. I want to understand how apache pinot work with Kafka topic that has AVRO schema (schema includes nested object, array of object etc..) because i didn't find any resource or example that shows how we can inject data with Kafka that has avro schema with it.
As per my understanding apache pinot we have to provide flat schema or other option for nested Json object we can use transform function. Is there any kind of Kafka connect for pinot for doing data injection?
Avro schema
{
"namespace" : "my.avro.ns",
"name": "MyRecord",
"type" : "record",
"fields" : [
{"name": "uid", "type": "int"},
{"name": "somefield", "type": "string"},
{"name": "options", "type": {
"type": "array",
"items": {
"type": "record",
"name": "lvl2_record",
"fields": [
{"name": "item1_lvl2", "type": "string"},
{"name": "item2_lvl2", "type": {
"type": "array",
"items": {
"type": "record",
"name": "lvl3_record",
"fields": [
{"name": "item1_lvl3", "type": "string"},
{"name": "item2_lvl3", "type": "string"}
]
}
}}
]
}
}}
]
}
Kafka Avro Message:
{
"uid": 29153333,
"somefield": "somevalue",
"options": [
{
"item1_lvl2": "a",
"item2_lvl2": [
{
"item1_lvl3": "x1",
"item2_lvl3": "y1"
},
{
"item1_lvl3": "x2",
"item2_lvl3": "y2"
}
]
}
]
}
You don't need a separate connector to ingest data into Pinot from Kafka, or other stream systems such as Kinesis, Apache Pulsar. You simply configure the Pinot table to point to stream source (Kafka broker in your case), along with any transformations you may want to map Kafka schema (Avro or otherwise) to schema in Pinot.
How you should store the data in Pinot (table schema in Pinot) is more a function of how you want to query it.
If you are only interested in a particular field inside your nested filed, you can configure a simple ingestion transform to extract that field out during ingestion and store it as a column in Pinot.
If you want to preserve the entire nested JSON blob for a column, and then query the blob, then you can use JSON indexing.
Here are some pointers to for your reference:
Ingestion Transforms
Flattening JSON
JSON functions
JSON Indexing
Pinot Docs
You may also want to consider joining the Apache Pinot slack community for Apache Pinot related questions.

How can I get "complex JSON" from a kafka topic and insert it in several tables in MySQL?

I'm a Kafka beginner so it is possible there is an API or a tool that could help me, but I do not know it. If I'm approaching this problem wrongly and you leave me know it, I would really appreciate it.
I have a JSON in my topic which looks something like this:
{
"id": "001",
"value": "30000",
"items": [
{
"id": 1,
"description": "chicken breast",
"value": "2300"
},
{
"id": 2,
"description": "Cookies",
"value": "2400"
}
]
}
And I need to take and insert it in a database similar to:
should I create a KafkaStreamApp, transform my data trying to simplify that JSON? or is any way to "map" or construct each row that I need to insert in each table with a KafkaConnect?
P.S: This is a simplified JSON example, the JSON in my topic will be much more complex.

Split json using NiFi

I have a json with all the records with merged I need to split the merged json and load in separate database using NiFi
My file when I execute
db.collection.findOne()
My input looks like:
[
{
"name": "sai",
"id": 101,
"company": "adsdr"
},
{
"name": "siva",
"id": 102,
"company": "shar"
},
{
"name": "vanai",
"id": 103,
"company": "ddr"
},
{
"name": "karti",
"id": 104,
"company": "sir"
}
]
I am getting all the json. I need to get output as:
{name: "sai", id:101, company: "sdr"}
So i only want one record, how can I parse the json using NiFi?
There is a SplitJson processor for this purpose:
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.SplitJson/index.html
There are various JSON Path testers online to come up with the correct expression:
https://jsonpath.curiousconcept.com/
Use Split json processor with below configs as shown in the below screenshot
SplitJson Config
As Bryan said, you can use SplitJson processor, and then, you can forward the splitted data flow to other databases. The processor internally using this json pathfinder. You can read there the operations that the processor supports.
Just use this to get the first element by:
// JSON Path Expression for the first element:
$[0]
[
{
"name": "sai",
"id": 101,
"company": "adsdr"
}
]