I am working on a project that pulls data from multiple db sources using kafka connect. I want to then be able to transform the data into a specified json format and then finally push that final json to an S3 bucket preferably using kafka connect to keep my overhead down.
Here is an example of what the data currently looks like coming into kafka (in avro format):
{"tableName":"TABLE1","SchemaName{"string":"dbo"},"tableID":1639117030,"columnName":{"string":"DATASET"},"ordinalPosition":{"int":1},"isNullable":{"int":1},"dataType":{"string":"varchar"},"maxLength":{"int":510},"precision":{"int":0},"scale":{"int":0},"isPrimaryKey":{"int":0},"tableSizeKB":{"long":72}}
{"tableName":"dtproperties","SchemaName":{"string":"dbo"},"tableID":1745441292,"columnName":{"string":"id"},"ordinalPosition":{"int":1},"isNullable":{"int":0},"dataType":{"string":"int"},"maxLength":{"int":4},"precision":{"int":10},"scale":{"int":0},"isPrimaryKey":{"int":1},"tableSizeKB":{"long":24}}
This looks like so when converted to JSON:
{
"tablename" : "AS_LOOKUPS",
"tableID": 5835333,
"columnName": "SVALUE",
"ordinalPosition": 6,
"isNullable": 1,
"dataType": "varchar",
"maxLength": 4000,
"precision": 0,
"scale": 0,
"isPrimaryKey": 0,
"tableSize": 0,
"sizeUnit": "GB"
},
{
"tablename" : "AS_LOOKUPS",
"tableID": 5835333,
"columnName": "SORT_ORDER",
"ordinalPosition": 7,
"isNullable": 1,
"dataType": "int",
"maxLength": 4,
"precision": 10,
"scale": 0,
"isPrimaryKey": 0,
"tableSize": 0,
"sizeUnit": "GB"
}
My goal is to get the data to look like so:
{
"header": "Database Inventory",
"DBName": "DB",
"ServerName": "server#server.com",
"SchemaName": "DBE",
"DB Owner": "Name",
"DB Guardian" : "Name/Group",
"ASV" : "ASVC1AUTODWH",
"ENVCI": "ENVC1AUTODWHORE",
"Service Owner" : "Name/Group",
"Business Owner" : "Name/Group",
"Support Owner" : "Name/Group",
"Date of Data" : "2017-06-28 12:12:55.000",
"TABLE_METADATA": {
"TABLE_SIZE" : "500",
"UNIT_SIZE" : "GB",
"TABLE_ID": 117575457,
"TABLE_NAME": "spt_fallback_db",
"COLUMN_METADATA": [
{
"COLUMN_NM": "xserver_name",
"DATE_TYPE": "varchar",
"MAX_LENGTH": 30,
"PRECISION": 0,
"SCALE": 0,
"IS_NULLABLE": 0,
"PRIMARY_KEY": 0,
"ORDINAL_POSITION": 1
},
{
"COLUMN_NM": "xdttm_ins",
"DATE_TYPE": "datetime",
"MAX_LENGTH": 8,
"PRECISION": 23,
"SCALE": 3,
"IS_NULLABLE": 0,
"PRIMARY_KEY": 0,
"ORDINAL_POSITION": 2
}, ........
The header data will mostly be generic, but some of the stuff like date and etc. will need to be populated.
Initially my original thought were that I could do everything utilizing kafka connect, and that I could just create a schema for the way I want the data to be formatted. I am having a problem though with utilizing a different schema with the connectors and I'm not really sure if it is even possible.
Another solution I thought about was utilizing Kafka Streams, and writing code to transform the data into what is needed. I'm not to sure how easy it is do that w/ Kafka Streaming.
And finally a third solution I have seen is to utilize Apache Spark, and manipulating the data with dataframes. But this will add more overhead.
I'm honestly not to sure what route to go, or if any of these solutions are what I'm looking for. So I am open to all advice on how to solve this problem.
Kafka Connect does have Simple Message Transforms (SMTs), a framework for making minor adjustments to the records produced by a source connector before they are written into Kafka, or to the records read from Kafka before they are send to sink connectors. Most SMTs are quite simple functions, but you can chain them together to slightly more complex operations. You can always implement your own Transformation with custom logic, but no matter what each transform operates on a single record at a time and never should make calls out to other services. SMTs are only for basic manipulation of individual records.
However, the changes you want to make are probably a bit more complex than what is suitable through SMTs. Kafka Streams seems like it is the best solution to this problem, since it allows you to create a simple stream processor that consumes the topic(s) produced by the source connector, alters (and possibly combines) the messages accordingly, and writes them out to other topic(s). Since you're already using Avro, you can write your Streams application to use Avro generic records (see this example) or with classes auto-generated from the Avro schemas (see this example).
You also mention that you have data from multiple sources, and chances are those are going to separate topics. If you want to integrate, join, combine, or simply merge those topics into other topics, then Kafka Streams is a great way to do this.
Kafka Streams apps are also just normal Java applications, so you can deploy them using the platform of your choosing, whether that's Docker, Kubernetes, Mesos, AWS, or something else. And they don't require a running distributed platform like Apache Spark requires.
Related
I have some queries running against my Cloudant service. Some of them return quickly but a small minority are slower than expected. How can I see which queries are running slowly?
IBM Cloud activity logs can be sent to LogDNA Activity Tracker - each log item has latency measurements allowing you to identify which queries are running slower than others. For example, a typical log entry will look like this:
{
"ts": "2021-11-30T22:39:58.620Z",
"accountName": "xxxxx-yyyy-zzz-bluemix",
"httpMethod": "POST",
"httpRequest": "/yourdb/_find",
"responseSizeBytes": 823,
"clientIp": "169.76.71.72",
"clientPort": 31393,
"statusCode": 200,
"terminationState": "----",
"dbName": "yourdb",
"dbRequest": "_find",
"userAgent": "nodejs-cloudant/4.5.1 (Node.js v14.17.5)",
"sslVersion": "TLSv1.2",
"cipherSuite": "ECDHE-RSA-CHACHA20-POLY1305",
"requestClass": "query",
"parsedQueryString": null,
"rawQueryString": null,
"timings": {
"connect": 0,
"request": 1,
"response": 2610,
"transfer": 0
},
"meta": {},
"logSourceCRN": "crn:v1:bluemix:public:cloudantnosqldb:us-south:a/abc12345:afdfsdff-dfdf34-6789-87yh-abcr45566::",
"saveServiceCopy": false
}
The timings object contains various measurements, including the response time for the query.
For compliance reasons, the actual queries are not written to the logs, so to match queries to log entries you could put a unique identifier in the query string of the request, which would appear in the rawQueryString parameter of the log entry.
For more information on logging see this blog post.
Another option is to simply measure HTTP round-trip latency.
Once you have found your slow queries, have a look at this post for ideas on how to optimise queries.
My goal is to grab JSON data from an HTTP source and store it in a Kafka topic using AVRO serialization.
Using Kafka Connect and an HTTP source connector along with a bunch of SMTs, I managed to create a Connect data structure that looks like this when written to the topic with the StringConverter:
Struct{base=stations,cod=200,coord=Struct{lat=54.0,lon=9.0},dt=1632150605}
Thus the JSON was successfully parsed into STRUCTs and I can manipulate individual elements using SMTs. Next, I created a new subject with the corresponding schema inside the Confluent Schema Registry and switched the connector's value converter over to the Confluent AVRO Converter with "value.converter": "io.confluent.connect.avro.AvroConverter".
Instead of the expected serialization I got an error message saying:
org.apache.kafka.common.errors.SerializationException: Error serializing Avro message
Caused by: org.apache.avro.SchemaParseException: Can't redefine: io.confluent.connect.avro.ConnectDefault
As soon as I remove the nested STRUCT with ReplaceField or simplify the structure with Flatten, the AVRO serialization works like a charm. So it looks like the converter cannot handle nested structures.
What is the right way to go when you have nested elements and want them to be serialized as such rather than storing the JSON as a String and trying to deal with object creation in the consumer or beyond? Is this possible in Kafka Connect?
The creation of STRUCT elements from a JSON String can be achieved by different means. Originally, the SMT ExpandJson was used for its simplicity. It does not create sufficiently named STRUCTs, however, as it doesn't have a schema to work off of. And that is what caused the initial error message as the AVRO serializer uses the generic class io.confluent.connect.avro.ConnectDefault for those STRUCTs and if there is more than one there is ambiguity, which throws an exception.
Another SMT doing seemingly the same thing is Json Schema, which has a documented FromJson conversion. It does accept a schema and thus gets around ExpandJson's problem of parsing nested elements as a generic type. What is being accepted is a JSON Schema, though, and the mapping to AVRO fullnames works by taking the word "properties" as the namespace and copying the field name. In this example, you would end up with properties.coord as the fullname of the inner element.
As an example, when the following JSON Schema is passed to the SMT:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"type": "object",
"properties": {
"coord": {
"type": "object",
"properties": {
"lon": {
"type": "number"
},
"lat": {
"type": "number"
}
},
"required": [
"lon",
"lat"
]
},
...
}
The AVRO schema it produces (and thus looks for in the Schema Registry) is:
{
"type": "record",
"fields": [
...
{
"name": "coord",
"type": {
"type": "record",
"name": "coord",
"namespace": "properties",
"fields": [
{
"name": "lat",
"type": "double"
},
{
"name": "lon",
"type": "double"
}
],
"connect.name": "properties.coord"
}
},
...
}
In theory, if you have another schema with a coord element on the second level, it will get the same fullname, but since these are not individual entries inside the Schema Registry needing to be referenced, this will not lead to collisions. Not being able to control the namespace of the AVRO record from the JSON Schema is a little bit of a shame, as it feels like you're just about there, but I haven't been able to dig deep enough to offer a solution.
The suggested SMT SetSchemaMetadata (see first reply to the question) can be useful in this process, but it's documentation clashes a little with AVRO naming conventions as it shows order-value in an example. It will try to find a schema that contains an AVRO record with this name as the root element and since '-' is an illegal character in an AVRO name, you get an error. If you use the correct name of the root element, though, the SMT does something very useful: Its RestService class, which queries the Schema Registry to find a matching schema, fails with a message printing out the exact schema definition that needs to be created, so you don't necessarily have to memorize all the transformation rules.
Thus the answer to the original question is: Yes, it can be done with Kafka Connect. And it also is the best way to go if you
don't want to write your own producer/connector
want to store JSON blobs in a typed way as opposed to converting them after they hit an initial topic
If conversion after data ingestion is an option, the de-, re- and serialization capabilities of ksqlDB seem to be quite powerful.
I have to store multiple fields containing nested (3-5 levels) JSON specific to our call center application in json_b fields. We then materialize views specific to our analysis needs and push the exec reporting views to Redshift. The data was originally ingested from an S3 source that was the backup for a lambda function that parsed a log stream and put it to S3 as parquet
The JSON to Parquet table loaded data in the following format:
{"ContactId": "val", "Timestamp": "2021-06-02T03:59:59.094Z", "Parameters": {"Text": "Para português, aperte três.", "Voice": "name", "Timeout": "3000", "MaxDigits": "1", "TextToSpeechType": "text"}, "ContactFlowId": "arn:-1993633fcebb", "ContactFlowModuleType": "GetUserInput"}
This has now changed as upstream has moved to remove lambda and put in Kinesis Firehose that lands the data to the same location in parquet. The new payload for those fields looks like this:
{"contactid": "val", "timestamp": "2021-06-02T03:59:59.094Z", "parameters": {"text": "Para português, aperte três.", "voice": "name", "timeout": "3000", "maxdigits": "1", "texttospeechtype": "text"}, "contactflowid": "arn:-1993633fcebb", "contactflowmoduletype": "GetUserInput"}
We didn't immediately realize the impact until the ETL in non-prod started behaving incorrectly and then the materialized views started loading wrong but turns out that queries that formerly had been defined with the original key/value pairs in mind were not being parsed even though nothing changed as far as field name or nest structure.
So:
message->>'ContactId' is distinct from message->>'contactid'.
The issue is that we now have both sets of nests within our core tables. I looked at the Firehose and it doesn't give options to preserve the case on the keys.
I could use CASE statements for mat view definition based on time since there is distinct cutover date but was wondering how I handle queries that span time before cutover to firehose and after.
Initial thought was to use COALESCE(message->>'ContactId', message->>'contactid') but this quickly gets ugly when trying to refactor queries involving aggregations across various nest levels.
Any thought on how I may be optimally work around this. In addition to mat views we are also querying this nests in trigger function from stage to target to where values are cast to specific datatype so concerned coalesce may be computationally too intensive for some of our batch loads.
Any thoughts/ideas would be much appreciated.
Thanks
Went with COALESCE given no better alternative short of upstream correction.
What is the best practice to structure a message for a topic containing different types that need to be sorted.
Example
Topic: user-events
Event types: UserCreatedEvent, UserUpdatedEvent, UserDeletedEvent.
Those events need to be saved in the same topic and partition to guarantee the order.
Possible solutions I see
Single schema containing all event type fields
Schema containing all event types schemas. {eventId, timestamp, userCreated: {}, userUpdated: {}, userDeleted: {}}
Different schema for event using Avro union
Pro
Easy to implement and process as a stream
Easy to implement, process as a stream and setup required fields for each event type
Every message is an event
Cons
Possible to have many empty fields and it's not possible to specify required fields per event type
Not clear the message type without inspecting the payload
Difficult to deserialize (GenericRecord)
Are there other possible solutions, how do you normally handle a topic with different message types? How do you process this king of topics?
Any reference to code example is welcome.
UPDATE
There are two articles from confluent trying to explain who to solve this:
Should You Put Several Event Types in the Same Kafka Topic?
Putting Several Event Types in the Same Topic
My opinion on the articles is that they give you only a partial answer.
The first tells you when is a good idea to save different types into the same topic, and event sourcing is a good fit.
The second, it’s more technical and illustrate the possibility of doing this with Avro union.
But none of them explain in details how to do it with a real example.
I have seen projects on github where they simplified the scenario by creating a single schema, more as a state than actual event (point 1.).
Talking to someone with experience using kafka, came up with the solution explained at point 2 by nesting the events into a “carrying event”.
I managed yesterday (I will share the solution asap) to use avro union and deserialize the events as GenericRecord and do transformation based on the event type.
Since I didn’t find any similar solution I was curious to know if I'm missing something, like drawbacks (e.g. Ksqldb doesn’t support different types) or better practices to do the same in kafka.
In cases when I need to transfer different objects through one topic, I use the transport container. It stores some meta information about nested object, and serialized object, that you want to transport.
Avsc schema of the container can be like
{
"type": "record",
"name": "TransportContainer",
"namespace": "org.example",
"fields": [
{
"name": "id",
"type": "long"
},
{
"name": "event_type",
"type": "string"
},
{
"name": "event_timestamp",
"type": "long"
},
{
"name": "event",
"type": "bytes"
}
]
}
In "event_type" field you should store the type of the event. You should use it to determine, which schema you need to use to deserialize a nested object, stored in the "event" field. Also, it helps to avoid deserialization of every nested object, if you want to read objects only with a specific type.
Say, I publish and consume different type of java objects.For each I have to define own serializer implementations.
How can we provide all implementations in the kafka consumer/producer properties file under the "serializer.class" property?
We have a similar setup with different objects in different topics, but always the same object type in one topic. We use the ByteArrayDeserializer that comes with the Java API 0.9.0.1, which means or message consumers get only ever a byte[] as the value part of the message (we consistently use String for the keys). The first thing the topic-specific message consumer does is to call the right deserializer to convert the byte[]. You could use a apache commons helper class. Simple enough.
If you prefer to let the KafkaConsumer do the deserialization for you, you can of course write your own Deserializer. The deserialize method you need to implement has the topic as the first argument. Use it as a key into a map that provides the necessary deserializer and off you go. My hunch is that in most cases you will just do a normal Java deserialization anyway.
The downside of the 2nd approach is that you need a common super class for all your message objects to be able to parameterize the ConsumerRecord<K,V> properly. With the first approach, however, it is ConsumerRecord<String, byte[]> anyway. But then you convert the byte[] to the object you need just at the right place and need only one cast right there.
One option is Avro. Avro lets you define record types that you can then easily serialize and deserialize.
Here's an example schema adapted from the documentation:
{"namespace": "example.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "default": null, "type": ["null","int"]},
{"name": "favorite_color", "default": null, "type": ["null","string"]}
]
}
Avro distinguishes between so-called SpecificData and GenericData. With SpecificData readers and writers, you can easily serialize and deserialize known Java objects. The downside is SpecificData requires compile-time knowledge of the class to schema conversion.
On the other hand, GenericData readers and writers let you deal with record types you didn't know about at compile time. While obviously very powerful, this can get kind of clumsy -- you will have to invest time coding around the rough edges.
There are other options out there -- Thrift comes to mind -- but from what I understand, one of the major differences is Avro's ability to work with GenericData.
Another benefit is multi-language compatibility. Avro I know has native support for a lot of languages, on a lot of platforms. The other options do too, I am sure -- probably any off the shelf option is going to be better than rolling your own in terms of multi-language support, it's just a matter of degrees.