Kafka connect transformations ExtractField$Value while preserving more keys - apache-kafka

I have the following message structure:
{
"payload" {
"a": 1,
"b": {"X": 1, "Y":2}
},
"timestamp": 1659692671
}
I want to use SMTs to get the following structure:
{
"a": 1,
"b": {"X": 1, "Y":2},
"timestamp": 1659692671
}
When I use ExtractField$Value for payload, I cannot preserve timestamp.
I cannot use Flatten because "b"'s structure should not be flatten.
Any ideas?
Thanks

Unfortunately, moving a field into another Struct isn't possible with the built-in transforms. You'd have to write your own, or use a stream-processing library before the data reaches the connector.
Note that Kafka records themselves have a timestamp, so do you really need the timestamp as a field in the value? One option is to extract the timestamp, then flatten, then add it back. However, this will override the timestamp that the producer had set with that value.

Related

Kafka Connect Filter Transformer

I am using a Kafka-connect, and I want to filter out some messages.
This is how a single message in Kafka looks like.
{
"code": 2001,
"time": 11111111,
"requestId": "123456789",
"info": [
{
"name": "dan",
"value": 21
}
]
}
And this is how my transformer looks like
transforms:
transforms: FilterByCode
transforms.FilterByCode.type: io.confluent.connect.transforms.Filter$Value
transforms.FilterByCode.filter.condition: $[?(#.code == 2000)]
transforms.FilterByCode.filter.type: include
I want to filter out the messages that their code value is not 2000.
I tried to few different syntax for the condition, but could not find any that works.
All the messages are transformed, and no one is filtered.
Any ideas on how I should use the syntax for the filtering condition?
If you want to filter out all non 2000 codes, try [?(#.code != 2000)]
You may test it here - http://jsonpath.herokuapp.com/

JSON data getting retrieved from mongodb with formats added explicitly inside field e.g.({"field": {$numberInt: "20"}}). How to process that data?

I have used mongo import to import data into mongodb from csv files. I am trying to retrieve data from an Mongodb realm service. The returned data for the entry is as follows:
{
"_id": "6124edd04543fb222e",
"Field1": "some string",
"Field2": {
"$numberDouble": "145.81"
},
"Field3": {
"$numberInt": "0"
},
"Field4": {
"$numberInt": "15"
},
"Field5": {
"$numberInt": "0"
}
How do I convert this into normal JSON by removing $numberInt and $numberDouble like :
{
"_id": "6124edd04543fb222e",
"Field1": "some string",
"Field2": 145.8,
"Field3": 0,
"Field4": 15,
"Field5": 0
}
The fields are also different for different documents so cannot use Mongoose directly. Are there any solutions to this?
Also would help to know why the numbers are being stored as $numberInt:"".
Edit:
For anyone with the same problem this is how I solved it.
The array of documents is in EJSON format instead of JSON like said in the upvoted answer. To covert it back into normal JSON, I used JSON.stringify to first convert each document I got from map function into string and then parsed it using EJSON.parse with
{strict:false} (this option is important)
option to convert it into normal JSON.
{restaurants.map((restaurant) => {
restaurant=EJSON.parse(JSON.stringify(restaurant),{strict:false});
}
EJSON.parse documentation here. The module to be installed and imported is mongodb-extjson.
The format with $numberInt etc. is called (MongoDB) Extended JSON.
You are getting it on the output side either because this is how you inserted your data (meaning your inserted data was incorrect, you need to fix the ingestion side) or because you requested extended JSON serialization.
If the data in the database is correct, and you want non-extended JSON output, you generally need to write your own serializers to JSON since there are multiple possibilities of how to format the data. MongoDB's JSON output format is the Extended JSON you're seeing in your first quote.

How do I use ADF copy activity with multiple rows in source?

I have source which is JSON array, sink is SQL server. When I use column mapping and see the code I can see mapping is done to first element of array so each run produces single record despite the fact that source has multiple records. How do I use copy activity to import ALL the rows?
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"schemaMapping": {
"['#odata.context']": "BuyerFinancing",
"['#odata.nextLink']": "PropertyCondition",
"value[0].AssociationFee": "AssociationFee",
"value[0].AssociationFeeFrequency": "AssociationFeeFrequency",
"value[0].AssociationName": "AssociationName",
Use * as the source field to indicate all elements in json format. For example, with json:
{
"results": [
{"field1": "valuea", "field2": "valueb"},
{"field1": "valuex", "field2": "valuey"}
]
}
and a database table with a column result to store the json. The mapping with results as the collection and * and the sub element will create two records with:
{"field1": "valuea", "field2": "valueb"}
{"field1": "valuex", "field2": "valuey"}
in the result field.
Copy Data Field Mapping
ADF support cross apply for json array. Please check the example in this doc. https://learn.microsoft.com/en-us/azure/data-factory/supported-file-formats-and-compression-codecs#jsonformat-example
For schema mapping: https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping#schema-mapping

How to remove key value from json var using scala

I have the following String var input:
val json= """[{"first": 1, "name": "abc", "timestamp": "2018/06/28"},
{"first": 2, "name": "mtm", "timestamp": "2018/06/28"}]"""
I need to remove key value(timestamp)
expected output:
val result= "[{"first": 1, "name": "abc"},{"first": 2, "name": "mtm"}]"
please kindly help.
A simple regex will do it:
json.replaceAll(""",\s*"timestamp"[^,}]*""", "")
Or with a JSON parser, (though it's quite hard to answer w/o knowing what JSON parser you're using), perhaps
parse it, with one of these What JSON library to use in Scala?
then remove the "timestamp" entries with e.g. List.map(m => m - "timestamp") (depends on which library you're using)
recompile the JSON

Delete a MongoDB subdocument by value

I have a collection containing documents that look like this:
{
"user": "foo",
"topics": {
"Topic AB": {
"score": 20,
"frequency": 3,
"last_seen": 40
},
"Topic BD": {
"score": 10,
"frequency": 2,
"last_seen": 38
},
"Topic TF": {
"score": 19,
"frequency": 6,
"last_seen": 20
}
}
}
I want to remove subdocuments whose last_seen value is less than 30.
I don't want to use arrays here since I'm using $inc to update the subdocuments in conjunction with upsert (which doesn't support the $ notation).
The real question here is how can I delete a key depending on its value. Using $unset simply drops a subdocument regardless of what it contains.
I'm afraid I don't think this is possible with your current design. Knowing the name of the key whose last_seen value you wish to test, for example Topic TF, you can do
> db.topics.update({"topics.Topic TF.last_seen" : { "$lt" : 30 }},
{ "$unset" : { "topics.Topic TF" : 1} })
However, with an embedded document structure, if you don't know the name of the key that you want to query against then you can't run the query. If the Topic XX keys are only known by what's in the document, you'd have to pull the whole document to find out what keys to test, and at that point you ought to just manipulate the document client-side and then update by _id.
The best option is to use arrays. The $ positional operator works with upserts, it just has a serious gotcha that, in the case of an insert, the $ will be interpreted as part of the field name instead of as an operator, so I understand your conclusion that it doesn't seem feasible. I'm not quite sure how you are using upsert such that arrays seem like they won't work, though. Could you give more detail there and I'll try to help come up with a reasonable workaround to use arrays and $ with your use case?