I have 3 different topics with 3 Avro files in schema registry, I want to stream these topics and join them together and write them into one topic. the problem is the key I want to join is different with the key I write the data into each topic.
Let's say we have these 3 Avro files:
Alarm:
{
"type" : "record",
"name" : "Alarm",
"namespace" : "com.kafkastream.schema.avro",
"fields" : [ {
"name" : "alarm_id",
"type" : "string",
"doc" : "Unique identifier of the alarm."
}, {
"name" : "ne_id",
"type" : "string",
"doc" : "Unique identifier of the network element ID that produces the alarm."
}, {
"name" : "start_time",
"type" : "long",
"doc" : "is the timestamp when the alarm was generated."
}, {
"name" : "severity",
"type" : [ "null", "string" ],
"doc" : "The severity field is the default severity associated to the alarm ",
"default" : null
}]
}
Incident:
{
"type" : "record",
"name" : "Incident",
"namespace" : "com.kafkastream.schema.avro",
"fields" : [ {
"name" : "incident_id",
"type" : "string",
"doc" : "Unique identifier of the incident."
}, {
"name" : "incident_type",
"type" : [ "null", "string" ],
"doc" : "Categorization of the incident e.g. Network fault, network at risk, customer impact, etc",
"default" : null
}, {
"name" : "alarm_source_id",
"type" : "string",
"doc" : "Respective Alarm"
}, {
"name" : "start_time",
"type" : "long",
"doc" : "is the timestamp when the incident was generated on the node."
}, {
"name" : "ne_id",
"type" : "string",
"doc" : "ID of specific network element."
}]
}
Maintenance:
{
"type" : "record",
"name" : "Maintenance",
"namespace" : "com.kafkastream.schema.avro",
"fields" : [ {
"name" : "maintenance_id",
"type" : "string",
"doc" : "The message number is the unique ID for every maintenance"
}, {
"name" : "ne_id",
"type" : "string",
"doc" : "The NE ID is the network element ID on which the maintenance is done."
}, {
"name" : "start_time",
"type" : "long",
"doc" : "The timestamp when the maintenance start."
}, {
"name" : "end_time",
"type" : "long",
"doc" : "The timestamp when the maintenance start."
}]
}
I have 3 topics in my Kafka for each of these Avro (ley's say alarm_raw, incident_raw, maintenance_raw) and whenever I wanted to write into these topics I am using ne_id as a key (so the topic partitioned by ne_id). now I want to join these 3 topics and get a new record and write it into a new topic. The problem is I want to join Alarm and Incident based on alarm_id and alarm_source_id and join alarm and maintenance based on ne_id. I want to avoid creating a new topic and re-assign a new key. Is there anyway that I specify the key while I am joining?
It depends what kind of join you want to use (c.f. https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Join+Semantics)
For KStream-KStream join, there is currently (v0.10.2 and earlier) no other way than setting a new key (e.g., by using selectKey()) and do a repartitioning.
For KStream-KTable join, Kafka 0.10.2 (will be released in the next weeks) contains a new feature called GlobalKTables (c.f. https://cwiki.apache.org/confluence/display/KAFKA/KIP-99%3A+Add+Global+Tables+to+Kafka+Streams). This allows you to do a non-key join on the KTable (i.e., a KStream-GlobalKTable join and thus you do not need to repartition the data in you GlobalKTable).
Note: a KStream-GlobalKTable join has different semantics than a KStream-KTable join. It is not time synchronized in contrast to the later, and thus, the join is non-deterministic by design with regard to GlobalKTable updates; i.e., there is no guarantee what KStream record will be the first to "see" a GlobalKTable updates and thus join with the updated GlobalKTable record.
There are plans to add a KTable-GlobalKTable join, too. This might become available in 0.10.3. There are no plans to add "global" KStream-KStream joins though.
You can maintain the same key by modifying it.
You can use KeyValueMapper through which you can modify your key as well as value.
You should use it as follows:
val modifiedStream = kStream.map[String,String](
new KeyValueMapper[String, String,KeyValue[String,String]]{
override def apply(key: String, value: String): KeyValue[String, String] = new KeyValue("modifiedKey", value)
}
)
You can apply above logic on multiple Kstream objects to maintain a single key for joining KStreams.
Related
I am trying to read the message from Kafka using consumer with the following properties
value.deserializer=io.confluent.kafka.serializers.KafkaAvroDeserializer
key.deserializer=org.apache.kafka.common.serialization.StringDeserializer
specific.avro.reader=true
And the schema is
{
"type" : "array",
"items" : {
"type" : "record",
"name" : "MyDto",
"namespace" : "test.dto",
"fields" : [ {
"default" : null,
"name" : "version",
"type" : ["null","string"]
}, {
"default" : null,
"name" : "testName",
"type" : ["null","string"]
}, {
"name" : "keys",
"type" : {"type": "array", "items": "string"},
"java-class" : "java.util.List"
}]
},
"java-class" : "java.util.List"
}
The object was written to Kafka successfully using this schema. But on deserialization I am getting exception java.lang.NoSuchMethodException: java.util.List.<init>()
Is it possible to use java.util.List class? I am using confluent 3.1.2
A List is an interface, it has no constructor or concrete implementation. Your Producer is probably using an ArrayList, and so should the reader schema and consumer
Make sure the schema is defined with "java-class" : "java.util.ArrayList"
I am joining two data sets in Birt . Its a left outer join. Below is the screen shot of the data sets.
The reason why I need all the rows of left table is I am doing some calculations on timestamp for all the rows of left table. I need to count the priority levels (how many times it occurred) in the right table if terminal Id matches with the left table.
When I get the records it gets the duplicate records which causes my timestamp calculations to get doubled.
I can't do inner join because I need to do the timestamp calculation from left table for must.
Relation of both the tables in many to many. I will explain with example what is the issue I am facing and what I want to achieve.
E.g. This is the data for the events of DeviceEventObject data set :
record 1 :
"event" : "EXITED SUPERVISOR MODE",
"timestamp" : ISODate("2017-12-17T06:06:23.181Z"),
"terminal" : {
"terminalId" : "testterminal",
"branchId" : "test"
}
record 2:
"event" : "ENTERED SUPERVISOR MODE",
"timestamp" : ISODate("2017-12-17T06:06:23.181Z"),
"terminal" : {
"terminalId" : "testterminal",
"branchId" : "test"
}
From this the timestamps of each event I am calculating the time between entered and exited events.
Now the other data set is DeviceStatusErrorCodePrioirtyLevel:
E.g. This is the records in this data set :
"status" : "Online",
"errorCode" : "123",
"priorityLevel" : "test",
"emailTypeCode" : "123",
"terminal" : {
"terminalId" : "testterminal",
"branchId" : "test"
}
Now I want to calculate the number of times the priority level "test" occurred for the terminalId "testterminal" . with the above data set count will be 1.
I am joining the both data sets on the basis of terminalId.
Now with the above data set I get duplicate records which doubles my time which I am calculating and also I get the count for the priority level 2.
For example this is what I get :
"event" : "EXITED SUPERVISOR MODE", "priorityLevel" : "test"
"event" : "ENTERED SUPERVISOR MODE", "priorityLevel" : "test"
What I want is :
"event" : "EXITED SUPERVISOR MODE", "priorityLevel" : "test"
"event" : "ENTERED SUPERVISOR MODE",
Additional Info of the birt project :
Sample data from both data sets :
DeviceStatusErrorCodePrioirtyLevel:
{
"_id" : ObjectId("5a36095f1854ad0b7096184b"),
"className" : "com.omnia.pie.cm.models.snapshot.terminal.v2.DeviceStatusErrorCodePrioirtyLevel",
"timestamp" : ISODate("2017-12-17T06:06:23.181Z"),
"deviceName" : "CardReader",
"status" : "Online",
"errorCode" : "123",
"priorityLevel" : "test",
"emailTypeCode" : "123",
"terminal" : {
"terminalId" : "testterminal",
"branchId" : "test"
}
}
DeviceEventObject:
{
"_id" : ObjectId("5a3608c61854ad0b70961846"),
"className" : "com.omnia.pie.cm.models.snapshot.terminal.v2.DeviceEventObject",
"event" : "EXITED SUPERVISOR MODE",
"value" : "True",
"timestamp" : ISODate("2017-12-17T06:03:50.901Z"),
"transactionData" : {
"transactionType" : "",
"transactionNumber" : "",
"sessionId" : ""
},
"terminal" : {
"terminalId" : "testterminal",
"branchId" : "test"
}
}
Here is the link to my report in case :
https://drive.google.com/file/d/1dHOEneG2-fbeP9Mz86LUhuk0tSxnLZxi/view?usp=sharing
Add a new data set for DeviceEventObject
Add the following aggregate function in the command expression builder.
The below function $lookup the data from status error code priority level based on terminalId followed by $unwind to flatten the data.
$group the flatten data on terminalId to accumulate the distinct priority levels for a terminal id.
$project to count the distinct priority levels
[{$lookup:{
from: "devicestatuserrorcodeprioirtylevel", // name of the collection
localField: "terminal.terminalId",
foreignField: "terminal.terminalId",
as: "dsecpl"
}},
{$unwind:"$dsecpl"},
{$group:{
"_id":"$terminal.terminalId",
"prioritylevels":{"$addToSet":"$dsecpl.priorityLevel"},
"events":{"$push":"$event"}
}},
{"$project":{"prioritylevelcount":{"$size":"$prioritylevels"}, "events": 1} }
]
Move all the available fields to the selected fields column.
Preview results.
I have two MongoDB collections: The first is a collection that includes frequency information for different IDs and is shown (truncated form) below:
[
{
"_id" : "A1",
"value" : 19
},
{
"_id" : "A2",
"value" : 6
},
{
"_id" : "A3",
"value" : 12
},
{
"_id" : "A4",
"value" : 8
},
{
"_id" : "A5",
"value" : 4
},
...
]
The second collection is more complex and contains information for each _id listed in the first collection (it's called frequency_collection_id in the second collection), but frequency_collection_id may be inside two lists (info.details_one, and info.details_two) for each record:
[
{
"_id" : ObjectId("53cfc1d086763c43723abb07"),
"info" : {
"status" : "pass",
"details_one" : [
{
"frequency_collection_id" : "A1",
"name" : "A1_object_name",
"class" : "known"
},
{
"frequency_collection_id" : "A2",
"name" : "A2_object_name",
"class" : "unknown"
}
],
"details_two" : [
{
"frequency_collection_id" : "A1",
"name" : "A1_object_name",
"class" : "known"
},
{
"frequency_collection_id" : "A2",
"name" : "A2_object_name",
"class" : "unknown"
}
],
}
}
...
]
What I'm looking to do, is merge the frequency information (from the first collection) into the second collection, in effect creating a collection that looks like:
[
{
"_id" : ObjectId("53cfc1d086763c43723abb07"),
"info" : {
"status" : "pass",
"details_one" : [
{
"frequency_collection_id" : "A1",
"name" : "A1_object_name",
"class" : "known",
**"value" : 19**
},
{
"frequency_collection_id" : "A2",
"name" : "A2_object_name",
"class" : "unknown",
**"value" : 6**
}
],
"details_two" : [
{
"frequency_collection_id" : "A1",
"name" : "A1_object_name",
"class" : "known",
**"value" : 19**
},
{
"frequency_collection_id" : "A2",
"name" : "A2_object_name",
"class" : "unknown",
**"value" : 6**
}
],
}
}
...
]
I know that this should be possible with MongoDB's MapReduce functions, but all the examples I've seen are either too minimal for my collection structure, or are answering different questions than I'm looking for.
Does anyone have any pointers? How can I merge my frequency information (from my first collection) into the records (inside my two lists in each record of the second collection)?
I know this is more or less a JOIN, which MongoDB does not support, but from my reading, it looks like this is a prime example of MapReduce.
I'm learning Mongo as best I can, so please forgive me if my question is too naive.
Just like all MongoDB operations, a MapReduce always operates only on a single collection and can not obtain info from another one. So you first step needs to be to dump both collections into one. Your documents have different _id's, so it should not be a problem for them to coexist in the same collection.
Then you do a MapReduce where the map function emits both kinds of documents for their common key, which is their frequency ID.
Your reduce function will then receive an array of two documents for each key: the two documents you have received. You then just have to merge these two documents into one. Keep in mind that the reduce-function can receive these two documents in any order. It can also happen that it gets called for a partial result (only one of the two documents) or for an already completed result. You need to handle these cases gracefully! A good implementation could be to create a new object and then iterate the input-documents copying all existing relevant fields with their values to the new object, so the resulting object is an amalgamation of the input documents.
I have two collection as bellow products has reference of user. i search product by name & in return i want combine output of product and user using map reduce method
user collection
{
"_id" : ObjectId("52ac5dd1fb670c2007000000"),
"company" : {
"about" : "This is textile machinery dealer",
"contactAddress" : [{
"address" : "abcd",
"city" : "52ac4bc6fb670c1007000000",
"zipcode" : "39as46as80"
},{
"address" : "abcd",
"city" : "52ac4bc6fb670c1007000000",
"zipcode" : "39as46as80"
}],
"fax" : "58784868",
"mainProducts" : "ads,asd,asd",
"mobileNumber" : "9537236588",
"name" : "krishna steels",
}
"user" : ObjectId("52ac4eb7fb670c0c07000000")
}
product colletion
{
"_id" : ObjectId("52ac5722fb670cf806000002"),
"category" : "52a2a9cc48a508b80e00001d",
"deliveryTime" : "10 days after received the ",
"price" : {
"minPrice" : "2000",
"maxPrice" : "3000",
"perUnit" : "5288ac6f7c104203e0976851",
"currency" : "INR"
},
"productName" : "New Mobile Solar Charger with Carabiner",
"rejectReason" : "",
"status" : 1,
"user" : ObjectId("52ac4eb7fb670c0c07000000")
}
This cannot be done. Mongo support Map Reduce only on one collection. You could try to fetch and merge in a java collection. Couple of days back I solved a similar problem using java collection.
Click to see similar response about joins and multi collection not supported in mongo.
This can be done using two map reduces.
You run your first MR and then you reduce out the second MR onto the results of the first.
You shouldn't do this though. JOINs are not designed to be done through MR, in fact it sounds like you are trying to do this MR with inline output which in itself is a very bad idea.
MRs are not designed to run inline to the application.
You would be better off doing the JOIN else where.
Suppose I have 'film' objects like the one below in my collection. Films have many actors, and actors belong to many films. Many-to-many.
How do I create another collection that consists of the unique 'actor' elements? Remember, some actors will be listed in more than one film.
{
"_id" : ObjectId("4edcffa5f320646bc8bd76b4"),
"directed_by" : [
"John Gilling"
],
"forbid:genre" : null,
"genre" : [ ],
"guid" : "#9202a8c04000641f8000000000b02e5d",
"id" : "/en/pirates_of_blood_river",
"initial_release_date" : "1962",
"name" : "Pirates of Blood River",
"starring" : [
{
"actor" : {
"guid" : "#9202a8c04000641f800000000006823e",
"id" : "/en/christopher_lee",
"name" : "Christopher Lee",
"lc_name" : "christopher lee"
}
},
{
"actor" : {
"guid" : "#9202a8c04000641f80000000001de22e",
"id" : "/en/oliver_reed",
"name" : "Oliver Reed",
"lc_name" : "oliver reed"
}
},
{
"actor" : {
"guid" : "#9202a8c04000641f80000000003b41da",
"id" : "/en/glenn_corbett",
"name" : "Glenn Corbett",
"lc_name" : "glenn corbett"
}
}
]
}
This can be done in the client app, but also via aggregation in mongodb - for example, you can run a mapreduce job with an output collection. In the map step, given a film document, you can emit key value pairs of ( actor.guid, { [other actor details] } ), and in the reduce step just return a single set of details (you could also count the number of films the actor was in at this point, if you wanted).
http://www.mongodb.org/display/DOCS/MapReduce has more info on the syntax.