Getting duplicate rows on left join in Birt reports - eclipse

I am joining two data sets in Birt . Its a left outer join. Below is the screen shot of the data sets.
The reason why I need all the rows of left table is I am doing some calculations on timestamp for all the rows of left table. I need to count the priority levels (how many times it occurred) in the right table if terminal Id matches with the left table.
When I get the records it gets the duplicate records which causes my timestamp calculations to get doubled.
I can't do inner join because I need to do the timestamp calculation from left table for must.
Relation of both the tables in many to many. I will explain with example what is the issue I am facing and what I want to achieve.
E.g. This is the data for the events of DeviceEventObject data set :
record 1 :
"event" : "EXITED SUPERVISOR MODE",
"timestamp" : ISODate("2017-12-17T06:06:23.181Z"),
"terminal" : {
"terminalId" : "testterminal",
"branchId" : "test"
}
record 2:
"event" : "ENTERED SUPERVISOR MODE",
"timestamp" : ISODate("2017-12-17T06:06:23.181Z"),
"terminal" : {
"terminalId" : "testterminal",
"branchId" : "test"
}
From this the timestamps of each event I am calculating the time between entered and exited events.
Now the other data set is DeviceStatusErrorCodePrioirtyLevel:
E.g. This is the records in this data set :
"status" : "Online",
"errorCode" : "123",
"priorityLevel" : "test",
"emailTypeCode" : "123",
"terminal" : {
"terminalId" : "testterminal",
"branchId" : "test"
}
Now I want to calculate the number of times the priority level "test" occurred for the terminalId "testterminal" . with the above data set count will be 1.
I am joining the both data sets on the basis of terminalId.
Now with the above data set I get duplicate records which doubles my time which I am calculating and also I get the count for the priority level 2.
For example this is what I get :
"event" : "EXITED SUPERVISOR MODE", "priorityLevel" : "test"
"event" : "ENTERED SUPERVISOR MODE", "priorityLevel" : "test"
What I want is :
"event" : "EXITED SUPERVISOR MODE", "priorityLevel" : "test"
"event" : "ENTERED SUPERVISOR MODE",
Additional Info of the birt project :
Sample data from both data sets :
DeviceStatusErrorCodePrioirtyLevel:
{
"_id" : ObjectId("5a36095f1854ad0b7096184b"),
"className" : "com.omnia.pie.cm.models.snapshot.terminal.v2.DeviceStatusErrorCodePrioirtyLevel",
"timestamp" : ISODate("2017-12-17T06:06:23.181Z"),
"deviceName" : "CardReader",
"status" : "Online",
"errorCode" : "123",
"priorityLevel" : "test",
"emailTypeCode" : "123",
"terminal" : {
"terminalId" : "testterminal",
"branchId" : "test"
}
}
DeviceEventObject:
{
"_id" : ObjectId("5a3608c61854ad0b70961846"),
"className" : "com.omnia.pie.cm.models.snapshot.terminal.v2.DeviceEventObject",
"event" : "EXITED SUPERVISOR MODE",
"value" : "True",
"timestamp" : ISODate("2017-12-17T06:03:50.901Z"),
"transactionData" : {
"transactionType" : "",
"transactionNumber" : "",
"sessionId" : ""
},
"terminal" : {
"terminalId" : "testterminal",
"branchId" : "test"
}
}
Here is the link to my report in case :
https://drive.google.com/file/d/1dHOEneG2-fbeP9Mz86LUhuk0tSxnLZxi/view?usp=sharing

Add a new data set for DeviceEventObject
Add the following aggregate function in the command expression builder.
The below function $lookup the data from status error code priority level based on terminalId followed by $unwind to flatten the data.
$group the flatten data on terminalId to accumulate the distinct priority levels for a terminal id.
$project to count the distinct priority levels
[{$lookup:{
from: "devicestatuserrorcodeprioirtylevel", // name of the collection
localField: "terminal.terminalId",
foreignField: "terminal.terminalId",
as: "dsecpl"
}},
{$unwind:"$dsecpl"},
{$group:{
"_id":"$terminal.terminalId",
"prioritylevels":{"$addToSet":"$dsecpl.priorityLevel"},
"events":{"$push":"$event"}
}},
{"$project":{"prioritylevelcount":{"$size":"$prioritylevels"}, "events": 1} }
]
Move all the available fields to the selected fields column.
Preview results.

Related

how to calculate time difference based on 2 fields in mongodb

I am familiar to simple mongodb queries but this one is a bit complex for me. Here, what I am trying to achieve is on the basis of jsonObject.callID and jsonObject.mobile fields I have to calculate time difference of jsonObject.timestamp. For example in below sample documents, jsonObject.callID and mobile will remain same for jsonObject.action start and end. So based on jsonObject.callID and jsonObject.mobile, I have to subtract the jsonObject.timestamp. jsonObject.callId will be same for two interval actions i.e. start and end with their same jsonObject.mobile numbers.
{
"_id" : ObjectId("5df9bc5ee5e7251030535df5"),
"_class" : "com.abc.mongo.docs.IvrMongoLog",
"jsonObject" : {
"mode" : "ivr",
"callID" : "33333",
"callee" : "128",
"action" : "end",
"mobile" : "218924535466",
"timestamp" : "2019-12-18 16:18:12"
}
}
{
"_id" : ObjectId("5df9bc3de5e7251030535df4"),
"_class" : "com.abc.mongo.docs.IvrMongoLog",
"jsonObject" : {
"mode" : "ivr",
"callID" : "33333",
"callee" : "128",
"action" : "start",
"mobile" : "218924535466",
"timestamp" : "2019-12-18 16:12:11"
}
}
So I am trying to achieve a output like below:
{
"callee" : "128",
"mobile" : "218924535466",
"callID" : "33333",
"minutes_of_call" : "6" // difference of "2019-12-18 16:18:12" - "2019-12-18 16:12:11"
}
subsequently I need such results for next documents...
Kindly assist.

How to collect specific samples from MongoDB collections?

I have a MongoDB collection "Events" with 1 million documents similar to:
{
"_id" : 32423,
"testid" : 43212,
"description" : "fdskfhsdj kfsdjfhskdjf hksdjfhsd kjfs",
"status" : "error",
"datetime" : ISODate("2018-12-04T15:55:00.000Z"),
"failure" : 0,
}
Considering the documents were sorted based on datetime field (ascending), I want to check them in the chronical order one by one and pick only the records where the "failure" field was 0 in the previous document and it is 1 in the current document. I want to skip other records in between.
For example, if I also have the following records:
{
"_id" : 32424,
....
"datetime" : ISODate("2018-12-04T16:55:00.000Z"),
"failure" : 0,
}
,
{
"_id" : 32425,
....
"datetime" : ISODate("2018-12-04T17:55:00.000Z"),
"failure" : 1,
}
,
{
"_id" : 32426,
....
"datetime" : ISODate("2018-12-04T18:55:00.000Z"),
"failure" : 0,
}
I only want to collect the one with "_id:32425", and repeat the same policy for the following cases.
Of course, if I extract all the data at once, then I can process it using Python for instance. But, extracting all the records would be really time-consuming (1 million documents!).
Is there a way to do the above via MongoDB commands?

Is a mongodb query with 1 indexed field faster than multiple indexed fields?

In the following model a product is owned by a customer. and cannot be ordered by other customers. So I know that in an order by customer 1 there can only be products owned by customer one.
To give you an idea here is a simple version of the data model:
Orders:
{
'customer' : 1
'products' : [
{'productId' : 'a'},
{'productId' : 'b'}
]
}
Products:
{
'id' : 'a'
'name' : 'somename'
'customer' : 1
}
I need to find orders that contain certain products. I know the product id and customer id. I'm free to add/change indexes on my database.
Now my question is. Is it faster to just add a single field index on the product id's and query only using that ID. Or should I go for a compound index with customer and product id?
I'm not sure if this matters, but in my real model the list of products is actually a list of objects which have an amount and a dbref to the product. And the customer is also a dbref.
Here is a full order object:
{
"_id" : 0,
"_class" : "nl.pfa.myprintforce.models.Order",
"orderNumber" : "e35f1fa8-b4c4-4d53-89c9-66abe94a3553",
"status" : "ERROR",
"created" : ISODate("2017-03-30T11:50:50.292Z"),
"finished" : false,
"orderTime" : ISODate("2017-01-12T12:50:50.292Z"),
"expectedDelivery" : ISODate("2017-03-30T11:50:50.292Z"),
"totalItems" : 19,
"orderItems" : [
{
"amount" : 4,
"product" : {
"$ref" : "product",
"$id" : NumberLong(16)
}
},
{
"amount" : 7,
"product" : {
"$ref" : "product",
"$id" : NumberLong(26)
}
},
{
"amount" : 8,
"product" : {
"$ref" : "product",
"$id" : NumberLong(7)
}
}
],
"stateList" : [
{
"timestamp" : ISODate("2017-03-28T11:50:50.074Z"),
"status" : "NEW",
"message" : ""
},
{
"timestamp" : ISODate("2017-03-29T11:50:50.075Z"),
"status" : "IN_PRODUCTION",
"message" : ""
},
{
"timestamp" : ISODate("2017-03-30T11:50:50.075Z"),
"status" : "ERROR",
"message" : "Something went wrong"
}
],
"customer" : {
"$ref" : "customer",
"$id" : ObjectId("58dcf11a71571a24c475c044")
}
}
When I have the following indexes:
1: {"customer" : 1, "orderItems.product" : 1}
2: {"orderItems.product" : 1}
both count queries (I use count to forcefully find all documents without the network transfer):
a: db.getCollection('order').find({
'orderItems.product' : DBRef('product',113)
}).count()
b: db.getCollection('order').find({
'customer' : DBRef('customer',ObjectId("58de009671571a07540a51d5")),
'orderItems.product' : DBRef('product',113)
}).count()
Run with the same time of ~0.007 seconds on a set of 200k.
When I add 1000k record for a different customer (and different products) it does not effect the time at all.
an extended explain shows that:
query 1 just uses index 2.
query 2 uses index 2 but also considered index 1. Perhaps index intersection is used here?
Because if I drop index 1 the results are:
Query a: 0.007 seconds
Query b: 0.035 seconds (5x as long!)
So my conclusion is that with the right indexing both methods work about as fast. However, if you do not need the compound index for anything else it's just a waste of space & write speed.
So: single field index is better in my case.

Kafka stream join with a specific key as input

I have 3 different topics with 3 Avro files in schema registry, I want to stream these topics and join them together and write them into one topic. the problem is the key I want to join is different with the key I write the data into each topic.
Let's say we have these 3 Avro files:
Alarm:
{
"type" : "record",
"name" : "Alarm",
"namespace" : "com.kafkastream.schema.avro",
"fields" : [ {
"name" : "alarm_id",
"type" : "string",
"doc" : "Unique identifier of the alarm."
}, {
"name" : "ne_id",
"type" : "string",
"doc" : "Unique identifier of the network element ID that produces the alarm."
}, {
"name" : "start_time",
"type" : "long",
"doc" : "is the timestamp when the alarm was generated."
}, {
"name" : "severity",
"type" : [ "null", "string" ],
"doc" : "The severity field is the default severity associated to the alarm ",
"default" : null
}]
}
Incident:
{
"type" : "record",
"name" : "Incident",
"namespace" : "com.kafkastream.schema.avro",
"fields" : [ {
"name" : "incident_id",
"type" : "string",
"doc" : "Unique identifier of the incident."
}, {
"name" : "incident_type",
"type" : [ "null", "string" ],
"doc" : "Categorization of the incident e.g. Network fault, network at risk, customer impact, etc",
"default" : null
}, {
"name" : "alarm_source_id",
"type" : "string",
"doc" : "Respective Alarm"
}, {
"name" : "start_time",
"type" : "long",
"doc" : "is the timestamp when the incident was generated on the node."
}, {
"name" : "ne_id",
"type" : "string",
"doc" : "ID of specific network element."
}]
}
Maintenance:
{
"type" : "record",
"name" : "Maintenance",
"namespace" : "com.kafkastream.schema.avro",
"fields" : [ {
"name" : "maintenance_id",
"type" : "string",
"doc" : "The message number is the unique ID for every maintenance"
}, {
"name" : "ne_id",
"type" : "string",
"doc" : "The NE ID is the network element ID on which the maintenance is done."
}, {
"name" : "start_time",
"type" : "long",
"doc" : "The timestamp when the maintenance start."
}, {
"name" : "end_time",
"type" : "long",
"doc" : "The timestamp when the maintenance start."
}]
}
I have 3 topics in my Kafka for each of these Avro (ley's say alarm_raw, incident_raw, maintenance_raw) and whenever I wanted to write into these topics I am using ne_id as a key (so the topic partitioned by ne_id). now I want to join these 3 topics and get a new record and write it into a new topic. The problem is I want to join Alarm and Incident based on alarm_id and alarm_source_id and join alarm and maintenance based on ne_id. I want to avoid creating a new topic and re-assign a new key. Is there anyway that I specify the key while I am joining?
It depends what kind of join you want to use (c.f. https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Join+Semantics)
For KStream-KStream join, there is currently (v0.10.2 and earlier) no other way than setting a new key (e.g., by using selectKey()) and do a repartitioning.
For KStream-KTable join, Kafka 0.10.2 (will be released in the next weeks) contains a new feature called GlobalKTables (c.f. https://cwiki.apache.org/confluence/display/KAFKA/KIP-99%3A+Add+Global+Tables+to+Kafka+Streams). This allows you to do a non-key join on the KTable (i.e., a KStream-GlobalKTable join and thus you do not need to repartition the data in you GlobalKTable).
Note: a KStream-GlobalKTable join has different semantics than a KStream-KTable join. It is not time synchronized in contrast to the later, and thus, the join is non-deterministic by design with regard to GlobalKTable updates; i.e., there is no guarantee what KStream record will be the first to "see" a GlobalKTable updates and thus join with the updated GlobalKTable record.
There are plans to add a KTable-GlobalKTable join, too. This might become available in 0.10.3. There are no plans to add "global" KStream-KStream joins though.
You can maintain the same key by modifying it.
You can use KeyValueMapper through which you can modify your key as well as value.
You should use it as follows:
val modifiedStream = kStream.map[String,String](
new KeyValueMapper[String, String,KeyValue[String,String]]{
override def apply(key: String, value: String): KeyValue[String, String] = new KeyValue("modifiedKey", value)
}
)
You can apply above logic on multiple Kstream objects to maintain a single key for joining KStreams.

want to merge two collection in mongo db using map reduce

I have two collection as bellow products has reference of user. i search product by name & in return i want combine output of product and user using map reduce method
user collection
{
"_id" : ObjectId("52ac5dd1fb670c2007000000"),
"company" : {
"about" : "This is textile machinery dealer",
"contactAddress" : [{
"address" : "abcd",
"city" : "52ac4bc6fb670c1007000000",
"zipcode" : "39as46as80"
},{
"address" : "abcd",
"city" : "52ac4bc6fb670c1007000000",
"zipcode" : "39as46as80"
}],
"fax" : "58784868",
"mainProducts" : "ads,asd,asd",
"mobileNumber" : "9537236588",
"name" : "krishna steels",
}
"user" : ObjectId("52ac4eb7fb670c0c07000000")
}
product colletion
{
"_id" : ObjectId("52ac5722fb670cf806000002"),
"category" : "52a2a9cc48a508b80e00001d",
"deliveryTime" : "10 days after received the ",
"price" : {
"minPrice" : "2000",
"maxPrice" : "3000",
"perUnit" : "5288ac6f7c104203e0976851",
"currency" : "INR"
},
"productName" : "New Mobile Solar Charger with Carabiner",
"rejectReason" : "",
"status" : 1,
"user" : ObjectId("52ac4eb7fb670c0c07000000")
}
This cannot be done. Mongo support Map Reduce only on one collection. You could try to fetch and merge in a java collection. Couple of days back I solved a similar problem using java collection.
Click to see similar response about joins and multi collection not supported in mongo.
This can be done using two map reduces.
You run your first MR and then you reduce out the second MR onto the results of the first.
You shouldn't do this though. JOINs are not designed to be done through MR, in fact it sounds like you are trying to do this MR with inline output which in itself is a very bad idea.
MRs are not designed to run inline to the application.
You would be better off doing the JOIN else where.