How to generate Serial No in apache beam? - apache-beam

While processing my input, I want to add a new field in output JSON, which value should be auto incremented.
Ex -
Input list
{"name": "Amar", "age": 10}
{"name": "Akbar", "age": 20}
{"name": "Anthony", "age": 30}
Output Expected After adding Serial No
{"No": 1, "name": "Amar", "age": 10}
{"No": 2, "name": "Akbar", "age": 20}
{"No": 3, "name": "Anthony", "age": 30}

Beam process elements in parallel and does not guarantee ordering of elements.
However, if you still want to assign counter then you can use states in apache beam to maintain a counter. Reference https://beam.apache.org/blog/2017/02/13/stateful-processing.html
Scope of a state is a key + window. So it should work fine when assigning independent counters for different sets of keys.
However, if you have small number of keys and windows then this can impact the parallelism of your pipeline.
Also, in distributed data processing, there is not much usage of such counter, it will be great if you can describe your usecase a bit more.

Related

Fanout data from Kafka to S3 using kafka connect

My kafka gets a wrapper payload in Json format. The wrapper payload looks like this:
{
"format": "wrapper",
"time": 1626814608000,
"events": [
{
"id": "item1",
"type": "product1",
"count": 200
},
{
"id": "item2",
"type": "product2",
"count": 300
}
],
"metadata": {
"schema": "schema-1"
}
}
I should export this to S3. But the catch here is, I should not store the wrapper. Instead, I should be storing the individual events based on the item.
For example, it should be stored in S3 as follows:
bucket/product1:
{"id": "item1", "type": "product1", "count": 200}
bucket/product2:
{"id": "item2", "type": "product2", "count": 300}
If you notice, the input is the wrapper with those events internally. However, my output should be each of those individual events stored in S3 in the same bucket with the product type as prefix.
My question is, is it possible to use Kafka Connect to do this? I see they have this Single Message transformer which seems to be a way to mutate data inside the object, but not to fanout like what I want. Even the signature looks like an R=>R
https://github.com/apache/kafka/blob/trunk/connect/api/src/main/java/org/apache/kafka/connect/transforms/Transformation.java
So based on my research, it does not seem possible. But I want to check if I am missing something before using a different option.
Transforms accept one event and output one event.
You need to use a stream processor such as Kafka Streams branch or flatMap functions to split an array of events into multiple events or multiple topics.

Calculate differences between consecutive Kafka messages in one Topic

I have some temperature sensors which are generated Kafka messages to a Kafka topic (my-sensors-topic). The messages generally look like below.
{"Offset": 7, "Id": 1, "Time": 1643718777898, "Value": 21}
{"Offset": 6, "Id": 1, "Time": 1643718768592, "Value": 20}
{"Offset": 5, "Id": 2, "Time": 1643718755443, "Value": 21}
{"Offset": 4, "Id": 3, "Time": 1643718746678, "Value": 21}
{"Offset": 3, "Id": 4, "Time": 1643718733408, "Value": 22}
{"Offset": 2, "Id": 2, "Time": 1643718709450, "Value": 20}
{"Offset": 1, "Id": 3, "Time": 1643718667375, "Value": 22}
{"Offset": 0, "Id": 1, "Time": 1643718386944, "Value": 19}
What I want to do is for a new generated message:
{"Offset": 8, "Id": 2, "Time": 1643719318393, "Value": 21}
Firstly compare the "Time" differences (in milliseconds) with the last existed message that has the same Id. In this case:
{"Offset": 5, "Id": 2, "Time": 1643718755443, "Value": 21}
Because its the last existed message and also with Id 2.
Secondly, I want to subtract the "Time" differences between these two messages.
If the difference is greater than 60000, then it's counted as an error occurred for this sensor and I need to create a message to record the error and write the message to another Kafka topic (my-sensors-error-topic).
The created message perhaps look like:
{"Id": 2, "Time_lead": 1643719318393, "Time_lag": 1643718755443, "Letancy": 1562950}
//Latency is calculated by (Time_lead-Time_lag)
So later I can select the Count from my-sensors-error-topic by (sensor) Id so I know how many errors occurred for this sensor.
From my own investigation, to reach my scenario, I need to use Kafka Processor API with State Store. Some examples mentioned I can implement Processor interface while others mentioned using Transform.
Which way is better to implement my scenarios and how?

Restful Uri structure to get array of resources of same type with only limited fields in response?

We have following API to get the list of all versions of cars in our database. A particular version can have multiple colour options available.
GET /api/versions/
[
{
"id": 1,
"colors": [{"name":"red", "hex": "#ff0000"},{{"name":"blue", "hex": #0000ff}}], //array of colors of that version
"price": 10000
},
{
"id": 2,
"colors": [{"name":"red", "hex": "#ff0000"},{{"name":"blue", "hex": #0000ff}}], //array of colors of that version
"price": 20000
},
...
]
The client wants an API to get NOT ALL but multiple versions data and only the colour field. What should be the URI for such a requirement? I have thought of something like below but I am not sure:
To get colors of version id 8 and 9:
GET /api/versions/?fields=colors&id=8,9
[
{
"id": 8,
"colors": [{"name":"tui", "hex": "#gg0000"},{{"name":"rie", "hex": #or0000}}], //array of colors of that version
},
{
"id": 9,
"colors": [{"name":"rie", "hex": "#or0000"},{{"name":"tui", "hex": #gg0000}}], //array of colors of that version
}
]
Please note: I have oversimplified things here. Versions response is quite complex and contains many more fields other than id, colours and price mentioned above. Plus, we will get multiple such requirements like we have got for colour currently i.e. to get the price of multiple versions.
The query parameters you have specified are perfect!
If you have unique ids for cars which can become a part of your api endpoint like
/api/cars/$car_id/versions?fields=colors,price&ids=8,9
Or you can try using /api/carversions instead of /api/versions.
You can also refer my answer to a similar question

MongoDB update if newer or insert if not exists

For example, if I have some sensors and they have unique SN.
Then, I have some history data (not sorted) in this format: (SN, timestamp, value)
I want to maintain sensors' latest status in MongoDB. {"sn":xxx, "timestamp": xxx, "value": xxx, "installed_time": xxx}. Installed times might be manually added later.
So currently, my code is like
if db.sensors.find_one({"sn":SN, "timestamp": {"$lt": timestamp}}):
db.sensors.update_one({"sn":SN}, {"$set":{"timestamp": timestamp, "value": value}}, {"upsert": True})
I'd like to know whether I can combine these to one operation.
I tried to do conditional upsert
db.sensors.update_one({"sn":SN, "timestamp": {"$lt": timestamp}}, {"$set":{"timestamp": timestamp, "value": value}}, {"upsert": True}). The problem is that I'll have multiple documents with same SN.
For example, let's start with empty collection. First, (1, 3, 1) is processed, {"sn": 1, "timestamp":3, "value": 1} is inserted. Then, processing (1, 1, 2), it will create another document {"sn": 1, "timestamp":1, "value": 2}. The intended behaviour is just ignore this data point.
I also tried to do document replacement db.sensors.update_one({"sn":SN, "timestamp": {"$lt": timestamp}}, {"sn": SN, "timestamp": timestamp, "value": value}, {"upsert": True}). This will overwrite some other fields like installed_time.

CouchDB: query reduced value on complex key with timeframe

Application user can perform different tasks. Each kind of task has unique identifier. Each user activity is recorded to database.
So we have following Event entity to keep in database:
{
"user_id": 1,
"task_id": 2,
"event_dt": [
2013, 11, 15, 10, 0, 0, 0
]
}
I need to know how many tasks of each type were performed by particular user during particular timeframe. Timeframe might be quite long (i.e. rolling chart for last year is requested).
For better understanding, map function might be something like:
emit([doc.user_id, doc.task_id, doc.event_dt], 1)
and it might be queried using group_level=2 (or group_level=1 in case just number of user events is needed).
Is it possible to answer above question by making single view query using map/reduce mechanism? Do I have to use list functionality (though it may cause performance issues)?
Just use flat key [doc.user_id, doc.task_id].concat(doc.event_dt) since it will simplify request and grouping logic:
with group_level=1: you'll get amount of tasks per user for all time
with group_level=2: amount of specific task ids per user for all time
with group_level=3: same as above but in context of specific year
with group_level=4: same as above but also grouped by months
etc. by days, hours, minutes and seconds
For instance, the result for group_level=3 may be:
{"rows":[
{"key": ["user1", "task1", 2012], "value": 3},
{"key": ["user1", "task2", 2013], "value": 14},
{"key": ["user1", "task3", 2013], "value": 15},
{"key": ["user2", "task1", 2012], "value": 9},
{"key": ["user2", "task4", 2012], "value": 26},
{"key": ["user2", "task4", 2013], "value": 53},
{"key": ["user3", "task1", 2013], "value": 5}
]}