I am trying to implement data deduplication using Kafka streams. Basically, I'd like to drop any duplicates after the first encountered message in a session window with a 1-second size and an 8-hour grace period for late arrivals.
A more concrete example:
Input:
Key: A1; Value: { sensor: A1, current: 42, timestamp: Fri Apr 15 2022 21:59:22.555 }
Key: A1; Value: { sensor: A1, current: 42, timestamp: Fri Apr 15 2022 21:59:23.876 }
Key: A1; Value: { sensor: A1, current: 42, timestamp: Fri Apr 15 2022 21:59:23.574 }
Key: B2; Value: { sensor: B2, current: 42, timestamp: Fri Apr 15 2022 21:59:24.732 }
Desired output:
Key: A1; Value: { sensor: A1, current: 42, timestamp: Fri Apr 15 2022 21:59:22.555 }
Key: A1; Value: { sensor: A1, current: 42, timestamp: Fri Apr 15 2022 21:59:23.876 }
Key: B2; Value: { sensor: B2, current: 52, timestamp: Fri Apr 15 2022 21:59:24.732 }
so the third message
{ sensor: A1, current: 42, timestamp: Fri Apr 15 2022 21:59:23.574 }
from the input stream should be dropped since the sensor and current fields are matching what we already have in a defined 1 second window
Here's sample code:
streamsBuilder
.stream(
"input-topic",
Consumed.with(Serdes.String(), telemetrySerde)
// set AVRO record as ingestion timestamp
.withTimestampExtractor(ValueTimestampExtractor())
)
.groupBy(
{ _, value ->
TelemetryDuplicate(
sensor = value.sensor,
current = value.current
)
},
Grouped.with(telemetryDuplicateSerde, telemetrySerde)
)
.windowedBy(
SessionWindows.ofInactivityGapAndGrace(
/* inactivityGap = */ Duration.ofSeconds(1),
/* afterWindowEnd = */ Duration.ofHours(8)
)
)
// always keep only the first record of the group
.reduce { value, _ -> value }
.toStream()
.selectKey { k, _ -> k.key().sensor }
.to("output-topic", Produced.with(Serdes.String(), telemetrySerde))
it is actually working and it does the job, HOWEVER, despite that I rekey the resulting stream from windowed to just sensor id, I have the following messages in the output-topic:
Key: A1; Value: { sensor: A1, current: 42, timestamp: Fri Apr 15 2022 21:59:22.555 }
Key: A1; Value: { sensor: A1, current: 42, timestamp: Fri Apr 15 2022 21:59:23.876 }
Key: A1; Value: NULL
Key: A1; Value: { sensor: A1, current: 42, timestamp: Fri Apr 15 2022 21:59:23.876 }
Key: B2; Value: { sensor: B2, current: 42, timestamp: Fri Apr 15 2022 21:59:24.732 }
That means that the stream is actually de-duplicated, but in a quite awkward way: due to the change in the session window it produces a tombstone that cancels the previous aggregation despite that neither the selected key nor the value are changed (see how reduce is defined).
The question: is it possible to somehow produce only the first encountered record in the window and not produce any tombstones and "updated" aggregations?
Cheers.
You can add a filter to remove the null values from being produced to your result topic:
...
.selectKey { k, _ -> k.key().sensor }
.filter { _, value -> value != null}
.to("output-topic", Produced.with(Serdes.String(), telemetrySerde))
Please see the example of SessionWindows deduplication here.
Related
I have a mongo instance running with oplogMinRetentionHours set to 24 hours and max oplog size set to 50G. But despite this config settings oplog entries seem to be withhold indefinitely since oplog has entries past 24 hours and oplog size has reached 1.4 TB and .34 TB on disk
db.runCommand( { serverStatus: 1 } ).oplogTruncation.oplogMinRetentionHours
24 hrs
db.getReplicationInfo()
{
"logSizeMB" : 51200,
"usedMB" : 1464142.51,
"timeDiff" : 3601538,
"timeDiffHours" : 1000.43,
"tFirst" : "Fri Mar 19 2021 14:15:49 GMT+0000 (Greenwich Mean Time)",
"tLast" : "Fri Apr 30 2021 06:41:27 GMT+0000 (Greenwich Mean Time)",
"now" : "Fri Apr 30 2021 06:41:28 GMT+0000 (Greenwich Mean Time)"
}
MongoDB server version: 4.4.0
OS: Windows Server 2016 DataCenter 64bit
what I have noticed is event with super user with root role is not able to access replset.oplogTruncateAfterPoint, not sure if this is by design
mongod.log
{"t":{"$date":"2021-04-30T06:35:51.308+00:00"},"s":"I", "c":"ACCESS",
"id":20436, "ctx":"conn8","msg":"Checking authorization
failed","attr":{"error":{"code":13,"codeName":"Unauthorized","errmsg":"not
authorized on local to execute command { aggregate:
"replset.oplogTruncateAfterPoint", pipeline: [ { $indexStats: {} }
], cursor: { batchSize: 1000.0 }, $clusterTime: { clusterTime:
Timestamp(1619764547, 1), signature: { hash: BinData(0,
180A28389B6BBA22ACEB5D3517029CFF8D31D3D8), keyId: 6935907196995633156
} }, $db: "local" }"}}}
Not sure why mongo would not delete older entries from oplog?
Mongodb oplog truncation seems to be triggered with inserts. So as and when insert happens oplog gets truncated.
I am running some code in oozie workflow named WF1's action named AC1.. This workflow is not scheduled but runs continuously.. usually action AC1 will get its turn 4 times a day. Time at which this action runs is not known previously.
Now, there is another Oozie workflow WF2, scheduled to run at 4:00 AM in the morning using Oozie coordinator. This WF2 runs for 3-4 minutes only as this is a small code required to be run in off-peak hours.
In this WF2, we want to check the status of workflow action AC1 (running as part of WF1 [everytime AC1 instance runs, a new id gets assigned to it]. Is it possible to get the status of AC1 using name only, without knowing the id?
I know I have a workaround where I can store the status of AC1 in Hive table and keep querying the same to know the status. But if something is offered out of the box, it will be helpful.
There are several ways to do it (as you mention).
The built-in way is to use the job information
So you can do a simple get and get a response with job status on all actions, in the below example you can go to actions look for your action name and change the status for example:
HTTP/1.1 200 OK
Content-Type: application/json;charset=UTF-8
.
{
id: "0-200905191240-oozie-W",
appName: "indexer-workflow",
appPath: "hdfs://user/bansalm/indexer.wf",
externalId: "0-200905191230-oozie-pepe",
user: "bansalm",
status: "RUNNING",
conf: "<configuration> ... </configuration>",
createdTime: "Thu, 01 Jan 2009 00:00:00 GMT",
startTime: "Fri, 02 Jan 2009 00:00:00 GMT",
endTime: null,
run: 0,
actions: [
{
id: "0-200905191240-oozie-W#indexer",
name: "AC1",
type: "map-reduce",
conf: "<configuration> ...</configuration>",
startTime: "Thu, 01 Jan 2009 00:00:00 GMT",
endTime: "Fri, 02 Jan 2009 00:00:00 GMT",
status: "OK",
externalId: "job-123-200903101010",
externalStatus: "SUCCEEDED",
trackerUri: "foo:8021",
consoleUrl: "http://foo:50040/jobdetailshistory.jsp?jobId=...",
transition: "reporter",
data: null,
errorCode: null,
errorMessage: null,
retries: 0
},
I am using the Groovy RESTClient for doing some API automation.
The issue I am having is when I am doing a GET call, the results that I get back are missing all the quotes and formatting I would expect. This is making it hard to parse the results.
Example:
What I would expect (and what I am getting back from my API) ...
{
"results": [{
"licenseType": "mobileAppLicensesWithDevice",
"name": "Lead Retrieval - Device Rental & App license",
"owner": {
"id": "a705c768-ee33-491d-a993-4dd7bc61228b",
"entityType": "exhibitor"
},
"termId": "630493a4-4a70-4f4f-afaf-31610c14c181",
"id": "c215affe-ed0f-4014-8ba8-53f9df97942a",
"readableId": "77umkh20kq3",
"accessCode": "0w7zh6t",
"updatedAt": "2016-09-22T17:02:06.911Z",
"createdAt": "2016-09-22T17:02:06.911Z"
}, {
"licenseType": "mobileAppLicensesWithDevice",
"name": "Lead Retrieval - Device Rental & App license",
"owner": {
"id": "a705c768-ee33-491d-a993-4dd7bc61228b",
"entityType": "exhibitor"
},
"termId": "630493a4-4a70-4f4f-afaf-31610c14c181",
"id": "4249aedb-934f-4db1-89d6-6f7f10152bf5",
"readableId": "fyzv5ay7tfs",
"accessCode": "ray0pwb",
"updatedAt": "2016-09-22T17:02:06.911Z",
"createdAt": "2016-09-22T17:02:06.912Z"
}, {
"licenseType": "mobileAppLicensesWithDevice",
"name": "Lead Retrieval - Device Rental & App license",
"owner": {
"id": "a705c768-ee33-491d-a993-4dd7bc61228b",
"entityType": "exhibitor"
},
"termId": "630493a4-4a70-4f4f-afaf-31610c14c181",
"id": "ea6de933-0043-435c-ad4e-32b81018ec05",
"readableId": "ytby08d586x",
"accessCode": "3sv1lj6",
"updatedAt": "2016-09-22T17:02:06.912Z",
"createdAt": "2016-09-22T17:02:06.912Z"
}],
"etags": [{}, {}, {}],
"total": 3
}
And here is what I am getting back when I use the RESTClient ...
{
etags = [{}, {}, {}], results = [{
accessCode = wvnc16i,
createdAt = 2016 - 09 - 23 T15: 08: 20.673 Z,
id = 06 f7bb76 - cc0c - 450 d - a4af - fcf3392fbd1b,
licenseType = mobileAppLicensesWithDevice,
name = Lead Retrieval - Device Rental & App license,
owner = {
entityType = exhibitor,
id = adc7a8e5 - b137 - 40 c6 - 8765 - deb3ce0b8b3f
},
readableId = r52jvlr0ok7,
termId = 630493 a4 - 4 a70 - 4 f4f - afaf - 31610 c14c181,
updatedAt = 2016 - 09 - 23 T15: 08: 20.673 Z
}, {
accessCode = nxf2dzw,
createdAt = 2016 - 09 - 23 T15: 08: 20.673 Z,
id = bda7ec58 - 5 c64 - 4082 - 9 da1 - 48534 dd6afcc,
licenseType = mobileAppLicensesWithDevice,
name = Lead Retrieval - Device Rental & App license,
owner = {
entityType = exhibitor,
id = adc7a8e5 - b137 - 40 c6 - 8765 - deb3ce0b8b3f
},
readableId = b11yew6bqra,
termId = 630493 a4 - 4 a70 - 4 f4f - afaf - 31610 c14c181,
updatedAt = 2016 - 09 - 23 T15: 08: 20.673 Z
}, {
accessCode = 3e7 f1ip,
createdAt = 2016 - 09 - 23 T15: 08: 20.674 Z,
id = 2e12657 a - 8 a62 - 4 d2c - b5d2 - 2 a3fe4c27f9e,
licenseType = mobileAppLicensesWithDevice,
name = Lead Retrieval - Device Rental & App license,
owner = {
entityType = exhibitor,
id = adc7a8e5 - b137 - 40 c6 - 8765 - deb3ce0b8b3f
},
readableId = ye410zlhqe6,
termId = 630493 a4 - 4 a70 - 4 f4f - afaf - 31610 c14c181,
updatedAt = 2016 - 09 - 23 T15: 08: 20.674 Z
}], total = 3
}
* EDIT *
Example of the code thats implementing this ...
def getLicenseId(String eventId, String exhibitorId) {
String licenseId
def apiGetLicenseId = webClient.get(
path: "event_api/events/${eventId}/exhibitors/${exhibitorId}/licenses",
headers: ['Authorization': "api_key $apiKey"],
requestContentType: JSON
)
assert apiGetLicenseId.status == 200 : "API get license ID failed!\nAPI Response: ${apiGetLicenseId.data}"
String index = apiGetLicenseId.data
def slurper = new JsonSlurper().parseText(index)
slurper.each {
println(it)
}
licenseId = apiGetLicenseId.data.results[1].id
println("Index = " + index)
println("Index Length = " + slurper)
println("License Id = " + licenseId)
return licenseId
}
Any thoughts on what the issue is here and how it can be resolved?
I'm trying to follow along with http://mongotips.com/b/array-keys-allow-for-modeling-simplicity/
I have a Story document and a Rating document. The user will rate a story, so I wanted to create a many relationship to ratings by users as such:
class StoryRating
include MongoMapper::Document
# key <name>, <type>
key :user_id, ObjectId
key :rating, Integer
timestamps!
end
class Story
include MongoMapper::Document
# key <name>, <type>
timestamps!
key :title, String
key :ratings, Array, :index => true
many :story_ratings, :in => :ratings
end
Then
irb(main):006:0> s = Story.create
irb(main):008:0> s.ratings.push(Rating.new(user_id: '0923ksjdfkjas'))
irb(main):009:0> s.ratings.last.save
=> true
irb(main):010:0> s.save
BSON::InvalidDocument: Cannot serialize an object of class StoryRating into BSON.
from /usr/local/lib/ruby/gems/1.9.1/gems/bson-1.6.2/lib/bson/bson_c.rb:24:in `serialize' (...)
Why?
You should be using the association "story_rating" method for your push/append rather than the internal "rating" Array.push to get what you want to follow John Nunemaker's "Array Keys Allow For Modeling Simplicity" discussion. The difference is that with the association method, MongoMapper will insert the BSON::ObjectId reference into the array, with the latter you are pushing a Ruby StoryRating object into the Array, and the underlying driver driver cant serialize it.
Here's a test that works for me, that shows the difference. Hope that this helps.
Test
require 'test_helper'
class Object
def to_pretty_json
JSON.pretty_generate(JSON.parse(self.to_json))
end
end
class StoryTest < ActiveSupport::TestCase
def setup
User.delete_all
Story.delete_all
StoryRating.delete_all
#stories_coll = Mongo::Connection.new['free11513_mongomapper_bson_test']['stories']
end
test "Array Keys" do
user = User.create(:name => 'Gary')
story = Story.create(:title => 'A Tale of Two Cities')
rating = StoryRating.create(:user_id => user.id, :rating => 5)
assert_equal(1, StoryRating.count)
story.ratings.push(rating)
p story.ratings
assert_raise(BSON::InvalidDocument) { story.save }
story.ratings.pop
story.story_ratings.push(rating) # note story.story_ratings, NOT story.ratings
p story.ratings
assert_nothing_raised(BSON::InvalidDocument) { story.save }
assert_equal(1, Story.count)
puts Story.all(:ratings => rating.id).to_pretty_json
end
end
Result
Run options: --name=test_Array_Keys
# Running tests:
[#<StoryRating _id: BSON::ObjectId('4fa98c25e4d30b9765000003'), created_at: Tue, 08 May 2012 21:12:05 UTC +00:00, rating: 5, updated_at: Tue, 08 May 2012 21:12:05 UTC +00:00, user_id: BSON::ObjectId('4fa98c25e4d30b9765000001')>]
[BSON::ObjectId('4fa98c25e4d30b9765000003')]
[
{
"created_at": "2012-05-08T21:12:05Z",
"id": "4fa98c25e4d30b9765000002",
"ratings": [
"4fa98c25e4d30b9765000003"
],
"title": "A Tale of Two Cities",
"updated_at": "2012-05-08T21:12:05Z"
}
]
.
Finished tests in 0.023377s, 42.7771 tests/s, 171.1084 assertions/s.
1 tests, 4 assertions, 0 failures, 0 errors, 0 skips
I am trying to get a better and organized result from using class inheritance with MongoMapper, but having some trouble.
class Item
include MongoMapper::Document
key :name, String
end
class Picture < Item
key :url, String
end
class Video < Item
key :length, Integer
end
When I run the following commands, they don't quite return what I am expecting.
>> Item.all
=> [#<Item name: "Testing", created_at: Sun, 03 Jan 2010 20:02:48 PST -08:00, updated_at: Mon, 04 Jan 2010 13:01:31 PST -08:00, _id: 4b416868010e2a04d0000002, views: 0, user_id: 4b416844010e2a04d0000001, description: "lorem?">]
>> Video.all
=> [#<Video name: "Testing", created_at: Sun, 03 Jan 2010 20:02:48 PST -08:00, updated_at: Mon, 04 Jan 2010 13:01:31 PST -08:00, _id: 4b416868010e2a04d0000002, views: 0, user_id: 4b416844010e2a04d0000001, description: "lorem?">]
>> Picture.all
=> [#<Picture name: "Testing", created_at: Sun, 03 Jan 2010 20:02:48 PST -08:00, updated_at: Mon, 04 Jan 2010 13:01:31 PST -08:00, _id: 4b416868010e2a04d0000002, views: 0, user_id: 4b416844010e2a04d0000001, description: "lorem?">]
They are all the same result, I would expect to have Item.all list all of the results, so including itself, Picture, and Video. But if the item is actually a Picture, I would like it to be returned if I ran Picture.all and not if I run Video.all. Do you see what I mean?
Am I misunderstanding how the inheritance works here? If I am what is the best way to replicate this sort of behavior? I am trying to follow this (point 2) as a guideline of how I want this work. I assume he can run Link.all to find all the links, and not include every other class that inherits from Item. Am I wrong?
The example you link to is a little misleading (or maybe just hard to follow) in that it doesn't show the full definition for the Item model. In order to use inheritance in your models, you'll need to define a key _type on the parent model. MongoMapper will then automatically set that key to the class name of the actual class of that document. So, for instance, you models would now look like this:
class Item
include MongoMapper::Document
key :name, String
key :_type, String
end
class Picture < Item
key :url, String
end
class Video < Item
key :length, Integer
end
and the output of your searches (assuming you created a Picture object) will turn into:
>> Item.all
=> [#<Picture name: "Testing", _type: "Picture", created_at: Sun, 03 Jan 2010 20:02:48 PST -08:00, updated_at: Mon, 04 Jan 2010 13:01:31 PST -08:00, _id: 4b416868010e2a04d0000002, views: 0, user_id: 4b416844010e2a04d0000001, description: "lorem?">]
>> Video.all
=> []
>> Picture.all
=> [#<Picture name: "Testing", _type: "Picture", created_at: Sun, 03 Jan 2010 20:02:48 PST -08:00, updated_at: Mon, 04 Jan 2010 13:01:31 PST -08:00, _id: 4b416868010e2a04d0000002, views: 0, user_id: 4b416844010e2a04d0000001, description: "lorem?">]