I'm trying to implement event deduplication:
https://developers.facebook.com/docs/marketing-api/conversions-api/deduplicate-pixel-and-server-events#deduplication-best-practices
Should EventID be unique per user or globally?
Event id should be unique per event instance. That's from a Facebook webinar I've attended. Same event ID received for many event instances will result in processing only the first event with that id that comes through. The rest will be discarded.
If we find the same server key combination (event_id, event_name) and browser key combination (eventID, event) sent to the same pixel ID within 48 hours, we discard the subsequent events.
It can be concluded that a unique event_id should only be for events with the same event_name sent to the same pixel within 48 hours.
Deduplication Best Practices
Related
I have a #KafkaListener class that listens to a particular topic and consumes records that contain either a Person object or a Phone object (and only one of them). Every Phone has a reference / correlation id to the corresponding Person. The listener class performs certain validations that are specific to the type received, saves the object into a database and produces a transfer success / failed response back to Kafka that is consumed by another service.
So a Person can successfully be transferred without any corresponding Phone, but a Phone transfer should only succeed if the corresponding Person transfer has succeeded. I can't wrap my head around how to implement this "synchronization", because Persons and Phones get into Kafka independently as separate records and it's not guaranteed that the Person corresponding to a particular Phone will be processed before the Phone.
Is it at all possible to have such a synchronization given the current architecture or should I redesign the producer and send a Person / Phone pair as a separate type?
Thanks.
It's not clear how you're using the same serializer for different object types, but you should probably create separate topics and/or branch your current one into two (refer Kafka Streams API)
I assume there are less people than phones, in which case you could build a KTable from a people topic, then as you get phone records, you can perform a left join or lookup against this table for some person ID
Other solutions could involve using Kafka Connect to dump records into a system where you can do the join
I saw on Stocktwits API documentation that they only allow to get mose recent 30 messages from stream. However, I need to extract messages by searching for certain user or certain symbols from stocktwits by specifying certain periods, etc. Jan,2017-Jan,2019. For example i want to extract all the messages sent by A user from 2017-2019 or all the messages with AAPL tag from 2017-2019. Is it possible?
From the doc : https://www.lagomframework.com/documentation/1.5.x/scala/ReadSide.html
It says :
Event tags In order to consume events from a read-side, the events need to be tagged. All events with a particular tag can be consumed as
a sequential, ordered stream of events. Events can be tagged by making
them implement the AggregateEvent interface. The tag is defined using
the aggregateTag method.
Q1 . What does it mean when it says consumed as sequential,ordered stream of events?
Q2. Why Tagging of events when there is offset?
A1. The read-side(s) will see the events for an entity in the order in which they were persisted by the write-side.
A2. Tagging is used so that different read-sides can subscribe to only specific events.
Short description about the setup:
I'm trying to implement a "basic" event store/ event-sourcing application using a RDBMS (in my case Postgres). The events are general purpose events with only some basic fields like eventtime, location, action, formatted as XML. Due to this general structure, there is now way of partitioning them in a useful way. The events are captured via a Java Application, that validate the events and then store them in an events table. Each event will get an uuid and recordtime when it is captured.
In addition, there can be subscriptions to external applications, which should get all events matching a custom criteria. When a new matching event is captured, the event should be PUSHED to the subscriber. To ensure, that the subscriber does not miss any event, I'm currently forcing the capture process to be single threaded. When a new event comes in, a lock is set, the event gets a recordtime assigned to the current time and the event is finally inserted into the DB table (explicitly waiting for the commit). Then the lock is released. For a subscription which runs scheduled for example every 5 seconds, I track the recordtime of the last sent event, and execute a query for new events like where recordtime > subscription_recordtime. When the matching events are successfully pushed to the subscriber, the subscription_recordtime is set to the events max recordtime.
Everything is actually working but as you can imagine, a single threaded capture process, does not scale very well. Thus the main question is: How can I optimise this and allow for example multiple capture processes running in parallel?
I already thought about setting the recordtime in the DB itself on insert, but since the order of commits cannot be guaranteed (JVM pauses), I think I might loose events when two capture transactions are running nearly at the same time. When I understand the DB generated timestamp currectly, it will be set before the actual commit. Thus a transaction with a recordtime t2 can already be visible to the subscription query, although another transaction with a recordtime t1 (t1 < t2), is still ongoing and so has not been committed. The recordtime for the subscription will be set to t2 and so the event from transaction 1 will be lost...
Is there a way to guarantee the order on a DB level, so that events are visible in the order they are captured/ committed? Every newly visible event must have a later timestamp then the event before (strictly monotonically increasing). I know about a full table lock, but I think, then I will have the same performance penalties as before.
Is it possible to set the DB to use a single threaded writer? Then each capture process would also be waiting for another write TX to finished, but on a DB level, which would be much better than a single instance/threaded capture application. Or can I use a different field/id for tracking the current state? Normal sequence ids will suffer from the same reasons.
Is there a way to guarantee the order on a DB level, so that events are visible in the order they are captured/ committed?
You should not be concerned with global ordering of events. Your events should contain a Version property. When writing events, you should always be inserting monotonically increasing Version numbers for a given Aggregate/Stream ID. That really is the only ordering that should matter when you are inserting. For Customer ABC, with events 1, 2, 3, and 4, you should only write event 5.
A database transaction can ensure the correct order within a stream using the rules above.
For a subscription which runs scheduled for example every 5 seconds, I track the recordtime of the last sent event, and execute a query for new events like where recordtime > subscription_recordtime.
Reading events is a slightly different story. Firstly, you will likely have a serial column to uniquely identify events. That will give you ordering and allow you to determine if you have read all events. When you read events from the store, if you detect a gap in the sequence. This will happen if an insert was in flight when you read the latest events. In this case, simply re-read the data and see if the gap is gone. This requires your subscription to maintain it's position in the index. Alternatively or additionally, you can read events that are at least N milliseconds old where N is a threshold high enough to compensate for delays in transactions (e.g 500 or 1000).
Also, bear in mind that there are open source RDBMS event stores that you can either use or leverage in your process.
Marten: http://jasperfx.github.io/marten/documentation/events/
SqlStreamStore: https://github.com/SQLStreamStore/SQLStreamStore
I am researching Atom feeds as a way of distributing event data as part of our organisation's internal REST APIs. I can control the feeds and ensure:
there is a "head" feed containing time-ordered events with an etag which updates if the feed changes (and short cache headers).
there are "archive" feeds containing older events with a fixed etag (and long cache headers).
the events are timestamped and immutable, i.e. they happened and can't change.
The question is, what must the consumer remember to be sure to synchronize itself with the latest data at any time, without double processing of events?
The last etag it processed?
The timestamp of the last event it processed?
I suppose it needs both? The etag to efficiently ask the feed if there's been any changes, (using HTTP If-None-Match) and if so, then use the datestamp to apply only the changes from that updated feed that haven't already been processed...
The question is nothing particularly to do with REST or the technology used to consume the feed. It would apply for anyone writing code to consume an Atom based RSS feed reader, for example.
UPDATE
Thinking about it - some of the events may have the same timestamp, as they get "detected" at the same time in batches. Could be awkward then for the consumer to rely on the timestamp of the last event successfully processed in case its processing dies half way through processing a batch with the same timestamp... This is why I hate timestamps!
In that case does the feed need to send an id with every event that the consumer has to remember instead? Wouldn't that id have to increment to eternity, and never ever be reset? What are the alternatives?
Your events should all carry a unique ID. A client is required to track those IDs, and that it is enough to prevent double-processing.
In that case does the feed need to send an id with every event that the consumer has to remember instead?
Yes. An atom:entry is required to have an atom:id that is unique. If your events are immutable, uniqueness of the ID is enough. In general, entries aren't required to be immutable. atom:updated contains the last significant change:
the most
recent instant in time when an entry or feed was modified in a way
the publisher considers significant
So a general client would need to consider the pair of id and updated.