Event sourcing - event replaying - cqrs

I have been reading about the Event Sourcing pattern which can be very useful if you would like to rebuild your system.
However, what if I need to run the event rebuild while servicing new incoming requests? Is there any particular pattern or best practise for that scenario?
So, instead of scheduling a system downtime, how do I ensure that the new incoming requests will not screw up my system while it is replaying because Event synchronisation and sequence is really important for my system. It involves updating the DB records which are dependent on the event sequence.
Any thoughts?

Note: For this example, all IDs are as 6 random alphanums, think they could be UUIDs or sha1s for example.
If you have this in the events:
WriteIndex | EventId | Type | streamId | Data
-------------------------------------------------------------------------------
1 | qcwbf2 | car.created | hrxs21 | { by: Alice, color: blue }
2 | e1owui | car.repainted | hrxs21 | { color: red }
3 | fjr4io | car.created | tye24p | { by: Alice, color: blue }
4 | fhreui | customer.created | b2dhuw | { name: Bob }
5 | urioe7 | car.sold | hrxs21 | { to: b2dhuw }
6 | fhreui | customer.renamed | b2dhuw | { name: Charlie }
-------------------------------------------------------------------------------
And this in your projections (After 6):
CarId | Creator | Color | Sold | Customer | CustomerId
-------------------------------------------------------
hrxs21 | Alice | red | yes | Bob | b2dhuw
tye24p | Alice | blue | no | |
-------------------------------------------------------
CustomerId | Name
---------------------
b2dhuw | Charlie
---------------------
Iamgine you have an error in the cars projection because you did not properly listen to the "customer.renamed" into your projector of the cars.
You rewrite all the projectors and want to replay.
Your fear
You replay events and get to this:
CarId | Creator | Color | Sold | Customer | CustomerId
-------------------------------------------------------
hrxs21 | Alice | red | yes | Charlie | b2dhuw
tye24p | Alice | blue | no | |
-------------------------------------------------------
But in parallel, while "rebuilding the car cache" (the projections are nothing else than a cache) there are two new events coming in:
WriteIndex | EventId | Type | streamId | Data
-------------------------------------------------------------------------------
7 | je3i32 | car.repainted | hrxs21 | { color: orange }
8 | 8c227x | customer.created | wuionx | { name: Dan }
9 | e39jc2 | car.sold | tye24p | { to: wuionx }
So it "seems" the new rebuilt fresh cache never "gets to the current state" as now the car of Charlie (former Bob) is now orange instead of red, a new customer has been created, and car number wuionx is now owned by Dan.
Your solution
Mark your cache data with either an "index" (caution, it need to be carefully designed) or a "timestamp" (caution with corrective-events injected into past-dates!
When reading your cache "make sure" you "apply" the "latest changes" pending to be applied into SYNC mode (not ASYNC).
Reason why: Rebuilding many thousands of events can take time. But rebuilding just a few dozens should be lighting fast.
Quick-n-Dirty solution
So... make your "replayer" to have these tables instead: (I'm assuming the WriteIndex is reliable for inspiration, but in practice I'd probably use another thing other than the writeIndex, it's just for illsutration):
CarId | Creator | Color | Sold | Customer | CustomerId | LatestChangedBy
--------------------------------------------------------|----------------
hrxs21 | Alice | red | yes | Charlie | b2dhuw | 6
tye24p | Alice | blue | no | | | 3
-------------------------------------------------------------------------
So when going to "consume" the car tye24p you see it's latest update was done because of 3 and you can "replay" 4-end listening only on this aggregate so you'll end up with this:
CarId | Creator | Color | Sold | Customer | CustomerId | LatestChangedBy
--------------------------------------------------------|----------------
hrxs21 | Alice | red | yes | Charlie | b2dhuw | 6
tye24p | Alice | blue | yes | Dan | wuionx | 9
-------------------------------------------------------------------------
This is inefficient as you see because you are replaying 4 to 6 "again" when you already replayed them.
A bit better solution
Have a global-replayer counter
CarId | Creator | Color | Sold | Customer | CustomerId
-------------------------------------------------------
hrxs21 | Alice | red | yes | Charlie | b2dhuw
tye24p | Alice | blue | no | |
-------------------------------------------------------
ReplayerMetaData
-------------------------------------------------------
lastReplayed: 6
-------------------------------------------------------
When you want to access anything you do a SYNC "quick update" of "any new event pending to be replayed".
If you want to access car tye24p you just see there are events up to "index 9" and you replayed up to 6. You then just "before reading" force an "update pending all" and replay just 7, 8 and 9. You end up with this car cache table:
CarId | Creator | Color | Sold | Customer | CustomerId
--------------------------------------------------------
hrxs21 | Alice | orange | yes | Charlie | b2dhuw
tye24p | Alice | blue | yes | Dan | wuionx
-------------------------------------------------------
ReplayerMetaData
-------------------------------------------------------
lastReplayed: 9
-------------------------------------------------------
Overall
With this solution you:
May do an infinite number of trials before switching the read model.
May stay "online" while you do all the trials.
When your system is "ready to go" for the final rebuild (imagine it takes 1 hour to process 1 million events) you just run it BUT with the system online.
After the mega rebuild, say your lastReplayed = 1.000.000. If during that hour, 2.000 new events came, you just may "replay those latests ones" again.
Imagine those 2.000 take 5 minutes, your pointer is now 1.002.000. Imagine in those 5 minutes, 300 more events came in:
Replay "only those latest ones" and your pointer will be: 1.002.300
Once you are "nearly catched up" (no matter if there's still a 50-events-gap) you just switch the read model (by just configuration flag) to the new model => This means you must not do a full deploy, you have to have already deployed a version able to read from here or from there, so switching is "immediate".
After switching you just "ensure" in your reads that you "force apply the latests synchronously".
This will affect only the first read and should be 1 or 2 seconds at most... the next reads will most probably be already in sync so continuing by checking that gap has not any penalty in performance, It'll say just "you need to update 0 more events" and done.
Hope to help!

I've used CQRS+ES in similar case.
I created projection with prepared data, that i could only update, but not rebuild.
And on every query i built result info from this quickly.
If you need to execute some long operations (like update in db), use sagas. Generate event -> saga -> update projection after saga ends and generate event2.
Of course, it'll be some delay between event income and projection update.
It's very interesting to learn more about you system and if such variant is good enough for you.

With the constraints you've described, it sounds like live rebuilding from zero cannot be part of your plans. You could instead have an A/B setup, playing through the events on a new system that is offline at that point, and then switching over to the new system once it has caught up. The key is that both the old and new systems can tune in to the event stream at the same time.
If you have varied systems subscribed to subsets of the events, it may be that you only need to replay events for one of those subsystems, and your synchronization/sequence needs can still be met without that subsystem in play.
You can prevent acting on obsolete information by including event sequence numbers in the data and having the sequence-dependent service defer processing if it hasn't seen that event yet.

For projecting your events to the read model, you need to use some sort of catch-up subscription, like the one that EventStore provides. In this case your new events will be saved to the store but will not immediately be projected.
However, you have to realise that your users will start see heavily stale data and make actions based on the inconsistent read model. I would rather avoid this and let the system rebuild.
I agree with the first answer that you might want to build a new read model in parallel with updating the old one and switching at some point, may be even not for all users first.

Related

PySpark SQL query to return row with most number of words

I am trying to come up with a pyspark sql query to return the row within the text column of the review Dataframe with the most number of words.
I would like to return both the full text as well as the number of words. This question is in regards to the reviews of the Yelp dataset. Here is what I have so far but apparently it is not (fully) correct:
query = """
SELECT text,LENGTH(text) - LENGTH(REPLACE(text,' ', '')) + 1 as count
FROM review
GROUP BY text
ORDER BY count DESC
"""
spark.sql(query).show()
Here is an example of a few rows from the dataframe:
[Row(business_id='ujmEBvifdJM6h6RLv4wQIg', cool=0, date='2013-05-07 04:34:36', funny=1, review_id='Q1sbwvVQXV2734tPgoKj4Q', stars=1.0, text='Total bill for this horrible service? Over $8Gs. These crooks actually had the nerve to charge us $69 for 3 pills. I checked online the pills can be had for 19 cents EACH! Avoid Hospital ERs at all costs.', useful=6, user_id='hG7b0MtEbXx5QzbzE6C_VA'),
Row(business_id='NZnhc2sEQy3RmzKTZnqtwQ', cool=0, date='2017-01-14 21:30:33', funny=0, review_id='GJXCdrto3ASJOqKeVWPi6Q', stars=5.0, text="I *adore* Travis at the Hard Rock's new Kelly Cardenas Salon! I'm always a fan of a great blowout and no stranger to the chains that offer this service; however, Travis has taken the flawless blowout to a whole new level! \n\nTravis's greets you with his perfectly green swoosh in his otherwise perfectly styled black hair and a Vegas-worthy rockstar outfit. Next comes the most relaxing and incredible shampoo -- where you get a full head message that could cure even the very worst migraine in minutes --- and the scented shampoo room. Travis has freakishly strong fingers (in a good way) and use the perfect amount of pressure. That was superb! Then starts the glorious blowout... where not one, not two, but THREE people were involved in doing the best round-brush action my hair has ever seen. The team of stylists clearly gets along extremely well, as it's evident from the way they talk to and help one another that it's really genuine and not some corporate requirement. It was so much fun to be there! \n\nNext Travis started with the flat iron. The way he flipped his wrist to get volume all around without over-doing it and making me look like a Texas pagent girl was admirable. It's also worth noting that he didn't fry my hair -- something that I've had happen before with less skilled stylists. At the end of the blowout & style my hair was perfectly bouncey and looked terrific. The only thing better? That this awesome blowout lasted for days! \n\nTravis, I will see you every single time I'm out in Vegas. You make me feel beauuuutiful!", useful=0, user_id='yXQM5uF2jS6es16SJzNHfg'),
Row(business_id='WTqjgwHlXbSFevF32_DJVw', cool=0, date='2016-11-09 20:09:03', funny=0, review_id='2TzJjDVDEuAW6MR5Vuc1ug', stars=5.0, text="I have to say that this office really has it together, they are so organized and friendly! Dr. J. Phillipp is a great dentist, very friendly and professional. The dental assistants that helped in my procedure were amazing, Jewel and Bailey helped me to feel comfortable! I don't have dental insurance, but they have this insurance through their office you can purchase for $80 something a year and this gave me 25% off all of my dental work, plus they helped me get signed up for care credit which I knew nothing about before this visit! I highly recommend this office for the nice synergy the whole office has!", useful=3, user_id='n6-Gk65cPZL6Uz8qRm3NYw')]
And expected output if this was the review with the most words:
I have to say that this office really has it together, they are so organized and friendly! Dr. J. Phillipp is a great dentist, very friendly and professional. The dental assistants that helped in my procedure were amazing, Jewel and Bailey helped me to feel comfortable! I don't have dental insurance, but they have this insurance through their office you can purchase for $80 something a year and this gave me 25% off all of my dental work, plus they helped me get signed up for care credit which I knew nothing about before this visit! I highly recommend this office for the nice synergy the whole office has!
And then something like Word count = xxxx
Edit: Here the example output for the first review using this code:
query = """
SELECT text, size(split(text, ' ')) AS word_count
FROM review
ORDER BY word_count DESC
"""
spark.sql(query).show(20, False)
Review returned with highest number of words:
Got a date with de$tiny?
** A ROMANTIC MOMENT WITH **
** THE BEST VIEW IN TOWN**
------------------------------------------------
/ **CN TOWER'S** \
/ **REVOLVING RESTAURANT** \
\ /
\ ----------------------------------------------- /
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
/ \
===========
o o~
/|~ ~|\
/\ / \ uhm, maybe not. the view may be great but a $30 to
$40 bleh $teak ain't necessarily gonna get you some
action later. Cheaper to get takeout from Harvey's and
eat and the beach! |4329 |
Encapsulating the UDF you had into native SQL logic by splitting string into an array of words and finding the array size.
spark.sql("SELECT text, size(split(text, ' ')) as word_count FROM review ORDER BY word_count DESC").show(200, False)
Example
data = [("This is a sentence.",), ("This sentence has 5 words.", )]
review = spark.createDataFrame(data, ("text", ))
review.registerTempTable("review")
spark.sql("SELECT text, size(split(text, ' ')) as word_count FROM review ORDER BY word_count DESC").show(200, False)
Output
+--------------------------+----------+
|text |word_count|
+--------------------------+----------+
|This sentence has 5 words.|5 |
|This is a sentence. |4 |
+--------------------------+----------+

how to do column-level locking? Is that possible?

Let's say I have a table as below:
+----+------+--------+
| ID | NAME | STATUS |
+----+------+--------+
| 1 | ADAM | ACTIVE |
| 2 | EVE | ACTIVE |
| 3 | JOHN | ACTIVE |
+----+------+--------+
Let's say I want to do column-level locking - the transaction abort if other transaction modify the value of the same column, e.g
+----+------+--------+
| ID | NAME | STATUS |
+----+------+--------+
| 1 | ADAM | ACTIVE | <- OK: Tx1: change NAME to ACE, Tx2: change STATUS to INACTIVE
| 2 | EVE | ACTIVE | <- Abort: Tx1: change NAME to CAROL, Tx2: change NAME to CAT
| 3 | JOHN | ACTIVE | <- OK, same value: Tx1: change NAME to JAN, Tx2: change NAME to JAN
+----+------+--------+
What lock or isolation level I need to set?
You can't lock individual column values natively. You could make multiple tables with a one to one relationship; or you could roll your own optimistic locking:
select name from t where id=2; -- get Eve
update t set name='Carol' where id=2 and name='Eve' returning id;
-- if no rows updated, rollback and throw an error.
This would not consider it an error if other session(s) changed the value from Eve to Cat then back to Eve again in between the first two lines shown.

What exactly is a wide column store?

Googling for a definition either returns results for a column oriented DB or gives very vague definitions.
My understanding is that wide column stores consist of column families which consist of rows and columns. Each row within said family is stored together on disk. This sounds like how row oriented databases store their data. Which brings me to my first question:
How are wide column stores different from a regular relational DB table? This is the way I see it:
* column family -> table
* column family column -> table column
* column family row -> table row
This image from Database Internals simply looks like two regular tables:
The guess I have as to what is different comes from the fact that "multi-dimensional map" is mentioned along side wide column stores. So here is my second question:
Are wide column stores sorted from left to right? Meaning, in the above example, are the rows sorted first by Row Key, then by Timestamp, and finally by Qualifier?
Let's start with the definition of a wide column database.
Its architecture uses (a) persistent, sparse matrix, multi-dimensional
mapping (row-value, column-value, and timestamp) in a tabular format
meant for massive scalability (over and above the petabyte scale).
A relational database is designed to maintain the relationship between the entity and the columns that describe the entity. A good example is a Customer table. The columns hold values describing the Customer's name, address, and contact information. All of this information is the same for each and every customer.
A wide column database is one type of NoSQL database.
Maybe this is a better image of four wide column databases.
My understanding is that the first image at the top, the Column model, is what we called an entity/attribute/value table. It's an attribute/value table within a particular entity (column).
For Customer information, the first wide-area database example might look like this.
Customer ID Attribute Value
----------- --------- ---------------
100001 name John Smith
100001 address 1 10 Victory Lane
100001 address 3 Pittsburgh, PA 15120
Yes, we could have modeled this for a relational database. The power of the attribute/value table comes with the more unusual attributes.
Customer ID Attribute Value
----------- --------- ---------------
100001 fav color blue
100001 fav shirt golf shirt
Any attribute that a marketer can dream up can be captured and stored in an attribute/value table. Different customers can have different attributes.
The Super Column model keeps the same information in a different format.
Customer ID: 100001
Attribute Value
--------- --------------
fav color blue
fav shirt golf shirt
You can have as many Super Column models as you have entities. They can be in separate NoSQL tables or put together as a Super Column family.
The Column Family and Super Column family simply gives a row id to the first two models in the picture for quicker retrieval of information.
Most (if not all) Wide-column stores are indeed row-oriented stores in that every parts of a record are stored together. You can see that as a 2-dimensional key-value store. The first part of the key is used to distribute the data across servers, the second part of the key lets you quickly find the data on the target server.
Wide-column stores will have different features and behaviors. However, Apache Cassandra, for example, allows you to define how the data will be sorted. Take this table for example:
| id | country | timestamp | message |
|----+---------+------------+---------|
| 1 | US | 2020-10-01 | "a..." |
| 1 | JP | 2020-11-01 | "b..." |
| 1 | US | 2020-09-01 | "c..." |
| 2 | CA | 2020-10-01 | "d..." |
| 2 | CA | 2019-10-01 | "e..." |
| 2 | CA | 2020-11-01 | "f..." |
| 3 | GB | 2020-09-01 | "g..." |
| 3 | GB | 2020-09-02 | "h..." |
|----+---------+------------+---------|
If your partitioning key is (id) and your clustering key is (country, timestamp), the data will be stored like this:
[Key 1]
1:JP,2020-11-01,"b..." | 1:US,2020-09-01,"c..." | 1:US,2020-10-01,"a..."
[Key2]
2:CA,2019-10-01,"e..." | 2:CA,2020-10-01,"d..." | 2:CA,2020-11-01,"f..."
[Key3]
3:GB,2020-09-01,"g..." | 3:GB,2020-09-02,"h..."
Or in table form:
| id | country | timestamp | message |
|----+---------+------------+---------|
| 1 | JP | 2020-11-01 | "b..." |
| 1 | US | 2020-09-01 | "c..." |
| 1 | US | 2020-10-01 | "a..." |
| 2 | CA | 2019-10-01 | "e..." |
| 2 | CA | 2020-10-01 | "d..." |
| 2 | CA | 2020-11-01 | "f..." |
| 3 | GB | 2020-09-01 | "g..." |
| 3 | GB | 2020-09-02 | "h..." |
|----+---------+------------+---------|
If you change the primary key (composite of partitioning and clustering key) to (id, timestamp) WITH CLUSTERING ORDER BY (timestamp DESC) (id is the partitioning key, timestamp is the clustering key in descending order), the result would be:
[Key 1]
1:US,2020-09-01,"c..." | 1:US,2020-10-01,"a..." | 1:JP,2020-11-01,"b..."
[Key2]
2:CA,2019-10-01,"e..." | 2:CA,2020-10-01,"d..." | 2:CA,2020-11-01,"f..."
[Key3]
3:GB,2020-09-01,"g..." | 3:GB,2020-09-02,"h..."
Or in table form:
| id | country | timestamp | message |
|----+---------+------------+---------|
| 1 | US | 2020-09-01 | "c..." |
| 1 | US | 2020-10-01 | "a..." |
| 1 | JP | 2020-11-01 | "b..." |
| 2 | CA | 2019-10-01 | "e..." |
| 2 | CA | 2020-10-01 | "d..." |
| 2 | CA | 2020-11-01 | "f..." |
| 3 | GB | 2020-09-01 | "g..." |
| 3 | GB | 2020-09-02 | "h..." |
|----+---------+------------+---------|

Splitting Values Between 2 Employees Possible Group By

We an incentive scheme that scored to staff via 'Points'. I have written an SP so that we can dynamically change these values day to day however I am struggling in my own head to figure the best way to split the values between two different agents.
The Setup
The source table simply looks like below:
Department | Policy |Opening Advisor | Closing Advisor | Agg Pts | Aff Pts | Rnl Pts |
Sales A | LYJX01PC01 | Sally | Sally | 0 | 3 | 0 |
Sales A | MUMP01TW01 | Sally | John | 0 | 0 | 3 |
This table is two separate tables, the first being the Opening Table and then the second being the Closing Table. These are joined by using the 'Branch' which is a column I've not put on here and then 'Policy' to keep it simple, Policy is what joins these 2 tables.
The Idea
The idea here is that the Points Per Policy are split 50/50 if the Opening Advisor and the Closing Advisor are different.
Line 1 the Opening and Closing Advisor is Sally, so for that policy she would receive all the points, however on Line 2 she opened the Policy however John closed it.
This would split the points out between them both i.e. 3/2 = 1.5. The output would then be similar to below:
The Output
Department | Policy | Advisor | Agg Pts | Aff Pts | Rnl Pts |
Sales A | LYJX01PC01 | Sally | 0 | 1.5 | 0 |
Sales A | LYJX01PC01 | John | 0 | 1.5 | 0 |
Sales A | MUMP01TW01 | Sally | 0 | 0 | 3 |
In my mind all I'm doing is GROUPING by Policy and Advisor and then splitting the points IF Opening / Closing Advisor differ, however putting that down in SQL I'm just turning circle and its one of those moments I need some exterior advice.

How do I generate a random sample of groups, including all people in the group, where the group_id (but not the person_id) changes across time?

I have data that looks like this:
+----------+-----------+------------+------+
| group_id | person_id | is_primary | year |
+----------+-----------+------------+------+
| aaa1 | 1 | TRUE | 2000 |
| aaa2 | 1 | TRUE | 2001 |
| aaa3 | 1 | TRUE | 2002 |
| aaa4 | 1 | TRUE | 2003 |
| aaa5 | 1 | TRUE | 2004 |
| bbb1 | 2 | TRUE | 2000 |
| bbb2 | 2 | TRUE | 2001 |
| bbb3 | 2 | TRUE | 2002 |
| bbb1 | 3 | FALSE | 2000 |
| bbb2 | 3 | FALSE | 2001 |
+----------+-----------+------------+------+
The data design is such that
person_id uniquely identifies an individual across time
group_id uniquely identifies a group within each year, but may change from year to year
each group contains primary and non-primary individuals
My goal is three-fold:
Get a random sample, e.g. 10%, of primary individuals
Get the data on those primary individuals for all time periods they appear in the database
Get the data on any non-primary individuals that share a group with any of the primary individuals that were sampled in the first and second steps
I'm unsure where to start with this, since I need to first pull a random sample of primary individuals and get all observations for them. Presumably I can do this by generating a random number that's the same within any person_id, then sample based on that. Then, I need to get the list of group_id that contain any of those primary individuals, and pull all records associated with those group_id.
I don't know where to start with these queries and subqueries, and unfortunately, the interface I'm using to access this database can't link information across separate queries, so I can't pull a list of random person_id for primary individuals, then use that text file to filter group_id in a second query; I have to do it all in one query.
A quick way to get this done is:
select
data_result.*
from
data as data_groups join
(select
person_id
from
data
where
is_primary
group by
person_id
order by
random()
limit 1) as selected_primary
ON (data_groups.person_id = selected_primary.person_id)
JOIN data AS data_result ON (data_groups.group_id = data_result.group_id AND data_groups.year = data_result.year)
I even made a fiddle so you can test it.
The query is pretty straightforward, it gets the sample, then it gets their groups and then it gets all the users of those groups.
Please pay atention to the Limit 1 clause that is there because the data set was so little. You can put a value or a query that gets the right percentage.
If anyone has an answer using windowing functions I'd like to see that.
Note: next time please provide the schema and the data insertion so it is easier to answer.