Mixpanel: Merge duplicate people profiles and also merge events - merge

I have duplicate profiles due to switching of the identifier in the code. I would like to merge the duplicate profiles now and also merge the events / activity feed.
I got the API working and by calling
deduplicate_people(prop_to_match='$email',merge_props=True,case_sensitive=False,backup=True,backup_file=None)
Duplicates are in fact removed, but the events / activity feed is not merged. So I'd loose many events.
Is there a way to remove duplicates and merging events / activity feed at the same time?

Duplicates happen because some persons use ID and others email as distinct_id due to the change of identifier. The events are referenced by that ID or email to the corresponding person.
So here is what I ended up doing to re-create the identity mapping for people and their events:
I used Mixpanel's API (export_people / export_events) to create a backup of people and events. I wrote a script that creates a mapping "distinct_id <-> email" for people that use an actual ID as distinct_id and not an email (each person has an $email field regardless of the content of the $distinct_id).
Then I went over all exported events. For each event that had an ID as distinct_id I used the mapping to change that distinct_id to email. Updated events were saved in a JSON file. Thus creating the reference from events to person using email as distinct_id -- the events that got lost otherwise.
Then I went ahead and used the de-duplicate API from Mixpanel to delete all duplicates -- thus loosing some events. Now I imported the events from the step before, which gave me back those missing events.
Three open questions to consider before using this approach:
I believe events are not actually deleted on deduplication. So by importing them again there are probably duplicate events in the system that are just not referenced to a person and that may show up at some point.
the deduplication by $email did keep the people that use email as distinct_id and removed the ones with the actual ID. I don't know if this is true every time or may have been a coincidence. My approach will fail for persons that still use ID as distinct_id.
I suppose it's generally discouraged to hack around the distinct_id like that, because making a mistake may result in data loss. So make sure to get it right..

Related

Caching strategies for a chat channel app

I need to develop a sort of chat channel with fairly simple requirements:
Each message should be shown along with the username and avatar of the author
Each message can have multiple comments
Each message should be shown along with its first comment
The first comment (from above) should be shown with its author username / avatar
Now I have designed the mongodb schema to satisfy the above, and fetching the messages is a fairly heavy aggregate command, mainly because I have 3 "lookup" commands:
To fetch the author username/avatar (which is stored just as an id on the message object)
To fetch the first comment (which is stored just as an id on the message object)
To fetch the first comment author username/avatar (again, stored as id on the comment object).
Once the channel is fetched I would like to cache it, reading around I found the Fanout on Write - Cache approach quite appealing, it is basically an "active" cache of the most recent X messages, "active" because it will get updated on every new message to the channel (new message will be push and last message will be removed).
The main issues I am "stuck" with are due to the fact that the "Cache" holds the denormalized data (meaning it includes the username/avatar/"first comment" objects and not ids), hence:
What to do when a user changes his username/avatar
What to do when a user delete a message
What to do when a "first comment" of a message is deleted?
Is it a wrong approach to hold the "cache" in its denormalized form? Any better approach to build a performant chat channel?
Thx!

Message library

The scenario is: some user sending messages to some group of people.
I was thinking to create one ROW for that specific conversation into one CLASS. WHERE in that ROW contains information such "sender name", "receiver " and addition I have column (PFRelation) which connects this specific row to another class where all messages from the user to the receiver would be saved(vice-versa) into.
So this action will happen every time the user starts a new conversation.
The benefit from this prospective :
Privacy because the only convo that is being saved are only from the user and the receiver group.
Downside of this prospective:
We all know that parse only provide 30reqs/s for free which means that 1 min =1800 reqs. So every time I create a new class to keep track of the convo. Am I using a lot of requests ?
I am looking suggestions and thoughts for the ideal way before I implement this messenger library.
It sounds like you have come up with something that is similar to what I have used before to implement messaging in an app with Parse as a backend. It's also important to think about how your UI will be querying for data. In general, it's most important to ensure that it is very easy and fast to read data. For most social apps, the following quote from Facebook's engineering team on Haystack is particularly relevant.
Haystack is an object store that we designed for sharing photos on
Facebook where data is written once, read often, never modified, and
rarely deleted.
The crucial piece of information here is written once, read often, never modified, and rarely deleted. No matter what approach you decide to take, keep that in mind while engineering your solution. The approach that I have used before to implement a messaging system using Parse is described below.
Overview
Each row (object) of the Message class corresponds with an individual text, picture, or video message that was posted. Each Message belongs to a Group. A Group can be as small as 2 User (private conversation) or grow as large as you like.
The RecentMessage class is the solution I came up with to deal with quickly and easily populating the UI. Each RecentMessage object corresponds to each Group that a given User may belong. Each User in a Group will have their own RecentMessage object which is kept up to date using beforeSave/afterSave cloud code triggers. Whenever a new Message is created, in the afterSave trigger we want to update all of the RecentMessage objects that belong to the Group.
You will most likely have a table in your app which displays all of the conversations that the user is part of. This is easily achieved by querying for all of that user's RecentMessage objects which already contains all of the Group information needed to load the rest of the messages when selected and also contains the most recent message's data (hence the name) to display in the table. Alternatively, RecentMessage could contain a pointer to the most recent Message, however I decided that copying the data was a beneficial tradeoff since it streamlines future queries.
Message
group (pointer to group which message is part of)
user (pointer to user who created it)
text (string)
picture (optional file)
video (optional file)
RecentMessage
group (group pointer)
user (user pointer)
lastMessage (string containing the text of most recent Message)
lastUser (pointer to the User who posted the most recent Message)
Group
members (array of user pointers)
name or whatever other info you want
Security/Privacy
Security and privacy are imperative when creating messaging functionality in your app. Make sure to read through the Parse Engineering security blog posts, and take your time to let it all soak in: Part I, Part II, Part III, Part IV, Part V.
Most important in our case is Part III which describes ACLs, or Access Control Lists. Group objects will have an ACL which corresponds to all of its member User. RecentMessage objects will have a restricted read/write ACL to its owner User. Message objects will inherit the ACL of the Group to which they belong, allowing all of the Group members to read. I recommend disabling the write ACL in the afterSave trigger so messages cannot be modified.
General Remarks
With regards to Parse and the request limit, you need to accept that fact that you will very quickly surpass the 30 req/s free tier. As a general rule of thumb, it's much better to focus on building the best possible user experience than to focus too much on scalability. By and large, issues of scalability rarely come into play because most apps fail. Not saying that to be discouraging — just something to keep in mind to prevent you from falling into the trap of over-engineering at the cost of time :)

Concurrent create in OrientDB

Due to some vague reasons we are using replicated orient-db for our application.
It looks likely that we will have cases when a single record could be created twice. Here is how it happens:
we have entities user_document, they have ID of user and ID of document - both users and documents are managed in another application and stored in another database;
this another application on creating new document sends broadcast event (via rabbit-mq topic);
several instances of our application receive this message and create another user_document with the same pair of user_id and document_id.
If I understand correct, we should use UNIQUE index on the pair of these ids and rely upon distributed transactions.
However due to some reasons (we have another team writitng layer between application and database) we probably could not use UNIQUE though it may sound stupid :)
What are our chances then?
Could we, for example, allow all instances to create redundant records and immediately after creation select by user_id and document_id and if more than one were found, delete ones with lexicografically higher own id?
Sure you can do in that way.
You can try to use something like
DELETE FROM (SELECT FROM user_document where user_id=? and document_id=? skip 1)
However, take a notice that without creation of index this approach may consume some additional resources on server, and you might got a significant slow down if user_document have big amount of records.

Creation Concurrency with CQRS and EventStore

Baseline info:
I'm using an external OAuth provider for login. If the user logs into the external OAuth, they are OK to enter my system. However this user may not yet exist in my system. It's not really a technology issue, but I'm using JOliver EventStore for what it's worth.
Logic:
I'm not given a guid for new users. I just have an email address.
I check my read model before sending a command, if the user email
exists, I issue a Login command with the ID, if not I issue a
CreateUser command with a generated ID. My issue is in the case of a new user.
A save occurs in the event store with the new ID.
Issue:
Assume two create commands are somehow issued before the read model is updated due to browser refresh or some other anomaly that occurs before consistency with the read model is achieved. That's OK that's not my problem.
What Happens:
Because the new ID is a Guid comb, there's no chance the event store will know that these two CreateUser commands represent the same user. By the time they get to the read model, the read model will know (because they have the same email) and can merge the two records or take some other compensating action. But now my read model is out of sync with the event store which still thinks these are two separate entities.
Perhaps it doesn't matter because:
Replaying the events will have the same effect on the read model
so that should be OK.
Because both commands are duplicate "Create" commands, they should contain identical information, so it's not like I'm losing anything in the event store.
Can anybody illuminate how they handled similar issues? If some compensating action needs to occur does the read model service issue some kind of compensation command when it realizes it's got a duplicate entry? Is there a simpler methodology I'm not considering?
You're very close to what I'd consider a proper possible solution. The scenario, if I may summarize, is somewhat like this:
Perform the OAuth-entication.
Using the read model decide between a recurring visitor and a new visitor, based on the email address.
In case of a new visitor, send a RegisterNewVisitor command message that gets handled and stored in the eventstore.
Assume there is some concurrency going on that, for the same email address, causes two RegisterNewVisitor messages, each containing what the system thinks is the key associated with the email address. These keys (guids) are different.
Detect this duplicate key issue in the read model and merge both read model records into one record.
Now instead of merging the records in the read model, why not send a ResolveDuplicateVisitorEmailAddress { Key1, Key2 } towards your domain model, leaving it up to the domain model (the codified form of the business decision to be taken) to resolve this issue. You could even have a dedicated read model to deal with these kind of issues, the other read model will just get a kind of DuplicateVisitorEmailAddressResolved event, and project it into the proper records.
Word of warning: You've asked a technical question and I gave you a technical, possible solution. In general, I would not apply this technique unless I had some business indicator that this is worth investing in (what's the frequency of a user logging in concurrently for the first time - maybe solving it this way is just a way of ignoring the root cause (flakey OAuth, no register new visitor process in place, etc)). There are other technical solutions to this problem but I wanted to give you the one closest to what you already have in place. They range from registering new visitors sequentially to keeping an in-memory projection of the visitors not yet in the read model.

How to get list of aggregates using JOliviers's CommonDomain and EventStore?

The repository in the CommonDomain only exposes the "GetById()". So what to do if my Handler needs a list of Customers for example?
On face value of your question, if you needed to perform operations on multiple aggregates, you would just provide the ID's of each aggregate in your command (which the client would obtain from the query side), then you get each aggregate from the repository.
However, looking at one of your comments in response to another answer I see what you are actually referring to is set based validation.
This very question has raised quite a lot debate about how to do this, and Greg Young has written an blog post on it.
The classic question is 'how do I check that the username hasn't already been used when processing my 'CreateUserCommand'. I believe the suggested approach is to assume that the client has already done this check by asking the query side before issuing the command. When the user aggregate is created the UserCreatedEvent will be raised and handled by the query side. Here, the insert query will fail (either because of a check or unique constraint in the DB), and a compensating command would be issued, which would delete the newly created aggregate and perhaps email the user telling them the username is already taken.
The main point is, you assume that the client has done the check. I know this is approach is difficult to grasp at first - but it's the nature of eventual consistency.
Also you might want to read this other question which is similar, and contains some wise words from Udi Dahan.
In the classic event sourcing model, queries like get all customers would be carried out by a separate query handler which listens to all events in the domain and builds a query model to satisfy the relevant questions.
If you need to query customers by last name, for instance, you could listen to all customer created and customer name change events and just update one table of last-name to customer-id pairs. You could hold other information relevant to the UI that is showing the data, or you could simply hold IDs and go to the repository for the relevant customers in order to work further with them.
You don't need list of customers in your handler. Each aggregate MUST be processed in its own transaction. If you want to show this list to user - just build appropriate view.
Your command needs to contain the id of the aggregate root it should operate on.
This id will be looked up by the client sending the command using a view in your readmodel. This view will be populated with data from the events that your AR emits.