I am running Neo4j 3.0.6, and am importing large amounts of data into a fresh instance from multiple sources. I enforce uniqueness using the following constraint:
CREATE CONSTRAINT ON (n:Person) ASSERT n.id IS UNIQUE
I will then import data and relationships from multiple sources and multiple threads:
MERGE (mother:Person{id: 22})
MERGE (father:Person{id: 55})
MERGE (self:Person{id: 128})
SET self += {name: "Alan"}
MERGE (self)-[:MOTHER]->(mother)
MERGE (self)-[:FATHER]->(father)
Meanwhile, on another thread, but still on the same Neo4j server and bolt endpoint, I'll be importing the rest of the data:
MERGE (husband:Person{id: 55})
MERGE (self:Person{id: 22})
SET self += {name: "Phyllis"}
MERGE (self)-[:HUSBAND]->(husband)
MERGE (wife:Person{id: 22})
MERGE (self:Person{id: 55})
SET self += {name: "Angel"}
MERGE (self)-[:WIFE]->(wife)
MERGE (brother:Person{id: 128})
MERGE (self:Person{id: 92})
SET self += {name: "Brian"}
MERGE (self)-[:BROTHER]->(brother)
MERGE (self)<-[:BROTHER]-(brother)
Finally, if I run the constraint command again, I get this:
Unable to create CONSTRAINT ON ( Person:Person ) ASSERT Person.id IS UNIQUE:
Multiple nodes with label `Person` have property `id` = 55:
node(708823)
node(708827)
There is no guarantee which order the records will be processed in. What ends up happening is multiple records for the same (:Person{id}) get created, but only one gets populated with name data.
It appears there is a race condition in Neo4j that if two MERGE's happen for the same id at the same time, they both will be created. Is there a way to avoid this race condition? Is there a way to establish the necessary locks?
Possible duplicate: Neo4J 2.1.3 Uniqueness Constraint Being Violated, Is This A Bug? But this is for CREATE and this google groups answer indicates that CREATE behaves differently than MERGE in respect to constraints.
I understand that you can get an implicit lock on some node and then use that for synchronization, but I think that effectively serializes the processing, so the processing won't really be processed concurrently.
Overall I think a better approach would be to abandon processing the same kind of data in multiple threads and just do a single import on one thread to MERGE :Persons and set their properties.
After that's imported, then you can process the creation of your relationships, with the understanding that you'll be MATCHing instead of MERGEing on :Persons.
Related
I have the following situation where I have an Apache Kafka topic containing numerous record types.
For example:
UserCreated
UserUpdated
UserDeleted
AnotherRecordType
...
I wish to implement CDC on the three listed User* record types such that at the end, I have an up-to-date KTable with all user information.
How can I do this in ksqlDB? Since, as far as I know, Debezium and other CDC connectors also source their data from a single topic, I at least know it should be possible.
I've been reading through the Confluent docs for a while now, but I can't seem to find anything quite pertinent to my use case (CDC using existing topic). If there is anything I've overlooked, I would greatly appreciate a link to the relevant documentation as well.
I assume that, at the very least, the records must have the same key for ksqlDB to be able to match them. So my questions boil down to:
How would I teach ksqlDB which is an insert, an update and a delete?
Is the key matching a hard requirement, or are there other join/match predicates that we can use?
One possibility that I can think of is basically how CDC already does it: treat each incoming record as a new entry so that I can have something like a slowly changing dimension in the KTable, grouping on the key and selecting entries with e.g. the latest timestamp.
So, is something like the following:
CREATE TABLE users AS
SELECT user.user_id,
latest_by_offset(user.name) AS name,
latest_by_offset(user.email),
CASE WHEN record.key = UserDeleted THEN true ELSE FALSE END,
user.timestamp,
...
FROM users
GROUP BY user.user_id
EMIT CHANGES;
possible (using e.g. ROWKEY for record.key)? If not, how does e.g. Debezium do it?
The general pattern is to not have different schema types; just User. Then, the first record of any unique key (userid, for example) is an insert. Afterwards any non null values for the same key are updates (generally requiring all fields to be part of the value, effectively going a "replace" operation in the table). Deletes are caused by sending null values for the key (tombstone events).
If you have multiple schemas, it might be better to create a new stream that nulls out any of the delete events, unifies the creates and updates to a common schema that you want information for, and filter event types that you want to ignore.
how does e.g. Debezium do it?
For consuming data coming from Debezium topics, you can use a transform to "extract the new record state". It doesn't create any tables for you.
When implementing a system which creates tasks that need to be resolved by some workers, my idea would be to create a table which would have some task definition along with a status, e.g. for document review we'd have something like reviewId, documentId, reviewerId, reviewTime.
When documents are uploaded to the system we'd just store the documentId along with a generated reviewId and leave the reviewerId and reviewTime empty. When next reviewer comes along and starts the review we'd just set his id and current time to mark the job as "in progress" (I deliberately skip the case where the reviewer takes a long time, or dies during the review).
When implementing such a use case in e.g. PostgreSQL we could use the UPDATE review SET reviewerId = :reviewerId, reviewTime: reviewTime WHERE reviewId = (SELECT reviewId from review WHERE reviewId is null AND reviewTime is null FOR UPDATE SKIP LOCKED LIMIT 1) RETURNING reviewId, documentId, reviewerId, reviewTime (so basically update the first non-taken row, using SKIP LOCKED to skip any already in-processing rows).
But when moving from native solution to JDBC and beyond, I'm having troubles implementing this:
Spring Data JPA and Spring Data JDBC don't allow the #Modifying query to return anything else than void/boolean/int and force us to perform 2 queries in a single transaction - one for the first pending row, and second one with the update
one alternative would be to use a stored procedure but I really hate the idea of storing such logic so away from the code
other alternative would be to use a persistent queue and skip the database all along but this introduced additional infrastructure components that need to be maintained and learned. Any suggestions are welcome though.
Am I missing something? Is it possible to have it all or do we have to settle for multiple queries or stored procedures?
Why Spring Data doesn't support returning entity for modifying queries?
Because it seems like a rather special thing to do and Spring Data JDBC tries to focus on the essential stuff.
Is it possible to have it all or do we have to settle for multiple queries or stored procedures?
It is certainly possible to do this.
You can implement a custom method using an injected JdbcTemplate.
I notice that multiple requests to a record causes writes to be possibly overwritten. I am using Mongo btw.
I have a schema like:
Trip { id, status, tagged_friends }
where tagged_friends is an association to Users collection
When I make 2 calls to update trips in close succession (in this case I am making 2 API calls from client - actually automated tests), its possible for them to interfere. Since they all call trip.save().
Update 1: update the tagged_friends association
Update 2: update the status field
So I am thinking these 2 updates should only save the "dirty" fields. I think I can do that with Trips.update() rather than trip.save()? But problem is I cannot use update to update an association? That does not appear to work?
Or perhaps there's a better way to do this?
Due to some vague reasons we are using replicated orient-db for our application.
It looks likely that we will have cases when a single record could be created twice. Here is how it happens:
we have entities user_document, they have ID of user and ID of document - both users and documents are managed in another application and stored in another database;
this another application on creating new document sends broadcast event (via rabbit-mq topic);
several instances of our application receive this message and create another user_document with the same pair of user_id and document_id.
If I understand correct, we should use UNIQUE index on the pair of these ids and rely upon distributed transactions.
However due to some reasons (we have another team writitng layer between application and database) we probably could not use UNIQUE though it may sound stupid :)
What are our chances then?
Could we, for example, allow all instances to create redundant records and immediately after creation select by user_id and document_id and if more than one were found, delete ones with lexicografically higher own id?
Sure you can do in that way.
You can try to use something like
DELETE FROM (SELECT FROM user_document where user_id=? and document_id=? skip 1)
However, take a notice that without creation of index this approach may consume some additional resources on server, and you might got a significant slow down if user_document have big amount of records.
I am seeing the following error in Jonathon Oliver's EventStore :
ERROR: 23505: duplicate key value violates unique constraint "ix_commits_revisions"
Any ideas why this is happening?
Assuming the index is as I googled it:
CREATE UNIQUE INDEX IX_Commits_Revisions ON Commits (
StreamId, StreamRevision, Items);
Two Saves have written equivalent changes, which represents an optimistic concurrency violation.
Typically this would be converted by a Common Domain (or similar) layer to an EventStore ConcurrencyException.
The solution is to re-apply the Command against a fresh Load of the events in the stream.
If you are caching the IEventRepository, you shouldn't be as everyone with write access to the database can equally write into the event stream.
How do I know all this? The Readme documents in the NuGet package explain the basis behind this very clearly and you're stealing from yourself/your employer if you don't read and re-read them until you can work this out yourself!