Neo4j: Create nodes via CYPHER/REST slow - rest

I try to create/update nodes via the REST API with Cypher's MERGE-statement. Each node has attributes of ca. 1kb (sum of all sizes). I create/update 1 node per request. (I know there are other ways to create lots of nodes in a batch, but this is not the question here.)
I use Neo4j community 2.1.6 on a Windows Server 2008 R2 Enterprise (24 CPUs, 64GB) and the database directory resides on a SAN drive. I get a rate of 4 - 6 nodes per second. Or in other words, a single create or update takes around 200ms. This seems rather slow for me.
The query looks like this:
MERGE (a:TYP1 { name: {name}, version: {version} })
SET
a.ATTR1={param1},
a.ATTR2={param2},
a.ATTR3={param3},
a.ATTR4={param4},
a.ATTR5={param5}
return id(a)
There is an index on name, version and two of the attributes.
Why does it take so long? And what can I try to improve the situation?
I could imagine that one problem is that every request must create a new connection? Is there a way to keep the http connection open for multiple requests?

For a query I'm pretty sure you can only use one index per query per label, so depending on your data they index usage might not be efficient.
As far as a persistent connection, that is possible, though I think it would depend on the library you're using to connect to the REST API. In the ruby neo4j gem we use the Faraday gem which has a NetHttpPersistent adapter.

The index is only used when you use ONE attribute with MERGE
If you need to merge on both, create a compound property, index it (or better use a constraint) and merge on that compound property
Use ON CREATE SET otherwise you (over-)write the attributes everytime, even if you didn't actually create the node.
Adapted Statement
MERGE (a:TYP1 { name_version: {name_version} })
ON CREATE SET
a.version = {version}
a.name = {name}
a.ATTR1={param1},
a.ATTR2={param2},
a.ATTR3={param3},
a.ATTR4={param4},
a.ATTR5={param5}
return id(a)

This is an example of how you can execute a batch of cypher queries from nodejs in one communication with the Neo4j.
To run it,
get nodejs installed (if you don't have it already)
get a token from https://developers.facebook.com/tools/explorer giving you access to user_groups
run it as > node {yourFileName}.js {yourToken}
prerequisites:
var request=require("request") ;
var graph = require('fbgraph');
graph.setAccessToken(process.argv[2]);
function now() {
instant = new Date();
return instant.getHours()
+':'+ instant.getMinutes()
+':'+ instant.getSeconds()
+'.'+ instant.getMilliseconds();
}
Get facebook data:
graph.get('me?fields=groups,friends', function(err,res) {
if (err) {
console.log(err);
throw now() +' Could not get groups from faceBook';
}
Create cypher statements
var batchCypher = [];
res.groups.data.forEach(function(group) {
var singleCypher = {
"statement" : "CREATE (n:group{group}) RETURN n, id(n)",
"parameters" : { "group" : group }
}
batchCypher.push(singleCypher);
Run them one by one
var fromNow = now();
request.post({
uri:"http://localhost:7474/db/data/transaction/commit",
json:{statements:singleCypher}
}, function(err,res) {
if (err) {
console.log('Could not commit '+ group.name);
throw err;
}
console.log('Used '+ fromNow +' - '+ now() +' to commit '+ group.name);
res.body.results.forEach(function(cypherRes) {
console.log(cypherRes.data[0].row);
});
})
});
Run them in batch
var fromNow = now();
request.post({
uri:"http://localhost:7474/db/data/transaction/commit",
json:{statements:batchCypher}
}, function(err,res) {
if (err) {
console.log('Could not commit the batch');
throw err;
}
console.log('Used '+ fromNow +' - '+ now() +' to commit the batch');
})
});
The log shows that a transaction for 5 groups is significantly slower than a transactions for 1 group but significantly faster than 5 transactions for 1 group each.
Used 20:38:16.19 - 20:38:16.77 to commit Voiture occasion Belgique
Used 20:38:16.29 - 20:38:16.82 to commit Marches & Randonnées
Used 20:38:16.31 - 20:38:16.86 to commit Vlazarus
Used 20:38:16.34 - 20:38:16.87 to commit Wijk voor de fiets
Used 20:38:16.33 - 20:38:16.91 to commit Niet de bestemming maar de route maakt de tocht goed.
Used 20:38:16.35 - 20:38:16.150 to commit the batch
I just read your comment, Andreas, do it is not applicable for you, but you might use it to find out if the time is spent in the communication or in the updates

Related

loopback-connector-postgresql: jsonpath in the "where" condition

The model is stored in postgresql. Something like:
{
id: <serial>
data: <json> {
someIds: [<int>, ...]
}
}
How to add a rule jsonb_path_match(data::jsonb, 'exists($.someIds[*] ? (# == 3))') to the filter (where)?
In this case, the value '3' '(# == 3)' shall be determined by the user.
loopback-connector-postgresql does not support JSON/JSONB datatype yet. There is a pull request opened to contribute such feature, but it was never finished by the author - see #401.
As a workaround, you can execute a custom SQL query to perform jsonb_patch_match-based search of your data.
Instructions for LoopBack 3: https://loopback.io/doc/en/lb3/Executing-native-SQL.html
dataSource.connector.execute(sql_stmt, params, callback);
Instructions for LoopBack 4: https://loopback.io/doc/en/lb4/apidocs.repository.defaultcrudrepository.execute.html
const result = await modelRepository.execute(sql_stmt, params, options);

How to use Redis and MongoDb together

I should build a web application to track users activities and I have some issues to understand how can use Redis for tracking online users activities and Mongo to store that data for analyzing it.
I could use just Mongo but I'm worried about the fact I have a lot of invocations to follow what a user is doing. So I was thinking to write on Redis the online data and put in Mongo when they become old. I mean old when the data is meaningless for being online.
I thought about one gateway between Mongo and Redis, so could be RabbitMQ?.
Any suggestions?
I should use just Mongo?
Just an example of code I wrote:
Front-end (Angular application / Socket.io):
setInterval(function () {
socket.emit('visitor-data', {
referringSite: document.referrer,
browser: navigator.sayswho,
os: navigator.platform,
page: location.pathname
});
}, 3000);
Back-end ( Node.js/ Socket.io)
socket.on('visitor-data', function(data) {
visitorsData[socket.id] = data;
);
VisitorsData is a just in an array, but I should build a scalable application, so I can't store data anymore in this way.
Then I have some functions like this for computing the data:
function computeRefererCounts() {
var referrerCounts = {};
for (var key in visitorsData) {
var referringSite = visitorsData[key].referringSite || '(direct)';
if (referringSite in referrerCounts) {
referrerCounts[referringSite]++;
} else {
referrerCounts[referringSite] = 1;
}
}
return referrerCounts;
}
Just some numbers:
I estimated something like :
1 million users for day
15 million activities for day.

Mongo - new vs processed approach

I am new to Mongo and have gotten close to where I want to be after 3 days of banging my head against the keyboard, but now I think I may just be misunderstanding certain key concepts:
What I am trying to do:
I have a node script that is pulling in feed items from various sources very frequently and storing them (title, link, origin, processed:false)
I have another script pulling out records at random, one at a time, using them, and updating processed:true
End Goal: Items should be unique by title - if it's been seen before it should not be written to DB, and once it's been processed one time, it should never be processed again.
INSERT SCRIPT:
key = {'title':title};
data = {'origin':origin, 'title':title, 'original_link':original_url, 'processed':false};
collection.update(key, data, {upsert:true}, function(err, doc) { ...
READ SCRIPT:
collection.findOne({processed:false}, function(err, doc){
if (err) throw err;
logger.info("Read out the following item from mongodb:...");
console.dir(doc);
thisId = doc._id;
markProcessed(thisId);
}
var markProcessed = function(id) {
collection.update({ _id:id },
{
$set: {'processed':true},
}, function(err, doc){
if (err) throw err;
logger.info("Marked record:"+id+" as processed");
console.dir(doc);
}
)
};
I've tried using collection.ensureIndex({'title':1}, {unique:true}) to no success either.
As the two scripts run in parallel the read script ends up repeating work on already processed records, and although the markProcessed function was working all yesterday it miraculously does not today :)
I would very much appreciate any guidance.
There is a problem with your insert script. When you use collection.update and you already have a document with the same key in the database, that document will be overwritten with the new one. An unique index doesn't prevent this, because there aren't two documents with the same title in the collection at the same time.
When you don't want to overwrite an existing record, use collection.insert which will fail when the inserted document violates an unique index.

MongoDB Social Network Adding Followers

I'm implementing a social network in MongoDB and I need to keep track of Followers and Following for each User. When I search for Users I want to display a list like Facebook with the User Name, Picture and number of Followers & Following. If I just wanted to display the User Name and Picture (info that doesn't change) it would be easy, but I also need to display the number of Followers & Following (which changes fairly regularly).
My current strategy is to embed the People a User follows into each User Document:
firstName: "Joe",
lastName: "Bloggs",
follows: [
{
_id: ObjectId("520534b81c9aac710d000002"),
profilePictureUrl: "https://pipt.s3.amazonaws.com/users/xxx.jpg",
name: "Mark Rogers",
},
{
_id: ObjectId("51f26293a5c5ea4331cb786a"),
name: "The Palace Bar",
profilePictureUrl: "https://s3-eu-west-1.amazonaws.com/businesses/xxx.jpg",
}
]
The question is - What is the best strategy to keep track of the number of Followers & Following for each User?
If I include the number of Follows / Following as part of the embedded document i.e.
follows: [
{
_id: ObjectId("520534b81c9aac710d000002"),
profilePictureUrl: "https://pipt.s3.amazonaws.com/users/xxx.jpg",
name: "Mark Rogers",
**followers: 10,**
**following: 400**
}
then every time a User follows someone requires multiple updates across all the embedded documents.
Since the consistency of this data isn't really important (i.e. Showing someone I have 10 instead of 11 followers isn't the end of the world), I can queue this update. Is this approach ok or can anyone suggest a better approach ?
You're on the right track. Think about which calculation is performed more - determining the number of followers/following or changing number of followers/following? Even if you're caching the output of the # of followers/following calculation it's still going to be performed one or two orders of magnitude more often than changing the number.
Also, think about the opposite. If you really need to display the number of followers/following for each of those users, you'll have to then do an aggregate on each load (or cache it somewhere, but you're still doing a lot of calcs).
Option 1: Cache the number of followers/following in the embedded document.
Upsides: Can display stats in O(1) time
Downsides: Requires O(N) time to follow/unfollow
Option 2: Count the number of followers/following on each page view (or cache invalidation)
Upsides: Can follow/unfollow in O(1) time
Downsides: Requires O(N) time to display
Add in the fact that follower/following stats can be eventually consistent whereas the counts have to be displayed on demand and I think it's a pretty easy decision to cache it.
I've gone ahead and implement the update followers/following based on the same strategy recommended by Mason (Option 1). Here's my code in NodeJs and Mongoose and using the AsyncJs Waterfall pattern in case anyone is interested or has any opinions. I haven't implemented queuing yet but the plan would be to farm most of this of to a queue.
async.waterfall([
function (callback) {
/** find & update the person we are following */
Model.User
.findByIdAndUpdate(id,{$inc:{followers:1}},{upsert:true,select:{fullName:1,profilePictureUrl:1,address:1,following:1,followers:1}})
.lean()
.exec(callback);
},
function (followee, callback) {
/** find & update the person doing the following */
var query = {
$inc:{following:1},
$addToSet: { follows: followee}
}
Model.User
.findByIdAndUpdate(credentials.username,query,{upsert:true,select:{fullName:1,profilePictureUrl:1,address:1,following:1,followers:1}})
.lean()
.exec(function(err,follower){
callback(err,follower,followee);
});
},
function(follower,followee,callback){
/** update the following count */
Model.User
.update({'follows._id':follower.id},{'follows.$.following':follower.following},{upsert:true,multi:true},function(err){
callback(err,followee);
});
},
function(followee,callback){
/** update the followers count */
Model.User
.update({'follows._id':followee.id},{'follows.$.followers':followee.followers},{upsert:true,multi:true},callback);
}
], function (err) {
if (err)
next(err);
else {
res.send(HTTPStatus.OK);
next();
}
});

Iterating through database records in node.js

I'm looking to learn node.js and mongodb which look suitable for something I'd like to make. As a little project to help me learn I thought I'd copy the "posts" table from a phpbb3 forum I have into a mongodb table so I did something like this where db is mongodb database connection, and client is a mysql database connection.
db.collection('posts', function (err, data) {
client.query('select * from phpbb_posts", function(err, rs) {
data.insert(rs);
});
this works ok when I do it on small tables, but my posts table has about 100000 rows in and this query doesn't return even when I leave it running for an hour. I suspect that it's trying to load the entire database table into memory and then insert it.
So what I would like to do is read a chunk of rows at a time and insert them. However I can't see how to read a subset of the rows in node.js, and even more of a problem, I can't understand how I can iterate through the queries one at a time when I only get notification via a callback that it's finished.
Any ideas how I can best do this? (I'm looking for solutions using node.js as I'd like to know how to solve this kind of problem, I could no doubt do it easily some other way)
You could try using the asnyc library by caolan. The library implements some async flow control methods to handle the caveats of a callback-oriented programming style as it is in node.js.
For your case, using the whilst method could work out, using LIMIT queries against mysql and inserting them into mongodb.
Example (not tested, as i have no testdata available, but i think you'll get the idea)
var insertCount = 0;
var offset = 0;
// set this to the overall recordcound from mysql
var recordCount = 0;
async.whilst(
// test condition callback
function () { return insertCount < recordCount; },
// actual worker callback
function (callback) {
db.collection('posts', function (err, data) {
client.query('select * from phpbb_posts LIMIT ' + insertCount + ',1000', function(err, rs) {
data.insert(rs);
// increment by actually fetched recordcount (res.length?)
insertCount += res.length;
// trigger flow callback
callback();
});
});
},
// finished callback
function (err) {
// finished inserting data, maybe check record count in mongodb here
}
});
As i already mentioned, this code is just adapted from an example of the async library readme. But maybe it is an option for adding such amounts of database records from mysql to mongo.