StackOverflowError when executing cypher queries through Neo4J REST API - rest

I'm using Neo4j version 3.0.7
I'm reading a list of edges from a dataset and I need to pass those edges batch-wise using the REST API.
I used the following query format to create multiple nodes (if they already don't exist) and their relationships in Neo4j through a single Cypher query via the REST API. I obtain the two vertices of an edge and the node properties are set according to the vertex IDs of those vertices.
{
"query":
"MATCH (n { name: 0 }), (m { name:1 })
CREATE (n)-[:X]->(m)
WITH count(*) as dummy
MATCH (n { name: 0 }), (m { name: 6309 })
CREATE (n)-[:X]->(m)"
}
This approach works correctly for a batch of 10 edges but when I try to send a batch of 1000 edges (nodes and their relationships) through a single Cypher query, I get a StackOverflowError exception. Is there a better approach to achieve this task?
Thank you for your help.
The error obtained from the response:
{
"exception" : "StackOverflowError",
"fullname" : "java.lang.StackOverflowError",
"stackTrace" : [ "scala.collection.TraversableOnce$class.$div$colon(TraversableOnce.scala:151) ..."
}

You can use UNWIND to get a single query:
UNWIND [[0,1], [0,6309]] AS pair
MATCH (n {name: pair[0]}), (m {name: pair[1]})
CREATE (n)-[:X]->(m)
Insert your node pairs after UNWIND as a list of two-element lists. As the query uses the name property for finding the nodes, it is worth adding an index to it. For example, if you haven Person nodes, index them with:
CREATE INDEX ON :Person(name)
(See also the Cypher reference card.)

Related

How to merge aql query with iterative traversal

I want to query a collection in ArangoDB using AQL, and at each node in the query, expand the node using a traversal.
I have attempted to do this by calling the traversal as a subquery using a LET statement within the collection query.
The result set for the traversal is empty, even though the query completes.
FOR ne IN energy
FILTER ne.identifier == "12345"
LET ne_edges = (
FOR v, e IN 1..1 ANY ne relation
RETURN e
)
RETURN MERGE(ne, {"edges": ne_edges})
[
{
"value": 123.99,
"edges": []
}
]
I have verified there are edges, and the traversal returns correctly when it is not executed as a subquery.
It seems as if the initial query is completing before a result is returned from the subquery, giving the result below.
What am I missing? or is there a better way?
I can think of two way to do this. This first is easier to understand but the second is more compact. For the examples below, I have a vertex collection test2 and an edge collection testEdge that links parent and child items within test2
Using Collect:
let seed = (FOR testItem IN test2
FILTER testItem._id in ['test2/Q1', 'test2/Q3']
RETURN testItem._id)
let traversal = (FOR seedItem in seed
FOR v, e IN 1..1 ANY seedItem
testEdge
RETURN {seed: seedItem, e_to: e._to})
for t in traversal
COLLECT seeds = t.seed INTO groups = t.e_to
return {myseed: seeds, mygroups: groups}
Above we first get the items we want to traverse through (seed), then we perform the traversal and get an object that has the seed .id and the related edges
Then we finally use collect into to group the results
Using array expansion
FOR testItem IN test2
FILTER testItem._id in ['test2/Q1', 'test2/Q3']
LET testEdges = (
FOR v, e IN 1..1 ANY testItem testEdge
RETURN e
)
RETURN {myseed: testItem._id, mygroups: testEdges[*]._to}
This time we combine the seed search and the traversal by using the let statement. then we use array expansion to group items
In either case, I end up with something that looks like this:
[
{
"myseed": "test2/Q1",
"mygroups": [
"test2/Q1-P5-2",
"test2/Q1-P6-3",
"test2/Q1-P4-1"
]
},
{
"myseed": "test2/Q3",
"mygroups": [
"test2/Q3",
"test2/Q3"
]
}
]

Custom index comparator in MongoDB

I'm working with a dataset composed by probabilistic encrypted elements indistinguishable from random samples. This way, sequential encryptions of the same number results in different ciphertexts. However, these still comparable through a special function that applies algorithms like SHA256 to compare two ciphertexts.
I want to add a list of the described ciphertexts to a MongoDB database and index it using a tree-based structure (i.e.: AVL). I can't simply apply the default indexing of the database because, as described, the records must be comparable using the special function.
An example: Suppose I have a database db and a collection c composed by the following document type:
{
"_id":ObjectId,
"r":string
}
Moreover, let F(int,string,string) be the following function:
F(h,l,r) = ( SHA256(l | r) + h ) % 3
where the operator | is a standard concatenation function.
I want to execute the following query in an efficient way, such as in a collection with some suitable indexing:
db.c.find( { F(h,l,r) :{ $eq: 0 } } )
for h and l chosen arbitrarily but not constants. I.e.: Suppose I want to find all records that satisfy F(h1,l1,r), for some pair (h1, l1). Later, in another moment, I want to do the same but using (h2, l2) such that h1 != h2 and l1 != l2. h and l may assume any value in the set of integers.
How can I do that?
You can execute this query use the operator $where, but this way can't use index. So, for query performance it's dependents on the size of your dataset.
db.c.find({$where: function() { return F(1, "bb", this.r) == 0; }})
Before execute the code above, you need store your function F on the mongodb server:
db.system.js.save({
_id: "F",
value: function(h, l, r) {
// the body of function
}
})
Links:
store javascript function on server
I've tried a solution that store the result of the function in your collection, so I changed the schema, like below:
{
"_id": ObjectId,
"r": {
"_key": F(H, L, value),
"value": String
}
}
The field r._key is value of F(h,l,r) with constant h and l, and the field r.value is original r field.
So you can create index on field r._key and your query condition will be:
db.c.find( { "r._key" : 0 } )

FlinkML: Joining DataSets of LabeledVector does not work

I am currently trying to join two DataSets (part of the flink 0.10-SNAPSHOT API). Both DataSets have the same form:
predictions:
6.932018685453303E155 DenseVector(0.0, 1.4, 1437.0)
org:
2.0 DenseVector(0.0, 1.4, 1437.0)
general form:
LabeledVector(Double, DenseVector(Double,Double,Double))
What I want to create is a new DataSet[(Double,Double)] containing only the labels of the two DataSets i.e.:
join:
6.932018685453303E155 2.0
Therefore I tried the following command:
val join = org.join(predictions).where(0).equalTo(0){
(l, r) => (l.label, r.label)
}
But as a result 'join' is empty. Am I missing something?
You are joining on the label field (index 0) of the LabeledVector type, i.e., building all pairs of elements with matching labels. Your example indicates that you want to join on the vector field instead.
However, joining on the vector field, for example by calling:
org.join(predictions).where("vector").equalTo("vector"){
(l, r) => (l.label, r.label)
}
will not work, because DenseVector, the type of the vector field, is not recognized as key type by Flink (such as all kinds of Arrays).
Till describes how to compare prediction and label values in a comment below.

PyMongo updating array records with calculated fields via cursor

Basically the collection output of an elaborate aggregate pipeline for a very large dataset is similar to the following:
{
"_id" : {
"clienta" : NumberLong(460011766),
"clientb" : NumberLong(2886729962)
},
"states" : [
[
"fixed", "fixed.rotated","fixed.rotated.off"
]
],
"VBPP" : [
244,
182,
184,
11,
299,
],
"PPF" : 72.4,
}
The intuitive, albeit slow, way to update these fields to be calculations of their former selves (length and variance of an array) with PyMongo before converting to arrays is as follows:
records_list = []
cursor = db.clientAgg.find({}, {'_id' : 0,
'states' : 1,
'VBPP' : 1,
'PPF': 1})
for record in cursor:
records_list.append(record)
for dicts in records_list:
dicts['states'] = len(dicts['states'])
dicts['VBPP'] = np.var(dicts['VBPP'])
I have written various forms of this basic flow to optimize for speed, but bringing in 500k dictionaries in memory to modify them before converting them to arrays to go through a machine learning estimator is costly. I have tried various ways to update the records directly via a cursor with variants of the following with no success:
cursor = db.clientAgg.find().skip(0).limit(50000)
def iter():
for item in cursor:
yield item
l = []
for x in iter():
x['VBPP'] = np.var(x['VBPP'])
# Or
# db.clientAgg.update({'_id':x['_id']},{'$set':{'x.VBPS': somefunction as above }},upsert=False, multi=True)
I also unsuccessfully tried using Mongo's usual operators since the variance is as simple as subtracting the mean from each element of the array, squaring the result, then averaging the results.
If I could successfully modify the collection directly then I could utilize something very fast like Monary or IOPro to load data directly from Mongo and into a numpy array without the additional overhead.
Thank you for your time
MongoDB has no way to update a document with values calculated from the document's fields; currently you can only use update to set values to constants that you pass in from your application. So you can set document.x to 2, but you can't set document.x to document.y + document.z or any other calculated value.
See https://jira.mongodb.org/browse/SERVER-11345 and https://jira.mongodb.org/browse/SERVER-458 for possible future features.
In the immediate future, PyMongo will release a bulk API that allows you to send a batch of distinct update operations in a single network round-trip which will improve your performance.
Addendum:
I have two other ideas. First, run some Javascript server-side. E.g., to set all documents' b fields to 2 * a:
db.eval(function() {
var collection = db.test_collection;
collection.find().forEach(function(doc) {
var b = 2 * doc.a;
collection.update({_id: doc._id}, {$set: {b: b}});
});
});
The second idea is to use the aggregation framework's $out operator, new in MongoDB 2.5.2, to transform the collection into a second collection that includes the calculated field:
db.test_collection.aggregate({
$project: {
a: '$a',
b: {$multiply: [2, '$a']}
}
}, {
$out: 'test_collection2'
});
Note that $project must explicitly include all the fields you want; only _id is included by default.
For a million documents on my machine the former approach took 2.5 minutes, and the latter 9 seconds. So you could use the aggregation framework to copy your data from its source to its destination, with the calculated fields included. Then, if desired, drop the original collection and rename the target collection to the source's name.
My final thought on this, is that MongoDB 2.5.3 and later can stream large result sets from an aggregation pipeline using a cursor. There's no reason Monary can't use that capability, so you might file a feature request there. That would allow you to get documents from a collection in the form you want, via Monary, without having to actually store the calculated fields in MongoDB.

Is it possible to refer to multiple documents in a mongo db query?

Suppose I have a collection containing the following documents:
...
{
event_counter : 3
event_type: 50
event_data: "yaya"
}
{
event_counter : 4
event_type: 100
event_data: "whowho"
}
...
Is it possible to ask for:
for each document, e where e.event_type == 100
get me any document f where
f.event_counter = e.event_counter+1
or equivalently:
find each f, where f.event_counter==e.event_counter+1 && e.event_type==100
I think the best way for you to approach this is on the application side, using multiple queries. You would want to run a query to match all documents with e.event_type = 100, like this one:
db.collection.find({"e.event_type" : 100});
Then, you'll have to write some logic to iterate through the results and run more queries to find documents with the right value of f.event_counter.
I am not sure it's possible to do this using MongoDB's aggregation framework. If it is possible, it will be quite a complicated query.