Spring Data Neo4j 4.0.0 : StackOverFlowError - import

I am using the Spring Data Neo4j 4.0.0 with Neo4j 2.2.1 and I am trying to import a timetree-like object with 2 levels under the root. The saved object is built and saved at the end and at some point of the saving process, I got this StackOverFlow error:
Exception in thread "main" java.lang.StackOverflowError
at java.lang.Character.codePointAt(Character.java:4668)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3693)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
at java.util.regex.Pattern$Branch.match(Pattern.java:4500)
at java.util.regex.Pattern$Branch.match(Pattern.java:4500)
at java.util.regex.Pattern$Branch.match(Pattern.java:4500)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4466)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4177)
at java.util.regex.Pattern$Curly.match(Pattern.java:4132)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
at java.util.regex.Pattern$Branch.match(Pattern.java:4502)
at java.util.regex.Pattern$Branch.match(Pattern.java:4500)
at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
at java.util.regex.Pattern$Start.match(Pattern.java:3408)
at java.util.regex.Matcher.search(Matcher.java:1199)
at java.util.regex.Matcher.find(Matcher.java:618)
at java.util.Formatter.parse(Formatter.java:2517)
at java.util.Formatter.format(Formatter.java:2469)
at java.util.Formatter.format(Formatter.java:2423)
at java.lang.String.format(String.java:2792)
at org.neo4j.ogm.cypher.compiler.IdentifierManager.nextIdentifier(IdentifierManager.java:48)
at org.neo4j.ogm.cypher.compiler.SingleStatementCypherCompiler.newRelationship(SingleStatementCypherCompiler.java:71)
at org.neo4j.ogm.mapper.EntityGraphMapper.getRelationshipBuilder(EntityGraphMapper.java:357)
at org.neo4j.ogm.mapper.EntityGraphMapper.link(EntityGraphMapper.java:315)
at org.neo4j.ogm.mapper.EntityGraphMapper.mapEntityReferences(EntityGraphMapper.java:262)
at org.neo4j.ogm.mapper.EntityGraphMapper.mapEntity(EntityGraphMapper.java:154)
at org.neo4j.ogm.mapper.EntityGraphMapper.mapRelatedEntity(EntityGraphMapper.java:524)
at org.neo4j.ogm.mapper.EntityGraphMapper.link(EntityGraphMapper.java:324)
at org.neo4j.ogm.mapper.EntityGraphMapper.mapEntityReferences(EntityGraphMapper.java:262)
at org.neo4j.ogm.mapper.EntityGraphMapper.mapEntity(EntityGraphMapper.java:154)
at org.neo4j.ogm.mapper.EntityGraphMapper.mapRelatedEntity(EntityGraphMapper.java:524)
...
Thank you in advance and your suggestion would be really appreciated!

SDN 4 isn't really intended to be used to batch import your objects into Neo4j. Its an Object Graph Mapping framework for general purpose Java applications, not a batch importer (which brings its own specific set of problems to the table). Some of the design decisions to support the intended use-case for SDN run contrary to what you would do if you were trying to design a purpose-built ETL. We are also constrained by the performance of Neo4j's HTTP Transactional endpoint, which although by no means slow in absolute terms, cannot hope to compete with the Batch Inserter for example.
There are some improvements to performance we will be making in the future and when the new binary protocol for Neo4j is released (2.3), we will be plugging that in as our transfer protocol. We expect this to improve transfer speeds to and from the database by at least an order of magnitude. However, please don't expect these changes to radically alter the behavioural characteristics of SDN 4. While a future version might be able load a few thousand nodes much faster than it can currently, it still won't be an ETL tool, and I wouldn't expect it to be used as such.

After some hours of trial and error, finally I found that I need to limit my save depth level.
Previously, I didn't specify the depth level and the saved object was growing larger and larger as the insertion of its children also ran concurrently. So, after giving a depth of 1 on every save method, I finally get rid of the StackOverFlow error. And, by not saving regularly (I put all the objects in an ArrayList and save them all at the end), I get 1 minute performance gain (from 3.5 minutes to 2.5 minutes) for importing ca. 1000 nodes (with relationships).
Nevertheless, the performance is still not satisfying yet since I could import over 60,000 data in just less than 1 minute with my previous MongoDB implementation. I don't know if it is because of the SDN4 and if it could be faster with Embedded API. I'm really curious if anyone has done any benchmarking on SDN4 and Embedded API.

Related

storing huge amounts of data in mongo

I am working on a front end system for a radius server.
The radius server will pass updates to the system every 180 seconds. Which means if I have about 15,000 clients that would be around 7,200,000 entries per day...Which is a lot.
I am trying to understand what the best possible way to store and retrieve this data will be. Obviously as time goes on, this will become substantial. Will MongoDB handle this? Typical document is not much, something this
{
id: 1
radiusId: uniqueId
start: 2017-01-01 14:23:23
upload: 102323
download: 1231556
}
However, there will be MANY of these records. I guess this is something similar to the way that SNMP NMS servers handle data which as far as I know they use RRD to do this.
Currently in my testing I just push every document into a single collection. So I am asking,
A) Is Mongo the right tool for the job and
B) Is there a better/more preferred/more optimal way to store the data
EDIT:
OK, so just incase someone comes across this and needs some help.
I ran it for a while in mongo, I was really not satisfied with performance. We can chalk this up to the hardware I was running on, perhaps my level of knowledge or the framework I was using. However I found a solution that works very well for me. InfluxDB pretty much handles all of this right out of the box, its a time series database which is effectively the data I am trying to store (https://github.com/influxdata/influxdb). Performance for me has been like night & day. Again, could all be my fault, just updating this.
EDIT 2:
So after a while I think I figured out why I never got the performance I was after with Mongo. I am using sailsjs as framework and it was searching by id using regex, which obviously has a huge performance hit. I will eventually try migrate back to Mongo instead of influx and see if its better.
15,000 clients updating every 180 seconds = ~83 insertions / sec. That's not a huge load even for a moderately sized DB server, especially given the very small size of the records you're inserting.
I think MongoDB will do fine with that load (also, to be honest, almost any modern SQL DB would probably be able to keep up as well). IMHO, the key points to consider are these:
Hardware: make sure you have enough RAM. This will primarily depend on how many indexes you define, and how many queries you're doing. If this is primarily a log that will rarely be read, then you won't need much RAM for your working set (although you'll need enough for your indexes). But if you're also running queries then you'll need much more resources
If you are running extensive queries, consider setting up a replica set. That way, your master server can be reserved for writing data, ensuring reliability, while your slaves can be configured to serve your queries without affecting the write reliability.
Regarding the data structure, I think that's fine, but it'll really depend on what type of queries you wish to run against it. For example, if most queries use the radiusId to reference another table and pull in a bunch of data for each record, then you might want to consider denormalizing some of that data. But again, that really depends on the queries you run.
If you're really concerned about managing the write load reliably, consider using the Mongo front-end only to manage the writes, and then dumping the data to a data warehouse backend to run queries on. You can partially do this by running a replica set like I mentioned above, but the disadvantage of a replica set is that you can't restructure the data. The data in each member of the replica set is exactly the same (hence the name, replica set :-) Oftentimes, the best structure for writing data (normalized, small records) isn't the best structure for reading data (denormalized, large records with all the info and joins you need already done). If you're running a bunch of complex queries referencing a bunch of other tables, using a true data warehouse for the querying part might be better.
As your write load increases, you may consider sharding. I'm assuming the RadiusId points to each specific server among a pool of Radius servers. You could potentially shard on that key, which would split the writes based on which server is sending the data. Thus, as you increase your radius servers, you can increase your mongo servers proportionally to maintain write reliability. However, I don't think you need to do this right away as I bet one reasonably provisioned server should be able to manage the load you've specified.
Anyway, those are my preliminary suggestions.

Is this scenario a big data project?

i'm involved in a project with 2 phases and i'm wondering if this is a big data project (i'm newbie in this field)
In the first phase i have this scenario:
i have to collect huge amont of data
i need to store them
i need to build a web application that shows data to the users
In the second phase i need to analyze stored data and builds report and do some analysis on them
Some example about data quantity; in one day i may need to collect and store around 86.400.000 record
Now i was thinking to this kind of architecture:
to colect data some asynchronous tecnology like Active MQ and MQTT protocol
to store data i was thinking about a NoSQL DB (mongo, Hbase or other)
Now this would solve my first phase problems
But what about the second phase?
I was thinking about some big data SW (like hadoop or spark) and some machine learning SW; so i can retrieve data from the DB, analyze them and build or store in a better way in order to build good reports and do some specific analysis
I was wondering if this is the best approach
How would you solve this kind of scenario? Am I in the right way?
thank you
Angelo
As answered by siddhartha, whether your project can be tagged as bigdata project or not, depends on context and buiseness domain/case of your project.
Coming to tech stack, each of the technology you mentioned has specific purpose. For example if you have structured data, you can use any new age base database with query support. NoSQL databases come in different flavours (columner, document based, key-value, etc), so technology choice depends again on the kind of data and use-case that you have. I suggest you to do some POCs and analysis of technologies before taking final calls.
Definition of big data varies from user to user. For Google 100 TB might be a small data but for me this is big data because of difference in available Hardware commodity. Ex -> Google can have cluster of 50000 nodes each node having 64 GB Ram for analysing 100 Tb of data so for them this not big data. But I cannot have cluster of 50000 node so for me it is big data.
Same is your case if have commodity hardware available you can go ahead with hadoop. As you have not mentioned size of file you are generating each day I cannot be certain about your case. But hadoop is always a good choice to process your data because of new projects like spark which can help you process data in much less time and moreover it also give you features of real time analysis. So according to me it is better if you can use spark or hadoop because then you can play with your data. Moreover since you want to use nosql database you can use hbase which is available with hadoop to store your data.
Hope this answers your question.

Lucene searches are slow via AzureDirectory

I'm having trouble understanding the complexities of Lucene. Any help would be appreciated.
We're using a Windows Azure blob to store our Lucene index, with Lucene.Net and AzureDirectory. A WorkerRole contains the only IndexWriter, and it adds 20,000 or more records a day, and changes a small number (fewer than 100) of the existing documents. A WebRole on a different box is set up to take two snapshots of the index (into another AzureDirectory), alternating between the two, and telling the WebService which directory to use as it becomes available.
The WebService has two IndexSearchers that alternate, reloading as the next snapshot is ready--one IndexSearcher is supposed to handle all client requests at a time (until the newer snapshot is ready). The IndexSearcher sometimes takes a long time (minutes) to instantiate, and other times it's very fast (a few seconds). Since the directory is physically on disk already (not using the blob at this stage), we expected it to be a fast operation, so this is one confusing point.
We're currently up around 8 million records. The Lucene search used to be so fast (it was great), but now it's very slow. To try to improve this, we've started to IndexWriter.Optimize the index once a day after we back it up--some resources online indicated that Optimize is not required for often-changing indexes, but other resources indicate that optimization is required, so we're not sure.
The big problem is that whenever our web site has more traffic than a single user, we're getting timeouts on the Lucene search. We're trying to figure out if there's a bottleneck at the IndexSearcher object. It's supposed to be thread-safe, but it seems like something is blocking the requests so that only a single search is performed at a time. The box is an Azure VM, set to a Medium size so it has lots of resources available.
Thanks for whatever insight you can provide. Obviously, I can provide more detail if you have any further questions, but I think this is a good start.
I have much larger indexes and have not run into these issues (~100 million records).
Put the indexes in memory if you can (8 million records sounds like it should fit into memory depending on the amount of analyzed fields etc.) You can use the RamDirectory as the cache directory
IndexSearcher is thread-safe and supposed to be re-used, but I am not sure if that is the reality. In Lucene 3.5 (Java version) they have a SearcherManager class that manages multiple threads for you.
http://java.dzone.com/news/lucenes-searchermanager
Also a non-Lucene post, if you are on an extra-large+ VM make sure you are taking advantage of all of the cores. Especially if you have an Web API/ASP.NET front-end for it, those calls all should be asynchronous.

NoSQL for time series/logged instrument reading data that is also versioned

My Data
It's primarily monitoring data, passed in the form of Timestamp: Value, for each monitored value, on each monitored appliance. It's regularly collected over many appliances and many monitored values.
Additionally, it has the quirky feature of many of these data values being derived at the source, with the calculation changing from time to time. This means that my data is effectively versioned, and I need to be able to simply call up only data from the most recent version of the calculation. Note: This is not versioning where the old values are overwritten. I simply have timestamp cutoffs, beyond which the data changes its meaning.
My Usage
Downstream, I'm going to have various undefined data mining/machine learning uses for the data. It's not really clear yet what those uses are, but it is clear that I will be writing all of the downstream code in Python. Also, we are a very small shop, so I can really only deal with so much complexity in setup, maintenance, and interfacing to downstream applications. We just don't have that many people.
The Choice
I am not allowed to use a SQL RDBMS to store this data, so I have to find the right NoSQL solution. Here's what I've found so far:
Cassandra
Looks totally fine to me, but it seems like some of the major users have moved on. It makes me wonder if it's just not going to be that much of a vibrant ecosystem. This SE post seems to have good things to say: Cassandra time series data
Accumulo
Again, this seems fine, but I'm concerned that this is not a major, actively developed platform. It seems like this would leave me a bit starved for tools and documentation.
MongoDB
I have a, perhaps irrational, intense dislike for the Mongo crowd, and I'm looking for any reason to discard this as a solution. It seems to me like the data model of Mongo is all wrong for things with such a static, regular structure. My data even comes in (and has to stay in) order. That said, everybody and their mother seems to love this thing, so I'm really trying to evaluate its applicability. See this and many other SE posts: What NoSQL DB to use for sparse Time Series like data?
HBase
This is where I'm currently leaning. It seems like the successor to Cassandra with a totally usable approach for my problem. That said, it is a big piece of technology, and I'm concerned about really knowing what it is I'm signing up for, if I choose it.
OpenTSDB
This is basically a time-series specific database, built on top of HBase. Perfect, right? I don't know. I'm trying to figure out what another layer of abstraction buys me.
My Criteria
Open source
Works well with Python
Appropriate for a small team
Very well documented
Has specific features to take advantage of ordered time series data
Helps me solve some of my versioned data problems
So, which NoSQL database actually can help me address my needs? It can be anything, from my list or not. I'm just trying to understand what platform actually has code, not just usage patterns, that support my super specific, well understood needs. I'm not asking which one is best or which one is cooler. I'm trying to understand which technology can most natively store and manipulate this type of data.
Any thoughts?
It sounds like you are describing one of the most common use cases for Cassandra. Time series data in general is often a very good fit for the cassandra data model. More specifically many people store metric/sensor data like you are describing. See:
http://rubyscale.com/blog/2011/03/06/basic-time-series-with-cassandra/
http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra
http://engineering.rockmelt.com/post/17229017779/modeling-time-series-data-on-top-of-cassandra
As far as your concerns with the community I'm not sure what is giving you that impression, but there is quite a large community (see irc, mailing lists) as well as a growing number of cassandra users.
http://www.datastax.com/cassandrausers
Regarding your criteria:
Open source
Yes
Works well with Python
http://pycassa.github.com/pycassa/
Appropriate for a small team
Yes
Very well documented
http://www.datastax.com/docs/1.1/index
Has specific features to take advantage of ordered time series data
See above links
Helps me solve some of my versioned data problems
If I understand your description correctly you could solve this multiple ways. You could start writing a new row when the version changes. Alternatively you could use composite columns to store the version along with the timestamp/value pair.
I'll also note that Accumulo, HBase, and Cassandra all have essentially the same data model. You will still find small differences around the data model in regards to specific features that each database offers, but the basics will be the same.
The bigger difference between the three will be the architecture of the system. Cassandra takes its architecture from Amazon's Dynamo. Every server in the cluster is the same and it is quite simple to setup. HBase and Accumulo or more direct clones of BigTable. These have more moving parts and will require more setup/types of servers. For example, setting up HDFS, Zookeeper, and HBase/Accumulo specific server types.
Disclaimer: I work for DataStax (we work with Cassandra)
I only have experience in Cassandra and MongoDB but my experience might add something.
So your basically doing time based metrics?
Ok if I understand right you use the timestamp as a versioning mechanism so that you query per a certain timestamp, say to get the latest calculation used you go based on the metric ID or whatever and get ts DESC and take off the first row?
It sounds like a versioned key value store at times.
With this in mind I probably would not recommend either of the two I have used.
Cassandra is too rigid and it's too heirachal, too based around how you query to the point where you can only make one pivot of graph data from (I presume you would wanna graph these metrics) the columfamily which is crazy, hence why I dropped it. As for searching (which Facebook use it for, and only that) it's not that impressive either.
MongoDB, well I love MongoDB and I am an elite of the user group and it could work here if you didn't use a key value storage policy but at the end of the day if your mind is not set and you don't like the tech then let me be the very first to say: don't use it! You will be no good at a tech that you don't like so stay away from it.
Though I would picture this happening in Mongo much like:
{
_id: ObjectID(),
metricId: 'AvailableMessagesInQueue',
formula: '4+5/10.01',
result: NaN
ts: ISODate()
}
And you query for the latest version of your calculation by:
var results = db.metrics.find({ 'metricId': 'AvailableMessagesInQueue' }).sort({ ts: -1 });
var latest = results.getNext();
Which would output the doc structure you see above. Without knowing more of exactly how you wish to query and the general servera and app scenario etc thats the best I can come up with.
I fond this thread on HBase though: http://mail-archives.apache.org/mod_mbox/hbase-user/201011.mbox/%3C5A76F6CE309AD049AAF9A039A39242820F0C20E5#sc-mbx04.TheFacebook.com%3E
Which might be of interest, it seems to support the argument that HBase is a good time based key value store.
I have not personally used HBase so do not take anything I say about it seriously....
I hope I have added something, if not you could try narrowing your criteria so we can answer more dedicated questions.
Hope it helps a little,
Not a plug for any particular technology but this article on Time Series storage using MongoDB might provide another way of thinking about the storage of large amounts of "sensor" data.
http://www.10gen.com/presentations/mongodc-2011/time-series-data-storage-mongodb
Axibase Time-Series Database
Open source
There is a free Community Edition
Works well with Python
https://github.com/axibase/atsd-api-python. There are also other language wrappers, for example ATSD R client.
Appropriate for a small team
Built-in graphics and rule engine make it productive for building an in-house reporting, dashboarding, or monitoring solution with less coding.
Very well documented
It's hard to beat IBM redbooks, but we're trying. API, configuration, and administration is documented in detail and with examples.
Has specific features to take advantage of ordered time series data
It's a time-series database from the ground-up so aggregation, filtering and non-parametric ARIMA and HW forecasts are available.
Helps me solve some of my versioned data problems
ATSD supports versioned time-series data natively in SE and EE editions. Versions keep track of status, change-time and source changes for the same timestamp for audit trails and reconciliations. It's a useful feature to have if you need clean, verified data with tracing. Think energy metering, PHMR records. ATSD schema also supports series tags, which you could use to store versioning columns manually if you're on CE edition or you need to extend default versioning columns: status, source, change-time.
Disclosure - I work for the company that develops ATSD.

Using Drools in a heavy batch process

We used Drools as part of a solution to act as a sort of filter in a very intense processing application, maybe running up to 100 rules on 500,000 + working memory objects.
turns out that it is extremely slow.
anybody else have any experience using Drools in a batch type processing application?
Kind of depends on your rules - 500K objects is reasonable given enough memory (it has to populate a RETE network in memory, so memory usage is a multiple of 500K objects - ie space for objects + space for network structure, indexes etc) - its possible you are paging to disk which would be really slow.
Of course, if you have rules that match combinations of the same type of fact, that can cause an explosion of combinations to try, which even if you have 1 rule will be really really slow.
If you had any more information on the analysis you are doing that would probably help with possible solutions.
I've used a Drools with a stateful working memory containing over 1M facts. With some tuning of both your rules and the underlying JVM, performance can be quite good after a few minutes for initial start-up. Let me know if you want more details.
I haven't worked with the latest version of Drools (last time I used it was about a year ago), but back then our high-load benchmarks proved it to be utterly slow. A huge disappointment after having based much of our architecture on it.
At least something good I remember about drools is that their dev team was available on IRC and very helpful, you might give them a try, they're the experts after all: irc.codehaus.org #drools
I'm just learning drools myself, so maybe I'm missing something, but why is the whole batch of five hundred thousand objects added to working memory at once? The only reason I can think of is that there are rules that kick in only when two or more items in the batch are related.
If that isn't the case, then perhaps you could use a stateless session and assert one object at a time. I assume rules will run 500k times faster in that case.
Even if it is the case, do all your rules need access to all 500k objects? Could you speed things up by applying per-item rules one at a time, and then in a second phase of processing apply batch level rules using a different rulebase and working memory? This would not change the volume of data, but the RETE network would be smaller because the simple rules would have been removed.
An alternative approach would be to try and identify the related groups of objects and assert the objects in groups during the second phase, further reducing the volume of data in working memory as well as splitting up the RETE network.
Drools is not really designed to be run on a huge number of objects. It's optimized for running complex rules on a few objects.
The working memory initialization for each additional object is too slow and the caching strategies are designed to work per working memory object.
Use a stateless session and add the objects one at a time ?
I had problems with OutOfMemory errors after parsing a few thousand objects. Setting a different default optimizer solved the problem.
OptimizerFactory.setDefaultOptimizer(OptimizerFactory.SAFE_REFLECTIVE);
We were looking at drools as well, but for us the number of objects is low so this isn't an issue. I do remember reading that there are alternate versions of the same algorithm that take memory usage more into account, and are optimized for speed while still being based on the same algorithm. Not sure if any of them have made it into a real usable library though.
this optimizer can also be set by using parameter
-Dmvel2.disable.jit=true