I have a three-node setup of OrientDB 2.1.2 using mostly default configuration (except for some options mentioned in "Performance Tuning").
In a particular operation using the Java Graph API I am generating 1745 vertex insertions and 1753 edge insertions, which takes between 6 to 7 seconds to complete (sometimes it takes up to 10 seconds).
I feel that this is far too long for this operation to complete. I tried adding/removing the "massive insert" intent, but it doesn't help that much.
Am I doing anything wrong / overlooking something? Is it normal for OrientDB to take up to 10 seconds to create ~3500 objects in the distributed setup?
Related
Trying to wrap my head around timescaledb, but my google-fu is failing me. Most likely because I'm not searching for the correct term.
With RRD tool, old data can be stored as averages, reducing the amount of data being stored.
I can't seem to find out how to do this with timescaledb. I'd like 5 minute resolution for 90 days, but after that, it's pointless to keep all those data points, and I'd like to reduce it to 30 or 60 minute averages for a couple years, then maybe daily averages after that.
Is this something that I can set in the database itself, or is this something I would have to implement in a housekeeping job?
We had the exact same question half a year ago.
The term "Data Retention" is also used by the timescaledb team. It is currently implemented using drop_chunks policies (see their doc here). It's a Enterprise feature but IMHO not (yet) as useful as it could/should be (and it surely does not do what you are looking for).
Let me explain: probably the easiest approach for down-sampling your data are Continuous Aggregates (their doc here). You can quite easily aggregate virtually any numeric value to whatever resolution you desire. However, Continuous Aggregates will be affected by the deletions of the drop_chunks, too. Your data is gone.
One workaround would be to create other Hypertables instead. Then, create your own background workers copying the data from the original, hi-res table to these new lo-res Hypertables.
For housekeeping, either use the Data Retention Enterprise feature or create your own background workers.
I have a couple of millions entries in a table which start and end timestamps. I want to implement an analysis tool which determines unique entries for a specific interval. Let's say between yesterday and 2 month before yesterday.
Depending on the interval the queries take between a couple of seconds and 30 minutes. How would I implement an analysis tool for a web front-end which would allow to quite quickly query this data, similar to Google Analytics.
I was thinking of moving the data into Redis and do something clever with interval and sorted sets etc. but I was wondering if there's something in PostgreSQL which would allow to execute aggregated queries, re-use old queries, so that for instance, after querying the first couple of days it does not start from scratch again when looking at different interval.
If not, what should I do? Export the data to something like Apache Spark or Dynamo DB and analysis in there to fill Redis for retrieving it quicker?
Either will do.
Aggregation is a basic task they all can do, and your data is smll enough to fit into main memory. So you don't even need a database (but the aggregation functions of a database may still be better implemented than if you rewrite them; and SQL is quite convenient to use.
Jusr do it. Give it a try.
P.S. make sure to enable data indexing, and choose the right data types. Maybe check query plans, too.
My postgres was running really slow lately, an aggregation for a month it usually ended up taking more than 1 minute (to be more exact the last one took 7 mins and 23 secs).
Last friday i recreated the servers (master and replica) and reimported the database.
First thing I noticed is that from 133gb now the database is 42gb (the actual data is around 12gb, i guess the rest are the indexes).
Everything was fast as hell for a day, after that the indexing finished (26gb on indexes) and now I'm back to square 1.
A count on ~5 million rows takes 3 mins 42 secs.
Made the autovacuum more aggressive and it looks like it's doing it's job now but the DB is still slow.
I am using the db for an API so it's constantly growing. Atm i have 2 tables one that has around 5 mil rows and the other 28 mil.
So if the master has a lot of activity and let's say that i'm expecting some performance loss, i don't expect it from the replica.
What's curios is that after a restart it's really fast for an hour or so.
Also another thing that i noticed was that on every query I do the IO is 100% while the memory and cpu are almost not used at all.
Any help would be greatly appreciated.
Update
Same database on a smaller machine works like a charm.
Same queries, same indexes.
The only difference is the traffic, not writing or updating that much.
Also i forgot to mention one thing, one of my indexes is clustered.
The live machine is a 5 core with 64gb and 3k IO.
The test machine is a 2 core with 4gb and an SSD.
Update
Found my issue.
Apparently the autovacuum can't get a lock and by the time it gets it the dead tuples increased.
Made the autovacuum more aggresive for now and deleted a bunch of unused indexes.
Still don't know how to fix the lock issue tho.
Update
Looks like something is increasing the estimated row count.
Since my last update here the row count increased by 2 mil.
I guess that by tomorrow the row count will be again around 12 mil and the count will be slow as hell again.
Could this be related to autovacuum?
Update
Well found my issue.
Looks like postgres is losing a lot of speed on a write intensive database.
Had a column that was used as a flag and updated a lot of times per day.
Everything looks really good after the flag and update was removed.
Any clue on how to fix this issue on a write intensive table?
May be the following pointers help:
Are you really sure you want to do a 5mil row Aggregation for an API? Everytime ? Can't you split the data into chunks such that only a small number of chunks actually get most of the new rows (and so the aggregation of all the previous chunks can be reused for the next Query)? Time is one such measure, serial numbers could be another, etc. If so, partitioning the data is an obvious solution you should investigate, it really has a good chance of giving you sub-second query times (assuming you store aggregations for previous chunks smartly).
A hunch about that first hour magic is that although this data fits RAM, concurrent querying pushes that data-set out and then its purely disk I/O... and in that case, CPU / RAM being idle isn't a surprise.
Finally, I think this setup is asking for a re-design where there is only so much you could do with a single SQL, and in that expecting sub-second Query times for data that is not within RAM for a 5mil data-set is probably being too optimistic!
(Nonetheless, do post your findings, if possible)
I am using the Spring Data Neo4j 4.0.0 with Neo4j 2.2.1 and I am trying to import a timetree-like object with 2 levels under the root. The saved object is built and saved at the end and at some point of the saving process, I got this StackOverFlow error:
Exception in thread "main" java.lang.StackOverflowError
at java.lang.Character.codePointAt(Character.java:4668)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3693)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
at java.util.regex.Pattern$Branch.match(Pattern.java:4500)
at java.util.regex.Pattern$Branch.match(Pattern.java:4500)
at java.util.regex.Pattern$Branch.match(Pattern.java:4500)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4466)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4177)
at java.util.regex.Pattern$Curly.match(Pattern.java:4132)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
at java.util.regex.Pattern$Branch.match(Pattern.java:4502)
at java.util.regex.Pattern$Branch.match(Pattern.java:4500)
at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
at java.util.regex.Pattern$Start.match(Pattern.java:3408)
at java.util.regex.Matcher.search(Matcher.java:1199)
at java.util.regex.Matcher.find(Matcher.java:618)
at java.util.Formatter.parse(Formatter.java:2517)
at java.util.Formatter.format(Formatter.java:2469)
at java.util.Formatter.format(Formatter.java:2423)
at java.lang.String.format(String.java:2792)
at org.neo4j.ogm.cypher.compiler.IdentifierManager.nextIdentifier(IdentifierManager.java:48)
at org.neo4j.ogm.cypher.compiler.SingleStatementCypherCompiler.newRelationship(SingleStatementCypherCompiler.java:71)
at org.neo4j.ogm.mapper.EntityGraphMapper.getRelationshipBuilder(EntityGraphMapper.java:357)
at org.neo4j.ogm.mapper.EntityGraphMapper.link(EntityGraphMapper.java:315)
at org.neo4j.ogm.mapper.EntityGraphMapper.mapEntityReferences(EntityGraphMapper.java:262)
at org.neo4j.ogm.mapper.EntityGraphMapper.mapEntity(EntityGraphMapper.java:154)
at org.neo4j.ogm.mapper.EntityGraphMapper.mapRelatedEntity(EntityGraphMapper.java:524)
at org.neo4j.ogm.mapper.EntityGraphMapper.link(EntityGraphMapper.java:324)
at org.neo4j.ogm.mapper.EntityGraphMapper.mapEntityReferences(EntityGraphMapper.java:262)
at org.neo4j.ogm.mapper.EntityGraphMapper.mapEntity(EntityGraphMapper.java:154)
at org.neo4j.ogm.mapper.EntityGraphMapper.mapRelatedEntity(EntityGraphMapper.java:524)
...
Thank you in advance and your suggestion would be really appreciated!
SDN 4 isn't really intended to be used to batch import your objects into Neo4j. Its an Object Graph Mapping framework for general purpose Java applications, not a batch importer (which brings its own specific set of problems to the table). Some of the design decisions to support the intended use-case for SDN run contrary to what you would do if you were trying to design a purpose-built ETL. We are also constrained by the performance of Neo4j's HTTP Transactional endpoint, which although by no means slow in absolute terms, cannot hope to compete with the Batch Inserter for example.
There are some improvements to performance we will be making in the future and when the new binary protocol for Neo4j is released (2.3), we will be plugging that in as our transfer protocol. We expect this to improve transfer speeds to and from the database by at least an order of magnitude. However, please don't expect these changes to radically alter the behavioural characteristics of SDN 4. While a future version might be able load a few thousand nodes much faster than it can currently, it still won't be an ETL tool, and I wouldn't expect it to be used as such.
After some hours of trial and error, finally I found that I need to limit my save depth level.
Previously, I didn't specify the depth level and the saved object was growing larger and larger as the insertion of its children also ran concurrently. So, after giving a depth of 1 on every save method, I finally get rid of the StackOverFlow error. And, by not saving regularly (I put all the objects in an ArrayList and save them all at the end), I get 1 minute performance gain (from 3.5 minutes to 2.5 minutes) for importing ca. 1000 nodes (with relationships).
Nevertheless, the performance is still not satisfying yet since I could import over 60,000 data in just less than 1 minute with my previous MongoDB implementation. I don't know if it is because of the SDN4 and if it could be faster with Embedded API. I'm really curious if anyone has done any benchmarking on SDN4 and Embedded API.
I'm having trouble understanding the complexities of Lucene. Any help would be appreciated.
We're using a Windows Azure blob to store our Lucene index, with Lucene.Net and AzureDirectory. A WorkerRole contains the only IndexWriter, and it adds 20,000 or more records a day, and changes a small number (fewer than 100) of the existing documents. A WebRole on a different box is set up to take two snapshots of the index (into another AzureDirectory), alternating between the two, and telling the WebService which directory to use as it becomes available.
The WebService has two IndexSearchers that alternate, reloading as the next snapshot is ready--one IndexSearcher is supposed to handle all client requests at a time (until the newer snapshot is ready). The IndexSearcher sometimes takes a long time (minutes) to instantiate, and other times it's very fast (a few seconds). Since the directory is physically on disk already (not using the blob at this stage), we expected it to be a fast operation, so this is one confusing point.
We're currently up around 8 million records. The Lucene search used to be so fast (it was great), but now it's very slow. To try to improve this, we've started to IndexWriter.Optimize the index once a day after we back it up--some resources online indicated that Optimize is not required for often-changing indexes, but other resources indicate that optimization is required, so we're not sure.
The big problem is that whenever our web site has more traffic than a single user, we're getting timeouts on the Lucene search. We're trying to figure out if there's a bottleneck at the IndexSearcher object. It's supposed to be thread-safe, but it seems like something is blocking the requests so that only a single search is performed at a time. The box is an Azure VM, set to a Medium size so it has lots of resources available.
Thanks for whatever insight you can provide. Obviously, I can provide more detail if you have any further questions, but I think this is a good start.
I have much larger indexes and have not run into these issues (~100 million records).
Put the indexes in memory if you can (8 million records sounds like it should fit into memory depending on the amount of analyzed fields etc.) You can use the RamDirectory as the cache directory
IndexSearcher is thread-safe and supposed to be re-used, but I am not sure if that is the reality. In Lucene 3.5 (Java version) they have a SearcherManager class that manages multiple threads for you.
http://java.dzone.com/news/lucenes-searchermanager
Also a non-Lucene post, if you are on an extra-large+ VM make sure you are taking advantage of all of the cores. Especially if you have an Web API/ASP.NET front-end for it, those calls all should be asynchronous.