I am using orientDB (starting from 2.2.0 to 2.2.6) I have created the graph schema and inserted records (E's and V's) about 50 Millions, my problem is the performance when editing the graph from the web admin console, the problem rendering is very slow and zooming too, also many times i get
java.lang.OutOfMemoryError
when I query for large number of records ex. (select from V limit 10000), my question is there any configurations i need to set or if there is any mistake i am doing.
The Graph editor was not designed to work with such amount of nodes (10K).
I Guess even if it was designed for a lot of nodes, 10k seems a lot even for canvas rendering.
You should work with less nodes 1-500
Related
In gremlin,
s = graph.traversal()
g = graph.traversal(computer())
i know the first one is for OLTP and second for OLAP. I know the difference between OLAP and OLTP at definition level.I have the following queries on this:
How does
the above queries differ in working?
Can I use second one ,using'g'
in queries in my application to get results(I know this 'g' one
gives gives results faster than first one )?
Difference between OLAP and OLTP with example ?
Thanks in advance.
From the user's perspective, in terms of results, there's no real difference between OLAP and OLTP. The Gremlin statements are the same save for configuration of the TraversalSource as you have shown with your use of withComputer() and other settings.
The difference is more in how the traversal is executed behind the scenes. OLAP-based traversals are meant to process the "entire graph" (i.e. all vertices/edges and perhaps more than once). Where OLTP based traversals are meant to process smaller bodies of data, typically starting with one or a handful of vertices and traversing from there. When you consider graphs in the scale of "billions of edges", it's easy to understand why an efficient mechanism like OLAP is needed to process such graphs.
You really shouldn't think of OLTP vs OLAP as "faster" vs "slower". It's probably better to think of it as it is described in the documentation:
OLTP: real-time, limited data accessed, random data access,
sequential processing, querying
OLAP: long running, entire data set
accessed, sequential data access, parallel processing, batch
processing
There's no reason why you can't use an OLAP traversal in your applications so long as your application is aware of the requirements of that traversal. If you have some SLA that says that REST requests must complete in under 0.5 seconds and you decide to use an OLAP traversal to get the answer, you will undoubtedly break your SLA. Assuming you execute the OLAP traversal job over Spark, it will take Spark 10-15 seconds just to get organized to run your job.
I'm not sure how to provide an example of OLAP and OLTP, except to talk about the use cases a little bit more, so it should be clear as to when to use one as opposed to the other. In any case, let's assume you have a graph with 10 billion edges. You would want your OLTP traversals to always start with some form of index lookup - like a traversal that shows the average age of the friends of the user "stephenm":
g.V().has('username','stephenm').out('knows').values('age').mean()
but what if I want to know the average age of every user in my database? In this case I don't have any index I can use to lookup a "small set of starting vertices" - I have to process all the many millions/billions of vertices in my graph. This is a perfect use case for OLAP:
g.V().hasLabel('user').values('age').mean()
OLAP is also great for understanding growth of your graph and for maintaining your graph. With billions of edges and a high data ingestion rate, not knowing that your graph is growing improperly is a death sentence. It's good to use OLAP to grab global statistics over all the data in the graph:
g.E().label().groupCount()
g.V().label().groupCount()
In the above examples, you get an edge/vertex label distribution. If you have an idea as to how your graph is growing, this can be a good indicator of whether or not your data ingestion process is working properly. On a billion edge graph, trying to execute even one of the traversals would take "forever" if it ever finished at all without error.
i have searched the Internet and also the manual for some performance tuning abilities to decrease the building time of my deBruijn Graph with OrientDB.
All following is dione in Java.
In OrientDB:
Kmers are indexed.
Edges are property based.
What i want to do is:
read in a multiple sequence file
split sequence in kmers
add Kmers to Database and create Edges between adjacent kmers
One and two are done already.
So when I add a kmer to OrientDB, i have to check if this kmer exist and if so i need the vertx to add a nwe edge.
Is there any fast way to do so?
I already created a local hash with kmer as key and OrientDB RID as value. But getting the vertex seems to need much time.
I already tried with:
OGlobalConfiguration.USE_WAL.setValue(false);
OGlobalConfiguration.TX_USE_LOG.setValue(false);
declareIntent(new OIntentMassiveInsert());
I need nearly 3 hours to add 256 kmers and 40.000.000 edges.
Also the size of the created DB is 9GB, the start file was 40 MB.
Any suggestions how to improve it?
Feel free to ask if something is not understandable.
Thanks a lot.
Michael
Edit:
do you have any experience with Record Grow factor?
I think the Node record have some in and out edge information by default in it. Could i increase the runtime with the RECORD_GROW_FACTOR? DO you have any experince with that?
Orientdb : 2.1.3
Pyorient : 1.4.7
I need to import a graph with one hundred thousand vertexs and half a million edges into Orientdb by pyorient.
db.command one by one
Firstly, I just use db.command("create vertex V set a=1") to insert all the vertexes and edges one by one.
But it takes me about two hours.
So I want to find a way to optimize this process.
Massive Insert ?
Then I find that Orientdb supports Massive Insert, but unfortunately the author of pyorient in an issue massive insertion: no transacations? mentioned that
in the bynary protocol ( and in pyorient of course ) there is not the massive insert intent.
SQL batch
Pyorient supports sql batch. Maybe this is an opportunity!
I just put all the insert commands together and run it by db.batch().
I take a graph with 5000 vertexes and 20000 edges for example
sql batch
vertexs : 25.1708816278 s
edges : 254.248636227 s
original
constrct vertexs : 19.5094766904 s
construct edges : 147.627924276 s
..it seems that sql batch costs much more time.
So I want to know whether there is a way to do it.
Thanks.
When you make the one by one entry, you've already tried to see if you get better performance using Transactional Graph and commit every X items ?? Usually this is the correct way to insert a lot of data. Unfortunately using pyorient, as you also indicated you, the Massive Insert you can not use it and also Multi-process approaches are unable to exploit (the socket connection is only one and all your concurrent objects will be serialized ( as for a pipeline ) because a connection pool is not implemented in the driver. So you can loose the performance advantages of the multiprocessing).
We have an old Typo3 4.5 application running very slowly on any server. Using blackfire.io and general debugging we are trying to figure out the bottleneck and whether we can alleviate some of the processing time.
The application uses the tt_news extension on various pages on the website. The index page has a lot of different tt_news modules, displaying entries for various categories and the like. All these entries also usually have a picture associated with them.
One of the major bottlenecks is the large number of SQL queries executed by Typo3. Particularly taxing is the following query, which is executed 247 times (!) on the start page:
SELECT ... FROM cache_imagesizes WHERE md5filename = ? limit ?
So far I wasn't able to find any ressource on how to alleviate this or if it is even possible. I think that the tt_news extension is just extremely inefficient.
Any input is appreciated.
Please try to set an index on md5filename. As this is a hash column, an index should be not more than five characters.
247 queries means that you have 247 pictures that are taken into account in order to render the front page. If you do not display that many pictures, try to reduce the data taking into account by setting limits in the plugins and in the rendering.
I'm having trouble understanding the complexities of Lucene. Any help would be appreciated.
We're using a Windows Azure blob to store our Lucene index, with Lucene.Net and AzureDirectory. A WorkerRole contains the only IndexWriter, and it adds 20,000 or more records a day, and changes a small number (fewer than 100) of the existing documents. A WebRole on a different box is set up to take two snapshots of the index (into another AzureDirectory), alternating between the two, and telling the WebService which directory to use as it becomes available.
The WebService has two IndexSearchers that alternate, reloading as the next snapshot is ready--one IndexSearcher is supposed to handle all client requests at a time (until the newer snapshot is ready). The IndexSearcher sometimes takes a long time (minutes) to instantiate, and other times it's very fast (a few seconds). Since the directory is physically on disk already (not using the blob at this stage), we expected it to be a fast operation, so this is one confusing point.
We're currently up around 8 million records. The Lucene search used to be so fast (it was great), but now it's very slow. To try to improve this, we've started to IndexWriter.Optimize the index once a day after we back it up--some resources online indicated that Optimize is not required for often-changing indexes, but other resources indicate that optimization is required, so we're not sure.
The big problem is that whenever our web site has more traffic than a single user, we're getting timeouts on the Lucene search. We're trying to figure out if there's a bottleneck at the IndexSearcher object. It's supposed to be thread-safe, but it seems like something is blocking the requests so that only a single search is performed at a time. The box is an Azure VM, set to a Medium size so it has lots of resources available.
Thanks for whatever insight you can provide. Obviously, I can provide more detail if you have any further questions, but I think this is a good start.
I have much larger indexes and have not run into these issues (~100 million records).
Put the indexes in memory if you can (8 million records sounds like it should fit into memory depending on the amount of analyzed fields etc.) You can use the RamDirectory as the cache directory
IndexSearcher is thread-safe and supposed to be re-used, but I am not sure if that is the reality. In Lucene 3.5 (Java version) they have a SearcherManager class that manages multiple threads for you.
http://java.dzone.com/news/lucenes-searchermanager
Also a non-Lucene post, if you are on an extra-large+ VM make sure you are taking advantage of all of the cores. Especially if you have an Web API/ASP.NET front-end for it, those calls all should be asynchronous.