performance tuning creating OrientDB (deBruijn Graph) - orientdb

i have searched the Internet and also the manual for some performance tuning abilities to decrease the building time of my deBruijn Graph with OrientDB.
All following is dione in Java.
In OrientDB:
Kmers are indexed.
Edges are property based.
What i want to do is:
read in a multiple sequence file
split sequence in kmers
add Kmers to Database and create Edges between adjacent kmers
One and two are done already.
So when I add a kmer to OrientDB, i have to check if this kmer exist and if so i need the vertx to add a nwe edge.
Is there any fast way to do so?
I already created a local hash with kmer as key and OrientDB RID as value. But getting the vertex seems to need much time.
I already tried with:
OGlobalConfiguration.USE_WAL.setValue(false);
OGlobalConfiguration.TX_USE_LOG.setValue(false);
declareIntent(new OIntentMassiveInsert());
I need nearly 3 hours to add 256 kmers and 40.000.000 edges.
Also the size of the created DB is 9GB, the start file was 40 MB.
Any suggestions how to improve it?
Feel free to ask if something is not understandable.
Thanks a lot.
Michael
Edit:
do you have any experience with Record Grow factor?
I think the Node record have some in and out edge information by default in it. Could i increase the runtime with the RECORD_GROW_FACTOR? DO you have any experince with that?

Related

Data retention in timescaledb

Trying to wrap my head around timescaledb, but my google-fu is failing me. Most likely because I'm not searching for the correct term.
With RRD tool, old data can be stored as averages, reducing the amount of data being stored.
I can't seem to find out how to do this with timescaledb. I'd like 5 minute resolution for 90 days, but after that, it's pointless to keep all those data points, and I'd like to reduce it to 30 or 60 minute averages for a couple years, then maybe daily averages after that.
Is this something that I can set in the database itself, or is this something I would have to implement in a housekeeping job?
We had the exact same question half a year ago.
The term "Data Retention" is also used by the timescaledb team. It is currently implemented using drop_chunks policies (see their doc here). It's a Enterprise feature but IMHO not (yet) as useful as it could/should be (and it surely does not do what you are looking for).
Let me explain: probably the easiest approach for down-sampling your data are Continuous Aggregates (their doc here). You can quite easily aggregate virtually any numeric value to whatever resolution you desire. However, Continuous Aggregates will be affected by the deletions of the drop_chunks, too. Your data is gone.
One workaround would be to create other Hypertables instead. Then, create your own background workers copying the data from the original, hi-res table to these new lo-res Hypertables.
For housekeeping, either use the Data Retention Enterprise feature or create your own background workers.

Counting Super Nodes On Titan

In my system I have the requirement that the number of edges on the node must be stored as an internal property on the vertex as well as a vertex centric index on a specific outgoing edge. This naturally requires me to count the number of edges on the node after all the data has finished loading. I do so as follows:
long edgeCount = graph.getGraph().traversal().V(vertexId).bothE().count().next();
However when I scale up my tests to the point where some of my nodes are "super" nodes I get the following exception on the above line:
Caused by: com.netflix.astyanax.connectionpool.exceptions.TransportException: TransportException: [host=127.0.0.1(127.0.0.1):9160, latency=4792(4792), attempts=1]org.apache.thrift.transport.TTransportException: Frame size (70936735) larger than max length (62914560)!
at com.netflix.astyanax.thrift.ThriftConverter.ToConnectionPoolException(ThriftConverter.java:197) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:65) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:28) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
at com.netflix.astyanax.thrift.ThriftSyncConnectionFactoryImpl$ThriftConnection.execute(ThriftSyncConnectionFactoryImpl.java:153) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
at com.netflix.astyanax.connectionpool.impl.AbstractExecuteWithFailoverImpl.tryOperation(AbstractExecuteWithFailoverImpl.java:119) ~[astyanax-core-3.8.0.jar!/:3.8.0]
at com.netflix.astyanax.connectionpool.impl.AbstractHostPartitionConnectionPool.executeWithFailover(AbstractHostPartitionConnectionPool.java:352) ~[astyanax-core-3.8.0.jar!/:3.8.0]
at com.netflix.astyanax.thrift.ThriftColumnFamilyQueryImpl$4.execute(ThriftColumnFamilyQueryImpl.java:538) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
at com.thinkaurelius.titan.diskstorage.cassandra.astyanax.AstyanaxKeyColumnValueStore.getNamesSlice(AstyanaxKeyColumnValueStore.java:112) ~[titan-cassandra-1.0.0.jar!/:na]
What is the best way to fix this ? Should I simply increase the frame size or is there a better way to count the number of edges on the node ?
Yes, you will need to increase the frame size. When you have a supernode, there is a really big row that needs to be read out of the storage backend, and this is even true in the OLAP case. I agree that if you are planning to calculate this on every vertex in the graph, this would be best done as an OLAP operation.
This and several other good tips can be found in this Titan mailing list thread. Keep in mind that link is pretty old, so the concepts are still valid, but some of the Titan configuration properties names may be different.
Such a task, which is OLAP by its nature, should be performed using a distributed system, not using a traversal.
There is a concept called GraphComputer in TinkerPop 3, which can be used to perform such a task.
It is basically allows you to run Gremlin queries, which will be evaluated on multiple machines.
For example, you can use SparkGraphComputer to run your queries on top of Apache Spark.

dqs - performance: how many rows in a project can you handle?

This question is strictly DQS-performance related.
The ‘customers’ table I need to clean has 40,000,000 rows… I created a matching policy using a subset (no issues there, I just used a top 10,000).
Now when I want to do a data quality project… I can’t take the entire table in one project… It just won’t respond… I only managed to handle 400,000 at a time and even in that situation it takes almost 2 hours… And it’s not the best solution, because I need to do the project on a view where id between 1 and 400,000.
Any solution to this guys?
I am also wondering… where's the bottleneck? is it CPU or disk?
Regards.

Performance improvement for fetching records from a Table of 10 million records in Postgres DB

I have a analytic table that contains 10 million records and for producing charts i have to fetch records from analytic table. several other tables are also joined to this table and data is fetched currently But it takes around 10 minutes even though i have indexed the joined column and i have used Materialized views in Postgres.But still performance is very low it takes 5 mins for executing the select query from Materialized view.
Please suggest me some technique to get the result within 5sec. I dont want to change the DB storage structure as so much of code changes has to be done to support it. I would like to know if there is some in built methods for query speed improvement.
Thanks in Advance
In general you can take care of this issue by creating a better data structure(Most engines do this to an extent for you with keys).
But if you were to create a sorting column of sorts. and create a tree like structure then you'd be left to a search rate of (N(log[N]) rather then what you may be facing right now. This will ensure you always have a huge speed up in your searches.
This is in regards to binary tree's, Red-Black trees and so on.
Another implementation for a speedup may be to make use of something allong the lines of REDIS, ie - a nice database caching layer.
For analytical reasons in the past I have also chosen to make use of technologies related to hadoop. Though this may be a larger migration in your case at this point.

sql server split mirror db on to multiple devices

Say I have a large production mirrored 1TB DB that resides on a single MDF device and I would like to split that up into say 5 200 Gig devices.
I want to do this without interruption to Production.
I thought I could break the mirror and use the RESTORE process for creating a mirror to achieve the split to multiple devices quickly and without interruption to Production. Doing this twice would allow me to get this done in a few hours.
Has anyone done this? Is it the preferred method seeing as we are mirroring anyways?
What are other my alternatives, Pros and Cons? And gotchas?
Also, I recall another more organic process where one would create the 5 new New Devices and somehow, over time get the objects to move over to the new devices. Not sure of the process for this but I seem to recall it being discussed. Sounds like this could take a long time and possibly cause some clocking at times?
Thanks
...Ray
This isn't quite as simple a process as it first looks, the reason being is that just adding the files to SQL server isn't enough as even if you were to add 4 new files, they would all be empty space, you would have one file with 1Tb of data in it and 4 empty ones, which would eventually fill up as SQL server uses a proportional fill method for the files, but most of your queries would still be hitting the single file.
I take it you are doing this to improve performance? If so, you will need to move data around into different files in order to actually split the data up. Whether you can do this online or not depends on whether you are running Enterprise Edition or not (as this allows you to rebuild indexes online).
An easy way to move a table (or more accurately a clustered index, which is pretty much the same thing as the table for all intents and purposes) is to add a new filegroup with a new data file and then rebuild the clustered index specifying the new filegroup, you can use the following to do this:
CREATE CLUSTERED INDEX Existing_Index_Name ON schema_name.table_name(column_name)
WITH(DROP_EXISTING=ON,Online=ON) on [new_filegroup_name]
GO
This code will create the new index on the new filegroup, get rid of the old one and if you are running enterprise edition, it will do it all without blocking the users.
See the following link for more methods of moving the data between filegroups:
Move data between SQL Server database filegroups
You should also look into partitioning your tables to help improve performance too:
Partitioning Tables and Indexes
With regards to your mirroring setup, you should break the mirror, then on the primary add all your files/filegroups, then move the data between the filegroups, then backup the modified database on the primary, restore on the mirror (so all the files are set up the same on the mirror) and then re-set up your mirroring.