Query in Gremlin-Server REST API is slower vs Gremlin Shell - titan

I'm using tinkerpop3 gremlin server.
I execute a simple query (using the standard REST api) to get edges of a vertex.
g.traversal().V(123456).outE('label')
When there are many results (about 2000-3000), the query is very slow, more than 20 seconds to get the JSON-results response.
The interesting thing is when I'm using the gremlin shell, running the same query, it takes about 1 second to receive the edges objects results!
I'm not sure, but I suspect that maybe the gremlin-server's JSON parser (I'm using GraphSon) is problematic (maybe very slow).
Any ideas?
Thanks

That does seem pretty slow, but it is building a potentially large result set in memory and serializing a graph element such as an entire Vertex or Edge is a bit "heavy" because it tries to match the general structure of the Vertex/Edge API hierarchy. We've seen where you can get a faster serialization time by just changing your query to:
g.V(123456).outE('label').valueMap()
or if you need the id/label as well:
g.V(123456).outE('label').valueMap(true)
in this way the hierarchy for the Vertex/Edge gets flattened to a simple Map which has less serialization overhead. In short, limit the amount of data you actually need on the client side to improve your performance.

Related

Why is saving data from an API to CSV faster than uploading it to MongoDB database

My question revolves around understanding the following two procedures (particularly performance and code logic) that I used to collect trade data from the US Census Bureau API. I already collected the data but I ended up writing two different ways of requesting and saving the data for which my questions pertain to.
Summary of my final questions comes at the bottom.
First way: npm request and mongodb to save the data
I limited my procedure using tiny-async-pool (sets concurrency of a certain function to perform) to not try to request too much at once or receive a timeout or overload my database with queries. Simply put, the bottleneck I was facing was the database since the API requests returned rather quickly (depending on body size 1-15 secs), but to save each array item (return data was nested array, sometimes from a few hundred items to over one hundred thousand items with max 10 values in each array) to its own mongodb document ranged from 100 ms to 700 ms. To save time from potential errors and not redoing the same queries, I also performed a check in my database before making the query to see if the query was already complete. The end result was that I did not follow this method since it was very error prone and susceptible to timeouts if the data was very large (I even set the timeout to 10 minutes in request options).
Second way: npm request and save data to csv
I used the same approach as the first method for the requests and concurrency, however I saved each query to its own csv file. In case of errors and not redoing successful queries I also did a check to see if the file already existed and if so skipped that query. This approach was error free, I ran it and after a few hours was able to have all the data saved. To write to csv was insanely fast, much more so than using mongodb.
Final summary and questions
My end goal was to get the data in the easiest manner possible. I used javascript because that's where I learned api requests and async operations, even though I will do most of my data analysis with python and pandas. I first tried the database method mostly because I thought it was the right way and I wanted to improve my database CRUD skills. After countless hours of refactoring code and trying new techniques I still could not get it to work properly. I resorted to the csv method which was a) much less code to write, b) less checks, c) faster, and d) more reliable.
My final questions are these:
Why was the csv approach better than the database approach? Any counter arguments or different approaches you would have used?
How do you handle bottlenecks and concurrency in your applications with regards to APIs and database operations? Do your techniques vary in production environments from personal use cases (in my case I just needed the data and a few hours of waiting was fine)?
Would you have used a different programming language or different package/module for this data collection procedure?

olap and oltp queries in gremlin

In gremlin,
s = graph.traversal()
g = graph.traversal(computer())
i know the first one is for OLTP and second for OLAP. I know the difference between OLAP and OLTP at definition level.I have the following queries on this:
How does
the above queries differ in working?
Can I use second one ,using'g'
in queries in my application to get results(I know this 'g' one
gives gives results faster than first one )?
Difference between OLAP and OLTP with example ?
Thanks in advance.
From the user's perspective, in terms of results, there's no real difference between OLAP and OLTP. The Gremlin statements are the same save for configuration of the TraversalSource as you have shown with your use of withComputer() and other settings.
The difference is more in how the traversal is executed behind the scenes. OLAP-based traversals are meant to process the "entire graph" (i.e. all vertices/edges and perhaps more than once). Where OLTP based traversals are meant to process smaller bodies of data, typically starting with one or a handful of vertices and traversing from there. When you consider graphs in the scale of "billions of edges", it's easy to understand why an efficient mechanism like OLAP is needed to process such graphs.
You really shouldn't think of OLTP vs OLAP as "faster" vs "slower". It's probably better to think of it as it is described in the documentation:
OLTP: real-time, limited data accessed, random data access,
sequential processing, querying
OLAP: long running, entire data set
accessed, sequential data access, parallel processing, batch
processing
There's no reason why you can't use an OLAP traversal in your applications so long as your application is aware of the requirements of that traversal. If you have some SLA that says that REST requests must complete in under 0.5 seconds and you decide to use an OLAP traversal to get the answer, you will undoubtedly break your SLA. Assuming you execute the OLAP traversal job over Spark, it will take Spark 10-15 seconds just to get organized to run your job.
I'm not sure how to provide an example of OLAP and OLTP, except to talk about the use cases a little bit more, so it should be clear as to when to use one as opposed to the other. In any case, let's assume you have a graph with 10 billion edges. You would want your OLTP traversals to always start with some form of index lookup - like a traversal that shows the average age of the friends of the user "stephenm":
g.V().has('username','stephenm').out('knows').values('age').mean()
but what if I want to know the average age of every user in my database? In this case I don't have any index I can use to lookup a "small set of starting vertices" - I have to process all the many millions/billions of vertices in my graph. This is a perfect use case for OLAP:
g.V().hasLabel('user').values('age').mean()
OLAP is also great for understanding growth of your graph and for maintaining your graph. With billions of edges and a high data ingestion rate, not knowing that your graph is growing improperly is a death sentence. It's good to use OLAP to grab global statistics over all the data in the graph:
g.E().label().groupCount()
g.V().label().groupCount()
In the above examples, you get an edge/vertex label distribution. If you have an idea as to how your graph is growing, this can be a good indicator of whether or not your data ingestion process is working properly. On a billion edge graph, trying to execute even one of the traversals would take "forever" if it ever finished at all without error.

Spring Data Neo4j 4.0.0 : StackOverFlowError

I am using the Spring Data Neo4j 4.0.0 with Neo4j 2.2.1 and I am trying to import a timetree-like object with 2 levels under the root. The saved object is built and saved at the end and at some point of the saving process, I got this StackOverFlow error:
Exception in thread "main" java.lang.StackOverflowError
at java.lang.Character.codePointAt(Character.java:4668)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3693)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
at java.util.regex.Pattern$Branch.match(Pattern.java:4500)
at java.util.regex.Pattern$Branch.match(Pattern.java:4500)
at java.util.regex.Pattern$Branch.match(Pattern.java:4500)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4466)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4177)
at java.util.regex.Pattern$Curly.match(Pattern.java:4132)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
at java.util.regex.Pattern$Branch.match(Pattern.java:4502)
at java.util.regex.Pattern$Branch.match(Pattern.java:4500)
at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
at java.util.regex.Pattern$Start.match(Pattern.java:3408)
at java.util.regex.Matcher.search(Matcher.java:1199)
at java.util.regex.Matcher.find(Matcher.java:618)
at java.util.Formatter.parse(Formatter.java:2517)
at java.util.Formatter.format(Formatter.java:2469)
at java.util.Formatter.format(Formatter.java:2423)
at java.lang.String.format(String.java:2792)
at org.neo4j.ogm.cypher.compiler.IdentifierManager.nextIdentifier(IdentifierManager.java:48)
at org.neo4j.ogm.cypher.compiler.SingleStatementCypherCompiler.newRelationship(SingleStatementCypherCompiler.java:71)
at org.neo4j.ogm.mapper.EntityGraphMapper.getRelationshipBuilder(EntityGraphMapper.java:357)
at org.neo4j.ogm.mapper.EntityGraphMapper.link(EntityGraphMapper.java:315)
at org.neo4j.ogm.mapper.EntityGraphMapper.mapEntityReferences(EntityGraphMapper.java:262)
at org.neo4j.ogm.mapper.EntityGraphMapper.mapEntity(EntityGraphMapper.java:154)
at org.neo4j.ogm.mapper.EntityGraphMapper.mapRelatedEntity(EntityGraphMapper.java:524)
at org.neo4j.ogm.mapper.EntityGraphMapper.link(EntityGraphMapper.java:324)
at org.neo4j.ogm.mapper.EntityGraphMapper.mapEntityReferences(EntityGraphMapper.java:262)
at org.neo4j.ogm.mapper.EntityGraphMapper.mapEntity(EntityGraphMapper.java:154)
at org.neo4j.ogm.mapper.EntityGraphMapper.mapRelatedEntity(EntityGraphMapper.java:524)
...
Thank you in advance and your suggestion would be really appreciated!
SDN 4 isn't really intended to be used to batch import your objects into Neo4j. Its an Object Graph Mapping framework for general purpose Java applications, not a batch importer (which brings its own specific set of problems to the table). Some of the design decisions to support the intended use-case for SDN run contrary to what you would do if you were trying to design a purpose-built ETL. We are also constrained by the performance of Neo4j's HTTP Transactional endpoint, which although by no means slow in absolute terms, cannot hope to compete with the Batch Inserter for example.
There are some improvements to performance we will be making in the future and when the new binary protocol for Neo4j is released (2.3), we will be plugging that in as our transfer protocol. We expect this to improve transfer speeds to and from the database by at least an order of magnitude. However, please don't expect these changes to radically alter the behavioural characteristics of SDN 4. While a future version might be able load a few thousand nodes much faster than it can currently, it still won't be an ETL tool, and I wouldn't expect it to be used as such.
After some hours of trial and error, finally I found that I need to limit my save depth level.
Previously, I didn't specify the depth level and the saved object was growing larger and larger as the insertion of its children also ran concurrently. So, after giving a depth of 1 on every save method, I finally get rid of the StackOverFlow error. And, by not saving regularly (I put all the objects in an ArrayList and save them all at the end), I get 1 minute performance gain (from 3.5 minutes to 2.5 minutes) for importing ca. 1000 nodes (with relationships).
Nevertheless, the performance is still not satisfying yet since I could import over 60,000 data in just less than 1 minute with my previous MongoDB implementation. I don't know if it is because of the SDN4 and if it could be faster with Embedded API. I'm really curious if anyone has done any benchmarking on SDN4 and Embedded API.

will I typically get better performance if I run an update/calc loop via javascript?

I have a script that loops over a set of records, performs some statistical calculations and updates the records. It's a big cursor: get record, calculate statistics from embedded documents, set fields on record, save record. There's <5k records that are being looped and each one embeds 90 history entries.
Question: would I get substantially better performance if I did this via javascript? The alternative being writing it in Ruby. My opinion (unfounded) is that since this can be done entirely in the database I will get better performance if send a chunk of js to Mongodb instead of adding Ruby in to the mix.
Related: is map/reduce appropriate for finding the median and mode of a set of values for many records?
The answer is really "it depends" - if the fields you need to do the calculations are very large, doing the calculation on the server side with JS might be a lot faster simply by cutting down on network traffic.
But, executing JS on the server side also holds a write lock, so depending on how complicated the calculations are, it might be more efficient to just do your calculations on the client side and then simply update the document.
Your best bet is to do a simple benchmark with ruby vs. server side JS. If you need to serve other database traffic at the same time, this should also be considered as well, because your lock % could be different in the two scenarios (you can monitor this with mongostat).
Also, keep in mind that using db.eval will not work with sharding, so avoid it if you are using a sharded environment or plan to in the future.

Entity Framework Code First - Reducing round trips with .Load() and .Local

I'm setting up a new application using Entity Framework Code Fist and I'm looking at ways to try to reduce the number of round trips to the SQL Server as much as possible.
When I first read about the .Local property here I got excited about the possibility of bringing down entire object graphs early in my processing pipeline and then using .Local later without ever having to worry about incurring the cost of extra round trips.
Now that I'm playing around with it I'm wondering if there is any way to take down all the data I need for a single request in one round trip. If for example I have a web page that has a few lists on it, news and events and discussions. Is there a way that I can take down the records of their 3 unrelated source tables into the DbContext in one single round trip? Do you all out there on the interweb think it's perfectly fine when a single page makes 20 round trips to the db server? I suppose with a proper caching mechanism in place this issue could be mitigated against.
I did run across a couple of cracks at returning multiple results from EF queries in one round trip but I'm not sure the complexity and maturity of these kinds of solutions is worth the payoff.
In general in terms of composing datasets to be passed to MVC controllers do you think that it's best to simply make a separate query for each set of records you need and then worry about much of the performance later in the caching layer using either the EF Caching Provider or asp.net caching?
It is completely ok to make several DB calls if you need them. If you are affraid of multiple roundtrips you can either write stored procedure and return multiple result sets (doesn't work with default EF features) or execute your queries asynchronously (run multiple disjunct queries in the same time). Loading unrealted data with single linq query is not possible.
Just one more notice. If you decide to use asynchronous approach make sure that you use separate context instance in each asynchronous execution. Asynchronous execution uses separate thread and context is not thread safe.
I think you are doing a lot of work for little gain if you don't already have a performance problem. Yes, pay attention to what you are doing and don't make unnecessary calls. The actual connection and across the wire overhead for each query is usually really low so don't worry about it.
Remember "Premature optimization is the root of all evil".
My rule of thumb is that executing a call for each collection of objects you want to retrieve is ok. Executing a call for each row you want to retrieve is bad. If your web page requires 20 collections then 20 calls is ok.
That being said, reducing this to one call would not be difficult if you use the Translate method. Code something like this would work
var reader = GetADataReader(sql);
var firstCollection = context.Translate<whatever1>(reader);
reader.NextResult();
var secondCollection = context.Translate<whateve2r>(reader);
etc
The big down side to doing this is that if you place your sql into a stored proc then your stored procs become very specific to your web pages instead of being more general purpose. This isn't the end of the world as long as you have good access to your database. Otherwise you could just define your sql in code.