I'm using Titan with Cassandra and have several (related) questions about querying the database with Gremlin:
1.) Is there an faster way to count all vertices than
g.V.count()
Titan claims to use an index. But how can I use an index without property?
WARN c.t.t.g.transaction.StandardTitanTx - Query requires iterating over all vertices [<>]. For better performance, use indexes
2.) Is there an faster way to count all vertices with property 'myProperty' than
g.V.has('myProperty').count()
Again Titan means following:
WARN c.t.t.g.transaction.StandardTitanTx - Query requires iterating over all vertices [(myProperty<> null)]. For better performance, use indexes
But again, how can I do this? I already have an index of 'myProperty', but it needs a value to query fast.
3.) And the same questions with edges...
Iterating all vertices with g.V.count() is the only way to get the count. It can't be done "faster". If your graph is so large that it takes hours to get an answer or your query just never returns at all, you should consider using Faunus. However, even with Faunus you can expect to wait for your answer (such is the nature of Hadoop...no sub-second response here), but at least you will get one.
Any time you do a table scan (i.e. iterate all Vertices) you get that warning of "iterating over all vertices". Generally speaking, you don't want to do that, as you will never get a response. Adding an index won't help you count all vertices any faster.
Edges have the same answer. Use g.E.count() in Gremlin if you can. If it takes too long, then try Faunus so you can at least get an answer.
doing a count is expensive in big distributed graph databases. You can have a node that keeps track of many of the databases frequent aggregate numbers and update it from a cron job so you have it handy. Usually if you have millions of vertices having the count from the previous hour is not such disaster.
Related
I would like to retrieve the total number of incoming or outgoing edges of a given type to/from a vertex in OrientDB. The obvious method is to construct a query using count() and inE(MyEdgeType), outE(MyEdgeType), or bothE(MyEdgeType). However, I am concerned about time complexity; if this operation is O(N) rather than O(1), I might be better off storing the number in the database rather than using count() each time I need it, because I anticipate the number of edges in question becoming very large. I have searched the documentation, but it does not seem to list the time complexities of OrientDB's functions. Also, I am unsure of whether to use in/out/both or inE/outE/bothE; I presume the E versions will be faster, but depending on how OrientDB stores edges under the hood, that may be wrong.
Is counting the set of incoming/outgoing/both edges of a given type to/from a vertex a constant-time operation--and if not, what is its time complexity? To be most efficient, do I need to use inE/outE/bothE, or in/out/both? Or is there some other method entirely that I've missed?
In OrientDB this is O(1): inE()/outE()/bothE().size() with or without the edge label as parameter.
I have about 1000 sensors outputting data during the day. Each sensor outputs about 100,000 points per day. When I query the data I am only interested in getting data from a given sensor on a given day. I don t do any cross sensor queries. The timeseries are unevenly spaced and I need to keep the time resolution so I cannot do things like arrays of 1 point per second.
I plan to store data over many years. I wonder which scheme is the best:
each day/sensor pair corresponds to one collection, thus adding 1000 collections of about 100,000 documents each per day to my db
each sensor corresponds to a collection. I have a fixed number of 1000 collections that grow every day by about 100,000 documents each.
1 seems to intuitively be faster for querying. I am using mongoDb 3.4 which has no limit for the number of collections in a db.
2 seems cleaner but I am afraid the collections will become huge and that querying will gradually become slower as each collection grows
I am favoring 1 but I might be wrong. Any advice?
Update:
I followed the advice of
https://bluxte.net/musings/2015/01/21/efficient-storage-non-periodic-time-series-mongodb/
Instead of storing one document per measurement, I have a document containing 128 measurement,startDate,nextDate. It reduces the number of documents and thus the index size but I am still not sure how to organize the collections.
When I query data, I just want the data for a (date,sensor) pair, that is why I thought 1 might speed up the reads. I currently have about 20,000 collections in my DB and when I query the list of all collections, it takes ages which makes me think that it is not a good idea to have so many collections.
What do you think?
I would definitely recommend approach 2, for a number of reasons:
MongoDB's sharding is designed to cope with individual collections getting larger and larger, and copes well with splitting data within a collection across separate servers as required. It does not have the same ability to split data which exists in many collection across different servers.
MongoDB is designed to be able to efficiently query very large collections, even when the data is split across multiple servers, as long as you can pick a suitable shard key which matches your most common read queries. In your case, that would be sensor + date.
With approach 1, your application needs to do the fiddly job of knowing which collection to query, and (possibly) where that collection is to be found. Approach 2, with well-configured sharding, means that the mongos process does that hard work for you
Whilst MongoDB has no limit on collections I tried a similar approach to 2 but moved away from it to a single collection for all sensor values because it was more manageable.
Your planned data collection is significant. Have you considered ways to reduce the volume? In my system I compress same-value runs and only store changes, I can also reduce the volume by skipping co-linear midpoints and interpolating later when, say, I want to know what the value was at time 't'. Various different sensors may need different compression algorithms (e.g. a stepped sensor like a thermostat set-point vs one that represents a continuous quantity like a temperature). Having a single large collection also makes it easy to discard data when it does get too large.
If you can guarantee unique timestamps you may also be able to use the timestamp as the _id field.
When I query the data I m only interested in getting data from a
given sensor on a given day. I don t do any cross sensor queries.
But that's what exactly what Cassandra is good for!
See this article and this one.
Really, in one of our my projects we were stuck with legacy MongoDB and the scenario, similar to yours, with the except of new data amount per day was even lower.
We tried to change data structure, granulate data over multiple MongoDB collections, changed replica set configurations, etc.
But we were still disappointed as data increases, but performance degrades
with the unpredictable load and reading data request affects writing response much.
With Cassandra we had fast writes and data retrieving performance effect was visible with the naked eye. If you need complex data analysis and aggregation, you could always use Spark (Map-reduce) job.
Moreover, thinking about future, Cassandra provides straightforward scalability.
I believe that keeping something for legacy is good as long as it suits well, but if not, it's more effective to change the technology stack.
If I understand right, you plan to create collections on the fly, i.e. at 12 AM you will have new collections. I guess MongoDB is a wrong choice for this. If required in MongoDB there is no way you can query documents across collections, you will have to write complex mechanism to retrieve data. In my opinion, you should consider elasticsearch. Where you can create indices(Collections) like sensor-data-s1-3-14-2017. Here you could do a wildcard search across indices. (for eg: sensor-data-s1* or sensor-data-*). See here for wildcard search.
If you want to go with MongoDB my suggestion is to go with option 2 and shard the collections. While sharding, consider your query pattern so you could get optimal performance and that does not degrade over the period.
Approach #1 is not cool, key to speed up is divide (shard) and rule. What-if number of singal itself reaches 100000.
So place one signal in one collection and shard signals over nodes to speed up read. Multiple collections or signals can be on same node.
How this Will Assist
Usually for signal processing time-span is used like process signal for 3 days, in that case you can parallel read 3 nodes for the signal and do parallel apache spark processing.
Cross-Signal processing: typically most of signal processing algorithms uses same period for 2 or more signals for analysis like cross correlation and as these (2 or more signals) are parallel fetch it'll also be fast and ore-processing of individual signal can be parallelized.
I know random record selection is not actually supported by MongoDB yet, but I have found a few ways to work around it.
However, I want to select a weighted random item. This is fairly easy with mySql, but I'm not sure of the best way to go about it with Mongo.
The problem I am solving is: I have a collection that holds sweepstakes entries, and based on the number of times a user shares/promotes the contest, they get an "extra entry", to increase their chance of winning. Rather than duplicate the user's entry, I have a field that records the number of times they have shared the contest. I want to use this number as a multiplier to weight the random selection of a "winner".
Here are a few approaches I have thought of:
Use a variation on the Cookbook random selection method, generating an array of random numbers (equal to the multiplier), for greater chances the record will be near the random point queried (but Mongo doesn't support array [multi-key] indexes, yes? so it might be slow)
Another variation on the Cookbook random method using a geospatial query, using a round polygon with a radius equal to the multiplier instead of a simple random number (if this is even possible, I've never used MongoDB geo indexes and queries)
Expand the entries in a new temporary collection, then use one of the MongoDB random selection methods
Avoid the problem and just store the duplicated entries in Mongo in the first place, and do a regular random select thingamajig
Keep a separate index of the MongoIDs and their weight multipliers in mySql (either constantly synced, or generated on demand) and use mySql to do a random weighted selection
Query out a huge array to do it in PHP and hope it doesn't run out of memory! :/
Am I on to anything here? Any other suggestions, for an obvious solution I am missing? I'm going to do some experimenting to see what works, but any feedback on my initial ideas is welcome!!
Performance needs to be "good" not great, since none of these contests are probably ever going to have millions of entries (usually more like [tens of] thousands), so fairness/accuracy is more important than speed. Thanks.
I am building a MongoDB database that will work with an Android app. I have a user collection and a records collection. The records documents consist of GPS tracks such as start and end coordinates, total time and top speed and distance. The user document is has user id, first name, last name and so forth.
I want to have aggregate stats for each user that summarizes total distance, total time, total average speed and top speed to date.
I am confused if I should do a map reduce and create an aggregate collection for users, or if I should add these stats to the user document with some kind of cron job type soliuton. I have read many guides about map reduce and aggregation for MongoDB but can't figure this out.
Thanks!
It sounds like your aggregate indicator values are per-user, in which case I would simply calculate them and push them directly into the user object as the same time as you update current co-oordinates, speed etc. They would be nice and easy (and fast) to query, and you could aggregate them further if you wished.
When I say pre-calculate, I don't mean MapReduce, which you would use as a batch process, I simply mean calculate on update of the user object.
If your aggregate stats are compiled across users, then you could still pre-calculate them on update, but if you also need to be able to query those aggregate stats against some other condition or filter, such as, "tell me what the total distance travelled for all users within x region", then depending on the number of combinations you may not be able to cover all those with pre-calculation.
So, if your aggregate stats ARE across users, AND need some sort of filter applying, then they'll need to be calculated from some snapshot of data. The two approaches here are;
the aggregation framework in 2.2
MapReduce
You would need to use MapReduce say, if you've a LOT of historical data that you want to crunch and you can pre-calculate the results for fast reading later. By my definition, that data isn't changing frequently, but even if it did, you can also use incremental MR to add new results to an existing calculation.
The aggregation framework in 2.2 will allow you to do a lot of this on demand, but it won't be as quick of course as pre-calculated values but way quicker than MR when executed on-demand. It can't cope with the high volume result-sets that you can do with MR, but it's better suited to queries where you don't know the parameter values in advance.
By way of example, if you wanted to calculate the aggregate sums of users stats within a particular lat/long, you couldn't use MR because there are just too many combinations of that filter, so you'd need to do that on the fly.
If however, you wanted it by city, well you could conceivably use MR there because you could stick to a finite set of cities and just pre-calculate them all.
But to wrap up, if your aggregate indicator values are per-user alone, then I'd start by calculating and storing the values inside the user object when I update the user object as I said in the first paragraph. Yes, you're storing the value as well as the inputs, but that's the model that saves you having to calculate on the fly.
I'm building an application that stores lots of data per user (possibly in gigabytes).
Something like a request log, so lets say you have the following fields for every record:
customer_id
date
hostname
environment
pid
ip
user_agent
account_id
user_id
module
action
id
response code
response time (range)
and possibly some more.
The good thing is that the usage will be mostly write only, but when there are reads
I'd like to be able to answer then quickly in near real time.
Another prediction about the usage pattern is that most of the time people will be looking at the most recent data,
and infrequently query for the past, aggregate etc, so my guess is that the working set will be much smaller then
the whole database, i.e. recent data for most users and ranges of history for some users that are doing analytics right now.
for the later case I suppose its ok for first query to be slower until it gets the range into memory.
But the problem is that Im not quite sure how to effectively index the data.
The start of the index is clear, its customer_id and date. but the rest can be
used in any combination and I can't predict the most common ones, at least not with any degree of certainty.
We are currently prototyping this with mongo. Is there a way to do it in mongo (storage/cpu/cost) effectively?
The only thing that comes to mind is to try to predict a couple of frequent queries and index them and just massively shard the data
and ensure that each customer's data is spread evenly over the shards to allow fast table scan over just the 'customer, date' index for the rest
of the queries.
P.S. I'm also open to suggestions about db alternatives.
with this limited number of fields, you could potentially just have an index on each of them, or perhaps in combination with customer_id. MongoDB is clever enough to pick the fastest index for each case then. If you can fit your whole data set in memory (a few GB is not a lot of data!), then this all really doesn't matter.
You're saying you have a GB per user, but that still means you can have an index on the fields as there are only about a dozen. And with that much data, you want sharding anyway at some point soon.
cheers,
Derick
I think, your requirements don't really mix well together. You can't have lots of data and instantaneous ad-hoc queries.
If you use a lot of indexes, then your writes will be slow, and you'll need much more RAM.
May I suggest this:
Keep your index on customer id and date to serve recent data to users and relax your requirements to either real-timeliness or accuracy of aggregate queries.
If you sacrifice accuracy, you will be firing map-reduce jobs every once in a while to precompute queries. Users then may see slightly stale data (or may not, it's historical immutable data, after all).
If you sacrifice speed, then you'll run map-reduce each time (right now it's the only sane way of calculating aggregates in a mongodb cluster).
Hope this helps :)