I am developing an app where the user will receive geo based information depending on his position. The server should be able to handle huge number of connected clients >100k.
Now i came up with 4 Approaches on how to handle the users position.
Approach - Without geospatial index:
The app server does just hold a list of connected clients and they're location.
Whenever there is a information available the server does loop over the whole list and checks whether the client is within a given radius.
Doubts: Very expensive
Approach - Handle geospatial index in the app server:
The app server does maintain a R Tree with all connected clients and they're location.
Therefore i was looking at JSI Java Spatial Index
Doubts: It is very expensive to update the geospatial index with JSI
Approach - Let the database "mongoDb" do the geospatial index / calculation:
The app server does only hold a reference to the connected client (connection) and saves the key to that reference together with its location into mondoDb.
When a new information is available the server can query the database to get the keys off all clients nearby.
Pro: I guess mongoDb does have a much better implementation of geospatial indexes than i could ever do in the app server.
Doubts: Clients are traveling around which forces me to update the geospatial index frequently. Can i do that or am i running into a performance problem?
Approach - Own "index" using 2 dimensional array
Today i was thinking about creating a very simple index by using a two dimensional array. While the outer array is for the longitude the inner would be for the latitude.
Lets say 3 longitude / altitude degree would be enough precision.
I could receive a list of users in a given area by
ArrayList userList = data[91][35] //91.2548980712891, 35.60869979858;
// i would also need to get the users in the surrounding arrays 90;35, 92;35 ...
// if i need more precision i could use one more decimal data[912][356]
Pro: I would have fast read and write access without a query to the database
Doubts: Radius is shorter at poles. Ugly hack?
I would be very grateful if someone could point me into "the" right direction.
The index used by MongoDB for geospatial indexing is based on a geohash, which essentially converts a 2 dimensional space into a one-dimensional key, suitable for B-tree indexing. While this is somewhat less efficient than an R-tree index it will be vastly more efficient than your scenario 1. I would also argue that filtering the data at the db level with a spatial query will be more efficient and easier to maintain than creating you own spatial indexing strategy on top.
The main thing to be aware of with MongoDB is that you cannot use a geometry column as a shard key, though you can shard a collection containing a geometry field, using another key. Also, if you wish to do any aggregation queries, (which isn't clear from your question) the geometry field must be the first through the pipeline.
There is also a geohaystack index, which is based on small buckets and optimized for search based on small areas, see, http://docs.mongodb.org/manual/core/geohaystack/, which might be useful in your case.
As far as speed is concerned, insertion and search on a B-Tree index are essentially O(log n), see Wikipedia B-Tree while without an index your search will be O(n), so it will not take very long before the difference in perfomance is enormous between having and not having an index.
If you are concerned about heavy writes slowing things down, you can tune the write concern in MongoDB so that you don't have to wait for a majority of replicas to respond to every write (the default), but at the cost of potentially inconsistent data, if you should lose your master.
Related
I have about 1000 sensors outputting data during the day. Each sensor outputs about 100,000 points per day. When I query the data I am only interested in getting data from a given sensor on a given day. I don t do any cross sensor queries. The timeseries are unevenly spaced and I need to keep the time resolution so I cannot do things like arrays of 1 point per second.
I plan to store data over many years. I wonder which scheme is the best:
each day/sensor pair corresponds to one collection, thus adding 1000 collections of about 100,000 documents each per day to my db
each sensor corresponds to a collection. I have a fixed number of 1000 collections that grow every day by about 100,000 documents each.
1 seems to intuitively be faster for querying. I am using mongoDb 3.4 which has no limit for the number of collections in a db.
2 seems cleaner but I am afraid the collections will become huge and that querying will gradually become slower as each collection grows
I am favoring 1 but I might be wrong. Any advice?
Update:
I followed the advice of
https://bluxte.net/musings/2015/01/21/efficient-storage-non-periodic-time-series-mongodb/
Instead of storing one document per measurement, I have a document containing 128 measurement,startDate,nextDate. It reduces the number of documents and thus the index size but I am still not sure how to organize the collections.
When I query data, I just want the data for a (date,sensor) pair, that is why I thought 1 might speed up the reads. I currently have about 20,000 collections in my DB and when I query the list of all collections, it takes ages which makes me think that it is not a good idea to have so many collections.
What do you think?
I would definitely recommend approach 2, for a number of reasons:
MongoDB's sharding is designed to cope with individual collections getting larger and larger, and copes well with splitting data within a collection across separate servers as required. It does not have the same ability to split data which exists in many collection across different servers.
MongoDB is designed to be able to efficiently query very large collections, even when the data is split across multiple servers, as long as you can pick a suitable shard key which matches your most common read queries. In your case, that would be sensor + date.
With approach 1, your application needs to do the fiddly job of knowing which collection to query, and (possibly) where that collection is to be found. Approach 2, with well-configured sharding, means that the mongos process does that hard work for you
Whilst MongoDB has no limit on collections I tried a similar approach to 2 but moved away from it to a single collection for all sensor values because it was more manageable.
Your planned data collection is significant. Have you considered ways to reduce the volume? In my system I compress same-value runs and only store changes, I can also reduce the volume by skipping co-linear midpoints and interpolating later when, say, I want to know what the value was at time 't'. Various different sensors may need different compression algorithms (e.g. a stepped sensor like a thermostat set-point vs one that represents a continuous quantity like a temperature). Having a single large collection also makes it easy to discard data when it does get too large.
If you can guarantee unique timestamps you may also be able to use the timestamp as the _id field.
When I query the data I m only interested in getting data from a
given sensor on a given day. I don t do any cross sensor queries.
But that's what exactly what Cassandra is good for!
See this article and this one.
Really, in one of our my projects we were stuck with legacy MongoDB and the scenario, similar to yours, with the except of new data amount per day was even lower.
We tried to change data structure, granulate data over multiple MongoDB collections, changed replica set configurations, etc.
But we were still disappointed as data increases, but performance degrades
with the unpredictable load and reading data request affects writing response much.
With Cassandra we had fast writes and data retrieving performance effect was visible with the naked eye. If you need complex data analysis and aggregation, you could always use Spark (Map-reduce) job.
Moreover, thinking about future, Cassandra provides straightforward scalability.
I believe that keeping something for legacy is good as long as it suits well, but if not, it's more effective to change the technology stack.
If I understand right, you plan to create collections on the fly, i.e. at 12 AM you will have new collections. I guess MongoDB is a wrong choice for this. If required in MongoDB there is no way you can query documents across collections, you will have to write complex mechanism to retrieve data. In my opinion, you should consider elasticsearch. Where you can create indices(Collections) like sensor-data-s1-3-14-2017. Here you could do a wildcard search across indices. (for eg: sensor-data-s1* or sensor-data-*). See here for wildcard search.
If you want to go with MongoDB my suggestion is to go with option 2 and shard the collections. While sharding, consider your query pattern so you could get optimal performance and that does not degrade over the period.
Approach #1 is not cool, key to speed up is divide (shard) and rule. What-if number of singal itself reaches 100000.
So place one signal in one collection and shard signals over nodes to speed up read. Multiple collections or signals can be on same node.
How this Will Assist
Usually for signal processing time-span is used like process signal for 3 days, in that case you can parallel read 3 nodes for the signal and do parallel apache spark processing.
Cross-Signal processing: typically most of signal processing algorithms uses same period for 2 or more signals for analysis like cross correlation and as these (2 or more signals) are parallel fetch it'll also be fast and ore-processing of individual signal can be parallelized.
I would like to hear some suggestion on implementing database solution for below problem
1) There are 100 million XML documents saved to the database per
day.
2) The database hold maximum 3 days of data
3) 1 million query request per day
4) The value through which the documents are filtered are stored in
a seperate table and mapped with the corresponding XMl document ID.
5) The documents are requested based on date range, documents
matching a list of ID's, Top 10 new documents, records that are new
after the previous request
Here is what I have done so far
1) Checked if I can use Redis, it is limited to few datatypes and
also cannot use multiple where conditions to filter the Hash in
Redis. Indexing based on date and lots of there fields. I am unable
to choose a right datastructure to store it on a hash
2) Investigated DynamoDB, its again a key vaue store where all the
filter conditions should be stored as one value. I am not sure if it
will be efficient querying a json document to filter the right XML
documnent.
3) Investigated Cassandra and it looks like it may fit my
requirement but it has a limitation saying that the read operations
might be slow. Cassandra has an advantage of faster write operation
over changing data. This looks like the best possible solition used
so far.
Currently we are using SQL server and there is a performance problem and so looking for a better solution.
Please suggest, thanks.
It's not that reads in Cassandra might be slow, but it's hard to guarantee SLA for reads (usually they will be fast, but then, some of them slow).
Cassandra doesn't have search capabilities which you may need in the future (ordering, searching by many fields, ranked searching). You can probably achieve that with Cassandra, but with obviously greater amount of effort than with a database suited for searching operations.
I suggest you looking at Lucene/Elasticsearch. Let me quote the features of Lucene from their main website:
Scalable
High-Performance Indexing
over 150GB/hour on modern hardware
small RAM requirements -- only 1MB heap
incremental indexing as fast as batch indexing
index size roughly 20-30% the size of text indexed
Powerful, Accurate and Efficient Search Algorithms
ranked searching -- best results returned first
many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more
fielded searching (e.g. title, author, contents)
sorting by any field
multiple-index searching with merged results
allows simultaneous update and searching
flexible faceting, highlighting, joins and result grouping
fast, memory-efficient and typo-tolerant suggesters
pluggable ranking models, including the Vector Space Model and Okapi BM25
configurable storage engine (codecs)
I am developing an Android app that uses MongoDB to store user records in document format. I will have several records that contain information about a GPS track such as start longitude and latitude, finish longitude and latitude, total time, top speed and total distance.
My question is regarding average speed. Should I let my app compute the average speed and store that as a field in the document, or should I compute this by only getting time and distance?
I will have thousands of records that should be sorted based on average speed and the most reasonable seems to store the average speed in the document as well. However, that breaks away from traditional SQL Acid thinking where speed would be calculated outside the DB.
The current document structure for the record collection is like this:
DocumentID (record)
DocumentID (user)
Start lnlt
Finish lnlt
Start time/date/gmt
End time/date/gmt
Total distance
Total time
Top speed
KMZ File
You should not talk abount ACID properties once you made a choice to use document oriented DB such as Mongo. Now, you have answered the question yourself:
" the most reasonable seems to store the average speed in the document as well."
We programmers have the tendency of ignoring the reasonable or simple approaches. We always try to question our selves whenever the solution we find looks obvious or common sense ;-).
Anyways, my suggestion is to store it as you want the sort to be performed by DB and not the application. This means that if any of the variables that influence the average speed change after initial storage then you should remember to update the result field as well.
My question is regarding average speed. Should I let my app compute the average speed and store that as a field in the document, or should I compute this by only getting time and distance?
As #Panegea rightly said, MongoDB does not rely on ACID properties. It relies on your app being able to control the distributed nature of it self, however that being said calculating the average speed outside of the DB isn't all that bad and using an atomic operator like $set will stop oddities when not using full ACID queries.
What you and #Panegea talk about is a form of pre-aggregation of your needed value to a pre-defined field on the document. This is by far a recommended approach not only in MongoDB but also in SQL (like the total shares on a facebook wall post) where querying for the aggregation of a computed field would be tiresome and very difficult for the server, or just not wise.
Edit
You clould achieve this with the aggregation framework: http://docs.mongodb.org/manual/applications/aggregation/ as well, might wanna take a look there, but pre-aggregation is by far the fastest method.
I am looking for a database implementing 2 geospatial indexes or allowing to simulate it efficiently.
Motivation: our application deals with vectors, rather than locations and we often need to locate all the records where the source is near something and the destination is near something else.
Mongodb does not have it. Is there a database, which does?
May be it could be simulated with the mongodb map reduce feature, where the database looks up all the records satisfying the source constraint and then passes it through the map-reduce to leave those, which satisfy the destination constraint as well. Did anyone do it?
Thanks.
It might be possible to fake this with MapReduce in Mongo, but this would only be suitable for use as a batch job and not likely to perform well as an application level query.
A workaround would be to store the source and destination points in separate Mongo collections. Then you could do a query on the source collection using $near to pull the closest points to the source point, then do another $near query against the destination collection, and compute the intersection in memory.
Another option - since you can use the geospatial index to index a field which contains an array of points, store both the source and destination points as elements in an array. Then issue two queries to that collection (one for the source point, one for the destination) and scan through the two result sets to calculate the final result (the queries won't distinguish between which match was the source and which was the destination, so you'd have to check that on the client side).
We are looking at using a NoSQL database system for a large project. Currently, we have read a bit about MongoDB and Cassandra, though we have absolutely no experience with either. We are very proficient with traditional relational databases like MySQL and Microsoft SQL, but the NoSQL (key/value store) is a new paradigm for us.
So basically, which NoSQL database do you guys recommend for our use?
We do both heavy writes and reads. Basically we have tens of thousands of devices that are reporting:
device_id (int), latitude (decimal), longitude (decimal), date/time (datetime), heading char(2), speed (int)
Every minute. So, at peak times we need to be able to process hundreds of writes a second.
Then, we also have users, that are querying this information in the form of, give me all messages from device_id 1234 for the last day, or last week. Also, users do other querying like, give me all messages from device_1234 where speed is greater than 50 and date is today.
So, our initial thoughts are that MongoDB or Cassandra are going to allow us to scale this much easier then using a traditional database.
A document or value in MongoDB or Cassandra for us, might look like:
{
device_id: 1234,
location: [-118.12719739973545, 33.859012351859946],
datetime: 1282274060,
heading: "N",
speed: 34
}
Which system do you guys recommend? Thanks greatly.
MongoDB has built-in support for geospatial indexes: http://www.mongodb.org/display/DOCS/Geospatial+Indexing
As an example to find the 10 closest devices to that location you can just do
db.devices.find({location: {$near: [-118.12719739973545, 33.859012351859946]}}).limit(10)
I have post on a location based app using MongoDB, just like the one you described. MongoDB, with it's strong query and index support, might make it a better choice for you. Just like Cassandra, MongoDB has partitioning and replication, for scaling read and writes. Their underlying architecture is very different.
Although you have not mentioned any location based queries, if you are interested in queries like "give me all the devices within the radius r of location l and between time t1 and t2", you will find MongoDB's geospatial query and indexing extremely useful.
I have done some work with mongodb and geospatial data, but not on the scale mentioned above. The geospatial searches are very fast, much more so than mysql.
I suggest looking into mongodb's sharding, replication, and clustering functionality to deal with the volume of writes. Sharding across device identifier may be a good way to deal with the write volume. If you're interested in proximity of events then sharding across lat/lng may be more appropriate.
jack
Go with mongodb for geo-location search. Release 2.4 improves on core geo features. Lot's of big sites use it for geolocation search.
You might consider using ElasticSearch. ES keeps the JSON of the original document stored, along with all the indexed fields. JSON can be instantiated into any modern languages variables/arguments. In Java, one could even disable that, and store native Java persistence data in a field. After search retrieval, just loop through and instantiate a collection of the original object types.
Using Elastics Search gives you Trie Indexes for high speed numberic range indexes, obviously you get full text searches of every flavor, and geographic bounding box queries, all in AND or OR filtering. Date searches are also native (although Java's handing of dates sucks so I switched to BIG INT representations of timestamps to represent dates)
UNLIKE some past and maybe present NoSQL solutions, the geographic indexing and querying is PART of any query and no extra steps are required. I.E., one MongoDB solution in the recent past required a geospatial search to collect conforming document IDs, then you used those IDs inside another query and searched within those for your other criteria. In reality, that's what happens in all solutions anyways, but it's much faster and cached in ElasticSearch.