I need to do a geo-spatial query that is filtered by time (past 5 hrs). Thinking of doing the following, but not sure what will yield in the fastest query:
Migrate old data (older than 5 hours) to Archive table (runs in background periodically)
Geo-spatial query with time-box (query objects within a time frame)
think this one might be computationally expensive at scale ?
Geo-spatial query and filter results (drop all data less older than time constraint)
Really thinking about how this will work at a large scale.
Thanks !
Related
I was using MongoDB to store some time-series data at 1 Hz. To facilitate this, my central document represented an hour of data per device per user, with 3600 values pre-allocated at document creation time. Two drawbacks:
every insert is an update. I need to query for the correct record (by user, by device, by day, by hour), append the latest IoT reading to the list, and update the record.
paged queries require complex custom code. I need to query for a record count of all the data matching my search range, then manually create each page of data to be returned.
I was hoping the MongoDB Native Time-Series Collections introduced in 5.0 would give me some performance improvements and it did, but only on ingestion rate. The comparison I used is to push 108,000 records into the database as quickly as possible and average the response time, then perform paged queries and get a range for the response time for those. This is what I observed:
Mongo Custom Code Solution: 30 milliseconds inserts, 10-20 millisecond
paged query.
Mongo Native Time-Series: 138 microsecond insert, 50-90 millisecond
paged query.
The difference in insert rate was expected, see #1 above. But for paged queries I didn't expect my custom time-series kluge implementation to be significantly faster than a native implementation. I don't really see a discussion of the advantages to be expected in the Mongo 5.0 documentation. Should I expect to see improvements in this regard?
I need to insert a lot of data as document into the database. It is usually +4k (pretty)-json rows / 150k characters. I'm inserting these all at once, but it still takes +20 seconds.
Is this a known limitation of mongodb? Are you aware of any settings or mongo forks that would provide a major performance boost? I can live with ~1 second.
no indexes inside the document
insert and update require the same amount of time
default mongo configuration
I am trying the performance of MongoDB to compare my current MySQL based solution.
In a collection/table X with three attributes A, B, and C, I have attribute A indexed in both MongoDB and MySQL.
Now I throw 1M data in MongoDB and MySQL, and tries the search performance in this straight-ward scenario.
The insert speed on MongoDB is only 10% faster than insert to MySQL. But that is OK, I knew adopting of MongoDB won't bring a magic promotion of my CRUDs, but I am really surprised by the search in MongoDB without index.
The results shows that, MongoDB select on non-indexed field is ten times slower than the select on a indexed field.
On the other hand, the MySQL select (MyISAM) on non-indexed field is only about 70% slower than the select on a indexed field.
Last but not least, in select with index scenario, MongoDB is about 30% quicker than my MySQL solution.
I wanna know that, is above figures normal? Especially the performance of MongoDB select without index?
I have my code like:
BasicDBObject query = new BasicDBObject("A", value_of_field_A);
DBCursor cursor = currentCollection.find(query);
while(cursor.hasNext()) {
DBObject obj = cursor.next();
// do nothing after that, only for testing purpose
}
BTW, from business logic's prespective, my collection could be really large (TB and more), what would you suggest for the size of each physical collection? 10 million Documents or 1 billion Documents?
Thanks a lot!
------------------------------ Edit ------------------------------
I tried the insert with 10 million records on both MongoDB and MySQL, and MongoDB's behavior is about 20% faster than MySQL -- not really that much as I thought.
I am curious that, if I have the MongoDB Auto-sharding being setup, will the insert speed being promoted? If so, do I need to put the Shards on different physic machines, or I can put them on the same machine with multi- cores?
------------------------------ Update ------------------------------
First, I modified the MongoDB write concern from ACKNOWLEDGED into UNACKNOWLEDGED, then the MongoDB insert speed is 3X faster.
Later on, I made the insert program in parallel (8 threads with a 8-cores computer), For MongoDB ACKNOWLEDGED mode, the insert is also improved 3X, for its UNACKNOWLEDGED mode, the speed is actually 50% slower.
For MySQL, the parallel insert mode increases the speed 5X faster! Which is faster than the best insert case from MongoDB!
MongoDB queries without the index will be doing table scan and we should know that data size of mongodb as compared to mysql is much more. I am guessing this might be one of the issue for slowness when doing a full scan.
Regarding query with indexes, mongoDB may turn out faster because of caching, no complex query optimizer plan (like mysql) etc.
The size of the collection is not an issue. In fact 10 million can be easily be handled in one collection. If you are have the requirement of archiving data, then you can break into smaller collections which will make the process easy.
I have an HBASE table of about 150k rows, each containing 3700 columns.
I need to select multiple rows at a time, and aggregate the results back, something like:
row[1][column1] + row[2][column1] ... + row[n][column1]
row[1][column2] + row[2][column2] ... + row[n][column2]
...
row[1][columnn] + row[2][columnn] ... + row[n][columnn]
Which I can do using a scanner, the issue is, I believe, that the scanner is like a cursor, and is not doing the work distributed over multiple machines at the same time, but rather getting data from one region, then hopping to another region to get the next set of data, and so on where my results span multiple regions.
Is there a way to scan in a distributed fashion (an option, or creating multiple scanners for each region's worth of data [This might be a can of worms in itself]) or is this something that must be done in a map/reduce job. If it's a M/R job, will it be "fast" enough for real time queries? If not, are there some good alternatives to doing these types of aggregations in realtime with a NOSQL type database?
What I would do in such cases is, have another table where I would have the aggregation summaries. That is When row[m] is inserted into table 1 in table 2 against (column 1) (which is the row key of table 2) I would save its summation or other aggregational results, be it average, standard deviation, max, min etc.
Another approach would be to index them into a search tool such as Lucene, Solr, Elastic Search etc. and run aggregational searches there. Here are some examples in Solr.
Finally, Scan spanning across multiple region or M/R jobs is not designed for real time queries (unless the clusters designed in such way, i.e. oversized to data requirements).
Hope it helps.
I'm doing a where in box query on a collection of ~40K documents. The query takes ~0.3s and fetching documents takes ~0.6 seconds (there are ~10K documents in the result set).
The documents are fairly small (~100 bytes each) and I limit the result to return the lat/lon only.
It seems very slow. Is this about right or am I doing something wrong?
It seems very slow indeed. A roughly equivalent search on I did on PostgreSQL, for example, is almost too fast to measure (i.e. probably faster than 1ms).
I don't know much about MongoDB, but are you certain that the geospatial index is actually turned on? (I ask because in RDBMSs it's easy to define a table with geometrical/geographical columns yet not define the actual indexing appropriately, and so you get roughly the same performance as what you describe).