MongoDB batch operation best practices - mongodb

We're running a web application that serves around 700K RPM at peak times, which each one usually updating 2 documents in our database.
Our product's constraint is to update these documents immediatley (no lazy data dump), and we were wondering if it's more effective to use batch operations to do 2 updates, or just 2 separate update calls?
Thanks

A batch will save you one round trip for a second update. It might not make a big difference, but it's worth trying.

Related

storing huge amounts of data in mongo

I am working on a front end system for a radius server.
The radius server will pass updates to the system every 180 seconds. Which means if I have about 15,000 clients that would be around 7,200,000 entries per day...Which is a lot.
I am trying to understand what the best possible way to store and retrieve this data will be. Obviously as time goes on, this will become substantial. Will MongoDB handle this? Typical document is not much, something this
{
id: 1
radiusId: uniqueId
start: 2017-01-01 14:23:23
upload: 102323
download: 1231556
}
However, there will be MANY of these records. I guess this is something similar to the way that SNMP NMS servers handle data which as far as I know they use RRD to do this.
Currently in my testing I just push every document into a single collection. So I am asking,
A) Is Mongo the right tool for the job and
B) Is there a better/more preferred/more optimal way to store the data
EDIT:
OK, so just incase someone comes across this and needs some help.
I ran it for a while in mongo, I was really not satisfied with performance. We can chalk this up to the hardware I was running on, perhaps my level of knowledge or the framework I was using. However I found a solution that works very well for me. InfluxDB pretty much handles all of this right out of the box, its a time series database which is effectively the data I am trying to store (https://github.com/influxdata/influxdb). Performance for me has been like night & day. Again, could all be my fault, just updating this.
EDIT 2:
So after a while I think I figured out why I never got the performance I was after with Mongo. I am using sailsjs as framework and it was searching by id using regex, which obviously has a huge performance hit. I will eventually try migrate back to Mongo instead of influx and see if its better.
15,000 clients updating every 180 seconds = ~83 insertions / sec. That's not a huge load even for a moderately sized DB server, especially given the very small size of the records you're inserting.
I think MongoDB will do fine with that load (also, to be honest, almost any modern SQL DB would probably be able to keep up as well). IMHO, the key points to consider are these:
Hardware: make sure you have enough RAM. This will primarily depend on how many indexes you define, and how many queries you're doing. If this is primarily a log that will rarely be read, then you won't need much RAM for your working set (although you'll need enough for your indexes). But if you're also running queries then you'll need much more resources
If you are running extensive queries, consider setting up a replica set. That way, your master server can be reserved for writing data, ensuring reliability, while your slaves can be configured to serve your queries without affecting the write reliability.
Regarding the data structure, I think that's fine, but it'll really depend on what type of queries you wish to run against it. For example, if most queries use the radiusId to reference another table and pull in a bunch of data for each record, then you might want to consider denormalizing some of that data. But again, that really depends on the queries you run.
If you're really concerned about managing the write load reliably, consider using the Mongo front-end only to manage the writes, and then dumping the data to a data warehouse backend to run queries on. You can partially do this by running a replica set like I mentioned above, but the disadvantage of a replica set is that you can't restructure the data. The data in each member of the replica set is exactly the same (hence the name, replica set :-) Oftentimes, the best structure for writing data (normalized, small records) isn't the best structure for reading data (denormalized, large records with all the info and joins you need already done). If you're running a bunch of complex queries referencing a bunch of other tables, using a true data warehouse for the querying part might be better.
As your write load increases, you may consider sharding. I'm assuming the RadiusId points to each specific server among a pool of Radius servers. You could potentially shard on that key, which would split the writes based on which server is sending the data. Thus, as you increase your radius servers, you can increase your mongo servers proportionally to maintain write reliability. However, I don't think you need to do this right away as I bet one reasonably provisioned server should be able to manage the load you've specified.
Anyway, those are my preliminary suggestions.

Sphinx: Real-Time Search w/Expiration?

I am designing a search that will be fed around 50 to 200 GB of text data per day (similar to logs) and it only needs to retain that data for week or two. This data will be piped at a constant rate (5,000/per second for example), non-stop, 24 hours a day. After a week or two, the document should drop out of the index never to be heard from again.
The index should be searchable with free-form text across only 1 field (pretty small in size, around 512 characters max). At most, the schema could have 2 attributes that could be categorized.
The system needs to be indexed in near real-time as data is fed to it. A delay of 15 to 30 seconds is acceptable.
We prefer to stream data into the indexer/service with a constant stream of pipe data.
Lastly, a single stand-alone solution is prefer over any type of distribution setup (this will be part of a package to deploy and setup on local machines for testers).
I'm looking closely at Sphinx search engine with RT updates via the API as it checks off most of these. But, I am not seeing an easy way to expire documents after a certain length of time.
I am aware that I could track the IDs and a timestamp and issue a batch DELETE through the Sphinx API. But, that creates an issue of tracking large amounts of IDs in a separate datastore that will need the same kind of 5,000/per second inserts and deleting them when done.
I also have a concern around Sphinx Fragmentation of mass-inserting, and mass-deleting in the middle of inserting.
We would really prefer the search engine/indexer to handle the expiration itself.
I think I can perform a WHERE timestamp < UNIXTIMESTAMP-OF-TWO-WEEEKS-AGO as the where clause in the Sphinx API in order to gather the Document IDs to delete. The problem with that is if the system does not stay ontop of the deletes, the total number of documents/search results will be in the 10s of millions, maybe even billions in count after a two week timeframe if it has to gather a few days worth of document ids to delete. That's not a feasible query.
You can actually run
DELETE FROM rt WHERE timestamp < UNIXTIMESTAMP-OF-TWO-WEEEKS-AGO
As a query to delete the old documents, which is much simpler :)
You will also need to call OPTIMIZE INDEX from time to time.
Both these will have to be called on some sort of 'cron' schedule, as they wont be run automatically.
You might be better not using Sphinxes DELETE function at all. When writing RT indexes, as soon as the RAM chunk is full its writen out as a disk chunk. So you end up with a number of disk chunks on the disk. The oldest documents will be in the oldest chunk, sequentially.
So to clear out the oldest documents, you could just dispose of the oldest chunks. (on a rolling basis)
The problem is sphinx does not include a function to delete individual chunks.
Will need to shutdown searchd, delete the chunk(s), manipulate the header files and then restart Sphinx. Not an easy process.
But in the more general sense, not sure if sphinx will be able to keep up with a continuous stream of 5,000/documents per second (even ignoreing delete for a moment) - Sphinx is generally designed for write-infrequently, read-frequently. It builds a (for the most part) monolithic inverted index. This is great for querying, but is very hard to keep updated. Its not great for incremental updates.

How to handle large mongodb collection

We have a collection that is potentially going to be very large.This collection used to store Bill releated data. So this is often used to reporting/Analytics purpose.
Please let me know the best approch to handle this large collection
1) Can I split and archive the old data(say 12 months period)?.But here old data is required to get analytic reports.I want to query this old data to show the sale comparion for past 2 yesrs.
2)can I have new collection with old data(12 months) .So for every 12 months i've to create new collection. For reports generation,I've to access all this documents to query. So this will cause performance problem?
3) Can I go for Sharding?
There are many variables to account for, the clearest being what hardware you use, how the data is structured, and how it is queried. A distributed network ought to be able to chew through your data faster than a single machine, but before diving into that solution I recommend generating an absurd amount of mock data comparable to what you are expecting, and then testing various approaches. Seriously. Create a bunch of data, and try to break things. It's fun! Soon enough you'll know more about what your problem requires than any website could tell you.
As for direct responses:
Perhaps, before archiving the data, appropriate stats summaries can be generated (or updated). Those summaries/simplifications can be used for sale comparisons without reloading all of the archived data they represent.
This strikes me as sensible. By splitting up the sales data, you have more control over how much data needs to be accessed. After all, a user won't always wish to see 3 years of data, they may only wish to see last week's.
Move to sharding when you actually need it. As is stated on the MongoDB site:
Converting an unsharded database to a sharded cluster is easy and seamless, so there is little advantage in configuring sharding while your data set is small.
You'll know it's time when your memory-map approaches the server's RAM limit. MongoDB supports reading and writing to databases too large to keep in memory, but I'm sure you already know that is SLOW.

Drools is very slow when we integrate with Talend ETL and process millions of records

we have used around 30 rules with multiple conditions in it. we are under the assumption that Drools takes one record and compares it against the records then will give the output for each one.So the time taken for processing 1 million record is around 4 hours. Cant we not process the records in batches. I mean to say in big numbers and reducing time for processing. Pls help me this issue. Thanks for the response.
Inserting 1M facts in one batch is a very bad strategy (unless you need to find combinations out of the lot). The documentation makes it clear that all work (at least in 5.x) is done during inserts and modifications. (6.x is reportedly different, but it's still bad practice to needlessly fill your memory up with objects galore.)
Simply insert, and after some suitable number, call fireAllRules() and process (transmit,...) the results. Make sure that no "dead stock" remains in Working Memory from such a batch - this would also slow you down.

Lucene searches are slow via AzureDirectory

I'm having trouble understanding the complexities of Lucene. Any help would be appreciated.
We're using a Windows Azure blob to store our Lucene index, with Lucene.Net and AzureDirectory. A WorkerRole contains the only IndexWriter, and it adds 20,000 or more records a day, and changes a small number (fewer than 100) of the existing documents. A WebRole on a different box is set up to take two snapshots of the index (into another AzureDirectory), alternating between the two, and telling the WebService which directory to use as it becomes available.
The WebService has two IndexSearchers that alternate, reloading as the next snapshot is ready--one IndexSearcher is supposed to handle all client requests at a time (until the newer snapshot is ready). The IndexSearcher sometimes takes a long time (minutes) to instantiate, and other times it's very fast (a few seconds). Since the directory is physically on disk already (not using the blob at this stage), we expected it to be a fast operation, so this is one confusing point.
We're currently up around 8 million records. The Lucene search used to be so fast (it was great), but now it's very slow. To try to improve this, we've started to IndexWriter.Optimize the index once a day after we back it up--some resources online indicated that Optimize is not required for often-changing indexes, but other resources indicate that optimization is required, so we're not sure.
The big problem is that whenever our web site has more traffic than a single user, we're getting timeouts on the Lucene search. We're trying to figure out if there's a bottleneck at the IndexSearcher object. It's supposed to be thread-safe, but it seems like something is blocking the requests so that only a single search is performed at a time. The box is an Azure VM, set to a Medium size so it has lots of resources available.
Thanks for whatever insight you can provide. Obviously, I can provide more detail if you have any further questions, but I think this is a good start.
I have much larger indexes and have not run into these issues (~100 million records).
Put the indexes in memory if you can (8 million records sounds like it should fit into memory depending on the amount of analyzed fields etc.) You can use the RamDirectory as the cache directory
IndexSearcher is thread-safe and supposed to be re-used, but I am not sure if that is the reality. In Lucene 3.5 (Java version) they have a SearcherManager class that manages multiple threads for you.
http://java.dzone.com/news/lucenes-searchermanager
Also a non-Lucene post, if you are on an extra-large+ VM make sure you are taking advantage of all of the cores. Especially if you have an Web API/ASP.NET front-end for it, those calls all should be asynchronous.