Sphinx: Real-Time Search w/Expiration? - sphinx

I am designing a search that will be fed around 50 to 200 GB of text data per day (similar to logs) and it only needs to retain that data for week or two. This data will be piped at a constant rate (5,000/per second for example), non-stop, 24 hours a day. After a week or two, the document should drop out of the index never to be heard from again.
The index should be searchable with free-form text across only 1 field (pretty small in size, around 512 characters max). At most, the schema could have 2 attributes that could be categorized.
The system needs to be indexed in near real-time as data is fed to it. A delay of 15 to 30 seconds is acceptable.
We prefer to stream data into the indexer/service with a constant stream of pipe data.
Lastly, a single stand-alone solution is prefer over any type of distribution setup (this will be part of a package to deploy and setup on local machines for testers).
I'm looking closely at Sphinx search engine with RT updates via the API as it checks off most of these. But, I am not seeing an easy way to expire documents after a certain length of time.
I am aware that I could track the IDs and a timestamp and issue a batch DELETE through the Sphinx API. But, that creates an issue of tracking large amounts of IDs in a separate datastore that will need the same kind of 5,000/per second inserts and deleting them when done.
I also have a concern around Sphinx Fragmentation of mass-inserting, and mass-deleting in the middle of inserting.
We would really prefer the search engine/indexer to handle the expiration itself.
I think I can perform a WHERE timestamp < UNIXTIMESTAMP-OF-TWO-WEEEKS-AGO as the where clause in the Sphinx API in order to gather the Document IDs to delete. The problem with that is if the system does not stay ontop of the deletes, the total number of documents/search results will be in the 10s of millions, maybe even billions in count after a two week timeframe if it has to gather a few days worth of document ids to delete. That's not a feasible query.

You can actually run
DELETE FROM rt WHERE timestamp < UNIXTIMESTAMP-OF-TWO-WEEEKS-AGO
As a query to delete the old documents, which is much simpler :)
You will also need to call OPTIMIZE INDEX from time to time.
Both these will have to be called on some sort of 'cron' schedule, as they wont be run automatically.
You might be better not using Sphinxes DELETE function at all. When writing RT indexes, as soon as the RAM chunk is full its writen out as a disk chunk. So you end up with a number of disk chunks on the disk. The oldest documents will be in the oldest chunk, sequentially.
So to clear out the oldest documents, you could just dispose of the oldest chunks. (on a rolling basis)
The problem is sphinx does not include a function to delete individual chunks.
Will need to shutdown searchd, delete the chunk(s), manipulate the header files and then restart Sphinx. Not an easy process.
But in the more general sense, not sure if sphinx will be able to keep up with a continuous stream of 5,000/documents per second (even ignoreing delete for a moment) - Sphinx is generally designed for write-infrequently, read-frequently. It builds a (for the most part) monolithic inverted index. This is great for querying, but is very hard to keep updated. Its not great for incremental updates.

Related

Re-index data more than one in Apache Druid

I want to get last one hour and day aggregation result from druid. Most queries I use includes ad-hoc queries. I want to ask two question;
1- Is a good idea that ingest all raw data without rollup? Without rollup, Can I re-index data with multiple times?. For example; one task reindex data to find unique user counts for each hour, and another task re-index the same data to find total count for each 10 minutes.
2- If rollup enabled to find some basic summarizes, this prevent to get information from the raw data(because it is summerized). When I want to reindex data, some useful informations may not found. Is good practise that enable rollup in streaming mode?
Whether to enable roll-up depends on your data size. Normally we
keep data outside of druid to replay and reindex again in the
different data sources. If you have a reasonable size of the data
you can keep your segment granularity to be hours/day/ week/month
ensuring that each segment doesn't exceed the ideal segment size (
500 MB recommended ). And query granularity to the none at index
time, so you can do this unique and total count aggregation at query
time.
You can actually set your query granularity at the index time to be
10 mins and it can still provide you uniques in 1 hr and total count
received in 1 hr.
Also, you can index data in multiple data sources if that's what you
are asking. If you are reindexing data for the same data source, it
will create duplicates and skew your results.
It depends on your use case. Rollup will help you better performance
and space optimization in druid cluster. Ideally, I would suggest
keeping your archived data separate in replayable format to reuse.

MongoDB documents of calulated values for a dashboard vs re-retrieving on each web page view?

If I have a page in a web app that displays some dashboard type statistics about documents in my database (counts, docs created per hour, per day etc), is it best to pre-calculate this data and store it in a separate document (and update as needed), or assuming the collections have appropriate indexes, would it be appropriate to execute queries to retrieve these statistics on every load of the page?
It's not necessary that the data has to be exactly up to date on every page hit/load, so that's why I was thinking to maintain the data I need to display in a separate document that can be retrieved on page hit (or even cached and only re-retrieved every 5 minutes or similar).
That's pretty broad, and I have the feeling you have already identified the key points. Generally speaking, you should consider these questions:
Do you need to allow users to apply filters? Complex filters usually make pre-aggregation impossible.
Related: Is it likely that the exact same data is ever queried again? If not, pre-aggregation might need to happen on different levels of granularity (e.g. by creating day / week / month totals and summing these, instead of individual events).
What is the relation of reads vs. writes on the data? If the number of writes is small, it might be OK to keep counters in real-time, instead of using read-caching.
What are your performance requirements for cached and uncached queries? Getting fast cached queries is trivial, but comes at the cost of stale data. Making uncached queries faster is more tricky and usually requires something like the multi-level approach discussed before - it often doesn't help if old data comes super fast, but new queries take minutes.
Caching works especially well if the data can't be changed later (or is seldomly changed), and the queries remain the same with a certain chance of re-occuring. A nice example are facebook's profiles, where past years are apparently cached for every visitor-profile combination. First accesses are slow, however...

Lucene searches are slow via AzureDirectory

I'm having trouble understanding the complexities of Lucene. Any help would be appreciated.
We're using a Windows Azure blob to store our Lucene index, with Lucene.Net and AzureDirectory. A WorkerRole contains the only IndexWriter, and it adds 20,000 or more records a day, and changes a small number (fewer than 100) of the existing documents. A WebRole on a different box is set up to take two snapshots of the index (into another AzureDirectory), alternating between the two, and telling the WebService which directory to use as it becomes available.
The WebService has two IndexSearchers that alternate, reloading as the next snapshot is ready--one IndexSearcher is supposed to handle all client requests at a time (until the newer snapshot is ready). The IndexSearcher sometimes takes a long time (minutes) to instantiate, and other times it's very fast (a few seconds). Since the directory is physically on disk already (not using the blob at this stage), we expected it to be a fast operation, so this is one confusing point.
We're currently up around 8 million records. The Lucene search used to be so fast (it was great), but now it's very slow. To try to improve this, we've started to IndexWriter.Optimize the index once a day after we back it up--some resources online indicated that Optimize is not required for often-changing indexes, but other resources indicate that optimization is required, so we're not sure.
The big problem is that whenever our web site has more traffic than a single user, we're getting timeouts on the Lucene search. We're trying to figure out if there's a bottleneck at the IndexSearcher object. It's supposed to be thread-safe, but it seems like something is blocking the requests so that only a single search is performed at a time. The box is an Azure VM, set to a Medium size so it has lots of resources available.
Thanks for whatever insight you can provide. Obviously, I can provide more detail if you have any further questions, but I think this is a good start.
I have much larger indexes and have not run into these issues (~100 million records).
Put the indexes in memory if you can (8 million records sounds like it should fit into memory depending on the amount of analyzed fields etc.) You can use the RamDirectory as the cache directory
IndexSearcher is thread-safe and supposed to be re-used, but I am not sure if that is the reality. In Lucene 3.5 (Java version) they have a SearcherManager class that manages multiple threads for you.
http://java.dzone.com/news/lucenes-searchermanager
Also a non-Lucene post, if you are on an extra-large+ VM make sure you are taking advantage of all of the cores. Especially if you have an Web API/ASP.NET front-end for it, those calls all should be asynchronous.

MongoDB High Avg. Flush Time - Write Heavy

I'm using MongoDB with approximately 4 million documents and around 5-6GB database size. The machine has 10GB of RAM, and free only reports around 3.7GB in use. The database is used for a video game related ladder (rankings) website, separated by region.
It's a fairly write heavy operation, but still gets a significant number of reads as well. We use an updater which queries an outside source every hour or two. This updater then processes the records and updates documents on the database. The updater only processes one region at a time (see previous paragraph), so approximately 33% of the database is updated.
When the updater runs, and for the duration that it runs, the average flush time spikes up to around 35-40 seconds, and we experience general slowdowns with other queries. The updater is RAN on a SEPARATE MACHINE and only queries MongoDB at the end, when all the data has been retrieved and processed from the third party.
Some people have suggested slowing down the number of updates, or only updating players who have changed, but the problem comes down to rankings. Since we support ties between players, we need to pre-calculate the ranks - so if only a few users have actually changed ranks, we still need to update the rest of the users ranks accordingly. At least, that was the case with MySQL - I'm not sure if there is a good solution with MongoDB for ranking ~800K->1.2 million documents while supporting ties.
My question is: how can we improve the flush and slowdown we're experiencing? Why is it spiking so high? Would disabling journaling (to take some load off the i/o) help, as data loss isn't something I'm worried about as the database is updated frequently regardless?
Server status: http://pastebin.com/w1ETfPWs
You are using the wrong tool for the job. MongoDB isn't designed for ranking large ladders in real time, at least not quickly.
Use something like Redis, Redis have something called a "Sorted List" designed just for this job, with it you can have 100 millions entries and still fetch the 5000000th to 5001000th at sub millisecond speed.
From the official site (Redis - Sorted sets):
Sorted sets
With sorted sets you can add, remove, or update elements in a very fast way (in a time proportional to the logarithm of the number of elements). Since elements are taken in order and not ordered afterwards, you can also get ranges by score or by rank (position) in a very fast way. Accessing the middle of a sorted set is also very fast, so you can use Sorted Sets as a smart list of non repeating elements where you can quickly access everything you need: elements in order, fast existence test, fast access to elements in the middle!
In short with sorted sets you can do a lot of tasks with great performance that are really hard to model in other kind of databases.
With Sorted Sets you can:
Take a leader board in a massive online game, where every time a new score is submitted you update it using ZADD. You can easily take the top users using ZRANGE, you can also, given an user name, return its rank in the listing using ZRANK. Using ZRANK and ZRANGE together you can show users with a score similar to a given user. All very quickly.
Sorted Sets are often used in order to index data that is stored inside Redis. For instance if you have many hashes representing users, you can use a sorted set with elements having the age of the user as the score and the ID of the user as the value. So using ZRANGEBYSCORE it will be trivial and fast to retrieve all the users with a given interval of ages.
Sorted Sets are probably the most advanced Redis data types, so take some time to check the full list of Sorted Set commands to discover what you can do with Redis!
Without seeing any disk statistics, I am of the opinion that you are saturating your disks.
This can be checked with iostat -xmt 2, and checking the %util column.
Please don't disable journalling - you will only cause more issues later down the line when your machine crashes.
Separating collections will have no effect. Separating databases may, but if you're IO bound, this will do nothing to help you.
Options
If I am correct, and your disks are saturated, adding more disks in a RAID 10 configuration will vastly help performance and durability - more so if you separate the journal off to an SSD.
Assuming that this machine is a single server, you can setup a replicaset and send your read queries there. This should help you a fair bit, but not as much as the disks.

Unlocking a collection after aborting a reIndex() command in MongoDB?

I was attempting to reduce the size of my indexes on a mongo collection and ran db.collection.reIndex().
After about 90 minutes, I began to think it had somehow gotten locked up and tried to cancel. Now (about 2 hours after cancelling) the collection appears to be locked to all write commands. All my other collections are allowing writes. Is there any way to unlock it?
The period of time that it takes to perform this operation is going is dependent on a few things, namely:
The size of the collection.
The number of indexes in that collection.
This is a blocking operation.
Simply put, a small database (less than 500MB) should only take a few minutes to reindex whereas a larger database (5-10GB or more) could take much longer ... with increasing length as the database size increases.
While it is best to let the procedure finish, if you absolutely needed to stop it, then restarting the process would be the way to do it. Also, send in a support ticket to: support#mongohq.com (including the name of the database) and the team can help more there.