Lucene NRT: When to commit? - lucene.net

We're refactoring our Lucene host (Lucene.NET 2.9.2), and are implementing Lucene NRT (Near Realtime).
What is the best time/threshold to commit the changes to disk? Is there a golden rule? If it is when the internal ramdisk holds a certain amount of data, how do I get the size?
Once a commit happens we update our database, so I'm not that fearfull of power failures (once the process starts again, it will reindex those documents that have not been committed).

I have just implemented what sounds like the same scheme in our system. I decided to do a commit when I have over 1000 uncommitted documents. I think the number really depends on how many docs/sec you will be adding. I am also not sure if I can run the commit on a different thread than where I am adding the docs.

Related

Handling backups of a large table (>1 TB) in Postgres?

I have a 1TB table (X) that is a pain to backup.
The table X contains historical log data that is not often updated after creation. We usually only access a single row at a time, so performance is still very good.
We currently make nightly full logical backups, and exclude X for the sake of backup time and space. We do not need historical backups of X, since the log files from which it is populated are backed up themselves. However, recovery of X by re-processing of the log files would take an unnecessary long time.
I'd like to include X in our backup strategy so that our recovery time can be much faster. It doesn't seem feasible to include X in the nightly logical backup.
Ideally, I'd like a single full backup for X that is updated incrementally (purely to save time).
I lack the experience to investigate solutions alone, and I'm wondering what my options are?
Barman for incremental updates? Partition X? Both?
After doing some more reading, I'm inclined to partition the table and write a nightly script to perform logical backups only on the changed table partitions (and replace the previous backups). However, this strategy may still take a long time during recovery with a pg_restore... Thoughts?
Thanks!
I think using barman with the rsync/SSH + WAL streaming option and performing incremental backups is the best option in your case. Going this way makes your nightly backups easier & less costly, since you don't have to do much logic yourself once you configure barman. I will update this with my blog shortly that details the steps.
Logical backups may not be the right approach for periodic backups when dealing with large databases. When using physical backups even though your backup size is large its more than compensated in the acquisition & restore cost (performance, speed & simplicity).
Thanks
UPDATE (2020-08-27):
Below is a git repo with the end-end demonstration. There are many versions of implementations out there that have done it, but if you wanted to do something from the scratch & keep the image simple (avoiding unnecessary dependencies), please take a look at this implementation,
https://github.com/softwarebrahma/PostgreSQL-Disaster-Recovery-With-Barman
Thanks

Sitecore 8.1 update 2 MongoDB backup

I am using replica set (2 mongo, 1 arbitor) for my Sitecore CD servers.
Assuming all mongo DB data get flushed to Reporting SQL DB; do we need to take backup of MongoDB database on production CD ?
If yes what is best approach and frequency to do it; considering My application is moderately using anaytics feature (Personalization , Campaign etc).
Unfortunately, your assumption is bad - the MongoDB is the definitive source of analytic data, not the reporting db. The reporting db contains only the aggregate info needed for generating the report (mostly). In fact, if (when) something goes wrong with the SQL DB, the idea is that it is rebuilt from the source MongoDB. Remember: You can't un-add two numbers after you've added them!
Backup vs Replication
A backup is a point-in-time view of the database, where replication is multiple active copies of a current database. I would advocate for replication over backup for this type of data. Why? Glad you asked!
Currency - under what circumstance would you want to restore a 50GB MongoDB? What if it was a week old? What if it was a month? Really the only useful data is current data, and websites are volatile places - log data backups are out of date within an hour. If you personalise on stale data is that providing a good user experience?
Cost - backing up large datasets is costly in terms of time, storage capacity and compute requirements; they are also a pain to restore and the bigger they are the more likely there's a corruption somewhere
Run of business
In a production MongoDB environment you really should have 2-3 replicas. That's going to save your arse if one of the boxes dies, which they sometimes do - MongoDB works the disks very hard.
These replicas are self-healing, and always current (pretty-much) so they are much better than taking backups. The chances that you lose all your replicas at once is really low except for one particular edge case... upgrades. So a backup is really only protection against hardware failure or data corruption which, in a multi-instance replica set, is already very effectively handled. Unless you're paranoid, you're never going to use that backup and it'll cost you plenty to have it.
Sitecore Upgrades
This is the killer edge-case - always make backups (see Back Up and Restore with MongoDB Tools) before running an upgrade because you can corrupt all of your replicas in one motion and you'll want to be able to roll back.
Data Trimming (side-note)
You didn't ask this, but at some point you'll be thinking "how the heck can I back up this 170GB monster db every day? this is ridiculous" - and you'll be right.
There are various schools of thought around how long this data should be persisted for - that's a question only you or your client can answer. I suggest keeping it until there's too much, then make a decision on how much you have to get rid of. Keep as much as you can tolerate.

Retrying or recovering after failed IndexWriter.Commit() in Lucene.NET

Our Lucene.NET index is located on a remote maching, accessible over a UNC path. For performance reasons (and following what appears to be Lucene.NET best practice) IndexWriter is not Commit()ed after each document modification, but rather once every 30 seconds.
Now, sometimes network fails and Commit() errors out with an exception. I know that Lucenet.NET is "fully ACID", and as such these failures do not corrupt the index itself. What worries me is that not-yet-committed documents are lost.
Is there any recommended way of dealing with this? Can I retry IndexWriter.Commit() in hopes that network connectivity is restored? Or should I buffer documents in RAMDirectory and then merge these into FSDirectory, with retry semantics? Or something else entirely?
In my implementation, I use an Oracle table. When a document is created, a row is added to the table with a value indicating it is not indexed. After the IndexWriter commit succeeds, I update the table to indicate it is indexed (along with some other data like indexed_date, etc.) That way if there is any kind of failure, the document will be refound and indexed (or possibly re-indexed) when the system or connectivity is restored. The table also opens up all kinds of reporting & audit capability that would otherwise not be available.
This might not be an option for you. Buffering documents locally to the indexwriter would work. Not sure why you would need retry semantics if you only looked to the local buffer for docs to index. I think you just have to make sure you delete the document from the local buffer after the commit succeeds so you don't keep indexing it forever. ;^)

Keeping a snapshot of the most current version of each aggregate in an event store

We're currently using an SQL-backed Event Store (the typical 2-table implementation) and some people in the team are afraid that even though we're using the Event Store only for writes, things may get a bit slower, so a suggestion was put in place to instead of adding snapshots here and there, to actually maintain a fully-consistent (with the event streams) snapshot of each aggregate in its most recent state (in JSON format). All the querying on the system will end up being done on the read-side, with a typical SQL database that is updated in an eventual consistency fashion from the ES (write) side.
Having such a system in place would allow us to enjoy the benefits of having an Event Store while simultaneously removing any possible performance issues altogether. We are currently not making use of any "time-travelling" feature, although sooner of later that will end up being the case.
Is this a good approach? There's something in it leaving my uncomfortable. For instance, if we need some sort of time-travelling feature, not having snapshots here and there in each aggregate's event-stream will prove a performance disaster. Of course we could have both a most-current-snapshot per aggregate instance and also snapshots throughout the event-streams.
In case we decide to go down this route, should we make the snapshot update for a given aggregate transactional to the events updates on that same aggregate, or should we just update the events and in an eventually-consistent manner update the snapshot?
What are the downsides of this approach? Has anyone tried something of the kind?
You should probably run your own benchmarks before adding unnecessary complexity to your system. We have noticed some performance problems when thousands of events need to be queried and applied to rebuild an aggregate from the event stream, where JSON to object deserialization was the biggest performance bottleneck. If each of your aggregates has only few events (say, < 100) you probably won't notice any significant differences in practice.
Most event stores record snapshots every n events/commits, say every 50-100 events, and on assembly query the latest snapshots and apply the missing events since the last snapshot. If you also keep all old snapshots in your snapshot database, the time traveling feature will be as fast as a usual query, and you'll only need slightly more persistence space, which is cheap nowadays.
The snapshots should always be written out of the original transaction (and can be generated in another thread), since it's non-crucial if the last snapshot is missing, but you want to don't want your business transaction to fail due to errors in the snapshot write transaction.
Depending on your usual system uptime and data size, it might make sense to held snapshots in memory or a distributed cache/graid or in another database (non-SQL).

How to ensure external projections are in sync when using CQRS and EventSourcing?

I'm starting a new application and I want to use cqrs and eventsourcing. I got the idea of replaying events to recreate aggregates and snapshotting to speedup if needed, using in memory models, caching, etc.
My question is regarding large read models I don't want to hold in memory. Suppose I have an application where I sell products, and I want to listen to a stream of events like "ProductRegistered" "ProductSold" and build a table in a relational database that will be used for reporting or integration with another system. Suppose there are lots of records and this table may take from a few seconds to minutes to truncate/rebuild, and the application exports dozens of these projections for multiple purposes.
How does one handle the consistency of the projections in this scenario?
With in-memory data, it's quite simple and fast to replay the events. But I feel that external projections that are kept in disk will be much slower to rebuild.
Should I always start my application with a TRUNCATE TABLE + rebuild for every external projection? This seems impractical to me over time, but I may be worried about a problem I didn't have yet.
Since the table is itself like a snapshot, I could keep a "control table" to tell which event was the last one I handled for that projection, so I can replay only what's needed. But I'm worried about inconsistencies if the application or database crashes. It seems that checking the consistency of the table and rebuilding would be the same, which points to the solution 1 again.
How would you handle that in a way that is maintainable over time? Are there better solutions?
Thank you very much.
One way to handle this is the concept of checkpointing. Essentially either your event stream or your whole system has a version number (checkpoint) that increments with each event.
For each projection, you store the last committed checkpoint that was applied. At startup, you pull events greater than the last checkpoint number that was applied to the projection, and continue building your projection from there. If you need to rebuild your projection, you delete the data AND the checkpoint and rerun the whole stream (or set of streams).
Caution: the last applied checkpoint and the projection's read models need to be persisted in a single transaction to ensure they do not get out of sync.