Syncing two mongodb's - mongodb

Im trying to figure out how to solve following use case with usage of mongodb.
There is a main server, where users can upload/edit/... contents. Sometimes the users want to work independently of the main, so they get a defined set of entries of mongodb of the main server and can deploy own local environment. But after some time, they want to get an update for entries from main server and/or submit their own updates or new insertions
Is there any out of box solution for that? I thought that the standard replica sets can be used, but the documentation is so sparse, so i cant really understand if it can work, if users are e.g. 1 month offline

Related

Saving query results back into elastic stack

I am absolutely new to the elastic stack.
So my problem space is I have utility which runs on client machines .We have few logs which are generated on these machines (thousands of them), So we have three data source- csv files, log files(generated by my application) and windows event log . I want to combine these three and generate some useful information out of them .Also want to generate a dashboard with some graphs which will be used by managers.
I have zeroed down on elk stack , idea is I install beats on client machine and push data to elastic and then use Kibana to get some visualization. Since I might have thousand of client pushing the data to elastic server, it might not be feasible to keep this data in the server for ever. But I need updated visualizations, to be available always. So I was planning that periodic queries will be run on the indexed data in elastic and the result which is generated (which is real information I need) will be saved back in elastic in a separate index and the visualization in Kibana are set up based on this index .And all the original data can now be cleared. This way I extract real info and keep it and delete unnecessary info.
My question to the expert are
Is my thinking or design correct(wrt to elk stack) given the problem statement
Is it feasible in elk stack and are there any examples or utilities to achieve this.
Thanks
Gaurav
Saving the results of your aggregations back into ElasticSearch is a perfectly valid option. You should also consider Cold storage as an option for storing large amounts of data with long retention.
You tagged logz.io in your question, so it's worth mentioning that there is a logz.io feature called 'Timeless accounts' which uses Optimizers to define query results that should be saved for longer than the retention periods of the underlying logs.
For the record, I work at logz.io

How to implement an optimistic (or pessimistic) locking when using two databases that need to be in sync?

I'm working on a solution in which we have two databases that are used with the following purposes:
An Elasticsearch used for search purposes
A Postgres database that acts as a source of truth for the data
Our application allows users to retrieve and update products, and a product has multiple attributes: name, price, description, etc... And two typical use cases are:
Retrieve products by name: a search is performed using elasticsearch, and then the IDs retrieved by ES are used on a secondary query against Postgres to obtain the actual and trustworthy data (so we get fast searches on big tables while getting trustworthy data)
Update product fields: We allow users to update any product information (kind of a collaborative wiki). First we store the data in Postgres, and then into Elasticsearch.
However and as I feared and as the amount of people using the app increased, we've run into race conditions; if a user #1 changed the name of a product to "Banana" and then user #2 at the same time changed the name of a product to "Apple", sometimes in elasticsearch the last record saved would be "Banana" while in Postgres "Apple" would be the last value, creating a serious inconsistency between databases.
So I've ventured into reading about optimistic/pessimistic locking in order to solve my problem, but so far all articles I find relate when you only use 1 relational database, and the solutions offered rely on ORM implementations (e.g. Hibernate). But our combined storage solution of ES + Postgres requires more "ballet" than that.
What techniques/options are available to me to solve my kind of problem?
Well I may attract some critics but let me explain you in a way that I understand. What I understand is this problem/concern is more of an architectural perspective rather than design/code perspective.
Immediate Consistency and of course eventual consistency
From Application layer
For the immediate consitency between two databases, the only way you can achieve them is to do polygot persistence in a transactional way so that either the same data in both Postgres and Elasticearch gets updated or none of them would. I wouldn't recommend this purely because it would put a lot of pressure on the application and you would find it very difficult to scale/maintain.
So basically GUI --> Application Layer --> Postgres/Elasticsearch
Queue/Real Time Streaming mechanism
You need to have a messaging queue so that the updates are going to the Queue in the event based approached.
GUI --> Application Layer --> Postgres--> Queue --> Elasticsearch
Eventual consistency but not immediate consistency
Have a separate application, normally let's call this indexer. The purpose of this tool is to carry out updates from postgres and push them into the Elasticsearch.
What you can have in the indexer is have multiple single configuration per source which would have
An option to do select * and index everything into Elasticsearch or full crawl
This would be utlized when you want to delete/reindex entire data into Elasticsearch
Ability to detect only the updated rows in the Postgres and thereby push them into Elasticsearch or incremental crawl
For this you would need to have a select query with where clause based on the status on your postgres rows for e.g. pull records with status 0 for document which got recently updated, or based on timestamp to pull records which got updated in last 30 secs/1 min or depending on your needs. Incremental query
Once you perform the incremental crawl, if you implement incremental using status, you need to change the status of this to 1(success) or '-1'(failure) so that in the next crawl the same document doesn't get picked up. Post-incremental query
Basically schedule jobs to run above queries as part of indexing operations.
Basically we would have GUI --> Application Layer --> Postgres --> Indexer --> Elasticsearch
Summary
I do not think it would be wise to think of fail proof way rather we should have a system that can recover in quickest possible time when it comes to providing consistency between two different data sources.
Having the systems decoupled would help greatly in scaling and figuring out issues related to data correctness/quality and at the same time would help you deal with frequent updates as well as growth rate of the data and updates along with it.
Also I recommend one more link that can help
Hope it helps!

Processing large mongo collection offsite

I have a system writing logs into mongodb (about 1kk logs per day). On a weekly basis I need to calculate some statistics on those logs. Since the calculations are very processor and memory consuming I want to copy collection I'm working to powerful offsite machine. How do I keep offsite collections up to date without copying everything? I modify offsite collection, by storing statistic within its elements i.e. adding fields {"alogirthm_1": "passed"} or {"stat1": 3.1415}. Is replication right for my use case or I should investigate other alternatives?
As to your question, yes, replication does partially resolve your issue, with limitations.
So there are several ways I know to resolve your issue:
The half-database, half-application way.
Replication keeps your data up to date. It doesn't allow you to modify the secondary nodes (which you call "offsite collection") however. So you have to do the calculation on the secondary and write data to the primary. You need to have an application running aggregation on the secondary, and write the result back to it's primary.
This requires that you run an application, PHP, .NET, Python, whatever.
full-server way
Since you are going to have multi-servers any way, you can consider using Sharding for the faster storage and directly do the calculation online. This way you don't even need to run an application. The Map/Reduce do the calculation and write output into an new collection. I DON'T recommend this solution though because of the Map/Reduce performance issue of current versions.
The full-application way
Basically you still use an replication for reading, but the server doesn't do any calculations except querying data. You can use capped collection or TTL index for removing expired data, and you just enumerate data one by one in your application and calculation by yourself.

MongoDB: Switch database/collection referenced by a given name on the fly

My application needs only read access to all of its databases. One of those databases (db_1) hosts a collection coll_1 whose entire contents* need to be replaced periodically**.
My goal is to have no or very little effect on read performance for servers currently connected to the database.
Approaches I could think of with so far:
1. renameCollection
Build a temporary collection coll_tmp, then use renameCollection with dropTarget: true to move its contents over to coll_1. The downside of this approach is that as far as I can tell, renameCollection does not copy indexes, so once the collection is renamed, coll_1 would need reindexing. While I don't have a good estimate of how long this would take, I would think that query-performance will be significantly affected until reindexing is complete.
2. TTL Index
Instead of straight up replacing, use a time-to-live index to expire documents after the chosen replacement period. Insert new data every time period. This seems like a decent solution to me, except that for our specific application, old data is better than no data. In this scenario, if the cron job to repopulate the database fails for whatever reason, we could potentially be left with an empty coll_1 which is undesirable. I think this might have a negligible effect, but this solution also requires on-the-fly indexing as every document is inserted.
3. Communicate current database to read-clients
Simply use two different databases (or collections?) and inform connected clients which one is more recent. This solution would allow for finishing indexing the new coll_1_alt (and then coll_1 again) before making it available. I personally dislike the solution since it couples the read clients very closely to the database itself, and of course communication channels are always imperfect.
4. copyDatabase
Use copyDatabase to rename (designate) an alternate database db_tmp to db_1.db_tmp would also have a collection coll_1. Once reindexing is complete on db_tmp.coll_1, copyDatabase could be used to simply rename db_tmp to db_1. It seems that this would require droppping db_1 before renaming, leaving a window in which data won't be accessible.
Ideally (and naively), I'd just set db_1 to be something akin to a symlink, switching to the most current database as needed.
Anyone has good suggestions on how to achieve the desired effect?
*There are about 10 million documents in coll_1.
** The current plan is to replace the collection once every 24 hours. The replacement interval might get as low as once every 30 minutes, but not lower.
The problem that you point out in option 4 you will also have with option 1. dropTarget will also mean that the collection is not available.
Another alternative could be to just have both the old and the new data in the same collection, and use a "version ID" that you then still have to communicate to your clients to do a query on. That at least stops you from having to do reindexing like you pointed out for option 1.
I think your best bet is actually option 3, and it's the most equivalent to changing a symlink, except it is on the client side.

sql server split mirror db on to multiple devices

Say I have a large production mirrored 1TB DB that resides on a single MDF device and I would like to split that up into say 5 200 Gig devices.
I want to do this without interruption to Production.
I thought I could break the mirror and use the RESTORE process for creating a mirror to achieve the split to multiple devices quickly and without interruption to Production. Doing this twice would allow me to get this done in a few hours.
Has anyone done this? Is it the preferred method seeing as we are mirroring anyways?
What are other my alternatives, Pros and Cons? And gotchas?
Also, I recall another more organic process where one would create the 5 new New Devices and somehow, over time get the objects to move over to the new devices. Not sure of the process for this but I seem to recall it being discussed. Sounds like this could take a long time and possibly cause some clocking at times?
Thanks
...Ray
This isn't quite as simple a process as it first looks, the reason being is that just adding the files to SQL server isn't enough as even if you were to add 4 new files, they would all be empty space, you would have one file with 1Tb of data in it and 4 empty ones, which would eventually fill up as SQL server uses a proportional fill method for the files, but most of your queries would still be hitting the single file.
I take it you are doing this to improve performance? If so, you will need to move data around into different files in order to actually split the data up. Whether you can do this online or not depends on whether you are running Enterprise Edition or not (as this allows you to rebuild indexes online).
An easy way to move a table (or more accurately a clustered index, which is pretty much the same thing as the table for all intents and purposes) is to add a new filegroup with a new data file and then rebuild the clustered index specifying the new filegroup, you can use the following to do this:
CREATE CLUSTERED INDEX Existing_Index_Name ON schema_name.table_name(column_name)
WITH(DROP_EXISTING=ON,Online=ON) on [new_filegroup_name]
GO
This code will create the new index on the new filegroup, get rid of the old one and if you are running enterprise edition, it will do it all without blocking the users.
See the following link for more methods of moving the data between filegroups:
Move data between SQL Server database filegroups
You should also look into partitioning your tables to help improve performance too:
Partitioning Tables and Indexes
With regards to your mirroring setup, you should break the mirror, then on the primary add all your files/filegroups, then move the data between the filegroups, then backup the modified database on the primary, restore on the mirror (so all the files are set up the same on the mirror) and then re-set up your mirroring.