When I think of a document database, I think of a bunch of JSON files. (I imagine it is more complex than that, but that is how envision it.)
In an upcoming project, we need the ability to deal with multiple different versions of the data. As I got to looking at the needs, they are very similar to the needs that drive branching and merging of code. (Versions of the data moving through a process, emergency updates to the existing data in prod even though there are active versions being worked on, etc)
This has me wondering, do any of the popular document databases have features that are similar to branching and merging of documents? (I tried searching around, but I could not get any relevant results.)
RavenDB has great Revisions and Patching features.
With Revisions you can keep track of your documents history
https://ravendb.net/docs/article-page/4.2/Csharp/server/extensions/revisions
https://ravendb.net/learn/inside-ravendb-book/reader/4.0/4-deep-dive-into-the-ravendb-client-api#document-revisions
With Patching you can update existing data in production
https://ravendb.net/docs/article-page/4.2/Csharp/client-api/operations/patching/single-document
https://ravendb.net/learn/inside-ravendb-book/reader/4.0/2-zero-to-ravendb#patching-documents
Related
We are investigate how to using modern code version systems for a legacy systems.
However, there are some difficulties that we do not really know how to solve. How do you think we should going to organize this?
Some parts of the code we have responsibility for, and some parts the customer are responsibility for.
The code that is in our development system is the master, and all changes are made there and then sent out to the customer. So far is it like a common project. But for some directories with a few hundred files, the client is responsible for the code. Customer can change these files as they want, and do not have inform us. When we do an update for them, we need to get a copy of these files and merge our changes before we give the new version to the customer.
We have released 4 base versions of the system over the 25 years the system have lived. Our six customers, however, use different versions of the system, which means that we have to consider it. In addition, each customer has their own requirements, and we have made major adjustments for each customer.
We have several parallel projects related to the same files. Today we have major problems that the various projects interfere with each other, so we will benefit greatly from having changes in different branches. But how do we organize this?
Some projects affect only one customer, other projects involving several customers. And some projects may be introduced into a new future base version.
So my question is how to organize this in our new version system.
I've been using MongoDB for about a year now, however not nearly up to its potential.
I've been developing new software, out of anyone's eyes except my own, and I've enjoyed the flexibility of the database to its fullest and I've made major structural changes to data on the fly.
Now I'm at a point where I have production server(s) and 3 development servers, I'm having a real problem with changing data structures and syncing them up.
Theoretically the development servers should always have the most current data from production. In a structured database, if I rename something, I can just run a compare tool and do the corresponding change in production after a pull. In MongoDB, this can become incredibly difficult.. there could be hundreds of changes from document to document, much less from database to database.
I've been reviewing my ~/.dbshell file to kinda get the feel of changes I've made, but what about changes made within the program its self? Configuration database changes?
Are there tools or procedures that are around to make this easier?
I've spent hours on Google researching how others do it. I came across Mongeez, but it's more manual and tedious than I need. In the past, I just do a mongodump and mongorestore inside of a git directory to transport data, but these snapshots are too rigid. I read a few blog posts regarding moving new data from production to development, but nothing about updating development documents in production. I could write a comparison script, but I feel like this is reinventing the wheel. There has to be a better way.
TL;DR: What are some ways to version NoSQL data, new entries and changed data, between environments?
I had a similar problem/experience while managing a few production Mongo machines for about a year.
Two quick pieces of advice:
WiredPrairie is right. Version your documents and that will allow you to migrate in a casual/relaxed manner. I wish we had done that up front. One of my biggest regrets.
We used Groovy to connect and do our schema/data changes and I loved it. The language is easy to learn and it works great with JSON. My practice was to back up the collections I'd be operating on, write the scripts in dev, run them and if I messed up, restore the backed up collections. Iterate until I got the scripts perfect and then repeat in production.
What NoSQL database do you recommend for developing a Wiki-like application?
I need documents to have many sub-section texts, and each can be versioned controlled, and yet normalized.
Think of a Wikipedia page. It has many sections, and being a Wiki, it has version control for the document. However, I do not want a new document to be created (or the document being entirely duplicated) everytime a paragraph is changed. I only want that particular paragraph (or section) to have a new version, so it won't waste space on storage.
Any recommendation on the database or the design strategy?
Currently no NoSQL database provides what you want here (as far as I know). The closest to what you want is the CouchDB which keeps document revision history on every update. The disk space is cheap so generally it's not a problem.
But if versioning is the key to your business and one of the business requirements you should choose a tool that is built specifically to solve this problem - Git. Git does exactly what you want and does a lot of heavy lifting for your wiki app (like version diffs, easy blame in other words who did what changes, has hooks etc.).
A great example is GitHub wiki pages. Their wiki engine built on git (Gollum) is open-sourced.
To conclude, here are your options:
use git
CouchDB that does revision tracking for you, but as far as I
know saves a copy of the document
implement the revision logic in
your app, any NoSQL db would fit nicely.
One of those questions that's difficult to google.
We were running into issues the other day with speed of our svn repository. The standard solution to this seems to be "more RAM! more CPU!" etc. Which got me to wondering, are there any source-control systems that use a document/nosql database (mongodb, couchdb etc) for database? It seems like it might be a natural -- but I'm no expert on source-control database theory. Perhaps there's a way to configure a more recent source control to use a document db as storage?
None that I know of do, and they wouldn't want to. Given the difference in degrees of testing, it would likely hurt robustness (a really bad thing for a source code repository). It would probably also end up hurting performance, because of the inability to do delta storage.
Note that Subversion has two very different storage mechanisms, one backed by the embedded Berkeley DB, and the other backed by simple files. One or the other of these might be better suited to your usage.
Also, since you posed your question pretty broadly, I'll comment on Git and TFS.
Git uses very efficiently packed files in the filesystem to store the repository. Frequently, the entire history is smaller than a checkout. For one very old project that my lab has, the entire history is 57MiB, and a working tree (not counting history) is 56MiB.
TFS stores a lot (possibly all) of its data in a SQL database.
Git uses memory-mapped files just like MongoDB :)
Though Git doesn't actually use MongoDB and I don't think it would want to. If you look at Git, it doesn't really need a NoSQL DB, it basically is a DB.
As far as i know no of the VCS uses noSQL/document based databases. The idea of using a couchdb etc. is not new...but no one has implemented such a thing till now...
This is a question for those of you developing on a team of devs where all of you have separate databases. You're versioning your database using source control and other tools which will automatically bring dev databases up to date to the latest version of the database (schema, data, SP's, functions, etc.).
OK Great! But wait! What if you are developing on version 4.0 of your software, but now you need to switch branches to the 3.2 branch to fix a bug? The schema could be (almost assuredly is) very different by now...
I suppose if you went through the extra effort to write rollback scripts along with your change scripts, this could work. But that seems like a lot of work - is it really worth it?
Much easier would be to create a new 3.2-branch database and work with that while working on the 3.2-branch code. It doesn't seem reasonable to me to require that each developer has exactly one database to work with.
I'm going on a limb and assume that you are versioning the database as a binary? If all your database assets were in the form of constructive code (eg SQL scripts and/or text data dumps), the solution would be simple, as suggested by Mark: store these assets as part of the development branch. To work on version 3.2, switch the branch, re-run the create scripts and presto, 3.2 database. Merging would be just as easy as with regular code (or just as painful, depending on your version control system of choice).
Here are some suggestions to work in this mode:
If creating the database instances from text is too slow, make a cache on a shared disk volume, keyed by the contents of all the schema / data files (or the MD5 sum thereof).
Write a pre-commit hook to ensure that the schema and data dumps in the developer's instance are the same as the ones under version control. This prevents people from making changes to their dev database with an interactive tool, and then forgetting to commit them.
You mention change scripts; treat them as a liability. While they may be required by your deployment scenario (eg for customers who want to upgrade in-place), they duplicate information from the version history of the database, and per Murphy's law duplication means desynchronization sooner or later. Try to auto-generate the change scripts from the versioned database assets using "diff"; or if this cannot be achieved, dedicate some serious unit tests to database upgrades.