Synchronizing Applications - mongodb

I have a standalone network device. It needs to be reworked to function as part of a geographically distributed group of these devices. Synchronization between devices in the group need not occur frequently, not more than hourly. The application is rails with SQLite.
Mainly, we want to keep certain pieces of information collected on these devices in sync. Because of the deployment, it isn't feasible to add a large database cluster.
I have been considering CouchDB since replication and handling conflicts resulting from replication is a strong suit of its.
What do you think of CouchDB as a mechanism to keep distributed network devices synchronized? Any thoughts or suggestions for an alternative approach?

What is the particular question?
CouchDB implements master-master replication which is exactly what you are asking for.
Or?

CouchDB would be a great fit for this, because as you say, it has master-master replication. Since you're replicating over the WAN, another huge add is that CouchDB was designed to handle going on and off the network gracefully, which will be a nice piece of fault tolerance.
A lot of people have used CouchDB for this type of situation. Take a look at some case studies (http://www.couchbase.com/customers/case-studies) and a recent blog post I wrote about using CouchDB to keep front end servers' session data synchronized (weblog.bocoup.com/storing-php-sessions-in-couchdb).
Also, it would help if you posted more information about your case so that we can help cater our answers.
Cheers.

CouchDB is fine. You might have some alternatives with Unix tools.
The simplest key/value database is files in a filesystem. They work great. If you only need key/value storage with basic replication, then rsync can do that. If your conflict resolution policy is, for example, always take the latest timestamped data, then you might get away with rsync.
First of all, you're probably running Unix/Linux. SSH and rsync will be included, unlike CouchDB.
Another advantage of rsync (actually its SSH tunnel) is of course identification, authentication, and authorization. Your device is presumably Unix/Linux, and there are a million ways to wire up Unix authorization. It's not a guarantee but nearly anything is doable: password files, NIS, LDAP, Kerberos, Samba/Active Directory. The list goes on.
With Couch you will have to figure out some kind of user management system.
Will you use oauth?
Will you have to write an authentication plugin?
Will you also replicate the _users database around? What about conflicts in the _users database?
Do you instead have a central _users database? How can you have a central users database if you can't have a central data database?
Couch, like MySQL, is a full-blown server. It will maintenance load that rsync won't.
Remember to compact your databases, compact your views, and run view cleanup
Remember to rotate the log files
Possibly back up your .couch files and your .ini config
In other words, can you do a quick and dirty rsync hack, or do you need the full Couch package?
CouchDB is a uniform, consistent platform regardless of OS. That can be good or bad. Not knowing your specifics, I would guess that rsync over SSH is the best short-term, but Couch is the best long-term. (But with so many software projects, long-term never seems to arrive.)

Related

Suggestions for allowing ad-hoc queries against postgresql via simple web service

I have a PostgreSQL 9.3 two node cluster with warm-standby (read-only) slave. There are around 30 individual databases with a few hundred total tables and 1.3 TB of raw data. I'd really like the internets to have full access to these tables and allow folks to write arbitrary queries against. The main reason is my ignorance and incompetence with setting up useful things like REST services, etc...
So I suppose one approach would be to simply allow postgresql tcp connections to the warm-standby host as a user with very limited SELECT perms and perhaps that is what I should do?
Another approach would be to have some simple JSON(P) emitting service that simply takes a database and query string, then returns results?
And I suspect you'll have a better approach, so that's why I am here :)
In general, I am not worried if the internets overrun this host with load and DOS's it. I just don't want it to become a security liability or have some method to delete data on the warm-standby host. The machine would be there for use and if there are naughty users, too bad for the others I guess. If it gets popular, I could setup more readonly hosts, anyway...
Thanks in advance for your thoughts and for those that say I just need to grit my teeth and figure out how to properly provide web services for the data. My main languages are PHP and python, so if you have ideas of tools for those languages...
There is a site: SQL Fiddle that allows simple querying of different databases. Its code is open sourced and available on github here.
You can try to adapt the code to your needs.

What are the pros and cons of DynamoDB with respect to other NoSQL databases?

We use MongoDB database add-on on Heroku for our SaaS product. Now that Amazon launched DynamoDB, a cloud database service, I was wondering how that changes the NoSQL offerings landscape?
Specifically for cloud based services or SaaS vendors, how will using DynamoDB be better or worse as compared to say MongoDB? Are there any cost, performance, scalability, reliability, drivers, community etc. benefits of using one versus the other?
For starters, it will be fully managed by Amazon's expert team, so you can bet that it will scale very well with virtually no input from the end user (developer).
Also, since its built and managed by Amazon, you can assume that they have designed it to work very well with their infrastructure so you can can assume that performance will be top notch. In addition to being specifically built for their infrastructure, they have chosen to use SSD's as storage so right from the start, disk throughput will be significantly higher than other data stores on AWS that are HDD backed.
I havent seen any drivers yet and I think its too early to tell how the community will react to this, but I suspect that Amazon will have drivers for all of the most popular languages and the community will likely receive this well - and in turn create additional drivers and tools.
Using MongoDB through an add-on for Heroku effectively turns MongoDB into a SaaS product as well.
In reality one would be comparing whatever service a chosen provider has compared to what Amazon can offer instead of comparing one persistance solution to another.
This is very hard to do. Each provider will have varying levels of service at different price points and one could consider the option of running it on their own hardware locally for development purposes a welcome option.
I think the key difference to consider is MongoDB is a software that you can install anywhere (including at AWS or at other cloud service or in-house) where as DynamoDB is a SaaS available exclusively as hosted service from Amazon (AWS). If you want to retain the option of hosting your application in-house, DynamoDB is not an option. If hosting outside of AWS is not a consideration, then, DynamoDB should be your default choice unless very specific features are of higher consideration.
There's a table in the following link that summarizes the attributes of DynamoDB and Cassandra:
http://www.datastax.com/dev/blog/amazon-dynamodb
Something that needs improvement on DynamoDB in order to become more usable is the possibility to index columns other than the primary key.
UPDATE 1 (06/04/2013)
On 04/18/2013, Amazon announced support for Local Secondary Indexes, which made DynamoDB f***ing great:
http://aws.amazon.com/about-aws/whats-new/2013/04/18/amazon-dynamodb-announces-local-secondary-indexes/
I have to be honest; I was very excited when I heard about the new DynamoDB and did attend the webinar yesterday. However it's so difficult to make a decision right now as everything they said was still very vague; I have no idea the functions that are going to be allowed / used through their service.
The one thing I do know is that scaling is automatically handled; which is pretty awesome, yet there are still so many unknowns that it's tough to really make a great analysis until all the facts are in and we can start using it.
Thus far I still see mongo as working much better for me (personally) in the project undertaking that I've been working on.
Like most DB decisions, it's really going to come down to a project by project decision of what's best for your need.
I anxiously await more information on the product, as for now though it is in beta and I wouldn't jump ship to adopt the latest and greatest only to be a tester :)
I think one of the key differences between DynamoDB and other NoSQL offerings is the provisioned throughput - you pay for a specific throughput level on a table and provided you keep your data well-partitioned you can always expect that throughput to be met. So as your application load grows you can scale up and keep you performance more-or-less constant.
Amazon DynamoDB seems like a pretty decent NoSQL solution. It is fast, and it is pretty easy to use. Other than having an AWS account, there really isn't any setup or maintenance required. The feature set and API is fairly small right now compared to MongoDB/CouchDB/Cassandra, but I would probably expect that to grow over time as feedback from the developer community is received. Right now, all of the official AWS SDKs include a DynamoDB client.
Pros
Lightning Fast (uses SSDs internally)
Really (really) reliable. (chances of write failures are lower)
Seamless scaling (no need to do manual sharding)
Works as webservices (no server, no configuration, no installation)
Easily integrated with other AWS features (can store the whole table into S3 or use EMR etc)
Replication is managed internally, so chances of accidental loss of data is negligible.
Cons
Very (very) limited querying.
Scanning is painful (I remember once a scanning through Java ran for 6 hours)
pre-defined throughput, which means sudden increase beyond the set throughput will be throttled.
throughput is partitioned as table is sharded internally. (which means if you had a throughput for 1000 and its partitioned in two and if you are reading only the latest data(from one part) then your throughput of reading is 500 only)
No joins, Limited indexing allowed (basically 2).
No views, triggers, scripts or stored procedure.
It's really good as an alternative to session storage in scalable application. Another good use would be logging/auditing in extensive system. NOT preferable for feature rich application with frequent enhancement or changes.

I need suggestions for a distributed media storage data store

I want to develop one multimedia system, the system need to save millions videos and images, so I want to select a distributed storage subsystem. who can give me some suggestion ? thanks!
I guess that best option for the 'millions videos and images' is content distribution/delivery network (CDN):
CDN is a server setup which allows for
faster, more efficient delivery of
your media files. It does this by
maintaining copies of your media at
different points of presence (POPs)
along a global network to ensure quick
client access and the fastest delivery
possible
If you will use CDN you no need care about many problems(distribution, fast access). Integration with CDN also should be very simple.
#yi_H
You can configure your writes to be first replicated to multiple nodes before it return to the client. Now whether or not that is needed is of course unto the use case. And definitely involves a performance hit. So if you are implementing a write heavy analytical database, it will have a significant impact on write throughput.
All other points you make about the question in terms of lack of requirements etc, I second that.
Having replicated file system with metadata in a nosql database is a very common way of doing things. #why did you consider this kinda approach?
Have you taken a look at Mongodb gridfs? I have never used it, but it is something I would take a look at to see if it gives you any ideas.
Yo gave us (near) zero information about what your requirements are. Eg:
Do you want atomic transactions?
Is the system read or write heavy?
Do you need fast queries or want to batch-process the data set?
How big are the videos?
Do you want to distribute data locally (on a LAN) or spanning multiple data centers / continents?
How are we supposed to pick the right tool if we don't know what it needs to support?
Without any knowledge of the system I would advise using some kind of FS replication for the videos and images and then storing the metadata associated with the items either in MongoDB, MySQL Master-Master or MySQL Cluster.
Distributed related to what?
If you are talking of replication to distribute:
MongoDb only restricted to Master-Slave replication, so only one node is able to read/write which leaves you with a single point of failure for a really distributed system.
CouchDB is able to peer-to-peer replicate.
Find a very good comparison here and here also compared with hbase.
With CouchDB you also have to be aware that you are going to talk http to the database and have build in webservices.
Regards,
Chris
An alternative is to use MongoDB's GridFS, serving as a (very easily manageable) redundant and distributed filesystem.
Some will say that it's slow on reads, (and it is, mostly because of the nature of its design) but that doesn't have to mean it's a dealbreaker for your system in whole, because if you need performance later on, you could always put Varnish or Squid in front of the filesystem tier.
For all I know, Squid also supports on-disk cache for all the less-hot files.
Sources:
http://www.mongodb.org/display/DOCS/GridFS
http://www.squid-cache.org/Doc/config/cache_dir/

do any source-control systems use a document database for storage?

One of those questions that's difficult to google.
We were running into issues the other day with speed of our svn repository. The standard solution to this seems to be "more RAM! more CPU!" etc. Which got me to wondering, are there any source-control systems that use a document/nosql database (mongodb, couchdb etc) for database? It seems like it might be a natural -- but I'm no expert on source-control database theory. Perhaps there's a way to configure a more recent source control to use a document db as storage?
None that I know of do, and they wouldn't want to. Given the difference in degrees of testing, it would likely hurt robustness (a really bad thing for a source code repository). It would probably also end up hurting performance, because of the inability to do delta storage.
Note that Subversion has two very different storage mechanisms, one backed by the embedded Berkeley DB, and the other backed by simple files. One or the other of these might be better suited to your usage.
Also, since you posed your question pretty broadly, I'll comment on Git and TFS.
Git uses very efficiently packed files in the filesystem to store the repository. Frequently, the entire history is smaller than a checkout. For one very old project that my lab has, the entire history is 57MiB, and a working tree (not counting history) is 56MiB.
TFS stores a lot (possibly all) of its data in a SQL database.
Git uses memory-mapped files just like MongoDB :)
Though Git doesn't actually use MongoDB and I don't think it would want to. If you look at Git, it doesn't really need a NoSQL DB, it basically is a DB.
As far as i know no of the VCS uses noSQL/document based databases. The idea of using a couchdb etc. is not new...but no one has implemented such a thing till now...

NO-SQL reliable for small business app?

I'm deciding between go for a NON-SQL engine or a regular SQL one for a document managment system for small bussines.
I have experience with firebird/sql server and found a good track of reliability (specially with firebird).
This market is full of crappy "servers" (clon-made PC, the mayority), cheap harddisk, rarely use of RAID or anything like that, some are in locations where a power-off is normal, some not have a UPS, etc... (I will include off-site auto-backup to external servers, but that no change the internal setup). (I know about end-user education about such proper setups, but is stupid depend on that, so stick to te point)
From the desing point of view, a schema-less database is the way to go for my system, but, I worry if any of the actual solutions (MongoDb, Tokyo Cabinet, etc) are like firebird and survice crash, malfunctions & abuse so data corruption is very rare.
The plan is store the office documents there & provide a central repository.
Check out Neo4j. It is a graph database (schema-free) that can be used like a document or key/value store.
Neo4j has been in production for many years in environments like you describe. Unlike many other NOSQL databases Neo4j actually flushes data to disk and uses a transaction log to recover from an inconsistent state. It also has real transactions (full ACID) that can span multiple operations and treat them as a single unit (which also seems to be a feature that is frequently left out in many other NOSQL stores).
-Johan
(Disclaimer: I am part of the Neo4j team)
CouchDB has the reliability you need:
The CouchDB file layout and commitment system features all Atomic Consistent Isolated Durable (ACID) properties. On-disk, CouchDB never overwrites committed data or associated structures, ensuring the database file is always in a consistent state.
Look at the ACID Properties section here for more info.
With CouchDB you also get easy backup and replication.
I've no code in production using CouchDB yet, but so far I'm very happy with the tests and the development process with CouchDB.