XML versus MongoDB - mongodb

I have a problem...
I need to store a daily barrage of about 3,000 mid-sized XML documents (100 to 200 data elements).
The data is somewhat unstable in the sense that the schema changes from time to time and the changes are not announced with enough advance notice, but need to be dealt with retroactively on an emergency "hotfix" basis.
The consumption pattern for the data involves both a website and some simple analytics (some averages and pie charts).
MongoDB seems like a great solution except for one problem; it requires converting between XML and JSON. I would prefer to store the XML documents as they arrive, untouched, and shift any intelligent processing to the consumer of the data. That way any bugs in the data-loading code will not cause permanent damage. Bugs in the consumer(s) are always harmless since you can fix and re-run without permanent data loss.
I don't really need "massively parallel" processing capabilities. It's about 4GB of data which fits comfortably in a 64-bit server.
I have eliminated from consideration Cassandra (due to complex setup) and Couch DB (due to lack of familiar features such as indexing, which I will need initially due to my RDBMS ways of thinking).
So finally here's my actual question...
Is it worthwhile to look for a native XML database, which are not as mature as MongoDB, or should I bite the bullet and convert all the XML to JSON as it arrives and just use MongoDB?

You may have a look at BaseX, (Basex.org), with built in XQuery processor and Lucene text indexing.

That Data Volume is Small
If there is no need for parallel data processing, there is no need for Mongo DB. Especially if dealing with small data amounts like 4GB, the overhead of distributing work can easily get larger than the actual evaluation effort.
4GB / 60k nodes is not large of XML databases, either. After some time of getting into it you will realize XQuery as a great tool for XML document analysis.
Is it Really?
Or do you get daily 4GB and have to evaluate that and all data you already stored? Then you will get to some amount which you cannot store and process on one machine any more; and distributing work will get necessary. Not within days or weeks, but a year will already bring you 1TB.
Converting to JSON
How does you input look like? Does it adhere any schema or even resemble tabular data? MongoDB's capabilities for analyzing semi-structured are way worse than what XML databases provide. On the other hand, if you only want to pull a few fields on well-defined paths and you can analyze one input file after the other, Mongo DB probably will not suffer much.
Carrying XML into the Cloud
If you want to use both an XML database's capabilities in analyzing the data and some NoSQL's systems capabilities in distributing the work, you could run the database from that system.
BaseX is getting to the cloud with exactly the capabilities you need -- but it will probably still take some time for that feature to get production-ready.

Related

How to handle large mongodb collection

We have a collection that is potentially going to be very large.This collection used to store Bill releated data. So this is often used to reporting/Analytics purpose.
Please let me know the best approch to handle this large collection
1) Can I split and archive the old data(say 12 months period)?.But here old data is required to get analytic reports.I want to query this old data to show the sale comparion for past 2 yesrs.
2)can I have new collection with old data(12 months) .So for every 12 months i've to create new collection. For reports generation,I've to access all this documents to query. So this will cause performance problem?
3) Can I go for Sharding?
There are many variables to account for, the clearest being what hardware you use, how the data is structured, and how it is queried. A distributed network ought to be able to chew through your data faster than a single machine, but before diving into that solution I recommend generating an absurd amount of mock data comparable to what you are expecting, and then testing various approaches. Seriously. Create a bunch of data, and try to break things. It's fun! Soon enough you'll know more about what your problem requires than any website could tell you.
As for direct responses:
Perhaps, before archiving the data, appropriate stats summaries can be generated (or updated). Those summaries/simplifications can be used for sale comparisons without reloading all of the archived data they represent.
This strikes me as sensible. By splitting up the sales data, you have more control over how much data needs to be accessed. After all, a user won't always wish to see 3 years of data, they may only wish to see last week's.
Move to sharding when you actually need it. As is stated on the MongoDB site:
Converting an unsharded database to a sharded cluster is easy and seamless, so there is little advantage in configuring sharding while your data set is small.
You'll know it's time when your memory-map approaches the server's RAM limit. MongoDB supports reading and writing to databases too large to keep in memory, but I'm sure you already know that is SLOW.

mongo as a main db for a complex project

Is there any sense to use mongodb in a system with great amount of entities (50+) connected to each other, for example in CRM. Any "success stories"?
There is a need of intensive writing and fast selection from high number of records for the some kind of analytics system.
It is definitely hard to provide a recommendation with such open question; however, you can analyze some of the advantages of MongoDB over other database, most likely you are considering Mongo as an alternative to a relational database like Oracle or SQL Server.
From http://mongodb.org you can see the main characteristics...
Document Oriented Storage: Which basically means you can have a single or multiple documents representing your data structures. One very important think here is that the schema is dynamic, that is you can add more attributes without having to change your database. Pretty useful for adding flexibility to your system.
Full index support: We wouldn't expect any less than full support for indices, right?
Replication and High availability; Sharding: Very critical elements for availability, disaster recovery, and to guarantee the
ability to grow with your system.
Querying: Again, pretty critical requirement. Need to make sure you account for the dynamic schema. You will need to consider in
your queries that some attributes are not defined for all documents
(remember dynamic schema?).
Map/Reduce: Very useful for
analytics. Recommended for aggregating large amounts of data.
Should be used offline, meaning, you don't run a live query against a
map/reduce function, otherwise you will be sitting for a while
waiting. But it is great to run batch analytics on your system.
GridFS: A great way of storing binary data. Automatically generates MD5's for your files, splits them in chunks, and can add
metadata. Your files will stay with your database.
Also, the Geolocation indices are great. You can define lon,lat attributes and do searches on those.
Now it is up to you to see if these features are good for your needs, or you rather stay with a well know relational system.
Before jumping into a solution you should experiment and build some prototypes. You will see very early what challenges you'll have in your design.
Hope this helps.

Storing millions of log files - Approx 25 TB a year

As part of my work we get approx 25TB worth log files annually, currently it been saved over an NFS based filesystem. Some are archived as in zipped/tar.gz while others reside in pure text format.
I am looking for alternatives of using an NFS based system. I looked at MongoDB, CouchDB. The fact that they are document oriented database seems to make it the right fit. However the log files content needs to be changed to JSON to be store into the DB. Something I am not willing to do. I need to retain the log files content as is.
As for usage we intend to put a small REST API and allow people to get file listing, latest files, and ability to get the file.
The proposed solutions/ideas need to be some form of distributed database or filesystem at application level where one can store log files and can scale horizontally effectively by adding more machines.
Ankur
Since you dont want queriying features, You can use apache hadoop.
I belive HDFS and HBase will be nice fit for this.
You can see lot of huge storage stories inside Hadoop powered by page
Take a look at Vertica, a columnar database supporting parallel processing and fast queries. Comcast used it to analyze about 15GB/day of SNMP data, running at an average rate of 46,000 samples per second, using five quad core HP Proliant servers. I heard some Comcast operations folks rave about Vertica a few weeks ago; they still really like it. It has some nice data compression techniques and "k-safety redundancy", so they could dispense with a SAN.
Update: One of the main advantages of a scalable analytics database approach is that you can do some pretty sophisticated, quasi-real time querying of the log. This might be really valuable for your ops team.
Have you tried looking at gluster? It is scalable, provides replication and many other features. It also gives you standard file operations so no need to implement another API layer.
http://www.gluster.org/
I would strongly disrecommend using a key/value or document based store for this data (mongo, cassandra, etc.). Use a file system. This is because the files are so large, and the access pattern is going to be linear scan. One thing problem that you will run into is retention. Most of the "NoSQL" storage systems use logical delete, which means that you have to compact your database to remove deleted rows. You'll also have a problem if your individual log records are small and you have to index each one of them - your index will be very large.
Put your data in HDFS with 2-3 way replication in 64 MB chunks in the same format that it's in now.
If you are to choose a document database:
On CouchDB you can use the _attachement API to attach the file as is to a document, the document itself could contain only metadata (like timestamp, locality and etc) for indexing. Then you will have a REST API for the documents and the attachments.
A similar approach is possible with Mongo's GridFs, but you would build the API yourself.
Also HDFS is a very nice choice.

MongoDB vs. Redis vs. Cassandra for a fast-write, temporary row storage solution

I'm building a system that tracks and verifies ad impressions and clicks. This means that there are a lot of insert commands (about 90/second average, peaking at 250) and some read operations, but the focus is on performance and making it blazing-fast.
The system is currently on MongoDB, but I've been introduced to Cassandra and Redis since then. Would it be a good idea to go to one of these two solutions, rather than stay on MongoDB? Why or why not?
Thank you
For a harvesting solution like this, I would recommend a multi-stage approach. Redis is good at real time communication. Redis is designed as an in-memory key/value store and inherits some very nice benefits of being a memory database: O(1) list operations. For as long as there is RAM to use on a server, Redis will not slow down pushing to the end of your lists which is good when you need to insert items at such an extreme rate. Unfortunately, Redis can't operate with data sets larger than the amount of RAM you have (it only writes to disk, reading is for restarting the server or in case of a system crash) and scaling has to be done by you and your application. (A common way is to spread keys across numerous servers, which is implemented by some Redis drivers especially those for Ruby on Rails.) Redis also has support for simple publish/subscribe messenging, which can be useful at times as well.
In this scenario, Redis is "stage one." For each specific type of event you create a list in Redis with a unique name; for example we have "page viewed" and "link clicked." For simplicity we want to make sure the data in each list is the same structure; link clicked may have a user token, link name and URL, while the page viewed may only have the user token and URL. Your first concern is just getting the fact it happened and whatever absolutely neccesary data you need is pushed.
Next we have some simple processing workers that take this frantically inserted information off of Redis' hands, by asking it to take an item off the end of the list and hand it over. The worker can make any adjustments/deduplication/ID lookups needed to properly file the data and hand it off to a more permanent storage site. Fire up as many of these workers as you need to keep Redis' memory load bearable. You could write the workers in anything you wish (Node.js, C#, Java, ...) as long as it has a Redis driver (most web languages do now) and one for your desired storage (SQL, Mongo, etc.)
MongoDB is good at document storage. Unlike Redis it is able to deal with databases larger than RAM and it supports sharding/replication on it's own. An advantage of MongoDB over SQL-based options is that you don't have to have a predetermined schema, you're free to change the way data is stored however you want at any time.
I would, however, suggest Redis or Mongo for the "step one" phase of holding data for processing and use a traditional SQL setup (Postgres or MSSQL, perhaps) to store post-processed data. Tracking client behavior sounds like relational data to me, since you may want to go "Show me everyone who views this page" or "How many pages did this person view on this given day" or "What day had the most viewers in total?". There may be even more complex joins or queries for analytic purposes you come up with, and mature SQL solutions can do a lot of this filtering for you; NoSQL (Mongo or Redis specifically) can't do joins or complex queries across varied sets of data.
I currently work for a very large ad network and we write to flat files :)
I'm personally a Mongo fan, but frankly, Redis and Cassandra are unlikely to perform either better or worse. I mean, all you're doing is throwing stuff into memory and then flushing to disk in the background (both Mongo and Redis do this).
If you're looking for blazing fast speed, the other option is to keep several impressions in local memory and then flush them disk every minute or so. Of course, this is basically what Mongo and Redis do for you. Not a real compelling reason to move.
All three solutions (four if you count flat-files) will give you blazing fast writes. The non-relational (nosql) solutions will give you tunable fault-tolerance as well for the purposes of disaster recovery.
In terms of scale, our test environment, with only three MongoDB nodes, can handle 2-3k mixed transactions per second. At 8 nodes, we can handle 12k-15k mixed transactions per second. Cassandra can scale even higher. 250 reads is (or should be) no problem.
The more important question is, what do you want to do with this data? Operational reporting? Time-series analysis? Ad-hoc pattern analysis? real-time reporting?
MongoDB is a good option if you want the ability to do ad-hoc analysis based on multiple attributes within a collection. You can put up to 40 indexes on a collection, though the indexes will be stored in-memory, so watch for size. But the result is a flexible analytical solution.
Cassandra is a key-value store. You define a static column or set of columns that will act as your primary index right up front. All queries run against Cassandra should be tuned to this index. You can put a secondary on it, but that's about as far as it goes. You can, of course, use MapReduce to scan the store for non-key attribution, but it will be just that: a serial scan through the store. Cassandra also doesn't have the notion of "like" or regex operations on the server nodes. If you want to find all customers where the first name starts with "Alex", you'll have to scan through the entire collection, pull the first name out for each entry and run it through a client-side regex.
I'm not familiar enough with Redis to speak intelligently about it. Sorry.
If you are evaluating non-relational platforms, you might also want to consider CouchDB and Riak.
Hope this helps.
Just found this: http://blog.axant.it/archives/236
Quoting the most interesting part:
This second graph is about Redis RPUSH vs Mongo $PUSH vs Mongo insert, and I find this graph to be really interesting. Up to 5000 entries mongodb $push is faster even when compared to Redis RPUSH, then it becames incredibly slow, probably the mongodb array type has linear insertion time and so it becomes slower and slower. mongodb might gain a bit of performances by exposing a constant time insertion list type, but even with the linear time array type (which can guarantee constant time look-up) it has its applications for small sets of data.
I guess everything depends at least on data type and volume. Best advice probably would be to benchmark on your typical dataset and see yourself.
According to the Benchmarking Top NoSQL Databases (download here)
I recommend Cassandra.
If you have the choice (and need to move away from flat fies) I would go with Redis. Its blazingly fast, will comfortably handle the load you're talking about, but more importantly you won't have to manage the flushing/IO code. I understand its pretty straight forward but less code to manage is better than more.
You will also get horizontal scaling options with Redis that you may not get with file based caching.
I can get around 30k inserts/sec with MongoDB on a simple $350 Dell. If you only need around 2k inserts/sec, I would stick with MongoDB and shard it for scalability. Maybe also look into doing something with Node.js or something similar to make things more asynchronous.
The problem with inserts into databases is that they usually require writing to a random block on disk for each insert. What you want is something that only writes to disk every 10 inserts or so, ideally to sequential blocks.
Flat files are good. Summary statistics (eg total hits per page) can be obtained from flat files in a scalable manner using merge-sorty map-reducy type algorithms. It's not too hard to roll your own.
SQLite now supports Write Ahead Logging, which may also provide adequate performance.
I have hand-on experience with mongodb, couchdb and cassandra. I converted a lot of files to base64 string and insert these string into nosql.
mongodb is the fastest. cassandra is slowest. couchdb is slow too.
I think mysql would be much faster than all of them, but I didn't try mysql for my test case yet.

Has anyone used an object database with a large amount of data?

Object databases like MongoDB and db4o are getting lots of publicity lately. Everyone that plays with them seems to love it. I'm guessing that they are dealing with about 640K of data in their sample apps.
Has anyone tried to use an object database with a large amount of data (say, 50GB or more)? Are you able to still execute complex queries against it (like from a search screen)? How does it compare to your usual relational database of choice?
I'm just curious. I want to take the object database plunge, but I need to know if it'll work on something more than a sample app.
Someone just went into production with a 12 terabytes of data in MongoDB. The largest I knew of before that was 1 TB. Lots of people are keeping really large amounts of data in Mongo.
It's important to remember that Mongo works a lot like a relational database: you need the right indexes to get good performance. You can use explain() on queries and contact the user list for help with this.
When I started db4o back in 2000 I didn't have huge databases in mind. The key goal was to store any complex object very simply with one line of code and to do that good and fast with low ressource consumption, so it can run embedded and on mobile devices.
Over time we had many users that used db4o for webapps and with quite large amounts of data, going close to todays maximum database file size of 256GB (with a configured block size of 127 bytes). So to answer your question: Yes, db4o will work with 50GB, but you shouldn't plan to use it for terabytes of data (unless you can nicely split your data over multiple db4o databases, the setup costs for a single database are negligible, you can just call #openFile() )
db4o was acquired by Versant in 2008, because it's capabilites (embedded, low ressource-consumption, lightweight) make it a great complimentary product to Versant's high-end object database VOD. VOD scales for huge amounts of data and it does so much better than relational databases. I think it will merely chuckle over 50GB.
MongoDB powers SourceForge, The New York Times, and several other large databases...
You should read the MongoDB use cases. People who are just playing with technology are often just looking at how does this work and are not at the point where they can understand the limitations. For the right sorts of datasets and access patterns 50GB is nothing for MongoDB running on the right hardware.
These non-relational systems look at the trade-offs which RDBMs made, and changed them a bit. Consistency is not as important as other things in some situations so these solutions let you trade that off for something else. The trade-off is still relatively minor ms or maybe secs in some situations.
It is worth reading about the CAP theorem too.
I was looking at moving the API I have for sure with the stack overflow iphone app I wrote a while back to MongoDB from where it currently sits in a MySQL database. In raw form the SO CC dump is in the multi-gigabyte range and the way I constructed the documents for MongoDB resulted in a 10G+ database. It is arguable that I didn't construct the documents well but I didn't want to spend a ton of time doing this.
One of the very first things you will run into if you start down this path is the lack of 32 bit support. Of course everything is moving to 64 bit now but just something to keep in mind. I don't think any of the major document databases support paging in 32 bit mode and that is understandable from a code complexity standpoint.
To test what I wanted to do I used a 64 bit instance EC2 node. The second thing I ran into is that even though this machine had 7G of memory when the physical memory was exhausted things went from fast to not so fast. I'm not sure I didn't have something set up incorrectly at this point because the non-support of 32 bit system killed what I wanted to use it for but I still wanted to see what it looked like. Loading the same data dump into MySQL takes about 2 minutes on a much less powerful box but the script I used to load the two database works differently so I can't make a good comparison. Running only a subset of the data into MongoDB was much faster as long as it resulted in a database that was less than 7G.
I think my take away from it was that large databases will work just fine but you may have to think about how the data is structured more than you would with a traditional database if you want to maintain the high performance. I see a lot of people using MongoDB for logging and I can imagine that a lot of those databases are massive but at the same time they may not be doing a lot of random access so that may mask what performance would look like for more traditional applications.
A recent resource that might be helpful is the visual guide to nosql systems. There are a decent number of choices outside of MongoDB. I have used Redis as well although not with as large of a database.
Here's some benchmarks on db4o:
http://www.db4o.com/about/productinformation/benchmarks/
I think it ultimately depends on a lot of factors, including the complexity of the data, but db4o seems to certainly hang with the best of them.
Perhaps worth a mention.
The European Space Agency's Planck mission is running on the Versant Object Database.
http://sci.esa.int/science-e/www/object/index.cfm?fobjectid=46951
It is a satelite with 74 onboard sensors launched last year which is mapping the infrarred spectrum of the universe and storing the information in a map segment model. It has been getting a ton of hype these days because of it's producing some of the coolest images ever seen of the universe.
Anyway, it has generated 25T of information stored in Versant and replicated across 3 continents. When the mission is complete next year, it will be a total of 50T
Probably also worth noting, object databases tend to be a lot smaller to hold the same information. It is because they are truly normalized, no data duplication for joins, no empty wasted column space and few indexes rather than 100's of them. You can find public information about testing ESA did to consider storage in multi-column relational database format -vs- using a proper object model and storing in the Versant object database. THey found they could save 75% disk space by using Versant.
Here is the implementation:
http://www.planck.fr/Piodoc/PIOlib_Overview_V1.0.pdf
Here they talk about 3T -vs- 12T found in the testing
http://newscenter.lbl.gov/feature-stories/2008/12/10/cosmic-data/
Also ... there are benchmarks which show Versant orders of magnitude faster on the analysis side of the mission.
CHeers,
-Robert