I was given the data files to MongoDB without being told much of what was in the content. Is there a way to probe the contents of the database to identify commonalities within the database.
There are a few different mongo shell helpers that will give you an idea of the general structure of collections (eg. field/data types and their frequency of usage in a collection):
schema.js
variety
To get a better idea of the actual content, you could try one of the Admin UIs.
Related
I have a Mongo database that I did not create or architect, is there a good way to introspect the db or print out what the structure is to start to get a handle on what types of data are being stored, how the data types are nested, etc?
Just query the database by running the following commands in the mongo shell:
use mydb //this switches to the database you want to query
show collections //this command will list all collections in the database
db.collectionName.find().pretty() //this will show all documents in the database in a readable format; do the same for each collection in the database
You should then be able to examine the document structure.
There is actually a tool to help you out here called Variety:
http://blog.mongodb.org/post/21923016898/meet-variety-a-schema-analyzer-for-mongodb
You can view the Github repo for it here: https://github.com/variety/variety
I should probably warn you that:
It uses MR to accomplish its tasks
It uses certain other queries that could bring a production set-up to a near halt in terms of performance.
As such I recommend you run this on a development server or a hidden node of a replica or something.
Depending on the size and depth of your documents it may take a very long time to understand the rough structure of your database through this but it will eventually give one.
This will print name and its type
var schematodo = db.collection_name.findOne()
for (var key in schematodo) { print (key, typeof key) ; }
I would recommend limiting the result set rather than issuing an unrestricted find command.
use mydb
db.collectionName.find().limit(10)
var z = db.collectionName.find().limit(10)
Object.keys(z[0])
Object.keys(z[1])
This will help you being to understand your database structure or lack thereof.
This is an open-source tool that I, along with my friend, have created - https://pypi.python.org/pypi/mongoschema/
It is a Python library with a pretty simple usage. You can try it out (even contribute).
One option is to use the Mongoeye. It is open-source tool similar to the Variety.
The difference is that Mongoeye is a stand-alone program (Mongo Shell is not required) and has more features (histograms, most frequent values, etc.).
https://github.com/mongoeye/mongoeye
Few days ago I found GUI client MongoDB Compass with some nice visualizations. See the product overview. It comes directly from the mongodb people and according to their doc:
MongoDB Compass is designed to allow users to easily analyze and understand the contents of their data collections within MongoDB...
You may've asked about validation schema. Here's the answer how to get it:
How to retrieve MongoDb collection validator rules?
Use Mongo Compass
which does a sample as explained here
Which does a random sample of 1000 documents to get you the schema - it could miss something but it's the only rational option if you database is several GBs.
Visualisation
The schema then can be exported as JSON
Documentation
You can use MongoDB's tool mongodump. On running it, a dump folder is created in the directory from which you executed mongodump. In that folder, there are multiple folders that correspond to the databases in MongDB, and there are subfolders that correspond to the collections, and files that correspond to the documents.
This method is the best I know of, as you can also make out the schema of empty collections.
In a system I'm building, it's essentially an issue tracking system, but with various issue templates. Some issue types will have different formats that others.
I was originally planning on using MySQL with a main issues table and an issues_meta table that contains key => value pairs. However, I'm thinking NoSQL (MongoDB) might be the better option.
Can MongoDB provide me with the ability to generate "standard"
reports, like # of issues by type, # of issues by type by month, # of
issues assigned per person, etc? I ask this because I've read a few
sources that said Mongo was bad at reporting.
I'm also planning on storing my audit logs in Mongo, since I want a single "table" for all actions (Modifications to any table). In Mongo I can store each field that was changed easily, since it is schemaless. Is this a bad idea?
Anything else I should know, and will Mongo work for what I want?
I think MongoDB will be a perfect match for that use case.
MongoDB collections are heterogeneous, meaning you can store documents with different fields in the same bag. So different reporting templates won't be a show stopper. You will be able to model a full issue with a single document.
MongoDB would be a good fit for logging too. You may be interested in capped collections.
Should you need to have relational association between documents, you can do have it too.
If you are using Ruby, I can recommend you Mongoid. It will make it easier. Also, it has support for versioning of documents.
MongoDB will definitely work (and you can use capped collections to automatically drop old records, if you want), but you should ask yourself, does it fit to this task well? For use case you've described it is better option to use Redis (simple and fast enough) or Riak (if you care a lot about your log data).
From my experience this is what i come up with.
Im currently saving Users and Statistic classes into the MongoDb and everything works grate.
But how do i save the log each user generate?
Was thinking to use the LogBack SiftingAppender and delegate the log information
to separate MongoDb Collections. Like every MongoDb Collection have the id of the user.
That way i don't have to create advanced mapreduce query's since logs are neatly stacked.
Or use SiftingAppender with a FileAppender so each user have a separate log file.
I there is a problem with this if the MongoDB have one million Log Collections each one named with the User Id. (is it even possible btw)
If everything is stored in the MongoDb the MongoDb master-slave replication makes
it easy if a master node dies.
What about the FileAppender approach. Feels like there will be a hole lot of log files
to administrate. One could maybe save them in folders according to Alphabet. Folder A
for user/id with names/id starting with A.
What are other options to make this work?
On your qn of 1M collections, the default namespace file for a db is 16MB which allows about 24000 namespaces (12000 collections + their _id indexes). More info on this website
And you can set maximum .ns (namespace) file size to 2GB with --nssize option, which will allow probably 3072000 namespaces.
Make use of Embedded Documents and have one Document for each user with an array of embedded documents containing log files. You can also benefit from sharding if collections get large.
I just installed mongodb from Synaptic (Using ubuntu 11.10).
I'm following this tutorial: http://howtonode.org/express-mongodb
to build a simple blog using node.js, express.js and mongodb.
I reached the part where you can actually store post in the mongodb.
It worked for me but I'm not sure how. I didn't even start the mongodb (just installed it).
The second thing that has me puzzled is: where is that data going?
Does anyone have any clue?
EDIT:
I think I found where is the data going:
var/lib/
Where are some files that are apparently written in binary code.
MongoDB stores all your data as BSON within the directory you found. The internal Mongo processes handle the sharding of the data when necessary following your specific setup, accomodating you if you have setup distribution across several servers.
One of the most confusing things about coming over to Mongo from RDBMS is the nature of collections vs tables.
http://www.mongodb.org/display/DOCS/MongoDB%2C+CouchDB%2C+MySQL+Compare+Grid
Of Note: A Mongo collection does not need to have any schema designed, it is assumed that the client application will validate and enforce any particular scheme. The way the data comes in and is serialized is the way it will be saved in its BSON representation. It is entirely possible to have missing keys and and totally different docs in the same collection, which is why robust data checking in your app becomes so important.
Mongo collections also do not need to be saved. There isn't anything that approximates the CREATE TABLE syntax from MySQL, since there's no schema there isn't much such a command would have to describe. Your collection will be created with the first document inserted, and every document inserted will be given a unique hashed '_id' key (assuming you don't define this key in your objects using your own methodology.'
Here's a helpful resource with the basic commands.
http://cheat.errtheblog.com/s/mongo
Mongo, and NoSQL in general, is a developer lead movement trying to bridge the gap between the object-oriented application layer and the persistency layer in an app, which would traditionally have been represented by an RDBMS. Developers just would much rather work within one paradigm, and OOP has become predominant especially in the web environment.
If you're coming over from LAMP and used PhpMyAdmin in the past, I really must suggest RockMongo. This tool makes it so much easier to observe and understand the actual structure of the BSON documents stored in your server.
http://code.google.com/p/rock-php/wiki/rock_mongo
Best of luck!
I'm working on a project where I will have a LOT of data, and it will be searchable by several forms that are very efficiently expressed as SQL Queries, but it also needs to be searched via natural language processing.
My plan is to build an index using Lucene for this form of search.
My question is that if I do this, and perform a search, Lucene will then return the ID's of matching documents in the index, I then have to lookup these entities from the relational database.
This could be done in two ways (That I can think of so far):
N amount of queries (Horrible)
Pass all the ID's to a stored procedure at once (Perhaps as a comma delimited parameter). This has the downside of being limited to the max parameter size, and the slow performance of a UDF to split the string into a temporary table.
I'm almost tempted to mirror everything into lucenes index, so that I can periodicly generate the index from the backing store, but only need to access it for the frontend.
Advice?
I would store the 'frontend' data inside the index itself, avoiding any db interaction. The db would be queried only when you want more information on the specific record.
When I encountered this problem I went with a relational database that has full-text search capabilities (I used PostgreSQL 8.3, which has built in ft support, with stemming and thesaurus support). This way the database can query using both SQL and ft commands. The downside is that you need a DB that has full-text-search capabilities, and these capabilities might be inferior to what lucene can do.
I guess the answer depends on what you are going to do with the results, if you are going to display the results in a grid and let the user choose the exact document he wants to access then you may want to add to the index enough text to help the user identify the document, like a blurb of say 200 characters and then once the member selects a document hit the DB to retrieve the whole thing.
This will impact the size of your index for sure, so that is another consideration you need to keep in mind. I would also put a cache between the DB and the front end so that the most used items will not incur the full cost of a DB access every time.
Probably not an option depending on how much stuff is in your database, but what I have done is store the db id's in the search index along with the properties I wanted indexed. Then in my service classes I cache all the data needed to display search results for all the objects (e.g., name, db id, image url's, description blurbs, social media info). The service class returns a Dictionary that can look up objects by db id, and I use the id's returned by Lucene.NET to pull data from the in-memory cache.
You could also forego the in-memory cache and store all the necessary properties for displaying a search result in the search index. I didn't do this because the in-memory cache is also used in scenarios other than search.
The in-memory cache is always fresh to within a few hours, and the only time I have to hit the db is if I need to pull more detailed data for a single object (if the user clicks on the link for a specific object to go to the page for that object).