Is it viable to use Hadoop with MongoDB as Database rather than HDFS - mongodb

I am doing research in Hadoop with MongoDB as Database not HDFS. So, I need some guidance in terms of performance and usability.
My scenario
My data is
Tweets from twitter
Facebook News feed
I can get the data from twitter and Facebook API . In order for hadoop processing I need to store.
So my question is, Is it viable (or beneficial) to use Hadoop along with Mongo DB to store social networking data like twitter feeds, facebook posts, etc? Or is it better to go with HDFS and store data in a file . Any expertise guidance will be appreciated. Thanks

It is totally viable to do that. But it mainly depends on your needs. Basically on, what do you want to do once you have the data?
That said, MongoDB is definitely a good option. It is good at storing unstructured, deeply nested documents, like JSON in your case. You don't have to worry too much about nesting and relations in your data. You don't have to worry about the schema as well. Schema-less storage is certainly a compelling reason to go with MongoDB.
On the other hand, I find HDFS more suitable for flat files, where you just have to pick the normalized data and start processing.
But these are just my thoughts. Others might have a different opinion. My final suggestion would be analyze your use case well and then finalize your store.
HTH

Related

MongoDB + Google Big Query - Normalize Data and Import to BQ

I've done quite a bit of searching, but haven't been able to find anything within this community that fits my problem.
I have a MongoDB collection that I would like to normalize and upload to Google Big Query. Unfortunately, I don't even know where to start with this project.
What would be the best approach to normalize the data? From there, what is recommended when it comes to loading that data to BQ?
I realize I'm not giving much detail here... but any help would be appreciated. Please let me know if I can provide any additional information.
If you're using python, easy way is to read collection chunky and use pandas' to_gbq method. Easy and quite fast to implement. But better to get more details.
Additionally to the answer provided by SirJ, you have multiple options to load data to BigQuery, including loading the data to Cloud Storage, local machine, Dataflow any more as mentioned here. Cloud Storage supports data in multiple formats such as CSV, JSON, Avro, Parquet and more. You also have various options to load data using Web UI, Command Line, API or using the Client Libraries which support C#, GO, Java, Node.JS, PHP, Python and Ruby.

Ionic mobile app using Cassandra, what about local storage?

I am working on a project using Ionic for the mobile side, I have a web app as well linked to a Cassandra database.
I need data synchronization between the mobile device (local storage) and the server-hosted Cassandra database. I use the cassandra-driver to connect to the database but then I realize how problematic it is to convert the data to an other type of database (SQLite for example).
Should I rather use an other database than Cassandra to make the synchronization easier ? (I need a noSQL solution)
Choice of the database depends on type of data you want to store. Cassandra is a column oriented database. It has great performance when you have to deal with large amount of data, but has many limitations related to the queries you need in order to pull data. For that reason, it might require additional efforts to develop something that you could easily do with some other database. So, the real question is do you really need Cassandra.
If you are using it only for mobile application, I don't think you will have so much data to exploit Cassandra benefits.
In your place, I would rather consider some other databases, such as MongoDB in case JSON is appropriate format for your data or Redis if you data is key/value pairs.

Modularize user management server, social feed server

I plan to design a system with Dreamfactory as the user management server while a separate REST server for social feed. Dreamfactory will have its own MySQL database for storing user info while the social feed will use MongoDB.
Is this a good system design? I'm new to this as I'm using both open source platform for two different purposes; social feed and user management.
It's difficult to answer your question without knowing requirements to the system. I was going to ask you why storing users in MySQL, but all the same I can ask why using MongoDB or product XXX ;)
There is no silver bullet in programming. Tool is chosen from requirements, not vice versa.
If you do not need to relate data, do not need transactions and does not care about data consistency at all, why go why relational databases? Solutions like AeroSpike or just Redis (yes, it can be persistent too) can give you much higher read/write rate.
Well, I suggest you go write a document, containing your system description, think of load this system is going to have. May be you will decide, that storing data in CSV files is ok for you (joking ;) )

NoSQL Database for Blog / Content Management System? (MongoDB / Cassandra)

My company has been used Oracle for a long time but we would like to look for a NoSQL database as a replacement for faster querying and flexible schema design.
I have tried to use MongoDB which would be the most popular NoSQL database nowadays. I connected it to Spring Data to do some simple queries, which is quite easy to be set up and code simply. Since we are using Spring MVC for web development, Spring Data seems quite suitable for integration.
However, I heard that Cassandra would have better performance in write and read, especially in large scaling system. I am not sure whether it is worth to move to Cassandra and not sure how to measure the performance between MongoDB and Cassandra.
Here are some requirements for my system:
focusing on article fetching
tagging for articles for users to easily search for their favors or related articles
non-distributed system, but have load-balancing and fail-over
Java based, Spring MVC for web development
articles would be stored as XML
probably provide user-defined tables (collections) and fields (keys)
Therefore I would like to raise some questions:
Which Database is the most suitable for my case? You may also raise other databases apart from MongoDB and Cassandra.
If I use Cassandra, which framework would be suitable for integrating to Spring MVC?
Thank you so much in advanced.
I have experience using Spring and Cassandra together. But I always have written my own data access layer.
Using the ORMs out there for Cassandra will not allow you to leverage its full power, and you will, most likely, introduce bugs because your SQL background will make you expect certain behaviours that are just not what Cassandra will give you.
My advice write the code that will access Cassandra yourself and do not be afraid to denormalize A LOT. Think more about how you want to query (or find it) your data than the format in which you want to save it.
I also strongly recommend reading this amazing article: Cassandra Data Modeling Best Practices part 1 part 2
Another DB which might suit your application better is CouchDB (I like using BigCouch). It is another Document based NoSQL database and is in my opinion superior to MongoDB. It offers better solution for scaling and gives emphasis to Availability (just like Cassandra).
I'd like to point you to this question about the difference between CouchDB and MongoDB.
As far as framework goes Play framework has a lot of plugin to work with NoSQL systems, so you might give it a try. You could try playorm which is the last I experimented on.
EDIT : I forgot to mention Kundera as well as an ORM for Cassandra
Choosing between Cassandra and MongoDB depends on type of storage. MongoDB is primarily for document based storage where you get an edge by having various sql like features.
If you require columnar database with high availability and multi dc replication? go for Cassandra.
http://db-engines.com/en/system/Cassandra%3BHBase%3BMongoDB

Is NoSQL suitable for Selling Tickets Web Application?

I want to write a high scalable web application for selling event tickets. I want to use NoSQL database, like Big Table or MongoDB and Cloud Service like Google App Engine (GAE) or Amazon Elastic Compute Cloud (Amazon EC2)
Is it posible using this type of database to be sure that two client will not be able to buy a ticket for the same place simultaneously? Or may be I will have to use RDBMS database and forget about Google App Engine?
Things like GAE's datastore can still support transactional semantics, for example:
http://code.google.com/appengine/docs/python/datastore/transactions.html
So yes, it is possible to do what you're seeking to do. (Note - GAE's Datastore is not exactly NoSQL, since it uses SQL-like queries.)
I have a problem with this question. Not all NoSQL databases are created equally, and different NoSQL databases have different ways they store data. Generally the thing you should be worried about are: data is actually written to disk and not just into memory. Most NoSQL databases can do this but not by default. Let's just say this is not a problem, you can usually tell the database like MOngo or Cassandra to write data to disk, can even tell how many servers at minimum the data should be written to.
The problem is that you may not get a true transactional support. When you deal with ecommerce it's important to have all or nothing type of transation where several operations either succeed completely or rolled back. There must be absolutely no chance that only part of your data is saved. For example, if you need to write data to more than one table (collection or document in NoSQL lingo), if server goes down in the middle of the process and your data is only written to one table, that's usually unacceptable in ecommerce.
I am not familiar with all NoSQL databases, but the ones I know don't have this option yet.
MySQL, on the other hand, does.
If transactional support or lack of it does not bother you, then I think its OK to use NoSQL as long as you tell it to save data to disk and not just into memory.
The answer is 'maybe.'
Depending on what you're trying to build, you many be able to use some of the techniques in this post:
http://kylebanker.com/blog/2010/06/07/mongodb-inventory-transactions/
Using something like get_or_insert you can easily ensure that two clients are not receiving the same resource simultaneously on Google App Engine. However, there are big differences between GAE and a RDBMS, so make sure you study them further before you make a decision.