How to read and write BSON files with Spark? - mongodb

I have many MongoDB dumps in gzip compressed BSON files, each with multiple documents. I would like to read them directly to Spark, ideally partitioning on individual document level.
Previous discussions (1, 2) are old and use the depracated Hadoop Mongo connector. The new, actively maintained Spark Mongo connector seems to implement a DefaultSource interface, a couple custom partitioners, and a connection layer.
I would like to extract (or contribute) a way to read a multi-document BSON file from disk into a DataFrame, such that different documents can be loaded into different partitions. Writing would also be great to have for completeness, but I'm not sure how robust can writing to a single file from multiple writers be. I am new to Spark and unsure where to start.

Related

there is a possibility to store data in HDFS with key-value?

Storing data in NoSQL databases can provide a key-value storage model. However, HDFS is a dostributed file storage in Hadoop ecosystem. Key-value is used by mapreduce clusters. Therefore, this distribution is generated in processing phase only.
I need to know if there is a possibility to store at rest data in HDFS where each value will identified by a key.
Hadoop support SequenceFiles since its early days (if not since inception) https://wiki.apache.org/hadoop/SequenceFile.
These are only useful in map/reduce scenarios and today you'd probably want to use one of the columnar formats (parquet or orc) to store your data which you can also degrade to only hold one key and one value and also use with multiple values per key (they also hold metadata that will enable you to skip data while scanning (e.g. parquet filter pushdown https://drill.apache.org/docs/parquet-filter-pushdown/)
Note that all these formats will not give you online query capabilities (like No-SQL databases) for that you need a no-sql database - if you want one that stores its data on HDFS there's HBase (by the way HFile format it uses is also a key-multi-value format stored on HDFS)

What value does Postgres adapter for spark/hadoop add?

I am not an HDFS nerd but coming from traditional RDMS background, I am scratching surface with newer technologies like Hadoop and Spark. Now, I was looking at my options when it comes to SQL querying on Spark data.
What I realized that Spark inherently supports SQL querying. Then I came across this link
https://www.enterprisedb.com/news/enterprisedb-announces-new-apache-spark-connecter-speed-postgres-big-data-processing
Which I am trying to make some sense of. If I am understanding it correctly. Data is still stored in HDFS format but Postgres connector is used as a query engine? If so, in presence of an existing querying framework, what new value does this postgress connector add?
Or I am misunderstanding what it actually does?
I think you are misunderstanding.
They allude to the concept of Foreign Data Wrapper.
"... They allow PostgreSQL queries to include structured or unstructured data, from multiple sources such as Postgres and NoSQL databases, as well as HDFS, as if they were in a single database. ...
"
This sounds to me like the Oracle Big Data Appliance approach. From Postgres you can look at the world of data processing it logically as though it is all Postgres, but underwater the HDFS data is accessed using Spark query engine invoked by the Postgres Query engine, but you need not concern yourself with that is the likely premise. We are in the domain of Virtualization. You can combine Big Data and Postgres data on the fly.
There is no such thing as Spark data as it is not a database as such barring some Spark fomatted data that is not compatible with Hive.
The value will be invariably be stated that you need not learn Big Data etc. Whether that is true remains to be seen.

is it possible to store mongodb data on hdfs

In my project, I'm challenging with data storage method. Firstly, in my project, there are streaming data in JSON format and most suitable db is MongoDB. I have to analyze data with Hadoop or Spark.
So, my conflict starts here: Can I store MongoDB collections in HDFS or must MongoDB and HDFS storage units be different? It is an important issue for my decision. Must I use Hadoop and MongoDB in same disk units or separate units?
They need to be different units since the methods of storage, security policy implementations and storage mechanisms themselves are different.

What is the common practice to store users data and analysis it with Spark/hadoop?

I'm new to spark. I'm used to a Web developer, not familiar to big data.
That's say I have a portal website. user's behavior and action will store in 5 sharded mongoDB clusters.
How to I analyze it with spark ?
Or Spark can get the data from any databases directly (postgres/mongoDB/mysql/....)
Because most website may use Relational DB as back-end database.
Should I export whole data in the website's databases into HBase ?
I stored all the users log in postgreSQL, is it practical to export data into HBase or other Spark preffered databases ?
It seems it will make lots of duplicated data if I copy the data to a new database.
Does my big data model need other framework excepts Spark ?
For analyze the data in the website's databases,
I don't see the reasons that I need HDFS, Mesos, ...
How to make Spark workers can access the data in PostgreSQL databases ?
I only know how to read data from text file,
and saw some codes about how to load data from HDFS://
But I don't have HDFS system now , should I create one HDFS for my purpose ?
Spark is a distributed compute engine; so it expects to have files accessible from all nodes. Here are some choices you might consider
There seems to be Spark - MongoDB connector. This post explains how to get it working
Export the data out of MongoDB into Hadoop. And then use Spark to process the files. For this , you need to have a Hadoop cluster running
If you are on Amazon, you can put the files in S3 store and access from Spark

HBase or Mongo for an Analytics DB if already using Hadoop?

I currently have a Hadoop cluster where I store tons of logs over which I run pig scripts for calculating aggregated analytics. I also have a Mongo cluster where I store production data.
I've recently been put in a position where I need to do a lot of one-off analytics queries, or enable others to do them. These queries frequently need to use both production data and log data together, so whatever I go with, I'd like to have everything in one place. My log data is in json and about 10x the size of my prod data. Here are the pros/cons of Mongo and HBase I'm seeing:
Mongo Pros/ HBase Cons:
Since log data is in JSON, I can get it into Mongo pretty easily, and I can do this in real time as it comes in through something like FluentD.
Most people I work with already have experience writing Mongo queries from needing to work with prod data, so getting an analytics db up on Mongo would be very simple for everyone to use.
I know much less about Hbase than Mongo.
No idea how easy/difficult it would be to get data in JSON or from Mongo into Hbase. I imagine this isn't so bad, but I don't see much documentation.
HBase Pros/Mongo Cons:
My log data is much bigger than my prod data, so storing it in both hadoop and mongo would be way more expensive than storing my prod data in both hadoop and mongo.
I can build HBase on top of my already running Hadoop cluster and fit my prod data in there without adding many extra machines. If I went with Mongo, I'd need a whole new Mongo cluster.
I could use Phoenix on top of Hbase to allow a simple SQL syntax for accessing all our data, but I'm not sure how unwieldily this would be for multi-level document-based data.
I know very little about Hbase currently, and I wouldn't consider myself a Mongo expert, so I'm probably missing a lot.
So, what am I missing, and which is right for my situation?
First of all, you should use something which you already can handle. Therefore, Mongo DB seems a good choice, especially when the data is already in the json format.
On the other hand, I used HBase quite a while and the read performance is amazing although having a lot of rows and I really don't know if there is any good and fast integration of Mongo DB with Hadoop.
HBase is the Hadoop database, so it is predestinated to work with Hadoop together.
If the logs could be indexed by (in the HBase Rowkey):
producing_program_identifier, timestamp, ...
HBase could work quite well for this query pattern.
But if you decide on HBase, use the
phoenix framwork, it will save you time using familiar interfaces like jdbc and sql-like queries. It also provides simple aggregation functions (count, avg, max, min) which may be sufficient.
From what you're saying it seems a mongoDB based solution would work best for you.
HBase is extremely versatile and you can get it to serve both your prod needs as well as your analytics needs however the general purpose SQL capabilities (in Phoenix, Cloudera's Impala and others) are in their infancy and the standard HBase way to get high query performance (designing the data structure for reads) will take a lot on effort (esp. since you don't have experience in HBase).
By the way it may be applicable for you to use map/reduces pre-aggregated data and then load it into MongoDB and thus utilize your current setup bette rather than change it either way