What is the best way to run Map/Reduce stuff on data from Mongo? - mongodb

I have a large Mongo database (100GB) hosted in the cloud (MongoLab or MongoHQ). I would like to run some Map/Reduce tasks on the data to compute some expensive statistics and was wondering what the best workflow is for getting this done. Ideally I would like to use Amazon's Map/Reduce services so to do this instead of maintaining my own Hadoop cluster.
Does it make sense to copy the data from the database to S3. Then run Amazon Map/Reduce on it? Or are there better ways to get this done.
Also if further down the line I might want to run the queries for frequently like every day so the data on S3 would need to mirror what is in Mongo would this complicate things?
Any suggestions/war stories would be super helpful.

Amazon S3 provides a utility called S3DistCp to get data in and out of S3. This is commonly used when running Amazon's EMR product and you don't want to host your own cluster or use up instances to store data. S3 can store all your data for you and EMR can read/write data from/to S3.
However, transferring 100GB will take time and if you plan on doing this more than once (i.e. more than a one-off batch job), it will be a significant bottleneck in your processing (especially if the data is expected to grow).
It looks you may not need to use S3. Mongo has implemented an adapter to implement map reduce jobs on top of your MongoDB. http://blog.mongodb.org/post/24610529795/hadoop-streaming-support-for-mongodb
This looks appealing since it lets you implement the MR in python/js/ruby.
I think this mongo-hadoop setup would be more efficient than copying 100GB of data out to S3.
UPDATE: An example of using map-reduce with mongo here.

Related

GCP Dataflow vs Cloud Functions to automate scrapping output and file-on-cloud merge into JSON format to insert in DB

I have two sources:
A csv that will be uploaded to a cloud storage service, probably GCP Cloud Storage.
The output of a scrapping process done with Python.
When a user updates 1) (the cloud stored file) an event should be triggered to execute 2) (the scrapping process) and then some transformation should take place in order to merge these two sources into one in a JSON format. Finally, the content of this JSON file should be stored in a DB of easy access and low cost. The files the user will update are of max 5MB and the updates will take place once weekly.
From what I've read, I can use GCP Cloud Functions to accomplish this whole process or I can use Dataflow too. I've even considered using both. I've also thought of using MongoDB to store the JSON objects of the two sources final merge.
Why should I use Cloud Functions, Dataflow or both? What are your thoughts on the DB? I'm open to different approaches. Thanks.
Regarding de use of Cloud Functions and Dataflow. In your case I will go for Cloud Functions as you don't have a big volume of data. Dataflow is more complex, more expensive and you will have to use Apache Beam. If you are confortable with python and having into consideration your scenario I will choose Cloud Functions. Easy, convenient...
To trigger a Cloud Functions when Cloud Storage object is updated you will have to configure the triggers. Pretty easy.
https://cloud.google.com/functions/docs/calling/storage
Regarding the DB. MongoDB is a good option but if you wanth something quick an inexpensive consider DataStore
As a managed service it will make your life easy with a lot of native integrations. Also it has a very interesting free tier.

How to access gold table in delta lake for web dashboards and other?

I am using the delta lake oss version 0.8.0.
Let's assume we calculated aggregated data and cubes using the raw data and saved the results in a gold table using delta lake.
My question is, is there a well known way to access these gold table data and deliver them to a web dashboard for example?
In my understanding, you need a running spark session to query a delta table.
So one possible solution could be to write a web api, which executes these spark queries.
Also you could write the gold results in a database like postgres to access it, but that seems just duplicating the data.
Is there a known best practice solution?
The real answer depends on your requirements regarding latency, number of requests per second, amount of data, deployment options (cloud/on-prem, where data located - HDFS/S3/...), etc. Possible approaches are:
Have the Spark running in the local mode inside your application - it may require a lot of memory, etc.
Run Thrift JDBC/ODBC server as a separate process, and access data via JDBC/ODBC
Read data directly using the Delta Standalone Reader library for JVM, or via delta-rs library that works with Rust/Python/Ruby

Datapipeline from Sagemaker to Redshift

I wanted to check with the community here if they have explore the pipeline option from Sagemaker to Redshift directly.
I want to load the predicted data from Sagemaker to a table in Redshift. I was planning to do it via S3, but was wondering if there are better ways to do this.
I think your idea to stage data in S3, if acceptable in your specific use-case, is a good baseline design:
SageMaker smoothly connects to S3 (via Batch Transform or Processing job)
Redshift COPY statements are best practice for efficient loading of data, and can be done from S3 ("COPY loads large amounts of data much more efficiently than using INSERT statements, and stores the data more effectively as well." - Redshift documentation)

How do I efficiently migrate the BigQuery Tables to On-Prem Postgres?

I need to migrate the tables from the BigQuery to the on-prem Postgres database.
How can I efficiently achieve that?
Some thoughts that are coming
I will use Google APIs to export the data from the tables
Store it locally
And finally, import to Postgres
But I am not sure if that can be done for a huge amount of data in TBs. Also, how can I automate this process? Can I use Jenkins for that?
Exporting the data from BigQuery, store it and importing it to PostgreSQL is a good approach. Here are other two alternatives that you can consider:
1) There's a PostgreSQL wrapper for BigQuery that allows to query directly from BigQuery. Depending on your case scenario this might be the easiest way to transfer the data; although, for TBs it might not be the best approach. This suggestion was made by #David in this SO question.
2) Using Dataflow. You can create a ETL process using Apache Beam to made the transfer. Take a look at this how-to for transferring data from BigQuery to CloudSQL. You would need to adapt it for local PostgreSQL, but the idea maintains.
Here's another SO answer that gives more context on this approach.

HBase or Mongo for an Analytics DB if already using Hadoop?

I currently have a Hadoop cluster where I store tons of logs over which I run pig scripts for calculating aggregated analytics. I also have a Mongo cluster where I store production data.
I've recently been put in a position where I need to do a lot of one-off analytics queries, or enable others to do them. These queries frequently need to use both production data and log data together, so whatever I go with, I'd like to have everything in one place. My log data is in json and about 10x the size of my prod data. Here are the pros/cons of Mongo and HBase I'm seeing:
Mongo Pros/ HBase Cons:
Since log data is in JSON, I can get it into Mongo pretty easily, and I can do this in real time as it comes in through something like FluentD.
Most people I work with already have experience writing Mongo queries from needing to work with prod data, so getting an analytics db up on Mongo would be very simple for everyone to use.
I know much less about Hbase than Mongo.
No idea how easy/difficult it would be to get data in JSON or from Mongo into Hbase. I imagine this isn't so bad, but I don't see much documentation.
HBase Pros/Mongo Cons:
My log data is much bigger than my prod data, so storing it in both hadoop and mongo would be way more expensive than storing my prod data in both hadoop and mongo.
I can build HBase on top of my already running Hadoop cluster and fit my prod data in there without adding many extra machines. If I went with Mongo, I'd need a whole new Mongo cluster.
I could use Phoenix on top of Hbase to allow a simple SQL syntax for accessing all our data, but I'm not sure how unwieldily this would be for multi-level document-based data.
I know very little about Hbase currently, and I wouldn't consider myself a Mongo expert, so I'm probably missing a lot.
So, what am I missing, and which is right for my situation?
First of all, you should use something which you already can handle. Therefore, Mongo DB seems a good choice, especially when the data is already in the json format.
On the other hand, I used HBase quite a while and the read performance is amazing although having a lot of rows and I really don't know if there is any good and fast integration of Mongo DB with Hadoop.
HBase is the Hadoop database, so it is predestinated to work with Hadoop together.
If the logs could be indexed by (in the HBase Rowkey):
producing_program_identifier, timestamp, ...
HBase could work quite well for this query pattern.
But if you decide on HBase, use the
phoenix framwork, it will save you time using familiar interfaces like jdbc and sql-like queries. It also provides simple aggregation functions (count, avg, max, min) which may be sufficient.
From what you're saying it seems a mongoDB based solution would work best for you.
HBase is extremely versatile and you can get it to serve both your prod needs as well as your analytics needs however the general purpose SQL capabilities (in Phoenix, Cloudera's Impala and others) are in their infancy and the standard HBase way to get high query performance (designing the data structure for reads) will take a lot on effort (esp. since you don't have experience in HBase).
By the way it may be applicable for you to use map/reduces pre-aggregated data and then load it into MongoDB and thus utilize your current setup bette rather than change it either way