GCP Dataflow vs Cloud Functions to automate scrapping output and file-on-cloud merge into JSON format to insert in DB - mongodb

I have two sources:
A csv that will be uploaded to a cloud storage service, probably GCP Cloud Storage.
The output of a scrapping process done with Python.
When a user updates 1) (the cloud stored file) an event should be triggered to execute 2) (the scrapping process) and then some transformation should take place in order to merge these two sources into one in a JSON format. Finally, the content of this JSON file should be stored in a DB of easy access and low cost. The files the user will update are of max 5MB and the updates will take place once weekly.
From what I've read, I can use GCP Cloud Functions to accomplish this whole process or I can use Dataflow too. I've even considered using both. I've also thought of using MongoDB to store the JSON objects of the two sources final merge.
Why should I use Cloud Functions, Dataflow or both? What are your thoughts on the DB? I'm open to different approaches. Thanks.

Regarding de use of Cloud Functions and Dataflow. In your case I will go for Cloud Functions as you don't have a big volume of data. Dataflow is more complex, more expensive and you will have to use Apache Beam. If you are confortable with python and having into consideration your scenario I will choose Cloud Functions. Easy, convenient...
To trigger a Cloud Functions when Cloud Storage object is updated you will have to configure the triggers. Pretty easy.
https://cloud.google.com/functions/docs/calling/storage
Regarding the DB. MongoDB is a good option but if you wanth something quick an inexpensive consider DataStore
As a managed service it will make your life easy with a lot of native integrations. Also it has a very interesting free tier.

Related

How to access gold table in delta lake for web dashboards and other?

I am using the delta lake oss version 0.8.0.
Let's assume we calculated aggregated data and cubes using the raw data and saved the results in a gold table using delta lake.
My question is, is there a well known way to access these gold table data and deliver them to a web dashboard for example?
In my understanding, you need a running spark session to query a delta table.
So one possible solution could be to write a web api, which executes these spark queries.
Also you could write the gold results in a database like postgres to access it, but that seems just duplicating the data.
Is there a known best practice solution?
The real answer depends on your requirements regarding latency, number of requests per second, amount of data, deployment options (cloud/on-prem, where data located - HDFS/S3/...), etc. Possible approaches are:
Have the Spark running in the local mode inside your application - it may require a lot of memory, etc.
Run Thrift JDBC/ODBC server as a separate process, and access data via JDBC/ODBC
Read data directly using the Delta Standalone Reader library for JVM, or via delta-rs library that works with Rust/Python/Ruby

Writing to cloud storage as a side effect in cloud dataflow

I have a cloud dataflow job that does a bunch of processing for an appengine app. At one stage in the pipeline, I do a group by a particular key, and for each record that matches that key I would like to write a file to Cloud Storage (using that key as part of the file name).
I don't know in advance how many of these records there will be. So this usage pattern doesn't fit the standard cloud dataflow data sink pattern (where the sharding of that output stage determines the # output files, and I have no control over the output file names per shard).
I am considering writing to Cloud Storage directly as a side-effect in a ParDo function, but have the following queries:
Is writing to cloud storage as a side-effect allowed at all?
If I was writing from outside a dataflow pipeline, it seems I should use the Java client for the JSON cloud storage API. But that involves authenticating via OAUTH to do any work: and that seems inappropriate for a job already running on GCE machines as part of a dataflow pipeline: will this work?
Any advice gratefully received.
Answering the first part of your question:
While nothing directly prevents you from performing side-effects (such as writing to Cloud Storage) in our pipeline code, usually it is a very bad idea. You should consider the fact that your code is not running in a single-threaded fashion on a single machine. You'd need to deal with several problems:
Multiple writers could be writing at the same time. You need to find a way to avoid conflicts between writers. Since Cloud Storage doesn't support appending to an object directly, you might want to use Composite objects techniques.
Workers can be aborted, e.g. in case of transient failures or problems with the infrastructure, which means that you need to be able to handle the interrupted/incomplete writes issue.
Workers can be restarted (after they were aborted). That would cause the side-effect code to run again. Thus you need to be able to handle duplicate entries in your output in one way or another.
Nothing in Dataflow prevents you from writing to a GCS file in your ParDo.
You can use GcpOptions.getCredential() to obtain a credential to use for authentication. This will use a suitable mechanism for obtaining a credential depending on how the job is running. For example, it will use a service account when the job is executing on the Dataflow service.

Migration from Google cloud sql to datastore

Is there an easy way to migrate a large (100G) Google cloud sql database to Google datastore?
The way that comes to mind is to write a python appengine script for each database and table and then put it into the datastore. That sounds tedious but maybe it has to be done?
Side note, the reason I'm leaving cloud sql is because I have jsp pages with multiple queries on them and they are incredibly slow even with a d32 sql instance. I hope that putting it in the datastore will be faster?
There seems to be a ton of questions about moving away from the datastore to cloud sql, but I couldn't find this one.
Thanks
Here are a few options:
Write an App Engine mapreduce [1] program that pulls data in appropriate chunks from Cloud SQL and write is to Datastore .
Spin up a VM on Google Compute Engine and write a program that fetches the data from Cloud SQL and write to Datastore using the Datastore external API [2].
Use the Datastore restore [3]. I'm not familiar with the format so I don't know how much work is to get produce something that the restore will accept.
[1] https://cloud.google.com/appengine/docs/python/dataprocessing/
[2] https://cloud.google.com/datastore/docs/apis/overview
[3] https://cloud.google.com/appengine/docs/adminconsole/datastoreadmin?csw=1#restoring_data
I wrote a couple scripts that do this running on compute engine.
The gcp datastore api
import googledatastore
Here is the code:
https://gist.github.com/nburn42/d8b488da1d2dc53df63f4c4a32b95def
And the dataflow api
from apache_beam.io.gcp.datastore.v1.datastoreio import WriteToDatastore
Here is the code:
https://gist.github.com/nburn42/2c2a06e383aa6b04f84ed31548f1cb09
I get a quota exceeded though after it hits 100,000 entities, and I have to wait another day to do another set.
Hopefully these are useful to someone with a smaller database than me.
( The quota problem is here
Move data from Google Cloud-SQL to Cloud Datastore )

Google cloud storage API - Merge data to existing

I have some data sets in Google cloud storage. I could find how I can append more data to this dataset. But if I want to merge the data set(Insert else update), how do I do it?
I have one option of using Hive - Insert overwrite. Is there any other better option?
Is there any option with Google cloud storage API itself?
Maybe this could be helpful: https://cloud.google.com/storage/docs/json_api/v1/objects/compose
Objects: compose
Concatenates a list of existing objects into a new object in the same bucket.
GCS treats your objects (files) as blobs, there are no in-built GCS operations on the text in your objects. There is an easier way to do the same as you are doing though.
App-engine hosted MapReduce provides in-built adapters to work with GCS. You can find the example code in this repo.

What is the best way to run Map/Reduce stuff on data from Mongo?

I have a large Mongo database (100GB) hosted in the cloud (MongoLab or MongoHQ). I would like to run some Map/Reduce tasks on the data to compute some expensive statistics and was wondering what the best workflow is for getting this done. Ideally I would like to use Amazon's Map/Reduce services so to do this instead of maintaining my own Hadoop cluster.
Does it make sense to copy the data from the database to S3. Then run Amazon Map/Reduce on it? Or are there better ways to get this done.
Also if further down the line I might want to run the queries for frequently like every day so the data on S3 would need to mirror what is in Mongo would this complicate things?
Any suggestions/war stories would be super helpful.
Amazon S3 provides a utility called S3DistCp to get data in and out of S3. This is commonly used when running Amazon's EMR product and you don't want to host your own cluster or use up instances to store data. S3 can store all your data for you and EMR can read/write data from/to S3.
However, transferring 100GB will take time and if you plan on doing this more than once (i.e. more than a one-off batch job), it will be a significant bottleneck in your processing (especially if the data is expected to grow).
It looks you may not need to use S3. Mongo has implemented an adapter to implement map reduce jobs on top of your MongoDB. http://blog.mongodb.org/post/24610529795/hadoop-streaming-support-for-mongodb
This looks appealing since it lets you implement the MR in python/js/ruby.
I think this mongo-hadoop setup would be more efficient than copying 100GB of data out to S3.
UPDATE: An example of using map-reduce with mongo here.