How to store PubSub data to big query using cloud functions? - google-cloud-firestore

I am once again asking for your help.
Let me tell you my current situation first.
I have a device that connects to the "Cloud IoT core" and sends data using mqtt.
The data then goes to the Pub/Sub topic.
Then a "Cloud function" gets triggered which stores the data inside "Firestore"
Another "Cloud function" gets triggered which sends me an email with the stored data inside Firestore.
The size of the data is about 1 Kilobyte and I expect to send about 10K messages per Month
I need that data to create a dashboard for which I am using "Google Data Studio"
To get my data inside there I installed the Firebase extension "Stream Collections to BigQuery" to send the data to "BigQuery". from there I just had to click a few buttons to automaticly stream data from BigQuery to "Google Data Studio"
Everything works so far but as you can see I store the data 4 times. once via email, once inside firestore, once inside BigQuery and Data studio. All of this is going to cost alot of money in the long term, because the data stored doubles every Month.
What I need from You guys is some advice on best practices.
Is there a way to store the data directly inside BigQuery when it arrives in the Pub/Sub?
If so can I also send an email with the data as an attachment?
Is BigQuery a good solution or should I use "Cloud SQL"?
To save data inside Firestore I can execute the following inside a cloud function. Is there a similare way for BigQuery?
firestore.collection("put Collection name here").doc(put document name here).set({
'name' : name
'age' : age
}).then((writeResult) => {
//console.log('Successfully executed set');
return;
}).catch((err) => {
console.log(err);
return;
});

Is there a way to store the data directly inside BigQuery when it
arrives in the Pub/Sub?
Yes, you can use Dataflow to build a streaming pipeline, as explained in different documentation items or blogs:
GCP Doc: Pub/Sub Topic to BigQuery
A Dataflow Journey: from PubSub to BigQuery
Write a Pub/Sub Stream to BigQuery
But you could also use the Node.js Client for BigQuery in a Cloud Function, triggered by Pub/Sub. However, one could consider that this doesn't "store the data directly"...
If so can I also send an email with the data as an attachment?
If you use a Cloud Function, that's quite easy, for example by using the dedicated "Trigger Email" Firebase Extension.
You can also directly send an email from a Cloud Function by using the nodemailer package, see this official Cloud Function sample.
Is BigQuery a good solution or should I use "Cloud SQL"?
It all depends on you exact use case... There is a lot of literature on the net: https://www.google.com/search?client=firefox-b-d&q=difference+between+Cloud+SQL+and+BigQuery
However, since you are going to use Data Studio, a classical answer would be to use BigQuery since it is best suited for analytics. But again, it depends on you exact use case.
(Note that this question alone would probably be closed on SO because it is opinion based).
To save data inside Firestore I can execute the following inside a
cloud function. Is there a similar way for BigQuery?
Yes, as said above, use the Node.js Client for BigQuery in your Cloud Function.

Related

flutter/firebase schedule a function at given time

I want to schedule the cloud function at a specific time and that time will be in firestore document.I want that when i add data inside firestore, a cloud function will trigger and get data from that latest added document and will fetch date and time from that document data and then schedule a cloud function at that specific time to perform a specific task (update status in firestore).
For the time scheduling check this official documentation out:
https://firebase.google.com/docs/functions/schedule-functions
If you want to execute code on database changes check the triggers out:
https://firebase.google.com/docs/functions/database-events
You will need something like onUpdate(). Depends on your needs.
You can store files in the firebase storage. Check out the official documentation for a code example.
https://firebase.google.com/docs/storage/web/list-files
In case you want to read data from the firebase in your Flutter app, you can implement the flutter-firebase packages.
Here you can find the instructions for all firebase packages:
https://firebase.flutter.dev/
Stack Overflow is for asking questions. Your Question sounds more like you expecting that someone will give you the whole code, so you don't have to do any research about that.
If that's not the case, sorry for the misunderstanding.
Cloud Functions trigger for Firestore writes. There is nothing built in to trigger them at a time that is specific in the document that is written.
But you can build that yourself using Cloud Tasks, as Doug shows in this excellent blog post: How to schedule a Cloud Function to run in the future with Cloud Tasks (to build a Firestore document TTL)
I used the node-schedule package inside the Firebase Cloud Functions, after which I was able to schedule tasks at dynamic times from FireStore.
Here is sample code:
exports.scheduleMessage=functions.firestore.document('users/{userID}/pending/{messageID}').onCreate(async (snapshot,context) =>{
schedule.scheduleJob(messageId,`myDate.getUTCMinutes()
myDate.getUTCHours() * * *`, async () =>{
// your logic inside
});
});

Google Cloud Spanner real time Change Data Capture to PubSub/Kafka through Cloud Data Fusion or Others

I would like to achieve a real time change data capture (log-based preferred) pipeline from Google Cloud Spanner to PubSub/Kafka for my downstream real time applications. Could you please let me know if there is a great and cost-effective way to achieve that? I will appreciate any advice and recommendations.
In addition, for Cloud Data Fusion from google, I noticed that it could achieve real time from mysql/postgresql to cloud spanner, but I did not find the way go from cloud spanner to pubsub/kafka in real time.
Also, I found another two ways, which to be listed here for any comments or suggestions.
Use Debezium, a log-based change data capture Kafka connector from the link https://cloud.google.com/architecture/capturing-change-logs-with-debezium#deploying_debezium_on_gke_on_google_cloud
Create a polling service (which may miss some data) to poll data from cloud spanner from the link: https://cloud.google.com/architecture/deploying-event-sourced-systems-with-cloud-spanner
If you have any suggestion or comment on this, I will be really grateful.
There's a open source implementation of a polling service for Cloud Spanner that can also automatically push changes to PubSub here: https://github.com/cloudspannerecosystem/spanner-change-watcher
It is however not log-based. It has some inherent limitations:
It can miss updates if the same record is updated twice within the polling interval. In that case, only the last value will be reported.
It only supports soft deletes.
You could have a look at the samples to see if it is something that might suit your needs at least to some degree: https://github.com/cloudspannerecosystem/spanner-change-watcher/tree/master/samples
Cloud Spanner has a new feature called Change Streams that would allow building a downstream pipeline from Spanner to PubSub/Kafka.
At this time, there's not a pre-packaged Spanner to PubSub/Kafka connector.
The way to read change streams currently is to use the SpannerIO Apache Beam connector that would allow building the pipeline with Dataflow, or also directly querying the API.
Disclaimer: I'm a Developer Advocate that works with the Cloud Spanner team.

Cannot create a batch pipeline to get data from ZohoCRM with http plugin 1.2.1 to BigQuery. Retuns Spark Program 'phase-1' failed

My first post here and I'm new to Data Fusion and I'm with low to no coding skills.
I want to get data from ZohoCRM to BigQuery. Module from ZohoCRM (e.g. accounts, contacts...) to be a separate table in BigQuery.
To connect to Zoho CRM I obtained a code, token, refresh token and everything needed as described here https://www.zoho.com/crm/developer/docs/api/v2/get-records.html. Then I ran a successful get records request as described here via Postman and it returned the records from Zoho CRM Accounts module as JSON file.
I thought it will be all fine and set the parameters in Data Fusion
DataFusion_settings_1 and DataFusion_settings_2 it validated fine. Then I previewed and ran the pipeline without deploying it. It failed with the following info from the logs logs_screenshot. I tried to manually enter a few fields in the schema when the format was JSON. I tried changing the format to csv, nether worked. I tried switching the Verify HTTPS Trust Certificates on and off. It did not help.
I'd be really thankful for some help. Thanks.
Update, 2020-12-03
I got in touch with Google Cloud Account Manager, who then took my question to their engineers and here is the info
The HTTP plugin can be used to "fetch Atom or RSS feeds regularly, or to fetch the status of an external system" it does not seems to be designed for APIs
At the moment a more suitable tool for data collected via APIs is Dataflow https://cloud.google.com/dataflow
"Google Cloud Dataflow is used as the primary ETL mechanism, extracting the data from the API Endpoints specified by the customer, which is then transformed into the required format and pushed into BigQuery, Cloud Storage and Pub/Sub."
https://www.onixnet.com/insights/gcp-101-an-introduction-to-google-cloud-platform
So in the next weeks I'll be looking at Data Flow.
Can you please attach the complete logs of the preview run? Make sure to redact any PII data. Also what is the version of CDF you are using? Is CDF instance private or public?
Thanks and Regards,
Sagar
Did you end up using Dataflow?
I am also experiencing the same issue with the HTTP plugin, but my temporary way to go around it was to use a cloud scheduler to periodically trigger a cloud function that fetches my data from the API and exports them as a JSON to GCS, which can then be accessed by Data Fusion.
My solution is of course non-ideal, so I am still looking for a way to use the Data Fusion HTTP plugin. I was able to make it work to get sample data from public API end-points, but for a reason still unknown to me I can't get it to work for my actual API.

Run a Google Cloud Function for each file in a bucket

I have a Google Cloud Function triggered by a Google Cloud Storage object.finalize event. When I deploy a new version of this function, I would like to run it for every existing file in the bucket (which have already been processed by the previous version of the function). Processing all the existing files in the bucket is a long running task, hence I don't think a Google Cloud Function which will process all files in a row is an option.
The best option I can see for now is to make a Google Cloud Function I can triggered via HTTP that will list all the files in the bucket and publish one event per file via Google PubSub, and then process each of these events with a slightly modified version of my initial Google Cloud Function which accepts a PubSub event in place of the object.finalize storage event.
I think it can work but I was wondering if there was an easier way to perform this operation.
If the operation you're trying to perform may take longer than the maximum time that a Cloud Function can run, you will need to split that operation into multiple steps. Your approach of using a PubSub trigger for each individual file, sounds like a valid approach to do that for me.
One option might be to write a small program that lists all of the objects in a bucket and, for each object, posts a message to Cloud Pub/Sub that triggers your function in the same way a GCS change would.

How to automatically create a new file based on an existing file within Google Cloud Storage

It's the first time I used Google Cloud, so I might ask the question in the wrong place.
 
Information provider upload a new file to Google Cloud Storage every day.
The file contains the information of all my clients/departments.
I have to sort through information and create a new file/s containing the relevant information for each department in my company .so that everyone gets the relevant information to them (security).
I can't figure out what are the steps I need to follow, to complete the task.
Can you help me?
You want to have a process that starts automatically and subsequently generates a new file once you upload something to Google Cloud Storage.
The easiest way to handle this is using Object Change Notifications. You can set up Object Change Notifications per bucket and this will send a POST request to an URL that you can define.
You can then easily set up a server (or run it on app engine) that will execute an action based on the POST request that it receives.
There is an even simpler option (although still in alpha) named cloud functions. Cloud functions is a serverless service that provides event based microservices (e.g. 'do this' if a new file is uploaded on GCS). This means you only have to write the code that defines what needs to happen if a new file is uploaded and then Cloud Functions will take care of executing the code when you upload a file to GCS. See this tutorial on using cloud functions with Google Cloud Storage.