Add metadata to S3 objects pushed by Kinesis Firehose - metadata

I have Firehose stream pushing data to s3 with a small lambda processing the format.
I'd like these objects in S3 to have some metadata when created.
I find you can add it when using aws cli, or through the console, but I don't find a way to automate it for the files created from firehose.
Format from lambda only includes recordId, data and result.
Am I missing something? Or is this something I cannot customise with Firehose?
Thanks :)

Firehose doesn't give that option out of box. But you can trigger an event notification when Firehose publishes a new object to S3. The event notification can trigger a lambda which will add the metadata.

Related

How to store PubSub data to big query using cloud functions?

I am once again asking for your help.
Let me tell you my current situation first.
I have a device that connects to the "Cloud IoT core" and sends data using mqtt.
The data then goes to the Pub/Sub topic.
Then a "Cloud function" gets triggered which stores the data inside "Firestore"
Another "Cloud function" gets triggered which sends me an email with the stored data inside Firestore.
The size of the data is about 1 Kilobyte and I expect to send about 10K messages per Month
I need that data to create a dashboard for which I am using "Google Data Studio"
To get my data inside there I installed the Firebase extension "Stream Collections to BigQuery" to send the data to "BigQuery". from there I just had to click a few buttons to automaticly stream data from BigQuery to "Google Data Studio"
Everything works so far but as you can see I store the data 4 times. once via email, once inside firestore, once inside BigQuery and Data studio. All of this is going to cost alot of money in the long term, because the data stored doubles every Month.
What I need from You guys is some advice on best practices.
Is there a way to store the data directly inside BigQuery when it arrives in the Pub/Sub?
If so can I also send an email with the data as an attachment?
Is BigQuery a good solution or should I use "Cloud SQL"?
To save data inside Firestore I can execute the following inside a cloud function. Is there a similare way for BigQuery?
firestore.collection("put Collection name here").doc(put document name here).set({
'name' : name
'age' : age
}).then((writeResult) => {
//console.log('Successfully executed set');
return;
}).catch((err) => {
console.log(err);
return;
});
Is there a way to store the data directly inside BigQuery when it
arrives in the Pub/Sub?
Yes, you can use Dataflow to build a streaming pipeline, as explained in different documentation items or blogs:
GCP Doc: Pub/Sub Topic to BigQuery
A Dataflow Journey: from PubSub to BigQuery
Write a Pub/Sub Stream to BigQuery
But you could also use the Node.js Client for BigQuery in a Cloud Function, triggered by Pub/Sub. However, one could consider that this doesn't "store the data directly"...
If so can I also send an email with the data as an attachment?
If you use a Cloud Function, that's quite easy, for example by using the dedicated "Trigger Email" Firebase Extension.
You can also directly send an email from a Cloud Function by using the nodemailer package, see this official Cloud Function sample.
Is BigQuery a good solution or should I use "Cloud SQL"?
It all depends on you exact use case... There is a lot of literature on the net: https://www.google.com/search?client=firefox-b-d&q=difference+between+Cloud+SQL+and+BigQuery
However, since you are going to use Data Studio, a classical answer would be to use BigQuery since it is best suited for analytics. But again, it depends on you exact use case.
(Note that this question alone would probably be closed on SO because it is opinion based).
To save data inside Firestore I can execute the following inside a
cloud function. Is there a similar way for BigQuery?
Yes, as said above, use the Node.js Client for BigQuery in your Cloud Function.

Cannot create a batch pipeline to get data from ZohoCRM with http plugin 1.2.1 to BigQuery. Retuns Spark Program 'phase-1' failed

My first post here and I'm new to Data Fusion and I'm with low to no coding skills.
I want to get data from ZohoCRM to BigQuery. Module from ZohoCRM (e.g. accounts, contacts...) to be a separate table in BigQuery.
To connect to Zoho CRM I obtained a code, token, refresh token and everything needed as described here https://www.zoho.com/crm/developer/docs/api/v2/get-records.html. Then I ran a successful get records request as described here via Postman and it returned the records from Zoho CRM Accounts module as JSON file.
I thought it will be all fine and set the parameters in Data Fusion
DataFusion_settings_1 and DataFusion_settings_2 it validated fine. Then I previewed and ran the pipeline without deploying it. It failed with the following info from the logs logs_screenshot. I tried to manually enter a few fields in the schema when the format was JSON. I tried changing the format to csv, nether worked. I tried switching the Verify HTTPS Trust Certificates on and off. It did not help.
I'd be really thankful for some help. Thanks.
Update, 2020-12-03
I got in touch with Google Cloud Account Manager, who then took my question to their engineers and here is the info
The HTTP plugin can be used to "fetch Atom or RSS feeds regularly, or to fetch the status of an external system" it does not seems to be designed for APIs
At the moment a more suitable tool for data collected via APIs is Dataflow https://cloud.google.com/dataflow
"Google Cloud Dataflow is used as the primary ETL mechanism, extracting the data from the API Endpoints specified by the customer, which is then transformed into the required format and pushed into BigQuery, Cloud Storage and Pub/Sub."
https://www.onixnet.com/insights/gcp-101-an-introduction-to-google-cloud-platform
So in the next weeks I'll be looking at Data Flow.
Can you please attach the complete logs of the preview run? Make sure to redact any PII data. Also what is the version of CDF you are using? Is CDF instance private or public?
Thanks and Regards,
Sagar
Did you end up using Dataflow?
I am also experiencing the same issue with the HTTP plugin, but my temporary way to go around it was to use a cloud scheduler to periodically trigger a cloud function that fetches my data from the API and exports them as a JSON to GCS, which can then be accessed by Data Fusion.
My solution is of course non-ideal, so I am still looking for a way to use the Data Fusion HTTP plugin. I was able to make it work to get sample data from public API end-points, but for a reason still unknown to me I can't get it to work for my actual API.

how to create table in Google BigQuery from continuously uploaded Kafka text files into Google cloud storage

i want to create bigquery table from the cloud storage. Kafka steam uploaded as text files into Cloud storage by every 5 minutes. I want to create bigquery table using that is updating every 5 minutes from the updated files into Bigquery. What is the best way to do this? Please give me some suggestions
You could use google-cloud-functions to detect when a file is uploaded, then execute some code to index that file.
Alternatively, I believe there already exists a BigQuery Kafka Connector, so you could skip GCS unless you need the raw data. (Note: binary files would be cheaper to store than plaintext, and BigQuery supports reading various formats)

Run a Google Cloud Function for each file in a bucket

I have a Google Cloud Function triggered by a Google Cloud Storage object.finalize event. When I deploy a new version of this function, I would like to run it for every existing file in the bucket (which have already been processed by the previous version of the function). Processing all the existing files in the bucket is a long running task, hence I don't think a Google Cloud Function which will process all files in a row is an option.
The best option I can see for now is to make a Google Cloud Function I can triggered via HTTP that will list all the files in the bucket and publish one event per file via Google PubSub, and then process each of these events with a slightly modified version of my initial Google Cloud Function which accepts a PubSub event in place of the object.finalize storage event.
I think it can work but I was wondering if there was an easier way to perform this operation.
If the operation you're trying to perform may take longer than the maximum time that a Cloud Function can run, you will need to split that operation into multiple steps. Your approach of using a PubSub trigger for each individual file, sounds like a valid approach to do that for me.
One option might be to write a small program that lists all of the objects in a bucket and, for each object, posts a message to Cloud Pub/Sub that triggers your function in the same way a GCS change would.

Google Cloud Dataflow: while in PubSub streaming mode, TextIO.Read uses massive amounts of vCPU time

I'm using Google Cloud Platform to transfer data from an Azure server to a BigQuery table (working nice and smoothly, functionally speaking).
The pipeline looks like this:
Dataflow streaming pipeline
The 'FetchMetadata' part of the pipeline is a simple TextIO.Read implementation where I read a 66-line .csv file with metadata from a GCP Storage bucket:
PCollection<String> metaLine = p.apply(TextIO.Read.named("FetchMetadata")
.from("gs://my-bucket"));
When I use my pipeline in Batch mode this works like a charm: first the metadata file is loaded in the pipeline in less than a second of vCPU time and then the data itself is loaded in the pipeline. Now when running in Streaming mode I would love to replicate that behaviour to some extent but when I just use the same code there is a problem: when running the pipeline for just 15 minutes (actual time) the TextIO.Read block uses a whopping 4 hours of vCPU time. For a pipeline that will be permanently running for a low budget project this is unacceptable.
So my question: is it possible to change the code so the file is periodically read again (if the file changes I want the pipeline to be updated, so let's say hourly updates) and not continiously like it's doing right now.
I've found some documentation where there is mention of TextIO.Read.Bound which looks like a good place to start solving this issue, but it's no solution for my periodical update problem (as far as I know)
I was stuck in a similar situation. The way I solved this problem is a bit different. I would like the community's insights into this solution.
I had files being updated every hour in a GCS bucket. I followed the blog post about Scheduling Dataflow Jobs from App Engine or Google Cloud Function.
I had the app engine endpoint configured to receive the object change notifications from the GCS bucket which contained the files to be processed. For every file that was created (update is also a create operation in an object store), app engine application would submit a job to google dataflow. The job would read the lines from the file (file name in the HTTP request body) and publish it to a Google PubSub topic.
A streaming pipeline then had been subscribed to the Google PubSub topic that would process and output the relevant rows to big query. This way, streaming pipeline ran at the minimum worker count when idle, the ingest of the files happened through a batch pipeline and the streaming pipeline scaled with respect to the volume of the publications in the Google PubSub topic.
In the tutorial for submitting jobs to Google Dataflow, the jar is executed on the underlying terminal. I modified the code to submit a job to google dataflow using templates which can be executed with parameters. This way, the job submission operation becomes super light weight while still creating a job for every new file upload to the GCS bucket. Please refer this link for details about executing google dataflow job templates.
Note: Please mention in the comments if the answer needs to be modified for the code snippets of the dataflow job template and app engine application and I will update the answer accordingly.