Google Cloud Dataflow: while in PubSub streaming mode, TextIO.Read uses massive amounts of vCPU time - streaming

I'm using Google Cloud Platform to transfer data from an Azure server to a BigQuery table (working nice and smoothly, functionally speaking).
The pipeline looks like this:
Dataflow streaming pipeline
The 'FetchMetadata' part of the pipeline is a simple TextIO.Read implementation where I read a 66-line .csv file with metadata from a GCP Storage bucket:
PCollection<String> metaLine = p.apply(TextIO.Read.named("FetchMetadata")
.from("gs://my-bucket"));
When I use my pipeline in Batch mode this works like a charm: first the metadata file is loaded in the pipeline in less than a second of vCPU time and then the data itself is loaded in the pipeline. Now when running in Streaming mode I would love to replicate that behaviour to some extent but when I just use the same code there is a problem: when running the pipeline for just 15 minutes (actual time) the TextIO.Read block uses a whopping 4 hours of vCPU time. For a pipeline that will be permanently running for a low budget project this is unacceptable.
So my question: is it possible to change the code so the file is periodically read again (if the file changes I want the pipeline to be updated, so let's say hourly updates) and not continiously like it's doing right now.
I've found some documentation where there is mention of TextIO.Read.Bound which looks like a good place to start solving this issue, but it's no solution for my periodical update problem (as far as I know)

I was stuck in a similar situation. The way I solved this problem is a bit different. I would like the community's insights into this solution.
I had files being updated every hour in a GCS bucket. I followed the blog post about Scheduling Dataflow Jobs from App Engine or Google Cloud Function.
I had the app engine endpoint configured to receive the object change notifications from the GCS bucket which contained the files to be processed. For every file that was created (update is also a create operation in an object store), app engine application would submit a job to google dataflow. The job would read the lines from the file (file name in the HTTP request body) and publish it to a Google PubSub topic.
A streaming pipeline then had been subscribed to the Google PubSub topic that would process and output the relevant rows to big query. This way, streaming pipeline ran at the minimum worker count when idle, the ingest of the files happened through a batch pipeline and the streaming pipeline scaled with respect to the volume of the publications in the Google PubSub topic.
In the tutorial for submitting jobs to Google Dataflow, the jar is executed on the underlying terminal. I modified the code to submit a job to google dataflow using templates which can be executed with parameters. This way, the job submission operation becomes super light weight while still creating a job for every new file upload to the GCS bucket. Please refer this link for details about executing google dataflow job templates.
Note: Please mention in the comments if the answer needs to be modified for the code snippets of the dataflow job template and app engine application and I will update the answer accordingly.

Related

Google Cloud Spanner real time Change Data Capture to PubSub/Kafka through Cloud Data Fusion or Others

I would like to achieve a real time change data capture (log-based preferred) pipeline from Google Cloud Spanner to PubSub/Kafka for my downstream real time applications. Could you please let me know if there is a great and cost-effective way to achieve that? I will appreciate any advice and recommendations.
In addition, for Cloud Data Fusion from google, I noticed that it could achieve real time from mysql/postgresql to cloud spanner, but I did not find the way go from cloud spanner to pubsub/kafka in real time.
Also, I found another two ways, which to be listed here for any comments or suggestions.
Use Debezium, a log-based change data capture Kafka connector from the link https://cloud.google.com/architecture/capturing-change-logs-with-debezium#deploying_debezium_on_gke_on_google_cloud
Create a polling service (which may miss some data) to poll data from cloud spanner from the link: https://cloud.google.com/architecture/deploying-event-sourced-systems-with-cloud-spanner
If you have any suggestion or comment on this, I will be really grateful.
There's a open source implementation of a polling service for Cloud Spanner that can also automatically push changes to PubSub here: https://github.com/cloudspannerecosystem/spanner-change-watcher
It is however not log-based. It has some inherent limitations:
It can miss updates if the same record is updated twice within the polling interval. In that case, only the last value will be reported.
It only supports soft deletes.
You could have a look at the samples to see if it is something that might suit your needs at least to some degree: https://github.com/cloudspannerecosystem/spanner-change-watcher/tree/master/samples
Cloud Spanner has a new feature called Change Streams that would allow building a downstream pipeline from Spanner to PubSub/Kafka.
At this time, there's not a pre-packaged Spanner to PubSub/Kafka connector.
The way to read change streams currently is to use the SpannerIO Apache Beam connector that would allow building the pipeline with Dataflow, or also directly querying the API.
Disclaimer: I'm a Developer Advocate that works with the Cloud Spanner team.

Cannot create a batch pipeline to get data from ZohoCRM with http plugin 1.2.1 to BigQuery. Retuns Spark Program 'phase-1' failed

My first post here and I'm new to Data Fusion and I'm with low to no coding skills.
I want to get data from ZohoCRM to BigQuery. Module from ZohoCRM (e.g. accounts, contacts...) to be a separate table in BigQuery.
To connect to Zoho CRM I obtained a code, token, refresh token and everything needed as described here https://www.zoho.com/crm/developer/docs/api/v2/get-records.html. Then I ran a successful get records request as described here via Postman and it returned the records from Zoho CRM Accounts module as JSON file.
I thought it will be all fine and set the parameters in Data Fusion
DataFusion_settings_1 and DataFusion_settings_2 it validated fine. Then I previewed and ran the pipeline without deploying it. It failed with the following info from the logs logs_screenshot. I tried to manually enter a few fields in the schema when the format was JSON. I tried changing the format to csv, nether worked. I tried switching the Verify HTTPS Trust Certificates on and off. It did not help.
I'd be really thankful for some help. Thanks.
Update, 2020-12-03
I got in touch with Google Cloud Account Manager, who then took my question to their engineers and here is the info
The HTTP plugin can be used to "fetch Atom or RSS feeds regularly, or to fetch the status of an external system" it does not seems to be designed for APIs
At the moment a more suitable tool for data collected via APIs is Dataflow https://cloud.google.com/dataflow
"Google Cloud Dataflow is used as the primary ETL mechanism, extracting the data from the API Endpoints specified by the customer, which is then transformed into the required format and pushed into BigQuery, Cloud Storage and Pub/Sub."
https://www.onixnet.com/insights/gcp-101-an-introduction-to-google-cloud-platform
So in the next weeks I'll be looking at Data Flow.
Can you please attach the complete logs of the preview run? Make sure to redact any PII data. Also what is the version of CDF you are using? Is CDF instance private or public?
Thanks and Regards,
Sagar
Did you end up using Dataflow?
I am also experiencing the same issue with the HTTP plugin, but my temporary way to go around it was to use a cloud scheduler to periodically trigger a cloud function that fetches my data from the API and exports them as a JSON to GCS, which can then be accessed by Data Fusion.
My solution is of course non-ideal, so I am still looking for a way to use the Data Fusion HTTP plugin. I was able to make it work to get sample data from public API end-points, but for a reason still unknown to me I can't get it to work for my actual API.

Dataprep jobs running for over 72 hours since 6/20 update. Job status reads complete but not published

I have been running daily Dataprep jobs and since the update last week, approximately half of my jobs are now hanging and not being published. They appear as jobs in progress although when I go to the actual job page, the job appears to be complete. There is no publishing action and the publishing target does not appear updated. Some jobs have now been going on for over 72 hours since Friday.
I've seen traces of other users having the same issue online but have not seen any sort of response or recognition from either Google or Trifacta.
I have tried restarting the jobs to no success and it appears that there is no way to cancel those hanging jobs because from Google's perspective, it seems as though the jobs were successful itself, just not published. This problem appears both on my jobs that publish to BigQuery as well as jobs that publish to Google Cloud Storage, as well as manual and scheduled jobs.
This may impact only jobs that have been pushed during the upgrade and should be rather cosmetic in nature. Please note that you won't get charged.
Did the exact same job work before with no changes? If so, please contact support and provide them as reference the successful and now failing job ID so it can be investigated further.
Cheers,
Sebastian
I have come acros the same problem! The output of the jobs is placed in a temp folder in cloudstorage with the output mostly consisting out of multiple files without headers....
It is also creating huge issues here. Instead of the normal output file, it places multiple parts of it in a temp folder without headers. The makes new scheduled jobs that rely on these outputs useless, because it does not load the new output.
If you manually merge the files in the temp folder and add headers (in case of csv) + place them in the correct folder, the output can be created manualy (for csv).
Also no response from Google yet.
We're seeing the exact same thing down to the destinations and job types . . . it's almost like Dataprep is losing track of the underlying DataFlow job and not finishing on its completion (that's why you see the temp files—that's the output, then Dataprep handles the formatting of the output file separately).
Someone was kind enough to already post this on the issue tracker, so please go star it and add any additional details that may be helpful to the Dataprep team:
https://issuetracker.google.com/issues/135865374

Run a Google Cloud Function for each file in a bucket

I have a Google Cloud Function triggered by a Google Cloud Storage object.finalize event. When I deploy a new version of this function, I would like to run it for every existing file in the bucket (which have already been processed by the previous version of the function). Processing all the existing files in the bucket is a long running task, hence I don't think a Google Cloud Function which will process all files in a row is an option.
The best option I can see for now is to make a Google Cloud Function I can triggered via HTTP that will list all the files in the bucket and publish one event per file via Google PubSub, and then process each of these events with a slightly modified version of my initial Google Cloud Function which accepts a PubSub event in place of the object.finalize storage event.
I think it can work but I was wondering if there was an easier way to perform this operation.
If the operation you're trying to perform may take longer than the maximum time that a Cloud Function can run, you will need to split that operation into multiple steps. Your approach of using a PubSub trigger for each individual file, sounds like a valid approach to do that for me.
One option might be to write a small program that lists all of the objects in a bucket and, for each object, posts a message to Cloud Pub/Sub that triggers your function in the same way a GCS change would.

Automate / schedule a script

I read a number of blog and watched tutorials - cannot find anything to help me with my problem.
I have a stakeholder that drops files into Google Cloud Storage, I have already written a script that performs ETL tasks to.
It would be great where I can create a trigger which runs my script as soon as the file is dropped in a specific place in Google Cloud Storage.
Google Cloud Storage supports Google Cloud Pub/Sub Notifications. This allows you to programmatically receive notifications when new objects are uploaded to your bucket.