Google Cloud Spanner real time Change Data Capture to PubSub/Kafka through Cloud Data Fusion or Others - apache-kafka

I would like to achieve a real time change data capture (log-based preferred) pipeline from Google Cloud Spanner to PubSub/Kafka for my downstream real time applications. Could you please let me know if there is a great and cost-effective way to achieve that? I will appreciate any advice and recommendations.
In addition, for Cloud Data Fusion from google, I noticed that it could achieve real time from mysql/postgresql to cloud spanner, but I did not find the way go from cloud spanner to pubsub/kafka in real time.
Also, I found another two ways, which to be listed here for any comments or suggestions.
Use Debezium, a log-based change data capture Kafka connector from the link https://cloud.google.com/architecture/capturing-change-logs-with-debezium#deploying_debezium_on_gke_on_google_cloud
Create a polling service (which may miss some data) to poll data from cloud spanner from the link: https://cloud.google.com/architecture/deploying-event-sourced-systems-with-cloud-spanner
If you have any suggestion or comment on this, I will be really grateful.

There's a open source implementation of a polling service for Cloud Spanner that can also automatically push changes to PubSub here: https://github.com/cloudspannerecosystem/spanner-change-watcher
It is however not log-based. It has some inherent limitations:
It can miss updates if the same record is updated twice within the polling interval. In that case, only the last value will be reported.
It only supports soft deletes.
You could have a look at the samples to see if it is something that might suit your needs at least to some degree: https://github.com/cloudspannerecosystem/spanner-change-watcher/tree/master/samples

Cloud Spanner has a new feature called Change Streams that would allow building a downstream pipeline from Spanner to PubSub/Kafka.
At this time, there's not a pre-packaged Spanner to PubSub/Kafka connector.
The way to read change streams currently is to use the SpannerIO Apache Beam connector that would allow building the pipeline with Dataflow, or also directly querying the API.
Disclaimer: I'm a Developer Advocate that works with the Cloud Spanner team.

Related

Can we check Firestore reads origin?

Is there a way to quantify how many Firestore reads come from clients and how many from Google Cloud Functions?
I'd like to reduce my project reads costs.
Firebase currently does not provide tools to track the origin of document reads. All the reads fall under the same bucket, which is: "Read happened". If you need to measure specific reads from your app, you will have to track that yourself somehow. You can add a logger to your app which will track if the request came from the client or just the Cloud Function itself.
This documentation may come in handy.
Firestore audit logging information
Google Cloud services write audit logs to help you answer the
questions, "Who did what, where, and when?" within your Google Cloud
resources.
Data Access audit logs
Includes "admin read" operations that read metadata or configuration
information. Also includes "data read" and "data write" operations
that read or write user-provided data.
To receive Data Access audit logs, you must explicitly enable them.
https://cloud.google.com/firestore/docs/audit-logging

Running periodic queries on google cloud sql instance

I have a google cloud postgre instance and I'd like to run periodic sql queries on it and use the monitoring system to alert the user with the results.
How can I accomplish just using the gcp platform? Without having to develop a separate app.
As far as I am aware of, There is no Built-in feature for recurring queries in Cloud SQL at the moment.
So you have to implement your own. You can Use Cloud Scheduler to trigger a Cloud function (via HTTP/S endpoint) that runs the query on Cloud SQL and then notify the user in the way that suits your needs (I would recommend using pub/sub).
and you might want to save the result in a GCS bucket and the user is to pull the result from there.
Also, you might want to check BigQuery. It has a built-in feature of Scheduling queries.

Am I getting hacked ? Cloud Firestore huge pick gap of traffic, with no explanation

I'm currently facing to a huge pick of Read on Cloud Firestore and no way to go up stream to find the issue. I saw first this increase on Cloudflare. From 1million request to 175millions in 3 days with no correlation with user activity.
Cloudflare Dashboard before
Cloudflare Dashboard after
Diving into Statistics from GCP and Firebase is even more confusing had they are reflecting different reality.
GCP Dashboard Cloud Firestore Read and Write
Firebase Dashboard Firestore Read and Write
I verified if it was correlated to a new development or new security rule, but nothing.
I was thinking for a time about an hacking but Write seems to follow Read, but sure of nothing.
Can anyone had a previous experience like that, or a hint of where to find more infos on GCP.
Thanks for reading guys

PostgresQL data_directory on Google Cloud Storage, possible?

I am new to google cloud and was wondering if it is possible to run PostgresQL container on Cloud Run but the data_directory of PostgresQL was pointed to Cloud Storage?
If possible, then please could you point me to some tutorials/guides on this topic. And also what are the downsides of this approach?
Edit-0: Just to clarify what I am trying to achieve:
I am learning google cloud and want to write simple application to work along with it. I have decided that the backend code will run as a container under Cloud Run and the persistent data(i.e the database file) will reside on Cloud Storage. Because this is a small app for learning purpose, I am trying to use as less moving parts as possible on the backend(and also ones that are always free). And also both PostgresQL and the backend code will reside in the same container except for the actual data file, which will reside under Cloud Storage. Is this approach correct? Are there better approaches to achieve the same minimalism?
Edit-1: Okay, I got the answer! The Google documentation here mentions the following:
"Don't run a database over Cloud Storage FUSE!"
The buckets are not meant to store database information, some of the limits are the following:
There is no limit to writes across multiple objects, which includes uploading, updating, and deleting objects. Buckets initially support roughly 1000 writes per second and then scale as needed.
There is no limit to reads of objects in a bucket, which includes reading object data, reading object metadata, and listing objects. Buckets initially support roughly 5000 object reads per second and then scale as needed.
One alternative to separate persistent disk for your PostgreSQL database, is to use Google Compute Engine. You can follow the “How to Set Up a New Persistent Disk for PostgreSQL Data” Community Tutorial.

How to chain IBM Data Connect Activities in a flow

I have defined several Activities in IBM Data Connect (on Bluemix) and would like to chain them together, e.g. one for copying a Cloudant DB to dashDB, another for refining the copied data - so forth and so on.
Can this be done? If yes - how.
Data Connect doesn't currently support a way of chaining your activities together. However, you could make use of the current scheduling capabilities to arrange the activities to run in sequence. As the only current trigger mechanism we have is time, their successful operation would require you to leave enough time for each one to execute before the next activity in the chain.
I will find out for you if we have the kind of feature you're after on our roadmap.
Regards,
Wesley -
IBM Bluemix Data Connect Engineering
You can also use the Data Connect API to do the orchestration. See the documentation here https://console.ng.bluemix.net/docs/services/dataworks1/index.html
Regards,
Hernando Borda
IBM Bluemix Data Connect Product Manager