Unable to use the google.cloud.sql.connector module in Google composer - google-cloud-sql

I am trying to schedule a dataflow pipeline job to read content from a CloudSQL SQLServer instance and write it to the BigQuery table. I'm using the google.cloud.sql.connector[pytds] for setting connection. The manual dataflow job runs successfully when I run it through the Google cloud shell. The airflow version(using Google cloud composer) fails, giving Name error.
'NameError: name 'Connector' is not defined'
I have enabled the save-main-session option. Also, I have mentioned the connector module in the py_requirements option and it is being installed(as per the airflow logs).
py_requirements=['apache-beam[gcp]==2.41.0','cloud-sql-python-connector[pytds]==0.6.1','pyodbc==4.0.34','SQLAlchemy==1.4.41','pymssql==2.2.5','sqlalchemy-pytds==0.3.4','pylint==2.15.4']
[2022-11-02 07:40:53,308] {process_utils.py:173} INFO - Collecting cloud-sql-python-connector[pytds]==0.6.1
[2022-11-02 07:40:53,333] {process_utils.py:173} INFO - Using cached cloud_sql_python_connector-0.6.1-py2.py3-none-any.whl (28 kB)
But it seems the import is not working.

You have to install the PyPi packages in Cloud Composer nodes, you have a tab in the GUI and Composer page :
Add all the needed packages for your Dataflow job in Composer via this page, except Apache Beam and Apache Beam GCP because Beam and Google Cloud dependencies are already installed in Cloud Composer.
Cloud Composer is the runner of your Dataflow job and the runner will instantiate the job. To be able to instantiate the job correctly, the runner needs to have the dependencies installed.
Then the Dataflow job in execution mode, will use the given py_requirements or setup.py file in the workers.
py_requirements or setup.py must also contains the needed Packages to execute the Dataflow job.

Related

installing connectors for confluent cloud

I'm trying to install MQ source & sink connectors for our confluent cloud. I've done this for on-prem apache kafka but doing the same of cloud seems to be different. Following the confluent documents says I need to have a platform installed on my local, which I did, and then to run a confluent-hub install which does install the connector on my local and then use the json for distributed instance.
My problem is when I run the json, it says the class for mq was not found, I tried to point the CLASSPATH to the dir where the jars are but still get the same error. How do I run this successfully?
--
ERROR Uncaught exception in REST call to /connectors (org.apache.kafka.connect.runtime.rest.errors.ConnectExceptionMapper:61)
org.apache.kafka.connect.errors.ConnectException: Failed to find any class that implements Connector and which name matches io.confluent.connect.ibm.mq.IbmMQSourceConnector,
Also want to understand how installing connector on local would apply to my cloud cluster? Not sure what I'm missing!
Confluent Cloud doesn't support custom connector installation, last I checked. They need to explicitly support and offer it.
I assume you're reading some documentation that indicates you need to run your own Connect cluster (not necessarily locally), where you have full control over the connectors that are installed

Error copying pip.conf from bucket to Cloud Composer Airflow environment

Similar to this lonely questioner I'm trying to install a Python package from a private PyPI repo such that it's available to our Google Cloud Composer Airflow instance.
I've followed these instructions but Airflow continues not to know about my package:
No module named 'foopackage'
I can't find any reference to my pip.conf in any logs anywhere so I'm not sure whether the file is in the right place, or has the right contents.
How can I proceed with debugging this problem?
The Cloud Composer environment logs show that there was a problem with copying pip.conf from the bucket, but don't give any other details:
{
insertId: "16qa4c8g540zxs3"
logName: "projects/{my-env}/logs/composer-agent"
receiveTimestamp: "2020-02-06T15:59:03.164564368Z"
resource: {…}
severity: "ERROR"
textPayload: "Copying gs://{my-bucket}/config/pip/pip.conf...
"
timestamp: "2020-02-06T15:59:00.857642186Z"
}
I first thought this might be a permissions issue, but the file seems to have the same set of permissions as other files in this bucket.
Where can I get more detailed information on what went wrong when copying that file?
update
I'm on composer-1.7.2-airflow-1.10.2.
update
The service account for my Composer environment already has the project.editor role.
This is an indicator that the Docker image used for the web server failed to build. To find the root cause, please view Cloud Build logs in project.
The reason for this, is a failed or taking long time operation, it timed out on the Composer’s backend. In some cases these errors persist in the backend, blocking future attempts. You can try re-enabling the API:
First solution that comes to my mind is running following commands in cloud shell:
gcloud services disable composer.googleapis.com
gcloud services enable composer.googleapis.com
After enabling the API, please update your Composer environment as usual.
When you install packages, the Composer environment re-creates Docker containers for the Airflow workers and scheduler, then performs a rolling update within the GKE cluster to update the workers to keep workers available. You can check Kubernetes Engine > Workloads to see if your environment timed out because of waiting for the scheduler and workers to come back online.
When Composer environment is using a custom service account and does not have IAM access to use Cloud Build, builds will fail immediately, so please check it. You can diagnose these by going to Cloud Build > History, and when you see builds without a log, it means that builds failed even before trying to build a container.
When your package implement bindings, it will fail at runtime if the libraries don't exist on the system. This means it is incompatible with Cloud Composer, because getting shared libraries into the build environment is not currently supported.
Another thing, make sure if your project is packed in correct way.
I hope you find the above pieces of information useful.

Deploy DataFlow job using Scio

I've started developing my first DataFlow job using Scio, the Scala SDK. The dataflow job will run in streaming mode.
Can anyone advise the best way to deploy this? I have read in the Scio docs they use sbt-pack and then deploy this within a Docker container. I have also read about using DataFlow templates (but not in great detail).
What's best?
Like for Java and Python version, you can run directly your code on Dataflow by using the dataflow runner and by launching it from your computer (or a VM/function).
If you want to package it for a reutilisation, you can create a template.
You can't run custom container on Dataflow.

Spring Task in Spring Cloud Dataflow on PCF can't find java

i have a Spring Cloud Task fat jar that i have successfully deployed to SCDF running on PCF. i have created a definition for it and can therefore run it from the dashboard. fwiw it reads and writes from a database using Spring JDBC.
i'm trying to now set it up to run in a scheduled way and am having issues. i created a stream with a triggertask source and a task-launcher-local sink, and have configured the triggertask URI to point to the fat jar (via http, using a staticfile PCF pushed app).
the dashboard shows the two PCF apps (one for triggertask, one for task-local-launcher) both starting successfully, and it all runs, but the task fails every time with the error:
Caused by: java.io.IOException: Cannot run program "java" (in directory "/home/vcap/tmp/spring-cloud-dataflow-5903184636016162160/Task--582903409-1502669137014/Task--582903409"): error=2, No such file or directory
from what i can tell and surmise, the PCF app running the stream tries to fork and exec a java call, but since java is not in the path for PCF app containers i get the error
am i right? either way, how can i get the Spring Cloud Task (jar) to successfully run?
Spring Cloud Data Flow: Server
1.2.3 (using built spring-cloud-dataflow-server-cloudfoundry-1.2.3.BUILD-SNAPSHOT.jar)
Spring Cloud Data Flow: Shell
1.2.3 (using downloaded spring-cloud-dataflow-shell-1.2.3.RELEASE.jar)
Deployment Environment
PCF v1.11.6 (on Azure)
pcf dev v0.26.0 (on mac)
App Starters
http://bit-dot-ly/1-0-4-GA-stream-applications-rabbit-maven
Logs
link to log
The stream definition is missing from the post. It is possible that you're using the tasklauncher-local sink, which is compatible only when using SCDF's local-server and it will fail with the attached error when running in CF. Please make sure you're using tasklauncher-cloudfoundry sink. This application was added in the latest release of app-starters.
As pointed in the previous SO thread, it is highly recommended that you use the latest release of app-starters (1.0.4 is at least 10 months old). The latest releases can be found at the project site.

How to create a Spark Streaming jar that would work in AWS EMR?

I've been developing a Spark Streaming application with Eclipse, and I'm using sbt to run it locally.
Now I want to deploy the application on AWS using a jar, but when I try to use the command package of sbt it creates a jar without all dependencies so when I upload it on AWS it won't work because of Scala being missing.
Is there a way to create a uber-jar with SBT? Am I doing something wrong with the deployment of Spark on AWS?
For creating uber-jar with sbt, use sbt plugin sbt-assembly. For more details about creating uber-jar using sbt-assembly refer the blog post
After creating you can run the assembly jar using java -jar command.
But from Spark-1.0.0 onwards the spark-submit script in Spark’s bin directory is used to launch applications on a cluster for more details refer here
You should really be following Running Spark on EC2 that reads:
The spark-ec2 script, located in Spark’s ec2 directory, allows you to
launch, manage and shut down Spark clusters on Amazon EC2. It
automatically sets up Spark, Shark and HDFS on the cluster for you.
This guide describes how to use spark-ec2 to launch clusters, how to
run jobs on them, and how to shut them down. It assumes you’ve already
signed up for an EC2 account on the Amazon Web Services site.
I've only partially followed the document so I can't comment on how well it's written.
Moreover, according to Shipping Code to the Cluster chapter in the other document:
The recommended way to ship your code to the cluster is to pass it
through SparkContext’s constructor, which takes a list of JAR files
(Java/Scala) or .egg and .zip libraries (Python) to disseminate to
worker nodes. You can also dynamically add new files to be sent to
executors with SparkContext.addJar and addFile.