ApacheBeam on FlinkRunner doesn't read from Kafka - apache-kafka

I'm trying to run Apache Beam backed by a local Flink cluster in order to consume from a Kafka Topic, as described in the Documentation for ReadFromKafka.
The code is basically this pipeline and some other setups as describe in the Beam Examples
with beam.Pipeline() as p:
lines = p | ReadFromKafka(
consumer_config={'bootstrap.servers': bootstrap_servers},
topics=[topic],
) | beam.WindowInto(beam.window.FixedWindows(1))
output = lines | beam.FlatMap(lambda x: print(x))
output | WriteToText(output)
Since I attempted to run on Flink, I followed this doc for Beam on Flink and did the following:
--> I download the binaries for flink 1.10 and followed these instructions to proper setup the cluster.
I checked the logs for the server and task instance. Both were properly initialized.
--> Started kafka using docker and exposing it in port 9092.
--> Executed the following in the terminal
python example_1.py --runner FlinkRunner --topic myTopic --bootstrap_servers localhost:9092 --flink_master localhost:8081 --output output_folder
The terminal outputs
2.23.0: Pulling from apache/beam_java_sdk Digest: sha256:3450c7953f8472c2312148a2a8324b0115fd71b3a7a01a3b017f6de69d89dfe1 Status: Image is up to date for apache/beam_java_sdk:2.23.0 docker.io/apache/beam_java_sdk:2.23.0
But then after writing some messags to myTopic, the terminal remains frozen and I don't see anything in the output folder. I checked flink-conf.yml and given these two lines
jobmanager.rpc.address: localhost
jobmanager.rpc.port: 6123
I assumed that the port for the jobs would be 6123 instead of 8081 as specified in beam documentation, but the behaviour for both ports is the same.
I'm very new to Beam/Flink, so I'm not quite sure that it can be, I have two hypothesis as of now, but can't quite figure out how to investigate'em:
Something related to the port that Beam communicates with Flink in order to send the jobs.
2.The Expansions Service for Python SDK mentioned in the apache.beam.io.external.ReadFromKafka docs
Note: To use these transforms, you need to start a Java Expansion Service. Please refer to the portability documentation on how to do that. Flink Users can use the built-in Expansion Service of the Flink Runner’s Job Server. The expansion service address has to be provided when instantiating the transforms.
But reading the portability documentation, it refers me back to the same doc for Beam on Flink.
Could someone, please, help me out?
Edit: I was writing to the topic using Debezium Source Connector for PostgreSQL and seeing the behavior mentioned above. But when I tried to the topic manually, the application crashed with the following
RuntimeError: org.apache.beam.sdk.util.UserCodeException: org.apache.beam.sdk.coders.CoderException: cannot encode a null byte[]
at org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:36)

You are doing everything correctly; the Java Expansion Service no longer needs to be started manually (see the latest docs). Also, Flink serves the web UI at 8081, but accepts job submission there just as well, so either port works fine.
It looks like you may be running into the issue that Python's TextIO does not yet support streaming.
Additionally, there is the complication that when running Python pipelines on Flink, the actual code runs in a docker image, and so if you are trying to write to a "local" file it will be a file inside the image, not on your machine.

Related

Submit jobs via Rest API and deploy Flink on a running Kubernetes cluster (Native way)

I am trying to implement a Rest client for Flink to send jobs via Restful Flink services. And also I want to integrate Flink and Kubernetes natively. I have decided to use “Application Mode” as deployment mode according to Flink documentation .
I have already implemented a job and packaged it as jar. And I have tested it on Standalone Flink. But my aim is to move on Kubernetes and deploy my application in Application mode via Rest API of Flink.
I have already investigated the samples at Flink documentation - Native Kubernetes. But I cannot find a sample for executing same samples via Restful services (esp. how to set --target kubernetes-application/kubernetes-session or other parameters).
In addition to samples, I checked out the Flink sources from GitHub and tried to find some sample implementation or get some clue.
I think the below ones are related with my case.
org.apache.flink.client.program.rest. RestClusterClient
org.apache.flink.kubernetes. KubernetesClusterDescriptorTest. testDeployApplicationCluster
But they are all so complicated for me to understand below points.
For application mode, are there any need to initialize a container to serve Flink Rest services before submitting job? If so, is it JobManager?
For application mode, how can I set the same command line parameters via Rest services?
For session mode, in command line samples, kubernetes-session.sh is executed before job submission to initialize a JobManager container. How sould I do this step via Rest client?
For session mode, how can I set the same command line parameters via Rest services? Although the command line samples send .jar job as parameter, should I upload jar before submitting job?
Could you please provide me some clue/sample to continue my implementation?
Best regards,
Burcu
I suspect that if you study the implementation of the Apache Flink Kubernetes Operator you'll find some clues.

Debezium io with pulsar

I want to understand how pulsar uses debezium io connect for CDC.
While creating the source using pulsar-admin source create, how can I pass broker url and authentication params or client. Similar to what we di when using localrun.
The cmd I run :
bin/pulsar-admin source localrun --sourceConfigFile debezium-mysql-source-config.yaml --client-auth-plugin --client-auth-params --broker-service-url
Now I want to replace this to create a connector which runs in cluster mode.
Localrun is a special mode that simplifies debugging and it runs outside of normal cluster. It needs extra parameters to create the client for the local runtime.
In the cluster mode the connector will get the client from the Pulsar connectors runtime/through the function worker configuration. All you need to do is use "bin/pulsar-admin source create ...".

Confluent Kafka services (local) do not start properly on wsl2 and seems to timeout communicating their status

I am seeing various different issues while trying to start Kafka services on wsl2. Details/symptoms below:
Confluent Kafka (7.0.0) platform
wsl2 - ubuntu 20.04LTS
When I use the command:
confluent local services start
Typically the system will take a long time and then exit with service failed (e.g. zookeeper, as that is the first service to start).
If I check the logs, it is actually started. So I again type the command and sure enough it immediately says zookeeper up, then proceed to try start kafka, which again after a min will say failed to start (but it really has started).
I suspect after starting the service (which is quite fast), system is not able to communicate back/exit and thus times out, I am not sure where the logs related to this are.
Can see this in the screenshot below
This means to start the whole stack (zookeeper/kafka/schema-registry/kafka-rest/kafka-connect/etc), takes forever, and in between I start getting other errors (sometimes, schema-registry is not able to find the cluster id, sometimes its a log file related error), which means I need to destroy and start again.
I have tried this over a couple of days and cant get this to work. Is confluent kafka that unstable on windows or I am missing some config change.
In terms of setup, I have not done any change in the config and am using the default config/ports.

How to send logs from Google Stackdriver to Kafka

I see many docs and posts about how to send logs to Stackdriver but almost no information about how to do the opposite - send logs from the Stackdriver to Kafka.
In my case, our Ops want to collect the logs from our web servers using Google's stackdriver agents and pushing them to stackdriver ... However, for my stream processing needs I want to get the logs into Kafka to use it's unparalleled abilities to retain and reprocess data by any number of consumers, something that I cannot do with PubSub.
So, what are the options for doing this? I only saw a couple of possible avenues - neither sounds too good:
based on this post: (https://powerspace.tech/how-to-stream-data-from-google-pubsub-to-kafka-with-kafka-connect-dbef1c340a76) push data into PubSub first, and then read from it using either Kafka connector or write my own Kafka consumer. I hate the thought of adding yet another hop (serialize/deserialize/ack/etc.) between the source of data and Kafka ....
I noticed a brief mentioning in passing on adding a plugin to Google's version of Fluentd (which is what stackdriver log collection agent is based on) here: https://powerspace.tech/how-to-stream-data-from-google-pubsub-to-kafka-with-kafka-connect-dbef1c340a76 . Not many details - so hard to tell how involved this approach is ...
Any other options?
Thank you!
Enter in to the Kafka console and add certain elements in the console. Once you have added the elements in the Kafka console you need to check if these elements are reflected successfully in the cloud shell. For this you will run the command > $ gcloud pubsub subscriptions pull from-kafka — auto-ack — limit=10 < . Once you run this command it will take some time to sync with the Kafka console. You will get the results after running this command a couple of times.
You will run the commands in the Cloud Shell and see the output in the Kafka VM SSH.
***Image1
Now you will be verifying the exact opposite procedure where in you will be running the command in the Kafka VM and seeing the output in the Cloud Shell. It will take some time for the output to be reflected and you may have to run the command > $ gcloud pubsub subscriptions pull from-kafka — auto-ack — limit=10 < a couple of times to see the output. Your output will look like this
*** image2
The Kafka plugin is deprecated. For more information, refer to https://cloud.google.com/stackdriver/docs/deprecations
Note: This functionality is only available for agents running on Linux. It is not available on Windows.
Kafka is monitored via JMX. Monitoring supports monitoring Kafka version 0.8.2 and higher.
On your VM instance, download kafka-082.conf from the GitHub configuration repository and place it in the directory /etc/stackdriver/collectd.d/:
(cd /etc/stackdriver/collectd.d/ && sudo curl -O https://raw.githubusercontent.com/Stackdriver/stackdriver-agent-service-configs/master/etc/collectd.d/kafka-082.conf)
The downloaded plugin configuration file assumes that your Kafka server is configured to accept JMX connections on port 9999. If you have configured Kafka with a different JMX port, as root, edit the file and follow the instructions to change the JMX port settings.
After adding the configuration file, restart the Monitoring agent by running the following command:
sudo service stackdriver-agent restart
What is monitored:
https://cloud.google.com/monitoring/api/metrics_agent#agent-kafka

How to redirect Apache Spark logs from the driver and the slaves to the console of the machine that launchs the Spark job using log4j?

I'm trying to build an Apache Spark application that normalizes csv files from HDFS (changes delimiter, fix broken lines). I use log4j for logging but all the logs just print in the executors so the only way i can check them is using yarn logs -applicationId command. Is there any way i can redirect all logs( from driver and from executors) to my gateway node(the one which launchs the spark job) so i can check them during execution?
You should have the executors log4j props configured to write files local to themselves. Streaming back to the driver will cause unnecessary latency in processing.
If you plan on being able to 'tail" the logs in near real-time, you would need to instrument a solution like Splunk or Elasticsearch, and use tools like Splunk Forwarders, Fluentd, or Filebeat that are agents on each box that specifically watch for all configured log paths, and push that data to a destination indexer, that'll parse and extract log field data.
Now, there are other alternatives like Streamsets or Nifi or Knime (all open source), which offer more instrumentation for collecting event processing failures, and effectively allow for "dead letter queues" to handle errors in a specific way. The part I like about those tools - no programming required.
i think it is not possible. When you execute spark in local mode you can able to see it in console. Otherwise you have to alter log4j properties for the log file path.
As per https://spark.apache.org/docs/preview/running-on-yarn.html#configuration,
YARN has two modes for handling container logs after an application has completed. If log aggregation is turned on (with the yarn.log-aggregation-enable config in yarn-site.xml file), container logs are copied to HDFS and deleted on the local machine.
You can also view the container log files directly in HDFS using the HDFS shell or API. The directory where they are located can be found by looking at your YARN configs (yarn.nodemanager.remote-app-log-dir and yarn.nodemanager.remote-app-log-dir-suffix in yarn-site.xml).
I am not sure whether the log aggregation from worker nodes happen in real time !!
There is an indirect way to achieve. Enable the following property in yarn-site.xml.
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
This will store all your logs of the submitted applications in hdfs location. Then using the following command you can download the logs into a single aggregated file.
yarn logs -applicationId application_id_example > app_logs.txt
I came across this github repo which downloads the driver and container logs separately. Clone this repository : https://github.com/hammerlab/yarn-logs-helpers
git clone --recursive https://github.com/hammerlab/yarn-logs-helpers.git
In your .bashrc (or equivalent), source .yarn-logs-helpers.sourceme:
$ source /path/to/repo/.yarn-logs-helpers.sourceme
Then download the aggregated logs into nicely segregated driver and container logs by this command.
yarn-container-logs application_example_id