How to run Apache Beam Pipeline connected to PubSub built in Golang on Directrunner - apache-beam

I'm trying to build a Apache Beam Pipeline in Golang that connects to PubSub with Apache Beam.
I'm trying to run Pipeline on directrunner,In the official document, there is a description to set the streaming option to true, but in Golang No streaming option.
My question is: how to run Pipeline that connects to PubSub in Golang on Directrunner?

With the current implementation of pubsubio in Go SDK, you can use it only on Dataflow runner. You can take a look at this example to get started.

Related

How to import and use kafka connect datagen in spark application

We need to perform unit testing for our real time streaming application written in scala-spark.
One option is to use embedded-Kafka for kafka test case simulation.
The other option is to use kafka connect datagen - https://github.com/confluentinc/kafka-connect-datagen
The examples found on various blogs include CLI option.
What i'm looking for is an example to import kafka connect datagen within scala application.
Appreciate help on any good resource on kafka connect datagen OR simulating streaming application within scala application
Kafka Connect is meant to be standalone.
You can use TestContainers project to start a broker and Connect worker, then run datagen connector from there.
Otherwise, for more rigorous testing, write your own KafkaProducer.send methods with data you control

Test kafka and flink integration flow

I would like to test Kafka / Flink integration with FlinkKafkaConsumer011 and FlinkKafkaProducer011 for example.
The process will be :
read from kafka topic with Flink
some manipulation with Flink
write into another kafka topic with Flink
With a string example it will be, read string from input topic, convert to uppercase, write into a new topic.
The question is how to test the flow ?
When I say test this is Unit/Integration test.
Thanks!
Flink documentation has a little doc on how you can write unit\integration tests for your transformation operators: link. The doc also has a little section about testing checkpointing and state handling, and about using AbstractStreamOperatorTestHarness.
However, I think you are more interested in end-to-end integration testing (including testing sources and sinks). For that, you can start a Flink mini cluster. Here is a link to an example code that starts a Flink mini cluster: link.
You can also launch a Kafka Broker within a JVM and use it for your testing purposes. Flink's Kafka connector does that for integration tests. Here is a sample code starting the Kafka server: link.
If you are running locally, you can use a simple generator app to generate messages for your source Kafka Topic (there are many available. You can generate messages continuously or based on different configured interval). Here is an example on how you can set Flink's job global parameters when running locally: Kafka010Example.
Another alternative is to create an integration environment (vs. production) to run your end-to-end testing. You will be able to get a real feel of how your program will behave in a production-like environment. It is always advised to have a complete parallel testing environment - including a test source\sink Kafka topics.

Is it possible to push messages to Kafka from Google Dataflow?

Is there any way to connect Kafka as sink in Google Dataflow? I know we can use CloudPubSubConnector with pub/sub and Kafka, but I dont want to use Pub/sub in between Dataflow and Kafka.
Thanks,
Bala
Yes (assuming you are using Java SDK). See 'Writing to Kafka' with usabe example in JavaDoc for KafkaIO : https://github.com/apache/beam/blob/release-2.3.0/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L221
If you're writing DataFlow jobs in Python you can use Confluents Kafka client
[https://github.com/confluentinc/confluent-kafka-python][1]
and write you own Beam Sink/Source interface. There is a guide for writing your own sinks/sources in Beam [https://beam.apache.org/documentation/sdks/python-custom-io/][1]

Spark Streaming with Nifi

I am looking for way where I can make use of spark streaming in Nifi. I see couple of posts where SiteToSite tcp connection is used for spark streaming application, but I think it will be good if I can launch Spark streaming from Nifi custom processor.
PublishKafka will publish message into Kafka followed by Nifi Spark streaming processor will read from Kafka Topic.
I can launch Spark streaming application from custom Nifi processor using Spark Streaming launcher API, but the biggest challenge is that it will create spark streaming context for each flow file, which can be costly operation.
Does anyone suggest storing spark streaming context in controller service ? or any better approach for running spark streaming application with Nifi ?
You can make use of ExecuteSparkInteractive to write your spark code which you are trying to include in your spark streaming application.
Here you need few things setup for spark code to run from within Nifi -
Setup Livy server
Add Nifi controllers to start spark Livy sessions.
LivySessionController
StandardSSLContextService (may be required)
Once you enable LivySessionController within Nifi, it will start spark sessions and you can check on spark UI if those livy sessions are up and running.
Now as we have Livy spark sessions running, so whenever flow file move through Nifi flow, it will run spark code within ExecuteSparkInteractive
This will be similar to Spark streaming application running outside Nifi. For me this approach is working very well and easy to maintain compare to having separate spark streaming application.
Hope this will help !!
I can launch Spark streaming application from custom Nifi processor using Spark Streaming launcher API, but the biggest challenge is that it will create spark streaming context for each flow file, which can be costly operation.
You'd be launching a standalone application in each case, which is not what you want. If you are going to integrate with Spark Streaming or Flink, you should be using something like Kafka to pub-sub between them.

Spark to MongoDB via Mesos

I am trying to connect Apache Spark to MongoDB using Mesos. Here is my architecture: -
MongoDB: MongoDB Cluster of 2 shards, 1 config server and 1 query server.
Mesos: 1 Mesos Master, 4 Mesos slaves
Now I have installed Spark on just 1 node. There is not much information available on this out there. I just wanted to pose a few questions: -
As per what I understand, I can connect Spark to MongoDB via mesos. In other words, I end up using MongoDB as a storage layer. Do I really Need Hadoop? Is it mandatory to pull all the data into Hadoop just for Spark to read it?
Here is the reason I am asking this. The Spark Install expects the HADOOP_HOME variable to be set. This seems like very tight coupling !! Most of the posts on the net speak about MongoDB-Hadoop connector. It doesn't make sense if you're forcing me to move everything to hadoop.
Does anyone have an answer?
Regards
Mario
Spark itself takes a dependency on Hadoop and data in HDFS can be used as a datasource.
However, if you use the Mongo Spark Connector you can use MongoDB as a datasource for Spark without going via Hadoop at all.
Spark-mongo connector is good idea, moreover if your are executing Spark in a hadoop cluster you need set HADOOP_HOME.
Check your requeriments and test it (tutorial)
Basic working knowledge of MongoDB and Apache Spark. Refer to the MongoDB documentation and Spark documentation.
Running MongoDB instance (version 2.6 or later).
Spark 1.6.x.
Scala 2.10.x if using the mongo-spark-connector_2.10 package
Scala 2.11.x if using the mongo-spark-connector_2.11 package
The new MongoDB Connector for Apache Spark provides higher performance, greater ease of use and, access to more advanced Spark functionality than the MongoDB Connector for Hadoop. The following table compares the capabilities of both connectors.
Then you need to configure Spark with mesos:
Connecting Spark to Mesos
To use Mesos from Spark, you need a Spark binary package available in a place accessible by Mesos, and a Spark driver program configured to connect to Mesos.
Alternatively, you can also install Spark in the same location in all the Mesos slaves, and configure spark.mesos.executor.home (defaults to SPARK_HOME) to point to that location.