Found two libraries for working with Nats, for Java and Scala(https://github.com/Logimethods/nats-connector-spark-scala, https://github.com/Logimethods/nats-connector-spark). Writing a separate connector on Scala and sending the output to pySpark its wrong. Is there any other way to connect pySpark to Nats?
Spark-submit version 2.3.0.2.6.5.1100-53
Thanks in advance.
In general, there is no normal way :( I found only an option using a python connector, sending output to a socket, and there pyspark processes the received data.
Related
We need to perform unit testing for our real time streaming application written in scala-spark.
One option is to use embedded-Kafka for kafka test case simulation.
The other option is to use kafka connect datagen - https://github.com/confluentinc/kafka-connect-datagen
The examples found on various blogs include CLI option.
What i'm looking for is an example to import kafka connect datagen within scala application.
Appreciate help on any good resource on kafka connect datagen OR simulating streaming application within scala application
Kafka Connect is meant to be standalone.
You can use TestContainers project to start a broker and Connect worker, then run datagen connector from there.
Otherwise, for more rigorous testing, write your own KafkaProducer.send methods with data you control
I want to live stream from one system to another system .
I am using kafka-python and am able to live stream locally.
Figures out that connectors will handle multiple devices. Can someone suggest me a way to use connectors to implement it in python?
Kafka Connect is a Java Framework, not Python.
Kafka Connect runs a REST API which you can use urllib3 or requests to interact with it, not kafka-python
https://kafka.apache.org/documentation/#connect
Once you create a connector, you are welcome to use kafka-python to produce data, which the JDBC sink would consume, for example, or you can use pandas for example to write to a database, which the JDBC source (or Debezium) would consume
I have a transformation in Pentaho Data Integration (PDI) that makes a query to NetSuite, builds JSON strings for each row and finally these strings are sent to Kafka. This is the transformation:
When I test the transform against my local Kafka it works like a charm, as you can see below:
The problem is when I substitute the connection parameters for those of an AWS EC2 instance where I have Kafka as well. The problem is that the transformation does not give errors, but the messages do not reach Kafka, as can be seen here:
This is the configuration of the Kafka Producer step of the transformation:
The strange thing is that although it does not send the messages to Kafka, it seems that it does connect to the server because the combobox is displayed with the names of the topics that I have:
In addition, this error is observed in the PDI terminal:
ERROR [NamedClusterServiceLocatorImpl] Could not find service for interface org.pentaho.hadoop.shim.api.jaas.JaasConfigService associated with named cluster null
Which doesn't make sense to me because I'm using a direct connection and not a connection to a Hadoop Cluster.
So I wanted to ask the members of this community if anyone has used POIs to send messages to Kafka and if they had to make configurations in POI or Slack to achieve it, since I cannot think what could be happening.
Thanks in advance for any ideas or comments to help me solve this!
I am new to spark.
I want know the use to spark job server like ooyala and Livy.
is spark code cannot communicate with HDFS directly.
Do we need a medium like webservice to send and receive data.
Thanks and Regards,
ooyala and Livy
Persistent Spark Context or not help you run spark like a http-service which you can request to create or run a spark job directly from a http call instead of calling from command like to run or cronjob for instance.
is spark code cannot communicate with HDFS directly
No. It can communicate with HDFS directly.
Do we need a medium like webservice to send and receive data.
Actually ooyala and Livy they return the result by json like you do call in a API.
So it depends on you build a medium or not.
Is it possible to connect to memsql from pyspark?
I heard that memsql recently built the streamliner infrastructure on top of pyspark to allow for custom python transformation
But does this mean I can run pyspark or submit a python spark job that connects to memsql?
Yes to both questions.
Streamliner is the best approach if your aim is to get data into MemSQL or perform transformation during ingest. How to use Python with Streamliner: http://docs.memsql.com/latest/spark/memsql-spark-interface-python/
You can also query MemSQL from a Spark application. Details on that here: http://docs.memsql.com/latest/spark/spark-sql-pushdown/
You can also run a Spark shell. See http://docs.memsql.com/latest/ops/cli/SPARK-SHELL/ & http://docs.memsql.com/latest/spark/admin/#launching-the-spark-shell