I am new to spark.
I want know the use to spark job server like ooyala and Livy.
is spark code cannot communicate with HDFS directly.
Do we need a medium like webservice to send and receive data.
Thanks and Regards,
ooyala and Livy
Persistent Spark Context or not help you run spark like a http-service which you can request to create or run a spark job directly from a http call instead of calling from command like to run or cronjob for instance.
is spark code cannot communicate with HDFS directly
No. It can communicate with HDFS directly.
Do we need a medium like webservice to send and receive data.
Actually ooyala and Livy they return the result by json like you do call in a API.
So it depends on you build a medium or not.
Related
I want to live stream from one system to another system .
I am using kafka-python and am able to live stream locally.
Figures out that connectors will handle multiple devices. Can someone suggest me a way to use connectors to implement it in python?
Kafka Connect is a Java Framework, not Python.
Kafka Connect runs a REST API which you can use urllib3 or requests to interact with it, not kafka-python
https://kafka.apache.org/documentation/#connect
Once you create a connector, you are welcome to use kafka-python to produce data, which the JDBC sink would consume, for example, or you can use pandas for example to write to a database, which the JDBC source (or Debezium) would consume
Found two libraries for working with Nats, for Java and Scala(https://github.com/Logimethods/nats-connector-spark-scala, https://github.com/Logimethods/nats-connector-spark). Writing a separate connector on Scala and sending the output to pySpark its wrong. Is there any other way to connect pySpark to Nats?
Spark-submit version 2.3.0.2.6.5.1100-53
Thanks in advance.
In general, there is no normal way :( I found only an option using a python connector, sending output to a socket, and there pyspark processes the received data.
As of now we are using curl and GET call to get data from outside using their ENDPOINT URL.
i'm planning to setup a new process and is there anyway to leverage kafka here instead of CURL.
unfortunately we dont have kafka confluent version.
Kafka doesn't perform or accept HTTP calls.
You'd have to write some HTTP scraper, then combine this with a Kafka producer.
Note: You don't need Confluent Platform to just setup and run their REST proxy next to your existing brokers
I am looking for way where I can make use of spark streaming in Nifi. I see couple of posts where SiteToSite tcp connection is used for spark streaming application, but I think it will be good if I can launch Spark streaming from Nifi custom processor.
PublishKafka will publish message into Kafka followed by Nifi Spark streaming processor will read from Kafka Topic.
I can launch Spark streaming application from custom Nifi processor using Spark Streaming launcher API, but the biggest challenge is that it will create spark streaming context for each flow file, which can be costly operation.
Does anyone suggest storing spark streaming context in controller service ? or any better approach for running spark streaming application with Nifi ?
You can make use of ExecuteSparkInteractive to write your spark code which you are trying to include in your spark streaming application.
Here you need few things setup for spark code to run from within Nifi -
Setup Livy server
Add Nifi controllers to start spark Livy sessions.
LivySessionController
StandardSSLContextService (may be required)
Once you enable LivySessionController within Nifi, it will start spark sessions and you can check on spark UI if those livy sessions are up and running.
Now as we have Livy spark sessions running, so whenever flow file move through Nifi flow, it will run spark code within ExecuteSparkInteractive
This will be similar to Spark streaming application running outside Nifi. For me this approach is working very well and easy to maintain compare to having separate spark streaming application.
Hope this will help !!
I can launch Spark streaming application from custom Nifi processor using Spark Streaming launcher API, but the biggest challenge is that it will create spark streaming context for each flow file, which can be costly operation.
You'd be launching a standalone application in each case, which is not what you want. If you are going to integrate with Spark Streaming or Flink, you should be using something like Kafka to pub-sub between them.
We are working on Qubole with Spark version 2.0.2.
We have a multi-step process in which all the intermediate steps write their output to HDFS and later this output is used in the reporting layer.
As per our use case, we want to avoid writing to HDFS and keep all the intermediate output as temporary tables in spark and directly write the final reporting layer output.
For this implementation, we wanted to use Job server provided by Qubole but when we try to trigger multiple queries on the Job server, Job server is running my jobs sequentially.
I have observed the same behavior in Databricks cluster as well.
The cluster we are using is a 30 node, r4.2xlarge.
Does anyone has experience in running multiple jobs using job server ?
Community's help will be greatly appreciated !