How can I queue REST requests using Apache Livy and Apache Spark? - scala

I have a remote process that sends a thousands requests to my humble Spark Standalone Cluster:
3 worker-nodes with 4 cores and 8GB
Identical master-node where the driver runs
that hosts a simple data processing service developed in Scala. Those requests are sent via cUrl command with some parameters to the .jar, through the REST Apache Livy interface like this:
curl -s -X POST -H "Content-Type: application/json" REMOTE_IP:8998/batches -d '{"file":"file://PATH_TO_JAR/target/MY_JAR-1.0.0-jar-with-dependencies.jar","className":"project.update.MY_JAR","args":["C1:1","C2:2","C3:3"]}'
This triggers a spark job each time,
the resource scheduling for the cluster is dynamic so it can serve at most 3 requests at a time,
when a worker goes idle, another queued request is served.
At some point in time, the requests kills the master node memory even if they are in a WAITING state (because Spark register the jobs to be served), hangs the master node and the workers loose connection to it.
Is there a way that I can queue this requests preventing that Spark hold RAM for them ? and when a worker is free, process another request from that queue.
This question is similar, saying that the yarn.scheduler.capacity.max-applications only allows N numbers of RUNNING applications, but I can't figure it out if this is the solution I need. Apache Livy doesn't have this functionality, not that I'm aware of.

Related

How to get dedicated Apache Kafka MirrorMaker 2.0 Rest API exposed

I am trying to reach a dedicated MirrorMaker 2.0 cluster to see the status of connectors/tasks etc. On this README in their git Apache kafka people claims that when used with dedicated.mode.enable.internal.rest=true MirrorMaker nodes are starting with an internal listener port to communicate with each other.
My question is; Is there a way to advertise this port to outside so I can send curl requests to the dedicated MirrorMaker nodes as we do in general like curl http://localhost:8083/connectors to see the connectors running etc?
I have already tried multiple solutions I've found online they simply do not work. It seems to me this is impossible when you start mirrormaker 2.0 with ./bin/connect-mirror-maker. I know this is possible, If I add every single required connector manually to an existing Kafka Connect cluster, but thats not what I am looking for.
I am also curious if there is a way to add the dedicated MirrorMaker cluster connectors into a already running kafka connect cluster.
This is important because we would like to get curl responses to check tasks status for MirrorMaker.
Thanks.
You should be able to run connect-distributed like normal, have its REST API available, then configure and monitor MM2 without using its dedicated scripts. Similarly, this is how you'd add to other existing Connect cluster.
Ideally, you should monitor from JMX, instead, where you get count of the running tasks, not use curl. Or, add Jolokia or Prometheus JMX Exporter to run their own http server, then curl that, and grep for the tasks metric

Kafka Connector - distributed - load balancing tasks

I am running development environment for Confluent Kafka, Community edition on Windows, version 3.0.1-2.11.
I am trying to achieve load balancing of tasks between 2 instances of connector. I am running Kafka Zookepper, Server, REST services and 2 instance of Connect distributed on the same machine.
Only difference between properties file for connectors is rest port since they are running on the same machine.
I don't create topics for connector offsets, config, status. Should I?
I have custom code for sink connector.
When I create worker for my sink connector I do this by executing POST request
POST http://localhost:8083/connectors
toward any of the running connectors. Checking is there loaded worker is done at URL
GET http://localhost:8083/connectors
My sink connector has System.out.println() lines in code with which I can follow output of my code in the console log.
When my worker is running I can see that only one instance of connector is executing code. If I terminate one connector another instance will take over the worker and execution will resume. However this is not what I want.
My goal is that both connector instances are running worker code so that they can share the load between them.
I've tried to got over some open source connectors to see is there specifics in writing code of connectors but with no success.
I've made some different attempts to tackle this problem but with no success.
I could rewrite my business code to come around this but I'm pretty sure I'm missing on something not obvious for me.
Recently I commented on Robin Moffatt's answer of this question.
From the sounds of it your custom code is not correctly spawning the number of tasks that you are expecting.
Make sure that you've set tasks.max >1 in your config
Make sure that your connector is correctly creating the appropriate number of tasks to taskConfigs
References:
https://opencredo.com/blogs/kafka-connect-source-connectors-a-detailed-guide-to-connecting-to-what-you-love/
https://docs.confluent.io/current/connect/devguide.html
https://enfuse.io/a-diy-guide-to-kafka-connectors/

Starting multiple connectors in Kafka Connect withing single distributed worker?

How to start multiple Kafka connectors in a Kafka Connect world within a single distributed worker(running on 3 different servers)?
Right now I have a need of 4 Kafka Connectors in this distributed worker(same group.id).
Currently, I am adding one connector at a time using following curl command.
curl -X POST -H "Content-type: application/json" -d '<my_single_connector_config>' 'http://localhost:8083/connectors'
Issue:
For each new connector I add, previous/existing connector(s) restarts along with new connector.
Question:
How should I start/create all these new connectors with one REST call in a distributed worker mode?
Is there any way to have all connector configs in a single REST call, like an array of connector configs?
I tried to search for the same but didn't come across any workaround for this.
Thanks.
For each new connector I add, previous/existing connector(s) restarts along with new connector.
Yes, that's the current behaviour of Kafka Connect. For further discussion see:
https://issues.apache.org/jira/browse/KAFKA-5505
https://cwiki.apache.org/confluence/display/KAFKA/Incremental+Cooperative+Rebalancing%3A+Support+and+Policies
How should I start/create all these new connectors with one REST call in a distributed worker mode?
Is there any way to have all connector configs in a single REST call, like an array of connector configs?
You can't do it in a single REST call
If you want to isolate your connectors from each other when creating/updating them, you can just run multiple distributed clusters.
So instead of 1 distributed Connect cluster running 3 connectors, you could have 3 distributed Connect clusters each running 1 connector.
Remember in practice a 'distributed Cluster' could just be of a single node, and indeed could all run on the same machine. You'd scale out for resilience and throughput capacity.

Kafka connect cluster setup or launching connect workers

I am going through kafka connect, and i am trying to get the concepts.
Let us say I have kafka cluster (nodes k1, k2 and k3) setup and it is running, now i want to run kafka connect workers in different nodes say c1 and c2 in distributed mode.
Few questions.
1) To run or launch kafka connect in distributed mode I need to use command ../bin/connect-distributed.sh, which is available in kakfa cluster nodes, so I need to launch kafka connect from any one of the kafka cluster nodes? or any node from where I launch kafka connect needs to have kafka binaries so that i will be able to use ../bin/connect-distributed.sh
2) I need to copy the my connector plugins to any kafka cluster node( or to all cluster nodes?) from where I do the step 1?
3) how does kafka copies these connector plugins to worker node before starting jvm process on the worker node? because the plugin is the one which has my task code and it needs to be copied to worker in order to start the process in worker.
4) Do i need to install anything in connect cluster nodes c1 and c2, like need to install java or any kafka connect related?
5) In some places it says use confluent platform but i would like to start it with apache kafka connect alone first.
can some one please through some light or even pointer to some resources would also help.
Thank you.
1) In order to have a highly available kafka-connect service you need to run at least two instances of connect-distributed.sh on two distinct machines that have the same group.id. You can find more details regarding the configuration of each worker here. For improved performance, Connect should be ran independently of the broker and Zookeeper machines.
2) Yes, you need to place all your connectors under plugin.path (normally under /usr/share/java/) on every machine that you are planning to run kafka-connect.
3) kafka-connect will load the connectors on startup. You don't need to handle this. Note that if your kafka-connect instance is running and a new connector is added, you need to restart the service.
4) You need to have Java installed on all your machines. For Confluent Platform particularly:
Java 1.7 and 1.8 are supported in this version of Confluent Platform
(Java 1.9 is currently not supported). You should run with the
Garbage-First (G1) garbage collector. For more information, see the
Supported Versions and Interoperability.
5) It depends. Confluent was founded by the original creators of Apache Kafka and it comes as a more complete distribution adding schema management, connectors and clients. It also comes with KSQL which is quite useful if you need to act on certain events. Confluent simply adds on top of the Apache Kafka distribution, it's not a modified version.
Answer given by Giorgos is correct. I ran few connectors and now I understand it better.
I am just trying to put it differently.
In Kafka connect there are two things involved one is Worker and second is connector.Below is on details about running distributed Kafka connect.
Kafka connect Worker is a Java process on which the connector/connect task will run. So first thing is we need to launch worker, to run/launch a worker we need java installed on that machine then we need Kafka connect related sh/bat files to launch worker and kafka libs which will be used by kafka connect worker, for this we will just simply copy/install Kafka in the worker machine, also we need to copy all the connector and connect-task related jars/dependencies in "plugin.path" as defined in the below worker properties file, now worker machine is ready, to start worker we need to invoke ./bin/connect-distributed.sh ./config/connect-distributed.properties, here connect-distributed.properties will have configuration for worker. The same thing has to be repeated in each machine where we need to run Kafka connect.
Now the worker java process is running in all machines, the woker config will have group.id property, the workers which have this same property value will be forming a group/cluster of workers.
Each worker process will expose rest endpoint (default http://localhost:8083/connectors), to launch/start a connector on the running workers, we need do http-post a connector config json, based on the given config the worker will start the connector and the number of tasks in the above group/cluster workers.
Example: Connect post,
curl -X POST -H "Content-Type: application/json" --data '{"name": "local-file-sink", "config": {"connector.class":"FileStreamSinkConnector", "tasks.max":"3", "file":"test.sink.txt", "topics":"connect-test" }}' http://localhost:8083/connectors

Storm cluster not working in Production mode

I have a storm topology which in two nodes. One is the nimbus and the other is the supervisor.
A proxy which is not part of storm accepts an HTTP request from a client and passes it to the storm topology.
The topology is like this:
1. The proxy passes data to a storm spout.
2. The spout passes data to multiple bolts.
3. The result is passed back to the proxy by the last bolt.
I am running the proxy and passing data to storm. I am able to connect a socket to the listener at the topology side. The data emitted by the spout is shown to be 0 in the UI. The same topology works fine in a local mode.
Thought it was a problem with supervisor, but the supervisor seems to be running fine because I am able to see the supervisor description and the individual spouts and bolts. But none of them emit anything.
Now, I am confused if the problem is the data being passed to the wrong machine or something. In order to communicate to the spout, Im creating the socket from the proxy as follows:
InetAddress stormInetAddr=InetAddress.getByName("198.18.17.16");
int stormPort=4321;
Socket stormSocket=new Socket(stormInetAddr,stormPort);
Here 198.18.17.16 is the nimbus IP. And 4321 is the port where data is being expected.
I tried giving the supervisor IP here, and it didnt connect. However, this does.
Now the proxy waits for the output on a specific port.
On the other side, after processing, data is read from the bolt. And there seems to be no activity from the cluster. But, I am getting a response which is basically the same request I had sent with some jumbled up data. And this response is supposed to be sent by the last bolt to a specific port which I had defined. And I GET data back, but the cluster shows NO ACTIVITY. I know this is very vague, but, does anyone have any idea as to whats happening?
It sounds like Storm is working fine, but your proxy/network settings are not. If it were a storm error, you should see exceptions in Nimbus UI and/or in the Storm supervisor logs.
Consider temporarily shutting down storm and use nc -l 4321 on the supervisor machines to assert your proxy is working as expected.
However...
You may have a fundamental flaw in your model. Storm's spouts are pull-based, so it seems odd to have incoming requests pushed to them. This is possible, of course, if you have your spouts start listening when they spin up and simply queue the requests. However, this presents another challenge for your model: you will likely have multiple spouts running on a single machine and they cannot share the same port (4321).
If you want to meld these two world of push & pull; then consider using a Kafka Spout.