I have a total of 3 VM's(CloudVPS). Each of them has java, confluent open source installed on them. In VM1 I am running 3 processes of Splunk-sink-connector which reads from different topics and are running on different ports. And using REST calls I posted JSON configuration to each of them.
Since I am running in distributed mode I want to take advantage of other 2 VM's also. Can anyone please tell me what to do, to add other 2 VM's to those 3 processes to achieve parallel processing.
You just need to run Kafka Connect in Distributed mode on the three VMs, follow the instructions here and make sure you give them all the same group.id which identifies them as members of the same cluster (and thus eligible for sharing workload of tasks out across them). More config details for distributed mode here.
See also:
https://rmoff.net/2019/11/22/common-mistakes-made-when-configuring-multiple-kafka-connect-workers/
http://rmoff.dev/ksldn19-kafka-connect
Related
I have three Ubuntu VMs (clones) in my local machine which i wanted to use to make a simple cluster. One VM to be used as a master and the other two as slaves. I can ssh every VM from every other one succesfully and i have the ip's of the two slaves in the conf/slaves file of the master and the master's ip in the spark-env.sh of every VM.When I run
start-slave.sh spark://master-ip:7077
from the slaves,they appear in the spark UI. But when i try to run things in parallel i always get the message about the resources. For testing code i use the scala shell
spark-shell --master://master-ip:7077 and sc.parallelize(1 until 10000).count.
Do You mean that warn: WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster ui to ensure that workers are registered and have sufficient memory
This message will pop up any time an application is requesting more resources from the cluster than the cluster can currently provide.
Spark is only looking for two things: Cores and Ram. Cores represents the number of open executor slots that your cluster provides for execution. Ram refers to the amount of free Ram required on any worker running your application.
Note for both of these resources the maximum value is not your System's max, it is the max as set by the your Spark configuration.
If you need to run multiple Spark apps simultaneously then you’ll need to adjust the amount of cores being used by each app.
If you are working with applications on the same node you need to assign cores to each application to make them work in parallel: ResourceScheduling
If you use VMs (as in your situation): assign only one core to each VM
when you first create it or whatever relevant to your system
resource capacity as by now spark request 4 cores for each * 2 VMs = 8 core which you don't have.
This is a tutorial i find that could help you: Install Spark on Ubuntu: Standalone Cluster Mode
Further Reading: common-spark-troubleshooting
I have an Apache spark standalone set up.
I wish to start 3 workers to run in parallel:
I use the commands below.
./start-master.sh
SPARK_WORKER_INSTANCES=3 SPARK_WORKER_CORES=2 ./start-slaves.sh
I tried to run a few jobs and below are the apache UI results:
Ignore the last three applications that failed: Below are my questions:
Why do I have just one worker displayed in the UI despite asking spark to start 3 each with 2 cores?
I want to partition my input RDD for better performance. So for the first two jobs with no partions, I had a time of 2.7 mins. Here my Scala source code had the following.
val tweets = sc.textFile("/Users/soft/Downloads/tweets").map(parseTweet).persist()
In my third job (4.3 min) I had the below:
val tweets = sc.textFile("/Users/soft/Downloads/tweets",8).map(parseTweet).persist()
I expected a shorter time with more partitions(8). Why was this the opposite of what was expected?
Apparently you have only one active worker, which you need to investigate why other workers are not reported by checking the spark logs.
More partitions doesn't always mean that the application runs faster, you need to check how you are creating partitions from source data, the amount of data parition'd and how much data is being shuffled, etc.
In case you are running on a local machine it is quite normal to just start a single worker with several CPU's as shown in the output. It will still split you task of the available CPU's in the machine.
Partitioning your file will happen automatically depending on the amount of available resources, it works quite well most of the time. Spark (and partitioning the files) comes with some overhead, so often, especially on a single machine Spark adds so much overhead it will slowdown you process. The added values comes with large amounts of data on a cluster of machines.
Assuming that you are starting a stand-alone cluster, I would suggest using the configuration files to setup a the cluster and use start-all.sh to start a cluster.
first in your spark/conf/slaves (copied from spark/conf/slaves.template add the IP's (or server names) of you worker nodes.
configure the spark/conf/spark-defaults.conf (copied from spark/conf/spark-defaults.conf.template Set at least the master node to the server that runs your master.
Use the spark-env.sh (copied from spark-env.sh.template) to configure the cores per worker, memory etc:
export SPARK_WORKER_CORES="2"
export SPARK_WORKER_MEMORY="6g"
export SPARK_DRIVER_MEMORY="4g"
export SPARK_REPL_MEM="4g"
Since it is standalone (and not hosted on a Hadoop environment) you need to share (or copy) the configuration (or rather the complete spark directory) to all nodes in your cluster. Also the data you are processing needs to be available on all nodes e.g. directly from a bucket or a shared drive.
As suggested by the #skjagini checkout the various log files in spark/logs/ to see what's going on. Each node will write their own log files.
See https://spark.apache.org/docs/latest/spark-standalone.html for all options.
(we have a setup like this running for several years and it works great!)
I want to start learning Spark 2.0 so I try to setup my dev (Scalav2.11) environment.
Spark uses a distributed env. to work on one cluster across multiple separate machines each node per machine. However, I do not have many machines for my testing purpose I only have one machine with CentOS 7 on it.
I am not after performance, I need something that would simulate a working cluster so that I could learn Spark.
How can I setup a development environment to learn and develop Spark applications without having to access multiple machines but still being able to learn and write code for fully functional Spark based environment?
Start with local mode.
Spark will do everything as usual: spawn executors, distribute tasks etc, the only step that will be omitted is the transfer of data across the network, and it's done completely under the hood in production so you don't need to take this omission into account while coding.
You will be able to specify number of executors (only threads in this mode), and test for example the fact that Spark Streaming needs at least 2 of them.
Refering to your comments:
Or it does not make much sense to make a cluster to learn spark
because it is all done under the hood and the programming is all the
same on local and say standalone/YARN/mesos mode
Yes, there are some conventions, but they are exactly the same on local and other modes.
Does the local mode means that I will be able to start exemplary
cluster with say 3 nodes?
local[3] should do the trick.
I have a nimbus server and 3 other supervisor servers. And I have 11 storm topologies running. But all of them are running in the Nimbus only. How to configure the other supervisors so that the topologies get distributed among various supervisors. Which configuration files I have to change?
It seems that there is something funny going on. For the two hosts corona-stage-storm-supervisor-01 and corona-stage-storm-supervisor-02 there are two supervisors each. However, a host should have only one supervisor running. I would assume that this "confuses" Nimbus and it uses remaining host (corona-storm-nimbus-01) that does only have a single supervisor running.
See Storm documentation for more detail (and talk to your admin who did the setup):
https://storm.apache.org/releases/1.0.0/Setting-up-a-Storm-cluster.html
About number of workers: this parameter defines how many worker JVM are use for a topology (the supervisor JVM starts worker JVM that do the actual work -- supervisors are basically "host local master" for coordination). You can set it in you job configuration via conf.setNumWorkers(int). If you want a topology to spread out over multiple hosts, you need to increase the parameter. Nevertheless, for multiple topologies as in your case, a value of one might also be ok -- different topologies should run of different host, independently of this parameter.
See Storm documentation for more details:
https://storm.apache.org/releases/1.0.0/Understanding-the-parallelism-of-a-Storm-topology.html
I'm interested in using Celery for an app I'm working on. It all seems pretty straight forward, but I'm a little confused about what I need to do if I have multiple load balanced application servers. All of the documentation assumes that the broker will be on the same server as the application. Currently, all of my application servers sit behind an Amazon ELB and tasks need to be able to come from any one of them.
This is what I assume I need to do:
Run a broker server on a separate instance
Configure each application instance to connect to that broker server
Each application instance will also be be a celery working (running
celeryd)?
My only beef with that is: What happens if my broker instance dies? Can I run 2 broker instances some how so I'm safe if one goes under?
Any tips or information on what to do in a setup like mine would be greatly appreciated. I'm sure I'm missing something or not understanding something.
For future reference, for those who do prefer to stick with RabbitMQ...
You can create a RabbitMQ cluster from 2 or more instances. Add those instances to your ELB and point your celeryd workers at the ELB. Just make sure you connect the right ports and you should be all set. Don't forget to allow your RabbitMQ machines to talk among themselves to run the cluster. This works very well for me in production.
One exception here: if you need to schedule tasks, you need a celerybeat process. For some reason, I wasn't able to connect the celerybeat to the ELB and had to connect it to one of the instances directly. I opened an issue about it and it is supposed to be resolved (didn't test it yet). Keep in mind that celerybeat by itself can only exist once, so that's already a single point of failure.
You are correct in all points.
How to make reliable broker: make clustered rabbitmq installation, as described here:
http://www.rabbitmq.com/clustering.html
Celery beat also doesn't have to be a single point of failure if you run it on every worker node with:
https://github.com/ybrs/single-beat