Partitioning of Apache Spark - scala

I have a cluster consisting of 1 master 10 worker nodes. When I set number of partition as 3, I wonder that does the master node use only 3 worker nodes or use all of them? Because it shows that all of them are used.

The question is not so clear about what are you asking, However following things might help
When you start the job with 10 executors, spark application master gets all the resource from yarn. So all the executors are already associated with the spark job.
However if your data partition is less than the number of executors available, the rest of the executors will be sitting idle. Hence it is not a good idea keeping the number of partition less than the executor count.

Related

Apache Spark - how to see how many nodes are being used during a job run?

I am using Scala Spark 2.4 and want to know the usage of the queue.
How to display how many nodes from a big cluster (100+ nodes) are being utilized for any particular job that is running?
Thanks.

Running a single kafka s3 sink connector in standalone vs distributed mode

I have a kafka topic "mytopic" with 10 partitions and want to use S3 sink connector to sink records to an S3 bucket. For scaling purposes it should be running on multiple nodes to write partitions data in parallel to the same S3 bucket.
In Kafka connect user guide and actually many other blogs/tutorials it's recommended to run workers in distributed mode instead of standalone to achieve better scalability and fault tolerance:
... distributed mode is more flexible in terms of scalability and offers the added advantage of a highly available service to minimize downtime.
I want to figure out which mode to choose for my use case: having one logical connector running on multiple nodes in parallel. My understanding is following:
If I run in distributed mode, I will end up having only 1 worker processing all the partitions, since it's considered one connector task.
Instead I should run in standalone mode in multiple nodes. In that case I will have a consumer group and achieve parallel processing of partitions.
In above described standalone scenario I will actually have fault tolerance: if one instance dies, the consumer group will rebalance and other standalone workers will handle the freed partitions.
Is my understaning correct or am I missing something?
Unfortunately I couldn't find much information on this topic other than this google groups discussion, where the author came to the same conclusion as I did.
In theory, that might work, but you'll end up ssh-ing to multiple machines, having basically the same config files, and just not using the connect-distributed command instead of connect-standalone.
You're missing the part about Connect server task rebalancing, though, which communicates over the Connect server REST ports
The underlying task code is all the same, only the entrypoint and offset storage are different. So, why not just use distributed if you have multiple machines?
You don't need to run, multiple instances of standalone processes, the Kafka workers are taking care of distributing the tasks, rebalancing, offset management under the distributed mode, you need to specify the same group id ...

Kafka Streams: Threads vs CPU cores

If the machine has 16 cores and if we define 6 threads in the config, would Kafka Streams utilize 6 cores OR would all the threads run on just a single core OR there is no control over the cores?
It is wrong to think this approach and multiple factors are involved here
If we define tasks as 6 , it means we have 6 partition for that topic which will be consumed parallelly by kafka consumer or connector.
If you have 16 cores and no other process running then chances are that , it will be executed as you expected.
This is not normal production scenario where we have multiple topics (having more than 1 partition ) which invalidated your theory.
You should have task based on consumer and machine should have only worker.
Once above condition is satisfied , we can perform performance test on that data
How much time takes to process 50k record ?
What is out expected time ?
We can upgrade our system based on above basic parameters.

storm-kafka-client spout consume message at different speed for different partition

I have a storm cluster of 5 nodes and a kafka cluster installed on the same nodes.
storm version: 1.2.1
kafka version: 1.1.0
I also have a kafka topic of 10 partitions.
Now, i want to consume this topic's data and process it by storm. But the message consume speed is really strange.
For test reason, my storm topology have only one component - kafka spout, and i always set kafka spout parallelism of 10, so that one partition will be read by only one thread.
When i run this topology on just 1 worker, all partitions will be read quickly and the lag is almost the same.(very small)
When i run this topology on 2 workers, 5 partitions will be read quickly, but the other 5 partitions will be read very slowly.
When i run this topology on 3 or 4 workers, 7 partitions will be read quickly and the other 3 partitions will be read very slowly.
When i run this topology on more than 5 workers, 8 partitions will be read quickly and the other 2 partitions will be read slowly.
Another strange thing is, when i use a different consumer group id when configure kafka spout, the test result may be different.
For example, when i use a specific group id and run topology on 5 workers, only 2 partitions can be read quickly. Just the opposite of the test using another group id.
I have written a simple java app that call High-level kafka jave api. I run it on each of the 5 storm node and find it can consume data very quickly for every partition. So the network issue can be excluded.
Has anyone met the same problem before? Or has any idea of what may cause such strange problem?
Thanks!

Apache Spark: Driver (instead of just the Executors) tries to connect to Cassandra

I guess I'm not yet fully understanding how Spark works.
Here is my setup:
I'm running a Spark cluster in Standalone mode. I'm using 4 machines for this: One is the Master, the other three are Workers.
I have written an application that reads data from a Cassandra cluster (see https://github.com/journeymonitor/analyze/blob/master/spark/src/main/scala/SparkApp.scala#L118).
The 3-nodes Cassandra cluster runs on the same machines that also host the Spark Worker nodes. The Spark Master node does not run a Cassandra node:
Machine 1 Machine 2 Machine 3 Machine 4
Spark Master Spark Worker Spark Worker Spark Worker
Cassandra node Cassandra node Cassandra node
The reasoning behind this is that I want to optimize data locality - when running my Spark app on the cluster, each Worker only needs to talk to its local Cassandra node.
Now, when submitting my Spark app to the cluster by running spark-submit --deploy-mode client --master spark://machine-1 from Machine 1 (the Spark Master), I expect the following:
a Driver instance is started on the Spark Master
the Driver starts one Executor on each Spark Worker
the Driver distributes my application to each Executor
my application runs on each Executor, and from there, talks to Cassandra via 127.0.0.1:9042
However, this doesn't seem to be the case. Instead, the Spark Master tries to talk to Cassandra (and fails, because there is no Cassandra node on the Machine 1 host).
What is it that I misunderstand? Does it work differently? Does in fact the Driver read the data from Cassandra, and distribute the data to the Executors? But then I could never read data larger than memory of Machine 1, even if the total memory of my cluster is sufficient.
Or, does the Driver talk to Cassandra not to read data, but to find out how to partition the data, and instructs the Executors to read "their" part of the data?
If someone can enlight me, that would be much appreciated.
Driver program is responsible for creating SparkContext, SQLContext and scheduling tasks on the worker nodes. It includes creating logical and physical plans and applying optimizations. To be able to do that it has to have access to the data source schema and possible other informations like schema or different statistics. Implementation details vary from source to source but generally speaking it means that data should be accessible on all nodes including application master.
At the end of the day your expectations are almost correct. Chunks of the data are fetched individually on each worker without going through driver program, but driver has to be able to connect to Cassandra to fetch required metadata.