How to setup fully functional (includeing cluster) Spark learning developement on one machine? - scala

I want to start learning Spark 2.0 so I try to setup my dev (Scalav2.11) environment.
Spark uses a distributed env. to work on one cluster across multiple separate machines each node per machine. However, I do not have many machines for my testing purpose I only have one machine with CentOS 7 on it.
I am not after performance, I need something that would simulate a working cluster so that I could learn Spark.
How can I setup a development environment to learn and develop Spark applications without having to access multiple machines but still being able to learn and write code for fully functional Spark based environment?

Start with local mode.
Spark will do everything as usual: spawn executors, distribute tasks etc, the only step that will be omitted is the transfer of data across the network, and it's done completely under the hood in production so you don't need to take this omission into account while coding.
You will be able to specify number of executors (only threads in this mode), and test for example the fact that Spark Streaming needs at least 2 of them.
Refering to your comments:
Or it does not make much sense to make a cluster to learn spark
because it is all done under the hood and the programming is all the
same on local and say standalone/YARN/mesos mode
Yes, there are some conventions, but they are exactly the same on local and other modes.
Does the local mode means that I will be able to start exemplary
cluster with say 3 nodes?
local[3] should do the trick.


kubernetes - How do I get one job to work with multiple nodes?

Currently, when I create and run deployment, I only work on one node.
I want to work on one task at the same time using Kubernetes.
I want all nodes to work like one computer.
Kubernetes is about managing containers and scheduling them to run across a cluster, not about “jobs” per se. Have a look at MapReduce and Apache Spark.
First you need to understand more about Kubernetes and why your understanding might be a bit misleading for you concept. Kubernetes is an container orchestration tool that automates many of the manual processes involved in deploying, managing, and scaling containerized applications.
In other words, you can cluster together groups of hosts running Linux containers, and K8s helps you manage those clusters. To process some kind of job, data you will need a software that runs on kubernetes.
The next step that you might want to look into is distributed computing concept and distributed computing model called MapReduce.
MapReduce was introduce by Google to meet the demand of large set of users for its applications. Its used to write write scalable applications that can do parallel processing to process a large amount of data on a large cluster of commodity hardware servers. Hadoop is software that has adopted MapReduce and is capable of running it`s programs in various languages (Python, Ruby, C++).
Take a look on this medium article about distributed computing system based on MapReduce and Kubernetes.

Apache Spark standalone settings

I have an Apache spark standalone set up.
I wish to start 3 workers to run in parallel:
I use the commands below.
I tried to run a few jobs and below are the apache UI results:
Ignore the last three applications that failed: Below are my questions:
Why do I have just one worker displayed in the UI despite asking spark to start 3 each with 2 cores?
I want to partition my input RDD for better performance. So for the first two jobs with no partions, I had a time of 2.7 mins. Here my Scala source code had the following.
val tweets = sc.textFile("/Users/soft/Downloads/tweets").map(parseTweet).persist()
In my third job (4.3 min) I had the below:
val tweets = sc.textFile("/Users/soft/Downloads/tweets",8).map(parseTweet).persist()
I expected a shorter time with more partitions(8). Why was this the opposite of what was expected?
Apparently you have only one active worker, which you need to investigate why other workers are not reported by checking the spark logs.
More partitions doesn't always mean that the application runs faster, you need to check how you are creating partitions from source data, the amount of data parition'd and how much data is being shuffled, etc.
In case you are running on a local machine it is quite normal to just start a single worker with several CPU's as shown in the output. It will still split you task of the available CPU's in the machine.
Partitioning your file will happen automatically depending on the amount of available resources, it works quite well most of the time. Spark (and partitioning the files) comes with some overhead, so often, especially on a single machine Spark adds so much overhead it will slowdown you process. The added values comes with large amounts of data on a cluster of machines.
Assuming that you are starting a stand-alone cluster, I would suggest using the configuration files to setup a the cluster and use to start a cluster.
first in your spark/conf/slaves (copied from spark/conf/slaves.template add the IP's (or server names) of you worker nodes.
configure the spark/conf/spark-defaults.conf (copied from spark/conf/spark-defaults.conf.template Set at least the master node to the server that runs your master.
Use the (copied from to configure the cores per worker, memory etc:
export SPARK_REPL_MEM="4g"
Since it is standalone (and not hosted on a Hadoop environment) you need to share (or copy) the configuration (or rather the complete spark directory) to all nodes in your cluster. Also the data you are processing needs to be available on all nodes e.g. directly from a bucket or a shared drive.
As suggested by the #skjagini checkout the various log files in spark/logs/ to see what's going on. Each node will write their own log files.
See for all options.
(we have a setup like this running for several years and it works great!)

Running kafka connect in Distributed mode?

I have a total of 3 VM's(CloudVPS). Each of them has java, confluent open source installed on them. In VM1 I am running 3 processes of Splunk-sink-connector which reads from different topics and are running on different ports. And using REST calls I posted JSON configuration to each of them.
Since I am running in distributed mode I want to take advantage of other 2 VM's also. Can anyone please tell me what to do, to add other 2 VM's to those 3 processes to achieve parallel processing.
You just need to run Kafka Connect in Distributed mode on the three VMs, follow the instructions here and make sure you give them all the same which identifies them as members of the same cluster (and thus eligible for sharing workload of tasks out across them). More config details for distributed mode here.
See also:

Apache Mesos vs Google Kubernetes

What's the difference between Apache's Mesos and Google's Kubernetes
I read the accepted answers but I'm still confused what the differences are.
If Kubernetes is a cluster management then what does Mesos do (I understand what it does from watching bunch of videos but I suppose I'm more confused how those two work together)?
From reading both Kubernetes and Marathon are "framework" sitting on top of Mesos?
What is Mesos responsible for and what are Kubernetes/Marathon responsible for and how do they work with each other?
I think the better question is When would I want to use Kubernetes on top of Mesos vs just running Mesos alone?
Mesos is another abstraction layer. It simply abstracts underlying hardware so the software that want to run on the top of it could only define required resources without having to know any other information.
Kubernetes could do similar thing but without abstraction provided by Mesos you can't run other frameworks (e.g., Spark or Cassandra) on same machine without manually dividing it between those frameworks.
Apache Mesos is a resource manager that shares resources (CPU shares, RAM, disk, ports) across a cluster of machines in a fair way. By sharing, I mean it offers these resources to so called framework schedulers (such as Marathon) and thereby has a clear separation of concerns in terms of resource management and scheduling decisions (which is implemented, depending on the job type, for example long-running or batch, by the framework scheduler). See also the Mesos architecture for further details.

Connecting laptop(s)/desktop(s) to form a MATLAB computing cluster?

I have experience running parallel jobs on a remote cluster, and parallel (parfor) jobs on a single local machine, but never tried making a cluster of my own. I have access couple of laptops/desktops/servers (root access on all except one server), and was wondering if I could connect them all (or some) to form a local cluster (will have about 30 cores total).
Once you move beyond working with one machine, you move license types from a parallel computing toolbox to a Distributed Computing Server license. The licenses are available in clusters from 8 workers and up. List price on a 8 worker cluster is $6K, 32 workers are $21K. You can get more information on the Mathworks product page. Also note that submitting jobs to the workers requires the Parallel Computing Toolbox.
Once you have the worker licenses the only supported way to distribute jobs to the workers is through a scheduler. The server licenses come with a basic Mathworks scheduler that does have some limitations, but is ideal for single users or small groups. Beyond that you would need to go with one of the higher end schedulers such as LSF. A full list of supported schedulers is on the product page. Moving from a PCT setup on a single machine to a distributed setup can be fairly involved.
Are you prepared to pay the license cost for this? You can use local clusters (up to 8) using 1 copy of the parallel computing toolbox license. But to use distributed clusters, you need a distributed computing toolbox for each "node" (processor core) on the cluster. I'm not familiar with how to set this up. I know that I have access to a few of these clusters, and I also use local clusters extensively. We opted to not create our own distributed cluster for this reason. We also have data that shows that distributed clusters were slow for our particular tasks (a lot of file io was happening in our case).
I know this doesn't answer your question, just a few things to think about.