Apache Spark standalone settings - scala

I have an Apache spark standalone set up.
I wish to start 3 workers to run in parallel:
I use the commands below.
./start-master.sh
SPARK_WORKER_INSTANCES=3 SPARK_WORKER_CORES=2 ./start-slaves.sh
I tried to run a few jobs and below are the apache UI results:
Ignore the last three applications that failed: Below are my questions:
Why do I have just one worker displayed in the UI despite asking spark to start 3 each with 2 cores?
I want to partition my input RDD for better performance. So for the first two jobs with no partions, I had a time of 2.7 mins. Here my Scala source code had the following.
val tweets = sc.textFile("/Users/soft/Downloads/tweets").map(parseTweet).persist()
In my third job (4.3 min) I had the below:
val tweets = sc.textFile("/Users/soft/Downloads/tweets",8).map(parseTweet).persist()
I expected a shorter time with more partitions(8). Why was this the opposite of what was expected?

Apparently you have only one active worker, which you need to investigate why other workers are not reported by checking the spark logs.
More partitions doesn't always mean that the application runs faster, you need to check how you are creating partitions from source data, the amount of data parition'd and how much data is being shuffled, etc.

In case you are running on a local machine it is quite normal to just start a single worker with several CPU's as shown in the output. It will still split you task of the available CPU's in the machine.
Partitioning your file will happen automatically depending on the amount of available resources, it works quite well most of the time. Spark (and partitioning the files) comes with some overhead, so often, especially on a single machine Spark adds so much overhead it will slowdown you process. The added values comes with large amounts of data on a cluster of machines.
Assuming that you are starting a stand-alone cluster, I would suggest using the configuration files to setup a the cluster and use start-all.sh to start a cluster.
first in your spark/conf/slaves (copied from spark/conf/slaves.template add the IP's (or server names) of you worker nodes.
configure the spark/conf/spark-defaults.conf (copied from spark/conf/spark-defaults.conf.template Set at least the master node to the server that runs your master.
Use the spark-env.sh (copied from spark-env.sh.template) to configure the cores per worker, memory etc:
export SPARK_WORKER_CORES="2"
export SPARK_WORKER_MEMORY="6g"
export SPARK_DRIVER_MEMORY="4g"
export SPARK_REPL_MEM="4g"
Since it is standalone (and not hosted on a Hadoop environment) you need to share (or copy) the configuration (or rather the complete spark directory) to all nodes in your cluster. Also the data you are processing needs to be available on all nodes e.g. directly from a bucket or a shared drive.
As suggested by the #skjagini checkout the various log files in spark/logs/ to see what's going on. Each node will write their own log files.
See https://spark.apache.org/docs/latest/spark-standalone.html for all options.
(we have a setup like this running for several years and it works great!)

Related

Initial job has not accepted any resources; Error with spark in VMs

I have three Ubuntu VMs (clones) in my local machine which i wanted to use to make a simple cluster. One VM to be used as a master and the other two as slaves. I can ssh every VM from every other one succesfully and i have the ip's of the two slaves in the conf/slaves file of the master and the master's ip in the spark-env.sh of every VM.When I run
start-slave.sh spark://master-ip:7077
from the slaves,they appear in the spark UI. But when i try to run things in parallel i always get the message about the resources. For testing code i use the scala shell
spark-shell --master://master-ip:7077 and sc.parallelize(1 until 10000).count.
Do You mean that warn: WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster ui to ensure that workers are registered and have sufficient memory
This message will pop up any time an application is requesting more resources from the cluster than the cluster can currently provide.
Spark is only looking for two things: Cores and Ram. Cores represents the number of open executor slots that your cluster provides for execution. Ram refers to the amount of free Ram required on any worker running your application.
Note for both of these resources the maximum value is not your System's max, it is the max as set by the your Spark configuration.
If you need to run multiple Spark apps simultaneously then you’ll need to adjust the amount of cores being used by each app.
If you are working with applications on the same node you need to assign cores to each application to make them work in parallel: ResourceScheduling
If you use VMs (as in your situation): assign only one core to each VM
when you first create it or whatever relevant to your system
resource capacity as by now spark request 4 cores for each * 2 VMs = 8 core which you don't have.
This is a tutorial i find that could help you: Install Spark on Ubuntu: Standalone Cluster Mode
Further Reading: common-spark-troubleshooting

DASK with local files on WORKER systems

I am working with mutiple systems as workers.
Each worker system has a part of the data locally stored. And I want the computation done by each worker on its respective file only.
I have tried using :
distributed.scheduler.decide_worker()
send_task_to_worker(worker, key)
but I could not automate assigning the task for each file.
Also, is there anyway I can access local files of the worker? Using tcp address, I only have access to a temp folder of the worker created for dask.
You can target computations to run on certain workers using the workers= keyword to the various methods on the client. See http://distributed.readthedocs.io/en/latest/locality.html#user-control for more information.
You might run a function on each of your workers that tells you which files are present:
>>> client.run(os.listdir, my_directory)
{'192.168.0.1:52523': ['myfile1.dat', 'myfile2.dat'],
'192.168.0.2:4244': ['myfile3.dat'],
'192.168.0.3:5515': ['myfile4.dat', 'myfile5.dat']}
You might then submit computations to run on those workers specifically.
future = client.submit(load, 'myfile1.dat', workers='192.168.0.1:52523')
If you are using dask.delayed you can also pass workers= to the `persist method. See http://distributed.readthedocs.io/en/latest/locality.html#user-control for more information

Intermittant file not found using Google Cloud Storage from Dataproc - flushing writes?

I have a series of dataproc jobs that run to import some data received each morning. The process creates a cluster, runs four jobs in sequence, then shuts down the cluster. The input file is read from Google Cloud Storage, and the intermediate results are also saved in Avro form in GCS with the final output going to Cloud SQL.
Fairly often the jobs will fail trying to read the Avro written by the previous job. It appears that GCS hasn't "caught up" and the results from the previous job haven't been fully written. I was getting failures trying to read files that appeared to be from the previous day's run and partway through those files would disappear and be replaced by the new ones. I have changed my script that runs the files to clear the work area before starting the jobs, but still have problems where sometimes it starts reading and all the parts haven't been written fully.
I could change the code to simply store the intermediate files on the cluster, tho I like having them available outside for diagnosing other problems. I could also just write to both locations with the cluster for working and GCS for diagnostics.
But assuming this is some kind of sync issue, is there a way to force GCS to flush writes / be fully synced between jobs? Or is there some check I can do to make sure everything has been written before starting the next job in my chain?
EDIT: To answer the comment below, the sequence of jobs all run on the same cluster. The cluster is started, each job run in turn on that cluster, and then the cluster is shut down.
For now, I have worked around this by having the jobs write to HDFS on the cluster in addition to GCS, and the subsequent jobs reading from the cluster. The GCS output is now strictly for diagnostics in case of a problem. But even tho my immediate problem is (I believe) fixed I still would like to know what's happening and why GCS seems out of sync for a bit.

How to setup fully functional (includeing cluster) Spark learning developement on one machine?

I want to start learning Spark 2.0 so I try to setup my dev (Scalav2.11) environment.
Spark uses a distributed env. to work on one cluster across multiple separate machines each node per machine. However, I do not have many machines for my testing purpose I only have one machine with CentOS 7 on it.
I am not after performance, I need something that would simulate a working cluster so that I could learn Spark.
How can I setup a development environment to learn and develop Spark applications without having to access multiple machines but still being able to learn and write code for fully functional Spark based environment?
Start with local mode.
Spark will do everything as usual: spawn executors, distribute tasks etc, the only step that will be omitted is the transfer of data across the network, and it's done completely under the hood in production so you don't need to take this omission into account while coding.
You will be able to specify number of executors (only threads in this mode), and test for example the fact that Spark Streaming needs at least 2 of them.
Refering to your comments:
Or it does not make much sense to make a cluster to learn spark
because it is all done under the hood and the programming is all the
same on local and say standalone/YARN/mesos mode
Yes, there are some conventions, but they are exactly the same on local and other modes.
Does the local mode means that I will be able to start exemplary
cluster with say 3 nodes?
local[3] should do the trick.

Spark: long delay between jobs

So we are running spark job that extract data and do some expansive data conversion and writes to several different files. Everything is running fine but I'm getting random expansive delays between resource intensive job finish and next job start.
In below picture, we can see that job that was scheduled at 17:22:02 took 15 min to finish, which means I'm expecting next job to be scheduled around 17:37:02. However, next job was scheduled at 22:05:59, which is +4 hours after job success.
When I dig into next job's spark UI it show <1 sec scheduler delay. So I'm confused to where does this 4 hours long delay is coming from.
(Spark 1.6.1 with Hadoop 2)
Updated:
I can confirm that David's answer below is spot on about how IO ops are handled in Spark is bit unexpected. (It makes sense to that file write essentially does "collect" behind the curtain before it writes considering ordering and/or other operations.) But I'm bit discomforted by the fact that I/O time is not included in job execution time. I guess you can see it in "SQL" tab of spark UI as queries are still running even with all jobs being successful but you cannot dive into it at all.
I'm sure there are more ways to improve but below two methods were sufficient for me:
reduce file count
set parquet.enable.summary-metadata to false
I/O operations often come with significant overhead that will occur on the master node. Since this work isn't parallelized, it can take quite a bit of time. And since it is not a job, it does not show up in the resource manager UI. Some examples of I/O tasks that are done by the master node
Spark will write to temporary s3 directories, then move the files using the master node
Reading of text files often occur on the master node
When writing parquet files, the master node will scan all the files post-write to check the schema
These issues can be solved by tweaking yarn settings or redesigning your code. If you provide some source code, I might be able to pinpoint your issue.
Discussion of writing I/O Overhead with Parquet and s3
Discussion of reading I/O Overhead "s3 is not a filesystem"
Problem:
I faced similar issue when writing parquet data on s3 with pyspark on EMR 5.5.1. All workers would finish writing data in _temporary bucket in output folder & Spark UI would show that all tasks have completed. But Hadoop Resource Manager UI would not release resources for the application neither mark it as complete. On checking s3 bucket, it seemed like spark driver was moving the files 1 by 1 from _temporary directory to output bucket which was extremely slow & all the cluster was idle except Driver node.
Solution:
The solution that worked for me was to use committer class by AWS ( EmrOptimizedSparkSqlParquetOutputCommitter ) by setting the configuration property spark.sql.parquet.fs.optimized.committer.optimization-enabled to true.
e.g.:
spark-submit ....... --conf spark.sql.parquet.fs.optimized.committer.optimization-enabled=true
or
pyspark ....... --conf spark.sql.parquet.fs.optimized.committer.optimization-enabled=true
Note that this property is available in EMR 5.19 or higher.
Result:
After running the spark job on EMR 5.20.0 using above solution, it did not create any _temporary directory & all the files were directly written to the output bucket, hence job finished very quickly.
Fore more details:
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3-optimized-committer.html