Spark: master local[*] is a lot slower than master local - scala

I have an EC2 set up with r3.8xlarge (32 cores, 244G RAM).
In my Spark application, I am reading two csv files from S3 using Spark-CSV from DataBrick, each csv has about 5 millions rows. I am unionAll the two DataFrames and running a dropDuplicates on the combined DataFrame.
But when I have,
val conf = new SparkConf()
.setMaster("local[32]")
.setAppName("Raw Ingestion On Apache Spark")
.set("spark.sql.shuffle.partitions", "32")
Spark is slower than .setMaster("local")
Wouldn't it be faster with 32 cores?

Well spark is not a Windows operating system, that it would work at maximum possible capacity from the start, you need to tune it for your usage.
Right now you just bluntly said to spark start and process my stuff on one node with 32 cores. That is not what Spark is good for. It is a distributed system suppose to be run on multi-node cluster, that is where it works best.
Reason is simple, even if you are using 32 core, what about IO issue?
Because now you are using let's if it has run 30 executors, than that is 32 process reading from same disk.
You specified 32 core, what about executor memory?
Did both machine had same ram, where you were testing.
You have specified now specifically that you want 32 partitions, if data is very small that is alot of overhead. Ideally you shouldn't specify partition until you know specifically what you are doing, or you are doing repetitive task, and you know data is going to be exactly similar all time.
If you tune it correctly spark with 32 core will indeed work faster than "local" which is basically running on one core.

Related

Spark data frame is not utilizing the workers

I have a spark cluster with 3 worker nodes, when i try to load the csv file from hdfs it only utilizes the resources(cpu & memory) on the system where i load the csv via spark-shell (used master node)
Load dataframe
val df = spark.read.format("csv")
.option("header","true")
.load("hdfs://ipaddr:9000/user/smb_ram/2016_HDD.csv")
Do some operation on the dataframe
df.agg(sum("failure")).show
When i load csv system memory increases by 1.3 GB which is the hdfs file size & 100 % CPU usage. The workers were idling CPU near 0 % and no memory usage changes. Ideally i would expect all the heavy lifting to be done by worker which is not happening.
Set spark mode to cluster that should solve your problem. Looks like your job is running in Client mode.

Write dataframe from spark cluster to cassandra cluster: Partitioning and Performance Tuning

I have two clusters -
1. Cloudera Hadoop- Spark jobs run here
2. Cloud - Cassandra cluster, multiple DC
While writing a dataframe from my spark job to a cassandra cluster, I am doing a repartition (repartionCount=10) in spark before writing. See below:
import org.apache.spark.sql.cassandra._
records.repartition(repartitionCount).write.cassandraFormat(table, keySpace)
.mode(SaveMode.Append)
.options(options)
.option(CassandraConnectorConf.LocalDCParam.name, cassandraDC.name)
.option(CassandraConnectorConf.ConnectionHostParam.name, cassandraDC.hosts)
.save()
In my multi tenant spark cluster, for a spark batch load with 20M records, and below configs, I see lot of task failures, resource preemption and on the fly failures.
spark.cassandra.output.batch.grouping.buffer.size=1000
spark.cassandra.output.batch.grouping.key=partition
spark.cassandra.output.concurrent.writes=20
spark.cassandra.connection.compression=LZ4
How should I tune this? Is the repartition to blame?
PS: My understanding in the beginning was: For a load with 20M rows, the "repartition" should distribute load evenly over executors (partition with 2M rows each), and batching will be done on these partition level (on 2M rows). But now, I am doubting is this causing unnecessary shuffle, if the spark-cassandra-connector is doing batching on whole dataframe level (whole 20M rows).
UPDATE: Removing the "repartition" brought down the performance a lot on my cloudera spark cluster (default partitions set at spark level is - spark.sql.shuffle.partitions: 200), so i dug a bit deeper and found my initial understanding was correct. Please note my spark and cassandra clusters are different. Datastax spark-cassandra-connector opens one connection per partition with a cassandra coordinator node, so I have decided to let it be the same. As Alex suggested, I have reduced the concurrent writes, I believe that should help.
You don't need to do repartition in Spark - just write data from Spark to Cassandra, don't try to change Spark Cassandra Connector defaults - they are work fine in the most situations. You need to look what kind of stage failures happening - most probably you're simply overloading Cassandra because of spark.cassandra.output.concurrent.writes=20 (use default value (5)) - sometimes having less writers help to write data faster as you don't overload Cassandra, and jobs aren't restarted.
P.S. partition in the spark.cassandra.output.batch.grouping.key - it's not a Spark partition, it's Cassandra partition that depends on the value of partition key column.

How multiple executors are managed on the worker nodes with a Spark standalone cluster?

Until now, I have only used Spark on a Hadoop cluster with YARN as the resource manager. In that type of cluster, I know exactly how many executors to run and how the resource management works. However, know that I am trying to use a Standalone Spark Cluster, I have got a little bit confused. Correct me where I am wrong.
From this article, by default, a worker node uses all the memory of the node minus 1 GB. But I understand that by using SPARK_WORKER_MEMORY, we can use lesser memory. For example, if the total memory of the node is 32 GB, but I specify 16 GB, Spark worker is not going to use anymore than 16 GB on that node?
But what about executors? Let us say if I want to run 2 executors per node, can I do that by specifying executor memory during spark-submit to be half of SPARK_WORKER_MEMORY, and if I want to run 4 executors per node, by specifying executor memory to be the quarter of SPARK_WORKER_MEMORY?
If so, besides executor memory, I would also have to specify executor cores correctly, I think. For example, if I want to run 4 executors on a worker, I would have to specify executor cores to be the quarter of SPARK_WORKER_CORES? What happens, if I specify a bigger number than that? I mean if I specify executor memory to be the quarter of SPARK_WORKER_MEMORY, but executor cores to be only half of SPARK_WORKER_CORES? Would I get 2 or 4 executors running on that node in that case?
This is the best way to control number of executors, cores and memory in my experience.
Cores: You can set total number of cores across all executors and number of cores per each executor
Memory: Executor memory individually
--total-executor-cores 12 --executor-cores 2 --executor-memory 6G
This would give you 6 executors and 2 cores/6G per each executor, so in total you are looking at 12 Cores and 36G
You can set driver memory using
--driver-memory 2G
So, I experimented with the Spark Standalone cluster myself a bit, and this is what I noticed.
My intuition that muliple executors can be run inside a worker, by tuning executor cores was indeed correct. Let us say, your worker has 16 cores. Now if you specify 8 cores for executors, Spark would run 2 executors per worker.
How many executors run inside a worker also depend upon the executor memory you specify. For example, if worker memory is 24 GB, and you want to run 2 executors per worker, you cannot specify executor memory to be more than 12 GB.
A worker's memory can be limited when starting a slave by specifing the value for optional parameter--memory or by changing the value of SPARK_WORKER_MEMORY. Same with the number of cores (--cores/SPARK_WORKER_CORES).
If you want to be able to run multiple jobs on the Standalone Spark cluster, you could use the spark.cores.max configuration property while doing spark-submit. For example, like this.
spark-submit <other parameters> --conf="spark.cores.max=16" <other parameters>
So, if your Standalone Spark Cluster allows 64 cores in total, and you give only 16 cores to your program, other Spark jobs could use the remaining 48 cores.

Spark write to parquet on hdfs

I have 3 nodes hadoop and spark installed. I would like to take data from rdbms into data frame and write this data into parquet on HDFS. "dfs.replication" value is 1 .
When i try this with following command i have seen all HDFS blocks are located on node which i executed spark-shell.
scala> xfact.write.parquet("hdfs://sparknode01.localdomain:9000/xfact")
Is this the intended behaviour or should all blocks be distributed across the cluster?
Thanks
Since you are writing your data to HDFS this does not depend on spark, but on HDFS. From Hadoop : Definitive Guide
Hadoop’s default strategy is to place the first replica on the same node as the client (for
clients running outside the cluster, a node is chosen at random, although the system
tries not to pick nodes that are too full or too busy).
So yes, this is the intended behaivour.
Just as #nik says, I do my work with multi cients and it done for me:
This is the python snippet:
columns = xfact.columns
test = sqlContext.createDataFrame(xfact.rdd.map(lambda a: a),columns)
test.write.mode('overwrite').parquet('hdfs://sparknode01.localdomain:9000/xfact')

Apache Spark: Driver (instead of just the Executors) tries to connect to Cassandra

I guess I'm not yet fully understanding how Spark works.
Here is my setup:
I'm running a Spark cluster in Standalone mode. I'm using 4 machines for this: One is the Master, the other three are Workers.
I have written an application that reads data from a Cassandra cluster (see https://github.com/journeymonitor/analyze/blob/master/spark/src/main/scala/SparkApp.scala#L118).
The 3-nodes Cassandra cluster runs on the same machines that also host the Spark Worker nodes. The Spark Master node does not run a Cassandra node:
Machine 1 Machine 2 Machine 3 Machine 4
Spark Master Spark Worker Spark Worker Spark Worker
Cassandra node Cassandra node Cassandra node
The reasoning behind this is that I want to optimize data locality - when running my Spark app on the cluster, each Worker only needs to talk to its local Cassandra node.
Now, when submitting my Spark app to the cluster by running spark-submit --deploy-mode client --master spark://machine-1 from Machine 1 (the Spark Master), I expect the following:
a Driver instance is started on the Spark Master
the Driver starts one Executor on each Spark Worker
the Driver distributes my application to each Executor
my application runs on each Executor, and from there, talks to Cassandra via 127.0.0.1:9042
However, this doesn't seem to be the case. Instead, the Spark Master tries to talk to Cassandra (and fails, because there is no Cassandra node on the Machine 1 host).
What is it that I misunderstand? Does it work differently? Does in fact the Driver read the data from Cassandra, and distribute the data to the Executors? But then I could never read data larger than memory of Machine 1, even if the total memory of my cluster is sufficient.
Or, does the Driver talk to Cassandra not to read data, but to find out how to partition the data, and instructs the Executors to read "their" part of the data?
If someone can enlight me, that would be much appreciated.
Driver program is responsible for creating SparkContext, SQLContext and scheduling tasks on the worker nodes. It includes creating logical and physical plans and applying optimizations. To be able to do that it has to have access to the data source schema and possible other informations like schema or different statistics. Implementation details vary from source to source but generally speaking it means that data should be accessible on all nodes including application master.
At the end of the day your expectations are almost correct. Chunks of the data are fetched individually on each worker without going through driver program, but driver has to be able to connect to Cassandra to fetch required metadata.