Different outputs per number of partition in spark - scala

I run spark code in my local machine and cluster.
I create SparkContext object for local machine with following code:
val sc = new SparkContext("local[*]", "Trial")
I create SparkContext object for cluster with following code:
val spark = SparkSession.builder.appName(args(0)+" "+args(1)).getOrCreate()
val sc = spark.sparkContext
and I set the number of partition as 4 for local machine and cluster with following code
val dataset = sc.textFile("Dataset.txt", 4)
In my cluster, I created 5 workers. One of them is driver node, rest of them run as worker.
I expects that the results should be same. However, the results of two parts which are local and cluster are different. What are the reasons of the problem?

I create SparkContext object for local machine with following code
and
I create SparkContext object for cluster with following code:
It appears that you may have defined two different environments for sc and spark as you define local[*] explicitly for sc while taking some default value for spark (that may read external configuration files or take so-called master URL from spark-submit).
These may be different that may affect what you use.
I expects that the results should be same. However, the results of two parts which are local and cluster are different. What are the reasons of the problem?
Dataset.txt you process in local vs cluster environments are different and hence the difference in the results. I'd strongly recommend using HDFS or some other shared file system to avoid such "surprises" in the future.

Related

Can it be assumed that number of partitions in a spark dataframe persists after being written to a table?

I recently inherited a Scala project that uses Spark. The CI pipeline succeeded the last time it ran but that was over six months ago, since then our CI infrastructure has changed and I am now changing this CI pipeline to run on the new infrastructure (basically used to run on an on-premises kubernetes cluster, it now has to run on a cloud-based kubernetes cluster).
A test that used to succeed is now failing and the only thing that has changed is the infrastructure on which it runs. The problematic code in questions boils down to this:
val tableName = "data_table_3_spark_partitions"
spark.createDataFrame(someData).repartition(3).write.mode(SaveMode.Overwrite).saveAsTable(tableName)
val numberOfSparkPartitionsPriorToErase = spark.table(tableName).rdd.getNumPartitions
numberOfSparkPartitionsPriorToErase shouldBe 3
The test now fails on that 4th line with:
1 was not equal to 3
The dataframe being written contains only 5 rows (it is a unit test after all)
The code is running against a local filesystem, here is the SparkConf:
new SparkConf().
setMaster("local[*]").
setAppName("test").
set("spark.ui.enabled", "false").
set("spark.app.id", appID).
set("spark.driver.host", "localhost").
set("spark.sql.sources.partitionOverwriteMode","dynamic")
It seems to me that this is a somewhat brittle test because it is predicated on the following assumption:
when a spark dataframe is written to a table and then that table read into a new dataframe, the new dataframe will have the same number of partitions as the original dataframe
I think this failing test proves that that is a false assumption. Am I correct that that is a false assumption?

How to copy file from local to HDFS directory in Oozie spark scala job?

I am trying to copy some files from local path to hdfs with scala, and running it with oozie. The job is failing as it is not able to read files from local path. Is there a way to read local files in oozie?
it is not possible to copy/read local files by spark if it is running in cluster mode. Reason is,
When Oozie submits Spark job in cluster mode, it is not necessary that YARN will allocate the same node (local node) as an executor. Suppose if you have limited executors and it allocates the same node then also not possible to access the same file by all other executors.
Then only possible solution I see is to keep all local files in the
share directory which will accessible by all cluster nodes after it
you can use the below commands to fire hdfs command using scala.
import org.apache.hadoop.fs
import org.apache.hadoop.fs._
val conf = new Configuration()
val fs = path.getFileSystem(conf)
val hdfspath = new Path("hdfs:///user/nikhil/test.csv")
val localpath = new Path("file:///home/cloudera/test/")
fs.copyToLocalFile(hdfspath,localpath)
Please find below link to get help in creating of Share directory just for reference.
https://www.tecmint.com/how-to-setup-nfs-server-in-linux/

How to limit pyspark ressources

I'm running pyspark in my local machine, and I want to limit the number of used cores and used memory (I've 8 cores and 16Gb of memory)
I don't know how to do this, I've tried to add these lines to my code, but the process is still greedy.
from pyspark import SparkContext, SparkConf
conf = (SparkConf().setMaster("local[4]")
.set("spark.executor.cores", "4")
.set("spark.cores.max", "4")
.set('spark.executor.memory', '6g')
)
sc = SparkContext(conf=conf)
rdd = sc.parallelize(input_data, numSlices=4)
map_result = rdd.map(map_func)
map_result.reduce(reduce_func)
Why do the confs are not applied ?
This maybe happennig due to "precedence" in configurations. Since Spark allows different ways to set configuration parameters. In the documentation we can see:
Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf. Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file. A few configuration keys have been renamed since earlier versions of Spark; in such cases, the older key names are still accepted, but take lower precedence than any instance of the newer key.
For more info: Spark Documentation
So I suggest reviewing spark-submit parameters and configuration files.
Hope it helps.

Spark Yarn Architecture

I had a question regarding this image in a tutorial I was following. So based on this image in a yarn based architecture does the execution of a spark application look something like this:
First you have a driver which is running on a client node or some data node. In this driver (similar to a driver in java?) consists of your code (written in java, python, scala, etc.) that you submit to the Spark Context. Then that spark context represents the connection to HDFS and submits your request to the Resource manager in the Hadoop ecosystem. Then the resource manager communicates with the Name node to figure out which data nodes in the cluster contain the information the client node asked for. The spark context will also put a executor on the worker node that will run the tasks. Then the node manager will start the executor which will run the tasks given to it by the Spark Context and will return back the data the client asked for from the HDFS to the driver.
Is the above interpretation correct?
Also would a driver send out three executors to each data node to retrieve the data from the HDFS, since the data in HDFS is replicated 3 times on various data nodes?
Your interpretation is close to reality but it seems that you are a bit confused on some points.
Let's see if I can make this more clear to you.
Let's say that you have the word count example in Scala.
object WordCount {
def main(args: Array[String]) {
val inputFile = args(0)
val outputFile = args(1)
val conf = new SparkConf().setAppName("wordCount")
val sc = new SparkContext(conf)
val input = sc.textFile(inputFile)
val words = input.flatMap(line => line.split(" "))
val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y}
counts.saveAsTextFile(outputFile)
}
}
In every spark job you have an initialisation step where you create a SparkContext object providing some configuration like the appname and the master, then you read a inputFile, you process it and you save the result of your processing on disk. All this code is running in the Driver except for the anonymous functions that make the actual processing (functions passed to .flatMap, .map and reduceByKey) and the I/O functions textFile and saveAsTextFile which are running remotely on the cluster.
Here the DRIVER is the name that is given to that part of the program running locally on the same node where you submit your code with spark-submit (in your picture is called Client Node). You can submit your code from any machine (either ClientNode, WorderNode or even MasterNode) as long as you have spark-submit and network access to your YARN cluster. For simplicity I will assume that the Client node is your laptop and the Yarn cluster is made of remote machines.
For simplicity I will leave out of this picture Zookeeper since it is used to provide High availability to HDFS and it is not involved in running a spark application. I have to mention that Yarn Resource Manager and HDFS Namenode are roles in Yarn and HDFS (actually they are processes running inside a JVM) and they could live on the same master node or on separate machines. Even Yarn Node managers and Data Nodes are only roles but they usually live on the same machine to provide data locality (processing close to where data are stored).
When you submit your application you first contact the Resource Manager that together with the NameNode try to find Worker nodes available where to run your spark tasks. In order to take advantage of the data locality principle, the Resource Manager will prefer worker nodes that stores on the same machine HDFS blocks (any of the 3 replicas for each block) for the file that you have to process. If no worker nodes with those blocks is available it will use any other worker node. In this case since data will not be available locally, HDFS blocks has to be moved over the network from any of the Data nodes to the node manager running the spark task. This process is done for each block that made your file, so some blocks could be found locally, some have to moved.
When the ResourceManager find a worker node available it will contact the NodeManager on that node and ask it to create an a Yarn Container (JVM) where to run a spark executor. In other cluster modes (Mesos or Standalone) you won't have a Yarn container but the concept of spark executor is the same. A spark executor is running as a JVM and can run multiple tasks.
The Driver running on the client node and the tasks running on spark executors keep communicating in order to run your job. If the driver is running on your laptop and your laptop crash, you will loose the connection to the tasks and your job will fail. That is why when spark is running in a Yarn cluster you can specify if you want to run your driver on your laptop "--deploy-mode=client" or on the yarn cluster as another yarn container "--deploy-mode=cluster". For more details look at spark-submit

Spark : check your cluster UI to ensure that workers are registered

I have a simple program in Spark:
/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("spark://10.250.7.117:7077").setAppName("Simple Application").set("spark.cores.max","2")
val sc = new SparkContext(conf)
val ratingsFile = sc.textFile("hdfs://hostname:8020/user/hdfs/mydata/movieLens/ds_small/ratings.csv")
//first get the first 10 records
println("Getting the first 10 records: ")
ratingsFile.take(10)
//get the number of records in the movie ratings file
println("The number of records in the movie list are : ")
ratingsFile.count()
}
}
When I try to run this program from the spark-shell i.e. I log into the name node (Cloudera installation) and run the commands sequentially on the spark-shell:
val ratingsFile = sc.textFile("hdfs://hostname:8020/user/hdfs/mydata/movieLens/ds_small/ratings.csv")
println("Getting the first 10 records: ")
ratingsFile.take(10)
println("The number of records in the movie list are : ")
ratingsFile.count()
I get correct results, but if I try to run the program from eclipse, no resources are assigned to program and in the console log all I see is:
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Also, in the Spark UI, I see this:
Job keeps Running - Spark
Also, it should be noted that this version of spark was installed with Cloudera (hence no worker nodes show up).
What should I do to make this work?
EDIT:
I checked the HistoryServer and these jobs don't show up there (even in incomplete applications)
I have done configuration and performance tuning for many spark clusters and this is a very common/normal message to see when you are first prepping/configuring a cluster to handle your workloads.
This is unequivocally due to insufficient resources to have the job launched. The job is requesting one of:
more memory per worker than allocated to it (1GB)
more CPU's than available on the cluster
Finally figured out what the answer is.
When deploying a spark program on a YARN cluster, the master URL is just yarn.
So in the program, the spark context should just looks like:
val conf = new SparkConf().setAppName("SimpleApp")
Then this eclipse project should be built using Maven and the generated jar should be deployed on the cluster by copying it to the cluster and then running the following command
spark-submit --master yarn --class "SimpleApp" Recommender_2-0.0.1-SNAPSHOT.jar
This means that running from eclipse directly would not work.
You can check your cluster's work node cores: your application can't exceed that. For example, you have two work node. And per work node you have 4 cores. Then you have 2 applications to run. So you can give every application 4 cores to run the job.
You can set like this in the code:
SparkConf sparkConf = new SparkConf().setAppName("JianSheJieDuan")
.set("spark.cores.max", "4");
It works for me.
There are also some causes of this same error message other than those posted here.
For a spark-on-mesos cluster, make sure you have java8 or newer java version on mesos slaves.
For spark standalone, make sure you have java8 (or newer) on the workers.
You don't have any workers to execute the job. There are no available cores for the job to execute and that's the reason the job's state is still in 'Waiting'.
If you have no workers registered with Cloudera how will the jobs execute?