Sharing data between nodes using Apache Spark

Sharing data between nodes using Apache Spark - scala

Here is how I launch the Spark job :
./bin/spark-submit \
--class MyDriver\
--master spark://master:7077 \
--executor-memory 845M \
--deploy-mode client \
./bin/SparkJob-0.0.1-SNAPSHOT.jar
The class MyDriver accesses the spark context using :
val sc = new SparkContext(new SparkConf())
val dataFile= sc.textFile("/data/example.txt", 1)
In order to run this within a cluster I copy the file "/data/example.txt" to all nodes within the cluster. Is there a mechanism using Spark to share this data file between nodes without manually copying them ? I don't think I can use a broadcast variable in this case ?
Update :
An option is to have a dedicated file server which shares the file to be processed : val dataFile= sc.textFile("http://fileserver/data/example.txt", 1)

sc.textFile("/some/file.txt") read a file distributed in hdfs, i.e.:
/some/file.txt is (already) split in multiple parts which are distributed a couple of computers each.
and each worker/task read one parts of the file. This is useful because you don't need to manage which part yourself.
If you have copied the files on each worker node, you can read it in all task:
val myRdd = sc.parallelize(1 to 100) // 100 tasks
val fileReadEveryWhere = myRdd.map( read("/my/file.txt") )
and have the code of read(...) implemented somewhere.
Otherwise, you can also use a [broadcast variable] that is seed from the driver to all workers:
val myObject = read("/my/file.txt") // obj instantiated on driver node
val bdObj = sc.broadcast(myObject)
val myRdd = sc.parallelize(1 to 100)
.map{ i =>
// use bdObj in task i, ex:
bdObj.value.process(i)
}
In this case, myObject should be serializable and it is better if it is not too big.
Also, the method read(...) is run on the driver machine. So you only need the file on the driver. But if you don't know which machine it is (e.g. if you use spark-submit) then the file should be on all machines :-\ . In this case, it is maybe better to have access to some DB or external file system.

Related

Class org.apache-spark.SparkException, java.lang.NoClassDefFoundError: Could not initialize class XXX

Class org.apache-spark.SparkException, java.lang.NoClassDefFoundError: Could not initialize class XXX(class where field validation exists) Exception when I am trying to do field validations on Spark Dataframe. Here is my code
And all classes and object used are serialized. Fails on AWS EMR spark job (works fine in local Machine.)
val newSchema = df.schema.add("errorList", ArrayType(new StructType()
.add("fieldName" , StringType)
.add("value" , StringType)
.add("message" , StringType)))
//Validators is a Sequence of validations on columns in a Row.
// Validator method signature
// def checkForErrors(row: Row): (fieldName, value, message) ={
// logic to validate the field in a row }
val validateRow: Row => Row = (row: Row)=>{
val errorList = validators.map(validator => validator.checkForErrors(row)
Row.merge(row, Row(errorList))
}
val validateDf = df.map(validateRow)(RowEncoder.apply(newSchema))
Versions : Spark 2.4.7 and Scala 2.11.8
Any ideas on why this might happen or if someone had the same issue.

I faced a very similar problem with EMR release 6.8.0 - in particular, the spark.jars configuration was not respected for me on EMR (I pointed it at a location of a JAR in S3), even though it seems to be normally accepted Spark parameter.
For me, the solution was to follow this guide ("How do I resolve the "java.lang.ClassNotFoundException" in Spark on Amazon EMR?"):
https://aws.amazon.com/premiumsupport/knowledge-center/emr-spark-classnotfoundexception/
In CDK (where our EMR cluster definitino is), I set up an EMR step to be executed immediately after cluster creation the rewrite the spark.driver.extraClassPath and spark.executor.extraClassPath to also contain the location of my additional JAR (in my case, the JAR physically comes in a Docker image, but you could also set up a boostrap action to copy it on the cluster from S3), as per their code in the article under "For Amazon EMR release version 6.0.0 and later,". The reason you have to do this "rewriting" is because EMR already populates these spark.*.extraClassPath with a bunch of its own JAR location, e.g. for JARs that contain the S3 drivers, so you effectively have to append your own JAR location, rather than just straight up setting the spark.*.extraClassPath to your location. If you do the latter (I tried it), then you will lose lot of the EMR functionality such as being able to read from S3.
#!/bin/bash
#
# This is an example of script_b.sh for changing /etc/spark/conf/spark-defaults.conf
#
while [ ! -f /etc/spark/conf/spark-defaults.conf ]
do
sleep 1
done
#
# Now the file is available, do your work here
#
sudo sed -i '/spark.*.extraClassPath/s/$/:\/home\/hadoop\/extrajars\/\*/' /etc/spark/conf/spark-defaults.conf
exit 0

Filtering millions of files with pySpark and Cloud Storage

I am facing the following task: I have individual files (like Mb) stored in Google Cloud Storage Bucket grouped in directories by date (each directory contains around 5k files). I need to look at each file (xml) , filter proper one and put them into Mongo or write back to Google Cloud Storage in lets say parquet format. I wrote a simple pySpark program that looks like this:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = (
SparkSession
.builder
.appName('myApp')
.config("spark.mongodb.output.uri", "mongodb://<mongo_connection>")
.config("spark.mongodb.output.database", "test")
.config("spark.mongodb.output.collection", "test")
.config("spark.hadoop.google.cloud.auth.service.account.enable", "true")
.config("spark.dynamicAllocation.enabled", "true")
.getOrCreate()
)
spark_context = spark.sparkContext
spark_context.setLogLevel("INFO")
sql_context = pyspark.SQLContext(spark_context)
# configure Hadoop
hadoop_conf = spark_context._jsc.hadoopConfiguration()
hadoop_conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
hadoop_conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
# DataFrame schema
schema = StructType([
StructField('filename', StringType(), True),
StructField("date", DateType(), True),
StructField("xml", StringType(), True)
])
# -------------------------
# Main operation
# -------------------------
# get all files
files = spark_context.wholeTextFiles('gs://bucket/*/*.gz')
rows = files \
.map(lambda x: custom_checking_map(x)) \
.filter(lambda x: x is not None)
# transform to DataFrame
df = sql_context.createDataFrame(rows, schema)
# write to mongo
df.write.format("mongo").mode("append").save()
# write back to Cloud Storage
df.write.parquet('gs://bucket/test.parquet')
spark_context.stop()
I tested it on a subset (single directory gs://bucket/20191010/*.gz) and it works. I deploy it on Google Dataproc cluster, but doubt anything is happening single the logs stop after 19/11/06 15:41:40 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1573054807908_0001
I am running 3 worker cluster with 4 cores and 15GB RAM + 500GB HDD. Spark version 2.3.3, scala 2.11 mongo-connector-spark_2.11-2.3.3.
I am new to Spark so any suggestions are appreciated. Normally, I would write this work using Python multiprocessing, but wanted to move to something "better", but now I am not sure.

It could take significant amount of time to list very large number of files in GCS - most probably your job "hangs" while Spark driver listing all files before starting processing.
You will achieve much better performance by listing all directories first and after that processing files in each directory - to achieve best performance you can process directories in parallel, but taking into account that each directory has 5k files and your cluster only 3 workers, it could be good enough to process directories sequentially.

Simple spark job fail due to GC overhead limit

I've created a standalone spark (2.1.1) cluster on my local machines
with 9 cores / 80G each machine (total of 27 cores / 240G Ram)
I've got a sample spark job that sum all the numbers from 1 to x
this is the code :
package com.example
import org.apache.spark.sql.SparkSession
object ExampleMain {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder
.master("spark://192.168.1.2:7077")
.config("spark.driver.maxResultSize" ,"3g")
.appName("ExampleApp")
.getOrCreate()
val sc = spark.SparkContext
val rdd = sc.parallelize(Lisst.range(1, 1000))
val sum = rdd.reduce((a,b) => a+b)
println(sum)
done
}
def done = {
println("\n\n")
println("-------- DONE --------")
}
}
When running the above code I get results after a few seconds
so I've crancked up the code to sum all the numbers from 1 to 1B (1,000,000,000) and than I get GC overhead limit reached
I read that spark should spill memory to the HDD if there isn't enough memory, I've tried to play with my cluster configuration but that didn't helped.
Driver memory = 6G
Number of workers = 24
Cores per worker = 1
Memory per worker = 10
I'm not a developer, and have no knowledge in Scala but would like to find a solution to run this code without GC issues.
Per #philantrovert request I'm adding my spark-submit command
/opt/spark-2.1.1/bin/spark-submit \
--class "com.example.ExampleMain" \
--master spark://192.168.1.2:6066 \
--deploy-mode cluster \
/mnt/spark-share/example_2.11-1.0.jar
In addition my spark/conf are as following:
slaves file contain the 3 IP addresses of my nodes (including the master)
spark-defaults contain:
spark.master spark://192.168.1.2:7077
spark.driver.memory 10g
spark-env.sh contain:
SPARK_LOCAL_DIRS= shared folder among all nodes
SPARK_EXECUTOR_MEMORY=10G
SPARK_DRIVER_MEMORY=10G
SPARK_WORKER_CORES=1
SPARK_WORKER_MEMORY=10G
SPARK_WORKER_INSTANCES=8
SPARK_WORKER_DIR= shared folder among all nodes
SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true"
Thanks

I suppose the problem is that you create a List with 1 Billion entries on the driver, which is a huge datastructure (4GB). There is a more efficient way the programmatically create an Dataset/RDD:
val rdd = spark.range(1000000000L).rdd

Spark Cluster: How to print out the content of RDD on each worker node

I just started learning apache spark and wanted to know why this is not working for me.
I am running spark 2.1 and started a master and a worker (not local). This my code:
object SimpleApp {
def main(args: Array[String]) {
val file = [FILELOCATION]
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val textFile = sc.textFile(file)
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word.toLowerCase.toCharArray.toList.sorted.mkString, 1))
.reduceByKey(_ + _)
counts.map(println)
counts.foreach(println)
val countCollect = counts.collect()
sc.stop()
}
}
I cannot seem to get the worker nodes to print out their contents in stdout. Even if I set the master and worker to local, it does not seem to work.
Am I understanding something wrong here?

If you want to print something in executor a normal println will do. That will print the output in the executor's stdout

You can Actually view the worker status, Application status stderr, stdout of each workers rdd distribution and many more things in viewing in localhost:8080 in the browser [Master Machine]. click on worker-Id you can view logs (stdout,stderr). If you want to see the actual distribution and status you can click on running application, In that click on Application Detailed UI link it will show complete details of your application.
If you want to view worker UI only then you can see by typing localhost:8081 in worker system.

Whenever you submit a Spark Job, the tasks (instructions) for the Spark job go from the driver to the executors. The driver can be running on the same node that you are currently logged onto (local and YARN-client) or the driver can be on another node (Application master).
All the actions return back a result to the driver, so if you are logged onto to the machine where the driver runs, you can see the output. But you cannot see the output on the executor nodes since any print statement will be printed on the console of respective machines. You can just do a sc.textFile() and it will save all the partitions into the directory separately. In this way you can see the contents in each partition.

Checkpoint data corruption in Spark Streaming

I am testing checkpointing and write ahead logs with this basic Spark streaming code below. I am checkpointing into a local directory. After starting and stopping the application a few times (using Ctrl-C) - it would refuse to start, for what looks like some data corruption in the checkpoint directoty. I am getting:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 80.0 failed 1 times, most recent failure: Lost task 0.0 in stage 80.0 (TID 17, localhost): com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 13994
at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
Full code:
import org.apache.hadoop.conf.Configuration
import org.apache.spark._
import org.apache.spark.streaming._
object ProtoDemo {
def createContext(dirName: String) = {
val conf = new SparkConf().setAppName("mything")
conf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
val ssc = new StreamingContext(conf, Seconds(1))
ssc.checkpoint(dirName)
val lines = ssc.socketTextStream("127.0.0.1", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
val runningCounts = wordCounts.updateStateByKey[Int] {
(values: Seq[Int], oldValue: Option[Int]) =>
val s = values.sum
Some(oldValue.fold(s)(_ + s))
}
// Print the first ten elements of each RDD generated in this DStream to the console
runningCounts.print()
ssc
}
def main(args: Array[String]) = {
val hadoopConf = new Configuration()
val dirName = "/tmp/chkp"
val ssc = StreamingContext.getOrCreate(dirName, () => createContext(dirName), hadoopConf)
ssc.start()
ssc.awaitTermination()
}
}

Basically what you are trying to do is a driver failure scenario , for this to work , based on the cluster you are running you have to follow the below instructions to monitor the driver process and relaunch the driver if it fails
Configuring automatic restart of the application driver - To automatically recover from a driver failure, the deployment infrastructure that is used to run the streaming application must monitor the driver process and relaunch the driver if it fails. Different cluster managers have different tools to achieve this.
Spark Standalone - A Spark application driver can be submitted to
run within the Spark Standalone cluster (see cluster deploy
mode), that is, the application driver itself runs on one of the
worker nodes. Furthermore, the Standalone cluster manager can be
instructed to supervise the driver, and relaunch it if the driver
fails either due to non-zero exit code, or due to failure of the
node running the driver. See cluster mode and supervise in the Spark
Standalone guide for more details.
YARN - Yarn supports a similar mechanism for automatically restarting an application. Please refer to YARN documentation for
more details.
Mesos - Marathon has been used to achieve this with Mesos.
You need to configure write ahead logs as below ,there are special instructions for S3 which you need to follow.
While using S3 (or any file system that does not support flushing) for write ahead logs, please remember to enable
spark.streaming.driver.writeAheadLog.closeFileAfterWrite
spark.streaming.receiver.writeAheadLog.closeFileAfterWrite.
See Spark Streaming Configuration for more details.

The issue looks rather Kryo Serializer issue than checkpoint corruption.
At code example (including GitHub project), Kryo Serialization is not configured.
Since it is not configured KryoException exception could not happen.
When using "write ahead logs", and restoring from a directory, all Spark config is getting from there.
At your example, createContext method does not call when starting from the checkpoint.
I assume the issue is another application were tested before with the same checkpoint directory, where Kryo Serializer where configured.
And current application fails to be restored from that checkpoint.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse