How to make Spark slaves use HDFS input files 'local' to them in a Hadoop+Spark cluster? - scala

I have a cluster of 9 computers with Apache Hadoop 2.7.2 and Spark 2.0.0 installed on them. Each computer runs an HDFS datanode and Spark slave. One of these computers also runs an HDFS namenode and Spark master.
I've uploaded a few TBs of gz-archives in HDFS with Replication=2. It turned out that some of the archives are corrupt. I'd want to find them. It looks like 'gunzip -t ' can help. So I'm trying to find a way to run a Spark application on the cluster so that each Spark executor tests archives 'local' (i.e. having one of the replicas located on the same computer where this executor runs) to it as long as it is possible. The following script runs but sometimes Spark executors process 'remote' files in HDFS:
// Usage (after packaging a jar with mainClass set to 'com.qbeats.cortex.CommoncrawlArchivesTester' in spark.pom
// and placing this jar file into Spark's home directory):
// ./bin/spark-submit --master spark://LV-WS10.lviv:7077 spark-cortex-fat.jar spark://LV-WS10.lviv:7077 hdfs://LV-WS10.lviv:9000/commoncrawl 9
// means testing for corruption the gz-archives in the directory hdfs://LV-WS10.lviv:9000/commoncrawl
// using a Spark cluster with the Spark master URL spark://LV-WS10.lviv:7077 and 9 Spark slaves
package com.qbeats.cortex
import org.apache.hadoop.mapred.TextInputFormat
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapred.FileSplit
import org.apache.spark.rdd.HadoopRDD
import org.apache.spark.{SparkContext, SparkConf, AccumulatorParam}
import sys.process._
object CommoncrawlArchivesTester extends App {
object LogAccumulator extends AccumulatorParam[String] {
def zero(initialValue: String): String = ""
def addInPlace(log1: String, log2: String) = if (log1.isEmpty) log2 else log1 + "\n" + log2
}
override def main(args: Array[String]): Unit = {
if (args.length >= 3) {
val appName = "CommoncrawlArchivesTester"
val conf = new SparkConf().setAppName(appName).setMaster(args(0))
conf.set("spark.executor.memory", "6g")
conf.set("spark.shuffle.service.enabled", "true")
conf.set("spark.dynamicAllocation.enabled", "true")
conf.set("spark.dynamicAllocation.initialExecutors", args(2))
val sc = new SparkContext(conf)
val log = sc.accumulator(LogAccumulator.zero(""))(LogAccumulator)
val text = sc.hadoopFile(args(1), classOf[TextInputFormat], classOf[LongWritable], classOf[Text])
val hadoopRdd = text.asInstanceOf[HadoopRDD[LongWritable, Text]]
val fileAndLine = hadoopRdd.mapPartitionsWithInputSplit { (inputSplit, iterator) =>
val fileName = inputSplit.asInstanceOf[FileSplit].getPath.toString
class FilePath extends Iterable[String] {
def iterator = List(fileName).iterator
}
val result = (sys.env("HADOOP_PREFIX") + "/bin/hadoop fs -cat " + fileName) #| "gunzip -t" !
println("Processed %s.".format(fileName))
if (result != 0) {
log.add(fileName)
println("Corrupt: %s.".format(fileName))
}
(new FilePath).iterator
}
val result = fileAndLine.collect()
println("Corrupted files:")
println(log.value)
}
}
}
What would you suggest?
ADDED LATER:
I tried another script which gets files from HDFS via textFile(). I looks like a Spark executor doesn't prefer among input files the files which are 'local' to it. Doesn't it contradict to "Spark brings code to data, not data to code"?
// Usage (after packaging a jar with mainClass set to 'com.qbeats.cortex.CommoncrawlArchiveLinesCounter' in spark.pom)
// ./bin/spark-submit --master spark://LV-WS10.lviv:7077 spark-cortex-fat.jar spark://LV-WS10.lviv:7077 hdfs://LV-WS10.lviv:9000/commoncrawl 9
package com.qbeats.cortex
import org.apache.spark.{SparkContext, SparkConf}
object CommoncrawlArchiveLinesCounter extends App {
override def main(args: Array[String]): Unit = {
if (args.length >= 3) {
val appName = "CommoncrawlArchiveLinesCounter"
val conf = new SparkConf().setAppName(appName).setMaster(args(0))
conf.set("spark.executor.memory", "6g")
conf.set("spark.shuffle.service.enabled", "true")
conf.set("spark.dynamicAllocation.enabled", "true")
conf.set("spark.dynamicAllocation.initialExecutors", args(2))
val sc = new SparkContext(conf)
val helper = new Helper
val nLines = sc.
textFile(args(1) + "/*").
mapPartitionsWithIndex( (index, it) => {
println("Processing partition %s".format(index))
it
}).
count
println(nLines)
}
}
}
SAIF C, could you explain in more detail please?

I've solved the problem by switching from Spark’s standalone mode to YARN.
Related topic: How does Apache Spark know about HDFS data nodes?

Related

How can I read/write data from Azurite using Spark?

I have tried to read/write Parquet files from/to Azurite using Spark like this:
import com.holdenkarau.spark.testing.DatasetSuiteBase
import org.apache.spark.SparkConf
import org.apache.spark.sql.SaveMode
import org.scalatest.WordSpec
class SimpleAzuriteSpec extends WordSpec with DatasetSuiteBase {
val AzuriteHost = "localhost"
val AzuritePort = 10000
val AzuriteAccountName = "devstoreaccount1"
val AzuriteAccountKey = "Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw=="
val AzuriteContainer = "container1"
val AzuriteDirectory = "dir1"
val AzuritePath = s"wasb://$AzuriteContainer#$AzuriteAccountName.blob.core.windows.net/$AzuriteDirectory/"
override final def conf: SparkConf = {
val cfg = super.conf
val settings =
Map(
s"spark.hadoop.fs.azure.storage.emulator.account.name" -> AzuriteAccountName,
s"spark.hadoop.fs.azure.account.key.${AzuriteAccountName}.blob.core.windows.net" -> AzuriteAccountKey
)
settings.foreach { case (k, v) =>
cfg.set(k, v)
}
cfg
}
"Spark" must {
"write to/read from Azurite" in {
import spark.implicits._
val xs = List(Rec(1, "Alice"), Rec(2, "Bob"))
val inputDs = spark.createDataset(xs)
inputDs.write
.format("parquet")
.mode(SaveMode.Overwrite)
.save(AzuritePath)
val ds = spark.read
.format("parquet")
.load(AzuritePath)
.as[Rec]
ds.show(truncate = false)
val actual = ds.collect().toList.sortBy(_.id)
assert(actual == xs)
}
}
}
case class Rec(id: Int, name: String)
I have tried both Azurite 3.9.0 and Azurite 2.7.0 (both in Docker). I can transfer files to/from Azurite using az (dockerized as well).
The test above runs on the Docker host. Azurite is reachable from the Docker host.
I am using Spark 2.4.5, Hadoop 2.10.0, and this dependency:
libraryDependencies += "org.apache.hadoop" % "hadoop-azure" % "2.10.0"
When using az, this connection string works:
AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=http://azurite-3.9.0:10000/devstoreaccount1;QueueEndpoint=http://azurite-3.9.0:10001/devstoreaccount1;"
yet I do not know how to configure this in Spark.
My question: How can I configure the host, the port, credentials etc. (in the path or in SparkConf)?
Yes, that's possible, but azurite should be accessible via 127.0.0.1:10000 for wasb (so if it runs on another machine then port forwarding will help) and then specify following spark args as example:
./pyspark --conf "spark.hadoop.fs.defaultFS=wasb://container#azurite" --conf "spark.hadoop.fs.azure.storage.emulator.account.name=azurite"
Then default file system will be backed up by your instance of azurite.
Don't use Azurite, just add these Jars to your Spark Dockerfile:
# Set JARS env
ENV JARS=${SPARK_HOME}/jars/azure-storage-${AZURE_STORAGE_VER}.jar,${SPARK_HOME}/jars/hadoop-azure-${HADOOP_AZURE_VER}.jar,${SPARK_HOME}/jars/jetty-util-ajax-${JETTY_VER}.jar,${SPARK_HOME}/jars/jetty-util-${JETTY_VER}.jar
RUN echo "spark.jars ${JARS}" >> $SPARK_HOME/conf/spark-defaults.conf
Set your configuration:
spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
spark.sparkContext._jsc.hadoopConfiguration().set(f"fs.azure.account.key.{ os.environ['AZURE_STORAGE_ACCOUNT'] }.blob.core.windows.net", os.environ['AZURE_STORAGE_KEY'])
Then you can read it:
val df = spark.read.parquet("wasbs://<container-name>#<storage-account-name>.blob.core.windows.net/<directory-name>")

Spark job hangingup in EC2 m4.10x machine

I have the below code that launches a spark job, when I am working on files less than 40 (maximum cores in my machine) the parallelize works fine, however when I work on files more than that its creating trouble. Any advise please.
`
object Cleanup extends Processor {
def main(args: Array[String]): Unit = {
val fileSeeker = new TelemetryFileSeeker("Config")
val files = fileSeeker.searchFiles(bucketName, urlPrefix, "2018-01-01T00:00:00.000Z", "2018-04-30T00:00:00.000Z").filter(_.endsWith(".gz"))
.map(each => (each, each.slice(0, each.lastIndexOf("/")))).slice(0,100)
if (files.nonEmpty) {
println("Number of Files" + files.length)
sc.parallelize(files).map(each => changeFormat(each)).collect()
}
}
def changeFormat(file: (String, String)): Unit = {
val fileProcessor = new Processor("Config", sparksession)
val uuid = java.util.UUID.randomUUID.toString
val tempInput = "inputfolder" + uuid
val tempOutput = "outputfolder" + uuid
val inpaths = Paths.get(tempInput)
val outpaths = Paths.get(tempOutput)
if (Files.notExists(inpaths)) Files.createDirectory(inpaths)
if (Files.notExists(outpaths)) Files.createDirectory(outpaths)
val downloadedFiles = fileProcessor.downloadAndUnzip(bucketName, List(file._1), tempInput)
val parsedFiles = fileProcessor.parseCSV(downloadedFiles)
parsedFiles.select(
"pa1",
"pa2",
"pa3"
).withColumn("pa4", lit(0.0)).write.mode(SaveMode.Overwrite).format(CSV_FORMAT)
.option("codec", "org.apache.hadoop.io.compress.GzipCodec").save(tempOutput)
val processedFiles = new File(tempOutput).listFiles.filter(_.getName.endsWith(".gz"))
val filesNames = processedFiles.map(_.getName).toList
val filesPaths = processedFiles.map(_.getPath).toList
fileProcessor.cleanUpRemote(bucketName, "new/" + file._2, filesNames)
fileProcessor.uploadFiles(bucketName, "new/" + file._2, filesPaths)
fileProcessor.cleanUpLocal(tempInput, tempOutput)
val remoteFiles = fileProcessor.checkRemote(bucketName, "new/" + file._2, filesNames)
logger.info("completed " + file._1)
}
}
spark config below
lazy val spark = SparkSession
.builder()
.appName("Project")
.config("spark.master", "local[*]")
.config("spark.sql.warehouse.dir", warehouseLocation)
.config("spark.executor.memory", "5g")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.enableHiveSupport()
.getOrCreate()
FYI: each function parsecsv function downloads 1 file into temporary folder and creates a dataframe in the specific folder. Files are of size 1GB. Also, I am trying to run this using java -cp jar class.
While I couldn't figure out the exact issue I tried to bypass the issue by passing only 38 files at a time by using the "List->grouped" method

Spark MLLib unable to write out to S3 : path already exists

I have data in a S3 bucket in directory /data/vw/. Each line is of the form:
| abc:2 def:1 ghi:3 ...
I want to convert it to the following format:
abc abc def ghi ghi ghi
The new converted lines should go to S3 in directory /data/spark
Basically, repeat each string the number of times that follows the colon. I am trying to convert a VW LDA input file to a corresponding file for consumption by Spark's LDA library.
The code:
import org.apache.spark.{SparkConf, SparkContext}
object Vw2SparkLdaFormatConverter {
def repeater(s: String): String = {
val ssplit = s.split(':')
(ssplit(0) + ' ') * ssplit(1).toInt
}
def main(args: Array[String]) {
val inputPath = args(0)
val outputPath = args(1)
val conf = new SparkConf().setAppName("FormatConverter")
val sc = new SparkContext(conf)
val vwdata = sc.textFile(inputPath)
val sparkdata = vwdata.map(s => s.trim().split(' ').map(repeater).mkString)
val coalescedSparkData = sparkdata.coalesce(100)
coalescedSparkData.saveAsTextFile(outputPath)
sc.stop()
}
}
When I run this (as a Spark EMR job in AWS), the step fails with exception:
18/01/20 00:16:28 ERROR ApplicationMaster: User class threw exception: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3a://mybucket/data/spark already exists
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3a://mybucket/data/spark already exists
at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:131)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1119)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1096)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1096)
at ...
The code is run as:
spark-submit --class Vw2SparkLdaFormatConverter --deploy-mode cluster --master yarn --conf spark.yarn.submit.waitAppCompletion=true --executor-memory 4g s3a://mybucket/scripts/myscalajar.jar s3a://mybucket/data/vw s3a://mybucket/data/spark
I have tried specifying new output paths (/data/spark1 etc), ensuring that it does not exist before the step is run. Even then it is not working.
What am I doing wrong? I am new to Scala and Spark so I might be overlooking something here.
You could convert to a dataframe and then save with overwrite enabled.
coalescedSparkData.toDF.write.mode('overwrite').csv(outputPath)
Or if you insist on using RDD methods, you can do as described already in this answer

Setting UP Intellij to run apache spark with remote master

I have a project setup with H2o. I am able to run the code with apache toree and I set the spark master as spark://xxxx.yyyy.zzzz:port.
It works fine and I can see the output in spark UI.
I am trying to run the same code as application in intellij with but I get error java.lang.ClassNotFoundException: org.apache.spark.h2o.utils.NodeDesc. but I see the application in Spark UI for a short amount of Time.
I tried running simple application with hello world and that worked as well as I am able to see the application in Spark UI,
import java.io.File
import hex.tree.gbm.GBM
import hex.tree.gbm.GBMModel.GBMParameters
import org.apache.spark.h2o.{StringHolder, H2OContext}
import org.apache.spark.{SparkFiles, SparkContext, SparkConf}
import water.fvec.H2OFrame
/**
* Example of Sparkling Water based application.
*/
object SparklingWaterDroplet {
def main(args: Array[String]) {
// Create Spark Context
val conf = configure("Sparkling Water Droplet")
val sc = new SparkContext(conf)
// Create H2O Context
val h2oContext = H2OContext.getOrCreate(sc)
import h2oContext.implicits._
// Register file to be available on all nodes
sc.addFile(this.getClass.getClassLoader.getResource("iris.csv").getPath)
// Load data and parse it via h2o parser
val irisTable = new H2OFrame(new File(SparkFiles.get("iris.csv")))
// Build GBM model
val gbmParams = new GBMParameters()
gbmParams._train = irisTable
gbmParams._response_column = 'class
gbmParams._ntrees = 5
val gbm = new GBM(gbmParams)
val gbmModel = gbm.trainModel.get
// Make prediction on train data
val predict = gbmModel.score(irisTable)('predict)
// Compute number of mispredictions with help of Spark API
val trainRDD = h2oContext.asRDD[StringHolder](irisTable('class))
val predictRDD = h2oContext.asRDD[StringHolder](predict)
// Make sure that both RDDs has the same number of elements
assert(trainRDD.count() == predictRDD.count)
val numMispredictions = trainRDD.zip(predictRDD).filter( i => {
val act = i._1
val pred = i._2
act.result != pred.result
}).collect()
println(
s"""
|Number of mispredictions: ${numMispredictions.length}
|
|Mispredictions:
|
|actual X predicted
|------------------
|${numMispredictions.map(i => i._1.result.get + " X " + i._2.result.get).mkString("\n")}
""".stripMargin)
// Shutdown application
sc.stop()
}
def configure(appName:String = "Sparkling Water Demo"):SparkConf = {
val conf = new SparkConf().setAppName(appName)
.setMaster("spark://xxx.yyy.zz.aaaa:oooo")
conf
}
}
I also tried exporting jars as compile from the dependencies menu ,
is there anything I am missing from Intellij setup.?
It looks like the external libraries are not getting pushed to the master

Spark-submit cannot access local file system

Really simple Scala code files at the first count() method call.
def main(args: Array[String]) {
// create Spark context with Spark configuration
val sc = new SparkContext(new SparkConf().setAppName("Spark File Count"))
val fileList = recursiveListFiles(new File("C:/data")).filter(_.isFile).map(file => file.getName())
val filesRDD = sc.parallelize(fileList)
val linesRDD = sc.textFile("file:///temp/dataset.txt")
val lines = linesRDD.count()
val files = filesRDD.count()
}
I don't want to set up a HDFS installation for this right now. How do I configure Spark to use the local file system? This works with spark-shell.
To read the file from local filesystem(From Windows directory) you need to use below pattern.
val fileRDD = sc.textFile("C:\\Users\\Sandeep\\Documents\\test\\test.txt");
Please see below sample working program to read data from local file system.
package com.scala.example
import org.apache.spark._
object Test extends Serializable {
val conf = new SparkConf().setAppName("read local file")
conf.set("spark.executor.memory", "100M")
conf.setMaster("local");
val sc = new SparkContext(conf)
val input = "C:\\Users\\Sandeep\\Documents\\test\\test.txt"
def main(args: Array[String]): Unit = {
val fileRDD = sc.textFile(input);
val counts = fileRDD.flatMap(line => line.split(","))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.collect().foreach(println)
//Stop the Spark context
sc.stop
}
}
val sc = new SparkContext(new SparkConf().setAppName("Spark File
Count")).setMaster("local[8]")
might help