Spark spends a long time on HadoopRDD: Input split

Spark spends a long time on HadoopRDD: Input split - scala

I'm running logistic regression with SGD on a large libsvm file. The file is about 10 GB in size with 40 million training examples.
When I run my scala code with spark-submit, I notice that spark spends a lot of time logging this:
18/02/07 04:44:50 INFO HadoopRDD: Input split: file:/ebs2/preprocess/xaa:234881024+33554432
18/02/07 04:44:51 INFO Executor: Finished task 6.0 in stage 1.0 (TID 7). 875 bytes result sent to driver
18/02/07 04:44:51 INFO TaskSetManager: Starting task 8.0 in stage 1.0 (TID 9, localhost, executor driver, partition 8, PROCESS_LOCAL, 7872 bytes)
18/02/07 04:44:51 INFO TaskSetManager: Finished task 6.0 in stage 1.0 (TID 7) in 1025 ms on localhost (executor driver) (7/307)
Why is Spark doing so many 'HadoopRDD: Input splits'? What's the purpose of that, and how do I go about speeding up or getting rid of this process?
Here is the code:
import org.apache.spark.SparkContext
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.optimization.L1Updater
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.util.MLUtils
import scala.compat.Platform._
object test {
def main(args: Array[String]) {
val nnodes = 1
val epochs = 3
val conf = new SparkConf().setAppName("Test Name")
val sc = new SparkContext(conf)
val t0=currentTime
val train = MLUtils.loadLibSVMFile(sc, "/ebs2/preprocess/xaa", 262165, 4)
val test = MLUtils.loadLibSVMFile(sc, "/ebs2/preprocess/xab", 262165, 4)
val t1=currentTime;
println("START")
val lrAlg = new LogisticRegressionWithSGD()
lrAlg.optimizer.setMiniBatchFraction(10.0/40000000.0)
lrAlg.optimizer.setNumIterations(12000000)
lrAlg.optimizer.setStepSize(0.01)
val model = lrAlg.run(train)
model.clearThreshold()
val scoreAndLabels = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
}
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
val auROC = metrics.areaUnderROC()
println("Area under ROC = " + auROC)
}
}

I fixed the speed issues by running
train = train.coalesce(1)
train.cache()
and by increasing the memory to a total of 64 gigs. Previously Spark might not have been caching properly due to not enough RAM.

Related

My Scala Spark code doesn't work, although pyspark it works fine

I am running Linear regression in Pyspark with multiple parameters and it works fine, But if run the same code with same configuration in scala it gives me below error on fitting the model.
Please ask question and I will put all the details, I am stuck here since last one month.
19/12/06 23:59:35 ERROR cluster.YarnScheduler: Lost executor 6 on bvpr-bdaws09.vq.internal.vodafone.com: Container killed by YARN for exceeding memory limits. 1.5 GB of 1.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
one thing more in scala if I just directly fit the linear regression model without multiple parameters then it also works.
val spark = (SparkSession.builder
.appName("MLib_scala")
.master("local")
.config("spark.driver.memory", "3g")
.config("spark.executor.memory", "20g")
.config("spark.dynamicAllocation.maxExecutors","5")
.config("spark.executor.cores", "3")
.config("spark.yarn.executor.memoryOverhead","2g")
.enableHiveSupport()
.getOrCreate())
val df1 = spark.sql("select * from Table_4kCols_10k_rows") //some 3GB hive table
val list_cols =df1.columns
val featureCols = list_cols.filter(_!= "dormant_flag")
val df2=df1.withColumnRenamed("dormant_flag","label")
val assembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features")
val df = assembler.transform(df2)
val seed = 5043
val Array(train, test) = df.randomSplit(Array(0.7, 0.3))
val lr = new LogisticRegression().setLabelCol("label").setFeaturesCol("features")
val fittedLR = lr.fit(train) // This works fine
val stages = Array(lr)
val pipeline = new Pipeline().setStages(stages)
val params = new ParamGridBuilder().addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0)).addGrid(lr.regParam, Array(0.1, 2.0)).build()
val evaluator = new BinaryClassificationEvaluator().setMetricName("areaUnderROC").setRawPredictionCol("prediction").setLabelCol("label")
val tvs = new TrainValidationSplit().setTrainRatio(0.75).setEstimatorParamMaps(params).setEstimator(pipeline).setEvaluator(evaluator)
val tvsFitted = tvs.fit(train) //This gives an error //pyspark works fine

Can not apply Spark User defined Functions

I've tried a lot of times to apply a function which applies some modification to a spark Dataframe which contains some text strings. Below is the corresponding code but it always give me this error:
An error occurred while calling o699.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 27.0 failed 1 times, most recent failure: Lost task 0.0 in stage 27.0 (TID 29, localhost, executor driver):
import os
import sys
from pyspark.sql import SparkSession
#!hdfs dfs -rm -r nixon_token*
spark = SparkSession.builder \
.appName("spark-nltk") \
.getOrCreate()
data = spark.sparkContext.textFile('1970-Nixon.txt')
def word_tokenize(x):
import nltk
return str(nltk.word_tokenize(x))
test_tok = udf(lambda x: word_tokenize(x),StringType())
resultDF = df_test.select("spans", test_tok('spans').alias('text_tokens'))
resultDF.show()

Spark Mllib .toBlockMatrix results in matrix of 0.0

I am trying to create a block matrix from a input data file. I have managed to get the data read from the data file and stored in IndexedRowMatrix and CoordinateMatrix format correct.
When I use .toBlockMatrix on the CoordinateMatrix the result is a block matrix containing only 0.0 with the same dimensions as the CoordinateMatrix.
I am using version 1.5.0-cdh5.5.0
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix
import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix
import org.apache.spark.mllib.linalg.distributed.IndexedRow
import org.apache.spark.mllib.linalg.distributed.BlockMatrix
val conf = new SparkConf().setMaster("local").setAppName("Transpose");
val sc = new SparkContext(conf)
val dataRDD = sc.textFile("/user/cloudera/data/data.txt").map(line => Vectors.dense(line.split(" ").map(_.toDouble))).zipWithIndex.map(_.swap)
//Format of dataRDD is RDD[(Long, Vector)]
val rows = dataRDD.map{case(k,v) => IndexedRow(k,v)}
//Format of rows is RDD[IndexedRow]
val mat: IndexedRowMatrix = new IndexedRowMatrix(rows)
val coordMat: CoordinateMatrix = mat.toCoordinateMatrix()
val blockMat: BlockMatrix = coordMat.toBlockMatrix().cache()
The data file is just simply two columns by sixty rows of integers.
140 123
141 310
310 381
480 321
... ...
Update:
I've done some investigating and have discovered that the groupByKey function is not working correctly, which is what is preventing the BlockMatrix from being formed correctly. I still however do not know why groupByKey, join, and groupBy are not working and always returning an empty result.

I have solved the problem by removing the lines of code:
val conf = new SparkConf().setMaster("local").setAppName("Transpose")
val sc = new SparkContext(conf)
I found the answer in the below linked page in a comment by Farzad Nozarian,
Unable to count words using reduceByKey((v1,v2) => v1 + v2) scala function in spark
As a side-note this might help people who are getting empty results for .groupByKey, .reduceByKey, .join, etc.

Cosine Similarity via DIMSUM in Spark

I have a very simple code to try Cosine Similarity:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, CoordinateMatrix, RowMatrix}
val rows= Array(((1,2,3,4,5),(1,2,3,4,5),(1,2,4,5,8),(3,4,1,2,7),(7,7,7,7,7)))
val mat = new RowMatrix(rows)
val simsPerfect = mat.columnSimilarities()
val simsEstimate = mat.columnSimilarities(0.8)
I run this code on Amazon AWS which has Spark 1.5 however I got the following message for the last two lines:
"Erroe: value columnSimilarities is not a memeber of org.apache.spark.rdd.RDD[(int,int)]"
Could you please help to resolve this issue?

I found the answer. I need to convert the matrix to RDD. Here is the right code:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, CoordinateMatrix, RowMatrix}
import org.apache.spark.rdd._
import org.apache.spark.mllib.linalg._
def matrixToRDD(m: Matrix): RDD[Vector] = {
val columns = m.toArray.grouped(m.numRows)
val rows = columns.toSeq.transpose // Skip this if you want a column-major RDD.
val vectors = rows.map(row => new DenseVector(row.toArray))
sc.parallelize(vectors)
}
val dm: Matrix = Matrices.dense(5, 5,Array(1,2,3,4,5,1,2,3,4,5,1,2,4,5,8,3,4,1,2,7,7,7,7,7,7))
val rows = matrixToRDD(dm)
val mat = new RowMatrix(rows)
val simsPerfect = mat.columnSimilarities()
val simsEstimate = mat.columnSimilarities(0.8)
println("Pairwise similarities are: " + simsPerfect.entries.collect.mkString(", "))
println("Estimated pairwise similarities are: " + simsEstimate.entries.collect.mkString(", "))
Cheers

TaskSchedulerImpl: Initial job has not accepted any resources. (Error in Spark)

I'm trying to run SparkPi example on my standalone mode cluster.
package org.apache.spark.examples
import scala.math.random
import org.apache.spark._
/** Computes an approximation to pi */
object SparkPi {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("SparkPi")
.setMaster("spark://192.168.17.129:7077")
.set("spark.driver.allowMultipleContexts", "true")
val spark = new SparkContext(conf)
val slices = if (args.length > 0) args(0).toInt else 2
val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
val count = spark.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
spark.stop()
}
}
Note: I made a little change in this line:
val conf = new SparkConf().setAppName("SparkPi")
.setMaster("spark://192.168.17.129:7077")
.set("spark.driver.allowMultipleContexts", "true")
Problem: I'm using spark-shell (Scala interface) to run this code. When I try this code, I receive this error repeatedly:
15/02/09 06:39:23 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
Note: I can see my workers in my Master's WebUI and also I can see a new job in the Running Applications section. But there is no end for this application and I see error repeatedly.
What is the problem?
Thanks

If you want to run this from spark shell, then start the shell with argument --master spark://192.168.17.129:7077 and enter the following code:
import scala.math.random
import org.apache.spark._
val slices = 10
val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
val count = sc.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
Otherwise, compile the code into a jar and run it with spark-submit. But remove setMaster from the code and add it as 'master' argument to spark-submit script. Also remove the allowMultipleContexts argument from the code.
You need only one spark context.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark spends a long time on HadoopRDD: Input split - scala

I fixed the speed issues by running train = train.coalesce(1) train.cache() and by increasing the memory to a total of 64 gigs. Previously Spark might not have been caching properly due to not enough RAM.

Related

My Scala Spark code doesn't work, although pyspark it works fine

Can not apply Spark User defined Functions

Spark Mllib .toBlockMatrix results in matrix of 0.0

Cosine Similarity via DIMSUM in Spark

TaskSchedulerImpl: Initial job has not accepted any resources. (Error in Spark)

Categories

Resources