Error in using MLlib function ALS in Spark - scala

I have read from a file like this:
val ratingText = sc.textFile("/home/cloudera/rec_data/processed_data/ratings/000000_0")
Used the following function to parse this data:
def parseRating(str: String): Rating= {
val fields = str.split(",")
Rating(fields(0).toInt, fields(1).trim.toInt, fields(2).trim.toDouble)
}
And created a rdd, which is then split into different RDDs
val ratingsRDD = ratingText.map(x=>parseRating(x)).cache()
val splits = ratingsRDD.randomSplit(Array(0.8, 0.2), 0L)
val trainingRatingsRDD = splits(0).cache()
Used the training RDD to create a model as follows:
val model = (new ALS().setRank(20).setIterations(10) .run(trainingRatingsRDD))
I get the following error in the last command
16/10/28 01:03:44 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
16/10/28 01:03:44 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
16/10/28 01:03:46 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
16/10/28 01:03:46 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK
Edit: T. Gaweda's suggestion helped in removing the errors, but I am still getting the following warning:
16/10/28 01:53:59 WARN Executor: 1 block locks were not released by TID = 60:
[rdd_420_0]
16/10/28 01:54:00 WARN Executor: 1 block locks were not released by TID = 61:
[rdd_421_0]
And I think this has resulted in an empty model , because the next step is resulting in the following error:
val topRecsForUser = model.recommendProducts(4276736,3)
Error is:
java.util.NoSuchElementException: next on empty iterator at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
Please help!

It's only a warning. Spark uses BLAS to perform calculations. BLAS has native implementations and JVM implementation, the native one is more optimized / faster. However, you must install native library individually.
Without this configuration the warning message will appear and Spark will use JVM implementation of BLAS. Results should be the same, maybe calculated quite slower.
Here you've got description what is BLAS and how to configure it, for example on Cent OS is should be only: yum install openblas lapack

Related

Spark - JVM Insufficient memory error while using Spark SQL

I am trying to run a spark job to process some Json data using Spark SQL. When i submit the job, I see the following error in the logs,
Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00007f29b96d5000, 12288, 0) failed; error='Cannot allocate memory' (errno=12)
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 12288 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /tmp/hs_err_pid5716.log
I am using the following code in the application,
val url = "foo://fooLink"
val rawData = sqlContext.read.option("multiline", true).json(url)
val pwp = new PrintWriter(new File("/tmp/file"))
rawData.collect.foreach(pwp.println)
pwp.close()
Command used to submit the job:
spark-submit --spark-conf spark.driver.userClassPathFirst=true --region us-east-1 --classname someClass somePackage-1.0-super.jar
It works for lesser data. But for some reason, the job does not create the "/tmp/file" in the cluster and throws the above error in the driver logs. Is there a way I can work around this? Any ideas would be greatly appreciated. Thanks :)
You will have to tweak some VM flags : XX:MaxDirectMemorySize and Xmx
Edit your spark-defaults.conf and modifiy the spark.executor.extraJavaOptions option to set the flags.

Compilation Error while writing Hopping Window in Kafka Streams(Confluent 4.0.0)

I am trying to write aggregation operation on time-windows in Confluent open source 4.0.0 version as below.
KTable<Windowed<String>, aggrTest> testWinAlerts =
testRecords.groupByKey()
.windowedBy(TimeWindows.of(TimeUnit.SECONDS.toMillis(120))
.advanceBy(TimeUnit.SECONDS.toMillis(1)))
.aggregate(
new aggrTestInitilizer(),
new minMaxCalculator(),
Materialized.<String, aggrTest, WindowStore<Bytes, byte[]>>
as("queryable-store-name")
.withValueSerde(aggrMessageSerde)
.withKeySerde(Serdes.String()));
But above code gives error while compilation as below
Exception in thread "main" java.lang.Error: Unresolved compilation problem:
The method aggregate(Initializer<VR>, Aggregator<? super String,? super TestFields,VR>, Materialized<String,VR,WindowStore<Bytes,byte[]>>) in the type TimeWindowedKStream<String,TestFields> is not applicable for the arguments (aggrTestInitilizer, minMaxCalculator, Materialized<String,aggrTest,WindowStore<Bytes,byte[]>>)
Same code when I write in 3.3.1 version as below, it does not give any error
KTable<Windowed<String>, aggrTest> testWinAlerts =
testRecords.groupByKey()
.aggregate(
new aggrTestInitilizer(),
new minMaxCalculator(),
TimeWindows.of(TimeUnit.SECONDS.toMillis(120))
.advanceBy(TimeUnit.SECONDS.toMillis(1)),
aggrMessageSerde,
"aggr-test");
What might be issue? Also aggrTestInitilizer, minMaxCalculator, aggrMessageSerde used in all cases are same.

How to catch an exception that occurred on a spark worker?

val HTF = new HashingTF(50000)
val Tf = Case.map(row=>
HTF.transform(row)
).cache()
val Idf = new IDF().fit(Tf)
try
{
Idf.transform(Tf).map(x=>LabeledPoint(1,x))
}
catch {
case ex:Throwable=>
println(ex.getMessage)
}
Code like this isn't working.
HashingTF/Idf belongs to org.spark.mllib.feature.
I'm still getting an exception that says
org.apache.spark.SparkException: Failed to get broadcast_5_piece0 of broadcast_5
I cannot see any of my files in the error log, how do I debug this?
It seems that the worker ran out of memory.
Instant temporary Fix:
Run the application without caching.
Just remove .cache()
How to Debug:
Probably Spark UI might have the complete exception details.
check Stage details
check the logs and thread dump in Executor tab
If you find multiple exceptions or errors try to resolve it in sequence.
Most of the times resolving 1st error will resolve subsequent errors.

Around 5-10% executors are LOST in my mesos framework

I have a 200 node mesos cluster that can run around 2700 executors concurrently. Around 5-10% of my executors are LOST at the very beginning. They go only until extracting the executor tar file.
WARNING: Logging before InitGoogleLogging() is written to STDERR I0617 21:35:09.947180 45885 fetcher.cpp:76] Fetching URI 'http://download_url/remote_executor.tgz' I0617 21:35:09.947273 45885 fetcher.cpp:126] Downloading 'http://download_url/remote_executor.tgz' to '/mesos_dir/remote_executor.tgz' I0617 21:35:57.551722 45885 fetcher.cpp:64] Extracted resource '/mesos_dir/remote_executor.tgz' into '/extracting_mesos_dir/'
Please let me know if someone else is facing this issue.
I am using python to implement both the scheduler and executor. The executor code is a python file that extends base class 'Executor'. I have implemented the launchTasks method of Executor class that simply does what the executor is supposed to do.
The executor info is:
executor = mesos_pb2.ExecutorInfo()
executor.executor_id.value = "executor-%s" % (str(task_id),)
executor.command.value = 'python -m myexecutor'
# where to download executor from
tar_uri = '%s/remote_executor.tgz' % (
self.conf.remote_executor_cache_url)
executor.command.uris.add().value = tar_uri
executor.name = 'some_executor_name'
executor.source = "executor_test"
Can you provide more details about what your Executor is supposed to do (at best ExecutorInfo Definition and the Executor itself)? What is the Command you use to start the executor (CommandInfo)?
For example definition of an executor have a look at Rendler.
It includes a sample executor and the ExecutorInfo definition.
Rendler are also includes samples in Java, GO, Python, Scala, and Haskell.

Spark with MongoDB error

I'm learning to use Spark with MongoDB, but I've encountered a problem that I think is related to the way I use Spark., because it doesn't make any sense to me.
My concept test is that I want to filter a collection containing about 800K documents by a certain field.
My code is very simple. Connect to my MongoDB, apply a filter transformation and then count the elements:
JavaSparkContext sc = new JavaSparkContext("local[2]", "Spark Test");
Configuration config = new Configuration();
config.set("mongo.input.uri", "mongodb://127.0.0.1:27017/myDB.myCollection");
JavaPairRDD<Object, BSONObject> mongoRDD = sc.newAPIHadoopRDD(config, com.mongodb.hadoop.MongoInputFormat.class, Object.class, BSONObject.class);
long numberOfFilteredElements = mongoRDD.filter(myCollectionDocument -> myCollectionDocument._2().get("site").equals("marfeel.com")).count();
System.out.format("Filtered collection size: %d%n", numberOfFilteredElements);
When I execute this code, the Mongo driver splits my collection into 2810 partitions, so equal number of tasks start to process.
About the task number 1000, I get the following error message:
ERROR Executor: Exception in task 990.0 in stage 0.0 (TID 990) java.lang.OutOfMemoryError: unable to create new native thread
I've searched a lot about this error, but it doesn't make any sense to me. I came up the conclusion that I have a problem with my code, that I have some library versions incompatibilities or that my real problem is that I'm getting the whole Spark concept wrong, and that the code above doesn't make any sense at all.
I'm using the following library versions:
org.apache.spark.spark-core_2.11 -> 1.2.0
org.apache.hadoop.hadoop-client -> 2.4.1
org.mongodb.mongo-hadoop.mongo-hadoop-core -> 1.3.1
org.mongodb.mongo-java-driver -> 2.13.0-rc1