Spark with MongoDB error - mongodb

I'm learning to use Spark with MongoDB, but I've encountered a problem that I think is related to the way I use Spark., because it doesn't make any sense to me.
My concept test is that I want to filter a collection containing about 800K documents by a certain field.
My code is very simple. Connect to my MongoDB, apply a filter transformation and then count the elements:
JavaSparkContext sc = new JavaSparkContext("local[2]", "Spark Test");
Configuration config = new Configuration();
config.set("mongo.input.uri", "mongodb://127.0.0.1:27017/myDB.myCollection");
JavaPairRDD<Object, BSONObject> mongoRDD = sc.newAPIHadoopRDD(config, com.mongodb.hadoop.MongoInputFormat.class, Object.class, BSONObject.class);
long numberOfFilteredElements = mongoRDD.filter(myCollectionDocument -> myCollectionDocument._2().get("site").equals("marfeel.com")).count();
System.out.format("Filtered collection size: %d%n", numberOfFilteredElements);
When I execute this code, the Mongo driver splits my collection into 2810 partitions, so equal number of tasks start to process.
About the task number 1000, I get the following error message:
ERROR Executor: Exception in task 990.0 in stage 0.0 (TID 990) java.lang.OutOfMemoryError: unable to create new native thread
I've searched a lot about this error, but it doesn't make any sense to me. I came up the conclusion that I have a problem with my code, that I have some library versions incompatibilities or that my real problem is that I'm getting the whole Spark concept wrong, and that the code above doesn't make any sense at all.
I'm using the following library versions:
org.apache.spark.spark-core_2.11 -> 1.2.0
org.apache.hadoop.hadoop-client -> 2.4.1
org.mongodb.mongo-hadoop.mongo-hadoop-core -> 1.3.1
org.mongodb.mongo-java-driver -> 2.13.0-rc1

Related

Zeppelin fails after loading DeltaTable with "Could not find active SparkSession"

I'm running into an issue when trying to do repeat queries with Delta Lake tables in Zeppelin. This code snippet runs without any problems the first time through:
import io.delta.tables._
val deltaTable = DeltaTable.forPath("s3://bucket/path")
deltaTable.toDF.show()
But when I try to run it a second time, it fails with this error:
java.lang.IllegalArgumentException: Could not find active SparkSession
at io.delta.tables.DeltaTable$$anonfun$1.apply(DeltaTable.scala:620)
at io.delta.tables.DeltaTable$$anonfun$1.apply(DeltaTable.scala:620)
at scala.Option.getOrElse(Option.scala:121)
at io.delta.tables.DeltaTable$.forPath(DeltaTable.scala:619)
... 51 elided
I can restart the Spark interpreter and run the query again, but this is a huge impediment to development. Does anyone know why this is happening and whether there is a workaround that doesn't involve restarting the interpreter every time I want to run a new query?

How to tell PySpark where the pymongo-spark package is located? [duplicate]

I'm having difficulty getting these components to knit together properly. I have Spark installed and working successfully, I can run jobs locally, standalone, and also via YARN. I have followed the steps advised (to the best of my knowledge) here and here
I'm working on Ubuntu and the various component versions I have are
Spark spark-1.5.1-bin-hadoop2.6
Hadoop hadoop-2.6.1
Mongo 2.6.10
Mongo-Hadoop connector cloned from https://github.com/mongodb/mongo-hadoop.git
Python 2.7.10
I had some difficulty following the various steps such as which jars to add to which path, so what I have added are
in /usr/local/share/hadoop-2.6.1/share/hadoop/mapreduce I have added mongo-hadoop-core-1.5.0-SNAPSHOT.jar
the following environment variables
export HADOOP_HOME="/usr/local/share/hadoop-2.6.1"
export PATH=$PATH:$HADOOP_HOME/bin
export SPARK_HOME="/usr/local/share/spark-1.5.1-bin-hadoop2.6"
export PYTHONPATH="/usr/local/share/mongo-hadoop/spark/src/main/python"
export PATH=$PATH:$SPARK_HOME/bin
My Python program is basic
from pyspark import SparkContext, SparkConf
import pymongo_spark
pymongo_spark.activate()
def main():
conf = SparkConf().setAppName("pyspark test")
sc = SparkContext(conf=conf)
rdd = sc.mongoRDD(
'mongodb://username:password#localhost:27017/mydb.mycollection')
if __name__ == '__main__':
main()
I am running it using the command
$SPARK_HOME/bin/spark-submit --driver-class-path /usr/local/share/mongo-hadoop/spark/build/libs/ --master local[4] ~/sparkPythonExample/SparkPythonExample.py
and I am getting the following output as a result
Traceback (most recent call last):
File "/home/me/sparkPythonExample/SparkPythonExample.py", line 24, in <module>
main()
File "/home/me/sparkPythonExample/SparkPythonExample.py", line 17, in main
rdd = sc.mongoRDD('mongodb://username:password#localhost:27017/mydb.mycollection')
File "/usr/local/share/mongo-hadoop/spark/src/main/python/pymongo_spark.py", line 161, in mongoRDD
return self.mongoPairRDD(connection_string, config).values()
File "/usr/local/share/mongo-hadoop/spark/src/main/python/pymongo_spark.py", line 143, in mongoPairRDD
_ensure_pickles(self)
File "/usr/local/share/mongo-hadoop/spark/src/main/python/pymongo_spark.py", line 80, in _ensure_pickles
orig_tb)
py4j.protocol.Py4JError
According to here
This exception is raised when an exception occurs in the Java client
code. For example, if you try to pop an element from an empty stack.
The instance of the Java exception thrown is stored in the
java_exception member.
Looking at the source code for pymongo_spark.py and the line throwing the error, it says
"Error while communicating with the JVM. Is the MongoDB Spark jar on
Spark's CLASSPATH? : "
So in response, I have tried to be sure the right jars are being passed, but I might be doing this all wrong, see below
$SPARK_HOME/bin/spark-submit --jars /usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-hadoop-spark-1.5.0-SNAPSHOT.jar,/usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-java-driver-3.0.4.jar --driver-class-path /usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-java-driver-3.0.4.jar,/usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-hadoop-spark-1.5.0-SNAPSHOT.jar --master local[4] ~/sparkPythonExample/SparkPythonExample.py
I have imported pymongo to the same python program to verify that I can at least access MongoDB using that, and I can.
I know there are quite a few moving parts here so if I can provide any more useful information please let me know.
Updates:
2016-07-04
Since the last update MongoDB Spark Connector matured quite a lot. It provides up-to-date binaries and data source based API but it is using SparkConf configuration so it is subjectively less flexible than the Stratio/Spark-MongoDB.
2016-03-30
Since the original answer I found two different ways to connect to MongoDB from Spark:
mongodb/mongo-spark
Stratio/Spark-MongoDB
While the former one seems to be relatively immature the latter one looks like a much better choice than a Mongo-Hadoop connector and provides a Spark SQL API.
# Adjust Scala and package version according to your setup
# although officially 0.11 supports only Spark 1.5
# I haven't encountered any issues on 1.6.1
bin/pyspark --packages com.stratio.datasource:spark-mongodb_2.11:0.11.0
df = (sqlContext.read
.format("com.stratio.datasource.mongodb")
.options(host="mongo:27017", database="foo", collection="bar")
.load())
df.show()
## +---+----+--------------------+
## | x| y| _id|
## +---+----+--------------------+
## |1.0|-1.0|56fbe6f6e4120712c...|
## |0.0| 4.0|56fbe701e4120712c...|
## +---+----+--------------------+
It seems to be much more stable than mongo-hadoop-spark, supports predicate pushdown without static configuration and simply works.
The original answer:
Indeed, there are quite a few moving parts here. I tried to make it a little bit more manageable by building a simple Docker image which roughly matches described configuration (I've omitted Hadoop libraries for brevity though). You can find complete source on GitHub (DOI 10.5281/zenodo.47882) and build it from scratch:
git clone https://github.com/zero323/docker-mongo-spark.git
cd docker-mongo-spark
docker build -t zero323/mongo-spark .
or download an image I've pushed to Docker Hub so you can simply docker pull zero323/mongo-spark):
Start images:
docker run -d --name mongo mongo:2.6
docker run -i -t --link mongo:mongo zero323/mongo-spark /bin/bash
Start PySpark shell passing --jars and --driver-class-path:
pyspark --jars ${JARS} --driver-class-path ${SPARK_DRIVER_EXTRA_CLASSPATH}
And finally see how it works:
import pymongo
import pymongo_spark
mongo_url = 'mongodb://mongo:27017/'
client = pymongo.MongoClient(mongo_url)
client.foo.bar.insert_many([
{"x": 1.0, "y": -1.0}, {"x": 0.0, "y": 4.0}])
client.close()
pymongo_spark.activate()
rdd = (sc.mongoRDD('{0}foo.bar'.format(mongo_url))
.map(lambda doc: (doc.get('x'), doc.get('y'))))
rdd.collect()
## [(1.0, -1.0), (0.0, 4.0)]
Please note that mongo-hadoop seems to close the connection after the first action. So calling for example rdd.count() after the collect will throw an exception.
Based on different problems I've encountered creating this image I tend to believe that passing mongo-hadoop-1.5.0-SNAPSHOT.jar and mongo-hadoop-spark-1.5.0-SNAPSHOT.jar to both --jars and --driver-class-path is the only hard requirement.
Notes:
This image is loosely based on jaceklaskowski/docker-spark
so please be sure to send some good karma to #jacek-laskowski if it helps.
If don't require a development version including new API then using --packages is most likely a better option.
Can you try using --package option instead of --jars ... in your spark-submit command:
spark-submit --packages org.mongodb.mongo-hadoop:mongo-hadoop-core:1.3.1,org.mongodb:mongo-java-driver:3.1.0 [REST OF YOUR OPTIONS]
Some of these jar files are not Uber jars and need more dependencies to be downloaded before that can get to work.
I was having this same problem yesterday. Was able to fix it by placing mongo-java-driver.jar in $HADOOP_HOME/lib and mongo-hadoop-core.jar and mongo-hadoop-spark.jar in $HADOOP_HOME/spark/classpath/emr (Or any other folder that is in the $SPARK_CLASSPATH).
Let me know if that helps.
Good Luckļ¼
#see https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage
from pyspark import SparkContext, SparkConf
import pymongo_spark
# Important: activate pymongo_spark.
pymongo_spark.activate()
def main():
conf = SparkConf().setAppName("pyspark test")
sc = SparkContext(conf=conf)
# Create an RDD backed by the MongoDB collection.
# This RDD *does not* contain key/value pairs, just documents.
# If you want key/value pairs, use the mongoPairRDD method instead.
rdd = sc.mongoRDD('mongodb://localhost:27017/db.collection')
# Save this RDD back to MongoDB as a different collection.
rdd.saveToMongoDB('mongodb://localhost:27017/db.other.collection')
# You can also read and write BSON:
bson_rdd = sc.BSONFileRDD('/path/to/file.bson')
bson_rdd.saveToBSON('/path/to/bson/output')
if __name__ == '__main__':
main()

Starting KsqlRestApplication form scala and getting NoSuchMethodError org.apache.kafka.streams.StreamsConfig.getConsumerConfigs

I am trying to write a program that enables me to run predefined KSQL operations on Kafka topics in Scala, but I don't want to open the KSQL Cli everytime. Therefore I want to start the KSQL "Server" from within my Scala program. If I understand the KSQL source code right, I have to build and start a KsqlRestApplication:
def restServer = KsqlRestApplication.buildApplication(new
KsqlRestConfig(defaultServerProperties), true, new VersionCheckerAgent
{override def start(ksqlModuleType: KsqlModuleType, properties:
Properties): Unit = ???})
But when I try doing that, I get the following error:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.kafka.streams.StreamsConfig.getConsumerConfigs(Ljava/lang/String;Ljava/lang/String;)Ljava/util/Map;
at io.confluent.ksql.rest.server.BrokerCompatibilityCheck.create(BrokerCompatibilityCheck.java:62)
at io.confluent.ksql.rest.server.KsqlRestApplication.buildApplication(KsqlRestApplication.java:241)
I looked into the function call in BrokerCompatibilityCheck and in the create function it calls the StreamsConfig.getConsumerConfigs() with 2 Strings as parameters instead of the parameters defined in
https://kafka.apache.org/0102/javadoc/org/apache/kafka/streams/StreamsConfig.html#getConsumerConfigs(StreamThread,%20java.lang.String,%20java.lang.String).
Are my KSQL and Kafka version simply not compatible or am I doing something wrong?
I am using KSQL version 4.1.0-SNAPSHOT and Kafka version 1.0.0.
Yes, NoSuchMethodError typically indicates a version incompatibility between libraries.
The link you posted is to javadoc for kafka 0.10.2. The method hasn't changed in 1.0 but indeed in the upcoming 1.1 it only takes 2 Strings:
https://kafka.apache.org/11/javadoc/org/apache/kafka/streams/StreamsConfig.html#getConsumerConfigs(java.lang.String,%20java.lang.String)
. That suggests the version of KSQL you're using (4.1.0-SNAPSHOT) depends on version 1.1 of kafka streams, which is currently in the release candidate phase and I believe and should be out soon:
https://lists.apache.org/thread.html/780c4458b16590e99261b69d7b41b9ec374a3226d72c8d38885a008a#%3Cusers.kafka.apache.org%3E
As per that email you can find the latest (1.1.0-rc2) artifacts in the apache staging repo:
https://repository.apache.org/content/groups/staging/

Error in using MLlib function ALS in Spark

I have read from a file like this:
val ratingText = sc.textFile("/home/cloudera/rec_data/processed_data/ratings/000000_0")
Used the following function to parse this data:
def parseRating(str: String): Rating= {
val fields = str.split(",")
Rating(fields(0).toInt, fields(1).trim.toInt, fields(2).trim.toDouble)
}
And created a rdd, which is then split into different RDDs
val ratingsRDD = ratingText.map(x=>parseRating(x)).cache()
val splits = ratingsRDD.randomSplit(Array(0.8, 0.2), 0L)
val trainingRatingsRDD = splits(0).cache()
Used the training RDD to create a model as follows:
val model = (new ALS().setRank(20).setIterations(10) .run(trainingRatingsRDD))
I get the following error in the last command
16/10/28 01:03:44 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
16/10/28 01:03:44 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
16/10/28 01:03:46 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
16/10/28 01:03:46 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK
Edit: T. Gaweda's suggestion helped in removing the errors, but I am still getting the following warning:
16/10/28 01:53:59 WARN Executor: 1 block locks were not released by TID = 60:
[rdd_420_0]
16/10/28 01:54:00 WARN Executor: 1 block locks were not released by TID = 61:
[rdd_421_0]
And I think this has resulted in an empty model , because the next step is resulting in the following error:
val topRecsForUser = model.recommendProducts(4276736,3)
Error is:
java.util.NoSuchElementException: next on empty iterator at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
Please help!
It's only a warning. Spark uses BLAS to perform calculations. BLAS has native implementations and JVM implementation, the native one is more optimized / faster. However, you must install native library individually.
Without this configuration the warning message will appear and Spark will use JVM implementation of BLAS. Results should be the same, maybe calculated quite slower.
Here you've got description what is BLAS and how to configure it, for example on Cent OS is should be only: yum install openblas lapack

Around 5-10% executors are LOST in my mesos framework

I have a 200 node mesos cluster that can run around 2700 executors concurrently. Around 5-10% of my executors are LOST at the very beginning. They go only until extracting the executor tar file.
WARNING: Logging before InitGoogleLogging() is written to STDERR I0617 21:35:09.947180 45885 fetcher.cpp:76] Fetching URI 'http://download_url/remote_executor.tgz' I0617 21:35:09.947273 45885 fetcher.cpp:126] Downloading 'http://download_url/remote_executor.tgz' to '/mesos_dir/remote_executor.tgz' I0617 21:35:57.551722 45885 fetcher.cpp:64] Extracted resource '/mesos_dir/remote_executor.tgz' into '/extracting_mesos_dir/'
Please let me know if someone else is facing this issue.
I am using python to implement both the scheduler and executor. The executor code is a python file that extends base class 'Executor'. I have implemented the launchTasks method of Executor class that simply does what the executor is supposed to do.
The executor info is:
executor = mesos_pb2.ExecutorInfo()
executor.executor_id.value = "executor-%s" % (str(task_id),)
executor.command.value = 'python -m myexecutor'
# where to download executor from
tar_uri = '%s/remote_executor.tgz' % (
self.conf.remote_executor_cache_url)
executor.command.uris.add().value = tar_uri
executor.name = 'some_executor_name'
executor.source = "executor_test"
Can you provide more details about what your Executor is supposed to do (at best ExecutorInfo Definition and the Executor itself)? What is the Command you use to start the executor (CommandInfo)?
For example definition of an executor have a look at Rendler.
It includes a sample executor and the ExecutorInfo definition.
Rendler are also includes samples in Java, GO, Python, Scala, and Haskell.