Spark with Mongodb is very slow - mongodb

I am running the spark-shell with mongodb connector. But the program was very slow , i think i will don't have the response from program.
My spark-shell command is :
./spark-shell --master spark://spark_host:7077 \
--conf "spark.mongodb.input.uri=mongodb://mongod_user:password#mongod_host:27017/database.collection?readPreference=primaryPreferred" \
--jars /mongodb/lib/mongo-spark-connector_2.10-2.0.0.jar,/mongodb/lib/bson-3.2.2.jar,/mongodb/lib/mongo-java-driver-3.2.2.jar
And my app code is :
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import com.mongodb.spark._
import org.bson.Document
import com.mongodb.spark.config.ReadConfig
import org.apache.spark.sql.SparkSession
import com.mongodb.spark.rdd.MongoRDD
val sparkSession = SparkSession.builder().getOrCreate()
val df = MongoSpark.load(sparkSession)
val dataset = df.filter("thisRequestTime > 1499250131596")
dataset.first // will wait to long time
What thing i was missed ? Help me please ~
PS: my spark is standalone model . App dependency is :
<properties>
<encoding>UTF-8</encoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<scala.compat.version>2.11</scala.compat.version>
<spark.version>2.1.1</spark.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.compat.version}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.compat.version}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.mongodb.spark</groupId>
<artifactId>mongo-spark-connector_${scala.compat.version}</artifactId>
<version>2.0.0</version>
</dependency>
</dependencies>

I've been trapped into this kind of problem for a while, but got it over at last.
I don't know the detail of your Mongodb configuration, but here is my solution for my problem, hope you find it helpful.
My dataset is huge, too. So I configured a sharded cluster for mongodb, that's why make it slow. To solve it, add one piece of conf spark.mongodb.input.partitioner=MongoShardedPartitioner. Otherwise a default partition policy with be put into use, which is not suitable for a sharded mongodb.
You can find more specific information here
good luck!

Related

Getting dependency error for sparksession and SQLContext

I am getting dependency error for my SQLContext and sparksession in my spark program
val sqlContext = new SQLContext(sc)
val spark = SparkSession.builder()
Error for SQLCOntext
Symbol 'type org.apache.spark.Logging' is missing from the classpath. This symbol is required by 'class org.apache.spark.sql.SQLContext'. Make sure that type Logging is in your classpath and check for conflicting dependencies with -Ylog-classpath. A full rebuild may help if 'SQLContext.class' was compiled against an incompatible version of org.apache.spark.
Error for SparkSession:
not found: value SparkSession
Below are the spark dependencies in my pom.xml
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.6.0-cdh5.15.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>2.0.0-cloudera1-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-catalyst_2.10</artifactId>
<version>1.6.0-cdh5.15.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-test-tags_2.10</artifactId>
<version>1.6.0-cdh5.15.1</version>
</dependency>
You can't have both Spark 2 and Spark 1.6 dependencies defined in your project.
org.apache.spark.Logging is not available in Spark 2 anymore.
Change
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>2.0.0-cloudera1-SNAPSHOT</version>
</dependency>
to
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.6.0-cdh5.15.1</version>
</dependency>

Creating a kafka consumer using spark streaming

I have started kafka, created a topic and a producer. Now I want to read messages sent from that producer. My code
def main(args: Array[String]){
val sparkConf = new SparkConf()
val spark = new SparkContext(sparkConf)
val streamingContext = new StreamingContext(spark, Seconds(5))
val kafkaStream = KafkaUtils.createStream(streamingContext
, "localhost:2181"
, "test-group"
, Map("test" -> 1))
kafkaStream.print
streamingContext.start
streamingContext.awaitTermination
}
The dependencies I use
<properties>
<spark.version>1.6.2</spark.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
</dependencies>
But every time I try to run it in idea I get
Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;
at org.apache.spark.util.Utils$.getSystemProperties(Utils.scala:1582)
at org.apache.spark.SparkConf.<init>(SparkConf.scala:59)
at org.apache.spark.SparkConf.<init>(SparkConf.scala:53)
at com.mypackage.KafkaConsumer$.main(KafkaConsumer.scala:10)
at com.mypackage.KafkaConsumer.main(KafkaConsumer.scala)
Other questions here point to conflicts between dependencies.
I use scala 2.10.5 and spark 1.6.2. I tried them in other projects, they worked fine.
Line 10 in this case is val sparkConf = new SparkConf()
I try to run the app in the IDEA without packaging it.
What can be the reason for this problem?
It's an error with Scala version. You're are using different versions of scala in your code and dependencies.
You said you're using scala 2.10 but you import spark-XX_2.11 dependencies. Unify your scala version.

Spark streaming with Kafka - java.lang.ClassNotFoundException: org.apache.spark.internal.Logging

I'm currently stuck in an annoying situation. I'm trying to implement Kafka offset handling logic using Spark Streaming together with Kafka and MongoDB. The offset persistence to/from MongoDB does what it is supposed to do but when I try to create a direct stream using:
JavaInputDStream<ConsumerRecord<String, String>> events = KafkaUtils.createDirectStream(
jsc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Assign(committedOffsetRanges.keySet(),kafkaParams, committedOffsetRanges)
);
I get the following exception (a lot of lines removed for brevity):
at org.apache.spark.streaming.kafka010.ConsumerStrategies.Assign(ConsumerStrategy.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.internal.Logging
I use the following dependencies:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>${spark-version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>${spark-version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10-assembly_2.10</artifactId>
<version>2.2.0</version>
</dependency>
${spark-version} is set to "1.6.0-cdh5.12.1".
I've read that org.apache.spark.internal.Logging existed up to v 1.5.2 of Spark but unfortunately I'm not able to downgrade.
Hmmmmm... is there anybody who has found a solution to this or at least a workaround?

Unable to connect to remote Cassandra via Spark + Scala

I'm having some troubles while trying to connect to a remote Cassandra using Apache-Spark and Scala. I successfully managed to connect in past, in the same way, with MongoDb.
This time I really don't understand why I'm getting the following error:
Failed to open native connection to Cassandra at {127.0.0.1}:9042
I guess it's a dependency and version problem but I was not able to find anything related to this issue in particular, both on documentation and on other questions.
I essentially manage to connect via ssh-tunnel to my server using jsch and all works fine. Then, I'm successfully able to connect to the local apache-spark using SparkConnectionFactory.scala:
package connection
import org.apache.spark.{SparkConf, SparkContext}
class SparkConnectionFactory {
var sparkContext : SparkContext = _
def initSparkConnection = {
val configuration = new SparkConf(true).setMaster("local[8]")
.setAppName("my_test")
.set("spark.cassandra.connection.host", "localhost")
.set("spark.cassandra.input.consistency.level","ONE")
.set("spark.driver.allowMultipleContexts", "true")
val sc = new SparkContext(configuration)
sparkContext = sc
}
def getSparkInstance : SparkContext = {
sparkContext
}
}
And calling it in my Main.scala:
val sparkConnectionFactory = new SparkConnectionFactory
sparkConnectionFactory.initSparkConnection
val sc : SparkContext = sparkConnectionFactory.getSparkInstance
But, when I try to select all the items inside a Cassandra table using:
val rdd = sc.cassandraTable("my_keyspace", "my_table")
rdd.foreach(println)
I get the error I wrote above.
On my server I installed Scala ~v2.11.6, Spark ~v2.1.1, SparkSQL ~v2.1.1. Of course I have 8 cores and a replication factor of 1. On my pom.xml I have:
. . .
<properties>
<scala.version>2.11.6</scala.version>
</properties>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
. . .
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/com.datastax.spark/spark-cassandra-connector_2.10 -->
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector_2.11</artifactId>
<version>2.0.3</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/commons-codec/commons-codec -->
<dependency>
<groupId>commons-codec</groupId>
<artifactId>commons-codec</artifactId>
<version>1.9</version>
</dependency>
</dependencies>
Is my issue caused by conflicting versions? If yes, how can I fix this? If not, any hint on what's causing it?
Thanks in advance.
I'm forwarding port 9042 to 8988
Then that's the port you need to connect to
.set("spark.cassandra.connection.port", 8988)

Error while using SparkSession or sqlcontext

I am new to spark. I am just trying to parse a json file using sparksession or sqlcontext.
But whenever I run them, I am getting the following error.
Exception in thread "main" java.lang.NoSuchMethodError:
org.apache.spark.internal.config.package$.CATALOG_IMPLEMENTATION()Lorg/apache/spark/internal/config/ConfigEntry; at
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$sessionStateClassName(SparkSession.scala:930) at
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:112) at
org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:110) at
org.apache.spark.sql.DataFrameReader.<init>(DataFrameReader.scala:535) at
org.apache.spark.sql.SparkSession.read(SparkSession.scala:595) at
org.apache.spark.sql.SQLContext.read(SQLContext.scala:504) at
joinAssetsAndAd$.main(joinAssetsAndAd.scala:21) at
joinAssetsAndAd.main(joinAssetsAndAd.scala)
As of now I created a scala project in eclipse IDE and configured it as Maven project and added the spark and sql dependencies.
My dependencies :
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.0.0</version>
</dependency>
</dependencies>
Could you please explain why I am getting this error and how to correct them?
Try to use the same version for spark-core and spark-sql. Change version of spark-sql to 2.1.0