I am using spark streaming in scala, where I am working with streaming twitter data. I have the following code:
val ssc = new StreamingContext(new SparkConf(), Seconds(5))
val tweets = TwitterUtils.createStream(ssc, None)
val user = tweets.map(x=> x.getText())
val lang = tweets.map(x=> x.getLang())
I am getting the following error:
[error] /home/user/Lab1.1/Twitterstats.scala:103: value getLang is not a member of twitter4j.Status
[error] val lang = tweets.map(x=> x.getLang())
[error] ^
[error] one error found
What is wrong with the above code? Could someone please help.
spark-streaming-twitter uses Twitter4j. getLang() is only supported since version 3.0.6 of Twitter4J. If you are using a version 1.5.2 (or lower) of spark-streaming-twitter you won't be able to call getLang() because it uses version 3.0.3 of twitter4j. Since 1.6.0 version 4.0.4 is supported as wel as the getLang() function.
So you could upgrade spark-streaming-twitter to 1.6.0 or higher. Or you could use another 3rd party library to detect the language of your tweets.
(possible duplicate)
Related
I am new to Apache Spark and I am using Scala and Mongodb to learn it.
https://docs.mongodb.com/spark-connector/current/scala-api/
I am trying to read the RDD from my MongoDB database, my notebook script as below:
import com.mongodb.spark.config._
import com.mongodb.spark._
val readConfig = ReadConfig(Map("uri" -> "mongodb+srv://root:root#mongodbcluster.td5gp.mongodb.net/test_database.test_collection?retryWrites=true&w=majority"))
val testRDD = MongoSpark.load(sc, readConfig)
print(testRDD.collect)
At the print(testRDD.collect) line, I got this error:
java.lang.NoSuchMethodError:
com.mongodb.internal.connection.Cluster.selectServer(Lcom/mongodb/selector/ServerSelector;)Lcom/mongodb/internal/connection/Server;
And more than 10 lines "at..."
Used libraries:
org.mongodb.spark:mongo-spark-connector_2.12:3.0.1
org.mongodb.scala:mongo-scala-driver_2.12:4.2.3
Is this the problem from Mongodb internal libraries or how could I fix it?
Many thanks
I suspect that there is a conflict between mongo-spark-connector and mongo-scala-driver. The former is using Mongo driver 4.0.5, but the later is based on the version 4.2.3. I would recommend to try only with mongo-spark-connector
I was also facing the same problem, solved it using mongo-spark-connector-2.12:3.0.1 jar and with that also added jar of Scalaj HTTP ยป 2.4.2. It's working fine now.
I am using spark 3.0.1 in my kotlin project. Compilation fails with the following error:
e: org.jetbrains.kotlin.util.KotlinFrontEndException: Exception while analyzing expression at (51,45) in /home/user/project/src/main/kotlin/ModelBuilder.kt
...
Caused by: java.lang.IllegalStateException: No parameter with index 0-0 (name=reverser$module$1 access=16) in method scala.collection.TraversableOnce.reverser$2
at org.jetbrains.kotlin.load.java.structure.impl.classFiles.AnnotationsAndParameterCollectorMethodVisitor.visitParameter(Annotations.kt:48)
at org.jetbrains.org.objectweb.asm.ClassReader.readMethod(ClassReader.java:1149)
at org.jetbrains.org.objectweb.asm.ClassReader.accept(ClassReader.java:680)
at org.jetbrains.org.objectweb.asm.ClassReader.accept(ClassReader.java:392)
at org.jetbrains.kotlin.load.java.structure.impl.classFiles.BinaryJavaClass.<init>(BinaryJavaClass.kt:77)
at org.jetbrains.kotlin.load.java.structure.impl.classFiles.BinaryJavaClass.<init>(BinaryJavaClass.kt:40)
at org.jetbrains.kotlin.cli.jvm.compiler.KotlinCliJavaFileManagerImpl.findClass(KotlinCliJavaFileManagerImpl.kt:115)
at org.jetbrains.kotlin.cli.jvm.compiler.KotlinCliJavaFileManagerImpl.findClass(KotlinCliJavaFileManagerImpl.kt:85)
at org.jetbrains.kotlin.cli.jvm.compiler.KotlinCliJavaFileManagerImpl$findClass$$inlined$getOrPut$lambda$1.invoke(KotlinCliJavaFileManagerImpl.kt:113)
at org.jetbrains.kotlin.cli.jvm.compiler.KotlinCliJavaFileManagerImpl$findClass$$inlined$getOrPut$lambda$1.invoke(KotlinCliJavaFileManagerImpl.kt:48)
at org.jetbrains.kotlin.load.java.structure.impl.classFiles.ClassifierResolutionContext.resolveClass(ClassifierResolutionContext.kt:60)
at org.jetbrains.kotlin.load.java.structure.impl.classFiles.ClassifierResolutionContext.resolveByInternalName$frontend_java(ClassifierResolutionContext.kt:101)
at org.jetbrains.kotlin.load.java.structure.impl.classFiles.BinaryClassSignatureParser$parseParameterizedClassRefSignature$1.invoke(BinaryClassSignatureParser.kt:141)
I've cleaned/rebuilt the project several times, removed the build directory and tried building from the command line with gradle.
The code where this happens:
val data = listOf(...)
val schema = StructType(arrayOf(
StructField("label", DataTypes.DoubleType, false, Metadata.empty()),
StructField("sentence", DataTypes.StringType, false, Metadata.empty())
))
val dataframe = spark.createDataFrame(data, schema) // <- offending line.
Was using kotlin version 1.4.0, upgraded to 1.4.10 without any change, still same error.
Looks like this bug (and this) already reported to JetBrains, but is it really not possible to use spark 3 (local mode) in kotlin 1.4?
I managed to get it working with Spring Boot (2.3.5) by adding the following to the dependencyManagement:
dependencies {
dependencySet("org.scala-lang:2.12.10") {
entry("scala-library")
}
}
This will downgrade the scala-library jar from 2.12.12 to 2.12.10 version, which is the same version of the scala-reflect jar in my project. I'm also using Kotlin 1.4.10
Are you trying to use this API?
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/SparkSession.html#createDataFrame-java.util.List-java.lang.Class-
There is no method which takes java.util.List as well as Schema object AFAIK..
I am following instruction jpmml-evaluator-spark to load local pmml model
my code is like below
import java.io.File
import org.jpmml.evaluator.spark._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
// load pmml
val new File(getClass.getClassLoader.getResource("random_forest.pmml").getFile)
// create evaluator
val evaluator = EvaluatorUtil.createEvaluator(pmmlFile)
I cannot show the error message directly, so I put it here
guesses:
there are some reasons i think may cause this problem
1, "jpmml-evaluator-spark" does not support PMML4.3, even if the author said new version 1.1.0 has already supported PMML4.3
2, there are some problems about my "random_forest.pmml", because this file is from others
Note:
development environment
spark 2.1.1
scala 2.11.8
and I run on the local, mac system version is OS X El Capitan Version 10.11.6
You are using Apache Spark 2.0, 2.1 or 2.2, which has prepended a legacy version of the JPMML-Model library (1.2.15, to be precise) to your application classpath. This issue is documented in SPARK-15526.
Solution - fix your application classpath as described in JPMML-Evaluator-Spark documentation (alternatively, consider switching to Apache Spark 2.3.0 or newer).
Another option of using PMML in Spark is PMML4S-Spark, which supports the latest PMML4.4, for example:
import org.pmml4s.spark.ScoreModel
val model = ScoreModel.fromFile(pmmlFile)
val scoreDf = model.transform(df)
I am trying to access the streaming tweets from Spark Streaming.
This is the software configuration.
Ubuntu 14.04.2 LTS
scala -version
Scala code runner version 2.11.7 -- Copyright 2002-2013, LAMP/EPFL
spark-submit --version
Spark version 1.6.0
Following is the code.
object PrintTweets
{
def main(args: Array[String]) {
// Configure Twitter credentials using twitter.txt
setupTwitter()
// Set up a Spark streaming context named "PrintTweets" that runs locally using
// all CPU cores and one-second batches of data
val ssc = new StreamingContext("local[*]", "PrintTweets", Seconds(1))
// Get rid of log spam (should be called after the context is set up)
setupLogging()
// Create a DStream from Twitter using our streaming context
val tweets = TwitterUtils.createStream(ssc, None)
// Now extract the text of each status update into RDD's using map()
val statuses = tweets.map(status => status.getText())
// Print out the first ten
statuses.print()
// Kick it all off
ssc.start()
ssc.awaitTermination()
}
}
Utilities.scala
object Utilities {
/** Makes sure only ERROR messages get logged to avoid log spam. */
def setupLogging() = {
import org.apache.log4j.{Level, Logger}
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
}
/** Configures Twitter service credentials using twiter.txt in the main workspace directory */
def setupTwitter() = {
import scala.io.Source
for (line <- Source.fromFile("./data/twitter.txt").getLines) {
val fields = line.split(" ")
if (fields.length == 2) {
System.setProperty("twitter4j.oauth." + fields(0), fields(1))
}
}
}
}
Issues:
Since it needs the twitter4j library, i have added
twitter4j-core-4.0.4, twitter4j-stream-4.0.4 in eclipse build path as external jars.
Then i ran the program, it didnt throw any error. But the tweets not appearing in console. It were empty.
So i see some forums and downgraded twitter4j to 3.0.3. Also in Eclipse i chosen Scala 2.10 Library container in Build Path window.
After that i got java.lang.NoSuchMethodError run-time error.
16/05/14 11:46:01 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NoSuchMethodError: twitter4j.TwitterStream.addListener(Ltwitter4j/StreamListener;)V
at org.apache.spark.streaming.twitter.TwitterReceiver.onStart(TwitterInputDStream.scala:72)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:575)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:565)
at org.apache.spark.SparkContext$$anonfun$37.apply(SparkContext.scala:1992)
at org.apache.spark.SparkContext$$anonfun$37.apply(SparkContext.scala:1992)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Please help me to resolve this. Initially i have installed spark by built using Scala 2.11. Is that the problem. Do i need uninstall everything and re-install Scala 2.10, then Spark pre-compiled package.
Or apart from Scala 2.11, do i need to have Scala 2.10 in my system?
The above exception seems to be caused by the incompatibility of spark version 1.6.0 and twitter4j 3.0.3 version.
twitter4j.TwitterStream which is being passed in the onStart method of org.apache.spark.streaming.twitter.TwitterReceiver, has method addListener which takes instance of twitter4j.StreamListener.
twitter4j 3.0.3 version has no method twitter4j.TwitterStream.addListener(StreamListener), instead it has few other addListener methods, which take the subclass of StreamListener.
twitter4j 4.0.4 version has the desired method, so that's why no error comes with this library. So changing to twitter4j 3.0.3 version will not solve the problem.
Problem is somewhere else.
In my case.
I had spark java project.
I cleaned pom file and start adding in order. First resolved spark related errors, then spark launcher, next on ward based on bigger library.
Note I was using cdh6.2.0 environment
I am new to Scala and am trying to code read a file using the following code
scala> val textFile = sc.textFile("README.md")
scala> textFile.count()
But I keep getting the following error
error: not found: value sc
I have tried everything, but nothing seems to work. I am using Scala version 2.10.4 and Spark 1.1.0 (I have even tried Spark 1.2.0 but it doesn't work either). I have sbt installed and compiled yet not able to run sbt/sbt assembly. Is the error because of this?
You should run this code using ./spark-shell. It's scala repl with provided sparkContext. You can find it in your apache spark distribution in folder spark-1.4.1/bin.