I am currently working on a scala/spark homework project ibn which I am to read-in a csv file containing a few thousand movie reviews as a dataframe. I am then to analyze these reviews and train a model to detect whether a review is positive or negative. I will be training these models using TF-IDF and Word2Vec. The issue I am having is that the code I have written so far does not find the specified header field named "word" which is output by a regex tokenizer. My code is written below, as well as the console output.
I thank you for your help and appreciate any pointers to how I might do this correctly/better than what I am doing now.
import org.apache.spark._
//import spark.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.SparkSession
import java.io._
import scala.io._
import scala.collection.mutable.ListBuffer
import org.apache.spark.{Partition, SparkContext, TaskContext}
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.rand
import org.apache.spark.rdd.RDD
import org.apache.spark.ml.feature.StopWordsRemover
import org.apache.spark.ml.feature.{RegexTokenizer, Tokenizer}
/*
DONE: step 1: loop through all the positive / negative reviews and label each (1 = Positive, 2 = Negative)
during the loop, place the text that is read as a string into a DF.
step 2: check which words are common among different labels and text with each other (possibly remove stop words)
this will satisfy the TF-IDF requirement
step 3: convert text into vectors and perform regression on the values
step 4: compare accuracy using the actual data (data for the above was using the test folder data)
*/
object Machine {
def main(args: Array[String]) {
val spark = SparkSession.builder.appName("Movie Review Manager").getOrCreate()
println("Reading data...")
val df = spark.read.format("csv").option("header", "true").load("movie_data.csv")
val regexTokenizer = new RegexTokenizer().setInputCol("review").setOutputCol("word").setPattern("\\s")
val remover = new StopWordsRemover().setInputCol("word").setOutputCol("feature")
df.show()
regexTokenizer.transform(df).show(false)
df.collect()
remover.transform(df).show(false)
df.show()
spark.stop()
}
}
And here is the console output:
Exception in thread "main" 2018-03-13 03:41:28 INFO ContextCleaner:54 - Cleaned accumulator 125
java.lang.IllegalArgumentException: Field "word" does not exist.2018-03-13 03:41:28 INFO ContextCleaner:54 - Cleaned accumulator 118
2018-03-13 03:41:28 INFO ContextCleaner:54 - Cleaned accumulator 116
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:267)2018-03-13 03:41:28 INFO ContextCleaner:54 - Cleaned accumulator 102
2018-03-13 03:41:28 INFO ContextCleaner:54 - Cleaned accumulator 110
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:267)2018-03-13 03:41:28 INFO ContextCleaner:54 - Cleaned accumulator 103
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:266)
at org.apache.spark.ml.feature.StopWordsRemover.transformSchema(StopWordsRemover.scala:111)
at org.apache.spark.ml.feature.StopWordsRemover.transform(StopWordsRemover.scala:91)
at Machine$.main(movieProgram.scala:44)
at Machine.main(movieProgram.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
2018-03-13 03:41:28 INFO SparkContext:54 - Invoking stop() from shutdown hook
2018-03-13 03:41:28 INFO AbstractConnector:318 - Stopped Spark#3abfc4ed{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
You are missing to store the transformation of new RegexTokenizer().setInputCol("review").setOutputCol("word").setPattern("\\s") to be used for new StopWordsRemover().setInputCol("word").setOutputCol("feature")
As you missed to save the regex tokenizer algorithm applied dataframe to be used for stop word remover algorithm and used the original df dataframe (where word column is not present) you had the error stating
java.lang.IllegalArgumentException: Field "word" does not exist...
Correct way is to do as below
object Machine {
def main(args: Array[String]) {
val spark = SparkSession.builder.appName("Movie Review Manager").getOrCreate()
println("Reading data...")
val df = spark.read.format("csv").option("header", "true").load("movie_data.csv")
val regexTokenizer = new RegexTokenizer().setInputCol("review").setOutputCol("word").setPattern("\\s")
val remover = new StopWordsRemover().setInputCol("word").setOutputCol("feature")
df.show(false) //original dataframe
val tokenized = regexTokenizer.transform(df)
tokenized.show(false) //tokenized dataframe
val removed = remover.transform(tokenized)
removed.show(false) //stopwords removed dataframe
spark.stop()
}
}
I hope the answer is helpful
Related
I need to get my messages from a Kafka producer and from the messages I need to find the words that contain % and generate a message for different % values. Finally I need to send it to ElasticSearch.
I am able to see the values in console using kafkaStream.print() but I need to process the string to match with required keywords and generate the message.
My code:
package rnd
import org.apache.spark.SparkConf
import kafka.serializer.StringDecoder
import org.apache.spark.sql.SQLContext
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Minutes, Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
object WordFind {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local").setAppName("KafkaReceiver")
val checkpointDir = "/usr/local/kafka/kafka_2.11-0.11.0.2/checkpoint/"
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
val batchIntervalSeconds = 2
val ssc = new StreamingContext(conf, Seconds(10))
import org.apache.spark.streaming.kafka.KafkaUtils
val kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181", "spark-streaming-consumer-group", Map("wordcounttopic" -> 5))
val s = kafkaStream.print()
println(" the words are: " + s)
ssc.remember(Minutes(1))
ssc.checkpoint(checkpointDir)
ssc
ssc.start()
ssc.awaitTerminationOrTimeout(batchIntervalSeconds * 5 * 1000)
}
}
If I pass "The usage is 75%" through the Lafka producer I should generate a message saying "Increase ram by 25%" in ElasticSearch.
The output that I am getting is:
18/02/09 16:38:27 INFO BlockManagerMasterEndpoint: Registering block manager localhost:37879 with 2.4 GB RAM, BlockManagerId(driver, localhost, 37879)
18/02/09 16:38:27 INFO BlockManagerMaster: Registered BlockManager
18/02/09 16:38:27 WARN StreamingContext: spark.master should be set as local[n], n > 1 in local mode if you have receivers to get data, otherwise Spark jobs will not get resources to process the received data.
***the words are: ()***
I want the String that i am passing in place of () in 's'.
The val kafkaStream is a RecieverInputDStream[(String, String)], where the data is (kafkaMetaData, kafkaMessage)
for more information see [https://github.com/apache/spark/blob/f830bb9170f6b853565d9dd30ca7418b93a54fe3/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/KafkaInputDStream.scala#L135 ].
We need to extract the second of the tuple and do the pattern matching (i.e filter RecieverInputDStream find the words that contain %) and then use map to generate output (i.e a message for different % values). And as mentioned by #stefanobaghino, the print() function just prints the output to the console and doesn't return any string of the record.
for example:
import org.apache.spark.streaming.dstream.ReceiverInputDStream
val kafkaStream: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream(sparkStreamingContext, "localhost:2181",
"spark-streaming-consumer-group", Map("wordcounttopic" -> 5))
import org.apache.spark.streaming.dstream.DStream
val filteredStream: DStream[(String, String)] = kafkaStream
.filter(record => record._2.contains("%")) // TODO : pattern matching here
val outputDStream: DStream[String] = filteredStream
.map(record => record._2.toUpperCase()) // just assuming some operation
outputDStream.print()
Use the outputDStream to be written into ElasticSearch. Hope this helps.
I am trying to use spark sql to query the data coming from kafka using zeppelin for real time trend analysis but without success.
here is the simple code snippets that I am running in zeppelin
//Load Dependency
%dep
z.reset()
z.addRepo("Spark Packages Repo").url("http://repo1.maven.org/maven2/")
z.load("org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.1")
z.load("org.apache.spark:spark-core_2.11:2.0.1")
z.load("org.apache.spark:spark-sql_2.11:2.0.1")
z.load("org.apache.spark:spark-streaming_2.11:2.0.1"
//simple streaming
%spark
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.kafka.KafkaUtils
import _root_.kafka.serializer.StringDecoder
import org.apache.spark.sql.SparkSession
val conf = new SparkConf()
.setAppName("clickstream")
.setMaster("local[*]")
.set("spark.streaming.stopGracefullyOnShutdown", "true")
.set("spark.driver.allowMultipleContexts","true")
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.config(conf)
.getOrCreate()
val ssc = new StreamingContext(conf, Seconds(1))
val topicsSet = Set("timer")
val kafkaParams = Map[String, String]("metadata.broker.list" -> "192.168.25.1:9091,192.168.25.1:9092,192.168.25.1:9093")
val lines = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topicsSet).map(_._2)
lines.window(Seconds(60)).foreachRDD{ rdd =>
val clickDF = spark.read.json(rdd) //doesn't have to be json
clickDF.createOrReplaceTempView("testjson1")
//olderway
//clickDF.registerTempTable("testjson2")
clickDF.show
}
lines.print()
ssc.start()
ssc.awaitTermination()
I am able to print each kafka message but when I run simple sql %sql select * from testjson1 // or testjson2, I get the following error
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
In the this post Streaming Data is being queried (with twitter example). So I am thinking it should be possible with kafka streaming. So I guess, maybe, I am doing something wrong OR missing some point?
Any ideas, suggestions, recommendation is welcomed
The error message does not tell that the temp view is missing. The error message tells, that the type None does not provide an element with name 'get'.
With spark the calculations based on the RDDs are performed when an action is called. So up to the point where you are creating the temporary table no calculation is performed. All the calculations are performed when you execute your query on the table. If your table would not exist you would get another error message.
Maybe the Kafka messages could be printed, but your exception tells, that the None instance does not know 'get'. So I believe that your source JSON data contains items without data and those items are represented by None and therefore cause the execption while spark performs the calculations.
I would suggest that you verify if your solution works in general, by testing if it works with a sample data that does not contain empty JSON elements.
i'm trying to create a text classifier spark(1.6.2) app, but I don't know what am I doing wrong. This is my code:
import org.apache.spark.ml.classification.{NaiveBayes, NaiveBayesModel}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.mllib
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
/**
* Created by kebodev on 2016.11.29..
*/
object PredTest {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("IktatoSparkRunner")
.set("spark.executor.memory", "2gb")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val trainData = sqlContext.read.json("src/main/resources/tst.json")
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
val wordsData = tokenizer.transform(trainData)
val hashingTF = new HashingTF()
.setInputCol("words").setOutputCol("features").setNumFeatures(20)
val featurizedData = hashingTF.transform(wordsData)
val model = NaiveBayes.train(featurizedData)
}
}
The NaiveBayes object doesn't have train method, what should I import?
If i try to use this way:
val naBa = new NaiveBayes()
naBa.fit(featurizedData)
I get this exception:
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Column label must be of type DoubleType but was actually StringType.
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:53)
at org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$classification$ClassifierParams$$super$validateAndTransformSchema(Classifier.scala:56)
at org.apache.spark.ml.classification.ClassifierParams$class.validateAndTransformSchema(Classifier.scala:40)
at org.apache.spark.ml.classification.ProbabilisticClassifier.org$apache$spark$ml$classification$ProbabilisticClassifierParams$$super$validateAndTransformSchema(ProbabilisticClassifier.scala:53)
at org.apache.spark.ml.classification.ProbabilisticClassifierParams$class.validateAndTransformSchema(ProbabilisticClassifier.scala:37)
at org.apache.spark.ml.classification.ProbabilisticClassifier.validateAndTransformSchema(ProbabilisticClassifier.scala:53)
at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:116)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:68)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:89)
at PredTest$.main(PredTest.scala:37)
at PredTest.main(PredTest.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
This is how my json file looks like:
{"text":"any text","label":"6.0"}
I'm really noob in this topic. Can anyone help me how to create a model, and then how to predict a new value.
Thank you!
Labels and Feature Vectors only contain Doubles. Your label column contains a String.
See your stacktrace:
Column label must be of type DoubleType but was actually StringType.
You can use the StringIndexer or CountVectorizer to convert it appropriately. See http://spark.apache.org/docs/latest/ml-features.html#stringindexer for further details.
I have been trying to parse data from Dstream got from spark stream(TCP) and send it to elastic search. I am getting an error org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: Found unrecoverable error [127.0.0.1:9200] returned Bad Request(400) - failed to parse; Bailing out..
The following is my code:
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.SparkContext
import org.apache.spark.serializer.KryoSerializer;
import org.apache.spark.SparkContext._
import org.elasticsearch.spark._
import org.elasticsearch.spark.rdd.EsSpark
import com.fasterxml.jackson.databind.ObjectMapper;
import org.apache.spark.TaskContext
import org.elasticsearch.common.transport.InetSocketTransportAddress;
object Test {
case class createRdd(Message: String, user: String)
def main(args:Array[String]) {
val mapper=new ObjectMapper()
val SparkConf = new SparkConf().setAppName("NetworkWordCount").setMaster("local[*]")
SparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
SparkConf.set("es.nodes","localhost:9200")
SparkConf.set("es.index.auto.create", "true")
// Create a local StreamingContext with batch interval of 10 second
val ssc = new StreamingContext(SparkConf, Seconds(10))
/* Create a DStream that will connect to hostname and port, like localhost 9999. As stated earlier, DStream will get created from StreamContext, which in return is created from SparkContext. */
val lines = ssc.socketTextStream("localhost",9998)
// Using this DStream (lines) we will perform transformation or output operation.
val words = lines.map(_.split(" "))
words.foreachRDD(_.saveToEs("spark/test"))
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
}
}
The following is the error:
16/10/17 11:02:30 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
16/10/17 11:02:30 INFO BlockManager: Found block input-0-1476682349200 locally
16/10/17 11:02:30 INFO Version: Elasticsearch Hadoop v5.0.0.BUILD.SNAPSHOT [4282a0194a]
16/10/17 11:02:30 INFO EsRDDWriter: Writing to [spark/test]
16/10/17 11:02:30 ERROR TaskContextImpl: Error in TaskCompletionListener
org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: Found unrecoverable error [127.0.0.1:9200] returned Bad Request(400) - failed to parse; Bailing out..
at org.elasticsearch.hadoop.rest.RestClient.processBulkResponse(RestClient.java:250)
at org.elasticsearch.hadoop.rest.RestClient.bulk(RestClient.java:202)
at org.elasticsearch.hadoop.rest.RestRepository.tryFlush(RestRepository.java:220)
at org.elasticsearch.hadoop.rest.RestRepository.flush(RestRepository.java:242)
at org.elasticsearch.hadoop.rest.RestRepository.close(RestRepository.java:267)
at org.elasticsearch.hadoop.rest.RestService$PartitionWriter.close(RestService.java:120)
at org.elasticsearch.spark.rdd.EsRDDWriter$$anonfun$write$1.apply(EsRDDWriter.scala:42)
at org.elasticsearch.spark.rdd.EsRDDWriter$$anonfun$write$1.apply(EsRDDWriter.scala:42)
at org.apache.spark.TaskContext$$anon$1.onTaskCompletion(TaskContext.scala:123)
at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:97)
at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:95)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:95)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I am coding on scala. I am unable to find the reason for the error. Please help me out with the exception.
Thank you.
I am attempting to perform Spark MLLib PCA (using Scala) on a RowMatrix with 2168 columns, and a large number of rows. However, I have observed that even with as few as 2 rows in the matrix (a 112KB text file), the following error is always produced, at the same job step:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at breeze.linalg.svd$.breeze$linalg$svd$$doSVD_Double(svd.scala:92)
at breeze.linalg.svd$Svd_DM_Impl$.apply(svd.scala:39)
at breeze.linalg.svd$Svd_DM_Impl$.apply(svd.scala:38)
at breeze.generic.UFunc$class.apply(UFunc.scala:48)
at breeze.linalg.svd$.apply(svd.scala:22)
at org.apache.spark.mllib.linalg.distributed.RowMatrix.computePrincipalComponents(RowMatrix.scala:380)
at SimpleApp$.main(scala-pca.scala:17)
at SimpleApp.main(scala-pca.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I have also observed that this error is remedied by using 1100 columns or fewer, regardless of the number of rows in the RowMatrix.
I am running Spark 1.3.0 standalone across 21 nodes, with 12 workers and 20G memory per node. I am submitting the job via spark-submit with --driver-memory 6g and --conf spark.executor.memory=1700m. In spark-env.sh the following options are set:
SPARK_WORKER_MEMORY=1700M
SPARK_WORKER_CORES=1
SPARK_WORKER_INSTANCES=12
Here is the code I am submitting:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.mllib.linalg.Matrix
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.{Vector, Vectors}
object SimpleApp {
def main(args: Array[String]) {
val datafilePattern = "/path/to/data/files*.txt"
val conf = new SparkConf().setAppName("pca_analysis").setMaster("master-host")
val sc = new SparkContext(conf)
val lData = sc.textFile(datafilePattern).cache()
val vecData = lData.map(line => line.split(" ").map(v => v.toDouble)).map(arr => Vectors.dense(arr))
val rmat: RowMatrix = new RowMatrix(vecData)
val pc: Matrix = rmat.computePrincipalComponents(15)
val projected: RowMatrix = rmat.multiply(pc)
println("Finished projecting rows.")
}
}
Has anyone else experienced this problem with the computePrincipalComponents() method? Any help is much appreciated.
I just ran into this issue and the fix for this is to increase the --driver-memory to may be 2G or more if needed.