MLlib classification example stops in stage 1 - scala

EDIT:
I tried using the text from Gabriel's answer and got spam features: 9 and ham features: 13. I tried changing the HashingTF to numFeatures = 9, then 13, then created one for each. Then the program stopped at "count at DataValidators.scala:38" just like before.
Completed Jobs(4)
count at 21 (spamFeatures)
count at 23 (hamFeatures)
count at 28 (trainingData.count())
first at GeneralizedLinearAlgorithm at 34 (val model = lrLearner.run(trainingData)
1) Why are the features being counted by lines, as in the code it is being split by spaces (" ")
2) Two things I see dift from my code and Gabriel's code:
a) I don't have anything about logger, but that shouldn't be an issue...
b) My files are located on hdfs(hdfs://ip-abc-de-.compute.internal:8020/user/ec2-user/spam.txt), once again shouldn't be an issue, but not sure if there's something i'm missing...
3) How long should I let it run for? I've let it run for at least 10 minutes with :local[2]..
I'm guessing at this point it might be some sort of issue with my Spark/MLlib setup? Is there an even simpler program I can run to see if there is a set up issue with MLLib? I have been able to run other spark streaming/sql jobs berfore...
Thanks!
[reposted from spark community]
Hello Everyone,
I am trying to run this MLlib example from Learning Spark:
https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala#L48
Things I'm doing differently:
1) instead of their spam.txt and normal.txt I have text files with 200 words...nothing huge at all and just plain text, with periods, commas, etc.
3) I've used numFeatures = 200, 1000 and 10,000
Error: I keep getting stuck when I try to run the model (based off details from ui below):
val model = new LogisticRegressionWithSGD().run(trainingData)
It will freeze on something like this:
[Stage 1:==============> (1 + 0) / 4]
Some details from webui:
org.apache.spark.rdd.RDD.count(RDD.scala:910)
org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:38)
org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:37)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161)
scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:70)
scala.collection.immutable.List.forall(List.scala:84)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:161)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:146)
$line21.$read$$iwC$$iwC$$iwC$$iwC.<init>(<console>:33)
$line21.$read$$iwC$$iwC$$iwC.<init>(<console>:38)
$line21.$read$$iwC$$iwC.<init>(<console>:40)
$line21.$read$$iwC.<init>(<console>:42)
$line21.$read.<init>(<console>:44)
$line21.$read$.<init>(<console>:48)
$line21.$read$.<clinit>(<console>)
$line21.$eval$.<init>(<console>:7)
$line21.$eval$.<clinit>(<console>)
$line21.$eval.$print(<console>)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
I am not sure what I am doing wrong...any help is much appreciated, thank you!

Thanks for this question, I wasn't aware of these examples so I downloaded them and tested them. What I see is that the git repository contains files with a lot of html code, it works, but you will end up adding 100 features which is possibly why you're not getting consistent results, since your own files contain much less features. What I did to test this works without html code was to remove the HTML code from spam.txt and ham.txt as follows:
ham.txt=
Dear Spark Learner, Thanks so much for attending the Spark Summit 2014!
Check out videos of talks from the summit at ...
Hi Mom, Apologies for being late about emailing and forgetting to send you
the package. I hope you and bro have been ...
Wow, hey Fred, just heard about the Spark petabyte sort. I think we need to
take time to try it out immediately ...
Hi Spark user list, This is my first question to this list, so thanks in
advance for your help! I tried running ...
Thanks Tom for your email. I need to refer you to Alice for this one. I
haven't yet figured out that part either ...
Good job yesterday! I was attending your talk, and really enjoyed it. I
want to try out GraphX ...
Summit demo got whoops from audience! Had to let you know. --Joe
spam.txt=
Dear sir, I am a Prince in a far kingdom you have not heard of. I want to
send you money via wire transfer so please ...
Get Viagra real cheap! Send money right away to ...
Oh my gosh you can be really strong too with these drugs found in the
rainforest. Get them cheap right now ...
YOUR COMPUTER HAS BEEN INFECTED! YOU MUST RESET YOUR PASSWORD. Reply to
this email with your password and SSN ...
THIS IS NOT A SCAM! Send money and get access to awesome stuff really
cheap and never have to ...
Then use bellow modifed MLib.scala, make sure you have log4j referenced in your project to redirect output to a file instead of the console, so you basically need to run twice, in first run watch the output by printing the # of features in spam and ham you can then set the correct # of features (instead of 100) I used 5.
package com.oreilly.learningsparkexamples.scala
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.log4j.Logger
object MLlib {
private val logger = Logger.getLogger("MLlib")
def main(args: Array[String]) {
logger.info("This is spark in Windows")
val conf = new SparkConf().setAppName(s"Book example: Scala").setMaster("local[2]").set("spark.executor.memory","1g")
//val conf = new SparkConf().setAppName(s"Book example: Scala")
val sc = new SparkContext(conf)
// Load 2 types of emails from text files: spam and ham (non-spam).
// Each line has text from one email.
val spam = sc.textFile("spam.txt")
val ham = sc.textFile("ham.txt")
// Create a HashingTF instance to map email text to vectors of 5 (not 100) features.
val tf = new HashingTF(numFeatures = 5)
// Each email is split into words, and each word is mapped to one feature.
val spamFeatures = spam.map(email => tf.transform(email.split(" ")))
println ("features in spam " + spamFeatures.count())
val hamFeatures = ham.map(email => tf.transform(email.split(" ")))
println ("features in ham " + ham.count())
// Create LabeledPoint datasets for positive (spam) and negative (ham) examples.
val positiveExamples = spamFeatures.map(features => LabeledPoint(1, features))
val negativeExamples = hamFeatures.map(features => LabeledPoint(0, features))
val trainingData = positiveExamples ++ negativeExamples
trainingData.cache() // Cache data since Logistic Regression is an iterative algorithm.
// Create a Logistic Regression learner which uses the LBFGS optimizer.
val lrLearner = new LogisticRegressionWithSGD()
// Run the actual learning algorithm on the training data.
val model = lrLearner.run(trainingData)
// Test on a positive example (spam) and a negative one (ham).
// First apply the same HashingTF feature transformation used on the training data.
val ex1 = "O M G GET cheap stuff by sending money to ...";
val ex2 = "Hi Dad, I started studying Spark the other ..."
val posTestExample = tf.transform(ex1.split(" "))
val negTestExample = tf.transform(ex2.split(" "))
// Now use the learned model to predict spam/ham for new emails.
println(s"Prediction for positive test example: ${ex1} : ${model.predict(posTestExample)}")
println(s"Prediction for negative test example: ${ex2} : ${model.predict(negTestExample)}")
sc.stop()
}
}
When I run this in the output I'm getting:
features in spam 5
features in ham 7
Prediction for positive test example: O M G GET cheap stuff by sending money
to ... : 1.0
Prediction for negative test example: Hi Dad, I started studying Spark the
other ... : 0.0

I had the same problem with Spark 1.5.2 on my local cluster.
My program stopped on "count at DataValidators.scala:40".
Resolved by running spark as "spark-submit --master local"

I had the similar problem with Spark 1.5.2 on my local cluster. My program stopped on "count at DataValidators.scala:40". I was caching my training features. Removed caching (just did not call cache function) and it resolved. Not sure of actual cause though.

Related

I am getting a syntax error while computing average number of friends in apache spark

Recently I have started doing a course of Frank Kane namely Taming big data by apache spark using python.
In the line where I have to compute average number of friends, I am getting a syntax error. I cannot understand how to fix this error. Please refer the code below.FYI I m using python 3. I have highlighted the code having syntax error.Please help as I have got stuck here.
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("AverageAge")
sc = SparkContext(conf = conf)
def parseline(line):
fields =line.split(',')
friend_age= int(fields[2])
friends_number= int(fields[3])
return (friend_age,friends_number)
lines = sc.textFile("file:///Sparkcourse/SparkCourse/fakefriends.csv")
rdd=lines.map(parseline)
making_keys=rdd.mapByValues(lambda x:(x,1))
totalsByAge=making_keys.reduceByKeys(lambda x,y: (x[0]+y[0],x[1]+y[1])
**averages_by_keys= totalsByAge.mapValues(lambda x: x[0] / x[1])**(Syntax Error)
results=averageByKeys.collect()
for result in results:
print result
Look at the line above, you're missing a closing parenthesis.

Incorrect evaluations on Random Forest in Pyspark

I am running a prediction using Logistic Regression and Random Forest on telecom churn data set.
Please find here the code snippet from my notebook:
data=spark.read.csv("D:\Shashank\CBA\Pyspark\Telecom_Churn_Data_SingTel.csv", header=True, inferSchema=True)
data.show(3)
This link is to show the kind of data i am dealing with on a high level
data=data.drop("State").drop("Area Code").drop("Phone Number")
from pyspark.ml.feature import StringIndexer, VectorAssembler
intlPlanIndex = StringIndexer(inputCol="International Plan", outputCol="International Plan Index")
voiceMailPlanIndex = StringIndexer(inputCol="Voice mail Plan", outputCol="Voice mail Plan Index")
churnIndex = StringIndexer(inputCol="Churn", outputCol="label")
othercols=["Account Length", "Num of Voice mail Messages","Total Day Minutes", "Total Day Calls", "Total day Charge","Total Eve Minutes","Total Eve Calls","Total Eve Charge","Total Night Minutes","Total Night Calls ","Total Night Charge","Total International Minutes","Total Intl Calls","Total Intl Charge","Number Customer Service calls "]
assembler = VectorAssembler(inputCols= ['International Plan Index'] + ['Voice mail Plan Index'] + othercols, outputCol="features")
(train, test) = data.randomSplit([0.8,0.2])
from pyspark.ml.classification import LogisticRegression
lrObj = LogisticRegression(labelCol='label', featuresCol='features')
from pyspark.ml.pipeline import Pipeline
pipeline = Pipeline(stages=[intlPlanIndex, voiceMailPlanIndex, churnIndex, assembler, lrObj])
lrModel = pipeline.fit(train)
prediction_train = lrModel.transform(train)
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
lr_Evaluator = MulticlassClassificationEvaluator()
lr_Evaluator.evaluate(prediction_train)
This image shows the result on evaluation using logistic Regression
I then repeat the same using a Random Forest classification model:
and I evaluate to 94.4%
My result is sort of like this:
Link to my Random Forest evaluation result
Everything looks ok until now.
But I get curious to see how things actually are being predicted, so i print the values of my prediction using the code below:
selected = prediction_1.select("features", "Label", "Churn", "prediction")
for row in selected.collect():
print(row)
The result i get is sort of like this in the screenshot below:
Link to image that shows the 2 results printed out for manual analysis
I then copy both the cells as shown from the above link into a compactor to see if my predicted values are different. (I expect there to be some difference, since the evaluation for Random forest turned out to be better)
But the comparison on any tool showed that the predictions are the same. Yet, the result on evaluation shows a difference 83.6% on LogisticRegression and 94.4% using RandomForest.
Why is there no difference in the 2 sets of data that i have generated from 2 different models when the ultimate evaluation using MuticlassClassificationEvaluator gives me different probabilities ?
You seem to be interested in metricName="accuracy"
predictions = model.transform(test)
evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
For more info refer the official documentation.
This question is no longer relevant since i am able to see the difference in the predictions which is in line with the accuracy predicted under each model.
The question came up because the data i copied from my Jupyter notebook was incomplete.
Thanks and appreciate your time.

Pyspark - Transfer control out of Spark Session (sc)

This is a follow up question on
Pyspark filter operation on Dstream
To keep a count of how many error messages/warning messages has come through for say a day, hour - how does one design the job.
What I have tried:
from __future__ import print_function
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
def counts():
counter += 1
print(counter.value)
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: network_wordcount.py <hostname> <port>", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="PythonStreamingNetworkWordCount")
ssc = StreamingContext(sc, 5)
counter = sc.accumulator(0)
lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
errors = lines.filter(lambda l: "error" in l.lower())
errors.foreachRDD(lambda e : e.foreach(counts))
errors.pprint()
ssc.start()
ssc.awaitTermination()
this however has multiple issues, to start with print doesn't work (does not output to stdout, I have read about it, the best I can use here is logging). Can I save the output of that function to a text file and tail that file instead?
I am not sure why the program just comes out, there is no error/dump anywhere to look further into (spark 1.6.2)
How does one preserve state? What I am trying is to aggregate logs by server and severity, another use case is to count how many transactions were processed by looking for certain keywords
Pseudo Code for what I want to try:
foreachRDD(Dstream):
if RDD.contains("keyword1 | keyword2 | keyword3"):
dictionary[keyword] = dictionary.get(keyword,0) + 1 //add the keyword if not present and increase the counter
print dictionary //or send this dictionary to else where
The last part of sending or printing dictionary requires switching out of spark streaming context - Can someone explain the concept please?
print doesn't work
I would recommend reading the design patterns section of the Spark documentation. I think that roughly what you want is something like this:
def _process(iter):
for item in iter:
print item
lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
errors = lines.filter(lambda l: "error" in l.lower())
errors.foreachRDD(lambda e : e.foreachPartition(_process))
This will get your call print to work (though it is worth noting that the print statement will execute on the workers and not the drivers, so if you're running this code on a cluster you will only see it on the worker logs).
However, it won't solve your second problem:
How does one preserve state?
For this, take a look at updateStateByKey and the related example.

Using Custom Hadoop input format for processing binary file in Spark

I have developed a hadoop based solution that process a binary file. This uses classic hadoop MR technique. The binary file is about 10GB and divided into 73 HDFS blocks, and the business logic written as map process operates on each of these 73 blocks. We have developed a customInputFormat and CustomRecordReader in Hadoop that returns key (intWritable) and value (BytesWritable) to the map function. The value is nothing but the contents of a HDFS block(bianry data). The business logic knows how to read this data.
Now, I would like to port this code in spark. I am a starter in spark and could run simple examples (wordcount, pi example) in spark. However, could not straightforward example to process binaryFiles in spark. I see there are two solutions for this use case. In the first, avoid using custom input format and record reader. Find a method (approach) in spark the creates a RDD for those HDFS blocks, use a map like method that feeds HDFS block content to the business logic. If this is not possible, I would like to re-use the custom input format and custom reader using some methods such as HadoopAPI, HadoopRDD etc. My problem:- I do not know whether the first approach is possible or not. If possible, can anyone please provide some pointers that contains examples? I was trying second approach but highly unsuccessful. Here is the code snippet I used
package org {
object Driver {
def myFunc(key : IntWritable, content : BytesWritable):Int = {
println(key.get())
println(content.getSize())
return 1
}
def main(args: Array[String]) {
// create a spark context
val conf = new SparkConf().setAppName("Dummy").setMaster("spark://<host>:7077")
val sc = new SparkContext(conf)
println(sc)
val rd = sc.newAPIHadoopFile("hdfs:///user/hadoop/myBin.dat", classOf[RandomAccessInputFormat], classOf[IntWritable], classOf[BytesWritable])
val count = rd.map (x => myFunc(x._1, x._2)).reduce(_+_)
println("The count is *****************************"+count)
}
}
}
Please note that the print statement in the main method prints 73 which is the number of blocks whereas the print statements inside the map function prints 0.
Can someone tell where I am doing wrong here? I think I am not using API the right way but failed to find some documentation/usage examples.
A couple of problems at a glance. You define myFunc but call func. Your myFunc has no return type, so you can't call collect(). If your myFunc truly doesn't have a return value, you can do foreach instead of map.
collect() pulls the data in an RDD to the driver to allow you to do stuff with it locally (on the driver).
I have made some progress in this issue. I am now using the below function which does the job
var hRDD = new NewHadoopRDD(sc, classOf[RandomAccessInputFormat],
classOf[IntWritable],
classOf[BytesWritable],
job.getConfiguration()
)
val count = hRDD.mapPartitionsWithInputSplit{ (split, iter) => myfuncPart(split, iter)}.collect()
However, landed up with another error the details of which i have posted here
Issue in accessing HDFS file inside spark map function
15/10/30 11:11:39 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 40.221.94.235): java.io.IOException: No FileSystem for scheme: spark
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)

Finding mean and standard deviation of a large dataset

I have about 1500 files on S3 (each file looks like this:)
Format :
UserId \t ItemId:Score,ItemdId:Score,ItemId:Score \n
UserId \t ItemId:Score,ItemdId:Score,ItemId:Score \n
I read the file as:
import scala.io.Source
val FileRead = Source.fromFile("/home/home/testdataFile1").mkString
Here is an example of what I get:
1152 401368:1.006,401207:1.03
1184 401230:1.119,40049:1.11,40029:1.31
How do I compute the average and standard deviation of the variable 'Score'?
While it's not explicit in the question, Apache Spark is a good tool for doing this in a distributed way. I assume you have set up a Spark cluster. Read the files into an RDD:
val lines: RDD[String] = sc.textFile("s3n://bucket/dir/*")
Pick out the "score" somehow:
val scores: RDD[Double] = lines.map(_.split(":").last.toDouble).cache
.cache saves it in memory. This avoids re-reading the files all the time, but can use a lot of RAM. Remove it if you want to trade speed for RAM.
Calculate the metrics:
val count = scores.count
val mean = scores.sum / count
val devs = scores.map(score => (score - mean) * (score - mean))
val stddev = Math.sqrt(devs.sum / (count - 1))
This question is not new, so maybe I can update the answers.
There are stddev functions (stddev, stddev_pop, and stddev_smap) is SparkSQL (import org.apache.spark.sql.functions) since spark version >= 1.6.0.
I use Apache Commons Math for this stuff (http://commons.apache.org/proper/commons-math/userguide/stat.html), albeit from Java. You can stream stuff through the SummaryStatistics class so you aren't limited to the size of memory. Scala to Java interop should allow you to do this, but I haven't tried it. You should be able to each your way through the File line by line and stream the stuff through an instance of SummaryStatistics. How hard could it be in Scala?
Lookie here, someone is off and Scala-izing the whole thing: https://code.google.com/p/scalalab/wiki/ApacheCommonMathsLibraryInScalaLab
I don't think that storage space should be an issue so I would try putting all of the values into an array of doubles then adding up all of the values then use that and the number of elements in the array to calculate the mean.Then add up all of the absolute values of the differences between the value in the mean and divide that by the number of elements. Then take the square root.