Finding mean and standard deviation of a large dataset - scala

I have about 1500 files on S3 (each file looks like this:)
Format :
UserId \t ItemId:Score,ItemdId:Score,ItemId:Score \n
UserId \t ItemId:Score,ItemdId:Score,ItemId:Score \n
I read the file as:
import scala.io.Source
val FileRead = Source.fromFile("/home/home/testdataFile1").mkString
Here is an example of what I get:
1152 401368:1.006,401207:1.03
1184 401230:1.119,40049:1.11,40029:1.31
How do I compute the average and standard deviation of the variable 'Score'?

While it's not explicit in the question, Apache Spark is a good tool for doing this in a distributed way. I assume you have set up a Spark cluster. Read the files into an RDD:
val lines: RDD[String] = sc.textFile("s3n://bucket/dir/*")
Pick out the "score" somehow:
val scores: RDD[Double] = lines.map(_.split(":").last.toDouble).cache
.cache saves it in memory. This avoids re-reading the files all the time, but can use a lot of RAM. Remove it if you want to trade speed for RAM.
Calculate the metrics:
val count = scores.count
val mean = scores.sum / count
val devs = scores.map(score => (score - mean) * (score - mean))
val stddev = Math.sqrt(devs.sum / (count - 1))

This question is not new, so maybe I can update the answers.
There are stddev functions (stddev, stddev_pop, and stddev_smap) is SparkSQL (import org.apache.spark.sql.functions) since spark version >= 1.6.0.

I use Apache Commons Math for this stuff (http://commons.apache.org/proper/commons-math/userguide/stat.html), albeit from Java. You can stream stuff through the SummaryStatistics class so you aren't limited to the size of memory. Scala to Java interop should allow you to do this, but I haven't tried it. You should be able to each your way through the File line by line and stream the stuff through an instance of SummaryStatistics. How hard could it be in Scala?
Lookie here, someone is off and Scala-izing the whole thing: https://code.google.com/p/scalalab/wiki/ApacheCommonMathsLibraryInScalaLab

I don't think that storage space should be an issue so I would try putting all of the values into an array of doubles then adding up all of the values then use that and the number of elements in the array to calculate the mean.Then add up all of the absolute values of the differences between the value in the mean and divide that by the number of elements. Then take the square root.

Related

Count Triangles in Scala - Spark

i am trying to get into data analytics using Spark with Scala. My question is how do i get the triangles in a graph? And i mean not the Triangle Count that comes with graphx, but the actual nodes that consist the triangle.
Suppose we have a graph file, i was able to calculate the triangles in scala, but the same technique does not apply in spark since i have to use RDD operations.
The data i give to the function is a complex List consisting of the src and the List of the destinations of that source; ex. Adj(5, List(1,2,3)), Adj(4, List(9,8,7)), ...
My scala version is this :
(Paths: List[Adj])
Paths.flatMap(i=> Paths.map(j => Paths.map(k => {
if(i.src != j.src && i.src!= k.src && j.src!=k.src){
if(i.dst.contains(j.src) && j.dst.contains(k.src) && k.dst.contains(i.src)){
println(i.src,j.src,k.src) //3 nodes that make a triangle
}
else{
()
}
}
})))
And the output would be something like:
(1,2,3)
(4,5,6)
(2,5,6)
In conclusion i want the same output but in spark environment execution. In addition i am looking for a more efficient way to hold the information about Adjacencies like key mapping and then reducing by key or something. As the spark environment needs a quite different way to approach each problem (big data operations) i would be gratefull if you could explain the way of thinking and give me a small briefing about the functions you used.
Thank you.

Use scala manipulate rdd get the words count from Sharespeare.txt to verb dict

the verb_dict.txt contains things like that
abash,abash,abashed,abashed,abashes,abashing
abate,abate,abated,abated,abates,abating
abide,abide,abode,abode,abides,abiding
......
Shakespeare text just 5MB article
every word have 6 tense the question is once met different tense in the text but need to count to original tenst
Use learned RDD operations to merge the verb pairs that are from the same verb. E.g. (work, 100),
(works,50), (working,150) ----> (work, 300).
my idea is group things like [(verb_in_different_tense,(original_tense,count))] I don't know if that works just give a think about it
My code shows below:
val shakes = sc.textFile("shakespeare.txt") to create rdd from txt file
val shakes1 = shakes.filter(l => l.length > 0) remove empty line
val shakes2 = shakes1.map(x=>x.replaceAll("""[\p{Punct}]""","")) remove the punctuations
val shakes3 = shakes2.flatMap(line=>line.split(" ")) split by space will make each words into one partition
val shakes4 = shakes3.filter(_.nonEmpty) get the nonEmpty line
val shakes5 = shakes4.filter(w=>w == w.toLowerCase()) to lower case
This is done for Shakespeare.txt file
then need to Use learned RDD operations to merge the verb pairs that are from the same verb. E.g. (work, 100),
(works,50), (working,150) ----> (work, 300).
can anyone give me specific steps to deal with this question please?
There are another questions related to this topic and they are related to NLP tasks. What you need here is to extract the lemmas of those words and put them in another column and then group by that column.
Take a look to https://en.wikipedia.org/wiki/Lemma_(morphology)
You can use the Stanford NLP library to perform the lemmatizer to your words(tokens). An example of how you can use in Java that you can use without problems in Scala: https://stanfordnlp.github.io/stanfordnlp/lemma.html
In this repo you can see how you can use Stanford CoreNLP in Spark https://github.com/databricks/spark-corenlp.
Or you can use the annotation from SparkNLP project https://nlp.johnsnowlabs.com/docs/en/annotators

Apache Spark: multiple outputs in one map task

TL;DR: I have a large file that I iterate over three times to get three different sets of counts out. Is there a way to get three maps out in one pass over the data?
Some more detail:
I'm trying to compute PMI between words and features that are listed in a large file. My pipeline looks something like this:
val wordFeatureCounts = sc.textFile(inputFile).flatMap(line => {
val word = getWordFromLine(line)
val features = getFeaturesFromLine(line)
for (feature <- features) yield ((word, feature), 1)
})
And then I repeat this to get word counts and feature counts separately:
val wordCounts = sc.textFile(inputFile).flatMap(line => {
val word = getWordFromLine(line)
val features = getFeaturesFromLine(line)
for (feature <- features) yield (word, 1)
})
val featureCounts = sc.textFile(inputFile).flatMap(line => {
val word = getWordFromLine(line)
val features = getFeaturesFromLine(line)
for (feature <- features) yield (feature, 1)
})
(I realize I could just iterate over wordFeatureCounts to get the wordCounts and featureCounts, but that doesn't answer my question, and looking at running times in practice I'm not sure it's actually faster to do it that way. Also note that there are some reduceByKey operations and other stuff that I do with this after the counts are computed that aren't shown, as they aren't relevant to the question.)
What I would really like to do is something like this:
val (wordFeatureCounts, wordCounts, featureCounts) = sc.textFile(inputFile).flatMap(line => {
val word = getWordFromLine(line)
val features = getFeaturesFromLine(line)
val wfCounts = for (feature <- features) yield ((word, feature), 1)
val wCounts = for (feature <- features) yield (word, 1)
val fCounts = for (feature <- features) yield (feature, 1)
??.setOutput1(wfCounts)
??.setOutput2(wCounts)
??.setOutput3(fCounts)
})
Is there any way to do this with spark? In looking for how to do this, I've seen questions about multiple outputs when you're saving the results to disk (not helpful), and I've seen a bit about accumulators (which don't look like what I need), but that's it.
Also note that I can't just yield all of these results in one big list, because I need three separate maps out. If there's an efficient way to split a combined RDD after the fact, that could work, but the only way I can think of to do this would end up iterating over the data four times, instead of the three I currently do (once to create the combined map, then three times to filter it into the maps I actually want).
It is not possible to split an RDD into multiple RDDs. This is understandable if you think about how this would work under the hood. Say you split RDD x = sc.textFile("x") into a = x.filter(_.head == 'A') and b = x.filter(_.head == 'B'). Nothing happens so far, because RDDs are lazy. But now you print a.count. So Spark opens the file, and iterates through the lines. If the line starts with A it counts it. But what do we do with lines starting with B? Will there be a call to b.count in the future? Or maybe it will be b.saveAsTextFile("b") and we should be writing these lines out somewhere? We cannot know at this point. Splitting an RDD is just not possible with the Spark API.
But nothing stops you from implementing something if you know what you want. If you want to get both a.count and b.count you can map lines starting with A into (1, 0) and lines with B into (0, 1) and then sum up the tuples elementwise in a reduce. If you want to save lines with B into a file while counting lines with A, you could use an aggregator in a map before filter(_.head == 'B').saveAsTextFile.
The only generic solution is to store the intermediate data somewhere. One option is to just cache the input (x.cache). Another is to write the contents into separate directories in a single pass, then read them back as separate RDDs. (See Write to multiple outputs by key Spark - one Spark job.) We do this in production and it works great.
This is one of the major disadvantages of Spark over traditional map-reduce programming. An RDD/DF/DS can be transformed into another RDD/DF/DS but you cannot map an RDD into multiple outputs. To avoid recomputation you need to cache the results into some intermediate RDD and then run multiple map operations to generate multiple outputs. The caching solution will work if you are dealing with reasonable size data. But if the data is large compared to the memory available the intermediate outputs will be spilled to disk and the advantage of caching will not be that great. Check out the discussion here - https://issues.apache.org/jira/browse/SPARK-1476. This is an old Jira but relevant. Checkout out the comment by Mridul Muralidharan.
Spark needs to provide a solution where a map operation can produce multiple outputs without the need to cache. It may not be elegant from the functional programming perspective but I would argue, it would be a good compromise to achieve better performance.
I was also quite disappointed to see that this is a hard limitation of Spark over classic MapReduce. I ended up working around it by using multiple successive maps in which I filter out the data I need.
Here's a schematic toy example that performs different calculations on the numbers 0 to 49 and writes both to different output files.
from functools import partial
import os
from pyspark import SparkContext
# Generate mock data
def generate_data():
for i in range(50):
yield 'output_square', i * i
yield 'output_cube', i * i * i
# Map function to siphon data to a specific output
def save_partition_to_output(part_index, part, filter_key, output_dir):
# Initialise output file handle lazily to avoid creating empty output files
file = None
try:
for key, data in part:
if key != filter_key:
# Pass through non-matching rows and skip
yield key, data
continue
if file is None:
file = open(os.path.join(output_dir, '{}-part{:05d}.txt'.format(filter_key, part_index)), 'w')
# Consume data
file.write(str(data) + '\n')
yield from []
finally:
if file is not None:
file.close()
def main():
sc = SparkContext()
rdd = sc.parallelize(generate_data())
# Repartition to number of outputs
# (not strictly required, but reduces number of output files).
#
# To split partitions further, use repartition() instead or
# partition by another key (not the output name).
rdd = rdd.partitionBy(numPartitions=2)
# Map and filter to first output.
rdd = rdd.mapPartitionsWithIndex(partial(save_partition_to_output, filter_key='output_square', output_dir='.'))
# Map and filter to second output.
rdd = rdd.mapPartitionsWithIndex(partial(save_partition_to_output, filter_key='output_cube', output_dir='.'))
# Trigger execution.
rdd.count()
if __name__ == '__main__':
main()
This will create two output files output_square-part00000.txt and output_cube-part00000.txt with the desired output splits.

Spark/Scala read hadoop file

In a pig script I saved a table using PigStorage('|').
I have in the corresponding hadoop folder files like
part-r-00000
etc.
What is the best way to load it in Spark/Scala ? In this table I have 3 fields: Int, String, Float
I tried:
text = sc.hadoopFile("file", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], sc.defaultMinPartitions)
But then I would need somehow to split each line. Is there a better way to do it?
If I were coding in python I would create a Dataframe indexed by the first field and whose columns are the values found in the string field and coefficients the float values. But I need to use scala to use the pca module. And the dataframes don't seem that close to python's ones
Thanks for the insight
PigStorage creates a text file without schema information so you need to do that work yourself something like
sc.textFile("file") // or directory where the part files are
val data = csv.map(line => {
vals=line.split("|")
(vals(0).toInt,vals(1),vals(2).toDouble)}
)

MLlib classification example stops in stage 1

EDIT:
I tried using the text from Gabriel's answer and got spam features: 9 and ham features: 13. I tried changing the HashingTF to numFeatures = 9, then 13, then created one for each. Then the program stopped at "count at DataValidators.scala:38" just like before.
Completed Jobs(4)
count at 21 (spamFeatures)
count at 23 (hamFeatures)
count at 28 (trainingData.count())
first at GeneralizedLinearAlgorithm at 34 (val model = lrLearner.run(trainingData)
1) Why are the features being counted by lines, as in the code it is being split by spaces (" ")
2) Two things I see dift from my code and Gabriel's code:
a) I don't have anything about logger, but that shouldn't be an issue...
b) My files are located on hdfs(hdfs://ip-abc-de-.compute.internal:8020/user/ec2-user/spam.txt), once again shouldn't be an issue, but not sure if there's something i'm missing...
3) How long should I let it run for? I've let it run for at least 10 minutes with :local[2]..
I'm guessing at this point it might be some sort of issue with my Spark/MLlib setup? Is there an even simpler program I can run to see if there is a set up issue with MLLib? I have been able to run other spark streaming/sql jobs berfore...
Thanks!
[reposted from spark community]
Hello Everyone,
I am trying to run this MLlib example from Learning Spark:
https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala#L48
Things I'm doing differently:
1) instead of their spam.txt and normal.txt I have text files with 200 words...nothing huge at all and just plain text, with periods, commas, etc.
3) I've used numFeatures = 200, 1000 and 10,000
Error: I keep getting stuck when I try to run the model (based off details from ui below):
val model = new LogisticRegressionWithSGD().run(trainingData)
It will freeze on something like this:
[Stage 1:==============> (1 + 0) / 4]
Some details from webui:
org.apache.spark.rdd.RDD.count(RDD.scala:910)
org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:38)
org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:37)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161)
scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:70)
scala.collection.immutable.List.forall(List.scala:84)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:161)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:146)
$line21.$read$$iwC$$iwC$$iwC$$iwC.<init>(<console>:33)
$line21.$read$$iwC$$iwC$$iwC.<init>(<console>:38)
$line21.$read$$iwC$$iwC.<init>(<console>:40)
$line21.$read$$iwC.<init>(<console>:42)
$line21.$read.<init>(<console>:44)
$line21.$read$.<init>(<console>:48)
$line21.$read$.<clinit>(<console>)
$line21.$eval$.<init>(<console>:7)
$line21.$eval$.<clinit>(<console>)
$line21.$eval.$print(<console>)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
I am not sure what I am doing wrong...any help is much appreciated, thank you!
Thanks for this question, I wasn't aware of these examples so I downloaded them and tested them. What I see is that the git repository contains files with a lot of html code, it works, but you will end up adding 100 features which is possibly why you're not getting consistent results, since your own files contain much less features. What I did to test this works without html code was to remove the HTML code from spam.txt and ham.txt as follows:
ham.txt=
Dear Spark Learner, Thanks so much for attending the Spark Summit 2014!
Check out videos of talks from the summit at ...
Hi Mom, Apologies for being late about emailing and forgetting to send you
the package. I hope you and bro have been ...
Wow, hey Fred, just heard about the Spark petabyte sort. I think we need to
take time to try it out immediately ...
Hi Spark user list, This is my first question to this list, so thanks in
advance for your help! I tried running ...
Thanks Tom for your email. I need to refer you to Alice for this one. I
haven't yet figured out that part either ...
Good job yesterday! I was attending your talk, and really enjoyed it. I
want to try out GraphX ...
Summit demo got whoops from audience! Had to let you know. --Joe
spam.txt=
Dear sir, I am a Prince in a far kingdom you have not heard of. I want to
send you money via wire transfer so please ...
Get Viagra real cheap! Send money right away to ...
Oh my gosh you can be really strong too with these drugs found in the
rainforest. Get them cheap right now ...
YOUR COMPUTER HAS BEEN INFECTED! YOU MUST RESET YOUR PASSWORD. Reply to
this email with your password and SSN ...
THIS IS NOT A SCAM! Send money and get access to awesome stuff really
cheap and never have to ...
Then use bellow modifed MLib.scala, make sure you have log4j referenced in your project to redirect output to a file instead of the console, so you basically need to run twice, in first run watch the output by printing the # of features in spam and ham you can then set the correct # of features (instead of 100) I used 5.
package com.oreilly.learningsparkexamples.scala
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.log4j.Logger
object MLlib {
private val logger = Logger.getLogger("MLlib")
def main(args: Array[String]) {
logger.info("This is spark in Windows")
val conf = new SparkConf().setAppName(s"Book example: Scala").setMaster("local[2]").set("spark.executor.memory","1g")
//val conf = new SparkConf().setAppName(s"Book example: Scala")
val sc = new SparkContext(conf)
// Load 2 types of emails from text files: spam and ham (non-spam).
// Each line has text from one email.
val spam = sc.textFile("spam.txt")
val ham = sc.textFile("ham.txt")
// Create a HashingTF instance to map email text to vectors of 5 (not 100) features.
val tf = new HashingTF(numFeatures = 5)
// Each email is split into words, and each word is mapped to one feature.
val spamFeatures = spam.map(email => tf.transform(email.split(" ")))
println ("features in spam " + spamFeatures.count())
val hamFeatures = ham.map(email => tf.transform(email.split(" ")))
println ("features in ham " + ham.count())
// Create LabeledPoint datasets for positive (spam) and negative (ham) examples.
val positiveExamples = spamFeatures.map(features => LabeledPoint(1, features))
val negativeExamples = hamFeatures.map(features => LabeledPoint(0, features))
val trainingData = positiveExamples ++ negativeExamples
trainingData.cache() // Cache data since Logistic Regression is an iterative algorithm.
// Create a Logistic Regression learner which uses the LBFGS optimizer.
val lrLearner = new LogisticRegressionWithSGD()
// Run the actual learning algorithm on the training data.
val model = lrLearner.run(trainingData)
// Test on a positive example (spam) and a negative one (ham).
// First apply the same HashingTF feature transformation used on the training data.
val ex1 = "O M G GET cheap stuff by sending money to ...";
val ex2 = "Hi Dad, I started studying Spark the other ..."
val posTestExample = tf.transform(ex1.split(" "))
val negTestExample = tf.transform(ex2.split(" "))
// Now use the learned model to predict spam/ham for new emails.
println(s"Prediction for positive test example: ${ex1} : ${model.predict(posTestExample)}")
println(s"Prediction for negative test example: ${ex2} : ${model.predict(negTestExample)}")
sc.stop()
}
}
When I run this in the output I'm getting:
features in spam 5
features in ham 7
Prediction for positive test example: O M G GET cheap stuff by sending money
to ... : 1.0
Prediction for negative test example: Hi Dad, I started studying Spark the
other ... : 0.0
I had the same problem with Spark 1.5.2 on my local cluster.
My program stopped on "count at DataValidators.scala:40".
Resolved by running spark as "spark-submit --master local"
I had the similar problem with Spark 1.5.2 on my local cluster. My program stopped on "count at DataValidators.scala:40". I was caching my training features. Removed caching (just did not call cache function) and it resolved. Not sure of actual cause though.