I am getting a syntax error while computing average number of friends in apache spark - pyspark

Recently I have started doing a course of Frank Kane namely Taming big data by apache spark using python.
In the line where I have to compute average number of friends, I am getting a syntax error. I cannot understand how to fix this error. Please refer the code below.FYI I m using python 3. I have highlighted the code having syntax error.Please help as I have got stuck here.
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("AverageAge")
sc = SparkContext(conf = conf)
def parseline(line):
fields =line.split(',')
friend_age= int(fields[2])
friends_number= int(fields[3])
return (friend_age,friends_number)
lines = sc.textFile("file:///Sparkcourse/SparkCourse/fakefriends.csv")
rdd=lines.map(parseline)
making_keys=rdd.mapByValues(lambda x:(x,1))
totalsByAge=making_keys.reduceByKeys(lambda x,y: (x[0]+y[0],x[1]+y[1])
**averages_by_keys= totalsByAge.mapValues(lambda x: x[0] / x[1])**(Syntax Error)
results=averageByKeys.collect()
for result in results:
print result

Look at the line above, you're missing a closing parenthesis.

Related

Action in Scala/Spark is not getting executed after a transformation

I am currently experimenting with Apache Spark through Scala. I am currently using version 2.4.3 of Spark Core (as defined in my build.sbt file). I am running a simple example: generating a RDD through a text file and filtering all the lines that contain the word "pandas". After that, I use an action to count the number of lines that actually contains that word in the file. If I simply try to count the total number of lines of the file, everything goes ok, but if I apply a filter transformation and then try to count the number of elements, it does not finish the execution.
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._
println("Creating Context")
val conf = new SparkConf().setMaster("local").setAppName("Test")
val sc = new SparkContext(conf)
val lines = sc.textFile("/home/lbali/example.txt")
val pandas = lines filter(line => line.contains("pandas"))
println("+++++ number of lines: " + lines.count()) // this works ok.
println("+++++ number of lines with pandas: " + pandas.count()) // This does not work
sc.stop()
Try persisting the dataframe. When multiple actions are performed over the same dataframe, it is better to persist it rather than going through the cycle again
lines.persist(MEMORY_AND_DISK)
Think I found a solution, downgrading the Scala version from 2.12.8 to 2.11.12 solved the issue.

How to remove header by using filter function in spark?

I want to remove header from a file. But, since the file will be split into partitions, I can't just drop the first item. So I was using a filter function to figure it out and here below is the code I am using :
val noHeaderRDD = baseRDD.filter(line=>!line.contains("REPORTDATETIME"));
and the error I am getting says "error not found value line "what could be the issue here with this code?
I don't think anybody answered the obvious, whereby line.contains also possible:
val noHeaderRDD = baseRDD.filter(line => !(line contains("REPORTDATETIME")))
You were nearly there, just a syntax issue, but that is significant of course!
Using textFile as below:
val rdd = sc.textFile(<<path>>)
rdd.filter(x => !x.startsWith(<<"Header Text">>))
Or
In Spark 2.0:
spark.read.option("header","true").csv("filePath")

Pyspark - Transfer control out of Spark Session (sc)

This is a follow up question on
Pyspark filter operation on Dstream
To keep a count of how many error messages/warning messages has come through for say a day, hour - how does one design the job.
What I have tried:
from __future__ import print_function
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
def counts():
counter += 1
print(counter.value)
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: network_wordcount.py <hostname> <port>", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="PythonStreamingNetworkWordCount")
ssc = StreamingContext(sc, 5)
counter = sc.accumulator(0)
lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
errors = lines.filter(lambda l: "error" in l.lower())
errors.foreachRDD(lambda e : e.foreach(counts))
errors.pprint()
ssc.start()
ssc.awaitTermination()
this however has multiple issues, to start with print doesn't work (does not output to stdout, I have read about it, the best I can use here is logging). Can I save the output of that function to a text file and tail that file instead?
I am not sure why the program just comes out, there is no error/dump anywhere to look further into (spark 1.6.2)
How does one preserve state? What I am trying is to aggregate logs by server and severity, another use case is to count how many transactions were processed by looking for certain keywords
Pseudo Code for what I want to try:
foreachRDD(Dstream):
if RDD.contains("keyword1 | keyword2 | keyword3"):
dictionary[keyword] = dictionary.get(keyword,0) + 1 //add the keyword if not present and increase the counter
print dictionary //or send this dictionary to else where
The last part of sending or printing dictionary requires switching out of spark streaming context - Can someone explain the concept please?
print doesn't work
I would recommend reading the design patterns section of the Spark documentation. I think that roughly what you want is something like this:
def _process(iter):
for item in iter:
print item
lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
errors = lines.filter(lambda l: "error" in l.lower())
errors.foreachRDD(lambda e : e.foreachPartition(_process))
This will get your call print to work (though it is worth noting that the print statement will execute on the workers and not the drivers, so if you're running this code on a cluster you will only see it on the worker logs).
However, it won't solve your second problem:
How does one preserve state?
For this, take a look at updateStateByKey and the related example.

Read ORC files directly from Spark shell

I am having issues reading an ORC file directly from the Spark shell. Note: running Hadoop 1.2, and Spark 1.2, using pyspark shell, can use spark-shell (runs scala).
I have used this resource http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.2.4/Apache_Spark_Quickstart_v224/content/ch_orc-spark-quickstart.html .
from pyspark.sql import HiveContext
hiveCtx = HiveContext(sc)
inputRead = sc.hadoopFile("hdfs://user#server:/file_path",
classOf[inputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat],
classOf[outputFormat:org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat])
I get an error generally saying wrong syntax. One time, the code seemed to work, I used just the 1st of three arguments passed to hadoopFile, but when I tried to use
inputRead.first()
the output was RDD[nothing, nothing]. I don't know if this is because the inputRead variable did not get created as an RDD or if it was not created at all.
I appreciate any help!
In Spark 1.5, I'm able to load my ORC file as:
val orcfile = "hdfs:///ORC_FILE_PATH"
val df = sqlContext.read.format("orc").load(orcfile)
df.show
You can try this code, it's working for me.
val LoadOrc = spark.read.option("inferSchema", true).orc("filepath")
LoadOrc.show()
you can also add the multiple path to read from
val df = sqlContext.read.format("orc").load("hdfs://localhost:8020/user/aks/input1/*","hdfs://localhost:8020/aks/input2/*/part-r-*.orc")

MLlib classification example stops in stage 1

EDIT:
I tried using the text from Gabriel's answer and got spam features: 9 and ham features: 13. I tried changing the HashingTF to numFeatures = 9, then 13, then created one for each. Then the program stopped at "count at DataValidators.scala:38" just like before.
Completed Jobs(4)
count at 21 (spamFeatures)
count at 23 (hamFeatures)
count at 28 (trainingData.count())
first at GeneralizedLinearAlgorithm at 34 (val model = lrLearner.run(trainingData)
1) Why are the features being counted by lines, as in the code it is being split by spaces (" ")
2) Two things I see dift from my code and Gabriel's code:
a) I don't have anything about logger, but that shouldn't be an issue...
b) My files are located on hdfs(hdfs://ip-abc-de-.compute.internal:8020/user/ec2-user/spam.txt), once again shouldn't be an issue, but not sure if there's something i'm missing...
3) How long should I let it run for? I've let it run for at least 10 minutes with :local[2]..
I'm guessing at this point it might be some sort of issue with my Spark/MLlib setup? Is there an even simpler program I can run to see if there is a set up issue with MLLib? I have been able to run other spark streaming/sql jobs berfore...
Thanks!
[reposted from spark community]
Hello Everyone,
I am trying to run this MLlib example from Learning Spark:
https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala#L48
Things I'm doing differently:
1) instead of their spam.txt and normal.txt I have text files with 200 words...nothing huge at all and just plain text, with periods, commas, etc.
3) I've used numFeatures = 200, 1000 and 10,000
Error: I keep getting stuck when I try to run the model (based off details from ui below):
val model = new LogisticRegressionWithSGD().run(trainingData)
It will freeze on something like this:
[Stage 1:==============> (1 + 0) / 4]
Some details from webui:
org.apache.spark.rdd.RDD.count(RDD.scala:910)
org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:38)
org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:37)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161)
scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:70)
scala.collection.immutable.List.forall(List.scala:84)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:161)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:146)
$line21.$read$$iwC$$iwC$$iwC$$iwC.<init>(<console>:33)
$line21.$read$$iwC$$iwC$$iwC.<init>(<console>:38)
$line21.$read$$iwC$$iwC.<init>(<console>:40)
$line21.$read$$iwC.<init>(<console>:42)
$line21.$read.<init>(<console>:44)
$line21.$read$.<init>(<console>:48)
$line21.$read$.<clinit>(<console>)
$line21.$eval$.<init>(<console>:7)
$line21.$eval$.<clinit>(<console>)
$line21.$eval.$print(<console>)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
I am not sure what I am doing wrong...any help is much appreciated, thank you!
Thanks for this question, I wasn't aware of these examples so I downloaded them and tested them. What I see is that the git repository contains files with a lot of html code, it works, but you will end up adding 100 features which is possibly why you're not getting consistent results, since your own files contain much less features. What I did to test this works without html code was to remove the HTML code from spam.txt and ham.txt as follows:
ham.txt=
Dear Spark Learner, Thanks so much for attending the Spark Summit 2014!
Check out videos of talks from the summit at ...
Hi Mom, Apologies for being late about emailing and forgetting to send you
the package. I hope you and bro have been ...
Wow, hey Fred, just heard about the Spark petabyte sort. I think we need to
take time to try it out immediately ...
Hi Spark user list, This is my first question to this list, so thanks in
advance for your help! I tried running ...
Thanks Tom for your email. I need to refer you to Alice for this one. I
haven't yet figured out that part either ...
Good job yesterday! I was attending your talk, and really enjoyed it. I
want to try out GraphX ...
Summit demo got whoops from audience! Had to let you know. --Joe
spam.txt=
Dear sir, I am a Prince in a far kingdom you have not heard of. I want to
send you money via wire transfer so please ...
Get Viagra real cheap! Send money right away to ...
Oh my gosh you can be really strong too with these drugs found in the
rainforest. Get them cheap right now ...
YOUR COMPUTER HAS BEEN INFECTED! YOU MUST RESET YOUR PASSWORD. Reply to
this email with your password and SSN ...
THIS IS NOT A SCAM! Send money and get access to awesome stuff really
cheap and never have to ...
Then use bellow modifed MLib.scala, make sure you have log4j referenced in your project to redirect output to a file instead of the console, so you basically need to run twice, in first run watch the output by printing the # of features in spam and ham you can then set the correct # of features (instead of 100) I used 5.
package com.oreilly.learningsparkexamples.scala
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.log4j.Logger
object MLlib {
private val logger = Logger.getLogger("MLlib")
def main(args: Array[String]) {
logger.info("This is spark in Windows")
val conf = new SparkConf().setAppName(s"Book example: Scala").setMaster("local[2]").set("spark.executor.memory","1g")
//val conf = new SparkConf().setAppName(s"Book example: Scala")
val sc = new SparkContext(conf)
// Load 2 types of emails from text files: spam and ham (non-spam).
// Each line has text from one email.
val spam = sc.textFile("spam.txt")
val ham = sc.textFile("ham.txt")
// Create a HashingTF instance to map email text to vectors of 5 (not 100) features.
val tf = new HashingTF(numFeatures = 5)
// Each email is split into words, and each word is mapped to one feature.
val spamFeatures = spam.map(email => tf.transform(email.split(" ")))
println ("features in spam " + spamFeatures.count())
val hamFeatures = ham.map(email => tf.transform(email.split(" ")))
println ("features in ham " + ham.count())
// Create LabeledPoint datasets for positive (spam) and negative (ham) examples.
val positiveExamples = spamFeatures.map(features => LabeledPoint(1, features))
val negativeExamples = hamFeatures.map(features => LabeledPoint(0, features))
val trainingData = positiveExamples ++ negativeExamples
trainingData.cache() // Cache data since Logistic Regression is an iterative algorithm.
// Create a Logistic Regression learner which uses the LBFGS optimizer.
val lrLearner = new LogisticRegressionWithSGD()
// Run the actual learning algorithm on the training data.
val model = lrLearner.run(trainingData)
// Test on a positive example (spam) and a negative one (ham).
// First apply the same HashingTF feature transformation used on the training data.
val ex1 = "O M G GET cheap stuff by sending money to ...";
val ex2 = "Hi Dad, I started studying Spark the other ..."
val posTestExample = tf.transform(ex1.split(" "))
val negTestExample = tf.transform(ex2.split(" "))
// Now use the learned model to predict spam/ham for new emails.
println(s"Prediction for positive test example: ${ex1} : ${model.predict(posTestExample)}")
println(s"Prediction for negative test example: ${ex2} : ${model.predict(negTestExample)}")
sc.stop()
}
}
When I run this in the output I'm getting:
features in spam 5
features in ham 7
Prediction for positive test example: O M G GET cheap stuff by sending money
to ... : 1.0
Prediction for negative test example: Hi Dad, I started studying Spark the
other ... : 0.0
I had the same problem with Spark 1.5.2 on my local cluster.
My program stopped on "count at DataValidators.scala:40".
Resolved by running spark as "spark-submit --master local"
I had the similar problem with Spark 1.5.2 on my local cluster. My program stopped on "count at DataValidators.scala:40". I was caching my training features. Removed caching (just did not call cache function) and it resolved. Not sure of actual cause though.