This is a follow up question on
Pyspark filter operation on Dstream
To keep a count of how many error messages/warning messages has come through for say a day, hour - how does one design the job.
What I have tried:
from __future__ import print_function
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
def counts():
counter += 1
print(counter.value)
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: network_wordcount.py <hostname> <port>", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="PythonStreamingNetworkWordCount")
ssc = StreamingContext(sc, 5)
counter = sc.accumulator(0)
lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
errors = lines.filter(lambda l: "error" in l.lower())
errors.foreachRDD(lambda e : e.foreach(counts))
errors.pprint()
ssc.start()
ssc.awaitTermination()
this however has multiple issues, to start with print doesn't work (does not output to stdout, I have read about it, the best I can use here is logging). Can I save the output of that function to a text file and tail that file instead?
I am not sure why the program just comes out, there is no error/dump anywhere to look further into (spark 1.6.2)
How does one preserve state? What I am trying is to aggregate logs by server and severity, another use case is to count how many transactions were processed by looking for certain keywords
Pseudo Code for what I want to try:
foreachRDD(Dstream):
if RDD.contains("keyword1 | keyword2 | keyword3"):
dictionary[keyword] = dictionary.get(keyword,0) + 1 //add the keyword if not present and increase the counter
print dictionary //or send this dictionary to else where
The last part of sending or printing dictionary requires switching out of spark streaming context - Can someone explain the concept please?
print doesn't work
I would recommend reading the design patterns section of the Spark documentation. I think that roughly what you want is something like this:
def _process(iter):
for item in iter:
print item
lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
errors = lines.filter(lambda l: "error" in l.lower())
errors.foreachRDD(lambda e : e.foreachPartition(_process))
This will get your call print to work (though it is worth noting that the print statement will execute on the workers and not the drivers, so if you're running this code on a cluster you will only see it on the worker logs).
However, it won't solve your second problem:
How does one preserve state?
For this, take a look at updateStateByKey and the related example.
Related
Recently I have started doing a course of Frank Kane namely Taming big data by apache spark using python.
In the line where I have to compute average number of friends, I am getting a syntax error. I cannot understand how to fix this error. Please refer the code below.FYI I m using python 3. I have highlighted the code having syntax error.Please help as I have got stuck here.
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("AverageAge")
sc = SparkContext(conf = conf)
def parseline(line):
fields =line.split(',')
friend_age= int(fields[2])
friends_number= int(fields[3])
return (friend_age,friends_number)
lines = sc.textFile("file:///Sparkcourse/SparkCourse/fakefriends.csv")
rdd=lines.map(parseline)
making_keys=rdd.mapByValues(lambda x:(x,1))
totalsByAge=making_keys.reduceByKeys(lambda x,y: (x[0]+y[0],x[1]+y[1])
**averages_by_keys= totalsByAge.mapValues(lambda x: x[0] / x[1])**(Syntax Error)
results=averageByKeys.collect()
for result in results:
print result
Look at the line above, you're missing a closing parenthesis.
I have an application that processes records in an rdd and puts them into a cache. I put a couple of Spark Accumulators in my application to keep track of processed and failed records. These stats are sent to statsD before the application closes. Here is some simple sample code:
val sc: SparkContext = new SparkContext(conf)
val jdbcDF: DataFrame = sqlContext.read.format("jdbc").options(Map(...)).load().persist(StorageLevel.MEMORY_AND_DISK)
logger.info("Processing table with " + jdbcDF.count + " rows")
val processedRecords = sc.accumulator(0L, "processed records")
val erroredRecords = sc.accumulator(0L, "errored records")
jdbcDF.rdd.foreachPartition(iterator => {
processedRecords += iterator.length // Problematic line
val cache = getCacheInstanceFromBroadcast()
processPartition(iterator, cache, erroredRecords) // updates cache with iterator documents
}
submitStats(processedRecords, erroredRecords)
I built and ran this in my cluster and it appeared to be functioning correctly, the job was marked as a SUCCESS by Spark. I queried the stats using Grafana and both counts were accurate.
However, when I queried my cache, Couchbase, none of the documents were there. I've combed through both driver and executor logs to see if any errors were being thrown but I couldn't find anything. My thinking is that this is some memory issue, but a couple long accumulators is enough to cause a problem?
I was able to get this code snippet working by commenting out the line that increments processedRecords - see the line in the snippet noted with Problematic line.
Does anyone know why commenting out that line fixes the issue? Also why is Spark failing silently and not marking the job as FAILURE?
The application isn't "failing" per se. The main problem is, Iterators can only be "iterated" through one time.
Calling iterator.length actually goes through and exhausts the iterator. Thus, when processPartition receives iterator, the iterator is already exhausted and looks empty (so no records will be processed).
Reference Scala docs to confirm that size is "the number of elements returned by it. Note: it will be at its end after this operation!" -- you can also view the source code to confirm this.
Workaround
If you rewrite processPartition to return a long value, that can be fed into the accumulator.
Also, sc.accumulator is deprecated in recent versions of Spark.
The workaround could look something like:
val acc = sc.longAccumulator("total processed records")
...
df.rdd.foreachPartition(iterator => {
val cache = getCacheInstanceFromBroadcast()
acc.add(processPartition(iterator, cache, erroredRecords))
})
...
// do something else
I have developed a hadoop based solution that process a binary file. This uses classic hadoop MR technique. The binary file is about 10GB and divided into 73 HDFS blocks, and the business logic written as map process operates on each of these 73 blocks. We have developed a customInputFormat and CustomRecordReader in Hadoop that returns key (intWritable) and value (BytesWritable) to the map function. The value is nothing but the contents of a HDFS block(bianry data). The business logic knows how to read this data.
Now, I would like to port this code in spark. I am a starter in spark and could run simple examples (wordcount, pi example) in spark. However, could not straightforward example to process binaryFiles in spark. I see there are two solutions for this use case. In the first, avoid using custom input format and record reader. Find a method (approach) in spark the creates a RDD for those HDFS blocks, use a map like method that feeds HDFS block content to the business logic. If this is not possible, I would like to re-use the custom input format and custom reader using some methods such as HadoopAPI, HadoopRDD etc. My problem:- I do not know whether the first approach is possible or not. If possible, can anyone please provide some pointers that contains examples? I was trying second approach but highly unsuccessful. Here is the code snippet I used
package org {
object Driver {
def myFunc(key : IntWritable, content : BytesWritable):Int = {
println(key.get())
println(content.getSize())
return 1
}
def main(args: Array[String]) {
// create a spark context
val conf = new SparkConf().setAppName("Dummy").setMaster("spark://<host>:7077")
val sc = new SparkContext(conf)
println(sc)
val rd = sc.newAPIHadoopFile("hdfs:///user/hadoop/myBin.dat", classOf[RandomAccessInputFormat], classOf[IntWritable], classOf[BytesWritable])
val count = rd.map (x => myFunc(x._1, x._2)).reduce(_+_)
println("The count is *****************************"+count)
}
}
}
Please note that the print statement in the main method prints 73 which is the number of blocks whereas the print statements inside the map function prints 0.
Can someone tell where I am doing wrong here? I think I am not using API the right way but failed to find some documentation/usage examples.
A couple of problems at a glance. You define myFunc but call func. Your myFunc has no return type, so you can't call collect(). If your myFunc truly doesn't have a return value, you can do foreach instead of map.
collect() pulls the data in an RDD to the driver to allow you to do stuff with it locally (on the driver).
I have made some progress in this issue. I am now using the below function which does the job
var hRDD = new NewHadoopRDD(sc, classOf[RandomAccessInputFormat],
classOf[IntWritable],
classOf[BytesWritable],
job.getConfiguration()
)
val count = hRDD.mapPartitionsWithInputSplit{ (split, iter) => myfuncPart(split, iter)}.collect()
However, landed up with another error the details of which i have posted here
Issue in accessing HDFS file inside spark map function
15/10/30 11:11:39 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 40.221.94.235): java.io.IOException: No FileSystem for scheme: spark
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
EDIT:
I tried using the text from Gabriel's answer and got spam features: 9 and ham features: 13. I tried changing the HashingTF to numFeatures = 9, then 13, then created one for each. Then the program stopped at "count at DataValidators.scala:38" just like before.
Completed Jobs(4)
count at 21 (spamFeatures)
count at 23 (hamFeatures)
count at 28 (trainingData.count())
first at GeneralizedLinearAlgorithm at 34 (val model = lrLearner.run(trainingData)
1) Why are the features being counted by lines, as in the code it is being split by spaces (" ")
2) Two things I see dift from my code and Gabriel's code:
a) I don't have anything about logger, but that shouldn't be an issue...
b) My files are located on hdfs(hdfs://ip-abc-de-.compute.internal:8020/user/ec2-user/spam.txt), once again shouldn't be an issue, but not sure if there's something i'm missing...
3) How long should I let it run for? I've let it run for at least 10 minutes with :local[2]..
I'm guessing at this point it might be some sort of issue with my Spark/MLlib setup? Is there an even simpler program I can run to see if there is a set up issue with MLLib? I have been able to run other spark streaming/sql jobs berfore...
Thanks!
[reposted from spark community]
Hello Everyone,
I am trying to run this MLlib example from Learning Spark:
https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala#L48
Things I'm doing differently:
1) instead of their spam.txt and normal.txt I have text files with 200 words...nothing huge at all and just plain text, with periods, commas, etc.
3) I've used numFeatures = 200, 1000 and 10,000
Error: I keep getting stuck when I try to run the model (based off details from ui below):
val model = new LogisticRegressionWithSGD().run(trainingData)
It will freeze on something like this:
[Stage 1:==============> (1 + 0) / 4]
Some details from webui:
org.apache.spark.rdd.RDD.count(RDD.scala:910)
org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:38)
org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:37)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161)
scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:70)
scala.collection.immutable.List.forall(List.scala:84)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:161)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:146)
$line21.$read$$iwC$$iwC$$iwC$$iwC.<init>(<console>:33)
$line21.$read$$iwC$$iwC$$iwC.<init>(<console>:38)
$line21.$read$$iwC$$iwC.<init>(<console>:40)
$line21.$read$$iwC.<init>(<console>:42)
$line21.$read.<init>(<console>:44)
$line21.$read$.<init>(<console>:48)
$line21.$read$.<clinit>(<console>)
$line21.$eval$.<init>(<console>:7)
$line21.$eval$.<clinit>(<console>)
$line21.$eval.$print(<console>)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
I am not sure what I am doing wrong...any help is much appreciated, thank you!
Thanks for this question, I wasn't aware of these examples so I downloaded them and tested them. What I see is that the git repository contains files with a lot of html code, it works, but you will end up adding 100 features which is possibly why you're not getting consistent results, since your own files contain much less features. What I did to test this works without html code was to remove the HTML code from spam.txt and ham.txt as follows:
ham.txt=
Dear Spark Learner, Thanks so much for attending the Spark Summit 2014!
Check out videos of talks from the summit at ...
Hi Mom, Apologies for being late about emailing and forgetting to send you
the package. I hope you and bro have been ...
Wow, hey Fred, just heard about the Spark petabyte sort. I think we need to
take time to try it out immediately ...
Hi Spark user list, This is my first question to this list, so thanks in
advance for your help! I tried running ...
Thanks Tom for your email. I need to refer you to Alice for this one. I
haven't yet figured out that part either ...
Good job yesterday! I was attending your talk, and really enjoyed it. I
want to try out GraphX ...
Summit demo got whoops from audience! Had to let you know. --Joe
spam.txt=
Dear sir, I am a Prince in a far kingdom you have not heard of. I want to
send you money via wire transfer so please ...
Get Viagra real cheap! Send money right away to ...
Oh my gosh you can be really strong too with these drugs found in the
rainforest. Get them cheap right now ...
YOUR COMPUTER HAS BEEN INFECTED! YOU MUST RESET YOUR PASSWORD. Reply to
this email with your password and SSN ...
THIS IS NOT A SCAM! Send money and get access to awesome stuff really
cheap and never have to ...
Then use bellow modifed MLib.scala, make sure you have log4j referenced in your project to redirect output to a file instead of the console, so you basically need to run twice, in first run watch the output by printing the # of features in spam and ham you can then set the correct # of features (instead of 100) I used 5.
package com.oreilly.learningsparkexamples.scala
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.log4j.Logger
object MLlib {
private val logger = Logger.getLogger("MLlib")
def main(args: Array[String]) {
logger.info("This is spark in Windows")
val conf = new SparkConf().setAppName(s"Book example: Scala").setMaster("local[2]").set("spark.executor.memory","1g")
//val conf = new SparkConf().setAppName(s"Book example: Scala")
val sc = new SparkContext(conf)
// Load 2 types of emails from text files: spam and ham (non-spam).
// Each line has text from one email.
val spam = sc.textFile("spam.txt")
val ham = sc.textFile("ham.txt")
// Create a HashingTF instance to map email text to vectors of 5 (not 100) features.
val tf = new HashingTF(numFeatures = 5)
// Each email is split into words, and each word is mapped to one feature.
val spamFeatures = spam.map(email => tf.transform(email.split(" ")))
println ("features in spam " + spamFeatures.count())
val hamFeatures = ham.map(email => tf.transform(email.split(" ")))
println ("features in ham " + ham.count())
// Create LabeledPoint datasets for positive (spam) and negative (ham) examples.
val positiveExamples = spamFeatures.map(features => LabeledPoint(1, features))
val negativeExamples = hamFeatures.map(features => LabeledPoint(0, features))
val trainingData = positiveExamples ++ negativeExamples
trainingData.cache() // Cache data since Logistic Regression is an iterative algorithm.
// Create a Logistic Regression learner which uses the LBFGS optimizer.
val lrLearner = new LogisticRegressionWithSGD()
// Run the actual learning algorithm on the training data.
val model = lrLearner.run(trainingData)
// Test on a positive example (spam) and a negative one (ham).
// First apply the same HashingTF feature transformation used on the training data.
val ex1 = "O M G GET cheap stuff by sending money to ...";
val ex2 = "Hi Dad, I started studying Spark the other ..."
val posTestExample = tf.transform(ex1.split(" "))
val negTestExample = tf.transform(ex2.split(" "))
// Now use the learned model to predict spam/ham for new emails.
println(s"Prediction for positive test example: ${ex1} : ${model.predict(posTestExample)}")
println(s"Prediction for negative test example: ${ex2} : ${model.predict(negTestExample)}")
sc.stop()
}
}
When I run this in the output I'm getting:
features in spam 5
features in ham 7
Prediction for positive test example: O M G GET cheap stuff by sending money
to ... : 1.0
Prediction for negative test example: Hi Dad, I started studying Spark the
other ... : 0.0
I had the same problem with Spark 1.5.2 on my local cluster.
My program stopped on "count at DataValidators.scala:40".
Resolved by running spark as "spark-submit --master local"
I had the similar problem with Spark 1.5.2 on my local cluster. My program stopped on "count at DataValidators.scala:40". I was caching my training features. Removed caching (just did not call cache function) and it resolved. Not sure of actual cause though.
I have a Spark RDD[String] that I would like to stream to the input of an external command on the local machine. The setup would be something like this
val data: RDD[String] = <Valid data>
val process = Seq("wc", "-l") // This is not the actual process, but it works the same way as it consumes a whole bunch of lines and produces very little output itself
// Here's what I've tried so far
val exitCode = (process #< data.toLocalIterator.toStream) ! // Doesn't work
val exitCode = (process #< new ByteArrayInputStream(data.toLocalIterator.mkString("\n").getBytes("UTF-8"))) ! // Works but seems to load the whole data into local memory which is definitely not what I want as data could be very big
val processIO = new ProcessIO(
in => data.toLocalIterator.toStream,
out => scala.io.Source.fromInputStream(out).getLines.foreach(println),
err => scala.io.Source.fromInputStream(err).getLines.foreach(println))
val exitCode = process.run(processIO) // This also doesn't work
Can anyone point me to a working solution that doesn't load all the data on the local machine and just streams it from an RDD[String] straight to the process, just like I'd do with
cat data.txt | wc -l
on the command line.
Thanks
I think I've figured this out. It seems that I forgot to actually write anything to the InputStream. Here is code that seems to be working for my small tests. I still haven't tested it on the big data yet, but it looks like it should work.
val processIO = BasicIO.standard(in => {
data.toLocalIterator.foreach(x => in.write((x + Properties.lineSeparator).getBytes(Charsets.UTF_8)))
in.close
})
val exitCode = process.run(processIO).exitValue
This is not an answer but you should be aware that it won't behave like cat data.txt | wc -l since the RDD can (and usually will) be split into multiple processes (tasks running in executors) so your accepting program needs to be able to get multiple streams and your should know that the data will not be ordered