count occurances of each word in apache spark - scala

val sc = new SparkContext("local[4]", "wc")
val lines: RDD[String] = sc.textFile("/tmp/inputs/*")
val errors = lines.filter(line => line.contains("ERROR"))
// Count all the errors
println(errors.count())
the above snippet would count the number of lines that contain the word ERROR. Is there a simplified function similar to "contains" that would return me the number of occurrences of the word instead?
say the file is in terms of Gigs and i would want to parallalize the effort using spark clusters.

Just count the instances per line and sum those together:
val errorCount = lines.map{line => line.split("[\\p{Punct}\\p{Space}]").filter(_ == "ERROR").size}.reduce(_ + _)

Related

How to select several element from an RDD file line using Spark in Scala

I'm new in spark and scala and I would like to select several columns from a dataset.
I transformed my data in RDD a file using :
val dataset = sc.textFile(args(0))
Then I split my line
val resu = dataset.map(line => line.split("\001"))
But I in my dataset I have a lot of features and I just want to keep some of then (colums 2 and 3)
I tried this (which works with Pyspark) but It does'nt work.
val resu = dataset.map(line => line.split("\001")[2,3])
I know this is a newbie question but is there someone who can help me ? thanks.
I just want to keep some of then (colums 2 and 3)
If you want columns 2 and 3 in tuple form you can do
val resu = dataset.map(line => {
val array = line.split("\001")
(array(2), array(3))
})
But if you want column 2 and 3 in array form then you can do
val resu = dataset.map(line => {
val array = line.split("\001")
Array(array(2), array(3))
})
In Scala, in order to access to specific list elements you have to use parentheses.
In your case, you want a sublist, so you can try the slice(i, j) function. It extracts the elements from the index i to the j-1. So in your case, you may use:
val resu = dataset.map(line => line.split("\001").slice(2,4))
Hope it helps.

Counting all the characters in the file using Spark/scala?

How can I calculate all the characters in the file using Spark/Scala? Here is what I am doing in the spark shell :
scala> val logFile = sc.textFile("ClasspathLength.txt")
scala> val counts = logFile.flatMap(line=>line.split("").map(char=>(char,1))).reduceByKey(_ + _)
scala> println(counts.count())
scala> 62
I am getting incorrect count here. Could someone help me fix this?
What you're doing here is:
Counting the number of times each unique character appears in the input:
val counts = logFile.flatMap(line=>line.split("").map(char=>(char,1))).reduceByKey(_ + _)
and then:
Counting the number of records in this result (using counts.count()), which ignores the actual values you calculated in the previous step
If you're interested in displaying the total number characters in the file - there's no need for grouping at all - you can map each line to its length and then use the implicit conversion into DoubleRDDFunctions to call sum():
logFile.map(_.length).sum()
Alternatively you can flatMap into separate record per character and then use count:
logFile.flatMap(_.toList).count
val spark=SparkSession.builder()
.master("local[4]")
.appName("Nos of Word Count")
.getOrCreate()
val sparkConfig=spark.sparkContext
sparkConfig.setLogLevel("ERROR")
val rdd1=sparkConfig.textFile("data/mini.txt")
println(rdd1.count())
val rdd2=rdd1.flatMap(f=>{f.split(" ")})//.map(x=>x.toInt)
println(rdd2.count())
val rdd3=rdd2
.map(w=>(w.count(p=>true))).map(w=>w.toInt)
println(rdd3.sum().round)
All you need here is flatMap + count
logFile.flatMap(identity).count

How to do data cleansing in Scala

I just started Scala on Spark, so I am not sure if my question is workable or should I turn to other solution/tool:
I have a text file for word counting and sorting, here is the file.
I load the file into HDFS
I then use the following code in Scala to do the counting:
val file = sc.textFile("hdfs://localhost:9000/Peter")
val counts = file.flatMap(line => line.split(" ")).map(p => (p,1)).reduceByKey(_+_).sortByKey(true,1)
counts.saveAsTextFile("Peter_SortedOutput6")
I checked the result on hdfs by hdfs dfs -cat hdfs://localhost:9000/user/root/Peter_SortedOutput5/part-00000
Part of the result is posted here for the convenience of reading:
((For,1)
((not,1)
(1,8)
(10,8)
(11,8)
(12,8)
(13,8)
(14,8)
(15,7)
(16,7)
(17,7)
(18,7)
(19,6)
(2,8)
(20,5)
(21,5)
(22,4)
(23,2)
(24,2)
(25,2)
(3,8)
(4,8)
(5,8)
(6,8)
(7,8)
(8,8)
(9,8)
(Abraham,,1)
(According,1)
(Amen.,4)
(And,19)
(As,5)
(Asia,,1)
(Babylon,,1)
(Balaam,1)
(Be,2)
(Because,1)
First, this is really not what I expect, I want the result showing in the desc order of count.
Second, there are result like the following:
(God,25)
(God's,1)
(God,,9)
(God,),1)
(God.,6)
(God:,2)
(God;,2)
(God?,1)
How to do some cleansing in the split so these occurrences can be grouped into one (God, 47)
Thank you very much.
There is a course BerkeleyX: CS105x Introduction to Apache Spark on edx.org by Berkerly&Databricks. One of the assignment is doing word count.
The steps are
remove punctuation, by replace "[^A-Za-z0-9\s]+" with "", or not include numbers "[^A-Za-z\s]+"
trim all spaces
lower all words
we can add extra step like
remove stop words
Code as follows
import org.apache.spark.ml.feature.StopWordsRemover
import org.apache.spark.sql.functions.split
// val reg = raw"[^A-Za-z0-9\s]+" // with numbers
val reg = raw"[^A-Za-z\s]+" // no numbers
val lines = sc.textFile("peter.txt").
map(_.replaceAll(reg, "").trim.toLowerCase).toDF("line")
val words = lines.select(split($"line", " ").alias("words"))
val remover = new StopWordsRemover()
.setInputCol("words")
.setOutputCol("filtered")
val noStopWords = remover.transform(words)
val counts = noStopWords.select(explode($"filtered")).map(word =>(word, 1))
.reduceByKey(_+_)
// from word -> num to num -> word
val mostCommon = counts.map(p => (p._2, p._1)).sortByKey(false, 1)
mostCommon.take(5)
Clean data by use of replaceAll:
val counts = file.flatMap(line => line.trim.toLowerCase.split(" ").replaceAll("[$,?+.;:\'s\\W\\d]", ""));
sort by value in scala API:
.map(item => item.swap) // interchanges position of entries in each tuple
.sortByKey(true, 1) // 1st arg configures ascending sort, 2nd arg configures one task
.map(item => item.swap)
sort by value in python API:
.map(lambda (a, b): (b, a)) \
.sortByKey(1, 1) \ # 1st arg configures ascending sort, 2nd configures 1 task
.map(lambda (a, b): (b, a))
Code should look like this (you may see syntax error, please fix if any):
val file = sc.textFile("hdfs://localhost:9000/Peter")
val counts = file.flatMap(line => line.trim.toLowerCase.split(" ").replaceAll("[$,?+.;:\'s\\W\\d]", ""))
.map(p => (p,1))
.reduceByKey(_+_)
.map(rec => rec.swap)
.sortByKey(true, 1)
.map(rec => rec.swap)
counts.saveAsTextFile("Peter_SortedOutput6")
see scala_regular_expressions - for what [\\W] or [\\d] or [;:',.?] mean.

Replace every word in a document based on a defined pattern

I have a file with thousands of document, and i want to read every documents and replace each word in this document like a pattern (Key,Value) at first all values are zero(0),
docs file :
Problems installing software weak tech support.
great idea executed perfect.
A perfect solution for anyone who listens to the radio.
...
and i have a score_file contains many words: e.g.
idea 1
software 1
weak 1
who 1
perfect 1
price -1
...
output like this pattern:
(Problems,0) (installing,1) (software,1) (develop,2) (weak,1) (tech,1) (support,0).
(great,1) (idea,1) (executed,2) (perfect,1).
(perfect,1) (solution,1) (for,0) (anyone,1) (who,1) (listens,1) (to,0) (the,0) (radio,0).
if a word of document occur in this score_file then value of (left word , this word , right word) in document Adding with 1 or -1 related to word's score.
i've tried :
val Words_file = sc.textFile("score_file.txt")
val Words_List = sc.broadcast(Words_file.map({ (line) =>
val Array(a,b) = line.split(" ").map(_.trim)(a,b.toInt)}).collect().toMap)
val Docs_file = sc.textFile("docs_file.txt")
val tokens = Docs_file.map(line => (line.split(" ").map(word => Words_List.value.getOrElse(word, 0)).reduce(_ + _), line.split(" ").filter(Words_List.value.contains(_) == false).mkString(" ")))
val out_Docs = tokens.map(s => if (s._2.length > 3) {s._1 + "," + s._2})
But it scored every document not its words, how can i generate my favorite output?
It is kinda hard to read you code, you seem to use a weird mix of CamelCase with underscores with vals sometimes starting with uppercase and sometimes not.
I'm not sure I completely got the question, but to get an output where each word in a given line is replaced by itself and the number coming from the other file, this might work:
val sc = new SparkContext()
val wordsFile = sc.textFile("words_file.txt")
val words = sc.broadcast(wordsFile.map( line => {
val Array(a, b) = line.split(" ").map(_.trim)
(a,b.toInt)
}).collect().toMap)
val docs = sc.textFile("docs_file.txt")
val tokens = docs.map(line => {
line.split(" ")
.map(token => s"(${token}, ${words.value.getOrElse(token, 0)})")
.mkString(" ")
})
tokens ends up being just an RDD[String] as the input, preserving lines (documents). I reformatted the code a bit to make it more readable.

How to remove lines with length less than 3 words?

I have a corpus of millions of Documents
and I want to remove lines which their length less than 3 words,(in Scala and Spark),
How can i do this?
All depends on how you define words but assuming a very simple approach:
def naiveTokenizer(text: String): Array[String] = text.split("""\s+""")
def naiveWordCount(text: String): Int = naiveTokenizer(text).length
val lines = sc.textFile("README.md")
lines.filter(naiveWordCount(_) >= 3)
You could use the count function. (This assumes you have ' ' as your spaces between words):
scala.io.Source.fromFile("myfile.txt").getLines().filter(_.count(' '==) >= 3)
zero33's answer is valid if you want to keep the lines intact. However, if you want the lines tokenized also, then flatMap is more efficient.
val lines = sc.textFile("README.md")
lines.flatMap(line =>{
val tokenized = line.split("""\s+""")
if(tokenized.length >= 3) tokenized
else Seq()
})