Replace every word in a document based on a defined pattern - scala

I have a file with thousands of document, and i want to read every documents and replace each word in this document like a pattern (Key,Value) at first all values are zero(0),
docs file :
Problems installing software weak tech support.
great idea executed perfect.
A perfect solution for anyone who listens to the radio.
...
and i have a score_file contains many words: e.g.
idea 1
software 1
weak 1
who 1
perfect 1
price -1
...
output like this pattern:
(Problems,0) (installing,1) (software,1) (develop,2) (weak,1) (tech,1) (support,0).
(great,1) (idea,1) (executed,2) (perfect,1).
(perfect,1) (solution,1) (for,0) (anyone,1) (who,1) (listens,1) (to,0) (the,0) (radio,0).
if a word of document occur in this score_file then value of (left word , this word , right word) in document Adding with 1 or -1 related to word's score.
i've tried :
val Words_file = sc.textFile("score_file.txt")
val Words_List = sc.broadcast(Words_file.map({ (line) =>
val Array(a,b) = line.split(" ").map(_.trim)(a,b.toInt)}).collect().toMap)
val Docs_file = sc.textFile("docs_file.txt")
val tokens = Docs_file.map(line => (line.split(" ").map(word => Words_List.value.getOrElse(word, 0)).reduce(_ + _), line.split(" ").filter(Words_List.value.contains(_) == false).mkString(" ")))
val out_Docs = tokens.map(s => if (s._2.length > 3) {s._1 + "," + s._2})
But it scored every document not its words, how can i generate my favorite output?

It is kinda hard to read you code, you seem to use a weird mix of CamelCase with underscores with vals sometimes starting with uppercase and sometimes not.
I'm not sure I completely got the question, but to get an output where each word in a given line is replaced by itself and the number coming from the other file, this might work:
val sc = new SparkContext()
val wordsFile = sc.textFile("words_file.txt")
val words = sc.broadcast(wordsFile.map( line => {
val Array(a, b) = line.split(" ").map(_.trim)
(a,b.toInt)
}).collect().toMap)
val docs = sc.textFile("docs_file.txt")
val tokens = docs.map(line => {
line.split(" ")
.map(token => s"(${token}, ${words.value.getOrElse(token, 0)})")
.mkString(" ")
})
tokens ends up being just an RDD[String] as the input, preserving lines (documents). I reformatted the code a bit to make it more readable.

Related

Spark - create list of words from text file and the word that comes immediately after it

I'm trying to create a pair rdd of every word from a text file and every word that follows it.
So for instance,
("I'm", "trying"), ("trying", "to"), ("to", "create") ...
It seems like I can almost use the zip fuction here, if I was able to start with an offset of 1 on the second bit.
How can I do this, or is there a better way?
I'm still not quite used to thinking in terms of functional programming here.
You can manipulate the index, then join on the initial pair RDD:
val rdd = sc.parallelize("I'm trying to create a".split(" "))
val el1 = rdd.zipWithIndex().map(l => (-1+l._2, l._1))
val el2 = rdd.zipWithIndex().map(l => (l._2, l._1))
el2.join(el1).map(l => l._2).collect()
Which outputs:
Array[(String, String)] = Array((I'm,trying), (trying,to), (to,create), (create,a))

Is there a better way of converting Iterator[char] to Seq[String]?

Following is my code that I have used to convert Iterator[char] to Seq[String].
val result = IOUtils.toByteArray(new FileInputStream (new File(fileDir)))
val remove_comp = result.grouped(11).map{arr => arr.update(2, 32);arr}.flatMap{arr => arr.update(3, 32); arr}
val convert_iter = remove_comp.map(_.toChar.toString).toSeq.mkString.split("\n")
val rdd_input = Spark.sparkSession.sparkContext.parallelize(convert_iter)
val fileDir:
12**34567890
12##34567890
12!!34567890
12¬¬34567890
12
'34567890
I am not happy with this code as the data size is big and converting to string would end up with heap space.
val convert_iter = remove_comp.map(_.toChar)
convert_iter: Iterator[Char] = non-empty iterator
Is there a better way of coding?
By completely disregarding corner cases about empty Strings etc I would start with something like:
val test = Iterable('s','f','\n','s','d','\n','s','v','y')
val (allButOne, last) = test.foldLeft( (Seq.empty[String], Seq.empty[Char]) ) {
case ((strings, chars), char) =>
if (char == '\n')
(strings :+ chars.mkString, Seq.empty)
else
(strings, chars :+ char)
}
val result = allButOne :+ last.mkString
I am sure it could be made more elegant, and handle corner cases better (once you define you want them handled), but I think it is a nice starting point.
But to be honest I am not entirely sure what you want to achieve. I just guessed that you want to group chars divided by \n together and turn them into Strings.
Looking at your code, I see that you are trying to replace the special characters such as **, ## and so on from the file that contains following data
12**34567890
12##34567890
12!!34567890
12¬¬34567890
12
'34567890
For that you can just read the data using sparkContext textFile and use regex replaceAllIn
val pattern = new Regex("[¬~!##$^%&*\\(\\)_+={}\\[\\]|;:\"'<,>.?` /\\-]")
val result = sc.textFile(fileDir).map(line => pattern.replaceAllIn(line, ""))
and you should have you result as RDD[String] which also an iterator
1234567890
1234567890
1234567890
1234567890
12
34567890
Updated
If there are \n and \r in between the texts at 3rd and 4th place and if the result is all fixed length of 10 digits text then you can use wholeTextFiles api of sparkContext and use following regex as
val pattern = new Regex("[¬~!##$^%&*\\(\\)_+={}\\[\\]|;:\"'<,>.?` /\\-\r\n]")
val result = sc.wholeTextFiles(fileDir).flatMap(line => pattern.replaceAllIn(line._2, "").grouped(10))
You should get the output as
1234567890
1234567890
1234567890
1234567890
1234567890
I hope the answer is helpful

Scala: write a MapReduce progam to find a word that follows a word [Homework]

I have a homework assignment where I must write a MapReduce program in Scala to find, for each word in the file which word that follows the most.
For example, for the word "basketball", the word "is" comes next 5 times, "has" 2 times, and "court" 1 time.
In a text file this might show up as:
basketball is..... (this sequence happens 5 times)
basketball has..... (this sequence happens 2 times)
basketball court.... (this sequence happens 1 time)
I am having a hard time conceptually figuring out how to do this.
The idea I have had but have not been able to successfully implement is
Iterate through each word, if the word is basketball, take the next word and add it to a map. Reduce by key, and sort from highest to lowest.
Unfortunately I do not know how to take the next next word in a list of words.
For example, i would like to do something like this
val lines = spark.textFile("basketball_words_only.txt") // process lines in file
// split into individual words
val words = lines.flatMap(line => line.split(" "))
var listBuff = new ListBuffer[String]() // a list Buffer to hold each following word
val it = Iterator(words)
while (it.hasNext) {
listBuff += it.next().next() // <-- this is what I would like to do
}
val follows = listBuff.map(word => (word, 1))
val count = follows.reduceByKey((x, y) => x + y) // another issue as I cannot reduceByKey with a listBuffer
val sort = count.sortBy(_._2,false,1)
val result2 = sort.collect()
for (i <- 0 to result2.length - 1) {
printf("%s follows %d times\n", result1(2)._1, result2(i)._2);
}
Any help would be appreciated. If I am over thinking this I am open to different ideas and suggestions.
Here's one way to do it using MLlib's sliding function:
import org.apache.spark.mllib.rdd.RDDFunctions._
val resRDD = textFile.
flatMap(_.split("""[\s,.;:!?]+""")).
sliding(2).
map{ case Array(x, y) => ((x, y), 1) }.
reduceByKey(_ + _).
map{ case ((x, y), c) => (x, y, c) }.
sortBy( z => (z._1, z._3, z._2), false )

How to do data cleansing in Scala

I just started Scala on Spark, so I am not sure if my question is workable or should I turn to other solution/tool:
I have a text file for word counting and sorting, here is the file.
I load the file into HDFS
I then use the following code in Scala to do the counting:
val file = sc.textFile("hdfs://localhost:9000/Peter")
val counts = file.flatMap(line => line.split(" ")).map(p => (p,1)).reduceByKey(_+_).sortByKey(true,1)
counts.saveAsTextFile("Peter_SortedOutput6")
I checked the result on hdfs by hdfs dfs -cat hdfs://localhost:9000/user/root/Peter_SortedOutput5/part-00000
Part of the result is posted here for the convenience of reading:
((For,1)
((not,1)
(1,8)
(10,8)
(11,8)
(12,8)
(13,8)
(14,8)
(15,7)
(16,7)
(17,7)
(18,7)
(19,6)
(2,8)
(20,5)
(21,5)
(22,4)
(23,2)
(24,2)
(25,2)
(3,8)
(4,8)
(5,8)
(6,8)
(7,8)
(8,8)
(9,8)
(Abraham,,1)
(According,1)
(Amen.,4)
(And,19)
(As,5)
(Asia,,1)
(Babylon,,1)
(Balaam,1)
(Be,2)
(Because,1)
First, this is really not what I expect, I want the result showing in the desc order of count.
Second, there are result like the following:
(God,25)
(God's,1)
(God,,9)
(God,),1)
(God.,6)
(God:,2)
(God;,2)
(God?,1)
How to do some cleansing in the split so these occurrences can be grouped into one (God, 47)
Thank you very much.
There is a course BerkeleyX: CS105x Introduction to Apache Spark on edx.org by Berkerly&Databricks. One of the assignment is doing word count.
The steps are
remove punctuation, by replace "[^A-Za-z0-9\s]+" with "", or not include numbers "[^A-Za-z\s]+"
trim all spaces
lower all words
we can add extra step like
remove stop words
Code as follows
import org.apache.spark.ml.feature.StopWordsRemover
import org.apache.spark.sql.functions.split
// val reg = raw"[^A-Za-z0-9\s]+" // with numbers
val reg = raw"[^A-Za-z\s]+" // no numbers
val lines = sc.textFile("peter.txt").
map(_.replaceAll(reg, "").trim.toLowerCase).toDF("line")
val words = lines.select(split($"line", " ").alias("words"))
val remover = new StopWordsRemover()
.setInputCol("words")
.setOutputCol("filtered")
val noStopWords = remover.transform(words)
val counts = noStopWords.select(explode($"filtered")).map(word =>(word, 1))
.reduceByKey(_+_)
// from word -> num to num -> word
val mostCommon = counts.map(p => (p._2, p._1)).sortByKey(false, 1)
mostCommon.take(5)
Clean data by use of replaceAll:
val counts = file.flatMap(line => line.trim.toLowerCase.split(" ").replaceAll("[$,?+.;:\'s\\W\\d]", ""));
sort by value in scala API:
.map(item => item.swap) // interchanges position of entries in each tuple
.sortByKey(true, 1) // 1st arg configures ascending sort, 2nd arg configures one task
.map(item => item.swap)
sort by value in python API:
.map(lambda (a, b): (b, a)) \
.sortByKey(1, 1) \ # 1st arg configures ascending sort, 2nd configures 1 task
.map(lambda (a, b): (b, a))
Code should look like this (you may see syntax error, please fix if any):
val file = sc.textFile("hdfs://localhost:9000/Peter")
val counts = file.flatMap(line => line.trim.toLowerCase.split(" ").replaceAll("[$,?+.;:\'s\\W\\d]", ""))
.map(p => (p,1))
.reduceByKey(_+_)
.map(rec => rec.swap)
.sortByKey(true, 1)
.map(rec => rec.swap)
counts.saveAsTextFile("Peter_SortedOutput6")
see scala_regular_expressions - for what [\\W] or [\\d] or [;:',.?] mean.

count occurances of each word in apache spark

val sc = new SparkContext("local[4]", "wc")
val lines: RDD[String] = sc.textFile("/tmp/inputs/*")
val errors = lines.filter(line => line.contains("ERROR"))
// Count all the errors
println(errors.count())
the above snippet would count the number of lines that contain the word ERROR. Is there a simplified function similar to "contains" that would return me the number of occurrences of the word instead?
say the file is in terms of Gigs and i would want to parallalize the effort using spark clusters.
Just count the instances per line and sum those together:
val errorCount = lines.map{line => line.split("[\\p{Punct}\\p{Space}]").filter(_ == "ERROR").size}.reduce(_ + _)