Text manipulation in Spark and Scala - scala

This is my data:
review/text: The product picture and part number match, but they together do not math the description.
review/text: A necessity for the Garmin. Used the adapter to power the unit on my motorcycle. Works like a charm.
review/text: This power supply did the job and got my computer back online in a hurry.
review/text: Not only did the supply work. it was easy to install, a lot quieter than the PowMax that fried.
review/text: This is an awesome power supply that was extremely easy to install.
review/text: I had my doubts since best buy would end up charging me $60. at the time I bought my camera for the card and the cable.
review/text: Amazing... Installed the board, and that's it, no driver needed. Work great, no error messages.
and I've tried:
import org.apache.spark.{SparkContext, SparkConf}
object test12 {
def filterfunc(s: String): Array[((String))] = {
s.split( """\.""")
.map(_.split(" ")
.filter(_.nonEmpty)
.map(_.replaceAll( """\W""", "")
.toLowerCase)
.filter(_.nonEmpty)
.flatMap(x=>x)
}
def main(args: Array[String]): Unit = {
val conf1 = new SparkConf().setAppName("pre2").setMaster("local")
val sc = new SparkContext(conf1)
val rdd = sc.textFile("data/2012/2012.txt")
val stopWords = sc.broadcast(List[String]("reviewtext", "a", "about", "above", "according", "accordingly", "across", "actually",...)
var grouped_doc_words = rdd.flatMap({ (line) =>
val words = line.map(filterfunc).filter(word_filter.value))
words.map(w => {
(line.hashCode(), w)
})
}).groupByKey()
}
}
and I want to generate this output :
doc1: product picture number match together not math description.
doc2: necessity garmin. adapter power unit my motorcycle. works like charm.
doc3: power supply job computer online hurry.
doc4: not supply work. easy install quieter powmax fried.
...
some exception: 1- (not , n't , non , none) not to be emitted 2- all dot (.) symbols must be held
my above code doesn't work very well.

Why not just sth like this:
This way you don't need any grouping or flatMapping.
EDIT:
I was writing this by hand and indeed there was some little bugs but i hoped idea was clear. Here is tested code:
def processLine(s: String, stopWords: Set[String]): List[String] = {
s.toLowerCase()
.replaceAll(""""[^a-zA-Z\.]""", "")
.replaceAll("""\.""", " .")
.split("\\s+")
.filter(!stopWords.contains(_))
.toList
}
def main(args: Array[String]): Unit = {
val conf1 = new SparkConf().setAppName("pre2").setMaster("local")
val sc = new SparkContext(conf1)
val rdd = sc.parallelize(
List(
"The product picture and part number match, but they together do not math the description.",
"A necessity for the Garmin. Used the adapter to power the unit on my motorcycle. Works like a charm.",
"This power supply did the job and got my computer back online in a hurry."
)
)
val stopWords = sc.broadcast(
Set("reviewtext", "a", "about", "above",
"according", "accordingly",
"across", "actually", "..."))
val grouped_doc_words = rdd.map(processLine(_, stopWords.value))
grouped_doc_words.collect().foreach(p => println(p))
}
This as result gives you:
List(the, product, picture, and, part, number, match,, but, they, together, do, not, math, the, description, .)
List(necessity, for, the, garmin, ., used, the, adapter, to, power, the, unit, on, my, motorcycle, ., works, like, charm, .)
List(this, power, supply, did, the, job, and, got, my, computer, back, online, in, hurry, .)
Now if you want string not list just do:
grouped_doc_words.map(_.mkString(" "))

I think there is a bug at marked line:
var grouped_doc_words = rdd.flatMap({ (line) =>
val words = line.map(filterfunc).filter(word_filter.value)) // **
words.map(w => {
(line.hashCode(), w)
})
}).groupByKey()
Here:
line.map(filterfunc)
should be:
filterfunc(line)
Explanation:
line is a String. map runs over a collection of items. When you do line.map(...) it basically runs the passed function on each Char - not something that you want.
scala> val line2 = "This is a long string"
line2: String = This is a long string
scala> line2.map(_.length)
<console>:13: error: value length is not a member of Char
line2.map(_.length)
Additionally, I don't know what are you using this in filterfunction:
.map(_.replaceAll( """\W""", "")
I am not able to run spark-shell properly at my end. Can you please update if these fix your problem?

Related

Converting disrete chunks of Stdin to usable form

Simply, if the user pastes a chunk of text (multiple lines) into the console all at once, I want to be able to grab that chunk and use it.
Currently my code is
val stringLines: List[String] = io.Source.stdin.getLines().toList
doStuff(stringLines)
However doStuff is never called. I realize the stdin iterator doesn't have an 'end', but how do I get the input as it is currently? I've checked many SO answers, but none of them work for multiple lines that need to be collated. I need to have all the lines of the user input at once, and it will always come as a single paste of data.
This is a rough outline, but it seems to work. I had to delve into Java Executors, which I hadn't encountered before, and I might not be using correctly.
You might want to play around with the timeout value, but 10 milliseconds passes my tests.
import java.util.concurrent.{Callable, Executors, TimeUnit}
import scala.util.{Failure, Success, Try}
def getChunk: List[String] = { // blocks until StdIn has data
val executor = Executors.newSingleThreadExecutor()
val callable = new Callable[String]() {
def call(): String = io.StdIn.readLine()
}
def nextLine(acc: List[String]): List[String] =
Try {executor.submit(callable).get(10L, TimeUnit.MILLISECONDS)} match {
case Success(str) => nextLine(str :: acc)
case Failure(_) => executor.shutdownNow() // should test for Failure type
acc.reverse
}
nextLine(List(io.StdIn.readLine())) // this is the blocking part
}
Usage is quite simple.
val input: List[String] = getChunk

If I knew what was going on here, I wouldn't ask. Error given is: InvalidKeySpecException: Password is not ASCII

Ok, so basically, if I am in the console (Intellij) and I type FileScramble.getRandomPW, I get an ASCII password. But if I run the command in the code, I don't. Instead, I get "org.jasypt.exceptions.EncryptionInitializationException: InvalidKeySpecException: Password is not ASCII."
Here is a screen shot of what I mean.
The fact that I've been up and down that block of code so many times leads me to believe that I'm missing something fundamental in the scala language. The try-catch of the getRandomPW block is never triggered. And, like I said, if I call it from the console, I get only ASCII.
The program is just going to scramble the contents of a file before deletion. It's by no means secure -- it's an exercise. It's me getting familiar with 1) scala, 2) encryption, and 3) sbt.
So here is the relevant code:
import java.io.{BufferedOutputStream, File, FileOutputStream, InputStream}
import java.nio.ByteBuffer
import java.security.SecureRandom
import org.jasypt.util.binary.BasicBinaryEncryptor
object FileScramble {
val base64chars = ('a' to 'z').union('A' to 'Z').union(0 to 9).union(List('/', '+'))
def byteArrayToBase64(x: java.nio.ByteBuffer) : String = {
// convert to string and filter out anything but base64chars
val nowString = new String(x.array.takeWhile(_ != 0), "UTF-8")
nowString.filter(base64chars.contains(_))
}
def writeBytes( data : Stream[Byte], file : File ) = {
val target = new BufferedOutputStream( new FileOutputStream(file) );
try data.foreach( target.write(_) ) finally target.close;
}
def getRandomPW : String = {
try {
var output : String = ""
while (output.length() < 10) {
// val r = scala.util.Random
val r = SecureRandom.getInstance("SHA1PRNG")
var bytePW : Array[Byte] = new Array[Byte](1000)
r.nextBytes(bytePW)
// get 1000 random bytes into a ByteBuffer
val preString = ByteBuffer.allocate(1000).put(bytePW)
// get a random base 64 password at least 10 chars long
output = byteArrayToBase64(preString)
}
output
}
catch {
case e : Exception => e.getMessage()
}
}
def main( args: Array[String] ): Unit = {
val fileHandle = new java.io.File(args(0))
// https://github.com/liufengyun/scala-bug
val source = scala.io.Source.fromFile(fileHandle, "ISO-8859-1")
// source = new MyInputStream(dataStream)
val byteArray = source.map(_.toByte).toArray
// val byteStream = source.map(_.toByte).toStream
source.close()
var binaryEncryptor = new BasicBinaryEncryptor();
val pw = getRandomPW
println("BEGIN: " + pw + ":END")
binaryEncryptor.setPassword(pw);
val encryptedOut = binaryEncryptor.encrypt(byteArray).toStream
writeBytes(encryptedOut, fileHandle)
}
}
Honestly, I've been up and down the block for a few hours and have not come up with any ideas as to what could be happening. It's by far the biggest head-scratcher I've had recently, to the point that I've asked SO a question for the first time in several years.
Your help is appreciated! I thank you in advance, whether you can help or not.
You have only one small, elusive mistake - when you're trying to add the numeric characters 0 - 9, you should add union('0' to '9'), instead of union(0 to 9) - otherwise you're adding non-ASCII characters (unicode values 0 - 9...) and thus getting the (justifiable) exception.
#TzachZohar has it exactly right.
What you might also consider, though, is letting the compiler help you out a bit more by adding your expected type.
val base64anys: Seq[Char] = ('a' to 'z').union('A' to 'Z').union(0 to 9).union(List('/', '+'))
does not compile. So you would have seen the error.

spark scala - replace text if exists in list

I have 2 datasets.
One is a dataframe with a bunch of data, one column has comments (a string).
The other is a list of words.
If a comment contains a word in the list, I want to replace the word in the comment with ##### and return the comment in full with the replaced words.
Here's some sample data:
CommentSample.txt
1 A badword small town
2 "Love the truck, though rattle is annoying."
3 Love the paint!
4
5 "Like that you added the ""oh badword2"" handle to passenger side."
6 "badword you. specific enough for you, badword3?"
7 This car is a piece if badword2
ProfanitySample.txt
badword
badword2
badword3
Here's my code so far:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class Response(UniqueID: Int, Comment: String)
val response = sc.textFile("file:/data/CommentSample.txt").map(_.split("\t")).filter(_.size == 2).map(r => Response(r(0).trim.toInt, r(1).trim.toString, r(10).trim.toInt)).toDF()
var profanity = sc.textFile("file:/data/ProfanitySample.txt").map(x => (x.toLowerCase())).toArray();
def replaceProfanity(s: String): String = {
val l = s.toLowerCase()
val r = "#####"
if(profanity.contains(s))
r
else
s
}
def processComment(s: String): String = {
val commentWords = sc.parallelize(s.split(' '))
commentWords.foreach(replaceProfanity)
commentWords.collect().mkString(" ")
}
response.select(processComment("Comment")).show(100)
It compiles, it runs, but the words are not replaced.
I don't know how to debug in scala.
I'm totally new! This is my first project ever!
Many thanks for any pointers.
-M
First, I think the usecase you describe here won't benefit much from the use of DataFrames - it's simpler to implement using RDDs only (DataFrames are mostly convenient when your transformations can easily be described using SQL, which isn't the case here).
So - here's a possible implementation using RDDs. This assumes the list of profanities isn't too large (i.e. up to ~thousands), so we can collect it into non-distributed memory. If that's not the case, a different approach (involving a join) might be needed.
case class Response(UniqueID: Int, Comment: String)
val mask = "#####"
val responses: RDD[Response] = sc.textFile("file:/data/CommentSample.txt").map(_.split("\t")).filter(_.size == 2).map(r => Response(r(0).trim.toInt, r(1).trim))
val profanities: Array[String] = sc.textFile("file:/data/ProfanitySample.txt").collect()
val result = responses.map(r => {
// using foldLeft here means we'll replace profanities one by one,
// with the result of each replace as the input of the next,
// starting with the original comment
profanities.foldLeft(r.Comment)({
case (updatedComment, profanity) => updatedComment.replaceAll(s"(?i)\\b$profanity\\b", mask)
})
})
result.take(10).foreach(println) // just printing some examples...
Note that the case-insensitivity and the "words only" limitations are implemented in the regex itself: "(?i)\\bSomeWord\\b".

Scala, finding max value in arrays

First time I've had to ask a question here, there is not enough info on Scala out there for a newbie like me.
Basically what I have is a file filled with hundreds of thousands of lists formatted like this:
(type, date, count, object)
Rows look something like this:
(food, 30052014, 400, banana)
(food, 30052014, 2, pizza)
All I need to is find the one row with the highest count.
I know I did this a couple of months ago but can't seem to wrap my head around it now. I'm sure I can do this without a function too. All I want to do is set a value and put that row in it but I can't figure it out.
I think basically what I want to do is a Math.max on the 3rd element in the lists, but I just can't get it.
Any help will be kindly appreciated. Sorry if my wording or formatting of this question isn't the best.
EDIT: There's some extra info I've left out that I should probably add:
All the records are stored in a tsv file. I've done this to split them:
val split_food = food.map(_.split("/t"))
so basically I think I need to use split_food... somehow
Modified version of #Szymon answer with your edit addressed:
val split_food = food.map(_.split("/t"))
val max_food = split_food.maxBy(tokens => tokens(2).toInt)
or, analogously:
val max_food = split_food.maxBy { case Array(_, _, count, _) => count.toInt }
In case you're using apache spark's RDD, which has limited number of usual scala collections methods, you have to go with reduce
val max_food = split_food.reduce { (max: Array[String], current: Array[String]) =>
val curCount = current(2).toInt
val maxCount = max(2).toInt // you probably would want to preprocess all items,
// so .toInt will not be called again and again
if (curCount > maxCount) current else max
}
You should use maxBy function:
case class Purchase(category: String, date: Long, count: Int, name: String)
object Purchase {
def apply(s: String) = s.split("\t") match {
case Seq(cat, date, count, name) => Purchase(cat, date.toLong, count.toInt, name)
}
}
foodRows.map(row => Purchase(row)).maxBy(_.count)
Simply:
case class Record(food:String, date:String, count:Int)
val l = List(Record("ciccio", "x", 1), Record("buffo", "y", 4), Record("banana", "z", 3))
l.maxBy(_.count)
>>> res8: Record = Record(buffo,y,4)
Not sure if you got the answer yet but I had the same issues with maxBy. I found once I ran the package... import scala.io.Source I was able to use maxBy and it worked.

Scala How to avoid reading the file twice

Here is the sample program I am working on to read a file with list of values one per line. I have to add all these values converting to double and also need to sort the values. Here is what I came up so far and it is working fine.
import scala.io.Source
object Expense{
def main(args: Array[String]): Unit = {
val lines = Source.fromFile("c://exp.txt").getLines()
val sum: Double = lines.foldLeft(0.0)((i, s) => i + s.replaceAll(",","").toDouble)
println("Total => " + sum)
println((Source.fromFile("c://exp.txt").getLines() map (_.replaceAll(",", "").toDouble)).toList.sorted)
}
}
The question here is, as you can see I am reading the file twice and I want to avoid it. As the Source.fromFile("c://exp.txt").getLines() gives you an iterator, I can loop through it only once and next operation it will be null, so I can't reuse the lines again for sorting and I need to read from file again. Also I don't want to store them into a temporary list. Is there any elegant way of doing this in a functional way?
Convert it to a List, so you can reuse it:
val lines = Source.fromFile("c://exp.txt").getLines().toList
As per my comment, convert it to a list, and rewrite as follows
import scala.io.Source
object Expense{
def main(args: Array[String]): Unit = {
val lines = Source.fromFile("c://exp.txt").getLines() map (_.replaceAll(",","").toDouble)
val sum: Double = lines.foldLeft(0.0)((i, s) => i + s)
println("Total => " + sum)
println(lines.toList.sorted)
}
}