Extract numbers and store them in variable in Scala and Spark - scala

I have a file like below:
0; best wrap ear market pair pair break make
1; time sennheiser product better earphone fit
1; recommend headphone pretty decent full sound earbud design
0; originally buy work gym work well robust sound quality good clip
1; terrific sound great fit toss mine profuse sweater headphone
0; negative experienced sit chair back touch chair earplug displace hurt
...
and i want to extract number and store it in a for each document, i've tried :
var grouped_with_wt = data.flatMap({ (line) =>
val words = line.split(";").split(" ")
words.map(w => {
val a =
(line.hashCode(),(vocab_lookup.value(w), a))
})
}).groupByKey()
expected output is :
(1453543,(best,0),(wrap,0),(ear,0),(market,0),(pair,0),(break,0),(make,0))
(3942334,(time,1),(sennheiser,1),(product,1),(better,1),(earphone,1),(fit,1))
...
after generating above results i used them in this code to generate final results:
val Beta = DenseMatrix.zeros[Int](V, S)
val Beta_c = grouped_with_wt.flatMap(kv => {
kv._2.map(wt => {
Beta(wt._1,wt._2) +=1
})
})
final results:
1 0
1 0
1 0
1 0
...
This code doesn't work well , Can anybody help me? I want a code like above.

val inputRDD = sc.textFile("input dir ")
val outRDD = inputRDD.map(r => {
val tuple = r.split(";")
val key = tuple(0)
val words = tuple(1).trim().split(" ")
val outArr = words.map(w => {
new Tuple2(w,key)
})
(r.hashCode, outArr.mkString(","))
})
outRDD.saveAsTextFile("output dir")
output
(-1704185638,(best,0),(wrap,0),(ear,0),(market,0),(pair,0),(pair,0),(break,0),(make,0))
(147969209,(time,5),(sennheiser,5),(product,5),(better,5),(earphone,5),(fit,5))
(1145947974,(recommend,1),(headphone,1),(pretty,1),(decent,1),(full,1),(sound,1),(earbud,1),(design,1))
(838871770,(originally,4),(buy,4),(work,4),(gym,4),(work,4),(well,4),(robust,4),(sound,4),(quality,4),(good,4),(clip,4))
(934228708,(terrific,5),(sound,5),(great,5),(fit,5),(toss,5),(mine,5),(profuse,5),(sweater,5),(headphone,5))
(659513416,(negative,-3),(experienced,-3),(sit,-3),(chair,-3),(back,-3),(touch,-3),(chair,-3),(earplug,-3),(displace,-3),(hurt,-3))

Related

How to generate a random Scala Int in Chisel code?

I am trying to implement the way-prediction technique in the RocketChip core (in-order). For this, I need to access each way separately. So this is how SRAM for tags looks like after modification (separate SRAM for each way)
val tag_arrays = Seq.fill(nWays) { SeqMem(nSets, UInt(width = tECC.width(1 + tagBits)))}
val tag_rdata = Reg(Vec(nWays, UInt(width = tECC.width(1 + tagBits))))
for ((tag_array, i) <- tag_arrays zipWithIndex) {
tag_rdata(i) := tag_array.read(s0_vaddr(untagBits-1,blockOffBits), !refill_done && s0_valid)
}
And I want to access it like
when (refill_done) {
val enc_tag = tECC.encode(Cat(tl_out.d.bits.error, refill_tag))
tag_arrays(repl_way).write(refill_idx, enc_tag)
ccover(tl_out.d.bits.error, "D_ERROR", "I$ D-channel error")
}
Where repl_way is Chisel random UInt generated by LFSR. But Seq element can be accessed only by Scala Int index which causes a compilation error. Then I tried access it like this
when (refill_done) {
val enc_tag = tECC.encode(Cat(tl_out.d.bits.error, refill_tag))
for (i <- 0 until nWays) {
when (repl_way === i.U) {tag_arrays(i).write(refill_idx, enc_tag)}
}
ccover(tl_out.d.bits.error, "D_ERROR", "I$ D-channel error")
}
But assertion arises -
assert(PopCount(s1_tag_hit zip s1_tag_disparity map { case (h, d) => h && !d }) <= 1)
I am trying to modify ICache.scala file. Any ideas on how to do this properly? Thanks!
I think you can just use a Vec here instead of a Seq
val tag_arrays = Vec(nWays, SeqMem(nSets, UInt(width = tECC.width(1 + tagBits))))
The Vec allows indexing with a UInt

ZipInputStream.read in ZipEntry

I am reading zip file using ZipInputStream. Zip file has 4 csv files. Some files are written completely, some are written partially. Please help me find the issue with below code. Is there any limit on reading buffer from ZipInputStream.read method?
val zis = new ZipInputStream(inputStream)
Stream.continually(zis.getNextEntry).takeWhile(_ != null).foreach { file =>
if (!file.isDirectory && file.getName.endsWith(".csv")) {
val buffer = new Array[Byte](file.getSize.toInt)
zis.read(buffer)
val fo = new FileOutputStream("c:\\temp\\input\\" + file.getName)
fo.write(buffer)
}
You have not closed/flushed the files you attempted to write. It should be something like this (assuming Scala syntax, or is this Kotlin/Ceylon?):
val fo = new FileOutputStream("c:\\temp\\input\\" + file.getName)
try {
fo.write(buffer)
} finally {
fo.close
}
Also you should check the read count and read more if necessary, something like this:
var readBytes = 0
while (readBytes < buffer.length) {
val r = zis.read(buffer, readBytes, buffer.length - readBytes)
r match {
case -1 => throw new IllegalStateException("Read terminated before reading everything")
case _ => readBytes += r
}
}
PS: In your example it seems to be less than required closing }s.

Simplify Scala loop to one line

How do I simplify this loop to some function like foreach or map or other thing with Scala? I want to put hitsArray inside that filter shipList.filter.
val hitsArray: Array[String] = T.split(" ");
for (hit <- hitsArray) {
shipSize = shipList.length
shipList = shipList.filter(!_.equalsIgnoreCase(hit))
}
if (shipList.length == 0) {
shipSunk = shipSunk + 1
} else if (shipList.length < shipSize) {
shipHit = shipHit + 1
}
To be fair, I don't understand why you are calling shipSize = shipList.length as you don't use it anywhere.
T.split(" ").foreach{ hit =>
shipList = shipList.filter(!_.equalsIgnoreCase(hit))
}
which gets you to where you want to go. I've made it 3 lines because you want to emphasize you're working via side effect in that foreach. That said, I don't see any advantage to making it a one-liner. What you had before was perfectly readable.
Something like this maybe?
shipList.filter(ship => T.split(" ").forall(!_.equalsIgnoreCase(ship)))
Although cleaner if shipList is already all lower case:
shipList.filterNot(T.split(" ").map(_.toLowerCase) contains _)
Or if your T is large, move it outside the loop:
val hits = T.split(" ").map(_.toLowerCase)
shipList.filterNot(hits contains _)

Converting Imperative Expressions to Functional style paradigm

I have the following Scala snippet from my code. I am not able to convert it into functional style. I could do it at other places in my code but not able to change the below one to functional. Issue is once the code exhausts all pattern matching options, then only it should send back "NA". Following code is doing that, but it's not in functional style (for-yield)
var matches = new ListBuffer[List[String]]()
for (line <- caselist){
var count = 0
for (pat <- pattern if (!pat.findAllIn(line).isEmpty)){
count += 1
matches += pat.findAllIn(line).toList
}
if (count == 0){
matches += List("NA")
}
}
return matches.toList
}
Your question is not entirely complete, so I can't be sure, but I believe the following will do the job:
for {
line <- caselist
matches = pattern.map(_.findAllIn(line).toList)
} yield matches.flatten match {
case Nil => List("NA")
case ms => ms
}
This should do the job. Using foreach and filter to generate the matches and checking to make sure there is a match for each line will work.
caseList.foreach{ line =>
val results = pattern.foreach ( pat => pat.findAllIn(line).toList )
val filteredResults = results.filter( ! _.isEmpty )
if ( filteredResults.isEmpty ) List("NA")
else filteredResults
}
Functional doesn't mean you can't have intermediate named values.

java.nio.BufferUnderflowException when processing files in Scala

I got a similar problem to this guy while processing 4MB log file. Actually I'm processing multiple files simultaneously but since I keep getting this exception, I decide to just test it for a single file:
val temp = Source.fromFile("./datasource/input.txt")
val dummy = new PrintWriter("test.txt")
var itr = 0
println("Default Buffer size: " + Source.DefaultBufSize)
try {
for( chr <- temp) {
dummy.print(chr.toChar)
itr += 1
if(itr == 75703) println("Passed line 85")
if(itr % 256 == 0){ print("..." + itr); temp.reset; System.gc; }
if(itr == 75703) println("Passed line 87")
if(itr % 2048 == 0) println("")
if(itr == 75703) println("Passed line 89")
}
} finally {
println("\nFalied at itr = " + itr)
}
What I always get is that it will fails at itr = 75703, while my output file will always be 64KB (65536 Bytes exact). No matter where I put temp.reset or System.gc, all experiments ends up the same.
It seems like the problem relies on some memory allocation but I cannot find any useful information on this problem. Any idea on how to solve this one?
All your helps are greatly appreciated
EDIT: Actually I want to process it as binary files, so this technique is not a good solution, many had recommend me to use BufferedInputStream instead.
Why are you calling reset on the Source before it has finished iterating thru the file?
val temp = Source.fromFile("./datasource/input.txt")
try {
for (line <- tem p.getLines) {
//whatever
}
finally temp.reset
Should work just fine with no underflows. See also this question