Split a list in several files scala - scala

I have the following list,
10,44,22
10,47,12
15,38,3
15,41,30
16,44,15
16,47,18
22,38,21
22,41,42
34,44,40
34,47,36
40,38,39
40,41,42
45,38,27
45,41,30
46,44,45
46,47,48
Then I am creating one file with it is content with the following code:
val fstream:FileWriter = new FileWriter("patSPO.csv")
var out:BufferedWriter = new BufferedWriter(fstream)
val sl = listSPO.sortBy(l => (l.sub, l.pre))
for ( a <- 0 to listSPO.size-1){
out.write(sl(a).sub.toString+","+sl(a).pre.toString+","+sl(a).obj.toString+"\n")
}
out.close()
However I want to divide the content in a n files, then I try the following for 4 files:
val fstream:FileWriter = new FileWriter("patSPO.csv")
val fstream1:FileWriter = new FileWriter("patSPO1.csv")
val fstream2:FileWriter = new FileWriter("patSPO2.csv")
val fstream3:FileWriter = new FileWriter("patSPO3.csv")
val fstream4:FileWriter = new FileWriter("patSPO4.csv")
var out:BufferedWriter = new BufferedWriter(fstream)
var out1:BufferedWriter = new BufferedWriter(fstream1)
var out2:BufferedWriter = new BufferedWriter(fstream2)
var out3:BufferedWriter = new BufferedWriter(fstream3)
var out4:BufferedWriter = new BufferedWriter(fstream4)
val b :Int = listSPO.size/4
val sl = listSPO.sortBy(l => (l.sub, l.pre))
for ( a <- 0 to listSPO.size-1){
out.write(sl(a).sub.toString+","+sl(a).pre.toString+","+sl(a).obj.toString+"\n")
}
for ( a <- 0 to b-1){
out1.write(sl(a).sub.toString+","+sl(a).pre.toString+","+sl(a).obj.toString+"\n")
}
for ( a <- b to (b*2)-1){
out2.write(sl(a).sub.toString+","+sl(a).pre.toString+","+sl(a).obj.toString+"\n")
}
for ( a <- b*2 to (b*3)-1){
out3.write(sl(a).sub.toString+","+sl(a).pre.toString+","+sl(a).obj.toString+"\n")
}
for ( a <- b*3 to (b*4)-1){
out4.write(sl(a).sub.toString+","+sl(a).pre.toString+","+sl(a).obj.toString+"\n")
}
out.close()
out1.close()
out2.close()
out3.close()
out4.close()
Then my question is if exist a general code where I put the number of files to generate, for example 32, and not to write 32 times the out, the for and the fstream?

First, let's make some utilities to eliminate ugly boilerplate:
def open(
number: Int = 1,
prefix: String,
ext: String
) = (0 to n-1)
.iterator
.map { i =>
"%s%s%s".format(prefix, if(i == 0) "" else i.toString, postfix)
}.map(new FileWriter(_))
.map(new BufferedWriter(_))
def toString(l: WhateverTheType) = Seq(
l.sub,
l.pre,
l.obj
).mkString(",") + "\n"
And now to the implementation:
writeToFiles(
listSPO: List[WhateverTheType],
numFiles: Int = 1,
prefix: String = "pasSPO",
ext: String = ".csv"
) = listSPO
.grouped((listSPO.size+1)/numFiles)
.zip(open(numFiles, prefix, ext))
.foreach { case (input, file) =>
try {
input.foreach(file.write(toString(input)))
} finally {
file.close
}
}

Assuming the objects in the list have this type (otherwise, relpace SPO in the below code with your type):
case class SPO(sub: Int, pre: Int, obj: Int)
This should do it:
val sl = listSPO.sortBy(l => (l.sub, l.pre))
val files = 5 // whatever number you want
// helper function to write a single list - similar to your implementation but shorter
def writeListToFile(name: String, list: List[SPO]): Unit = {
val writer = new BufferedWriter(new FileWriter(name))
list.foreach(spo => writer.write(s"${spo.sub},${spo.pre},${spo.obj}\n"))
writer.close()
}
sl.grouped(sl.size / files) // split into sublists
.zipWithIndex // add index of sublists for file name
.foreach {
case (sublist, 0) => writeListToFile(s"pasSPO.csv", sublist) // if indeed you want the first file name NOT to include the index
case (sublist, index) => writeListToFile(s"pasSPO$index.csv", sublist)
}

Related

in countWord example, I apply foreach but it has cannot resolved symbol error

here is example about countWords. (Scala)
[origin]
def countWords(text: String): mutable.Map[String, Int] = {
val counts = mutable.Map.empty[String, Int]
for (rawWord <- text.split("[ ,!.]+")) {
val word = rawWord.toLowerCase
val oldCount =
if (counts.contains(word)) counts(word)
else 0
counts += (word -> (oldCount + 1))
}
return counts
}
[my code]
here is my code.
def countWords2(text: String):mutable.Map[String, Int] = {
val counts = mutable.Map.empty[String, Int]s
text.split("[ ,!.]").foreach(word =>
val lowWord = word.toLowerCase()
val oldCount = if (counts.contains(lowWord)) counts(lowWord) else 0
counts += (lowWord -> (oldCount + 1))
)
return counts
}
I tried transfer "for()" sentence to "foreach" but I got "cannot resolved symbol" error message.
how to use foreach in this case?

Evaluating multiple filters in a single pass

I have below rdd created and I need to perform a series of filters on the same dataset to derive different counters and aggregates.
Is there a way I can apply these filters and compute aggregates in a single pass, avoiding spark to go over the same dataset multiple times?
val res = df.rdd.map(row => {
// ............... Generate data here for each row.......
})
res.persist(StorageLevel.MEMORY_AND_DISK)
val all = res.count()
val stats1 = res.filter(row => row.getInt(1) > 0)
val stats1Count = stats1.count()
val stats1Agg = stats1.map(r => r.getInt(1)).mean()
val stats2 = res.filter(row => row.getInt(2) > 0)
val stats2Count = stats2.count()
val stats2Agg = stats2.map(r => r.getInt(2)).mean()
You can use aggregate:
case class Stats(count: Int = 0, sum: Int = 0) {
def mean = sum/count
def +(s: Stats): Stats = Stats(count + s.count, sum + s.sum)
def <- (n: Int) = if(n > 0) copy(count + 1, sum + n) else this
}
val (stats1, stats2) = res.aggregate(Stats() -> Stats()) (
{ (s, row) => (s._1 <- row.getInt(1), s._2 <- row.getInt(2)) },
{ _ + _ }
)
val (stat1Count, stats1Agg, stats2Count, stats2Agg) = (stats1.count, stats1.mean, stats2.count, stats2.mean)

Iterating over cogrouped RDD

I have used a cogroup function and obtain following RDD:
org.apache.spark.rdd.RDD[(Int, (Iterable[(Int, Long)], Iterable[(Int, Long)]))]
Before the map operation joined object would look like this:
RDD[(Int, (Iterable[(Int, Long)], Iterable[(Int, Long)]))]
(-2095842000,(CompactBuffer((1504999740,1430096464017), (613904354,1430211912709), (-1514234644,1430288363100), (-276850688,1430330412225)),CompactBuffer((-511732877,1428682217564), (1133633791,1428831320960), (1168566678,1428964645450), (-407341933,1429009306167), (-1996133514,1429016485487), (872888282,1429031501681), (-826902224,1429034491003), (818711584,1429111125268), (-1068875079,1429117498135), (301875333,1429121399450), (-1730846275,1429131773065), (1806256621,1429135583312))))
(352234000,(CompactBuffer((1350763226,1430006650167), (-330160951,1430320010314)),CompactBuffer((2113207721,1428994842593), (-483470471,1429324209560), (1803928603,1429426861915))))
Now I want to do the following:
val globalBuffer = ListBuffer[Double]()
val joined = data1.cogroup(data2).map(x => {
val listA = x._2._1.toList
val listB = x._2._2.toList
for(tupleB <- listB) {
val localResults = ListBuffer[Double]()
val itemToTest = Set(tupleB._1)
val tempList = ListBuffer[(Int, Double)]()
for(tupleA <- listA) {
val tValue = someFunctionReturnDouble(tupleB._2, tupleA._2)
val i = (tupleA._1, tValue)
tempList += i
}
val sortList = tempList.sortWith(_._2 > _._2).slice(0,20).map(i => i._1)
val intersect = sortList.toSet.intersect(itemToTest)
if (intersect.size > 0)
localResults += 1.0
else localResults += 0.0
val normalized = sum(localResults.toList)/localResults.size
globalBuffer += normalized
}
})
//method sum
def sum(xs: List[Double]): Double = {//do the sum}
At the end of this I was expecting joined to be a list with double values. But when I looked at it it was unit. Also I will this is not the Scala way of doing it. How do I obtain globalBuffer as the final result.
Hmm, if I understood your code correctly, it could benefit from these improvements:
val joined = data1.cogroup(data2).map(x => {
val listA = x._2._1.toList
val listB = x._2._2.toList
val localResults = listB.map {
case (intBValue, longBValue) =>
val itemToTest = intBValue // it's always one element
val tempList = listA.map {
case (intAValue, longAValue) =>
(intAValue, someFunctionReturnDouble(longBvalue, longAValue))
}
val sortList = tempList.sortWith(-_._2).slice(0,20).map(i => i._1)
if (sortList.toSet.contains(itemToTest)) { 1.0 } else {0.0}
// no real need to convert to a set for 20 elements, by the way
}
sum(localResults)/localResults.size
})
Transformations of RDDs are not going to modify globalBuffer. Copies of globalBuffer are made and sent out to each of the workers, but any modifications to these copies on the workers will never modify the globalBuffer that exists on the driver (the one you have defined outside the map on the RDD.) Here's what I do (with a few additional modifications):
val joined = data1.cogroup(data2) map { x =>
val iterA = x._2._1
val iterB = x._2._2
var count, positiveCount = 0
val tempList = ListBuffer[(Int, Double)]()
for (tupleB <- iterB) {
tempList.clear
for(tupleA <- iterA) {
val tValue = someFunctionReturnDouble(tupleB._2, tupleA._2)
tempList += ((tupleA._1, tValue))
}
val sortList = tempList.sortWith(_._2 > _._2).iterator.take(20)
if (sortList.exists(_._1 == tupleB._1)) positiveCount += 1
count += 1
}
positiveCount.toDouble/count
}
At this point you can obtain of local copy of the proportions by using joined.collect.

Spark: split rows and accumulate

I have this code:
val rdd = sc.textFile(sample.log")
val splitRDD = rdd.map(r => StringUtils.splitPreserveAllTokens(r, "\\|"))
val rdd2 = splitRDD.filter(...).map(row => createRow(row, fieldsMap))
sqlContext.createDataFrame(rdd2, structType).save(
org.apache.phoenix.spark, SaveMode.Overwrite, Map("table" -> table, "zkUrl" -> zkUrl))
def createRow(row: Array[String], fieldsMap: ListMap[Int, FieldConfig]): Row = {
//add additional index for invalidValues
val arrSize = fieldsMap.size + 1
val arr = new Array[Any](arrSize)
var invalidValues = ""
for ((k, v) <- fieldsMap) {
val valid = ...
var value : Any = null
if (valid) {
value = row(k)
// if (v.code == "SOURCE_NAME") --> 5th column in the row
// sourceNameCount = row(k).split(",").size
} else {
invalidValues += v.code + " : " + row(k) + " | "
}
arr(k) = value
}
arr(arrSize - 1) = invalidValues
Row.fromSeq(arr.toSeq)
}
fieldsMap contains the mapping of the input columns: (index, FieldConfig). Where FieldConfig class contains "code" and "dataType" values.
TOPIC -> (0, v.code = "TOPIC", v.dataType = "String")
GROUP -> (1, v.code = "GROUP")
SOURCE_NAME1,SOURCE_NAME2,SOURCE_NAME3 -> (4, v.code = "SOURCE_NAME")
This is the sample.log:
TOPIC|GROUP|TIMESTAMP|STATUS|SOURCE_NAME1,SOURCE_NAME2,SOURCE_NAME3|
SOURCE_TYPE1,SOURCE_TYPE2,SOURCE_TYPE3|SOURCE_COUNT1,SOURCE_COUNT2,SOURCE_COUNT3|
DEST_NAME1,DEST_NAME2,DEST_NAME3|DEST_TYPE1,DEST_TYPE2,DEST_TYPE3|
DEST_COUNT1,DEST_COUNT2,DEST_COUNT3|
The goal is to split the input (sample.log), based on the number of source_name(s).. In the example above, the output will have 3 rows:
TOPIC|GROUP|TIMESTAMP|STATUS|SOURCE_NAME1|SOURCE_TYPE1|SOURCE_COUNT1|
|DEST_NAME1|DEST_TYPE1|DEST_COUNT1|
TOPIC|GROUP|TIMESTAMP|STATUS|SOURCE_NAME2|SOURCE_TYPE2|SOURCE_COUNT2|
DEST_NAME2|DEST_TYPE2|DEST_COUNT2|
TOPIC|GROUP|TIMESTAMP|STATUS|SOURCE_NAME3|SOURCE_TYPE3|SOURCE_COUNT3|
|DEST_NAME3|DEST_TYPE3|DEST_COUNT3|
This is the new code I am working on (still using createRow defined above):
val rdd2 = splitRDD.filter(...).flatMap(row => {
val srcName = row(4).split(",")
val srcType = row(5).split(",")
val srcCount = row(6).split(",")
val destName = row(7).split(",")
val destType = row(8).split(",")
val destCount = row(9).split(",")
var newRDD: ArrayBuffer[Row] = new ArrayBuffer[Row]()
//if (srcName != null) {
println("\n\nsrcName.size: " + srcName.size + "\n\n")
for (i <- 0 to srcName.size - 1) {
// missing column: destType can sometimes be null
val splittedRow: Array[String] = Row.fromSeq(Seq((row(0), row(1), row(2), row(3),
srcName(i), srcType(i), srcCount(i), destName(i), "", destCount(i)))).toSeq.toArray[String]
newRDD = newRDD ++ Seq(createRow(splittedRow, fieldsMap))
}
//}
Seq(Row.fromSeq(Seq(newRDD)))
})
Since I am having an error in converting my splittedRow to Array[String]
(".toSeq.toArray[String]")
error: type arguments [String] do not conform to method toArray's type parameter bounds [B >: Any]
I decided to update my splittedRow to:
val rowArr: Array[String] = new Array[String](10)
for (j <- 0 to 3) {
rowArr(j) = row(j)
}
rowArr(4) = srcName(i)
rowArr(5) = row(5).split(",")(i)
rowArr(6) = row(6).split(",")(i)
rowArr(7) = row(7).split(",")(i)
rowArr(8) = row(8).split(",")(i)
rowArr(9) = row(9).split(",")(i)
val splittedRow = rowArr
You could use a flatMap operation instead of a map operation to return multiple rows. Consequently, your createRow would be refactored to createRows(row: Array[String], fieldsMap: List[Int, IngestFieldConfig]): Seq[Row].

scala priority queue not ordering properly?

I'm seeing some strange behavior with Scala's collection.mutable.PriorityQueue. I'm performing an external sort and testing it with 1M records. Each time I run the test and verify the results between 10-20 records are not sorted properly. I replace the scala PriorityQueue implementation with a java.util.PriorityQueue and it works 100% of the time. Any ideas?
Here's the code (sorry it's a bit long...). I test it using the tools gensort -a 1000000 and valsort from http://sortbenchmark.org/
def externalSort(inFileName: String, outFileName: String)
(implicit ord: Ordering[String]): Int = {
val MaxTempFiles = 1024
val TempBufferSize = 4096
val inFile = new java.io.File(inFileName)
/** Partitions input file and sorts each partition */
def partitionAndSort()(implicit ord: Ordering[String]):
List[java.io.File] = {
/** Gets block size to use */
def getBlockSize: Long = {
var blockSize = inFile.length / MaxTempFiles
val freeMem = Runtime.getRuntime().freeMemory()
if (blockSize < freeMem / 2)
blockSize = freeMem / 2
else if (blockSize >= freeMem)
System.err.println("Not enough free memory to use external sort.")
blockSize
}
/** Sorts and writes data to temp files */
def writeSorted(buf: List[String]): java.io.File = {
// Create new temp buffer
val tmp = java.io.File.createTempFile("external", "sort")
tmp.deleteOnExit()
// Sort buffer and write it out to tmp file
val out = new java.io.PrintWriter(tmp)
try {
for (l <- buf.sorted) {
out.println(l)
}
} finally {
out.close()
}
tmp
}
val blockSize = getBlockSize
var tmpFiles = List[java.io.File]()
var buf = List[String]()
var currentSize = 0
// Read input and divide into blocks
for (line <- io.Source.fromFile(inFile).getLines()) {
if (currentSize > blockSize) {
tmpFiles ::= writeSorted(buf)
buf = List[String]()
currentSize = 0
}
buf ::= line
currentSize += line.length() * 2 // 2 bytes per char
}
if (currentSize > 0) tmpFiles ::= writeSorted(buf)
tmpFiles
}
/** Merges results of sorted partitions into one output file */
def mergeSortedFiles(fs: List[java.io.File])
(implicit ord: Ordering[String]): Int = {
/** Temp file buffer for reading lines */
class TempFileBuffer(val file: java.io.File) {
private val in = new java.io.BufferedReader(
new java.io.FileReader(file), TempBufferSize)
private var curLine: String = ""
readNextLine() // prep first value
def currentLine = curLine
def isEmpty = curLine == null
def readNextLine() {
if (curLine == null) return
try {
curLine = in.readLine()
} catch {
case _: java.io.EOFException => curLine = null
}
if (curLine == null) in.close()
}
override protected def finalize() {
try {
in.close()
} finally {
super.finalize()
}
}
}
val wrappedOrd = new Ordering[TempFileBuffer] {
def compare(o1: TempFileBuffer, o2: TempFileBuffer): Int = {
ord.compare(o1.currentLine, o2.currentLine)
}
}
val pq = new collection.mutable.PriorityQueue[TempFileBuffer](
)(wrappedOrd)
// Init queue with item from each file
for (tmp <- fs) {
val buf = new TempFileBuffer(tmp)
if (!buf.isEmpty) pq += buf
}
var count = 0
val out = new java.io.PrintWriter(new java.io.File(outFileName))
try {
// Read each value off of queue
while (pq.size > 0) {
val buf = pq.dequeue()
out.println(buf.currentLine)
count += 1
buf.readNextLine()
if (buf.isEmpty) {
buf.file.delete() // don't need anymore
} else {
// re-add to priority queue so we can process next line
pq += buf
}
}
} finally {
out.close()
}
count
}
mergeSortedFiles(partitionAndSort())
}
My tests don't show any bugs in PriorityQueue.
import org.scalacheck._
import Prop._
object PriorityQueueProperties extends Properties("PriorityQueue") {
def listToPQ(l: List[String]): PriorityQueue[String] = {
val pq = new PriorityQueue[String]
l foreach (pq +=)
pq
}
def pqToList(pq: PriorityQueue[String]): List[String] =
if (pq.isEmpty) Nil
else { val h = pq.dequeue; h :: pqToList(pq) }
property("Enqueued elements are dequeued in reverse order") =
forAll { (l: List[String]) => l.sorted == pqToList(listToPQ(l)).reverse }
property("Adding/removing elements doesn't break sorting") =
forAll { (l: List[String], s: String) =>
(l.size > 0) ==>
((s :: l.sorted.init).sorted == {
val pq = listToPQ(l)
pq.dequeue
pq += s
pqToList(pq).reverse
})
}
}
scala> PriorityQueueProperties.check
+ PriorityQueue.Enqueued elements are dequeued in reverse order: OK, passed
100 tests.
+ PriorityQueue.Adding/removing elements doesn't break sorting: OK, passed
100 tests.
If you could somehow reduce the input enough to make a test case, it would help.
I ran it with five million inputs several times, output matched expected always. My guess from looking at your code is that your Ordering is the problem (i.e. it's giving inconsistent answers.)