Reading lines and raw bytes from the same source in scala - scala

I need to write code that does the following:
Connect to a tcp socket
Read a line ending in "\r\n" that contains a number N
Read N bytes
Use those N bytes
I am currently using the following code:
val socket = new Socket(InetAddress.getByName(host), port)
val in = socket.getInputStream;
val out = new PrintStream(socket.getOutputStream)
val reader = new DataInputStream(in)
val baos = new ByteArrayOutputStream
val buffer = new Array[Byte](1024)
out.print(cmd + "\r\n")
out.flush
val firstLine = reader.readLine.split("\\s")
if(firstLine(0) == "OK") {
def read(written: Int, max: Int, baos: ByteArrayOutputStream): Array[Byte] = {
if(written >= max) baos.toByteArray
else {
val count = reader.read(buffer, 0, buffer.length)
baos.write(buffer, 0, count)
read(written + count, max, baos)
}
}
read(0, firstLine(1).toInt, baos)
} else {
// RAISE something
}
baos.toByteArray()
The problem with this code is that the use of DataInputStream#readLine raises a deprecation warning, but I can't find a class that implements both read(...) and readLine(...). BufferedReader for example, implements read but it reads Chars and not Bytes. I could cast those chars to bytes but I don't think it's safe.
Any other ways to write something like this in scala?
Thank you

be aware that on the JVM a char has 2 bytes, so "\r\n" is 4 bytes. This is generally not true for Strings stored outside of the JVM.
I think the safest way would be to read your file in raw bytes until you reache your Binary representation of "\r\n", now you can create a Reader (makes bytes into JVM compatible chars) on the first bytes, where you can be shure that there is Text only, parse it, and contiue safely with the rest of the binary data.

You can achive the goal to use read(...) and readLine(...) in one class. The idea is use BufferedReader.read():Int. The BufferedReader class has buffered the content so you can read one byte a time without performance decrease.
The change can be: (without scala style optimization)
import java.io.BufferedInputStream
import java.io.BufferedReader
import java.io.ByteArrayOutputStream
import java.io.PrintStream
import java.net.InetAddress
import java.net.Socket
import java.io.InputStreamReader
object ReadLines extends App {
val host = "127.0.0.1"
val port = 9090
val socket = new Socket(InetAddress.getByName(host), port)
val in = socket.getInputStream;
val out = new PrintStream(socket.getOutputStream)
// val reader = new DataInputStream(in)
val bufIns = new BufferedInputStream(in)
val reader = new BufferedReader(new InputStreamReader(bufIns, "utf8"));
val baos = new ByteArrayOutputStream
val buffer = new Array[Byte](1024)
val cmd = "get:"
out.print(cmd + "\r\n")
out.flush
val firstLine = reader.readLine.split("\\s")
if (firstLine(0) == "OK") {
def read(written: Int, max: Int, baos: ByteArrayOutputStream): Array[Byte] = {
if (written >= max) {
println("get: " + new String(baos.toByteArray))
baos.toByteArray()
} else {
// val count = reader.read(buffer, 0, buffer.length)
var count = 0
var b = reader.read()
while(b != -1){
buffer(count) = b.toByte
count += 1
if (count < max){
b = reader.read()
}else{
b = -1
}
}
baos.write(buffer, 0, count)
read(written + count, max, baos)
}
}
read(0, firstLine(1).toInt, baos)
} else {
// RAISE something
}
baos.toByteArray()
}
for test, below is a server code:
object ReadLinesServer extends App {
val serverSocket = new ServerSocket(9090)
while(true){
println("accepted a connection.")
val socket = serverSocket.accept()
val ops = socket.getOutputStream()
val printStream = new PrintStream(ops, true, "utf8")
printStream.print("OK 2\r\n") // 1 byte for alpha-number char
printStream.print("ab")
}
}

Seems this is the best solution I can find:
val reader = new BufferedReader(new InputStreamReader(in))
val buffer = new Array[Char](1024)
out.print(cmd + "\r\n")
out.flush
val firstLine = reader.readLine.split("\\s")
if(firstLine(0) == "OK") {
def read(readCount: Int, acc: List[Byte]): Array[Byte] = {
if(readCount <= 0) acc.toArray
else {
val count = reader.read(buffer, 0, buffer.length)
val asBytes = buffer.slice(0, count).map(_.toByte)
read(readCount - count, acc ++ asBytes)
}
}
read(firstLine(1).toInt, List[Byte]())
} else {
// RAISE
}
That is, use buffer.map(_.toByte).toArray to transform a char Array into a Byte Array without caring about the encoding.

Related

Why does "Inflator" algorithm fails for UTF-8 encoding?

I wrote the following code in order to decompress messages that were zipped using the Deflator algorithm:
def decompressMsg[V: StringDecoder](msg: String): Try[V] = {
if (msg.startsWith(CompressionHeader)) {
logger.debug(s"Message before decompression is: ${msg}")
val compressedByteArray =
msg.drop(CompressionHeader.length).getBytes(StandardCharsets.UTF_8)
val inflaterInputStream = new InflaterInputStream(
new ByteArrayInputStream(compressedByteArray)
)
val decompressedByteArray = readDataFromInflaterInputStream(inflaterInputStream)
StringDecoder.decode[V](new String(decompressedByteArray, StandardCharsets.UTF_8).tap {
decompressedMsg => logger.info(s"Message after decompression is: ${decompressedMsg}")
})
} else {
StringDecoder.decode[V](msg)
}
}
private def readDataFromInflaterInputStream(
inflaterInputStream: InflaterInputStream
): Array[Byte] = {
val outputStream = new ByteArrayOutputStream
var runLoop = true
while (runLoop) {
val buffer = new Array[Byte](BufferSize)
val len = inflaterInputStream.read(buffer) // ERROR IS THROWN FROM THIS LINE!!
outputStream.write(buffer, 0, len)
if (len < BufferSize) runLoop = false
}
outputStream.toByteArray
}
The input argument 'msg' was compressed using the Deflator. The above code fails with the error message:
invalid stored block lengths java.util.zip.ZipException: invalid
stored block lengths
After I saw this thread, I changed StandardCharsets.UTF_8 to StandardCharsets.ISO_8859_1 and surprisingly, the code passed and returned the desired behaviour.
I don't want to work with an encoding different than UTF_8. Do you have an idea how to make my code work with UTF_8 encoding?

Scala cipher AES decryption does not work -- conditionally unable to decrypt file -- padding error

Scala 2.11.8, help with encryption... This follows principles from this stackoverflow site and gives an error (javax.crypto.BadPaddingException: Given final block not properly padded). I know why the error is caused, but, need help in handling it. Note: error occurs when the decrypt code is executed separately (i.e. on a different window, separate spark-shell instance). Rarely, the error occurs when both encrypt and decrypt are in the same instance... The salt and IvSpec are copied to the separate instance --as shown at the end (Note: I have verified that the bytes are the same in both instances)...
import java.io.{BufferedWriter, File, FileWriter, FileInputStream, FileOutputStream, BufferedInputStream, BufferedOutputStream, DataInputStream, DataOutputStream}
import org.apache.commons.io.FileUtils;
import javax.crypto.{Cipher, SecretKey, SecretKeyFactory, CipherInputStream, CipherOutputStream}
import javax.crypto.spec.{IvParameterSpec, SecretKeySpec, PBEKeySpec}
import java.security.SecureRandom
import scala.util.Random
import scala.math.pow
val password = "Let us test this"
val random = new SecureRandom();
val salt = Array.fill[Byte](16)(0)
random.nextBytes(salt)
val IvSpec1 = Array.fill[Byte](16)(0)
random.nextBytes(IvSpec1)
val IvSpec = new IvParameterSpec(IvSpec1)
def password_to_key(password: String, salt: Array[Byte]): SecretKeySpec = {
val spec = new PBEKeySpec(password.toCharArray(), salt, 65536, 256);
val f = SecretKeyFactory.getInstance("PBKDF2WithHmacSHA1");
val key = f.generateSecret(spec).getEncoded()
new SecretKeySpec(key, "AES")
}
val Key = password_to_key(password, salt)
val Algorithm = "AES/CBC/PKCS5Padding"
val cipher_encrypt = Cipher.getInstance(Algorithm)
cipher_encrypt.init(Cipher.ENCRYPT_MODE, Key, IvSpec)
val cipher_decrypt = Cipher.getInstance(Algorithm)
cipher_decrypt.init(Cipher.DECRYPT_MODE, Key, IvSpec)
def encrypt(file_in: String, file_out: String, cipher_encrypt:javax.crypto.Cipher ) {
val in = new BufferedInputStream(new FileInputStream(file_in))
val out = new BufferedOutputStream(new FileOutputStream(file_out))
val out_encrypted = new CipherOutputStream(out, cipher_encrypt)
val bufferSize = 1024 * pow(2,4).toInt
val bb = new Array[Byte](bufferSize)
var bb_read = in.read(bb, 0, bufferSize)
while (bb_read > 0) {
out_encrypted.write(bb, 0, bb_read)
bb_read = in.read(bb, 0, bufferSize)
}
in.close()
out_encrypted.close()
out.close()
}
def decrypt(file_in: String, file_out: String, cipher_decrypt:javax.crypto.Cipher ) {
val in = new BufferedInputStream(new FileInputStream(file_in))
val in_decrypted = new CipherInputStream(in, cipher_decrypt)
val out = new BufferedOutputStream(new FileOutputStream(file_out))
val bufferSize = 1024 * pow(2,4).toInt
val bb = new Array[Byte](bufferSize)
var bb_read = in_decrypted.read(bb, 0, bufferSize)
while (bb_read >0 ) {
out.write(bb, 0, bb_read)
bb_read = in_decrypted.read(bb, 0, bufferSize)
}
in_decrypted.close()
in.close()
out.close()
}
val file_in = "test.csv"
val file_encrypt = "test_encrypt.csv"
val file_decrypt = "test_decrypt.csv"
encrypt(file_in, file_encrypt, cipher_encrypt)
decrypt( file_encrypt, file_decrypt, cipher_decrypt)
// To write salt, IvSpec (to re-read it in a separate instance...)
val salt_loc = new File("salt.txt")
val IvSpec_loc = new File("IvSpec.txt")
val salt_w = new FileOutputStream(salt_loc)
salt_w.write(salt)
salt_w.close()
val IvSpec_w = new FileOutputStream(IvSpec_loc)
IvSpec_w.write(IvSpec1)
IvSpec_w.close()
//to re-read salt and IvSpec in a separate instance...
//Ignore that we do not need to re-read IvSpec
val salt_loc = new File("salt.txt")
val IvSpec_loc = new File("IvSpec.txt")
val salt_r = new FileInputStream(salt_loc)
val salt = Stream.continually(salt_r.read).takeWhile(-1 !=).map(_.toByte).toArray
val IvSpec_r = new FileInputStream(IvSpec_loc)
val IvSpec1 = Stream.continually(IvSpec_r.read).takeWhile(-1 !=).map(_.toByte).toArray
val IvSpec = new IvParameterSpec(IvSpec1)
When the decrypt code is executed in a separate java process, this definitely gives an error (error is related to padding). This works most times (95%+), when encrypt and decrypt are done in the same script (e.g. above will work most of the time - if executed in the same instance). If decrypt is done separately, by getting the salt and IvSpec in a separate window/process/thread/java instance, it fails.
error is java.io.IOException: javax.crypto.BadPaddingException: Given final block not properly padded
at javax.crypto.CipherInputStream.getMoreData(CipherInputStream.java:121)
at javax.crypto.CipherInputStream.read(CipherInputStream.java:239)
at javax.crypto.CipherInputStream.read(CipherInputStream.java:215)
at decrypt3(<console>:81)
... 60 elided
Caused by: javax.crypto.BadPaddingException: Given final block not properly padded
at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:991)
at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:847)
at com.sun.crypto.provider.AESCipher.engineDoFinal(AESCipher.java:446)
at javax.crypto.Cipher.doFinal(Cipher.java:2047)
at javax.crypto.CipherInputStream.getMoreData(CipherInputStream.java:118)
... 63 more

What happens when you do java data manipulations in Spark outside of an RDD

I am reading a csv file from hdfs using Spark. It's going into an FSDataInputStream object. I cant use the textfile() method because it splits up the csv file by line feed, and I am reading a csv file that has line feeds inside the text fields. Opencsv from sourcefourge handles line feeds inside the cells, its a nice project, but it accepts a Reader as an input. I need to convert it to a string so that I can pass it to opencsv as a StringReader. So, HDFS File -> FSdataINputStream -> String -> StringReader -> an opencsv list of strings. Below is the code...
import java.io._
import org.apache.spark.sql.SQLContext
import org.apache.hadoop.fs._
import org.apache.hadoop.conf._
import com.opencsv._
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import java.lang.StringBuilder
val conf = new Configuration()
val hdfsCoreSitePath = new Path("core-site.xml")
val hdfsHDFSSitePath = new Path("hdfs-site.xml")
conf.addResource(hdfsCoreSitePath)
conf.addResource(hdfsHDFSSitePath)
val fileSystem = FileSystem.get(conf)
val csvPath = new Path("/raw_data/project_name/csv/file_name.csv")
val csvFile = fileSystem.open(csvPath)
val fileLen = fileSystem.getFileStatus(csvPath).getLen().toInt
var b = Array.fill[Byte](2048)(0)
var j = 1
val stringBuilder = new StringBuilder()
var bufferString = ""
csvFile.seek(0)
csvFile.read(b)
var bufferString = new String(b,"UTF-8")
stringBuilder.append(bufferString)
while(j != -1) {b = Array.fill[Byte](2048)(0);j=csvFile.read(b);bufferString = new String(b,"UTF-8");stringBuilder.append(bufferString)}
val stringBuilderClean = new StringBuilder()
stringBuilderClean = stringBuilder.substring(0,fileLen)
val reader: Reader = new StringReader(stringBuilderClean.toString()).asInstanceOf[Reader]
val csv = new CSVReader(reader)
val javaContext = new JavaSparkContext(sc)
val sqlContext = new SQLContext(sc)
val javaRDD = javaContext.parallelize(csv.readAll())
//do a bunch of transformations on the RDD
It works but I doubt it is scalable. It makes me wonder how big of a limitation it is to have a driver program which pipes in all the data trough one jvm. My questions to anyone very familiar with spark are:
What happens when you do data manipulations across your whole data set like this, before it even gets dropped into the input RDD? It is just treated as any other program and would be swapping out like crazy I guess?
How would you then make any spark program scalable? Do you always NEED to extract the data directly into an input RDD?
Your code loads the data into the memory, and then Spark driver will split and send each part of data to executor, of cause, it is not scalable.
There are two ways to resolve your question.
write custom InputFormat to support CSV file format
import java.io.{InputStreamReader, IOException}
import com.google.common.base.Charsets
import com.opencsv.{CSVParser, CSVReader}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{Seekable, Path, FileSystem}
import org.apache.hadoop.io.compress._
import org.apache.hadoop.io.{ArrayWritable, Text, LongWritable}
import org.apache.hadoop.mapred._
class CSVInputFormat extends FileInputFormat[LongWritable, ArrayWritable] with JobConfigurable {
private var compressionCodecs: CompressionCodecFactory = _
def configure(conf: JobConf) {
compressionCodecs = new CompressionCodecFactory(conf)
}
protected override def isSplitable(fs: FileSystem, file: Path): Boolean = {
val codec: CompressionCodec = compressionCodecs.getCodec(file)
if (null == codec) {
return true
}
codec.isInstanceOf[SplittableCompressionCodec]
}
#throws(classOf[IOException])
def getRecordReader(genericSplit: InputSplit, job: JobConf, reporter: Reporter): RecordReader[LongWritable, ArrayWritable] = {
reporter.setStatus(genericSplit.toString)
val delimiter: String = job.get("textinputformat.record.delimiter")
var recordDelimiterBytes: Array[Byte] = null
if (null != delimiter) {
recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8)
}
new CsvLineRecordReader(job, genericSplit.asInstanceOf[FileSplit], recordDelimiterBytes)
}
}
class CsvLineRecordReader(job: Configuration, split: FileSplit, recordDelimiter: Array[Byte])
extends RecordReader[LongWritable, ArrayWritable] {
private val compressionCodecs = new CompressionCodecFactory(job)
private val maxLineLength = job.getInt(org.apache.hadoop.mapreduce.lib.input.
LineRecordReader.MAX_LINE_LENGTH, Integer.MAX_VALUE)
private var filePosition: Seekable = _
private val file = split.getPath
private val codec = compressionCodecs.getCodec(file)
private val isCompressedInput = codec != null
private val fs = file.getFileSystem(job)
private val fileIn = fs.open(file)
private var start = split.getStart
private var pos: Long = 0L
private var end = start + split.getLength
private var reader: CSVReader = _
private var decompressor: Decompressor = _
private lazy val CSVSeparator =
if (recordDelimiter == null)
CSVParser.DEFAULT_SEPARATOR
else
recordDelimiter(0).asInstanceOf[Char]
if (isCompressedInput) {
decompressor = CodecPool.getDecompressor(codec)
if (codec.isInstanceOf[SplittableCompressionCodec]) {
val cIn = (codec.asInstanceOf[SplittableCompressionCodec])
.createInputStream(fileIn, decompressor, start, end, SplittableCompressionCodec.READ_MODE.BYBLOCK)
reader = new CSVReader(new InputStreamReader(cIn), CSVSeparator)
start = cIn.getAdjustedStart
end = cIn.getAdjustedEnd
filePosition = cIn
}else {
reader = new CSVReader(new InputStreamReader(codec.createInputStream(fileIn, decompressor)), CSVSeparator)
filePosition = fileIn
}
} else {
fileIn.seek(start)
reader = new CSVReader(new InputStreamReader(fileIn), CSVSeparator)
filePosition = fileIn
}
#throws(classOf[IOException])
private def getFilePosition: Long = {
if (isCompressedInput && null != filePosition) {
filePosition.getPos
}else
pos
}
private def nextLine: Option[Array[String]] = {
if (getFilePosition < end){
//readNext automatical split the line to elements
reader.readNext() match {
case null => None
case elems => Some(elems)
}
} else
None
}
override def next(key: LongWritable, value: ArrayWritable): Boolean =
nextLine
.exists { elems =>
key.set(pos)
val lineLength = elems.foldRight(0)((a, b) => a.length + 1 + b)
pos += lineLength
value.set(elems.map(new Text(_)))
if (lineLength < maxLineLength) true else false
}
#throws(classOf[IOException])
def getProgress: Float =
if (start == end)
0.0f
else
Math.min(1.0f, (getFilePosition - start) / (end - start).toFloat)
override def getPos: Long = pos
override def createKey(): LongWritable = new LongWritable
override def close(): Unit = {
try {
if (reader != null) {
reader.close
}
} finally {
if (decompressor != null) {
CodecPool.returnDecompressor(decompressor)
}
}
}
override def createValue(): ArrayWritable = new ArrayWritable(classOf[Text])
}
Simple test example:
val arrayRdd = sc.hadoopFile("source path", classOf[CSVInputFormat], classOf[LongWritable], classOf[ArrayWritable],
sc.defaultMinPartitions).map(_._2.get().map(_.toString))
arrayRdd.collect().foreach(e => println(e.mkString(",")))
The other way which I prefer uses spark-csv written by databricks, which is well supported for CSV file format, you can take some practices in the github page.
Updated for spark-csv, using univocity as parserLib, which can handle multi-line cells
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("parserLib", "univocity")
.option("inferSchema", "true") // Automatically infer data types
.load("source path")
What happens when you do data manipulations across your whole data set like this, before it even gets dropped into the input RDD? It is just treated as any other program and would be swapping out like crazy I guess?
You load the whole dataset into local memory. So if you have the memory, it works.
How would you then make any spark program scalable?
You have select the a data format that spark can load, or you change your application so that it can load the data format into spark directly or a bit of both.
In this case you could look at creating a custom InputFormat that splits on something other than newlines. I think you would want to also look at how you write your data so it is partitioned in HDFS at record boundaries not new lines.
However I suspect the simplest answer is to encode the data differently. JSON Lines or encode the newlines in the CSV file during the write or Avro or... Anything that fits better with Spark & HDFS.

Round-tripping through Deflater in Scala fails

I'm looking to roundtrip bytes through java's Deflater and running into issues. First the output, then the code. What am I doing wrong here, and how can I properly round trip through these streams?
Output:
scala> new String(decompress(compress("face".getBytes)))
(crazy output string of length 20)
Code:
def compress(bytes: Array[Byte]): Array[Byte] = {
val deflater = new java.util.zip.Deflater
val baos = new ByteArrayOutputStream
val dos = new DeflaterOutputStream(baos, deflater)
dos.write(bytes)
baos.close
dos.finish
dos.close
baos.toByteArray
}
def decompress(bytes: Array[Byte]): Array[Byte] = {
val deflater = new java.util.zip.Deflater
val baos = new ByteArrayOutputStream(512)
val bytesIn = new ByteArrayInputStream(bytes)
val in = new DeflaterInputStream(bytesIn, deflater)
var go = true
while (go) {
val b = in.read
if (b == -1)
go = false
else
baos.write(b)
}
baos.close
in.close
baos.toByteArray
}
You're (re-)Deflater-ing the result of the original deflation when you should be Inflater-ing it...

scala priority queue not ordering properly?

I'm seeing some strange behavior with Scala's collection.mutable.PriorityQueue. I'm performing an external sort and testing it with 1M records. Each time I run the test and verify the results between 10-20 records are not sorted properly. I replace the scala PriorityQueue implementation with a java.util.PriorityQueue and it works 100% of the time. Any ideas?
Here's the code (sorry it's a bit long...). I test it using the tools gensort -a 1000000 and valsort from http://sortbenchmark.org/
def externalSort(inFileName: String, outFileName: String)
(implicit ord: Ordering[String]): Int = {
val MaxTempFiles = 1024
val TempBufferSize = 4096
val inFile = new java.io.File(inFileName)
/** Partitions input file and sorts each partition */
def partitionAndSort()(implicit ord: Ordering[String]):
List[java.io.File] = {
/** Gets block size to use */
def getBlockSize: Long = {
var blockSize = inFile.length / MaxTempFiles
val freeMem = Runtime.getRuntime().freeMemory()
if (blockSize < freeMem / 2)
blockSize = freeMem / 2
else if (blockSize >= freeMem)
System.err.println("Not enough free memory to use external sort.")
blockSize
}
/** Sorts and writes data to temp files */
def writeSorted(buf: List[String]): java.io.File = {
// Create new temp buffer
val tmp = java.io.File.createTempFile("external", "sort")
tmp.deleteOnExit()
// Sort buffer and write it out to tmp file
val out = new java.io.PrintWriter(tmp)
try {
for (l <- buf.sorted) {
out.println(l)
}
} finally {
out.close()
}
tmp
}
val blockSize = getBlockSize
var tmpFiles = List[java.io.File]()
var buf = List[String]()
var currentSize = 0
// Read input and divide into blocks
for (line <- io.Source.fromFile(inFile).getLines()) {
if (currentSize > blockSize) {
tmpFiles ::= writeSorted(buf)
buf = List[String]()
currentSize = 0
}
buf ::= line
currentSize += line.length() * 2 // 2 bytes per char
}
if (currentSize > 0) tmpFiles ::= writeSorted(buf)
tmpFiles
}
/** Merges results of sorted partitions into one output file */
def mergeSortedFiles(fs: List[java.io.File])
(implicit ord: Ordering[String]): Int = {
/** Temp file buffer for reading lines */
class TempFileBuffer(val file: java.io.File) {
private val in = new java.io.BufferedReader(
new java.io.FileReader(file), TempBufferSize)
private var curLine: String = ""
readNextLine() // prep first value
def currentLine = curLine
def isEmpty = curLine == null
def readNextLine() {
if (curLine == null) return
try {
curLine = in.readLine()
} catch {
case _: java.io.EOFException => curLine = null
}
if (curLine == null) in.close()
}
override protected def finalize() {
try {
in.close()
} finally {
super.finalize()
}
}
}
val wrappedOrd = new Ordering[TempFileBuffer] {
def compare(o1: TempFileBuffer, o2: TempFileBuffer): Int = {
ord.compare(o1.currentLine, o2.currentLine)
}
}
val pq = new collection.mutable.PriorityQueue[TempFileBuffer](
)(wrappedOrd)
// Init queue with item from each file
for (tmp <- fs) {
val buf = new TempFileBuffer(tmp)
if (!buf.isEmpty) pq += buf
}
var count = 0
val out = new java.io.PrintWriter(new java.io.File(outFileName))
try {
// Read each value off of queue
while (pq.size > 0) {
val buf = pq.dequeue()
out.println(buf.currentLine)
count += 1
buf.readNextLine()
if (buf.isEmpty) {
buf.file.delete() // don't need anymore
} else {
// re-add to priority queue so we can process next line
pq += buf
}
}
} finally {
out.close()
}
count
}
mergeSortedFiles(partitionAndSort())
}
My tests don't show any bugs in PriorityQueue.
import org.scalacheck._
import Prop._
object PriorityQueueProperties extends Properties("PriorityQueue") {
def listToPQ(l: List[String]): PriorityQueue[String] = {
val pq = new PriorityQueue[String]
l foreach (pq +=)
pq
}
def pqToList(pq: PriorityQueue[String]): List[String] =
if (pq.isEmpty) Nil
else { val h = pq.dequeue; h :: pqToList(pq) }
property("Enqueued elements are dequeued in reverse order") =
forAll { (l: List[String]) => l.sorted == pqToList(listToPQ(l)).reverse }
property("Adding/removing elements doesn't break sorting") =
forAll { (l: List[String], s: String) =>
(l.size > 0) ==>
((s :: l.sorted.init).sorted == {
val pq = listToPQ(l)
pq.dequeue
pq += s
pqToList(pq).reverse
})
}
}
scala> PriorityQueueProperties.check
+ PriorityQueue.Enqueued elements are dequeued in reverse order: OK, passed
100 tests.
+ PriorityQueue.Adding/removing elements doesn't break sorting: OK, passed
100 tests.
If you could somehow reduce the input enough to make a test case, it would help.
I ran it with five million inputs several times, output matched expected always. My guess from looking at your code is that your Ordering is the problem (i.e. it's giving inconsistent answers.)