java.nio.BufferUnderflowException when processing files in Scala - scala

I got a similar problem to this guy while processing 4MB log file. Actually I'm processing multiple files simultaneously but since I keep getting this exception, I decide to just test it for a single file:
val temp = Source.fromFile("./datasource/input.txt")
val dummy = new PrintWriter("test.txt")
var itr = 0
println("Default Buffer size: " + Source.DefaultBufSize)
try {
for( chr <- temp) {
dummy.print(chr.toChar)
itr += 1
if(itr == 75703) println("Passed line 85")
if(itr % 256 == 0){ print("..." + itr); temp.reset; System.gc; }
if(itr == 75703) println("Passed line 87")
if(itr % 2048 == 0) println("")
if(itr == 75703) println("Passed line 89")
}
} finally {
println("\nFalied at itr = " + itr)
}
What I always get is that it will fails at itr = 75703, while my output file will always be 64KB (65536 Bytes exact). No matter where I put temp.reset or System.gc, all experiments ends up the same.
It seems like the problem relies on some memory allocation but I cannot find any useful information on this problem. Any idea on how to solve this one?
All your helps are greatly appreciated
EDIT: Actually I want to process it as binary files, so this technique is not a good solution, many had recommend me to use BufferedInputStream instead.

Why are you calling reset on the Source before it has finished iterating thru the file?
val temp = Source.fromFile("./datasource/input.txt")
try {
for (line <- tem p.getLines) {
//whatever
}
finally temp.reset
Should work just fine with no underflows. See also this question

Related

Decoding delimited frames from byte arrays

I have frames that are delimited by bytes to start and stop the frame (they do not appear in the stream).
I read a chunk from disk or network socket, i then need to pass to a deserializer but only after I have de-framed the packet first.
Frames may span multiple chunks that have been read, note how frame 3 is split across array 1 and array 2.
Rather than reinvent the wheel for this common problem, do any github or similar projects exist?
I am investigating ReadOnlySequenceSegment<T> from https://www.codemag.com/article/1807051/Introducing-.NET-Core-2.1-Flagship-Types-Span-T-and-Memory-T and will post updates as I work out the requirements.
Update
Further to Stephen Cleary link (thank you!!) to https://github.com/davidfowl/TcpEcho/blob/master/src/Server/Program.cs I have the below.
My data is json, so unlike the original question the delimiter tokens will appear in the stream. Therefore I have to count the array delimitator and only declare a frame when i have found the outermost [ and ] characters.
The below code works, and less manual copies done (not sure if still done behind the scenes - code is quite neater using David Fowl approach).
However I am casting to array instead of using buffer.PositionOf((byte)'[') since I was unable to see how I could call the PositionOf with an offset applied (i.e. scan deeper into the frame past previously found delimiter tokens).
Am i using/butchering the library in a brute force way, or is the below good to go with the array cast?
class Program
{
static async Task Main(string[] args)
{
using var stream = File.Open(args[0], FileMode.Open);
var reader = PipeReader.Create(stream);
while (true)
{
ReadResult result = await reader.ReadAsync();
ReadOnlySequence<byte> buffer = result.Buffer;
while (TryDeframe(ref buffer, out ReadOnlySequence<byte> line))
{
// Process the line.
var str = System.Text.Encoding.UTF8.GetString(line.ToArray());
Console.WriteLine(str);
}
// Tell the PipeReader how much of the buffer has been consumed.
reader.AdvanceTo(buffer.Start, buffer.End);
// Stop reading if there's no more data coming.
if (result.IsCompleted)
{
break;
}
}
// Mark the PipeReader as complete.
await reader.CompleteAsync();
}
private static bool TryDeframe(ref ReadOnlySequence<byte> buffer, out ReadOnlySequence<byte> frame)
{
int frameCount = 0;
int start = -1;
int end = -1;
var bytes = buffer.ToArray();
for (var i = 0; i < bytes.Length; i++)
{
var b = bytes[i];
if (b == (byte)'[')
{
if (start == -1)
start = i;
frameCount++;
}
else if (b == (byte)']')
{
frameCount--;
if (frameCount == 0)
{
end = i;
break;
}
}
}
if (start == -1 || end == -1) // no frame found
{
frame = default;
return false;
}
frame = buffer.Slice(start, end+1);
buffer = buffer.Slice(frame.Length);
return true;
}
}
do any github or similar projects exist?
David Fowler has an echo server that uses Pipelines to implement delimited frames.

Efficient way to optimise a Scala code to read large file that doesn't fit in memory

Problem Statement Below,
We have a large log file which stores user interactions with an application. The entries in the log file follow the following schema: {userId, timestamp, actionType} where actionType is one of two possible values: [open, close]
Constraints:
The log file is too big to fit in memory on one machine. Also assume that the aggregated data doesn’t fit into memory.
Code has to be able to run on a single machine.
Should not use an out-of-the box implementation of mapreduce or 3rd party database; don’t assume we have a Hadoop or Spark or other distributed computing framework.
There can be multiple entries of each actionType for each user, and there might be missing entries in the log file. So a user might be missing a close record between two open records or vice versa.
Timestamps will come in strictly ascending order.
For this problem, we need to implement a class/classes that computes the average time spent by each user between open and close. Keep in mind that there are missing entries for some users, so we will have to make a choice about how to handle these entries when making our calculations. Code should follow a consistent policy with regards to how we make that choice.
The desired output for the solution should be [{userId, timeSpent},….] for all the users in the log file.
Sample log file (comma-separated, text file)
1,1435456566,open
2,1435457643,open
3,1435458912,open
1,1435459567,close
4,1435460345,open
1,1435461234,open
2,1435462567,close
1,1435463456,open
3,1435464398,close
4,1435465122,close
1,1435466775,close
Approach
Below is the code I've written in Python & Scala, which seems to be not efficient and upto the expectations of the scenario given, I'd like to feedback from community of developers in this forum how better we could optimise this code as per given scenario.
Scala implementation
import java.io.FileInputStream
import java.util.{Scanner, Map, LinkedList}
import java.lang.Long
import scala.collection.mutable
object UserMetrics extends App {
if (args.length == 0) {
println("Please provide input data file name for processing")
}
val userMetrics = new UserMetrics()
userMetrics.readInputFile(args(0),if (args.length == 1) 600000 else args(1).toInt)
}
case class UserInfo(userId: Integer, prevTimeStamp: Long, prevStatus: String, timeSpent: Long, occurence: Integer)
class UserMetrics {
val usermap = mutable.Map[Integer, LinkedList[UserInfo]]()
def readInputFile(stArr:String, timeOut: Int) {
var inputStream: FileInputStream = null
var sc: Scanner = null
try {
inputStream = new FileInputStream(stArr);
sc = new Scanner(inputStream, "UTF-8");
while (sc.hasNextLine()) {
val line: String = sc.nextLine();
processInput(line, timeOut)
}
for ((key: Integer, userLs: LinkedList[UserInfo]) <- usermap) {
val userInfo:UserInfo = userLs.get(0)
val timespent = if (userInfo.occurence>0) userInfo.timeSpent/userInfo.occurence else 0
println("{" + key +","+timespent + "}")
}
if (sc.ioException() != null) {
throw sc.ioException();
}
} finally {
if (inputStream != null) {
inputStream.close();
}
if (sc != null) {
sc.close();
}
}
}
def processInput(line: String, timeOut: Int) {
val strSp = line.split(",")
val userId: Integer = Integer.parseInt(strSp(0))
val curTimeStamp = Long.parseLong(strSp(1))
val status = strSp(2)
val uInfo: UserInfo = UserInfo(userId, curTimeStamp, status, 0, 0)
val emptyUserInfo: LinkedList[UserInfo] = new LinkedList[UserInfo]()
val lsUserInfo: LinkedList[UserInfo] = usermap.getOrElse(userId, emptyUserInfo)
if (lsUserInfo != null && lsUserInfo.size() > 0) {
val lastUserInfo: UserInfo = lsUserInfo.get(lsUserInfo.size() - 1)
val prevTimeStamp: Long = lastUserInfo.prevTimeStamp
val prevStatus: String = lastUserInfo.prevStatus
if (prevStatus.equals("open")) {
if (status.equals(lastUserInfo.prevStatus)) {
val timeSelector = if ((curTimeStamp - prevTimeStamp) > timeOut) timeOut else curTimeStamp - prevTimeStamp
val timeDiff = lastUserInfo.timeSpent + timeSelector
lsUserInfo.remove()
lsUserInfo.add(UserInfo(userId, curTimeStamp, status, timeDiff, lastUserInfo.occurence + 1))
} else if(!status.equals(lastUserInfo.prevStatus)){
val timeDiff = lastUserInfo.timeSpent + curTimeStamp - prevTimeStamp
lsUserInfo.remove()
lsUserInfo.add(UserInfo(userId, curTimeStamp, status, timeDiff, lastUserInfo.occurence + 1))
}
} else if(prevStatus.equals("close")) {
if (status.equals(lastUserInfo.prevStatus)) {
lsUserInfo.remove()
val timeSelector = if ((curTimeStamp - prevTimeStamp) > timeOut) timeOut else curTimeStamp - prevTimeStamp
lsUserInfo.add(UserInfo(userId, curTimeStamp, status, lastUserInfo.timeSpent + timeSelector, lastUserInfo.occurence+1))
}else if(!status.equals(lastUserInfo.prevStatus))
{
lsUserInfo.remove()
lsUserInfo.add(UserInfo(userId, curTimeStamp, status, lastUserInfo.timeSpent, lastUserInfo.occurence))
}
}
}else if(lsUserInfo.size()==0){
lsUserInfo.add(uInfo)
}
usermap.put(userId, lsUserInfo)
}
}
Python Implementation
import sys
def fileBlockStream(fp, number_of_blocks, block):
#A generator that splits a file into blocks and iterates over the lines of one of the blocks.
assert 0 <= block and block < number_of_blocks #Assertions to validate number of blocks given
assert 0 < number_of_blocks
fp.seek(0,2) #seek to end of file to compute block size
file_size = fp.tell()
ini = file_size * block / number_of_blocks #compute start & end point of file block
end = file_size * (1 + block) / number_of_blocks
if ini <= 0:
fp.seek(0)
else:
fp.seek(ini-1)
fp.readline()
while fp.tell() < end:
yield fp.readline() #iterate over lines of the particular chunk or block
def computeResultDS(chunk,avgTimeSpentDict,defaultTimeOut):
countPos,totTmPos,openTmPos,closeTmPos,nextEventPos = 0,1,2,3,4
for rows in chunk.splitlines():
if len(rows.split(",")) != 3:
continue
userKeyID = rows.split(",")[0]
try:
curTimeStamp = int(rows.split(",")[1])
except ValueError:
print("Invalid Timestamp for ID:" + str(userKeyID))
continue
curEvent = rows.split(",")[2]
if userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==1 and curEvent == "close":
#Check if already existing userID with expected Close event 0 - Open; 1 - Close
#Array value within dictionary stores [No. of pair events, total time spent (Close tm-Open tm), Last Open Tm, Last Close Tm, Next expected Event]
curTotalTime = curTimeStamp - avgTimeSpentDict[userKeyID][openTmPos]
totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]
eventCount = avgTimeSpentDict[userKeyID][countPos] + 1
avgTimeSpentDict[userKeyID][countPos] = eventCount
avgTimeSpentDict[userKeyID][totTmPos] = totalTime
avgTimeSpentDict[userKeyID][closeTmPos] = curTimeStamp
avgTimeSpentDict[userKeyID][nextEventPos] = 0 #Change next expected event to Open
elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==0 and curEvent == "open":
avgTimeSpentDict[userKeyID][openTmPos] = curTimeStamp
avgTimeSpentDict[userKeyID][nextEventPos] = 1 #Change next expected event to Close
elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==1 and curEvent == "open":
curTotalTime,closeTime = missingHandler(defaultTimeOut,avgTimeSpentDict[userKeyID][openTmPos],curTimeStamp)
totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]
avgTimeSpentDict[userKeyID][totTmPos]=totalTime
avgTimeSpentDict[userKeyID][closeTmPos]=closeTime
avgTimeSpentDict[userKeyID][openTmPos]=curTimeStamp
eventCount = avgTimeSpentDict[userKeyID][countPos] + 1
avgTimeSpentDict[userKeyID][countPos] = eventCount
elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==0 and curEvent == "close":
curTotalTime,openTime = missingHandler(defaultTimeOut,avgTimeSpentDict[userKeyID][closeTmPos],curTimeStamp)
totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]
avgTimeSpentDict[userKeyID][totTmPos]=totalTime
avgTimeSpentDict[userKeyID][openTmPos]=openTime
eventCount = avgTimeSpentDict[userKeyID][countPos] + 1
avgTimeSpentDict[userKeyID][countPos] = eventCount
elif curEvent == "open":
#Initialize userid with Open event
avgTimeSpentDict[userKeyID] = [0,0,curTimeStamp,0,1]
elif curEvent == "close":
#Initialize userid with missing handler function since there is no Open event for this User
totaltime,OpenTime = missingHandler(defaultTimeOut,0,curTimeStamp)
avgTimeSpentDict[userKeyID] = [1,totaltime,OpenTime,curTimeStamp,0]
def missingHandler(defaultTimeOut,curTimeVal,lastTimeVal):
if lastTimeVal - curTimeVal > defaultTimeOut:
return defaultTimeOut,curTimeVal
else:
return lastTimeVal - curTimeVal,curTimeVal
def computeAvg(avgTimeSpentDict,defaultTimeOut):
resDict = {}
for k,v in avgTimeSpentDict.iteritems():
if v[0] == 0:
resDict[k] = 0
else:
resDict[k] = v[1]/v[0]
return resDict
if __name__ == "__main__":
avgTimeSpentDict = {}
if len(sys.argv) < 2:
print("Please provide input data file name for processing")
sys.exit(1)
fileObj = open(sys.argv[1])
number_of_chunks = 4 if len(sys.argv) < 3 else int(sys.argv[2])
defaultTimeOut = 60000 if len(sys.argv) < 4 else int(sys.argv[3])
for chunk_number in range(number_of_chunks):
for chunk in fileBlockStream(fileObj, number_of_chunks, chunk_number):
computeResultDS(chunk, avgTimeSpentDict, defaultTimeOut)
print (computeAvg(avgTimeSpentDict,defaultTimeOut))
avgTimeSpentDict.clear() #Nullify dictionary
fileObj.close #Close the file object
Both program above gives desired output, but efficiency is what matters for this particular scenario. Let me know if you've anything better or any suggestions on existing implementation.
Thanks in Advance!!
What you are after is iterator usage. I'm not going to re-write your code, but the trick here is likely to be using an iterator. Fortunately Scala provides decent out of the box tooling for the job.
import scala.io.Source
object ReadBigFiles {
def read(fileName: String): Unit = {
val lines: Iterator[String] = Source.fromFile(fileName).getLines
// now you get iterator semantics for the file line traversal
// that means you can only go through the lines once, but you don't incur a penalty on heap usage
}
}
For your use case, you seem to require a lastUser, so you're dealing with groups of 2 entries. I think you you have two choices, either go for iterator.sliding(2), which will produce iterators for every pair, or simply add recursion to the mix using options.
def navigate(source: Iterator[String], last: Option[User]): ResultType = {
if (source.hasNext) {
val current = source.next()
last match {
case Some(existing) => // compare with previous user etc
case None => navigate(source, Some(current))
}
} else {
// exit recursion, return result
}
}
You can avoid all the code you've written to read the file and so on. If you need to count occurrences, simply build a Map inside your recursion, and increment the occurrences at every step based on your business logic.
from queue import LifoQueue, Queue
def averageTime() -> float:
logs = {}
records = Queue()
with open("log.txt") as fp:
lines = fp.readlines()
for line in lines:
if line[0] not in logs:
logs[line[0]] = LifoQueue()
logs[line[0]].put((line[1], line[2]))
else:
logs[line[0]].put((line[1], line[2]))
for k in logs:
somme = 0
count = 0
while not logs[k].empty():
l = logs[k].get()
somme = (somme + l[0]) if l[1] == "open" else (somme - l[0])
count = count + 1
records.put([k, somme, count//2])
while not records.empty():
record = records.get()
print(f"UserId={record[0]} Avg={record[1]/record[2]}")

ZipInputStream.read in ZipEntry

I am reading zip file using ZipInputStream. Zip file has 4 csv files. Some files are written completely, some are written partially. Please help me find the issue with below code. Is there any limit on reading buffer from ZipInputStream.read method?
val zis = new ZipInputStream(inputStream)
Stream.continually(zis.getNextEntry).takeWhile(_ != null).foreach { file =>
if (!file.isDirectory && file.getName.endsWith(".csv")) {
val buffer = new Array[Byte](file.getSize.toInt)
zis.read(buffer)
val fo = new FileOutputStream("c:\\temp\\input\\" + file.getName)
fo.write(buffer)
}
You have not closed/flushed the files you attempted to write. It should be something like this (assuming Scala syntax, or is this Kotlin/Ceylon?):
val fo = new FileOutputStream("c:\\temp\\input\\" + file.getName)
try {
fo.write(buffer)
} finally {
fo.close
}
Also you should check the read count and read more if necessary, something like this:
var readBytes = 0
while (readBytes < buffer.length) {
val r = zis.read(buffer, readBytes, buffer.length - readBytes)
r match {
case -1 => throw new IllegalStateException("Read terminated before reading everything")
case _ => readBytes += r
}
}
PS: In your example it seems to be less than required closing }s.

Simplify Scala loop to one line

How do I simplify this loop to some function like foreach or map or other thing with Scala? I want to put hitsArray inside that filter shipList.filter.
val hitsArray: Array[String] = T.split(" ");
for (hit <- hitsArray) {
shipSize = shipList.length
shipList = shipList.filter(!_.equalsIgnoreCase(hit))
}
if (shipList.length == 0) {
shipSunk = shipSunk + 1
} else if (shipList.length < shipSize) {
shipHit = shipHit + 1
}
To be fair, I don't understand why you are calling shipSize = shipList.length as you don't use it anywhere.
T.split(" ").foreach{ hit =>
shipList = shipList.filter(!_.equalsIgnoreCase(hit))
}
which gets you to where you want to go. I've made it 3 lines because you want to emphasize you're working via side effect in that foreach. That said, I don't see any advantage to making it a one-liner. What you had before was perfectly readable.
Something like this maybe?
shipList.filter(ship => T.split(" ").forall(!_.equalsIgnoreCase(ship)))
Although cleaner if shipList is already all lower case:
shipList.filterNot(T.split(" ").map(_.toLowerCase) contains _)
Or if your T is large, move it outside the loop:
val hits = T.split(" ").map(_.toLowerCase)
shipList.filterNot(hits contains _)

Extract numbers and store them in variable in Scala and Spark

I have a file like below:
0; best wrap ear market pair pair break make
1; time sennheiser product better earphone fit
1; recommend headphone pretty decent full sound earbud design
0; originally buy work gym work well robust sound quality good clip
1; terrific sound great fit toss mine profuse sweater headphone
0; negative experienced sit chair back touch chair earplug displace hurt
...
and i want to extract number and store it in a for each document, i've tried :
var grouped_with_wt = data.flatMap({ (line) =>
val words = line.split(";").split(" ")
words.map(w => {
val a =
(line.hashCode(),(vocab_lookup.value(w), a))
})
}).groupByKey()
expected output is :
(1453543,(best,0),(wrap,0),(ear,0),(market,0),(pair,0),(break,0),(make,0))
(3942334,(time,1),(sennheiser,1),(product,1),(better,1),(earphone,1),(fit,1))
...
after generating above results i used them in this code to generate final results:
val Beta = DenseMatrix.zeros[Int](V, S)
val Beta_c = grouped_with_wt.flatMap(kv => {
kv._2.map(wt => {
Beta(wt._1,wt._2) +=1
})
})
final results:
1 0
1 0
1 0
1 0
...
This code doesn't work well , Can anybody help me? I want a code like above.
val inputRDD = sc.textFile("input dir ")
val outRDD = inputRDD.map(r => {
val tuple = r.split(";")
val key = tuple(0)
val words = tuple(1).trim().split(" ")
val outArr = words.map(w => {
new Tuple2(w,key)
})
(r.hashCode, outArr.mkString(","))
})
outRDD.saveAsTextFile("output dir")
output
(-1704185638,(best,0),(wrap,0),(ear,0),(market,0),(pair,0),(pair,0),(break,0),(make,0))
(147969209,(time,5),(sennheiser,5),(product,5),(better,5),(earphone,5),(fit,5))
(1145947974,(recommend,1),(headphone,1),(pretty,1),(decent,1),(full,1),(sound,1),(earbud,1),(design,1))
(838871770,(originally,4),(buy,4),(work,4),(gym,4),(work,4),(well,4),(robust,4),(sound,4),(quality,4),(good,4),(clip,4))
(934228708,(terrific,5),(sound,5),(great,5),(fit,5),(toss,5),(mine,5),(profuse,5),(sweater,5),(headphone,5))
(659513416,(negative,-3),(experienced,-3),(sit,-3),(chair,-3),(back,-3),(touch,-3),(chair,-3),(earplug,-3),(displace,-3),(hurt,-3))