Efficient way to optimise a Scala code to read large file that doesn't fit in memory - scala

Problem Statement Below,
We have a large log file which stores user interactions with an application. The entries in the log file follow the following schema: {userId, timestamp, actionType} where actionType is one of two possible values: [open, close]
Constraints:
The log file is too big to fit in memory on one machine. Also assume that the aggregated data doesn’t fit into memory.
Code has to be able to run on a single machine.
Should not use an out-of-the box implementation of mapreduce or 3rd party database; don’t assume we have a Hadoop or Spark or other distributed computing framework.
There can be multiple entries of each actionType for each user, and there might be missing entries in the log file. So a user might be missing a close record between two open records or vice versa.
Timestamps will come in strictly ascending order.
For this problem, we need to implement a class/classes that computes the average time spent by each user between open and close. Keep in mind that there are missing entries for some users, so we will have to make a choice about how to handle these entries when making our calculations. Code should follow a consistent policy with regards to how we make that choice.
The desired output for the solution should be [{userId, timeSpent},….] for all the users in the log file.
Sample log file (comma-separated, text file)
1,1435456566,open
2,1435457643,open
3,1435458912,open
1,1435459567,close
4,1435460345,open
1,1435461234,open
2,1435462567,close
1,1435463456,open
3,1435464398,close
4,1435465122,close
1,1435466775,close
Approach
Below is the code I've written in Python & Scala, which seems to be not efficient and upto the expectations of the scenario given, I'd like to feedback from community of developers in this forum how better we could optimise this code as per given scenario.
Scala implementation
import java.io.FileInputStream
import java.util.{Scanner, Map, LinkedList}
import java.lang.Long
import scala.collection.mutable
object UserMetrics extends App {
if (args.length == 0) {
println("Please provide input data file name for processing")
}
val userMetrics = new UserMetrics()
userMetrics.readInputFile(args(0),if (args.length == 1) 600000 else args(1).toInt)
}
case class UserInfo(userId: Integer, prevTimeStamp: Long, prevStatus: String, timeSpent: Long, occurence: Integer)
class UserMetrics {
val usermap = mutable.Map[Integer, LinkedList[UserInfo]]()
def readInputFile(stArr:String, timeOut: Int) {
var inputStream: FileInputStream = null
var sc: Scanner = null
try {
inputStream = new FileInputStream(stArr);
sc = new Scanner(inputStream, "UTF-8");
while (sc.hasNextLine()) {
val line: String = sc.nextLine();
processInput(line, timeOut)
}
for ((key: Integer, userLs: LinkedList[UserInfo]) <- usermap) {
val userInfo:UserInfo = userLs.get(0)
val timespent = if (userInfo.occurence>0) userInfo.timeSpent/userInfo.occurence else 0
println("{" + key +","+timespent + "}")
}
if (sc.ioException() != null) {
throw sc.ioException();
}
} finally {
if (inputStream != null) {
inputStream.close();
}
if (sc != null) {
sc.close();
}
}
}
def processInput(line: String, timeOut: Int) {
val strSp = line.split(",")
val userId: Integer = Integer.parseInt(strSp(0))
val curTimeStamp = Long.parseLong(strSp(1))
val status = strSp(2)
val uInfo: UserInfo = UserInfo(userId, curTimeStamp, status, 0, 0)
val emptyUserInfo: LinkedList[UserInfo] = new LinkedList[UserInfo]()
val lsUserInfo: LinkedList[UserInfo] = usermap.getOrElse(userId, emptyUserInfo)
if (lsUserInfo != null && lsUserInfo.size() > 0) {
val lastUserInfo: UserInfo = lsUserInfo.get(lsUserInfo.size() - 1)
val prevTimeStamp: Long = lastUserInfo.prevTimeStamp
val prevStatus: String = lastUserInfo.prevStatus
if (prevStatus.equals("open")) {
if (status.equals(lastUserInfo.prevStatus)) {
val timeSelector = if ((curTimeStamp - prevTimeStamp) > timeOut) timeOut else curTimeStamp - prevTimeStamp
val timeDiff = lastUserInfo.timeSpent + timeSelector
lsUserInfo.remove()
lsUserInfo.add(UserInfo(userId, curTimeStamp, status, timeDiff, lastUserInfo.occurence + 1))
} else if(!status.equals(lastUserInfo.prevStatus)){
val timeDiff = lastUserInfo.timeSpent + curTimeStamp - prevTimeStamp
lsUserInfo.remove()
lsUserInfo.add(UserInfo(userId, curTimeStamp, status, timeDiff, lastUserInfo.occurence + 1))
}
} else if(prevStatus.equals("close")) {
if (status.equals(lastUserInfo.prevStatus)) {
lsUserInfo.remove()
val timeSelector = if ((curTimeStamp - prevTimeStamp) > timeOut) timeOut else curTimeStamp - prevTimeStamp
lsUserInfo.add(UserInfo(userId, curTimeStamp, status, lastUserInfo.timeSpent + timeSelector, lastUserInfo.occurence+1))
}else if(!status.equals(lastUserInfo.prevStatus))
{
lsUserInfo.remove()
lsUserInfo.add(UserInfo(userId, curTimeStamp, status, lastUserInfo.timeSpent, lastUserInfo.occurence))
}
}
}else if(lsUserInfo.size()==0){
lsUserInfo.add(uInfo)
}
usermap.put(userId, lsUserInfo)
}
}
Python Implementation
import sys
def fileBlockStream(fp, number_of_blocks, block):
#A generator that splits a file into blocks and iterates over the lines of one of the blocks.
assert 0 <= block and block < number_of_blocks #Assertions to validate number of blocks given
assert 0 < number_of_blocks
fp.seek(0,2) #seek to end of file to compute block size
file_size = fp.tell()
ini = file_size * block / number_of_blocks #compute start & end point of file block
end = file_size * (1 + block) / number_of_blocks
if ini <= 0:
fp.seek(0)
else:
fp.seek(ini-1)
fp.readline()
while fp.tell() < end:
yield fp.readline() #iterate over lines of the particular chunk or block
def computeResultDS(chunk,avgTimeSpentDict,defaultTimeOut):
countPos,totTmPos,openTmPos,closeTmPos,nextEventPos = 0,1,2,3,4
for rows in chunk.splitlines():
if len(rows.split(",")) != 3:
continue
userKeyID = rows.split(",")[0]
try:
curTimeStamp = int(rows.split(",")[1])
except ValueError:
print("Invalid Timestamp for ID:" + str(userKeyID))
continue
curEvent = rows.split(",")[2]
if userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==1 and curEvent == "close":
#Check if already existing userID with expected Close event 0 - Open; 1 - Close
#Array value within dictionary stores [No. of pair events, total time spent (Close tm-Open tm), Last Open Tm, Last Close Tm, Next expected Event]
curTotalTime = curTimeStamp - avgTimeSpentDict[userKeyID][openTmPos]
totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]
eventCount = avgTimeSpentDict[userKeyID][countPos] + 1
avgTimeSpentDict[userKeyID][countPos] = eventCount
avgTimeSpentDict[userKeyID][totTmPos] = totalTime
avgTimeSpentDict[userKeyID][closeTmPos] = curTimeStamp
avgTimeSpentDict[userKeyID][nextEventPos] = 0 #Change next expected event to Open
elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==0 and curEvent == "open":
avgTimeSpentDict[userKeyID][openTmPos] = curTimeStamp
avgTimeSpentDict[userKeyID][nextEventPos] = 1 #Change next expected event to Close
elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==1 and curEvent == "open":
curTotalTime,closeTime = missingHandler(defaultTimeOut,avgTimeSpentDict[userKeyID][openTmPos],curTimeStamp)
totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]
avgTimeSpentDict[userKeyID][totTmPos]=totalTime
avgTimeSpentDict[userKeyID][closeTmPos]=closeTime
avgTimeSpentDict[userKeyID][openTmPos]=curTimeStamp
eventCount = avgTimeSpentDict[userKeyID][countPos] + 1
avgTimeSpentDict[userKeyID][countPos] = eventCount
elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==0 and curEvent == "close":
curTotalTime,openTime = missingHandler(defaultTimeOut,avgTimeSpentDict[userKeyID][closeTmPos],curTimeStamp)
totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]
avgTimeSpentDict[userKeyID][totTmPos]=totalTime
avgTimeSpentDict[userKeyID][openTmPos]=openTime
eventCount = avgTimeSpentDict[userKeyID][countPos] + 1
avgTimeSpentDict[userKeyID][countPos] = eventCount
elif curEvent == "open":
#Initialize userid with Open event
avgTimeSpentDict[userKeyID] = [0,0,curTimeStamp,0,1]
elif curEvent == "close":
#Initialize userid with missing handler function since there is no Open event for this User
totaltime,OpenTime = missingHandler(defaultTimeOut,0,curTimeStamp)
avgTimeSpentDict[userKeyID] = [1,totaltime,OpenTime,curTimeStamp,0]
def missingHandler(defaultTimeOut,curTimeVal,lastTimeVal):
if lastTimeVal - curTimeVal > defaultTimeOut:
return defaultTimeOut,curTimeVal
else:
return lastTimeVal - curTimeVal,curTimeVal
def computeAvg(avgTimeSpentDict,defaultTimeOut):
resDict = {}
for k,v in avgTimeSpentDict.iteritems():
if v[0] == 0:
resDict[k] = 0
else:
resDict[k] = v[1]/v[0]
return resDict
if __name__ == "__main__":
avgTimeSpentDict = {}
if len(sys.argv) < 2:
print("Please provide input data file name for processing")
sys.exit(1)
fileObj = open(sys.argv[1])
number_of_chunks = 4 if len(sys.argv) < 3 else int(sys.argv[2])
defaultTimeOut = 60000 if len(sys.argv) < 4 else int(sys.argv[3])
for chunk_number in range(number_of_chunks):
for chunk in fileBlockStream(fileObj, number_of_chunks, chunk_number):
computeResultDS(chunk, avgTimeSpentDict, defaultTimeOut)
print (computeAvg(avgTimeSpentDict,defaultTimeOut))
avgTimeSpentDict.clear() #Nullify dictionary
fileObj.close #Close the file object
Both program above gives desired output, but efficiency is what matters for this particular scenario. Let me know if you've anything better or any suggestions on existing implementation.
Thanks in Advance!!

What you are after is iterator usage. I'm not going to re-write your code, but the trick here is likely to be using an iterator. Fortunately Scala provides decent out of the box tooling for the job.
import scala.io.Source
object ReadBigFiles {
def read(fileName: String): Unit = {
val lines: Iterator[String] = Source.fromFile(fileName).getLines
// now you get iterator semantics for the file line traversal
// that means you can only go through the lines once, but you don't incur a penalty on heap usage
}
}
For your use case, you seem to require a lastUser, so you're dealing with groups of 2 entries. I think you you have two choices, either go for iterator.sliding(2), which will produce iterators for every pair, or simply add recursion to the mix using options.
def navigate(source: Iterator[String], last: Option[User]): ResultType = {
if (source.hasNext) {
val current = source.next()
last match {
case Some(existing) => // compare with previous user etc
case None => navigate(source, Some(current))
}
} else {
// exit recursion, return result
}
}
You can avoid all the code you've written to read the file and so on. If you need to count occurrences, simply build a Map inside your recursion, and increment the occurrences at every step based on your business logic.

from queue import LifoQueue, Queue
def averageTime() -> float:
logs = {}
records = Queue()
with open("log.txt") as fp:
lines = fp.readlines()
for line in lines:
if line[0] not in logs:
logs[line[0]] = LifoQueue()
logs[line[0]].put((line[1], line[2]))
else:
logs[line[0]].put((line[1], line[2]))
for k in logs:
somme = 0
count = 0
while not logs[k].empty():
l = logs[k].get()
somme = (somme + l[0]) if l[1] == "open" else (somme - l[0])
count = count + 1
records.put([k, somme, count//2])
while not records.empty():
record = records.get()
print(f"UserId={record[0]} Avg={record[1]/record[2]}")

Related

How to generate a random Scala Int in Chisel code?

I am trying to implement the way-prediction technique in the RocketChip core (in-order). For this, I need to access each way separately. So this is how SRAM for tags looks like after modification (separate SRAM for each way)
val tag_arrays = Seq.fill(nWays) { SeqMem(nSets, UInt(width = tECC.width(1 + tagBits)))}
val tag_rdata = Reg(Vec(nWays, UInt(width = tECC.width(1 + tagBits))))
for ((tag_array, i) <- tag_arrays zipWithIndex) {
tag_rdata(i) := tag_array.read(s0_vaddr(untagBits-1,blockOffBits), !refill_done && s0_valid)
}
And I want to access it like
when (refill_done) {
val enc_tag = tECC.encode(Cat(tl_out.d.bits.error, refill_tag))
tag_arrays(repl_way).write(refill_idx, enc_tag)
ccover(tl_out.d.bits.error, "D_ERROR", "I$ D-channel error")
}
Where repl_way is Chisel random UInt generated by LFSR. But Seq element can be accessed only by Scala Int index which causes a compilation error. Then I tried access it like this
when (refill_done) {
val enc_tag = tECC.encode(Cat(tl_out.d.bits.error, refill_tag))
for (i <- 0 until nWays) {
when (repl_way === i.U) {tag_arrays(i).write(refill_idx, enc_tag)}
}
ccover(tl_out.d.bits.error, "D_ERROR", "I$ D-channel error")
}
But assertion arises -
assert(PopCount(s1_tag_hit zip s1_tag_disparity map { case (h, d) => h && !d }) <= 1)
I am trying to modify ICache.scala file. Any ideas on how to do this properly? Thanks!
I think you can just use a Vec here instead of a Seq
val tag_arrays = Vec(nWays, SeqMem(nSets, UInt(width = tECC.width(1 + tagBits))))
The Vec allows indexing with a UInt

Can anyone explain interesting spin-lock behavior?

Given the following code
case class Score(value: BigInt, random: Long = randomLong) extends Comparable[Score] {
override def compareTo(that: Score): Int = {
if (this.value < that.value) -1
else if (this.value > that.value) 1
else if (this.random < that.random) -1
else if (this.random > that.random) 1
else 0
}
override def equals(obj: _root_.scala.Any): Boolean = {
val that = obj.asInstanceOf[Score]
this.value == that.value && this.random == that.random
}
}
#tailrec
private def update(mode: UpdateMode, member: String, newScore: Score, spinCount: Int, spinStart: Long): Unit = {
// Caution: there is some subtle logic below, so don't modify it unless you grok it
try {
Metrics.checkSpinCount(member, spinCount)
} catch {
case cause: ConcurrentModificationException =>
throw new ConcurrentModificationException(Leaderboard.maximumSpinCountExceeded.format("update", member), cause)
}
// Set the spin-lock
put(member, None) match {
case None =>
// BEGIN CRITICAL SECTION
// Member's first time on the board
if (scoreToMember.put(newScore, member) != null) {
val message = s"$member: added new member in memberToScore, but found old member in scoreToMember"
logger.error(message)
throw new ConcurrentModificationException(message)
}
memberToScore.put(member, Some(newScore)) // remove the spin-lock
// END CRITICAL SECTION
case Some(option) => option match {
case None => // Update in progress, so spin until complete
//logger.debug(s"update: $member locked, spinCount = $spinCount")
for (i <- -1 to spinCount * 2) {Thread.`yield`()} // dampen contention
update(mode, member, newScore, spinCount + 1, spinStart)
case Some(oldScore) =>
// BEGIN CRITICAL SECTION
// Member already on the leaderboard
if (scoreToMember.remove(oldScore) == null) {
val message = s"$member: oldScore not found in scoreToMember, concurrency defect"
logger.error(message)
throw new ConcurrentModificationException(message)
} else {
val score =
mode match {
case Replace =>
//logger.debug(s"$member: newScore = $newScore")
newScore
case Increment =>
//logger.debug(s"$member: newScore = $newScore, oldScore = $oldScore")
Score(newScore.value + oldScore.value)
}
//logger.debug(s"$member: updated score = $score")
scoreToMember.put(score, member)
memberToScore.put(member, Some(score)) // remove the spin-lock
//logger.debug(s"update: $member unlocked")
}
// END CRITICAL SECTION
// Do this outside the critical section to reduce time under lock
if (spinCount > 0) Metrics.checkSpinTime(System.nanoTime() - spinStart)
}
}
}
There are two important data structures: memberToScore and scoreToMember. I have experimented using both TrieMap[String,Option[Score]] and ConcurrentHashMap[String,Option[Score]] for memberToScore and both have the same behavior.
So far my testing indicates the code is correct and thread safe, but the mystery is the performance of the spin-lock. On a system with 12 hardware threads, and 1000 iterations on 12 Futures: hitting the same member all the time results in spin cycles of 50 or more, but hitting a random distribution of members can result in spin cycles of 100 or more. The behavior gets worse if I don't dampen the spin without iterating over yield() calls.
So, this seems counter intuitive, I was expecting the random distribution of keys to result in less spin than the same key, but testing proves otherwise.
Can anyone offer some insight into this counter-intuitive behavior?
Granted there may be better solutions to my design, and I am open to them, but for now I cannot seem to find a satisfactory explanation for what my tests are showing, and my curiosity leaves me hungry.
As an aside, while the single member test has a lower ceiling for the spin count, the random member test has a lower ceiling for time spinning, which is what I would expect. I just cannot explain why the random member test generally produces a higher ceiling for spin count.

SCALA multiple nested case statement in map RDD operations

I have an RDD with 7 string elements
linesData: org.apache.spark.rdd.RDD[(String, String, String, String, String, String, String)]
I need to figure out for each record the following 2 items: Status and CSrange.
the logic is something like:
case when a(6)=0
if Date(a(5)) > Date(a(4)) then
if Date(a(5)) - Date(a(4)) > 60 days then
staus = '60+'
else
status = 'Curr'
endif
else
status = 'Curr'
end
when (a(6) >=1 and a(6) <=3 ) then
Status = 'In-Forcl'
when (a(6) >=4 and a(6) <=8)
Status = 'Forclosed'
else
Status = 'Unknown'
end case
case when (a(1) <640 and a(1) >0 ) then CSrange = '<640'
when (a(1) <650 and a(1)> 579 then CSrange = '640-649'
when (a(1) <660 and a(1)> 619 then CSrange = '650-659'
when (a(1) <680 and a(1)> 639 then CSrange = '640-649'
when (a(1) >789 then CSrange = '680+'
else
CSRange ='Unknown'
end case
At this point, I would like to write the data out to disk, with probably 9 elements insteatd of 7. (later I will need to do calculate the rate of each of the statuses above by various elements).
My first problems are:
1. how to do date arithmetics? as I need to stay at the RDD level (no data frames).
2. I don't know how to do the CASE statements in SCALA.
SAMPLE DATA:
(2017_7_0555,794,Scott,CORNERSTONE,8/1/2017,8/1/2017,0)
(2017_7_0557,682,Hennepin,LAKE AREA MT,9/1/2017,8/1/2017,0)
(2017_7_0565,754,Ramsey,GUARANTEED R,6/1/2017,8/1/2017,0)
(2017_7_0570,645,Hennepin,FAIRWAY INDE,2/1/2015,8/1/2017,5)
(2017_7_0574,732,Wright,GUARANTEED R,7/1/2017,8/1/2017,0)
(2017_7_0575,789,Hennepin,GUARANTEED R,8/1/2017,8/1/2017,0)
(2017_7_0577,662,Hennepin,MIDCOUNTRY,8/1/2017,8/1/2017,0)
(2017_7_4550,642,Mower,WELLS FARGO,5/1/2017,8/1/2017,0)
(2017_7_4574,689,Hennepin,RIVER CITY,8/1/2017,8/1/2017,0)
(2017_7_4584,662,Hennepin,WELLS FARGO,8/1/2017,8/1/2017,0)
(2017_7_4600,719,Ramsey,PHH HOME LOA,5/1/2017,8/1/2017,0)
I would create a case class of seven fields and map RDD[(,...)] to RDD[MyClass] and cast to appropriate types. I recommend you JodaTime date library for dates. This will make your code more descriptive.
Then in a map extract status and range into two functions:
myRDD.map(myInstance => (getStatus(myInstance), getRange(myInstance)))
def getStatus(myInstance: MyClass) : String = {
val (_,_,_,_,date4,date5,field6,_) = MyClass.unapply(myInstance).get
field6 match {
case 0 => {
if(date5.isAfter(date4) {
if(date4.plusDays(60).isAfter(date5)){
"60+"
} else {
"Curr"
}
}
}
case x if (x >= 1 && x <= 3) => "Forclosed"
....
}
}
Note:
I haven't tested the code.
Rename variables from the example.
I show you an example about how to manage JodaTime dates, and scala pattern matching. You have to finish the function and define the other one.

ZipInputStream.read in ZipEntry

I am reading zip file using ZipInputStream. Zip file has 4 csv files. Some files are written completely, some are written partially. Please help me find the issue with below code. Is there any limit on reading buffer from ZipInputStream.read method?
val zis = new ZipInputStream(inputStream)
Stream.continually(zis.getNextEntry).takeWhile(_ != null).foreach { file =>
if (!file.isDirectory && file.getName.endsWith(".csv")) {
val buffer = new Array[Byte](file.getSize.toInt)
zis.read(buffer)
val fo = new FileOutputStream("c:\\temp\\input\\" + file.getName)
fo.write(buffer)
}
You have not closed/flushed the files you attempted to write. It should be something like this (assuming Scala syntax, or is this Kotlin/Ceylon?):
val fo = new FileOutputStream("c:\\temp\\input\\" + file.getName)
try {
fo.write(buffer)
} finally {
fo.close
}
Also you should check the read count and read more if necessary, something like this:
var readBytes = 0
while (readBytes < buffer.length) {
val r = zis.read(buffer, readBytes, buffer.length - readBytes)
r match {
case -1 => throw new IllegalStateException("Read terminated before reading everything")
case _ => readBytes += r
}
}
PS: In your example it seems to be less than required closing }s.

Specify Variable Initialization Order in Scala

I have a special class Model that needs to have its methods called in a very specific order.
I tried doing something like this:
val model = new Model
new MyWrappingClass {
val first = model.firstMethod()
val second = model.secondMethod()
val third = model.thirdMethod()
}
The methods should be called in the order listed, however I am seeing an apparently random order.
Is there any way to get the variable initialization methods to be called in a particular order?
I doubt your methods are called in the wrong order. But to be sure, you can try something like this:
val (first, second, third) = (
model.firstMethod(),
model.secondMethod(),
model.thirdMethod()
)
You likely have some other problem with your code.
I can run 100 million loops where it never gets the order wrong, as follows:
class Model {
var done = Array(false,false,false);
def firstMethod():Boolean = { done(0) = true; done(1) || done(2) };
def secondMethod():Boolean = { done(1) = true; !done(0) || done(2) };
def thirdMethod():Boolean = { done(2) = true; !done(0) || !done(1) };
};
Notice that these methods return a True if done out of order and false when called in order.
Here's your class:
class MyWrappingClass {
val model = new Model;
val first = model.firstMethod()
val second = model.secondMethod()
val third = model.thirdMethod()
};
Our function to check for bad behavior on each trial:
def isNaughty(w: MyWrappingClass):Boolean = { w.first || w.second || w.third };
A short program to test:
var i = 0
var b = false;
while( (i<100000000) && !b ){
b = isNaughty(new MyWrappingClass);
i += 1;
}
if (b){
println("out-of-order behavior occurred");
println(i);
} else {
println("looks good");
}
Scala 2.11.7 on OpenJDK8 / Ubuntu 15.04
Of course this doesn't prove it impossible to have wrong order, only that correct behavior seems highly repeatable in a fairly simple case.