Spark Structured Streaming - Update following groupByKey and mapGroupsWithState giving duplicate key results - scala

I am trying to execute the following stateful aggregation in Databricks (scala):
sig_df
.as[InputRow]
.groupByKey(_.uid)
.mapGroupsWithState(GroupStateTimeout.NoTimeout)(updateAcrossEvents)
.writeStream
.queryName("events_per_window_2")
.format("memory")
.outputMode("update")
.start()
The functions managing the state are these:
def updateAcrossEvents(uid: String,
inputs: Iterator[InputRow],
oldState: GroupState[UState]):UState =
{
var state:UState = if (oldState.exists) oldState.get else UState(uid, -999999, -999999, -999999)
for (input <- inputs) {
state = updateUStateWithEvent(state, input)
oldState.update(state)
}
state
}
And this:
def updateUStateWithEvent(state:UState, input:InputRow):UState = {
// no timestamp, just ignore it
if (Option(input.timestamp).isEmpty) {
return state
}
if (input.sig_id == 10) {
state.front_in = input.sig_value.toInt
}
else if (input.sig_id == 17) {
state.rear_in = input.sig_value.toInt
}
else if (input.sig_id == 25){
state.top_in = input.sig_value.toInt
}
//return the updated state
state
}
The issue I am facing is that the output has duplicates for the key uid. The following query returns plenty of results:
SELECT uid, count(*) FROM events_per_window_2
where front_in <> -999999
or rear_in <> -999999
or top_in <> -999999
group by uid
having count(*) > 1
I was of the understanding that since the outputMode is an update, we will not get any dupes.What might be going wrong with my approach here?

Related

Can some one help me on below apex code i want to update this code as per governor limits

public static void updatecasefields(List<Case> lstcase) {
//List<Case> lstcase = new list<case>();
ID devRecordTypeId = Schema.SObjectType.Case.getRecordTypeInfosByDeveloperName().get('CRM_CSR_Case').getRecordTypeId();
for (Case cs: lstcase) {
if(cs.ID != null && cs.RecordTypeId == devRecordTypeId) {
}
List<CRM_CasePick__c> Casp = [SELECT Id, CRM_Carrier_Name__c,CRM_LOB__c, CRM_SLA_Turnaround_Time__c,CRM_Category__c, CRM_Issue_Sub_Type__c,CRM_Issue_Type__c,CRM_Turnaround_Time_Days__c FROM CRM_CasePick__c WHERE CRM_Carrier_Name__c = :cs.GiDP_CarrierName__c AND CRM_Category__c = :cs.CRM_Category__c AND CRM_Issue_Type__c = :cs.CRM_Issue_Type__c AND CRM_Issue_Sub_Type__c = :cs.CRM_Issue_Sub_Type__c AND CRM_LOB__c = :cs.CRM_Line_of_Business__c];
for(CRM_CasePick__c CP: Casp) {
cs.CRM_Turnaround_Time_Days__c = cp.CRM_Turnaround_Time_Days__c;
cs.CRM_SLA_Turnaround_time__c = cp.CRM_SLA_Turnaround_Time__c;
}
}
}
Remove the SOQL query from the for loop - best practice is to never run a query within a loop.
Right now it is running that query for every value of the initial list. If the list is over 100 records, it will exceed the governor limit.

Performance disadvantage of using Datasets vs RDD with spark

I've rewrite my code partially to use dataset instead of rdds, however I experience significant performance decrease for some operations.
For example:
val filtered = trips.filter(t => exportFilter.check(t)).cache()
seems to be much slower, and CPU mostly idle:
What the reason for this? Is that bad idea to use datasets when trying to access plain objects?
UPDATE:
Here is filter check method:
override def check(trip: Trip): Boolean = {
if (trip == null || !trip.isCompleted) {
return false
}
// Return if no extended filter configured or we already
if (exportConfiguration.isBasicFilter) {
return trip.isCompleted
}
// Here trip is completed, check other conditions
// Filter out trips from future
val isTripTimeOk = checkTripTime(trip)
return isTripTimeOk
}
/**
* Trip time should have end time today or inside yesterday midnight interval
*/
def checkTripTime(trip: Trip): Boolean = {
// Check inclusive trip low bound. Should have end time today or inside yesterday midnight interval
val isLowBoundOk = tripTimingProcessor.isLaterThanYesterdayMidnightIntervalStarts(trip.getEndTimeMillis)
if (!isLowBoundOk) {
updateLowBoundMetrics(trip)
return false
}
// Check trip high bound
val isHighBoundOk = tripTimingProcessor.isBeforeMidnightIntervalStarts(trip.getEndTimeMillis)
if (!isHighBoundOk) {
metricService.inc(trip.getStartTimeMillis, trip.getProviderId,
ExportMetricName.TRIPS_EXPORTED_S3_SKIPPED_END_INSIDE_MIDNIGHT_INTERVAL)
}
return isHighBoundOk
}
private def updateLowBoundMetrics(trip: Trip) = {
metricService.inc(trip.getStartTimeMillis, trip.getProviderId,
ExportMetricName.TRIPS_EXPORTED_S3_SKIPPED_END_BEFORE_YESTERDAY_MIDNIGHT_INTERVAL)
val pointIter = trip.getPoints.iterator()
while (pointIter.hasNext()) {
val point = pointIter.next()
metricService.inc(point.getCaptureTimeMillis, point.getProviderId,
ExportMetricName.POINT_EXPORTED_S3_SKIPPED_END_BEFORE_YESTERDAY_MIDNIGHT_INTERVAL)
}
}

Efficient way to optimise a Scala code to read large file that doesn't fit in memory

Problem Statement Below,
We have a large log file which stores user interactions with an application. The entries in the log file follow the following schema: {userId, timestamp, actionType} where actionType is one of two possible values: [open, close]
Constraints:
The log file is too big to fit in memory on one machine. Also assume that the aggregated data doesn’t fit into memory.
Code has to be able to run on a single machine.
Should not use an out-of-the box implementation of mapreduce or 3rd party database; don’t assume we have a Hadoop or Spark or other distributed computing framework.
There can be multiple entries of each actionType for each user, and there might be missing entries in the log file. So a user might be missing a close record between two open records or vice versa.
Timestamps will come in strictly ascending order.
For this problem, we need to implement a class/classes that computes the average time spent by each user between open and close. Keep in mind that there are missing entries for some users, so we will have to make a choice about how to handle these entries when making our calculations. Code should follow a consistent policy with regards to how we make that choice.
The desired output for the solution should be [{userId, timeSpent},….] for all the users in the log file.
Sample log file (comma-separated, text file)
1,1435456566,open
2,1435457643,open
3,1435458912,open
1,1435459567,close
4,1435460345,open
1,1435461234,open
2,1435462567,close
1,1435463456,open
3,1435464398,close
4,1435465122,close
1,1435466775,close
Approach
Below is the code I've written in Python & Scala, which seems to be not efficient and upto the expectations of the scenario given, I'd like to feedback from community of developers in this forum how better we could optimise this code as per given scenario.
Scala implementation
import java.io.FileInputStream
import java.util.{Scanner, Map, LinkedList}
import java.lang.Long
import scala.collection.mutable
object UserMetrics extends App {
if (args.length == 0) {
println("Please provide input data file name for processing")
}
val userMetrics = new UserMetrics()
userMetrics.readInputFile(args(0),if (args.length == 1) 600000 else args(1).toInt)
}
case class UserInfo(userId: Integer, prevTimeStamp: Long, prevStatus: String, timeSpent: Long, occurence: Integer)
class UserMetrics {
val usermap = mutable.Map[Integer, LinkedList[UserInfo]]()
def readInputFile(stArr:String, timeOut: Int) {
var inputStream: FileInputStream = null
var sc: Scanner = null
try {
inputStream = new FileInputStream(stArr);
sc = new Scanner(inputStream, "UTF-8");
while (sc.hasNextLine()) {
val line: String = sc.nextLine();
processInput(line, timeOut)
}
for ((key: Integer, userLs: LinkedList[UserInfo]) <- usermap) {
val userInfo:UserInfo = userLs.get(0)
val timespent = if (userInfo.occurence>0) userInfo.timeSpent/userInfo.occurence else 0
println("{" + key +","+timespent + "}")
}
if (sc.ioException() != null) {
throw sc.ioException();
}
} finally {
if (inputStream != null) {
inputStream.close();
}
if (sc != null) {
sc.close();
}
}
}
def processInput(line: String, timeOut: Int) {
val strSp = line.split(",")
val userId: Integer = Integer.parseInt(strSp(0))
val curTimeStamp = Long.parseLong(strSp(1))
val status = strSp(2)
val uInfo: UserInfo = UserInfo(userId, curTimeStamp, status, 0, 0)
val emptyUserInfo: LinkedList[UserInfo] = new LinkedList[UserInfo]()
val lsUserInfo: LinkedList[UserInfo] = usermap.getOrElse(userId, emptyUserInfo)
if (lsUserInfo != null && lsUserInfo.size() > 0) {
val lastUserInfo: UserInfo = lsUserInfo.get(lsUserInfo.size() - 1)
val prevTimeStamp: Long = lastUserInfo.prevTimeStamp
val prevStatus: String = lastUserInfo.prevStatus
if (prevStatus.equals("open")) {
if (status.equals(lastUserInfo.prevStatus)) {
val timeSelector = if ((curTimeStamp - prevTimeStamp) > timeOut) timeOut else curTimeStamp - prevTimeStamp
val timeDiff = lastUserInfo.timeSpent + timeSelector
lsUserInfo.remove()
lsUserInfo.add(UserInfo(userId, curTimeStamp, status, timeDiff, lastUserInfo.occurence + 1))
} else if(!status.equals(lastUserInfo.prevStatus)){
val timeDiff = lastUserInfo.timeSpent + curTimeStamp - prevTimeStamp
lsUserInfo.remove()
lsUserInfo.add(UserInfo(userId, curTimeStamp, status, timeDiff, lastUserInfo.occurence + 1))
}
} else if(prevStatus.equals("close")) {
if (status.equals(lastUserInfo.prevStatus)) {
lsUserInfo.remove()
val timeSelector = if ((curTimeStamp - prevTimeStamp) > timeOut) timeOut else curTimeStamp - prevTimeStamp
lsUserInfo.add(UserInfo(userId, curTimeStamp, status, lastUserInfo.timeSpent + timeSelector, lastUserInfo.occurence+1))
}else if(!status.equals(lastUserInfo.prevStatus))
{
lsUserInfo.remove()
lsUserInfo.add(UserInfo(userId, curTimeStamp, status, lastUserInfo.timeSpent, lastUserInfo.occurence))
}
}
}else if(lsUserInfo.size()==0){
lsUserInfo.add(uInfo)
}
usermap.put(userId, lsUserInfo)
}
}
Python Implementation
import sys
def fileBlockStream(fp, number_of_blocks, block):
#A generator that splits a file into blocks and iterates over the lines of one of the blocks.
assert 0 <= block and block < number_of_blocks #Assertions to validate number of blocks given
assert 0 < number_of_blocks
fp.seek(0,2) #seek to end of file to compute block size
file_size = fp.tell()
ini = file_size * block / number_of_blocks #compute start & end point of file block
end = file_size * (1 + block) / number_of_blocks
if ini <= 0:
fp.seek(0)
else:
fp.seek(ini-1)
fp.readline()
while fp.tell() < end:
yield fp.readline() #iterate over lines of the particular chunk or block
def computeResultDS(chunk,avgTimeSpentDict,defaultTimeOut):
countPos,totTmPos,openTmPos,closeTmPos,nextEventPos = 0,1,2,3,4
for rows in chunk.splitlines():
if len(rows.split(",")) != 3:
continue
userKeyID = rows.split(",")[0]
try:
curTimeStamp = int(rows.split(",")[1])
except ValueError:
print("Invalid Timestamp for ID:" + str(userKeyID))
continue
curEvent = rows.split(",")[2]
if userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==1 and curEvent == "close":
#Check if already existing userID with expected Close event 0 - Open; 1 - Close
#Array value within dictionary stores [No. of pair events, total time spent (Close tm-Open tm), Last Open Tm, Last Close Tm, Next expected Event]
curTotalTime = curTimeStamp - avgTimeSpentDict[userKeyID][openTmPos]
totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]
eventCount = avgTimeSpentDict[userKeyID][countPos] + 1
avgTimeSpentDict[userKeyID][countPos] = eventCount
avgTimeSpentDict[userKeyID][totTmPos] = totalTime
avgTimeSpentDict[userKeyID][closeTmPos] = curTimeStamp
avgTimeSpentDict[userKeyID][nextEventPos] = 0 #Change next expected event to Open
elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==0 and curEvent == "open":
avgTimeSpentDict[userKeyID][openTmPos] = curTimeStamp
avgTimeSpentDict[userKeyID][nextEventPos] = 1 #Change next expected event to Close
elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==1 and curEvent == "open":
curTotalTime,closeTime = missingHandler(defaultTimeOut,avgTimeSpentDict[userKeyID][openTmPos],curTimeStamp)
totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]
avgTimeSpentDict[userKeyID][totTmPos]=totalTime
avgTimeSpentDict[userKeyID][closeTmPos]=closeTime
avgTimeSpentDict[userKeyID][openTmPos]=curTimeStamp
eventCount = avgTimeSpentDict[userKeyID][countPos] + 1
avgTimeSpentDict[userKeyID][countPos] = eventCount
elif userKeyID in avgTimeSpentDict.keys() and avgTimeSpentDict[userKeyID][nextEventPos]==0 and curEvent == "close":
curTotalTime,openTime = missingHandler(defaultTimeOut,avgTimeSpentDict[userKeyID][closeTmPos],curTimeStamp)
totalTime = curTotalTime + avgTimeSpentDict[userKeyID][totTmPos]
avgTimeSpentDict[userKeyID][totTmPos]=totalTime
avgTimeSpentDict[userKeyID][openTmPos]=openTime
eventCount = avgTimeSpentDict[userKeyID][countPos] + 1
avgTimeSpentDict[userKeyID][countPos] = eventCount
elif curEvent == "open":
#Initialize userid with Open event
avgTimeSpentDict[userKeyID] = [0,0,curTimeStamp,0,1]
elif curEvent == "close":
#Initialize userid with missing handler function since there is no Open event for this User
totaltime,OpenTime = missingHandler(defaultTimeOut,0,curTimeStamp)
avgTimeSpentDict[userKeyID] = [1,totaltime,OpenTime,curTimeStamp,0]
def missingHandler(defaultTimeOut,curTimeVal,lastTimeVal):
if lastTimeVal - curTimeVal > defaultTimeOut:
return defaultTimeOut,curTimeVal
else:
return lastTimeVal - curTimeVal,curTimeVal
def computeAvg(avgTimeSpentDict,defaultTimeOut):
resDict = {}
for k,v in avgTimeSpentDict.iteritems():
if v[0] == 0:
resDict[k] = 0
else:
resDict[k] = v[1]/v[0]
return resDict
if __name__ == "__main__":
avgTimeSpentDict = {}
if len(sys.argv) < 2:
print("Please provide input data file name for processing")
sys.exit(1)
fileObj = open(sys.argv[1])
number_of_chunks = 4 if len(sys.argv) < 3 else int(sys.argv[2])
defaultTimeOut = 60000 if len(sys.argv) < 4 else int(sys.argv[3])
for chunk_number in range(number_of_chunks):
for chunk in fileBlockStream(fileObj, number_of_chunks, chunk_number):
computeResultDS(chunk, avgTimeSpentDict, defaultTimeOut)
print (computeAvg(avgTimeSpentDict,defaultTimeOut))
avgTimeSpentDict.clear() #Nullify dictionary
fileObj.close #Close the file object
Both program above gives desired output, but efficiency is what matters for this particular scenario. Let me know if you've anything better or any suggestions on existing implementation.
Thanks in Advance!!
What you are after is iterator usage. I'm not going to re-write your code, but the trick here is likely to be using an iterator. Fortunately Scala provides decent out of the box tooling for the job.
import scala.io.Source
object ReadBigFiles {
def read(fileName: String): Unit = {
val lines: Iterator[String] = Source.fromFile(fileName).getLines
// now you get iterator semantics for the file line traversal
// that means you can only go through the lines once, but you don't incur a penalty on heap usage
}
}
For your use case, you seem to require a lastUser, so you're dealing with groups of 2 entries. I think you you have two choices, either go for iterator.sliding(2), which will produce iterators for every pair, or simply add recursion to the mix using options.
def navigate(source: Iterator[String], last: Option[User]): ResultType = {
if (source.hasNext) {
val current = source.next()
last match {
case Some(existing) => // compare with previous user etc
case None => navigate(source, Some(current))
}
} else {
// exit recursion, return result
}
}
You can avoid all the code you've written to read the file and so on. If you need to count occurrences, simply build a Map inside your recursion, and increment the occurrences at every step based on your business logic.
from queue import LifoQueue, Queue
def averageTime() -> float:
logs = {}
records = Queue()
with open("log.txt") as fp:
lines = fp.readlines()
for line in lines:
if line[0] not in logs:
logs[line[0]] = LifoQueue()
logs[line[0]].put((line[1], line[2]))
else:
logs[line[0]].put((line[1], line[2]))
for k in logs:
somme = 0
count = 0
while not logs[k].empty():
l = logs[k].get()
somme = (somme + l[0]) if l[1] == "open" else (somme - l[0])
count = count + 1
records.put([k, somme, count//2])
while not records.empty():
record = records.get()
print(f"UserId={record[0]} Avg={record[1]/record[2]}")

How to assure the return StringList will be ordered : Scala

I am using Scala 2.11.8
I am trying to read queries from my Property File. Each Query Set has multiple parts (explained below)
And i have certain sequence in which these queries must execute.
Code:
import com.typesafe.config.ConfigFactory
object ReadProperty {
def main(args : Array[String]): Unit = {
val queryRead = ConfigFactory.load("testqueries.properties").getConfig("select").getStringList("caseInc").toArray()
val localRead = ConfigFactory.load("testqueries.properties").getConfig("select").getStringList("caseLocal").toArray.toSet
queryRead.foreach(println)
localRead.foreach(println)
}
}
PropertyFile Content :
select.caseInc.2 = Select emp_salary, emp_dept_id from employees
select.caseLocal.1 = select one
select.caseLocal.3 = select three
select.caseRemote.2 = Select e1.emp_name, d1.dept_name, e1.salary from emp_1 e1 join dept_1 d1 on(e1.emp_dept_id = d1.dept_id)
select.caseRemote.1 = Select * from departments
select.caseInc.1 = Select emp_id, emp_name from employees
select.caseLocal.2 = select two
select.caseLocal.4 = select four
Output:
Select emp_id, emp_name from employees
Select emp_salary, emp_dept_id from employees
select one
select two
select three
select four
As we can see in output, The result is Sorted . In the property if you see i have tried numbering the queries in the sequence it should run.(passing the caseInc, caseLocal as arguments).
With getStringList() i am always getting the Sorted List on the basis of the sequence number i am providing.
Even when i tried using toArray() & toArray().toSet i am getting sorted output.
So far its Good
But how to be sure that it will always return in Sorted Order which i have provided in the property file. I am confused because somehow i am not able to find the API which says that the returned List will be Sorted.
I think you can rely on this fact. Looking into the code of DefaultTransformer you can see following piece of logic:
} else if (requested == ConfigValueType.LIST && value.valueType() == ConfigValueType.OBJECT) {
// attempt to convert an array-like (numeric indices) object to a
// list. This would be used with .properties syntax for example:
// -Dfoo.0=bar -Dfoo.1=baz
// To ensure we still throw type errors for objects treated
// as lists in most cases, we'll refuse to convert if the object
// does not contain any numeric keys. This means we don't allow
// empty objects here though :-/
AbstractConfigObject o = (AbstractConfigObject) value;
Map<Integer, AbstractConfigValue> values = new HashMap<Integer, AbstractConfigValue>();
for (String key : o.keySet()) {
int i;
try {
i = Integer.parseInt(key, 10);
if (i < 0)
continue;
values.put(i, o.get(key));
} catch (NumberFormatException e) {
continue;
}
}
if (!values.isEmpty()) {
ArrayList<Map.Entry<Integer, AbstractConfigValue>> entryList = new ArrayList<Map.Entry<Integer, AbstractConfigValue>>(
values.entrySet());
// sort by numeric index
Collections.sort(entryList,
new Comparator<Map.Entry<Integer, AbstractConfigValue>>() {
#Override
public int compare(Map.Entry<Integer, AbstractConfigValue> a,
Map.Entry<Integer, AbstractConfigValue> b) {
return Integer.compare(a.getKey(), b.getKey());
}
});
// drop the indices (we allow gaps in the indices, for better or
// worse)
ArrayList<AbstractConfigValue> list = new ArrayList<AbstractConfigValue>();
for (Map.Entry<Integer, AbstractConfigValue> entry : entryList) {
list.add(entry.getValue());
}
return new SimpleConfigList(value.origin(), list);
}
}
Note how keys are parsed as integer values and then sorted using Integer.compare

Specify Variable Initialization Order in Scala

I have a special class Model that needs to have its methods called in a very specific order.
I tried doing something like this:
val model = new Model
new MyWrappingClass {
val first = model.firstMethod()
val second = model.secondMethod()
val third = model.thirdMethod()
}
The methods should be called in the order listed, however I am seeing an apparently random order.
Is there any way to get the variable initialization methods to be called in a particular order?
I doubt your methods are called in the wrong order. But to be sure, you can try something like this:
val (first, second, third) = (
model.firstMethod(),
model.secondMethod(),
model.thirdMethod()
)
You likely have some other problem with your code.
I can run 100 million loops where it never gets the order wrong, as follows:
class Model {
var done = Array(false,false,false);
def firstMethod():Boolean = { done(0) = true; done(1) || done(2) };
def secondMethod():Boolean = { done(1) = true; !done(0) || done(2) };
def thirdMethod():Boolean = { done(2) = true; !done(0) || !done(1) };
};
Notice that these methods return a True if done out of order and false when called in order.
Here's your class:
class MyWrappingClass {
val model = new Model;
val first = model.firstMethod()
val second = model.secondMethod()
val third = model.thirdMethod()
};
Our function to check for bad behavior on each trial:
def isNaughty(w: MyWrappingClass):Boolean = { w.first || w.second || w.third };
A short program to test:
var i = 0
var b = false;
while( (i<100000000) && !b ){
b = isNaughty(new MyWrappingClass);
i += 1;
}
if (b){
println("out-of-order behavior occurred");
println(i);
} else {
println("looks good");
}
Scala 2.11.7 on OpenJDK8 / Ubuntu 15.04
Of course this doesn't prove it impossible to have wrong order, only that correct behavior seems highly repeatable in a fairly simple case.