How to count new element from stream by using spark-streaming - scala

I have done implementation of daily compute. Here is some pseudo-code.
"newUser" may called first activated user.
// Get today log from hbase or somewhere else
val log = getRddFromHbase(todayDate)
// Compute active user
val activeUser = log.map(line => ((line.uid, line.appId), line).reduceByKey(distinctStrategyMethod)
// Get history user from hdfs
val historyUser = loadFromHdfs(path + yesterdayDate)
// Compute new user from active user and historyUser
val newUser = activeUser.subtractByKey(historyUser)
// Get new history user
val newHistoryUser = historyUser.union(newUser)
// Save today history user
saveToHdfs(path + todayDate)
Computation of "activeUser" can be converted to spark-streaming easily. Here is some code:
val transformedLog = sdkLogDs.map(sdkLog => {
val time = System.currentTimeMillis()
val timeToday = ((time - (time + 3600000 * 8) % 86400000) / 1000).toInt
((sdkLog.appid, sdkLog.bcode, sdkLog.uid), (sdkLog.channel_no, sdkLog.ctime.toInt, timeToday))
})
val activeUser = transformedLog.groupByKeyAndWindow(Seconds(86400), Seconds(60)).mapValues(x => {
var firstLine = x.head
x.foreach(line => {
if (line._2 < firstLine._2) firstLine = line
})
firstLine
})
But the approach of "newUser" and "historyUser" is confusing me.
I think my question can be summarized as "how to count new element from stream". As my pseudo-code above, "newUser" is part of "activeUser". And I must maintain a set of "historyUser" to know which part is "newUser".
I consider an approach, but I think it may not work right way:
Load the history user as a RDD. Foreach DStream of "activeUser" and find the elements doesn't exist in the "historyUser". A problem here is when should I update this RDD of "historyUser" to make sure I can get the right "newUser" of a window.
Update the "historyUser" RDD means add "newUser" to it. Just like what I did in the pseudo-code above. The "historyUser" is updated once a day in that code. Another problem is how to do this update RDD operation from a DStream. I think update "historyUser" when window slides is proper. But I haven't find a proper API to do this.
So which is the best practice to solve this problem.

updateStateByKey would help here as it allows you to set initial state (your historical users) and then update it on each interval of your main stream. I put some code together to explain the concept
val historyUsers = loadFromHdfs(path + yesterdayDate).map(UserData(...))
case class UserStatusState(isNew: Boolean, values: UserData)
// this will prepare the RDD of already known historical users
// to pass into updateStateByKey as initial state
val initialStateRDD = historyUsers.map(user => UserStatusState(false, user))
// stateful stream
val trackUsers = sdkLogDs.updateStateByKey(updateState, new HashPartitioner(sdkLogDs.ssc.sparkContext.defaultParallelism), true, initialStateRDD)
// only new users
val newUsersStream = trackUsers.filter(_._2.isNew)
def updateState(newValues: Seq[UserData], prevState: Option[UserStatusState]): Option[UserStatusState] = {
// Group all values for specific user as needed
val groupedUserData: UserData = newValues.reduce(...)
// prevState is defined only for users previously seen in the stream
// or loaded as initial state from historyUsers RDD
// For new users it is None
val isNewUser = !prevState.isDefined
// as you return state here for the user - prevState won't be None on next iterations
Some(UserStatusState(isNewUser, groupedUserData))
}

Related

SLICK: To insert value in 3 different tables using one slick api call

The question is related to slick:
I have three tables:
1) Users
2)Team2members
3)Team2Owners
In my post request to users, I am passing values of memberOf and managerOf, these values will be inserted in the tables Team2members and Team2Owners respectively and not in Users table. Though other values of the post request will be inserted in 'Users' table.
My Post request looks like below:
{"kind": "via#user",
"userReference":{"userId":"priya16"},
"user":"preferredNameSpecialChar#domain1.com","memberOf":{"teamReference":{"organizationId":"airtel","teamId":"supportteam"}},
"managerOf":{"teamReference":{"organizationId":"airtel","teamId":"supportteam"}},
"firstName":"Special_fn1",
"lastName":"specialChar_ln1",
"preferredName":[{"locale":"employee1","value":"##$%^&*(Z0FH"}],
"description":" preferredNameSpecialChar test "}
I am forming the query which is shown below:
The query seems to work fine when only memberInsert is defined, when I try to define both the values i.e.memberInsert and managerInsert then insertion happens only for second value.
val query = config.api.customerTableDBIO(apiRequest.parameters.organizationId).flatMap { tables =>
val userInsert = tables.Users returning tables.Users += empRow
val memberInsert = inputObject.memberOf.map(m => m.copy(teamReference = m.teamReference.copy(organizationId = apiRequest.parameters.organizationId))).map { r =>
for {
team2MemberRow <- tables.Team2members returning tables.Team2members += Teams2MembersEntity.fromEmtToTeams2Members(r, empRow.id)
team <- tables.Teams.filter(_.id === r.teamReference.teamId.toLowerCase).map(_.friendlyName).result.headOption
} yield (team2MemberRow, team)
}
val managerInsert = inputObject.managerOf.map(m => m.copy(teamReference = m.teamReference.copy(organizationId = apiRequest.parameters.organizationId))).map { r =>
for {
team2OwnerRow <- tables.Team2owners returning tables.Team2owners += Teams2OwnersEntity.fromEmtToTeam2owners(r, empRow.id)
team <- tables.Teams.filter(_.id === r.teamReference.teamId.toLowerCase).map(_.friendlyName).result.headOption
} yield (team2OwnerRow, team)
}
userInsert.flatMap { userRow =>
val user = UserEntity.fromDbEntity(userRow)
if (memberInsert.isDefined) memberInsert.get
.map(r => user.copy(memberOf = Some(Teams2MembersEntity.fromEmtToMemberRef(r._1, r._2.map(TeamEntity.toApiFriendlyName).getOrElse(List.empty)))))
else DBIO.successful(user)
if (managerInsert.isDefined) managerInsert.get
.map(r => user.copy(managerOf = Some(Teams2OwnersEntity.fromEmtToManagerRef(r._1, r._2.map(TeamEntity.toApiFriendlyName).getOrElse(List.empty)))))
else DBIO.successful(user)
}
}
The query seems to work fine when only memberInsert is defined, when I try to define both the values i.e.memberInsert and managerInsert then insertion happens only for second value.
The problem looks to be with the final call to flatMap.
That should return a DBIO[T]. However, your expression generates a DBIO[T] in various branches, but only one value will be returned from flatMap. That would explain why you don't see all the actions being run.
Instead, what you could do is assign each step to a value and sequence them. There are lots of ways you could do that, such as using DBIO.seq or andThen.
Here's a sketch of one approach that might work for you....
val maybeInsertMemeber: Option[DBIO[User]] =
member.map( your code for constructing an action here )
val maybeInsertManager Option[DBIO[User]] =
manager.map( your code for constructing an action here )
DBIO.sequenceOption(maybeInsertMember) andThen
DBIO.sequenceOption(maybeInsertManager) andThen
DBIO.successful(user)
The result of that expression is a DBIO[User] which combines three queries together.

How to create basic pagination logic in Gatling?

So I am trying to create basic pagination in Gatling but failing miserably.
My current scenario is as follows:
1) First call is always a POST request, the body includes page index 1 and page size 50
{"pageIndex":1,"pageSize":50}
I am then receiving 50 object + the total count of objects on the environment:
"totalCount":232
Since I need to iterate through all objects on the environment, I will need to POST this call 5 time, each time with an updated pageIndex.
My current (failing) code looks like:
def getAndPaginate(jsonBody: String) = {
val pageSize = 50;
var totalCount: Int = 0
var currentPage: Int = 1
var totalPages: Int =0
exec(session => session.set("pageIndex", currentPage))
exec(http("Get page")
.post("/api")
.body(ElFileBody("json/" + jsonBody)).asJson
.check(jsonPath("$.result.objects[?(#.type == 'type')].id").findAll.saveAs("list")))
.check(jsonPath("$.result.totalCount").saveAs("totalCount"))
.exec(session => {
totalCount = session("totalCount").as[Int]
totalPages = Math.ceil(totalCount/pageSize).toInt
session})
.asLongAs(currentPage <= totalPages)
{
exec(http("Get assets action list")
.post("/api")
.body(ElFileBody("json/" + jsonBody)).asJson
.check(jsonPath("$.result.objects[?(#.type == 'type')].id").findAll.saveAs("list")))
currentPage = currentPage+1
exec(session => session.set("pageIndex", currentPage))
pause(Config.minDelayValue seconds, Config.maxDelayValue seconds)
}
}
Currently the pagination values are not assign to the variables that I have created at the beginning of the function, if I create the variables at the Object level then they are assigned but in a manner which I dont understand. For example the result of Math.ceil(totalCount/pageSize).toInt is 4 while it should be 5. (It is 5 if I execute it in the immediate window.... I dont get it ). I would than expect asLongAs(currentPage <= totalPages) to repeat 5 times but it only repeats twice.
I tried to create the function in a class rather than an Object because as far as I understand there is only one Object. (To prevent multiple users accessing the same variable I also ran only one user with the same result)
I am obviously missing something basic here (new to Gatling and Scala) so any help would be highly appreciated :)
using regular scala variables to hold the values isn't going to work - the gatling DSL defines builders that are only executed once at startup, so lines like
.asLongAs(currentPage <= totalPages)
will only ever execute with the initial values.
So you just need to handle everything using session variables
def getAndPaginate(jsonBody: String) = {
val pageSize = 50;
exec(session => session.set("notDone", true))
.asLongAs("${notDone}", "index") {
exec(http("Get assets action list")
.post("/api")
.body(ElFileBody("json/" + jsonBody)).asJson
.check(
jsonPath("$.result.totalCount")
//transform lets us take the result of a check (and an optional session) and modify it before storing - so we can use it to get store a boolean that reflects whether we're on the last page
.transform( (totalCount, session) => ((session("index").as[Int] + 1) * pageSize) < totalCount.toInt)
.saveAs("notDone")
)
)
.pause(Config.minDelayValue seconds, Config.maxDelayValue seconds)
}
}

Difference between RoundRobinRouter and RoundRobinRoutinglogic

So I was reading tutorial about akka and came across this http://manuel.bernhardt.io/2014/04/23/a-handful-akka-techniques/ and I think he explained it pretty well, I just picked up scala recently and having difficulties with the tutorial above,
I wonder what is the difference between RoundRobinRouter and the current RoundRobinRouterLogic? Obviously the implementation is quite different.
Previously the implementation of RoundRobinRouter is
val workers = context.actorOf(Props[ItemProcessingWorker].withRouter(RoundRobinRouter(100)))
with processBatch
def processBatch(batch: List[BatchItem]) = {
if (batch.isEmpty) {
log.info(s"Done migrating all items for data set $dataSetId. $totalItems processed items, we had ${allProcessingErrors.size} errors in total")
} else {
// reset processing state for the current batch
currentBatchSize = batch.size
allProcessedItemsCount = currentProcessedItemsCount + allProcessedItemsCount
currentProcessedItemsCount = 0
allProcessingErrors = currentProcessingErrors ::: allProcessingErrors
currentProcessingErrors = List.empty
// distribute the work
batch foreach { item =>
workers ! item
}
}
}
Here's my implementation of RoundRobinRouterLogic
var mappings : Option[ActorRef] = None
var router = {
val routees = Vector.fill(100) {
mappings = Some(context.actorOf(Props[Application3]))
context watch mappings.get
ActorRefRoutee(mappings.get)
}
Router(RoundRobinRoutingLogic(), routees)
}
and treated the processBatch as such
def processBatch(batch: List[BatchItem]) = {
if (batch.isEmpty) {
println(s"Done migrating all items for data set $dataSetId. $totalItems processed items, we had ${allProcessingErrors.size} errors in total")
} else {
// reset processing state for the current batch
currentBatchSize = batch.size
allProcessedItemsCount = currentProcessedItemsCount + allProcessedItemsCount
currentProcessedItemsCount = 0
allProcessingErrors = currentProcessingErrors ::: allProcessingErrors
currentProcessingErrors = List.empty
// distribute the work
batch foreach { item =>
// println(item.id)
mappings.get ! item
}
}
}
I somehow cannot run this tutorial, and it's stuck at the point where it's iterating the batch list. I wonder what I did wrong.
Thanks
In the first place, you have to distinguish diff between them.
RoundRobinRouter is a Router that uses round-robin to select a connection.
While
RoundRobinRoutingLogic uses round-robin to select a routee
You can provide own RoutingLogic (it has helped me to understand how Akka works under the hood)
class RedundancyRoutingLogic(nbrCopies: Int) extends RoutingLogic {
val roundRobin = RoundRobinRoutingLogic()
def select(message: Any, routees: immutable.IndexedSeq[Routee]): Routee = {
val targets = (1 to nbrCopies).map(_ => roundRobin.select(message, routees))
SeveralRoutees(targets)
}
}
link on doc http://doc.akka.io/docs/akka/2.3.3/scala/routing.html
p.s. this doc is very clear and it has helped me the most
Actually I misunderstood the method, and found out the solution was to use RoundRobinPool as stated in http://doc.akka.io/docs/akka/2.3-M2/project/migration-guide-2.2.x-2.3.x.html
For example RoundRobinRouter has been renamed to RoundRobinPool or
RoundRobinGroup depending on which type you are actually using.
from
val workers = context.actorOf(Props[ItemProcessingWorker].withRouter(RoundRobinRouter(100)))
to
val workers = context.actorOf(RoundRobinPool(100).props(Props[ItemProcessingWorker]), "router2")

Apache-Spark: method in foreach doesn't work

I read file from HDFS, which contains x1,x2,y1,y2 representing a envelope in JTS.
I would like to use those data to build STRtree in foreach.
val inputData = sc.textFile(inputDataPath).cache()
val strtree = new STRtree
inputData.foreach(line => {val array = line.split(",").map(_.toDouble);val e = new Envelope(array(0),array(1),array(2),array(3)) ;
println("envelope is " + e);
strtree.insert(e,
new Rectangle(array(0),array(1),array(2),array(3)))})
As you can see, I also print the e object.
To my surprise, when I log the size of strtree, it is zero! It seems that insert method make no senses here.
By the way, if I write hard code some test data line by line, the strtree can be built well.
One more thing, those project is packed into jar and submitted in the spark-shell.
So, why does the method in foreach not work ?
You will have to collect() to do this:
inputData.collect().foreach(line => {
... // your code
})
You can do this (for avoiding collecting all data):
val pairs = inputData.map(line => {
val array = line.split(",").map(_.toDouble);
val e = new Envelope(array(0),array(1),array(2),array(3)) ;
println("envelope is " + e);
(e, new Rectangle(array(0),array(1),array(2),array(3)))
}
pairs.collect().foreach(pair => {
strtree.insert(pair._1, pair._2)
}
Use .map() instead of .foreach() and reassign the outcome.
Foreach does not return the outcome of applyied function. It can be used for sending data somewhere, storing to db, printing, and so on.

Can I use non volatile external variables in Scala Enumeratee?

I need to group output of my Enumerator in different ZipEntries, based on specific property (providerId), original chartPreparations stream is ordered by providerId, so I can just keep reference to provider, and add a new entry when provider chages
Enumerator.outputStream(os => {
val currentProvider = new AtomicReference[String]()
// Step 1. Creating zipped output file
val zipOs = new ZipOutputStream(os, Charset.forName("UTF8"))
// Step 2. Processing chart preparation Enumerator
val chartProcessingTask = (chartPreparations) run Iteratee.foreach(cp => {
// Step 2.1. Write new entry if needed
if(currentProvider.get() == null || cp.providerId != currentProvider.get()) {
if (currentProvider.get() != null) {
zipOs.write("</body></html>".getBytes(Charset.forName("UTF8")))
}
currentProvider.set(cp.providerId)
zipOs.putNextEntry(new ZipEntry(cp.providerName + ".html"))
zipOs.write(HTML_HEADER)
}
// Step 2.2 Write chart preparation in HTML format
zipOs.write(toHTML(cp).getBytes(Charset.forName("UTF8")))
})
// Step 3. On Complete close stream
chartProcessingTask.onComplete(_ => zipOs.close())
})
Since current provider reference, is changing, during the output, I made it AtomicReference, so that I could handle references from different threads.
Can currentProvider just be a var Option[String], and why?