Maintaining state within a stream - scala

I have a heavy load flow of users data. I want to determine if this is a new user by it's id. In order to reduce calls to the db I rather maintain a state in memory of previous users.
val users = mutable.set[String]()
//init the state from db
user = db.getAllUsersIds()
val source: Source[User, NotUsed]
val dbSink: Sink[User, NotUsed] //goes to db
//if the user is added to the set it will return true
val usersFilter = Flow[User].filter(user => users.add(user.id))
now I can create a graph
source ~> usersFilter ~> dbSink
my problem is that the mutable state is shared and unsafe. Is there an option to maintain the state within the flow ?

There are two ways of doing this.
If you are getting a streams of records and you want to deduplicate the stream (because some ids are already processed). You can do
http://janschulte.com/2016/03/08/deduplicate-akka-stream/
The other way of doing this is via database lookups where you check if the ID already exists.
val alreadyExists : Flow[User, NotUsed] = {
// build a cache of known ids
val knownIdList = ... // query database and get list of IDs
Flow[User].filterNot(user => knownIdList.contains(user.id))
}

Related

Scaffeine: how to set different expiration time for default value

Scala application use case:
We have a Scala based that module reads the data from global cache (Redis) and save the same into local cache(Scaffeine). As we want this data to be refreshed asynchronously, we are using LoadingCache with refreshAfterWrite duration set to refresh window of 2.second.
Question:
We need to set different expiry time while setting values in local cache based on if key present in the redis (global cache) or not.
e.g.
If the key is not present in the global cache, we want to save the same key in local cache with default value and refresh window set to 5.minutes.
If key is present in the global cache, we want to store the same in local cache with actual value and refresh window set to 30.minute.
Sample code
object LocalCache extends App {
// data being stored in the cache
class DataObject(data: String) {
override def toString: String = {
"[ 'data': '" + this.data + "' ]"
}
}
// loader helper
private def loaderHelper(key: Int): Future[DataObject] = {
// this method will replace to read the data from Redis Cache
// for now, returns different values per key
if (key == 1) Future.successful(new DataObject("LOADER_HELPER_1"))
else if (key == 2) Future.successful(new DataObject("LOADER_HELPER_2"))
else Future.successful(new DataObject("LOADER_HELPER"))
}
// async loader
private def loader(key: Int): DataObject = {
Try {
Await.result(loaderHelper(key), 1.seconds)
} match {
case Success(result) =>
result
case Failure(exception: Exception) =>
val temp: DataObject = new DataObject("LOADER")
temp
}
}
// initCache
private def initCache(maximumSize: Int): LoadingCache[Int, DataObject] =
Scaffeine()
.recordStats()
.expireAfterWrite(2.second)
.maximumSize(maximumSize)
.build(loader)
// operations on the cache.
val cache: LoadingCache[Int, DataObject] = initCache(maximumSize = 500)
cache.put(1, new DataObject("foo"))
cache.put(2, new DataObject("hoo"))
println("sleeping for 3 sec\n")
Thread.sleep(3000)
println(cache.getIfPresent(1).toString)
println(cache.getIfPresent(2).toString)
println(cache.get(3).toString)
println("sleeping for 10 sec\n")
Thread.sleep(10000)
println("waking up from 10 sec sleep")
println(cache.get(1).toString)
println(cache.get(2).toString)
println(cache.get(3).toString)
println("\nCache Stats: "+ cache.stats())
}
I see lots of custom.policy that can be used to overwrite the expiryAfter policies (expiryAfterWrite/Update/Access) but nothing can be found for refreshAterWrite policies which refreshes the data asynchronously. Any help will be appreciable.
P.S.
I'm very newbie to work on Scala and also explore the Scaffeine.
Unfortunately variable refresh is not supported yet. There is an open issue to provide that feature.
At the moment expiration can be custom per entry, but automatic refresh is fixed. A manual refresh may be triggered by LoadingCache.refresh(key), if you want to manage it yourself. For example, you could periodically iterate over the entries (via the asMap() view) and refresh manually based on a custom criteria.
The AsyncLoadingCache could be useful instead of blocking on a future within your cache loader. The cache will return the in-flight future, won't make it expirable until the value materializes, and will remove it if it fails. Note that the synchronous() view is very useful for async caches to access more operations.
From testing, you might find Guava's fake ticker useful to simulate time.

Joining a KTable with a KStream and nothing arrives in the output topic

I leftjoin a KStream with a KTable, but I don't see any output to the output topic:
val stringSerde: Serde[String] = Serdes.String()
val longSerde: Serde[java.lang.Long] = Serdes.Long()
val genericRecordSerde: Serde[GenericRecord] = new GenericAvroSerde()
val builder = new KStreamBuilder()
val networkImprStream: KStream[Long, GenericRecord] = builder
.stream(dfpGcsNetworkImprEnhanced)
// Create a global table for advertisers. The data from this global table
// will be fully replicated on each instance of this application.
val advertiserTable: GlobalKTable[java.lang.Long, GenericRecord]= builder.globalTable(advertiserTopicName, "advertiser-store")
// Join the network impr stream to the advertiser global table. As this is global table
// we can use a non-key based join with out needing to repartition the input stream
val networkImprWithAdvertiserNameKStream: KStream[java.lang.Long, GenericRecord] = networkImprStream.leftJoin(advertiserTable,
(_, networkImpr) => {
println(networkImpr)
networkImpr.get("advertiserId").asInstanceOf[java.lang.Long]
},
(networkImpr: GenericRecord, adertiserIdToName: GenericRecord) => {
println(networkImpr)
networkImpr.put("advertiserName", adertiserIdToName.get("name"))
networkImpr
}
)
networkImprWithAdvertiserNameKStream.to(networkImprProcessed)
val streams = new KafkaStreams(builder, streamsConfiguration)
streams.cleanUp()
streams.start()
// usually the stream application would be running forever,
// in this example we just let it run for some time and stop since the input data is finite.
Thread.sleep(15000L)
If I bypass the join and directly output the input topic to the output, I see messages arriving. I've already changed the join to a left join, added some printlns to see when the key is extracted (nothing is printed on the console though). Also I use the kafka streams reset tool every time, so starting from the beginning. I am running out of ideas here. Also I've added some test access to the store and it works and contains keys from the stream (although this should not prohibit any output because of the left join).
In my source stream the key is null. Although I am not using this key to join the table this key must not be null. So creating an intermediate stream with a dummy key it works. So even I have a global KTable here the restrictions for the keys for the stream messages also apply here:
http://docs.confluent.io/current/streams/developer-guide.html#kstream-ktable-join
Input records for the stream with a null key or a null value are ignored and do not trigger the join.

How to use Flink streaming to process Data stream of Complex Protocols

I'm using Flink Stream for the handling of data traffic log in 3G network (GPRS Tunnelling Protocol). And I'm having trouble in the synthesis of information in a user session of the user.
For example: how to map the start and end one session. I don't know that there Flink streaming suited to handle complex protocols like that?
p/s:
We capture data exchanging between SGSN and GGSN in 3G network (use GTP protocol with GTP-C/U messages). A session is started when the SGSN sends the CreateReq (TEID, Seq, IMSI, TEID_dl,TEID_data_dl) message and GGSN responses CreateRsp(TEID_dl, Seq, TEID_ul, TEID_data_ul) message.
After the session is established, others GTP-C messages (ex: UpdateReq, DeleteReq) sent from SGSN to GGSN uses TEID_ul and response message uses TEID_dl, GTP- U message uses TEID_data_ul (SGSN -> GGSN) and TEID_data_dl (GGSN -> SGSN). GTP-U messages contain information such as AppID (facebook, twitter, web), url,...
Finally, I want to handle continuous log data stream and map the GTP-C messages and GTP-U of the same one user (IMSI) to make a report.
I've tried this:
val sessions = createReqs.connect(createRsps).flatMap(new CoFlatMapFunction[CreateReq, CreateRsp, Session] {
// holds CreateReqs indexed by (tedid_dl,seq)
private val createReqs = mutable.HashMap.empty[(String, String), CreateReq]
// holds CreateRsps indexed by (tedid,seq)
private val createRsps = mutable.HashMap.empty[(String, String), CreateRsp]
override def flatMap1(req: CreateReq, out: Collector[Session]): Unit = {
val key = (req.teid_dl, req.header.seqNum)
val oRsp = createRsps.get(key)
if (!oRsp.isEmpty) {
val rsp = oRsp.get
println("OK")
out.collect(new Session(rsp.header.time, req.imsi, req.teid_dl, req.teid_ddl, rsp.teid_upl, rsp.teid_dupl, req.rat, req.apn))
createRsps.remove(key)
} else {
createReqs.put(key, req)
}
}
override def flatMap2(rsp: CreateRsp, out: Collector[Session]): Unit = {
val key = (rsp.header.teid, rsp.header.seqNum)
val oReq = createReqs.get(key)
if (!oReq.isEmpty) {
val req = oReq.get
out.collect(new Session(rsp.header.time, req.imsi, req.teid_dl, req.teid_ddl, rsp.teid_upl, rsp.teid_dupl, req.rat, req.apn))
createReqs.remove(key)
} else {
createRsps.put(key, rsp)
}
}
}).print()
This code always returns empty result. The fact that the input stream contains CreateRsp and CreateReq message of the same session. They appear very close together (within 1 second). When I debug, the oReq.isEmpty == true every time.
What i'm doing wrong?
To be honest it is a bit difficult to see through the telco specifics here, but if I understand correctly you have at least 3 streams, the first two being the CreateReq and the CreateRsp streams.
To detect the establishment of a session I would use the ConnectedDataStream abstraction to share state between the two aforementioned streams. Check out this example for usage or the related Flink docs.
Is this what you are trying to achieve?

Why scala futures do not work faster even with more threads in threads pool?

I have a following algorithm with scala:
Do initial call to db to initialize cursor
Get 1000 entities from db (Returns Future)
For every entity process one additional request to database and get modified entity (returns future)
Transform original entity
Put transformed entity to Future call back from #3
Wait for all Futures
In scala it some thing like:
val client = ...
val size = 1000
val init:Future = client.firstSearch(size) //request over network
val initResult = Await(init, 30.seconds)
var cursorId:String = initResult.getCursorId
while (!cursorId.isEmpty) {
val futures:Seq[Future] = client.grabWithSize(cursorId).map{response=>
response.getAllResults.map(result=>
val grabbedOne:Future[Entity] = client.grabOneEntity(result.id) //request over network
val resultMap:Map[String,Any] = buildMap(result)
val transformed:Map[String,Any] = transform(resultMap) //no future here
grabbedOne.map{grabbedOne=>
buildMap(grabbedOne) == transformed
}
}
Futures.sequence(futures).map(_=> response.getNewCursorId)
}
}
def buildMap(...):Map[String,Any] //sync call
I noticed that if I increase size say two times, every iteration in while started working slowly ~1.5. But I do not see that my PC processor loaded more. It loaded near zero, but time increases in ~1.5. Why? I have setuped:
implicit val ec = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(1024))
I think, that not all Futures executed in parallel. But why? And ho to fix?
I see that in your code, the Futures don't block each other. It's more likely the database that is the bottleneck.
Is it possible to do a SQL join for O(1) rather than O(n) in terms of database calls? (If you're using Slick, have a look under the queries section about joins.)
If the load is low, it's probably that the connection pool is maxed out, you'd need to increase it for the database and the network.

continuously fetch database results with scalaz.stream

I'm new to scala and extremely new to scalaz. Through a different stackoverflow answer and some handholding, I was able to use scalaz.stream to implement a Process that would continuously fetch twitter API results. Now i'd like to do the same thing for the Cassandra DB where the twitter handles are stored.
The code for fetching the twitter results is here:
def urls: Seq[(Handle,URL)] = {
Await.result(
getAll(connection).map { List =>
List.map(twitterToGet =>
(twitterToGet.handle, urlBoilerPlate + twitterToGet.handle + parameters + twitterToGet.sinceID)
)
},
5 seconds)
}
val fetchUrl = channel.lift[Task, (Handle, URL), Fetched] {
url => Task.delay {
val finalResult = callTwitter(url)
if (finalResult.tweets.nonEmpty) {
connection.updateTwitter(finalResult)
} else {
println("\n" + finalResult.handle + " does not have new tweets")
}
s"\ntwitter Fetch & database update completed"
}
}
val P = Process
val process =
(time.awakeEvery(3.second) zipWith P.emitAll(urls))((b, url) => url).
through(fetchUrl)
val fetched = process.runLog.run
fetched.foreach(println)
What I'm planning to do is use
def urls: Seq[(Handle,URL)] = {
to continuously fetch Cassandra results (with an awakeEvery) and send them off to an actor to run the above twitter fetching code.
My question is, what is the best way to implement this with scalaz.stream? Note that i'd like it to get ALL the database results, then have a delay before getting ALL the database results again. Should i use the same architecture as the twitter fetching code above? If so, how would I create a channel.lift that doesn't require input? Is there a better way in scalaz.stream?
Thanks in advance
Got this working today. The cleanest way to do it would be to emit the database results as a stream and attach a sink to the end of the stream to do the twitter processing. What I actually have is a bit more complex as it retrieves the database results continuously and sends them off to an actor for the twitter processing. The style of retrieving the results follows my original code from my question:
val connection = new simpleClient(conf.getString("cassandra.node"))
implicit val threadPool = new ScheduledThreadPoolExecutor(4)
val system = ActorSystem("mySystem")
val twitterFetch = system.actorOf(Props[TwitterFetch], "twitterFetch")
def myEffect = channel.lift[Task, simpleClient, String]{
connection: simpleClient => Task.delay{
val results = Await.result(
getAll(connection).map { List =>
List.map(twitterToGet =>
(twitterToGet.handle, urlBoilerPlate + twitterToGet.handle + parameters + twitterToGet.sinceID)
)
},
5 seconds)
println("Query Successful, results= " +results +" at " + format.print(System.currentTimeMillis()))
twitterFetch ! fetched(connection, results)
s"database fetch completed"
}
}
val P = Process
val process =
(time.awakeEvery(3.second).flatMap(_ => P.emit(connection).
through(myEffect)))
val fetching = process.runLog.run
fetching.foreach(println)
Some notes:
I had asked about using channel.lift without input, but it became clear that the input should be the cassandra connection.
The line
val process =
(time.awakeEvery(3.second).flatMap(_ => P.emit(connection).
through(myEffect)))
Changed from zipWith to flatMap because I wanted to retrieve the results continuously instead of once.