Scala application use case:
We have a Scala based that module reads the data from global cache (Redis) and save the same into local cache(Scaffeine). As we want this data to be refreshed asynchronously, we are using LoadingCache with refreshAfterWrite duration set to refresh window of 2.second.
Question:
We need to set different expiry time while setting values in local cache based on if key present in the redis (global cache) or not.
e.g.
If the key is not present in the global cache, we want to save the same key in local cache with default value and refresh window set to 5.minutes.
If key is present in the global cache, we want to store the same in local cache with actual value and refresh window set to 30.minute.
Sample code
object LocalCache extends App {
// data being stored in the cache
class DataObject(data: String) {
override def toString: String = {
"[ 'data': '" + this.data + "' ]"
}
}
// loader helper
private def loaderHelper(key: Int): Future[DataObject] = {
// this method will replace to read the data from Redis Cache
// for now, returns different values per key
if (key == 1) Future.successful(new DataObject("LOADER_HELPER_1"))
else if (key == 2) Future.successful(new DataObject("LOADER_HELPER_2"))
else Future.successful(new DataObject("LOADER_HELPER"))
}
// async loader
private def loader(key: Int): DataObject = {
Try {
Await.result(loaderHelper(key), 1.seconds)
} match {
case Success(result) =>
result
case Failure(exception: Exception) =>
val temp: DataObject = new DataObject("LOADER")
temp
}
}
// initCache
private def initCache(maximumSize: Int): LoadingCache[Int, DataObject] =
Scaffeine()
.recordStats()
.expireAfterWrite(2.second)
.maximumSize(maximumSize)
.build(loader)
// operations on the cache.
val cache: LoadingCache[Int, DataObject] = initCache(maximumSize = 500)
cache.put(1, new DataObject("foo"))
cache.put(2, new DataObject("hoo"))
println("sleeping for 3 sec\n")
Thread.sleep(3000)
println(cache.getIfPresent(1).toString)
println(cache.getIfPresent(2).toString)
println(cache.get(3).toString)
println("sleeping for 10 sec\n")
Thread.sleep(10000)
println("waking up from 10 sec sleep")
println(cache.get(1).toString)
println(cache.get(2).toString)
println(cache.get(3).toString)
println("\nCache Stats: "+ cache.stats())
}
I see lots of custom.policy that can be used to overwrite the expiryAfter policies (expiryAfterWrite/Update/Access) but nothing can be found for refreshAterWrite policies which refreshes the data asynchronously. Any help will be appreciable.
P.S.
I'm very newbie to work on Scala and also explore the Scaffeine.
Unfortunately variable refresh is not supported yet. There is an open issue to provide that feature.
At the moment expiration can be custom per entry, but automatic refresh is fixed. A manual refresh may be triggered by LoadingCache.refresh(key), if you want to manage it yourself. For example, you could periodically iterate over the entries (via the asMap() view) and refresh manually based on a custom criteria.
The AsyncLoadingCache could be useful instead of blocking on a future within your cache loader. The cache will return the in-flight future, won't make it expirable until the value materializes, and will remove it if it fails. Note that the synchronous() view is very useful for async caches to access more operations.
From testing, you might find Guava's fake ticker useful to simulate time.
Related
I am accessing a state store to query it and have had to wrap the store() statement with a try/catch block to retry it because sometimes I am getting this exception:
org.apache.kafka.streams.errors.InvalidStateStoreException: Cannot get state store customers-store because the stream thread is PARTITIONS_REVOKED, not RUNNING
at org.apache.kafka.streams.state.internals.StreamThreadStateStoreProvider.stores(StreamThreadStateStoreProvider.java:49)
at org.apache.kafka.streams.state.internals.QueryableStoreProvider.getStore(QueryableStoreProvider.java:57)
at org.apache.kafka.streams.KafkaStreams.store(KafkaStreams.java:1053)
at com.codependent.kafkastreams.customer.service.CustomerService.getCustomer(CustomerService.kt:75)
at com.codependent.kafkastreams.customer.service.CustomerServiceKt.main(CustomerService.kt:108)
This is the code used to retrieve the store (the full code is on a github repo):
fun getCustomer(id: String): Customer? {
var keyValueStore: ReadOnlyKeyValueStore<String, Customer>? = null
while(keyValueStore == null) {
try {
keyValueStore = streams.store(CUSTOMERS_STORE, QueryableStoreTypes.keyValueStore<String, Customer>())
} catch (ex: InvalidStateStoreException) {
ex.printStackTrace()
}
}
val customer = keyValueStore.get(id)
return customer
}
And this is the main program:
fun main(args: Array<String>) {
val customerService = CustomerService("main", "localhost:9092")
customerService.initializeStreams()
customerService.createCustomer(Customer("53", "Joey"))
val customer = customerService.getCustomer("53")
println(customer)
customerService.stopStreams()
}
The exception happens randomly running the program several times, after the previous executions finish. Note: I don't do anything to the executing Kafka cluster and use its default config.
At the time you are accessing the store, the Kafka Streams application is going through a rebalance, and state stores aren't accessible at that time. You want to make sure you only query the stores when the application state is RUNNING and not REBALANCING.
What you could do is check the state of the application before attempting to read from the store like this:
if(streams.state() == State.RUNNING) {
keyValueStore = streams.store(...);
val customer = keyValueStore.get(id);
return customer;
}
There is also a KafkaStreams.setStateListener method you can use to register a KafkStreams.StateListener implementation. The StateListener.onChange method is called each time the application changes its state.
I'm trying to mimic a periodic token refresh. We have javascript code in our frontend that periodically checks if a token refresh needs to occur, and if so, issues a call to refresh the token. This is ran every couple minutes or so as long as anyone is using the application.
The typical user of our application will leave the app open without doing anything on it for time greater than the token lifetime. So I can't simply check and perform a token refresh on each call without adjusting the script to not mimic real life usage (because calls would need to occur more frequently).
Any ideas if, or how, this could be possible?
Okay, the best solution I could come up with was essentially to create my own "Pause" class, that breaks long pauses down into small pauses, and between each checks to see if the token needs refreshed. Looks roughly like this:
//new RefreshADToken().create() creates a ChainBuilder that refreshes the token if it's necessary
object PauseHelpers {
val tooBigOfPauseThreshold = 150 //300 seconds = 5 minutes, so anything over 150 is too big
def adPause(duration: Int): ChainBuilder = {
doIfOrElse(duration > tooBigOfPauseThreshold) {
val iterations = duration / tooBigOfPauseThreshold
repeat(iterations, "pause_counter") {
pause(tooBigOfPauseThreshold)
.exec(new RefreshADToken().create())
}.pause(duration % tooBigOfPauseThreshold).exec(new RefreshADToken().create())
} {
pause(duration).exec(new RefreshADToken().create())
}
}
}
//... then
import orhub.modules.actions._
class POC extends Simulation {
//some stuff
var scn = scenario("poc")
.feed(hospitalUsersFeeder)
.exec(session => {
session.set("env", environment)
})
.exec(new ADPause(120 * 60).create())
I have a heavy load flow of users data. I want to determine if this is a new user by it's id. In order to reduce calls to the db I rather maintain a state in memory of previous users.
val users = mutable.set[String]()
//init the state from db
user = db.getAllUsersIds()
val source: Source[User, NotUsed]
val dbSink: Sink[User, NotUsed] //goes to db
//if the user is added to the set it will return true
val usersFilter = Flow[User].filter(user => users.add(user.id))
now I can create a graph
source ~> usersFilter ~> dbSink
my problem is that the mutable state is shared and unsafe. Is there an option to maintain the state within the flow ?
There are two ways of doing this.
If you are getting a streams of records and you want to deduplicate the stream (because some ids are already processed). You can do
http://janschulte.com/2016/03/08/deduplicate-akka-stream/
The other way of doing this is via database lookups where you check if the ID already exists.
val alreadyExists : Flow[User, NotUsed] = {
// build a cache of known ids
val knownIdList = ... // query database and get list of IDs
Flow[User].filterNot(user => knownIdList.contains(user.id))
}
I have paged interface. Given a starting point a request will produce a list of results and a continuation indicator.
I've created an observable that is built by constructing and flat mapping an observable that reads the page. The result of this observable contains both the data for the page and a value to continue with. I pluck the data and flat map it to the subscriber. Producing a stream of values.
To handle the paging I've created a subject for the next page values. It's seeded with an initial value then each time I receive a response with a valid next page I push to the pages subject and trigger another read until such time as there is no more to read.
Is there a more idiomatic way of doing this?
function records(start = 'LATEST', limit = 1000) {
let pages = new rx.Subject();
this.connect(start)
.subscribe(page => pages.onNext(page));
let records = pages
.flatMap(page => {
return this.read(page, limit)
.doOnNext(result => {
let next = result.next;
if (next === undefined) {
pages.onCompleted();
} else {
pages.onNext(next);
}
});
})
.pluck('data')
.flatMap(data => data);
return records;
}
That's a reasonable way to do it. It has a couple of potential flaws in it (that may or may not impact you depending upon your use case):
You provide no way to observe any errors that occur in this.connect(start)
Your observable is effectively hot. If the caller does not immediately subscribe to the observable (perhaps they store it and subscribe later), then they'll miss the completion of this.connect(start) and the observable will appear to never produce anything.
You provide no way to unsubscribe from the initial connect call if the caller changes its mind and unsubscribes early. Not a real big deal, but usually when one constructs an observable, one should try to chain the disposables together so it call cleans up properly if the caller unsubscribes.
Here's a modified version:
It passes errors from this.connect to the observer.
It uses Observable.create to create a cold observable that only starts is business when the caller actually subscribes so there is no chance of missing the initial page value and stalling the stream.
It combines the this.connect subscription disposable with the overall subscription disposable
Code:
function records(start = 'LATEST', limit = 1000) {
return Rx.Observable.create(observer => {
let pages = new Rx.Subject();
let connectSub = new Rx.SingleAssignmentDisposable();
let resultsSub = new Rx.SingleAssignmentDisposable();
let sub = new Rx.CompositeDisposable(connectSub, resultsSub);
// Make sure we subscribe to pages before we issue this.connect()
// just in case this.connect() finishes synchronously (possible if it caches values or something?)
let results = pages
.flatMap(page => this.read(page, limit))
.doOnNext(r => this.next !== undefined ? pages.onNext(this.next) : pages.onCompleted())
.flatMap(r => r.data);
resultsSub.setDisposable(results.subscribe(observer));
// now query the first page
connectSub.setDisposable(this.connect(start)
.subscribe(p => pages.onNext(p), e => observer.onError(e)));
return sub;
});
}
Note: I've not used the ES6 syntax before, so hopefully I didn't mess anything up here.
This page has a description of Map's getOrElseUpdate usage method:
object WithCache{
val cacheFun1 = collection.mutable.Map[Int, Int]()
def fun1(i:Int) = i*i
def catchedFun1(i:Int) = cacheFun1.getOrElseUpdate(i, fun1(i))
}
So you can use catchedFun1 which will check if cacheFun1 contains key and return value associated with it. Otherwise, it will invoke fun1, then cache fun1's result in cacheFun1, then return fun1's result.
I can see one potential danger - cacheFun1 can became to large. So cacheFun1 must be cleaned somehow by garbage collector?
P.S. What about scala.collection.mutable.WeakHashMap and java.lang.ref.* ?
See the Memo pattern and the Scalaz implementation of said paper.
Also check out a STM implementation such as Akka.
Not that this is only local caching so you might want to lookinto a distributed cache or STM such as CCSTM, Terracotta or Hazelcast
Take a look at spray caching (super simple to use)
http://spray.io/documentation/1.1-SNAPSHOT/spray-caching/
makes the job easy and has some nice features
for example :
import spray.caching.{LruCache, Cache}
//this is using Play for a controller example getting something from a user and caching it
object CacheExampleWithPlay extends Controller{
//this will actually create a ExpiringLruCache and hold data for 48 hours
val myCache: Cache[String] = LruCache(timeToLive = new FiniteDuration(48, HOURS))
def putSomeThingInTheCache(#PathParam("getSomeThing") someThing: String) = Action {
//put received data from the user in the cache
myCache(someThing, () => future(someThing))
Ok(someThing)
}
def checkIfSomeThingInTheCache(#PathParam("checkSomeThing") someThing: String) = Action {
if (myCache.get(someThing).isDefined)
Ok(s"just $someThing found this in the cache")
else
NotFound(s"$someThing NOT found this in the cache")
}
}
On the scala mailing list they sometimes point to the MapMaker in the Google collections library. You might want to have a look at that.
For simple caching needs, I'm still using Guava cache solution in Scala as well.
Lightweight and battle tested.
If it fit's your requirements and constraints generally outlined below, it could be a great option:
Willing to spend some memory to improve speed.
Expecting that keys will sometimes get queried more than once.
Your cache will not need to store more data than what would fit in RAM. (Guava caches are local to a single run of your application.
They do not store data in files, or on outside servers.)
Example for using it will be something like this:
lazy val cachedData = CacheBuilder.newBuilder()
.expireAfterWrite(60, TimeUnit.MINUTES)
.maximumSize(10)
.build(
new CacheLoader[Key, Data] {
def load(key: Key): Data = {
veryExpansiveDataCreation(key)
}
}
)
To read from it, you can use something like:
def cachedData(ketToData: Key): Data = {
try {
return cachedData.get(ketToData)
} catch {
case ee: Exception => throw new YourSpecialException(ee.getMessage);
}
}
Since it hasn't been mentioned before let me put on the table the light Spray-Caching that can be used independently from Spray and provides expected size, time-to-live, time-to-idle eviction strategies.
We are using Scaffeine (Scala + Caffeine), and you can read abouts its pros/cons compared to other frameworks over here.
You add your sbt,
"com.github.blemale" %% "scaffeine" % "4.0.1"
Build your cache
import com.github.blemale.scaffeine.{Cache, Scaffeine}
import scala.concurrent.duration._
val cachedItems: Cache[String, Int] =
Scaffeine()
.recordStats()
.expireAtferWrite(60.seconds)
.maximumSize(500)
.build[String, Int]()
cachedItems.put("key", 1) // Add items
cache.getIfPresent("key") // Returns an option