I am developing a Apache Flink application using Scala API ( I am pretty new using this technology).
I am using a hashmap to store some values that come from a database, and I need to refresh these values each 1h. There is any way to refresh this hashmap asynchronously?
Thanks!
I'm not sure what you mean by "refresh this hashmap asynchronously" in the context of a Flink workflow.
For what it's worth, if you have a hashmap that's keyed by some piece of data from records flowing through your workflow, then you can use Flink's support for managed key state to store the value (and checkpoint it), and make it queryable.
I interpret your question to mean that you are using some state in Flink to mirror/cache some data that comes from an external database, and you wish to periodically refresh it.
Typically this sort of thing is done by continuously streaming a Change Data Capture (CDC) stream from the external database into Flink. Continuous, streaming solutions are generally a better fit for Flink. But if you want to do this in hourly batches, you could write a custom source or a ProcessFunction that wakes up once an hour, makes a query to the database, and emits a stream of records that can be used to update the operator holding the state.
You can achieve this with the use of Apache Flink's Asynchronous I/O for External Data Access, see this post for details async io.
Here's a way to use AsyncDataStream to refresh a map periodically by creating a async function and attaching it to a source stream.
class AsyncEnricherFunction extends RichAsyncFunction[String, (String String)] {
#transient private var m: Map[String, String] = _
#transient private var client: DataBaseClient = _
#transient private var refreshInterval: Int = _
#throws(classOf[Exception])
override def open(parameters: Configuration): Unit = {
client = new DataBaseClient(host, port, credentials)
refreshInterval = 1000
load()
}
private def load(): Unit = {
val str = "select key, value from KeyValue"
m = client.query(str).asMap
lastRefreshed = System.currentTimeMillis()
}
override def asyncInvoke(input: String, resultFuture: ResultFuture[(String, String]): Unit = {
Future {
if (System.currentTimeMillis() > lastRefreshed + refreshInterval) load()
val enriched = (input, m(input))
resultFuture.complete(Seq(enriched))
}(ExecutionContext.global)
}
override def close() : Unit = { client.close() }
}
val in: DataStream[String] = env.addSource(src)
val enriched = AsyncDataStream.unorderedWait(in, AsyncEnricherFunction(), 5000, TimeUnit.MILLISECONDS, 100)
Related
I need to get some data from Cassandra for entries in a Kafka-Streams streaming application. I'd need to perform the join on ID. I'd like to set up a cache to save time used for queries.
The table is simple:
id | name
---|-----
1 |Mike
My plan is straightforward: query the table from database then store into a Map[Int, String].
The main problem is - data may change in the table and needs to be updated periodically, so I need to query it from time to time.
So far I've come up with a threaded solution like this:
// local database mirror
class Mirror(user: String, password: String) extends Runnable {
var database: Map[Int, String] = Map[Int, String]() withDefaultValue "undefined"
def run(): Unit = {
update()
}
//
def update(): Unit = {
println("update")
database.synchronized {
println("sync-update")
// val c = Driver.getConnection(...)
// database = c.execute(select id, name from table). ...
database += (1 -> "one")
Thread.sleep(100)
// c.close()
}
}
def get(k: Int): Option[String] = {
println("get")
database.synchronized {
println("sync-get")
if (! (database contains k)) {
update()
database.get(k)
} else {
database.get(k)
}
}
}
}
Main looks like this:
def main(args: Array[String]): Unit = {
val db = new Mirror("u", "p")
val ex = new ScheduledThreadPoolExecutor(1)
val f = ex.scheduleAtFixedRate(db, 100, 100, TimeUnit.SECONDS)
while(true) { // simulate stream
val res = db.get(1)
println(res)
Thread.sleep(10000)
}
}
It seems to function fine. But are there any pitfalls in my code? Especially I'm not confident about thread safety of update & get functions.
If you are not opposed to using Akka I would look at Akka Streams; specifically Alpakka to do this. There's no need to reinvent the wheel if you don't have to.
That being said the code has the following problems:
Existence check on cache will not help if the entries in Cassandra are updated. It will only help if they are missing from your cache
Look at using a reentrant read write lock if you believe that most of the time your cache will have the current entries. This will help with contention if you have multiple threads calling your mirror.
Again, I would highly recommend you look at Akka Streams with Alpakka because you can do what you want with that tool wihtout having to write a bunch of code yourself.
I am completely new to Flink. May this question is repeated but found only one link and that is not understandable for me.
https://stackoverflow.com/a/44294980/6904987
I stored Data in Redis in Key Value format example Key is UserId and UserInfo is value. Written below code for it.
class RedisExampleMapper extends RedisMapper[(String, String)] {
override def getCommandDescription: RedisCommandDescription = {
new RedisCommandDescription(RedisCommand.HSET, "HASH_NAME")
}
override def getKeyFromData(data: (String, String)): String = data._1
override def getValueFromData(data: (String, String)): String = data._2
}
val env = StreamExecutionEnvironment.getExecutionEnvironment
val conf = new FlinkJedisPoolConfig.Builder().setHost("IP").build()
val streamSink = env.readTextFile("/path/useInformation.txt").map(x => {
val userInformation = x.split(",")
val UserId = userInformation(0)
val UserInfo = userInformation(1)
(UserId , UserInfo)
})
val redisSink = new RedisSink[(String, String)](conf, new RedisExampleMapper)
streamSink.addSink(redisSink)
Sample Data:
12 "UserInfo12"
13 "UserInfo13"
14 "UserInfo14"
15 "UserInfo15"
I want to feteched data from redis using Flink based on key . example 14 should return "UserInfo14". Output should print in Flink Log file or terminal whatever it is.
Thanks in advance.
Extending on the answer in https://stackoverflow.com/a/44294980/6904987.
Add the source with env.addSource(new RedisSource(data structure name)).
You have to implement yourself the RedisSource that connects to a Redis database, reading the records from a Redis data structure.
The implementation depends. Either you consume from Redis through polling or you subscribe to Redis, emitting events from the source whenever you get them from Redis.
You can check the general SourceFunction example and documentation available here: https://ci.apache.org/projects/flink/flink-docs-release-1.5/api/java/org/apache/flink/streaming/api/functions/source/SourceFunction.html
If you want to query Redis for key-value search, you can use a Redis client inside your transformations. For example, Jedis can be used to query Redis if you are using Java with Flink.
The aim is to stream data from a database, perform some computation on this chunk of data(this computation returns a Future of some case class) and send this data as chunked response to the user. Currently I am able to stream data and send the response without performing any computation. However, I am unable to perform this computation and then stream the result.
This is the route I have implemented.
def streamingDB1 =
path("streaming-db1") {
get {
val src = Source.fromPublisher(db.stream(getRds))
complete(src)
}
}
The function getRds returns the rows of a table mapped into a case class(Using slick). Now consider the function compute which takes each row as an input and returns a Future of another case class. Something like
def compute(x: Tweet) : Future[TweetNew] = ?
How can I implement this function on variable src and send the chunked response(as a stream) of this computation to the user.
You could transform the source using mapAsync:
val src =
Source.fromPublisher(db.stream(getRds))
.mapAsync(parallelism = 3)(compute)
complete(src)
Adjust the level of parallelism as needed.
Note that you might need to configure a few settings as mentioned in the Slick documentation:
Note: Some database systems may require session parameters to be set in a certain way to support streaming without caching all data at once in memory on the client side. For example, PostgreSQL requires both .withStatementParameters(rsType = ResultSetType.ForwardOnly, rsConcurrency = ResultSetConcurrency.ReadOnly, fetchSize = n) (with the desired page size n) and .transactionally for proper streaming.
So if you're using PostgreSQL, for example, then your Source might look something like the following:
val src =
Source.fromPublisher(
db.stream(
getRds.withStatementParameters(
rsType = ResultSetType.ForwardOnly,
rsConcurrency = ResultSetConcurrency.ReadOnly,
fetchSize = 10
).transactionally
)
).mapAsync(parallelism = 3)(compute)
You need to have a way to marshall TweetNew and also if you send a chunk with length 0 client may close connection.
This code works with curl:
case class TweetNew(str: String)
def compute(string: String) : Future[TweetNew] = Future {
TweetNew(string)
}
val route = path("hello") {
get {
val byteString: Source[ByteString, NotUsed] = Source.apply(List("t1", "t2", "t3"))
.mapAsync(2)(compute)
.map(tweet => ByteString(tweet.str + "\n"))
complete(HttpEntity(ContentTypes.`text/plain(UTF-8)`, byteString))
}
}
I've got a SourceQueue. When I offer an element to this I want it to pass through the Stream and when it reaches the Sink have the output returned to the code that offered this element (similar as Sink.head returns an element to the RunnableGraph.run() call).
How do I achieve this? A simple example of my problem would be:
val source = Source.queue[String](100, OverflowStrategy.fail)
val flow = Flow[String].map(element => s"Modified $element")
val sink = Sink.ReturnTheStringSomehow
val graph = source.via(flow).to(sink).run()
val x = graph.offer("foo")
println(x) // Output should be "Modified foo"
val y = graph.offer("bar")
println(y) // Output should be "Modified bar"
val z = graph.offer("baz")
println(z) // Output should be "Modified baz"
Edit: For the example I have given in this question Vladimir Matveev provided the best answer. However, it should be noted that this solution only works if the elements are going into the sink in the same order they were offered to the source. If this cannot be guaranteed the order of the elements in the sink may differ and the outcome might be different from what is expected.
I believe it is simpler to use the already existing primitive for pulling values from a stream, called Sink.queue. Here is an example:
val source = Source.queue[String](128, OverflowStrategy.fail)
val flow = Flow[String].map(element => s"Modified $element")
val sink = Sink.queue[String]().withAttributes(Attributes.inputBuffer(1, 1))
val (sourceQueue, sinkQueue) = source.via(flow).toMat(sink)(Keep.both).run()
def getNext: String = Await.result(sinkQueue.pull(), 1.second).get
sourceQueue.offer("foo")
println(getNext)
sourceQueue.offer("bar")
println(getNext)
sourceQueue.offer("baz")
println(getNext)
It does exactly what you want.
Note that setting the inputBuffer attribute for the queue sink may or may not be important for your use case - if you don't set it, the buffer will be zero-sized and the data won't flow through the stream until you invoke the pull() method on the sink.
sinkQueue.pull() yields a Future[Option[T]], which will be completed successfully with Some if the sink receives an element or with a failure if the stream fails. If the stream completes normally, it will be completed with None. In this particular example I'm ignoring this by using Option.get but you would probably want to add custom logic to handle this case.
Well, you know what offer() method returns if you take a look at its definition :) What you can do is to create Source.queue[(Promise[String], String)], create helper function that pushes pair to stream via offer, make sure offer doesn't fail because queue might be full, then complete promise inside your stream and use future of the promise to catch completion event in external code.
I do that to throttle rate to external API used from multiple places of my project.
Here is how it looked in my project before Typesafe added Hub sources to akka
import scala.concurrent.Promise
import scala.concurrent.Future
import java.util.concurrent.ConcurrentLinkedDeque
import akka.stream.scaladsl.{Keep, Sink, Source}
import akka.stream.{OverflowStrategy, QueueOfferResult}
import scala.util.Success
private val queue = Source.queue[(Promise[String], String)](100, OverflowStrategy.backpressure)
.toMat(Sink.foreach({ case (p, param) =>
p.complete(Success(param.reverse))
}))(Keep.left)
.run
private val futureDeque = new ConcurrentLinkedDeque[Future[String]]()
private def sendQueuedRequest(request: String): Future[String] = {
val p = Promise[String]
val offerFuture = queue.offer(p -> request)
def addToQueue(future: Future[String]): Future[String] = {
futureDeque.addLast(future)
future.onComplete(_ => futureDeque.remove(future))
future
}
offerFuture.flatMap {
case QueueOfferResult.Enqueued =>
addToQueue(p.future)
}.recoverWith {
case ex =>
val first = futureDeque.pollFirst()
if (first != null)
addToQueue(first.flatMap(_ => sendQueuedRequest(request)))
else
sendQueuedRequest(request)
}
}
I realize that blocking synchronized queue may be bottleneck and may grow indefinitely but because API calls in my project are made only from other akka streams which are backpressured I never have more than dozen items in futureDeque. Your situation may differ.
If you create MergeHub.source[(Promise[String], String)]() instead you'll get reusable sink. Thus every time you need to process item you'll create complete graph and run it. In that case you won't need hacky java container to queue requests.
Problem Statement : We are adding all incoming request parameters of user for particular module in MySQL DB table as a row (this is a huge data). Now, we want to design a process which will read each record from this table and will get more information about that request of user by calling third party APIs and after that it will put this returned meta information in another table.
Current Attempts:
I am using Scala + Slick to do this. As the data to read is huge, I want to read this table one row at a time and process it. I tried using slick + akka streams, however I am getting 'java.util.concurrent.RejectedExecutionException'
Following is the rough logic that I have tried,
implicit val system = ActorSystem("Example")
import system.dispatcher
implicit val materializer = ActorMaterializer()
val future = db.stream(SomeQuery.result)
Source.fromPublisher(future).map(row => {
id = dataEnrichmentAPI.process(row)
}).runForeach(id => println("Processed row : "+ id))
dataEnrichmentAPI.process : This function makes a third party REST call and also does some DB query to get required data. This DB query is done using 'db.run' method and it also waits until it finishes (Using Await)
e.g.,
def process(row: RequestRecord): Int = {
// SomeQuery2 = Check if data is already there in DB
val retId: Seq[Int] = Await.result(db.run(SomeQuery2.result), Duration.Inf)
if(retId.isEmpty){
val metaData = RestCall()
// SomeQuery3 = Store this metaData in DB
Await.result(db.run(SomeQuery3.result), Duration.Inf)
return metaData.id;
}else{
// SomeQuery4 = Get meta data id
return Await.result(db.run(SomeQuery4.result), Duration.Inf)
}
}
I am getting this exception where I am using blocking call to DB. I don't think if I can get rid of it as return value is required for later flow to continue.
Does 'blocking call' is a reason behind this Exception ?
What is the best practice to solve this kind of problem ?
Thanks.
I don't know if this is your problem (too few details) but you should never block.
Speaking of best practices, us async stages instead.
This is more or less what your code would look like without using Await.result:
def process(row: RequestRecord): Future[Int] = {
db.run(SomeQuery2.result) flatMap {
case retId if retId.isEmpty =>
// what is this? is it a sync call? if it's a rest call it should return a future
val metaData = RestCall()
db.run(SomeQuery3.result).map(_ => metaData.id)
case _ => db.run(SomeQuery4.result)
}
}
Source.fromPublisher(db.stream(SomeQuery.result))
// choose your own parallelism
.mapAsync(2)(dataEnrichmentAPI.process)
.runForeach(id => println("Processed row : "+ id))
This way you will be handling backpressure and parallelism explicitly and idiomatically.
Try to never call Await.result in production code and only compose futures using map, flatMap and for comprehensions