I need to get some data from Cassandra for entries in a Kafka-Streams streaming application. I'd need to perform the join on ID. I'd like to set up a cache to save time used for queries.
The table is simple:
id | name
---|-----
1 |Mike
My plan is straightforward: query the table from database then store into a Map[Int, String].
The main problem is - data may change in the table and needs to be updated periodically, so I need to query it from time to time.
So far I've come up with a threaded solution like this:
// local database mirror
class Mirror(user: String, password: String) extends Runnable {
var database: Map[Int, String] = Map[Int, String]() withDefaultValue "undefined"
def run(): Unit = {
update()
}
//
def update(): Unit = {
println("update")
database.synchronized {
println("sync-update")
// val c = Driver.getConnection(...)
// database = c.execute(select id, name from table). ...
database += (1 -> "one")
Thread.sleep(100)
// c.close()
}
}
def get(k: Int): Option[String] = {
println("get")
database.synchronized {
println("sync-get")
if (! (database contains k)) {
update()
database.get(k)
} else {
database.get(k)
}
}
}
}
Main looks like this:
def main(args: Array[String]): Unit = {
val db = new Mirror("u", "p")
val ex = new ScheduledThreadPoolExecutor(1)
val f = ex.scheduleAtFixedRate(db, 100, 100, TimeUnit.SECONDS)
while(true) { // simulate stream
val res = db.get(1)
println(res)
Thread.sleep(10000)
}
}
It seems to function fine. But are there any pitfalls in my code? Especially I'm not confident about thread safety of update & get functions.
If you are not opposed to using Akka I would look at Akka Streams; specifically Alpakka to do this. There's no need to reinvent the wheel if you don't have to.
That being said the code has the following problems:
Existence check on cache will not help if the entries in Cassandra are updated. It will only help if they are missing from your cache
Look at using a reentrant read write lock if you believe that most of the time your cache will have the current entries. This will help with contention if you have multiple threads calling your mirror.
Again, I would highly recommend you look at Akka Streams with Alpakka because you can do what you want with that tool wihtout having to write a bunch of code yourself.
Related
I read the advantages of using RichSinkFunction over directly calling the DB methods. Therefore, I decided to write my own RichSinkFunction.
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.sink.{RichSinkFunction, SinkFunction}
import com.datastax.driver.core.{Session, Cluster}
class CassandraAsSink extends RichSinkFunction {
override def open(parameters: Configuration): Unit = {
val cluster = Cluster.builder().addContactPoint("localhost").build()//
val session = cluster.connect("example")
}
override def invoke(value: Nothing, context: SinkFunction.Context): Unit = {
session.execute(
s"""
INSERT INTO users (name, credits, user_id)
VALUES ($name, $credits, $userId)
"""
)
}
override def close(): Unit = {
//something like session.close()
}
}
However, I am not able to develop it fully. I want to call this method under a separate class which should pass 3 arguments that I want to enter mentioned in the code. The record is in JSON format. I can manage that by parsing and getting the attributes. But how do I pass it to the invoke method and how can I pass the session object throughout the class. Also, is it a correct way of doing it since I am new to both Flink and Scala?
Will stream/string.new CassandraAsSink().invoke(name,credits,user_id) work when it comes to the calling part?
Modified:
class CassandraSink extends RichSinkFunction[String] {
var cluster: Cluster = _
var session: Session = _
println("inside....")
override def open(parameters: Configuration): Unit = {
cluster = Cluster.builder().addContactPoint("localhost").build() //
session = cluster.connect("example")
println("Connected....")
}
override def invoke(value: String): Unit = {
println("inside invoke: " + value)
session.execute(
s"""
INSERT INTO jsondata1(records_b)
VALUES ($value)
"""
)
}
override def close(): Unit = {
session.close()
println("Session Closed...")
//something like session.close()
}
}
Calling part:
val datastreamFromString:DataStream[String]=env.fromElements(data) // where data is string
datastreamFromString.addSink(new CassandraAsSink())
I figured out that there is some problem with my DataStream created from String. The class is working fine. I have initialized the env variable as the second line in the class.
Flink already has a Cassandra sink; it has valuable features you haven't attempted to support, especially checkpointing.
As for your questions:
You can make session a member variable that can be initialized in open and used in invoke.
Flink will call the invoke method for every stream record coming into the sink. This record passed to invoke as the value parameter. You'll need to extract the fields like name, etc from that value.
You'll need to attach the sink to your job graph; overall it will end up being something like this:
val env = StreamExecutionEnvironment.getExecutionEnvironment
env
.addSource(source)
... // some processing
.addSink(new CassandraAsSink())
env.execute()
By the way, there are training lessons with examples and exercises included in the Flink documentation to help you get started.
I am developing a Apache Flink application using Scala API ( I am pretty new using this technology).
I am using a hashmap to store some values that come from a database, and I need to refresh these values each 1h. There is any way to refresh this hashmap asynchronously?
Thanks!
I'm not sure what you mean by "refresh this hashmap asynchronously" in the context of a Flink workflow.
For what it's worth, if you have a hashmap that's keyed by some piece of data from records flowing through your workflow, then you can use Flink's support for managed key state to store the value (and checkpoint it), and make it queryable.
I interpret your question to mean that you are using some state in Flink to mirror/cache some data that comes from an external database, and you wish to periodically refresh it.
Typically this sort of thing is done by continuously streaming a Change Data Capture (CDC) stream from the external database into Flink. Continuous, streaming solutions are generally a better fit for Flink. But if you want to do this in hourly batches, you could write a custom source or a ProcessFunction that wakes up once an hour, makes a query to the database, and emits a stream of records that can be used to update the operator holding the state.
You can achieve this with the use of Apache Flink's Asynchronous I/O for External Data Access, see this post for details async io.
Here's a way to use AsyncDataStream to refresh a map periodically by creating a async function and attaching it to a source stream.
class AsyncEnricherFunction extends RichAsyncFunction[String, (String String)] {
#transient private var m: Map[String, String] = _
#transient private var client: DataBaseClient = _
#transient private var refreshInterval: Int = _
#throws(classOf[Exception])
override def open(parameters: Configuration): Unit = {
client = new DataBaseClient(host, port, credentials)
refreshInterval = 1000
load()
}
private def load(): Unit = {
val str = "select key, value from KeyValue"
m = client.query(str).asMap
lastRefreshed = System.currentTimeMillis()
}
override def asyncInvoke(input: String, resultFuture: ResultFuture[(String, String]): Unit = {
Future {
if (System.currentTimeMillis() > lastRefreshed + refreshInterval) load()
val enriched = (input, m(input))
resultFuture.complete(Seq(enriched))
}(ExecutionContext.global)
}
override def close() : Unit = { client.close() }
}
val in: DataStream[String] = env.addSource(src)
val enriched = AsyncDataStream.unorderedWait(in, AsyncEnricherFunction(), 5000, TimeUnit.MILLISECONDS, 100)
I have a method in which there are multiple calls to db. As I have not implemented any concurrent processing, a 2nd db call has to wait until the 1st db call gets completed, 3rd has to wait until the 2nd gets completed and so on.
All db calls are independent of each other. I want to make this in such a way that all DB calls run concurrently.
I am new to Akka framework.
Can someone please help me with small sample or references would help. Application is developed in Scala Lang.
There are three primary ways that you could achieve concurrency for the given example needs.
Futures
For the particular use case that is asked about in the question I would recommend Futures before any akka construct.
Suppose we are given the database calls as functions:
type Data = ???
val dbcall1 : () => Data = ???
val dbcall2 : () => Data = ???
val dbcall3 : () => Data = ???
Concurrency can be easily applied, and then the results can be collected, using Futures:
val f1 = Future { dbcall1() }
val f2 = Future { dbcall2() }
val f3 = Future { dbcall3() }
for {
v1 <- f1
v2 <- f2
v3 <- f3
} {
println(s"All data collected: ${v1}, ${v2}, ${v3}")
}
Akka Streams
There is a similar stack answer which demonstrates how to use the akka-stream library to do concurrent db querying.
Akka Actors
It is also possible to write an Actor to do the querying:
object MakeQuery
class DBActor(dbCall : () => Data) extends Actor {
override def receive = {
case _ : MakeQuery => sender ! dbCall()
}
}
val dbcall1ActorRef = system.actorOf(Props(classOf[DBActor], dbcall1))
However, in this use case Actors are less helpful because you still need to collect all of the data together.
You can either use the same technique as the "Futures" section:
val f1 : Future[Data] = (dbcall1ActorRef ? MakeQuery).mapTo[Data]
for {
v1 <- f1
...
Or, you would have to wire the Actors together by hand through the constructor and handle all of the callback logic for waiting on the other Actor:
class WaitingDBActor(dbCall : () => Data, previousActor : ActorRef) {
override def receive = {
case _ : MakeQuery => previousActor forward MakeQuery
case previousData : Data => sender ! (dbCall(), previousData)
}
}
If you want to querying database, you should use something like slick which is a modern database query and access library for Scala.
quick example of slick:
case class User(id: Option[Int], first: String, last: String)
class Users(tag: Tag) extends Table[User](tag, "users") {
def id = column[Int]("id", O.PrimaryKey, O.AutoInc)
def first = column[String]("first")
def last = column[String]("last")
def * = (id.?, first, last) <> (User.tupled, User.unapply)
}
val users = TableQuery[Users]
then your need to create configuration for your db:
mydb = {
dataSourceClass = "org.postgresql.ds.PGSimpleDataSource"
properties = {
databaseName = "mydb"
user = "myuser"
password = "secret"
}
numThreads = 10
}
and in your code you load configuration:
val db = Database.forConfig("mydb")
then run your query with db.run method which gives you future as result, for example you can get all rows by calling method result
val allRows: Future[Seq[User]] = db.run(users.result)
this query run without blocking current thread.
If you have task which take long time to execute or calling to another service, you should use futures.
Example of that is simple HTTP call to external service. you can find example in here
If you have task which take long time to execute and for doing so, you have to keep mutable states, in this case the best option is using Akka Actors which encapsulate your state inside an actor which solve problem of concurrency and thread safety as simple as possible.Example of suck tasks are:
import akka.actor.Actor
import scala.concurrent.Future
case class RegisterEndpoint(endpoint: String)
case class NewUpdate(update: String)
class UpdateConsumer extends Actor {
val endpoints = scala.collection.mutable.Set.empty[String]
override def receive: Receive = {
case RegisterEndpoint(endpoint) =>
endpoints += endpoint
case NewUpdate(update) =>
endpoints.foreach { endpoint =>
deliverUpdate(endpoint, update)
}
}
def deliverUpdate(endpoint: String, update: String): Future[Unit] = {
Future.successful(Unit)
}
}
If you want to process huge amount of live data, or websocket connection, processing CSV file which is growing over time, ... or etc, the best option is Akka stream. For example reading data from kafka topic using Alpakka:Alpakka kafka connector
Within an akka stream stage FlowShape[A, B] , part of the processing I need to do on the A's is to save/query a datastore with a query built with A data. But that datastore driver query gives me a future, and I am not sure how best to deal with it (my main question here).
case class Obj(a: String, b: Int, c: String)
case class Foo(myobject: Obj, name: String)
case class Bar(st: String)
//
class SaveAndGetId extends GraphStage[FlowShape[Foo, Bar]] {
val dao = new DbDao // some dao with an async driver
override def createLogic(inheritedAttributes: Attributes) = new GraphStageLogic(shape) {
setHandlers(in, out, new InHandler with Outhandler {
override def onPush() = {
val foo = grab(in)
val add = foo.record.value()
val result: Future[String] = dao.saveAndGetRecord(add.myobject)//saves and returns id as string
//the naive approach
val record = Await(result, Duration.inf)
push(out, Bar(record))// ***tests pass every time
//mapping the future approach
result.map { x=>
push(out, Bar(x))
} //***tests fail every time
The next stage depends on the id of the db record returned from query, but I want to avoid Await. I am not sure why mapping approach fails:
"it should work" in {
val source = Source.single(Foo(Obj("hello", 1, "world")))
val probe = source
.via(new SaveAndGetId))
.runWith(TestSink.probe)
probe
.request(1)
.expectBarwithId("one")//say we know this will be
.expectComplete()
}
private implicit class RichTestProbe(probe: Probe[Bar]) {
def expectBarwithId(expected: String): Probe[Bar] =
probe.expectNextChainingPF{
case r # Bar(str) if str == expected => r
}
}
When run with mapping future, I get failure:
should work ***FAILED***
java.lang.AssertionError: assertion failed: expected: message matching partial function but got unexpected message OnComplete
at scala.Predef$.assert(Predef.scala:170)
at akka.testkit.TestKitBase$class.expectMsgPF(TestKit.scala:406)
at akka.testkit.TestKit.expectMsgPF(TestKit.scala:814)
at akka.stream.testkit.TestSubscriber$ManualProbe.expectEventPF(StreamTestKit.scala:570)
The async side channels example in the docs has the future in the constructor of the stage, as opposed to building the future within the stage, so doesn't seem to apply to my case.
I agree with Ramon. Constructing a new FlowShapeis not necessary in this case and it is too complicated. It is very much convenient to use mapAsync method here:
Here is a code snippet to utilize mapAsync:
import akka.stream.scaladsl.{Sink, Source}
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
object MapAsyncExample {
val numOfParallelism: Int = 10
def main(args: Array[String]): Unit = {
Source.repeat(5)
.mapAsync(parallelism)(x => asyncSquare(x))
.runWith(Sink.foreach(println)) previous stage
}
//This method returns a Future
//You can replace this part with your database operations
def asyncSquare(value: Int): Future[Int] = Future {
value * value
}
}
In the snippet above, Source.repeat(5) is a dummy source to emit 5 indefinitely. There is a sample function asyncSquare which takes an integer and calculates its square in a Future. .mapAsync(parallelism)(x => asyncSquare(x)) line uses that function and emits the output of Future to the next stage. In this snipet, the next stage is a sink which prints every item.
parallelism is the maximum number of asyncSquare calls that can run concurrently.
I think your GraphStage is unnecessarily overcomplicated. The below Flow performs the same actions without the need to write a custom stage:
val dao = new DbDao
val parallelism = 10 //number of parallel db queries
val SaveAndGetId : Flow[Foo, Bar, _] =
Flow[Foo]
.map(foo => foo.record.value().myobject)
.mapAsync(parallelism)(rec => dao.saveAndGetRecord(rec))
.map(Bar.apply)
I generally try to treat GraphStage as a last resort, there is almost always an idiomatic way of getting the same Flow by using the methods provided by the akka-stream library.
Trying out the newly minted Akka Streams. It seems to be working except for one small thing - there's no output.
I have the following table definition:
case class my_stream(id: Int, value: String)
class Streams(tag: Tag) extends Table[my_stream](tag, "my_stream") {
def id = column[Int]("id")
def value = column[String]("value")
def * = (id, value) <> (my_stream.tupled, my_stream.unapply)
}
And I'm trying to output the contents of the table to stdout like this:
def main(args: Array[String]) : Unit = {
implicit val system = ActorSystem("Subscriber")
implicit val materializer = ActorMaterializer()
val strm = TableQuery[Streams]
val db = Database.forConfig("pg-postgres")
try{
var src = Source.fromPublisher(db.stream(strm.result))
src.runForeach(r => println(s"${r.id},${r.value}"))(materializer)
} finally {
system.shutdown
db.close
}
}
I have verified that the query is being run by configuring debug logging. However, all I get is this:
08:59:24.099 [main] INFO com.zaxxer.hikari.HikariDataSource - pg-postgres - is starting.
08:59:24.428 [main] INFO com.zaxxer.hikari.pool.HikariPool - pg-postgres - is closing down.
The cause is that Akka Streams is asynchronous and runForeach returns a Future which will be completed once the stream completes, but that Future is not being handled and as such the system.shutdown and db.close executes immediately instead of after the stream completes.
Just in case it helps anyone searching this very same issue but in MySQL, take into account that you should enable the driver stream support "manually":
def enableStream(statement: java.sql.Statement): Unit = {
statement match {
case s: com.mysql.jdbc.StatementImpl => s.enableStreamingResults()
case _ =>
}
}
val publisher = sourceDb.stream(query.result.withStatementParameters(statementInit = enableStream))
Source: http://www.slideshare.net/kazukinegoro5/akka-streams-100-scalamatsuri
Ended up using #ViktorKlang answer and just wrapped the run with an Await.result. I also found an alternative answer in the docs which demonstrates using the reactive streams publisher and subscriber interfaces:
The stream method returns a DatabasePublisher[T] and Source.fromPublisher returns a Source[T, NotUsed]. This means you have to attach a subscriber instead of using runForEach - according to the release notes NotUsed is a replacement for Unit. Which means nothing gets passed to the Sink.
Since Slick implements the reactive streams interface and not the Akka Stream interfaces you need to use the fromPublisher and fromSubscriber integration point. That means you need to implement the org.reactivestreams.Subscriber[T] interface.
Here's a quick and dirty Subscriber[T] implementation which simply calls println:
class MyStreamWriter extends org.reactivestreams.Subscriber[my_stream] {
private var sub : Option[Subscription] = None;
override def onNext(t: my_stream): Unit = {
println(t.value)
if(sub.nonEmpty) sub.head.request(1)
}
override def onError(throwable: Throwable): Unit = {
println(throwable.getMessage)
}
override def onSubscribe(subscription: Subscription): Unit = {
sub = Some(subscription)
sub.head.request(1)
}
override def onComplete(): Unit = {
println("ALL DONE!")
}
}
You need to make sure you call the Subscription.request(Long) method in onSubscribe and then in onNext to ask for data or nothing will be sent or you won't get the full set of results.
And here's how you use it:
def main(args: Array[String]) : Unit = {
implicit val system = ActorSystem("Subscriber")
implicit val materializer = ActorMaterializer()
val strm = TableQuery[Streams]
val db = Database.forConfig("pg-postgres")
try{
val src = Source.fromPublisher(db.stream(strm.result))
val flow = src.to(Sink.fromSubscriber(new MyStreamWriter()))
flow.run()
} finally {
system.shutdown
db.close
}
}
I'm still trying to figure this out so I welcome any feedback. Thanks!