Read Data From Redis Using Flink - scala

I am completely new to Flink. May this question is repeated but found only one link and that is not understandable for me.
https://stackoverflow.com/a/44294980/6904987
I stored Data in Redis in Key Value format example Key is UserId and UserInfo is value. Written below code for it.
class RedisExampleMapper extends RedisMapper[(String, String)] {
override def getCommandDescription: RedisCommandDescription = {
new RedisCommandDescription(RedisCommand.HSET, "HASH_NAME")
}
override def getKeyFromData(data: (String, String)): String = data._1
override def getValueFromData(data: (String, String)): String = data._2
}
val env = StreamExecutionEnvironment.getExecutionEnvironment
val conf = new FlinkJedisPoolConfig.Builder().setHost("IP").build()
val streamSink = env.readTextFile("/path/useInformation.txt").map(x => {
val userInformation = x.split(",")
val UserId = userInformation(0)
val UserInfo = userInformation(1)
(UserId , UserInfo)
})
val redisSink = new RedisSink[(String, String)](conf, new RedisExampleMapper)
streamSink.addSink(redisSink)
Sample Data:
12 "UserInfo12"
13 "UserInfo13"
14 "UserInfo14"
15 "UserInfo15"
I want to feteched data from redis using Flink based on key . example 14 should return "UserInfo14". Output should print in Flink Log file or terminal whatever it is.
Thanks in advance.

Extending on the answer in https://stackoverflow.com/a/44294980/6904987.
Add the source with env.addSource(new RedisSource(data structure name)).
You have to implement yourself the RedisSource that connects to a Redis database, reading the records from a Redis data structure.
The implementation depends. Either you consume from Redis through polling or you subscribe to Redis, emitting events from the source whenever you get them from Redis.
You can check the general SourceFunction example and documentation available here: https://ci.apache.org/projects/flink/flink-docs-release-1.5/api/java/org/apache/flink/streaming/api/functions/source/SourceFunction.html

If you want to query Redis for key-value search, you can use a Redis client inside your transformations. For example, Jedis can be used to query Redis if you are using Java with Flink.

Related

Extracting Data From Azure SQL database using Akka.io

Currently, I am able to create a session using intelliJ:
//sqlserver is the name of application.conf {}
val databaseConfig = DatabaseConfig.forConfig[JdbcProfile]("sqlserver")
implicit val session = SlickSession.forConfig(databaseConfig)
this is the config:
sqlserver = {
profile = "slick.jdbc.SQLServerProfile$"
db {
driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
host = <myHostName> e.g. myresource.database.windows.net
port = <myPortNumber> e.g 1433
databaseName = <myDatabaseName>
url = <jdbc:sqlserver:myHostName:port;database=myDatabase>
user = <user>
password = <password>
connectionTimeout = "30 seconds"
}
}
Some of the methods suggested are:
// The example domain
case class User(id: Int, name: String)
val users = (1 to 42).map(i => User(i, s"Name$i"))
// This import enables the use of the Slick sql"...",
// sqlu"...", and sqlt"..." String interpolators.
// See "http://slick.lightbend.com/doc/3.2.1/sql.html#string-interpolation"
import session.profile.api._
// Stream the users into the database as insert statements
val done: Future[Done] =
Source(users)
.via(
// add an optional first argument to specify the parallelism factor (Int)
Slick.flow(user => sqlu"INSERT INTO ALPAKKA_SLICK_SCALADSL_TEST_USERS VALUES(${user.id}, ${user.name})")
)
.log("nr-of-updated-rows")
.runWith(Sink.ignore)
I couldn't find any examples of methods for me to extract any data with SQL commands from Akka.io. The closest one is at this link:
[Akka.io Slink JDBC][1]
At this point, there are no errors for the connection, but i'm still lacking of the methods to access and downloading from Azure SQL databases with Query methods.
This one looks like it is creating it's own Vector List.
case class User(id: Int, name: String)
val users = (1 to 42).map(i => User(i, s"Name$i"))
Results: Vector(User(1,Name1), User(2,Name2), ....)
Is there a way where I can extract my data from Azure SQL server?
[1]: https://doc.akka.io/docs/alpakka/current/slick.html
If you want to get data from SQL Server into an Akka Stream, you need a Source, not a Sink (which is for writing from Akka into the database).
Because Alpakka defers JDBC integration to the Slick library, it's perhaps worth reading up on that library.
From the documentation, you'll want something like:
import slick.jdbc.GetResult
import session.profile.api._
case class User(id: Int, name: String)
// Define how to transform result rows (each row being a PositionedResult)
// into Users. See https://scala-slick.org/doc/3.3.2/sql.html
implicit val getUserFromResult = GetResult(r => User(r.nextInt, r.nextString))
val gotAllUsers: Future[Done] =
Slick.source(sql"SELECT id, name FROM table".as[User])
.log("user")
.runWith(Sink.ignore)
// Wait for the query to complete before exiting, only useful for this example.
Await.result(gotAllUsers, Duration.Inf)

Apache Flink - Refresh a Hashmap asynchronously

I am developing a Apache Flink application using Scala API ( I am pretty new using this technology).
I am using a hashmap to store some values that come from a database, and I need to refresh these values each 1h. There is any way to refresh this hashmap asynchronously?
Thanks!
I'm not sure what you mean by "refresh this hashmap asynchronously" in the context of a Flink workflow.
For what it's worth, if you have a hashmap that's keyed by some piece of data from records flowing through your workflow, then you can use Flink's support for managed key state to store the value (and checkpoint it), and make it queryable.
I interpret your question to mean that you are using some state in Flink to mirror/cache some data that comes from an external database, and you wish to periodically refresh it.
Typically this sort of thing is done by continuously streaming a Change Data Capture (CDC) stream from the external database into Flink. Continuous, streaming solutions are generally a better fit for Flink. But if you want to do this in hourly batches, you could write a custom source or a ProcessFunction that wakes up once an hour, makes a query to the database, and emits a stream of records that can be used to update the operator holding the state.
You can achieve this with the use of Apache Flink's Asynchronous I/O for External Data Access, see this post for details async io.
Here's a way to use AsyncDataStream to refresh a map periodically by creating a async function and attaching it to a source stream.
class AsyncEnricherFunction extends RichAsyncFunction[String, (String String)] {
#transient private var m: Map[String, String] = _
#transient private var client: DataBaseClient = _
#transient private var refreshInterval: Int = _
#throws(classOf[Exception])
override def open(parameters: Configuration): Unit = {
client = new DataBaseClient(host, port, credentials)
refreshInterval = 1000
load()
}
private def load(): Unit = {
val str = "select key, value from KeyValue"
m = client.query(str).asMap
lastRefreshed = System.currentTimeMillis()
}
override def asyncInvoke(input: String, resultFuture: ResultFuture[(String, String]): Unit = {
Future {
if (System.currentTimeMillis() > lastRefreshed + refreshInterval) load()
val enriched = (input, m(input))
resultFuture.complete(Seq(enriched))
}(ExecutionContext.global)
}
override def close() : Unit = { client.close() }
}
val in: DataStream[String] = env.addSource(src)
val enriched = AsyncDataStream.unorderedWait(in, AsyncEnricherFunction(), 5000, TimeUnit.MILLISECONDS, 100)

Cache Cassandra table in scala application

I need to get some data from Cassandra for entries in a Kafka-Streams streaming application. I'd need to perform the join on ID. I'd like to set up a cache to save time used for queries.
The table is simple:
id | name
---|-----
1 |Mike
My plan is straightforward: query the table from database then store into a Map[Int, String].
The main problem is - data may change in the table and needs to be updated periodically, so I need to query it from time to time.
So far I've come up with a threaded solution like this:
// local database mirror
class Mirror(user: String, password: String) extends Runnable {
var database: Map[Int, String] = Map[Int, String]() withDefaultValue "undefined"
def run(): Unit = {
update()
}
//
def update(): Unit = {
println("update")
database.synchronized {
println("sync-update")
// val c = Driver.getConnection(...)
// database = c.execute(select id, name from table). ...
database += (1 -> "one")
Thread.sleep(100)
// c.close()
}
}
def get(k: Int): Option[String] = {
println("get")
database.synchronized {
println("sync-get")
if (! (database contains k)) {
update()
database.get(k)
} else {
database.get(k)
}
}
}
}
Main looks like this:
def main(args: Array[String]): Unit = {
val db = new Mirror("u", "p")
val ex = new ScheduledThreadPoolExecutor(1)
val f = ex.scheduleAtFixedRate(db, 100, 100, TimeUnit.SECONDS)
while(true) { // simulate stream
val res = db.get(1)
println(res)
Thread.sleep(10000)
}
}
It seems to function fine. But are there any pitfalls in my code? Especially I'm not confident about thread safety of update & get functions.
If you are not opposed to using Akka I would look at Akka Streams; specifically Alpakka to do this. There's no need to reinvent the wheel if you don't have to.
That being said the code has the following problems:
Existence check on cache will not help if the entries in Cassandra are updated. It will only help if they are missing from your cache
Look at using a reentrant read write lock if you believe that most of the time your cache will have the current entries. This will help with contention if you have multiple threads calling your mirror.
Again, I would highly recommend you look at Akka Streams with Alpakka because you can do what you want with that tool wihtout having to write a bunch of code yourself.

Akka Stream return object from Sink

I've got a SourceQueue. When I offer an element to this I want it to pass through the Stream and when it reaches the Sink have the output returned to the code that offered this element (similar as Sink.head returns an element to the RunnableGraph.run() call).
How do I achieve this? A simple example of my problem would be:
val source = Source.queue[String](100, OverflowStrategy.fail)
val flow = Flow[String].map(element => s"Modified $element")
val sink = Sink.ReturnTheStringSomehow
val graph = source.via(flow).to(sink).run()
val x = graph.offer("foo")
println(x) // Output should be "Modified foo"
val y = graph.offer("bar")
println(y) // Output should be "Modified bar"
val z = graph.offer("baz")
println(z) // Output should be "Modified baz"
Edit: For the example I have given in this question Vladimir Matveev provided the best answer. However, it should be noted that this solution only works if the elements are going into the sink in the same order they were offered to the source. If this cannot be guaranteed the order of the elements in the sink may differ and the outcome might be different from what is expected.
I believe it is simpler to use the already existing primitive for pulling values from a stream, called Sink.queue. Here is an example:
val source = Source.queue[String](128, OverflowStrategy.fail)
val flow = Flow[String].map(element => s"Modified $element")
val sink = Sink.queue[String]().withAttributes(Attributes.inputBuffer(1, 1))
val (sourceQueue, sinkQueue) = source.via(flow).toMat(sink)(Keep.both).run()
def getNext: String = Await.result(sinkQueue.pull(), 1.second).get
sourceQueue.offer("foo")
println(getNext)
sourceQueue.offer("bar")
println(getNext)
sourceQueue.offer("baz")
println(getNext)
It does exactly what you want.
Note that setting the inputBuffer attribute for the queue sink may or may not be important for your use case - if you don't set it, the buffer will be zero-sized and the data won't flow through the stream until you invoke the pull() method on the sink.
sinkQueue.pull() yields a Future[Option[T]], which will be completed successfully with Some if the sink receives an element or with a failure if the stream fails. If the stream completes normally, it will be completed with None. In this particular example I'm ignoring this by using Option.get but you would probably want to add custom logic to handle this case.
Well, you know what offer() method returns if you take a look at its definition :) What you can do is to create Source.queue[(Promise[String], String)], create helper function that pushes pair to stream via offer, make sure offer doesn't fail because queue might be full, then complete promise inside your stream and use future of the promise to catch completion event in external code.
I do that to throttle rate to external API used from multiple places of my project.
Here is how it looked in my project before Typesafe added Hub sources to akka
import scala.concurrent.Promise
import scala.concurrent.Future
import java.util.concurrent.ConcurrentLinkedDeque
import akka.stream.scaladsl.{Keep, Sink, Source}
import akka.stream.{OverflowStrategy, QueueOfferResult}
import scala.util.Success
private val queue = Source.queue[(Promise[String], String)](100, OverflowStrategy.backpressure)
.toMat(Sink.foreach({ case (p, param) =>
p.complete(Success(param.reverse))
}))(Keep.left)
.run
private val futureDeque = new ConcurrentLinkedDeque[Future[String]]()
private def sendQueuedRequest(request: String): Future[String] = {
val p = Promise[String]
val offerFuture = queue.offer(p -> request)
def addToQueue(future: Future[String]): Future[String] = {
futureDeque.addLast(future)
future.onComplete(_ => futureDeque.remove(future))
future
}
offerFuture.flatMap {
case QueueOfferResult.Enqueued =>
addToQueue(p.future)
}.recoverWith {
case ex =>
val first = futureDeque.pollFirst()
if (first != null)
addToQueue(first.flatMap(_ => sendQueuedRequest(request)))
else
sendQueuedRequest(request)
}
}
I realize that blocking synchronized queue may be bottleneck and may grow indefinitely but because API calls in my project are made only from other akka streams which are backpressured I never have more than dozen items in futureDeque. Your situation may differ.
If you create MergeHub.source[(Promise[String], String)]() instead you'll get reusable sink. Thus every time you need to process item you'll create complete graph and run it. In that case you won't need hacky java container to queue requests.

Make CRUD operations with ReactiveMongo

I have started to learn scala recently and trying to create simple api using akka HTTP and reactivemongo.
Have problems with simple operations. Spend a lot of time digging docks, official tutorials, stackoverflow etc. Probably I am missing something very simple.
My code:
object MongoDB {
val config = ConfigFactory.load()
val database = config.getString("mongodb.database")
val servers = config.getStringList("mongodb.servers").asScala
val credentials = Lis(Authenticate(database,config.getString("mongodb.userName"), config.getString("mongodb.password")))
val driver = new MongoDriver
val connection = driver.connection(servers, authentications = credentials)
//val db = connection.database(database)
}
Now I would like to make basic CRUD operations. I am trying different code snippets but can't get it working.
Here are some examples:
object TweetManager {
import MongoDB._
//taken from docs
val collection = connection.database("test").
map(_.collection("tweets"))
val document1 = BSONDocument(
"author" -> "Tester",
"body" -> "test"
)
//taken from reactivemongo tutorial, it had extra parameter as BSONCollection, but can't get find the way of getting it
def insertDoc1(doc: BSONDocument): Future[Unit] = {
//another try of getting the collection
//def collection = for ( db1 <- db) yield db1.collection[BSONCollection]("tweets")
val writeRes: Future[WriteResult] = collection.insert(doc)
writeRes.onComplete { // Dummy callbacks
case Failure(e) => e.printStackTrace()
case Success(writeResult) =>
println(s"successfully inserted document with result: $writeResult")
}
writeRes.map(_ => {}) // in this example, do nothing with the success
}
}
insertDoc1(document1)
I can't do any operation on the collection. IDE gives me: "cannot resolve symbol". Compiler gives error:
value insert is not a member of scala.concurrent.Future[reactivemongo.api.collections.bson.BSONCollection]
What is the correct way of doing it?
You are trying to call the insert operation on a Future[Collection], rather than on the underlying collection (calling operation on Future[T] rather than on T is not specific to ReactiveMongo).
It's recommanded to have a look at the documentation.