I have a collection of people in mongo, and I want to go over each person in the collection as a stream, and for each person call a method that is performing api call, changing the model, and inserting to a new collection in mongo.
It looks like this:
def processPeople()(implicit m: Materializer): Future[Unit] = {
val peopleSource: Source[Person, Future[State]] = collection.find(json()).cursor[Person]().documentSource()
peopleSource.runWith(Sink.seq[Person]).map(people => {
people.foreach(person => {
changeModelAndInsertToNewCollection(person)
})
})
}
but this is not working...the part of changing the model seems like is working, but the insert to mongo is not working.
It looks like also the method is not starting right away, there some processing going behind before for a min before it starts....do you see the issue?
Solution 1 :
def changeModelAndInsertToNewCollection(person:Person) : Future[Boolean] ={
//Todo : call mongo api to update the person
???
}
def processPeople()(implicit m: Materializer): Future[Done] = {
val numberOfConcurrentUpdate = 10
val peopleSource: Source[Person, Future[State]] =
collection
.find(json())
.cursor[Person]()
.documentSource()
peopleSource
.mapAsync(numberOfConcurrentUpdate)(changeModelAndInsertToNewCollection)
withAttributes(ActorAttributes.supervisionStrategy(Supervision.restartingDecider))
.runWith(Sink.ignore)}
Solution 2 :
using Alpakka as akka stream connector for mongo
val source: Source[Document, NotUsed] =
MongoSource(collection.find(json()).cursor[Person]().documentSource())
source.runWith(MongoSink.updateOne(2, collection))
Related
I need to get some data from Cassandra for entries in a Kafka-Streams streaming application. I'd need to perform the join on ID. I'd like to set up a cache to save time used for queries.
The table is simple:
id | name
---|-----
1 |Mike
My plan is straightforward: query the table from database then store into a Map[Int, String].
The main problem is - data may change in the table and needs to be updated periodically, so I need to query it from time to time.
So far I've come up with a threaded solution like this:
// local database mirror
class Mirror(user: String, password: String) extends Runnable {
var database: Map[Int, String] = Map[Int, String]() withDefaultValue "undefined"
def run(): Unit = {
update()
}
//
def update(): Unit = {
println("update")
database.synchronized {
println("sync-update")
// val c = Driver.getConnection(...)
// database = c.execute(select id, name from table). ...
database += (1 -> "one")
Thread.sleep(100)
// c.close()
}
}
def get(k: Int): Option[String] = {
println("get")
database.synchronized {
println("sync-get")
if (! (database contains k)) {
update()
database.get(k)
} else {
database.get(k)
}
}
}
}
Main looks like this:
def main(args: Array[String]): Unit = {
val db = new Mirror("u", "p")
val ex = new ScheduledThreadPoolExecutor(1)
val f = ex.scheduleAtFixedRate(db, 100, 100, TimeUnit.SECONDS)
while(true) { // simulate stream
val res = db.get(1)
println(res)
Thread.sleep(10000)
}
}
It seems to function fine. But are there any pitfalls in my code? Especially I'm not confident about thread safety of update & get functions.
If you are not opposed to using Akka I would look at Akka Streams; specifically Alpakka to do this. There's no need to reinvent the wheel if you don't have to.
That being said the code has the following problems:
Existence check on cache will not help if the entries in Cassandra are updated. It will only help if they are missing from your cache
Look at using a reentrant read write lock if you believe that most of the time your cache will have the current entries. This will help with contention if you have multiple threads calling your mirror.
Again, I would highly recommend you look at Akka Streams with Alpakka because you can do what you want with that tool wihtout having to write a bunch of code yourself.
I have a method in which there are multiple calls to db. As I have not implemented any concurrent processing, a 2nd db call has to wait until the 1st db call gets completed, 3rd has to wait until the 2nd gets completed and so on.
All db calls are independent of each other. I want to make this in such a way that all DB calls run concurrently.
I am new to Akka framework.
Can someone please help me with small sample or references would help. Application is developed in Scala Lang.
There are three primary ways that you could achieve concurrency for the given example needs.
Futures
For the particular use case that is asked about in the question I would recommend Futures before any akka construct.
Suppose we are given the database calls as functions:
type Data = ???
val dbcall1 : () => Data = ???
val dbcall2 : () => Data = ???
val dbcall3 : () => Data = ???
Concurrency can be easily applied, and then the results can be collected, using Futures:
val f1 = Future { dbcall1() }
val f2 = Future { dbcall2() }
val f3 = Future { dbcall3() }
for {
v1 <- f1
v2 <- f2
v3 <- f3
} {
println(s"All data collected: ${v1}, ${v2}, ${v3}")
}
Akka Streams
There is a similar stack answer which demonstrates how to use the akka-stream library to do concurrent db querying.
Akka Actors
It is also possible to write an Actor to do the querying:
object MakeQuery
class DBActor(dbCall : () => Data) extends Actor {
override def receive = {
case _ : MakeQuery => sender ! dbCall()
}
}
val dbcall1ActorRef = system.actorOf(Props(classOf[DBActor], dbcall1))
However, in this use case Actors are less helpful because you still need to collect all of the data together.
You can either use the same technique as the "Futures" section:
val f1 : Future[Data] = (dbcall1ActorRef ? MakeQuery).mapTo[Data]
for {
v1 <- f1
...
Or, you would have to wire the Actors together by hand through the constructor and handle all of the callback logic for waiting on the other Actor:
class WaitingDBActor(dbCall : () => Data, previousActor : ActorRef) {
override def receive = {
case _ : MakeQuery => previousActor forward MakeQuery
case previousData : Data => sender ! (dbCall(), previousData)
}
}
If you want to querying database, you should use something like slick which is a modern database query and access library for Scala.
quick example of slick:
case class User(id: Option[Int], first: String, last: String)
class Users(tag: Tag) extends Table[User](tag, "users") {
def id = column[Int]("id", O.PrimaryKey, O.AutoInc)
def first = column[String]("first")
def last = column[String]("last")
def * = (id.?, first, last) <> (User.tupled, User.unapply)
}
val users = TableQuery[Users]
then your need to create configuration for your db:
mydb = {
dataSourceClass = "org.postgresql.ds.PGSimpleDataSource"
properties = {
databaseName = "mydb"
user = "myuser"
password = "secret"
}
numThreads = 10
}
and in your code you load configuration:
val db = Database.forConfig("mydb")
then run your query with db.run method which gives you future as result, for example you can get all rows by calling method result
val allRows: Future[Seq[User]] = db.run(users.result)
this query run without blocking current thread.
If you have task which take long time to execute or calling to another service, you should use futures.
Example of that is simple HTTP call to external service. you can find example in here
If you have task which take long time to execute and for doing so, you have to keep mutable states, in this case the best option is using Akka Actors which encapsulate your state inside an actor which solve problem of concurrency and thread safety as simple as possible.Example of suck tasks are:
import akka.actor.Actor
import scala.concurrent.Future
case class RegisterEndpoint(endpoint: String)
case class NewUpdate(update: String)
class UpdateConsumer extends Actor {
val endpoints = scala.collection.mutable.Set.empty[String]
override def receive: Receive = {
case RegisterEndpoint(endpoint) =>
endpoints += endpoint
case NewUpdate(update) =>
endpoints.foreach { endpoint =>
deliverUpdate(endpoint, update)
}
}
def deliverUpdate(endpoint: String, update: String): Future[Unit] = {
Future.successful(Unit)
}
}
If you want to process huge amount of live data, or websocket connection, processing CSV file which is growing over time, ... or etc, the best option is Akka stream. For example reading data from kafka topic using Alpakka:Alpakka kafka connector
I have started to learn scala recently and trying to create simple api using akka HTTP and reactivemongo.
Have problems with simple operations. Spend a lot of time digging docks, official tutorials, stackoverflow etc. Probably I am missing something very simple.
My code:
object MongoDB {
val config = ConfigFactory.load()
val database = config.getString("mongodb.database")
val servers = config.getStringList("mongodb.servers").asScala
val credentials = Lis(Authenticate(database,config.getString("mongodb.userName"), config.getString("mongodb.password")))
val driver = new MongoDriver
val connection = driver.connection(servers, authentications = credentials)
//val db = connection.database(database)
}
Now I would like to make basic CRUD operations. I am trying different code snippets but can't get it working.
Here are some examples:
object TweetManager {
import MongoDB._
//taken from docs
val collection = connection.database("test").
map(_.collection("tweets"))
val document1 = BSONDocument(
"author" -> "Tester",
"body" -> "test"
)
//taken from reactivemongo tutorial, it had extra parameter as BSONCollection, but can't get find the way of getting it
def insertDoc1(doc: BSONDocument): Future[Unit] = {
//another try of getting the collection
//def collection = for ( db1 <- db) yield db1.collection[BSONCollection]("tweets")
val writeRes: Future[WriteResult] = collection.insert(doc)
writeRes.onComplete { // Dummy callbacks
case Failure(e) => e.printStackTrace()
case Success(writeResult) =>
println(s"successfully inserted document with result: $writeResult")
}
writeRes.map(_ => {}) // in this example, do nothing with the success
}
}
insertDoc1(document1)
I can't do any operation on the collection. IDE gives me: "cannot resolve symbol". Compiler gives error:
value insert is not a member of scala.concurrent.Future[reactivemongo.api.collections.bson.BSONCollection]
What is the correct way of doing it?
You are trying to call the insert operation on a Future[Collection], rather than on the underlying collection (calling operation on Future[T] rather than on T is not specific to ReactiveMongo).
It's recommanded to have a look at the documentation.
I'm trying to dump my mongo database into a json object but because my queries to the database are asynchrounous I'm having problems.
Each collection in my database contains user data and each collection name is a user name.
So, when I want to get all my users data I recover all the collection names and then loop over them to recover each collection one by one.
def databaseDump(prom : Promise[JsObject]) = {
for{
dbUsers <- getUsers
} yield dbUsers
var rebuiltJson = Json.obj()
var array = JsArray()
res.map{ users =>
users.map{ userNames =>
if(userNames.size == 0){
prom failure new Throwable("Empty database")
}
var counter = 0
userNames.foreach { username =>
getUserTables(username).map { tables =>
/* Add data to array*/
...
counter += 1
if(counter == userNames.size){
/*Add data to new json*/
...
prom success rebuiltJson
}
}
}
}
}
This kinda works, but sometimes the promise is succesfully triggered even though all the data has not yet been recoverd. This is due to that fact that my counter variable is not a reliable solution.
Is there a way to loop over all the users, query the database and wait for all the data to be recovered before succesfully triggering the promise? I tried to use for comprehension but didn't find a way to do it. Is there a way to dump a whole mongo DB into one Json : { username : data, username : data ..} ?
The users/tables terminology was getting me confused, so I wrote a new function that dumps a database into a single JsObject.
// helper function to find all documents inside a collection c
// and return them as a single JsArray
def getDocs(c: JSONCollection)(implicit ec: ExecutionContext) = c.find(Json.obj()).cursor[JsObject]().jsArray()
def dumpToJsObject(db: DefaultDB)(implicit ec: ExecutionContext): Future[JsObject] = {
// get a list of all collections in the db
val collectionNames = db.collectionNames
val collections = collectionNames.map(_.map(db.collection[JSONCollection](_)))
// each entry is a tuple collectionName -> content (as JsArray)
val namesToDocs = collections.flatMap {
colls => Future.sequence(colls.map(c => getDocs(c).map(c.name -> _)))
}
// convert to a single JsObject
namesToDocs.map(JsObject(_))
}
I haven't tested it yet (I will do so later), but this function should at least give you the general idea. You get the list of all collections inside the database. For each collection, you perform a query to get all documents inside that collection. The list of documents is converted into a JsArray, and finally all collections are composed to a single JsObject with the collection names as keys.
If the goal is to write the data to an output stream (local/file or network), with side effects.
import scala.concurrent.{ ExecutionContext, Future }
import reactivemongo.bson.BSONDocument
import reactivemongo.api.{ Cursor, MongoDriver, MongoConnection }
val mongoUri = "mongodb://localhost:27017/my_db"
val driver = new MongoDriver
val maxDocs = Int.MaxValue // max per collection
// Requires to have an ExecutionContext in the scope
// (e.g. `import scala.concurrent.ExecutionContext.Implicits.global`)
def dump()(implicit ec: ExecutionContext): Future[Unit] = for {
uri <- Future.fromTry(MongoConnection.parseURI(mongoUri))
con = driver.connection(uri)
dn <- Future(uri.db.get)
db <- con.database(dn)
cn <- db.collectionNames
_ <- Future.sequence(cn.map { collName =>
println(s"Collection: $collName")
db.collection(collName).find(BSONDocument.empty). // findAll
cursor[BSONDocument]().foldWhile({}, maxDocs) { (_, doc) =>
// Replace println by appropriate side-effect
Cursor.Cont(println(s"- ${BSONDocument pretty doc}"))
}
})
} yield ()
If using with the JSON serialization pack, just replace BSONDocument with JsObject (e.g. BSONDocument.empty ~> Json.obj()).
If testing from the Scala REPL, after having paste the previous code, it can be executed as following.
dump().onComplete {
case result =>
println(s"Dump result: $result")
//driver.close()
}
Trying out the newly minted Akka Streams. It seems to be working except for one small thing - there's no output.
I have the following table definition:
case class my_stream(id: Int, value: String)
class Streams(tag: Tag) extends Table[my_stream](tag, "my_stream") {
def id = column[Int]("id")
def value = column[String]("value")
def * = (id, value) <> (my_stream.tupled, my_stream.unapply)
}
And I'm trying to output the contents of the table to stdout like this:
def main(args: Array[String]) : Unit = {
implicit val system = ActorSystem("Subscriber")
implicit val materializer = ActorMaterializer()
val strm = TableQuery[Streams]
val db = Database.forConfig("pg-postgres")
try{
var src = Source.fromPublisher(db.stream(strm.result))
src.runForeach(r => println(s"${r.id},${r.value}"))(materializer)
} finally {
system.shutdown
db.close
}
}
I have verified that the query is being run by configuring debug logging. However, all I get is this:
08:59:24.099 [main] INFO com.zaxxer.hikari.HikariDataSource - pg-postgres - is starting.
08:59:24.428 [main] INFO com.zaxxer.hikari.pool.HikariPool - pg-postgres - is closing down.
The cause is that Akka Streams is asynchronous and runForeach returns a Future which will be completed once the stream completes, but that Future is not being handled and as such the system.shutdown and db.close executes immediately instead of after the stream completes.
Just in case it helps anyone searching this very same issue but in MySQL, take into account that you should enable the driver stream support "manually":
def enableStream(statement: java.sql.Statement): Unit = {
statement match {
case s: com.mysql.jdbc.StatementImpl => s.enableStreamingResults()
case _ =>
}
}
val publisher = sourceDb.stream(query.result.withStatementParameters(statementInit = enableStream))
Source: http://www.slideshare.net/kazukinegoro5/akka-streams-100-scalamatsuri
Ended up using #ViktorKlang answer and just wrapped the run with an Await.result. I also found an alternative answer in the docs which demonstrates using the reactive streams publisher and subscriber interfaces:
The stream method returns a DatabasePublisher[T] and Source.fromPublisher returns a Source[T, NotUsed]. This means you have to attach a subscriber instead of using runForEach - according to the release notes NotUsed is a replacement for Unit. Which means nothing gets passed to the Sink.
Since Slick implements the reactive streams interface and not the Akka Stream interfaces you need to use the fromPublisher and fromSubscriber integration point. That means you need to implement the org.reactivestreams.Subscriber[T] interface.
Here's a quick and dirty Subscriber[T] implementation which simply calls println:
class MyStreamWriter extends org.reactivestreams.Subscriber[my_stream] {
private var sub : Option[Subscription] = None;
override def onNext(t: my_stream): Unit = {
println(t.value)
if(sub.nonEmpty) sub.head.request(1)
}
override def onError(throwable: Throwable): Unit = {
println(throwable.getMessage)
}
override def onSubscribe(subscription: Subscription): Unit = {
sub = Some(subscription)
sub.head.request(1)
}
override def onComplete(): Unit = {
println("ALL DONE!")
}
}
You need to make sure you call the Subscription.request(Long) method in onSubscribe and then in onNext to ask for data or nothing will be sent or you won't get the full set of results.
And here's how you use it:
def main(args: Array[String]) : Unit = {
implicit val system = ActorSystem("Subscriber")
implicit val materializer = ActorMaterializer()
val strm = TableQuery[Streams]
val db = Database.forConfig("pg-postgres")
try{
val src = Source.fromPublisher(db.stream(strm.result))
val flow = src.to(Sink.fromSubscriber(new MyStreamWriter()))
flow.run()
} finally {
system.shutdown
db.close
}
}
I'm still trying to figure this out so I welcome any feedback. Thanks!