How to save the outcome of collection.find into Array - mongodb

object ConnHelper extends Serializable{
lazy val jedis = new Jedis("localhost")
lazy val mongoClient = MongoClient("mongodb://localhost:27017/recommender")
}
val ratingCollection = ConnHelper.mongoClient.getDatabase(mongoConfig.db).getCollection(MONGODB_RATING_COLLECTION)
val Existratings: Observable[Option[BsonValue]] = ratingCollection
.find(equal("userId",1234))
.map{
item => item.get("productId")
}
The documents are like these
{
"id":****,
"userId":4567,
"productId":12345,
"score":5.0
}
I use Scala and Mongo-Scala-driver 2.9.0 to connect MongoDB and find documents where the "userId" field equal 1234, then I want save the value of "productId" of the documents into Array, but the returned value is observable type.
Could anyone tell how to save the query outcome into Array? I would appreciate it very much.

Please try a method which uses a Promise/Future structure to find the sequence of documents that match the search criteria. For example:
import org.mongodb.scala.bson._
def find (search_key: String, search_value: String, collection_name: String): Seq[Document] = {
// The application will need to wait for the find operation thread to complete
// in order to process the returned value.
log.debug(s"Starting database find_all operation thread")
// Set up new client connection, database, and collection
val _client: MongoClient = MongoClient(config_client)
val _database: MongoDatabase = _client.getDatabase(config_database)
val collection: MongoCollection[Document] = _database.getCollection(collection_name)
// Set up result sequence
var result_seq : Seq[Document] = Seq.empty
// Set up Promise container to wait for the database operation to complete
val promise = Promise[Boolean]
// Start insert operation thread; once the thread has finished, read resulting documents.
collection.find(equal(search_key, search_value)).collect().subscribe((results: Seq[Document]) => {
log.trace(s"Found operation thread completed")
// Append found documents to the results
result_seq = result_seq ++ results
log.trace(s" Result sequence: $result_seq")
promise.success(true) // set Promise container
_client.close // close client connection to avoid memory leaks
})
val future = promise.future // Promise completion result
Await.result(future, Duration.Inf) // wait for the promise completion result
// Return document sequence
result_seq
}
Then you can iterate through the document sequence and pull the products into a List (better than Array).
def read : List[String] = {
val document_seq = Database.find("userID","1234",collection)
// Set up an empty return map
val return_map : mutable.Map[String, String] = mutable.Map.empty
// Translate data from each document into Product object
document_seq.foreach(_document => {
return_map.put(
_document("id").asString.getValue,
_document("productId").asString.getValue
)
})
// Convert values to list map and return
return_map.values.toList
}

The Mongo Scala Driver uses the Observable model which is composed by three parts.
You need to subscribe an observer to the observable. Take a look to the examples.
The fastest solution is to coplete with a toFuture call:
val Existratings =
ratingCollection
.find(equal("userId",1234))
.map{
item => item.get("productId")
}.toFuture()
That will return a Sep of BsonValues with the resultset

this maybe:
val productIds = ratingCollection
.find(equal("userId",1234))
.map { _.get("productId") }
.toArray

The most direct solution to get an Array is to fold directly in one:
ratingCollection
.find(???)
.map { ??? }
.foldLeft(Array.empty[Item]) { _ :+ _ }
.head() //in order to get a Future[Array[Item]]
.onComplete {
case Success(values: Array[Item]) => //capture the array
case Failure(exception) => //fail logic
}
It's probably best to work with the Future rather build your own Observer logic for subscription.

Related

Why a Thread.sleep or closing the connection is required after waiting for a remove call to complete?

I'm again seeking you to share your wisdom with me, the scala padawan!
I'm playing with reactive mongo in scala and while I was writting a test using scalatest, I faced the following issue.
First the code:
"delete" when {
"passing an existent id" should {
"succeed" in {
val testRecord = TestRecord(someString)
Await.result(persistenceService.persist(testRecord), Duration.Inf)
Await.result(persistenceService.delete(testRecord.id), Duration.Inf)
Thread.sleep(1000) // Why do I need that to make the test succeeds?
val thrownException = intercept[RecordNotFoundException] {
Await.result(persistenceService.read(testRecord.id), Duration.Inf)
}
thrownException.getMessage should include(testRecord._id.toString)
}
}
}
And the read and delete methods with the code initializing connection to db (part of the constructor):
class MongoPersistenceService[R](url: String, port: String, databaseName: String, collectionName: String) {
val driver = MongoDriver()
val parsedUri: Try[MongoConnection.ParsedURI] = MongoConnection.parseURI("%s:%s".format(url, port))
val connection: Try[MongoConnection] = parsedUri.map(driver.connection)
val mongoConnection = Future.fromTry(connection)
def db: Future[DefaultDB] = mongoConnection.flatMap(_.database(databaseName))
def collection: Future[BSONCollection] = db.map(_.collection(collectionName))
def read(id: BSONObjectID): Future[R] = {
val query = BSONDocument("_id" -> id)
val readResult: Future[R] = for {
coll <- collection
record <- coll.find(query).requireOne[R]
} yield record
readResult.recover {
case NoSuchResultException => throw RecordNotFoundException(id)
}
}
def delete(id: BSONObjectID): Future[Unit] = {
val query = BSONDocument("_id" -> id)
// first read then call remove. Read will throw if not present
read(id).flatMap { (_) => collection.map(coll => coll.remove(query)) }
}
}
So to make my test pass, I had to had a Thread.sleep right after waiting for the delete to complete. Knowing this is evil usually punished by many whiplash, I want learn and find the proper fix here.
While trying other stuff, I found instead of waiting, entirely closing the connection to the db was also doing the trick...
What am I misunderstanding here? Should a connection to the db be opened and close for each call to it? And not do many actions like adding, removing, updating records with one connection?
Note that everything works fine when I remove the read call in my delete function. Also by closing the connection, I mean call close on the MongoDriver from my test and also stop and start again embed Mongo which I'm using in background.
Thanks for helping guys.
Warning: this is a blind guess, I've no experience with MongoDB on Scala.
You may have forgotten to flatMap
Take a look at this bit:
collection.map(coll => coll.remove(query))
Since collection is Future[BSONCollection] per your code and remove returns Future[WriteResult] per doc, so actual type of this expression is Future[Future[WriteResult]].
Now, you have annotated your function as returning Future[Unit]. Scala often makes Unit as a return value by throwing away possibly meaningful values, which it does in your case:
read(id).flatMap { (_) =>
collection.map(coll => {
coll.remove(query) // we didn't wait for removal
() // before returning unit
})
}
So your code should probably be
read(id).flatMap(_ => collection.flatMap(_.remove(query).map(_ => ())))
Or a for-comprehension:
for {
_ <- read(id)
coll <- collection
_ <- coll.remove(query)
} yield ()
You can make Scala warn you about discarded values by adding a compiler flag (assuming SBT):
scalacOptions += "-Ywarn-value-discard"

ReactiveMongo conditional update

I am confused on how one would conditionally update a document based on a previous query using only futures.
Let say I want to push to some value into an array in a document only if that array has a size less than a given Integer.
I am using this function to get the document, after getting the document I am pushing values - what I am unable to do is do that conditionally.
def joinGroup(actionRequest: GroupRequests.GroupActionRequest): Future[GroupResponse.GroupActionCompleted] = {
//groupisNotFull() is a boolean future
groupIsNotFull(actionRequest.groupId).map(
shouldUpdate => {
if(shouldUpdate){
Logger.info(actionRequest.initiator + " Joining Group: " + actionRequest.groupId)
val selector = BSONDocument("group.groupid" -> BSONDocument("$eq" -> actionRequest.groupId))
val modifier = BSONDocument("$push" -> BSONDocument("group.users" -> "test-user"))
val updateResult = activeGroups.flatMap(_.update(selector, modifier))
.map(res => {
GroupActionCompleted(
actionRequest.groupId,
actionRequest.initiator,
GroupConstants.Actions.JOIN,
res.ok,
GroupConstants.Messages.JOIN_SUCCESS
)
})
.recover {
case e: Throwable => GroupActionCompleted(
actionRequest.groupId,
actionRequest.initiator, GroupConstants.Actions.JOIN,
success = false,
GroupConstants.Messages.JOIN_FAIL
)
}
updateResult
}
else {
val updateResult = Future.successful(
GroupActionCompleted(
actionRequest.groupId,
actionRequest.initiator,
GroupConstants.Actions.JOIN,
success = false,
GroupConstants.Messages.JOIN_FAIL
))
updateResult
}
}
)
}
//returns a Future[Boolean] based on if there is room for another user.
private def groupIsNotFull(groupid: String): Future[Boolean] = {
findGroupByGroupId(groupid)
.map(group => {
if (group.isDefined) {
val fGroup = group.get
fGroup.group.users.size < fGroup.group.groupInformation.maxUsers
} else {
false
}
})
}
I am confused on why I cannot do this. The compilation error is:
Error:type mismatch;
found : scala.concurrent.Future[response.group.GroupResponse.GroupActionCompleted]
required: response.group.GroupResponse.GroupActionCompleted
for both the if and else branch 'updateResult'.
As a side question.. is this the proper way of updating documents conditionally - that is querying for it, doing some logic then executing another query?
Ok got it - you need to flatMap the first Future[Boolean] like this:
groupIsNotFull(actionRequest.groupId).flatMap( ...
Using flatMap, the result will be a Future[T] with map you would get a Future[Future[T]]. The compiler knows you want to return a Future[T] so its expecting the map to return a T and you are trying to return a Future[T] - so it throws the error. Using flatMap will fix this.
Some further clarity on map vs flatmap here: In Scala Akka futures, what is the difference between map and flatMap?
I believe the problem is because the joinGroup2 function return type is Future[Response], yet you are returning just a Response in the else block. If you look at the signature of the mapTo[T] function, it returns a Future[T].
I think you need to wrap the Response object in a Future. Something like this:
else {
Future { Response(false, ERROR_REASON) }
}
Btw you have a typo: Respose -> Response

Scala - Batched Stream from Futures

I have instances of a case class Thing, and I have a bunch of queries to run that return a collection of Things like so:
def queries: Seq[Future[Seq[Thing]]]
I need to collect all Things from all futures (like above) and group them into equally sized collections of 10,000 so they can be serialized to files of 10,000 Things.
def serializeThings(Seq[Thing]): Future[Unit]
I want it to be implemented in such a way that I don't wait for all queries to run before serializing. As soon as there are 10,000 Things returned after the futures of the first queries complete, I want to start serializing.
If I do something like:
Future.sequence(queries)
It will collect the results of all the queries, but my understanding is that operations like map won't be invoked until all queries complete and all the Things must fit into memory at once.
What's the best way to implement a batched stream pipeline using Scala collections and concurrent libraries?
I think that I managed to make something. The solution is based on my previous answer. It collects results from Future[List[Thing]] results until it reaches a treshold of BatchSize. Then it calls serializeThings future, when it finishes, the loop continues with the rest.
object BatchFutures extends App {
case class Thing(id: Int)
def getFuture(id: Int): Future[List[Thing]] = {
Future.successful {
List.fill(3)(Thing(id))
}
}
def serializeThings(things: Seq[Thing]): Future[Unit] = Future.successful {
//Thread.sleep(2000)
println("processing: " + things)
}
val ids = (1 to 4).toList
val BatchSize = 5
val future = ids.foldLeft(Future.successful[List[Thing]](Nil)) {
case (acc, id) =>
acc flatMap { processed =>
getFuture(id) flatMap { res =>
val all = processed ++ res
val (batch, rest) = all.splitAt(5)
if (batch.length == BatchSize) { // if futures filled the batch with needed amount
serializeThings(batch) map { _ =>
rest // process the rest
}
} else {
Future.successful(all) //if we need more Things for a batch
}
}
}
}.flatMap { rest =>
serializeThings(rest)
}
Await.result(future, Duration.Inf)
}
The result prints:
processing: List(Thing(1), Thing(1), Thing(1), Thing(2), Thing(2))
processing: List(Thing(2), Thing(3), Thing(3), Thing(3), Thing(4))
processing: List(Thing(4), Thing(4))
When the number of Things isn't divisible by BatchSize we have to call serializeThings once more(last flatMap). I hope it helps! :)
Before you do Future.sequence do what you want to do with individual future and then use Future.sequence.
//this can be used for serializing
def doSomething(): Unit = ???
//do something with the failed future
def doSomethingElse(): Unit = ???
def doSomething(list: List[_]) = ???
val list: List[Future[_]] = List.fill(10000)(Future(doSomething()))
val newList =
list.par.map { f =>
f.map { result =>
doSomething()
}.recover { case throwable =>
doSomethingElse()
}
}
Future.sequence(newList).map ( list => doSomething(list)) //wait till all are complete
instead of newList generation you could use Future.traverse
Future.traverse(list)(f => f.map( x => doSomething()).recover {case th => doSomethingElse() }).map ( completeListOfValues => doSomething(completeListOfValues))

ReactiveMongo database dump with Play Framework 2.5

I'm trying to dump my mongo database into a json object but because my queries to the database are asynchrounous I'm having problems.
Each collection in my database contains user data and each collection name is a user name.
So, when I want to get all my users data I recover all the collection names and then loop over them to recover each collection one by one.
def databaseDump(prom : Promise[JsObject]) = {
for{
dbUsers <- getUsers
} yield dbUsers
var rebuiltJson = Json.obj()
var array = JsArray()
res.map{ users =>
users.map{ userNames =>
if(userNames.size == 0){
prom failure new Throwable("Empty database")
}
var counter = 0
userNames.foreach { username =>
getUserTables(username).map { tables =>
/* Add data to array*/
...
counter += 1
if(counter == userNames.size){
/*Add data to new json*/
...
prom success rebuiltJson
}
}
}
}
}
This kinda works, but sometimes the promise is succesfully triggered even though all the data has not yet been recoverd. This is due to that fact that my counter variable is not a reliable solution.
Is there a way to loop over all the users, query the database and wait for all the data to be recovered before succesfully triggering the promise? I tried to use for comprehension but didn't find a way to do it. Is there a way to dump a whole mongo DB into one Json : { username : data, username : data ..} ?
The users/tables terminology was getting me confused, so I wrote a new function that dumps a database into a single JsObject.
// helper function to find all documents inside a collection c
// and return them as a single JsArray
def getDocs(c: JSONCollection)(implicit ec: ExecutionContext) = c.find(Json.obj()).cursor[JsObject]().jsArray()
def dumpToJsObject(db: DefaultDB)(implicit ec: ExecutionContext): Future[JsObject] = {
// get a list of all collections in the db
val collectionNames = db.collectionNames
val collections = collectionNames.map(_.map(db.collection[JSONCollection](_)))
// each entry is a tuple collectionName -> content (as JsArray)
val namesToDocs = collections.flatMap {
colls => Future.sequence(colls.map(c => getDocs(c).map(c.name -> _)))
}
// convert to a single JsObject
namesToDocs.map(JsObject(_))
}
I haven't tested it yet (I will do so later), but this function should at least give you the general idea. You get the list of all collections inside the database. For each collection, you perform a query to get all documents inside that collection. The list of documents is converted into a JsArray, and finally all collections are composed to a single JsObject with the collection names as keys.
If the goal is to write the data to an output stream (local/file or network), with side effects.
import scala.concurrent.{ ExecutionContext, Future }
import reactivemongo.bson.BSONDocument
import reactivemongo.api.{ Cursor, MongoDriver, MongoConnection }
val mongoUri = "mongodb://localhost:27017/my_db"
val driver = new MongoDriver
val maxDocs = Int.MaxValue // max per collection
// Requires to have an ExecutionContext in the scope
// (e.g. `import scala.concurrent.ExecutionContext.Implicits.global`)
def dump()(implicit ec: ExecutionContext): Future[Unit] = for {
uri <- Future.fromTry(MongoConnection.parseURI(mongoUri))
con = driver.connection(uri)
dn <- Future(uri.db.get)
db <- con.database(dn)
cn <- db.collectionNames
_ <- Future.sequence(cn.map { collName =>
println(s"Collection: $collName")
db.collection(collName).find(BSONDocument.empty). // findAll
cursor[BSONDocument]().foldWhile({}, maxDocs) { (_, doc) =>
// Replace println by appropriate side-effect
Cursor.Cont(println(s"- ${BSONDocument pretty doc}"))
}
})
} yield ()
If using with the JSON serialization pack, just replace BSONDocument with JsObject (e.g. BSONDocument.empty ~> Json.obj()).
If testing from the Scala REPL, after having paste the previous code, it can be executed as following.
dump().onComplete {
case result =>
println(s"Dump result: $result")
//driver.close()
}

Spark Scala Get Data Back from rdd.foreachPartition

I have some code like this:
println("\nBEGIN Last Revs Class: "+ distinctFileGidsRDD.getClass)
val lastRevs = distinctFileGidsRDD.
foreachPartition(iter => {
SetupJDBC(jdbcDriver, jdbcUrl, jdbcUser, jdbcPassword)
while(iter.hasNext) {
val item = iter.next()
//println(item(0))
println("String: "+item(0).toString())
val jsonStr = DB.readOnly { implicit session =>
sql"SELECT jsonStr FROM lasttail WHERE fileGId = ${item(0)}::varchar".
map { resultSet => resultSet.string(1) }.single.apply()
}
println("\nJSON: "+jsonStr)
}
})
println("\nEND Last Revs Class: "+ lastRevs.getClass)
The code outputs (with heavy edits) something like:
BEGIN Last Revs Class: class org.apache.spark.rdd.MapPartitionsRDD
String: 1fqhSXPE3GwrJ6SZzC65gJnBaB5_b7j3pWNSfqzU5FoM
JSON: Some({"Struct":{"fileGid":"1fqhSXPE3GwrJ6SZzC65gJnBaB5_b7j3pWNSfqzU5FoM",... )
String: 1eY2wxoVq17KGMUBzCZZ34J9gSNzF038grf5RP38DUxw
JSON: Some({"Struct":{"fileGid":"1fqhSXPE3GwrJ6SZzC65gJnBaB5_b7j3pWNSfqzU5FoM",... )
...
JSON: None()
END Last Revs Class: void
QUESTION 1:
How can I get the lastRevs value to be in a useful format like the JSON string/null or an option like Some / None?
QUESTION 2:
My preference: IS there another way get at partitions data that an RDD-like format (rather than the iterator format)?
dstream.foreachRDD { (rdd, time) =>
rdd.foreachPartition { partitionIterator =>
val partitionId = TaskContext.get.partitionId()
val uniqueId = generateUniqueId(time.milliseconds, partitionId)
// use this uniqueId to transactionally commit the data in partitionIterator
}
}
from http://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning
QUESTION 3: Is the method of getting data that I am using a sane method (given I am following the link above)? (Put aside the fact that this is a scalikejdbc system JDBC right now. This is going to be a key, value store of some type other than this prototype.)
To create a transformation that uses resources local to the executor (such as a DB or network connection), you should use rdd.mapPartitions. It allows to initialize some code locally to the executor and use those local resources to process the data in the partition.
The code should look like:
val lastRevs = distinctFileGidsRDD.
mapPartitions{iter =>
SetupJDBC(jdbcDriver, jdbcUrl, jdbcUser, jdbcPassword)
iter.map{ element =>
DB.readOnly { implicit session =>
sql"SELECT jsonStr FROM lasttail WHERE fileGId = ${element(0)}::varchar"
.map { resultSet => resultSet.string(1) }.single.apply()
}
}
}