How to fetch data from mongodB in Scala

How to fetch data from mongodB in Scala - mongodb

I wrote following code to fetch data from MongoDB
import com.typesafe.config.ConfigFactory
import org.mongodb.scala.{ Document, MongoClient, MongoCollection, MongoDatabase }
import scala.concurrent.ExecutionContext
object MongoService extends Service {
val conf = ConfigFactory.load()
implicit val mongoService: MongoClient = MongoClient(conf.getString("mongo.url"))
implicit val mongoDB: MongoDatabase = mongoService.getDatabase(conf.getString("mongo.db"))
implicit val ec: ExecutionContext = ExecutionContext.global
def getAllDocumentsFromCollection(collection: String) = {
mongoDB.getCollection(collection).find()
}
}
But when I tried to get data from getAllDocumentsFromCollection I'm not getting each data for further manipulation. Instead I'm getting
FindObservable(com.mongodb.async.client.FindIterableImpl#23555cf5)
UPDATED:
object MongoService {
// My settings (see available connection options)
val mongoUri = "mongodb://localhost:27017/smsto?authMode=scram-sha1"
import ExecutionContext.Implicits.global // use any appropriate context
// Connect to the database: Must be done only once per application
val driver = MongoDriver()
val parsedUri = MongoConnection.parseURI(mongoUri)
val connection = parsedUri.map(driver.connection(_))
// Database and collections: Get references
val futureConnection = Future.fromTry(connection)
def db1: Future[DefaultDB] = futureConnection.flatMap(_.database("smsto"))
def personCollection = db1.map(_.collection("person"))
// Write Documents: insert or update
implicit def personWriter: BSONDocumentWriter[Person] = Macros.writer[Person]
// or provide a custom one
def createPerson(person: Person): Future[Unit] =
personCollection.flatMap(_.insert(person).map(_ => {})) // use personWriter
def getAll(collection: String) =
db1.map(_.collection(collection))
// Custom persistent types
case class Person(firstName: String, lastName: String, age: Int)
}
I tried to use reactivemongo as well with above code but I couldn't make it work for getAll and getting following error in createPerson
Please suggest how can I get all data from a collection.

This is likely too late for the OP, but hopefully the following methods of retrieving & iterating over collections using mongo-spark can prove useful to others.
The Asynchronous Way - Iterating over documents asynchronously means you won't have to store an entire collection in-memory, which can become unreasonable for large collections. However, you won't have access to all your documents outside the subscribe code block for reuse. I'd recommend doing things asynchronously if you can, since this is how the mongo-scala driver was intended to be used.
db.getCollection(collectionName).find().subscribe(
(doc: org.mongodb.scala.bson.Document) => {
// operate on an individual document here
},
(e: Throwable) => {
// do something with errors here, if desired
},
() => {
// this signifies that you've reached the end of your collection
}
)
The "Synchronous" Way - This is a pattern I use when my use-case calls for a synchronous solution, and I'm working with smaller collections or result-sets. It still uses the asynchronous mongo-scala driver, but it returns a list of documents and blocks downstream code execution until all documents are returned. Handling errors and timeouts may depend on your use case.
import org.mongodb.scala._
import org.mongodb.scala.bson.Document
import org.mongodb.scala.model.Filters
import scala.collection.mutable.ListBuffer
/* This function optionally takes filters if you do not wish to return the entire collection.
* You could extend it to take other optional query params, such as org.mongodb.scala.model.{Sorts, Projections, Aggregates}
*/
def getDocsSync(db: MongoDatabase, collectionName: String, filters: Option[conversions.Bson]): ListBuffer[Document] = {
val docs = scala.collection.mutable.ListBuffer[Document]()
var processing = true
val query = if (filters.isDefined) {
db.getCollection(collectionName).find(filters.get)
} else {
db.getCollection(collectionName).find()
}
query.subscribe(
(doc: Document) => docs.append(doc), // add doc to mutable list
(e: Throwable) => throw e,
() => processing = false
)
while (processing) {
Thread.sleep(100) // wait here until all docs have been returned
}
docs
}
// sample usage of 'synchronous' method
val client: MongoClient = MongoClient(uriString)
val db: MongoDatabase = client.getDatabase(dbName)
val allDocs = getDocsSync(db, "myCollection", Option.empty)
val someDocs = getDocsSync(db, "myCollection", Option(Filters.eq("fieldName", "foo")))

Related

Apache beam stops to process PubSub messages after some time

I'm trying to write a simple Apache Beam pipeline (which will run on the Dataflow runner) to do the following:
Read PubSub messages containing file paths on GCS from a subscription.
For each message, read the data contained in the file associated with the message (the files can be of a variery of formats (csv, jsonl, json, xml, ...)).
Do some processing on each record.
Write back the result on GCS.
I'm using a 10 seconds fixed window on the messages. Since incoming files are already chunked (max size of 10MB) I decided not to use splittable do functions to read the files, in order to avoid adding useless complexity (especially for files that are not trivially splittable in chunks).
Here is a simplified code sample that gives the exact same problem of the full one:
package skytv.ingester
import java.io.{BufferedReader, InputStreamReader}
import java.nio.charset.StandardCharsets
import kantan.csv.rfc
import org.apache.beam.sdk.Pipeline
import org.apache.beam.sdk.io.{Compression, FileIO, FileSystems, TextIO, WriteFilesResult}
import org.apache.beam.sdk.io.gcp.pubsub.{PubsubIO, PubsubMessage}
import org.apache.beam.sdk.options.PipelineOptionsFactory
import org.apache.beam.sdk.transforms.DoFn.ProcessElement
import org.apache.beam.sdk.transforms.windowing.{BoundedWindow, FixedWindows, PaneInfo, Window}
import org.apache.beam.sdk.transforms.{Contextful, DoFn, MapElements, PTransform, ParDo, SerializableFunction, SimpleFunction, WithTimestamps}
import org.apache.beam.sdk.values.{KV, PCollection}
import org.joda.time.{Duration, Instant}
import skytv.cloudstorage.CloudStorageClient
import skytv.common.Closeable
import kantan.csv.ops._
import org.apache.beam.sdk.io.FileIO.{Sink, Write}
class FileReader extends DoFn[String, List[String]] {
private def getFileReader(filePath: String) = {
val cloudStorageClient = new CloudStorageClient()
val inputStream = cloudStorageClient.getInputStream(filePath)
val isr = new InputStreamReader(inputStream, StandardCharsets.UTF_8)
new BufferedReader(isr)
}
private def getRowsIterator(fileReader: BufferedReader) = {
fileReader
.asUnsafeCsvReader[Seq[String]](rfc
.withCellSeparator(',')
.withoutHeader
.withQuote('"'))
.toIterator
}
#ProcessElement
def processElement(c: ProcessContext): Unit = {
val filePath = c.element()
Closeable.tryWithResources(
getFileReader(filePath)
) {
fileReader => {
getRowsIterator(fileReader)
.foreach(record => c.output(record.toList))
}
}
}
}
class DataWriter(tempFolder: String) extends PTransform[PCollection[List[String]], WriteFilesResult[String]] {
private val convertRecord = Contextful.fn[List[String], String]((dr: List[String]) => {
dr.mkString(",")
})
private val getSink = Contextful.fn[String, Sink[String]]((destinationKey: String) => {
TextIO.sink()
})
private val getPartitioningKey = new SerializableFunction[List[String], String] {
override def apply(input: List[String]): String = {
input.head
}
}
private val getNaming = Contextful.fn[String, Write.FileNaming]((destinationKey: String) => {
new Write.FileNaming {
override def getFilename(
window: BoundedWindow,
pane: PaneInfo,
numShards: Int,
shardIndex: Int,
compression: Compression
): String = {
s"$destinationKey-${window.maxTimestamp()}-${pane.getIndex}.csv"
}
}
})
override def expand(input: PCollection[List[String]]): WriteFilesResult[String] = {
val fileWritingTransform = FileIO
.writeDynamic[String, List[String]]()
.by(getPartitioningKey)
.withDestinationCoder(input.getPipeline.getCoderRegistry.getCoder(classOf[String]))
.withTempDirectory(tempFolder)
.via(convertRecord, getSink)
.withNaming(getNaming)
.withNumShards(1)
input
.apply("WriteToAvro", fileWritingTransform)
}
}
object EnhancedIngesterScalaSimplified {
private val SUBSCRIPTION_NAME = "projects/<project>/subscriptions/<subscription>"
private val TMP_LOCATION = "gs://<path>"
private val WINDOW_SIZE = Duration.standardSeconds(10)
def main(args: Array[String]): Unit = {
val options = PipelineOptionsFactory.fromArgs(args: _*).withValidation().create()
FileSystems.setDefaultPipelineOptions(options)
val p = Pipeline.create(options)
val messages = p
.apply("ReadMessages", PubsubIO.readMessagesWithAttributes.fromSubscription(SUBSCRIPTION_NAME))
// .apply("AddProcessingTimeTimestamp", WithTimestamps.of(new SerializableFunction[PubsubMessage, Instant] {
// override def apply(input: PubsubMessage): Instant = Instant.now()
// }))
val parsedMessages = messages
.apply("ApplyWindow", Window.into[PubsubMessage](FixedWindows.of(WINDOW_SIZE)))
.apply("ParseMessages", MapElements.via(new SimpleFunction[PubsubMessage, String]() {
override def apply(msg: PubsubMessage): String = new String(msg.getPayload, StandardCharsets.UTF_8)
}))
val dataReadResult = parsedMessages
.apply("ReadData", ParDo.of(new FileReader))
val writerResult = dataReadResult.apply(
"WriteData",
new DataWriter(TMP_LOCATION)
)
writerResult.getPerDestinationOutputFilenames.apply(
"FilesWritten",
MapElements.via(new SimpleFunction[KV[String, String], String]() {
override def apply(input: KV[String, String]): String = {
println(s"Written ${input.getKey}, ${input.getValue}")
input.getValue
}
}))
p.run.waitUntilFinish()
}
}
The problem is that after the processing of some messages (in the order of 1000), the job stops processing new messages and they remain in the PubSub subscription unacknowledged forever.
I saw that in such a situation the watermark stops to advance and the data freshness linearly increases indefinitely.
Here is a screenshot from dataflow:
And here the situation on the PubSub queue:
Is it possible that there are some messages that remain stuck in the dataflow queues filling them so that no new messages can be added?
I thought that there was some problem on how timestamps are computed by the PubsubIO, so I tried to force the timestamps to be equal to the processing time of each message, but I had no success.
If I leave the dataflow job in this state, it seems that it continuously reprocesses the same messages without writing any data to storage.
Do you have any idea on how to solve this problem?
Thanks!

Most likely the pipeline has encountered an error while processing one(or more) elements in the pipeline (and it shouldn't have anything to do with how timestamps are computed by the PubsubIO), which stops the watermark from advancing since the failed work will be retried again and again on dataflow.
You can check if there's any failure from the log, specifically from worker or harness component. If there's an unhandled runtime exception such as parse error etc, it is very likely being the root cause of a streaming pipeline getting stuck.
If there's no UserCodeException then it is likely some other issue caused by dataflow backend and you can reach out to Dataflow customer support so engineers can look into the backend issue for your pipeline.

Can't understand TwoPhaseCommitSinkFunction lifecycle

I needed a sink to Postgres DB, so I started to build a custom Flink SinkFunction. As FlinkKafkaProducer implements TwoPhaseCommitSinkFunction, then I decided to do the same. As stated in O'Reilley's book Stream Processing with Apache Flink, you just need to implement the abstract methods, enable checkpointing and you're up to go. But what really happens when I run my code is that commit method is called only once, and it is called before invoke, what is totally unexpected since you shouldn't be ready to commit if your set of ready-to-commit transactions is empty. And the worst is that, after committing, invoke is called for all of the transaction lines present in my file, and then abort is called, which is even more unexpected.
When the Sink is initialized, It is of my understanding that the following should occur:
beginTransaction is called and sends an identifier to invoke
invoke adds the lines to the transaction, according to the identifier received
pre-commit makes all final modification on current transaction data
commit handles the finalized transaction of pre-commited data
So, I can't see why my program doesn't show this behaviour.
Here goes my sink code:
package PostgresConnector
import java.sql.{BatchUpdateException, DriverManager, PreparedStatement, SQLException, Timestamp}
import java.text.ParseException
import java.util.{Date, Properties, UUID}
import org.apache.flink.api.common.ExecutionConfig
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.sink.{SinkFunction, TwoPhaseCommitSinkFunction}
import org.apache.flink.streaming.api.scala._
import org.slf4j.{Logger, LoggerFactory}
class PostgreSink(props : Properties, config : ExecutionConfig) extends TwoPhaseCommitSinkFunction[(String,String,String,String),String,String](createTypeInformation[String].createSerializer(config),createTypeInformation[String].createSerializer(config)){
private var transactionMap : Map[String,Array[(String,String,String,String)]] = Map()
private var parsedQuery : PreparedStatement = _
private val insertionString : String = "INSERT INTO mydb (field1,field2,point) values (?,?,point(?,?))"
override def invoke(transaction: String, value: (String,String,String,String), context: SinkFunction.Context[_]): Unit = {
val LOG = LoggerFactory.getLogger(classOf[FlinkCEPClasses.FlinkCEPPipeline])
val res = this.transactionMap.get(transaction)
if(res.isDefined){
var array = res.get
array = array ++ Array(value)
this.transactionMap += (transaction -> array)
}else{
val array = Array(value)
this.transactionMap += (transaction -> array)
}
LOG.info("\n\nPassing through invoke\n\n")
()
}
override def beginTransaction(): String = {
val LOG: Logger = LoggerFactory.getLogger(classOf[FlinkCEPClasses.FlinkCEPPipeline])
val identifier = UUID.randomUUID.toString
LOG.info("\n\nPassing through beginTransaction\n\n")
identifier
}
override def preCommit(transaction: String): Unit = {
val LOG = LoggerFactory.getLogger(classOf[FlinkCEPClasses.FlinkCEPPipeline])
try{
val tuple : Option[Array[(String,String,String,String)]]= this.transactionMap.get(transaction)
if(tuple.isDefined){
tuple.get.foreach( (value : (String,String,String,String)) => {
LOG.info("\n\n"+value.toString()+"\n\n")
this.parsedQuery.setString(1,value._1)
this.parsedQuery.setString(2,value._2)
this.parsedQuery.setString(3,value._3)
this.parsedQuery.setString(4,value._4)
this.parsedQuery.addBatch()
})
}
}catch{
case e : SQLException =>
LOG.info("\n\nError when adding transaction to batch: SQLException\n\n")
case f : ParseException =>
LOG.info("\n\nError when adding transaction to batch: ParseException\n\n")
case g : NoSuchElementException =>
LOG.info("\n\nError when adding transaction to batch: NoSuchElementException\n\n")
case h : Exception =>
LOG.info("\n\nError when adding transaction to batch: Exception\n\n")
}
this.transactionMap = this.transactionMap.empty
LOG.info("\n\nPassing through preCommit...\n\n")
}
override def commit(transaction: String): Unit = {
val LOG : Logger = LoggerFactory.getLogger(classOf[FlinkCEPClasses.FlinkCEPPipeline])
if(this.parsedQuery != null) {
LOG.info("\n\n" + this.parsedQuery.toString+ "\n\n")
}
try{
this.parsedQuery.executeBatch
val LOG : Logger = LoggerFactory.getLogger(classOf[FlinkCEPClasses.FlinkCEPPipeline])
LOG.info("\n\nExecuting batch\n\n")
}catch{
case e : SQLException =>
val LOG : Logger = LoggerFactory.getLogger(classOf[FlinkCEPClasses.FlinkCEPPipeline])
LOG.info("\n\n"+"Error : SQLException"+"\n\n")
}
this.transactionMap = this.transactionMap.empty
LOG.info("\n\nPassing through commit...\n\n")
}
override def abort(transaction: String): Unit = {
val LOG : Logger = LoggerFactory.getLogger(classOf[FlinkCEPClasses.FlinkCEPPipeline])
this.transactionMap = this.transactionMap.empty
LOG.info("\n\nPassing through abort...\n\n")
}
override def open(parameters: Configuration): Unit = {
val LOG: Logger = LoggerFactory.getLogger(classOf[FlinkCEPClasses.FlinkCEPPipeline])
val driver = props.getProperty("driver")
val url = props.getProperty("url")
val user = props.getProperty("user")
val password = props.getProperty("password")
Class.forName(driver)
val connection = DriverManager.getConnection(url + "?user=" + user + "&password=" + password)
this.parsedQuery = connection.prepareStatement(insertionString)
LOG.info("\n\nConfiguring BD conection parameters\n\n")
}
}
And this is my main program:
package FlinkCEPClasses
import PostgresConnector.PostgreSink
import org.apache.flink.api.java.io.TextInputFormat
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.cep.PatternSelectFunction
import org.apache.flink.cep.pattern.conditions.SimpleCondition
import org.apache.flink.cep.scala.pattern.Pattern
import org.apache.flink.core.fs.{FileSystem, Path}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.cep.scala.{CEP, PatternStream}
import org.apache.flink.streaming.api.functions.source.FileProcessingMode
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import java.util.Properties
import org.apache.flink.api.common.ExecutionConfig
import org.slf4j.{Logger, LoggerFactory}
class FlinkCEPPipeline {
val LOG: Logger = LoggerFactory.getLogger(classOf[FlinkCEPPipeline])
LOG.info("\n\nStarting the pipeline...\n\n")
var env : StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.enableCheckpointing(10)
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)
env.setParallelism(1)
//var input : DataStream[String] = env.readFile(new TextInputFormat(new Path("/home/luca/Desktop/lines")),"/home/luca/Desktop/lines",FileProcessingMode.PROCESS_CONTINUOUSLY,1)
var input : DataStream[String] = env.readTextFile("/home/luca/Desktop/lines").name("Raw stream")
var tupleStream : DataStream[(String,String,String,String)] = input.map(new S2PMapFunction()).name("Tuple Stream")
var properties : Properties = new Properties()
properties.setProperty("driver","org.postgresql.Driver")
properties.setProperty("url","jdbc:postgresql://localhost:5432/mydb")
properties.setProperty("user","luca")
properties.setProperty("password","root")
tupleStream.addSink(new PostgreSink(properties,env.getConfig)).name("Postgres Sink").setParallelism(1)
tupleStream.writeAsText("/home/luca/Desktop/output",FileSystem.WriteMode.OVERWRITE).name("File Sink").setParallelism(1)
env.execute()
}
My S2PMapFunction code:
package FlinkCEPClasses
import org.apache.flink.api.common.functions.MapFunction
case class S2PMapFunction() extends MapFunction[String,(String,String,String,String)] {
override def map(value: String): (String, String, String,String) = {
var tuple = value.replaceAllLiterally("(","").replaceAllLiterally(")","").split(',')
(tuple(0),tuple(1),tuple(2),tuple(3))
}
}
My pipeline works like this: I read lines from a file, map them to a tuple of strings, and use the data inside the tuples to save them in a Postgres DB
If you want to simulate the data, just create a file with lines in a format like this:
(field1,field2,pointx,pointy)
Edit
The execution order of the TwoPhaseCommitSinkFUnction's methods is the following:
Starting pipeline...
beginTransaction
preCommit
beginTransaction
commit
invoke
invoke
invoke
invoke
invoke
invoke
invoke
invoke
invoke
invoke
invoke
invoke
invoke
invoke
invoke
invoke
invoke
invoke
abort

I'm not an expert on this topic, but a couple of guesses:
preCommit is called whenever Flink begins a checkpoint, and commit is called when the checkpoint is complete. These methods are called simply because checkpointing is happening, regardless of whether the sink has received any data.
Checkpointing is happening periodically, regardless of whether any data is flowing through your pipeline. Given your very short checkpointing interval (10 msec), it does seem plausible that the first checkpoint barrier will reach the sink before the source has managed to send it any data.
It also looks like you are assuming that only one transaction will be open at a time. I'm not sure that's strictly guaranteed, but so long as maxConcurrentCheckpoints is 1 (which is the default), you should be okay.

So, here goes the "answer" for this question. Just to be clear: at this moment, the problem about the TwoPhaseCommitSinkFunction hasn't been solved yet. If what you're looking for is about the original problem, then you should look for another answer. If you don't care about what you'll use as a sink, then maybe I can help you with that.
As suggested by #DavidAnderson, I started to study the Table API and see if it could solve my problem, which was using Flink to insert lines in my database table.
It turned out to be really simple, as you'll see.
OBS: Beware of the version you are using. My Flink's version is 1.9.0.
Source code
package FlinkCEPClasses
import java.sql.Timestamp
import java.util.Properties
import org.apache.flink.api.common.typeinfo.{TypeInformation, Types}
import org.apache.flink.api.java.io.jdbc.JDBCAppendTableSink
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.table.api.{EnvironmentSettings, Table}
import org.apache.flink.table.api.scala.StreamTableEnvironment
import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.sinks.TableSink
import org.postgresql.Driver
class TableAPIPipeline {
// --- normal pipeline initialization in this block ---
var env : StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.enableCheckpointing(10)
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)
env.setParallelism(1)
var input : DataStream[String] = env.readTextFile("/home/luca/Desktop/lines").name("Original stream")
var tupleStream : DataStream[(String,Timestamp,Double,Double)] = input.map(new S2PlacaMapFunction()).name("Tuple Stream")
var properties : Properties = new Properties()
properties.setProperty("driver","org.postgresql.Driver")
properties.setProperty("url","jdbc:postgresql://localhost:5432/mydb")
properties.setProperty("user","myuser")
properties.setProperty("password","mypassword")
// --- normal pipeline initialization in this block END ---
// These two lines create what Flink calls StreamTableEnvironment.
// It seems pretty similar to a normal stream initialization.
val settings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build()
val tableEnv = StreamTableEnvironment.create(env,settings)
//Since I wanted to sink data into a database, I used JDBC TableSink,
//because it is very intuitive and is a exact match with my need. You may
//look for other TableSink classes that fit better in you solution.
var tableSink : JDBCAppendTableSink = JDBCAppendTableSink.builder()
.setBatchSize(1)
.setDBUrl("jdbc:postgresql://localhost:5432/mydb")
.setDrivername("org.postgresql.Driver")
.setPassword("mypassword")
.setUsername("myuser")
.setQuery("INSERT INTO mytable (data1,data2,data3) VALUES (?,?,point(?,?))")
.setParameterTypes(Types.STRING,Types.SQL_TIMESTAMP,Types.DOUBLE,Types.DOUBLE)
.build()
val fieldNames = Array("data1","data2","data3","data4")
val fieldTypes = Array[TypeInformation[_]](Types.STRING,Types.SQL_TIMESTAMP,Types.DOUBLE, Types.DOUBLE)
// This is the crucial part of the code: first, you need to register
// your table sink, informing the name, the field names, field types and
// the TableSink object.
tableEnv.registerTableSink("postgres-table-sink",
fieldNames,
fieldTypes,
tableSink
)
// Then, you transform your DataStream into a Table object.
var table = tableEnv.fromDataStream(tupleStream)
// Finally, you insert your stream data into the registered sink.
table.insertInto("postgres-table-sink")
env.execute()
}

How to make synchronous queries using MongoDb Scala driver

As "MongoDb Scala Driver" is the only official Scala driver now, I plan to switch from Casbah. However, MongoDb Scala Driver seems to support only Asynchronous API (at least in its documentation).
Is there a way to make synchronous queries?

I had the same problems some days ago when I was moving from Casbah. Apparently the official Mongodb driver uses the observer pattern. I wanted to retrieve a sequence number from a collection and I had to wait for the value to be retrieved to continue the operation. I am not sure if that's the correct way, but at least this is one way of doing it:
def getSequenceId(seqName: String): Int = {
val query = new BsonDocument("seq_id", new BsonString(seqName))
val resultado = NewMongo.SequenceCollection.findOneAndUpdate(query,inc("nextId",1))
resultado.subscribe(new Observer[Document] {
override def onNext(result: Document): Unit ={}
override def onError(e: Throwable): Unit ={}
override def onComplete(): Unit = {}
})
val awaitedR = Await.result(resultado.toFuture, Duration.Inf).asInstanceOf[List[Document]](0)
val ret = awaitedR.get("nextId").getOrElse(0).asInstanceOf[BsonDouble].intValue();
return ret;
}
Apparently you can convert the result observer into a future and wait for its return using the Await function as I did. Then you can manipulate the result as you wish.
The database configuration is the following:
private val mongoClient: MongoClient = MongoClient("mongodb://localhost:27017/?maxPoolSize=30")
private val database: MongoDatabase = mongoClient.getDatabase("mydb");
val Sequence: MongoCollection[Document] = database.getCollection(SEQUENCE);
Hope my answer was helpful

Parallelizing access to S3 objects in Scala

I'm writing a Scala program to read objects matching a certain prefix on S3.
At the moment, I'm testing it on my Macbook Pro and it takes 270ms (avg. over 1000 trials) to hit S3, retrieve the 10 objects (avg. size of object 150Kb) and process it to print the output.
Here's my code:
val myBucket = "my-test-bucket"
val myPrefix = "t"
val startTime = System.currentTimeMillis()
//Can I make listObject parallel?
val listObjRequest: ListObjectsRequest = new ListObjectsRequest().withBucketName(myBucket)
val listObjResult: Seq[String] = s3.listObjects(listObjRequest).getObjectSummaries.par.toIndexedSeq.map(_.getKey).filter(_ matches s"./.*${myPrefix}.*/*")
//Can I make forEach parallel?
listObjResult foreach println //Could be any function
println(s"Total time: ${System.currentTimeMillis() - startTime}ms")
In the big scheme of things, I've got to sift through 50Gb of data (approx. 350K nested objects) and delete objects following a certain prefix (approx. 40K objects).
Hardware considerations aside, what can I do to optimize my code?
Thanks!

A possible solution would be to batch the request objects and send a request for batch deletion in S3. You can group the objects to delete and then parallalize the mapping over the parallel collection:
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.DeleteObjectsRequest.KeyVersion
import com.amazonaws.services.s3.model.{DeleteObjectsRequest, DeleteObjectsResult}
import scala.collection.JavaConverters._
import scala.concurrent.Future
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent._
import scala.util.Try
object AmazonBatchDeletion {
def main(args: Array[String]): Unit = {
val filesToDelete: List[String] = ???
val numOfGroups: Int = ???
val deletionAttempts: Iterator[Future[Try[DeleteObjectsResult]]] =
filesToDelete
.grouped(numOfGroups)
.map(groupToDelete => Future {
blocking {
deleteFilesInBatch(groupToDelete, "bucketName")
}
})
val result: Future[Iterator[Try[DeleteObjectsResult]]] =
Future.sequence(deletionAttempts)
// TODO: make sure deletion was successful.
// Recover if needed form faulted futures.
}
def deleteFilesInBatch(filesToDelete: List[String],
bucketName: String): Try[DeleteObjectsResult] = {
val amazonClient = new AmazonS3Client()
val deleteObjectsRequest = new DeleteObjectsRequest(bucketName)
deleteObjectsRequest.setKeys(filesToDelete.map(new KeyVersion(_)).asJava)
Try {
amazonClient.deleteObjects(deleteObjectsRequest)
}
}
}

Casbah: No implicit view available error

In a Play app, using Salat and Casbah, I am trying to de-serialize a DBObject into an object of type Task, but I am getting this error when calling .asObject:
No implicit view available from com.mongodb.casbah.Imports.DBObject =>
com.mongodb.casbah.Imports.MongoDBObject. Error occurred in an
application involving default arguments.
The object is serialized correctly with .asDBObject, and written to the database as expected.
What is causing this behaviour, and what can be done to solve it? Here's the model involved:
package models
import db.{MongoFactory, MongoConnection}
import com.novus.salat._
import com.novus.salat.global._
import com.novus.salat.annotations._
import com.mongodb.casbah.Imports._
import com.mongodb.casbah.commons.Imports._
import play.api.Play
case class Task(label: String, _id: ObjectId=new ObjectId)
object Task {
implicit val ctx = new Context {
val name = "Custom_Classloader"
}
ctx.registerClassLoader(Play.classloader(Play.current))
val taskCollection = MongoFactory.database("tasks")
def create(label: String): Task = {
val task = new Task(label)
val dbObject = grater[Task].asDBObject(task)
taskCollection.save(dbObject)
grater[Task].asObject(dbObject)
}
def all(): List[Task] = {
val results = taskCollection.find()
val tasks = for (item <- results) yield grater[Task].asObject(item)
tasks.toList
}
}
Versions
casbah: "2.8.1"
scala: "2.11.6"
salat: "1.9.9"

Instructions on creating a custom context:
First, define a custom context as
implicit val ctx = new Context { /* custom behaviour */ }
in a package object
Stop importing com.novus.salat.global._
Import your own custom context everywhere instead.
Source: https://github.com/novus/salat/wiki/CustomContext

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to fetch data from mongodB in Scala - mongodb

Related

Apache beam stops to process PubSub messages after some time

Can't understand TwoPhaseCommitSinkFunction lifecycle

How to make synchronous queries using MongoDb Scala driver

Parallelizing access to S3 objects in Scala

Casbah: No implicit view available error

Categories

Resources