How to make synchronous queries using MongoDb Scala driver - mongodb

As "MongoDb Scala Driver" is the only official Scala driver now, I plan to switch from Casbah. However, MongoDb Scala Driver seems to support only Asynchronous API (at least in its documentation).
Is there a way to make synchronous queries?

I had the same problems some days ago when I was moving from Casbah. Apparently the official Mongodb driver uses the observer pattern. I wanted to retrieve a sequence number from a collection and I had to wait for the value to be retrieved to continue the operation. I am not sure if that's the correct way, but at least this is one way of doing it:
def getSequenceId(seqName: String): Int = {
val query = new BsonDocument("seq_id", new BsonString(seqName))
val resultado = NewMongo.SequenceCollection.findOneAndUpdate(query,inc("nextId",1))
resultado.subscribe(new Observer[Document] {
override def onNext(result: Document): Unit ={}
override def onError(e: Throwable): Unit ={}
override def onComplete(): Unit = {}
})
val awaitedR = Await.result(resultado.toFuture, Duration.Inf).asInstanceOf[List[Document]](0)
val ret = awaitedR.get("nextId").getOrElse(0).asInstanceOf[BsonDouble].intValue();
return ret;
}
Apparently you can convert the result observer into a future and wait for its return using the Await function as I did. Then you can manipulate the result as you wish.
The database configuration is the following:
private val mongoClient: MongoClient = MongoClient("mongodb://localhost:27017/?maxPoolSize=30")
private val database: MongoDatabase = mongoClient.getDatabase("mydb");
val Sequence: MongoCollection[Document] = database.getCollection(SEQUENCE);
Hope my answer was helpful

Related

Determing if a MongoDB connection is unavailavble and creating a new connection if it is

I'm attempting to improve the below code that creates a MongoDB connection and inserts a document using the insertDocument method:
import com.typesafe.scalalogging.LazyLogging
import org.mongodb.scala.result.InsertOneResult
import org.mongodb.scala.{Document, MongoClient, MongoCollection, MongoDatabase, Observer, SingleObservable}
import play.api.libs.json.JsResult.Exception
object MongoFactory extends LazyLogging {
val uri: String = "mongodb+srv://*********"
val client: MongoClient = MongoClient(uri)
val db: MongoDatabase = client.getDatabase("db")
val collection: MongoCollection[Document] = db.getCollection("col")
def insertDocument(document: Document) = {
val singleObservable: SingleObservable[InsertOneResult] = collection.insertOne(document)
singleObservable.subscribe(new Observer[InsertOneResult] {
override def onNext(result: InsertOneResult): Unit = println(s"onNext: $result")
override def onError(e: Throwable): Unit = println(s"onError: $e")
override def onComplete(): Unit = println("onComplete")
})
}
}
The primary issue I see with the above code is that if the connection becomes stale due to MongoDB server going offline or some other condition then
the connection is not restarted.
An improvement to cater for this scenario is :
object MongoFactory extends LazyLogging {
val uri: String = "mongodb+srv://*********"
var client: MongoClient = MongoClient(uri)
var db: MongoDatabase = client.getDatabase("db")
var collection: MongoCollection[Document] = db.getCollection("col")
def isDbDown() : Boolean = {
try {
client.getDatabase("db")
false
}
catch {
case e: Exception =>
true
}
}
def insertDocument(document: Document) = {
if(isDbDown()) {
client = MongoClient(uri)
db = client.getDatabase("db")
collection = db.getCollection("col")
}
val singleObservable: SingleObservable[InsertOneResult] = collection.insertOne(document)
singleObservable.subscribe(new Observer[InsertOneResult] {
override def onNext(result: InsertOneResult): Unit = println(s"onNext: $result")
override def onError(e: Throwable): Unit = println(s"onError: $e")
override def onComplete(): Unit = println("onComplete")
})
}
}
I expect this to handle the scenario if the DB connection becomes unavailable but is there a more idiomatic Scala method of
determining
Your code does not create connections. It creates MongoClient instances.
As such you cannot "create a new connection". MongoDB drivers do not provide an API for applications to manage connections.
Connections are managed internally by the driver and are created and destroyed automatically as needed in response to application requests/commands. You can configure connection pool size and when stale connections are removed from the pool.
Furthermore, execution of a single application command may involve multiple connections (up to 3 easily, possibly over 5 if encryption is involved), and the connection(s) used depend on the command/query. Checking the health of any one connection, even if it was possible, wouldn't be very useful.

Bulk Insert Data in HBase using Structured Spark Streaming

I'm reading data coming from a Kafka (100.000 line per second) using Structured Spark Streaming, and i'm trying to insert all the data in HBase.
I'm in Cloudera Hadoop 2.6 and I'm using Spark 2.3
I tried something like I've seen here.
eventhubs.writeStream
.foreach(new MyHBaseWriter[Row])
.option("checkpointLocation", checkpointDir)
.start()
.awaitTermination()
MyHBaseWriter looks like this :
class AtomeHBaseWriter[RECORD] extends HBaseForeachWriter[Row] {
override def toPut(record: Row): Put = {
override val tableName: String = "hbase-table-name"
override def toPut(record: Row): Put = {
// Get Json
val data = JSON.parseFull(record.getString(0)).asInstanceOf[Some[Map[String, Object]]]
val key = data.getOrElse(Map())("key")+ ""
val val = data.getOrElse(Map())("val")+ ""
val p = new Put(Bytes.toBytes(key))
//Add columns ...
p.addColumn(Bytes.toBytes(columnFamaliyName),Bytes.toBytes(columnName), Bytes.toBytes(val))
p
}
}
And the HBaseForeachWriter class looks like this :
trait HBaseForeachWriter[RECORD] extends ForeachWriter[RECORD] {
val tableName: String
def pool: Option[ExecutorService] = None
def user: Option[User] = None
private var hTable: Table = _
private var connection: Connection = _
override def open(partitionId: Long, version: Long): Boolean = {
connection = createConnection()
hTable = getHTable(connection)
true
}
def createConnection(): Connection = {
// I create HBase Connection Here
}
def getHTable(connection: Connection): Table = {
connection.getTable(TableName.valueOf(Variables.getTableName()))
}
override def process(record: RECORD): Unit = {
val put = toPut(record)
hTable.put(put)
}
override def close(errorOrNull: Throwable): Unit = {
hTable.close()
connection.close()
}
def toPut(record: RECORD): Put
}
So here I'm doing a put line by line, even if I allow 20 executors and 4 cores for each, I don't have the data inserted immediatly in HBase. So what I need to do is a bulk load ut I'm struggled because all what I find in the internet is to realize it with RDDs and Map/Reduce.
What I understand is slow rate of record ingestion in to hbase. I have few suggestions to you.
1) hbase.client.write.buffer .
the below property may help you.
hbase.client.write.buffer
Description Default size of the BufferedMutator write buffer in bytes. A bigger buffer takes more memory — on both the client and
server side since server instantiates the passed write buffer to
process it — but a larger buffer size reduces the number of RPCs made.
For an estimate of server-side memory-used, evaluate
hbase.client.write.buffer * hbase.regionserver.handler.count
Default 2097152 (around 2 mb )
I prefer foreachBatch see spark docs (its kind of foreachPartition in spark core) rather foreach
Also in your hbase writer extends ForeachWriter
open method intialize array list of put
in process add the put to the arraylist of puts
in close table.put(listofputs); and then reset the arraylist once you updated the table...
what it does basically your buffer size mentioned above is filled with 2 mb then it will flush in to hbase table. till then records wont go to hbase table.
you can increase that to 10mb and so....
In this way number of RPCs will be reduced. and huge chunk of data will be flushed and will be in hbase table.
when write buffer is filled up and a flushCommits in to hbase table is triggered.
Example code : in my answer
2) switch off WAL you can switch off WAL(write ahead log - Danger is no recovery) but it will speed up writes... if dont want to recover the data.
Note : if you are using solr or cloudera search on hbase tables you
should not turn it off since Solr will work on WAL. if you switch it
off then, Solr indexing wont work.. this is one common mistake many of
us does.
How to swtich off : https://hbase.apache.org/1.1/apidocs/org/apache/hadoop/hbase/client/Put.html#setWriteToWAL(boolean)
Basic architechture and link for further study :
http://hbase.apache.org/book.html#perf.writing
as I mentioned list of puts is good way... this is the old way (foreachPartition with list of puts) of doing before structured streaming example is like below .. where foreachPartition operates for each partition not every row.
def writeHbase(mydataframe: DataFrame) = {
val columnFamilyName: String = "c"
mydataframe.foreachPartition(rows => {
val puts = new util.ArrayList[ Put ]
rows.foreach(row => {
val key = row.getAs[ String ]("rowKey")
val p = new Put(Bytes.toBytes(key))
val columnV = row.getAs[ Double ]("x")
val columnT = row.getAs[ Long ]("y")
p.addColumn(
Bytes.toBytes(columnFamilyName),
Bytes.toBytes("x"),
Bytes.toBytes(columnX)
)
p.addColumn(
Bytes.toBytes(columnFamilyName),
Bytes.toBytes("y"),
Bytes.toBytes(columnY)
)
puts.add(p)
})
HBaseUtil.putRows(hbaseZookeeperQuorum, hbaseTableName, puts)
})
}
To sumup :
What I feel is we need to understand the psycology of spark and hbase
to make then effective pair.

How to fetch data from mongodB in Scala

I wrote following code to fetch data from MongoDB
import com.typesafe.config.ConfigFactory
import org.mongodb.scala.{ Document, MongoClient, MongoCollection, MongoDatabase }
import scala.concurrent.ExecutionContext
object MongoService extends Service {
val conf = ConfigFactory.load()
implicit val mongoService: MongoClient = MongoClient(conf.getString("mongo.url"))
implicit val mongoDB: MongoDatabase = mongoService.getDatabase(conf.getString("mongo.db"))
implicit val ec: ExecutionContext = ExecutionContext.global
def getAllDocumentsFromCollection(collection: String) = {
mongoDB.getCollection(collection).find()
}
}
But when I tried to get data from getAllDocumentsFromCollection I'm not getting each data for further manipulation. Instead I'm getting
FindObservable(com.mongodb.async.client.FindIterableImpl#23555cf5)
UPDATED:
object MongoService {
// My settings (see available connection options)
val mongoUri = "mongodb://localhost:27017/smsto?authMode=scram-sha1"
import ExecutionContext.Implicits.global // use any appropriate context
// Connect to the database: Must be done only once per application
val driver = MongoDriver()
val parsedUri = MongoConnection.parseURI(mongoUri)
val connection = parsedUri.map(driver.connection(_))
// Database and collections: Get references
val futureConnection = Future.fromTry(connection)
def db1: Future[DefaultDB] = futureConnection.flatMap(_.database("smsto"))
def personCollection = db1.map(_.collection("person"))
// Write Documents: insert or update
implicit def personWriter: BSONDocumentWriter[Person] = Macros.writer[Person]
// or provide a custom one
def createPerson(person: Person): Future[Unit] =
personCollection.flatMap(_.insert(person).map(_ => {})) // use personWriter
def getAll(collection: String) =
db1.map(_.collection(collection))
// Custom persistent types
case class Person(firstName: String, lastName: String, age: Int)
}
I tried to use reactivemongo as well with above code but I couldn't make it work for getAll and getting following error in createPerson
Please suggest how can I get all data from a collection.
This is likely too late for the OP, but hopefully the following methods of retrieving & iterating over collections using mongo-spark can prove useful to others.
The Asynchronous Way - Iterating over documents asynchronously means you won't have to store an entire collection in-memory, which can become unreasonable for large collections. However, you won't have access to all your documents outside the subscribe code block for reuse. I'd recommend doing things asynchronously if you can, since this is how the mongo-scala driver was intended to be used.
db.getCollection(collectionName).find().subscribe(
(doc: org.mongodb.scala.bson.Document) => {
// operate on an individual document here
},
(e: Throwable) => {
// do something with errors here, if desired
},
() => {
// this signifies that you've reached the end of your collection
}
)
The "Synchronous" Way - This is a pattern I use when my use-case calls for a synchronous solution, and I'm working with smaller collections or result-sets. It still uses the asynchronous mongo-scala driver, but it returns a list of documents and blocks downstream code execution until all documents are returned. Handling errors and timeouts may depend on your use case.
import org.mongodb.scala._
import org.mongodb.scala.bson.Document
import org.mongodb.scala.model.Filters
import scala.collection.mutable.ListBuffer
/* This function optionally takes filters if you do not wish to return the entire collection.
* You could extend it to take other optional query params, such as org.mongodb.scala.model.{Sorts, Projections, Aggregates}
*/
def getDocsSync(db: MongoDatabase, collectionName: String, filters: Option[conversions.Bson]): ListBuffer[Document] = {
val docs = scala.collection.mutable.ListBuffer[Document]()
var processing = true
val query = if (filters.isDefined) {
db.getCollection(collectionName).find(filters.get)
} else {
db.getCollection(collectionName).find()
}
query.subscribe(
(doc: Document) => docs.append(doc), // add doc to mutable list
(e: Throwable) => throw e,
() => processing = false
)
while (processing) {
Thread.sleep(100) // wait here until all docs have been returned
}
docs
}
// sample usage of 'synchronous' method
val client: MongoClient = MongoClient(uriString)
val db: MongoDatabase = client.getDatabase(dbName)
val allDocs = getDocsSync(db, "myCollection", Option.empty)
val someDocs = getDocsSync(db, "myCollection", Option(Filters.eq("fieldName", "foo")))

Converting Guava futures to Twitter ones

I have a Finatra application that accesses Cassandra using Datastax driver which produces Guava Futures. The conversion is done by following code
import com.google.common.util.concurrent.{FutureCallback, Futures, ListenableFuture}
import com.twitter.util.{Future, Promise}
implicit def toTwitterFuture[A](f: ListenableFuture[A]): Future[A] = {
val p = Promise[A]()
val callback = new FutureCallback[A] {
override def onSuccess(result: A): Unit = p.setValue(result)
override def onFailure(t: Throwable): Unit = p.setException(t)
}
Futures.addCallback(f, callback)
p
}
Problem is, after this conversion the rest of the request processing happens in a thread pool created by Cassandra driver potentially blocking other I/O tasks. How can I access Finagle's executor where this kind of CPU-intensive work should be done?

Spark Structured Streaming, multiples queries are not running concurrently

I slightly modified example taken from here - https://github.com/apache/spark/blob/v2.2.0/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCount.scala
I added seconds writeStream (sink):
scala
case class MyWriter1() extends ForeachWriter[Row]{
override def open(partitionId: Long, version: Long): Boolean = true
override def process(value: Row): Unit = {
println(s"custom1 - ${value.get(0)}")
}
override def close(errorOrNull: Throwable): Unit = true
}
case class MyWriter2() extends ForeachWriter[(String, Int)]{
override def open(partitionId: Long, version: Long): Boolean = true
override def process(value: (String, Int)): Unit = {
println(s"custom2 - $value")
}
override def close(errorOrNull: Throwable): Unit = true
}
object Main extends Serializable{
def main(args: Array[String]): Unit = {
println("starting")
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
val host = "localhost"
val port = "9999"
val spark = SparkSession
.builder
.master("local[*]")
.appName("app-test")
.getOrCreate()
import spark.implicits._
// Create DataFrame representing the stream of input lines from connection to host:port
val lines = spark.readStream
.format("socket")
.option("host", host)
.option("port", port)
.load()
// Split the lines into words
val words = lines.as[String].flatMap(_.split(" "))
// Generate running word count
val wordCounts = words.groupBy("value").count()
// Start running the query that prints the running counts to the console
val query1 = wordCounts.writeStream
.outputMode("update")
.foreach(MyWriter1())
.start()
val ds = wordCounts.map(x => (x.getAs[String]("value"), x.getAs[Int]("count")))
val query2 = ds.writeStream
.outputMode("update")
.foreach(MyWriter2())
.start()
spark.streams.awaitAnyTermination()
}
}
Unfortunately, only first query runs, second never runs (MyWriter2 never been called)
Please advice what I'm doing wrong. According to doc: You can start any number of queries in a single SparkSession. They will all be running concurrently sharing the cluster resources.
I had the same situation (but on the newer structured-streaming api) and in my case it helped to call awaitTermination() on the last streamingQuery.
s.th. like:
query1.start()
query2.start().awaitTermination()
Update:
Instead above, this build-in solution/method is better:
sparkSession.streams.awaitAnyTermination()
Are you using nc -lk 9999 for sending data to spark ? every query create a connection to nc but nc can only send data to the first connection (query) , you can write a tcp server instead of nc
You are using .awaitAnyTermination() which will terminate the application when the first stream returns, you have to wait for both of the streams to finish before you terminate.
something like this should do the trick:
query1.awaitTermination()
query2.awaitTermination()
What you have done is right! Just go ahead and check the scheduler your Spark f/w is using. Most probably it would be using FIFO scheduler which means the first query takes up all resources. Just change it to FAIR Scheduler and you should be good.