I have a document collection like:
{ "_id":"ABC", "job": {...} }
How to read a batch of Jobs given a collection of IDs from Mongo DB in Kotlin?
We search like this:
MongoClients.create(clientSettings).use { client ->
val db = client.getDatabase(dbName)
val coll = db.getCollection(collName)
val filter = Filters.`in`("_id", jobs.ids)
val res = coll.find(filter).map{ it.get("job") /* ??? */ }.asIterable()
return JobBatch(res)
}
Please see /* ??? */ - How to convert Document to Job class?
Related
I have a problem to solve where data comes in as JSON from kinesis like below:
{
datatype: "datatype_1"
id : "id_1"
data : {...}
}
All records in the stream then need to go through a lookup function with datatype and id passed as arguments to find a unique group of locations to write the items to in JSON.
i.e.
def get_locations(id: String, datatype: String): Array[String] = //custom logic here
where the resultant array would look like
[ "s3:///example_bucket/example_folder_1", "s3:///example_bucket2/example_folder_2"]
My question is how do I most efficiently group records coming off the stream by datatype and id and write to the various s3 locations. I was hoping to do something like below:
sparkSession.readStream.format("kinesis")
.option("streamName", kinesis_stream_name)
.option("initialPosition", "latest")
.option("region", aws_region)
.load()
//more transforms
.select(
col("datatype"),
col("id"),
col("data")
)
// Not sure how I can do what's below
// .write.partitionBy("id", "datatype")
// .format("json")
// .option("compression","gzip")
// .save(get_locations("id","datatype")) //saving to all locations in result array
I do advise you to create the bucket in the code in the runtime as a best practice, you can use the node.js aws S3 API or your runtime language API
As you said in your comment you are getting the parameters from your runtime.
However as an answer to your question here is a function that creates a bucket containing the id in its name (you can change it to the format that you like ) then in that bucket you will have a lot of files based on the partition of the dataframe while saving:
import java.util
import com.amazonaws.regions.Regions
import com.amazonaws.services.s3.model.AmazonS3Exception
import com.amazonaws.services.s3.{AmazonS3, AmazonS3ClientBuilder}
def get_locations(id: String, datatype: String) = {
//you can configure the default region to the adequat region
//of course
val s3: AmazonS3 = AmazonS3ClientBuilder.standard.withRegion(Regions.DEFAULT_REGION).build
object CreateBucket {
def getBucket(bucket_name: String): Bucket = {
var named_bucket = null.asInstanceOf[Bucket]
val buckets: util.List[Bucket] = s3.listBuckets
import scala.collection.JavaConversions._
for (b <- buckets) {
if (b.getName.equals(bucket_name)) named_bucket = b
}
named_bucket
}
def createBucket(bucket_name: String): Bucket = {
var b = null.asInstanceOf[Bucket]
if (s3.doesBucketExistV2(bucket_name)) {
System.out.format("Bucket %s already exists.\n", bucket_name)
b = getBucket(bucket_name)
}
else try b = s3.createBucket(bucket_name)
catch {
case e: AmazonS3Exception =>
System.err.println(e.getErrorMessage)
}
b
}
}
//change your bucket name here if
//you like
val bucket_name = "bucket_" + id
var my_bucket = null.asInstanceOf[Bucket]
if (s3.doesBucketExistV2(bucket_name)) {
System.out.format("Bucket %s already exists.\n", bucket_name)
my_bucket = CreateBucket.getBucket(bucket_name)
}
else try my_bucket = s3.createBucket(bucket_name)
catch {
case e: AmazonS3Exception =>
System.err.println(e.getErrorMessage)
}
my_bucket
}
//I don't know how you will get those parameters
var id = " "
var datatype = " "
df.write.partitionBy("id", "dataType")
.format("json")
.option("compression", "gzip")
.save(get_locations(id, datatype).toString)
Don't forget to add the dependecies in maven or in build.sbt with the version that you have already in aws (sdk) :
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-s3</artifactId>
<version>1.11.979</version>
</dependency>
I followed This Unable to send data to MongoDB using Kafka-Spark Structured Streaming to send data from spark structured streaming to mongoDB and I implemented it successfully but there is one issue .
Like when function
override def process(record: Row): Unit = {
val doc: Document = Document(record.prettyJson.trim)
// lazy opening of MongoDB connection
ensureMongoDBConnection()
val result = collection.insertOne(doc)
if (messageCountAccum != null)
messageCountAccum.add(1)
}
code is executing without any problem but no data is being send to MongoDB
But if i add a print statement like this
override def process(record: Row): Unit = {
val doc: Document = Document(record.prettyJson.trim)
// lazy opening of MongoDB connection
ensureMongoDBConnection()
val result = collection.insertOne(doc)
result.foreach(println) //print statement
if (messageCountAccum != null)
messageCountAccum.add(1)
}
Data is getting inserted in MongoDB
I don't know why????
foreach initializes the writer sink. Without the foreach your dataframe is never calculated.
Try this :
val df = // your df here
df.map(r => process(r))
df.count()
I would like to create an generic update method, that take as input the Object _id and a String (Json) that correspond to the update to do.
I need to transform the inputDocument variable to a Document type to be passed in the update method
I need to have this generic way of typing because i would like to use this method on any field of the collection,
How can i achieve this ?
def updateField(_id : String, inputDocument : String): Future[UpdateResult] = {
/* inputDocument = {"key" : value}*/
val mongoClient = MongoClient("mongodb://localhost:27017")
val database: MongoDatabase = mongoClient.getDatabase("databaseName")
val collection: MongoCollection[Document] = database.getCollection("collectionName")
val updateDocument = Document("$set" -> inputDocument)
collection
.updateOne(Filters.eq("_id", BsonObjectId(_id)), updateDocument)
.toFuture()
}
My thinking about handling the inputDocument variable was not correct.
It is indeed necessary to transform the input into Document.
org.mongodb.scala.bson.collection.mutable.Document has an apply() method to parse a json string to a Document.
Thank you all for your comments.
def updateField(_id : String, inputDocument : String): Future[UpdateResult] = {
/* inputDocument = {"key" : value}*/
val mongoClient = MongoClient("mongodb://localhost:27017")
val database: MongoDatabase = mongoClient.getDatabase("databaseName")
val collection: MongoCollection[Document] = database.getCollection("collectionName")
val updateDocument = Document("$set" -> Document(inputDocument))
collection
.updateOne(Filters.eq("_id", BsonObjectId(_id)), updateDocument)
.toFuture()
}
I wrote following code to fetch data from MongoDB
import com.typesafe.config.ConfigFactory
import org.mongodb.scala.{ Document, MongoClient, MongoCollection, MongoDatabase }
import scala.concurrent.ExecutionContext
object MongoService extends Service {
val conf = ConfigFactory.load()
implicit val mongoService: MongoClient = MongoClient(conf.getString("mongo.url"))
implicit val mongoDB: MongoDatabase = mongoService.getDatabase(conf.getString("mongo.db"))
implicit val ec: ExecutionContext = ExecutionContext.global
def getAllDocumentsFromCollection(collection: String) = {
mongoDB.getCollection(collection).find()
}
}
But when I tried to get data from getAllDocumentsFromCollection I'm not getting each data for further manipulation. Instead I'm getting
FindObservable(com.mongodb.async.client.FindIterableImpl#23555cf5)
UPDATED:
object MongoService {
// My settings (see available connection options)
val mongoUri = "mongodb://localhost:27017/smsto?authMode=scram-sha1"
import ExecutionContext.Implicits.global // use any appropriate context
// Connect to the database: Must be done only once per application
val driver = MongoDriver()
val parsedUri = MongoConnection.parseURI(mongoUri)
val connection = parsedUri.map(driver.connection(_))
// Database and collections: Get references
val futureConnection = Future.fromTry(connection)
def db1: Future[DefaultDB] = futureConnection.flatMap(_.database("smsto"))
def personCollection = db1.map(_.collection("person"))
// Write Documents: insert or update
implicit def personWriter: BSONDocumentWriter[Person] = Macros.writer[Person]
// or provide a custom one
def createPerson(person: Person): Future[Unit] =
personCollection.flatMap(_.insert(person).map(_ => {})) // use personWriter
def getAll(collection: String) =
db1.map(_.collection(collection))
// Custom persistent types
case class Person(firstName: String, lastName: String, age: Int)
}
I tried to use reactivemongo as well with above code but I couldn't make it work for getAll and getting following error in createPerson
Please suggest how can I get all data from a collection.
This is likely too late for the OP, but hopefully the following methods of retrieving & iterating over collections using mongo-spark can prove useful to others.
The Asynchronous Way - Iterating over documents asynchronously means you won't have to store an entire collection in-memory, which can become unreasonable for large collections. However, you won't have access to all your documents outside the subscribe code block for reuse. I'd recommend doing things asynchronously if you can, since this is how the mongo-scala driver was intended to be used.
db.getCollection(collectionName).find().subscribe(
(doc: org.mongodb.scala.bson.Document) => {
// operate on an individual document here
},
(e: Throwable) => {
// do something with errors here, if desired
},
() => {
// this signifies that you've reached the end of your collection
}
)
The "Synchronous" Way - This is a pattern I use when my use-case calls for a synchronous solution, and I'm working with smaller collections or result-sets. It still uses the asynchronous mongo-scala driver, but it returns a list of documents and blocks downstream code execution until all documents are returned. Handling errors and timeouts may depend on your use case.
import org.mongodb.scala._
import org.mongodb.scala.bson.Document
import org.mongodb.scala.model.Filters
import scala.collection.mutable.ListBuffer
/* This function optionally takes filters if you do not wish to return the entire collection.
* You could extend it to take other optional query params, such as org.mongodb.scala.model.{Sorts, Projections, Aggregates}
*/
def getDocsSync(db: MongoDatabase, collectionName: String, filters: Option[conversions.Bson]): ListBuffer[Document] = {
val docs = scala.collection.mutable.ListBuffer[Document]()
var processing = true
val query = if (filters.isDefined) {
db.getCollection(collectionName).find(filters.get)
} else {
db.getCollection(collectionName).find()
}
query.subscribe(
(doc: Document) => docs.append(doc), // add doc to mutable list
(e: Throwable) => throw e,
() => processing = false
)
while (processing) {
Thread.sleep(100) // wait here until all docs have been returned
}
docs
}
// sample usage of 'synchronous' method
val client: MongoClient = MongoClient(uriString)
val db: MongoDatabase = client.getDatabase(dbName)
val allDocs = getDocsSync(db, "myCollection", Option.empty)
val someDocs = getDocsSync(db, "myCollection", Option(Filters.eq("fieldName", "foo")))
class Employee(tag: Tag) extends Table[table_types.user](tag, "EMPLOYEE") {
def employeeID = column[Int]("EMPLOYEE_ID")
def empName = column[String]("NAME")
def startDate = column[String]("START_DATE")
def * = (employeeID, empName, startDate)
}
object employeeHandle {
def insert(emp:Employee):Future[Any] = {
val dao = new SlickPostgresDAO
val db = dao.db
val insertdb = DBIO.seq(employee += (emp))
db.run(insertdb)
}
}
Insert into database a million employee records
object Hello extends App {
val employees = List[*1 million employee list*]
for(employee<-employees) {
employeeHandle.insert(employee)
*Code to connect to rest api to confirm entry*
}
}
However when I run the above code I soon run out of connections to Postgres. How can I do it in parallel (in a non blocking way) but at the same time ensure I don't run out of connections to postgres.
I think you don't need to do it in parallel; I don't see how it can solve it. Instead you could simply create connection before you start that loop and pass it to employeeHandle.insert(db, employee).
Something like (I don't know scala):
object Hello extends App {
val dao = new SlickPostgresDAO
val db = dao.db
val employees = List[*1 million employee list*]
for(employee<-employees) {
employeeHandle.insert(db, employee)
*Code to connect to rest api to confirm entry*
}
}
Almost all examples of slick insert I have come across uses blocking to fullfil the results. It would be nice to have one that doesn't.
My take on it:
object Hello extends App {
val employees = List[*1 million employee list*]
val groupedList = employees.grouped(10).toList
insertTests()
def insertTests(l: List[List[Employee]] = groupedList): Unit = {
val ins = l.head
val futures = ins.map { no => employeeHandle.insert(employee)}
val seq = Future.sequence(futures)
Await.result(seq, Duration.Inf)
if(l.nonEmpty) insertTests(l.tail)
}
}
Also the connection parameter in insert handle should be outside
object employeeHandle {
val dao = new SlickPostgresDAO
val db = dao.db
def insert(emp:Employee):Future[Any] = {
val insertdb = DBIO.seq(employee += (emp))
db.run(insertdb)
}
}