Reading documents from a MongoDB collection and writing them locally using Scala - mongodb

I am using MongoDB Scala driver trying to access a MongoDB collection and "download" the documents contained within to a JSON file. The number of documents in the collection is measured in 100Ks, but a similar task in python (downloading, creating a pd.DataFrame and storing as CSV) takes 10 minutes while trying to do it in spark-shell it takes over an hour easily. I sense I am doing something very wrong, but can't figure out what.
val documents = db.getCollection("collectionName").find().toFuture()
documents on Complete {
case Success(docs) => {
// without partitioning tasks would become too large - resulting in most docs being written as empty
val rdd = sc.parallelize(docs, 1000)
val ds = spark.read.json(rdd.map(_.toJson).toDS)
ds.write.mode("overwrite").json("/path/to/data/data.json")
}
case Failure(e) => println("Error")
}
Is there a better way to do this?

Related

MongoDB countDocuments() very slow in Scala as compared to MongoDB Compass

I am trying to make MongoDB to count documents based on where clause
def headResult(): C = Await.result(observable.head(), Duration(10, TimeUnit.SECONDS))
val database: MongoDatabase = mongoClient.getDatabase("dbname")
val collection: MongoCollection[Document] = database.getCollection("tablename")
val recordCount = collection.countDocuments()
.headResult()
This query returns the count as 766 782 but it takes 2-2.5 seconds. When I make the same query through MongoDB Compass it takes 0.2 seconds .
db.tbltrackerdata.find({},{}).count()
Since where clause is dynamic I cannot save prior or maintain any metadata for this.
This isn't because of Scala. It's because db.collection.count({}) or db.collection.find({}).count() does not actually fetch the documents. Instead it simply retrieves the number of documents from the collection's metadata.
The countDocuments method performs a proper aggregation, which is expected to be slower but also more accurate. You can see it mentioned in the MongoDB documentation here:
https://docs.mongodb.com/manual/reference/method/db.collection.countDocuments/#mechanics

Updating Data in MongoDB from Apache Spark Streaming

I am using the scala Api of Apache Spark Streaming to read from Kafka Server in a window with the size of a minute and a slide intervall of a minute.
The message from Kafka contain a timestamp from the moment they were sent and an arbitrary value. Each of the values is supposed to be reducedByKeyAndWindow and saved to the Mongo.
val messages = stream.map(record => (record.key, record.value.toDouble))
val reduced = messages.reduceByKeyAndWindow((x: Double , y: Double) => (x + y),
Seconds(60), Seconds(60))
reduced.foreachRDD({ rdd =>
import spark.implicits._
val aggregatedPower = rdd.map({x => MyJsonObj(x._2, x._1)}).toDF()
aggregatedPower.write.mode("append").mongo()
})
This works so far, however it is possible, that some message come with a delay, of a minute, which results leads to having two json objects with the same timestamp in the dataBase.
{"_id":"5aa93a7e9c6e8d1b5c486fef","power":6.146849997,"timestamp":"2018-03-14 15:00"}
{"_id":"5aa941df9c6e8d11845844ae","power":5.0,"timestamp":"2018-03-14 15:00"}
The Documentation of the mongo-spark-connector didn't help me with finding a solution.
Is there a smart way to query whether the timestamp in the current window is already in the database and if so update this value?
Is there a smart way to query whether the timestamp in the current window is already in the database and if so update this value?
It seems that what you're looking for is a MongoDB operation called upsert. Where an update operation will insert a new document if the criteria has no match, and update the fields if there is a match.
If you are using MongoDB Connector for Spark v2.2+, whenever a Spark dataframe contains an _id field, the data will be upserted. Which means any existing documents with the same _id value will be updated and new documents without existing _id value in the collection will be inserted.
Now you could try to create an RDD using MongoDB Spark Aggregation, specifying a $match filter to query where timestamp matches the current window:
val aggregatedRdd = rdd.withPipeline(Seq(
Document.parse(
"{ $match: { timestamp : '2018-03-14 15:00' } }"
)))
Modify the value of power field, and then write.mode('append').
You may also find blog: Data Streaming MongoDB/Kafka useful as well. If you would like to write a Kafka consumer and directly insert into MongoDB applying your logics using MongoDB Java Driver

Spark, mapPartitions, network connection is closed before map operation is finished

I am running a spark job and at some point I want to connect to an elastic search server to get some data and add them to an RDD. So the code I am using looks like this
input.mapParitions(records=>{
val elcon=new ElasticSearchConnection
val client:TransportClient=elcon.openConnection()
val newRecs=records.flatMap(record=>{
val response = client.prepareGet("index" "indexType",
record.id.toString).execute().actionGet()
val newRec=processRec(record,reponse)
newRec
})//end of flatMap
client.close()
newRecs
})//end of mapPartitions
My problem is that the client.close() command is called before the flatMap operation is finished which results of course into an Exception. The code works if I move the generation and the closing of the connection inside the flatMap, but this would generate a huge amount of connections. Is it possible to make sure that client.close will be called after the flatMap operation is finished?
Making a blocking call for each item in your RDD to fetch corresponding ElasticSearch document is causing the problem. It is generally advised to avoid blocking calls.
There is another alternate approach using the ElasticSearch-for-Hadoop's Spark support.
Read the ElasticSearch index/type as another RDD and join it with your RDD.
Include the right version of ESHadoop dependency.
import org.elasticsearch.spark._
val esRdd = sc.esRDD("index/indexType") //This returns a pair RDD of (_id, Map of all key value pairs for all fields]
input.map(record => (record.id, record)) //Convert your RDD of records to a pair rdd of (id, record) as we want to join based on the id
input.join(esRdd).map(rec => processResponse(rec._2._1, rec._2._2)) // Join the two RDDs based on id column it returns a pair RDD with key=id & value=Pair of matching records (id,(inputrddrecord,esrddrecord))
Hope this helps.
PS: It will still not alleviate the problem of lack of co-location. (i.e. each document with _id will come from different shard of the index). Better approach would have been to achieve co-location of the input RDD and the ES index's documents at the time of creating the ES index.

elasticsearch-spark connector size limit parameter is ignored in query

I'm trying to query elasticsearch with the elasticsearch-spark connector and I want to return only few results:
For example:
val conf = new SparkConf().set("es.nodes","localhost").set("es.index.auto.create", "true").setMaster("local")
val sparkContext = new SparkContext(conf)
val query = "{\"size\":1}"
println(sparkContext.esRDD("index_name/type", query).count())
However this will return all the documents in the index.
Some parameters are actually ignored from the query by design, such as : from, size, fields, etc.
They are used internally inside the elasticsearch-spark connector.
Unfortunately this list of unsupported parameters isn't documented. But if you wish to use the size parameter you can always rely on the pushdown predicate and use the DataFrame/Dataset limit method.
So you ought using the Spark SQL DSL instead e.g :
val df = sqlContext.read.format("org.elasticsearch.spark.sql")
.option("pushdown","true")
.load("index_name/doc_type")
.limit(10) // instead of size : 10
This query will return the first 10 documents returned by the match_all query that is used by default by the connector.
Note: The following isn't correct on any level.
This is actually on purpose. Since the connector does a parallel query, it also looks at the number of documents being returned so if the user specifies a parameter, it will overwrite it according to the es.scroll.limit setting (see the configuration option).
When you query elasticsearch it also run the query in parallel on all the index shards without overwriting them.
If I understand this correctly, you are executing a count operation, which does not return any documents. Do you expect it to return 1 because you specified size: 1? That's not happening, which is by design.
Edited to add:
This is the definition of count() in elasticsearch-hadoop:
override def count(): Long = {
val repo = new RestRepository(esCfg)
try {
return repo.count(true)
} finally {
repo.close()
}
}
It does not take the query into account at all, but considers the entire ES index as the RDD input.
This is actually on purpose. Since the connector does a parallel query, it also looks at the number of documents being returned so if the user specifies a parameter, it will overwrite it according to the es.scroll.limit setting (see the configuration option).
In other words, if you want to control the size, do so through that setting as it will always take precedence.
Beware that this parameter applies per shard. So, if you have 5 shards you might bet up to fice hits if this parameter is set to 1.
See https://www.elastic.co/guide/en/elasticsearch/hadoop/master/configuration.html

Upsert many records using ReactiveMongo and Scala

I am writing a DAO Actor for MongoDB that uses ReactiveMongo. I want to implement some very simple CRUD operations, among which the ability to upsert many records in one shot. Since I have a reactive application (built on Akka), it's important for me to have idempotent actions, so I need the operation to be an upsert, not an insert.
So far I have the following (ugly) code to do so:
case class UpsertResult[T](nUpd: Int, nIns: Int, failed: List[T])
def upsertMany[T](l: List[T], collection: BSONCollection)
(implicit ec: ExecutionContext, w: BSONDocumentWriter[T]):
Future[UpsertResult[T]] = {
Future.sequence(l.map(o => collection.save(o).map(r => (o, r))))
.transform({
results =>
val failed: List[T] = results.filter(!_._2.ok).unzip._1
val nUpd = results.count(_._2.updatedExisting)
UpsertResult(nUpd, results.size - nUpd - failed.size, failed)
}, t => t)
}
Is there an out-of-the-box way of upserting many records at once using the reactivemongo API alone?
I am a MongoDB beginner so this might sound trivial to many. Any help is appreciated!
Mongo has no support for upserting multiple documents in one query. The update operation for example can always only insert up to one new element. So this is not a flaw in the reactivemongo driver, there simply is no DB command to achieve the result you expect. Iterating over the documents you want to upsert is the right way to do it.
The manual on mongodb about upsert contains further informations:
http://docs.mongodb.org/manual/core/update/#update-operations-with-the-upsert-flag
According to the docs, BSSONCollection.save inserts the document, or updates it if it already exists in the collection: see here. Now, I'm not sure exactly how it makes the decision about whether the document already exists or not: presumably it's based on what MongoDB tells it... so the primary key/id or a unique index.
In short: I think you're doing it the right way (including your result counts from LastError).