Error: com.mongodb.MongoCursorNotFoundException: while reading from MongoDB into Spark - mongodb

I am trying to read a collection in MongoDB in Spark with 234 million records. I want only 1 field.
case class Linkedin_Profile(experience : Array[Experience])
case class Experience(company : String)
val rdd = MongoSpark.load(sc, ReadConfig(Map("uri" -> mongo_uri_linkedin)))
val company_DS = rdd.toDS[Linkedin_Profile]()
val count_udf = udf((x: scala.collection.mutable.WrappedArray[String]) => {x.filter( _ != null).groupBy(identity).mapValues(_.size)})
val company_ColCount = company_DS.select(explode(count_udf($"experience.company")))
comp_rdd.saveAsTextFile("/dbfs/FileStore/chandan/intermediate_count_results.csv")
The job runs for 1 hour with half of jobs completed but after that it gives an error
com.mongodb.MongoCursorNotFoundException:
Query failed with error code -5 and error message
'Cursor 8962537864706894243 not found on server cluster0-shard-01-00-i7t2t.mongodb.net:37017'
on server cluster0-shard-01-00-i7t2t.mongodb.net:37017
I tried changing the configuration with below, but to no avail.
System.setProperty("spark.mongodb.keep_alive_ms", "7200000")
Please suggest how to read this large collection.

The config property park.mongodb.keep_alive_ms is meant to control the life of the client. See docs here.
The issue you're experiencing seems to be related to server-side configuration. According to what's documented on this issue:
By specifing the cursorTimeoutMillis option, administrators can configure mongod or mongos to automatically remove idle client cursors after a specified interval. The timeout applies to all cursors maintained on a mongod or mongos, may be specified when starting the mongod or mongos and may be modified at any time using the setParameter command.
So, try starting your mongod daemon with specified cursorTimeoutMillis, such as:
mongod --setParameter cursorTimeoutMillis=10800000
This command tries to instruct the server to keep cursors valid for 3 hours.
Although this may in theory get rid of the annoyance, it is still a good idea to get the reads to complete faster. You might want to look into limiting the dataset sitting in collections to what you really want to load into Spark. There may be many options to tune the read speed worth looking into.

Yes,By specifing the cursorTimeoutMillis option, you can avoid this.
But,if you are not the administrators, you can cache the MongoRdd by Action first, then do something in spark env.

Related

Setting up MongoDB environment requirements for Parse Server

I have my instance running and am able to connect remotely however I'm stuck on where to set this parameter to false since it states that the default is set to true:
failIndexKeyTooLong
Setting the 'failIndexKeyTooLong' is a three-step process:
You need to go to the command console in the Tools menu item for the admin database of your database instance. This command will only work on the admin database, pictured here:
Once there, pick any command from the list and it will give you a short JSON text for that command.
Erase the command they provide (I chose 'ping') and enter the following JSON:
{
"setParameter" : 1,
"failIndexKeyTooLong" : false
}
Here is an example to help:
Note if you are using a free plan at MongoLab: This will NOT work if you have a free plan; it only works with paid plans. If you have the free plan, you will not even see the admin database. HOWEVER, I contacted MongoLab and here is what they suggest:
Hello,
First of all, welcome to MongoLab. We'd be happy to help.
The failIndexKeyTooLong=false option is only necessary when your data
include indexed values that exceed the maximum key value length of
1024 bytes. This only occurs when Parse auto-indexes certain
collections, which can actually lead to incorrect query results. Parse
has updated their migration guide to include a bit more information
about this, here:
https://parse.com/docs/server/guide#database-why-do-i-need-to-set-failindexkeytoolong-false-
Chances are high that your migration will succeed without this
parameter being set. Can you please give that a try? If for any reason
it does fail, please let us know and we can help you on potential next
steps.
Our Dedicated and Shared Cluster plans
(https://mongolab.com/plans/pricing/) do provide the ability to toggle
this option, but because our free Sandbox plans are running on shared
server processes, with other Sandbox users, this parameter is not
configurable.
When launching your mongodb server, you can set this parameter to false :
mongod --setParameter failIndexKeyTooLong=false
I have wrote an article that help you to Setting up Parse-Server and all its dependencies on your own server:
https://medium.com/#jcminarro/run-parse-server-on-your-own-server-using-digitalocean-b2a7d66e1205

RedisClient fails strategy

I'm using play framework with scala. I also use RedisScala driver (this one https://github.com/etaty/rediscala ) to communicate with Redis. If Redis doesn't contain data then my app is looking for data in MongoDB.
When Redis fails or just not available for some reason then application waits a response too long. How to implement failover strategy in this case. I would like to stop requesting Redis if requests take too long time. And start working with Redis when it is back online.
To clarify the question my code is like following now
private def getUserInfo(userName: String): Future[Option[UserInfo]] = {
CacheRepository.getBaseUserInfo(userName) flatMap{
case Some(userInfo) =>
Logger.trace(s"AuthenticatedAction.getUserInfo($userName). User has been found in cache")
Future.successful(Some(userInfo))
case None =>
getUserFromMongo(userName)
}
}
I think you need to distinguish between the following cases (in order of their likelihood of occurrence) :
No Data in cache (Redis) - I guess in this case, Redis will return very quickly and you have to get it from Mongo. In your code above you need to set the data in Redis after you get it from Mongo so that you have it in the cache for subsequent calls.
You need to wrap your RedisClient in your application code aware of any disconnects/reconnects. Essentially have a two states - first, when Redis is working properly, second, when Redis is down/slow.
Redis is slow - this could happen because of one of the following.
2.1. Network is slow: Again, you cannot do much about this but to return a message to your client. Going to Mongo is unlikely to resolve this if your network itself is slow.
2.2. Operation is slow: This happens if you are trying to get a lot of data or you are running a range query on a sorted set, for example. In this case you need to revisit the Redis data structure you are using the the amount of data you are storing in Redis. However, looks like in your example, this is not going to be an issue. Single Redis get operations are generally low latency on a LAN.
Redis node is not reachable - I'm not sure how often this is going to happen unless your network is down. In such a case you also will have trouble connecting to MongoDB as well. I believe this can also happen when the node running Redis is down or its disk is full etc. So you should handle this in your design. Having said that the Rediscala client will automatically detect any disconnects and reconnect automatically. I personally have done this. Stopped and updgrade Redis version and restart Redis without touching my running client(JVM).
Finally, you can use a Future with a timeout (see - Scala Futures - built in timeout?) in your program above. If the Future is not completed by the timeout you can take your other action(s) (go to Mongo or return an error message to the user). Given that #1 and #2 are likely to happen much more frequently than #3, you timeout value should reflect these two cases. Given that #1 and #2 are fast on a LAN you can start with a timeout value of 100ms.
Soumya Simanta provided detailed answer and I just would like to post code I used for timeout. The code requires Play framework which is used in my project
private def get[B](key: String, valueExtractor: Map[String, ByteString] => Option[B], logErrorMessage: String): Future[Option[B]] = {
val timeoutFuture = Promise.timeout(None, Duration(Settings.redisTimeout))
val mayBeHaveData = redisClient.hgetall(key) map { value =>
valueExtractor(value)
} recover{
case e =>
Logger.info(logErrorMessage + e)
None
}
// if timeout occurred then None will be result of method
Future.firstCompletedOf(List(mayBeHaveData, timeoutFuture))
}

Get opcounters by database

I have a monitoring script that looks something like this
client = pymongo.MongoClient()
for database in client.database_names():
iterator = client[database].command({"serverStatus":1})["opcounters"].iteritems()
for key, value in iterator:
log(key, data=value, database=database)
This has been giving me the same result for all of my opcounters. Looking at my graphs, I get data like this:
opcounters.command_per_second on test_database: 53.32K
opcounters.command_per_second on log_database: 53.32K
Obviously, "serverStatus" is indicative of the entire server, not just the database.
Is it possible to get opcounters for each database?
There are no per-database opcounts, at least for v2.8.0 or earlier. The op-counter structure that is used in serverStats is a global one. Each new count is recorded without the context of which db or collection was involved.
As an small aside the collStats command does not have op statistics at all so it wont be possible to calculate a database's total ops by aggregation either.
There's an open feature request you can watch/upvote in the MongoDB issue tracker: SERVER-2178: Track stats per db/collection.

Elixir / Elixir-mongo collection find breaks on Enum.to_list

This is my first go around with elixir, and I'm trying to make a simple web scraper that saves into mongodb.
I've installed the elixir-mongo package and am able to insert into the database correctly. Sadly, I'm not able to retrieve the values that I have put into the DB.
Here is the error that I am getting:
** (Mix) Could not start application jobboard: exited in: JB.start(:normal, [])
** (EXIT) an exception was raised:
** (ArgumentError) argument error
(elixir) lib/enum.ex:1266: Enum.reduce/3
(elixir) lib/enum.ex:1798: Enum.to_list/1
(jobboard) lib/scraper.ex:8: JB.Scraper.scrape/0
(jobboard) lib/jobboard.ex:26: JB.start/2
(kernel) application_master.erl:272: :application_master.start_it_old/4
If I understand the source correctly, then the mongo library should implement reduce here:
https://github.com/checkiz/elixir-mongo/blob/13211a0c0c9bb5fed29dd2faf7a01342b4e97eb4/lib/mongo_find.ex#L78
Here are the relevant sections of my code:
#JB.Scraper
def scrape do
urls = JB.ScrapedUrls.unscraped_urls
end
#JB.ScrapedUrls
def unscraped_urls do
MongoService.find(%{scraped: false})
end
#MongoService
def find(statement) do
collection |> Mongo.Collection.find(statement) |> Enum.to_list
end
defp collection do
mongo = Mongo.connect!
db = mongo |> Mongo.db("simply_hired_urls")
db |> Mongo.Db.collection("urls")
end
As a bonus, if anyone can tell me how I can get around connecting to Mongo every time I make a new call, that would be awesome. :) I'm still figuring out FP.
Thanks!
Jon
Didn't use this library, but I just made a simple attempt of the simplified version of your code.
I've started with
Mongo.connect!
|> Mongo.db("test")
|> Mongo.Db.collection("foo")
|> Mongo.Collection.find(%{scraped: true})
|> Enum.to_list
This worked fine. Then I suspected that the problem occurs when too many connections are open, so I ran this test repeatedly, and then it failed with the same error you got. It failed consistently when trying to open the connection for the 2037th time. Looking at the mongodb log, I can tell that it can't open another connection:
[initandlisten] can't create new thread, closing connection
To fix this, I simply closed the connection after I converted the results to list, using Mongo.Server.close/1. That fixed the problem
As you detect yourself, this is not an optimal way of communicating with the database, and you'd be better off if you could reuse the connection for multiple queries.
A standard way of doing this is to hold on to the connection in a process, such as GenServer or an Agent. The connection becomes a part of the process state, and you can run multiple queries in that process over the same connection.
Obviously, if multiple client processes use a single database process, all queries will be serialized, and the database process then becomes a performance bottleneck. To deal with this, you could open a pool of processes, each one managing a distinct database connection. This can be done in simple way with the poolboy library.
My suggestion is that you try implementing a single GenServer based process that maintains the connection and runs queries. Then see if your code works correctly, and when it does, try to use poolboy to be able to deal with concurrent requests efficiently.

how to understand read preferences in mongo

I'm just getting started with mongoDB. I am trying to understand how to set up my secondary database servers so that when there is no primary, the secondaries can be used to read data. I believe the read preference I'm going for is preferredPrimary.
Now that I kinda understand which of the read preferences I want to test out, I'm trying to understand how to set up my replica set for preferredPrimary.
I've been reading through the following documentation:
http://docs.mongodb.org/manual/tutorial/configure-replica-set-tag-sets/
Questions:
Is this the right doc to follow to set up read preferences?
Assuming that it is, I want to verify that the tags names / values are anything that I come up with? So specifically, the key used in the example "dc" is NOT a keyword in mongo. Is that correct?
once I set up these tags, in my client, when I'm connecting to the mongo database, do i have to specify any settings? I'm using a php front end, and I found this:
http://php.net/manual/en/mongodb.setreadpreference.php
can you confirm that these tags replace the rs.slaveOK() method?
Environment:
mongoDB version 2.6.5
replica set with 3 members - one primary and 2 secondary servers
Yes
Yes
Yes, but the link that you provided is only for readPreference
You also need to supply custom writeConcern (extract from link in a question):
db.users.insert( { id: "xyz", status: "A" }, { writeConcern: { w: "MultipleDC" } } )
Look into php driver documentation how to do that.
Yes, you may skip call to slaveOK in this case (especially, that in 95% cases you will be reading from primary)