I'm trying to write from a Spark RDD to MongoDB using the mongo-spark-connector.
I'm facing two problems
[main problem] I can't connect to Mongo if I define the host according to the documentation (using all instances in the mongo replica set)
[secondary/related problem] If I connect to the primary only, I can write... but I typically crash the primary writing the first collection
Environment:
mongo-spark-connector 1.1
spark 1.6
scala 2.10.5
First I'll setup a dummy example to demonstrate...
import org.bson.Document
import com.mongodb.spark.MongoSpark
import com.mongodb.spark.config.WriteConfig
import org.apache.spark.rdd.RDD
/**
* fake json data
*/
val recs: List[String] = List(
"""{"a": 123, "b": 456, "c": "apple"}""",
"""{"a": 345, "b": 72, "c": "banana"}""",
"""{"a": 456, "b": 754, "c": "cat"}""",
"""{"a": 876, "b": 43, "c": "donut"}""",
"""{"a": 432, "b": 234, "c": "existential"}"""
)
val rdd_json_str: RDD[String] = sc.parallelize(recs, 5)
val rdd_hex_bson: RDD[Document] = rdd_json_str.map(json_str => Document.parse(json_str))
Some values that won't change...
// credentials
val user = ???
val pwd = ???
// fixed values
val db = "db_name"
val replset = "replset_name"
val collection_name = "collection_name"
Here's what does NOT work... in this case "url" would look something like machine.unix.domain.org and "ip" would look like... well, an IP address.
This is how the documentation says to define the host... with every machine in the replica set.
val host = "url1:27017,url2:27017,url3:27017"
val host = "ip_address1:27017,ip_address2:27017,ip_address3:27017"
I can't get either of these to work. Using every permutation I can think of for the uri...
val uri = s"mongodb://${user}:${pwd}#${host}/${db}?replicaSet=${replset}"
val uri = s"mongodb://${user}:${pwd}#${host}/?replicaSet=${replset}"
val uri = s"mongodb://${user}:${pwd}#${replset}/${host}/${db}"
val uri = s"mongodb://${user}:${pwd}#${replset}/${host}/${db}.${collection_name}"
val uri = s"mongodb://${user}:${pwd}#${host}" // setting db, collection, replica set in WriteConfig
val uri = s"mongodb://${user}:${pwd}#${host}/${db}" // this works IF HOST IS PRIMARY ONLY; not for hosts as defined above
EDIT
more detail on the error messages.. the errors take to forms...
form 1
typically includes java.net.UnknownHostException: machine.unix.domain.org
also, comes back with server addresses in url form even when defined as IP addresses
com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting
for a server that matches WritableServerSelector. Client view of cluster
state is {type=REPLICA_SET, servers=[{address=machine.unix.domain.org:27017,
type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException:
machine.unix.domain.org}, caused by {java.net.UnknownHostException:
machine.unix.domain.org}}, {address=machine.unix.domain.org:27017,
type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException:
machine.unix.domain.org}, caused by {java.net.UnknownHostException:
machine.unix.domain.org}}, {address=machine.unix.domain.org:27017,
type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException:
machine.unix.domain.org}, caused by {java.net.UnknownHostException:
machine.unix.domain.org}}]
form 2
(authentication error... though connecting with same credentials to primary only works fine)
com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting
for a server that matches WritableServerSelector. Client view of cluster
state is {type=REPLICA_SET, servers=[{address=xx.xx.xx.xx:27017,
type=UNKNOWN, state=CONNECTING, exception=
{com.mongodb.MongoSecurityException: Exception authenticating
MongoCredential{mechanism=null, userName='xx', source='admin', password=
<hidden>, mechanismProperties={}}}, caused by
{com.mongodb.MongoCommandException: Command failed with error 18:
'Authentication failed.' on server xx.xx.xx.xx:27017. The full response is {
"ok" : 0.0, "errmsg" : "Authentication failed.", "code" : 18, "codeName" :
"AuthenticationFailed", "operationTime" : { "$timestamp" : { "t" :
1534459121, "i" : 1 } }, "$clusterTime" : { "clusterTime" : { "$timestamp" :
{ "t" : 1534459121, "i" : 1 } }, "signature" : { "hash" : { "$binary" :
"xxx=", "$type" : "0" }, "keyId" : { "$numberLong" : "123456" } } } }}}...
end EDIT
here's what DOES work... on the dummy data only... more on that below...
val host = s"${primary_ip_address}:27017" // primary only
val uri = s"mongodb://${user}:${pwd}#${host}/${db}"
val writeConfig: WriteConfig =
WriteConfig(Map(
"uri" -> uri,
"database" -> db,
"collection" -> collection_name,
"replicaSet" -> replset))
// write data to mongo
MongoSpark.save(rdd_hex_bson, writeConfig)
This... connecting to primary only... works great for dummy data, but crashes the primary for real data (50 - 100GB from and RDD with 2700 partitions). My guess is that it opens up too many connections at once... it looks like it opens ~900 connections to write (this jives since default parallelism 2700 based on 900 virtual cores and parellelism factor of 3x).
I'm guessing if I repartition so it opens fewer connections, I'll have better luck... but I'm guessing this also ties in to writing to the primary only instead of spreading it over all instances.
I've read everything I can find here... but most examples are for single instance connections... https://docs.mongodb.com/spark-connector/v1.1/configuration/#output-configuration
It turns out there were two problems here. From the original question, these were referenced as errors of 'form 1' and 'form 2'.
error of 'form 1' - solution
The gist of the problem turned out to be a bug in the mongo-spark-connector. It turns out that it can't connect to a replica set using IP addresses... it requires URIs. Since the DNS servers in our cloud don't have these lookups, I got it working by modifying /etc/hosts on every executor and then using the connection string format like this:
val host = "URI1:27017,URI2:27017,URI3:27017"
val uri = s"mongodb://${user}:${pwd}#${host}/${db}?replicaSet=${replset}&authSource=${db}"
val writeConfig: WriteConfig =
WriteConfig(Map(
"uri"->uri,
"database"->db,
"collection"->collection,
"replicaSet"->replset,
"writeConcern.w"->"majority"))
this required first adding the following to /etc/hosts on every machine:
IP1 URI1
IP2 URI2
IP3 URI3
Now of course, i can't figure out how to use bootstrap actions in AWS EMR to update /etc/hosts when the cluster spins up. But that's another question. (AWS EMR bootstrap action as sudo)
error of 'form 2' - solution
adding &authSource=${db} to the uri solved this.
Related
We have 4 mongoDB servers of which the first one is currently the primary with 3 replicas. If I specify all 4 servers in the connection string it fails to connect at all, but if I just specify the first one it connects fine. This is bad because if the first server fails, it will not be able to connect.
This works:
mongodb://login:password#server1:27017/admin?readPreference=Primary
This does NOT work:
mongodb://login:password#server1:27017,server2:27017,server3:27017,server4:27017/admin?readPreference=Primary
Exception:
A timeout occured after 30000ms selecting a server using CompositeServerSelector{ Selectors = WritableServerSelector, LatencyLimitingServerSelector{ AllowedLatencyRange = 00:00:00.0150000 } }. Client view of cluster state is { ClusterId : "1", ConnectionMode : "Automatic", Type : "ReplicaSet", State : "Connected", Servers : [{ ServerId: "{ ClusterId : 1, EndPoint : "Unspecified/server1:27017" }", EndPoint: "Unspecified/server1:27017", State: "Disconnected", Type: "Unknown", HeartbeatException: "MongoDB.Driver.MongoConnectionException: An exception occurred while opening a connection to the server.
The service trying to connect runs on Kube.
Any idea why this would be?
you need to add the mode to the connection string: replicaSet=myRepl
I am trying to write data from Spark (using Databricks) to Azure Cosmos DB(Mongo DB). There are no errors when executing notebook but i am getting below error when querying the collection.
I have used jar from databricks website azure-cosmosdb-spark_2.4.0_2.11-2.1.2-uber.jar. My versions is 6.5 (includes Apache Spark 2.4.5, Scala 2.11)
import org.joda.time.format._
import com.microsoft.azure.cosmosdb.spark.schema._
import com.microsoft.azure.cosmosdb.spark.CosmosDBSpark
import com.microsoft.azure.cosmosdb.spark.config.Config
import org.apache.spark.sql.functions._
val configMap = Map(
"Endpoint" -> "https://******.documents.azure.com:***/",
"Masterkey" -> "****==",
"Database" -> "dbname",
"Collection" -> "collectionname"
)
val config = Config(configMap)
val df = spark.sql("select id,name,address from emp_db.employee")
CosmosDBSpark.save(df, config)
when i query the collection i get below response
Error: error: {
"_t" : "OKMongoResponse",
"ok" : 0,
"code" : 1,
"errmsg" : "Unknown server error occurred when processing this request.",
"$err" : "Unknown server error occurred when processing this request."
}
Any help would be much appreciated. Thank you!!!
That error suggests you are using CosmosDB with the MongoDB api.
The spark connector for CosmosDB only supports it when using the SQL api.
Instead you should use the MongoDB connector.
https://learn.microsoft.com/en-us/azure/cosmos-db/spark-connector
Use this instead https://docs.mongodb.com/spark-connector/master/
I have created one dataset in azure datafactory for connect mongodb.
While configuration i have added MongoDb connection string and it's show me connection successful. As per shown in image view
After that i have configure whole pipeline with proper configuration.
Now i am facing issue while run that pipeline.
I am getting below error as Mongodb connection is not valid.
Operation on target Copydataset1 failed: Failure happened on 'Source' side. ErrorCode=UserErrorMongoDbConnectionTimeout,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Connection to MongoDB server is timeout. This is usually caused by invalid connection string.,Source=Microsoft.DataTransfer.Runtime.MongoDbV2Connector,''Type=System.TimeoutException,Message=A timeout occured after 30000ms selecting a server using CompositeServerSelector{ Selectors = ReadPreferenceServerSelector{ ReadPreference = { Mode : Primary } }, LatencyLimitingServerSelector{ AllowedLatencyRange = 00:00:00.0150000 } }. Client view of cluster state is { ClusterId : "1", ConnectionMode : "Automatic", Type : "Unknown", State : "Disconnected", Servers : [{ ServerId: "{ ClusterId : 1, EndPoint : "192.0.0.1:27017" }", EndPoint: "192.0.0.1:27017", State: "Disconnected", Type: "Unknown" }] }.,Source=MongoDB.Driver.Core,'
NOTE: This ip 192.0.0.1:27017 is only for example.
I am trying to use cygnus with Mongo DB, but no data have been persisted in the data base.
Here is the notification got in cygnus:
15/07/21 14:48:01 INFO handlers.OrionRestHandler: Starting transaction (1437482681-118-0000000000)
15/07/21 14:48:01 INFO handlers.OrionRestHandler: Received data ({ "subscriptionId" : "55a73819d0c457bb20b1d467", "originator" : "localhost", "contextResponses" : [ { "contextElement" : { "type" : "enocean", "isPattern" : "false", "id" : "enocean:myButtonA", "attributes" : [ { "name" : "ButtonValue", "type" : "", "value" : "ON", "metadatas" : [ { "name" : "TimeInstant", "type" : "ISO8601", "value" : "2015-07-20T21:29:56.509293Z" } ] } ] }, "statusCode" : { "code" : "200", "reasonPhrase" : "OK" } } ]})
15/07/21 14:48:01 INFO handlers.OrionRestHandler: Event put in the channel (id=1454120446, ttl=10)
Here is my agent configuration:
cygnusagent.sources = http-source
cygnusagent.sinks = OrionMongoSink
cygnusagent.channels = mongo-channel
#=============================================
# source configuration
# channel name where to write the notification events
cygnusagent.sources.http-source.channels = mongo-channel
# source class, must not be changed
cygnusagent.sources.http-source.type = org.apache.flume.source.http.HTTPSource
# listening port the Flume source will use for receiving incoming notifications
cygnusagent.sources.http-source.port = 5050
# Flume handler that will parse the notifications, must not be changed
cygnusagent.sources.http-source.handler = com.telefonica.iot.cygnus.handlers.OrionRestHandler
# URL target
cygnusagent.sources.http-source.handler.notification_target = /notify
# Default service (service semantic depends on the persistence sink)
cygnusagent.sources.http-source.handler.default_service = def_serv
# Default service path (service path semantic depends on the persistence sink)
cygnusagent.sources.http-source.handler.default_service_path = def_servpath
# Number of channel re-injection retries before a Flume event is definitely discarded (-1 means infinite retries)
cygnusagent.sources.http-source.handler.events_ttl = 10
# Source interceptors, do not change
cygnusagent.sources.http-source.interceptors = ts gi
# TimestampInterceptor, do not change
cygnusagent.sources.http-source.interceptors.ts.type = timestamp
# GroupinInterceptor, do not change
cygnusagent.sources.http-source.interceptors.gi.type = com.telefonica.iot.cygnus.interceptors.GroupingInterceptor$Builder
# Grouping rules for the GroupingInterceptor, put the right absolute path to the file if necessary
# See the doc/design/interceptors document for more details
cygnusagent.sources.http-source.interceptors.gi.grouping_rules_conf_file = /home/egm_demo/usr/fiware-cygnus/conf/grouping_rules.conf
# ============================================
# OrionMongoSink configuration
# sink class, must not be changed
cygnusagent.sinks.mongo-sink.type = com.telefonica.iot.cygnus.sinks.OrionMongoSink
# channel name from where to read notification events
cygnusagent.sinks.mongo-sink.channel = mongo-channel
# FQDN/IP:port where the MongoDB server runs (standalone case) or comma-separated list of FQDN/IP:port pairs where the MongoDB replica set members run
cygnusagent.sinks.mongo-sink.mongo_hosts = 127.0.0.1:27017
# a valid user in the MongoDB server (or empty if authentication is not enabled in MongoDB)
cygnusagent.sinks.mongo-sink.mongo_username =
# password for the user above (or empty if authentication is not enabled in MongoDB)
cygnusagent.sinks.mongo-sink.mongo_password =
# prefix for the MongoDB databases
#cygnusagent.sinks.mongo-sink.db_prefix = kura
# prefix pro the MongoDB collections
#cygnusagent.sinks.mongo-sink.collection_prefix = button
# true is collection names are based on a hash, false for human redable collections
cygnusagent.sinks.mongo-sink.should_hash = false
# ============================================
# mongo-channel configuration
# channel type (must not be changed)
cygnusagent.channels.mongo-channel.type = memory
# capacity of the channel
cygnusagent.channels.mongo-channel.capacity = 1000
# amount of bytes that can be sent per transaction
cygnusagent.channels.mongo-channel.transactionCapacity = 100
Here is my rule :
{
"grouping_rules": [
{
"id": 1,
"fields": [
"button"
],
"regex": ".*",
"destination": "kura",
"fiware_service_path": "/kuraspath"
}
]
}
Any ideas of what I have missed? Thanks in advance for your help!
This configuration parameter is wrong:
cygnusagent.sinks = OrionMongoSink
According to your configuration, it must be mongo-sink (I mean, you are configuring a Mongo sink named mongo-sink when you configure lines such as cygnusagent.sinks.mongo-sink.type).
In addition, I would recommend you to not using the grouping rules feature; it is an advanced feature about sending the data to a collection different than the default one, and in a first stage I would play with the default behaviour. Thus, my recommendation is to leave the path to the file in cygnusagent.sources.http-source.interceptors.gi.grouping_rules_conf_file, but comment all the JSON within it :)
I am using Ubuntu t1.micro EC2 instance and installed MongoDB-2.6.7 using the link: http://docs.mongodb.org/manual/tutorial/install-mongodb-on-ubuntu/
The problem what I am facing is that I cannot access the replica set Primary member.
ServerAddress address0 = new ServerAddress("<public_ip1>", 27017);
ServerAddress address1 = new ServerAddress("<public_ip2>", 27018);
ServerAddress address2 = new ServerAddress("<public_ip3>", 27019);
I am getting MongoTimeoutException.
The issue here is: When I don't use PRIMARY's server address and set ReadPreference to secondaryPreferred, I could read from the available SECONDARY.
And I could read (and even write to PRIMARY), when I used any of these server address as individual connection.
MongoClient mongoClient = new MongoClient("<public_ip1>", 27017);
Replica Set configuration has been given below:
{
"_id" : "replicaSet",
"version" : 5,
"members" : [
{
"_id" : 0,
"host" : "ip-10-0-3-76:27017" //**private_ip**
},
{
"_id" : 1,
"host" : "ip-10-0-2-19:27018" //**private_ip**
},
{
"_id" : 2,
"host" : "ip-10-0-3-144:27019" //**private_ip**
}
]
}
No problem with security configs also. I have set ALL for inbound and outbound.
Could any one please help me out solving this problem.
The error is given below:
Exception in thread "main" com.mongodb.MongoTimeoutException: Timed out after 10000 ms while waiting for a server that matches {serverSelectors=[ReadPreferenceServerSelector{readPreference=secondaryPreferred},
LatencyMinimizingServerSelector{acceptableLatencyDifference=15 ms}]}. Client view of cluster state is {type=ReplicaSet, servers=[{address=ip-10-0-2-19:27018, type=Unknown, state=Connecting, exception=
{com.mongodb.MongoException$Network: Exception opening the socket}, caused by {java.net.UnknownHostException: ip-10-0-2-19}}, {address=ip-10-0-3-10:27019, type=Unknown, state=Connecting, exception=
{com.mongodb.MongoException$Network: Exception opening the socket}, caused by {java.net.UnknownHostException: ip-10-0-3-10}}, {address=ip-10-0-3-76:27017, type=Unknown, state=Connecting, exception=
{com.mongodb.MongoException$Network: Exception opening the socket}, caused by {java.net.UnknownHostException: ip-10-0-3-76}}]
at com.mongodb.BaseCluster.getServer(BaseCluster.java:82)
at com.mongodb.DBTCPConnector.getServer(DBTCPConnector.java:656)
at com.mongodb.DBTCPConnector.access$500(DBTCPConnector.java:40)
at com.mongodb.DBTCPConnector$MyPort.getConnection(DBTCPConnector.java:505)
at com.mongodb.DBTCPConnector$MyPort.get(DBTCPConnector.java:448)
at com.mongodb.DBTCPConnector.innerCall(DBTCPConnector.java:284)
at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:269)
at com.mongodb.DBCollectionImpl.find(DBCollectionImpl.java:84)
at com.mongodb.DB.command(DB.java:320)
at com.mongodb.DB.command(DB.java:299)
at com.mongodb.DBCollection.getCount(DBCollection.java:1269)
at com.mongodb.DBCursor.count(DBCursor.java:796)
at com.test.replicaSetTest.main(replicaSetTest.java:41)
Yes you can use Elastic IP address. Attach one Elastic IP to each of the Ec2 instance Network interfaces.
you can refer this link: http://blog.mongodirector.com/best-practices-for-deploying-mongodb-on-ec2/