I am using the scala Api of Apache Spark Streaming to read from Kafka Server in a window with the size of a minute and a slide intervall of a minute.
The message from Kafka contain a timestamp from the moment they were sent and an arbitrary value. Each of the values is supposed to be reducedByKeyAndWindow and saved to the Mongo.
val messages = stream.map(record => (record.key, record.value.toDouble))
val reduced = messages.reduceByKeyAndWindow((x: Double , y: Double) => (x + y),
Seconds(60), Seconds(60))
reduced.foreachRDD({ rdd =>
import spark.implicits._
val aggregatedPower = rdd.map({x => MyJsonObj(x._2, x._1)}).toDF()
aggregatedPower.write.mode("append").mongo()
})
This works so far, however it is possible, that some message come with a delay, of a minute, which results leads to having two json objects with the same timestamp in the dataBase.
{"_id":"5aa93a7e9c6e8d1b5c486fef","power":6.146849997,"timestamp":"2018-03-14 15:00"}
{"_id":"5aa941df9c6e8d11845844ae","power":5.0,"timestamp":"2018-03-14 15:00"}
The Documentation of the mongo-spark-connector didn't help me with finding a solution.
Is there a smart way to query whether the timestamp in the current window is already in the database and if so update this value?
Is there a smart way to query whether the timestamp in the current window is already in the database and if so update this value?
It seems that what you're looking for is a MongoDB operation called upsert. Where an update operation will insert a new document if the criteria has no match, and update the fields if there is a match.
If you are using MongoDB Connector for Spark v2.2+, whenever a Spark dataframe contains an _id field, the data will be upserted. Which means any existing documents with the same _id value will be updated and new documents without existing _id value in the collection will be inserted.
Now you could try to create an RDD using MongoDB Spark Aggregation, specifying a $match filter to query where timestamp matches the current window:
val aggregatedRdd = rdd.withPipeline(Seq(
Document.parse(
"{ $match: { timestamp : '2018-03-14 15:00' } }"
)))
Modify the value of power field, and then write.mode('append').
You may also find blog: Data Streaming MongoDB/Kafka useful as well. If you would like to write a Kafka consumer and directly insert into MongoDB applying your logics using MongoDB Java Driver
Related
I am using MongoDB Scala driver trying to access a MongoDB collection and "download" the documents contained within to a JSON file. The number of documents in the collection is measured in 100Ks, but a similar task in python (downloading, creating a pd.DataFrame and storing as CSV) takes 10 minutes while trying to do it in spark-shell it takes over an hour easily. I sense I am doing something very wrong, but can't figure out what.
val documents = db.getCollection("collectionName").find().toFuture()
documents on Complete {
case Success(docs) => {
// without partitioning tasks would become too large - resulting in most docs being written as empty
val rdd = sc.parallelize(docs, 1000)
val ds = spark.read.json(rdd.map(_.toJson).toDS)
ds.write.mode("overwrite").json("/path/to/data/data.json")
}
case Failure(e) => println("Error")
}
Is there a better way to do this?
I am trying to read a huge complex document from MongoDB into spark data frame. When I convert this db to json, It works. But If I directly read from MongoDB I am getting the following error : Caused by: com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a DoubleType (value: BsonString{value='NaN'})
Able to read into DF and do all the processing. Getting error when I try to show it or write to a json/csv.
at mongo$.main(mongo.scala:270) – df.show()
Using sbt for Dependencies
mongo spark connector: 2.2.1
Scala Version: 2.11.6 Spark version: 2.3.0/2.2.0
As described by the error, this is because there is a String value of "NaN" which is inferred in the Spark schema as Double type.
There's a value for the field, amongst all the documents that is not Double. for example :
{_id:1, foo: 100.00}
{_id:2, foo: 101.00}
{_id:3, foo: 102.00}
{_id:4, foo: 103.00}
...
{_id:99, foo: "NaN"}
{_id:100, foo: 200.00}
As you may know, "NaN" means "Not a Number". It is likely during creation of the document that whichever process failed to insert Double, and defaulted into NaN.
There are few ways to solve this, depending on your use case:
Utilise MongoDB Schema Validation to ensure that the values within the collection has the expected type on insert.
Perform transformation to clean the data. Query the collection to find the offending field i.e. {foo: "NaN"} and update with a desired value i.e. 0.
I had a similar conversion problem. Mongodb takes a sample of 1000 documents to define the scheme. In my case, 1000 documents were not sufficient to cover all cases. I increased the sample size and this solved the problem. Mongo documentation
Code:
val readConfig = ReadConfig(Map(
"database" -> "myDatabase",
"collection" -> "myCollection",
"sampleSize" -> "100000"), Some(ReadConfig(sc)))
val df = sc.loadFromMongoDB(readConfig).toDF()
I am trying to read a collection with documents of varying schema from mongo using spark-mongo connector to get a dataframe. What is see is that spark is inferring the schema for the dataframe from the first record and if i query the dataframe for any other field , i am getting an excpetion. Is there any way to resolve this issue?
I am running a spark job and at some point I want to connect to an elastic search server to get some data and add them to an RDD. So the code I am using looks like this
input.mapParitions(records=>{
val elcon=new ElasticSearchConnection
val client:TransportClient=elcon.openConnection()
val newRecs=records.flatMap(record=>{
val response = client.prepareGet("index" "indexType",
record.id.toString).execute().actionGet()
val newRec=processRec(record,reponse)
newRec
})//end of flatMap
client.close()
newRecs
})//end of mapPartitions
My problem is that the client.close() command is called before the flatMap operation is finished which results of course into an Exception. The code works if I move the generation and the closing of the connection inside the flatMap, but this would generate a huge amount of connections. Is it possible to make sure that client.close will be called after the flatMap operation is finished?
Making a blocking call for each item in your RDD to fetch corresponding ElasticSearch document is causing the problem. It is generally advised to avoid blocking calls.
There is another alternate approach using the ElasticSearch-for-Hadoop's Spark support.
Read the ElasticSearch index/type as another RDD and join it with your RDD.
Include the right version of ESHadoop dependency.
import org.elasticsearch.spark._
val esRdd = sc.esRDD("index/indexType") //This returns a pair RDD of (_id, Map of all key value pairs for all fields]
input.map(record => (record.id, record)) //Convert your RDD of records to a pair rdd of (id, record) as we want to join based on the id
input.join(esRdd).map(rec => processResponse(rec._2._1, rec._2._2)) // Join the two RDDs based on id column it returns a pair RDD with key=id & value=Pair of matching records (id,(inputrddrecord,esrddrecord))
Hope this helps.
PS: It will still not alleviate the problem of lack of co-location. (i.e. each document with _id will come from different shard of the index). Better approach would have been to achieve co-location of the input RDD and the ES index's documents at the time of creating the ES index.
I'm trying to query elasticsearch with the elasticsearch-spark connector and I want to return only few results:
For example:
val conf = new SparkConf().set("es.nodes","localhost").set("es.index.auto.create", "true").setMaster("local")
val sparkContext = new SparkContext(conf)
val query = "{\"size\":1}"
println(sparkContext.esRDD("index_name/type", query).count())
However this will return all the documents in the index.
Some parameters are actually ignored from the query by design, such as : from, size, fields, etc.
They are used internally inside the elasticsearch-spark connector.
Unfortunately this list of unsupported parameters isn't documented. But if you wish to use the size parameter you can always rely on the pushdown predicate and use the DataFrame/Dataset limit method.
So you ought using the Spark SQL DSL instead e.g :
val df = sqlContext.read.format("org.elasticsearch.spark.sql")
.option("pushdown","true")
.load("index_name/doc_type")
.limit(10) // instead of size : 10
This query will return the first 10 documents returned by the match_all query that is used by default by the connector.
Note: The following isn't correct on any level.
This is actually on purpose. Since the connector does a parallel query, it also looks at the number of documents being returned so if the user specifies a parameter, it will overwrite it according to the es.scroll.limit setting (see the configuration option).
When you query elasticsearch it also run the query in parallel on all the index shards without overwriting them.
If I understand this correctly, you are executing a count operation, which does not return any documents. Do you expect it to return 1 because you specified size: 1? That's not happening, which is by design.
Edited to add:
This is the definition of count() in elasticsearch-hadoop:
override def count(): Long = {
val repo = new RestRepository(esCfg)
try {
return repo.count(true)
} finally {
repo.close()
}
}
It does not take the query into account at all, but considers the entire ES index as the RDD input.
This is actually on purpose. Since the connector does a parallel query, it also looks at the number of documents being returned so if the user specifies a parameter, it will overwrite it according to the es.scroll.limit setting (see the configuration option).
In other words, if you want to control the size, do so through that setting as it will always take precedence.
Beware that this parameter applies per shard. So, if you have 5 shards you might bet up to fice hits if this parameter is set to 1.
See https://www.elastic.co/guide/en/elasticsearch/hadoop/master/configuration.html