I am trying to read a collection with documents of varying schema from mongo using spark-mongo connector to get a dataframe. What is see is that spark is inferring the schema for the dataframe from the first record and if i query the dataframe for any other field , i am getting an excpetion. Is there any way to resolve this issue?
Related
I want to use MongoDB to BigQuery dataflow template and I have 2 questions:
Can I somehow configure partitioning for the destination table? For e.g. if I want to dump my database every day?
Can I map nested fields in MongoDB to records in BQ instead of columns with string values?
I see User option with values FLATTEN and NONE, but FLATTEN will flatten documents for 1 level only.
May any of these two approaches help?:
Create a destination table with structure definition before running dataflow
Using UDF
I tried to use MongoDB to BigQuery dataflow with User option set to FLATTEN.
I'm trying to load data from Elasticsearch to Mongo DB. I want to retain the same _id value that is present in elasticsearch while writing to Mongo as well. I'm able to do it, but the _id field is of type String in Elastic search and I would like to push it into Mongo DB after coverting it to Mongo ObjectId datatype.
The data from elasticsearch is loaded into a dataframe. I'm using spark scala for doing the same. Any help to achieve this?
I have tried it this way modifying the dataframe, but it throws up error,
df("_id") = new ObjectId(df("_id"))
It doesn't work this way.
val df = spark.read
.format("org.elasticsearch.spark.sql")
.option("query", esQuery)
.option("pushdown", true)
.option("scroll.size", Config.ES_SCROLL_SIZE)
.load(Config.ES_RESOURCE)
.withColumn("_id", $"_metadata".getItem("_id"))
.drop("_metadata")
df("_id") = new ObjectId(df("_id"))
I want to load dataframe into mongo DB with _id field as Mongo ObjectId datatype rather than String datatype.
Present: _id : "123456ABCD"
Expected: _id : ObjectId(123456ABCD)
Try this
import org.apache.spark.sql.functions.typedLit
.withColumn("date", typedLit(new ObjectId($"_metadata".getItem("_id"))))
need some help please.
How can I add an index while inserting dataframe to mongodb like this?
my_dataframe.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save()
My dataframe contains a field named my_custom_id and I want to make this one my new index instead of the default index added by mongodb.
Thank you
I am using the scala Api of Apache Spark Streaming to read from Kafka Server in a window with the size of a minute and a slide intervall of a minute.
The message from Kafka contain a timestamp from the moment they were sent and an arbitrary value. Each of the values is supposed to be reducedByKeyAndWindow and saved to the Mongo.
val messages = stream.map(record => (record.key, record.value.toDouble))
val reduced = messages.reduceByKeyAndWindow((x: Double , y: Double) => (x + y),
Seconds(60), Seconds(60))
reduced.foreachRDD({ rdd =>
import spark.implicits._
val aggregatedPower = rdd.map({x => MyJsonObj(x._2, x._1)}).toDF()
aggregatedPower.write.mode("append").mongo()
})
This works so far, however it is possible, that some message come with a delay, of a minute, which results leads to having two json objects with the same timestamp in the dataBase.
{"_id":"5aa93a7e9c6e8d1b5c486fef","power":6.146849997,"timestamp":"2018-03-14 15:00"}
{"_id":"5aa941df9c6e8d11845844ae","power":5.0,"timestamp":"2018-03-14 15:00"}
The Documentation of the mongo-spark-connector didn't help me with finding a solution.
Is there a smart way to query whether the timestamp in the current window is already in the database and if so update this value?
Is there a smart way to query whether the timestamp in the current window is already in the database and if so update this value?
It seems that what you're looking for is a MongoDB operation called upsert. Where an update operation will insert a new document if the criteria has no match, and update the fields if there is a match.
If you are using MongoDB Connector for Spark v2.2+, whenever a Spark dataframe contains an _id field, the data will be upserted. Which means any existing documents with the same _id value will be updated and new documents without existing _id value in the collection will be inserted.
Now you could try to create an RDD using MongoDB Spark Aggregation, specifying a $match filter to query where timestamp matches the current window:
val aggregatedRdd = rdd.withPipeline(Seq(
Document.parse(
"{ $match: { timestamp : '2018-03-14 15:00' } }"
)))
Modify the value of power field, and then write.mode('append').
You may also find blog: Data Streaming MongoDB/Kafka useful as well. If you would like to write a Kafka consumer and directly insert into MongoDB applying your logics using MongoDB Java Driver
We have some tables in mongodb which we are extracting data and loading into mysql tables through kettle. How do I get the create timestamp of a record from mongodb in kettle?
I read some articles where we can get date from objectid but I was not able to extract in kettle.
can anyone give me the syntax for extracting date from objectid or is it possible to get hidden create ts from mongodb meta layer in kettle?
Thanks,
Deepthi