I'm trying to append data to a existing document in a collection when I'm migrating data using Spark from other sources. i searched the documentation but i didn't find any.
Any kind of help will be appreciated.
Thanks.
I was researching this problem and I found that you can append to a document with an existing _id from spark dataframe to MongoDB, you can do it using:
MongoSpark.save(df.write.mode("append"))
In mode "append" the connector will append all the fields to the document with the _id that exists, notice that it will remove fields that will not exist in the dataframe you're writing to the database.
Source: https://groups.google.com/forum/#!topic/mongodb-user/eF-qdpYbFS0
Related
I have stored a database in my clevercloud inside the MongoDB addon. I have managed to read my collection from there. How do I convert it to a Spark DataFrame. The file originally is in csv format.
I am using Mongo-Hadoop connector to work with Spark and MongoDB.I want to delete the documents in an RDD from the MongoDB,looks there is a MongoUpdateWritable to support document update. Is there way to do deletion with Mongo-Hadoop connector?
Thanks
If you want only delete records in an RDD use the functions of the Spark API, like map, reduce, filter...
If you want save later the results, use the MongoUpdateWriteble.
Look the basics: Mongo-Hadoop-Spark
I've got a fairly big RDD with 400 fields coming from Kafka spark stream, I need to create another RDD or Map by selecting some fields from initial RDD stream when I transform the stream and eventually writing the Elasticsearch.
I know my fields by field name but don't know the field index.
How do I project the specific fields by field name to a new Map?
Assuming each field is delimited by delimiter '#'. You can determine the index for each field using the first row or header file and store in some data-structure. Subsequently, you can use this structure to determine the fields and create new maps.
You can use Apache Avro format to pre-process the data. That would allow you to access the data based on their fields and would not require the knowledge of their indexes in the String. The following link provides a great starting point to integrate Avro with Kafka and Spark.
http://aseigneurin.github.io/2016/03/04/kafka-spark-avro-producing-and-consuming-avro-messages.html
I am setting up a new ElasticSearch instance using the mongo-connector python tool. The tool is working, but only imported around ~100k entries from the mongodb oplog.
However, my collections contain millions of records... Is there a way to pass all the records from each collection through the oplog without modifying the records in any way?
Following the advice of Sammaye, I solved this problem by iterating over the collection, converting to json, and posting it to the index API via curl. Thanks for the suggestion!
Is there an option to generate automatically a schema.xml for solr from mongodb? e.g each field of a document and subdocuments from a collection should by indexed and get searchable by default.
As written as in this SO answer Solr's Schemaless Mode could help you
Solr supports a Schemaless Mode. When starting Solr this way, you are initially not bound to a schema. When you give Solr a first document it will guess the appropriate field types and generate a schema that includes those field types for you. These fields are then fixed. You may still add new fields on the fly that way.
What you still need to do is to create an Import Route of some kind from your mongodb into Solr.
After googling a bit, you may stumble over the SO question - solr Data Import Handlers for MongoDB - which may help you on that part too.
Probably simpler would be to create a mongo query whose result contains all relevant information you require, save the result to json and send that to Solr's direct update handler, which can parse json.
So in short
Create a new, empty core in Schemaless Mode
Create an import of some kind that covers all entities and attributes you want
Run the import
Check if the result is as you want it to be
As long as (4) is not satisfied you may delete the core and repeat these steps.
No, MongoDB does not provide this option. You will have to create a script that maps documents to XML.