How to append data to a existing document in mongodb using spark?

How to append data to a existing document in mongodb using spark? - mongodb

I'm trying to append data to a existing document in a collection when I'm migrating data using Spark from other sources. i searched the documentation but i didn't find any.
Any kind of help will be appreciated.
Thanks.

I was researching this problem and I found that you can append to a document with an existing _id from spark dataframe to MongoDB, you can do it using:
MongoSpark.save(df.write.mode("append"))
In mode "append" the connector will append all the fields to the document with the _id that exists, notice that it will remove fields that will not exist in the dataframe you're writing to the database.
Source: https://groups.google.com/forum/#!topic/mongodb-user/eF-qdpYbFS0

Related

I have read a collection from a mongodb database. How do I convert that collection into a spark dataframe?

I have stored a database in my clevercloud inside the MongoDB addon. I have managed to read my collection from there. How do I convert it to a Spark DataFrame. The file originally is in csv format.

How to delete documents(records) with Mongo-Hadoop connector for Spark

I am using Mongo-Hadoop connector to work with Spark and MongoDB.I want to delete the documents in an RDD from the MongoDB,looks there is a MongoUpdateWritable to support document update. Is there way to do deletion with Mongo-Hadoop connector?
Thanks

If you want only delete records in an RDD use the functions of the Spark API, like map, reduce, filter...
If you want save later the results, use the MongoUpdateWriteble.
Look the basics: Mongo-Hadoop-Spark

Selects fields from Spark RDD

I've got a fairly big RDD with 400 fields coming from Kafka spark stream, I need to create another RDD or Map by selecting some fields from initial RDD stream when I transform the stream and eventually writing the Elasticsearch.
I know my fields by field name but don't know the field index.
How do I project the specific fields by field name to a new Map?

Assuming each field is delimited by delimiter '#'. You can determine the index for each field using the first row or header file and store in some data-structure. Subsequently, you can use this structure to determine the fields and create new maps.
You can use Apache Avro format to pre-process the data. That would allow you to access the data based on their fields and would not require the knowledge of their indexes in the String. The following link provides a great starting point to integrate Avro with Kafka and Spark.
http://aseigneurin.github.io/2016/03/04/kafka-spark-avro-producing-and-consuming-avro-messages.html

is there a way to dump an entire mongodb collection into the oplog?

I am setting up a new ElasticSearch instance using the mongo-connector python tool. The tool is working, but only imported around ~100k entries from the mongodb oplog.
However, my collections contain millions of records... Is there a way to pass all the records from each collection through the oplog without modifying the records in any way?

Following the advice of Sammaye, I solved this problem by iterating over the collection, converting to json, and posting it to the index API via curl. Thanks for the suggestion!

Create schema.xml automatically for Solr from mongodb

Is there an option to generate automatically a schema.xml for solr from mongodb? e.g each field of a document and subdocuments from a collection should by indexed and get searchable by default.

As written as in this SO answer Solr's Schemaless Mode could help you
Solr supports a Schemaless Mode. When starting Solr this way, you are initially not bound to a schema. When you give Solr a first document it will guess the appropriate field types and generate a schema that includes those field types for you. These fields are then fixed. You may still add new fields on the fly that way.
What you still need to do is to create an Import Route of some kind from your mongodb into Solr.
After googling a bit, you may stumble over the SO question - solr Data Import Handlers for MongoDB - which may help you on that part too.
Probably simpler would be to create a mongo query whose result contains all relevant information you require, save the result to json and send that to Solr's direct update handler, which can parse json.
So in short
Create a new, empty core in Schemaless Mode
Create an import of some kind that covers all entities and attributes you want
Run the import
Check if the result is as you want it to be
As long as (4) is not satisfied you may delete the core and repeat these steps.

No, MongoDB does not provide this option. You will have to create a script that maps documents to XML.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to append data to a existing document in mongodb using spark? - mongodb

I'm trying to append data to a existing document in a collection when I'm migrating data using Spark from other sources. i searched the documentation but i didn't find any. Any kind of help will be appreciated. Thanks.

Related

I have read a collection from a mongodb database. How do I convert that collection into a spark dataframe?

How to delete documents(records) with Mongo-Hadoop connector for Spark

Selects fields from Spark RDD

is there a way to dump an entire mongodb collection into the oplog?

Create schema.xml automatically for Solr from mongodb

Categories

Resources