Moving HDFS data into MongoDB - mongodb

I am trying to move HDFS data into MongoDB. I know how to export data into mysql by using sqoop. I dont think I can use sqoop for MongoDb. I need help understanding how to do that.

This recipe will use the MongoOutputFormat class to load data from an HDFS instance
into a MongoDB collection.
Getting ready
The easiest way to get started with the Mongo Hadoop Adaptor is to clone the Mongo-Hadoop
project from GitHub and build the project configured for a specific version of Hadoop. A Git
client must be installed to clone this project.
This recipe assumes that you are using the CDH3 distribution of Hadoop.
The official Git Client can be found at http://git-scm.com/downloads .
The Mongo Hadoop Adaptor can be found on GitHub at https://github.com/mongodb/
mongo-hadoop . This project needs to be built for a specific version of Hadoop. The resulting
JAR file must be installed on each node in the $HADOOP_HOME/lib folder.
The Mongo Java Driver is required to be installed on each node in the $HADOOP_HOME/
lib folder. It can be found at https://github.com/mongodb/mongo-java-driver/
downloads .
How to do it...
Complete the following steps to copy data form HDFS into MongoDB:
1. Clone the mongo-hadoop repository with the following command line:
git clone https://github.com/mongodb/mongo-hadoop.git
2. Switch to the stable release 1.0 branch:
git checkout release-1.0
3. Set the Hadoop version which mongo-hadoop should target. In the folder
that mongo-hadoop was cloned to, open the build.sbt file with a text editor.
Change the following line:
hadoopRelease in ThisBuild := "default"
to
hadoopRelease in ThisBuild := "cdh3"
4. Build mongo-hadoop :
./sbt package
This will create a file named mongo-hadoop-core_cdh3u3-1.0.0.jar in the
core/target folder.
5. Download the MongoDB Java Driver Version 2.8.0 from https://github.com/
mongodb/mongo-java-driver/downloads .
6. Copy mongo-hadoop and the MongoDB Java Driver to $HADOOP_HOME/lib on
each node:
cp mongo-hadoop-core_cdh3u3-1.0.0.jar mongo-2.8.0.jar $HADOOP_
HOME/lib
7. Create a Java MapReduce program that will read the weblog_entries.txt file
from HDFS and write them to MongoDB using the MongoOutputFormat class:
import java.io.*;
import org.apache.commons.logging.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.*;
import org.bson.*;
import org.bson.types.ObjectId;
import com.mongodb.hadoop.*;
import com.mongodb.hadoop.util.*;
public class ExportToMongoDBFromHDFS {
private static final Log log = LogFactory.getLog(ExportToMongoDBFromHDFS.class);
public static class ReadWeblogs extends Mapper<LongWritable, Text, ObjectId, BSONObject>{
public void map(Text key, Text value, Context context)
throws IOException, InterruptedException{
System.out.println("Key: " + key);
System.out.println("Value: " + value);
String[] fields = value.toString().split("\t");
String md5 = fields[0];
String url = fields[1];
String date = fields[2];
String time = fields[3];
String ip = fields[4];
BSONObject b = new BasicBSONObject();
b.put("md5", md5);
b.put("url", url);
b.put("date", date);
b.put("time", time);
b.put("ip", ip);
context.write( new ObjectId(), b);
}
}
public static void main(String[] args) throws Exception{
final Configuration conf = new Configuration();
MongoConfigUtil.setOutputURI(conf,"mongodb://<HOST>:<PORT>/test. weblogs");
System.out.println("Configuration: " + conf);
final Job job = new Job(conf, "Export to Mongo");
Path in = new Path("/data/weblogs/weblog_entries.txt");
FileInputFormat.setInputPaths(job, in);
job.setJarByClass(ExportToMongoDBFromHDFS.class);
job.setMapperClass(ReadWeblogs.class);
job.setOutputKeyClass(ObjectId.class);
job.setOutputValueClass(BSONObject.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(MongoOutputFormat.class);
job.setNumReduceTasks(0);
System.exit(job.waitForCompletion(true) ? 0 : 1 );
}
}
8. Export as a runnable JAR file and run the job:
hadoop jar ExportToMongoDBFromHDFS.jar
9. Verify that the weblogs MongoDB collection was populated from the Mongo shell:
db.weblogs.find();

The basic problem is that mongo stores its data in BSON format (binary JSON), while you hdfs data may have different formats (txt, sequence, avro). The easiest thing to do would be to use pig to load your results using this driver:
https://github.com/mongodb/mongo-hadoop/tree/master/pig
into mongo db. You'll have to map your values to your collection - there's a good example on the git hub page.

Related

Reading of DBname.system.indexes failed on Atlas cluster by mongobee after getting connected

I have a Jhipster Spring boot project. Recently I shifted from mlabs standalone sandboxes to Atlas cluster sandbox M0 Free tier replica set. It even worked and I had made some database operations on it. But now for some reason then there is a read permission error
Error creating bean with name 'mongobee' defined in class path resource [DatabaseConfiguration.class]: Invocation of init method failed; nested exception is com.mongodb.MongoQueryException: Query failed with error code 8000 and error message 'user is not allowed to do action [find] on [test.system.indexes]' on server ********-shard-00-01-mfwhq.mongodb.net:27017
You can see the full stack here https://pastebin.com/kaxcr7VS
I have searched high and low and all I could find is that M0 tier user doesn't have permissions to overwrite admin database which I am not doing.
Even now connection to Mlabs DB works fine but have this issue on Atlas DB M0 tier.
Mongo DB version : 3.4
Jars and It's version
name: 'mongobee', version: '0.10'
name: 'mongo-java-driver', version: '3.4.2'
#Neil Lunn
The userId I am using to connect is that of admin's and the connection read and write works through shell or Robo3T(mongo client)
After discussion with MongoDB support team, MongoDB 3.0 deprecates direct access to the system.indexes collection, which had previously been used to list all indexes in a database. Applications should use db.<COLLECTION>.getIndexes() instead.
From MongoDB Atlas docs it can be seen that they may forbid calls to system. collections:
Optionally, for the read and readWrite role, you can also specify a collection. If you do not specify a collection for read and readWrite, the role applies to all collections (excluding some system. collections) in the database.
From the stacktrace it's visible that MongoBee is trying to make this call, so it's now the library issue and it should be updated.
UPDATE:
In order to fix an issue until MongoBee has released new version:
Get the latest sources of MongoBee git clone git#github.com:mongobee/mongobee.git, cd mongobee
Fetch pull request git fetch origin pull/87/head:mongobee-atlas
Checkout git checkout mongobee-atlas
Install MongoBee jar mvn clean install
Get compiled jar from /target folder or local /.m2
Use the jar as a dependency on your project
Came across this issue this morning. Heres a quick and dirty monkey-patch for the problem:
package com.github.mongobee.dao;
import com.github.mongobee.changeset.ChangeEntry;
import com.mongodb.BasicDBObject;
import com.mongodb.DB;
import com.mongodb.DBCollection;
import com.mongodb.DBObject;
import java.util.List;
import static com.github.mongobee.changeset.ChangeEntry.CHANGELOG_COLLECTION;
public class ChangeEntryIndexDao {
public void createRequiredUniqueIndex(DBCollection collection) {
collection.createIndex(new BasicDBObject()
.append(ChangeEntry.KEY_CHANGEID, 1)
.append(ChangeEntry.KEY_AUTHOR, 1),
new BasicDBObject().append("unique", true));
}
public DBObject findRequiredChangeAndAuthorIndex(DB db) {
DBCollection changelogCollection = db.getCollection(CHANGELOG_COLLECTION);
List<DBObject> indexes = changelogCollection.getIndexInfo();
if (indexes == null) return null;
for (DBObject index : indexes) {
BasicDBObject indexKeys = ((BasicDBObject) index.get("key"));
if (indexKeys != null && (indexKeys.get(ChangeEntry.KEY_CHANGEID) != null && indexKeys.get(ChangeEntry.KEY_AUTHOR) != null)) {
return index;
}
}
return null;
}
public boolean isUnique(DBObject index) {
Object unique = index.get("unique");
if (unique != null && unique instanceof Boolean) {
return (Boolean) unique;
} else {
return false;
}
}
public void dropIndex(DBCollection collection, DBObject index) {
collection.dropIndex(index.get("name").toString());
}
}
Caused by: java.lang.NoSuchMethodError: com.github.mongobee.dao.ChangeEntryIndexDao.<init>(Ljava/lang/String;)V
at com.github.mongobee.dao.ChangeEntryDao.<init>(ChangeEntryDao.java:34)
at com.github.mongobee.Mongobee.<init>(Mongobee.java:87)
at com.xxx.proj.config.DatabaseConfiguration.mongobee(DatabaseConfiguration.java:62)
at com.xxx.proj.config.DatabaseConfiguration$$EnhancerBySpringCGLIB$$4ae465a5.CGLIB$mongobee$1(<generated>)
at com.xxx.proj.config.DatabaseConfiguration$$EnhancerBySpringCGLIB$$4ae465a5$$FastClassBySpringCGLIB$$f202afb.invoke(<generated>)
at org.springframework.cglib.proxy.MethodProxy.invokeSuper(MethodProxy.java:228)
at org.springframework.context.annotation.ConfigurationClassEnhancer$BeanMethodInterceptor.intercept(ConfigurationClassEnhancer.java:358)
at com.xxx.proj.config.DatabaseConfiguration$$EnhancerBySpringCGLIB$$4ae465a5.mongobee(<generated>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:162)
... 22 common frames omitted
jhipster 5 must be using a different version, because i get that when implementing the above code. looks like its expecting a different version.
This access to system.indexes is an open issue in mongobee. The issue has been fixed in the project, but the project was abandoned before the fix was ever released.
Due to this project abandonment, two successor libraries have since been forked from mongobee which have fixed this issue: Mongock and mongobeeJ.
Switching your application's dependency from the mongobee library to one of these successor libraries will allow you to run mongobee database migrations on Atlas.
To summarize these libraries:
Mongock - Forked from mongobee in 2018. Actively maintained. Has evolved significantly from the original, including built-in support for Spring, Spring Boot, and both versions 3 & 4 of the Mongo Java driver.
mongobeeJ - Forked from mongobee in 2018. Five updated versions have been released. Minimal evolution from the original mongobee. Mongo Java Driver 4 support was implemented in August, 2020. This project was deprecated in August, 2020, with a recommendation from its creators to use a library such as Mongock instead.

export data from mongo to hive

my input: a collection("demo1") in mongo db (version 3.4.4 )
my output : my data imported in a database in hive("demo2") (version 1.2.1.2.3.4.7-4)
purpose : create a connector between mongo and hive
Error:
Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. com/mongodb/util/JSON
I tried 2 solutions following those steps (but the error remains):
1) I create a local collection in mongo (via robomongo) connected to docker
2) I upload those version of jars and add it in hive
ADD JAR /home/.../mongo-hadoop-hive-2.0.2.jar;
ADD JAR /home/.../mongo-hadoop-core-2.0.2.jar;
ADD JAR /home/.../mongo-java-driver-3.4.2.jar;
Unfortunately the error doesn't change; so I upload those version, I hesitate in choosing right version for my export, so I try this:
ADD JAR /home/.../mongo-hadoop-hive-1.3.0.jar;
ADD JAR /home/.../mongo-hadoop-core-1.3.0.jar;
ADD JAR /home/.../mongo-java-driver-2.13.2.jar;
3) I create an external table
CREATE EXTERNAL TABLE demo2
(
id INT,
name STRING,
password STRING,
email STRING
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH
SERDEPROPERTIES('mongo.columns.mapping'='{"id":"_id","name":"name","password":"password","email":"email"}')
TBLPROPERTIES('mongo.uri'='mongodb://localhost:27017/local.demo1');
Error returned in hive :
Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. com/mongodb/util/JSON
How can I resolve this problem?
Copying the correct jar files (mongo-hadoop-core-2.0.2.jar, mongo-hadoop-hive-2.0.2.jar, mongo-java-driver-3.2.2.jar) on ALL the nodes of the cluster did the trick for me.
Other points to take care about:
Follow all steps mentioned here religiously - https://github.com/mongodb/mongo-hadoop/wiki/Hive-Usage#installation
Adhere to the requirements given here - https://github.com/mongodb/mongo-hadoop#requirements
Other useful links
https://github.com/mongodb/mongo-hadoop/wiki/FAQ#i-get-a-classnotfoundexceptionnoclassdeffounderror-when-using-the-connector-what-do-i-do
https://groups.google.com/forum/#!topic/mongodb-user/xMVoTSePgg0

Spring Data MongoDB logging mongoTemplate query

I want to see the native mongodb query created by Spring Data.
I tried the following.
Created application.properties file under META-INF/resources and added the following
logging.level.org.springframework.data.mongodb.core.MongoTemplate=DEBUG
Created log4j.properties file under META-INF/resources and added the following
log4j.category.org.springframework.data.document.mongodb=DEBUG
log4j.category.org.springframework.data.document.mongodb=DEBUG
log4j.appender.stdout.layout.ConversionPattern=%d{ABSOLUTE} %5p %40.40c:%4L - %m%n
In my code level
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
...
private static final Log log = LogFactory.getLog(abc.class);
...
log.info(mongoTemplate.aggregate(aggregation, xyz.class, xyz.class));
I am not able to find any query in my jetty console. Not sure what I am missing.

Testing java mongodb driver source with mongodb running remotly

I have made changes to source code, I require to run all testcases to check its effect using command
./gradlew check
I am having the mongodb running in remote machine. Can anyone help me in configuring java mongodb driver with remotely running mongodb.
you need to import the java driver to your project.
import com.mongodb.MongoClient;
import com.mongodb.client.MongoDatabase;
then you need to connect to the mongoDB on your server, it can be localhost or it can be your server. also you can choose which port to use:
MongoClient mongoClient = new MongoClient("localhost", 27017);
then you can connect to your db:
MongoDatabase db = mongoClient.getDatabase("test");
and to connect to one of your collections, and make actions on it:
db.getCollection("restaurants").insertOne(
new Document("address",
new Document()
.append("street", "2 Avenue")
.append("zipcode", "10075")
.append("building", "1480")
.append("coord", asList(-73.9557413, 40.7720266)))
.append("borough", "Manhattan")
.append("cuisine", "Italian")
.append("grades", asList(
new Document()
.append("grade", "A")
.append("score", 11),
new Document()
.append("grade", "B")
.append("score", 17)))
.append("name", "Vella")
.append("restaurant_id", "41704620"));
One can pass the connection string while starting testcases
./gradlew check -Dorg.mongodb.test.uri=mongodb://example.com:27017/

Soap UI, REST API, update database

I am going to use soapUI to test the REST API framework.
Is there a way through which i can insert/update records inside the MongoDB with data in a file type(csv, txt etc) using soapUI tool?
What i am trying to do is validate the API calls and update the database from a data file.
If you are willing to use Groovy script, then you can do this pretty easily.
Put your jdbc driver in SoapUI's bin\ext directory.
https://github.com/mongodb/mongo-java-driver/downloads
(probably where you can it for mongodb)
Then you need roughly these things in your script:
import groovy.sql.Sql
def groovyUtils = new com.eviware.soapui.support.GroovyUtils(context)
groovyUtils.registerJdbcDriver("org.postgresql.Driver") // NOT SURE WHAT STRING FOR MONGODB
def connectString = "....."
sql = Sql.newInstance(connectString) // TEST YOUR CONNECT STRING IN A SQL BROWSER
def misc = sql.firstRow("SELECT * from table")
groovy.sql.Sql is very nice!
http://groovy.codehaus.org/api/groovy/sql/Sql.html
You can easily use below groovy code in "Groovy test step" of your test case and connect to mongodb. Before that please ensure that mongodb java client jar file and gmongo are in {Installation Directory}\bin\ext folder of your soapUI installation
Gmongo: http://mvnrepository.com/artifact/com.gmongo/gmongo/1.5
Mongodb Java Client : http://mvnrepository.com/artifact/org.mongodb/mongo-java-driver/3.2.2
import com.gmongo.GMongoClient
import com.gmongo.GMongo
import com.mongodb.MongoCredential
import com.mongodb.ServerAddress
//def credentials = MongoCredential.createMongoCRCredential('admin', 'students', 'admin' as char[])
//def client = new GMongoClient(new ServerAddress("127.0.0.1:27017"))
context.gmongo=new GMongo()
def db=context.gmongo.getDB("test")
log.info db.fruit.find().count()
db.fruit.find().each{
doc->log.info doc
}