What is the most efficient way of moving data out of Hive and into MongoDB? - mongodb

Is there an elegant, easy and fast way to move data out of Hive into MongoDB?

You can do the export with the Hadoop-MongoDB connector. Just run the Hive query in your job's main method. This output will then be used by the Mapper in order to insert the data into MongoDB.
Example:
Here I'm inserting a semicolon separated text file (id;firstname;lastname) to a MongoDB
collection using a simple Hive query :
import java.io.IOException;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.SQLException;
import java.sql.Statement;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import com.mongodb.hadoop.MongoOutputFormat;
import com.mongodb.hadoop.io.BSONWritable;
import com.mongodb.hadoop.util.MongoConfigUtil;
public class HiveToMongo extends Configured implements Tool {
private static class HiveToMongoMapper extends
Mapper<LongWritable, Text, IntWritable, BSONWritable> {
//See: https://issues.apache.org/jira/browse/HIVE-634
private static final String HIVE_EXPORT_DELIMETER = '\001' + "";
private IntWritable k = new IntWritable();
private BSONWritable v = null;
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String [] split = value.toString().split(HIVE_EXPORT_DELIMETER);
k.set(Integer.parseInt(split[0]));
v = new BSONWritable();
v.put("firstname", split[1]);
v.put("lastname", split[2]);
context.write(k, v);
}
}
public static void main(String[] args) throws Exception {
try {
Class.forName("org.apache.hadoop.hive.jdbc.HiveDriver");
}
catch (ClassNotFoundException e) {
System.out.println("Unable to load Hive Driver");
System.exit(1);
}
try {
Connection con = DriverManager.getConnection(
"jdbc:hive://localhost:10000/default");
Statement stmt = con.createStatement();
String sql = "INSERT OVERWRITE DIRECTORY " +
"'hdfs://localhost:8020/user/hive/tmp' select * from users";
stmt.executeQuery(sql);
}
catch (SQLException e) {
System.exit(1);
}
int res = ToolRunner.run(new Configuration(), new HiveToMongo(), args);
System.exit(res);
}
#Override
public int run(String[] args) throws Exception {
Configuration conf = getConf();
Path inputPath = new Path("/user/hive/tmp");
String mongoDbPath = "mongodb://127.0.0.1:6900/mongo_users.mycoll";
MongoConfigUtil.setOutputURI(conf, mongoDbPath);
/*
Add dependencies to distributed cache via
DistributedCache.addFileToClassPath(...) :
- mongo-hadoop-core-x.x.x.jar
- mongo-java-driver-x.x.x.jar
- hive-jdbc-x.x.x.jar
HadoopUtils is an own utility class
*/
HadoopUtils.addDependenciesToDistributedCache("/libs/mongodb", conf);
HadoopUtils.addDependenciesToDistributedCache("/libs/hive", conf);
Job job = new Job(conf, "HiveToMongo");
FileInputFormat.setInputPaths(job, inputPath);
job.setJarByClass(HiveToMongo.class);
job.setMapperClass(HiveToMongoMapper.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(MongoOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setNumReduceTasks(0);
job.submit();
System.out.println("Job submitted.");
return 0;
}
}
One drawback is that a 'staging area' (/user/hive/tmp) is needed to store the intermediate Hive output. Furthermore as far as I know the Mongo-Hadoop connector doesn't support upserts.
I'm not quite sure but you can also try to fetch the data from Hive without running
hiveserver which exposes a Thrift service so that you can probably save some overhead.
Look at the source code of Hive's org.apache.hadoop.hive.cli.CliDriver#processLine(String line, boolean allowInterupting) method which actually executes the query. Then you can hack together something like this:
...
LogUtils.initHiveLog4j();
CliSessionState ss = new CliSessionState(new HiveConf(SessionState.class));
ss.in = System.in;
ss.out = new PrintStream(System.out, true, "UTF-8");
ss.err = new PrintStream(System.err, true, "UTF-8");
SessionState.start(ss);
Driver qp = new Driver();
processLocalCmd("SELECT * from users", qp, ss); //taken from CliDriver
...
Side notes:
There's also a hive-mongo connector implementation you might also check.
It's also worth having a look at the implementation of the Hive-HBase connector to get some idea if you want to implement a similar one for MongoDB.

Have you looked into Sqoop? It's supposed to make it very simple to move data between Hadoop and SQL/NoSQL databases. This article also gives an example of using it with Hive.

Take a look at the hadoop-MongoDB connector project :
http://api.mongodb.org/hadoop/MongoDB%2BHadoop+Connector.html
"This connectivity takes the form of allowing both reading MongoDB data into Hadoop (for use in MapReduce jobs as well as other components of the Hadoop ecosystem), as well as writing the results of Hadoop jobs out to MongoDB."
not sure if it will work for your use case but it's worth looking at.

Related

Perform Mongodb Aggregation in JMeter

I'm trying to run a simple aggregation in JMeter using the the mongo-java-driver 3.8. I'm new to JMeter and using Mongo with Java. I used this tutorial as a starting point:
https://www.blazemeter.com/blog/mongodb-performance-testing-with-jmeter/
I modified the code from the Querying Documents section for use in the JSR223 Sampler as follows:
import org.bson.Document;
import org.bson.types.ObjectId;
import com.mongodb.client.model.Aggregates;
try {
MongoCollection<Document> collection = vars.getObject("collection");
Document result = collection.aggregate(Arrays.asList(Aggregates.sample(1)));
vars.put("exampleDocumentId", result.get("_id").toString());
return "Document with id=" + result.get("_id") + " found";
}
catch (Exception e) {
SampleResult.setSuccessful(false);
SampleResult.setResponseCode("500");
SampleResult.setResponseMessage("Exception: " + e);
}
I get the following error in response for the Sampler result in the View Results tree:
Response code: 500
Response message: Exception: org.codehaus.groovy.runtime.typehandling.GroovyCastException:
Cannot cast object 'com.mongodb.client.internal.AggregateIterableImpl#3c7a0022' with class
'com.mongodb.client.internal.AggregateIterableImpl' to class 'org.bson.Document'
As chuckskull said read the documentation:
https://mongodb.github.io/mongo-java-driver/3.8/driver/tutorials/aggregation/
You don't need blocks and don't use forEach(printBlock); at the end of your aggregate statement; instead use first() just as the tutorial mentioned above used on find statements.
If you're a novice (like me) just use the restaurant data suggested in the documentation when your getting a hang for how this works.
Here's a working example:
import com.mongodb.client.MongoClients;
import com.mongodb.client.MongoClient;
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoDatabase;
import com.mongodb.client.model.Aggregates;
import com.mongodb.client.model.Accumulators;
import com.mongodb.client.model.Projections;
import com.mongodb.client.model.Filters;
import org.bson.Document;
try {
MongoClient mongoClient = MongoClients.create();
MongoDatabase database = mongoClient.getDatabase("test");
MongoCollection<Document> collection = database.getCollection("restaurants");
Document result = collection.aggregate(Arrays.asList(Aggregates.sample(1))).first()
vars.put("exampleDocumentId", result.get("_id").toString());
return "Document with id=" + result.get("_id") + " found";
}
catch (Exception e) {
SampleResult.setSuccessful(false);
SampleResult.setResponseCode("500");
SampleResult.setResponseMessage("Exception: " + e);
}
Collection.aggregate() function call returns AggregateIterable which cannot be cast to the document directly, you can use Groovy head() method which returns the first value from the Iterable instance like:
Document result = collection.aggregate(Arrays.asList(Aggregates.sample(1))).head()
More information on Groovy scripting in JMeter: Apache Groovy - Why and How You Should Use It
Please pay attention to the part of the error where it says: Cannot cast object.
The method MongoCollection#aggregate returns AggregateIterable.
Changing Document to AggregateIterable<Document> will solve this problem.

Unable to override lagom kafka parameters

I created a normal java project and put all dependencies of lagom kafka client on classpath , then in source folder i put the application.conf
Content of application.conf
lagom.broker.kafka {
service-name = ""
brokers = "127.0.0.1:9092"
}
while running the application service-name = "" should be used (so that my broker path could be used, rather than discovering), but it was not working
while debugging i found that in KafkaConfig class service-name comes out to be "kafka_native".
I found that while creating KafkaConfig , conf object which is coming dosen't have my application.conf in its origin
After this i tried overriding them using vm parameters like this:
-Dlagom.broker.kafka.service-name=""
-Dlagom.broker.kafka.brokers="127.0.0.1:9092"
-Dakka.kafka.consumer.kafka-clients.auto.offset.reset="earliest"
and it worked.
Can somebody explain why overriding in application conf not working
This is how i am subscribing to topic
import java.net.URI;
import java.util.concurrent.CompletableFuture;
import com.ameyo.ticketing.ticket.api.TicketingService;
import com.ameyo.ticketing.ticket.api.events.TicketEvent;
import com.lightbend.lagom.javadsl.api.broker.Topic;
import com.lightbend.lagom.javadsl.client.integration.LagomClientFactory;
import com.typesafe.config.ConfigFactory;
import akka.Done;
import akka.stream.javadsl.Flow;
/**
*
*/
public class Main {
public static void main(String[] args) {
String brokers = ConfigFactory.load().getString("lagom.broker.kafka.brokers");
System.out.println("Initial Value for Brokers " + brokers);
LagomClientFactory clientFactory = LagomClientFactory.create("legacy-system", Main.class.getClassLoader());
TicketingService ticketTingService = clientFactory.createClient(TicketingService.class,
URI.create("http://localhost:11000"));
Topic<TicketEvent> ticketEvents = ticketTingService.ticketEvents();
ticketEvents.subscribe().withGroupId("nya13").atLeastOnce(Flow.<TicketEvent> create().mapAsync(1, e -> {
System.out.println("kuch to aaya");
return CompletableFuture.completedFuture(Done.getInstance());
}));
try {
Thread.sleep(1000000000);
} catch (InterruptedException e1) {
}
}
}
Change configuration to
akka{
lagom.broker.kafka {
service-name = ""
brokers = "127.0.0.1:9092"
}
}
and it worked

How to run Hadoop jobs in Amazon EMR using Eclipse?

I have followed the tutorial given by Amazon here but it seems that my code failed to run.
The error that I got:
Exception in thread "main" java.lang.Error: Unresolved compilation problem:
The method withJobFlowRole(String) is undefined for the type AddJobFlowStepsRequest
at main.main(main.java:38)
My full code:
import java.io.IOException;
import com.amazonaws.auth.AWSCredentials;
import com.amazonaws.auth.PropertiesCredentials;
import com.amazonaws.services.elasticmapreduce.*;
import com.amazonaws.services.elasticmapreduce.model.AddJobFlowStepsRequest;
import com.amazonaws.services.elasticmapreduce.model.AddJobFlowStepsResult;
import com.amazonaws.services.elasticmapreduce.model.HadoopJarStepConfig;
import com.amazonaws.services.elasticmapreduce.model.StepConfig;
import com.amazonaws.services.elasticmapreduce.util.StepFactory;
public class main {
public static void main(String[] args) {
AWSCredentials credentials = null;
try {
credentials = new PropertiesCredentials(
main.class.getResourceAsStream("AwsCredentials.properties"));
} catch (IOException e1) {
System.out.println("Credentials were not properly entered into AwsCredentials.properties.");
System.out.println(e1.getMessage());
System.exit(-1);
}
AmazonElasticMapReduce client = new AmazonElasticMapReduceClient(credentials);
// predefined steps. See StepFactory for list of predefined steps
StepConfig hive = new StepConfig("Hive", new StepFactory().newInstallHiveStep());
// A custom step
HadoopJarStepConfig hadoopConfig1 = new HadoopJarStepConfig()
.withJar("s3://mywordcountbuckett/binary/WordCount.jar")
.withMainClass("com.my.Main1") // optional main class, this can be omitted if jar above has a manifest
.withArgs("--verbose"); // optional list of arguments
StepConfig customStep = new StepConfig("Step1", hadoopConfig1);
AddJobFlowStepsResult result = client.addJobFlowSteps(new AddJobFlowStepsRequest()
.withJobFlowRole("jobflow_role")
.withServiceRole("service_role")
.withSteps(hive, customStep));
System.out.println(result.getStepIds());
}
}
What could be the reason that the code is not running ?
Are there any tutorials based on the latest version ?

How to create repository instance in JackRabbit Oak using MongoMK

I am trying to Create a Oak's JCR repository to store content with " Apache Oak over MongoDB".
(which i have absolutely no idea)
Here's what iv'e been doing.
MongoClient connection = new MongoClient("127.0.0.1", 27017);
DB db = connection.getDB("test");
MongoMK.Builder m = new MongoMK.Builder();
MongoMK kernel = m.setMongoDB(db).open();
Repository repo = new Jcr().createRepository();
session = repo.login(); // Error javax.jcr.NoSuchWorkspaceException
Was trying to link "Repository " to "MongoMK" - which seems like a nightmare.
I have tried doing
Repository repo = new Jcr(kernel).createRepository(); //Error
I found something similiar # [How to create repository instance in JackRabbit Oak using MicroKernel , that didn't help either.
My question being, is there anyway to link-up MongMK - Repository ??
P.S - Tried using "NodeStore".
Yes, this was not well documented. The following should work:
import javax.jcr.Node;
import javax.jcr.Repository;
import javax.jcr.Session;
import org.apache.jackrabbit.oak.Oak;
import org.apache.jackrabbit.oak.plugins.document.DocumentMK;
import org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore;
import org.apache.jackrabbit.oak.spi.security.OpenSecurityProvider;
import com.mongodb.DB;
import com.mongodb.MongoClient;
public class Test {
public static void main(String... args) throws Exception {
DB db = new MongoClient("127.0.0.1", 27017).getDB("test2");
DocumentNodeStore ns = new DocumentMK.Builder().
setMongoDB(db).getNodeStore();
Repository repo = new Jcr(new Oak(ns))
.with(new OpenSecurityProvider())
.createRepository();
Session session = repo.login();
Node root = session.getRootNode();
if (root.hasNode("hello")) {
Node hello = root.getNode("hello");
long count = hello.getProperty("count").getLong();
hello.setProperty("count", count + 1);
System.out.println("found the hello node, count = " + count);
} else {
System.out.println("creating the hello node");
root.addNode("hello").setProperty("count", 1);
}
session.save();
session.logout();
ns.dispose();
}
}
This is now also documented.

SMS and Email Queues from Database

I just wanted to discuss a situation I am facing.
I want to send eMails to the users - a lot of eMails - but if I send them at application run time the AWS SDK is slow for emails - bad user experience - atleast for my application.
So what I plan to do is enter the data (email address, content to send, 0) in the database and launch a cron job to read the table and start sending the emails - once it sends the email - it marks the database row as 1.
I read somewhere that is a wrong practice and puts overload on the database server.
Yes, I would use intelligent crons so that no 2 crons overlap or setup a cron each for even and odd numbers etc. I am also looking at 3rd Party alternatives likes http://www.iron.io/ for crons.
Could someone share their experience with a similar situation etc. I just want to use the intelligent solution and not just put a ton of resources on the database and spend hefty on transactions...
I had to do something similar and did as Charles Engelke suggested - I used SQS.
I eliminated the database entirely by putting the entire message contents in the SQS message. You're limited to 64k in an SQS message, so as long as thats not a problem this approach is possible.
Here is sample code to queue up the message:
package com.softwareconfidence.bsp.sending;
import com.amazonaws.services.sqs.AmazonSQS;
import com.amazonaws.services.sqs.model.SendMessageRequest;
import com.googlecode.funclate.json.Json;
import java.util.HashMap;
import java.util.Map;
public class EmailQueuer {
private final AmazonSQS sqs;
private final String sendQueueUrl;
public EmailQueuer(AmazonSQS sqs,String sendQueueUrl) {
this.sqs = sqs;
this.sendQueueUrl = sendQueueUrl;
}
public void queue() {
Map<String,String> emailModel = new HashMap<String, String>(){{
put("from","me#me.com");
put("to","you#you.com");
put("cc","her#them.com");
put("subject","Greetings");
put("body","Hello World");
}};
sqs.sendMessage(new SendMessageRequest(sendQueueUrl, Json.toJson(emailModel)));
}
}
Then in your app you need to have an executor service that polls the queue and processes messages:
new ScheduledThreadPoolExecutor(1).scheduleAtFixedRate(sendEmails(), 0, 1, MINUTES)
You will need to make sure to call shutdown() on this executor when it app is exiting. Anyway, this line will send emails every minute, where sendEmails() returns an instance of this Runnable class:
package com.softwareconfidence.bsp.standalone.sending;
import com.amazonaws.services.simpleemail.AmazonSimpleEmailService;
import com.amazonaws.services.simpleemail.model.*;
import com.amazonaws.services.sqs.AmazonSQS;
import com.amazonaws.services.sqs.model.DeleteMessageRequest;
import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
import com.amazonaws.services.sqs.model.ReceiveMessageResult;
import com.amazonaws.services.sqs.model.SendMessageRequest;
import com.googlecode.funclate.json.Json;
import java.io.PrintWriter;
import java.io.StringWriter;
import java.util.List;
import java.util.Map;
public class FromSqsEmailer implements Runnable {
private final AmazonSQS sqs;
private final String sendQueueUrl;
private final String deadLetterQueueUrl;
private final AmazonSimpleEmailService emailService;
public FromSqsEmailer(AmazonSimpleEmailService emailService, String deadLetterQueueUrl, String sendQueueUrl, AmazonSQS sqs) {
this.emailService = emailService;
this.deadLetterQueueUrl = deadLetterQueueUrl;
this.sendQueueUrl = sendQueueUrl;
this.sqs = sqs;
}
public void run() {
int batchSize = 10;
int numberHandled;
do {
ReceiveMessageResult receiveMessageResult =
sqs.receiveMessage(new ReceiveMessageRequest(sendQueueUrl).withMaxNumberOfMessages(batchSize));
final List<com.amazonaws.services.sqs.model.Message> toSend = receiveMessageResult.getMessages();
for (com.amazonaws.services.sqs.model.Message message : toSend) {
SendEmailResult sendResult = sendMyEmail(Json.parse(message.getBody()));
if(sendResult != null) {
sqs.deleteMessage(new DeleteMessageRequest(sendQueueUrl, message.getReceiptHandle()));
}
}
numberHandled = toSend.size();
} while (numberHandled > 0);
}
private SendEmailResult sendMyEmail(Map<String, Object> emailModel) {
Destination to = new Destination()
.withToAddresses(get("to", emailModel))
.withCcAddresses(get("cc", emailModel));
try {
return emailService.sendEmail(new SendEmailRequest(get("from", emailModel), to, body(emailModel)));
} catch (Exception e){
StringWriter stackTrace = new StringWriter();
e.printStackTrace(new PrintWriter(stackTrace));
sqs.sendMessage(new SendMessageRequest(deadLetterQueueUrl, "while sending email " + stackTrace));
}
return null;
}
private String get(String propertyName, Map<String, Object> emailModel) {
return emailModel.get(propertyName).toString();
}
private Message body(Map<String, Object> emailModel) {
Message message = new Message().withSubject(new Content(get("subject", emailModel)));
Body body = new Body().withText(new Content(get("body", emailModel)));
message.setBody(body);
return message;
}
}
One downsize of this approach if you're using a database is that the email sending step is a HTTP call. If you have a database transaction that rollsback after this HTTP call, your business process is undone, but the email is going to be sent.
Food for thought.
Thanks for the detailed response Mike. I finally ended up in implementing a REST API for my application with secure Username+Password+Key access and run it from a 3rd Party Service Iron.io which gets
www.example.com/rest/messages/format/json
It iterates and sends the messages collecting status in an array - which it then posts back to
www.example.com/rest/messagesposted
I followed this approach because I had to schedule messages for over
90 days interval and queues hold messages only for say like 14 days.
What do you recon?