How to run Hadoop jobs in Amazon EMR using Eclipse? - eclipse

I have followed the tutorial given by Amazon here but it seems that my code failed to run.
The error that I got:
Exception in thread "main" java.lang.Error: Unresolved compilation problem:
The method withJobFlowRole(String) is undefined for the type AddJobFlowStepsRequest
at main.main(main.java:38)
My full code:
import java.io.IOException;
import com.amazonaws.auth.AWSCredentials;
import com.amazonaws.auth.PropertiesCredentials;
import com.amazonaws.services.elasticmapreduce.*;
import com.amazonaws.services.elasticmapreduce.model.AddJobFlowStepsRequest;
import com.amazonaws.services.elasticmapreduce.model.AddJobFlowStepsResult;
import com.amazonaws.services.elasticmapreduce.model.HadoopJarStepConfig;
import com.amazonaws.services.elasticmapreduce.model.StepConfig;
import com.amazonaws.services.elasticmapreduce.util.StepFactory;
public class main {
public static void main(String[] args) {
AWSCredentials credentials = null;
try {
credentials = new PropertiesCredentials(
main.class.getResourceAsStream("AwsCredentials.properties"));
} catch (IOException e1) {
System.out.println("Credentials were not properly entered into AwsCredentials.properties.");
System.out.println(e1.getMessage());
System.exit(-1);
}
AmazonElasticMapReduce client = new AmazonElasticMapReduceClient(credentials);
// predefined steps. See StepFactory for list of predefined steps
StepConfig hive = new StepConfig("Hive", new StepFactory().newInstallHiveStep());
// A custom step
HadoopJarStepConfig hadoopConfig1 = new HadoopJarStepConfig()
.withJar("s3://mywordcountbuckett/binary/WordCount.jar")
.withMainClass("com.my.Main1") // optional main class, this can be omitted if jar above has a manifest
.withArgs("--verbose"); // optional list of arguments
StepConfig customStep = new StepConfig("Step1", hadoopConfig1);
AddJobFlowStepsResult result = client.addJobFlowSteps(new AddJobFlowStepsRequest()
.withJobFlowRole("jobflow_role")
.withServiceRole("service_role")
.withSteps(hive, customStep));
System.out.println(result.getStepIds());
}
}
What could be the reason that the code is not running ?
Are there any tutorials based on the latest version ?

Related

Vert.x ConfigRetriever.listen does not get triggered when the config file changes

I have a simple vertical that I am using to test the ConfigRetriever.listen for changes.
Using Vert.x version 4.3.4
import io.vertx.config.ConfigRetriever;
import io.vertx.config.ConfigRetrieverOptions;
import io.vertx.config.ConfigStoreOptions;
import io.vertx.core.AbstractVerticle;
import io.vertx.core.json.JsonObject;
public class MyVerticle extends AbstractVerticle {
#Override
public void start() {
ConfigStoreOptions fileStore = new ConfigStoreOptions().setType("file")
.setFormat("yaml")
.setConfig(new JsonObject().put("path", "config.yaml"));
ConfigRetrieverOptions options = new ConfigRetrieverOptions().setScanPeriod(1000)
.addStore(fileStore);
ConfigRetriever retriever = ConfigRetriever.create(vertx, options);
retriever.listen(change -> {
JsonObject previous = change.getPreviousConfiguration();
System.out.println(previous);
JsonObject changedConf = change.getNewConfiguration();
System.out.println(changedConf);
});
}
}
[Edit] The config file is under src/main/resource
When I run this, I get an output of the before as empty and after as config in my yaml file.
{}
{"bridgeservice":{"eb_address":"xyz","path":"/api/v1/aaa/","port":80}}
The problem is when I change the value in the yaml config file nothing happens. I expect the changes to get printed. When I am running this in the debugger I see
Thread [vert.x-internal-blocking-0] (Running)
..
..
..
Thread [vert.x-internal-blocking-19] (Running)
When I put the following just before the retriever.listen() , I get the succeeded... line printed and nothing from the listen method even after changing the config file values.
retriever.getConfig(ar -> {
if (ar.succeeded()) {
System.out.println("succeeded :" + ar.result());
}
else {
ar.cause()
.printStackTrace();
}
});
May be related to SO having-trouble-listen-vert-x-config-change
[Edit] The config file is under src/main/resource
When I moved my config file from resources to a folder cfg at the same level as src the Verticle behaved as it should and picked up config changes. I don't know why, maybe it's an eclipse environment thing.

Unable to override lagom kafka parameters

I created a normal java project and put all dependencies of lagom kafka client on classpath , then in source folder i put the application.conf
Content of application.conf
lagom.broker.kafka {
service-name = ""
brokers = "127.0.0.1:9092"
}
while running the application service-name = "" should be used (so that my broker path could be used, rather than discovering), but it was not working
while debugging i found that in KafkaConfig class service-name comes out to be "kafka_native".
I found that while creating KafkaConfig , conf object which is coming dosen't have my application.conf in its origin
After this i tried overriding them using vm parameters like this:
-Dlagom.broker.kafka.service-name=""
-Dlagom.broker.kafka.brokers="127.0.0.1:9092"
-Dakka.kafka.consumer.kafka-clients.auto.offset.reset="earliest"
and it worked.
Can somebody explain why overriding in application conf not working
This is how i am subscribing to topic
import java.net.URI;
import java.util.concurrent.CompletableFuture;
import com.ameyo.ticketing.ticket.api.TicketingService;
import com.ameyo.ticketing.ticket.api.events.TicketEvent;
import com.lightbend.lagom.javadsl.api.broker.Topic;
import com.lightbend.lagom.javadsl.client.integration.LagomClientFactory;
import com.typesafe.config.ConfigFactory;
import akka.Done;
import akka.stream.javadsl.Flow;
/**
*
*/
public class Main {
public static void main(String[] args) {
String brokers = ConfigFactory.load().getString("lagom.broker.kafka.brokers");
System.out.println("Initial Value for Brokers " + brokers);
LagomClientFactory clientFactory = LagomClientFactory.create("legacy-system", Main.class.getClassLoader());
TicketingService ticketTingService = clientFactory.createClient(TicketingService.class,
URI.create("http://localhost:11000"));
Topic<TicketEvent> ticketEvents = ticketTingService.ticketEvents();
ticketEvents.subscribe().withGroupId("nya13").atLeastOnce(Flow.<TicketEvent> create().mapAsync(1, e -> {
System.out.println("kuch to aaya");
return CompletableFuture.completedFuture(Done.getInstance());
}));
try {
Thread.sleep(1000000000);
} catch (InterruptedException e1) {
}
}
}
Change configuration to
akka{
lagom.broker.kafka {
service-name = ""
brokers = "127.0.0.1:9092"
}
}
and it worked

How to create repository instance in JackRabbit Oak using MongoMK

I am trying to Create a Oak's JCR repository to store content with " Apache Oak over MongoDB".
(which i have absolutely no idea)
Here's what iv'e been doing.
MongoClient connection = new MongoClient("127.0.0.1", 27017);
DB db = connection.getDB("test");
MongoMK.Builder m = new MongoMK.Builder();
MongoMK kernel = m.setMongoDB(db).open();
Repository repo = new Jcr().createRepository();
session = repo.login(); // Error javax.jcr.NoSuchWorkspaceException
Was trying to link "Repository " to "MongoMK" - which seems like a nightmare.
I have tried doing
Repository repo = new Jcr(kernel).createRepository(); //Error
I found something similiar # [How to create repository instance in JackRabbit Oak using MicroKernel , that didn't help either.
My question being, is there anyway to link-up MongMK - Repository ??
P.S - Tried using "NodeStore".
Yes, this was not well documented. The following should work:
import javax.jcr.Node;
import javax.jcr.Repository;
import javax.jcr.Session;
import org.apache.jackrabbit.oak.Oak;
import org.apache.jackrabbit.oak.plugins.document.DocumentMK;
import org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore;
import org.apache.jackrabbit.oak.spi.security.OpenSecurityProvider;
import com.mongodb.DB;
import com.mongodb.MongoClient;
public class Test {
public static void main(String... args) throws Exception {
DB db = new MongoClient("127.0.0.1", 27017).getDB("test2");
DocumentNodeStore ns = new DocumentMK.Builder().
setMongoDB(db).getNodeStore();
Repository repo = new Jcr(new Oak(ns))
.with(new OpenSecurityProvider())
.createRepository();
Session session = repo.login();
Node root = session.getRootNode();
if (root.hasNode("hello")) {
Node hello = root.getNode("hello");
long count = hello.getProperty("count").getLong();
hello.setProperty("count", count + 1);
System.out.println("found the hello node, count = " + count);
} else {
System.out.println("creating the hello node");
root.addNode("hello").setProperty("count", 1);
}
session.save();
session.logout();
ns.dispose();
}
}
This is now also documented.

What is the most efficient way of moving data out of Hive and into MongoDB?

Is there an elegant, easy and fast way to move data out of Hive into MongoDB?
You can do the export with the Hadoop-MongoDB connector. Just run the Hive query in your job's main method. This output will then be used by the Mapper in order to insert the data into MongoDB.
Example:
Here I'm inserting a semicolon separated text file (id;firstname;lastname) to a MongoDB
collection using a simple Hive query :
import java.io.IOException;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.SQLException;
import java.sql.Statement;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import com.mongodb.hadoop.MongoOutputFormat;
import com.mongodb.hadoop.io.BSONWritable;
import com.mongodb.hadoop.util.MongoConfigUtil;
public class HiveToMongo extends Configured implements Tool {
private static class HiveToMongoMapper extends
Mapper<LongWritable, Text, IntWritable, BSONWritable> {
//See: https://issues.apache.org/jira/browse/HIVE-634
private static final String HIVE_EXPORT_DELIMETER = '\001' + "";
private IntWritable k = new IntWritable();
private BSONWritable v = null;
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String [] split = value.toString().split(HIVE_EXPORT_DELIMETER);
k.set(Integer.parseInt(split[0]));
v = new BSONWritable();
v.put("firstname", split[1]);
v.put("lastname", split[2]);
context.write(k, v);
}
}
public static void main(String[] args) throws Exception {
try {
Class.forName("org.apache.hadoop.hive.jdbc.HiveDriver");
}
catch (ClassNotFoundException e) {
System.out.println("Unable to load Hive Driver");
System.exit(1);
}
try {
Connection con = DriverManager.getConnection(
"jdbc:hive://localhost:10000/default");
Statement stmt = con.createStatement();
String sql = "INSERT OVERWRITE DIRECTORY " +
"'hdfs://localhost:8020/user/hive/tmp' select * from users";
stmt.executeQuery(sql);
}
catch (SQLException e) {
System.exit(1);
}
int res = ToolRunner.run(new Configuration(), new HiveToMongo(), args);
System.exit(res);
}
#Override
public int run(String[] args) throws Exception {
Configuration conf = getConf();
Path inputPath = new Path("/user/hive/tmp");
String mongoDbPath = "mongodb://127.0.0.1:6900/mongo_users.mycoll";
MongoConfigUtil.setOutputURI(conf, mongoDbPath);
/*
Add dependencies to distributed cache via
DistributedCache.addFileToClassPath(...) :
- mongo-hadoop-core-x.x.x.jar
- mongo-java-driver-x.x.x.jar
- hive-jdbc-x.x.x.jar
HadoopUtils is an own utility class
*/
HadoopUtils.addDependenciesToDistributedCache("/libs/mongodb", conf);
HadoopUtils.addDependenciesToDistributedCache("/libs/hive", conf);
Job job = new Job(conf, "HiveToMongo");
FileInputFormat.setInputPaths(job, inputPath);
job.setJarByClass(HiveToMongo.class);
job.setMapperClass(HiveToMongoMapper.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(MongoOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setNumReduceTasks(0);
job.submit();
System.out.println("Job submitted.");
return 0;
}
}
One drawback is that a 'staging area' (/user/hive/tmp) is needed to store the intermediate Hive output. Furthermore as far as I know the Mongo-Hadoop connector doesn't support upserts.
I'm not quite sure but you can also try to fetch the data from Hive without running
hiveserver which exposes a Thrift service so that you can probably save some overhead.
Look at the source code of Hive's org.apache.hadoop.hive.cli.CliDriver#processLine(String line, boolean allowInterupting) method which actually executes the query. Then you can hack together something like this:
...
LogUtils.initHiveLog4j();
CliSessionState ss = new CliSessionState(new HiveConf(SessionState.class));
ss.in = System.in;
ss.out = new PrintStream(System.out, true, "UTF-8");
ss.err = new PrintStream(System.err, true, "UTF-8");
SessionState.start(ss);
Driver qp = new Driver();
processLocalCmd("SELECT * from users", qp, ss); //taken from CliDriver
...
Side notes:
There's also a hive-mongo connector implementation you might also check.
It's also worth having a look at the implementation of the Hive-HBase connector to get some idea if you want to implement a similar one for MongoDB.
Have you looked into Sqoop? It's supposed to make it very simple to move data between Hadoop and SQL/NoSQL databases. This article also gives an example of using it with Hive.
Take a look at the hadoop-MongoDB connector project :
http://api.mongodb.org/hadoop/MongoDB%2BHadoop+Connector.html
"This connectivity takes the form of allowing both reading MongoDB data into Hadoop (for use in MapReduce jobs as well as other components of the Hadoop ecosystem), as well as writing the results of Hadoop jobs out to MongoDB."
not sure if it will work for your use case but it's worth looking at.

How to use WebSockets in Scala using Play Framework?

I would like to use WebSockets in Scala and Play Framework. But I can't get the Echo-server example to work.
What should I import for await() and disconnect()?
The error I get is Error raised is : not found: value await. I used the code below:
package controllers
import play._
import play.mvc._
import play.mvc.Http.WebSocketEvent
import play.mvc.Http.WebSocketFrame
import play.mvc.Http.WebSocketClose
import play.mvc.WebSocketController
object MySocket extends WebSocketController {
def echo = {
while(Http.Inbound.current().isOpen()) {
val e : WebSocketEvent =
await(Http.Inbound.current().nextEvent()).asInstanceOf[WebSocketEvent]
if(e.isInstanceOf[WebSocketFrame]) {
val frame : WebSocketFrame = e.asInstanceOf[WebSocketFrame]
if(!frame.isBinary) {
if(frame.textData.equals("quit")) {
Http.Outbound.current().send("Bye!");
disconnect();
} else {
Http.Outbound.current().send("Echo: " + frame.textData)
}
}
}
if(e.isInstanceOf[WebSocketClose]) {
Logger.info("Socket closed!")
}
}
}
}
Here is the compilation error in the Terminal:
Compiling:
/Users/jonas/play-1.2.2RC1/jonassite/app/MySocket.scala
/Users/jonas/play-1.2.2RC1/jonassite/app/MySocket.scala:14: not found: value await
val e : WebSocketEvent = await(Http.Inbound.current().nextEvent()).asInstanceOf[WebSocketEvent]
^
/Users/jonas/play-1.2.2RC1/jonassite/app/MySocket.scala:20: not found: value disconnect
disconnect();
^
two errors found
Compiling:
/Users/jonas/play-1.2.2RC1/jonassite/app/MySocket.scala
/Users/jonas/play-1.2.2RC1/jonassite/app/MySocket.scala:14: not found: value await
val e : WebSocketEvent = await(Http.Inbound.current().nextEvent()).asInstanceOf[WebSocketEvent]
^
/Users/jonas/play-1.2.2RC1/jonassite/app/MySocket.scala:20: not found: value disconnect
disconnect();
^
two errors found
12:52:57,049 ERROR ~
#66lce6kp8
Internal Server Error (500) for request GET /handshake
Compilation error (In /app/MySocket.scala around line 14)
The file /app/MySocket.scala could not be compiled. Error raised is : not found: value await
play.exceptions.CompilationException: not found: value await
at play.scalasupport.ScalaPlugin.compilationException(ScalaPlugin.scala:129)
at play.scalasupport.ScalaPlugin.detectClassesChange(ScalaPlugin.scala:115)
at play.plugins.PluginCollection.detectClassesChange(PluginCollection.java:358)
at play.Play.detectChanges(Play.java:591)
at play.Invoker$Invocation.init(Invoker.java:186)
at Invocation.HTTP Request(Play!)
await() and disconnect() are methods available from the WebSocketController. However, these are currently only available in the Java version, and not Scala. See this post here on the play groups for more information.
This should be available in the 1.0 release of the scala plugin, but for now if you want to use the aysnc features (await etc), then you will have to use Java, or take a look at the Java wrapper that one of the Play users have developed.