Converting PDF file to text on HDFS (JAVA) [duplicate]

Converting PDF file to text on HDFS (JAVA) [duplicate] - eclipse

This question already has answers here:
Why am I getting a NoClassDefFoundError in Java?
(31 answers)
Closed 5 years ago.
In this, I overewrite class PdfInputFormat with FileInputFormat class. This class is returning object of PdfRecordReader class which is doing all PDF conversion. I am facing an error here.
I am creating the jar in Eclipse by going to :
Tool > Eclipse - Method of exporting > export > create jar.
I am selecting the package required libraries in the jar.
I am executing the jar using the following command:
hadoop jar /home/tcs/converter.jar com.amal.pdf.PdfInputDriver /user/tcs/wordcountfile.pdf /user/convert
After running this I get the following exception:
17/06/09 09:26:51 WARN mapred.LocalJobRunner: job_local1466878685_0001
java.lang.Exception: java.lang.NoClassDefFoundError: org/apache/fontbox/cmap/CMapParser
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:489)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:549)
Caused by: java.lang.NoClassDefFoundError: org/apache/fontbox/cmap/CMapParser
at org.apache.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:548)
at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:383)
at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:372)
at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:61)
at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:552)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:248)
at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:207)
at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:367)
at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:291)
at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:247)
at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:180)
at com.amal.pdf.PdfRecordReader.initialize(PdfRecordReader.java:43)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:548)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:786)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:270)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.apache.fontbox.cmap.CMapParser
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 21 more
17/06/09 09:26:52 INFO mapreduce.Job: Job job_local1466878685_0001 failed with state FAILED due to: NA
17/06/09 09:26:52 INFO mapreduce.Job: Counters: 0
false
Here is the code:
PdfRecordReader class(code)
package com.amal.pdf;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
public class PdfRecordReader extends RecordReader<Object, Object>
{
private String[] lines = null;
private LongWritable key = null;
private Text value = null;
#Override
public void initialize(InputSplit genericSplit, TaskAttemptContext context)
throws IOException, InterruptedException {
FileSplit split = (FileSplit) genericSplit;
Configuration job = context.getConfiguration();
final Path file = split.getPath();
/*
* The below code contains the logic for opening the file and seek to
* the start of the split. Here we are applying the Pdf Parsing logic
*/
FileSystem fs = file.getFileSystem(job);
FSDataInputStream fileIn = fs.open(split.getPath());
PDDocument pdf = null;
String parsedText = null;
PDFTextStripper stripper;
pdf = PDDocument.load(fileIn);
stripper = new PDFTextStripper();
//getting exception because of this line****
parsedText = stripper.getText(pdf);
this.lines = parsedText.split("\n"); }
#Override
public boolean nextKeyValue() throws IOException, InterruptedException {
if (key == null) {
key = new LongWritable();
key.set(1);
value = new Text();
value.set(lines[0]);
} else {
int temp = (int) key.get();
if (temp < (lines.length - 1)) {
int count = (int) key.get();
value = new Text();
value.set(lines[count]);
count = count + 1;
key = new LongWritable(count);
} else {
return false;
}
}
if (key == null || value == null) {
return false;
} else {
return true;
}
}
#Override
public LongWritable getCurrentKey() throws IOException,
InterruptedException {
return key;
}
#Override
public Text getCurrentValue() throws IOException, InterruptedException {
return value;
}
#Override
public float getProgress() throws IOException, InterruptedException {
return 0;
}
#Override
public void close() throws IOException {
}
}
//Note: Since it is for HADOOP environment, using eclipse will not make //runnable JAR for this project.
// Is there anyway to export this project as a runnable JAR.
//Need help to understand what I am doing wrong.

The error is because hadoop could not find org.apache.fontbox.cmap.CMapParser class which should be an external library that you have imported in your code.
The external dependent jar was not packaged with the jar you used for hadoop command and thus hadoop system couldn't find the jar in hdfs. This is because when we run hadoop command codes (jars) get distributed to where data lies in hdfs cluster and thus the dependent jar was not found.
There are two solutions you can follow:
1 ) you can include the external jars with hadoop command as
hadoop jar /home/tcs/converter.jar com.amal.pdf.PdfInputDriver -libjars <path to external jars comma separated> /user/tcs/wordcountfile.pdf /user/convert
2) or you can use shade plugin and create a uber jar by including all dependent libraries inside your own jar.

Related

Cybersource org.apache.cxf.binding.soap.SoapFault: Security processing failed USING CXF

package com.cybersource.schemas.transaction_data.transactionprocessor;
import java.io.IOException;
import java.math.BigInteger;
import java.net.MalformedURLException;
import java.net.URL;
imp ort java.rmi.RemoteException;
import java.util.HashMap;
import java.util.Map;
import org.apache.cxf.version.Version;
import com.cybersource.schemas.transaction_data_1.BillTo;
import com.cybersource.schemas.transaction_data_1.CCAuthService;
import com.cybersource.schemas.transaction_data_1.Card;
import com.cybersource.schemas.transaction_data_1.Item;
import com.cybersource.schemas.transaction_data_1.PurchaseTotals;
import com.cybersource.schemas.transaction_data_1.ReplyMessage;
import com.cybersource.schemas.transaction_data_1.RequestMessage;
import org.apache.cxf.endpoint.Client;
import org.apache.cxf.endpoint.Endpoint;
import org.apache.cxf.frontend.ClientProxy;
import org.apache.cxf.ws.security.wss4j.WSS4JInInterceptor;
import org.apache.cxf.ws.security.wss4j.WSS4JOutInterceptor;
import org.apache.ws.security.WSConstants;
import org.apache.ws.security.WSPasswordCallback;
//import org.apache.wss4j.common.ext.WSPasswordCallback;
import org.apache.ws.security.handler.WSHandlerConstants;
import javax.security.auth.callback.Callback;
import javax.security.auth.callback.CallbackHandler;
import javax.security.auth.callback.UnsupportedCallbackException;
public class CybersourceClientExample {
// Replace the MERCHANT_ID and MERCHANT_KEY with the appropriate --donevalues.
private static final String MERCHANT_ID = "MERCHANT_ID ";
private static final String MERCHANT_KEY = "MERCHANT_KEY ";
private static final String SERVER_URL = "https://ics2wstesta.ic3.com/commerce/1.x/transactionProcessor/CyberSourceTransaction_1.142.wsdl";
private static final String CLIENT_LIB_VERSION = Version.getCompleteVersionString() + "/1.5.10"; // CXF Version / WSS4J Version
private static final String CLIENT_LIBRARY = "Java CXF WSS4J";
private static final String CLIENT_ENV = System.getProperty("os.name") + "/" +
System.getProperty("os.version") + "/" +
System.getProperty("java.vendor") + "/" +
System.getProperty("java.version");
public static void main(String[] args) throws RemoteException, MalformedURLException {
RequestMessage request = new RequestMessage();
// To help Cybersource troubleshoot any problems that you may encounter,
// include the following information about the client.
addClientLibraryInfo(request);
request.setMerchantID(MERCHANT_ID);
// Internal Transaction Reference Code for the Merchant
request.setMerchantReferenceCode("222222");
// Here we are telling the client that we are going to run an AUTH.
request.setCcAuthService(new CCAuthService());
request.getCcAuthService().setRun("true");
request.setBillTo(buildBillTo());
request.setCard(buildCard());
request.setPurchaseTotals(buildPurchaseTotals());
request.getItem().add(buildItem("0", "12.34", "2"));
request.getItem().add(buildItem("1", "56.78", "1"));
ITransactionProcessor processor = new TransactionProcessor(new URL(SERVER_URL)).getPortXML();
// Add WS-Security Headers to the Request
addSecurityValues(processor);
ReplyMessage reply = processor.runTransaction(request);
System.out.println("decision = " + reply.getDecision());
System.out.println("reasonCode = " + reply.getReasonCode());
System.out.println("requestID = " + reply.getRequestID());
System.out.println("requestToken = " + reply.getRequestToken());
System.out.println("ccAuthReply.reasonCode = " + reply.getCcAuthReply().getReasonCode());
}
private static void addClientLibraryInfo(RequestMessage request) {
request.setClientLibrary(CLIENT_LIBRARY);
request.setClientLibraryVersion(CLIENT_LIB_VERSION);
request.setClientEnvironment(CLIENT_ENV);
}
private static Item buildItem(String id, String unitPrice, String quantity) {
Item item = new Item();
item.setId(new BigInteger(id));
item.setUnitPrice(unitPrice);
item.setQuantity(quantity);
return item;
}
private static PurchaseTotals buildPurchaseTotals() {
PurchaseTotals purchaseTotals = new PurchaseTotals();
purchaseTotals.setCurrency("USD");
purchaseTotals.setGrandTotalAmount("100");
return purchaseTotals;
}
private static Card buildCard() {
Card card = new Card();
card.setAccountNumber("4111111111111111");
card.setExpirationMonth(new BigInteger("12"));
card.setExpirationYear(new BigInteger("2020"));
return card;
}
private static BillTo buildBillTo() {
BillTo billTo = new BillTo();
billTo.setFirstName("John");
billTo.setLastName("Doe");
billTo.setStreet1("1295 Charleston Road");
billTo.setCity("Mountain View");
billTo.setState("CA");
billTo.setPostalCode("94043");
billTo.setCountry("US");
billTo.setEmail("null#cybersource.com");
billTo.setIpAddress("10.7.111.111");
return billTo;
}
private static void addSecurityValues(ITransactionProcessor processor) {
Client client = ClientProxy.getClient(processor);
Endpoint endpoint = client.getEndpoint();
// We'll have to add the Username and Password properties to an OutInterceptor
HashMap<String, Object> outHeaders = new HashMap<String, Object>();
outHeaders.put(WSHandlerConstants.ACTION, WSHandlerConstants.USERNAME_TOKEN);
outHeaders.put(WSHandlerConstants.USER, MERCHANT_ID);
outHeaders.put(WSHandlerConstants.PASSWORD_TYPE, WSConstants.PW_TEXT);
outHeaders.put(WSHandlerConstants.PW_CALLBACK_CLASS, ClientPasswordHandler.class.getName());
WSS4JOutInterceptor interceptor = new WSS4JOutInterceptor(outHeaders);
endpoint.getOutInterceptors().add(interceptor);
}
public static class ClientPasswordHandler implements CallbackHandler {
#Override
public void handle(Callback[] callbacks) throws IOException, UnsupportedCallbackException {
for (Callback callback : callbacks) {
if ((WSPasswordCallback)callback instanceof WSPasswordCallback) {
WSPasswordCallback passwordCallback = (WSPasswordCallback) callback;
passwordCallback.setPassword(MERCHANT_KEY);
}
}
}
}
}
I am facing the following error. Please help.
Dec 04, 2017 8:15:00 AM org.apache.cxf.wsdl.service.factory.ReflectionServiceFactoryBean buildServiceFromWSDL
INFO: Creating Service {urn:schemas-cybersource-com:transaction-data:TransactionProcessor}TransactionProcessor from WSDL: https://ics2wstesta.ic3.com/commerce/1.x/transactionProcessor/CyberSourceTransaction_1.142.wsdl
Dec 04, 2017 8:15:03 AM org.apache.cxf.phase.PhaseInterceptorChain doDefaultLogging
WARNING: Interceptor for {urn:schemas-cybersource-com:transaction-data:TransactionProcessor}TransactionProcessor#{urn:schemas-cybersource-com:transaction-data:TransactionProcessor}runTransaction has thrown exception, unwinding now
org.apache.cxf.binding.soap.SoapFault: Security processing failed.
at org.apache.cxf.ws.security.wss4j.WSS4JOutInterceptor$WSS4JOutInterceptorInternal.handleMessageInternal(WSS4JOutInterceptor.java:269)
at org.apache.cxf.ws.security.wss4j.WSS4JOutInterceptor$WSS4JOutInterceptorInternal.handleMessage(WSS4JOutInterceptor.java:135)
at org.apache.cxf.ws.security.wss4j.WSS4JOutInterceptor$WSS4JOutInterceptorInternal.handleMessage(WSS4JOutInterceptor.java:122)
at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
at org.apache.cxf.endpoint.ClientImpl.doInvoke(ClientImpl.java:518)
at org.apache.cxf.endpoint.ClientImpl.invoke(ClientImpl.java:427)
at org.apache.cxf.endpoint.ClientImpl.invoke(ClientImpl.java:328)
at org.apache.cxf.endpoint.ClientImpl.invoke(ClientImpl.java:281)
at org.apache.cxf.frontend.ClientProxy.invokeSync(ClientProxy.java:96)
at org.apache.cxf.jaxws.JaxWsClientProxy.invoke(JaxWsClientProxy.java:139)
at com.sun.proxy.$Proxy36.runTransaction(Unknown Source)
at com.cybersource.schemas.transaction_data.transactionprocessor.CybersourceClientExample.main(CybersourceClientExample.java:77)
Caused by: org.apache.wss4j.common.ext.WSSecurityException: WSHandler: password callback failed
Original Exception was java.lang.ClassCastException: org.apache.wss4j.common.ext.WSPasswordCallback cannot be cast to org.apache.ws.security.WSPasswordCallback
at org.apache.wss4j.dom.handler.WSHandler.performPasswordCallback(WSHandler.java:1172)
at org.apache.wss4j.dom.handler.WSHandler.getPasswordCB(WSHandler.java:1130)
at org.apache.wss4j.dom.action.UsernameTokenAction.execute(UsernameTokenAction.java:43)
at org.apache.wss4j.dom.handler.WSHandler.doSenderAction(WSHandler.java:234)
at org.apache.cxf.ws.security.wss4j.WSS4JOutInterceptor.access$100(WSS4JOutInterceptor.java:54)
at org.apache.cxf.ws.security.wss4j.WSS4JOutInterceptor$WSS4JOutInterceptorInternal.handleMessageInternal(WSS4JOutInterceptor.java:261)
... 11 more
Caused by: java.lang.ClassCastException: org.apache.wss4j.common.ext.WSPasswordCallback cannot be cast to org.apache.ws.security.WSPasswordCallback
at com.cybersource.schemas.transaction_data.transactionprocessor.CybersourceClientExample$ClientPasswordHandler.handle(CybersourceClientExample.java:152)
at org.apache.wss4j.dom.handler.WSHandler.performPasswordCallback(WSHandler.java:1170)
... 16 more
Exception in thread "main" javax.xml.ws.soap.SOAPFaultException: Security processing failed.
at org.apache.cxf.jaxws.JaxWsClientProxy.invoke(JaxWsClientProxy.java:161)
at com.sun.proxy.$Proxy36.runTransaction(Unknown Source)
at com.cybersource.schemas.transaction_data.transactionprocessor.CybersourceClientExample.main(CybersourceClientExample.java:77)
Caused by: org.apache.wss4j.common.ext.WSSecurityException: WSHandler: password callback failed
Original Exception was java.lang.ClassCastException: org.apache.wss4j.common.ext.WSPasswordCallback cannot be cast to org.apache.ws.security.WSPasswordCallback
at org.apache.wss4j.dom.handler.WSHandler.performPasswordCallback(WSHandler.java:1172)
at org.apache.wss4j.dom.handler.WSHandler.getPasswordCB(WSHandler.java:1130)
at org.apache.wss4j.dom.action.UsernameTokenAction.execute(UsernameTokenAction.java:43)
at org.apache.wss4j.dom.handler.WSHandler.doSenderAction(WSHandler.java:234)
at org.apache.cxf.ws.security.wss4j.WSS4JOutInterceptor.access$100(WSS4JOutInterceptor.java:54)
at org.apache.cxf.ws.security.wss4j.WSS4JOutInterceptor$WSS4JOutInterceptorInternal.handleMessageInternal(WSS4JOutInterceptor.java:261)
at org.apache.cxf.ws.security.wss4j.WSS4JOutInterceptor$WSS4JOutInterceptorInternal.handleMessage(WSS4JOutInterceptor.java:135)
at org.apache.cxf.ws.security.wss4j.WSS4JOutInterceptor$WSS4JOutInterceptorInternal.handleMessage(WSS4JOutInterceptor.java:122)
at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
at org.apache.cxf.endpoint.ClientImpl.doInvoke(ClientImpl.java:518)
at org.apache.cxf.endpoint.ClientImpl.invoke(ClientImpl.java:427)
at org.apache.cxf.endpoint.ClientImpl.invoke(ClientImpl.java:328)
at org.apache.cxf.endpoint.ClientImpl.invoke(ClientImpl.java:281)
at org.apache.cxf.frontend.ClientProxy.invokeSync(ClientProxy.java:96)
at org.apache.cxf.jaxws.JaxWsClientProxy.invoke(JaxWsClientProxy.java:139)
... 2 more
Caused by: java.lang.ClassCastException: org.apache.wss4j.common.ext.WSPasswordCallback cannot be cast to org.apache.ws.security.WSPasswordCallback
at com.cybersource.schemas.transaction_data.transactionprocessor.CybersourceClientExample$ClientPasswordHandler.handle(CybersourceClientExample.java:152)
at org.apache.wss4j.dom.handler.WSHandler.performPasswordCallback(WSHandler.java:1170)
... 16 more

"Caused by: java.lang.ClassCastException: org.apache.wss4j.common.ext.WSPasswordCallback cannot be cast to org.apache.ws.security.WSPasswordCallback at " - looks like you are mixing WSS4J versions incorrectly. Have you verified that all of the WSS4J jars on the classpath have the same version? Also have you checked that the WSS4J version is the correct one that should be used with the version of CXF you are using?

how to create ruunable jar for hadoop using eclipse

//I am getting exception while running hadoop jar file which convert pdf to //text and parse to mapper
java.lang.Exception: java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, received org.apache.hadoop.io.LongWritable
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:489)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:549)
Caused by: java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, received org.apache.hadoop.io.LongWritable
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1072)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:715)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
at org.apache.hadoop.mapreduce.Mapper.map(Mapper.java:125)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:270)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/06/08 19:12:10 INFO mapreduce.Job: Job job_local815278758_0001 running in uber mode : false
17/06/08 19:12:10 INFO mapreduce.Job: map 0% reduce 0%
17/06/08 19:12:10 INFO mapreduce.Job: Job job_local815278758_0001 failed with state FAILED due to: NA
17/06/08 19:12:10 INFO mapreduce.Job: Counters: 0
//Mapper class
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCountMapper extends
Mapper<Object, Object, Object, Object> {
private Text word = new Text();
private final static LongWritable one = new LongWritable(1);
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.progress();
context.write(word, one);
}
}
}
//Reducer class
package com.amal.pdf;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends
Reducer<Object, Object, Object, Object> {
protected void reduce(Text key, Iterable<LongWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (LongWritable value : values) {
sum += value.get();
}
context.write(key, new LongWritable(sum));
}
}
//PDF record Reader class
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
public class PdfRecordReader extends RecordReader<Object, Object> {
private String[] lines = null;
private LongWritable key = null;
private Text value = null;
#Override
public void initialize(InputSplit genericSplit, TaskAttemptContext context)
throws IOException, InterruptedException {
FileSplit split = (FileSplit) genericSplit;
Configuration job = context.getConfiguration();
final Path file = split.getPath();
/*
* The below code contains the logic for opening the file and seek to
* the start of the split. Here we are applying the Pdf Parsing logic
*/
FileSystem fs = file.getFileSystem(job);
FSDataInputStream fileIn = fs.open(split.getPath());
PDDocument pdf = null;
String parsedText = null;
PDFTextStripper stripper;
pdf = PDDocument.load(fileIn);
stripper = new PDFTextStripper();
//getting exception because of this line
parsedText = stripper.getText(pdf);
this.lines = parsedText.split("\n"); }
#Override
public boolean nextKeyValue() throws IOException, InterruptedException {
if (key == null) {
key = new LongWritable();
key.set(1);
value = new Text();
value.set(lines[0]);
} else {
int temp = (int) key.get();
if (temp < (lines.length - 1)) {
int count = (int) key.get();
value = new Text();
value.set(lines[count]);
count = count + 1;
key = new LongWritable(count);
} else {
return false;
}
}
if (key == null || value == null) {
return false;
} else {
return true;
}
}
#Override
public LongWritable getCurrentKey() throws IOException,
InterruptedException {
return key;
}
#Override
public Text getCurrentValue() throws IOException, InterruptedException {
return value;
}
#Override
public float getProgress() throws IOException, InterruptedException {
return 0;
}
#Override
public void close() throws IOException {
}
}
//One more thing can anyone help to create runnable jar, configuration is not //showing inside eclipse because main is for hadoop environment.

The error on your console says:
Caused by: java.io.IOException:
Type mismatch in key from map: expected org.apache.hadoop.io.Text, received org.apache.hadoop.io.LongWritable
That means that the key-value pair you are providing to your mapper doesn't match it's definition.
Your mapper class should look something like this:
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private Text word = new Text();
private final static IntWritable one = new IntWritable(1);
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.progress();
context.write(word, one);
}
}
}

InvalidInputException when running a jar exported from Eclipse

I have installed Hadoop 2.6 in centos7 and it's running fine. But when I run a jar exported from Eclipse, it gives the following error:
[root#myspark ~]# hadoop jar fengcount.jar intput output1
17/05/26 21:24:51 INFO client.RMProxy: Connecting to ResourceManager
at myspark/192.168.44.100:8032 17/05/26 21:24:53 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/root/.staging/job_1495765615548_0004 Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://myspark:54310/user/root/intput
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:321)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:264)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:385)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:302)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:319)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:197)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1297)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1294)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1294)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1315)
at hdfs.hadoop_hdfs.fengcount.main(fengcount.java:39)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
The file input/test1.txt actually exists:
[root#myspark ~]# hdfs dfs -ls -R
drwxr-xr-x - root supergroup 0 2017-05-26 21:02 input
-rw-r--r-- 1 root supergroup 16 2017-05-24 01:57 input/test1.txt
My code:
package hdfs.hadoop_hdfs;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class fengcount {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
// TODO Auto-generated method stub
Configuration conf=new Configuration();
String[] otherargs=new GenericOptionsParser(conf,args).getRemainingArgs();
if (otherargs.length!=2) {
System.err.println("Usage:fengcount<int><out>");
System.exit(2);
}
#SuppressWarnings("deprecation")
Job job=new Job(conf, "fengcount");
job.setJarByClass(fengcount.class);
job.setMapperClass(TokerizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherargs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherargs[1]));
System.exit(job.waitForCompletion(true)?0:1);
}
// mapper class
public static class TokerizerMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
#Override
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
// TODO Auto-generated method stub
System.out.println("key=" + key.toString());
System.out.println("value=" + value.toString());
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
//reduce process
public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
#Override
public void reduce(Text key, Iterable<IntWritable> values,
Reducer<Text, IntWritable, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
//mapreduce process
}

From the error log, I could see hdfs://myspark:54310/user/root/intput, I suspect its incorrect. I guess the path ends with input.
Good luck!

how to process a batch of text files through the Stanford NLP pipeline?

I need to process a batch of text files through the Stanford NLP pipeline with the semantic graph as output. There is an example of the pipeline I found in this SO question. I just changed the class name, put out "dcoref" from .props, and fill it as a new class to the Core NLP project in Eclipse. Here is the code:
import java.io.*;
import java.util.*;
import edu.stanford.nlp.io.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.util.*;
public class testpipeline {
public static void main(String[] args) throws IOException {
PrintWriter out;
if (args.length > 1) {
out = new PrintWriter(args[1]);
} else {
out = new PrintWriter(System.out);
}
PrintWriter xmlOut = null;
if (args.length > 2) {
xmlOut = new PrintWriter(args[2]);
}
StanfordCoreNLP pipeline = new StanfordCoreNLP();
Annotation annotation;
if (args.length > 0) {
annotation = new Annotation(IOUtils.slurpFileNoExceptions(args[0]));
} else {
annotation = new Annotation("He didn't get a reply.");
}
pipeline.annotate(annotation);
pipeline.prettyPrint(annotation, out);
if (xmlOut != null) {
pipeline.xmlPrint(annotation, xmlOut);
}
// An Annotation is a Map ...
List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
if (sentences != null && sentences.size() > 0) {
CoreMap sentence = sentences.get(0);
Tree tree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);
out.println();
out.println("The first sentence parsed is:");
tree.pennPrint(out);
}
}
}
But it doesn't work. Running the full pipeline, I'm getting the following errors through the console:
Reading the model of the lemmatizer
java.io.FileNotFoundException: models\lemmatizer.model (The system cannot find the path specified)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:146)
at java.io.FileInputStream.<init>(FileInputStream.java:101)
at is2.lemmatizer.Lemmatizer.readModel(Lemmatizer.java:166)
at is2.lemmatizer.Lemmatizer.<init>(Lemmatizer.java:68)
at is2.lemmatizer.Lemmatizer.<init>(Lemmatizer.java:82)
at examples.FullPipeline.main(FullPipeline.java:56)
Applying the lemmatizer
Exception in thread "main" java.lang.NullPointerException
at is2.lemmatizer.Lemmatizer.lemmatize(Lemmatizer.java:447)
at is2.lemmatizer.Lemmatizer.apply(Lemmatizer.java:530)
at examples.FullPipeline.main(FullPipeline.java:59)
Would somebody suggest how to correct this, please?

Hadoop MapReduce on Eclipse: Cleaning up the staging area file:/app/hadoop/tmp/mapred/staging/myname183880112/.staging/job_local183880112_0001

2014-04-04 16:02:31.633 java[44631:1903] Unable to load realm info from SCDynamicStore
14/04/04 16:02:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/04/04 16:02:32 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/04/04 16:02:32 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
14/04/04 16:02:32 WARN snappy.LoadSnappy: Snappy native library not loaded
14/04/04 16:02:32 INFO mapred.FileInputFormat: Total input paths to process : 1
14/04/04 16:02:32 INFO mapred.JobClient: Cleaning up the staging area file:/app/hadoop/tmp/mapred/staging/myname183880112/.staging/job_local183880112_0001
java.lang.NullPointerException
at org.apache.hadoop.conf.Configuration.getLocalPath(Configuration.java:950)
at org.apache.hadoop.mapred.JobConf.getLocalPath(JobConf.java:476)
at org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(LocalJobRunner.java:121)
at org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:592)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:1013)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
at LineIndex.main(LineIndex.java:92)
I am trying to execute a Mapreduce program for Line Index using MapReduce in Eclipse. The above error shows up. My code is:
public class LineIndex {
public static class LineIndexMapper extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text> {
private final static Text word = new Text();
private final static Text location = new Text();
public void map(LongWritable key, Text val,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
String fileName = fileSplit.getPath().getName();
location.set(fileName);
String line = val.toString();
StringTokenizer itr = new StringTokenizer(line.toLowerCase());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, location);
}
}
}
public static class LineIndexReducer extends MapReduceBase
implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
boolean first = true;
StringBuilder toReturn = new StringBuilder();
while (values.hasNext()){
if (!first)
toReturn.append(", ");
first=false;
toReturn.append(values.next().toString());
}
output.collect(key, new Text(toReturn.toString()));
}
}
/**
* The actual main() method for our program; this is the
* "driver" for the MapReduce job.
*/
public static void main(String[] args) {
JobClient client = new JobClient();
JobConf conf = new JobConf(LineIndex.class);
conf.setJobName("LineIndexer");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(conf, new Path("input"));
FileOutputFormat.setOutputPath(conf, new Path("output"));
conf.setMapperClass(LineIndexMapper.class);
conf.setReducerClass(LineIndexReducer.class);
conf.addResource(new Path("/usr/local/hadoop/etc/hadoop/core-site.xml"));
conf.addResource(new Path("/usr/local/hadoop/etc/hadoop/hdfs-site.xml"));
client.setConf(conf);
try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}
}
I am unable to understand and resolve the error Nullpointerexception here.
Could someone please help me out??

Can you add mapred-site.xml file to the Configuration class object and try once more.
You may also need to specify the property mapred.local.dir in that xml file.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Converting PDF file to text on HDFS (JAVA) [duplicate] - eclipse

Related

Cybersource org.apache.cxf.binding.soap.SoapFault: Security processing failed USING CXF

how to create ruunable jar for hadoop using eclipse

InvalidInputException when running a jar exported from Eclipse

how to process a batch of text files through the Stanford NLP pipeline?

Hadoop MapReduce on Eclipse: Cleaning up the staging area file:/app/hadoop/tmp/mapred/staging/myname183880112/.staging/job_local183880112_0001

Categories

Resources