How can I Print DOCX file in Netbeans using Aspose.words? - netbeans

Recently, I was working on a project that can print DOCX files in Netbeans. I am new in java so I'm not really familiar with it. I use Aspose.Words but I really don't know how to use it. I watch many tutorials but they are not in NetBeans so I am confused. Thanks a lot in advance for helping me.
This is my code:
String dox = path.getText();
XWPFDocument docx = null;
try {
docx = new XWPFDocument(POIXMLDocument.openPackage(dox));
XWPFWordExtractor ext = new XWPFWordExtractor(docx);
content.setText(ext.getText());
} catch (IOException ex) {
Logger.getLogger(count.class.getName()).log(Level.SEVERE, null, ex);
}
PrinterJob job = PrinterJob.getPrinterJob();
job.setPrintable(new OutputPrinter(path.getText()));
boolean doPrint = job.printDialog();
if (doPrint)
{
try
{
job.print();
}
catch (PrinterException e)
{
// Print job did not complete.
}
}
I tried extracting it but it only gets the content. What I'm trying is to print the whole document just like in the MSWord.

First of all: Why is this a NetBeans question? I can’t see any NetBeans relations, maybe you should change it to Java (at least the tag).
Second: I think your question is a duplicate to this one

Related

Creating filter using org.eclipse.core.resources.IProject class

I am creating a jar file using org.eclipse.core.resources.IProject. while creating i would like to exclude few files from the selected directory based on the file extension.
if (selectPackageCombo.getText().equals(item)) {
IProject project = availableProjects.get(i);
Package package = EclipsePackageRepository.instance().getPackage(project);
if (package != null) {
try {
project.createFilter(IResourceFilterDescription.EXCLUDE_ALL| IResourceFilterDescription.FOLDERS|IResourceFilterDescription.FILES,
new FileInfoMatcherDescription("org.eclipse.ui.ide.multiFilter", "1.0-name-matches-false-false-Test"),IResource.BACKGROUND_REFRESH, new NullProgressMonitor());
} catch (CoreException e) {
e.printStackTrace();
}
jarData.setSelectedProject(project);
jarData.setOutputSuffix(package.getPackageResource().getType());
}
break;
}
Please help me how to create the object of FileInfoMatcherDescription to exclude all the files with the extension ".ayt"
There isn't much informaton about this but it looks like you can work out the values by using the 'Resource Filters' page in the project properties and creating the filter you want. The id and arguments values will then be saved in the .project file in the project which you can read.
So for your requirement I get
<matcher>
<id>org.eclipse.ui.ide.multiFilter</id>
<arguments>1.0-name-matches-false-false-*.ayt</arguments>
</matcher>
so the constructor is
new FileInfoMatcherDescription("org.eclipse.ui.ide.multiFilter",
"1.0-name-matches-false-false-*.ayt")

In Eclipse How to refactor a filename in a project programatically

We can refactor the project names, but I need help regarding the filename refactoring in eclipse in a programatic way. We have a folder under this folder there lies a xxx.zzz file and we want to rename/refactor this file.
Kind regards
Try to right click the file, select refactor and enter your desired file name. You may be asked additionally to update references and similarly names variables. Check them both (recommended) and your file And all its references will be updated
If the file is a class file, you can try the code below:
RefactoringContribution contribution = RefactoringCore.getRefactoringContribution(IJavaRefactorings.RENAME_COMPILATION_UNIT);
RenameJavaElementDescriptor descriptor = (RenameJavaElementDescriptor) contribution.createDescriptor();
descriptor.setProject(cu.getResource().getProject().getName());
descriptor.setNewName(newFileName);
descriptor.setJavaElement(cu);
descriptor.setUpdateReferences(true);
RefactoringStatus status = new RefactoringStatus();
try {
RenameRefactoring refactoring = (RenameRefactoring) descriptor.createRefactoring(status);
IProgressMonitor monitor = new NullProgressMonitor();
RefactoringStatus status1 = refactoring.checkInitialConditions(monitor);
if (!status1.hasFatalError()) {
RefactoringStatus status2 = refactoring.checkFinalConditions(monitor);
if (!status2.hasFatalError()) {
Change change = refactoring.createChange(monitor);
change.perform(monitor);
}
}
} catch (CoreException e) {
e.printStackTrace();
} catch (Exception e) {
e.printStackTrace();
}
cu is of type ICompilationUnit and you can get a compilation unit from IPackageFragment.
You can also replace IJavaRefactorings.RENAME_COMPILATION_UNIT with what your need.

How to find the first instance of a pdf file using HTMLElement?

void DownloadFile(object sender, WebBrowserDocumentCompletedEventArgs e)
{
HtmlElementCollection links = webBrowser1.Document.Links;
foreach (HtmlElement link in links) //
{
if (link.InnerText.Equals("*.pdf"))
{
link.InvokeMember("Click");
break;
}
}
}
How do I find the first instance of a pdf file using HTML element. I was trying to do *.pdf but it does not work.
Looks like you are using C#, you've tagged this as htmlelements which is a Java library, so you might have the wrong place.
However, if InnerText gets the link href (or if the link text contains the .pdf) then you probably want:
EndsWith(".pdf")
instead of
Equals("*.pdf").

itext: how to tweak text extraction?

I'm using iText 5.5.8 for Java.
Following the default, straightforward text extraction procedures, i.e.
PdfTextExtractor.getTextFromPage(reader, pageNumber)
I was surprised to find several mistakes in the output, specifically all letter ds come out as os.
So how does text extraction in iText really work? Is it some kind of OCR?
I took a look under the hood, trying to grasp how TextExtractionStrategy works, but I couldn't figure out much. SimpleTextExtractionStrategy for example seems to just determine the presence of lines and spaces, whereas it's TextRenderInfo that provides text by invoking some decode method on a GraphicsState's font field and that's as far as I could go without getting a major migraine.
So who's my man? Which class should I override or which parameter should I tweak to be able to tell iText "hey, you're reading all ds wrong!"
edit:
sample PDF can be found at http://www.fpozzi.com/stampastopper/download/ name of file is 0116_LR.pdf
Sorry, can't share a direct link.
This is some basic code for text extraction
import java.io.File;
import java.io.IOException;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
public class Import
{
public static void importFromPdf(final File pdfFile) throws IOException
{
PdfReader reader = new PdfReader(pdfFile.getAbsolutePath());
try
{
for (int i = 1; i <= reader.getNumberOfPages(); i++)
{
System.out.println(PdfTextExtractor.getTextFromPage(reader, i));
System.out.println("----------------------------------");
}
}
catch (IOException e)
{
throw e;
}
finally
{
reader.close();
}
}
public static void main(String[] args)
{
try
{
importFromPdf(new File("0116_LR.pdf"));
}
catch (IOException e)
{
e.printStackTrace();
}
}
}
edit after #blagae and #mkl answers
Before starting to fiddle with iText I have tried text extraction from Apache PDFBox (a project similar to iText I just discoreved) but it does have the same issue.
Understanding how these programs treat text is way beyond my dedication, so I have written a simple method to extract text from raw page content, that is whatever stands between BT and ET markers.
import java.io.File;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import com.itextpdf.text.io.RandomAccessSourceFactory;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.RandomAccessFileOrArray;
import com.itextpdf.text.pdf.parser.ContentByteUtils;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
public class Import
{
private final static Pattern actualWordPattern = Pattern.compile("\\((.*?)\\)");
public static void importFromPdf(final File pdfFile) throws IOException
{
PdfReader reader = new PdfReader(pdfFile.getAbsolutePath());
Matcher matcher;
String line, extractedText;
boolean anyMatchFound;
try
{
for (int i = 1; i <= 16; i++)
{
byte[] contentBytes = ContentByteUtils.getContentBytesForPage(reader, i);
RandomAccessFileOrArray raf = new RandomAccessFileOrArray(new RandomAccessSourceFactory().createSource(contentBytes));
while ((line = raf.readLine()) != null && !line.equals("BT"));
extractedText = "";
while ((line = raf.readLine()) != null && !line.equals("ET"))
{
anyMatchFound = false;
matcher = actualWordPattern.matcher(line);
while (matcher.find())
{
anyMatchFound = true;
extractedText += matcher.group(1);
}
if (anyMatchFound)
extractedText += "\n";
}
System.out.println(extractedText);
System.out.println("+++++++++++++++++++++++++++");
String properlyExtractedText = PdfTextExtractor.getTextFromPage(reader, i);
System.out.println(properlyExtractedText);
System.out.println("---------------------------");
}
}
catch (IOException e)
{
throw e;
}
finally
{
reader.close();
}
}
public static void main(String[] args)
{
try
{
importFromPdf(new File("0116_LR.pdf"));
}
catch (IOException e)
{
e.printStackTrace();
}
}
}
It appears, at least in my case, that characters are correct. However the order of words or even letters is messy, super messy in fact, so this approach is unusable either.
What really surprises me is that all methods I have tried so far to retrieve text from PDFs, including copy/paste from Adobe Reader, screw something up.
I have come to the conclusion that the most reliable way to get some decent text extraction may also be the most unexpected: some good OCR.
I am now trying to:
1) transform pdf into an image (PDFBox is great at doing that - do not even bother to try pdf-renderer)
2) OCR that image
I will post my results in a few days.
Your input document has been created in a strange (but 'legal') way. There is a Unicode mapping in the resources that maps arbitrary glyphs to Unicode points. In particular, character number 0x64, d in ASCII, is mapped to the glyph with Unicode point 0x6f (UTF-8), which is o, in this font. This is not a problem per se - any PDF viewer can handle it - but it is strange, because all other glyphs that are used are not "cross-mapped". e.g. character 0x63 is mapped to Unicode point 0x63 (which is c), etc.
Now for the reason that Acrobat does the text extraction correctly (except for the space), and the others go wrong. We'll have to delve into the PDF syntax for this:
[p, -17.9, e, -15.1, l, 1.4, l, 8.4, i, -20, m, 5.8, i, 14, st, -17.5, e, 31.2, ,, -20.1, a] TJ
<</ActualText <fffffffeffffffff00640064> >> BDC
5.102 0 Td
[d, -14.2, d] TJ
EMC
That tells a PDF viewer to print p-e-l-l-i- -m-i-st-e- -a on the first line of code, and d-d after that on the fourth line. However, d maps to o, which is apparently only a problem for text extraction. Acrobat does do the text extraction correctly, because there is a content marker /ActualText which says that whatever we write between the BDC and EMC markers must be parsed as dd (0x64,0x64).
So to answer your question: iText does this on the same level as a lot of well-respected viewers, which all ignore the /ActualText marker. Except for Acrobat, which does respect it and overrules the ToUnicode mapping.
And to really answer your question: iText is currently looking into parsing the /ActualText marker, but it will probably take a while before it gets into an official release.
This probably has to do with how the PDF with OCR'd in the first place, rather than with how iTextSharp is parsing the PDF's contents. Try copy/pasting the text from the PDF into Notepad, and see if the "ds -> os" transformation still occurs. If this is the case, you're going to have to do the following when parsing text from this particular PDF:
Identify all occurrences of the string "os".
Decide whether or not the word of which the given "os" instance is a constituent is a valid English/German/Spanish/ word.
If it IS a valid word, do nothing.
If it is NOT a valid word, perform the reverse "os -> ds" transformation, and check against the dictionary in the language of your choice again.

PdfUtilities.convertPdf2Png Create automatic images in My directory

I've written some code to perform OCR on a PDF using Tesseract (Tess4J):
public void DoOCRAnalyse(String From) throws FileNotFoundException {
Tesseract instance = Tesseract.getInstance(); // JNA Interface Mapping
File[] files=PdfUtilities.convertPdf2Png(new File(From));
for (File f:files) {
try {
String result = instance.doOCR(f);
/*String result = instance.doOCR(take File or BufferedImage); */
SearchForSVHC(result,SvhcList);
} catch (TesseractException e) {
System.err.println(e.getMessage());
}
}
}
It recognizes text, which is great, but my problem is that it needs the images to be in a directory on disk. How can I pass a BufferedImage or File to the methode doOCR() without needing the files on disk?
You are passing a File object to doOCR. When you call convertPdf2Png, it invokes GhostScript to convert a PDF file to one or more PNG files. You certainly can delete them after OCR if you want, e.g., by executing f.Delete() in a finally block.