How to extract content from .png or .gif files using TIKA? - png

This is my piece of code:
Metadata metadata = new Metadata();
BodyContentHandler handler = new BodyContentHandler();
FileInputStream inputstream = new FileInputStream(new File(
"resources/bed_bath_beyond.gif"));
ParseContext parseContext = new ParseContext();
Parser parser = new AutoDetectParser();
parseContext.set(Parser.class, parser);
parser.parse(inputstream , handler, metadata, parseContext);
XHTMLContentHandler xhandler=new XHTMLContentHandler(handler, metadata);
String text = xhandler.toString();
System.out.println("Contents of the document:" + text);
Above code gives OUTPUT:
Contents of the document:
I am not getting content of the file in output.
Please help.

Related

How to use itext to submit PDF Form

After reading many stackoverflow and tried many solutions I'm stuck with this:
I am receiving a PDF that I cannot change and need to automatically process it.
The PDF is a PDF form with 2 fields and submit button.
The following code is the closes I came to what I need to do:
public static final String SRC = "C:\\Dev\\test.pdf";
public static final String DEST = "C:\\Dev\\test_result.pdf";
public static final String DATA = "C:\\Dev\\data.xml";
File file = new File(DEST);
PdfReader reader = new PdfReader(SRC);
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(DEST));
AcroFields form = stamper.getAcroFields();
XfaForm xfa = form.getXfa();
xfa.fillXfaForm(new FileInputStream(DATA));
This gives a null pointer:
Exception in thread "main" java.lang.NullPointerException
at com.itextpdf.text.pdf.XfaForm.fillXfaForm(XfaForm.java:1168)
at com.itextpdf.text.pdf.XfaForm.fillXfaForm(XfaForm.java:1146)
at com.itextpdf.text.pdf.XfaForm.fillXfaForm(XfaForm.java:1134)
at com.itextpdf.text.pdf.XfaForm.fillXfaForm(XfaForm.java:1131)
I can get and set the fields on the form with this code:
AcroFields fields = reader.getAcroFields();
fields.setField("pdfForm.loginUser", "myemail#domain.com");
fields.setField("pdfForm.loginPass", "mypassword");
How do I convert the Acrofields to XfaForm?

docx with track change producing incorrect output in Apache Tika

I am parsing docx files using apache tika.
AutoDetectParser parser = new AutoDetectParser();
ContentHandler contentHandler = new BodyContentHandler();
inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
Metadata metadata = new Metadata();
OfficeParserConfig officeParserConfig = new OfficeParserConfig();
officeParserConfig.setIncludeDeletedContent(false);
parseContext.set(OfficeParserConfig.class, officeParserConfig);
parser.parse(inputStream, contentHandler, metadata, parseContext);
System.out.println(contentHandler.toString());
When I am sending track_revised docx file it's adding all the text deleted with the actual text and inserted text. Is there a way to tell parser to exclude the deleted text?
I did figure it out
AutoDetectParser parser = new AutoDetectParser();
ContentHandler contentHandler = new BodyContentHandler();
inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
Metadata metadata = new Metadata();
ParseContext parseContext = new ParseContext();
OfficeParserConfig officeParserConfig = new OfficeParserConfig();
officeParserConfig.setUseSAXDocxExtractor(true);
officeParserConfig.setIncludeDeletedContent(false);
parseContext.set(OfficeParserConfig.class, officeParserConfig);
parser.parse(inputStream, contentHandler, metadata, parseContext);
System.out.println(contentHandler.toString());

Is it possible to merge several pdfs using iText7

I have several datasheets for products. Each is a separate file. What I want to do is to use iText to generate a summary / recommended set of actions, based on answers to a webform, and then append to that all the relevant datasheets. This way, I only need to open one new tab in the browser to print all information, rather than opening one for the summary, and one for each datasheet that is needed.
So, is it possible to do this using iText?
Yes, you can merge PDFs using iText 7. E.g. look at the iText 7 Jump-Start tutorial sample C06E04_88th_Oscar_Combine, the pivotal code is:
PdfDocument pdf = new PdfDocument(new PdfWriter(dest));
PdfMerger merger = new PdfMerger(pdf);
//Add pages from the first document
PdfDocument firstSourcePdf = new PdfDocument(new PdfReader(SRC1));
merger.merge(firstSourcePdf, 1, firstSourcePdf.getNumberOfPages());
//Add pages from the second pdf document
PdfDocument secondSourcePdf = new PdfDocument(new PdfReader(SRC2));
merger.merge(secondSourcePdf, 1, secondSourcePdf.getNumberOfPages());
firstSourcePdf.close();
secondSourcePdf.close();
pdf.close();
(C06E04_88th_Oscar_Combine method createPdf)
Depending on your use case, you might want to use the PdfDenseMerger with its helper class PageVerticalAnalyzer instead of the PdfMerger here. It attempts to put content from multiple source pages onto a single target page and corresponds to the iText 5 PdfVeryDenseMergeTool from this answer. Due to the nature of PDF files this only works for PDFs without headers, footers, and similar artifacts.
I found a solution that works quite well.
public byte[] Combine(IEnumerable<byte[]> pdfs)
{
using (var writerMemoryStream = new MemoryStream())
{
using (var writer = new PdfWriter(writerMemoryStream))
{
using (var mergedDocument = new PdfDocument(writer))
{
var merger = new PdfMerger(mergedDocument);
foreach (var pdfBytes in pdfs)
{
using (var copyFromMemoryStream = new MemoryStream(pdfBytes))
{
using (var reader = new PdfReader(copyFromMemoryStream))
{
using (var copyFromDocument = new PdfDocument(reader))
{
merger.Merge(copyFromDocument, 1, copyFromDocument.GetNumberOfPages());
}
}
}
}
}
}
return writerMemoryStream.ToArray();
}
}
Use
DirectoryInfo d = new DirectoryInfo(INPUT_FOLDER);
var pdfList = new List<byte[]> { };
foreach (var file in d.GetFiles("*.pdf"))
{
pdfList.Add(File.ReadAllBytes(file.FullName));
}
File.WriteAllBytes(OUTPUT_FOLDER + "\\merged.pdf", Combine(pdfList));
Autor:
https://www.nikouusitalo.com/blog/combining-pdf-documents-using-itext7-and-c/
If you want to add two array of bytes and return one array of bytes as PDF/A
public static byte[] mergePDF(byte [] first, byte [] second) throws IOException {
// Initialize PDF writer
ByteArrayOutputStream arrayOutputStream = new ByteArrayOutputStream();
PdfWriter writer = new PdfWriter(arrayOutputStream);
// Initialize PDF document
PdfADocument pdf = new PdfADocument(writer, PdfAConformanceLevel.PDF_A_1B, new PdfOutputIntent("Custom", "",
"https://www.color.org", "sRGB IEC61966-2.1", new FileInputStream("sRGB_CS_profile.icm")));
PdfMerger merger = new PdfMerger(pdf);
//Add pages from the first document
PdfDocument firstSourcePdf = new PdfDocument(new PdfReader(new ByteArrayInputStream(first)));
merger.merge(firstSourcePdf, 1, firstSourcePdf.getNumberOfPages());
//Add pages from the second pdf document
PdfDocument secondSourcePdf = new PdfDocument(new PdfReader(new ByteArrayInputStream(second)));
merger.merge(secondSourcePdf, 1, secondSourcePdf.getNumberOfPages());
firstSourcePdf.close();
secondSourcePdf.close();
writer.close();
pdf.close();
return arrayOutputStream.toByteArray();
}
The question doesn't specify the language, so I'm adding an answer using C#; this works for me. I'm creating three separate but related PDFs then combining them into one.
After creating the three separate PDF docs and adding data to them, I combine them this way:
PdfDocument pdfCombined = new PdfDocument(new PdfWriter(destCombined));
PdfMerger merger = new PdfMerger(pdfCombined);
PdfDocument pdfReaderExecSumm = new PdfDocument(new PdfReader(destExecSumm));
merger.Merge(pdfReaderExecSumm, 1, pdfReaderExecSumm.GetNumberOfPages());
PdfDocument pdfReaderPhrases = new PdfDocument(new PdfReader(destPhrases));
merger.Merge(pdfReaderPhrases, 1, pdfReaderPhrases.GetNumberOfPages());
PdfDocument pdfReaderUncommonWords = new PdfDocument(new PdfReader(destUncommonWords));
merger.Merge(pdfReaderUncommonWords, 1, pdfReaderUncommonWords.GetNumberOfPages());
pdfCombined.Close();
So the combined PDF is a PDFWriter type of PdfDocument, and the merged pieces parts are PdfReader types of PdfDocuments, and the PdfMerger is the glue that binds it all together.
Here is the minimum C# code needed to merge file1.pdf into file2.pdf creating new merged.pdf:
var path = #"C:\Temp\";
var src0 = System.IO.Path.Combine(path, "merged.pdf");
var wtr0 = new PdfWriter(src0);
var pdf0 = new PdfDocument(wtr0);
var src1 = System.IO.Path.Combine(path, "file1.pdf");
var fi1 = new FileInfo(src1);
var rdr1= new PdfReader(fi1);
var pdf1 = new PdfDocument(rdr1);
var src2 = System.IO.Path.Combine(path, "file2.pdf");
var fi2 = new FileInfo(src2);
var rdr2 = new PdfReader(fi2);
var pdf2 = new PdfDocument(rdr2);
var merger = new PdfMerger(pdf0);
merger.Merge(pdf1, 1, pdf1.GetNumberOfPages());
merger.Merge(pdf2, 1, pdf2.GetNumberOfPages());
merger.Close();
pdf0.Close();
Here is a VB.NET solution using open source iText7 that can merge multiple PDF files to an output file.
Imports iText.Kernel.Pdf
Imports iText.Kernel.Utils
Public Function Merge_PDF_Files(ByVal input_files As List(Of String), ByVal output_file As String) As Boolean
Dim Input_Document As PdfDocument = Nothing
Dim Output_Document As PdfDocument = Nothing
Dim Merger As PdfMerger
Try
Output_Document = New iText.Kernel.Pdf.PdfDocument(New iText.Kernel.Pdf.PdfWriter(output_file))
'Create the output file (Document) from a Merger stream'
Merger = New PdfMerger(Output_Document)
'Merge each input PDF file to the output document'
For Each file As String In input_files
Input_Document = New PdfDocument(New PdfReader(file))
Merger.Merge(Input_Document, 1, Input_Document.GetNumberOfPages())
Input_Document.Close()
Next
Output_Document.Close()
Return True
Catch ex As Exception
'catch Exception if needed'
If Input_Document IsNot Nothing Then Input_Document.Close()
If Output_Document IsNot Nothing Then Output_Document.Close()
File.Delete(output_file)
Return False
End Try
End Function
USAGE EXAMPLE:
Dim success as boolean = false
Dim input_files_list As New List(Of String)
input_files_list.Add("c:\input_PDF1.pdf")
input_files_list.Add("c:\input_PDF2.pdf")
input_files_list.Add("c:\input_PDF3.pdf")
success = Merge_PDF_Files(input_files_list, "c:\output_PDF.pdf")
'Optional: handling errors'
if success then
'Files merged'
else
'Error merging files'
end if

file not downloading properly

I am downloading a file from a url and saving it to a directory on my phone.
the path is: /private/var/mobile/Applications/17E4F0B0-0781-4259-B39D-37057D44B778/Documents/samplefile.txt
However, when i debug the file is created and downloaded. But, when i ad-hoc it and run the file. samplefile.txt is created but it's blank.
Code:
String directory = Environment.GetFolderPath (Environment.SpecialFolder.MyDocuments);
var filename = Path.Combine (directory, "samplefile.txt");
if (!File.Exists (filename)) {
File.Create (filename);
var webClient = new WebClient ();
webClient.DownloadStringCompleted += (s, e) => {
var text = e.Result; // get the downloaded text
File.WriteAllText (filename, text);
};
var url = new Uri (/**myURL**/);
webClient.Encoding = Encoding.UTF8;
webClient.DownloadStringAsync (url);
I modified your sample slightly and the following works for me.
The StreamReader is only there just to re-read in the contents of the file to confirm that its the same contents in the file as that of the downloaded file:-
If you put a breakpoint there also you can manually inspect same contents as downloaded.
string directory = System.Environment.GetFolderPath(System.Environment.SpecialFolder.Personal);
var filename = Path.Combine(directory, "samplefile.txt");
if (!File.Exists(filename))
{
var webClient = new WebClient();
webClient.DownloadStringCompleted += (s, e) =>
{
// Write contents of downloaded file to device:-
var text = e.Result; // get the downloaded text
StreamWriter sw = new StreamWriter(filename);
sw.Write(text);
sw.Flush();
sw.Close();
sw = null;
// Read in contents from device and validate same as downloaded:-
StreamReader sr = new StreamReader(filename);
string strFileContentsOnDevice = sr.ReadToEnd();
System.Diagnostics.Debug.Assert(strFileContentsOnDevice == text);
};
var url = new Uri("**url here**, UriKind.Absolute);
webClient.Encoding = Encoding.UTF8;
webClient.DownloadStringAsync(url);
}

Streaming OpenXML result

By using OpenXML to manipulating a Word document (as a template), the server application saves the new content as a temporary file and then sends it to user to download.
The question is how to make these content ready to download without saving it on the server as a temporary file? Is it possible to save OpenXML result as a byte[] or Stream instead of saving it as a file?
Using this page:
OpenXML file download without temporary file
I changed my code to this one:
byte[] result = null;
byte[] templateBytes = System.IO.File.ReadAllBytes(wordTemplate);
using (MemoryStream templateStream = new MemoryStream())
{
templateStream.Write(templateBytes, 0, (int)templateBytes.Length);
using (WordprocessingDocument doc = WordprocessingDocument.Open(templateStream, true))
{
MainDocumentPart mainPart = doc.MainDocumentPart;
...
mainPart.Document.Save();
templateStream.Position = 0;
using (MemoryStream memoryStream = new MemoryStream())
{
templateStream.CopyTo(memoryStream);
result = memoryStream.ToArray();
}
}
}
You can create the WordprocessingDocument and then use the Save() method to save it to a Stream.
http://msdn.microsoft.com/en-us/library/cc882497
var memoryStream = new MemoryStream();
document.Clone(memoryStream);