How do I merge two PDF files with attachments using itext? - itext

I am trying to merge two pdf files (file1.pdf and file2.pdf) into a single file file3.pdf. One of the source files, file2.pdf has few attachments.
Using PdfCopyFields addDocument method does not include the attachments in the source pdf files to the destination pdf file. How do I achieve this?
Extracting the document level attachments from source files using PdfDictionary and adding them to the destination file using PdfWriter addFileAttachment method works.
Can you please let me know if there is any other efficient method to include the attachments from source pdf files to be included in destination pdf file after merging?
This is the sample code that I am using to replicate the scenario.
public class TestItext
{
public String[] attachments;
public TestItext()
{
attachments = new String[2];
}
public static void main(String[] args)
{
try
{
TestItext obj = new TestItext();
obj.extractDocLevelAttachments("C:\\source.pdf");
obj.addAttachments("C:\\source.pdf","C:\\temp\\dest.pdf");
}
catch(Exception e)
{
e.printStackTrace();
}
}
public void extractDocLevelAttachments(String filename) throws IOException
{
PdfReader reader = new PdfReader(filename);
PdfDictionary root = reader.getCatalog();
PdfDictionary documentnames = root.getAsDict(PdfName.NAMES);
PdfDictionary embeddedfiles = documentnames.getAsDict(PdfName.EMBEDDEDFILES);
PdfArray filespecs = embeddedfiles.getAsArray(PdfName.NAMES);
PdfDictionary filespec;
PdfDictionary refs;
FileOutputStream fos;
PRStream stream;
int count = 0;
for (int i = 0; i < filespecs.size(); ) {
filespecs.getAsString(i++);
filespec = filespecs.getAsDict(i++);
refs = filespec.getAsDict(PdfName.EF);
for (Object key : refs.getKeys()) {
fos = new FileOutputStream(String.format("C:\\temp\\"+ filespec.getAsString((PdfName)key).toString()));
attachments[count++] = String.format("C:\\temp\\"+ filespec.getAsString((PdfName)key).toString());
stream = (PRStream) PdfReader.getPdfObject(refs.getAsIndirectObject((PdfName)key));
fos.write(PdfReader.getStreamBytes(stream));
fos.flush();
fos.close();
}
}
reader.close();
}
public void addAttachments(String src, String dest) throws IOException, DocumentException
{
PdfReader reader = new PdfReader(src);
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
for (int i = 0; i < attachments.length; i++) {
addAttachment(stamper.getWriter(), new File(attachments[i]));
}
stamper.close();
}
protected void addAttachment(PdfWriter writer, File src) throws IOException {
PdfFileSpecification fs =
PdfFileSpecification.fileEmbedded(writer, src.getAbsolutePath(), src.getName(), null);
writer.addFileAttachment(src.getName().substring(0, src.getName().indexOf('.')), fs);
}
}

Related

Manipulate paths, color etc. in iText

I need to analyze path data of PDF files and manipulate content with iText 7. Manipulations include deletion/replacemant and coloring.
I can analyze the graphics alright with something like the following code:
public class ContentParsing {
public static void main(String[] args) throws IOException {
new ContentParsing().inspectPdf("testdata/test.pdf");
}
public void inspectPdf(String path) throws IOException {
File file = new File(path);
PdfDocument pdf = new PdfDocument(new PdfReader(file.getAbsolutePath()));
PdfDocumentContentParser parser = new PdfDocumentContentParser(pdf);
for (int i=1; i<=pdf.getNumberOfPages(); i++) {
parser.processContent(i, new PathEventListener());
}
pdf.close();
}
}
public class PathEventListener implements IEventListener {
public void eventOccurred(IEventData eventData, EventType eventType) {
PathRenderInfo pathRenderInfo = (PathRenderInfo) eventData;
for ( Subpath subpath : pathRenderInfo.getPath().getSubpaths() ) {
for ( IShape segment : subpath.getSegments() ) {
// Here goes some path analysis code
System.out.println(segment.getBasePoints());
}
}
}
public Set<EventType> getSupportedEvents() {
Set<EventType> supportedEvents = new HashSet<EventType>();
supportedEvents.add(EventType.RENDER_PATH);
return supportedEvents;
}
}
Now, what's the way to go with manipulating things and writing them back to the PDF? Do I have to construct an entirely new PDF document and copy everything over (in manipulated form), or can I somehow manipulate the read PDF data directly?
Now, what's the way to go with manipulating things and writing them back to the PDF? Do I have to construct an entirely new PDF document and copy everything over (in manipulated form), or can I somehow manipulate the read PDF data directly?
In essence you are looking for a class which is not merely parsing a PDF content stream and signaling the instructions in it like the PdfCanvasProcessor (the PdfDocumentContentParser you use is merely a very thin wrapper for PdfCanvasProcessor) but which also creates the content stream anew with the instructions you forward back to it.
A generic content stream editor class
For iText 5.5.x a proof-of-concept for such a content stream editor class can be found in this answer (the Java version is a bit further down in the answer text).
This is a port of that proof-of-concept to iText 7:
public class PdfCanvasEditor extends PdfCanvasProcessor
{
/**
* This method edits the immediate contents of a page, i.e. its content stream.
* It explicitly does not descent into form xobjects, patterns, or annotations.
*/
public void editPage(PdfDocument pdfDocument, int pageNumber) throws IOException
{
if ((pdfDocument.getReader() == null) || (pdfDocument.getWriter() == null))
{
throw new PdfException("PdfDocument must be opened in stamping mode.");
}
PdfPage page = pdfDocument.getPage(pageNumber);
PdfResources pdfResources = page.getResources();
PdfCanvas pdfCanvas = new PdfCanvas(new PdfStream(), pdfResources, pdfDocument);
editContent(page.getContentBytes(), pdfResources, pdfCanvas);
page.put(PdfName.Contents, pdfCanvas.getContentStream());
}
/**
* This method processes the content bytes and outputs to the given canvas.
* It explicitly does not descent into form xobjects, patterns, or annotations.
*/
public void editContent(byte[] contentBytes, PdfResources resources, PdfCanvas canvas)
{
this.canvas = canvas;
processContent(contentBytes, resources);
this.canvas = null;
}
/**
* <p>
* This method writes content stream operations to the target canvas. The default
* implementation writes them as they come, so it essentially generates identical
* copies of the original instructions the {#link ContentOperatorWrapper} instances
* forward to it.
* </p>
* <p>
* Override this method to achieve some fancy editing effect.
* </p>
*/
protected void write(PdfCanvasProcessor processor, PdfLiteral operator, List<PdfObject> operands)
{
PdfOutputStream pdfOutputStream = canvas.getContentStream().getOutputStream();
int index = 0;
for (PdfObject object : operands)
{
pdfOutputStream.write(object);
if (operands.size() > ++index)
pdfOutputStream.writeSpace();
else
pdfOutputStream.writeNewLine();
}
}
//
// constructor giving the parent a dummy listener to talk to
//
public PdfCanvasEditor()
{
super(new DummyEventListener());
}
//
// Overrides of PdfContentStreamProcessor methods
//
#Override
public IContentOperator registerContentOperator(String operatorString, IContentOperator operator)
{
ContentOperatorWrapper wrapper = new ContentOperatorWrapper();
wrapper.setOriginalOperator(operator);
IContentOperator formerOperator = super.registerContentOperator(operatorString, wrapper);
return formerOperator instanceof ContentOperatorWrapper ? ((ContentOperatorWrapper)formerOperator).getOriginalOperator() : formerOperator;
}
//
// members holding the output canvas and the resources
//
protected PdfCanvas canvas = null;
//
// A content operator class to wrap all content operators to forward the invocation to the editor
//
class ContentOperatorWrapper implements IContentOperator
{
public IContentOperator getOriginalOperator()
{
return originalOperator;
}
public void setOriginalOperator(IContentOperator originalOperator)
{
this.originalOperator = originalOperator;
}
#Override
public void invoke(PdfCanvasProcessor processor, PdfLiteral operator, List<PdfObject> operands)
{
if (originalOperator != null && !"Do".equals(operator.toString()))
{
originalOperator.invoke(processor, operator, operands);
}
write(processor, operator, operands);
}
private IContentOperator originalOperator = null;
}
//
// A dummy event listener to give to the underlying canvas processor to feed events to
//
static class DummyEventListener implements IEventListener
{
#Override
public void eventOccurred(IEventData data, EventType type)
{ }
#Override
public Set<EventType> getSupportedEvents()
{
return null;
}
}
}
(PdfCanvasEditor.java)
The explanations from the iText 5 answer still apply, the parsing framework has not changed much from iText 5.5.x to iText 7.0.x.
Usage examples
Unfortunately you wrote in very vague terms about how exactly you want to change the contents. Thus I simply ported some iText 5 samples which made use of the original iText 5 content stream editor class:
Watermark removal
These are ports of the use cases in this answer.
testRemoveBoldMTTextDocument
This example drops all text written in a font the name of which ends with "BoldMT":
try ( InputStream resource = getClass().getResourceAsStream("document.pdf");
PdfReader pdfReader = new PdfReader(resource);
OutputStream result = new FileOutputStream(new File(RESULT_FOLDER, "document-noBoldMTText.pdf"));
PdfWriter pdfWriter = new PdfWriter(result);
PdfDocument pdfDocument = new PdfDocument(pdfReader, pdfWriter) )
{
PdfCanvasEditor editor = new PdfCanvasEditor()
{
#Override
protected void write(PdfCanvasProcessor processor, PdfLiteral operator, List<PdfObject> operands)
{
String operatorString = operator.toString();
if (TEXT_SHOWING_OPERATORS.contains(operatorString))
{
if (getGraphicsState().getFont().getFontProgram().getFontNames().getFontName().endsWith("BoldMT"))
return;
}
super.write(processor, operator, operands);
}
final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
};
for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++)
{
editor.editPage(pdfDocument, i);
}
}
(EditPageContent.java test method testRemoveBoldMTTextDocument)
testRemoveBigTextDocument
This example drops all text written with a large font size:
try ( InputStream resource = getClass().getResourceAsStream("document.pdf");
PdfReader pdfReader = new PdfReader(resource);
OutputStream result = new FileOutputStream(new File(RESULT_FOLDER, "document-noBigText.pdf"));
PdfWriter pdfWriter = new PdfWriter(result);
PdfDocument pdfDocument = new PdfDocument(pdfReader, pdfWriter) )
{
PdfCanvasEditor editor = new PdfCanvasEditor()
{
#Override
protected void write(PdfCanvasProcessor processor, PdfLiteral operator, List<PdfObject> operands)
{
String operatorString = operator.toString();
if (TEXT_SHOWING_OPERATORS.contains(operatorString))
{
if (getGraphicsState().getFontSize() > 100)
return;
}
super.write(processor, operator, operands);
}
final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
};
for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++)
{
editor.editPage(pdfDocument, i);
}
}
(EditPageContent.java test method testRemoveBigTextDocument)
Text color change
This is a port of the use case in this answer.
testChangeBlackTextToGreenDocument
This example changes the color of black text to green.
try ( InputStream resource = getClass().getResourceAsStream("document.pdf");
PdfReader pdfReader = new PdfReader(resource);
OutputStream result = new FileOutputStream(new File(RESULT_FOLDER, "document-blackTextToGreen.pdf"));
PdfWriter pdfWriter = new PdfWriter(result);
PdfDocument pdfDocument = new PdfDocument(pdfReader, pdfWriter) )
{
PdfCanvasEditor editor = new PdfCanvasEditor()
{
#Override
protected void write(PdfCanvasProcessor processor, PdfLiteral operator, List<PdfObject> operands)
{
String operatorString = operator.toString();
if (TEXT_SHOWING_OPERATORS.contains(operatorString))
{
if (currentlyReplacedBlack == null)
{
Color currentFillColor = getGraphicsState().getFillColor();
if (Color.BLACK.equals(currentFillColor))
{
currentlyReplacedBlack = currentFillColor;
super.write(processor, new PdfLiteral("rg"), Arrays.asList(new PdfNumber(0), new PdfNumber(1), new PdfNumber(0), new PdfLiteral("rg")));
}
}
}
else if (currentlyReplacedBlack != null)
{
if (currentlyReplacedBlack instanceof DeviceCmyk)
{
super.write(processor, new PdfLiteral("k"), Arrays.asList(new PdfNumber(0), new PdfNumber(0), new PdfNumber(0), new PdfNumber(1), new PdfLiteral("k")));
}
else if (currentlyReplacedBlack instanceof DeviceGray)
{
super.write(processor, new PdfLiteral("g"), Arrays.asList(new PdfNumber(0), new PdfLiteral("g")));
}
else
{
super.write(processor, new PdfLiteral("rg"), Arrays.asList(new PdfNumber(0), new PdfNumber(0), new PdfNumber(0), new PdfLiteral("rg")));
}
currentlyReplacedBlack = null;
}
super.write(processor, operator, operands);
}
Color currentlyReplacedBlack = null;
final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
};
for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++)
{
editor.editPage(pdfDocument, i);
}
}
(EditPageContent.java test method testChangeBlackTextToGreenDocument)

Create PDF with iText on iSeries leads to error "The document has no pages."

We use the nice library iText for one of my customer's project to generate a pdf from a string representing a html page. The iText version is 5.5.10.
The following piece of code works well on the development environments and servers running on Windows, but it is not working on the customer's server running on iSeries.
public class GeneratePDFCmdImpl extends ControllerCommandImpl implements
GeneratePDFCmd {
private String charsetStr = null;
private Charset charset = null;
private BaseFont bf = null;
private String destFile = null;
private String destFilename = null;
private String srcContent = null;
private String docName = null;
public void setDocName(String docname) {
this.docName = docname;
}
public void setSrcContent(String srcContent) {
this.srcContent = srcContent;
}
private void prepareDefaultsAndSettings() {
/* srcContent may be more complex html but even this simple one is not working */
srcContent = "<html><head></head><body>This is just a test</body></html>";
docName = "mypdf";
charsetStr = "UTF-8";
destFilename = docName+".pdf";
Date timestamp = new Date();
/* destFile = "/" is just for the sample. In my real project, the value is a folder where my app has full rights
*/
destFile = "/" + destFilename;
charset = Charset.forName(charsetStr);
FontFactory.register("/fonts/arial.ttf","Arial");
bf = FontFactory.getFont("Arial").getBaseFont();
}
#Override
public void performExecute() throws ECException {
super.performExecute();
Document document = null;
OutputStream os = null;
prepareDefaultsAndSettings();
try {
InputStream srcInputStream;
srcInputStream = new ByteArrayInputStream(srcContent.getBytes(charset));
document = new Document(PageSize.A4, 20, 20, 75, 80);
FileOutputStream destOutput = new FileOutputStream(destFile);
PdfWriter writer = PdfWriter.getInstance(document,destOutput);
writer.setPageEvent( new HeaderFooterPageEvent(bf));
document.open();
XMLWorkerHelper.getInstance().parseXHtml(writer, document, srcInputStream, charset);
document.close();
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (DocumentException e) {
e.printStackTrace();
} finally {
if(document != null) {
document.close();
}
document = null;
try {
if (os != null) {
os.close();
}
} catch(IOException e) {
e.printStackTrace();
}
os = null;
}
}
private class HeaderFooterPageEvent extends PdfPageEventHelper {
PdfContentByte cb;
PdfTemplate template;
BaseFont bf;
Font f;
float fs;
public HeaderFooterPageEvent(BaseFont _bf) {
super();
bf = _bf;
f = new Font(bf);
}
#Override
public void onOpenDocument(PdfWriter writer, Document document) {
cb = writer.getDirectContent();
template = cb.createTemplate(50, 50);
}
#Override
public void onEndPage(PdfWriter writer, Document document) {
Date dat = new Date();
ColumnText ct = new ColumnText(writer.getDirectContent());
SimpleDateFormat sdf = new SimpleDateFormat("dd-MM-yyyy HH:mm");
ct.showTextAligned(writer.getDirectContent(), Element.ALIGN_CENTER, new Phrase(sdf.format(dat) ), 100, 30, 0);
String text = "Page " +writer.getPageNumber() + " to ";
float len = bf.getWidthPoint(text, 12);
cb.beginText();
cb.setFontAndSize(bf, 12);
cb.setTextMatrix(450, 30);
cb.showText(text);
cb.endText();
cb.addTemplate(template, 450 + len, 30);
}
#Override
public void onCloseDocument(PdfWriter writer, Document document) {
template.beginText();
template.setFontAndSize(bf, 12);
template.showText(String.valueOf(writer.getPageNumber()));
template.endText();
}
}
}
When executed on the iSeries, we have the error message
com.ibm.commerce.command.ECCommandTarget executeCommand CMN0420E: The following command exception has occurred during processing: "ExceptionConverter: java.io.IOException: The document has no pages.". ExceptionConverter: java.io.IOException: The document has no pages.
at com.itextpdf.text.pdf.PdfPages.writePageTree(PdfPages.java:112)
at com.itextpdf.text.pdf.PdfWriter.close(PdfWriter.java:1256)
at com.itextpdf.text.pdf.PdfDocument.close(PdfDocument.java:900)
at com.itextpdf.text.Document.close(Document.java:415)
at be.ourcustomer.package.GeneratePDFCmdImpl.performExecute(GeneratePDFCmdImpl.java:107)
I don't have much idea about what we do wrong. Any help would be greatly appreciated

Issue in converting HTML to PDF containing <pre> tag with Flying Saucer and ITEXT

I am using Flying Saucer library to convert html to pdf. It is working fine with the all the HTML files.
But for some HTML files which include some tags in pre tag, generated PDF file has tags displayed.
If I remove pre tags then the formatting of data is lost.
My code is
org.w3c.dom.Document document = null;
try {
Document doc = Jsoup.parse(new File(htmlFile), "UTF-8", "");
Whitelist wl = new RelaxedPlusDataBase64Images();
Cleaner cleaner = new Cleaner(wl);
doc = cleaner.clean(doc);
Tidy tidy = new Tidy();
tidy.setShowWarnings(false);
tidy.setXmlTags(false);
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setPrintBodyOnly(true);
tidy.setXHTML(true);
tidy.setMakeClean(true);
tidy.setAsciiChars(true);
if (doc.select("pre").html().contains("</")) {
doc.select("pre").unwrap();
}
Reader reader = new StringReader(doc.html());
document = (tidy.parseDOM(reader, null));
Element element = (Element) document.getElementsByTagName("head").item(0);
element.getParentNode().removeChild(element);
NodeList elements = document.getElementsByTagName("img");
for (int i = 0; i < elements.getLength(); i++) {
String value = elements.item(i).getAttributes().getNamedItem("src").getNodeValue();
if (value != null && value.startsWith("cid:") && value.contains("#")) {
value = value.substring(value.indexOf("cid:") + 4, value.indexOf("#"));
elements.item(i).getAttributes().getNamedItem("src").setNodeValue(value);
System.out.println(value);
}
}
document.normalize();
System.out.println(getNiceLyFormattedXMLDocument(document));
} catch (Exception e) {
System.out.println(e);
}
Method to create PDF is :
try {
org.w3c.dom.Document doc = CleanHtml.cleanNTidyHTML("b.html");
ITextRenderer renderer = new ITextRenderer();
renderer.setDocument(doc, null);
renderer.setPDFVersion(new Character('7'));
String outputFile = "test.pdf";
OutputStream os = new FileOutputStream(outputFile);
renderer.layout();
renderer.createPDF(os);
os.flush();
os.close();
} catch (Exception e) {
e.printStackTrace();
}
By using itext XMLWorker :
try {
org.w3c.dom.Document doc = CleanHtml.cleanNTidyHTML("a.html");
String k = CleanHtml.getNiceLyFormattedXMLDocument(doc);
OutputStream file = new FileOutputStream(new File("test.pdf"));
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, file);
document.open();
ByteArrayInputStream is = new ByteArrayInputStream(k.getBytes());
XMLWorkerHelper.getInstance().parseXHtml(writer, document, is);
document.close();
file.close();
} catch (Exception e) {
e.printStackTrace();
}
public static String getNiceLyFormattedXMLDocument(org.w3c.dom.Document doc) throws IOException, TransformerException {
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
// transformer.setOutputProperty(OutputKeys.METHOD, "xml");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");
Writer stringWriter = new StringWriter();
StreamResult streamResult = new StreamResult(stringWriter);
transformer.transform(new DOMSource(doc), streamResult);
String result = stringWriter.toString();
return result;
}

How to create multi page table of content using itext in java

I am using example MergeWithToc2.java to create the TOC.
I have tried couple of things to resolve the issue but didn't succeed.
public class MergeWithToc2 {
public static final String SRC1 = "PositionPdf.pdf";
public static final String SRC2 = "concatenated1.pdf";
public static final String SRC3 = "new_page.pdf";
public static final String DEST = "test/merge_with_toc2.pdf";
public Map<String, PdfReader> filesToMerge;
public static void main(String[] args) throws IOException, DocumentException {
File file = new File(DEST);
file.getParentFile().mkdirs();
MergeWithToc2 app = new MergeWithToc2();
app.createPdf(DEST);
}
public MergeWithToc2() throws IOException {
filesToMerge = new TreeMap<String, PdfReader>();
for(int i=0 ; i <50 ; i++ ){
filesToMerge.put(i + "Hello World", new PdfReader(SRC1));
//filesToMerge.put("02 Movies / Countries", new PdfReader(SRC2));
}
}
public void createPdf(String filename) throws IOException, DocumentException {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Map<Integer, String> toc = new TreeMap<Integer, String>();
Document document = new Document();
PdfCopy copy = new PdfCopy(document, baos);
PageStamp stamp;
document.open();
int n;
int pageNo = 0;
PdfImportedPage page;
Chunk chunk;
for (Map.Entry<String, PdfReader> entry : filesToMerge.entrySet()) {
n = entry.getValue().getNumberOfPages();
toc.put(pageNo + 1, entry.getKey());
for (int i = 0; i < n; ) {
pageNo++;
page = copy.getImportedPage(entry.getValue(), ++i);
stamp = copy.createPageStamp(page);
chunk = new Chunk(String.format("Page %d", pageNo));
if (i == 1)
chunk.setLocalDestination("p" + pageNo);
ColumnText.showTextAligned(stamp.getUnderContent(),
Element.ALIGN_RIGHT, new Phrase(chunk),
559, 810, 0);
stamp.alterContents();
copy.addPage(page);
}
}
PdfReader reader = new PdfReader(SRC3);
page = copy.getImportedPage(reader, 1);
stamp = copy.createPageStamp(page);
Paragraph p;
PdfAction action;
PdfAnnotation link;
float y = 770;
ColumnText ct = new ColumnText(stamp.getOverContent());
ct.setSimpleColumn(36, 36, 559, y);
for (Map.Entry<Integer, String> entry : toc.entrySet()) {
p = new Paragraph(entry.getValue());
p.add(new Chunk(new DottedLineSeparator()));
p.add(String.valueOf(entry.getKey()));
ct.addElement(p);
ct.go();
action = PdfAction.gotoLocalPage("p" + entry.getKey(), false);
link = new PdfAnnotation(copy, 36, ct.getYLine(), 559, y, action);
stamp.addAnnotation(link);
y = ct.getYLine();
}
ct.go();
stamp.alterContents();
copy.addPage(page);
document.close();
for (PdfReader r : filesToMerge.values()) {
r.close();
}
reader.close();
reader = new PdfReader(baos.toByteArray());
n = reader.getNumberOfPages();
reader.selectPages(String.format("%d, 1-%d", n, n-1));
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(filename));
stamper.close();
}
}

How do I convert text files to .arff format(weka)

Please advise me How do I convert text files to .arff format(weka)
because i wan to do data clustering for 1000 txt file.
regards
There are some converters implemented in WEKA, just find the right format or make little changes to your data (using awk, sed...).
Here is the API pages related to this topic: http://weka.sourceforge.net/doc.stable/weka/core/converters/package-summary.html
For exapmle here is how to convert from CSV to ARFF:
java weka.core.converters.CSVLoader filename.csv > filename.arff
Here is the code you can use
package text.Classification;
import java.io.*;
import weka.core.*;
public class TextDirectoryToArff {
public Instances createDataset(String directoryPath) throws Exception {
FastVector atts;
FastVector attVals;
atts = new FastVector();
atts.addElement(new Attribute("contents", (FastVector) null));
String[] s = { "class1", "class2", "class3" };
attVals = new FastVector();
for (String p : s)
attVals.addElement(p);
atts.addElement(new Attribute("class", attVals));
Instances data = new Instances("MyRelation", atts, 0);
System.out.println(data);
InputStreamReader is = null;
File dir = new File(directoryPath);
String[] files = dir.list();
for (int i = 0; i < files.length; i++) {
if (files[i].endsWith(".txt")) {
double[] newInst = new double[2];
File txt = new File(directoryPath + File.separator + files[i]);
is = new InputStreamReader(new FileInputStream(txt));
StringBuffer txtStr = new StringBuffer();
int c;
while ((c = is.read()) != -1) {
txtStr.append((char) c);
}
newInst[0] = data.attribute(0).addStringValue(txtStr.toString());
int j=i%(s.length-1);
newInst[1] = attVals.indexOf(s[j]);
data.add(new Instance(1.0, newInst));
}
}
return data;
}
public static void main(String[] args) {
TextDirectoryToArff tdta = new TextDirectoryToArff();
try {
Instances dataset = tdta.createDataset("/home/asadul/Desktop/Downloads/text_example/class5");
PrintWriter fileWriter = new PrintWriter("/home/asadul/Desktop/Downloads/text_example/abc.arff", "UTF-8");
fileWriter.println(dataset);
fileWriter.close();
} catch (Exception e) {
System.err.println(e.getMessage());
e.printStackTrace();
}
}
}