How to search a string in a pdf file [duplicate] - lucene.net

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to index pdf, ppt, xl files in lucene (java based or python or php any of these is fine)?
I need to search a string in a collection of files in a folder includes the pdf, docx, txt formats. Is it possible to search a string using lucene.net.
please give some references helpful for this..
thank u..

You would need to extract the text of the various files (pdf, docx, txt) and insert that text into a that to a Lucene index. Lucene doesn't have the ability to read text out of the various document formats
Extract PDF text in .NET
Docx and Ifilters
Generally search for "extract {document format} text in .net" and you should find plenty of resources.

Related

Hindi content in Java and jasper report [duplicate]

This question already has answers here:
How can I render hindi correctly when exporting to pdf?
(4 answers)
Closed 7 months ago.
**
How to get Hindi content stored in informix database in JAVA code?
How to display Hindi content in Jasper ireport? Stored content in
database is showing as ??????????????????????????????????????'
My code is :
String sql="";
sql="select template from ropk_sms where template_id=1307165174435759958";
List<Map> getRecords = jdbcTemplateObject.queryForList(sql);
for (Map row : getRecords){
Sirkdpe0100ActionBean objBean = new Sirkdpe0100ActionBean();
Collection c = row.values();
Iterator itr = c.iterator();
obj = itr.next();
objBean.setParty_name(obj!= null ?obj.toString().trim():"");
System.out.println("DATA"+objBean.getParty_name());
allRecords.add(objBean);
}
**
Hindi content should just be a string of characters, like other content in languages.
The fact that your content shows as question marks is likely that you are using the wrong encoding or the font used to display the characters just does not support Hindi characters.
Just make sure all your systems use Unicode and the same encoding type (e.g. UTF-8).
To have your database work correctly when sorting results, set the collation to Hindi as well (though I do not know how it is done on Informix).
Finally ensure the fonts used for rendering contain glyphs for your characters.

Why aren't word processing program documents stored as plaintext?

Whenever an MS word (or LibreOffice or other word processor) document is opened in its respective program, the words appear normally on the page, but when the document is opened in a text editor, most of it is Unicode gibberish.
I can understand why the document might have some parts that aren't legible, like bullet points or metadata, but why isn't at least some of the content stored as plaintext? Does every letter get encoded?
The last format docx of Microsoft Word is an xml with plain text compressed with zip. You can unzip the file by renaming docx to zip and then open the file with a notepad. So it is stored partially as plain text just compressed.
I find that it is probably a branding thing. If you want you can import it to a Text File.
If you go to File > Export > Change File Type > Plain Text (*.txt), you can export the document there.

convert stream file of iText PDF not opening MS word

Our project has requirement to generate end report both in PDF and MS-Word Document. We are using iTextSharp to dynamically generate tables and rows in report. Finally we will upload the file to server as PDF and MS-word. Both will be converted to Byte Array/Stream file and saved as PDF and MS-Word Document. In Which,uploaded PDF working as expected, but MS-word getting error and not opening(Attaching the screen shot).
iTextSharp doesn't produce MS Word documents, so this isn't an actual iText question. When I look at your screen shot, I see that you are trying to import a PDF file into Word. Since Word can't interpret PDF syntax, it shows you the syntax of the PDF file:
%PDF-1.4
%âãÏÓ
1 0 obj
<</Type/Font...
I think your question is wrong. You are not using iTextSharp to create a PDF file and an MS Word file. You are using iTextSharp to create a PDF file, and not an MS Word file.
There is no such thing as "Save a PDF as MS Word file" in iTextSharp, and it will be extremely difficult to find another tool that can convert a PDF document to a Word document in an acceptable way. (There are such tools, but the quality is suboptimal for PDFs that weren't made to be converted to another format.)

Starting format for text file to be converted to other formats

I need to write a document using images, texts, hyperlinks... And then convert it to PDF and DOC (but in the future it can be converted to more file formats).
What's the best "starting format" for this document?
Doc or Docx might be the best file format for creating the document containing images, texts, hyperlinks, and many more elements. Once created, it's easy to convert files in .doc/.docx format into other file format, such as Image, PDF, HTML, by using OpenXML or even commercial library like Spire.Doc.

Comparing 2 word Documents in Office 2003

we have word Document(office 2003) Containing Bookmarks(Template) - We Generate the Document Via Application under Test(Final Document)(Office 2003)- Based on the data we enter in the Application under Test -Book mark gets filled and Document Gets Printed
So now I need to Compare the template with the Final Document
What is the Best Approach to compare the 2 documents
Note: I need to Compare the Margin , font and all other formatting stuffs as well
Initial Analaysis - I Converted the 2003 template and the final document to 2010 word format and changed the File type to .zip - when we extract it i got numerous XML - I compared both the XML - but that is not adding value for this kind of test why because eventhough there is a discrepancy it is becoming really difficult to Map the contents in XML to the actual contents in the word document
Winmerge has a plugin that handles Word 2003/2007 files.