I'm encountering some weird encoding issues. I need to parse an HTML document from the web, and I'm using the 'Content-Type' charset meta-data to determine the encoding type.
One page has been giving me trouble and is encoded by 'Shift_jis' (Japanese) - The parser result contains some garbled characters.
When I parse the same document using UTF-8 the characters that were garbled before are parsed correctly but everything else is now garbled.
I'm assuming the document contains text in two different encoding types.
I there anyway I could parse this document correctly ?
Also, I don't how, but all the browsers seem to deal well with the issue and are presenting the page nicely.
Would really appreciate any thoughts on this.
The page that I need to parse : http://ao.recruit.co.jp/form.html
First of all, what the browser sees is:
莨夂、セ讎りヲ
What is shown in rendered html is not the same because of the CSS text-indent: -9999px and the background image laid over it. But it's there. Removing them will show the text browser is seeing.
Out of the box, decoding as Shift-Jis should give you 莨夂、セ讎りヲ?, but if you want same results as in a browser, you should use a custom CharsetDecoder with IGNORE:
URL url = new URL( "http://ao.recruit.co.jp/form.html");
BufferedInputStream bis = new BufferedInputStream(url.openStream());
CharsetDecoder decoder = Charset.forName("Shift-Jis").newDecoder();
decoder.onMalformedInput(CodingErrorAction.IGNORE);
decoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
Reader inputReader = new InputStreamReader(bis, decoder);
String result = IOUtils.toString(inputReader);
System.out.print(result);
This will give you same result as with browsers. Of course, it won't parse the text from the image file.
Related
I want to parse a PDF that has no images, only text. I'm trying to find pieces of text. For example to search the string "Name:" and be able to read the characters after ":".
I'm already able to open a PDF, get the number of pages, and to loop on them. The problem is when I want to use functions like CGPDFDictionaryGetStream or CGPDFStreamCopyData, because they use pointers. I have not found much info on the internet for swift programmers.
Maybe the easiest way would be to parse all the content to an NSString. Then I could do the rest.
Here my code:
// Get existing Pdf reference
let pdf = CGPDFDocumentCreateWithURL(NSURL(fileURLWithPath: path))
let pageCount = CGPDFDocumentGetNumberOfPages(pdf);
for index in 1...pageCount {
let myPage = CGPDFDocumentGetPage(pdf, index)
//Search somehow the string "Name:" to get whats written next
}
You can use PDFKit to do this. It is part of the Quartz framework and is available on both iOS and MacOS. It is also pretty fast, I was able to search through a PDF with over 15000 characters in just 0.07s.
Here is an example:
import Quartz
let pdf = PDFDocument(url: URL(fileURLWithPath: "/Users/...some path.../test.pdf"))
guard let contents = pdf?.string else {
print("could not get string from pdf: \(String(describing: pdf))")
exit(1)
}
let footNote = contents.components(separatedBy: "FOOT NOTE: ")[1] // get all the text after the first foot note
print(footNote.components(separatedBy: "\n")[0]) // print the first line of that text
// Output: "The operating system being written in C resulted in a more portable software."
You can also still access most of (if not all of) the properties you had before. Such as pdf.pageCount for the number of pages, and pdf.page(at: <Int>) to get a specific page.
This is a pretty intensive task. There are libs like PDFKitten which are not maintained anymore. Here is a port of PDFKitten to swift that i did, with some modifications to the way the string searching / content indexing is done, as well as support for truetype fonts.
https://github.com/SimpleApp/PDFParser
[disclaimer : lib author]
[second disclaimer: this lib is 100% mit open sourced. The library has nothing to do with the company, it's not an ad or even a product, i'm posting this comment to help people, and then maybe grow a community around it, because it's a very common requirement and nothing free works well enough]
EDIT : the reason it's a pretty intensive task (not to mention all the character encoding issues), is that the PDF format doesn't have the notion of a "line of text" or even a "word". All it has is character printing instruction. Which means that if you want to find a "word", you'll have to recompute the frame of every blocks of character, using font information, and find the ones can be coalesced into a single word.
That's the reason why you won't find a lot of libraries doing those kind of features, and even some big project fail sometimes at providing correct copy/paste or text search features.
I am looking for a way to actually get the contents of the file itself, in its text format, dumped. E.g.: i don't want a dictionary object, i don't want some sort of extractionstrategy option, i just want the same text document that itextsharp uses to parse... the WHOLE thing as a string or stringbuilder...
I have not yet found a way to do this using any tools what so ever... my problem is that i am trying to read a dynamic PDF into a C# application... and we all know that those darn dynamic PDFs can't be parsed by iTextSharp (AcroForm and AcroFields always comes up empty), so i figured that if i can get the actual text dump of the entire file, i can see what it looks like and parse it myself for this specific task (e.g.: make a class for each document i know i can received, and make a map there based on what i see).
If anyone can help me do that, or even better, find a way, in C#, to extract the XML Source for the PDF (kinda like clicking the XML Source tab in LiveCycle) instead, it would be greatly appreciated.
Thanks!
Matt
If you are looking for the actual operators and commands of each page in the raw text format, try the following code:
var reader = new PdfReader("test.pdf");
int intPageNum = reader.NumberOfPages;
for (int i = 1; i <= intPageNum; i++)
{
byte[] contentBytes = reader.GetPageContent(i);
File.WriteAllBytes("page-" + i + ".txt", contentBytes);
}
reader.Close();
I am looking for a way to actually get the contents of the file
itself, in its text format, dumped. E.g.: i don't want a dictionary
object, i don't want some sort of extractionstrategy option, i just
want the same text document that itextsharp uses to parse... the WHOLE
thing as a string or stringbuilder...
Unfortunately the data that itextsharp uses to parse are not yet text; the operators in that data are given in some textual format but the actual glyphs may be given in a completely arbitrary ad-hoc encoding. That been said, often some standard encoding is used as it is the most simple solution for the components in use. You cannot in general count on that, though. The answer by VahidN shows you how to access the starting points for that content; not seldomly, though, that page content data he extracts only contain references to resources which are contained in different objects.
my problem is that i am trying to read a dynamic PDF into a C#
application... and we all know that those darn dynamic PDFs can't be
parsed by iTextSharp (AcroForm and AcroFields always comes up empty),
This sounds as if you actually have a completely different task at hand. Dynamic forms and their contents are not part of the page content but instead stored in a separate XML Forms Architecture stream.
iText in Action, 2nd edition, in chapter 8 gives you some information on how to access the XFA stream data, for a first glimps look at the sample XfaMovie.cs.
You might also want to look at the iText XML Worker project for easier manipulation of XFA streams.
if you just want to dump the text, try this:
PdfReader reader = new PdfReader(pdfFileName);
String text = "";
nPages = reader.NumberOfPages;
for (int i = 0; i < nPages; i++)
{
text += PdfTextExtractor.GetTextFromPage(reader, i + 1);
}
I have a code that handles displaying a blob from a local Oracle database. I store both JPG and PDF files as blob. I could view the JPG file, but not the PDF. I have checked these
$self->content_type('image/jpg')
to
$self->content_type('application/pdf').
And the Blob does have data. I checked the length and it has "184546".
All I get when I click the link for the pdf file is a blank page with the title GETIMAGPAGE(application/pdf).
Any help or pointers would be greatly appreciated.
Also, How can we have the content_type to enable two different mime_types? For example in my case both image as well as pdf, depending on what we get?
File::MMagic can recognize the type of data using magic numbers.
use File::MMagic;
$magic = File::MMagic->new;
$self->content($blob);
$self->content_type($magic->checktype_contents($blob));
If you don't want to require a native/plugin PDF reader, perhaps FlexPaper might fit your needs.
I want to retrieve the text from a pdf file using iTextSharp. However, I wasn't able to use PDFTextExtractor as in JAVA library of itextsharp(itext). I need readPDFOffline class to return content of file. I will give the pseudo below for you to understand well what I want.
private string readPDFOffline(string fileUri);
read PDF;
retrieve Text Content of This Pdf;*
save content into string contentOfflineFile;
return contentOfflineFile;
I would like to do the * part of Code
PdfTextExtractor is present in the most recent releases of iTextSharp, available here.
Retrieving text in PDF is not easy. Not impossible, but there are times when the only thing that will work is OCR. For all other cases, PdfTextExtractor should work. Cases of it not working are considered bugs and should be reported as such.
Be aware that there are several cases where what looks like valid text is not extractable:
Text with no encoding... just glyph indexes. OCR time.
"Text" that is just raw paths. Horribly inefficient, and time for more OCR.
"Text" that is pixels in a bitmap. OCR once more.
OCR: Optical Character Recognition. There's even a reasonably good one for free available on Google Code, though I don't recall the name off the top of my head.
I have an app., coded with ejb3, jsf and maven, which runs on jboss 4.2.2GA
The problem I have been facing for 2 days is I cannot convert non-english characters that are added to url on runtime. For instance, there is a search textbox and a button. When a user enters a word including non-english characters, and pushes the button, it is added to the url with bad characters like %56 or &347 etc..
Is there any way to achieve what I am trying to do here? BTW, is there also any way to get over this problem on the jboss side configuration rather than application side (filters or context.xml etc..)?
Any help would be appreciated
Thanks a lot,
Baris
--
EDIT: I have solved this issue by using URLEncoder. When I passed the variable to the action method, I use URLEncoder in order to encode it to the right charset.
Example:
Take parameter from the URL:
String someString = ServletActionContext.getRequest().getParameter("someStringFromURL");
Encode the string;
String encoded = URLEncoder.encode(someString, "ISO-8859-9");
Find the appropriate connector element in your tomcat server.xml (deploy/jboss-web.deployer/server.xml for recent versions) and add the attribute URIEncoding with a value of UTF-8.
I have solved this issue by using URLEncoder. When I passed the variable to the action method, I use URLEncoder in order to encode it to the right charset.
Example: Take parameter from the URL:
String someString = ServletActionContext.getRequest().getParameter("someStringFromURL");
Encode the string;
String encoded = URLEncoder.encode(someString, "ISO-8859-9");