I want to parse a PDF that has no images, only text. I'm trying to find pieces of text. For example to search the string "Name:" and be able to read the characters after ":".
I'm already able to open a PDF, get the number of pages, and to loop on them. The problem is when I want to use functions like CGPDFDictionaryGetStream or CGPDFStreamCopyData, because they use pointers. I have not found much info on the internet for swift programmers.
Maybe the easiest way would be to parse all the content to an NSString. Then I could do the rest.
Here my code:
// Get existing Pdf reference
let pdf = CGPDFDocumentCreateWithURL(NSURL(fileURLWithPath: path))
let pageCount = CGPDFDocumentGetNumberOfPages(pdf);
for index in 1...pageCount {
let myPage = CGPDFDocumentGetPage(pdf, index)
//Search somehow the string "Name:" to get whats written next
}
You can use PDFKit to do this. It is part of the Quartz framework and is available on both iOS and MacOS. It is also pretty fast, I was able to search through a PDF with over 15000 characters in just 0.07s.
Here is an example:
import Quartz
let pdf = PDFDocument(url: URL(fileURLWithPath: "/Users/...some path.../test.pdf"))
guard let contents = pdf?.string else {
print("could not get string from pdf: \(String(describing: pdf))")
exit(1)
}
let footNote = contents.components(separatedBy: "FOOT NOTE: ")[1] // get all the text after the first foot note
print(footNote.components(separatedBy: "\n")[0]) // print the first line of that text
// Output: "The operating system being written in C resulted in a more portable software."
You can also still access most of (if not all of) the properties you had before. Such as pdf.pageCount for the number of pages, and pdf.page(at: <Int>) to get a specific page.
This is a pretty intensive task. There are libs like PDFKitten which are not maintained anymore. Here is a port of PDFKitten to swift that i did, with some modifications to the way the string searching / content indexing is done, as well as support for truetype fonts.
https://github.com/SimpleApp/PDFParser
[disclaimer : lib author]
[second disclaimer: this lib is 100% mit open sourced. The library has nothing to do with the company, it's not an ad or even a product, i'm posting this comment to help people, and then maybe grow a community around it, because it's a very common requirement and nothing free works well enough]
EDIT : the reason it's a pretty intensive task (not to mention all the character encoding issues), is that the PDF format doesn't have the notion of a "line of text" or even a "word". All it has is character printing instruction. Which means that if you want to find a "word", you'll have to recompute the frame of every blocks of character, using font information, and find the ones can be coalesced into a single word.
That's the reason why you won't find a lot of libraries doing those kind of features, and even some big project fail sometimes at providing correct copy/paste or text search features.
Related
Is there a method I can use in a program to access and modify ripped music CD-MP3 file(s) header text?
There is a method available in the MusicMatch jukeboks music player, but with 2000 files ripper from 50 CD's, the job is quite formidable and the tool "supertagging' is cumbersome to use.
What I see for me is more like the visual representation of Excel, where I would have just the three fields Artist name, Song title and Album name. displayed.
The Artist field would have the option of repeating the top field down for all the song titles, Album would always be repeted for all song titles.
Song titles wil of course have to be entered for each item.
In the ripped files, every file has the fields track#, artist, album + some of less importance.
Just let me know if I am at the wrong forum for my search. I just don't know anywhere else that I might go.
For programming I might use Visual Foxpro and/or assembler. I haven't used C since early 1980's.
If you really want to develop it yourself, at least use an ID3 library, don't write the functionality yourself!
A good one is at http://id3lib.sourceforge.net/. I haven't tried it recently, but I'm sure you can call it from VFP somehow.
If you just want something that is better for tagging a shed-load of files, look at MediaMonkey.
If you want to work solely in VFP then you should use the VFP low-level utilities
FOPEN()
FCHSIZE( )
FCLOSE( )
FCREATE( )
FEOF( )
FFLUSH( )
FGETS( )
FPUTS( )
FREAD( )
FSEEK( )
FWRITE( )
These are pretty well documented within the VFP Help system and there are numerous examples on the web.
With them you can get the 'raw' data from the MP3 file, identify what you are looking for, change it, and write it back again.
The downside is that specific 'fields' (e.g. Artist name, Song title and Album name, etc.) will not be readily recognized. You would need to write code to identify these and then identify where the values reside.
Good Luck
So Im importing a text file that contains a list of character sets. These sets have a meaning they refer to a status of an object. For example TOMTOM100 means Delivery announced. Ones i import he text file the status is presented in 0-5 labels(depends on how many status updates are available).
At first i wanted to do this with a if statement but quickly realized that it would be to much.
if ((trackTraceStatusone.text = #"TOMTOM100"))
{
trackTraceStatusone.text = #"Delivery announced.";
}
Is there a way to create some kind of translator that automatically translates the status in a readable format?
TOMTOM100 > Delivery announced
TOMTOM101 > Delivery Scanned
and so on.
Sounds like a job for NSLocalizedStringFromTable() or the corresponding NSBundle method -localizedStringForKey:value:table:. This will let you load the string from a .strings file in your bundle, which will look something like this:
"TOMTOM100" = "Delivery Announced";
"TOMTOM101" = "Delivery Scanned";
This will also make it easy to provide different strings for different languages. For more information, see the String Resources section of the Resource Programming Guide.
I am looking for a way to actually get the contents of the file itself, in its text format, dumped. E.g.: i don't want a dictionary object, i don't want some sort of extractionstrategy option, i just want the same text document that itextsharp uses to parse... the WHOLE thing as a string or stringbuilder...
I have not yet found a way to do this using any tools what so ever... my problem is that i am trying to read a dynamic PDF into a C# application... and we all know that those darn dynamic PDFs can't be parsed by iTextSharp (AcroForm and AcroFields always comes up empty), so i figured that if i can get the actual text dump of the entire file, i can see what it looks like and parse it myself for this specific task (e.g.: make a class for each document i know i can received, and make a map there based on what i see).
If anyone can help me do that, or even better, find a way, in C#, to extract the XML Source for the PDF (kinda like clicking the XML Source tab in LiveCycle) instead, it would be greatly appreciated.
Thanks!
Matt
If you are looking for the actual operators and commands of each page in the raw text format, try the following code:
var reader = new PdfReader("test.pdf");
int intPageNum = reader.NumberOfPages;
for (int i = 1; i <= intPageNum; i++)
{
byte[] contentBytes = reader.GetPageContent(i);
File.WriteAllBytes("page-" + i + ".txt", contentBytes);
}
reader.Close();
I am looking for a way to actually get the contents of the file
itself, in its text format, dumped. E.g.: i don't want a dictionary
object, i don't want some sort of extractionstrategy option, i just
want the same text document that itextsharp uses to parse... the WHOLE
thing as a string or stringbuilder...
Unfortunately the data that itextsharp uses to parse are not yet text; the operators in that data are given in some textual format but the actual glyphs may be given in a completely arbitrary ad-hoc encoding. That been said, often some standard encoding is used as it is the most simple solution for the components in use. You cannot in general count on that, though. The answer by VahidN shows you how to access the starting points for that content; not seldomly, though, that page content data he extracts only contain references to resources which are contained in different objects.
my problem is that i am trying to read a dynamic PDF into a C#
application... and we all know that those darn dynamic PDFs can't be
parsed by iTextSharp (AcroForm and AcroFields always comes up empty),
This sounds as if you actually have a completely different task at hand. Dynamic forms and their contents are not part of the page content but instead stored in a separate XML Forms Architecture stream.
iText in Action, 2nd edition, in chapter 8 gives you some information on how to access the XFA stream data, for a first glimps look at the sample XfaMovie.cs.
You might also want to look at the iText XML Worker project for easier manipulation of XFA streams.
if you just want to dump the text, try this:
PdfReader reader = new PdfReader(pdfFileName);
String text = "";
nPages = reader.NumberOfPages;
for (int i = 0; i < nPages; i++)
{
text += PdfTextExtractor.GetTextFromPage(reader, i + 1);
}
I want to retrieve the text from a pdf file using iTextSharp. However, I wasn't able to use PDFTextExtractor as in JAVA library of itextsharp(itext). I need readPDFOffline class to return content of file. I will give the pseudo below for you to understand well what I want.
private string readPDFOffline(string fileUri);
read PDF;
retrieve Text Content of This Pdf;*
save content into string contentOfflineFile;
return contentOfflineFile;
I would like to do the * part of Code
PdfTextExtractor is present in the most recent releases of iTextSharp, available here.
Retrieving text in PDF is not easy. Not impossible, but there are times when the only thing that will work is OCR. For all other cases, PdfTextExtractor should work. Cases of it not working are considered bugs and should be reported as such.
Be aware that there are several cases where what looks like valid text is not extractable:
Text with no encoding... just glyph indexes. OCR time.
"Text" that is just raw paths. Horribly inefficient, and time for more OCR.
"Text" that is pixels in a bitmap. OCR once more.
OCR: Optical Character Recognition. There's even a reasonably good one for free available on Google Code, though I don't recall the name off the top of my head.
I am writing an iPhone app that has to pull raw HTML data off a website an grab the url of the links and the displayed text of a link.
For example in the like Click here to go to google
It would pull grab
url = www.google.com
text = Click Here to go to google
I'm using the regexlite library but i'm in no way an expert on regular expressions i have tried several things to get this working.
I want to use the following code
NSString *searchString = #"$10.23, $1024.42, $3099";
NSString *regexString = #"\\$((\\d+)(?:\\.(\\d+)|\\.?))";
NSArray *capturesArray = NULL;
capturesArray = [searchString arrayOfCaptureComponentsMatchedByRegex:regexString];
So my question is can someone tell me what the searchString would be to parse html links or point me to a clear tutorial on how regexlite works i have tired reading the documentation at http://regexkit.sourceforge.net/RegexKitLite/ and i dont understand it.
Thanks in advance,
Zen_silence
In short, don't do that. Regular expressions are a horrible way to parse HTML. HTML documents are highly structured with a hierarchy of tags whose contents may span lines without said lines appearing in the rendered form.
Assuming well structured HTML, you can use an XML parser.
In particular, the iPhone offers the NSXMLParser and some good examples of usage therein.
searchString would be the whole raw HTML text, and regexString should be more like:
NSString *regexString = #"href=\"(.*)\">(.*)<";
Then you would use capturing matches to pull out match1 and match2, repeating the match through the HTML text using the Range option for searching so that you would skip past what you had already searched...
I don't know what you are trying to do with searchString and the numbers though.
In case anyone else has this same question the regex string to match an html link is
NSString *regexString = #"<a href=([^>]*)>([^>]*) - ";
The Oreilly book "Mastering Regular Expressions" helped me figure this out really quickly i highly recommend reading if you are trying to use regular expressions.