Does anyone know what options are available to you in the following method?
// pdf:PDFDocument
pdf.dataRepresentationWithOptions(options: [NSObject : AnyObject])
I'm trying to take a PDF, open it up, search for a specific tag per PDF specs and insert additional tags immediately after that tag.
After editing with PDFDocument methods, I was hoping to convert it a string to search the whole file, so I thought to convert it to a data representation first, and then from there to string. But I suppose I could save it to file, and re-open it from there too so I don't have to convert from the PDFDocument object directly.
Related
I try to add a thumbnail to a JPEG picture using libexif.
For now I'm borrowing the code from exif (the command line tool that is shipped by the libexif team).
However I noticed the XMP tags get deleted from the metadata. There is an old bugreport here.
I tried to see how to achieve this anyway with libexif but I don't really understand how to get the XMP from input file and put it in the output file. I just want to copy all XMP data, I don't need to extract anything of it.
I saw there is a TAG EXIF_TAG_XML_PACKET in exif_tag.h but couldn't figure out how to read/write this tag.
A related solution is in this SO answer but it looks complicated. I'm not familiar coding in C.
Is it actually possible to keep all XMP when using only libexif API? Have things changed in recent years on that? How would you write this in code?
Thanks
I believe it should be somewhat straightforward. XMP fields are described in the ISO/Adobe standard. Regular Kotlin/Java/Android file I/O and some string manipulation should be all that is required.
I would start out by becoming intimately familiar with ISO 16684-1:2019. Then, write a method for your jpeg file class that grabs all the XMP fields. Store those fields in a temp file (to prevent difficult to recover data loss in the event of your code or libexif crashing). Hand the file off to libexif. Generate the thumbnail. Finally, when that's done you can restore the XMP fields. If the thumbnail is stored in an XMP field as well (and it sounds like it is), it may be easier to concatenate that field with the other ones which were already grabbed, updating the temp file so that it contains EVERY XMP field, before adding all of the XMP fields back to the jpeg.
Unfortunately, I do not currently have the time to read a 50 page ISO standard, synthesize the information, and then write the code to implement the solution. Here's a link to the standard at least, to get you started.
https://www.iso.org/obp/ui/#iso:std:iso:16684:-1:ed-2:v1:en
I have a code that implements a Novacode.LineChart. And the LineChart type which is shown by default is this one:
But I dont want this type of chart, I want it without points, like this:
This is the code where I create the chart:
LineChart c = new LineChart();
c.AddLegend(ChartLegendPosition.Bottom, false);
c.Grouping = Grouping.Stacked;
Anyone knows how can I hide thoose points and show only the lines? Thanks to everyone!!
Your question is shown up while I was searching for the exact same feature. It's probably a bit late but I hope it would be useful for other people in need of this feature.
My so called answer is not more than a few lines of dirty and unmanageable hack so unless you are not in dire need, I do not recommend to follow this way.
I also do not know if is it an approved approach here but I prefer to write the solution step by step so it may help you to grasp the concept and use better methods.
After I have realized that I was unable to use DocX to create a line chart without markers, using currently provided API, I wanted to know what were the differences between actual and desired output. So I saved a copy of .docx file with line chart after I manually edited the chart to expected result.
Before and after the edit
As you may already know, a .docx is a container format and essentially comprised of a few different folders and files. You can open it up with a .zip archive extractor. I used 7-Zip for this task and found chart file at location of /word/charts/chart1.xml but this may differ depending on the file, but you can easily figure it out.
Compared both of chart1.xml files and the difference was, the file without the markers had and extra XML tag with an additional attribute;
<c:marker>
<c:symbol val="none" />
</c:marker>
I had to somehow add this segment of code to chart. I added these up to example code provided by DocX. You can follow up from: DocX/ChartSample.cs at master
This is where the fun begins. Easy part first.
using System.Xml;
using System.Xml.Linq;
using Xceed.Words.NET;
// Create a line chart.
var line_chart = new LineChart();
// Create the data.
var PlaceholderData = ChartData.GenerateRandomDataForLinechart();
// Create and add series
var Series_1 = new Series("Your random chart with placeholder data");
Series_1.Bind(PlaceholderData, "X-Axis", "Y-Axis");
line_chart.AddSeries(Series_1);
// Create a new XmlDocument object and clone the actual chart XML
XmlDocument XMLWithNewTags = new XmlDocument();
XMLWithNewTags.LoadXml(line_chart.Xml.ToString());
I've used XPath Visualizer Tool to determine the XPath query, which is important to know because you can't just add the marker tag to somewhere and expect it to work. Why do I tell this? Because I appended marker tag on a random line and expected it to work. Naive.
// Set a namespace manager with the proper XPath location and alias
XmlNamespaceManager NSMngr = new XmlNamespaceManager(XMLWithNewTags.NameTable);
string XPathQuery = "/c:chartSpace/c:chart/c:plotArea/c:lineChart/c:ser";
string xmlns = "http://schemas.openxmlformats.org/drawingml/2006/chart";
NSMngr.AddNamespace("c", xmlns);
XmlNode NewNode = XMLWithNewTags.SelectSingleNode(XPathQuery, NSMngr);
Now create necessary tags on newly created XML Document object with specified namespace
XmlElement Symbol = XMLWithNewTags.CreateElement("c", "symbol", xmlns);
Symbol.SetAttribute("val", "none");
XmlElement Marker = XMLWithNewTags.CreateElement("c", "marker", xmlns);
Marker.AppendChild(Symbol);
NewNode.AppendChild(Marker);
And we should copy the contents of latest changes to actual XML object. But oops, understandably it is defined as private so it is a read-only object. This is where I thought like "Okay, I've fiddled enough with this. I better find another library" but then decided to go on because reasons.
Downloaded DocX repo, changed this line to
get; set;
recompiled, copied Xceed.Words.NET.dll to both projectfolder/packages and projectfolder/projectname/bin/Debug folder and finally last a few lines were
// Copy the contents of latest changes to actual XML object
line_chart.Xml = XDocument.Parse(XMLWithNewTags.InnerXml);
// Insert chart into document
document.InsertChart(line_chart);
// Save this document to disk.
document.Save();
Is it worth it? I'm not sure but I have learned a few things while working on it. There're probably lots of bad programming practises in this answer so please tell me if you see one. Sorry for meh English.
I want to parse a PDF that has no images, only text. I'm trying to find pieces of text. For example to search the string "Name:" and be able to read the characters after ":".
I'm already able to open a PDF, get the number of pages, and to loop on them. The problem is when I want to use functions like CGPDFDictionaryGetStream or CGPDFStreamCopyData, because they use pointers. I have not found much info on the internet for swift programmers.
Maybe the easiest way would be to parse all the content to an NSString. Then I could do the rest.
Here my code:
// Get existing Pdf reference
let pdf = CGPDFDocumentCreateWithURL(NSURL(fileURLWithPath: path))
let pageCount = CGPDFDocumentGetNumberOfPages(pdf);
for index in 1...pageCount {
let myPage = CGPDFDocumentGetPage(pdf, index)
//Search somehow the string "Name:" to get whats written next
}
You can use PDFKit to do this. It is part of the Quartz framework and is available on both iOS and MacOS. It is also pretty fast, I was able to search through a PDF with over 15000 characters in just 0.07s.
Here is an example:
import Quartz
let pdf = PDFDocument(url: URL(fileURLWithPath: "/Users/...some path.../test.pdf"))
guard let contents = pdf?.string else {
print("could not get string from pdf: \(String(describing: pdf))")
exit(1)
}
let footNote = contents.components(separatedBy: "FOOT NOTE: ")[1] // get all the text after the first foot note
print(footNote.components(separatedBy: "\n")[0]) // print the first line of that text
// Output: "The operating system being written in C resulted in a more portable software."
You can also still access most of (if not all of) the properties you had before. Such as pdf.pageCount for the number of pages, and pdf.page(at: <Int>) to get a specific page.
This is a pretty intensive task. There are libs like PDFKitten which are not maintained anymore. Here is a port of PDFKitten to swift that i did, with some modifications to the way the string searching / content indexing is done, as well as support for truetype fonts.
https://github.com/SimpleApp/PDFParser
[disclaimer : lib author]
[second disclaimer: this lib is 100% mit open sourced. The library has nothing to do with the company, it's not an ad or even a product, i'm posting this comment to help people, and then maybe grow a community around it, because it's a very common requirement and nothing free works well enough]
EDIT : the reason it's a pretty intensive task (not to mention all the character encoding issues), is that the PDF format doesn't have the notion of a "line of text" or even a "word". All it has is character printing instruction. Which means that if you want to find a "word", you'll have to recompute the frame of every blocks of character, using font information, and find the ones can be coalesced into a single word.
That's the reason why you won't find a lot of libraries doing those kind of features, and even some big project fail sometimes at providing correct copy/paste or text search features.
I am looking for a way to actually get the contents of the file itself, in its text format, dumped. E.g.: i don't want a dictionary object, i don't want some sort of extractionstrategy option, i just want the same text document that itextsharp uses to parse... the WHOLE thing as a string or stringbuilder...
I have not yet found a way to do this using any tools what so ever... my problem is that i am trying to read a dynamic PDF into a C# application... and we all know that those darn dynamic PDFs can't be parsed by iTextSharp (AcroForm and AcroFields always comes up empty), so i figured that if i can get the actual text dump of the entire file, i can see what it looks like and parse it myself for this specific task (e.g.: make a class for each document i know i can received, and make a map there based on what i see).
If anyone can help me do that, or even better, find a way, in C#, to extract the XML Source for the PDF (kinda like clicking the XML Source tab in LiveCycle) instead, it would be greatly appreciated.
Thanks!
Matt
If you are looking for the actual operators and commands of each page in the raw text format, try the following code:
var reader = new PdfReader("test.pdf");
int intPageNum = reader.NumberOfPages;
for (int i = 1; i <= intPageNum; i++)
{
byte[] contentBytes = reader.GetPageContent(i);
File.WriteAllBytes("page-" + i + ".txt", contentBytes);
}
reader.Close();
I am looking for a way to actually get the contents of the file
itself, in its text format, dumped. E.g.: i don't want a dictionary
object, i don't want some sort of extractionstrategy option, i just
want the same text document that itextsharp uses to parse... the WHOLE
thing as a string or stringbuilder...
Unfortunately the data that itextsharp uses to parse are not yet text; the operators in that data are given in some textual format but the actual glyphs may be given in a completely arbitrary ad-hoc encoding. That been said, often some standard encoding is used as it is the most simple solution for the components in use. You cannot in general count on that, though. The answer by VahidN shows you how to access the starting points for that content; not seldomly, though, that page content data he extracts only contain references to resources which are contained in different objects.
my problem is that i am trying to read a dynamic PDF into a C#
application... and we all know that those darn dynamic PDFs can't be
parsed by iTextSharp (AcroForm and AcroFields always comes up empty),
This sounds as if you actually have a completely different task at hand. Dynamic forms and their contents are not part of the page content but instead stored in a separate XML Forms Architecture stream.
iText in Action, 2nd edition, in chapter 8 gives you some information on how to access the XFA stream data, for a first glimps look at the sample XfaMovie.cs.
You might also want to look at the iText XML Worker project for easier manipulation of XFA streams.
if you just want to dump the text, try this:
PdfReader reader = new PdfReader(pdfFileName);
String text = "";
nPages = reader.NumberOfPages;
for (int i = 0; i < nPages; i++)
{
text += PdfTextExtractor.GetTextFromPage(reader, i + 1);
}
Many photo viewing and editing applications allow you to examine and change EXIF and IPTC data in JPEG and other image files. For example, I can see things like shutter speed, aperture and orientation in the picture files that come off my Canon A430. There are many, many name/value pairs in all this metadata. But...
What do I do if I want to store some data that doesn't have a build-in field name. Let's say I'm photographing an athletics competition and I want to tag every photo with the competitor's bib number. Can I create a "bib_number" field and assign it a values of "0001", "5478", "8124" etc, and then search for all photos with bib_number="5478"?
I've spent a few hours searching and the best I can come up with is to put this custom information in the "keywords" field but this isn't quite what I'm after. With this socution I'd have to craft a query like "keywords contains bib_number_5478" whereas what I want it "bib_number is 5478".
So do the EXIF and/or IPTC standards allow addtional user-defined field names?
Thanks
Kev
It can be used for that, but it really shouldn't: it's meant to be user-editable and so isn't a safe place to put critical metadata. Using an XMP sidecar is better for this kind of thing: in XMP, any field added that a given app does not understand is, according to the standard, supposed to be ignored by that app and not destroyed.
I don't know if there are applications to do this but by the standards described for JPEG files there is a field called Comments where you can assign values that could act like tags.
C# code:
using System.Windows.Media.Imaging;
using System.IO;
...
FileStream fs = new FileStream(#"<img_path>", FileMode.Open, FileAccess.ReadWrite);
BitmapMetadata bmd = (BitmapMetadata)BitmapFrame.Create(fs).Metadata;
bmd.Comment = "Some Comment Here";
also if you are looking for an application that already has this functionality built into it, then might i recommend Irfan View (open pic, go to Image menu, click on Comments button).
Hope this helps.