iText manipulate pdf with non-embedded fonts stored on file system - itext

Using iText, need help to apply existing pdf with non-embedded fonts stored on file system.
a. I have pdf documents with non-embedded fonts (pdf include the name of the font). I have all these non-embedded fonts available on file system.
b. I need to modify the existing pdf by applying these fonts.
Can someone please help in providing a working code snippet. Thanks in advance.

Related

Is it possible to build a LibreOffice document from code similar to the way a web page is built from HTML and CSS?

Is it possible to build a LibreOffice document from code similar to the way a web page is built from HTML and CSS? Can one write an ODF file in which the content and styling are separate, and then/view open in LibreOffice? If so, can one write the code in a text editor as done for HTML/CSS?
There area two reasons I now ask. 1) When I need to make a style change in LibreOffice I have to manually make the same adjustments in a hundred places, such as changing the style of block quotes. 2) I'd like to build documents from a database of text.
I found a question on this in relation to databases but it was about eight years old.
Thank you for any direction you may be able to provide.
Unzip an .odt file that contains styles. You will see two files, content.xml and styles.xml. Edit these files using a text editor and then zip the folder back up to get a modified .odt file.
Be aware that there are two types of styles in the XML files. Named styles are what most people think of as styles, whereas automatic styles are custom formatting, like when you select some text and change the font directly.
The link from tohuwawohu describes utilities to work programmatically with the file. Also as mentioned in the link, it's not too hard to write code yourself. For example in python, import the built-in libraries zipfile and xml.etree.

Get documentation from GitHub project as a single pdf

I'm looking for a single pdf of the ErpNext and Frappe user manuals.
Documentation seems to be provided in html and the source is in markdown. I did find tools to convert markdown to html/pdf, but no reliable solution to generate a SINGLE pdf file keeping the structure as shown here:
Put more abstractly: How to transform GitHub markdown documentation (organized in subdirectories) into a single pdf file?
Could anyone help me out?
Any way of achieving this is welcome, thanks in advance!
You can convert markdown to PDF with Pandoc or similar tools.
You can fsearch the internet about how to concatenate files on your OS.
There are several (online) tools to merge multiple PDFs into one.
To create a single file you can either
concatenate the markdown files into one big file, then convert to PDF, or
convert all markdown files to PDF, then merge all PDF files into one big PDF.

asp.net web application to convert pdf to word

Is there any clear and proper process to convert a pdf file into a word file with all formatting and images in asp.net web application?
The best way to do that is by using the OCR. It will recognize the text and the images in the PDF file, and then you can save it on a DOC file. I know a third party toolkit named leadtools that should help you doing your requirements, since it support the ASP.NET environment. You can check their Online OCR Demo
Also, you can check their website for more information, or contact their support team.
PDF is a presentational format where all the content is placed by absolute positions. There are no paragraphs and other structured elements (unless it is a Tagged PDF). Technically, you can output every word character by character in any order, but visually it would look like a normal text. Thus, to make a proper conversion to word it is required to do content recognition or some kind of OCR (e.g. ABBYY FineReader)
There are some paid components on the market that allow to do text extraction and some do converting pages to images (obviously, this is not a desired approach for converting into word).

How to read pdf table content data?

I have a requirement to read a pdf file having tabular format data only like in excel file. I need to extract the cell value of given pdf file.
Is it be anyhow possible using itext API. If you have something to share then please share it or any other solutions?
The PDF format is just a canvas where text and graphics are placed without any structure information. As such there aren't any iText-objects in a PDF file. In each page there will probably be a number of Strings, but you can't reconstruct a phrase or a paragraph using these strings. There are probably a number of lines drawn, but you can't retrieve a Table-object based on these lines.
In short: parsing the content of a PDF-file is NOT POSSIBLE with iText.
You can try this! This lets you read PDF pages.
I recently ran into this problem. I wasn't able to make it work with itext.
An alternate solution I found was to open a PDF document in Adobe and export it to xml. At least with my PDF's it preserved the table information and then I was able to programmatically work with the XML to generate tabular files like excel etc.
The other issue I ran into was that Adobe only lets you export one file at a time and I had lots of files. Luckily Adobe also has a merge function. I ended up merging all the files together and then exporting them as one big XML file and working with that file to generate what I needed.

How to create a Word-like document (.docx) in an app?

I would like to create a .docx file within an iPad application. The file would be created within the app (the user would create/edit it like in Word--preferably with the same "feel" of Word) and then it would be saved as a .docx file.
So, is it possible to do this? If so, how? What other alternative file formats are there?
Thanks,
John
You can easily generate RTF corresponding to most typical features of a word processor. It will not cover the vastness of available DOCX features, but I'm not certain a complete port of Microsoft Word to the iPhone would be practical, so most of these features would be unavailable anyway.
RTF is fully (read-write) supported by Microsoft Office and several other editors.