Images in OOXML (Office Open XML) standard documents are damaged. Where I can find a good one? - openxml

We are working on a project to deal with OOXML format, specifically DOCX format. We downloaded PDFs from ISO site (http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html) but found all images in the PDF are black. Some images have colored lines but none of them has text.
Is there anyone read the standard?
Where I can get a good document with good images.
Thanks

You can take a look at the ECMA-376 version of the standard at the following link. I would download the third edition set of the pdf's as they are the most recent to date.

Related

Get documentation from GitHub project as a single pdf

I'm looking for a single pdf of the ErpNext and Frappe user manuals.
Documentation seems to be provided in html and the source is in markdown. I did find tools to convert markdown to html/pdf, but no reliable solution to generate a SINGLE pdf file keeping the structure as shown here:
Put more abstractly: How to transform GitHub markdown documentation (organized in subdirectories) into a single pdf file?
Could anyone help me out?
Any way of achieving this is welcome, thanks in advance!
You can convert markdown to PDF with Pandoc or similar tools.
You can fsearch the internet about how to concatenate files on your OS.
There are several (online) tools to merge multiple PDFs into one.
To create a single file you can either
concatenate the markdown files into one big file, then convert to PDF, or
convert all markdown files to PDF, then merge all PDF files into one big PDF.

Matlab access PDF as an array of images

Building a system which search for a specific region in the picture, and saves it. Everything works fine. Mostly I am going to extract these regions from pdf books.
So I am looking for a solution to treat PDF file in matlab as an array of images (each page is an image). Up till now the only thing I have found is how to open pdf files in matlab.
The best solution I came up with is to export PDF as many PNG images and iterate through them. There is nothing bad with these idea, but I am wondering am I missing something
Judging from this page it appears to be impossible to import pdf directly into matlab:
And a quick file exchange search for 'pdf import' only offers an attempt to extract text, rather than the images.
So all in all your approach of saving the pdf as images and then importing them seems to be the way to go.
I agree with Salvador Dali and Dennis. To convert each page of the PDF to a png image, I downloaded imagemagick and followed the commands here:
https://aleksandarjakovljevic.com/convert-pdf-images-using-imagemagick/
Specifically:
convert -density 150 -antialias "input_file_name.pdf" -resize 1024x -quality 100 "output_file_name-%03d.png"
Of course, there are other discussion about using ImageMagick for this purpose:
Converting a PDF to PNG and
Convert PDF to PNG using ImageMagick
This is an old thread, but it's the one I found when I asked the same question, so I thought I would elaborate in case it's helpful to future users who also land on this thread.

asp.net web application to convert pdf to word

Is there any clear and proper process to convert a pdf file into a word file with all formatting and images in asp.net web application?
The best way to do that is by using the OCR. It will recognize the text and the images in the PDF file, and then you can save it on a DOC file. I know a third party toolkit named leadtools that should help you doing your requirements, since it support the ASP.NET environment. You can check their Online OCR Demo
Also, you can check their website for more information, or contact their support team.
PDF is a presentational format where all the content is placed by absolute positions. There are no paragraphs and other structured elements (unless it is a Tagged PDF). Technically, you can output every word character by character in any order, but visually it would look like a normal text. Thus, to make a proper conversion to word it is required to do content recognition or some kind of OCR (e.g. ABBYY FineReader)
There are some paid components on the market that allow to do text extraction and some do converting pages to images (obviously, this is not a desired approach for converting into word).

Reading powerpoint pptx file in Objective-C

I was asked to find a way to use pptx as input for a game using cocos2D on iphone.
As far as I know pptx file use Office Open XML standard and should be fully readable, including informations on animations, in any programming language.
However I only find examples/tutorial using docx files and I would like to know if such documentation exists for pptx files.
I just spent two days on that topic, and I just can't find the strength to dive into Microsoft documentation.
A .pptx file is just a ZIP file containing some images files and XML files. Wouter van Vugt's free e-book will help you understand the XML. Unfortunately the files contain a lot of boilerplate so it can take some time to find the bits you are interested in.

What's the easiest way to generate DOC files?

Right now I'm generating HTML with a Perlscript, and then manually converting to DOC in OpenOffice. Actually I have to copy, create new "Text document", paste, save, as it treats HTML and DOC as separate file types, but that's quite unessential. That's very inconvenient.
Is there any automated way I can convert HTML to decent DOC, or some other nice format like HTML I can generate textually and convert to DOC in automated way?
(I'm on OSX)
I can't help you get to .doc, but have you seen the Open XML Format SDK from Microsoft? This will allow you to generate Office 2007 format documents (.docx, .xlsx etc) from .NET code.
Theoretically you may have some luck with this under Mono on OS X, as it doesn't require an installation of Office 2007 (for Windows) to function.
Not sure if this is what you want, but you can fairly easily generate WordML documents with code. WordML is the Word 2003 XML file format. It's NOT the same thing at the Office 2007 Open XML formats. WordML is just one file that's not too hard to create if your just doing fairly basic formatting. You could generate it directly rather than creating the HTML first. You can name the files with a .DOC extension and Word 2003 and later will open them just fine. You can resave them as real .DOC file if you want.
Here's the on-line WordML reference. I can send you some sample code if you'd like.
http://msdn.microsoft.com/en-us/library/aa212812(office.11).aspx
If you really want to create a general file format that could be converted into other formats, creating XML-FO file might be the way to go. There are a number of products out there that can take XML-FO and transform it into other files, such as Word and PDF.
We do use the components of Aspose that are available for .NET and Java. With Java you should be able to use them on OS X, too.
You have to purchase the components (i.e. they are not free), but aside from this, they are really great.