PNG files in SO questions - png

Can we prevent PNG files from being uploaded to SO? Or at least suggest that they be pasted in as text and surrounded by braces?
It is very difficult for me to convert PNG data to usable XLS or TXT that can be used to duplicate the users problems.

Related

png files not readable after extracting mbtiles

I need to run with tiles on an offline system.
I like the idea of being able to download .mbtiles files,
and extract the tiles.
I downloaded a data set and tried to extract tiles using mb-util.
When I try to extract tiles from the .mbtiles file.
The .png files are not readable.
This isn't a permissions problem.
The .png files don't appear to be .png files.
Any idea why this doesn't work?
Is is not supported?
Does it have something to do with the free data set I'm testing with?
I'm going to have a difficult time getting my company to spend money for
something, if I can't prove the data is valid.

Get documentation from GitHub project as a single pdf

I'm looking for a single pdf of the ErpNext and Frappe user manuals.
Documentation seems to be provided in html and the source is in markdown. I did find tools to convert markdown to html/pdf, but no reliable solution to generate a SINGLE pdf file keeping the structure as shown here:
Put more abstractly: How to transform GitHub markdown documentation (organized in subdirectories) into a single pdf file?
Could anyone help me out?
Any way of achieving this is welcome, thanks in advance!
You can convert markdown to PDF with Pandoc or similar tools.
You can fsearch the internet about how to concatenate files on your OS.
There are several (online) tools to merge multiple PDFs into one.
To create a single file you can either
concatenate the markdown files into one big file, then convert to PDF, or
convert all markdown files to PDF, then merge all PDF files into one big PDF.

How do I automate converting PDF to HTML?

I work for a publisher and am trying to extract content from our fully laid out PDFs. I've tried pdftohtml, pdftotext, pdfminer, and other Python-based approaches to getting the content, as well as saving to Word, HTML, XML, etc. from the original Acrobat files.
I don't need just the text, I also need the text formatting. That's because, for example, I need all the blue text in the document.
When I save to HTML, Word, etc. from Acrobat, the resulting files contain screenshots of the pages, not the laid out text. When I extract text using different Python modules I get the text but lose the text formatting.
The only solution I've found is to manually copy and paste from the PDF into a word doc, then saving as HTML. I'm hoping to automate this.
Why does copying from Acrobat into Word achieve what I can't do by other means? Has anybody come across this problem before?
Maybe you can consider another method. The software (https://pdfapi.codeplex.com/) can convert pdf files to html directly via MVS. If you are able to use the MVS, i think the software i mentioned above is useful for you to convert the text in pdf files to html that can keep the format perfectly. Of course, it's just a referral, you can have a try.

Matlab access PDF as an array of images

Building a system which search for a specific region in the picture, and saves it. Everything works fine. Mostly I am going to extract these regions from pdf books.
So I am looking for a solution to treat PDF file in matlab as an array of images (each page is an image). Up till now the only thing I have found is how to open pdf files in matlab.
The best solution I came up with is to export PDF as many PNG images and iterate through them. There is nothing bad with these idea, but I am wondering am I missing something
Judging from this page it appears to be impossible to import pdf directly into matlab:
And a quick file exchange search for 'pdf import' only offers an attempt to extract text, rather than the images.
So all in all your approach of saving the pdf as images and then importing them seems to be the way to go.
I agree with Salvador Dali and Dennis. To convert each page of the PDF to a png image, I downloaded imagemagick and followed the commands here:
https://aleksandarjakovljevic.com/convert-pdf-images-using-imagemagick/
Specifically:
convert -density 150 -antialias "input_file_name.pdf" -resize 1024x -quality 100 "output_file_name-%03d.png"
Of course, there are other discussion about using ImageMagick for this purpose:
Converting a PDF to PNG and
Convert PDF to PNG using ImageMagick
This is an old thread, but it's the one I found when I asked the same question, so I thought I would elaborate in case it's helpful to future users who also land on this thread.

How to read pdf table content data?

I have a requirement to read a pdf file having tabular format data only like in excel file. I need to extract the cell value of given pdf file.
Is it be anyhow possible using itext API. If you have something to share then please share it or any other solutions?
The PDF format is just a canvas where text and graphics are placed without any structure information. As such there aren't any iText-objects in a PDF file. In each page there will probably be a number of Strings, but you can't reconstruct a phrase or a paragraph using these strings. There are probably a number of lines drawn, but you can't retrieve a Table-object based on these lines.
In short: parsing the content of a PDF-file is NOT POSSIBLE with iText.
You can try this! This lets you read PDF pages.
I recently ran into this problem. I wasn't able to make it work with itext.
An alternate solution I found was to open a PDF document in Adobe and export it to xml. At least with my PDF's it preserved the table information and then I was able to programmatically work with the XML to generate tabular files like excel etc.
The other issue I ran into was that Adobe only lets you export one file at a time and I had lots of files. Luckily Adobe also has a merge function. I ended up merging all the files together and then exporting them as one big XML file and working with that file to generate what I needed.