Get documentation from GitHub project as a single pdf - github

I'm looking for a single pdf of the ErpNext and Frappe user manuals.
Documentation seems to be provided in html and the source is in markdown. I did find tools to convert markdown to html/pdf, but no reliable solution to generate a SINGLE pdf file keeping the structure as shown here:
Put more abstractly: How to transform GitHub markdown documentation (organized in subdirectories) into a single pdf file?
Could anyone help me out?
Any way of achieving this is welcome, thanks in advance!

You can convert markdown to PDF with Pandoc or similar tools.
You can fsearch the internet about how to concatenate files on your OS.
There are several (online) tools to merge multiple PDFs into one.
To create a single file you can either
concatenate the markdown files into one big file, then convert to PDF, or
convert all markdown files to PDF, then merge all PDF files into one big PDF.

Related

Download link in GitHub markdown [duplicate]

When you link to a PDF file using:
[download this](file.pdf)
it downloads the pdf file. I have an excel workbook that I'd like to allow someone to download using:
[download this](file.xlsx)
When I click it, it takes me to create a new page in the wiki. Is there any markdown syntax I can add that identifies the link as something to download?
If I have to, I can save the excel workbook as a PDF, but it's not going to be pretty.
Thank you!
First, try making a files subdirectory in your wiki, and putting your files in there.
I tried using an html anchor tag
download this
instead of the markdown link syntax
[download this](files/file.csv)
but it seems that GitHub wiki strips out the download attribute from the anchor tag.
In the end, I zipped my spreadsheet in a zip file and had the markdown link point to the zip file.
[download this](files/file.csv.zip)

How do I automate converting PDF to HTML?

I work for a publisher and am trying to extract content from our fully laid out PDFs. I've tried pdftohtml, pdftotext, pdfminer, and other Python-based approaches to getting the content, as well as saving to Word, HTML, XML, etc. from the original Acrobat files.
I don't need just the text, I also need the text formatting. That's because, for example, I need all the blue text in the document.
When I save to HTML, Word, etc. from Acrobat, the resulting files contain screenshots of the pages, not the laid out text. When I extract text using different Python modules I get the text but lose the text formatting.
The only solution I've found is to manually copy and paste from the PDF into a word doc, then saving as HTML. I'm hoping to automate this.
Why does copying from Acrobat into Word achieve what I can't do by other means? Has anybody come across this problem before?
Maybe you can consider another method. The software (https://pdfapi.codeplex.com/) can convert pdf files to html directly via MVS. If you are able to use the MVS, i think the software i mentioned above is useful for you to convert the text in pdf files to html that can keep the format perfectly. Of course, it's just a referral, you can have a try.

How to read pdf table content data?

I have a requirement to read a pdf file having tabular format data only like in excel file. I need to extract the cell value of given pdf file.
Is it be anyhow possible using itext API. If you have something to share then please share it or any other solutions?
The PDF format is just a canvas where text and graphics are placed without any structure information. As such there aren't any iText-objects in a PDF file. In each page there will probably be a number of Strings, but you can't reconstruct a phrase or a paragraph using these strings. There are probably a number of lines drawn, but you can't retrieve a Table-object based on these lines.
In short: parsing the content of a PDF-file is NOT POSSIBLE with iText.
You can try this! This lets you read PDF pages.
I recently ran into this problem. I wasn't able to make it work with itext.
An alternate solution I found was to open a PDF document in Adobe and export it to xml. At least with my PDF's it preserved the table information and then I was able to programmatically work with the XML to generate tabular files like excel etc.
The other issue I ran into was that Adobe only lets you export one file at a time and I had lots of files. Luckily Adobe also has a merge function. I ended up merging all the files together and then exporting them as one big XML file and working with that file to generate what I needed.

Images in OOXML (Office Open XML) standard documents are damaged. Where I can find a good one?

We are working on a project to deal with OOXML format, specifically DOCX format. We downloaded PDFs from ISO site (http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html) but found all images in the PDF are black. Some images have colored lines but none of them has text.
Is there anyone read the standard?
Where I can get a good document with good images.
Thanks
You can take a look at the ECMA-376 version of the standard at the following link. I would download the third edition set of the pdf's as they are the most recent to date.

Convert rtf files to chm files ? Convert hlp files to chm files?

We were shipping .hlp files to customers when development was in VC++. The process to create it was as follows:
1. Create rtf file
2. Create new project in WinHelp and then compile to get .hlp file.
Now development has moved to .net and also I found that we can no longer open .hlp files in windows 7 or vista.
I wanted to know if there are any free command line tools using which we can convert these .hlp files to a .chm file ?
Also I wanted to know if there are any free command line tools to convert .rtf file to .chm ?
Microsoft has a tool which can convert Win Help projects to HTML Help. It is called HTML Help Workshop. You can open the existing .hpj project file with it and choose the option to convert it to HTML Help project .hhp. You can then compile the .hhp project with the same tool to generate the .chm file.
There are however many shortcomings in the tool. It generates an HTML page for each page in the rtf file but the naming of these HTML pages is random causing future referencing to be difficult.
If you just have the .hlp file and not the original Win Help project files, you can use a decompiler to generate the .hpj and .rtf files first and then convert them using HTML Help Workshop.
I found the following link quite helpful:
http://www.help-info.de/en/Help_Info_WinHelp/hw_converting.htm
EDIT: there are some 3rd party convertors and Help Authoring Tools (HATs) also available which may do the job better than HTML Help Workshop but most of them are not free.
Keep in mind that CHM is compiled HTML, and not very related to html, so your main problem is conversion of rtf to html
I would try to convert RTF to HTML, but on a topic per file.
What you could try is to input the RTF into word and try to save as HTML, and then use a program/script to split out the various topics to individual files and fixup references.
Then compile the result with a CHM compiler (like MS htmlhelp workshop)