Is it possible to build a LibreOffice document from code similar to the way a web page is built from HTML and CSS? - libreoffice

Is it possible to build a LibreOffice document from code similar to the way a web page is built from HTML and CSS? Can one write an ODF file in which the content and styling are separate, and then/view open in LibreOffice? If so, can one write the code in a text editor as done for HTML/CSS?
There area two reasons I now ask. 1) When I need to make a style change in LibreOffice I have to manually make the same adjustments in a hundred places, such as changing the style of block quotes. 2) I'd like to build documents from a database of text.
I found a question on this in relation to databases but it was about eight years old.
Thank you for any direction you may be able to provide.

Unzip an .odt file that contains styles. You will see two files, content.xml and styles.xml. Edit these files using a text editor and then zip the folder back up to get a modified .odt file.
Be aware that there are two types of styles in the XML files. Named styles are what most people think of as styles, whereas automatic styles are custom formatting, like when you select some text and change the font directly.
The link from tohuwawohu describes utilities to work programmatically with the file. Also as mentioned in the link, it's not too hard to write code yourself. For example in python, import the built-in libraries zipfile and xml.etree.

Related

VSCode: is it possible to create a command that will fill a file with text on creation?

At my place of work, we have a standard source header. I've been using snippets to generate it when adding text to a file. However, since it's supposed to be pretty much used on everything, I figure I might as well see if I can automate its generation on file creation.
Is there a way to automatically add text to a file on creation in vscode? Can I generate different text based on the file extension?

Visio .vsdx format unzip and zip corrupts

I'm attempting to modify a Visio file (Open XML format) without having to use the Windows Visio application. My first experiment is just to use 7zip to unzip a known good .vsdx file that was created using Visio. That is all good; I can view the content of the package. Without making any modifications, I use 7zip to re-zip the content and renamed to .vsdx, but when I tried to open the resulting new file using Visio, it complains that the file is corrupt. Is there a way to manually re-zip the content into something that Visio accepts as a valid Visio file? I suspect that there may be some sort of checks for the validity of the file, but can't find what that may be. Thanks for any input.
I would use some form of OpenXML library to get at the file's guts using some sort of "approved magic".
Understanding that you might not want to do whatever you're doing via programming, I looked for some sort of free editor.
I found this free plug-in for Visual Studio:
https://marketplace.visualstudio.com/items?itemName=bsivanov.OpenXMLPackageEditorforVisualStudio
It works in the free "Microsoft Visual Studio Community 2019" as well. I just opened the dev environment (aka: the application) and dragged a Visio .vsdx file into the app. It opened with a tree-like editor. I was able to dig down until I found the visio > pages > page1.xml "leaf". Inside there, I was able to change some text on a shape, then save the "package".
Whatever this tool does, it saves the file properly, and I was able to open the altered .vsdx file in Visio. And the text that I changed in the editor was indeed changed inside of Visio!
I think I've used this in the past:
"Welcome to the Open XML SDK 2.5 for Office"
https://learn.microsoft.com/en-us/office/open-xml/open-xml-sdk
https://github.com/OfficeDev/Open-XML-SDK
To edit Visio files without the Visio application, you'll still need to understand how Visio works, to some extent.
A simple example:
I changed the text on a shape fairly easily within one of the page.xml files. That was easy. Then I wanted to add a copy of that shape. It was simple enough to copy and paste the whole xml block for the existing shape, then change the PinX and PinY attributes to move the shape to a different location on the page.
But you won't see that shape unless you give it a unique ID within the page. I tested deleting the ID attribute (to see if Visio would figure it out on open and assign one automatically), but it didn't work. If the ID is the same as another shape, the shape is ignored when you open the file. Once I changed ID to something unused, I did see the new copy of the shape.
If you create grouped shapes, or shapes that have advanced behavior (SmartShapes, ShapeSheet formulas, etc.), then this could get complicated. As formulas need to reference other shapes by ID, so you need to manage the IDs! For simple boxes and lines, etc., it might work well (and fast) to generate these things via OpenXML. Good luck!

is it possible to view a question with a browser before importing it to Moodle?

I have created a XML file using R-exams out of just a single exercise to be imported to Moodle. I would like to view it before uploading it in the Moodle question bank. I tried to open it with Firefox and I can see some code but not the output and a message appear saying that the XML file does not seem to have a style sheet associated to it. Is there a way to find this style sheet and to see how the question comes out just using a browser like Firefox or Chrome?
To emulate how the R/exams exercises are converted to HTML by exams2moodle() and how Moodle displays mathematical content, it's best to use
exams2html(..., converter = "pandoc-mathjax")
In recent versions of R/exams the resulting HTML file then automatically loads the MathJax Javascript that enables correct rendering of mathematical content in all modern browsers (including Google Chrome). See also http://www.R-exams.org/tutorials/math/ for some general advice about math in HTML.
To the best of my knowledge there is no tool that would quickly display Moodle XML files in such a way that you can easily assess them.

How do I automate converting PDF to HTML?

I work for a publisher and am trying to extract content from our fully laid out PDFs. I've tried pdftohtml, pdftotext, pdfminer, and other Python-based approaches to getting the content, as well as saving to Word, HTML, XML, etc. from the original Acrobat files.
I don't need just the text, I also need the text formatting. That's because, for example, I need all the blue text in the document.
When I save to HTML, Word, etc. from Acrobat, the resulting files contain screenshots of the pages, not the laid out text. When I extract text using different Python modules I get the text but lose the text formatting.
The only solution I've found is to manually copy and paste from the PDF into a word doc, then saving as HTML. I'm hoping to automate this.
Why does copying from Acrobat into Word achieve what I can't do by other means? Has anybody come across this problem before?
Maybe you can consider another method. The software (https://pdfapi.codeplex.com/) can convert pdf files to html directly via MVS. If you are able to use the MVS, i think the software i mentioned above is useful for you to convert the text in pdf files to html that can keep the format perfectly. Of course, it's just a referral, you can have a try.

Converting large amounts of text and dynamic data into PDF

I have a three page Word document that needs to be converted into PDF. This Word document was given to me as a template to show me what the PDF output should look like. I tried converting this document into PDF, created a PDF form and used iTextSharp to open the form, populate it with data and return it back to the client. This is all great but due to large amounts of data stored, the placeholders were insufficient and the text would be truncated or hidden.
My second attempt was to create an MVC 2 View without master page, pass the model to the view, take the HTML representation of the View, pass it over to iTextSharp and render the PDF. The problem here was that iTextSharp failed on some tags (one of them was <hr> tag). I managed to get rid of the problematic tag, but then tables were not rendered properly. Namely, the border attribute was ignored so I ended up with borderless tables. That attempt failed.
I need a suggestion or advice on the most efficient way to create a PDF document in MVC 2 which would be maintainable in the long run. I really don't want my actions to be 200+ lines long. Working directly with the Word document is not the best solution as I have never worked with VSTO so I don't quite know what it would look like to open Word and manipulate text inside of it and add dynamic data and then convert that dynamically into PDF.
Any suggestion is highly welcome.
Best regards!
One thing that I've done in the past is to save the Word file as a DOCX and unzip it since DOCX is just a renamed zip file. Within the archive open up /word/document.xml and you'll see your document. There's a lot of weird XML tags in there but overall you should get a pretty good idea of where your content is. Then just add placeholder text like {FIRST_NAME}, save the file and re-zip.
Then from code you can just perform the same steps, unzipping with something like SharpZipLib or DotNetZip, swapping placeholder copy, re-zipping and then using very simple Word automation to Save-As a PDF.
The other route is to fully utilize iTextSharp and actually write Paragraphs and PdfPTable and everything else. It takes a lot longer to setup but would give you the most control.
Q: you say "... but due to large amounts of data stored, the placeholders were insufficient and the text would be truncated or hidden"
How do you end up having to much data ? If the word template can "hold" the data in 3 pages, they should fit in 3 PDF pages.
I used to use iTextSharp to create my PDF's, but I also almost always ended up building the PDF document from scratch myself.(not really a <200 line solution) Have you considerate another library, I recently switched to MigraDoc's PDFSharp.Way simpler to use then iText, lotsa examples / docus
Just my two cents
Word documents object model is quite easy to understand. It will either contain series of Paragraphs or Tables. Using the Open XML SDK, you can iterate through each paragraph/table in the word document and retrieve it's content and styles. Then you can generate PDF document on the fly using those retrieved information. This will work under MVC too.
But if your word document contains complex elements, then it will take some more time for you to implement based on this approach. Also, this approach would only work with (Word 2007 and 2010) files.
Also, HTML to PDF options currently available in the ITextSharp library would work with only known set of tags, as far as I know.
Another suggestion is to make use of commercially available .NET components. There are lot of good solution available. For ex: Syncfusion