Convert pdf to docx using libreoffice without textbox - libreoffice

I am working on a service to convert pdf with selectable text, to docx file.
I have used libreoffice to do this with the below command:
libreoffice --headless --infilter="writer_pdf_import" --convert-to doc:"MS Word 2007 XML" --outdir /pdfOutput myPdf.pdf
The problem is my output file is not in plain text, but textboxes containing editable text.
How can I convert pdf to docx with plain text?

Related

Why aren't word processing program documents stored as plaintext?

Whenever an MS word (or LibreOffice or other word processor) document is opened in its respective program, the words appear normally on the page, but when the document is opened in a text editor, most of it is Unicode gibberish.
I can understand why the document might have some parts that aren't legible, like bullet points or metadata, but why isn't at least some of the content stored as plaintext? Does every letter get encoded?
The last format docx of Microsoft Word is an xml with plain text compressed with zip. You can unzip the file by renaming docx to zip and then open the file with a notepad. So it is stored partially as plain text just compressed.
I find that it is probably a branding thing. If you want you can import it to a Text File.
If you go to File > Export > Change File Type > Plain Text (*.txt), you can export the document there.

convert stream file of iText PDF not opening MS word

Our project has requirement to generate end report both in PDF and MS-Word Document. We are using iTextSharp to dynamically generate tables and rows in report. Finally we will upload the file to server as PDF and MS-word. Both will be converted to Byte Array/Stream file and saved as PDF and MS-Word Document. In Which,uploaded PDF working as expected, but MS-word getting error and not opening(Attaching the screen shot).
iTextSharp doesn't produce MS Word documents, so this isn't an actual iText question. When I look at your screen shot, I see that you are trying to import a PDF file into Word. Since Word can't interpret PDF syntax, it shows you the syntax of the PDF file:
%PDF-1.4
%âãÏÓ
1 0 obj
<</Type/Font...
I think your question is wrong. You are not using iTextSharp to create a PDF file and an MS Word file. You are using iTextSharp to create a PDF file, and not an MS Word file.
There is no such thing as "Save a PDF as MS Word file" in iTextSharp, and it will be extremely difficult to find another tool that can convert a PDF document to a Word document in an acceptable way. (There are such tools, but the quality is suboptimal for PDFs that weren't made to be converted to another format.)

Put highlighted code in a word document using Apache POI

I'm generating some docx file (using Apache POI) that has a lot of SQL code in it. Because I'd like that code to be colored in a Word document, I'm first generating HTML with styles that does syntax highlighting. Now I can't put that HTML in a Word document. Is that even possible (using POI)?
What I'd like to achieve is SQL code in a docx being colored based on a generated HTML (like exporting SQL code from Notepad++ as HTML and pasting it in a Word document). Any ideas?

How convert PDF to ACROFORM type?

I use pdftk for filling forms.
and now when I enter
F:\GoogleDisk\projects\comparepdfs>pdftk new/file.pdf
fill_form new/b2bf7150aa9de8b2ef8edd20a5677f7f.fdf output new/temp_b2bf7150aa9de8b2
ef8edd20a5677f7f.pdf
returned
Warning: input PDF is not an acroform, so its fields were not filled.
How fix it or convert PDF to acroform?
I decided it.
Combine files in Acrobat - and it create new pdf.
New pdf is good.

Printing RTF Text in PDFLib 9 Table Cell

I am using Perl and PDFLib 9 to dynamically create a PDF document. I read some data from a DB and print a table with the data onto the PDF. One field in my DB contains RTF text. How can i print RTF text in a PDFLib table cell? I can't find any example in the PDFLib cookbook.
you can not use RTF text for PDFlib textflow/table cells. So you have to parse your RTF text and "translate" this into textflow notation. Afterwards you can format text within a multiline text by using inline options within create_textflow() or by calling multiple add_textflow() calls.