I need to convert multiple .rtf-files to html. I thought the libre office commandline suit (lowriter) would do:
lowriter --headless --convert-to html *.rtf
But the program finishes (or crashed?) without any error-message at file number 249 (from around 380). I don't know why. As lowriter doesn't seem to have a good error-log (acording to the stackoverflow-post from Arnon Weinberg), maybe one of the .rtf-files is corrupt and the program crashes. But lowriter won't say. Is this possible?
I have no solution, but a workaround which workes fine for me:
for f in *.rtf; do lowriter --headless --convert-to html $f; done
I would be happy to know, if anybody can confirm or explain my problem and come up with a solution!
Related
I'm currently using PyPDF2 to work with PDF files in Python.
When I run a script to load some PDF files and extract some key words from the PDFs, I'm not able to:
PdfReadError: File has not been decrypted
So to (try &) get around this I implement:
if pathObj.isEncrypted:
pathObj.decrypt('')
However, I'm then comfronted with:
NotImplementedError: only algorithm code 1 and 2 are supported
Now, I kinda understand what the errors are telling me. What I don't understand is the fact that none of my PDFs are encrypted
Does anyone know why files that are not encryted are apparently encrypted? Is this some issue with PyPDF2?
Cheers
It would appear that these PDFs are 128-bit AES type encrypted. However, they can still used in Adobe just not with PyPDF2.
To get around the problem you have to install:
qpdf
and add to the code:
qpdf, --password=" ", --decrypt, in_put.pdf, out_put.pdf
I have a bunch of Archives that I want to extract. Problem is, there's a lot of them, and it's a lot of info to move around. I'd like to do it all at once. It's probably taken more time to research than to do it manually, but research is more interesting.
TL;DR: Would like help with 7-zip command line to extract multiple archives into their own directory. Autohotkey, Powershell, and batch files answers would also be nice if you are feeling extra helpful.
Win10, latest update and all that. I've been using 7-zip, so if there's a better extractor for this it might be a helpful suggestion. I have a little experience with coding, so I can usually pars an example and apply it to my project, but I can't come up with code on my own. So with that said, I'm comfortable using cmd, autohotkey, powershell, batch files, and a few others, but I need an example before I can do anything. haha
So, in my research, I found
(7z x -o"...\Stellaris\mod\Examples\" "...\content\281990\*")
for cmd, which works, except that extracts everything to the same dir since the archive files are in the root archive dir (I think that's why; if they were one folder down, it should work like I want right?). I don't think you can use environment variables in the path(?). Not sure what would make it work here...
Powershell: I only recently started tinkering with it so the one script I found didn't make any sense to me. And never found anyone using AutoHotKey for this.
And finally a **batch file* I found here seemed to come closest (normally I'd comment on that thread cause apparently it's still active, but I don't have 50 rep), but I wasn't sure how to modify it for my purposes:
#echo off
SET "filename=%~1" #Where does the working dir path go?
SET dirName=%filename:~0,-4% #How/where would you put in wildcards?
7z x -o"%dirName%" "%filename%"
I don't mind using any method, though I might prefer AHK? I'm probably most experienced there.
If you made it this far, wow, I'm impressed! I hope it was coherent enough to understand (probably not at first?). And maybe a little entertaining? I think I'm funny. Let me know if I should add or remove anything for the future. I know it's probably way too much context, but I would rather have too much than not enough, and I'm never sure what would be relevant and what would not. I'm not happy with my code format here, but I didn't quite understand what the help was saying about whitespace and I'm not familiar enough with Markdown yet (I wanted comments to be in line). Also, I'm honestly not sure about the tags.
EDIT: Added TL;DR at the top, and...
Found an answer via a program that does this. I'll post it in an answer as well: ExtractNow seems to be a bit outdated, last update was in '17, but it did what I wanted it to.
For interactive use at the command prompt:
for %z in ("\path\to\dir\subdir\*.zip") do #echo 7z "-o\path\to\extracted\%~nz" "%~z"
This won't run 7z, but it will print out the commands. Once you are satisfied that the printed commands look fine, remove the #echo to execute them.
In a batch script you must of course duplicate the % signs.
Found an answer via a program that does this. ExtractNow seems to be a bit outdated, last update was in '17, but it did what I wanted it to with only a few settings changes.
So, in my research, I found
(7z x -o"...\Stellaris\mod\Examples" "...\content\281990\*")
for cmd, which works, except that extracts everything to the same dir...
Assuming you were using Windows, 7-zip would have worked fine to do what you wanted. The only thing you were missing is the * character, which 7-zip expands to be the archive name when used with the -o switch:
7z x "dir\subdir\*.*" -o"dir\*"
So 7z x -o"...\Stellaris\mod\Examples" "...\content\281990\*"
becomes:
7z x -o"...\Stellaris\mod\Examples\*" "...\content\281990\*"
Also be aware that *.* does not mean any file under 7-zip. 7-Zip takes *.* to be name of any file that has an extension. To process all files just use a "dir\subdir\*" without the extra .*.
I would like to use openoffice or libreoffice to convert a presentation made with Impress ( odp file, but might be powerpoint ppt, too ) to jpg images.
My point is: I have an odp presentation file, composed with 10 slides, then I would receive 10 jpeg images, one for each slide.
I tried with :
soffice --headless --convert-to jpg presentation.odp
This works perfect, but I just receive the very first slide of my presentation, not all. I do need all of them.
I don't know if there's an option to tell soffice to convert all the slides instead of the first one.
I know there are other ways like converting to pdf and then use IM, but I want to solve this using soffice. Im doing everything under Ubuntu Linux.
Thanks in advance.
Juan
Im going to reply my own answer.
To convert, massively, from .odp to images, under Linux using CLI, I'll do:
soffice --headless --convert-to pdf presentation.odp
Then:
convert -density 400 converted.pdf -resize 800x600 my_filename%d.jpg
This solution works, but it needs some improvements to make it faster and to prevent it from failing due to lack of hardware resources.
But, if your odp is not that big, you converted from odp/ppt/pptx/whatever to images, massively, it is scriptable, and using just CLI.
In a perl script, I try to convert svg files to pdf. This works great by just refering to Inkscape:
system "inkscape -D -z --file=$in --export-pdf=$out";
But it is enormously slow even for little 100 KB files, I mean it can be minutes per file, causing the script to fail when running with a time-out constrain, eg. on a webserver.
To speed up, I have read about svg2pdf as a standalone, but never found a binary for Win7 or managed to compile it, even with the libcairo dlls present.
My last idea now is to use the CPAN module Cairo. It makes me hoping that it can convert an svg file to pdf, but in the documentation I only find drawings and surfaces, but no method to write/convert.
Has anyone experience with that?
Making my comment an answer: You could try rsvg-convert which is part of the librsvg library. It's probably faster than Inkscape but it's still an external command.
After a lot of experiments, I still can't get the following script working. I need some guidance on how to diagnoze this particular Perl problem. Thanks in advance.
This script is for testing the use of Office 2007 OCR API:
use warnings;
use strict;
use Win32::OLE;
use Win32::OLE::Const;
Win32::OLE::Const->Load("Microsoft Office Document Imaging 12\.0 Type Library")
or
die "Cannot use the Office 2007 OCR API";
my $miDoc = Win32::OLE->new('MODI.Document')
or die "Cannot create a MODI object";
#Loads an existing TIFF file
$miDoc->Create('OCR-test.tif');
#Performs OCR with the OCR language set to English
$miDoc->OCR(LangId => 'miLANG_ENGLISH');
#Get the OCR result
my $OCRresult = $miDoc->{Images}->Item(0)->{Layout}{Text};
print $OCRresult;
I did a small test. I loaded an .MDI file containing the OCR information. I deleted the OCR method line and ran the script and I got the expected text output of "print $OCRresult". But otherwise, Perl throws me the error saying
Use of uninitialized value $OCRresult in print at E:\OCR-test.pl line 15
I'm suspecting that something's wrong with the line
$miDoc->OCR(LangId => 'miLANG_ENGLISH');
I tried leaving the parens empty or using three paraments, like 'miLANG_ENGLISH',1,1 etc but without any luck.
I also tried using Microsfot Office Document Imaging to test if the TIF I'm experimenting with was text recognizable and the result was positive.
So what other diagnostic methods do I have?
Or can someone who happens to have Office 2007 test my code with a whatever jpg,bmp or tif pictures that have text content and see if something's wrong?
Thanks in advance.
UPDATE
Haha, I've finally figured out where the problem is and how I can solve it. #hobbs, thank you for leaving the comment :) Things are interesting. When I was trying to respond to your comment, I added the link of the url of Office Document Imaging 2003 VBA Language Reference and I took yet another look at the stuff there. And the following information caught my eyes:
LangId can be one of the following MiLANGUAGES constants.
miLANG_CHINESE_SIMPLIFIED (2052, &H804)
I changed the following OCR method line:
$miDoc->OCR('miLANG_ENGLISH',1,1);
to this:
$miDoc->OCR(2052,1,1);
A few notes:
1. I'm running ActivePerl 5.10.0 on Windows XP (Chinese version)
2. Before this, I already tried $miDoc->(9) but without luck
And suddenly and kind of magically that pesky ERROR saying "Use of uninitialized value $OCRresult in print at E:\OCR-test.pl line 15" disappeared completely and the OCRed text appeared on the screen. The OCR result was not satisfying but the parameter "2052" refers to Chinese and the TIF image contains all English. So I changed the parameter to
$miDoc->OCR(9,1,1) but this time without luck. Windows threw me this error:
unknown software exception (0x0000000d)
I changed the TIF image to one that contains all Chinese characters and changed the parameter to "$miDoc->OCR(2052,1,1);" again and this time everything worked just like expected. The OCR result was satisfying.
Now I think there's something weird about my Office 2007 OCR API and if someone who happens to run Windows XP (English version) and have installed Office 2007 would probably not encounter that exception error with the parameter
$miDoc->OCR(9,1,1);
Anyway, I'm really happy that I've finally get things working :D
For starters I would try dumping the value of $miDoc->{Images} -- does it exist? If it exists and it's a collection does it contain anything? If it contains anything, what is it? An error? Or maybe just a different structure than you're expecting? warn, Dumper, and a little exploration can go a long way.
Incidentally, if you want to do the "modern" thing and don't mind grabbing a nifty tool off of CPAN, try Devel::Dwarn -- it makes dumping to stderr even more fun than it was already :)