Programatically get pages count in Microsoft Word documents on linux

Programatically get pages count in Microsoft Word documents on linux - ms-word

I need to get pages count from word documents. I've tested many libraries and scripts (apache poi, perl scripts, some application for linux and some more) and the only working solution was to install Microsoft Office with Wine and access OLE with perl. I've managed to do it but it seems I can't use it on server due to licensing problems...
The problem with apachepoi and other solutions providing access to word documents info is related to incompleteness of some docs. pageCount property in document summary is sometimes missing (it's often case with odt documents saved as doc and older docs).
Is there any way to actually count pages (not only get info from summary) without installing Microsoft Office on server?

I was going to say wvSummary, but I think this uses the metadata you're referring to. I'm not sure there is a way to get the page count without actually laying out the document. So you might have to resort to using APIs to drive a real Office-compatible application like OpenOffice or AbiWord.

If you trust the document summary, instead of using wvSummary, you can just open the file and do a Regex search for "nofpages(\d+)". Groups[1] will contain the number of pages.
Since Word always saves the summary when it saves, I think this is pretty safe if you know the document was last saved with Word, which in my experience is 99% of the time.

This is a version that also gets the page count from the document summary. I've added it late because MS Word has been through a number of updates since the question was asked.
The environment in which the following works is:
GNU bash, version 5.1.0(1)-release (x86_64-redhat-linux-gnu)
MS Word version 12 (2007) and version 16 (2016 - 2021)
It does not work for MS Word 9 (2000), and I assume earlier. I've also not tested the code on other shells.
DOCUMENT=<YourDocumentName>
PROPS=`unzip -c "$DOCUMENT" docProps/app.xml | tail -1`
NUMBER_OF_PAGES=`sed -e 's/.*\(<Pages>[0-9]*<\/Pages>\).*/\1/' <<< $PROPS | cut -d'>' -f2 | cut -d'<' -f1`
The PROPS variable is used so that you might get the number of lines or words without reading the entire file again.
NUMBER_OF_LINES=`sed -e 's/.*\(<Lines>[0-9]*<\/Lines>\).*/\1/' <<< $PROPS | cut -d'>' -f2 | cut -d'<' -f1`
NUMBER_OF_WORDS=`sed -e 's/.*\(<Words>[0-9]*<\/Words>\).*/\1/' <<< $PROPS | cut -d'>' -f2 | cut -d'<' -f1`
You can view the other properties with:
echo $PROPS

Related

How to read a large number of LDAP entries in perl?

I already have an LDAP script in order to read LDAP user information one by one. My problem is that I am returning all users found in Active Directory. This will not work because currently our AD has around 100,000 users causing the script to crash due to memory limitations.
What I was thinking of doing was to try to process users by batches of X amount of users and if possible, using threads in order to process some users in parallel. The only thing is that I have just started using Perl, so I was wondering if anyone could give me a general idea of how to do this.

If you can get the executable ldapsearch to work in your environment (and it does work in *nix and Windows, although the syntax is often different), you can try something like this:
my $LDAP_SEARCH = "ldapsearch -h $LDAP_SERVER -p $LDAP_PORT -b $BASE -D uid=$LDAP_USERNAME -w $LDAP_PASSWORD -LLL";
my #LDAP_FIELDS = qw(uid mail Manager telephoneNumber CostCenter NTLogin displayName);
open (LDAP, "-|:utf8", "$LDAP_SEARCH \"$FILTER\" " . join(" ", #LDAP_FIELDS));
while (<LDAP>) {
# process each LDAP response
}
I use that to read nearly 100K LDAP entries without memory problems (although it still takes 30 minutes or more). You'll need to define $FILTER (or leave it blank) and of course all the LDAP server/username/password pieces.
If you want/need to do a more pure-Perl version, I've had better luck with Net::LDAP instead of Net::LDAP::Express, especially for large queries.

updating table rows based on txt file

I have been searching but so far I only found how to insert date into tables based on a csv files.
I have the following scenario:
Directory name = ticketID
Inside this directory I have a couple of files, like:
Description.txt
Summary.txt - Contains ticket header and has been imported succefully.
Progress_#.txt - this is everytime a ticket gets udpdated. I get a new file.
Solution.txt
Importing the Issue.txt was easy since this was actually a CSV.
Now my problem is with Description and Progress files.
I need to update the existing rows with the data from this files. Something on the line of
update table_ticket set table_ticket.description = Description.txt where ticket_number = directoryname
I'm using PostgreSQL and the COPY command is valid for new data and it would still fail due to the ',;/ special chars.
I wanted to do this using bash script, but it seem that it is it won't be possible:
for i in `find . -type d`
do
update table_ticket
set table_ticket.description = $i/Description.txt
where ticket_number = $i
done
Of course the above code would take into consideration connection to the database.
Anyone has a idea on how I could achieve this using shell script. Or would it be better to just make something in Java and read and update the record, although I would like to avoid this approach.
Thanks
Alex

Thanks for the answer, but I came across this:
psql -U dbuser -h dbhost db
\set content = `cat PATH/Description.txt`
update table_ticket set description = :'content' where ticketnr = TICKETNR;
Putting this into a simple script I created the following:
#!/bin/bash
for i in `find . -type d|grep ^./CS`
do
p=`echo $i|cut -b3-12 -`
echo $p
sed s/PATH/${p}/g cmd.sql > cmd.tmp.sql
ticketnr=`echo $p|cut -b5-10 -`
sed -i s/TICKETNR/${ticketnr}/g cmd.tmp.sql
cat cmd.tmp.sql
psql -U supportAdmin -h localhost supportdb -f cmd.tmp.sql
done
The downside is that it will create always a new connection, later I'll change to create a single file
But it does exactly what I was looking for, putting the contents inside a single column.

psql can't read the file in for you directly unless you intend to store it as a large object in which case you can use lo_import. See the psql command \lo_import.
Update: #AlexandreAlves points out that you can actually slurp file content in using
\set myvar = `cat somefile`
then reference it as a psql variable with :'myvar'. Handy.
While it's possible to read the file in using the shell and feed it to psql it's going to be awkward at best as the shell offers neither a native PostgreSQL database driver with parameterised query support nor any text escaping functions. You'd have to roll your own string escaping.
Even then, you need to know that the text encoding of the input file is valid for your client_encoding otherwise you'll insert garbage and/or get errors. It quickly lands up being easier to do it in a langage with proper integration with PostgreSQL like Python, Perl, Ruby or Java.
There is a way to do what you want in bash if you really must, though: use Pg's delimited dollar quoting with a randomized delimiter to help prevent SQL injection attacks. It's not perfect but it's pretty darn close. I'm writing an example now.
Given problematic file:
$ cat > difficult.txt <__END__
Shell metacharacters like: $!(){}*?"'
SQL-significant characters like "'()
__END__
and sample table:
psql -c 'CREATE TABLE testfile(filecontent text not null);'
You can:
#!/bin/bash
filetoread=$1
sep=$(printf '%04x%04x\n' $RANDOM $RANDOM)
psql <<__END__
INSERT INTO testfile(filecontent) VALUES (
\$x${sep}\$$(cat ${filetoread})\$x${sep}\$
);
__END__
This could be a little hard to read and the random string generation is bash specific, though I'm sure there are probably portable approaches.
A random tag string consisting of alphanumeric characters (I used hex for convenience) is generated and stored in seq.
psql is then invoked with a here-document tag that isn't quoted. The lack of quoting is important, as <<'__END__' would tell bash not to interpret shell metacharacters within the string, wheras plain <<__END__ allows the shell to interpret them. We need the shell to interpret metacharacters as we need to substitute sep into the here document and also need to use $(...) (equivalent to backticks) to insert the file text. The x before each substitution of seq is there because here-document tags must be valid PostgreSQL identifiers so they must start with a letter not a number. There's an escaped dollar sign at the start and end of each tag because PostgreSQL dollar quotes are of the form $taghere$quoted text$taghere$.
So when the script is invoked as bash testscript.sh difficult.txt the here document lands up expanding into something like:
INSERT INTO testfile(filecontent) VALUES (
$x0a305c82$Shell metacharacters like: $!(){}*?"'
SQL-significant characters like "'()$x0a305c82$
);
where the tags vary each time, making SQL injection exploits that rely on prematurely ending the quoting difficult.
I still advise you to use a real scripting language, but this shows that it is indeed possible.

The best thing to do is to create a temporary table, COPY those from the files in question, and then run your updates.
Your secondary option would be to create a function in a language like pl/perlu and do this in the stored procedure, but you will lose a lot of performance optimizations that you can do when you update from a temp table.

PostgreSQL VCL controls

One of the products I write software for is an accounting type application. It is written in C++, uses C++ Builder and VCL controls, connects to PostgreSQL database running on Linux.
The PostgreSQL database is currently at version 8.4.x. We use UTF8 encoding. Everything works pretty good.
We are running tests of our software against PostgreSQL v9.2.3 with exact same encoding and are finding a problem in which all our text editing inputs are replacing multiple lines with \r\n characters.
So for example, you enter 3 lines of text and hit enter key after each line then save it and read it back, I get one line with the line ending characters removed. When we fetch the data from the database, we wind up with one line like so: line1\r\nline2\r\nline3\r\n where "\r\n" is displayed instead of getting 0x0A, 0x0D in the stream.
Our application is not Unicode aware. Borland's AnsiString. (In the process of migrating this app. to C++ Builder XE). Does anyone know what might be causing this or offer some things to try to fix this in the current code base while the larger conversion is underway?
I've tried the Borland DBText and DBRichText controls and they both do the same thing.
The other point I should mention is we only tested against new PostgreSQL on the server and are still using a 8.x PostgreSQL client library (psql.lib). So the client and server version aren't exactly at the same level but I don't suspect this is an issue but any insight certainly welcome.
UPDATE:
Here are some command line results from the two versions of PostgreSQL.
Version 9.2.3
testdb=# select * from notes where oid=5146352;
docid | docno | username | created | followup | reminder | subject | comments
-------+----------+----------+-------------------------------+----------+----------+-----------+-----------------------------
3001 | 11579522 | eric | 2013-02-15 22:38:24.136517+00 | f | f | Test Note | line1\r\nline2\r\nline3\r\n
Version 8.4.8
testdb=# select * from notes where oid=16490575;
docid | docno | username | created | followup | reminder | subject | comments
-------+----------+----------+------------------------------+----------+----------+--------------+----------
3001 | 11579522 | eric | 2013-02-18 20:15:23.10943-05 | f | f | <> | line1\r
: line2\r
: line3\r
:
Not sure how to format this for SO, but in the 8.4.8 command line output, I have 3 new lines printed on the screen where as the 9.2.3 version concatenates the output.
The insert for both databases is the same client. So something changed in the way PostgreSQL handles new line characters and I'm wondering if there is a config setting to revert the old behavior or something I can do within my select statement to get the old behavior back.

8.4 has standard_conforming_strings set to off by default, and 9.2 has it on by default.
When it's off, in a literal string, '\n' means a newline as in the C language, whereas when it's on, it means a backslash character followed by the character n.
To go back to the 8.4 behavior, you may issue SET standard_conforming_strings=off inside your sessions
or
ALTER DATABASE yourdb SET standard_conforming_strings=off;
for it to persist and be the default for new connections to this database.
Long term it's recommended to adapt your code to deal with standard_conforming_strings to on since it's the way forward.

Your problem looks like something to do with postgres config variable standard_conforming_strings. Before Postgres 9.1, this was turned off by default. Thats why postgres did not treat backslashes literally but interpreted them. But According to SQL standard, backslashes should be treated literally. So, from postgres 9.1, this config variable has been turned on and you see your \r\n as literal instead of interpretations.
Although this i not the right approach, to make it work in your case, you need to edit your server's configuration file(postgresql.conf) and turn off this setting(standard_conforming_strings=on)

How to extract strings from plist files for translation (localization)?

I need to prepare list of strings for translation of my iPhone application.
I have extracted strings from *.m files using genstring and from the XIB files using ibtool command.
But I have also lots of texts to translate in plist files (String field types enclosed in string tag).
Is there a nice bash script / command to extract those strings into a flat txt file?
I could review and filter it so my translators can work with nice list but not with alien looking XML file.

I made a custom shell script which tries to figure out the values needed. You can then use the localize.py script in a modified way (see below) to automatically create the translation files. (The line break where somehow very important) If there more entities to be translated, the shell script can be modified accordingly
#!/bin/bash
rm -f $2
sed -n 'N;/<key>Title<\/key>/{N;/<string>.*<\/string>/{s/.*<string>\(.*\)<\/string>.*/\/* \1 *\/\
"\1" = "\1";\
/p;};}' $1 >> $2
sed -n 'N;/<key>FooterText<\/key>/{N;/<string>.*<\/string>/{s/.*<string>\(.*\)<\/string>.*/\/* \1 *\/\
\"\1" = "\1";\
/p;}
;}' $1 >> $2
sed -n 'N;/<key>Titles<\/key>/{N;/<array>/{:a
N;/<\/array>/!{
/<string>.*<\/string>/{s/.*<string>\(.*\)<\/string>.*/\/* \1 *\/\
\"\1" = "\1";\
/p;}
ba
;};};}' $1 >> $2
the localize.py script needed some modification. Therefore I created a small package containing the localizer for the source code and for the plist Files. The new script even supports Duplikates (meaning it will kick them)

We recently made a small online application to do that, please take a look on: http://www.icapps.be/plist-translator/

I can't think of any command off the top of my head. However, plists are glorified xml files and there are various parsers available for them.
It shouldn't be too difficult to create a simple python script to get all the strings from the file.

Does this help?
http://www.icanlocalize.com/site/tutorials/how-to-translate-plist-files/
We much prefer paying clients who use our translation system with our translators, but you can translate yourself in our GUI at no charge.

zsh filename globbling/substitution

I am trying to create my first zsh completion script, in this case for the command netcfg.
Lame as it may sound I have stuck on the first hurdle, disclaimer, I know how to do this crudely, however I seek the "ZSH WAY" to do this.
I need to list the files in /etc/networking but only the files, not the directory component, so I do the following.
echo $(ls /etc/network.d/*(.))
/etc/network.d/ethernet-dhcp /etc/network.d/wireless-wpa-config
What I wanted was:
ethernet-dhcp wireless-wpa-config
So I try (excuse my naivity) :
echo ${(s/*\/)$(ls /etc/network.d/*(.))}
/etc/network.d/ethernet-dhcp /etc/network.d/wireless-wpa-config
It seems that this doesn't work, I'm sure there must be some clever way of doing this by splitting into an array and getting the last part but as I say, I'm complete noob at this.
Any advice gratefully received.

General note: There is no need to use ls to generate the filenames. You might as well use echo some*glob. But if you want to protect the possible embedded newline characters even that is a bad idea. The first example below globs directly into an array to protect embedded newlines. The second one uses printf to generate NUL terminated data to accomplish the same thing without using a variable.
It is easy to do if you are willing to use a variable:
typeset -a entries
entries=(/etc/network.d/*(.)) # generate the list
echo ${entries#/etc/network.d/} # strip the prefix from each one
You can also do it without a variable, but the extra stuff to isolate individual entries is a bit ugly:
# From the inside, to the outside:
# * glob the entries
# * NUL terminate them into a single string
# * split at NUL
# * strip the prefix from each one
echo ${${(0)"$(printf '%s\0' /etc/network.d/*(.))"}#/etc/network.d/}
Or, if you are going to use a subshell anyway (i.e. the command substitution in the previous example), just cd to the directory so it is not part of the glob expansion (plus, you do not have to repeat the directory name):
echo ${(0)"$(cd /etc/network.d && printf '%s\0' *(.))"}

Chris Johnsen's answer is full of useful information about zsh, however it doesn't mention the much simpler solution that works in this particular case:
echo /etc/network.d/*(:t)
This is using the t history modifier as a glob qualifier.

Thanks for your suggestions guys, having done yet more reading of ZSH and coming back to the problem a couple of days later, I think I've got a very terse solution which I would like to share for your benefit.
echo ${$(print /etc/network.d/*(.)):t}

I'm used to seeing basename(1) stripping off directory components; also, you can use echo /etc/network/* to get the file listing without running the external ls program. (Running external programs can slow down completion more than you'd like; I didn't find a zsh-builtin for basename, but that doesn't mean that there isn't one.)
Here's something I hope will help:
haig% for f in /etc/network/* ; do basename $f ; done
if-down.d
if-post-down.d
if-pre-up.d
if-up.d
interfaces

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse