How to perform UTF-8 encoding using xmlstarlet fo --encode option? - encoding

The synopsis for xmlstarlet fo says
XMLStarlet Toolkit: Format XML document
Usage: xmlstarlet fo [<options>] <xml-file>
where <options> are
-n or --noindent - do not indent
-t or --indent-tab - indent output with tabulation
-s or --indent-spaces <num> - indent output with <num> spaces
-o or --omit-decl - omit xml declaration <?xml version="1.0"?>
--net - allow network access
-R or --recover - try to recover what is parsable
-D or --dropdtd - remove the DOCTYPE of the input docs
-C or --nocdata - replace cdata section with text nodes
-N or --nsclean - remove redundant namespace declarations
-e or --encode <encoding> - output in the given encoding (utf-8, unicode...)
-H or --html - input is HTML
-h or --help - print help
When I run
cat unformatted.html | xmlstarlet fo -H -R --encode utf-8
I am returned the error message
failed to load external entity "utf-8"

In my limited experience, xmlstarlet fo especially, needs the stdin dash to work (better).
In your example, the 'unformatted.html' contents are piped to xmlstarlet.
But xmlstarlet fo doesn't 'see' the piped input, if you don't use a - (dash).
It assumes that the last argument (utf-8) is the filename ("external entity") whose contents you're trying to format. Obviously, there's no such file. Just to be on the safe side, I'd also enclose the encoding argument with double quotes, like so: "utf-8".
Altering your statement to
xmlstarlet fo -H -R --encode "utf-8" unformatted.html
should do the trick.
The cat is unnecessary, I'd think.

Related

Formating file changes encoding on Redhat system

I have a bash script which extract data from an oracle database. I use spool to extract data. After extraction I format the file by removing and replacing some characters. My problem is after formating the files are in ANSI encoding instead of ut8.
Extraction with spool. The file is utf8
Format with cat and tr command and redirect in another file. This file is ansi.
The same process works fine on Aix system. I try iconv but it doesnt work. Do you please have an idea why the encoding changes from utf8 to ansi ? How to correct it please ?
You should consequently use either ISO-8859-1 or UTF-8. In the latter case, don't use tr as it doesn't (yet?) support multi-byte characters, use sed instead (e.g sed 's/deletethis//g').
ISO-8859-1:
export LC_CTYPE=fr_FR.ISO-8859-1
export NLS_LANG=French_France.WE8ISO8859P1
# fetch data from Oracle, emulated by the following line
echo 'âêîôû' >test.latin1 # 5 bytes (+lineend)
# perform formatting, eg:
sed 's/ê/[e-circumflex]/g' test.latin1
# or the same with hex-codes:
sed $'s/\xea/[e-circumflex]/g' test.latin1
UTF-8:
export LC_CTYPE=fr_FR.UTF-8
export NLS_LANG=French_France.AL32UTF8
# fetch data from Oracle, emulated by the following line
echo 'âêîôû' >test.utf8 # 10 bytes (+lineend)
# perform formatting, eg:
sed 's/ê/[e-circumflex]/g' test.utf8
# or the same with hex-codes:
sed $'s/\xc3\xaa/[e-circumflex]/g' test.utf8
Note: no conversion (iconv, recode, etc) is required, just make sure NLS_LANG and LC_CTYPE are compatible. (Also, your terminal(emulator) should be set accordingly; for PuTTY it is Configuration/Category/Window/Translation/Remote-character-set.)
Original answer:
I cannot tell what's wrong with the formatting you perform, but here is a method to damage the utf8-encoded text:
$ echo 'ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP' | iconv -f iso-8859-2 -t utf-8 | xxd
00000000: c381 5256 c38d 5a54 c5b0 52c5 9020 54c3 ..RV..ZT..R.. T.
00000010: 9c4b c396 5246 c39a 52c3 9347 c389 500a .K..RF..R..G..P.
$ echo 'ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP' | iconv -f iso-8859-2 -t utf-8 | tr -d $'\200-\237' | xxd
00000000: c352 56c3 5a54 c5b0 52c5 2054 c34b c352 .RV.ZT..R. T.K.R
00000010: 46c3 52c3 47c3 500a F.R.G.P.
Here the tr -d $'\200-\237' part deleted half of the utf8-sequences (c381 became c3, c590 became c5), rendering the text unusable.

exiftool not showing space character when using -t or -T option

I am using the following command to save a tag with a space character:
exiftool -config xmp.config -overwrite_original -PropertyID=' ' /Users/admin/Downloads/Files/09913/1KingWithSofaBed_rm521_1.tif
Using the -X option, I can see that the space character was saved succesfully:
exiftool -X -filename -PropertyID /Users/admin/Downloads/Files/09913/1KingWithSofaBed_rm521_1.tif
<?xml version='1.0' encoding='UTF-8'?>
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
<rdf:Description rdf:about='/Users/admin/Downloads/Files/09913/1KingWithSofaBed_rm521_1.tif'
xmlns:et='http://ns.exiftool.ca/1.0/' et:toolkit='Image::ExifTool 11.84'
xmlns:System='http://ns.exiftool.ca/File/System/1.0/'
xmlns:XMP-xmp='http://ns.exiftool.ca/XMP/XMP-xmp/1.0/'>
<System:FileName>1KingWithSofaBed_rm521_1.tif</System:FileName>
<XMP-xmp:PropertyID> </XMP-xmp:PropertyID>
</rdf:Description>
</rdf:RDF>
The problem is that -t or -T does not show the space:
exiftool -t -filename -PropertyID /Users/admin/Downloads/Files/09913/1KingWithSofaBed_rm521_1.tif
File Name 1KingWithSofaBed_rm521_1.tif
Property ID
exiftool -T -filename -PropertyID /Users/admin/Downloads/Files/09913/1KingWithSofaBed_rm521_1.tif
1KingWithSofaBed_rm521_1.tif
In both cases the space is not present (I have checked the contents with an hex editor) for the PropertyID field.
Is this a limitation of exiftool or it is possible to show it usint -t or -T option?
The answer from Phil Harvey, the author of exiftool
You can use the (undocumented) -ec option (ExifTool 11.54 or later) to escape control characters using C-style escape sequences and preserve trailing newlines, nulls and newlines, etc
I tested it out and it seemed to preserve trailing spaces

DOS/Windows xmlstarlet usage with a String instead of a xml file

can xmlstarlet be used with a String instead of a xml file?
e.g.:
xmlstarlet sel -t -v "/*" "<pathlist><path>C:\file.txt</path></pathlist>"
instead of
xmlstarlet sel -t -v "/*" pathlist.xml
or how else could i realize with a string ?
when i echo the string and pipe it to xmlstarlet it does not work:
SET "_var=^<pathlist^>^<path^>C:\file.txt^</path^> ^</pathlist^>"
&
call echo %^_var% | xmlstarlet sel -t -v "//*"
gives error:
< was unexpected at this time.
-:1.1: Document is empty
^
-:1.1: Start tag expected, '<' not found
^
this is a simple task actually, but i cant get it to work. i just want to echo a string to xmlstarlet within a One-Liner.
cmd.exe syntax is weird, the following trick using set /p seems to work:
C:\tmp><nul (set /p ="<pathlist><path>C:\file.txt</path></pathlist>") | xmlstarlet sel -t -v /*
C:\file.txt
/* may get glob expanded (depending on what files you have). Unfortunately, there is no way to quote it from cmd.exe (the expansion is performed by libc on behalf of xmlstarlet), so you will have to rewrite the XPath in that case, e.g. /pathlist instead.
Source: https://groups.google.com/d/msg/alt.msdos.batch.nt/RNug94fXI5s/BdgYJfNmXysJ via http://www.netikka.net/tsneti/info/tscmd047.htm
I found no explanation of why escaping <> doesn't work with | redirection??
C:\tmp> echo ^<^>
<>
C:\tmp> echo ^<^> | more
> was unexpected at this time.

Extracting the contents between two different strings using bash or perl

I have tried to scan through the other posts in stack overflow for this, but couldn't get my code work, hence I am posting a new question.
Below is the content of file temp.
<?xml version="1.0" encoding="UTF-8"?>
<env:Envelope xmlns:env="http://schemas.xmlsoap.org/soap/envelope/<env:Body><dp:response xmlns:dp="http://www.datapower.com/schemas/management"><dp:timestamp>2015-01-
22T13:38:04Z</dp:timestamp><dp:file name="temporary://test.txt">XJzLXJlc3VsdHMtYWN0aW9uX18i</dp:file><dp:file name="temporary://test1.txt">lc3VsdHMtYWN0aW9uX18i</dp:file></dp:response></env:Body></env:Envelope>
This file contains the base64 encoded contents of two files names test.txt and test1.txt. I want to extract the base64 encoded content of each file to seperate files test.txt and text1.txt respectively.
To achieve this, I have to remove the xml tags around the base64 contents. I am trying below commands to achieve this. However, it is not working as expected.
sed -n '/test.txt"\>/,/\<\/dp:file\>/p' temp | perl -p -e 's#<dp:file name="temporary://test.txt">##g'|perl -p -e 's#</dp:file>##g' > test.txt
sed -n '/test1.txt"\>/,/\<\/dp:file\>/p' temp | perl -p -e 's#<dp:file name="temporary://test1.txt">##g'|perl -p -e 's#</dp:file></dp:response></env:Body></env:Envelope>##g' > test1.txt
Below command:
sed -n '/test.txt"\>/,/\<\/dp:file\>/p' temp | perl -p -e 's#<dp:file name="temporary://test.txt">##g'|perl -p -e 's#</dp:file>##g'
produces output:
XJzLXJlc3VsdHMtYWN0aW9uX18i
<dp:file name="temporary://test1.txt">lc3VsdHMtYWN0aW9uX18i</dp:response> </env:Body></env:Envelope>`
Howeveer, in the output I am expecting only first line XJzLXJlc3VsdHMtYWN0aW9uX18i. Where I am commiting mistake?
When i run below command, I am getting expected output:
sed -n '/test1.txt"\>/,/\<\/dp:file\>/p' temp | perl -p -e 's#<dp:file name="temporary://test1.txt">##g'|perl -p -e 's#</dp:file></dp:response></env:Body></env:Envelope>##g'
It produces below string
lc3VsdHMtYWN0aW9uX18i
I can then easily route this to test1.txt file.
UPDATE
I have edited the question by updating the source file content. The source file doesn't contain any newline character. The current solution will not work in that case, I have tried it and failed. wc -l temp must output to 1.
OS: solaris 10
Shell: bash
sed -n 's_<dp:file name="\([^"]*\)">\([^<]*\).*_\1 -> \2_p' temp
I add \1 -> to show link from file name to content but for content only, just remove this part
posix version so on GNU sed use --posix
assuming that base64 encoded contents is on the same line as the tag around (and not spread on several lines, that need some modification in this case)
Thanks to JID for full explaination below
How it works
sed -n
The -n means no printing so unless explicitly told to print, then there will be no output from sed
's_
This is to substitute the following regex using _ to separate regex from the replacement.
<dp:file name=
Regular text
"\([^"]*\)"
The brackets are a capture group and must be escaped unless the -r option is used( -r is not available on posix). Everything inside the brackets is captured. [^"]* means 0 or more occurrences of any character that is not a quote. So really this just captures anything between the two quotes.
>\([^<]*\)<
Again uses the capture group this time to capture everything between the > and <
.*
Everything else on the line
_\1 -> \2
This is the replacement, so replace everything in the regex before with the first capture group then a -> and then the second capture group.
_p
Means print the line
Resources
http://unixhelp.ed.ac.uk/CGI/man-cgi?sed
http://www.grymoire.com/Unix/Sed.html
/usr/xpg4/bin/sed works well here.
/usr/bin/sed is not working as expected in case if the file contains just 1 line.
below command works for a file containing only single line.
/usr/xpg4/bin/sed -n 's_<env:Envelope\(.*\)<dp:file name="temporary://BackUpDir/backupmanifest.xml">\([^>]*\)</dp:file>\(.*\)_\2_p' securebackup.xml 2>/dev/null
Without 2>/dev/null this sed command outputs the warning sed: Missing newline at end of file.
This because of the below reason:
Solaris default sed ignores the last line not to break existing scripts because a line was required to be terminated by a new line in the original Unix implementation.
GNU sed has a more relaxed behavior and the POSIX implementation accept the fact but outputs a warning.

How to search and replace in text files only?

I have a directory containing a bunch of files, some text some binary, with no consistent naming. I want to search and replace a string in text files only. So I went with:
perl -i -pne 's#/some/text/to/replace#/replacement/text#' *
Remove the -i option and you will see that binary files get caught. How do I modify this one-liner to skip binary files?
ack -n --text --sort -f . | xargs perl -i -pne 's…'
Abusing ack goes much quicker than writing your own solution with -T.
Well, this is all based on what your definition of a text file is. Perl 5 has the -T filetest operator that will tell you if a filename or filehandle is a text file (using Perl 5's definition):
perl -i -pne 'BEGIN{#ARGV=grep-T,#ARGV}s#regex#replacement#' *
The BEGIN block will filter out any files that don't pass the -T test, so they won't even be read (except for their first block because that is what -T uses to determine if they are text).
From perldoc -f -X
The -T and -B switches work as follows. The first block or so of the file is examined for odd characters such as strange control codes or characters with the high bit set. If too many strange characters (>30%) are found, it's a -B file; otherwise it's a -T file. Also, any file containing a zero byte in the first block is considered a binary file. If -T or -B is used on a filehandle, the current IO buffer is examined rather than the first block. Both -T and -B return true on an empty file, or a file at EOF when testing a filehandle. Because you have to read a file to do the -T test, on most occasions you want to use a -f against the file first, as in next unless -f $file && -T $file .