Linux Sort vs Perl String Comparison - perl

Because I was dealing with very large files, I sorted my base and candidate files before comparing them to see what lines were missing from the other. I did this to avoid keeping the records in memory. The sorting was done by using the Linux command-line tool, sort.
In my Perl script, I would look at whether the string in the line was lt, gt, or eq to the line in the other file, advancing the pointers in the file where necessary. However, I hit a problem when I noticed that my string comparison thought the strings in the base file were lt a string in the candidate file which contained special characters.
Is there a surefire way of making sure my Linux sort and Perl string comparisons are using the same type of string comparator?

The sort command uses the current locale, as specified by the environment variable LC_ALL, to determine the sort order for characters. Usually the easiest way to fix sorting issues is to manually set this to the C locale, which treats each 8-bit byte as a single character and compares by simple numeric value. In most shells this can be done as a one-off just for a single command by prefixing it like so:
LC_ALL=C sort < infile > outfile
This will also solve similar problems for some other text-processing programs. (E.g. I recall problems working with CSV files on a German person's computer -- this was traced back to the fact that Germans use a comma instead of a decimal point. Putting LC_ALL=C in front of the relevant commands fixed that issue too.)
[EDIT] Although Perl can be directed to treat some strings as Unicode, by default it still treats input and output as streams of 8-bit bytes, so the above approach should produce an order that is the same as Perl's sort() function. (Thanks to Ven'Tatsu for this nugget.)

Related

Encoding special chars in XSLT output

I have built a set of scripts, part of which transform XML documents from one vocabulary to a subset of the document in another vocabulary.
For reasons that are opaque to me, but apparently non-negotiable, the target platform (Java-based) requires the output document to have 'encoding="UTF-8"' in the XML declaration, but some special characters within text nodes must be encoded with their hex unicode value - e.g. '”' must be replaced with '”' and so forth. I have not been able to acquire a definitive list of which chars must be encoded, but it does not appear to be as simple as "all non-ASCII".
Currently, I have a horrid mess of VBScript using ADODB to directly check each line of the output file after processing, and replace characters where necessary. This is painfully slow, and unsurprisingly some characters get missed (and are consequently nuked by the target platform).
While I could waste time "refining" the VBScript, the long-term aim is to get rid of that entirely, and I'm sure there must be a faster and more accurate way of achieving this, ideally within the XSLT stage itself.
Can anyone suggest any fruitful avenues of investigation?
(edit: I'm not convinced that character maps are the answer - I've looked at them before, and unless I'm mistaken, since my input could conceivably contain any unicode character, I would need to have a map containing all of them except the ones I don't want encoded...)
<xsl:output encoding="us-ascii"/>
Tells the serialiser that it has to produce ASCII-compatible output. That should force it to produce character references for all non-ASCII characters in text content and attribute values. (Should there be non-ASCII in other places like tag or attribute names, serialisation will fail.)
Well with XSLT 2.0 you have tagged your post with you can use a character map, see http://www.w3.org/TR/xslt20/#character-maps.

Why does CGI.pm still use "\0" as null character, when it's treated as a normal character in Perl?

Quoted from in the CGI.pm docs:
When using this, the thing you must watch out for are multivalued CGI
parameters. Because a hash cannot distinguish between scalar and list
context, multivalued parameters will be returned as a packed string,
separated by the "\0" (null) character.
However, as it turns out, \0 is nothing special in Perl:
print length("test\0hi");
The output is :
7
whereas in C it should be 4.
Why does CGI.pm still use \0 as null character, when it's treated as a normal character (not the mark of end of string any more) in Perl?
It's a design mistake. I think we agree that it should not coerce the hash value to a string at all, but it probably seemed like a good idea back then and \0 simply is the least bad choice for various reasons of little importance.
Edit: People usually avoid to put NULs in their data precisely because it tends to cause breakage in C programs, so this makes this character slightly more favourable as separator.
Edit 2: hobbs comments that it goes back to Perl 4, so the mistake is not in the original design, but in carrying it over and then not trying hard enough to deprecate the feature.
Well, hindsight is always perfect. Hash::MultiValue is the smarter data structure you were thinking of.
It's a security feature.
Users of ->Vars expect a hash of key-values, where the values are strings. If one of the value happens to be a reference to an array, it would break that expectation and it could cause the program to behave badly.
If you want to support arguments with multiple values, use ->param in list context. You can use it to build your own hash, if you want.
my %hash;
for ($cgi->params) {
$hash{$_} = [ $cgi->param($_) ];
}
I strongly disagree about it being a design error. I think it's very very smart way of handling bad data (multiple instances of a parameter where at most one is expected).

How to use '^#' in Vim scripts?

I'm trying to work around a problem with using ^# (i.e., <ctrl-#>) characters in Vim scripts. I can insert them into a script, but when the script runs it seems the line is truncated at the point where a ^# was located.
My kludgy solution so far is to have a ^# stored in a variable, then reference the variable in the script whenever I would have quoted a literal ^#. Can someone tell me what's going on here? Is there a better way around this problem?
That is one reason why I never use raw special character values in scripts. While ^# does not work, string <C-#> in mappings works as expected, so you may use one of
nnoremap <C-#> {rhs}
nnoremap <Nul> {rhs}
It is strange, but you cannot use <Char-0x0> here. Some notes about null byte in strings:
Inserting null byte into string truncates it: vim uses old C-style strigs that end with null byte, thus it cannot appear in strings. These strings are very inefficient, so if you want to generate a very large text, try accumulating it into a list of lines (using setline is very fast as buffer is represented as a list of lines).
Most functions that return list of strings (like readfile, getline(start, end)) or take list of strings (like writefile, setline, append) treat \n (NL) as Null. It is also the internal representation of buffer lines, see :h NL-used-for-Nul.
If you try to insert \n character into the command-line, you will get Null shown (but this is really a newline). If you want to edit a file that has \n in a filename (it is possible on *nix), you will need to prepend newline with backslash.
The byte ctrl-# is also known as '\0'. Many languages, programs, etc. use it as an "end of string" marker, so it's not surprising that vim gets confused there. If you must use this byte in the middle of a script string, it sounds like your workaround is a decent one.

Why would Perl's printf output the format specifier rather than the formatted number?

I'm trying to debug a problem on a remote user's site. We've narrowed it down to a problem with formatted output in Perl. The user swears up and down that
perl -e 'printf "Number: %lG\n", 0.1'
prints
Number: %lG
not
Number: 0.1
The user reports that their Perl is version 5.8. The oldest version I have around is 5.8.1, and it seems to behave correctly.
Any guesses? Misconfiguration? Module conflicts?
Quoting from the sprintf documentation:
Returns a string formatted by the usual printf conventions of the C library
function sprintf. See below for more details and see sprintf(3) or
printf(3) on your system for an explanation of the general principles.
IOW, like with many built-ins, Perl just thinly wraps a function, and it's
platform dependent.
Perl's sprintf permits the following universally-known conversions:
%l is not part of it. My guess is that the remote user is not using GNU. He can find out exactly what is supported by his unwashed Unix by typing man 3 sprintf or man 3 printf.
It should only do that if it doesn't recognise the format specifier. For example:
pax> perl -e 'printf "Number: %q\n", 0.1'
Number: %q
I think you're going to have to go on site to figure this one out although you may want to first get them to cut and paste the text and a screen dump from, for example, HyperSnap demo, into an email so you can check it carefully. I only suggest that one since we use it internally and it has a free trial. You can use any decent screen capture program you like.
I originally thought that they may be typing in 1G (wun jee) instead of lG (ell jee) but the former still works.
I do notice the IG (eye jee) will print out the text rather than the number but, unless they're using a particularly bad font, that should be a recognisable difference.
BTW, I only have 5.10 to work with so it may be a version problem. It's hard to imagine, however, that there'd be a major difference between 5.8 and 5.8.1.
It looks like it was a Perl build configuration issue. The Perl build supports a `d_longdbl` option, which indicates whether long doubles are allowed or not. You can test whether it is set on your machine with:
perl -V:d_longdbl
More info at perldoc sprintf.
Thanks for your input everybody.
Edit:
Nope, that wasn't it either. Close inspection of the sprintf documentation revealed that the modifiers for a long double are q, ll, and L, NOT l. l is a valid modifer for integer types. D'oh.
It looks like most installations of perl will silently ignore the l, and parse the rest of the modifier correctly. Except on our user's site. ☹ Anyway, the problem was fixed by using a valid modifier for a long double.
FYI, I played with the same format specifiers in the C printf.
printf("The number is %lG\n", 0.001);
printf("The number is %LG\n", 0.001);
The first call “worked”, printing out 0.001, but the second call printed out a garbage value until I properly specified the type of the numeric literal:
printf("The number is %LG\n", 0.001L);
Apparently the C printf is silently ignoring the improper l modifier. This makes me suspect that most Perl installations ignore it too.

Command-line arguments as bytes instead of strings in python3

I'm writing a python3 program, that gets the names of files to process from command-line arguments. I'm confused regarding what is the proper way to handle different encodings.
I think I'd rather consider filenames as bytes and not strings, since that avoids the danger of using an incorrect encoding. Indeed, some of my file names use an incorrect encoding (latin1 when my system locale uses utf-8), but that doesn't prevent tools like ls from working. I'd like my tool to be resilient to that as well.
I have two problems: the command-line arguments are given to me as strings (I use argparse), and I want to report errors to the user as strings.
I've successfuly adapted my code to use binaries, and my tool can handle files whose name are invalid in the current default encoding, as long as it is by recursing trough the filesystem, because I convert the arguments to binaries early, and use binaries when calling fs functions. When I receive a filename argument which is invalid, however, it is handed to me as a unicode string with strange characters like \udce8. I do not know what these are, and trying to encode it always fail, be it with utf8 or with the corresponding (wrong) encoding (latin1 here).
The other problem is for reporting errors. I expect users of my tool to parse my stdout (hence wanting to preserve filenames), but when reporting errors on stderr I'd rather encode it in utf-8, replacing invalid sequences with appropriate "invalid/question mark" characters.
So,
1) Is there a better, completely different way to do it ? (yes, fixing the filenames is planned, but i'd still like my tool to be robust)
2) How do I get the command line arguments in their original binary form (not pre-decoded for me), knowing that for invalid sequences re-encoding the decoded argument will fail, and
3) How do I tell the utf-8 codec to replace invalid, undecodable sequences with some invalid mark rather than dying on me ?
When I receive a filename argument
which is invalid, however, it is
handed to me as a unicode string with
strange characters like \udce8.
Those are surrogate characters. The low 8 bits is the original invalid byte.
See PEP 383: Non-decodable Bytes in System Character Interfaces.
Don't go against the grain: filenames are strings, not bytes.
You shouldn't use a bytes when you should use a string. A bytes is a tuple of integers. A string is a tuple of characters. They are different concepts. What you're doing is like using an integer when you should use a boolean.
(Aside: Python stores all strings in-memory under Unicode; all strings are stored the same way. Encoding specifies how Python converts the on-file bytes into this in-memory format.)
Your operating system stores filenames as strings under a specific encoding. I'm surprised you say that some filenames have different encodings; as far as I know, the filename encoding is system-wide. Functions like open default to the default system filename encoding, for example.