Which encoding uses the \x (backslash x) prefix? - encoding

I'm attempting to decode text which is prefixing certain 'special characters' with \x. I've worked out the following mappings by hand:
\x28 (
\x29 )
\x3a :
e.g. 12\x3a39\x3a03 AM
Does anyone recognise what this encoding is?

It's ASCII. All occurrences of the four characters \xST are converted to 1 character, whose ASCII code is ST (in hexadecimal), where S and T are any of 0123456789abcdefABCDEF.

The '\xAB' notation is used in C, C++, Perl, and other languages taking a cue from C, as a way of expressing hexadecimal character codes in the middle of a string.
The notation '\007' means use octal for the character code, when there are digits after the backslash.
In C99 and later, you can also use \uabcd and \U00abcdef to encode Unicode characters in hexadecimal (with 4 and 8 hex digits required; the first two hex digits in \U must be 0 to be valid, and often the third digit will be 0 too — 1 is the only other valid value).
Note that in C, octal escapes are limited to a maximum of 3 digits but hexadecimal escapes are not limited to 2 or 3 digits; the hexadecimal escape ends at the first character that's not a hexadecimal digit. In the question, the sequence is "12\x3a39\x3a03". That is a string containing 4 characters: 1, 2, \x3a39 and \x3a03. The actual value used for the 4-digit hex characters is implementation-defined. To achieve the desired result (using \x3A to represent a colon :), the code would have to use string concatenation:
"12\x3a" "39\x3a" "03"
This now contains 8 characters: 1, 2, :, 3, 9, :, 0, 3.

I use CyberChef for this sort of thing.
If you drop it in the input field and drag Magic from the Favourites list into the recipe it'll tell you the conversion and that you could've used the From_Hex recipe with a \x delimiter.

I'm guessing that what you are dealing with is a unicode string that has been encoded differently than the output stream it was sent to. ie. a utf-16 string output to a latin-1 device. In that situation, certain characters will be outputted as escape values to avoid sending control characters or wrong characters to the output device. This happens in python at least.

Related

PATINDEX does not recognize dot and comma

I have a column that should contain phone numbers but it contains whatever the user wanted. I need to create an update to remove all the characters after an invalid character.
To do this I am using a regex as PATINDEX('%[^0-9+-/()" "]%', [MobilNr]) and it seemed to work until I had some numbers as +1235, 36446 and to my surprise the result is 0 instead of 6. Also if the number contains . it returns 0.
Does PATINDEX ignores dot(".") and comma(",")? Are there other characters that PATINDEX will ignore?
It's not that PATINDEX ignores the comma and the dot, it's your pattern that created this problem.
With PATINDEX, the hyphen char (-) has a special meaning - it's in fact an operator that denotes an inclusive range - like 0-9 denotes all digits between 0 and 9 - so when you do +-/ it means all the chars between + and / (inclusive, of course). The comma and dot chars are within this range, that's why you get this result.
Fixing the pattern is easy: either use | as a logical or, or simply move the hyphen to the end of the pattern:
SELECT PATINDEX('%[^0-9/()" "+-]%', '+1235, 36446') -- Result: 6

Combining characters using their hexa name

This works:
say "\c[COMBINING BREVE, COMBINING DOT ABOVE]" # OUTPUT: «̆̇␤»
However, this does not:
say "\c[0306, 0307]"; # OUTPUT: «IJij␤»
It's treating it as two different characters. Is there a way to make it work directly by using the numbers, other than use uniname to convert it to names?
The \c[…] escape is for declaring a character by its name or an alias.
0306 is not a name, it is the ordinal/codepoint of a character.
The \x[…] escape is for declaring a character by its hexadecimal ordinal.
say "\x[0306, 0307]"; # OUTPUT: «̆̇␤»
(Hint: There is an x in a hexadecimal literal 0x0306)
\c uses decimal numbers:
say "\c[774, 775]"
where 774 is the decimal equivalent of 0306, works perfectly.

Character literal for vertical tab?

How can I write a character literal for a vertical tab ('\v', ASCII 11) in Scala?
'\v' doesn't work. (invalid escape character)
'\11' should be it, but...
scala> '\11'.toInt
res13: Int = 9
But 9 is the ASCII code for a normal tab('\t'). What is going on there?
EDIT: This works and produces the right character, but I'd still like to know the syntax for a literal.
val c:Char = 11
You need to use '\13'. It's in octal.
For more information see Scala Language Specification.
1.3.4 Character Literals
Syntax:
characterLiteral ::= ‘\’’ printableChar ‘\’’ | ‘\’’ charEscapeSeq ‘\’’
A character literal is a single character enclosed in quotes. The
character is either a printable unicode character or is described by
an escape sequence (§1.3.6).
Example 1.3.4 Here are some character
literals: ’a’ ’\u0041’ ’\n’ ’\t’ Note that ‘\u000A’ is not a valid
character literal because Unicode conversion is done before literal
parsing and the Unicode character \u000A (line feed) is not a
printable character. One can use instead the escape sequence ‘\n’ or
the octal escape ‘\12’ (§1.3.6).

CAM::PDF returning non ascii character instead of quotes

I am having trouble with non ascii characters being returned. I am not sure at which level the issue resides. It could be the actual PDF encoding, the decoding used by CAM::PDF (which is FlateDecode) or CAM::PDF itself. The following returns a string full of the commands used to create the PDF (Tm, Tj, etc).
use CAM::PDF;
my $filename = "sample.pdf";
my $cam_obj = CAM::PDF->new($filename) or die "$CAM::PDF::errstr\n";
my $tree = $cam_obj->getPageContentTree(1);
my $page_string = $tree->toString();
print $page_string;
You can download sample.pdf here
The text returned in the Tj often has one character which is non ASCII. In the PDF, the actual character is almost always a quote, single or double.
While reproducing this I found that the returned character is consistent within the PDF but varies amongst PDFs. I also noticed the PDF is using a specific font file. I'm now looking into font files to see if the same character can be mapped to varying binary values.
:edit:
Regarding Windows-1252. My PDF returns an "Õ" instead of apostrophes. The Õ character is hex 0xD5 in Windows-1252 and UTF-8. If the idea is that the character is encoded with Windows-1252, then it should be a hex 0x91 or 0x92 which it is not. Which is why the following does nothing to the character:
use Encode qw(decode encode);
my $page_string = 'Õ';
my $characters = decode 'Windows-1252', $page_string;
my $octets = encode 'UTF-8', $characters;
open STS, ">TEST.txt";
print STS $octets . "\n";
I'm the author of CAM-PDF. Your PDF is non-compliant. From the PDF 1.7 specification, section 3.2.3 "String Objects":
"Within a literal string, the backslash (\) is used as an escape
character for various purposes, such as to include newline characters,
nonprinting ASCII characters, unbalanced parentheses, or the backslash
character itself in the string. [...] The \ddd escape sequence provides
a way to represent characters outside the printable ASCII character set."
If you have large quantities of non-ASCII characters, you can represent them using hexadecimal string notation.
EDIT: Perhaps my interpretation of the spec is incorrect, given a_note's alternative answer. I'll have to revisit this... Certainly, the spec could be clearer in this area.
Sorry to intrude, and with all due respect, sir, but file IS compliant. Section 3.2.3 further states:
[The \ddd] notation provides a way to specify characters outside the
7-bit ASCII character set by using ASCII characters only. However,
any 8-bit value may appear in a string.
"receiving" - where? You get "Õ" instead of expected what? And doing exactly what? You know that windows command prompt uses dos code page, not windows-1252, right? (oops, new thread again... probably i should register here :-) )

Check characters inside string for their Unicode value

I would like to replace characters with certain Unicode values in a variable with dash. I have two ideas which might work, but I do not know how to check for the value of character:
1/ processing variable as string, checking every characters value and placing these characters in a new variable (replacing those characters which are invalid)
2/ use these magic :-)
$variable = s/[$char_range]/-/g;
char_range should be similar to [0-9] or [A-Z], but it should be values for utf-8 characters. I need range from 0x00 to 0x7F to be exact.
The following expression should replace anything that is not ASCII with a hyphen, which is (I think) what you want to do:
s/[\N{U+0080}-\N{U+FFFF}]/-/g
There's no such thing as UTF-8 characters. There are only characters that you encode into UTF-8. Even then, you don't want to make ranges outside of the magical ones that Perl knows about. You're likely to get more than you expect.
To get the ordinal value for a character, use ord:
use utf8;
my $code_number = ord '😸'; # U+1F638
say sprintf "%#x", $code_number;
However, I don't think that's what you need. It sounds like you want to replace characters in the ASCII range with a -. You can specify ranges of code numbers:
s/[\000-\177]/-/g; # in octal
s/[\x00-\x7f]/-/g; # in hexadecimal
You can specify wide character ordinal values in braces:
s/[\x80-\x{10ffff}]/-/g; # wide characters, replace non-ASCII in this case
When the characters have a common property, you can use that:
s/\p{ASCII}/-/g;
However, if you are replacing things character for character, you might want a transliteration:
$string =~ tr/\000-\177/-/;