BreakPermittedHere char in filename - unicode

I recieved (mail attachment) a .pdf file with a \0082 (Break Permitted Here) character instead presumably of a é one.
Sciences Num<here>riques et Technologie.pdf
What could have happened ?

It's a simple mojibake case: sender and receiver applies different code pages.
Example for given characters .\Py\mojibakeWindows.py é \x82
Mojibake prove using 97 codecs
string é ['U+00e9'] Latin Small Letter E With Acute
versus ['U+0082'] ??? Cc
[b'\x82']
é ['cp437', 'cp720', 'cp775', 'cp850', 'cp852', 'cp857', 'cp858', 'cp860', 'cp861', 'cp863', 'cp865']
['cp1006', 'latin_1', 'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7', 'iso8859_8', 'iso8859_9', 'iso8859_10', 'iso8859_13', 'iso8859_14', 'iso8859_15', 'iso8859_16', 'iso8859_11']
Here the mojibakeWindows.py script is as follows:
import sys
import codecs
import unicodedata
if len(sys.argv) == 3:
str1st = sys.argv[1].encode('raw_unicode_escape').decode('unicode_escape').encode('utf-16_BE','surrogatepass').decode('utf-16_BE');
str2nd = sys.argv[2].encode('raw_unicode_escape').decode('unicode_escape').encode('utf-16_BE','surrogatepass').decode('utf-16_BE');
else:
print( 'need two `string` parameters e.g. as follows:');
print( sys.argv[0], '"╧╤╪"', '"ÏÑØ"' )
sys.exit();
codec_list = ['ascii', 'big5', 'big5hkscs', 'cp037', 'cp424', 'cp437', 'cp500', 'cp720', 'cp737', 'cp775', 'cp850', 'cp852', 'cp855', 'cp856', 'cp857', 'cp858', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864', 'cp865', 'cp866', 'cp869', 'cp874', 'cp875', 'cp932', 'cp949', 'cp950', 'cp1006', 'cp1026', 'cp1140', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255', 'cp1256', 'cp1257', 'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213', 'euc_kr', 'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp', 'iso2022_jp_1', 'iso2022_jp_2', 'iso2022_jp_2004', 'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1', 'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7', 'iso8859_8', 'iso8859_9', 'iso8859_10', 'iso8859_13', 'iso8859_14', 'iso8859_15', 'iso8859_16', 'johab', 'koi8_r', 'koi8_u', 'mac_cyrillic', 'mac_greek', 'mac_iceland', 'mac_latin2', 'mac_roman', 'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213', 'utf_32', 'utf_32_be', 'utf_32_le', 'utf_16', 'utf_16_be', 'utf_16_le', 'utf_7', 'utf_8', 'utf_8_sig'] + ['cp273', 'cp1125', 'iso8859_11', 'koi8_t', 'kz1048']; # 'cp65001',
print( 'Mojibake prove using', len( codec_list ), 'codecs' );
str1stname = unicodedata.name(str1st, '??? {}'.format(unicodedata.category(str1st))) if len(str1st)==1 else ''
str2ndname = unicodedata.name(str2nd, '??? {}'.format(unicodedata.category(str2nd))) if len(str2nd)==1 else ''
print( 'string', str1st, ['U+{0:04x}'.format( ord(ch)) for ch in str1st], str1stname.title() );
print( 'versus', str2nd, ['U+{0:04x}'.format( ord(ch)) for ch in str2nd], str2ndname.title() );
str1list = [];
str2list = [];
strXlist = [];
for cod in codec_list:
for doc in codec_list:
if cod != doc:
# str1ste = codecs.encode( str1st,encoding=cod,errors='replace');
try:
str1ste = codecs.encode( str1st,encoding=cod,errors='strict');
except:
str1ste = b'?' * len(str1st);
# str2nde = codecs.encode( str2nd,encoding=doc,errors='replace');
try:
str2nde = codecs.encode( str2nd,encoding=doc,errors='strict');
except:
str2nde = b'?' * len(str2nd);
if ( str1ste == str2nde and b'?' not in str1ste ):
if cod not in str1list: str1list.append( cod );
if doc not in str2list: str2list.append( doc );
if str1ste not in strXlist: strXlist.append( str1ste );
print( strXlist );
print( str1st, str1list );
print( str2nd, str2list );
Another example applies my Alt KeyCode Finder script (see column Alt0 for ACP code and column Dec for OEMCP code): powershell -COMMAND .\PShell\MyCharMap.ps1 é,0x82
Ch Unicode Dec CP IME Alt Alt0 IME 0405/cs-CZ; CP65001; ACP 65001
é U+00E9 233 …233… Latin Small Letter E With Acute
130 CP437 en-US 0233 (ACP 1252) US & Western Eu
130 CP850 en-GB 0233 (ACP 1252) US & Western Eu
130 CP852 cs-CZ 0233 (ACP 1250) Central Europe
130 CP775 et-EE 0233 (ACP 1257) Baltic
130 CP857 tr-TR 0233 (ACP 1254) Turkish
130 CP720 ar-EG 0233 (ACP 1256) Arabic
vi-VN 0233 (ACP 1258) Vietnamese
U+0082 130 …130… Break Permitted Here
130 CP869 el-gr (ACP 1253) Greek-2
th-TH 0130 (ACP 874) Thai

Related

What does the \u{...} notation mean in UNICODE and why are only some characters displayed like this in the CLDR project?

In this link you will find the most used characters for each language. Why are some characters in some languages displayed under the \u{...} notation?
I think that what is in the brackets is the hexadecimal code of the character, but I can't understand why they would only do it with some characters.
The character sequences enclosed in curly brackets {} are digraphs (trigraphs, …) counted as a distinct letter in given language (supposedly with its own place in the alphabet), for instance
digraph {ch} in cs (Czech language);
trigraph {dzs} in hu (Hungarian alphabet);
more complex digraph examples in kkj (Kako language) shows the following Python code snippet:
>>> kkj='[a á à â {a\u0327} b ɓ c d ɗ {ɗy} e é è ê ɛ {ɛ\u0301} {ɛ\u0300} {ɛ\u0302} {ɛ\u0327} f g {gb} {gw} h i í ì î {i\u0327} j k {kp} {kw} l m {mb} n {nd} nj {ny} ŋ {ŋg} {ŋgb} {ŋgw} o ó ò ô ɔ {ɔ\u0301} {ɔ\u0300} {ɔ\u0302} {ɔ\u0327} p r s t u ú ù û {u\u0327} v w y]'
>>> print( kkj)
[a á à â {a̧} b ɓ c d ɗ {ɗy} e é è ê ɛ {ɛ́} {ɛ̀} {ɛ̂} {ɛ̧} f g {gb} {gw} h i í ì î {i̧} j k {kp} {kw} l m {mb} n {nd} nj {ny} ŋ {ŋg} {ŋgb} {ŋgw} o ó ò ô ɔ {ɔ́} {ɔ̀} {ɔ̂} {ɔ̧} p r s t u ú ù û {u̧} v w y]
>>>
For instance, {a\u0327} renders as {a̧} i.e. something like Latin Small Letter A with Combining Cedilla which has no Unicode equivalent. A counterexample:
ņ (U+0146) Latin Small Letter N With Cedilla with decomposition 004E 0327:
>>> import unicodedata
>>> print( 'ņ', unicodedata.normalize('NFC','{n\u0327}'))
ņ {ņ}
Edit:
Characters presented as unicode literals (\uxxxx = a character with 16-bit hex value xxxx) are unrenderable ones (or hard to render, at least). The following Python script shows some of them (Bidi_Class Values L-Left_To_Right, R-Right_To_Left, NSM-Nonspacing_Mark, BN-Boundary_Neutral):
# -*- coding: utf-8 -*-
import unicodedata
pa = 'ੱੰ਼੍ੁੂੇੈੋੌ'
pa = '\u0327 \u0A71 \u0A70 \u0A3C ੦ ੧ ੨ ੩ ੪ ੫ ੬ ੭ ੮ ੯ ੴ ੳ ਉ ਊ ਓ ਅ ਆ ਐ ਔ ੲ ਇ ਈ ਏ ਸ {ਸ\u0A3C} ਹ ਕ ਖ {ਖ\u0A3C} ਗ {ਗ\u0A3C} ਘ ਙ ਚ ਛ ਜ {ਜ\u0A3C} ਝ ਞ ਟ ਠ ਡ ਢ ਣ ਤ ਥ ਦ ਧ ਨ ਪ ਫ {ਫ\u0A3C} ਬ ਭ ਮ ਯ ਰ ਲ ਵ ੜ \u0A4D ਾ ਿ ੀ \u0A41 \u0A42 \u0A47 \u0A48 \u0A4B \u0A4C'
pa = '\u0300 \u0301 \u0302 \u1DC6 \u1DC7 \u0A71 \u0A70 \u0A3C \u0A4D \u0A41 \u0A42 \u0A47 \u0A48 \u0A4B \u0A4C \u05B7 \u05B8 \u05BF \u200C \u200D \u200E \u200F \u064B \u064C \u064E \u064F \u0650'
# above examples from ·kkj· ·bas· ·pa· ·yi· ·kn· ·ur· ·mzn·
print( pa )
for chr in pa:
if chr != ' ':
if chr == '{' or chr == '}':
print( chr )
else:
print( '\\u%04x' % ord(chr), chr,
unicodedata.category(chr),
unicodedata.bidirectional(chr) + '\t',
str( unicodedata.combining(chr)) + '\t',
unicodedata.name(chr, '?') )
Result: .\SO\63659122.py
̀ ́ ̂ ᷆ ᷇ ੱ ੰ ਼ ੍ ੁ ੂ ੇ ੈ ੋ ੌ ַ ָ ֿ ‌ ‍ ‎ ‏ ً ٌ َ ُ ِ
\u0300 ̀ Mn NSM 230 COMBINING GRAVE ACCENT
\u0301 ́ Mn NSM 230 COMBINING ACUTE ACCENT
\u0302 ̂ Mn NSM 230 COMBINING CIRCUMFLEX ACCENT
\u1dc6 ᷆ Mn NSM 230 COMBINING MACRON-GRAVE
\u1dc7 ᷇ Mn NSM 230 COMBINING ACUTE-MACRON
\u0a71 ੱ Mn NSM 0 GURMUKHI ADDAK
\u0a70 ੰ Mn NSM 0 GURMUKHI TIPPI
\u0a3c ਼ Mn NSM 7 GURMUKHI SIGN NUKTA
\u0a4d ੍ Mn NSM 9 GURMUKHI SIGN VIRAMA
\u0a41 ੁ Mn NSM 0 GURMUKHI VOWEL SIGN U
\u0a42 ੂ Mn NSM 0 GURMUKHI VOWEL SIGN UU
\u0a47 ੇ Mn NSM 0 GURMUKHI VOWEL SIGN EE
\u0a48 ੈ Mn NSM 0 GURMUKHI VOWEL SIGN AI
\u0a4b ੋ Mn NSM 0 GURMUKHI VOWEL SIGN OO
\u0a4c ੌ Mn NSM 0 GURMUKHI VOWEL SIGN AU
\u05b7 ַ Mn NSM 17 HEBREW POINT PATAH
\u05b8 ָ Mn NSM 18 HEBREW POINT QAMATS
\u05bf ֿ Mn NSM 23 HEBREW POINT RAFE
\u200c ‌ Cf BN 0 ZERO WIDTH NON-JOINER
\u200d ‍ Cf BN 0 ZERO WIDTH JOINER
\u200e ‎ Cf L 0 LEFT-TO-RIGHT MARK
\u200f ‏ Cf R 0 RIGHT-TO-LEFT MARK
\u064b ً Mn NSM 27 ARABIC FATHATAN
\u064c ٌ Mn NSM 28 ARABIC DAMMATAN
\u064e َ Mn NSM 30 ARABIC FATHA
\u064f ُ Mn NSM 31 ARABIC DAMMA
\u0650 ِ Mn NSM 32 ARABIC KASRA
It seems like all codepoints that don't have a well-defined stand-alone look (or are not meant to be used as stand-alone characters) are represented with this notation.
For example U+0A3C is present in the "character" {ਫ\u0A3C}. U+0A3C is a combining codepoint that modifies the one that is before it.

How to pick out the binary part from a multipart file with powershell?

I have a multipart file received from a server and I need to pick out the pdf part from it. I tried with removing the first x lines and the last 2 with
$content=Get-Content $originalfile
$content[0..($content.length-3)] |$outfile
but it corrupts the binary data, so what is the way to get the binary part from the file?
MIME-Version: 1.0
Content-Type: multipart/related; boundary=MIME_Boundary;
start="<6624867311297537120--4d6a31bb.16a77205e4d.3282>";
type="text/xml"
--MIME_Boundary
Content-ID: <6624867311297537120--4d6a31bb.16a77205e4d.3282>
Content-Type: text/xml; charset=utf-8
Content-Transfer-Encoding: 8bit
<?xml version="1.0" encoding="UTF-8"?>
<soapenv:Body xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"/>
--MIME_Boundary
Content-ID:
Content-Type: application/xml
Content-Disposition: form-data; name="metadata"
<?xml version="1.0" encoding="ISO-8859-1"?>
<metadata><contentLength>64288</contentLength><etag>7e3da21f7ed1b434def94f4b</etag><contentType>application/octet-stream</contentType><properties><property><key>Account</key><value>finance</value></property><property><key>Business Unit</key><value>EU DEBMfg</value></property><property><key>Document Type</key><value>PAYABLES</value></property><property><key>Filename</key><value>test-pdf.pdf</value></property></properties></metadata>
--MIME_Boundary
Content-ID:
Content-Type: application/octet-stream
Content-Disposition: form-data; name="content"
%PDF-1.6
%âãÏÓ
37 0 obj <</Linearized 1/L 20597/O 40/E 14115/N 1/T 19795/H [ 1005 215]>>
endobj
xref
37 34
0000000016 00000 n
0000001386 00000 n
0000001522 00000 n
0000001787 00000 n
0000002250 00000 n
.
.
.
0000062787 00000 n
0000063242 00000 n
trailer
<<
/Size 76
/Prev 116
/Root 74 0 R
/Encrypt 38 0 R
/Info 75 0 R
/ID [ <C21F21EA44C1E2ED2581435FA5A2DCCE> <3B7296EB948466CB53FB76CC134E3E76> ]
>>
startxref
63926
%%EOF
--MIME_Boundary-
You need to read the file as a series of bytes and treat it as a binary file.
Next, to parse out the PDF part of the file, you need to read it again as String, so you can perform Regular Expression on it.
The String should be in an encoding that does not alter the bytes in any way, and for that, there is the special encoding Codepage 28591 (ISO 8859-1) with which the bytes in the original file are used as-is.
To do this, I've written the following helper function:
function ConvertTo-BinaryString {
# converts the bytes of a file to a string that has a
# 1-to-1 mapping back to the file's original bytes.
# Useful for performing binary regular expressions.
Param (
[Parameter(Mandatory = $True, ValueFromPipeline = $True, Position = 0)]
[ValidateScript( { Test-Path $_ -PathType Leaf } )]
[String]$Path
)
$Stream = New-Object System.IO.FileStream -ArgumentList $Path, 'Open', 'Read'
# Note: Codepage 28591 (ISO 8859-1) returns a 1-to-1 char to byte mapping
$Encoding = [Text.Encoding]::GetEncoding(28591)
$StreamReader = New-Object System.IO.StreamReader -ArgumentList $Stream, $Encoding
$BinaryText = $StreamReader.ReadToEnd()
$StreamReader.Close()
$Stream.Close()
return $BinaryText
}
Using the above function, you should be able to get the binary part from the multipart file like this:
$inputFile = 'D:\blah.txt'
$outputFile = 'D:\blah.pdf'
# read the file as byte array
$fileBytes = [System.IO.File]::ReadAllBytes($inputFile)
# and again as string where every byte has a 1-to-1 mapping to the file's original bytes
$binString = ConvertTo-BinaryString -Path $inputFile
# create your regex, all as ASCII byte characters: '%PDF.*%%EOF[\r?\n]{0,2}'
$regex = [Regex]'(?s)(\x25\x50\x44\x46[\x00-\xFF]*\x25\x25\x45\x4F\x46[\x0D\x0A]{0,2})'
$match = $regex.Match($binString)
# use a MemoryStream object to store the result
$stream = New-Object System.IO.MemoryStream
$stream.Write($fileBytes, $match.Index, $match.Length)
# save the binary data of the match as a series of bytes
[System.IO.File]::WriteAllBytes($outputFile, $stream.ToArray())
# clean up
$stream.Dispose()
Regex details:
( Match the regular expression below and capture its match into backreference number 1
\x25 Match the ASCII or ANSI character with position 0x25 (37 decimal => %) in the character set
\x50 Match the ASCII or ANSI character with position 0x50 (80 decimal => P) in the character set
\x44 Match the ASCII or ANSI character with position 0x44 (68 decimal => D) in the character set
\x46 Match the ASCII or ANSI character with position 0x46 (70 decimal => F) in the character set
[\x00-\xFF] Match a single character in the range between ASCII character 0x00 (0 decimal) and ASCII character 0xFF (255 decimal)
* Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\x25 Match the ASCII or ANSI character with position 0x25 (37 decimal => %) in the character set
\x25 Match the ASCII or ANSI character with position 0x25 (37 decimal => %) in the character set
\x45 Match the ASCII or ANSI character with position 0x45 (69 decimal => E) in the character set
\x4F Match the ASCII or ANSI character with position 0x4F (79 decimal => O) in the character set
\x46 Match the ASCII or ANSI character with position 0x46 (70 decimal => F) in the character set
[\x0D\x0A] Match a single character present in the list below
ASCII character 0x0D (13 decimal)
ASCII character 0x0A (10 decimal)
{0,2} Between zero and 2 times, as many times as possible, giving back as needed (greedy)
)

Unicode letters with more than 1 alphabetic latin character?

I'm not really sure how to express it but I'm searching for unicode letters which are more than one visual latin letter.
I found this in Word so far:
DZ
Dz
dz
NJ
Lj
LJ
Nj
nj
Any others?
Here are some of the characters I've found. I'd first done this manually by looking at some probable blocks. However I've later written a Python script to do this automatically that you can find at the end of this answer
Digraphs
Two Glyphs
Digraph
Unicode Code Point
HTML
DZ, Dz, dz
DZ, Dz, dz
U+01F1 U+01F2 U+01F3
DZ Dz dz
DŽ, Dž, dž
DŽ, Dž, dž
U+01C4 U+01C5 U+01C6
DŽ Dž dž
IJ, ij
IJ, ij
U+0132 U+0133
IJ ij
LJ, Lj, lj
LJ, Lj, lj
U+01C7 U+01C8 U+01C9
LJ Lj lj
NJ, Nj, nj
NJ, Nj, nj
U+01CA U+01CB U+01CC
NJ Nj nj
Ligatures
Non-ligature
Ligature
Unicode
HTML
AA, aa
Ꜳ, ꜳ
U+A732, U+A733
Ꜳ ꜳ
AE, ae
Æ, æ
U+00C6, U+00E6
Æ æ
AO, ao
Ꜵ, ꜵ
U+A734, U+A735
Ꜵ ꜵ
AU, au
Ꜷ, ꜷ
U+A736, U+A737
Ꜷ ꜷ
AV, av
Ꜹ, ꜹ
U+A738, U+A739
Ꜹ ꜹ
AV, av (with bar)
Ꜻ, ꜻ
U+A73A, U+A73B
Ꜻ ꜻ
AY, ay
Ꜽ, ꜽ
U+A73C, U+A73D
Ꜽ ꜽ
et
🙰
U+1F670
🙰
f‌f
ff
U+FB00
ff
f‌f‌i
ffi
U+FB03
ffi
f‌f‌l
ffl
U+FB04
ffl
f‌i
fi
U+FB01
fi
f‌l
fl
U+FB02
fl
OE, oe
Œ, œ
U+0152, U+0153
Œ œ
OO, oo
Ꝏ, ꝏ
U+A74E, U+A74F
Ꝏ ꝏ
ſs, ſz
ẞ, ß
U+1E9E, U+00DF
ß
st
st
U+FB06
st
ſt
ſt
U+FB05
ſt
TZ, tz
Ꜩ, ꜩ
U+A728, U+A729
Ꜩ ꜩ
ue
ᵫ
U+1D6B
ᵫ
VY, vy
Ꝡ, ꝡ
U+A760, U+A761
Ꝡ ꝡ
There are a few other ligatures that are used for phonetic transcription but looks like Latin characters
Non-ligature
Ligature
Unicode
HTML
db
ȸ
U+0238
ȸ
dz
ʣ
U+02A3
ʣ
IJ, ij
IJ, ij
U+0132, U+0133
IJ ij
ls
ʪ
U+02AA
ʪ
lz
ʫ
U+02AB
ʫ
qp
ȹ
U+0239
ȹ
ts
ʦ
U+02A6
ʦ
ui
ꭐ
U+AB50
ꭐ
turned ui
ꭑ
U+AB51
ꭑ
https://en.wikipedia.org/wiki/List_of_precomposed_Latin_characters_in_Unicode#Digraphs_and_ligatures
Edit:
There are more letterlike symbols beside ℻ and ℡ like what the OP found in the comment:
℀ ℁ ⅍ ℅ ℆ ℔ ℠ ™
Longer letters are mainly from the CJK Compatibility block
U+XXXX
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
U+338x
㎀
㎁
㎂
㎃
㎄
㎅
㎆
㎇
㎈
㎉
㎊
㎋
㎌
㎍
㎎
㎏
U+339x
㎐
㎑
㎒
㎓
㎔
㎕
㎖
㎗
㎘
㎙
㎚
㎛
㎜
㎝
㎞
㎟
U+33Ax
㎠
㎡
㎢
㎣
㎤
㎥
㎦
㎧
㎨
㎩
㎪
㎫
㎬
㎭
㎮
㎯
U+33Bx
㎰
㎱
㎲
㎳
㎴
㎵
㎶
㎷
㎸
㎹
㎺
㎻
㎼
㎽
㎾
㎿
U+33Cx
㏀
㏁
㏂
㏃
㏄
㏅
㏆
㏇
㏈
㏉
㏊
㏋
㏌
㏍
㏎
㏏
U+33Dx
㏐
㏑
㏒
㏓
㏔
㏕
㏖
㏗
㏘
㏙
㏚
㏛
㏜
㏝
㏞
㏟
Among the 3-letter-like symbols are ㎈ ㎑ ㎒ ㎓ ㎔㏒ ㏕ ㏖ ㏙ ㎪ ㎫ ㎬ ㎭ ㏆ ㏿ ㍱... Probably the ones with most characters are ㎉ and ㎯
Unicode even have codepoints for Roman numerals. Here another 4-letter-like character can be found: Ⅷ
U+XXXX
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
U+215x
⅐
⅑
⅒
⅓
⅔
⅕
⅖
⅗
⅘
⅙
⅚
⅛
⅜
⅝
⅞
⅟
U+216x
Ⅰ
Ⅱ
Ⅲ
Ⅳ
Ⅴ
Ⅵ
Ⅶ
Ⅷ
Ⅸ
Ⅹ
Ⅺ
Ⅻ
Ⅼ
Ⅽ
Ⅾ
Ⅿ
U+217x
ⅰ
ⅱ
ⅲ
ⅳ
ⅴ
ⅵ
ⅶ
ⅷ
ⅸ
ⅹ
ⅺ
ⅻ
ⅼ
ⅽ
ⅾ
ⅿ
U+218x
ↀ
ↁ
ↂ
Ↄ
ↄ
ↅ
ↆ
ↇ
ↈ
↉
↊
↋
If normal numbers can be considered then there are some other code points for multiple digits like ⒆ ⒇ ⓳ ⓴ in enclosed alphanumerics
U+XXXX
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
U+246x
①
②
③
④
⑤
⑥
⑦
⑧
⑨
⑩
⑪
⑫
⑬
⑭
⑮
⑯
U+247x
⑰
⑱
⑲
⑳
⑴
⑵
⑶
⑷
⑸
⑹
⑺
⑻
⑼
⑽
⑾
⑿
U+248x
⒀
⒁
⒂
⒃
⒄
⒅
⒆
⒇
⒈
⒉
⒊
⒋
⒌
⒍
⒎
⒏
U+249x
⒐
⒑
⒒
⒓
⒔
⒕
⒖
⒗
⒘
⒙
⒚
⒛
⒜
⒝
⒞
⒟
U+24Ax
⒠
⒡
⒢
⒣
⒤
⒥
⒦
⒧
⒨
⒩
⒪
⒫
⒬
⒭
⒮
⒯
U+24Bx
⒰
⒱
⒲
⒳
⒴
⒵
Ⓐ
Ⓑ
Ⓒ
Ⓓ
Ⓔ
Ⓕ
Ⓖ
Ⓗ
Ⓘ
Ⓙ
U+24Cx
Ⓚ
Ⓛ
Ⓜ
Ⓝ
Ⓞ
Ⓟ
Ⓠ
Ⓡ
Ⓢ
Ⓣ
Ⓤ
Ⓥ
Ⓦ
Ⓧ
Ⓨ
Ⓩ
U+24Dx
ⓐ
ⓑ
ⓒ
ⓓ
ⓔ
ⓕ
ⓖ
ⓗ
ⓘ
ⓙ
ⓚ
ⓛ
ⓜ
ⓝ
ⓞ
ⓟ
U+24Ex
ⓠ
ⓡ
ⓢ
ⓣ
ⓤ
ⓥ
ⓦ
ⓧ
ⓨ
ⓩ
⓪
⓫
⓬
⓭
⓮
⓯
U+24Fx
⓰
⓱
⓲
⓳
⓴
⓵
⓶
⓷
⓸
⓹
⓺
⓻
⓼
⓽
⓾
⓿
and in Enclosed Alphanumeric Supplement
🅫, 🅪, 🆋, 🆌, 🆍, 🄭, 🄮, 🅊, 🅋, 🅌, 🅍, 🅎, 🅏
A few more:
Currency symbol group
₧ ₨ ₶ ₯ ₠ ₢ ₷
Miscellaneous technical group
⎂ ⏨
Control pictures (probably you'll need to zoom out to see)
U+XXXX
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
U+240x
␀
␁
␂
␃
␄
␅
␆
␇
␈
␉
␊
␋
␌
␍
␎
␏
U+241x
␐
␑
␒
␓
␔
␕
␖
␗
␘
␙
␚
␛
␜
␝
␞
␟
U+242x
␠
␡
␢
␣
␤
␥
␦
Alchemical Symbols
🜀 🜅 🜆 🜇 🜈 🝪 🝫 🝬 🝛 🝜 🝝
Musical Symbols
𝄶 𝄷 𝄸 𝄹 𝄉 𝄊 𝄫
And there are the emojis 🔟 💤🆔🚾🆖🆗🔢🔡🔠 💯🆘🆎🆑™🔙🔚🔜🔝🔛📆🗓🔞
Vertical bars may be considered uppercase i or lowercase L (like your 〷 example which is actually the TELEGRAPH LINE FEED SEPARATOR SYMBOL) and we have
Vai syllable see ꔖ 0xa516
Large triple vertical bar operator ⫼ 0x2afc
Counting rod tens digit three: 𝍫 0x1d36b
Suzhou numerals 〢 〣
Chinese river 川
║ BOX DRAWINGS DOUBLE VERTICAL...
Here's the automatic script to find the multi-character letters
import unicodedata
for c in range(0, 0x10FFFF + 1):
d = unicodedata.normalize('NFKD', chr(c))
if len(d) > 1 and d.isascii() and d.isalpha():
print("U+%04X (%s): %s\n" % (c, chr(c), d))
It won't be able to find many ligatures like æ or œ because they're not considered orthographic ligatures and aren't decomposable in Unicode. Here's the result in Unicode 11.0.0 (checked with unicodedata.unidata_version)
U+0132 (IJ): IJ
U+0133 (ij): ij
U+01C7 (LJ): LJ
U+01C8 (Lj): Lj
U+01C9 (lj): lj
U+01CA (NJ): NJ
U+01CB (Nj): Nj
U+01CC (nj): nj
U+01F1 (DZ): DZ
U+01F2 (Dz): Dz
U+01F3 (dz): dz
U+20A8 (₨): Rs
U+2116 (№): No
U+2120 (℠): SM
U+2121 (℡): TEL
U+2122 (™): TM
U+213B (℻): FAX
U+2161 (Ⅱ): II
U+2162 (Ⅲ): III
U+2163 (Ⅳ): IV
U+2165 (Ⅵ): VI
U+2166 (Ⅶ): VII
U+2167 (Ⅷ): VIII
U+2168 (Ⅸ): IX
U+216A (Ⅺ): XI
U+216B (Ⅻ): XII
U+2171 (ⅱ): ii
U+2172 (ⅲ): iii
U+2173 (ⅳ): iv
U+2175 (ⅵ): vi
U+2176 (ⅶ): vii
U+2177 (ⅷ): viii
U+2178 (ⅸ): ix
U+217A (ⅺ): xi
U+217B (ⅻ): xii
U+3250 (㉐): PTE
U+32CC (㋌): Hg
U+32CD (㋍): erg
U+32CE (㋎): eV
U+32CF (㋏): LTD
U+3371 (㍱): hPa
U+3372 (㍲): da
U+3373 (㍳): AU
U+3374 (㍴): bar
U+3375 (㍵): oV
U+3376 (㍶): pc
U+3377 (㍷): dm
U+337A (㍺): IU
U+3380 (㎀): pA
U+3381 (㎁): nA
U+3383 (㎃): mA
U+3384 (㎄): kA
U+3385 (㎅): KB
U+3386 (㎆): MB
U+3387 (㎇): GB
U+3388 (㎈): cal
U+3389 (㎉): kcal
U+338A (㎊): pF
U+338B (㎋): nF
U+338E (㎎): mg
U+338F (㎏): kg
U+3390 (㎐): Hz
U+3391 (㎑): kHz
U+3392 (㎒): MHz
U+3393 (㎓): GHz
U+3394 (㎔): THz
U+3396 (㎖): ml
U+3397 (㎗): dl
U+3398 (㎘): kl
U+3399 (㎙): fm
U+339A (㎚): nm
U+339C (㎜): mm
U+339D (㎝): cm
U+339E (㎞): km
U+33A9 (㎩): Pa
U+33AA (㎪): kPa
U+33AB (㎫): MPa
U+33AC (㎬): GPa
U+33AD (㎭): rad
U+33B0 (㎰): ps
U+33B1 (㎱): ns
U+33B3 (㎳): ms
U+33B4 (㎴): pV
U+33B5 (㎵): nV
U+33B7 (㎷): mV
U+33B8 (㎸): kV
U+33B9 (㎹): MV
U+33BA (㎺): pW
U+33BB (㎻): nW
U+33BD (㎽): mW
U+33BE (㎾): kW
U+33BF (㎿): MW
U+33C3 (㏃): Bq
U+33C4 (㏄): cc
U+33C5 (㏅): cd
U+33C8 (㏈): dB
U+33C9 (㏉): Gy
U+33CA (㏊): ha
U+33CB (㏋): HP
U+33CC (㏌): in
U+33CD (㏍): KK
U+33CE (㏎): KM
U+33CF (㏏): kt
U+33D0 (㏐): lm
U+33D1 (㏑): ln
U+33D2 (㏒): log
U+33D3 (㏓): lx
U+33D4 (㏔): mb
U+33D5 (㏕): mil
U+33D6 (㏖): mol
U+33D7 (㏗): PH
U+33D9 (㏙): PPM
U+33DA (㏚): PR
U+33DB (㏛): sr
U+33DC (㏜): Sv
U+33DD (㏝): Wb
U+33FF (㏿): gal
U+FB00 (ff): ff
U+FB01 (fi): fi
U+FB02 (fl): fl
U+FB03 (ffi): ffi
U+FB04 (ffl): ffl
U+FB05 (ſt): st
U+FB06 (st): st
U+1F12D (🄭): CD
U+1F12E (🄮): WZ
U+1F14A (🅊): HV
U+1F14B (🅋): MV
U+1F14C (🅌): SD
U+1F14D (🅍): SS
U+1F14E (🅎): PPV
U+1F14F (🅏): WC
U+1F16A (🅪): MC
U+1F16B (🅫): MD
U+1F190 (🆐): DJ

Advanced Command-Line Replace Command In VBScript

I'm writing a compiler for my won computer language. Now before the language can be compiled i actually need to replace all apostrophes (') with percents (%) via a command-line vbs program. But the apostrophes only need to be replaced if there is NOT a circumflex accent (^) in front of it. So for example, in this code:
color 0a
input twelve = 0a "hi! that^'s great! "
execute :testfornum 'twelve'
exit
:testfornum
if numeric('1) (
return
) ELSE (
print 0a "oops 'twelve' should be numeric"
)
return
the apostrophe at line 2 should not be replaced, but the ones at line 3, 6 and 9 should be.
can anyone help me?
this is what i have so far:
'syntax: (cscript) replace.vbs [filename] "StringToFind" "stringToReplace"
Option Explicit
Dim FileScriptingObject, file, strReplace, strReplacement, fileD, lastContainment, newContainment
file=Wscript.arguments(0)
strReplace=WScript.arguments(1)
strReplacement=WScript.arguments(2)
Set FileScriptingObject=CreateObject("Scripting.FileSystemObject")
if FileScriptingObject.FileExists(file) = false then
wscript.echo "File not found!"
wscript.Quit
end if
set fileD=fileScriptingobject.OpenTextFile(file,1)
lastContainment=fileD.ReadAll
newContainment=replace(lastContainment,strReplace,strReplacement,1,-1,0)
set fileD=fileScriptingobject.OpenTextFile(file,2)
fileD.Write newContainment
fileD.Close
As #Ansgar's solution fails for the special case of a leading ' (no non-^ before that), here is an approach that uses a replace function in a test script that makes further experiments easy:
Option Explicit
Function fpR(m, g1, g2, p, s)
If "" = g1 Then
fpR = "%"
Else
fpR = m
End If
End Function
Function qq(s)
qq = """" & s & """"
End Function
Dim rE : Set rE = New RegExp
rE.Global = True
rE.Pattern = "(\^)?(')"
Dim rA : Set rA = New RegExp
rA.Global = True
rA.Pattern = "([^^])'"
'rA.Pattern = "([^^])?'"
Dim s
For Each s In Split(" 'a^'b' a'b'^'c nix a^''b")
WScript.Echo qq(s), "==>", qq(rE.Replace(s, GetRef("fpR"))), "==>", qq(rA.Replace(s, "$1%"))
Next
output:
cscript 25221565.vbs
"" ==> "" ==> ""
"'a^'b'" ==> "%a^'b%" ==> "'a^'b%" <=== oops
"a'b'^'c" ==> "a%b%^'c" ==> "a%b%^'c"
"nix" ==> "nix" ==> "nix"
"a^''b" ==> "a^'%b" ==> "a^'%b"
You can't do this with a normal string replacement. A regular expression would work, though:
...
Set re = New RegExp
re.Pattern = "(^|[^^])'"
re.Global = True
newContainment = re.Replace(lastContainment, "$1%")
...

How to identify all non-basic UTF-8 characters in a set of strings in perl

I'm using perl's XML::Writer to generate an import file for a program called OpenNMS. According to the documentation I need to pre-declare all special characters as XML ENTITY declarations. Obviously I need to go through all strings I'm exporting and catalogue the special characters used. What's the easiest way to work out which characters in a perl string are "special" with respect to UTF-8 encoding? Is there any way to work out what the entity names for those characters should be?
In order to find "special" characters, you can use ord to find out the codepoint. Here's an example:
# Create a Unicode test file with some Latin chars, some Cyrillic,
# and some outside the BMP.
# The BMP is the basic multilingual plane, see perluniintro.
# (Not sure what you mean by saying "non-basic".)
perl -CO -lwe "print join '', map chr, 97 .. 100, 0x410 .. 0x415, 0x10000 .. 0x10003" > u.txt
# Read it and find codepoints outside the BMP.
perl -CI -nlwe "print for map ord, grep ord > 0xffff, split //" < u.txt
You can get a good introduction from reading perluniintro.
I'm not sure what the docs you're referring to mean in the section "Exported XML".
Looks like some limitation of a system which is de facto ASCII and doesn't do Unicode.
Or a misunderstanding of XML. Or both.
Anyway, if you're looking for names you could use or reference the canonical ones.
See XML Entity Definitions for Characters or one of the older documents for HTML or MathML referenced therein.
You might look into the uniquote program. It has a --xml option. For example:
$ cat sample
1 NFD single combining characters: (crème brûlée et fiancé) and (crème brûlée et fiancé).
2 NFC single combining characters: (crème brûlée et fiancé) and (crème brûlée et fiancé).
3 NFD multiple combining characters: (hẫç̌k) and (hã̂ç̌k).
3 NFC multiple combining characters: (hẫç̌k) and (hã̂ç̌k).
5 invisible characters: (4⁄3⁢π⁢r³) and (4⁄3⁢π⁢r³).
6 astral characters: (𝐂 = sqrt[𝐀² + 𝐁²]) and (𝐂 = sqrt[𝐀² + 𝐁²]).
7 astral + combining chars: (𝐂̅ = sqrt[𝐀̅² + 𝐁̅²]) and (𝐂̅ = sqrt[𝐀̅² + 𝐁̅²]).
8 wide characters: (wide) and (wide).
9 regular characters: (normal) and (normal).
$ uniquote -x sample
1 NFD single combining characters: (cre\x{300}me bru\x{302}le\x{301}e et fiance\x{301}) and (cre\x{300}me bru\x{302}le\x{301}e et fiance\x{301}).
2 NFC single combining characters: (cr\x{E8}me br\x{FB}l\x{E9}e et fianc\x{E9}) and (cr\x{E8}me br\x{FB}l\x{E9}e et fianc\x{E9}).
3 NFD multiple combining characters: (ha\x{302}\x{303}c\x{327}\x{30C}k) and (ha\x{303}\x{302}c\x{327}\x{30C}k).
3 NFC multiple combining characters: (h\x{1EAB}\x{E7}\x{30C}k) and (h\x{E3}\x{302}\x{E7}\x{30C}k).
5 invisible characters: (4\x{2044}3\x{2062}\x{3C0}\x{2062}r\x{B3}) and (4\x{2044}3\x{2062}\x{3C0}\x{2062}r\x{B3}).
6 astral characters: (\x{1D402} = sqrt[\x{1D400}\x{B2} + \x{1D401}\x{B2}]) and (\x{1D402} = sqrt[\x{1D400}\x{B2} + \x{1D401}\x{B2}]).
7 astral + combining chars: (\x{1D402}\x{305} = sqrt[\x{1D400}\x{305}\x{B2} + \x{1D401}\x{305}\x{B2}]) and (\x{1D402}\x{305} = sqrt[\x{1D400}\x{305}\x{B2} + \x{1D401}\x{305}\x{B2}]).
8 wide characters: (\x{FF57}\x{FF49}\x{FF44}\x{FF45}) and (\x{FF57}\x{FF49}\x{FF44}\x{FF45}).
9 regular characters: (normal) and (normal).
$ uniquote -b sample
1 NFD single combining characters: (cre\xCC\x80me bru\xCC\x82le\xCC\x81e et fiance\xCC\x81) and (cre\xCC\x80me bru\xCC\x82le\xCC\x81e et fiance\xCC\x81).
2 NFC single combining characters: (cr\xC3\xA8me br\xC3\xBBl\xC3\xA9e et fianc\xC3\xA9) and (cr\xC3\xA8me br\xC3\xBBl\xC3\xA9e et fianc\xC3\xA9).
3 NFD multiple combining characters: (ha\xCC\x82\xCC\x83c\xCC\xA7\xCC\x8Ck) and (ha\xCC\x83\xCC\x82c\xCC\xA7\xCC\x8Ck).
3 NFC multiple combining characters: (h\xE1\xBA\xAB\xC3\xA7\xCC\x8Ck) and (h\xC3\xA3\xCC\x82\xC3\xA7\xCC\x8Ck).
5 invisible characters: (4\xE2\x81\x843\xE2\x81\xA2\xCF\x80\xE2\x81\xA2r\xC2\xB3) and (4\xE2\x81\x843\xE2\x81\xA2\xCF\x80\xE2\x81\xA2r\xC2\xB3).
6 astral characters: (\xF0\x9D\x90\x82 = sqrt[\xF0\x9D\x90\x80\xC2\xB2 + \xF0\x9D\x90\x81\xC2\xB2]) and (\xF0\x9D\x90\x82 = sqrt[\xF0\x9D\x90\x80\xC2\xB2 + \xF0\x9D\x90\x81\xC2\xB2]).
7 astral + combining chars: (\xF0\x9D\x90\x82\xCC\x85 = sqrt[\xF0\x9D\x90\x80\xCC\x85\xC2\xB2 + \xF0\x9D\x90\x81\xCC\x85\xC2\xB2]) and (\xF0\x9D\x90\x82\xCC\x85 = sqrt[\xF0\x9D\x90\x80\xCC\x85\xC2\xB2 + \xF0\x9D\x90\x81\xCC\x85\xC2\xB2]).
8 wide characters: (\xEF\xBD\x97\xEF\xBD\x89\xEF\xBD\x84\xEF\xBD\x85) and (\xEF\xBD\x97\xEF\xBD\x89\xEF\xBD\x84\xEF\xBD\x85).
9 regular characters: (normal) and (normal).
$ uniquote -v sample
1 NFD single combining characters: (cre\N{COMBINING GRAVE ACCENT}me bru\N{COMBINING CIRCUMFLEX ACCENT}le\N{COMBINING ACUTE ACCENT}e et fiance\N{COMBINING ACUTE ACCENT}) and (cre\N{COMBINING GRAVE ACCENT}me bru\N{COMBINING CIRCUMFLEX ACCENT}le\N{COMBINING ACUTE ACCENT}e et fiance\N{COMBINING ACUTE ACCENT}).
2 NFC single combining characters: (cr\N{LATIN SMALL LETTER E WITH GRAVE}me br\N{LATIN SMALL LETTER U WITH CIRCUMFLEX}l\N{LATIN SMALL LETTER E WITH ACUTE}e et fianc\N{LATIN SMALL LETTER E WITH ACUTE}) and (cr\N{LATIN SMALL LETTER E WITH GRAVE}me br\N{LATIN SMALL LETTER U WITH CIRCUMFLEX}l\N{LATIN SMALL LETTER E WITH ACUTE}e et fianc\N{LATIN SMALL LETTER E WITH ACUTE}).
3 NFD multiple combining characters: (ha\N{COMBINING CIRCUMFLEX ACCENT}\N{COMBINING TILDE}c\N{COMBINING CEDILLA}\N{COMBINING CARON}k) and (ha\N{COMBINING TILDE}\N{COMBINING CIRCUMFLEX ACCENT}c\N{COMBINING CEDILLA}\N{COMBINING CARON}k).
3 NFC multiple combining characters: (h\N{LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE}\N{LATIN SMALL LETTER C WITH CEDILLA}\N{COMBINING CARON}k) and (h\N{LATIN SMALL LETTER A WITH TILDE}\N{COMBINING CIRCUMFLEX ACCENT}\N{LATIN SMALL LETTER C WITH CEDILLA}\N{COMBINING CARON}k).
5 invisible characters: (4\N{FRACTION SLASH}3\N{INVISIBLE TIMES}\N{GREEK SMALL LETTER PI}\N{INVISIBLE TIMES}r\N{SUPERSCRIPT THREE}) and (4\N{FRACTION SLASH}3\N{INVISIBLE TIMES}\N{GREEK SMALL LETTER PI}\N{INVISIBLE TIMES}r\N{SUPERSCRIPT THREE}).
6 astral characters: (\N{MATHEMATICAL BOLD CAPITAL C} = sqrt[\N{MATHEMATICAL BOLD CAPITAL A}\N{SUPERSCRIPT TWO} + \N{MATHEMATICAL BOLD CAPITAL B}\N{SUPERSCRIPT TWO}]) and (\N{MATHEMATICAL BOLD CAPITAL C} = sqrt[\N{MATHEMATICAL BOLD CAPITAL A}\N{SUPERSCRIPT TWO} + \N{MATHEMATICAL BOLD CAPITAL B}\N{SUPERSCRIPT TWO}]).
7 astral + combining chars: (\N{MATHEMATICAL BOLD CAPITAL C}\N{COMBINING OVERLINE} = sqrt[\N{MATHEMATICAL BOLD CAPITAL A}\N{COMBINING OVERLINE}\N{SUPERSCRIPT TWO} + \N{MATHEMATICAL BOLD CAPITAL B}\N{COMBINING OVERLINE}\N{SUPERSCRIPT TWO}]) and (\N{MATHEMATICAL BOLD CAPITAL C}\N{COMBINING OVERLINE} = sqrt[\N{MATHEMATICAL BOLD CAPITAL A}\N{COMBINING OVERLINE}\N{SUPERSCRIPT TWO} + \N{MATHEMATICAL BOLD CAPITAL B}\N{COMBINING OVERLINE}\N{SUPERSCRIPT TWO}]).
8 wide characters: (\N{FULLWIDTH LATIN SMALL LETTER W}\N{FULLWIDTH LATIN SMALL LETTER I}\N{FULLWIDTH LATIN SMALL LETTER D}\N{FULLWIDTH LATIN SMALL LETTER E}) and (\N{FULLWIDTH LATIN SMALL LETTER W}\N{FULLWIDTH LATIN SMALL LETTER I}\N{FULLWIDTH LATIN SMALL LETTER D}\N{FULLWIDTH LATIN SMALL LETTER E}).
9 regular characters: (normal) and (normal).
$ uniquote --xml sample
1 NFD single combining characters: (crème brûlée et fiancé) and (crème brûlée et fiancé).
2 NFC single combining characters: (crème brûlée et fiancé) and (crème brûlée et fiancé).
3 NFD multiple combining characters: (hâçk) and (hãçk).
3 NFC multiple combining characters: (hẫk) and (hãk).
5 invisible characters: (4⁄3⁢r³) and (4⁄3⁢r³).
6 astral characters: (𝐂 = sqrt[𝐀 + 𝐁]) and (𝐂 = sqrt[𝐀 + 𝐁]).
7 astral + combining chars: (𝐂 = sqrt[𝐀 + 𝐁]) and (𝐂 = sqrt[𝐀 + 𝐁]).
8 wide characters: (w) and (w).
9 regular characters: (normal) and (normal).
$ uniquote --verbose --html sample
1 NFD single combining characters: (crème brûlée et fiancé) and (crème brûlée et fiancé).
2 NFC single combining characters: (crème brûlée et fiancé) and (crème brûlée et fiancé).
3 NFD multiple combining characters: (hẫç̌k) and (hã̂ç̌k).
3 NFC multiple combining characters: (hẫç̌k) and (hã̂ç̌k).
5 invisible characters: (4⁄3⁢π⁢r³) and (4⁄3⁢π⁢r³).
6 astral characters: (𝐂 = sqrt[𝐀² + 𝐁²]) and (𝐂 = sqrt[𝐀² + 𝐁²]).
7 astral + combining chars: (𝐂̅ = sqrt[𝐀̅² + 𝐁̅²]) and (𝐂̅ = sqrt[𝐀̅² + 𝐁̅²]).
8 wide characters: (wide) and (wide).
9 regular characters: (normal) and (normal).