I need a binary-to-text encoding, like base64, that takes binary and produce text but i can't have any lowercase letters.
Anything but lowercase letters. Or anything case unsensitive that i can set to uppercase.
Is it possible?
Thanks a lot !
You're looking for Base32
The resulting character set is all one case, which can often be beneficial when using a case-insensitive filesystem, spoken language, or human memory.
You could try Uuencoding. It's like Base64 (in that it represents each 3-byte sequence with 4 symbols), but it uses a different set of 64 symbols, the ASCII characters 0x20 through 0x5F, which doesn't include lowercase.
Related
I cannot understand some key elements of encoding:
Is ASCII only a character or it also has its encoding scheme algorithm ?
Does other windows code pages such as Latin1 have their own encoding algorithm ?
Are UTF7, 8, 16, 32 the only encoding algorithms ?
Does the UTF alghoritms are used only with the UNICODE set ?
Given the ASCII text: Hello World, if I want to convert it into Latin1 or BIG5, which encoding algorithms are being used in this process ? More specifically, does Latin1/Big5 use their own encoding alghoritm or I have to use a UTF alghoritm ?
1: Ascii is just an encoding — a really simple encoding. It's literally just the positive end of a signed byte (0...127) mapped to characters and control codes.
Refer to https://www.ascii.codes/ to see the full set and inspect the characters.
There are definitely encoding algorithms to convert ascii strings to and from strings in other encodings, but there is no compression/decompression algorithm required to write or read ascii strings like there is for utf8 or utf16, if that's what you're implying.
2: LATIN-1 is also not a compressed (usually called 'variable width') encoding, so there's no algorithm needed to get in and out of it.
See https://kb.iu.edu/d/aepu for a nice description of LATIN-1 conceptually and of each character in the set. Like a lot of encodings, its first 128 slots are just ascii. Like ascii, it's 1 byte in size, but it's an unsigned byte, so after the last ascii character (DEL/127), LATIN1 adds another 128 characters.
As with any conversion from one string encoding to another, there is an algorithm specifically tailored to that conversion.
3: Again, unicode encodings are just that — encodings. But they're all compressed except for utf32. So unless you're working with utf32 there is always a compression/decompression step required to write and read them.
Note: When working with utf32 strings there is one nonlinear oddity that has to be accounted for... combining characters. Technically that is yet another type of compression since they save space by not giving a codepoint to every possible combination of uncombined character and combining character. They "precombine" a few, but they would run out of slots very quickly if they did them all.
4: Yes. The compression/decompression algorithms for the compressed unicode encodings are just for those encodings. They would not work for any other encoding.
Think of it like zip/unzip. Unzipping anything other than a zipped file or folder would of course not work. That goes for things that are not compressed in the first place and also things that are compressed but using another compression algorithm (e.g.: rar).
I recently wrote the utf8 and utf16 compression/decompression code for a new cross-platform library being developed, and I can tell you quite confidently if you feed a Big5-encoded string into my method written specifically for decompressing utf8... not only would it not work, it might very well crash.
Re: your "Hello World" question... Refer to my answer to your second question about LATIN-1. No conversion is required to go from ascii to LATIN-1 because the first 128 characters (0...127) of LATIN-1 are ascii. If you're converting from LATIN-1 to ascii, the same is true for the lower half of LATIN-1, but if any of the characters beyond 127 are in the string, it would be what's called a "lossy"/partial conversion or an outright failure, depending on your tolerance level for lossiness. In your example, however, all of the characters in "Hello World" have the exact same values in both encodings, so it would convert perfectly, without loss, in either direction.
I know practically nothing about Big5, but regardless, don't use utf-x algos for other encodings. Each one of those is written very specifically for 1 particular encoding (or in the case of conversion: pair of encodings).
If you're curious about utf8/16 compression/decompression algorithms, the unicode website is where you should start (watch out though. they don't use the compression/decompression metaphor in their documentation):
http://unicode.org
You probably won't need anything else.
... except maybe a decent codepoint lookup tool: https://www.unicode.codes/
You can roll your own code based on the unicode documentation, or use the official unicode library:
http://site.icu-project.org/home
Hope this helps.
In general, most encoding schemes like ASCII or Latin-1 are simply big tables mapping characters to specific byte sequences. There may or may not be some specific algorithm how the creators came up with those specific character⟷byte associations, but there's generally not much more to it than that.
One of the innovations of Unicode specifically is the indirection of assigning each character a unique number first and foremost, and worrying about how to encode that number into bytes secondarily. There are a number of encoding schemes for how to do this, from the UCS and GB 18030 encodings to the most commonly used UTF-8/UTF-16 encodings. Some are largely defunct by now like UCS-2. Each one has their pros and cons in terms of space tradeoffs, ease of processing and transportability (e.g. UTF-7 for safe transport over 7-bit system like email). Unless otherwise noted, they can all encode the full set of current Unicode characters.
To convert from one encoding to another, you pretty much need to map bytes from one table to another. Meaning, if you look at the EBCDIC table and the Windows 1250 table, the characters 0xC1 and 0x41 respectively both seem to represent the same character "A", so when converting between the two encodings, you'd map those bytes as equivalent. Yes, that means there needs to be one such mapping between each possible encoding pair.
Since that is obviously rather laborious, modern converters virtually always go through Unicode as a middleman. This way each encoding only needs to be mapped to the Unicode table, and the conversion can be done with encoding A → Unicode code point → encoding B. In the end you just want to identify which characters look the same/mean the same, and change the byte representation accordingly.
A character encoding is a mapping from a sequence of characters to a sequence of bytes (in the past there were also encodings to a sequence of bits - they are falling out of fashion). Usually this mapping is one-to-one but not necessarily onto. This means there may be byte sequences that don't correspond to a character sequence in this encoding.
The domain of the mapping defines which characters can be encoded.
Now to your questions:
ASCII is both, it defines 128 characters (some of them are control codes) and how they are mapped to the byte values 0 to 127.
Each encoding may define its own set of characters and how they are mapped to bytes
no, there are others as well ASCII, ISO-8859-1, ...
Unicode uses a two step mapping: first the characters are mapped to (relatively) small integers called "code points", then these integers are mapped to a byte sequence. The first part is the same for all UTF encodings, the second step differs. Unicode has the ambition to contain all characters. This means, most characters are in the "UNICODE set".
Every character in the world has been assigned a unicode value [ numbered from 0 to ...]. It is actually an unique value. Now, it depends on an individual that how he wants to use that unicode value. He can even use it directly or can use some known encoding schemes like utf8, utf16 etc. Encoding schemes map that unicode value into some specific bit sequence [ can vary from 1 byte to 4 bytes or may be 8 in future if we get to know about all the languages of universe/aliens/multiverse ] so that it can be uniquely identified in the encoding scheme.
For example ASCII is an encoding scheme which only encodes 128 characters out of all characters. It uses one byte for every character which is equivalent to utf8 representation. GSM7 is one other format which uses 7 bit per character to encode 128 characters from unicode character list.
Utf8:
It uses 1 byte for characters whose unicode value is till 127.
Beyond this it has its own way of representing the unicode values.
Uses 2 byte for Cyrillic then 3 bytes for Hindi characters.
Utf16:
It uses 2 byte for characters whose unicode value is till 127.
and it also uses 2 byte for Cyrillic, Hindi characters.
All the utf encoding schemes fixes initial bits in specific pattern [ eg: 110|restbits] and rest bits [eg: initialbits|11001] takes the unicode value to make a unique representation.
Wikipedia on utf8, utf16, unicode will make it clear.
I coded an utf translator which converts incoming utf8 text across all languages into its equivalent utf16 text.
I'm working with a binary file that references another file using absolute paths.
The path contains both japanese and ascii characters.
The length of the string is given, so I can just read that many bytes and convert it into a string.
However the problem is trying to convert the string. If I specify the encoding as ascii, it'll fail on the japanese characters. If I specify it as japanese encoding (shift-jis or something), it won't read the english characters properly.
One byte is used for each ascii character, while two bytes are used for each japanese character.
What is the fastest and cleanest way to convert these bytes into a string? The encodings are known. Will the same technique work in older versions of python.
This sounds like you have fallen victim for a misunderstand the basics of Unicode and encodings. It may be that you have not, but misunderstandnings are common and understandable, while the situation you describe are not.
A string of bytes that contains mixed encodings are, per definition, invalid in any of these encodings. If this really was the case, you would have to split the bytes string into it's parts, and decode every part separately. In this case it would probably mean splitting on the path separators, so it would be reasonably easy, but in other cases it would not. However, I serously doubt that this is the case, as it would mean that your source is insane. That happens, but it is unlikely. :-)
If the source gives you one path as a bytes string, it is most likely that this string uses only one encoding. It may contain both Japanese and ASCII-characters and still be using one encoding. The most common encodings that can handle both Japanese and ASCII are UTF-8 and UTF-16. My guess is that your source uses one of those. In fact, since you write "One byte is used for each ascii character, while two bytes are used for each japanese character" it is probably UTF-8. It could also be Shift JIS, but it seems you already tried that.
If not, please explain what your source is, and give examples of the byte strings (in ASCII/HEX) that you are given.
I'm looking for a simple and efficient way to store UTF-8 strings in ASCII-7. With efficient I mean the following:
all ASCII alphanumeric chars in the input should stay the same ASCII alphanumeric chars in the output
the resulting string should be as short as possible
the operation needs to be reversable without any data loss
the resulting ASCII string should be case insensitive
there should be no restriction on the input length
the whole UTF-8 range should be allowed
My first idea was to use Punycode (IDNA) as it fits the first four requirements, but it fails at the last two.
Can anyone recommend an alternative encoding scheme? Even better if there's some code available to look at.
UTF-7, or, slightly less transparent but more widespread, quoted-printable.
all ASCII chars in the input should stay ASCII chars in the output
(Obviously not fully possible as you need at least one character to act as an escape.)
Since ASCII covers the full range of 7-bit values, an encoding scheme that preserves all ASCII characters, is 7-bits long, and encodes the full Unicode range is not possible.
Edited to add:
I think I understand your requirements now. You are looking for a way to encode UTF-8 strings in a seven-bit code, in which, if that encoded string were interpreted as ASCII text, then the case of the alphabetic characters may be arbitrarily modified, and yet the decoded string will be byte-for-byte identical to the original.
If that's the case, then your best bet would probably be just to encode the binary representation of the original as a string of hexadecimal digits. I know you are looking for a more compact representation, but that's a pretty tall order given the other constraints of the system, unless some custom encoding is devised.
Since the hexadecimal representation can encode any arbitrary binary values, it might be possible to shrink the string by compressing them before taking the hex values.
If you're talking about non-standard schemes - MECE
URL encoding or numeric character references are two possible options.
It depends on the distribution of characters in your strings.
Quoted-printable is good for mostly-ASCII strings because there's no overhead except with '=' and control characters. However, non-ASCII characters take an inefficient 6-12 bytes each, so if you have a lot of those, you'll want to consider UTF-7 or Base64 instead.
Punycode is used for IDNA, but you can use it outside the restrictions imposed by it
Per se, Punycode doesn't fail your last 2 requirements:
>>> import sys
>>> _ = ("\U0010FFFF"*10000).encode("punycode")
>>> all(chr(c).encode("punycode") for c in range(sys.maxunicode))
True
(for idna, python supplies another homonymous encoding)
obviously, if you don't nameprep the input, the encoded string isn't strictly case-insensitive anymore... but if you supply only lowercase (or if you don't care about the decoded case) you should be good to go
I'm trying to get some legacy code to display Chinese characters properly. One character encoding I'm trying to work with starts with a 0x7F and is 4 bytes long (including the 0x7F byte). Does anyone know what kind of encoding this is and where I can find information for it? Thanks..
UPDATE:
I've also had to work with some Japanese encoding that starts every character with a 0xE3 and is three bytes long. It displays on my computer properly if I choose the Japanese locale in Windows, however, it doesn't display properly in our application. However, if any other locale other than Japanese is selected, I cannot even view the filenames properly. So I'm guessing this encoding is not Unicode. Anyone know what it is? Is it ANSI? Is it Shift JIS?
For the Chinese one, I've tested it with Unicode and UTF-8 characters and I'm getting the same pattern; 0x7F followed by three bytes. Are Unicode and UTF-8 the same?
One character encoding I'm trying to work with starts with a 0x7F and is 4 bytes long
What are the other bytes? Do you have any Latin text in this encoding?
If it's “0x7f 0x... 0x00 0x00” you are looking at UTF-32LE. It could also be two UTF-16 (either LE or BE) characters.
Most East Asian encodings use 0x80-0xFF as lead bytes for non-ASCII characters; there is none I know of that would use a leading 0x7F as anything other than an ASCII delete.
ETA:
are there supposed to be Byte Order Marks?
There doesn't need to be a BOM if there is an out-of-band way of signalling that the encoding is ‘UTF-32LE’ (possibly one that is lost before it gets to you).
I've also had to work with some Japanese encoding that starts every character with a 0xE3 and is three bytes long.
That's surely UTF-8. Sequence 0xE3 0x... 0x... would result in a character between U+3000 and U+4000, which is where the hiragana/katakana live.
It displays on my computer properly if I choose the Japanese locale in Windows, however, it doesn't display properly in our application.
Then chances are your application is is one of the regrettable horde of non-Unicode-compliant apps, still using ‘A’(*) versions of the Win32 interfaces inside of the ‘W’-suffixed ones. Whether you can read in the string according to its real encoding is moot: a non-Unicode-compliant app will never be able to display an East Asian ideograph on a Western locale.
(*: named for “ANSI”, which is Windows's misleading term for “whatever the system codepage is set to at the moment”. That's why changing your locale affected it.)
ETA(2):
OK, cracked it. It's not any standardised encoding I've met before, but it's relatively easy to decipher if you assume the premise that Unicode code points are being encoded.
0x00-0x7E: plain ASCII
0x7F A B C: Unicode character
The character encoded in a Unicode escape can be calculated by taking the index in a key string of A, B and C and adding together:
A*0x1000 + B*0x40 + C
That is, it's a base-64 character set, but it's not the usual Base64 standard. A little experimentation gives a key string of:
.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz
The ‘.’ and ‘_’ characters are guesses, since none of the characters you posted uses them. We'd need more data to find out the exact string.
So, for example:
0x7F 3 u g
A=4 B=58 C=44
4*0x1000 + 58*0x40 + 44 = 0x4EAC
U+4EAC = 京
ETA(3):
Yeah, it should be easy to create a native Unicode string by sucking out each code point manually and joining as a character. Not quite sure what's available on whatever platform you're using, but any Unicode-capable platform should be able to make a string from codepoints simply (and hopefully without having to manually re-encode to UTF-16LE bytes).
I figured it must be Unicode codepoints by noticing that the three example characters had first escape-characters in the same general range, and in the same numerical order as their Unicode codepoints. The other two characters seemed to change randomly, so it was very likely a big-endian encoding of the code point, and probably a base-64 encoding as 6 is as many bits as you can get out of readable ASCII.
Standard Base64 itself starts with letters, which would put something starting with a number too far up to be in the Basic Multilingual Plane. So I started guessing with ‘0123456789ABCDEFG...’ which would be the other obvious choice of key string. That got resulting numbers that were close to the code points for the given characters, but a bit too low. Inserting an extra character at the start of the key string (so digit ‘0’ doesn't map to number 0) got one of the characters right and the other two very close; the one that was right had no lower-case letters, so to change only the lower-case letters I inserted another character between the upper and lower cases. This came up with the right numbers.
It's not guaranteed that this is actually right, but (apart from the arbitrary choice of inserted characters) it's very likely to be it.
You might want to look at chinese character encoding page on Wikipedia. The only encoding in there that I can see that is always 4 bytes is UTF-32.
GB 18030 is the current standard Chinese character set, but it can be 1 to 4 bytes long.
Try chardet. It does a good job of guessing the character encoding of a string of bytes.
Are Unicode and UTF-8 the same?
No. UTF-8 is just one way to represent Unicode characters as a sequence of bytes. Unicode is the full standard, assigning numeric and human-readable identifiers to each character, as well as lots of metadata about the characters.
It might be a valid unicode encoding, such as a utf-8 or UTF16 surrogate pair.
Yes, the Chinese one is UTF-8, a implementation (encoding) of Unicode.
The UTF-8 is 1 byte long for ASCII characters and up to 4 bytes for others.
I have a Perl script that is being called by third parties to send me names of people who have registered my software. One of these parties encodes the names in UTF-8, so I have adapted my script accordingly to decode UTF-8 to ASCII with Encode::decode_utf8(...).
This usually works fine, but every 6 months or so one of the names contains cyrillic, greek or romanian characters, so decoding the name results in garbage characters such as "ПодражанÑкаÑ". I have to follow-up with the customer and ask him for a "latin character version" of his name in order to issue a registration code.
So, is there any Perl module that can detect whether there are such characters and automatically translates them to their closest ASCII representation if necessary?
It seems that I can use Lingua::Cyrillic::Translit::ICAO plus Lingua::DetectCharset to handle Cyrillic, but I would prefer something that works with other character sets as well.
I believe you could use Text::Unidecode for this, it is precisely what it tries to do.
In the documentation for Text::Unicode, under "Caveats", it appears that this phrase is incorrect:
Make sure that the input data really is a utf8 string.
UTF-8 is a variable-length encoding, whereas Text::Unidecode only accepts a fixed-length (two-byte) encoding for each character. So that sentence should read:
Make sure that the input data really is a string of two-byte Unicode characters.
This is also referred to as UCS-2.
If you want to convert strings which really are utf8, you would do it like so:
my $decode_status = utf8::decode($input_to_be_converted);
my $converted_string = unidecode ($input_to_be_converted);
If you have to deal with UTF-8 data that are not in the ascii range, your best bet is to change your backend so it doesn't choke on utf-8. How would you go about transliterating kanji signs?
If you get cyrilic text there is no "closest ASCII representation" for many characters.