Default DICOM encoding without Specific Character Set - encoding

If a DICOM file does not define a Specific Character Set (0008,0005), what character set does it use by default? Is ASCII the default encoding for DICOM files?
TL;DR
A DICOM file contains German ä in one of the tags, but the file does not specify any character set. I assume that in this case the file is allowed to contain only ASCII symbols (the default character set) and report this file as invalid. Before I submit my change, I want to make sure that I understood DICOM correctly.

As specified in the Dicom Data Structures and Encoding
6.1.2.5.4 Levels of Implementation and Initial Designation
a) Attribute Specific Character Set (0008,0005) not present:
7-bit code
Implementation level: ISO 2022 Level 1 - Elementary 7-bit code (code-level identifier 1)
Initial designation: ISO-IR 6 (ASCII) as G0.
Code Extension shall not be used
Reference:
http://dicom.nema.org/medical/dicom/current/output/chtml/part05/chapter_6.html#sect_6.1.2.5.4

To add to answer by JonnyQ, DICOM standard also defines mechanisms when confronted with character sets that are unknown to implementations or unsupported (see PS 3.5 section 6.1.2.3). Implementations can print or display such characters by replacing all unknown characters with the four characters "\nnn", where "nnn" is the three digit octal representation of each byte.
An example given in the standard for an ASCII based machine as follows:
Character String: Günther
Encoded representation: 04/07 15/12 06/14 07/04 06/08 06/05 07/02
ASCII based machine: G\374nther
Implementations may also encounter Control Characters which they have no means to print or display.
Application may print or display such Control Characters by replacing the Control Character with the
four characters “\nnn”, where “nnn” is the three digit octal representation of each byte.

Related

Why this elaborate RTF encoding of an apostrophe?

Scrivener produces RTF files with this elaborate apostrophe encoding:
They didn\loch\af0\hich\af0\dbch\af0\uc1\u8217\'92t do it.
Unicode 8217 is "Right Single Quotation Mark". Okay, but this RTF has that unicode character and \'92 as well. What's going on here?
That RTF breaks down to the following:
They didn - plain text
\loch - The text consists of single-byte low-ANSI (0x00–0x79) characters
\af0 - Associated Font Number 0
\hich - The text consists of single-byte high-ANSI (0x80–0xFF) characters
\af0 - Associated Font Number 0
\dbch - The text consists of double-byte characters
\af0 - Associated Font Number 0
\uc1 - number of bytes corresponding to a given \uN Unicode character
\u8217 - a single Unicode character that has no equivalent ANSI representation based on the current ANSI code page
\'92 - A hexadecimal value, based on the specified character set (may be used to identify 8-bit values).
t do it. - plain text
Some of that is superfluous in this context and can be ignored, it is just font information. What is important is that \u8217 represents the apostrophe in Unicode, \'92 represents an equivalent apostrophe in ANSI, and \uc1 indicates the \'92 takes up 1 character. A Unicode-enabled RTF reader will handle \u8217 and ignore \'92. A non-Unicode RTF reader will ignore \u8217 and handle \'92. This is stated in the RTF spec for Unicode RTF:
\uN
This keyword represents a single Unicode character that has no equivalent ANSI representation based on the current ANSI code page. N represents the Unicode character value expressed as a decimal number.
This keyword is followed immediately by equivalent character(s) in ANSI representation. In this way, old readers will ignore the \uN keyword and pick up the ANSI representation properly. When this keyword is encountered, the reader should ignore the next N characters, where N corresponds to the last \ucN value encountered.
...
An RTF writer, when it encounters a Unicode character with no corresponding ANSI character, should output \uN followed by the best ANSI representation it can manage. Also, if the Unicode character translates into an ANSI character stream with count of bytes differing from the current Unicode Character Byte Count, it should emit the \ucN keyword prior to the \uN keyword to notify the reader of the change.

Encode binary data into ASCII while preserving most of its valid characters

When displaying a bytes object in python the print function will display a select number values as ASCII characters instead of their numeric representation.
>>> b"1 2 \x30 a b \x80"
b'1 2 0 a b \x80'
Is there a known encoding that would allow a set of binary data containing mostly ASCII text to be put into a valid ASCII string where the few invalid characters are replaced by a numeric representation, similarly to what python does with bytes?
Edit:
I used python's bytes.repr as an example of what the encoding would do, the project we need this for is written in c++, something like a "language agnostic" spec would be nice.
Edit 2:
Think of this as a base64 alternative so that binary data that is mostly ASCII does not get altered too much.

Understanding encoding schemes

I cannot understand some key elements of encoding:
Is ASCII only a character or it also has its encoding scheme algorithm ?
Does other windows code pages such as Latin1 have their own encoding algorithm ?
Are UTF7, 8, 16, 32 the only encoding algorithms ?
Does the UTF alghoritms are used only with the UNICODE set ?
Given the ASCII text: Hello World, if I want to convert it into Latin1 or BIG5, which encoding algorithms are being used in this process ? More specifically, does Latin1/Big5 use their own encoding alghoritm or I have to use a UTF alghoritm ?
1: Ascii is just an encoding — a really simple encoding. It's literally just the positive end of a signed byte (0...127) mapped to characters and control codes.
Refer to https://www.ascii.codes/ to see the full set and inspect the characters.
There are definitely encoding algorithms to convert ascii strings to and from strings in other encodings, but there is no compression/decompression algorithm required to write or read ascii strings like there is for utf8 or utf16, if that's what you're implying.
2: LATIN-1 is also not a compressed (usually called 'variable width') encoding, so there's no algorithm needed to get in and out of it.
See https://kb.iu.edu/d/aepu for a nice description of LATIN-1 conceptually and of each character in the set. Like a lot of encodings, its first 128 slots are just ascii. Like ascii, it's 1 byte in size, but it's an unsigned byte, so after the last ascii character (DEL/127), LATIN1 adds another 128 characters.
As with any conversion from one string encoding to another, there is an algorithm specifically tailored to that conversion.
3: Again, unicode encodings are just that — encodings. But they're all compressed except for utf32. So unless you're working with utf32 there is always a compression/decompression step required to write and read them.
Note: When working with utf32 strings there is one nonlinear oddity that has to be accounted for... combining characters. Technically that is yet another type of compression since they save space by not giving a codepoint to every possible combination of uncombined character and combining character. They "precombine" a few, but they would run out of slots very quickly if they did them all.
4: Yes. The compression/decompression algorithms for the compressed unicode encodings are just for those encodings. They would not work for any other encoding.
Think of it like zip/unzip. Unzipping anything other than a zipped file or folder would of course not work. That goes for things that are not compressed in the first place and also things that are compressed but using another compression algorithm (e.g.: rar).
I recently wrote the utf8 and utf16 compression/decompression code for a new cross-platform library being developed, and I can tell you quite confidently if you feed a Big5-encoded string into my method written specifically for decompressing utf8... not only would it not work, it might very well crash.
Re: your "Hello World" question... Refer to my answer to your second question about LATIN-1. No conversion is required to go from ascii to LATIN-1 because the first 128 characters (0...127) of LATIN-1 are ascii. If you're converting from LATIN-1 to ascii, the same is true for the lower half of LATIN-1, but if any of the characters beyond 127 are in the string, it would be what's called a "lossy"/partial conversion or an outright failure, depending on your tolerance level for lossiness. In your example, however, all of the characters in "Hello World" have the exact same values in both encodings, so it would convert perfectly, without loss, in either direction.
I know practically nothing about Big5, but regardless, don't use utf-x algos for other encodings. Each one of those is written very specifically for 1 particular encoding (or in the case of conversion: pair of encodings).
If you're curious about utf8/16 compression/decompression algorithms, the unicode website is where you should start (watch out though. they don't use the compression/decompression metaphor in their documentation):
http://unicode.org
You probably won't need anything else.
... except maybe a decent codepoint lookup tool: https://www.unicode.codes/
You can roll your own code based on the unicode documentation, or use the official unicode library:
http://site.icu-project.org/home
Hope this helps.
In general, most encoding schemes like ASCII or Latin-1 are simply big tables mapping characters to specific byte sequences. There may or may not be some specific algorithm how the creators came up with those specific character⟷byte associations, but there's generally not much more to it than that.
One of the innovations of Unicode specifically is the indirection of assigning each character a unique number first and foremost, and worrying about how to encode that number into bytes secondarily. There are a number of encoding schemes for how to do this, from the UCS and GB 18030 encodings to the most commonly used UTF-8/UTF-16 encodings. Some are largely defunct by now like UCS-2. Each one has their pros and cons in terms of space tradeoffs, ease of processing and transportability (e.g. UTF-7 for safe transport over 7-bit system like email). Unless otherwise noted, they can all encode the full set of current Unicode characters.
To convert from one encoding to another, you pretty much need to map bytes from one table to another. Meaning, if you look at the EBCDIC table and the Windows 1250 table, the characters 0xC1 and 0x41 respectively both seem to represent the same character "A", so when converting between the two encodings, you'd map those bytes as equivalent. Yes, that means there needs to be one such mapping between each possible encoding pair.
Since that is obviously rather laborious, modern converters virtually always go through Unicode as a middleman. This way each encoding only needs to be mapped to the Unicode table, and the conversion can be done with encoding A → Unicode code point → encoding B. In the end you just want to identify which characters look the same/mean the same, and change the byte representation accordingly.
A character encoding is a mapping from a sequence of characters to a sequence of bytes (in the past there were also encodings to a sequence of bits - they are falling out of fashion). Usually this mapping is one-to-one but not necessarily onto. This means there may be byte sequences that don't correspond to a character sequence in this encoding.
The domain of the mapping defines which characters can be encoded.
Now to your questions:
ASCII is both, it defines 128 characters (some of them are control codes) and how they are mapped to the byte values 0 to 127.
Each encoding may define its own set of characters and how they are mapped to bytes
no, there are others as well ASCII, ISO-8859-1, ...
Unicode uses a two step mapping: first the characters are mapped to (relatively) small integers called "code points", then these integers are mapped to a byte sequence. The first part is the same for all UTF encodings, the second step differs. Unicode has the ambition to contain all characters. This means, most characters are in the "UNICODE set".
Every character in the world has been assigned a unicode value [ numbered from 0 to ...]. It is actually an unique value. Now, it depends on an individual that how he wants to use that unicode value. He can even use it directly or can use some known encoding schemes like utf8, utf16 etc. Encoding schemes map that unicode value into some specific bit sequence [ can vary from 1 byte to 4 bytes or may be 8 in future if we get to know about all the languages of universe/aliens/multiverse ] so that it can be uniquely identified in the encoding scheme.
For example ASCII is an encoding scheme which only encodes 128 characters out of all characters. It uses one byte for every character which is equivalent to utf8 representation. GSM7 is one other format which uses 7 bit per character to encode 128 characters from unicode character list.
Utf8:
It uses 1 byte for characters whose unicode value is till 127.
Beyond this it has its own way of representing the unicode values.
Uses 2 byte for Cyrillic then 3 bytes for Hindi characters.
Utf16:
It uses 2 byte for characters whose unicode value is till 127.
and it also uses 2 byte for Cyrillic, Hindi characters.
All the utf encoding schemes fixes initial bits in specific pattern [ eg: 110|restbits] and rest bits [eg: initialbits|11001] takes the unicode value to make a unique representation.
Wikipedia on utf8, utf16, unicode will make it clear.
I coded an utf translator which converts incoming utf8 text across all languages into its equivalent utf16 text.

How does WTX (websphere transformation extender) know whether the character is ASCII or UNICODE?

ASCII is 8-bit value. Unicode may be 8 or 16 or 32 bit value. If I define subclass as character how does WTX know whether it is 8 or 16 bit character?
Setting an item's subclass to character is only one half of the solution. You also have to set the language (defaults to "Western") and, more important, the character set. If you choose UTF-8 (-16, -32), the parser is capable of recognizing multi-byte characters and will read them properly (given that the document being parsed is encoded in the type tree's encoding, of course).

What is the difference between charsets and character encoding

What is the difference between charsets and character encoding? When i say i am using utf-8 encoding then what will be my charset? Does it take unicode as charset by default?
UTF-8 is an encoding of the Unicode character set. Therefore if you're using UTF-8, the character set is Unicode, but you're not likely to have to specify this separately anywhere. The other main encoding of Unicode is UTF-16, which is not put into 8-bit byte streams because it contains zero bytes. If you are dealing with Unicode in a byte sequence, it is certainly encoded as UTF-8.
Other than Unicode, character sets are usually considered to have a single fixed encoding, and then terms like character set, charset, codepage, encoding are often used interchangeably, or depending on the vendor. This is sloppy but creates no runtime problems.
The only possible exceptions I can think of are East Asian: JIS and EUC originally defined multiple encodings for the same character set, but in practice today, each encoding is just treated separately.
Character set: definition which character has which numeric code point (ascii, jis, unicode)
Encoding: definition how the numeric code point is physically represented (utf, ucs, shiftjis)
According to Unicode terminology
ACR: Abstract Character Repertoire
= the set of characters to be encoded, for example, some alphabet or symbol set
CCS: Coded Character Set
= a mapping from an abstract character repertoire to a set of nonnegative integers
CEF: Character Encoding Form
= a mapping from a set of nonnegative integers that are elements of a
CCS to a set of sequences of particular code units of some specified width, such as 32-bit integers
CES: Character Encoding Scheme
= a reversible transformation from a set of sequences of code units (from one or more CEFs to a serialized sequence of bytes)
CM: Character Map
= a mapping from sequences of members of an abstract character repertoire to serialized sequences of bytes bridging all four levels in a single operation
TES: Transfer Encoding Syntax
= a reversible transform of encoded data, which may or may not contain textual data
Older protocols like MIME use "charset" when they really mean "character encoding scheme". Originally, different character encodings were though of as independent character repertoires instead of subsets of Unicode.
A character set defines the mapping between numbers and characters. Almost all char sets say 65 is A, and agree in general about mappings of numbers up to 127. But they might have different stands when it comes to numbers above 127.
There are a lot of character sets
EBCDIC
Double Byte Character Set
ANSI
Different OEM char sets
Unicode, an effort to create a single character set that included every reasonable writing system on the planet and some make-believe ones like Klingon, too.
When you say character encoding, you're talking about how a Unicode code point (a character) is stored internally.
In UTF-8 encoding, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.
There's something called UTF-7, which is a lot like UTF-8 but guarantees that the high bit will always be zero
There are hundreds of traditional encodings which can only store some code points correctly and change all the other code points into question marks. Some popular encodings of English text are Windows-1252 (the Windows 9x standard for Western European languages) and ISO-8859-1, aka Latin-1 (also useful for any Western European language).
UTF 7, 8, 16, and 32 all have the nice property of being able to store any code point correctly.
This post is almost entirely based on Joel Spolsky's post on Unicode: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets. Read it to get a better idea.
Charset is synonym for character encoding
Default encoding depends on the operating system and locale.
EDIT
http://www.w3.org/TR/REC-xml/#sec-TextDecl
http://www.w3.org/TR/REC-xml/#NT-EncodingDecl