Non-Latin characters in username for FTP - unicode

I tried to find the list of characters allowed in username for FTP but the RFC is not very specific. What ftp servers and clients support user names in Unicode? Special characters? Is there a generally accepted spec that explains the list of characters allowed in FTP usernames? (googling was of no help to me)

RFC 959 5.3.2:
<username> ::= <string>
<string> ::= <char> | <char><string>
<char> ::= any of the 128 ASCII characters except <CR> and <LF>
Later RFCs (like proposed standard RFC 3659) talk about UTF-8 extensions, but only in the context of pathnames and file contents encoding.
So you can only depend on ASCII, but I suspect in practice most clients and servers support UTF-8.

Try to encode using UTF-8 because most FTP servers will work with UTF-8.

Related

Character Encodings compatibility with ASCII

I'm currently reading mails from file and process some of the header information. Non-ASCII characters are encoded according to RFC2047 in quoted-printable oder Base64, so the files contain no non-ASCII characters . If the file is encoded in UTF-8, Win-1252 or one of the ISO-8859-* character encodings, I won't run into problems because ASCII is embedded at the same place in all these charsets (so 0x41 is a A in all of those charsets).
But what if the file is encoded using an encoding that does not embed ASCII in that way? Do encodings like this even exist? And if so, is there even a reliable way of detecting them?
There is a Charset-detector of Mozilla based on this very interesting article. It can detect a very large amount of different encodings. There is also a port to C# available on GitHub which I used before. It turned out to be quite reliable. But of course, when the text just contains ASCII characters, it cannot distinguish between the different encodings that encode ASCII in the same way. But any encodings that encode ASCII in a different way should be detected correctly with this library.

Transfer UTF-8 file over FTP using ASCII mode?

The question says it all. Is it possible to transfer a UTF-8 file over FTP using ASCII mode? Or will this cause the characters to be written incorrectly? Thanks!
UTF-8 encoding was designed to be backward compatible with ASCII encoding.
The RFC 959 requires the FTP clients and servers to treat the file in ASCII mode as 8-bit:
3.1.1.1. ASCII TYPE
...
The sender converts the data from an internal character
representation to the standard 8-bit NVT-ASCII
representation (see the Telnet specification). The receiver
will convert the data from the standard form to his own
internal form.
In accordance with the NVT standard, the sequence
should be used where necessary to denote the end of a line
of text. (See the discussion of file structure at the end
of the Section on Data Representation and Storage.)
...
Using the standard NVT-ASCII representation means that data
must be interpreted as 8-bit bytes.
So even UTF-8 unaware FTP client or server should correctly translate line-endings, as these are encoded identically in ASCII and UTF-8. And they should not corrupt the other characters.
From a practical point of view: I haven't met a server that does have problems with 8-bit text files. I'm Czech, so I regularly work with UTF-8 and in past with Windows-1250 and ISO/IEC 8859-2 8-bit encodings.
RFC 2640, from 1999, updates the FTP protocol to support internationalization. It requires FTP servers to use UTF-8 as the transfer encoding in section 2.2. So as long as you aren't trying to upload to a DEC TOPS-20 server (which stores five 7-bit bytes within a 36-bit word), you should be fine.

What is the RFC 822 format for the email addresses?

I have to make a regular expression for the email addresses (RFC 822) and I want to know which characters are allowed in the local part and in the domain.
I found this https://www.rfc-editor.org/rfc/rfc822#section-6.1 but I don't see that it says which are the valid characters.
According to RFC 822, the local part may contain any ASCII character, since local-part is defined using word, which is defined as atom / quoted-string; atom covers most ASCII characters, and the rest can be written in a quoted-string. There are syntactic restrictions, but obeying them, any ASCII character can be used.
On similar grounds, RFC 822 allows any ASCII character in the domain part.
On the other hand, RFC 822 was obsoleted in 2001 by RFC 2822, which in turn was obsoleted in 2008 by RFC 5322. The status of RFCs can be checked from the RFC Editor’s RFC database.

What string of characters should a source send to disambiguate the byte-encoding they are using?

I'm decoding bytestreams into unicode characters without knowing the encoding that's been used by each of a hundred or so senders.
Many of the senders are not technically astute, and will not be able to tell me what encoding they are using. It will be determined by the happenstance of the toolchains they are using to generate the data.
The senders are, for the moment, all UK/English based, using a variety of operating systems.
Can I ask all the senders to send me a particular string of characters that will unambiguously demonstrate what encoding each sender is using?
I understand that there are libraries that use heuristics to guess at the encoding - I'm going to chase that up too, as a runtime fallback, but first I'd like to try and determine what encodings are being used, if I can.
(Don't think it's relevant, but I'm working in Python)
A full answer to this question depends on a lot of factors, such as the range of encodings used by the various upstream systems, and how well your users will comply with instructions to type magic character sequences into text fields, and how skilled they will be at the obscure keyboard combinations to type the magic character sequences.
There are some very easy character sequences which only some users will be able to type. Only users with a Cyrillic keyboard and encoding will find it easy to type "Ильи́ч" (Ilyich), and so you only have to distinguish between the Cyrillic-capable encodings like UTF-8, UTF-16, iso8859_5, and koi8_r. Similarly, you could come up with Japanese, Chinese, and Korean character sequences which distinguish between users of Japanese, simplified Chinese, traditional Chinese, and Korean systems.
But let's concentrate on users of western European computer systems, and the common encodings like ISO-8859-15, Mac_Roman, UTF-8, UTF-16LE, and UTF-16BE. A very simple test is to have users enter the Euro character '€', U+20AC, and see what byte sequence gets generated:
byte ['\xa4'] means iso-8859-15 encoding
bytes ['\xe2', '\x82', '\xac'] mean utf-8 encoding
bytes ['\x00', '\xac'] mean utf-16be encoding
bytes ['\xac', '\x00'] mean utf-16le encoding
byte ['\x80'] means cp1252 ("Windows ANSI") encoding
byte ['\xdb'] means macroman encoding
iso-8859-1 won't be able to represent the Euro character at all. iso-8859-15 is the Euro-supporting successor to iso-8859-1.
U.S. users probably won't know how to type a Euro character. (OK, that's too snarky. 3% of them will know.)
You should check what each of these byte sequences, interpreted as any of the possible encodings, is not a character sequence that users would likely type themselves. For instance, the '\xa4' of the iso-8859-15 Euro symbol could also be the iso-8859-1 or cp1252 or UTF-16le encoding of '¤', the macroman encoding of '§', or the first byte of any of thousands of UTF-16 characters, such as U+A4xx Yi Syllables, or U+01A4 LATIN SMALL LETTER OI. It would not be a valid first byte of a UTF-8 sequence. If some of your users submit text in Yi, you might have a problem.
The Python 3.x documentation, 7.2.3. Standard Encodings lists the character encodings which the Python standard library can easily handle. The following program lets you see how a test character sequence is encoded into bytes by various encodings:
>>> for e in ['iso-8859-1','iso-8859-15', 'utf-8', 'utf-16be', 'utf-16le', \
... 'cp1252', 'macroman']:
... print e, list( euro.encode(e, 'backslashreplace'))
So, as an expedient, satisficing hack, consider telling your users to type a '€' as the first character of a text field, if there are any problems with encoding. Then your system should interpret any of the above byte sequences as an encoding clue, and discard them. If users want to start their text content with a Euro character, they start the field with '€€'; the first gets swallowed, the second remains part of the text.

Sending multilingual email. Which charset should I suse?

I want to send emails in a number of languages (en/es/fr/de/ru/pl). I notice that Gmail uses KOI8-R charset when sending emails contatining Cyrillic characters.
Can I just use KOI8-R for ALL my emails, or is there any reason to select a particular charset for each language?
i would recommend alway using utf-8 nowadays.
wikipedia on utf-8:
UTF-8 (8-bit UCS/Unicode
Transformation Format) is a
variable-length character encoding for
Unicode. It is able to represent any
character in the Unicode standard, yet
is backwards compatible with ASCII.
For these reasons, it is steadily
becoming the preferred encoding for
e-mail, web pages,[1] and other places
where characters are stored or
streamed.
Use UTF-8. KOI8-R wouldn't be ideal for non-Russian languages, and changing codesets always tends to be a headache on the receiving side.