Socket, read until \n. What if bytes in message happen to be \n and end reading prematurely? - sockets

I plan to have a socket reading data until it gets to the \n character (In order to read individual messages from a data stream). What happens, however, if the message you're sending happens to have bytes that match the \n character? Won't that end reading prematurely and mess up everything? How do people usually read until a certain part in their data?
Ok, Joe provided a lot of good alternatives to reading till the "\n" character. (Now that I think of it, \n is probably only used for text based things)
Quote Joe: There's a lot of choices here. 1) substitute something for the delimiter when it's part of the content. 2) pick a less likely delimiter than \n. 3) include a message length ahead of each message that you parse out. And
#3 seems like the best way to accomplish what I'm trying to do.

Related

Why doesn't IO::Socket::Async's emit a trailing "a"?

I was wondering if anyone knows how to get around the encoding of IO::Socket::Async, particularly the draw-backs described by this:
For example, if the UTF-8 encoding is being used and the last byte in the packet decoded to "a", this would not be emitted since the next packet may include a combining character that should form a single grapheme together. Control characters (such as \n) always serve as grapheme boundaries, so any text-based protocols that use newlines or null bytes as terminators will not need special consideration.
This is currently causing my sockets to omit the last character on messages, but I am not sure how to work around this. I tried to convert the Connection to a Channel then just feed a dumby \n into it, simulating end of input for the message, but that did not work. How can I work around this quirk in UTF-8 encoding?
Here is the MVP to reproduce this:
sub listen(Int $port) {
react {
whenever IO::Socket::Async.listen('0.0.0.0', $port) -> $connection {
whenever $connection.Supply -> $data {
say $data;
$connection.print: $data;
}
}
}
}
listen(9999);
Now if you hit port 9999 on your local machine with any data that does not end with \n you will see that the last byte is ignored.
It's not a "drawback"; it's just Raku reflecting how Unicode works. If you know you only need to handle ASCII or Latin-1, then specify that:
whenever $connection.Supply(:enc<ascii>) -> $data { # or :enc<latin-1>
...
}
If wanting to handle Unicode text, then it's necessary to deal with that fact that receiving, for example, the codepoint for the letter "a", does not give enough information to pass along a complete character, since the next codepoint received in the next packet might be a combining character, such as an acute accent to be placed on the "a". Note that a Raku Str is a character-level data structure (in other languages, strings are often bytes or codepoints, which creates different problems that are largely invisible to those only caring about English text!)
Any well-designed network protocol will provide a way to know when the end of the text content has been reached. Some protocols, such as HTTP, explicitly specify the byte length of the content, thus one can work a the byte level (:bin) and decode the result after seeing that many bytes. Others might use connection close or line breaks.
In conclusion, the string semantics or IO::Socket::Async (and elsewhere in Raku) aren't themselves a problem, but they may show up design problems in protocols.

Is there a "end of heading" or "beginning of transmission" character in Unicode?

Unicode has characters for START OF HEADING (␁ U+0001), START OF TEXT (␂ U+0002), END OF TEXT (␃ U+0003), and END OF TRANSMISSION (␄ U+0004). What's confusing about this is that, while there is a START OF HEADING character, there is no END OF HEADING character, and while there is an END OF TRANSMISSION character, there is no START OF TRANSMISSION character.
Where are these missing characters?
How should I go about representing the start of a transmission, or the end of a heading, using Unicode?
If the answer is "just use START OF HEADING in place of START OF TRANSMISSION," then what should I do if my "transmission" doesn't have a "heading"?
If the second part of the answer is "just use START OF TEXT in place of END OF HEADING," what happens if there is something between the heading and the text?†
† I can't imagine that this happens often (if ever), but I'm asking just in case someone out there ever tries to put something between the end of the heading and the start of their text.
Stack Exchange doesn't have a Unicode site, so I'm posting this here. If someone thinks that it would fit better on one of the other Network sites, please let me know in the comments.
The characters U+0000 to U+001F are imported directly from ASCII. If it didn't exist in ASCII, it doesn't exist in Unicode, in that range.
Most are obsolete; in-band delimiters are not so much used nowadays. If you're using an existing protocol with in-band delimiters, it'll have rules based on ASCII usage; if you're designing a new protocol, there are probably better ways to proceed.
As far as I recall, there's no need for end-of-header in typical usage, because that's coincident with start-of-text. There's presumably no need for start-of-transmission because the first thing you receive is the start of transmission, after synchronization (start bits in async disciplines, SYN in sync).

Why does an email subject contain linefeed or carriage return characters?

I'm making a code to check a mailbox, and forward unseen mails to another user.
But sometimes it fails with an error:
ValueError: Header values may not contain linefeed or carriage return characters
I checked the raw fetched data and found out that the 'Subject' value contains \r\n.
Not all mails contain, but some do.
It just appears normal in the mailbox, and I have no idea why some contain such characters.
Does it have to do with the length of the subject?
How can I deal with these situations?
Thanks :)
Email messages have a maximum line length. That's historical and the rule isn't upheld 100% of the time, so to speak. But in header fields, a space is to be treated the same as a CR LF and a sequence of spaces or a htab character. This is a really long subject, encoded in that way:
Subject: Pretend this is about 80-90
characters long
The simplest way to deal with it is to consider any sequences of space characters to be a single space.
Read the source of any email message, you'll see this wrapping in most of then. The Received fields is almost always wrapped, for instance, and quite often To if there are many addressees, or Content-Type/Content-Disposition for attachments.

why is there a '=' at the end of a SMTP message body?

I receive email messages over sockets and see that long lines in the message body are broken up, separated by the following expression
'=\r\n'
I cannot find any documentation on this and wonder if someone just happens to know where I can find information on this behavior.
Also, please ONLY feedback on my question, no comments regarding email and sockets!
Thanks
Alex
From Wikipedia, regarding Quoted-printable:
Lines of quoted-printable encoded data must not be longer than 76 characters. To satisfy this requirement without altering the encoded text, soft line breaks may be added as desired. A soft line break consists of an "=" at the end of an encoded line, and does not appear as a line break in the decoded text.
The \r\n is likely coming from whatever is generating the content or body of the email, and is a line break also. Depending on the client used to view the message, it may or may not render as an actual line break.

How to detect malformed UTF characters

I want to detect and replace malformed UTF-8 characters with blank space using a Perl script while loading the data using SQL*Loader. How can I do this?
Consider Python. It allows to extend codecs with user-defined error handlers, so you can replace undecodable bytes with anything you want.
import codecs
codecs.register_error('spacer', lambda ex: (u' ', ex.start + 1))
s = 'spam\xb0\xc0eggs\xd0bacon'.decode('utf8', 'spacer')
print s.encode('utf8')
This prints:
spam eggs bacon
EDIT: (Removed bit about SQL Loader as it seems to no longer be relevant.)
One problem is going to be working out what counts as the "end" of a malformed UTF-8 character. It's easy to say what's illegal, but it may not be obvious where the next legal character starts.
RFC 3629 describes the structure of UTF-8 characters. If you take a look at that, you'll see that it's pretty straightforward to find invalid characters, AND that the next character boundary is always easy to find (it's a character < 128, or one of the "long character" start markers, with leading bits of 110, 1110, or 11110).
But BKB is probably correct - the easiest answer is to let perl do it for you, although I'm not sure what Perl does when it detects the incorrect utf-8 with that filter in effect.