Escape from open()

Escape from open() - perl

I am a newcomer to Perl, however not to programming in general. I have been looking for any hints how to escape from open() in Perl, but have not been lucky, and that is why I am asking here.
I have a:
$mailprog = '/usr/lib/sendmail';
open(MAIL,"|$mailprog -t");
read(STDIN, $buffer, 18);
print MAIL "To: xxx#xxx.xxx\n";
print MAIL "From: xxx#xxx.xxx\n";
print MAIL "Subject: xxx\n";
print MAIL $buffer;
close (MAIL);
Is there any way how I can shape the input into the $buffer so as to escape from sendmail ? The buffer input length is arbitrary. Input is totally under my control. Thanks a lot for any ideas !

man sendmail says:
By default, Postfix sendmail(1) reads a message from standard input
until EOF or until it reads a line with only a . character, and
arranges for delivery. Postfix sendmail(1) relies on the postdrop(1)
command to create a queue file in the maildrop directory.
So you would want your input to contain the sequence "\n.\n" somewhere.

Only one sequence is special to sendmail once it starts reading the body: A line containing a single . signals the end of the input. (EOF does the same.)
That means that if your input contains a line that contains nothing but ., you need to escape it. The default transfer encoding doesn't provide a means of escape, so you will need to specifying a Content-Transfer-Encoding that avoids the issue (e.g. base64) or allows you to escape the period (e.g. quote-printable), and encode the content accordingly.
This brings us to the restrictions of the content transfer encoding you choose adds.
The default content transfer encoding, 7bit, requires lines of no more than 998 octets terminated by CRLF. Those lines may only contain octets in [1,127], and octets 10 and 13 may only appear as part of the line terminator.
If the content transfer encoding you chose isn't suitable to encode your input, you will need to choose a different one.
You really should be using something like Email::Sender instead of working at such a low level.

Related

Split function returns weird characters

I am facing a problem with a script I want to make. In short, I am connecting to a local database with dbi and execute some queries. While this works just fine, and as I print out the returned values from select queries and so on, when I split, say, the $firstName to an array and print out the array I get weird characters. Note that all the fields in the table I am working are containing only greek characters and are utf8_general_ci. I played around with use utf8, use encoding, binmode, encode etc but still the split function does return  weird characters while before the split the whole greek word was printed fine. I suppose this is due to some missing pragma about string encoding or something similar but really can't find out the solution. Thanks in advance.
Here is the piece of code I am describing. Perl version is v5.14.2
#query = &DatabaseSubs::getStringFromDb();
print "$query[1]\n"; # prints the greek name fine
#chars = split('',$query[1]);
foreach $chr (#chars) {
print "$chr \n"; # prints weird chars
}
And here is the output from print and foreach respectively.

By default, Perl assumes that you are working with single-byte characters. But you aren't, in UTF8 the Greek characters that you are using are two-bytes in size. Therefore split is splitting your characters in half and you're getting strange characters.
You need to decode your bytes into characters as they come into your program. One way to do that would be like this.
use Encode;
my #query = map { decode_utf8($_) } DatabaseSubs::getStringFromDb();
(I've also removed the unnecessary and potentially confusing '&' from the subroutine call.)
Now #query contains properly decode character strings and split will split into individual characters correctly(*).
But if you print one of these characters, you'll get a "wide character" warning. That's because Perl's I/O layer expects single-byte characters. You need to tell it to expect UTF8. You can do that like this:
binmode STDOUT, ':utf8';
There are other improvements that you could consider. For example, you could probably put the decoding into the getStringFromDb subroutine. I recommend reading perldoc perluniintro and perldoc perlunicode for more details.
(*) Yes, there's another whole level of pain lurking when you get into two-character graphemes, but let's ignore that for now.

Your data is in utf8, but perl doesn't know that, so each perl character is just one byte of the multibyte characters that are stored in the database.
You tell perl that the data is in fact utf8 with:
utf8::decode($query[1]);
(though most database drivers provide a way to automate this before you even see the data in your code). Once you've done this, split will properly operate on the actual characters. You probably then need to also set your output filehandle to expect utf8 characters, or it will try to downgrade them to an 8-bit encoding.

The issue is that split('', $word) splits on every byte where in utf8 you can have multi-byte characters. For characters with ASCII value less than 127, this is fine, but anything beyond 127 is represented as multiple bytes. You're essentially printing half the character's code, hence it looking like garbage.

How do I sanitize invalid UTF-8 in Perl?

My Perl program takes some text from a disk file as input, wraps it in some XML, then outputs it to STDOUT. The input is nominally UTF-8, but sometimes has junk inserted. I need to sanitize the output such that no invalid UTF-8 octets are emitted, otherwise the downstream consumer (Sphinx) will blow up.
At the very least I would like to know if the data is invalid so I can avoid passing it on; ideally I could remove just the offending bytes. However, enabling all the fatalisms I can find doesn't quite get me there with perl 5.12 (FWIW, use v5.12; use warnings qw( FATAL utf8 ); is in effect).
I'm specifically having trouble with the sequence "\xFE\xBF\xBE". If I create a file containing only these three bytes (perl -e 'print "\xEF\xBF\xBE"' > bad.txt), trying to read the file with mode :encoding(UTF-8) errors out with utf8 "\xFFFE" does not map to Unicode, but only under 5.14.0. 5.12.3 and earlier are perfectly fine reading and later writing that sequence. I'm unsure where it's getting the \xFFFE (illegal reverse-BOM) from, but at least having a complaint is consistent with Sphinx.
Unfortunately, decode_utf8("\xEF\xBF\xBE", 1) causes no errors under 5.12 or 5.14. I'd prefer a detection method that didn't require an encoded I/O layer, as that will just leave me with an error message and no way to sanitize the raw octets.
I'm sure there are more sequences that I need to address, but just handling this one would be a start. So my questions are: can I reliably detect this kind of problem data with a perl before 5.14? What substitution routine can generally sanitize almost-UTF-8 into strict UTF-8?

You should read the UTF-8 vs. utf8 vs. UTF8 section of the Encode docs.
To summarize, Perl has two different UTF-8 encodings. Its native encoding is called utf8, and basically allows any codepoint, regardless of what the Unicode standard says about that codepoint.
The other encoding is called utf-8 (a.k.a. utf-8-strict). This allows only codepoints that are listed as legal for interchange by the Unicode standard.
"\xEF\xBF\xBE", when interpreted as UTF-8, decodes to the codepoint U+FFFE. But that's not legal for interchange according to Unicode, so programs that are strict about such things complain.
Instead of using decode_utf8 (which uses the lax utf8 encoding), use decode with the utf-8 encoding. And read the Handling Malformed Data section to see the different ways you can handle or complain about problems.
Update: It does appear that some versions of Perl don't complain about U+FFFE, even when using the utf-8-strict encoding. This appears to be a bug. You may just have to build a list of codepoints that Sphinx complains about and filter them out manually (e.g. with tr).

You have a utf8 string containing some invalid utf8...
This replaces it with a default 'bad char'.
use Encode qw(decode encode);
my $octets = decode('UTF-8', $malformed_utf8, Encode::FB_DEFAULT);
my $good_utf8 = encode('UTF-8', $octets, Encode::FB_CROAK);

Reading a large file in perl, record by record, with a dynamic record separator

I have a script that reads a large file line by line. The record separator ($/) that I would like to use is (\n). The only problem is that the data on each line contains CRLF characters (\r\n), which the program should not be considered the end of a line.
For example, here is a sample data file (with the newlines and CRLFs written out):
line1contents\n
line2contents\n
line3\r\ncontents\n
line4contents\n
If I set $/ = "\n", then it splits the third line into two lines. Ideally, I could just set $/ to a regex that matches \n and not \r\n, but I don't think that's possible. Another possibility is to read in the whole file, then use the split function to split on said regex. The only problem is that the file is too large to load into memory.
Any suggestions?

For this particular task, it sounds pretty straightforward to check your line ending and append the next line as necessary:
$/ = "\n";
...
while(<$input>) {
while( substr($_,-2) eq "\r\n" ) {
$_ .= <$input>;
}
...
}
This is the same logic used to support line continuation in a number of different programming contexts.
You are right that you can't set $/ to a regular expression.

dos2unix would put a UNIX newline character in for the "\r\n" and so wouldn't really solve the problem. I would use a regex that replaces all instances of "\r\n" with a space or tab character and save the results to a different file (since you don't want to split the line at those points). Then I would run your script on the new file.

Try using dos2unix on the file first, and then read in as normal.

Telnet connection and issue while sending input to the server

I have written the server program using the select. Then I have connect the client using telnet. The connection also completed successfully.
If I have the input length as 6 character including newline, in the server side it display the length as 7 character. How it is possible?

Server side:
The client is sending \r\n instead of \n, which would account for the extra character. You can translate it back to just a newline with a simple regex:
# $data holds the input line from the client.
$data =~ s/\r\n/\n/g; # Search for \r\n, replace it with \n
Client side:
Assuming you're using Net::Telnet, you're probably sending 2 characters for the newline, \r and \n, as specified by the Telnet RFC.
The documentation I linked to says this,
In the input stream, each sequence of
carriage return and line feed (i.e.
"\015\012" or CR LF) is converted to
"\n". In the output stream, each
occurrence of "\n" is converted to a
sequence of CR LF. See binmode() to
change the behavior. TCP protocols
typically use the ASCII sequence,
carriage return and line feed to
designate a newline.
And the default is not binary mode (binmode), meaning that all instances of \n in your client data will be replaced by \r\n before it gets sent to the server.
The default Binmode is 0, which means
do newline translation.
You can stop the module from replacing your newlines by calling binmode on your file descriptor, or in the case of Net::Telnet, call binmode on your object and pass 1.
# Do not translate newlines.
$obj->binmode(1);
Or on the server you can search for \r\n on the input data and replace it with \n.

Why does my Perl tr/// remove newlines?

I'm trying to clean up form input using the following Perl transliteration:
sub ValidateInput {
my $input = shift;
$input =~ tr/a-zA-Z0-9_#.:;',#$%&()\/\\{}[]?! -//cd;
return $input;
}
The problem is that this transliteration is removing embedded newline characters that users may enter into a textarea field which I want to keep as part of the string. Any ideas on how I can update this to stop it from removing embedded newline characters? Thanks in advance for your help!

I'm not sure what you are doing, but I suspect you are trying to keep all the characters between the space and the tilde in the ASCII table, along with some of the whitespace characters. I think most of your list condenses to a single range \x20-\x7e:
$string =~ tr/\x0a\x0d\x20-\x7e//cd;
If you want to knock out a character like " (although I suspect you really want it since you allow the single quote), just adjust your range:
$string =~ tr/\x0a\x0d\x20-\xa7\xa9-\x7e//cd;

That's a bit of a byzantine way of doing it! If you add \012 it should keep the newlines.
$input =~ tr/a-zA-Z0-9_#.:;',#$%&()\/\{}[]?! \012-//cd;

See Form content types.
application/x-www-form-urlencoded: Line breaks are represented as "CR LF" pairs (i.e., %0D%0A).
...
multipart/form-data: As with all MIME transmissions, "CR LF" (i.e., %0D%0A) is used to separate lines of data.
I do not know what you have in the database. Now you know what your script it sees.
You are using CGI.pm, right?

Thanks for the help guys! Ultimately I decided to process all the data in our database to remove the character that was causing the issue so that any text that was submitted via our update form (and not changed by the user) would match what was in the database. Per your suggestions I also added a few additional allowed characters to the validation regex.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse