A sed script to fix an invalid RFC2822 email header - sed

I have emails coming into a processing system, and some of them have started coming in with an invalid blank line splitting the email headers, like this:
Date: Thu, 7 Mar 2013 22:24:44 +0000
Message-ID: <86A1035194F72547A2979A7767CD3BAF35485B8D#QTS-MB02.ecicloud.com>
References: <C0DA0966847B31409025BBD9A70187DA35399D17#QTS-MB02.ecicloud.com>
Accept-Language: en-US
Content-Language: en-US
The blank line in the middle is invalid and causes problems for downstream programs.
I'd like to come up with a simple sed script to fix any occurences specifically of Accept-Language:.* preceded by a blank line, so that the blank line is eliminated.

Delete all blank lines in a lines in file with sed:
sed -i '/^\s*$/d' file
Delete blank line from the start of the file upto the line starting Content-Language:
sed -i '1,/^Content-Language/{/^\s*$/d}' file

sed '/^[ \t]*$/ {N;/\nAccept-Language: en-US$/! P;D;}' FILE

You should really look at the formail and procmail commands for processing email. See http://www.procmail.org/ and http://linuxcommand.org/man_pages/formail1.html.

Related

What is this sed command doing?

I am looking for non printable characters into a file, and I found this web page.
It shows the following command:
sed "l" file
If I am not mistaken, according to man, this option is:
List out the current line in a ''visually unambiguous'' form.
Moreover, when I run this command on a fake file with one line, the output is as follow:
The line is displayed twice, but each displayed line (in the output) contains at most 69 bytes of the input line. The rest of the line is displayed at the next line.
The second time the line is displayed, it is in its full length.
fake file
toto, titi, tatafdsfdsfdgfgfdsgrgdfgzfdgzgffgerssssssssssssssssssssssssss
Command
sed "l" fake_file
output
$ sed "l" fake_file
toto, titi, tatafdsfdsfdgfgfdsgrgdfgzfdgzgffgerssssssssssssssssssssss\
ssss$
toto, titi, tatafdsfdsfdgfgfdsgrgdfgzfdgzgffgerssssssssssssssssssssssssss
Questions
What does ''visually unambiguous'' exactly mean ?
Why is the output like this ? I was expecting only one line with the $ sign at the end. I was also not expecting output to be displayed on 69 bytes max.
Environment
Tested with same output on:
sed (GNU sed) 4.7
sed (GNU sed) 4.2.2
By default, sed outputs the result after processing a line. If you handle the output yourself, tell sed not to output the line by the -n switch.

Using Windows version of Sed can I change a text file with timecode to a format for a video editing software. Also delete the first line

Is it possible to use Windows version Sed to change a text file with timecode to a format for the video editing software Edius?
%Frame Rate 3,0
00:00:00:00
00:00:00:00
00:01:06:15
00:07:19:12
00:09:52:03
I need to add a comma , then the timecode in quotation marks "00:00:00:00" then another comma and two quotation marks ,""
,"00:00:00:00",""
,"00:01:06:15",""
,"00:07:19:12",""
,"00:09:52:03",""
Thanks
sed 's/^/,"/;s/$/",""/' file > file.out && mv file.out file
If you're using Linux it can be simplified to
sed -i 's/^/,"/;s/$/",""/' file
Or if on MAC OS-X, then you'll need
sed -i"" 's/^/,"/;s/$/",""/' file
IHTH
$ sed 's/.*/,"&",""/' file
,"00:00:00:00",""
,"00:01:06:15",""
,"00:07:19:12",""
,"00:09:52:03",""

echoing timestamp appends annoying "^#"

i run this command :echom system("date")<CR> in vim.
my expected output is something like this: Sat Jan 10 12:28:58 CET 2015
but it always appends an annoying ^#
so that the output is like this: Sat Jan 10 12:28:58 CET 2015^#
why?
and how can i easily avoid this?
when i run date in terminal it gives me the expected output. plus a newline of course (*1). so my guess is, that the ^# comes from the newline, right?
i run vim 7.3 on debian (the version from the official repositories) in the terminal version, (not the gui version!) in a gnome-terminal with utf-8 encoding.
(*1): the prompt looks like this
user#host$ date
Sam Jän 10 12:28:58 CET 2015
user#host$
not like this:
user#host$ date
Sam Jän 10 12:28:58 CET 2015user#host$
The ^# does come from the fact that date ends with a newline (\n).
You can either :
remove the trailing characters (this will only output the expected result if the command ends with a newline) :
:echom system("date")[:-2]
substitute the trailing \n (a bit more verbose) :
:echom substitute(system("date"), '\n$', '', '')

Wget changing character around download location

I have a little perl script that I updated to download images from tvrage. But I have a problem.
This is the code line I have problems with:
system "wget -P '/home/user/script/cache/posters' $imgurl";
It usually works just fine but from time to time it fails with the same error.
HTTP request sent, awaiting response... 200 OK
Length: 16758 (16K) [image/jpeg]
Saving to: â/home/user/script/cache/posters/28386.jpgâ
ERROR! Wide character in syswrite at IO/Handle.pm line 207.
ERROR! Wide character in syswrite at IO/Handle.pm line 207.
Compilation failed in require.
Wide character in syswrite at IO/
I have located the problem to be that wget changes ‘ and ’ to â
â/home/user/script/cache/posters/28386.jpgâ
All successful downloads have the ‘ and ’
HTTP request sent, awaiting response... 200 OK
Length: 28218 (28K) [image/jpeg]
Saving to: ‘/home/user/script/cache/posters/6597.jpg’
I just tried adding this
system "wget --restrict-file-names=nocontrol -P '/home/tup/tuper4/cache/posters' $imgurl";
In the hope that it would work better and so far it has not failed but I suspect it's not the issue and would like some guidance if possible.
Should I maybe try
system "cd /location/ && wget $imgurl";
Would it make any difference?
I guess my real question here is: What could cause wget to change from ‘ and ’ to â ?
Thank you in advance for any help!
Output of locale is:
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
And the images are also UTF-8
I did suspect that it had to do with the encoding and hence added
--restrict-file-names=nocontrol
Remains to see if it will work.
Edit: Several days later and I have not seen the error again so it looks like "nocontrol" helped.
It's not wget changing the character.
The character encoding seems to be set to something wrong.
When the real encoding is UTF-8, as it probably is, but set to something else, showing the quote as character â is a typical symptom. Sometimes it's followed by more characters.
So it should work if you set the encoding to UTF-8.
--
What is the output of the command locale?
Background info:
http://askleo.com/why_do_i_get_odd_characters_instead_of_quotes_in_my_documents/
Googling "â quote" gives some good results.

sed: matching unicode blocks with

I am desperately trying to replace certain unicode characters (graphemes) from a file using sed. However I keep failing for some of them, namely the ones from unicode blocks:
\p{InHigh_Surrogates}: U+D800–U+DB7F
\p{InHigh_Private_Use_Surrogates}: U+DB80–U+DBFF
\p{InLow_Surrogates}: U+DC00–U+DFFF
I tried (in a sed config file loaded via the -f switch):
s/\p{InHigh_Surrogates}/###/ --> no effect at all
s/\\p\{InHigh_Surrogates\}/###_D-NON-UTF8_###/ -> error message 'Invalid content of \{\}'
Anybody got a suggestion? Also, I am not necessarily focused on using the blocks - but I also failed trying to define a character range of the form \xd800-\xdfff.
Thanks,
Thomas
Try using the -r flag for sed:
$ sed -r 's/\\p\{InHigh_Surrogates\}/###/g' file
###: U+D800–U+DB7F
\p{InHigh_Private_Use_Surrogates}: U+DB80–U+DBFF
\p{InLow_Surrogates}: U+DC00–U+DFFF
From man sed:
-r, --regexp-extended
use extended regular expressions in the script.