Does . really match any character? - sed

I am using a very simple sed script removing comments : sed -e 's/--.*$//'
It works great until non-ascii characters are present in a comment, e.g.: -- °.
This line does not match the regular expression and is not substituted.
Any idea how to get . to really match any character?
Solution :
Since file says it is an iso8859 text, LANG variable environment must be changed before calling sed :
LANG=iso8859 sed -e 's/--.*//' -

It works for me. It's probably a character encoding problem.
This might help:
Why does sed fail with International characters and how to fix?
http://www.barregren.se/blog/how-use-sed-together-utf8

#julio-guerra: I ran into a similar situation, trying to delete lines like the folowing (note the Æ character):
--MP_/yZa.b._zhqt9OhfqzaÆC
in a file, using
sed 's/^--MP_.*$//g' my_file
The file encoding indicated by the Linux file command was
file my_file: ISO-8859 text, with very long lines
file -b my_file: ISO-8859 text, with very long lines
file -bi my_file: text/plain; charset=iso-8859-1
I tried your solution (clever!), with various permutations; e.g.,
LANG=ISO-8859 sed 's/^--MP_.*$//g' my_file
but none of those worked. I found two workarounds:
The following Perl expression worked, i.e. deleted that line:
perl -pe 's/^--MP_.*$//g' my_file
[For an explanation of the -pe command-line switches, refer to this StackOverflow answer:
Perl flags -pe, -pi, -p, -w, -d, -i, -t? ]
Alternatively, after converting the file encoding to UTF-8, the sed expression worked (the Æ character remained, but was now UTF8-encoded):
iconv -f iso-8859-1 -t utf-8 my_file > my_file.utf8
As I am working with lots (1000's) of emails with various encodings, that undergo intermediate processing (bash-scripted conversions to UTF-8 do not always work), for my purposes "solution 1" above will probably be the most robust solution.
Notes:
sed (GNU sed) 4.4
perl v5.26.1 built for x86_64-linux-thread-multi
Arch Linux x86_64 system

The documentation of GNU sed's z command mentions this effect (my emphasis):
This command empties the content of pattern space. It is usually
the same as 's/.*//', but is more efficient and works in the
presence of invalid multibyte sequences in the input stream. POSIX
mandates that such sequences are not matched by '.', so that
there is no portable way to clear sed's buffers in the middle of
the script in most multibyte locales (including UTF-8 locales).
It seems likely that you are running sed in a UTF-8 (or other multibyte) locale. You'll want to set LC_CTYPE (that's finer-grained than LANG, and won't affect translation of error messages. Valid locale names usually look like en.iso88591 or (for the location in your profile) fr_FR.iso88591, not just the encoding on its own - you might be able to see the full list with locale -a.
Example:
LC_CTYPE=fr_FR.iso88591 sed -e 's/--.*//'
Alternatively, if you know that the non-comment parts of the line contain only ASCII, you could split the line at a comment marker, print the first part and discard the remainder:
sed -e 's/--/\n/' -e 'P' -e 'd'

Related

Replacing Windows CRLF with Unix LF using Perl -- `Unrecognized switch: -g`?

Problem Background
We have several thousand large (10M<lines) text files of tabular data produced by a windows machine which we need to prepare for upload to a database.
We need to change the file encoding of these files from cp1252 to utf-8, replace any bare Unix LF sequences (i.e. \n) with spaces, then replace the DOS line end sequences ("CR-LF", i.e \r\n) with Unix line end sequences (i.e. \n).
The dos2unix utility is not available for this task.
We initially had a bash function that packaged these operations together using iconv and sed, with iconv doing the encoding and sed dealing with the LF/CRLF sequences. I'm trying to replace part of this bash function with a perl command.
Example Code
Based on some helpful code review, I want to change this function to a perl script.
The author of the code review suggested the following perl to replace CRLF (i.e. "\r\n") with LF ("\n").
perl -g -pe 's/(?<!\r)\n/ /g; s/\r\n/\n/g;'
The explanation for why this is better than what we had previously makes perfect sense, but this line fails for me with:
Unrecognized switch: -g (-h will show valid options).
More interestingly, the author of the code review also suggests it is possible to perform the decode/recode in a perl script, too, but I am completely unsure where to start.
Questions
Please can someone explain why the suggested answer fails with Unrecognized switch: -g (-h will show valid options).?
If it helps, the line is supposed to receive piped input from incov as follows (though I am interested in learning how to use perl to do the redcoding/recoding step, too):
iconv --from-code=CP1252 --to-code=UTF-8 $1$ | \
perl -g -pe 's/(?<!\r)\n/ /g; s/\r\n/\n/g;'
> "$2"
(Highly simplified) example input for testing:
apple|orange|\n|lemon\r\nrasperry|strawberry|mango|\n\r\n
Desired output:
apple|orange| |lemon\nrasperry|strawberry|mango| \n
Perl recently added the command line switch -g as an alias for 'gulp mode' in Perl v5.36.0.
This works in Perl version v5.36.0:
s=$(printf "Line 1\nStill Line 1\r\nLine 2\r\nLine 3\r\n")
perl -g -pe 's/(?<!\r)\n/ /g; s/\r\n/\n/g;' <<<"$s"
Prints:
Line 1 Still Line 1
Line 2
Line 3
But any version of perl earlier than v5.36.0, you would do:
perl -0777 -pe 's/(?<!\r)\n/ /g; s/\r\n/\n/g;' <<<"$s"
# same
BTW, the conversion you are looking for a way easier in this case with awk since it is close to the defaults.
Just do this:
awk -v RS="\r\n" '{gsub(/\n/," ")} 1' <<<"$s"
Line 1 Still Line 1
Line 2
Line 3
Or, if you have a file:
awk -v RS="\r\n" '{gsub(/\n/," ")} 1' file
This is superior to the posted perl solution since the file is processed record be record (each block of text separated by \r\n) versus having the read the entire file into memory.
(On Windows you may need to do awk -v RS="\r\n" -v ORS="\n" '...')
Another note:
You can get similar behavior from Perl by:
Setting the input record separator to the fixed string $/="\r\n" in a BEGIN block;
Use the -l switch so every line has the input record separator removed;
Use tr for speedy replacement of \n with ' ';
Possible set the output record separator, $/="\n", on Windows.
Full command:
perl -lpE 'BEGIN{$/="\r\n"} tr/\n/ /' file
The error message is about the command line switch -g you use in perl -g -pe .... This is not about the switch at the regex - which is valid (but useless since there is only a single \n in a line anyway, and -p reads line by line).
This switch simply does not exist with the perl version you are using. It was only added with perl 5.36, so you are likely using an older version. Try -0777 instead.

translate perl script removing control characters in script(1) output to sed

I'm recording terminal sessions using the script command. Unfortunately the typescript output file contains many control-characters - for example from pressing the full screen command (F11) when in the vim editor or try it below.
script -f -t 2>${LOGNAME}-$(/bin/date +%Y%m%d-%H%M%S).time -a ${LOGNAME}-$(/bin/date +%Y%m%d-%H%M%S).session
vi abc.log
#write something and save
#:x to quit vi
ctrl + d to quit script
The script output hostname-datetime.session contais too many vi control-characters.
I found a perl script in commandlinefu, which can remove these control characters from the typescript.
I am actually doing this replacement in C, and the program runs on a chroot envrioment, where the perl is not avaliable.
Question: Is there a a way to translate the following perl command to sed ?
cat typescript | perl -pe 's/\e([^\[\]]|\[.*?[a-zA-Z]|\].*?\a)//g' | col -b > typescript-processed
if you ONLY want printable ascii :
LC_ALL=C tr -cd ' -~\n\t' < typescript > typescript_printable_ascii_only
why this works? all printable ("normal") ascii are between Space and Tilde
And in addition you need Newline and Tab.
So ' -~\n\t' covers all printable "normal" ascii characters. And tr -d 'chars' deletes all chars, -c takes the opposite of the range given (so everything except 'chars').
=> This LC_ALL=C tr -cd ' -~\n\t' deletes everything except the normal ascii characters (including newline and tab) (I force the locale to be 'C' to be sure we are in the right locale when calling "tr")
This works well for me with GNU sed (or gsed on a Mac):
sed -re 's/\x1b[^m]*m//g' typescript | col -b
I created a sample typescript, and since I'm using a relatively advanced shell prompt, it's full of control characters, and the perl script in the OP doesn't actually work, so rather than converting I had to come up with my own.
Looking at the typescript with hexdump -C, it seems that all control sequences start with \x1b (the Escape character, or ^[), and end with the letter "m". So in sed I use a simple replacement from ^[ until m, normally written as \x1b.*?m but since sed doesn't support the ? symbol to make a pattern non-greedy, I used [^m]*m to emulate non-greedy matching.

Replace all Windows-1252 characters to the respective UTF-8 ones

I looking to a way to recursively replace all imcompatible windows-1252 caracteres to the respective utf-8 ones.
I tried iconv, without success.
I also found the following command:
grep -rl oldstring . |xargs sed -i -e 's/oldstring/newstring/'
But I'll not like to exec this command by hand for every charactere.
Is there a way or software that can do that?

sed + removes all leading and trailing whitespace from each line on solaris system

I have a Solaris machine (SunOSsu1a 5.10 Generic_142900-15 sun 4vsparcSUNW,Netra-T2000).
The following sed syntax removes all leading and trailing whitespace from each line (I need to remove whitespace because it causes application problems).
sed 's/^[ \t]*//;s/[ \t]*$//' orig_file > new_file
But I noticed that sed also removes the "t" character from the end of each line.
Please advise how to fix the sed syntax/command in order to remove only the leading and trailing whitespace from each line (the solution can be also with Perl or AWK).
Examples (take a look at the last string - set_host)
1)
Original line before running sed command
pack/configuration/param[14]/action:set_host
another example (before I run sed)
+/etc/cp/config/Network-Configuration/Network-Configuration.xml:/cp-pack/configuration/param[8]/action:set_host
2)
the line after I run the sed command
pack/configuration/param[14]/action:set_hos
another example (after I run sed)
+/etc/cp/config/Network-Configuration/Network-Configuration.xml:/cp-pack/configuration/param[8]/action:set_hos
Just occurred to me you can use a character class:
sed 's/^[[:space:]]*//;s/[[:space:]]*$//'
This happens in your sed and gnu sed with the --posix option because (evidently) posix interprets the [ \t] as a space, a \, or a t. You can fix this by putting a literal tab instead of \t, easiest way is probably Ctrl+V Tab. If that doesn't work, put the patterns in a file (with the literal tabs) and use sed -f patterns.sed oldfile > newfile.
As #aix noted, the problem is undoubtedly that your sed doesn't understand \t. While the GNU sed does, many propriety Unix flavors don't. HP-UX is one and I believe Solaris is too. If you can't install a GNU sed I'd look to Perl:
perl -pi.old -e 's{^\s+}{};s{\s+$}{}' file
...will trim one or more leading white space (^\s+) [spaces and/or tabs] together with trailing white space (\s+$) updating the file in situ leaving a backup copy as "file.old".

sed to remove URLs from a file

I am trying to write a sed expression that can remove urls from a file
example
http://samgovephotography.blogspot.com/ updated my blog just a little bit ago. Take a chance to check out my latest work. Hope all is well:)
Meet Former Child Star & Author Melissa Gilbert 6/15/09 at LA's B&N https://hollywoodmomblog.com/?p=2442 Thx to HMB Contributor #kdpartak :)
But I dont get it:
sed 's/[\w \W \s]*http[s]*:\/\/\([\w \W]\)\+[\w \W \s]*/ /g' posFile
FIXED!!!!!
handles almost all cases, even malformed URLs
sed 's/[\w \W \s]*http[s]*[a-zA-Z0-9 : \. \/ ; % " \W]*/ /g' positiveTweets | grep "http" | more
The following removes http:// or https:// and everything up until the next space:
sed -e 's!http\(s\)\{0,1\}://[^[:space:]]*!!g' posFile
updated my blog just a little bit ago. Take a chance to check out my latest work. Hope all is well:)
Meet Former Child Star & Author Melissa Gilbert 6/15/09 at LA's B&N Thx to HMB Contributor #kdpartak :)
Edit:
I should have used:
sed -e 's!http[s]\?://\S*!!g' posFile
"[s]\?" is a far more readable way of writing "an optional s" compared to "\(s\)\{0,1\}"
"\S*" a more readable version of "any non-space characters" than "[^[:space:]]*"
I must have been using the sed that came installed with my Mac at the time I wrote this answer (brew install gnu-sed FTW).
There are better URL regular expressions out there (those that take into account schemes other than HTTP(S), for instance), but this will work for you, given the examples you give. Why complicate things?
The accepted answer provides the approach that I used to remove URLs, etc. from my files. However it left "blank" lines. Here is a solution.
sed -i -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g' input_file
perl -i -pe 's/^'`echo "\012"`'${2,}//g' input_file
The GNU sed flags, expressions used are:
-i Edit in-place
-e [-e script] --expression=script : basically, add the commands in script
(expression) to the set of commands to be run while processing the input
^ Match start of line
$ Match end of line
? Match one or more of preceding regular expression
{2,} Match 2 or more of preceding regular expression
\S* Any non-space character; alternative to: [^[:space:]]*
However,
sed -i -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g'
leaves nonprinting character(s), presumably \n (newlines). Standard sed-based approaches to remove "blank" lines, tabs and spaces, e.g.
sed -i 's/^[ \t]*//; s/[ \t]*$//'
do not work, here: if you do not use a "branch label" to process newlines, you cannot replace them using sed (which reads input one line at a time).
The solution is to use the following perl expression:
perl -i -pe 's/^'`echo "\012"`'${2,}//g'
which uses a shell substitution,
'`echo "\012"`'
to replace an octal value
\012
(i.e., a newline, \n), that occurs 2 or more times,
{2,}
(otherwise we would unwrap all lines), with something else; here:
//
i.e., nothing.
[The second reference below provides a wonderful table of these values!]
The perl flags used are:
-p Places a printing loop around your command,
so that it acts on each line of standard input
-i Edit in-place
-e Allows you to provide the program as an argument,
rather than in a file
References:
perl flags: Perl flags -pe, -pi, -p, -w, -d, -i, -t?
ASCII control codes: https://www.cyberciti.biz/faq/unix-linux-sed-ascii-control-codes-nonprintable/
remove URLs: sed to remove URLs from a file
branch labels: How can I replace a newline (\n) using sed?
GNU sed manual: https://www.gnu.org/software/sed/manual/sed.html
quick regex guide: https://www.gnu.org/software/sed/manual/html_node/Regular-Expressions.html
Example:
$ cat url_test_input.txt
Some text ...
https://stackoverflow.com/questions/4283344/sed-to-remove-urls-from-a-file
https://www.google.ca/search?dcr=0&ei=QCsyWtbYF43YjwPpzKyQAQ&q=python+remove++citations&oq=python+remove++citations&gs_l=psy-ab.3...1806.1806.0.2004.1.1.0.0.0.0.61.61.1.1.0....0...1c.1.64.psy-ab..0.0.0....0.-cxpNc6youY
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
https://bbengfort.github.io/tutorials/2016/05/19/text-classification-nltk-sckit-learn.html
http://datasynce.org/2017/05/sentiment-analysis-on-python-through-textblob/
https://www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
http://www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
ftp://ftp.ncbi.nlm.nih.gov/
ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/alignment_indices/20100804.alignment.index
Some more text.
$ sed -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g' url_test_input.txt > a
$ cat a
Some text ...
Some more text.
$ perl -i -pe 's/^'`echo "\012"`'${2,}//g' a
Some text ...
Some more text.
$