Replacing Windows CRLF with Unix LF using Perl -- `Unrecognized switch: -g`? - perl

Problem Background
We have several thousand large (10M<lines) text files of tabular data produced by a windows machine which we need to prepare for upload to a database.
We need to change the file encoding of these files from cp1252 to utf-8, replace any bare Unix LF sequences (i.e. \n) with spaces, then replace the DOS line end sequences ("CR-LF", i.e \r\n) with Unix line end sequences (i.e. \n).
The dos2unix utility is not available for this task.
We initially had a bash function that packaged these operations together using iconv and sed, with iconv doing the encoding and sed dealing with the LF/CRLF sequences. I'm trying to replace part of this bash function with a perl command.
Example Code
Based on some helpful code review, I want to change this function to a perl script.
The author of the code review suggested the following perl to replace CRLF (i.e. "\r\n") with LF ("\n").
perl -g -pe 's/(?<!\r)\n/ /g; s/\r\n/\n/g;'
The explanation for why this is better than what we had previously makes perfect sense, but this line fails for me with:
Unrecognized switch: -g (-h will show valid options).
More interestingly, the author of the code review also suggests it is possible to perform the decode/recode in a perl script, too, but I am completely unsure where to start.
Questions
Please can someone explain why the suggested answer fails with Unrecognized switch: -g (-h will show valid options).?
If it helps, the line is supposed to receive piped input from incov as follows (though I am interested in learning how to use perl to do the redcoding/recoding step, too):
iconv --from-code=CP1252 --to-code=UTF-8 $1$ | \
perl -g -pe 's/(?<!\r)\n/ /g; s/\r\n/\n/g;'
> "$2"
(Highly simplified) example input for testing:
apple|orange|\n|lemon\r\nrasperry|strawberry|mango|\n\r\n
Desired output:
apple|orange| |lemon\nrasperry|strawberry|mango| \n

Perl recently added the command line switch -g as an alias for 'gulp mode' in Perl v5.36.0.
This works in Perl version v5.36.0:
s=$(printf "Line 1\nStill Line 1\r\nLine 2\r\nLine 3\r\n")
perl -g -pe 's/(?<!\r)\n/ /g; s/\r\n/\n/g;' <<<"$s"
Prints:
Line 1 Still Line 1
Line 2
Line 3
But any version of perl earlier than v5.36.0, you would do:
perl -0777 -pe 's/(?<!\r)\n/ /g; s/\r\n/\n/g;' <<<"$s"
# same
BTW, the conversion you are looking for a way easier in this case with awk since it is close to the defaults.
Just do this:
awk -v RS="\r\n" '{gsub(/\n/," ")} 1' <<<"$s"
Line 1 Still Line 1
Line 2
Line 3
Or, if you have a file:
awk -v RS="\r\n" '{gsub(/\n/," ")} 1' file
This is superior to the posted perl solution since the file is processed record be record (each block of text separated by \r\n) versus having the read the entire file into memory.
(On Windows you may need to do awk -v RS="\r\n" -v ORS="\n" '...')
Another note:
You can get similar behavior from Perl by:
Setting the input record separator to the fixed string $/="\r\n" in a BEGIN block;
Use the -l switch so every line has the input record separator removed;
Use tr for speedy replacement of \n with ' ';
Possible set the output record separator, $/="\n", on Windows.
Full command:
perl -lpE 'BEGIN{$/="\r\n"} tr/\n/ /' file

The error message is about the command line switch -g you use in perl -g -pe .... This is not about the switch at the regex - which is valid (but useless since there is only a single \n in a line anyway, and -p reads line by line).
This switch simply does not exist with the perl version you are using. It was only added with perl 5.36, so you are likely using an older version. Try -0777 instead.

Related

sed or awk to change a specific number in a file on RHEL7

I need help figuring out the syntax or what command to use to find an replace a specific number in a file.
I need to replace the number 10 with 25 in a configuration file. I have tried the following:
sed 's/10/25/g' /etc/security/limits.conf
This changes other instances that contain 10 such as 1000 and 10000 to 2500 and 25000, I need to juct change the need to just change 10 to 25. Please help.
Thank you,
Joseph
The trick here is to limit the sed substitution to the line you want to change. For limits.conf you are best off matching the domain, type and item. So if you wanted to just change a limit for domain #student, type hard, item nproc, you'd use something like
sed '/#student.*hard.*nproc/s/10/25/g' /etc/security/limits.conf
sed -ri '/^#/!s/(^.*)([[:space:]]10$)/\1 25/' /etc/security/limits.conf
With regular expression interpretation enabled (-r or -E), process all lines that don't start with a # by using ! We then split the lines into two sections, and replace the line for the first section followed by a space and 25. The $ ensure that the entry to replace is anchored at the end of the line.
Awk is another option:
awk -i 'NF==4 && $4==10 { gsub("10","25",$4) }1' /etc/security/limits.conf
Check if the line has 4 space delimited fields (NF==4) and the 4th field ($4) is 10. If this condition is met, replace 10 with 25 using gsub and print all lines with 1
The -i is an inplace amend flag on more recent versions of awk. If a compliant version is not available, use:
awk 'NF==4 && $4==10 { gsub("10","25",$4) }1' /etc/security/limits.conf > /etc/security/limits.tmp && mv -f /etc/security/limits.tmp /etc/security/limits.conf
Use this Perl one-liner, where \b stands for word break (so that 10 will not match 210 or 102):
perl -pe 's/\b10\b/25/g' in_file > out_file
Or to change the file in-place:
perl -i.bak -pe 's/\b10\b/25/g' in_file
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
-i.bak : Edit input files in-place (overwrite the input file). Before overwriting, save a backup copy of the original file by appending to its name the extension .bak.
The regex uses modifier /g : Match the pattern repeatedly.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlrequick: Perl regular expressions quick start

Require exactly 1 trailing newline in a file

I have a mix of files with various ways of using trailing new lines. There are no carriage returns, it's only \n. Some files have multiple newlines and some files have no trailing newline. I want to edit the files in place.
How can I edit the files to have exactly 1 trailing newline?
To change text files in-place to have one and only one trailing newline:
sed -zi 's/\n*$/\n/'
This requires GNU sed.
-z tells sed to read in the file using the NUL character as a separator. Since text files have no NUL characters, this has the effect of reading the whole file in at once.
-i tells GNU sed to change the file in place.
s/\n*$/\n/ tells sed to replace however many newlines there are at the end of the file with a single newline.
Replace all trailing new lines with one?
$text =~ s/\n+$/\n/;
This leaves the file with one newline at the end – if it had at least one to start with. If you want it to be there even if the file didn't have one, replace \n+ with \n*.
For the in-place specification, implying a one-liner:
perl -i -0777 -wpe 's/\n+$/\n/' file.txt
The meaning of the switches is explained in Command Switches in perlrun.
Here is a summary of the switches. Please see the above docs for precise explanations.
-i changes the file "in place." Note that data is still copied and temporary files used
-0777 reads the file whole. The -0[oct|hex] sets $/ to the number, so to nul with -0
-w uses warnigs. Not exactly the same as use warnings but better than nothing
-p the code in '' runs on each line of file in turn, like -n, and then $_ is printed
-e what follows between '' is executed as Perl code
-E is the same but also enables features, likesay
Note that we can see the equivalent code by using core O and B::Deparse modules as
perl -MO=Deparse -wp -e 1
This prints
BEGIN { $^W = 1; }
LINE: while (defined($_ = <ARGV>)) {
'???';
}
continue {
print $_;
}
-e syntax OK
showing a script equivalent to the one liner with -w and -p.
perl -i -0 -pe 's/\n\n*$/\n/' input-file
The solutions posted so far read your whole input file into memory which will be an issue if your file is huge. This only reads contiguous empty lines into memory:
awk -i inplace '/./{printf "%s", buf; buf=""; print; next} {buf = buf $0 ORS}' file
The above uses GNU awk for inplace editing.

Pass command line parameters to perl via file?

Could command lines parameters been saved to a file and then pass the file to perl to parse out the options? Like response file (prefix the name with #) for some Microsoft tools.
I am trying to pass expression to perl via command line, like perl -e 'print "\n"', and Windows command prompt makes using double quotes a little hard.
There are several solutions, from most to least preferable.
Write your program to a file
If your one liner is too big or complicated, write it to a file and run it. This avoids messing with shell escapes. You can reuse it and debug it and work in a real editor.
perl path\to\some_program
Command line options to perl can be put on the otherwise useless on Windows #! line. Here's an example.
#!/usr/bin/perl -i.bak -p
# -i.bak Backs up the file.
# -p Puts each line into $_ and writes out the new value of $_.
# So this changes all instances in a file of " with '.
s{"}{'}g;
Use alternative quote delimiters
Perl has a slew of alternative ways to write quotes. Use them instead. This is good for both one liners as well as things like q[<tag key='value'>].
perl -e "print qq[\n]"
Escape the quote
^ is the cmd.exe escape character. So ^" is treated as a literal quote.
perl -e "print ^"\n^""
Pretty yucky. I'd prefer using qq[] and reserve ^" for when you need to print a literal quote.
perl -e "print qq[^"\n]"
Use the ASCII code
The ASCII and UTF-8 hex code for " is 22. You can supply this to Perl with qq[\x22].
perl -e "print qq[\x22\n]"
You can read the file into a string and then use
use Getopt::Long qw(GetOptionsFromString);
$ret = GetOptionsFromString($string, ...);
to parse the options from that.

Awk inside of qsub

I have a bash script in which I have a few qsubs. Each of them are waiting for a preivous qsub to be done before starting.
My first qsub consist of sending files in a certain directory to a perl program and having the outfiles printed in a new directory. At the end, I echo the array with all my jobs names. This script works as intented.
mkdir -p /perl_files_dir
for ID_FILES in `ls Infiles_dir/*.txt`;
do
JOB_ID=`echo "perl perl_scirpt.pl $ID_FILES" | qsub -j oe `
JOB_ID_ARRAY="${JOB_ID_ARRAY}:$JOB_ID"
done
echo $JOB_ID_ARRAY
My second qsub is meant to sort all my previous files made with my perl script in a new outfile and to start after all these jobs are done (about 100 jobs) with depend=afterany. Again, this part is working fine.
SORT_JOB=`echo "sort -m -n perl_files_dir/*.txt >>sorted_file.txt" | qsub -j oe -W depend=afterany$JOB_ID_ARRAY`
SORT_ARRAY="${SORT_ARRAY}:$SORT_JOB"
My issue is that in my sorted file, I have a few columns I wish to remove (2 to 6), so I came up with this last line using awk piped to sed with another depend=afterany
SED=`echo "awk '{\$2="";\$3="";\$4="";\$5="";\$6=""; print \$0}' sorted_file.txt \
| sed 's/ //g' >final_file.txt" | qsub -j oe -W depend=afterany$SORT_ARRAY`
This last step creates final_file.txt, but leaves it empty. I added SED= before my echo because it would otherwise give me Command not found.
I tried without the pipe so it would just print everything. Unfortunately it prints nothing.
I assume it is not opening my sorted file and this is why my final file is empty after my sed. If it's the case, then why won't awk read it?
In my script, I am using variables to define my directories and files (with the correct path). I know my issue is not about find my files or directories since they are perfectly defined at the beginning and used throughout the script. I tried to write the whole path instead of a variable and I get the same results.
for ID_FILES in `ls Infiles_dir/*.txt`
Simplify this to
for ID_FILES in Infiles_dir/*.txt
ls lists the files you pass it (except when you pass it directories, then it lists their content). Rather than telling it to display a list of files and parse the output, use the list of files you already have! This is more reliable (parsing the output of ls will fail if the file names contain whitespace or wildcard characters), clearer and faster. Don't parse the output of ls.
SORT_JOB=`echo "sort -m -n perl_files_dir/*.txt >>sorted_file.txt" | qsub -j oe -W depend=afterany$JOB_ID_ARRAY`
You'd make your life simpler if you used the right form of quoting in the right place. Don't use backquotes, because it's difficult to know how to quote things inside. Use $(…) instead, it's exactly equivalent except that it is parsed in a sane way.
I recommend using a here document for the shell snippet that you're feeding to qsub. You have fewer quoting issues to worry about, and it's more readable.
While we're at it, always put double quotes around variable substitutions and command substitutions: "$some_variable", "$(some_command)". Annoyingly, $var in shell syntax doesn't mean “take the value of the variable var”, it means “take the value of the variable var, parse it as a list of wildcard patterns, and replace each pattern by the list of matching files if there are matching files”. This extra stuff is turned off if the substitution happens inside double quotes (or in a here document, by the way): "$var" means “take the value of the variable var”.
SORT_JOB=$(qsub -j oe -W depend="afterany$JOB_ID_ARRAY" <<'EOF'
sort -m -n perl_files_dir/*.txt >>sorted_file.txt
EOF
)
We now get to the snippet where the quoting was actually causing a problem.
SED=`echo "awk '{\$2="";\$3="";\$4="";\$5="";\$6=""; print \$0}' sorted_file.txt \
| sed 's/ //g' >final_file.txt" | qsub -j oe -W depend=afterany$SORT_ARRAY`
The string that becomes the argument to the echo command is:
awk '{$2=;$3=;$4=;$5=;$6=; print $0}' sorted_file.txt | sed 's/ //g' >final_file.txt
This is syntactically incorrect, and that's why you're not getting any output.
You didn't escape the double quotes inside what was meant to be the awk snippet. It's a lot clearer if you use a here document. Also, you don't need the SED= part. You added it because you had a command substitution (a command between …), which substitutes the output of a command. But since you aren't interested in the output of the qsub command, don't take its output, just execute it.
qsub -j oe -W depend="afterany$SORT_ARRAY" <<'EOF'
awk '{$2="";$3="";$4="";$5="";$6=""; print $0}' sorted_file.txt |
sed 's/ //g' >final_file.txt
EOF
I'm not familiar with qsub, but presumably there's a way to get the error output and the return status of the commands it runs. Inspect that error output, you should have seen the errors from awk.
The version of awk that I am using, does not like the character escapes
awk --version
GNU Awk 3.1.7
spuder#cent64$ awk '{\$2="";\$3="";\$4=""; print \$0}' foo.txt
awk: {\$2="";\$3="";\$4=""; print \$0}
awk: ^ backslash not last character on line
Try the following syntax
awk '{for(i=2;i<=7;i++) $i="";print}' foo.txt
As a side note, if you are using Torque 4.x you may not be able to use a comma separated list of jobs with -W depend=, instead you may need to create a new PBS declarative (-W) for each job.
eg...
#Invalid syntax in newer versions of torque
qsub -W depend=foo,bar
Resources
backslash in gawk fields
Print all but the first three columns
http://docs.adaptivecomputing.com/torque/help.htm#topics/commands/qsub.htm#-W

sed to remove URLs from a file

I am trying to write a sed expression that can remove urls from a file
example
http://samgovephotography.blogspot.com/ updated my blog just a little bit ago. Take a chance to check out my latest work. Hope all is well:)
Meet Former Child Star & Author Melissa Gilbert 6/15/09 at LA's B&N https://hollywoodmomblog.com/?p=2442 Thx to HMB Contributor #kdpartak :)
But I dont get it:
sed 's/[\w \W \s]*http[s]*:\/\/\([\w \W]\)\+[\w \W \s]*/ /g' posFile
FIXED!!!!!
handles almost all cases, even malformed URLs
sed 's/[\w \W \s]*http[s]*[a-zA-Z0-9 : \. \/ ; % " \W]*/ /g' positiveTweets | grep "http" | more
The following removes http:// or https:// and everything up until the next space:
sed -e 's!http\(s\)\{0,1\}://[^[:space:]]*!!g' posFile
updated my blog just a little bit ago. Take a chance to check out my latest work. Hope all is well:)
Meet Former Child Star & Author Melissa Gilbert 6/15/09 at LA's B&N Thx to HMB Contributor #kdpartak :)
Edit:
I should have used:
sed -e 's!http[s]\?://\S*!!g' posFile
"[s]\?" is a far more readable way of writing "an optional s" compared to "\(s\)\{0,1\}"
"\S*" a more readable version of "any non-space characters" than "[^[:space:]]*"
I must have been using the sed that came installed with my Mac at the time I wrote this answer (brew install gnu-sed FTW).
There are better URL regular expressions out there (those that take into account schemes other than HTTP(S), for instance), but this will work for you, given the examples you give. Why complicate things?
The accepted answer provides the approach that I used to remove URLs, etc. from my files. However it left "blank" lines. Here is a solution.
sed -i -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g' input_file
perl -i -pe 's/^'`echo "\012"`'${2,}//g' input_file
The GNU sed flags, expressions used are:
-i Edit in-place
-e [-e script] --expression=script : basically, add the commands in script
(expression) to the set of commands to be run while processing the input
^ Match start of line
$ Match end of line
? Match one or more of preceding regular expression
{2,} Match 2 or more of preceding regular expression
\S* Any non-space character; alternative to: [^[:space:]]*
However,
sed -i -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g'
leaves nonprinting character(s), presumably \n (newlines). Standard sed-based approaches to remove "blank" lines, tabs and spaces, e.g.
sed -i 's/^[ \t]*//; s/[ \t]*$//'
do not work, here: if you do not use a "branch label" to process newlines, you cannot replace them using sed (which reads input one line at a time).
The solution is to use the following perl expression:
perl -i -pe 's/^'`echo "\012"`'${2,}//g'
which uses a shell substitution,
'`echo "\012"`'
to replace an octal value
\012
(i.e., a newline, \n), that occurs 2 or more times,
{2,}
(otherwise we would unwrap all lines), with something else; here:
//
i.e., nothing.
[The second reference below provides a wonderful table of these values!]
The perl flags used are:
-p Places a printing loop around your command,
so that it acts on each line of standard input
-i Edit in-place
-e Allows you to provide the program as an argument,
rather than in a file
References:
perl flags: Perl flags -pe, -pi, -p, -w, -d, -i, -t?
ASCII control codes: https://www.cyberciti.biz/faq/unix-linux-sed-ascii-control-codes-nonprintable/
remove URLs: sed to remove URLs from a file
branch labels: How can I replace a newline (\n) using sed?
GNU sed manual: https://www.gnu.org/software/sed/manual/sed.html
quick regex guide: https://www.gnu.org/software/sed/manual/html_node/Regular-Expressions.html
Example:
$ cat url_test_input.txt
Some text ...
https://stackoverflow.com/questions/4283344/sed-to-remove-urls-from-a-file
https://www.google.ca/search?dcr=0&ei=QCsyWtbYF43YjwPpzKyQAQ&q=python+remove++citations&oq=python+remove++citations&gs_l=psy-ab.3...1806.1806.0.2004.1.1.0.0.0.0.61.61.1.1.0....0...1c.1.64.psy-ab..0.0.0....0.-cxpNc6youY
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
https://bbengfort.github.io/tutorials/2016/05/19/text-classification-nltk-sckit-learn.html
http://datasynce.org/2017/05/sentiment-analysis-on-python-through-textblob/
https://www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
http://www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
www.google.ca/?q=halifax&gws_rd=cr&dcr=0&ei=j7UyWuGKM47SjwOq-ojgCw
ftp://ftp.ncbi.nlm.nih.gov/
ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/alignment_indices/20100804.alignment.index
Some more text.
$ sed -e 's/http[s]\?:\/\/\S*//g ; s/www\.\S*//g ; s/ftp:\S*//g' url_test_input.txt > a
$ cat a
Some text ...
Some more text.
$ perl -i -pe 's/^'`echo "\012"`'${2,}//g' a
Some text ...
Some more text.
$