How can I cut out a section of a string using Perl? - perl

I need to cut some characters out of the middle of a string; the starting and ending positions of the character sequence to be cut will vary.
For example, say I have the sentence
The quick brown fox jumped over the lazy dog
I need to count forward from the first character until I get to fox, assign the character position of the f to a variable, continue counting forward until I get to 'the' and then cut out the characters between and including the initial f and the final e.
Note
There is an e in jumped which is between fox and the, this should be ignored, it must find the position of the e in the.

To remove a section of a string where you're not sure of all the intervening characters, you can use the substitution operator. If there's a match, the position of the beginning of the match (zero-indexed) is stored in $-[0] (or $LAST_MATCH_START[0] if you use English;):
use strict;
use warnings;
use 5.010;
my $string = 'The quick brown fox jumped over the lazy dog';
$string =~ s/fox.*the//;
say "Matched at char $-[0]" if defined $-[0];
say "New string: $string";
Output:
Matched at char 16
New string: The quick brown lazy dog
Which "the"?
Note that the regex I used is greedy, so it will gobble up every the until the last. For the string:
The quick brown fox jumped over the lazy dog and the sleepy cat
you will get:
Matched at char 16
New string: The quick brown sleepy cat
To stop at the first occurrence of the, change the substitution to:
s/fox.*?the//;
Whole words only
Both of the regexes above will still match partial words. The string:
The quick brown foxhole jumped over their lazy dog
gives:
Matched at char 16
New string: The quick brown ir lazy dog
To only match whole words* change the substitution to:
s/(?:^|\s+)\Kfox\s+.*\s+the(?=\s+|\z)//; # greedy
or
s/(?:^|\s+)\Kfox\s+.*?\s+the(?=\s+|\z)//; # non-greedy
* It's hard to define what counts as a whole word in an English sentence. The above expects a word to be surrounded on both sides by one or more spaces or to be at the beginning or end of the string, which excludes things like in-the-know, but also excludes "fox" and the,. This is obviously not a great definition.

I have the sentence
The quick brown fox jumped over the lazy dog
I need to count forward from the first character until I get to 'fox', asign the character position of 'f' to a variable, continue counting forward until I get to 'the' and then cut out the characters, including and between, 'f' and 'e'.
I am quoting your problem description because it indicates the C mindset with which you are approaching Perl. At a bit higher level than C, your problem is to actually to cut out the words between "brown" and "lazy". Perl allows you to directly express this idea:
$ perl -wE 'say join(" ", (split /\s+(?:fox|the)\s+/, "The quick brown fox jumped over the lazy dog")[0, 2])'
The quick brown lazy dog
Or, using the range operator:
$ perl -wE 'say join " ", grep !(/^fox$/ .. /^the$/), split " ", "The quick brown fox jumped over the lazy dog"'
The quick brown lazy dog
which literally reads "take all the words not between 'fox' and 'the', join them together using a single space as the word separator, and print the resulting sentence."
If the original sentence has many, many words, the first one might be more efficient as it will only ever create a three element list.
You can read more about the range operator in perldoc perlop. Since you are just beginning to learn Perl, you should read everything mentioned in perldoc perltoc at least once, including all the FAQ sections.

Related

substituting spaces for underscores using lookaheads in perl

I have files with many lines of the following form:
word -0.15636028 -0.2953045 0.29853472 ....
(one word preceding several hundreds floats delimited by blanks)
Due to some errors out of my control, the word sometimes has spaces in it.
a bbb c -0.15636028 -0.2953045 0.29853472 .... (several hundreds floats)
which I wish to substitute by underscores so to get:
a_bbb_c -0.15636028 -0.2953045 0.29853472 .... (several hundreds floats)
have tried for each line the following substitution code:
s/\s(?=(\s-?\d\.\d+)+)/_/g;
So lookarounds is apparently not the solution.
I'd be grateful for any clues.
Your idea for the lookahead is fine, but the question is how to replace only spaces in the part matched before the lookahead, when they are mixed with other things (the words, that is).
One way is to capture what precedes the first float (given by lookahead), and in the replacement part run another regex on what's been captured, to replace spaces
s{ (.*?) (?=\s+-?[0-9]+\.[0-9]) }{ $1 =~ s/\s+/_/gr }ex
Notes
Modifier /e makes the replacement part be evaluated as code; any valid Perl code goes
With s{}{} delimiters we can use s/// ones in the replacement part's regex
Regex in the replacement part, that changes spaces to _ in the captured text, has /r modifier so to return the modified string and leave the original unchanged. Thus we aren't attempting to change $1 (it's read only), and the modified string (being returned) is available as the replacement
Modifier /x allows use of spaces in patterns, for readability
Some assumptions must be made here. Most critical one is that the text to process is followed by a number in the given format, -?[0-9]+\.[0-9]+, and that there isn't such a number in the text itself. This follows the OP's sample and, more decidedly, the attempted solution
A couple of details with assumptions. (1) Leading digits are expected with [0-9]+\. -- if you can have numbers like .123 then use [0-9]*\. (2) The \s+ in the inner regex collapses multiple consecutive spaces into one _, so a b c becomes a_b_c (and not a__b_c)
In the lookahead I scoop up all spaces preceding the first float with \s+ -- and so they'll stay in front of the first float. This is as wanted with one space but with multiple ones it may be awkward
If they were included in the .*? capture (if the lookahead only has one space, \s) then we'd get an _ trailing the word(s). I thought that'd be more awkward. The ideal solution is to run another regex and clean that up, if such a case is possible and if it's a bother
An example
echo "a bbb c -0.15636028 -0.2953045" |
perl -wpe's{(.*?)(?=\s+-?[0-9]+\.[0-9])}{ $1 =~ s/\s+/_/gr }e'
prints
a_bbb_c -0.15636028 -0.2953045
Then to process all lines in a file you can do either
perl -wpe'...' file > new_file
and get a new_file with changes, or
perl -i.bak -wpe'...' file
to change the file in-place (that's -i), where .bak makes it save a backup.
Would something like this work for you:
s/\s+/_/g;
s/_(-?\d+\.)/ $1/g;
Use a negative lookahead to replace any spaces not followed by a float:
echo "a bbb cc -0.123232 -0.3232" | perl -wpe 's/ +(?! *-?\d+\.)/_/g'
Assuming from your comments your files look like that:
name float1 float2 float3
a bbb c -0.15636028 -0.2953045 0.29853472
abbb c -0.15636028 -0.2953045 0.29853472
a bbbc -0.15636028 -0.2953045 0.29853472
ab bbc -0.15636028 -0.2953045 0.29853472
abbbc -0.15636028 -0.2953045 0.29853472
Since you said in comments that the first field may contain digits, you can't use a lookahead that searches the first float to solve the problem. (you can nevertheless use a lookahead that counts the number of floats until the end of the line but it isn't very handy).
What I suggest is a solution based on fields number defined by the header first line.
You can use the header line to know the number of fields and replace spaces at the begining of other lines until the number of fields is the same.
You can use perl command line as awk like that:
perl -MEnglish -pae'$c=scalar #F if ($NR==1);for($i=0;$i<scalar(#F)-$c;$i++){s/\s+/_/}' file
The for loop counts the difference between the number of fields in the first row (stored in $c) and in the current line (given by scalar(#F) where #F is the fields array), and repeats the substitution.
The a switches the perl command line in autosplit mode and the -MEnglish makes available the number row variable as $NR (like the NR variable in awk).
It's possible to shorten it like that:
perl -pae'$c=#F if $.<2;$i=#F-$c;s/\s+/_/ while $i--' file

Capturing matches with sed

I have just started experimenting with sed and don't really get how does match capturing work: if I have a code like this for capturing two words sed 's/\([a-z]*\).*\([a-z]*\).*/\1 \2/' why isn't the second word captured?
Edit1: Let's say I have this string: "the brown fox jumps over the lazy dog". I want sed to match "the brown", but it only matches the first word
(Quoting Sundeep, just to make a Q/A pair.)
replace the dot in .* with space character...
sed 's/\([a-z]*\) *\([a-z]*\).*/\1 \2/'

process the document, adding the part of speech

There is one definition per line, the format is "WORDPartOfSpeech"
The task is to process the document, adding the part of speech
whenever it's defined. No re-formatting should be done.
For example, if the lexicon is
THE article
BIG adjective
BALL noun
and the document is
The big red ball fell.
Then the output should be
The/article big/adjective red ball/noun fell.
If I put the lexicon in a database table as 2 fields and I ran an SQL select that outputted as 1 comma separated line in the following format: "The/article,big/adjective,ball/noun" then how would I take that line and process it against the document so that it is outputted like above?
You should modify your sql query to preserve any words that don't match a term in your lexicon (perhaps by using an outer join; if you show us that query we could give you more specific advice). Then, assuming your output then looks something like this (with just a / following each term that didn't match your lexicon):
The/article big/adjective red/ ball/noun fell/.
You could clean it up with sed like this (assuming the string had been saved in a variable called $variablename:
sed 's_\/\([ .]\)_\1_g' <(echo "$variablename")
Explanation:
I used _ instead of / to delimit my s command for legibility. The syntax s/search/replace/g is synonymous with s_search_replace_g.
\/\([ .]\) tells sed to match anything with a literal / (escaped as \/) followed by either a space or a period [ .]. anything matching this pattern is stored into a reference because of the \( and \) surrounding the pattern.
\1 in the replacement pattern is the backreference I mentioned earlier. This works like a variable storing the matched portion we surrounded with parentheses in the search pattern. In effect, I've told sed to strip out any forward slashes that are followed by a space or a period, without stripping away the space or period itself.
Output:
The/article big/adjective red ball/noun fell.

How to restrict a find and replace to only one column within a CSV?

I have a 4-column CSV file, e.g.:
0001 # fish # animal # eats worms
I use sed to do a find and replace on the file, but I need to limit this find and replace to only the text found inside column 3.
How can I have a find and replace only occur on this one column?
Are you sure you want to be using sed? What about csvfix? Is your CSV nice and simple with no quotes or embedded commas or other nasties that make regexes...a less than satisfactory way of dealing with a general CSV file? I'm assuming that the # is the 'comma' in your format.
Consider using awk instead of sed:
awk -F# '$3 ~ /pattern/ { OFS= "#"; $3 = "replace"; }'
Arguably, you should have a BEGIN block that sets OFS once. For one line of input, it didn't make any odds (and you'd probably be hard-pressed to measure a difference on a million lines of input, too):
$ echo "pattern # pattern # pattern # pattern" |
> awk -F# '$3 ~ /pattern/ { OFS= "#"; $3 = "replace"; }'
pattern # pattern #replace# pattern
$
If sed still seems appealing, then:
sed '/^\([^#]*#[^#]*\)#pattern#\(.*\)/ s//\1#replace#\2/'
For example (and note the slightly different input and output – you can fix it to handle the same as the awk quite easily if need be):
$ echo "pattern#pattern#pattern#pattern" |
> sed '/^\([^#]*#[^#]*\)#pattern#\(.*\)/ s//\1#replace#\2/'
pattern#pattern#replace#pattern
$
The first regex looks for the start of a line, a field of non-at-signs, an at-sign, another field of non-at-signs and remembers the lot; it looks for an at-sign, the pattern (which must be in the third field since the first two fields have been matched already), another at-sign, and then the residue of the line. When the line matches, then it replaces the line with the first two fields (unchanged, as required), then adds the replacement third field, and the residue of the line (unchanged, as required).
If you need to edit rather than simply replace the third field, then you think about using awk or Perl or Python. If you are still constrained to sed, then you explore using the hold space to hold part of the line while you manipulate the other part in the pattern space, and end up re-integrating your desired output line from the hold space and pattern space before printing the line. That's nearly as messy as it sounds; actually, possibly even messier than it sounds. I'd go with Perl (because I learned it long ago and it does this sort of thing quite easily), but you can use whichever non-sed tool you like.
Perl editing the third field. Note that the default output is $_ which had to be reassembled from the auto-split fields in the array #F.
$ echo "pattern#pattern#pattern#pattern" | sh -x xxx.pl
> perl -pa -F# -e '$F[2] =~ s/\s*pat(\w\w)rn\s*/ prefix-$1-suffix /; $_ = join "#", #F; ' "$#"
pattern#pattern# prefix-te-suffix #pattern
$
An explanation. The -p means 'loop, reading lines into $_ and printing $_ at the end of each iteration'. The -a means 'auto-split $_ into the array #F'. The -F# means the field separator is #. The -e is followed by the Perl program. Arrays are indexed from 0 in Perl, so the third field is split into $F[2] (the sigil — the # or $ — changes depending on whether you're working with a value from the array or the array as a whole. The =~ is a match operator; it applies the regex on the RHS to the value on the LHS. The substitute pattern recognizes zero or more spaces \s* followed by pat then two 'word' characters which are remembered into $1, then rn and zero or more spaces again; maybe there should be a ^ and $ in there to bind to the start and end of the field. The replacement is a space, 'prefix-', the remembered pair of letters, and '-suffix' and a space. The $_ = join "#", #F; reassembles the input line $_ from the possibly modified separate fields, and then the -p prints that out. Not quite as tidy as I'd like (so there's probably a better way to do it), but it works. And you can do arbitrary transforms on arbitrary fields in Perl without much difficulty. Perl also has a module Text::CSV (and a high-speed C version, Text::CSV_XS) which can handle really complex CSV files.
Essentially break the line into three pieces, with the pattern you're looking for in the middle. Then keep the outer pieces and replace the middle.
/\([^#]*#[^#]*#\[^#]*\)pattern\([^#]*#.*\)/s//\1replacement\2/
\([^#]*#[^#]*#\[^#]*\) - gather everything before the pattern, including the 3rd # and any text before the math - this becomes \1
pattern - the thing you're looking for
\([^#]*#.*\) - gather everything after the pattern - this becomes \2
Then change that line into \1 then the replacement, then everything after pattern, which is \2
This might work for you:
echo 0001 # fish # animal # eats worms|
sed 's/#/&\n/2;s/#/\n&/3;h;s/\n#.*//;s/.*\n//;y/a/b/;G;s/\([^\n]*\)\n\([^\n]*\).*\n/\2\1/'
0001 # fish # bnimbl # eats worms
Explanation:
Define the field to be worked on (in this case the 3rd) and insert a newline (\n) before it and directly after it. s/#/&\n/2;s/#/\n&/3
Save the line in the hold space. h
Delete the fields either side s/\n#.*//;s/.*\n//
Now process the field i.e. change all a's to b's. y/a/b/
Now append the original line. G
Substitute the new field for the old field (also removing any newlines). s/\([^\n]*\)\n\([^\n]*\).*\n/\2\1/
N.B. That in step 4 the pattern space only contains the defined field, so any number of commands may be carried out here and the result will not affect the rest of the line.

What do these various pieces of syntax mean?

I'm trying to figure out the syntax of both the sed command and perl script:
sed 's/^EOR:$//' INPUTFILE |
perl -00 -ne '/
TAGA01:\s+(.*?)\n
.*
TAGCC08:\s+(.*?)\n
# and so on
/xs && print "$1 $2\n"'
Why is there a circumflex ^ in the sed command? The third slash / will replace all instances of EOR: with a blank line, correct?
I understand some of the Perl script. Looking at perlrun, -00 will slurp the stream in paragraph mode and -n starts a while <> loop.
Why is there the first slash / next to the apostrophe? The command searches for TAGXXXX:, but I am not sure what \s+(.*?) does. Does that put whatever is after the tag into a variable? How about the .* in the between tag searches? What does /ns do? What do the $1 and $2 refer to in the print line?
This was tough to find online, and if someone could kick me in the right direction, I'd appreciate it.
The circumflex ^ is regex for "start of line", and $ is regex for "end of line"; so sed will only remove lines which contain exactly "EOR:" and nothing else.
The Perl script is basically perl -00 -ne '/(re)g(ex)/ && print "re ex\n"' with a big ole regex instead of the simple placeholder I put here. In particular, the /x modifier allows you to split the regex over several lines. So the first / is the start of the regex and the final / is the end of the regex and the lines in between form the regex together.
The /s modifier changes how Perl interprets . in a regex; normally it will match any character except newline, but with this option, it includes newlines as well. This means that .* can match multiple lines.
\s matches a single whitespace character; \s+ matches as many whitespace characters as possible, but there has to be at least one.
(.*?) matches an arbitrary length of string; the dot matches any character, the asterisk says zero or more of any character, and the question mark modifies the asterisk repetition operator to match as short a string as possible instead of as long a string as possible. The parentheses cause the skipped expression to be captured in a back reference; the backrefs are named $1, $2, etc, as many as there are backreferences; the numbers correspond to the order of the opening parenthesis (so if you apply (a(b)) to the string "ab", $1 will be "ab" and $2 will be "b").
Finally, \n matches a literal newline. So the (.*?) non-greedy match will match up to the first newline, i.e. the tail of the line on which the TAGsomething was found. (I
imagine these are gene sequences, not "tags"?)
It doesn't really make sense to run sed separately; Perl would be quite capable of removing the EOR: lines before attempting to match the regex.
Let's see...
Yes, sed will empty the lines with EOR:
The first / in the Perl script means a regexp pattern. Concretely, it is searching for a pattern in the form below
The regex ends with "xs", which means that the regex will match multiple lines of the input
The script also will print as output the strings found in the tags (see below). The $1 and $2 mean the elements contained in the first pair of parentheses ($1) and in the second ($2).
. The form is this one:
TAGA01:<spaces><string1>
<whatever here>
TAGCC00:<spaces><string2>
In this case, $1 is <string1> and $2 is <string2>.