I would like to translate all instances of a character with two characters. The usual way I would do it is:
$text =~ s/a/aa/g;
I only want single instances of a character to be doubled. So aa would remain aa and not turn into aaaa.
I am thinking I have to use variables in the s/// statement but I cannot find any suitable pattern here or on the net.
Match instances of a that are not next to another a:
s/(?<!a)a(?!a)/aa/g;
Related
I have a command like this, it is marking words to appear in an index in the document:
sed -i "s/\b$line\b/\\\keywordis\{$line\}\{$wordis\}\{$definitionis\}/g" file.txt
The problem is, it is finding matches within existing matches, which means its e.g. "hello" is replaced with \keywordis{hello}{a common greeting}, but then "greeting" might be searched too, and \keywordis{hello}{a common \keywordis{greeting}{a phrase used when meeting someone}}...
How can I tell sed to perform the replacement, but ignore text that is already inside curly brackets?
Curley brackets in this case will always appear on the same line.
How can I tell sed to perform the replacement, but ignore text that is already inside curly brackets?
First tokenize input. Place something unique, like | or byte \x01 between every \keywordis{hello}{a common greeting} and store that in hold space. Something along s/\\the regex to match{hello}{a common greeting}/\x01&\x01/g'.
Ten iterate over elements in hold space. Use \n to separate elements already parsed from not parsed - input from output. If the element matches the format \keywordis{hello}{a common greeting}, just move it to the front before the newline in hold space, if it does not, perform the replacement. Here's an example: Identify and replace selective space inside given text file , it uses double newline \n\n as input/output separator.
Because, as you noted, replacements can have overlapping words with the patterns you are searching for, I believe the simplest would be after each replacement shuffling the pattern space like for ready output and starting the process all over for the current line.
Then on the end, shuffle the hold space to remove \x01 and newline and any leftovers and output.
Overall, it's Latex. I believe it would be simpler to do it manually.
By "eating" the string from the back and placing it in front of input/output separator inside pattern space, I simplified the process. The following program:
sed '
# add our input/output separator - just a newline
s/^/\n/
: loop
# l1000
# Ignore any "\keywords" and "{stuff}"
/^\([^\n]*\)\n\(.*\)\(\\[^{}]*\|{[^{}]*}\)$/{
s//\3\1\n\2/
b loop
}
# Replace hello followed by anthing not {}
# We match till the end because regex is greedy
# so that .* will eat everything.
/^\([^\n]*\)\n\(.*\)hello\([{}]*\)$/{
s//\\keywordis{hello}{a common greeting}\3\1\n\2/
b loop
}
# Hello was not matched - ignore anything irrelevant
# note - it has to match at least one character after newline
/^\([^\n]*\)\n\(.*\)\([^{}]\+\)$/{
s//\3\1\n\2/
b loop
}
s/\n//
' <<<'
\keywordis{hello}{hello} hello {some other hello} another hello yet
'
outputs:
\keywordis{hello}{hello} \keywordis{hello}{a common greeting} {some other hello} another \keywordis{hello}{a common greeting} yet
i have string like below
#a b#mail.com has joined
i want to show #a b c as mention, but hard to detect space, so i put the string like below before using regex
#``a b#mail.com`` has joined
So i want to detect start with
#``
and end with
``
can anyone help me the regex, i tried so much but still not working, here is the regex im testing
^#``.*``{3,}
You might use a capture group and capture any char except an # using a negated character class.
^#``([^#\r\n]+).*?``
The pattern matches
^#`` Start of string, match # and 2 backticks
( Capture group 1
[^#\r\n]+ Match 1+ occurrences of any char except # or a newline
) Close group 1
.*?`` Match as least as possible chars until 2 backticks
Regex demo
I have files with many lines of the following form:
word -0.15636028 -0.2953045 0.29853472 ....
(one word preceding several hundreds floats delimited by blanks)
Due to some errors out of my control, the word sometimes has spaces in it.
a bbb c -0.15636028 -0.2953045 0.29853472 .... (several hundreds floats)
which I wish to substitute by underscores so to get:
a_bbb_c -0.15636028 -0.2953045 0.29853472 .... (several hundreds floats)
have tried for each line the following substitution code:
s/\s(?=(\s-?\d\.\d+)+)/_/g;
So lookarounds is apparently not the solution.
I'd be grateful for any clues.
Your idea for the lookahead is fine, but the question is how to replace only spaces in the part matched before the lookahead, when they are mixed with other things (the words, that is).
One way is to capture what precedes the first float (given by lookahead), and in the replacement part run another regex on what's been captured, to replace spaces
s{ (.*?) (?=\s+-?[0-9]+\.[0-9]) }{ $1 =~ s/\s+/_/gr }ex
Notes
Modifier /e makes the replacement part be evaluated as code; any valid Perl code goes
With s{}{} delimiters we can use s/// ones in the replacement part's regex
Regex in the replacement part, that changes spaces to _ in the captured text, has /r modifier so to return the modified string and leave the original unchanged. Thus we aren't attempting to change $1 (it's read only), and the modified string (being returned) is available as the replacement
Modifier /x allows use of spaces in patterns, for readability
Some assumptions must be made here. Most critical one is that the text to process is followed by a number in the given format, -?[0-9]+\.[0-9]+, and that there isn't such a number in the text itself. This follows the OP's sample and, more decidedly, the attempted solution
A couple of details with assumptions. (1) Leading digits are expected with [0-9]+\. -- if you can have numbers like .123 then use [0-9]*\. (2) The \s+ in the inner regex collapses multiple consecutive spaces into one _, so a b c becomes a_b_c (and not a__b_c)
In the lookahead I scoop up all spaces preceding the first float with \s+ -- and so they'll stay in front of the first float. This is as wanted with one space but with multiple ones it may be awkward
If they were included in the .*? capture (if the lookahead only has one space, \s) then we'd get an _ trailing the word(s). I thought that'd be more awkward. The ideal solution is to run another regex and clean that up, if such a case is possible and if it's a bother
An example
echo "a bbb c -0.15636028 -0.2953045" |
perl -wpe's{(.*?)(?=\s+-?[0-9]+\.[0-9])}{ $1 =~ s/\s+/_/gr }e'
prints
a_bbb_c -0.15636028 -0.2953045
Then to process all lines in a file you can do either
perl -wpe'...' file > new_file
and get a new_file with changes, or
perl -i.bak -wpe'...' file
to change the file in-place (that's -i), where .bak makes it save a backup.
Would something like this work for you:
s/\s+/_/g;
s/_(-?\d+\.)/ $1/g;
Use a negative lookahead to replace any spaces not followed by a float:
echo "a bbb cc -0.123232 -0.3232" | perl -wpe 's/ +(?! *-?\d+\.)/_/g'
Assuming from your comments your files look like that:
name float1 float2 float3
a bbb c -0.15636028 -0.2953045 0.29853472
abbb c -0.15636028 -0.2953045 0.29853472
a bbbc -0.15636028 -0.2953045 0.29853472
ab bbc -0.15636028 -0.2953045 0.29853472
abbbc -0.15636028 -0.2953045 0.29853472
Since you said in comments that the first field may contain digits, you can't use a lookahead that searches the first float to solve the problem. (you can nevertheless use a lookahead that counts the number of floats until the end of the line but it isn't very handy).
What I suggest is a solution based on fields number defined by the header first line.
You can use the header line to know the number of fields and replace spaces at the begining of other lines until the number of fields is the same.
You can use perl command line as awk like that:
perl -MEnglish -pae'$c=scalar #F if ($NR==1);for($i=0;$i<scalar(#F)-$c;$i++){s/\s+/_/}' file
The for loop counts the difference between the number of fields in the first row (stored in $c) and in the current line (given by scalar(#F) where #F is the fields array), and repeats the substitution.
The a switches the perl command line in autosplit mode and the -MEnglish makes available the number row variable as $NR (like the NR variable in awk).
It's possible to shorten it like that:
perl -pae'$c=#F if $.<2;$i=#F-$c;s/\s+/_/ while $i--' file
I have one string
"8.53" I want my resulting string "853"
I have tried
the following code
tr|.||;
but its not replacing its giving 8.53 only .
I have tried another way using
tr|.|NULL|;
but its giving 8N53 can anyone please suggest me how to use tr to replace a character with NULL.
Thanks
You need to specify the d modifier to delete chars with no corresponding char:
tr/.//d;
Or you could use the (slower but more familiar) substitution operator:
s/\.//g;
You don't want tr because that transliterates characters from the 1st list with the corresponding character in the 2nd list (which was N in your example since that was the first character). You'll want the substitution operator.
my $var = "8.53";
$var =~ s/\.//;
print $var;
Add the g flag if there are multiple instances you want to replace (s/\.//g).
I've to match a regular-expression, stored in a variable:
#!/bin/env perl
use warnings;
use strict;
my $expr = qr/\s*(\w+(\[\d+\])?)\s+(\w+(\[\d+\])?)/sx;
$str = "abcd[3] xyzg[4:0]";
if ($str =~ m/$expr/) {
print "\n%%%%%%%%% $`-----$&-----$'\n";
}
else {
print "\n********* NOT MATCHED\n";
}
But I'm getting the outout in $& as
%%%%%%%%% -----abcd[3] xyzg-----[4:0]
But expecting, it shouldn't go inside the if clause.
What is intended is:
if $str = "abcd xyzg" => %%%%%%%%% -----abcd xyzg----- (CORRECT)
if $str = "abcd[2] xyzg" => %%%%%%%%% -----abcd[2] xyzg----- (CORRECT)
if $str = "abcd[2] xyzg[3] => %%%%%%%%% -----abcd[2] xyzg[3]----- (CORRECT)
if $str = "abcd[2:0] xyzg[3] => ********* NOT MATCHED (CORRECT)
if $str = "abcd[2:0] xyzg[3:0] => ********* NOT MATCHED (CORRECT)
if $str = "abcd[2] xyzg[3:0]" => ********* NOT MATCHED (CORRECT/INTENDED)
but output is %%%%%%%%% -----abcd[2] xyzg-----[3:0] (WRONG)
OR better to say this is not intended.
In this case, it should/my_expectation go to the else block.
Even I don't know, why $& take a portion of the string (abcd[2] xyzg), and $' having [3:0]?
HOW?
It should match the full, not a part like the above. If it didn't, it shouldn't go to the if clause.
Can anyone please help me to change my $expr pattern, so that I can have what is intended?
By default, Perl regexes only look for a matching substring of the given string. In order to force comparison against the entire string, you need to indicate that the regex begins at the beginning of the string and ends at the end by using ^ and $:
my $expr = qr/^\s*(\w+(\[\d+\])?)\s+(\w+(\[\d+\])?)$/;
(Also, there's no reason to have the /x modifier, as your regex doesn't include any literal whitespace or # characters, and there's no reason for the /s modifier, as you're not using ..)
EDIT: If you don't want the regex to match against the entire string, but you want it to reject anything in which the matching portion is followed by something like "[0:0]", the simplest way would be to use lookahead:
my $expr = qr/^\s*(\w+(\[\d+\])?)\s+(\w+(\[\d+\]|(?=[^[\w])|$ ))/x;
This will match anything that takes the following form:
beginning of the string (which your example in the comments seems to imply you want)
zero or more whitespace characters
one or more word characters
optional: [, one or more digits, ]
one or more whitespace characters
one or more word characters
one of the following, in descending order of preference:
[, one or more digits, ]
an empty string followed by (but not including!) a character that is neither [ nor a word character (The exclusion of word characters is to keep the regex engine from succeeding on "a[0] bc[1:2]" by only matching "a[0] b".)
end of string (A space is needed after the $ to keep it from merging with the following ) to form the name of a special variable, and this entails the reintroduction of the /x option.)
Do you have any more unstated requirements that need to be satisfied?
The short answer is your regexp is wrong.
We can't fix it for you without you explaining what you need exactly, and the community is not going to write a regexp exactly for your purpose because that's just too localized a question that only helps you this one time.
You need to ask something more general about regexps that we can explain to you, that will help you fix your regexp, and help others fix theirs.
Here's my general answer when you're having trouble testing your regexp. Use a regexp tool, like the regex buddy one.
So I'm going to give a specific answer about what you're overlooking here:
Let's make this example smaller:
Your pattern is a(bc+d)?. It will match: abcd abccd etc. While it will not match bcd nor bzd in the case of abzd it will match as matching only a because the whole group of bc+d is optional. Similarly it will match abcbcd as a dropping the whole optional group that couldn't be matched (at the second b).
Regexps will match as much of the string as they can and return a true match when they can match something and have satisfied the entire pattern. If you make something optional, they will leave it out when they have to including it only when it's present and matches.
Here's what you tried:
qr/\s*(\w+(\[\d+\])?)\s+(\w+(\[\d+\])?)/sx
First, s and x aren't needed modifiers here.
Second, this regex can match:
Any or no whitespace followed by
a word of at least one alpha character followed by
optionally a grouped square bracketed number with at least one digit (eg [0] or [9999]) followed by
at least one white space followed by
a word of at least one alpha character followed by
optionally a square bracketed number with at least one digit.
Clearly when you ask it to match abcd[0] xyzg[0:4] the colon ends the \d+ pattern but doesn't satisfy the \] so it backtracks the whole group, and then happily finds the group was optional. So by not matching the last optional group, your pattern has matched successfully.