Can Sed match matching brackets? - sed

My code has a ton of occurrences of something like:
idof(some_object)
I want to replace them with:
some_object["id"]
It sounds simple:
sed -i 's/idof(\([^)]\+\))/\1["id"]/g' source.py
The problem is that some_object might be something like idof(get_some_object()), or idof(my_class().get_some_object()), in which case, instead of getting what I want (get_some_object()["id"] or my_class().get_some_object()["id"]), I get get_some_object(["id"]) or my_class(["id"].get_some_object()).
Is there a way to have sed match closing bracket, so that it internally keeps track of any opening/closing brackets inside my (), and ignores those?
It needs to keep everything that's between those brackets: idof(ANYTHING) becomes ANYTHING["id"].

Using sed
$ sed -E 's/idof\(([[:alpha:][:punct:]]*)\)/\1["id"]/g' input_file
Using ERE, exclude idof and the first opening parenthesis.
As a literal closing parenthesis is also excluded, everything in-between the capture parenthesis including additional parenthesis will be captured.
[[:alpha:]] will match all alphabetic characters including upper and lower case while [[:punct:]] will capture punctuation characters including ().-{} and more.
The g option will make the substitution as many times as the pattern is found.

Theoretically, you can write a regex that will handle all combinations of idof(....) up to some limit of nested () calls inside ..... Such regex would have to list with all possible combinations of calls, like idof(one(two(three))) or idof(one(two(three)four(five)) you can match with an appropriate regex like idof([^()]*([^()]*([^()]*)[^()]*)[^()]*) or idof([^()]*([^()]*([^()]*)[^()]*([^()]*)[^()]*) respectively.
The following regex handles only some cases, but shows the complexity and general path. Writing a regex to handle all possible cases to "eat" everything in front of the trailing ) is left to OP as an exercise why it's better to use something else. Note that handling string literals ")" becomes increasingly complex.
The following Bash code:
sed '
: begin
# No idof? Just print the line!
/^\(.*\)idof(\([^)]*)\)/!n
# Note: regex is greedy - we start from the back!
# Note: using newline as a stack separator.
s//\1\n\2/
# hold the front
{ h ; x ; s/\n.*// ; x ; s/[^\n]*\n// ; }
: handle_brackets
# Eat everything before final ) up to some number of nested ((())) calls.
# Insert more jokes here.
: eat_brackets
/^[^()]*\(([^()]*\(([^()]*\(([^()]*\(([^()]*\(([^()]*\(([^()]*)\)\?[^()]*)\)\?[^()]*)\)\?[^()]*)\)\?[^()]*)\)\?[^()]*)\)/{
s//&\n/
# Hold the front.
{ H ; x ; s/\n\([^\n]*\)\n.*/\1/ ; x ; s/[^\n]*\n// ; }
b eat_brackets
}
/^\([^()]*\))/!{
s/^/ERROR: eating brackets did not work: /
q1
}
# Add the id after trailing ) and remove it.
s//\1["id"]/
# Join with hold space and clear the hold space for next round
{ H ; s/.*// ; x ; s/\n//g ; }
# Restart for another idof if in input.
b begin
' <<EOF
before idof(some_object) after
before idof(get_some_object()) after
before idof(my_class().get_some_object()) after
before idof(one(two(three)four)five) after
before idof(one(two(three)four)five) between idof(one(two(three)four)five) after
before idof( one(two(three)four)five one(two(three)four)five ) after
before idof(one(two(three(four)five)six(seven(eight)nine)ten) between idof(one(two(three(four)five)six(seven(eight)nine)ten) after
EOF
Will output:
before some_object["id"] after
before get_some_object()["id"] after
before my_class().get_some_object()["id"] after
before one(two(three)four)five["id"] after
before one(two(three)four)five["id"] between one(two(three)four)five["id"] after
before one(two(three)four)five one(two(three)four)five ["id"] after
ERROR: eating brackets did not work: one(two(three(four)five)six(seven(eight)nine)ten) after
The last line is not handled correctly, because (()()) case is not correctly handled. One would have to write a regex to match it.

Related

How to match exact string in perl

I am trying to parse all the files and verify if any of the file content has strings TESTDIR or TEST_DIR
Files contents might look something like:-
TESTDIR = foo
include $(TESTDIR)/chop.mk
...
TEST_DIR := goldimage
MAKE_TESTDIR = var_make
NEW_TEST_DIR = tesing_var
Actually I am only interested in TESTDIR ,$(TESTDIR),TEST_DIR but in my case last two lines should be ignored. I am new to perl , Can anyone help me out with re-rex.
/\bTEST_?DIR\b/
\b means a "word boundary", i.e. the place between a word character and a non-word character. "Word" here has the Perl meaning: it contains characters, numbers, and underscores.
_? means "nothing or an underscore"
Look at "characterset".
Only (space) surrounding allowed:
/^(.* )?TEST_?DIR /
^ beginning of the line
(.* )? There may be some content .* but if, its must be followed by a space
at the and says that a whitespace must be there. Otherwise use ( .*)?$ at the end.
One of a given characterset is allowed:
Should the be other characters then a space be possible you can use a character class []:
/^(.*[ \t(])?TEST_?DIR[) :=]/
(.*[ \t(])? in front of TEST_?DIR may be a (space) or a \t (tab) or ( or nothing if the line starts with itself.
afterwards there must be one of (space) or : or = or ). Followd by anything (to "anything" belongs the "=" of ":=" ...).
One of a given group is allowed:
So you need groups within () each possible group in there devided by a |:
/^(.*( |\t))?TEST_?DIR( | := | = )/
In this case, at the beginning is no change to [ \t] because each group holds only one character and \t.
At the end, there must be (single space) or := (':=' surrounded by spaces) or = ('=' surrounded by spaces), following by anything...
You can use any combination...
/^(.*[ \t(])?TEST_?DIR([) =:]| :=| =|)/
Test it on Debuggex.com. (Use 'PCRE')

Avoiding duplicate items in a comma-separated list of two-letter words

I need to write a regex which allows a group of 2 chars only once. This is my current regex :
^([A-Z]{2},)*([A-Z]{2}){1}$
This allows me to validate something like this :
AL,RA,IS,GD
AL
AL,RA
The problem is that it validates also AL,AL and AL,RA,AL.
EDIT
Here there are more details.
What is allowed:
AL,RA,GD
AL
AL,RA
AL,IS,GD
What it shouldn't be allowed:
AL,RA,AL
AL,AL
AL,RA,RA
AL,IS,AL
IS,IS,AL
IS,GD,GD
IS,GD,IS
I need that every group of two characters appears only once in the sequence.
Try something like this expression:
/^(?:,?(\b\w{2}\b)(?!.*\1))+$/gm
I have no knowledge of swift, so take it with a grain of salt. The idea is basically to only match a whole line while making sure that no single matched group occurs at a later point in the line.
First of all, let's shorten your pattern. It can be easily achieved since the length of each comma-separated item is fixed and the list items are only made up of uppercase ASCII letters. So, your pattern can be written as ^(?:[A-Z]{2}(?:,\b)?)+$. See this regex demo.
Now, you need to add a negative lookahead that will check the string for any repeating two-letter sequence at any distance from the start of string, and within any distance between each. Use
^(?!.*\b([A-Z]{2})\b.*\b\1\b)(?:[A-Z]{2}(?:,\b)?)+$
See the regex demo
Possible implementation in Swift:
func isValidInput(Input:String) -> Bool {
return Input.range(of: #"^(?!.*\b([A-Z]{2})\b.*\b\1\b)(?:[A-Z]{2}(?:,\b)?)+$"#, options: .regularExpression) != nil
}
print(isValidInput(Input:"AL,RA,GD")) // true
print(isValidInput(Input:"AL,RA,AL")) // false
Details
^ - start of string
(?!.*\b([A-Z]{2})\b.*\b\1\b) - a negative lookahead that fails the match if, immediately to the right of the current location, there is:
.* - any 0+ chars other than line break chars, as many as possible
\b([A-Z]{2})\b - a two-letter word as a whole word
.* - any 0+ chars other than line break chars, as many as possible
\b\1\b - the same whole word as in Group 1. NOTE: The word boundaries here are not necessary in the current scenario where the word length is fixed, it is two, but if you do not know the word length, and you have [A-Z]+, you will need the word boundaries, or other boundaries depending on the situation
(?:[A-Z]{2}(?:,\b)?)+ - 1 or more sequences of:
[A-Z]{2} - two uppercase ASCII letters
(?:,\b)? - an optional sequence: , only if followed with a word char: letter, digit or _. This guarantees that , won't be allowed at the end of the string
$ - end of string.
You can use a negative lookahead with a back-reference:
^(?!.*([A-Z]{2}).*\1).*
if, as in the all the examples in the question, it is known that the string contains only comma-separated pairs of capital letters. I will relax that assumption later in my answer.
Demo
The regex performs the following operations:
^ # match beginning of line
(?! # begin negative lookahead
.* # match 0+ characters (1+ OK)
([A-Z]{2}) # match 2 uppercase letters in capture group 1
.* # match 0+ characters (1+ OK)
\1 # match the contents of capture group 1
) # end negative lookahead
.* # match 0+ characters (the entire string)
Suppose now that one or more capital letters may appear between each pair of commas, or before the first comma or after the last comma, but it is only strings of two letters that cannot be repeated. Moreover, I assume the regex must confirm the regex has the desired form. Then the following regex could be used:
^(?=[A-Z]+(?:,[A-Z]+)*$)(?!.*(?:^|,)([A-Z]{2}),(?:.*,)?\1(?:,|$)).*
Demo
The regex performs the following operations:
^ # match beginning of line
(?= # begin pos lookahead
[A-Z]+ # match 1+ uc letters
(?:,[A-Z]+) # match ',' then by 1+ uc letters in a non-cap grp
* # execute the non-cap grp 0+ times
$ # match the end of the line
) # end pos lookahead
(?! # begin neg lookahead
.* # match 0+ chars
(?:^|,) # match beginning of line or ','
([A-Z]{2}) # match 2 uc letters in cap grp 1
, # match ','
(?:.*,) # match 0+ chars, then ',' in non-cap group
? # optionally match non-cap grp
\1 # match the contents of cap grp 1
(?:,|$) # match ',' or end of line
) # end neg lookahead
.* # match 0+ chars (entire string)
If there is no need check that the string contains only comma-separated strings of one or more upper case letters the postive lookahead at the beginning can be removed.

sed: replace letter between square brackets

I have the following string:
signal[i]
signal[bg]
output [10:0]
input [i:1]
what I want is to replace the letters between square brackets (by underscore for example) and to keep the other strings that represents table declaration:
signal[_]
signal[__]
output [10:0]
input [i:1]
thanks
try:
awk '{gsub(/\[[a-zA-Z]+\]/,"[_]")} 1' Input_file
Globally substituting the (bracket)alphabets till their longest match then with [_]. Mentioning 1 will print the lines(edited or without edited ones).
EDIT: Above will substitute all alphabets with one single _, so to get as many underscores as many characters are there following may help in same.
awk '{match($0,/\[[a-zA-Z]+\]/);VAL=substr($0,RSTART+1,RLENGTH-2);if(VAL){len=length(VAL);;while(i<len){q=q?q"_":"_";i++}};gsub(/\[[a-zA-Z]+\]/,"["q"]")}1' Input_file
OR
awk '{
match($0,/\[[a-zA-Z]+\]/);
VAL=substr($0,RSTART+1,RLENGTH-2);
if(VAL){
len=length(VAL);
while(i<len){
q=q?q"_":"_";
i++
}
};
gsub(/\[[a-zA-Z]+\]/,"["q"]")
}
1
' Input_file
Will add explanation soon.
EDIT2: Following is the one with explanation purposes for OP and users.
awk '{
match($0,/\[[a-zA-Z]+\]/); #### using match awk's built-in utility to match the [alphabets] as per OP's requirement.
VAL=substr($0,RSTART+1,RLENGTH-2); #### Creating a variable named VAL which has substr($0,RSTART+1,RLENGTH-2); which will have substring value, whose starting point is RSTART+1 and ending point is RLENGTH-2.
RSTART and RLENGTH are the variables out of the box which will be having values only when awk finds any match while using match.
if(VAL){ #### Checking if value of VAL variable is NOT NULL. Then perform following actions.
len=length(VAL); #### creating a variable named len which will have length of variable VAL in it.
while(i<len){ #### Starting a while loop which will run till the value of VAL from i(null value).
q=q?q"_":"_"; #### creating a variable named q whose value will be concatenated it itself with "_".
i++ #### incrementing the value of variable i with 1 each time.
}
};
gsub(/\[[a-zA-Z]+\]/,"["q"]") #### Now globally substituting the value of [ alphabets ] with [ value of q(which have all underscores in it) then ].
}
1 #### Mentioning 1 will print (edited or non-edited) lines here.
' Input_file #### Mentioning the Input_file here.
Alternative gawk solution:
awk -F'\\[|\\]' '$2!~/^[0-9]+:[0-9]$/{ gsub(/./,"_",$2); $2="["$2"]" }1' OFS= file
The output:
signal[_]
signal[__]
output [10:0]
-F'\\[|\\]' - treating [ and ] as field separators
$2!~/^[0-9]+:[0-9]$/ - performing action if the 2nd field does not represent table declaration
gsub(/./,"_",$2) - replace each character with _
This might work for you (GNU sed);
sed ':a;s/\(\[_*\)[[:alpha:]]\([[:alpha:]]*\]\)/\1_\2/;ta' file
Match on opening and closing square brackets with any number of _'s and at least one alpha character and replace said character by an underscore and repeat.
awk '{sub(/\[i\]/,"[_]")sub(/\[bg\]/,"[__]")}1' file
signal[_]
signal[__]
output [10:0]
input [i:1]
The explanation is as follows: Since bracket is as special character it has to be escaped to be handled literally then it becomes easy use sub.

How can I use sed to to convert $$ blah $$ in TeX to \begin{equation} blah \end{equation}

I have files with entries of the form:
$$
y = x^2
$$
I'm looking for a way (specifically using sed) to convert them to:
\begin{equation}
y = x^2
\end{equation}
The solution should not rely on the form of the equation (which may also span mutiple lines) nor on the text preceding the opening $$ or following the closing $$.
Thanks for the help.
sed '
/^\$\$$/ {
x
s/begin/&/
t use_end_tag
s/^.*$/\\begin{equation}/
h
b
: use_end_tag
s/^.*$/\\end{equation}/
h
}
'
Explanation:
sed maintains two buffers: the pattern space (pspace) and the hold space (hspace). It operates in cycles, where during each cycle it reads a line and executes the script for that line. pspace is usually auto-printed at the end of each cycle (unless the -n option is used), and then deleted before the next cycle. hspace holds its contents between cycles.
The idea of the script is that whenever $$ is seen, hspace is first checked to see if it contains the word "begin". If it does, then substitute the end tag; otherwise substitute the begin tag. In either case, store the substituted tag in the hold space so it can be checked next time.
sed '
/^\$\$$/ { # if line contains only $$
x # exchange pspace and hspace
s/begin/&/ # see if "begin" was in hspace
t use_end_tag # if it was, goto use_end_tag
s/^.*$/\\begin{equation}/ # replace pspace with \begin{equation}
h # set hspace to contents of pspace
b # start next cycle after auto-printing
: use_end_tag
s/^.*$/\\end{equation}/ # replace pspace with \end{equation}
h # set hspace to contents of pspace
}
'
This might work for you (GNU sed):
sed -r '1{x;s/^/\\begin{equation}\n\\end{equation}/;x};/\$\$/{g;P;s/(.*)\n(.*)/\2\n\1/;h;d}' file
Prime the hold space with the required strings. On encountering the marker print the first line and then swap the strings in anticipation of the next marker.
I can not help you with sed, but this awk should do:
awk '/\$\$/ && !f {$0="\\begin{equation}";f=1} /\$\$/ && f {$0="\\end{equation}";f=0}1' file
\begin{equation}
y = x^2
\end{equation}
The f=0is not needed, if its not repeated.

Flip array index with sed

I have some java code declaring a 2d array that I want to flip.
Content is like:
zData[0][0] = 198;
zData[0][1] = 198;
zData[0][2] = 198;
...
And I want to flip indices to have
zData[0][0] = 198;
zData[1][0] = 198;
zData[2][0] = 198;
So I tried doing it with sed:
sed -r 's#zData[([0-9]*)][([0-9]*)]#zData[\2][\1]#g' DataSample1.java
But unfortunately sed says:
sed: -e expression #1, char 43: Unmatched ) or \)
Might the string "zData" hold kind of flag or option?
I tried not using the -r option but I have the same kind of message for:
sed 's#zData[\(\[\0\-\9\]\*\)][\(\[\0\-\9\]\*\)]#zData[\2][\1]#g' DataSample1.java
Thanks for your help
Simples:
$ sed -r 's/(zData)(\[[^]]+])(\[[^]]+])/\1\3\2/' file
zData[0][0] = 198;
zData[1][0] = 198;
zData[2][0] = 198;
Regexplanation:
# Match
(zData) # Capture the variable name we want to transpose
( # Start capture group for first index
\[ # Opening bracket escaped to mean literal [
[^]]+ # One or more none ] characters i.e the digits
] # The closing literal ] doesn't need escaping here.
) # Close the capture
(\[[^]]+]) # Same regexp as before for the second index
# Replace
\1\3\2 # Switch the indexes but rearranging the 2nd and 3rd capture groups
Note: Switch \[[^]]+] to if it is clearer \[[0-9]+] for you, so instead of saying match an opening square bracket followed by one or more none-closing brackets followed by a closing bracket you are saying match an opening square bracket followed by one or more digit followed by a closing bracket.
Try that one:
sed 's#\([a-zA-Z0-9_-]\+\)\(\[[^]]*\]\)\(\[[^]*]\]\)\(.*$\)#\1\3\2\4#'
It adds four captures for the variable name, the first index, the second index and the rest and then switches order.
Edit: #Sudo_O's solution with extended regular expressions is much more readable. Thx for that! Nevertheless, on some systems sed -r may not be available, since it is not part of basic POSIX.