sed seems to match pattern properly only when newline inserted - sed

I am currently running the following sed command:
sed 's/P(\(.*\))\\mid(\(.*\))/\\condprob{\1}{\2}/g' myfile.tex
Essentially, I have inherited an oddly formatted tex file, and want to replace everything like this:
P(<foo>)\mid(<bar>)
With this
\condprob{<foo>}{<bar>}
The file I am trying to run sed on contains the following line:
P(\vec{m}_i)\mid(t,h,\alpha) = \prod_{u\in\mathcal{U}} P(\vec{m}_{iu})\mid(t,h,\alpha)
Which I would like to change to this:
\condprob{\vec{m}_i}{t,h,\alpha} = \prod_{u\in\mathcal{U}}\condprob{\vec{m}_{iu}}{t,h,\alpha}
However, sed keeps missing the first \mid and instead gives me this:
\condprob{\vec{m}_i)\mid(t,h,\alpha) = \prod_{u\in\mathcal{U}} P(\vec{m}_{iu}}{t,h,\alpha}
If I add a line break at the = sign it matches everything fine
Can someone please a) help me resolve this, and b) perhaps tell me why it is happening?
Thanks.
Edit: thanks choroba and Sloopjon, you've both answered my why, and Sloopjon's solution is actually exactly what I was needing. choroba: I guess I will have to wait another day to learn perl.
For those that are interested Sloopjon's solution when translated into my problem looks like this (match everything that isn't a closing parenthesis):
sed 's/P(\([^)]*\))\\mid(\([^)]\))/\\condprob{\1}{\2}/g' myfile.tex

It looks like you expect P(\(.*\)) to match only P(\vec{m}_i), but the * quantifier is greedy, so it actually matches P(\vec{m}_i)\mid...P(\vec{m}_{iu}). There are two common fixes for this: use a non-greedy quantifier if your tool supports it, or change the pattern so that it only matches what you expect. For example, if you know that parentheses won't nest in this P() construct, change .* to [^)]*.
Edit: I also suggest that you look for a regex visualizer or debugger when you have a problem like this. For example, pasting your example into debuggex.com makes it clear what's happening.

The problem is the greediness of the * quantifier. It matches as many times as it can, i.e. it doesn't stop at the first ).
You can try Perl, that features "non-greedy" (frugal, lazy) *?:
perl -pe 's/P\((.*?)\)\\mid\((.*?)\)/\\condprob{$1}{$2}/g'

Related

Substitution/Replacement problem with a data file Perl, Sublime

My file is x in the format \D{5}\d\d/ D{5}\d or |D{5}dd
example:
aahed9aalii5aargh9abaca9abaci9aback13
The /d may be 1 or 2 digits no spaces or breaks in the entire document.
The goal is to create a .csv file dividing the \D{5} from \d{1} or \d{2}
Tried sublime text,perl,textedit or pages
In Sublime I understand how to find the (\D{5} group) but not how to replace that with (\D{5}),)
I found the s(dog/cat)substitution example but could not get that to translate in perl or sublime.
Found the perl command line idea
(perl -pi.bak -e 's\/D{5}/D{5}\,/g' $filename) may not be exact
But could not decipher all the errors
The reason I chose regex for this is the only commonality to each value is the length of the word is the same throughout the document. There are no tabs, no parens, no spaces, no fixed length fields nothing to get my hooks in.
The question:
How do I retain the original values in the replace/substitution function?
I realize what this board has to deal with in regard to duplicate
questions. Do you realize on my side how difficult it is to search through all the previous questions when I am not sure what I am looking for?
I am not looking for someone to give me a fish, looking for someone to teach me how to fish.
If REGEX is not the answer maybe I am missing something any guidance would be appreciated.
Thanks
The $1, $2, etc variables may be used to refer back to "captures" (parenthesized parts) within the most recent regexp.
echo aahed9aalii5aargh9abaca9abaci9aback13 | perl -pe 's/(\D{5})(\d*)/$1=$2,/g'
Outputs:
aahed=9,aalii=5,aargh=9,abaca=9,abaci=9,aback=13,

Less ugly way to use sed to simply include a new line?

There are a lot of guides, handbooks, fast-guides, question/answers about it: no one are simple and objective...
It is a classical problem, near all text editors crashes with big files XML or HTML "all in one line", so we need to decide what tag will recive the \n and replace all occurences of <tag by \n<tag ... so simple. Why it is not simple to do by terminal?
The best question/answer about this case not solves: Bash: How can I replace a string by new line in osx bash? Example using that solution: sed 's/<article/\'$'\n\n<article/g' file.htm not works, need some more exotical syntax, so it is not simple as I solicitated in this question.
So, this quetion is not about "any solution", but about "some simple/elegant solution".
If I understand what you are looking for you could try something like the following:
sed 's/<tag>/\n<tag>/g' file.htm
which is very close to the anwser you linked.
It already looks quite simple to me, it replaces the tag with a new line character and writes the tag again.
However I don't get the need for this '$' in your case.

How to understand this perl multi-line writing command

I am trying to understand the perl commands below:
$my = << EOU;
This is an example.
Example too.
EOU
What is the name of this way? Could somebody can explain more about this "multi-line writing" command?
Essentially the syntax is allowing you to put anything unique as a marker so that it won't conflict with your contents. You can do this:
$my = <<ABCDEFG;
This is an example.
Example too.
BLAH
ABCDEFG
Everything between "This.." and "BLAH" will be assigned to the variable. Note that you shouldn't have a space after the << symbols otherwise you will get a syntax error. It helps avoid adding CR characters, or append (.) everywhere, and useful when passing data into another application (eg. ftp session). Here Documents is the correct term for this.
Everything between <<EOU and EOU is a multi-line, non-escapable, string. It's nothing fancy, think of them as start and end quote marks with nothing inside requiring escapes to be literally what you typed...

using sed with ? (question mark) special character

I have an infected website, and I am trying to clean it out using sed. Unfortunately I am unable to escape the question mark sign in the URL and I am really stuck here. I've searched over the web for a possible solution, but unfortunately I didn't found a proper way to do so.
Just an explanation:
The injected code is similar to this one:
< iframe src=http://test.com/index.html?i=23123>< /iframe>
Note that I am not a pro, and there is why I need your help!
so my way to clear the code is :
sed -i '/< iframe src=http:\/\/test.com\/index.html\?i=23123>/,/< \/iframe>/d' index.html
Unfortunately that didn't help as well as all others.
All help will be gratefully appreciated.
echo "< iframe src=http://test.com/index.html?i=23123>< /iframe>" \
| sed 's#< iframe src=http://test.com/index.html?i=23123>< /iframe>##'
Produces no output, which to me means this is successfully deleting your problem string.
Note that most seds will accept an alternate regex-replacement character, here I am using # because there are no #s in the search target. On some seds, you have to tell it 'hey I'm using an alternate, and escape the char, like s\#.....##.
I don't see why your attempt to quote the ? is failing. Did you try [?] and (worst case) [\?]. Are there 2nd level evaluations happening by the shell that you're not mentioning here? Does my simple example also fail?
As others will certainly tell you, your approach is strictly a bandaid, you need to figure out what the security hole is in your system and fix it. Your pages will get corrupted again. :-(
IHTH

How to search and replace empty line after specific token?

I want to delete newlines after lines containing a keyword e.g. like modifiers private:,public: or protected: to fulfill our coding standard. I need a command line tool (Linux) for this, so please no Notepad++, Emacs, VS, or Vim solutions, if they require user interaction. So in other words I want to do a:
sed -i 's/private:\s*\n\s*\n/private:\n/g'
I've seen this question but was unable to extend it to my needs.
If I understand correctly, you want to remove empty lines which follow a line containing private:, public:, or protected:.
sed ':loop;/private:\|public:\|protected:/{n;/^$/d;Tloop}' inputfile
Explanation:
:loop create a label
/private:\|public:\|protected:/ will search for lines containing the pattern.
n;/^$/d will load the next line (n), check whether it is an empty line (/^$/), and if it is, delete the line (d).
Tloop branch to label loop if there was no match (line was not empty)
I am no sed guru, there might be more elegant ways to do this. There might also be more elegant ways to do this in awk, perl, python, whatever.
perl -0777 -pi -e 's/([ \t\r]*)(private|protected|public):[ \t\r]*\n[ \t\r]*\n/$1$2:\n/g' file
should do the trick, while also take trailing and leading whitespace into account, which I didn't specified in the question as this is not a must requirement.