Invalid reference \1 using sed when trying to print matching expression - sed

Before I start, I already looked at this question, but it seems the solution was that they were not escaping the parentheses in their regex. I'm getting the same error, but I'm not grouping a regex. What I want to do is find all names/usernames in a lastlog file and return the UNs ONLY.
What I have:
s/^[a-z]+ |^[a-z]+[0-9]+/\1/p
I've seen many solutions that show how to do it in awk, which is great for future reference, but I want to do it using sed.
Edit for example input:
dzhu pts/15 n0000d174.cs.uts Wed Feb 17 08:31:22 -0600 2016
krobbins **Never logged in**
js24 **Never logged in**

You cannot use backreferences (such as \1) if you do not have any capture groups in the first part of your substitution command.
Assuming you want the first word in the line, here's a command you can run:
sed -n 's/^\s*\(\w\+\)\s\?.*/\1/p'
Explanation:
-n suppresses the default behavior of sed to print each line it processes
^\s* matches the start of the line followed by any number of whitespace
\(\w\+\) captures one or more word characters (letters and numbers)
\s\?.* matches one or zero spaces, followed by any number of characters. This is to make sure we match the whole word in the capture group
\1 replaces the matched line with the captured group
The p flag prints lines that matched the expression. Combined with -n, this means only matches get printed out.
I hope this helps!

Related

Extracting substring from inside bracketed string, where the substring may have spaces

I've got an application that has no useful api implemented, and the only way to get certain information is to parse string output. This is proving to be very painful...
I'm trying to achieve this in bash on SLES12.
Given I have the following strings:
QMNAME(QMTKGW01) STATUS(Running)
QMNAME(QMTKGW01) STATUS(Ended normally)
I want to extract the STATUS value, ie "Ended normally" or "Running".
Note that the line structure can move around, so I can't count on the "STATUS" being the second field.
The closest I have managed to get so far is to extract a single word from inside STATUS like so
echo "QMNAME(QMTKGW01) STATUS(Running)" | sed "s/^.*STATUS(\(\S*\)).*/\1/"
This works for "Running" but not for "Ended normally"
I've tried switching the \S* for [\S\s]* in both "grep -o" and "sed" but it seems to corrupt the entire regex.
This is purely a regex issue, by doing \S you requested to match non-white space characters within (..) but the failing case has a space between which does not comply with the grammar defined. Make it simple by explicitly calling out the characters to match inside (..) as [a-zA-Z ]* i.e. zero or more upper & lower case characters and spaces.
sed 's/^.*STATUS(\([a-zA-Z ]*\)).*/\1/'
Or use character classes [:alnum:] if you want numbers too
sed 's/^.*STATUS(\([[:alnum:] ]*\)).*/\1/'
sed 's/.*STATUS(\([^)]*\)).*/\1/' file
Output:
Running
Ended normally
Extracting a substring matching a given pattern is a job for grep, not sed. We should use sed when we must edit the input string. (A lot of people use sed and even awk just to extract substrings, but that's wasteful in my opinion.)
So, here is a grep solution. We need to make some assumptions (in any solution) about your input - some are easy to relax, others are not. In your example the word STATUS is always capitalized, and it is immediately followed by the opening parenthesis (no space, no colon etc.). These assumptions can be relaxed easily. More importantly, and not easy to work around: there are no nested parentheses. You will want the longest substring of non-closing-parenthesis characters following the opening parenthesis, no mater what they are.
With these assumptions:
$ grep -oP '\bSTATUS\(\K[^)]*(?=\))' << EOF
> QMNAME(QMTKGW01) STATUS(Running)
> QMNAME(QMTKGW01) STATUS(Ended normally)
> EOF
Running
Ended normally
Explanation:
Command options: o to return only the matched substring; P to use Perl extensions (the \K marker and the lookahead). The regexp: we look for a word boundary (\b) - so the word STATUS is a complete word, not part of a longer word like SUBSTATUS; then the word STATUS and opening parenthesis. This is required for a match, but \K instructs that this part of the matched string will not be returned in the output. Then we seek zero or more non-closing-parenthesis characters ([^)]*) and we require that this be followed by a closing parenthesis - but the closing parenthesis is also not included in the returned string. That's a "lookahead" (the (?= ... ) construct).

gnu sed remove portion of line after pattern match with special characters

The goal is to use sed to return only the url from each line of FF extension Mining Blocker which uses this format for its regex lines:
{"baseurl":"*://002.0x1f4b0.com/*", "suburl":"*://*/002.0x1f4b0.com/*"},
{"baseurl":"*://003.0x1f4b0.com/*", "suburl":"*://*/003.0x1f4b0.com/*"},
the result should be:
002.0x1f4b0.com
003.0x1f4b0.com
One way would be to keep everything after suburl":"*://*/ then remove each occurrence of /*"},
I found https://unix.stackexchange.com/questions/24140/return-only-the-portion-of-a-line-after-a-matching-pattern but the special characters are a problem.
this won't work:
sed -n -e s#^.*suburl":"*://*/##g hosts
Would someone please show me how to mark the 2 asterisks in the string so they are seen by regex as literal characters, not wildcards?
edit:
sed -n 's#.*://\*/\([^/]\+\)/.*#\1#p' hosts
doesn't work, unfortunately.
regarding character substitution, thanks for directing me to the references.
I reduced the searched-for string to //*/ and used ASCII character codes like this:
sed -n -e s#^.*\d047\d047\d042\d047##g hosts
Unfortunately, that didn't output any changes to the lines.
My assumptions are:
^.*something specifies everything up to and including the last occurrence of "something" in a line
sed -n -e s#search##g deletes (replace with nothing) "search" within a line
So, this line:
sed -n -e s#^.*\d047\d047\d042\d047##g hosts
Should output everything after //*/ in each line...except it doesn't.
What is incorrect with that line?
Regarding deleting everything including and after the first / AFTER that first operation, yes, that's wanted too.
This might work for you (GNU sed):
sed -n 's#.*://\*/\([^/]\+\)/.*#\1#p' file
Match greedily (the longest string that matches) all characters up to ://*/, followed by a group of characters (which will be referred to as \1) that do not match a /, followed by the rest of the line and replace it by the group \1.
N.B. the sed substitution delimiters are arbitrary, in this case chosen to be # so as make pattern matching / easier. Also the character * on the left hand side of the substitution command may be interpreted as a meta character that means zero or more of the previous character/group and so is quoted \* so that it does not mistakenly exert this property. Finally, using the option -n toggles off the usual printing of every thing in the pattern space after all the sed commands have been executed. The p flag on the substitution command, prints the pattern space following a successful substitution, therefore only URL's will appear in the output or nothing.

Understanding working of sed in bash

#!/bin/bash
echo "the first application of sed"
sed -e 's/^\([0-9]\{3\}\)/(\1)/' s.txt
echo "the second application of sed"
sed -e 's/^\([0-9]\{3\}\)/(\1\+\1)/' s.txt
echo "see the original file"
cat s.txt
the first application of sed
(905)-123-3456
(905)-124-3456
(905)-125-3456
(905)-126-3456
(905)-127-3456
the second application of sed
(905+905)-123-3456
(905+905)-124-3456
(905+905)-125-3456
(905+905)-126-3456
(905+905)-127-3456
see the original file
905-123-3456
905-124-3456
905-125-3456
905-126-3456
905-127-3456
I'm just starting out in shell programming and for the last 2 hours I'm stuck with this code. I know the basic usage of sed but I cannot figure out what the line
sed -e 's/^\([0-9]\{3\}\)/(\1)/' s.txt
does. I know -e is expression, s is substitute. ^ indicates beginning of line but the part after that is confusing. Any ideas?
Ultimately, it is manual-bashing exercise.
\( marks the start of a capture, up to the balanced \) — they can be nested, though these ones don't.
\{ marks the start of a repeat specification up to the following \} — they cannot be nested. In this case, you have \{3\} so this repeats the previous item, [0-9], three times.
The \1 in the replacement refers the material captured by the first \( in the search pattern.
Hence:
s/^\([0-9]\{3\}\)/(\1)/
wraps the three digits at the start of the line in parentheses — as shown in your output. Because it is anchored, it happens just once. If a line doesn't start with three digits, nothing happens to that line as a result of this command.
The second example is only marginally different. It takes the sequence of three digits at the start of the line and replaces it with that sequence, a + mark, and the sequence again, all wrapped in parentheses — as shown in your output.
There are relatively few metacharacters in the replacement part of a s/// command; there are a lot of metacharacters in the search part. Further, there are different dialects in the search part — some variants of sed support 'extended regular expressions' instead of 'basic regular expressions' (which is what your example uses); others support Perl-like expressions (not quite the full PCRE — Perl Compatible Regular Expressions — as far as I know, but some notations from PCRE). For that, you need to read the manual for the sed you're using.
Let's break this down:
sed -e 's/^\([0-9]\{3\}\)/(\1)/' s.txt
The nomenclature of sed's substitute is like this:
s/search/replace/options
In your case, search part is ^\([0-9]\{3\}\). Parenthesis and curly brackets can have special meaning and they are escaped by a \. If we remove them for understanding purposes, this is how it will look:
^([0-9]{3})
It means - the line should start with a number between 0 and 9 and it should be repeated 3 times. So basically, it's a 3 digit number (e.g. 123, 543 etc.).
The parenthesis () groups the 3 digit number, which can be referred to as the first group.
The replace part of it is (\1). That means, the group we captured in search is regurgitated.

Perl one-liner: deleting a line with pattern matching

I am trying to delete bunch of lines in a file if they match with a particular pattern which is variable.
I am trying to delete a line which matches with abc12, abc13, etc.
I tried writing a C-shell script, and this is the code:
**!/bin/csh
foreach $x (12 13 14 15 16 17)
perl -ni -e 'print unless /abc$x/' filename
end**
This doesn't work, but when I use the one-liner without a variable (abc12), it works.
I am not sure if there is something wrong with the pattern matching or if there is something else I am missing.
Yes, it's the fact you're using single quotes. It means that $x is being interpreted literally.
Of course, you're also doing it very inefficiently, because you're processing each file multiple times.
If you're looking to remove lines abc12 to abc17 you can do this all in one go:
perl -n -i.bak -e 'print unless m/abc1[234567]/' filename
Try this
perl -n -i.bak -e 'print unless m/abc1[2-7]/' filename
using the range [2-7] only removes the need to type [234567] which has the effect of saving you three keystrokes.
man 1 bash: Pattern Matching
[...] Matches any one of the enclosed characters. A pair of characters separated by a hyphen denotes a range expression; any character that sorts between those two characters, inclusive, using the current locale's collating sequence and character set, is matched. If the first character following the [ is a ! or a ^ then any character not enclosed is matched.
A - may be matched by including it as the first or last character in the set. A ] may be matched by including it as the first character in the set.

Can I use the sed command to replace multiple empty line with one empty line?

I know there is a similar question in SO How can I replace mutliple empty lines with a single empty line in bash?. But my question is can this be implemented by just using the sed command?
Thanks
Give this a try:
sed '/^$/N;/^\n$/D' inputfile
Explanation:
/^$/N - match an empty line and append it to pattern space.
; - command delimiter, allows multiple commands on one line, can be used instead of separating commands into multiple -e clauses for versions of sed that support it.
/^\n$/D - if the pattern space contains only a newline in addition to the one at the end of the pattern space, in other words a sequence of more than one newline, then delete the first newline (more generally, the beginning of pattern space up to and including the first included newline)
You can do this by removing empty lines first and appending line space with G command:
sed '/^$/d;G' text.txt
Edit2: the above command will add empty lines between each paragraph, if this is not desired, you could do:
sed -n '1{/^$/p};{/./,/^$/p}'
Or, if you don't mind that all leading empty lines will be stripped, it may be written as:
sed -n '/./,/^$/p'
since the first expression just evaluates the first line, and prints it if it is blank.
Here: -n option suppresses pattern space auto-printing, /./,/^$/ defines the range between at least one character and none character (i.e. empty space between newlines) and p tells to print this range.