Using sed to eliminate all lines that do not match the desired form - sed

I have a single column csv that looks something like this:
KFIG
KUNV
K~LK
K7RT
3VGT
Some of the datapoints are garbled in transmission. I need to keep only the entries that begin with a capital letter, then the other three digits could be a capital letter OR a number. For example, in the list above I would have to delete K~LK and 3VGT.
I know that to delete all but capital letters I can write
sed -n '/[A-Z]\{4,\}/p'
I just want to adjust this to where the last three digits could be capital letters or numbers. Any help would be appreciated.

Just use:
sed -n '/[A-Z][A-Z0-9]\{3,\}/p'
However, if these identifiers are really all that there is in the file, I would propose the following command (it will assure that the whole line is matched, so it will reject for example identifiers more than 4 characters long):
sed -n '/^[A-Z][A-Z0-9]\{3\}$/p'
^ means "match zero-length string at the beginning of line";
\{3\} means "match exactly 3 occurences of the previous atom", the previous atom being [A-Z0-9];
$ means "match zero-length string at the end of line".

Related

How to replace a character using sed with different lengths in preceding string

I have a file in which I want to replace the "_" string with "-" in cases where it makes up a part of my gene name. Examples of the gene names and my intended output are:
aa1c1_123 -> aa1c1-123
aa1c2_456 -> aa1c1-456
aa1c10_789 -> aa1c1-789
In essence, the first four characters are fixed, followed by 1 or 2 characters depending on the chromosome, an underscore and then the remainder of the gene ID which could vary in length and character. Important is that there are other strings in this gene information column contains other strings with underscores (e.g. "gene_id", "transcript_id", "five_prime_utr") so using sed -i.bak s/_/-/g' file.gtf
can't be done.
Perhaps not the most elegant way, but this should work:
sed -i.bak 's/\([0-9a-z]\{4\}[0-9][0-9]\?\)_/\1-/g' file.gtf
i.e. capture a group (referenced by \1 in the substitution) of 4 characters consisting of lower case letters and digits followed by exactly one digit and perhaps another digit, which is followed by an underscore; if found, replace it by the group's content and a dash. This should exclude your other occurrences consisting of only characters and an underscore.

Sed command to break comma separated string upto certain length

Example string
TEST,TEST1,TEST3,TEST4,TEST5
Expected output :
TEST,TEST1,
TEST3,TEST4,
TEST5
I want to split data from comma before 15th position
Try this:
sed 's/.\{,15\},/&\n/g' <<< "string" # or
sed 's/.\{,15\},/&\n/g' file
.\{,15\}, matches a part of input consisting of 0 to 15 characters followed by a comma. since sed is greedy while matching patterns, it will match as much characters as it can.
&\n expands up to matched part followed by a line feed.
s/REGEXP/REPLACEMENT/g replaces every match against REGEXP with REPLACEMENT.

Using sed to replace a number located between two other numbers

I need to replace a numeric value, that occurs in a specific line of a series of config files in a pattern like this:
string number_1 number_to_replace number_2
I want to obtain something like this:
string number_1 number_replaced number_2
The difficulties I encountered are:
number_1 or number_2 can be equal to number_to_replace, so a simple replacement is not possible.
number_1 and number_2 vary between config files so I don't know them in advance.
The closest attempt I got until now is:
echo "field 4 4 4" | sed 's/\s4\s/3/'
Which ouputs:
field34 4
This is close, given that I want to replace the intermediate number I added another "\s" to try to use the known fact that the line starts with a character.
echo "field 4 4 4" | sed 's/\s\s4\s/3/'
Which gives:
field 4 4 4
So, nothing is replaced this time. How can I proceed? A somewhat detailed explanation would be ideal, because my knowledge of replacing expressions that involve patterns in nearly zero.
Thanks.
You can do something like below, which matches your exact sequence of digits as in the example. You could replace 3 with any digit of your choice.
sed 's/\([0-9]\{1,\}\)[[:space:]]\([0-9]\{1,\}\)[[:space:]]\([0-9]\{1,\}\)/\1 3 \3/'
Notice that I've used the POSIX bracket expression to match the whitespace character which should be supported in any variant of sed you are using. Note that \s is supported in only the GNU variants.
The literal meaning of the regex definition is to match a single digit followed by a space, then a digit and space and another digit. The captured groups are stored from \1. Since your intention is to remove the 2nd digit, you replace that with the word of your choice.
If the extra escapes causes it unreadable, use the -E flag for extended regex support. I've used the default BRE version

SED command to remove words at the end of the string

I want to remove last 2 words in the string which is in a file.
I am using this command first to delete the last word. But I couldn't do it. can someone help me
sed 's/\w*$//' <file name>
my strings are like this
Input:
asbc/jahsf/jhdsflk/jsfh/ -0.001 (exam)
I want to remove both numerical value and the one in brackets.
Output:
asbc/jahsf/jhdsflk/jsfh/
Using GNU sed:
$ sed -r 's/([[:space:]]+[-+.()[:alnum:]]+){2}$//' file
asbc/jahsf/jhdsflk/jsfh/
How it works
[[:space:]]+ matches one or more spaces.
[-+.()[:alnum:]]+ matches the 'words' which are allowed to contain any number of plus or minus signs, periods, parens, or any alphanumeric characters.
Note that, when a period is inside square brackets, [.], it is just a period, not a wildcard: it does not need to be escaped.
([[:space:]]+[-+.()[:alnum:]]+) matches one or more spaces followed by a word.
([[:space:]]+[-+.()[:alnum:]]+){2}$ matches two words and the spaces which precede them.
Note the use of character classes like [:space:] and [:alnum:]. Unlike the old-fashioned classes like [a-zA-Z0-9], these classes are unicode safe.
OSX (BSD) sed
The above was tested on GNU sed. For BSD sed, try:
sed -E 's/([[:space:]][[:space:]]*[-+.()[:alnum:][:alnum:]]*){2}$//' file
To remove everything that follows a number with decimal places
This looks for a decimal number with optional sign and removes it, the spaces which precede it, and everything which follows it:
$ sed -r 's/[[:space:]]+[-+]?[[:digit:]]+[.][[:digit:]]+[[:space:]].*//' file
asbc/jahsf/jhdsflk/jsfh/
How it works:
[[:space:]]+ matches one or more spaces
[-+]? matches zero or one signs.
[[:digit:]]+ matches any number of digits.
[.] matches a decimal point (period).
[[:digit:]]+ matches one or more digits following the decimal point.
[[:space:]] matches a space following the number.
.* matches anything which follows.
It looks like there is a tab between what you want to keep and what you want to get rid of. I don't have linux in front of me but try this.
sed 's/\t.*//'
This is assuming your strings are always formatted similarily which is what I take from your comment.
This might work for you (GNU sed):
sed -r 's/\s+\S+\s+\S+\s*$//' file
or if you prefer:
sed -r 's/(\s+\S+){2}\s*$//' file
This matches and removes: one or more whitespaces followed by one or more non-whitespaces twice followed by zero or more whitespaces at the end of the line.

Sed expression that converts some matches to uppercase

This sed expression converts an input string into a two-line output string. Each of the two output lines are composed of substrings from the input. The first line needs to be convered into upper case:
s:random_stuff\(choice1\|choice2\){\([^}]*\)}:\U\1\n\2:
The aim is to convert
random_stuff_choice1{This is a sentence that MAY contain AnyThing}
random_stuff_choice2{This is another similar sentence}
into
CHOICE1
This is a sentence that MAY contain AnyThing
CHOICE2
This is another similar sentence
The problem I have is that \U aplies to everything following it so the second line is also uppercased. Is it possible to make \U apply to the first match only ?
With sed:
$ sed 's/.*\(choice[0-9]\+\){\([^}]*\)}/\U\1\n\E\2/' file
CHOICE1
This is a sentence that MAY contain AnyThing
CHOICE2
This is another similar sentence
With awk:
$ awk -F'{|}' 'gsub(/.*_/,""){print toupper($1)"\n"$2}' file
CHOICE1
This is a sentence that MAY contain AnyThing
CHOICE2
This is another similar sentence
Use \E to cancel the \U:
s:random_stuff_\(choice1\|choice2\){\([^}]*\)}:\U\1\E\n\2: