How to replace a character using sed with different lengths in preceding string - sed

I have a file in which I want to replace the "_" string with "-" in cases where it makes up a part of my gene name. Examples of the gene names and my intended output are:
aa1c1_123 -> aa1c1-123
aa1c2_456 -> aa1c1-456
aa1c10_789 -> aa1c1-789
In essence, the first four characters are fixed, followed by 1 or 2 characters depending on the chromosome, an underscore and then the remainder of the gene ID which could vary in length and character. Important is that there are other strings in this gene information column contains other strings with underscores (e.g. "gene_id", "transcript_id", "five_prime_utr") so using sed -i.bak s/_/-/g' file.gtf
can't be done.

Perhaps not the most elegant way, but this should work:
sed -i.bak 's/\([0-9a-z]\{4\}[0-9][0-9]\?\)_/\1-/g' file.gtf
i.e. capture a group (referenced by \1 in the substitution) of 4 characters consisting of lower case letters and digits followed by exactly one digit and perhaps another digit, which is followed by an underscore; if found, replace it by the group's content and a dash. This should exclude your other occurrences consisting of only characters and an underscore.

Related

Sed command to break comma separated string upto certain length

Example string
TEST,TEST1,TEST3,TEST4,TEST5
Expected output :
TEST,TEST1,
TEST3,TEST4,
TEST5
I want to split data from comma before 15th position
Try this:
sed 's/.\{,15\},/&\n/g' <<< "string" # or
sed 's/.\{,15\},/&\n/g' file
.\{,15\}, matches a part of input consisting of 0 to 15 characters followed by a comma. since sed is greedy while matching patterns, it will match as much characters as it can.
&\n expands up to matched part followed by a line feed.
s/REGEXP/REPLACEMENT/g replaces every match against REGEXP with REPLACEMENT.

Using sed to replace a number located between two other numbers

I need to replace a numeric value, that occurs in a specific line of a series of config files in a pattern like this:
string number_1 number_to_replace number_2
I want to obtain something like this:
string number_1 number_replaced number_2
The difficulties I encountered are:
number_1 or number_2 can be equal to number_to_replace, so a simple replacement is not possible.
number_1 and number_2 vary between config files so I don't know them in advance.
The closest attempt I got until now is:
echo "field 4 4 4" | sed 's/\s4\s/3/'
Which ouputs:
field34 4
This is close, given that I want to replace the intermediate number I added another "\s" to try to use the known fact that the line starts with a character.
echo "field 4 4 4" | sed 's/\s\s4\s/3/'
Which gives:
field 4 4 4
So, nothing is replaced this time. How can I proceed? A somewhat detailed explanation would be ideal, because my knowledge of replacing expressions that involve patterns in nearly zero.
Thanks.
You can do something like below, which matches your exact sequence of digits as in the example. You could replace 3 with any digit of your choice.
sed 's/\([0-9]\{1,\}\)[[:space:]]\([0-9]\{1,\}\)[[:space:]]\([0-9]\{1,\}\)/\1 3 \3/'
Notice that I've used the POSIX bracket expression to match the whitespace character which should be supported in any variant of sed you are using. Note that \s is supported in only the GNU variants.
The literal meaning of the regex definition is to match a single digit followed by a space, then a digit and space and another digit. The captured groups are stored from \1. Since your intention is to remove the 2nd digit, you replace that with the word of your choice.
If the extra escapes causes it unreadable, use the -E flag for extended regex support. I've used the default BRE version

create a generic regex for a string in perl

I have tried to create regex for the below:
STRING sou_u02_mlpv0747_CCF_ASB001_LU_FW_ALERT|/opt/app/medvhs/mvs/applications/cm_vm5/fwhome/UnifiedLogging|UL_\d{8}_CCF_ASB001_LU_sou_u02_mlpv0747_Primary.log.csv|FATAL|red|1h||fw_alert
REGEX----> /^[^#]\w+\|[^\|]+\|\w+\|\w+\|\w*\|\w*\|([^\|]+|)\|\w*$/
I am unable to figure out the mistake here.
I created the above by referring another regex which working fine and given below
/^[^#]\w+\|[^\|]+\|([^\|]+|)\|[rm]\|(in|out|old|new|arch|missing)\|\w+\|([0-9-,]+|)\|\w*\|\w*$/
sou_u02_mlpv0747_CCF_ASB001_LU_ODR|/opt/app/medvhs/mvs/applications/cm_vm5/components/CCF_ASB001_LU/SPOOL/ODR||r|out|30m|0400-1959|30m|gprs_in_stag
can some one please help me. Any leads would be highly apprciated.
Let's start from a brief look at your source text (the first that you included).
It is composed of "sections" separated with | char.
This char (|) must be matched by \|. Remember about the preceding
backslash, otherwise, a "bare" | would mean the alternative separator
(you used it in one place).
And now take a look at each section (between |):
Some of them contain only a sequence of word chars (and can be matched
by \w+).
Other sections, however, contain also other chars, e.g. slashes,
backslash, braces and dots, so each such section is actually a sequence
of chars other than "|" and must be matched by [^|]+ (here,
between [ and ], the vertical bar may be unescaped).
Now let's write each section and its "type":
sou_u02_..._FW_ALERT - word chars.
/opt/app/.../UnifiedLogging - other chars (because of slashes).
UL_\d{8}_..._Primary.log.csv - other chars (because of \d{8}
and dots).
FATAL|red|1h - 3 sections composed of word chars.
An empty section, between 2 consecutive | chars.
fw_alert - word chars.
And now, how to match these groups, and the separating |:
Point 1: \w+\| - word chars and (escaped) vertical bar.
Point 2 and 3 (together): (?:[^|]+\|){2} - a non-capturing
group - (?:...), containing a sequence of "other" chars - [^|]+
and a vertical bar - \|, occurring 2 times {2}.
Point 4 (three "word char" groups): (?:\w+\|){3} - similiar to
the previous point.
Point 5: Just as in your solution - ([^|]+|)\|, a capturing group -
(...), with 2 alternatives ...|.... The first alternative is
[^|]+ (a sequence of "other" chars), and the second alternative
is empty. After the capturing group there is \| to match the vertical
bar.
Point 6: \w+ - word chars. This time no \|, as this is the last
section.
The regex assembled so far must be:
prepended with a ^ (start of string) and
appended with a $ (end of string).
So the whole regex, matching your source text can be:
^\w+\|(?:[^|]+\|){2}(?:\w+\|){3}([^|]+|)\|\w+$
Actually, the only capturing group can be written another way,
as ([^|]*) - without alternatives, but with * as the
repetition count, allowing also empty content.
Your choice, which variant to apply.
The third field
UL_\d{8}_CCF_ASB001_LU_sou_u02_mlpv0747_Primary.log.csv
Contains a backslash, \, braces { } and dots .. None of these can be matched by \w
Note also that there is no need to escape a pipe | inside a characters class: [^|]+ is fine

Using sed to eliminate all lines that do not match the desired form

I have a single column csv that looks something like this:
KFIG
KUNV
K~LK
K7RT
3VGT
Some of the datapoints are garbled in transmission. I need to keep only the entries that begin with a capital letter, then the other three digits could be a capital letter OR a number. For example, in the list above I would have to delete K~LK and 3VGT.
I know that to delete all but capital letters I can write
sed -n '/[A-Z]\{4,\}/p'
I just want to adjust this to where the last three digits could be capital letters or numbers. Any help would be appreciated.
Just use:
sed -n '/[A-Z][A-Z0-9]\{3,\}/p'
However, if these identifiers are really all that there is in the file, I would propose the following command (it will assure that the whole line is matched, so it will reject for example identifiers more than 4 characters long):
sed -n '/^[A-Z][A-Z0-9]\{3\}$/p'
^ means "match zero-length string at the beginning of line";
\{3\} means "match exactly 3 occurences of the previous atom", the previous atom being [A-Z0-9];
$ means "match zero-length string at the end of line".

sed - remove specific subscript from string

please provide me a sed oneliner which provides this output:
sdc3 sdc2
for Input :
sdc3[1] sdc2[0]
I mean remove all subscript value from the string ..
sed 's/\[[^]]*\]//g'
reads: substitute any string with literal "[" followed by zero or more characters that aren't a "]", and then the closing "]", with an empty string.
You need the [^]] bit to prevent greedy matching treating "[1] sdc2[0]" as a single match in your sample string.
As for your comment:
sed 's#\([^[ ]*\)\[[^]]*\]#/dev/\1#g'
I switch the seperator from the usual '/' to '#', just to avoid escaping the /dev/ bit you asked for (I won't say "for clarity")
the \(...\) bit matches a subgroup, here sdc2 or whatever, so we can refer to it in the replacement
the subgroup uses a similar character class to the one we used discarding the index: [^[ ] means any character except an "[" (again, to avoid greedily matching the index) or a space (assuming your values are space-delimited as per your post)
the replacement is now the literal "/dev/" followed by the first (and only) subgroup match
the g flag at the end tells it to perform multiple matches per line, instead of stopping at the first one