I have something like
chr1 162724289 162724421 CAAAATGTTTATAAGGACAGCCTGCTCTCTCCCCTCAGTACAGGGCAGCTGCTTGCCTGTGAACCAGTAAACAGCTCTGTGGTTTCATGGTTGCTCCCTCTCTCCCCAACCCTCACCTCTCAAGGCTGGACT chr1 162724414 162724421 ID=exon:ENST00000367921.3:5;Parent=ENST00000367921.3;gene_id=ENSG00000162733.12;transcript_id=ENST00000367921.3;gene_type=protein_coding;gene_status=KNOWN;gene_name=DDR2;transcript_type=protein_coding;transcript_status=KNOWN;transcript_name=DDR2-002;exon_number=5;exon_id=ENSE00001165686.1;level=2;protein_id=ENSP00000356898.3;ccdsid=CCDS1241.1;havana_gene=OTTHUMG00000034423.4;havana_transcript=OTTHUMT00000097650.1;tag=basic,appris_principal,CCDS
I would like to extract only the exon_number=5 from the 8th column. This is kind of a long one line command and, since I have other columns I want to keep, I guess that I cannot use awk -F ';'. I tried something like:
sed -E 's/ ID=*\(exon_number=[0-9]\)* \1/'
Desired output:
chr1 162724289 162724421 CAAAATGTTTATAAGGACAGCCTGCTCTCTCCCCTCAGTACAGGGCAGCTGCTTGCCTGTGAACCAGTAAACAGCTCTGTGGTTTCATGGTTGCTCCCTCTCTCCCCAACCCTCACCTCTCAAGGCTGGACT chr1 162724414 162724421 exon_number=5
Any advice would be great!
Thanks
With sed, you may match and remove exactly what you want:
sed -E 's/(.* )ID=[^[:space:]]*(exon_number=[0-9]+).*/\1\2/'
See the online sed demo
Explanation
-E - POSIX ERE syntax enabling option
(.* )ID=[^[:space:]]*(exon_number=[0-9]+).* - a rege pattern matching:
(.* ) - Group 1: any 0+ chars, as many as possible, and then a space
ID=[^[:space:]]* - ID= and 0+ whitespace chars
(exon_number=[0-9]+) - exon_number= and 1 or more digits (Group 2)
.* - the rest of the line
\1\2 - the replacement pattern inserts the contents of Group 1 and 2 into the resulting string.
EDIT: As per OP changed the requirement so putting solution as per that only.
awk -F";" 'match($0,/exon_number=[0-9]+/){val=$1;sub(/ ID.*/,"",val);print val,substr($0,RSTART,RLENGTH)}' Input_file
Following simple awk may help you here.
awk 'match($0,/exon_number=[0-9]+/){print substr($0,RSTART,RLENGTH)}' Input_file
Solution 2nd: In case your Input_file is having always same kind of data then simply print it by field.
awk -F";" '{print $11}' Input_file
Related
I have a file in which some lines contain a json object on a single line, and I want to extract the value of the window_indicator property.
A normal regular expression is: "window_indicator":\s*([\-\d\.]+) in which I want the value of the fist match group.
Here it is working perfectly well: https://regex101.com/r/w9Iuch/1
I've settled on sed because it seems that grep has to print the whole line and can't limit to the match group value, and perl is overkill.
Unfortunately, sed isn't actually capable of doing this, is it?
# sed 's/("window_indicator:)/\1/' in.txt
sed: -e expression #1, char 26: invalid reference \1 on `s' command's RHS
# sed -E 's/("window_indicator":)/\1/p' in.txt
prints out every line of the file
# sed -rn 's/("window_indicator":)/\1/p' in.txt
prints the whole line
# sed -rn 's/("window_indicator":)/\1/' in.txt
nothing
With sed, you need to match the whole line, capture what you need, replace the whole match with Group 1 placeholder, and make sure you suppress the default line output and only print the new text after successful substitution:
sed -nE 's/.*"window_indicator":[[:space:]]*([-0-9.]+).*/\1/p' in.txt
If the first match is to be retrieved, add q to quit:
sed -nE 's/.*"window_indicator":[[:space:]]*([-0-9.]+).*/\1/p;q' in.txt
Note that \d is not supported in POSIX regex, it is replaced with 0-9 range in the bracket expression here.
Details
n - suppress default line output
E - enables POSIX ERE flavor
.*"window_indicator":[[:space:]]*([-0-9.]+).* - finds
.* - any text
"window_indicator": - a fixed string
[[:space:]]* - zero or more whitespaces (GNU sed supports \s, too)
([-0-9.]+) - Group 1: one or more digits, - or .
.* - any text
\1 - replaces with Group 1 value
p - prints the result upon successful replacement
q - quits processing the stream.
With GNU grep, it is even easier:
grep -oP '"window_indicator":\s*\K[-\d.]+' in.txt
To get the first match,
grep -oP '"window_indicator":\s*\K[-\d.]+' in.txt | head -1
Here,
o - outputs matched texts only
P - enables the PCRE regex engine
"window_indicator":\s*\K[-\d.]+ - matches
"window_indicator": - a fixed string
\s* - zero or more whitespaces
\K - removes the text matched so far from the match value
[-\d.]+ - matches one or more -, . or digits.
1st solution: With your shown samples please try following awk code. Though its always advised to use json parsers like: jq. Simple explanation would be, using match function of awk here, where using regex "window_indicator":[0-9]+} in it to match needed value. If regex is successfully matched then creating variable val which has sub-string of matched regex in current line. Then substituting "window_indicator": and } with NULL in val and printing val which will give needed value.
awk '
match($0,/"window_indicator":[0-9]+}/){
val=substr($0,RSTART,RLENGTH)
gsub(/"window_indicator":|}/,"",val)
print val
}
' Input_file
2nd solution: Using GNU grep where using positive look ahead and positive look behind mechanism and getting the expected output as per requirement.
grep -oP '(?<="window_indicator":)\d+(?=})' Input_file
Using sed
$ sed -E 's/.*window_indicator":([0-9]+).*/\1/' input_file
0
Using grep
$ grep -Po '.*window_indicator":\K\d+' input_file
0
Using awk
$ awk '{match($0,/.*window_indicator":([0-9]+)/,arr);print arr[1]}' input_file
0
I have a file that looks like this:
chr4 StringTie exon 185054979 185055237 1000 + . gene_id `"MSTRG.41311"; transcript_id "ENST00000658673.1"; exon_number "2"; gene_name `"LINC02436"; ref_gene_id "ENSG00000250754.6";
chr4 StringTie exon 185069961 185070030 1000 + . gene_id `"MSTRG.41311"; transcript_id "ENST00000658673.1"; exon_number "3"; gene_name "LINC02436"; ref_gene_id "ENSG00000250754.6";
chr6 HAVANA exon 169067764 169068299 . + . gene_id "ENSG00000234519.2"; transcript_id "ENST00000666733.1"; exon_number "1"; gene_name "RP3-495K2.1";
I want to only keep the gene id information so the file will look like this:
MSTRG.41311
MSTRG.41311
ENSG00000234519.2
I have tried the following:
cat file.gtf|sed 's/!ENSG*//g'|sed 's/!ENSG*//g' > myfile.txt.
But this does not give me the desired output. I think this is because of the quotation marks which is a special character but I'm not sure.
Can someone help with this problem?
Thanks!
Try with this (GNU sed):
sed -E 's/gene_id/\x0/;s/.*\x0 `?"([^"]+)".*/\1/' input
Since gene_id occurs twice on the first two lines (and you seem to be intereseted in the first occurrence on each line), I can't just go with sed 's/.*gene_id…, otherwise the .* will eat everything up to the before the last gene_id on the line.
Therefore, my approach is to pick the first gene_id on each line and change it in a \x0 character, via s/gene_id/\x0/ (since there's no greedy .* before gene_id, it will match the first on the line).
Once I've marked that position with \x0, I can use it to "anchor" the rest of the regex in the following substitution, where .*\x0 will match everything on each line up to and including (what was) the first gene_id on the line, and `?"([^"]+)".* matches the rest of the line while capturing with (…) the part between "s.
I've used -E for extended regex, so I can use (…) instead of \(…\), for instance.
Oh, the `? is just because you've put those backticks in the first two lines, so with ? (which'd be \? without the -E option) I required that zero or one backtick matches at that position. Not sure if it was a copy and paste mistake.
This might work for you (GNU sed):
sed -En 's/.*\<gene_id\>[^"]*"([^"]*)".*/\1/p' file
Turn on extended regexp -E and off implicit printing -n as this is a filtering operation.
Match the word gene_id, make a back reference to the string between the next pair of double quotes and replace the whole line by the back reference printing the result.
Fast:
awk -v RS='[^[:alnum:]_.]+' 'f==1{print;f=0} $0=="gene_id"{f=1}'
100% POSIX:
awk -F '[^[:alnum:]_.]+' '{for (i=1; i<=NF; i++) {if ($i=="gene_id") {print $(i+1); next}}}'
Setting RS to a regular expression is not posix, but is commonly available.
You could adapt either to print any field, anywhere in the line.
You can also try do this with cut -d"delimiter" -f columns nb
For example :
cat file.gtf | cat f.txt | cut -d"\"" -f 1
The \ is use because " can't be place between two others " "
I am trying to remove the | at the end of $3 and insert a tab using sed, but currently only the | is getting removed and this will not work in my awk command later. Is there a better way? Thank you :).
input
chr1 955542 955763|AGRN
chr1 957570 957852|AGRN
chr1 976034 976270|AGRN
chr1 976542 976787|AGRN
sed
sed 's/<|>/TAB/g' input > out
current output
chr1 955542 955763AGRN
chr1 957570 957852AGRN
chr1 976034 976270AGRN
chr1 976542 976787AGRN
If you really want a two-step approach, where you remove | chars. first and then feed the result to awk (instead of doing it all in awk - see Lars Fischer's comment on the question[1]
), the simplest approach is:
tr '|' '\t' < input > out
Incidentally, your sed command doesn't produce the output you quote.
To do it in sed (which is overkill here, unless you want the convenience of in-place updating with -i), you'd need:
# GNU Sed
sed 's/|/\t/g' input
# BSD/OSX Sed, from bash/ksh/zsh:
sed 's/|/'$'\t''/g' input
# Fully POSIX-compliant (from a shell that doesn't support $'...' strings)
sed 's/|/'"$(printf '\t')"'/g' input
[1] To add an explanation: awk -F '[\t |]+' '...' sets -F (which sets special awk variable FS, the input field separator) to a regular expression that allows you to recognize not just the whitespace-separated tokens as fields, but also the two fields contained in tokens such as 955763|AGRN - which means there is no need for pre-preprocessing the input.
Regex [\t |]+ means: consider any nonempty run of any mix of tabs, spaces, and pipe symbols a field separator.
awk '{sub(/\|/,"\t")}1' file
chr1 955542 955763 AGRN
chr1 957570 957852 AGRN
chr1 976034 976270 AGRN
I have a file with multiple lines and for line 2 to the end of the file I want to swap fields 8 and 9. The file is comma separated and I'd like to do the swap inline so I can run it on a batch of files using * wildcard. If this can be accomplished similarly with awk then that works for me too.
example:
header1,header2,header3,...,header8,header9,...,headerN
field1.1,...,field1.9,field1.8,...,field1.N
field2.1,...,field2.9,field2.8,...,field2.N
field3.1,...,field3.9,field3.8,...,field3.N
...
I think the command would look similar to sed -r -i '2,$s/^(([^,]*,){8})([^,]*,)([^,]*,)(.*)/\1\3\2\4/' temp*.log,
but \2 is not what I expect, it is the 7th field. I know that \2 will not be the 8th field because I have double parentheses there, but I'm not sure how to fix it. Could somebody please explain what this equation is doing and specifically what [^,] is doing and how the {8} is applied?
Thanks in advance.
In awk, you might use:
awk -F',' 'BEGIN {OFS=","} {t = $8; $8 = $9; $9 = t; print}'
In sed, the command is more convoluted, but it could be done.
sed -e 's/^\(\([^,]*,\)\{7\}\)\([^,]*,\)\([^,]*,\)/\1\4\3/'
Add the -i .bak option if your version of sed (e.g. GNU or BSD) supports it.
This uses the universally available sed regexes (it would work on even archaic versions of sed). You could lose most of the backslashes if you used 'extended regular expressions' instead:
sed -r -i 's/^(([^,]*,){7})([^,]*,)([^,]*,)/\1\4\3\5/'
Note the nested remembered (captured) patterns. The outer set is \1, the inner set would be \2 but that gets repeated 7 times, so you'd have the seventh field as \2. Anyway, that's why the eighth and ninth columns are switched with \4 and \3. \5 are the remaining columns.
(I note in passing that it would have been helpful to have some sample data in sufficiently the correct format to test with. It was a nuisance having to edit what is shown in the question to be able to test the code.)
If you need to do much CSV work, then either use Perl and its CSV modules (Text::CSV and Text::CSV_XS) or Python and its CSV module, or get CSVfix.
$2 is the second part in the RE
Denumbered by first occurence of (.
So in
'2,$s/^(([^,]*,){8})([^,]*,)([^,]*,)(.*)/\1\3\2\4/'
You could see (followind alignment):
$1 = (([^,]*,){8})
$2 = ([^,]*,)
$3 = ([^,]*,)
$4 = ([^,]*,)
and finaly $5 = (.*)
In this specific case, $2 must hold the last match of the height ({8}).
it seems that awk is the right tool:
awk -F',' -v OFS=',' '{t=$8;$8=$9;$9=t}7' file
This might work for you (GNU sed):
sed -ri '1!s/(,[^,]*)(,[^,]*)/\2\1/4' file
This swaps the 9th field with the 8th i.e. 8 / 2 = 4, if you wanted the 7th with the 8th:
sed -ri '1!{s/^/,/;s/(,[^,]*)(,[^,]*)/\2\1/4;s/^,//}' file
I am trying to exchange two words in a line but it doesn't work. For example: "Today is my first day of university" should be "my is Today first day of university"
This is what I tried:
sed 's/\([a-zA-z0-9]*\)\([a-zA-z0-9]*\)\([a-zA-z0-9]*\)/\3\2\1/' filename.txt
What am I doing wrong?
I start to make it with \s which means any whitespaces chars.
I use it for match every words with [^\s]*which match with everything but not spaces.
And I had \s* for match withspaces between words. And don't forget to rewrite a space in replacement.
Look a this for an example:
sed 's#\([^ ]*\)\s+#\1 #'
( I use # instead of /)
sed -r 's/^(\w+)(\s+\w+\s+)(\w+)(.*)/\3\2\1\4/'
with your example:
kent$ echo "Today is my first day of university"|sed -r 's/^(\w+)(\s+\w+\s+)(\w+)(.*)/\3\2\1\4/'
my is Today first day of university
for your problem, awk is more straightforward:
awk '{t=$1;$1=$3;$3=t}1'
same input:
kent$ echo "Today is my first day of university"|awk '{t=$1;$1=$3;$3=t}1'
my is Today first day of university
Try this:
sed -rn 's/(\w+\s)(\w+\s)(\w+\s)(.*)/\3\2\1\4/p' filename.txt
-n suppress automatic printing of pattern space
-r use extended regular expressions in the script
\s for whitespace
This might work for you (GNU sed):
sed -r 's/(\S+)\s+(\S+)\s+(\S+)/\3 \2 \1/' file
You are not accounting for whitespace.
use [ \t]+ between words.