I am trying to remove the | at the end of $3 and insert a tab using sed, but currently only the | is getting removed and this will not work in my awk command later. Is there a better way? Thank you :).
input
chr1 955542 955763|AGRN
chr1 957570 957852|AGRN
chr1 976034 976270|AGRN
chr1 976542 976787|AGRN
sed
sed 's/<|>/TAB/g' input > out
current output
chr1 955542 955763AGRN
chr1 957570 957852AGRN
chr1 976034 976270AGRN
chr1 976542 976787AGRN
If you really want a two-step approach, where you remove | chars. first and then feed the result to awk (instead of doing it all in awk - see Lars Fischer's comment on the question[1]
), the simplest approach is:
tr '|' '\t' < input > out
Incidentally, your sed command doesn't produce the output you quote.
To do it in sed (which is overkill here, unless you want the convenience of in-place updating with -i), you'd need:
# GNU Sed
sed 's/|/\t/g' input
# BSD/OSX Sed, from bash/ksh/zsh:
sed 's/|/'$'\t''/g' input
# Fully POSIX-compliant (from a shell that doesn't support $'...' strings)
sed 's/|/'"$(printf '\t')"'/g' input
[1] To add an explanation: awk -F '[\t |]+' '...' sets -F (which sets special awk variable FS, the input field separator) to a regular expression that allows you to recognize not just the whitespace-separated tokens as fields, but also the two fields contained in tokens such as 955763|AGRN - which means there is no need for pre-preprocessing the input.
Regex [\t |]+ means: consider any nonempty run of any mix of tabs, spaces, and pipe symbols a field separator.
awk '{sub(/\|/,"\t")}1' file
chr1 955542 955763 AGRN
chr1 957570 957852 AGRN
chr1 976034 976270 AGRN
Related
I have a file that looks like this:
chr4 StringTie exon 185054979 185055237 1000 + . gene_id `"MSTRG.41311"; transcript_id "ENST00000658673.1"; exon_number "2"; gene_name `"LINC02436"; ref_gene_id "ENSG00000250754.6";
chr4 StringTie exon 185069961 185070030 1000 + . gene_id `"MSTRG.41311"; transcript_id "ENST00000658673.1"; exon_number "3"; gene_name "LINC02436"; ref_gene_id "ENSG00000250754.6";
chr6 HAVANA exon 169067764 169068299 . + . gene_id "ENSG00000234519.2"; transcript_id "ENST00000666733.1"; exon_number "1"; gene_name "RP3-495K2.1";
I want to only keep the gene id information so the file will look like this:
MSTRG.41311
MSTRG.41311
ENSG00000234519.2
I have tried the following:
cat file.gtf|sed 's/!ENSG*//g'|sed 's/!ENSG*//g' > myfile.txt.
But this does not give me the desired output. I think this is because of the quotation marks which is a special character but I'm not sure.
Can someone help with this problem?
Thanks!
Try with this (GNU sed):
sed -E 's/gene_id/\x0/;s/.*\x0 `?"([^"]+)".*/\1/' input
Since gene_id occurs twice on the first two lines (and you seem to be intereseted in the first occurrence on each line), I can't just go with sed 's/.*gene_id…, otherwise the .* will eat everything up to the before the last gene_id on the line.
Therefore, my approach is to pick the first gene_id on each line and change it in a \x0 character, via s/gene_id/\x0/ (since there's no greedy .* before gene_id, it will match the first on the line).
Once I've marked that position with \x0, I can use it to "anchor" the rest of the regex in the following substitution, where .*\x0 will match everything on each line up to and including (what was) the first gene_id on the line, and `?"([^"]+)".* matches the rest of the line while capturing with (…) the part between "s.
I've used -E for extended regex, so I can use (…) instead of \(…\), for instance.
Oh, the `? is just because you've put those backticks in the first two lines, so with ? (which'd be \? without the -E option) I required that zero or one backtick matches at that position. Not sure if it was a copy and paste mistake.
This might work for you (GNU sed):
sed -En 's/.*\<gene_id\>[^"]*"([^"]*)".*/\1/p' file
Turn on extended regexp -E and off implicit printing -n as this is a filtering operation.
Match the word gene_id, make a back reference to the string between the next pair of double quotes and replace the whole line by the back reference printing the result.
Fast:
awk -v RS='[^[:alnum:]_.]+' 'f==1{print;f=0} $0=="gene_id"{f=1}'
100% POSIX:
awk -F '[^[:alnum:]_.]+' '{for (i=1; i<=NF; i++) {if ($i=="gene_id") {print $(i+1); next}}}'
Setting RS to a regular expression is not posix, but is commonly available.
You could adapt either to print any field, anywhere in the line.
You can also try do this with cut -d"delimiter" -f columns nb
For example :
cat file.gtf | cat f.txt | cut -d"\"" -f 1
The \ is use because " can't be place between two others " "
I have something like
chr1 162724289 162724421 CAAAATGTTTATAAGGACAGCCTGCTCTCTCCCCTCAGTACAGGGCAGCTGCTTGCCTGTGAACCAGTAAACAGCTCTGTGGTTTCATGGTTGCTCCCTCTCTCCCCAACCCTCACCTCTCAAGGCTGGACT chr1 162724414 162724421 ID=exon:ENST00000367921.3:5;Parent=ENST00000367921.3;gene_id=ENSG00000162733.12;transcript_id=ENST00000367921.3;gene_type=protein_coding;gene_status=KNOWN;gene_name=DDR2;transcript_type=protein_coding;transcript_status=KNOWN;transcript_name=DDR2-002;exon_number=5;exon_id=ENSE00001165686.1;level=2;protein_id=ENSP00000356898.3;ccdsid=CCDS1241.1;havana_gene=OTTHUMG00000034423.4;havana_transcript=OTTHUMT00000097650.1;tag=basic,appris_principal,CCDS
I would like to extract only the exon_number=5 from the 8th column. This is kind of a long one line command and, since I have other columns I want to keep, I guess that I cannot use awk -F ';'. I tried something like:
sed -E 's/ ID=*\(exon_number=[0-9]\)* \1/'
Desired output:
chr1 162724289 162724421 CAAAATGTTTATAAGGACAGCCTGCTCTCTCCCCTCAGTACAGGGCAGCTGCTTGCCTGTGAACCAGTAAACAGCTCTGTGGTTTCATGGTTGCTCCCTCTCTCCCCAACCCTCACCTCTCAAGGCTGGACT chr1 162724414 162724421 exon_number=5
Any advice would be great!
Thanks
With sed, you may match and remove exactly what you want:
sed -E 's/(.* )ID=[^[:space:]]*(exon_number=[0-9]+).*/\1\2/'
See the online sed demo
Explanation
-E - POSIX ERE syntax enabling option
(.* )ID=[^[:space:]]*(exon_number=[0-9]+).* - a rege pattern matching:
(.* ) - Group 1: any 0+ chars, as many as possible, and then a space
ID=[^[:space:]]* - ID= and 0+ whitespace chars
(exon_number=[0-9]+) - exon_number= and 1 or more digits (Group 2)
.* - the rest of the line
\1\2 - the replacement pattern inserts the contents of Group 1 and 2 into the resulting string.
EDIT: As per OP changed the requirement so putting solution as per that only.
awk -F";" 'match($0,/exon_number=[0-9]+/){val=$1;sub(/ ID.*/,"",val);print val,substr($0,RSTART,RLENGTH)}' Input_file
Following simple awk may help you here.
awk 'match($0,/exon_number=[0-9]+/){print substr($0,RSTART,RLENGTH)}' Input_file
Solution 2nd: In case your Input_file is having always same kind of data then simply print it by field.
awk -F";" '{print $11}' Input_file
Given a file ./wordslist.txt with <word> <number_of_apparitions> such as :
aš toto 39626
ir 35938
tai 33361
tu 28520
kad 26213
...
How to exclude the end-of-lines digits in order to collect in output.txt data such :
aš toto
ir
tai
tu
kad
...
Note :
Sed, find, cut or grep prefered. I cannot use something which keeps [a-z] things since my data can contain ascii letters, non-ascii letters, chinese characters, digits, etc.
I suggest:
cut -d " " -f 1 wordslist.txt > output.txt
Or :
sed -E 's/ [0-9]+$//' wordslist.txt > output.txt.
Use awk for print first word in this case.
awk '{print $1}' your_file > your_new_file
awk solution to simply print input line excluding last column
$ awk '{NF--; print}' wordslist.txt
aš toto
ir
tai
tu
kad
Note:
This will only work in some awks. Per POSIX incrementing NF adds a null field but decrementing NF is undefined behavior (thanks #EdMorton for the info)
This doesn't check if last column is numeric and field separation in output will be single space only
If there can be empty lines in input file, use awk 'NF{NF--}1'
The following works :
sed -r 's/ [0-9]+$//g' wordslist.txt
I have a file with multiple lines and for line 2 to the end of the file I want to swap fields 8 and 9. The file is comma separated and I'd like to do the swap inline so I can run it on a batch of files using * wildcard. If this can be accomplished similarly with awk then that works for me too.
example:
header1,header2,header3,...,header8,header9,...,headerN
field1.1,...,field1.9,field1.8,...,field1.N
field2.1,...,field2.9,field2.8,...,field2.N
field3.1,...,field3.9,field3.8,...,field3.N
...
I think the command would look similar to sed -r -i '2,$s/^(([^,]*,){8})([^,]*,)([^,]*,)(.*)/\1\3\2\4/' temp*.log,
but \2 is not what I expect, it is the 7th field. I know that \2 will not be the 8th field because I have double parentheses there, but I'm not sure how to fix it. Could somebody please explain what this equation is doing and specifically what [^,] is doing and how the {8} is applied?
Thanks in advance.
In awk, you might use:
awk -F',' 'BEGIN {OFS=","} {t = $8; $8 = $9; $9 = t; print}'
In sed, the command is more convoluted, but it could be done.
sed -e 's/^\(\([^,]*,\)\{7\}\)\([^,]*,\)\([^,]*,\)/\1\4\3/'
Add the -i .bak option if your version of sed (e.g. GNU or BSD) supports it.
This uses the universally available sed regexes (it would work on even archaic versions of sed). You could lose most of the backslashes if you used 'extended regular expressions' instead:
sed -r -i 's/^(([^,]*,){7})([^,]*,)([^,]*,)/\1\4\3\5/'
Note the nested remembered (captured) patterns. The outer set is \1, the inner set would be \2 but that gets repeated 7 times, so you'd have the seventh field as \2. Anyway, that's why the eighth and ninth columns are switched with \4 and \3. \5 are the remaining columns.
(I note in passing that it would have been helpful to have some sample data in sufficiently the correct format to test with. It was a nuisance having to edit what is shown in the question to be able to test the code.)
If you need to do much CSV work, then either use Perl and its CSV modules (Text::CSV and Text::CSV_XS) or Python and its CSV module, or get CSVfix.
$2 is the second part in the RE
Denumbered by first occurence of (.
So in
'2,$s/^(([^,]*,){8})([^,]*,)([^,]*,)(.*)/\1\3\2\4/'
You could see (followind alignment):
$1 = (([^,]*,){8})
$2 = ([^,]*,)
$3 = ([^,]*,)
$4 = ([^,]*,)
and finaly $5 = (.*)
In this specific case, $2 must hold the last match of the height ({8}).
it seems that awk is the right tool:
awk -F',' -v OFS=',' '{t=$8;$8=$9;$9=t}7' file
This might work for you (GNU sed):
sed -ri '1!s/(,[^,]*)(,[^,]*)/\2\1/4' file
This swaps the 9th field with the 8th i.e. 8 / 2 = 4, if you wanted the 7th with the 8th:
sed -ri '1!{s/^/,/;s/(,[^,]*)(,[^,]*)/\2\1/4;s/^,//}' file
I'm trying to read a text file (using perl) where each line has several records, like this:
r1c1 & r1c2 & r1c3 \\
r2c1 & r2c2 & r2c3 \\
So, & is the record separator.
The Perl help says this:
$ perl -h
-0[octal] specify record separator (\0, if no argument)
Why you would use octal number is beyond me. But 046 is the octal ASCII of the separator &, so I tried this:
perl -046 -ane 'print join ",", #F; print "\n"' file.txt
where the desired output would be
r1c1,r1c2,r1c3 \\
r2c1,r2c2,r2c3 \\
But it doesn't work. How do you do it right?
I think you are mixing two separate things. The record separator that -0 affects is what divides the input up into "lines". -a makes each "line" then be split into #F, by default on whitespace. To change what -a splits on, use the -F switch, like -F'&'.
When in doubt about the perl command line options look at perldoc perlrun in the command line.
Also if you use the -l option perl -F'&' -lane ... it will remove the end-of-line (EOL) char of every line before pass it to your script and will add it for each print, so you don't need to put "\n" in your code. The fewer chars in a one liner the better.