Input:
rs001 A C T G C G T T
rs002 C C T T G G A A
out1:
rs001 AC TG CG TT
rs002 CC TT GG AA
out2 :
rs001 1 1 1 2
rs002 2 2 2 2
Ok so basically I want to replace any two similar nucleotides (like AA, CC, TT, or GG) to 2 and any two different (like AT, TA, CG, .. etc) to 1 taking into account that the input should be converted first to out1 then to out2. Also we have so many fields (like 200 columns) in each row, so loops are needed here.
This is what I tried:
cat input | awk '{ for (x = 2; x <= NF; x = x+2) print $x$(x+1) }'
Results are so weird, so can anyone please tell me why I can't get out1 ?! What mistakes I did in awk loops ?
Thanks in advance
For the first,
sed 's/ \([ACGT]\) / \1/g' input >out1
This will remove the space after every other nucleitude. It matches a nucleotide with a space on both sides; the next match will pick up where the previous ended.
For the second,
sed 's/\([ACGT]\)\1/2/g;s/[ACGT][ACGT]/1/g' out1 >out2
This replaces two adjacent identical letters with 2, then any remaining adjacent two letters with 1.
This assumes you have Linux; other sed dialects may require minor modifications.
awk '{
out1 = out2 = $1
for (i=2;i<=NF;i+=2) {
out1 = out1 FS $i $(i+1)
out2 = out2 FS ($i == $(i+1) ? 2 : 1)
}
print out1 > "out1"
print out2 > "out2"
}' input
Here's how you fix your awk script to get output 1:
awk '{ printf "%s ", $1; for (x = 2; x <= NF; x = x + 2) {printf "%s%s ", $x, $(x+1)} printf "\n"}' input
print adds a new line at the end by default, so you'll have to use formatted strings printf to specify where exactly you want the new lines.
(Also added printf "%s ", $1; at the start to print the header at the start of each line)
Edit: Triplee's solution looks much more elegant than mine - you should ditch awk and go with his =)
This might work for you (GNU sed):
sed -re 's/ (.) / \1/g;w out1' -e 's/([ACTG])\1/2/g;s/[ACTG]./1/g' file >out2
Related
I have file with some text lines. I need to print lines 3-7 and 11 if it has two "b". I did
sed -n '/b\{2,\}/p' file but it printed lines where "b" occurs two times in a row
You can use
sed -n '3,7{/b[^b]*b/p};11{/b[^b]*b/p}' file
## that is equal to
sed -n '3,7{/b[^b]*b/p};11{//p}' file
Note that b[^b]*b matches b, then any zero or more chars other than b and then a b. The //p in the second part matches the most recent pattern , i.e. it matches the same b[^b]*b regex.
Note you might also use b.*b regex if you want, but the bracket expressions tend to word faster.
See an online demo, tested with sed (GNU sed) 4.7:
s='11bb1
b222b
b n b
ww
ee
bb
rrr
fff
999
10
11 b nnnn bb
www12'
sed -ne '3,7{/b[^b]*b/p};11{/b[^b]*b/p}' <<< "$s"
Output:
b n b
bb
11 b nnnn bb
Only lines 3, 6 and 11 are returned.
Just use awk for simplicity, clarity, portability, maintainability, etc. Using any awk in any shell on every Unix box:
awk '( (3<=NR && NR<=7) || (NR==11) ) && ( gsub(/b/,"&") >= 2 )' file
Notice how if you need to change a range, add a range, add other line numbers, change how many bs there are, add other chars and/or strings to match, add some completely different condition, etc. it's all absolutely clear and trivial.
For example, want to print the line if there's exactly either 13 or 27 bs instead of 2 or more:?
awk '( (3<=NR && NR<=7) || (NR==11) ) && ( gsub(/b/,"&") ~ /^(13|27)$/ )' file
Want to print the line if the line number is between 23 and 59 but isn't 34?
awk '( 23<=NR && NR<=59 && NR!=34 ) && ( gsub(/b/,"&") >= 2 )' file
Try making similar changes to a sed script. I'm not saying you can't force it to happen, but it's not nearly as trivial, clear, portable, etc. as it is using awk.
I hope to parse a txt file that looks like this:
A a, b, c
B e
C f, g
The format I hope to get is:
A a
A b
A c
B e
C f
C g
I tried this:
perl -ane '#s=split(/\,/, $F[1]); foreach $k (#s){print "$F[0] $k\n";}' txt.txt
but it only works when there's no space after commas. In the original file, there is a space after each comma. What should I do?
$ perl -lane 'print "$F[0] $_" for map { tr/,//rd } #F[1..$#F]' input.txt
A a
A b
A c
B e
C f
C g
Use auto-split mode on whitespace like normal, and for each element of an array slice of #F from the second field to the last one, remove any commas (I used tr//d, the more usual s/// works too, of course) and print it with the first field prepended.
Alternatively, don't use -a because it splits too much.
perl -le'#F = split(" ", $_, 2); print "$F[0] $_" for split(/,\s*/, $F[1])'
I have this Input:
1 a
a
2 b b
3 c
c
4 d d
5 e e
6 f
f
7 g
g
I want this output using sed command
1 a a
2 b b
3 c c
4 d d
5 e e
6 f f
7 g g
I'm trying this without success
sed '/^[^0-9]/ x; N; { s/\n/ / }; n' file
Another in awk:
$ awk 'BEGIN{RS=""}{for(i=1;i<=NF;i+=3)print $i,$(i+1),$(i+2)}' file
1 a a
2 b b
3 c c
4 d d
5 e e
6 f f
7 g g
Explained:
$ awk 'BEGIN {
RS="" # prime awk to read in a paragraph of data
}
{
for(i=1;i<=NF;i+=3) # jump forward 3 fields at a time
print $i,$(i+1),$(i+2) # print 3 fields
}' file
awk 'NR>1 && /^[0-9]/ {print substr(s,2); s=""} {s=s FS $0} END {print substr(s,2)}' file
NR>1 && /^[0-9]/: If a line is not the first and begins with a digit,
{print substr(s,2); s=""}: print "s" without the leading space, then clear it.
{s=s FS $0}: On every line, append the current line to the value of "s". FS is a space by default.
edit: Added END condition to catch last line, hated it, made a better separate answer.
Made it simpler with awk:
awk 'NF==2 {printf("%s ", $0); next} 1' file
Basically, "Don't print a newline if there are only exactly two fields."
This might work for you (GNU sed):
sed '/^[0-9]/{:a;N;s/\n\([^0-9]\)/ \1/;ta;P;D}' file
If the current line begins with an integer, append the following line. If that line does not begin with an integer, replace the newline by a space and repeat. Otherwise print/delete the first line in the pattern space and repeat.
I have a data file that needs a new column of identifiers from 1 to 5. The final purpose is to split the data into five separate files with no leftover file (split leaves a leftover file).
Data:
aa
bb
cc
dd
ff
nn
ww
tt
pp
with identifier column:
aa 1
bb 2
cc 3
dd 4
ff 5
nn 1
ww 2
tt 3
pp 4
Not sure if this can be done with seq? Afterwards it will be split with:
awk '$2 == 1 {print $0}'
awk '$2 == 2 {print $0}'
awk '$2 == 3 {print $0}'
awk '$2 == 4 {print $0}'
awk '$2 == 5 {print $0}'
Perl to the rescue:
perl -pe 's/$/" " . $. % 5/e' < input > output
Uses 0 instead of 5.
$. is the line number.
% is the modulo operator.
the /e modifier tells the substitution to evaluate the replacement part as code
i.e. end of line ($) is replaced with a space concatenated (.) with the line number modulo 5.
$ awk '{print $0, ((NR-1)%5)+1}' file
aa 1
bb 2
cc 3
dd 4
ff 5
nn 1
ww 2
tt 3
pp 4
No need for that to create 5 separate files of course. All you need is:
awk '{print > ("file_" ((NR-1)%5)+1)}' file
Looks like you're happy with a perl solution that outputs 1-4 then 0 instead of 1-5 so FYI here's the equivalent in awk:
$ awk '{print $0, NR%5}' file
aa 1
bb 2
cc 3
dd 4
ff 0
nn 1
ww 2
tt 3
pp 4
I am going to offer a Perl solution even though it wasn't tagged because Perl is well suited to solve this problem.
If I understand what you want to do, you have a single file that you want to split into 5 separate files based on the position of a line in the data file:
the first line in the data file goes to file 1
the second line in the data file goes to file 2
the third line in the data file goes to file 3
...
since you already have the lines position in the file, you don't really need the identifier column (though you could pursue that solution if you wanted).
Instead you can open 5 filehandles and simply alternate which handle you write to:
use strict;
use warnings;
my $datafilename = shift #ARGV;
# open filehandles and store them in an array
my #fhs;
foreach my $i ( 0 .. 4 ) {
open my $fh, '>', "${datafilename}_$i"
or die "$!";
$fhs[$i] = $fh;
}
# open the datafile
open my $datafile_fh, '<', $datafilename
or die "$!";
my $row_number = 0;
while ( my $datarow = <$datafile_fh> ) {
print { $fhs[$row_number++ % #fhs] } $datarow;
}
# close resources
foreach my $fh ( #fhs ) {
close $fh;
}
I've got data in a large file (280 columns wide, 7 million lines long!) and I need to swap the first two columns. I think I could do this with some kind of awk for loop, to print $2, $1, then a range to the end of the file - but I don't know how to do the range part, and I can't print $2, $1, $3...$280! Most of the column swap answers I've seen here are specific to small files with a manageable number of columns, so I need something that doesn't depend on specifying every column number.
The file is tab delimited:
Affy-id chr 0 pos NA06984 NA06985 NA06986 NA06989
You can do this by swapping values of the first two fields:
awk ' { t = $1; $1 = $2; $2 = t; print; } ' input_file
I tried the answer of perreal with cygwin on a windows system with a tab separated file. It didn't work, because the standard separator is space.
If you encounter the same problem, try this instead:
awk -F $'\t' ' { t = $1; $1 = $2; $2 = t; print; } ' OFS=$'\t' input_file
Incoming separator is defined by -F $'\t' and the seperator for output by OFS=$'\t'.
awk -F $'\t' ' { t = $1; $1 = $2; $2 = t; print; } ' OFS=$'\t' input_file > output_file
Try this more relevant to your question :
awk '{printf("%s\t%s\n", $2, $1)}' inputfile
This might work for you (GNU sed):
sed -i 's/^\([^\t]*\t\)\([^\t]*\t\)/\2\1/' file
Have you tried using the cut command? E.g.
cat myhugefile | cut -c10-20,c1-9,c21- > myrearrangedhugefile
This is also easy in perl:
perl -pe 's/^(\S+)\t(\S+)/$2\t$1/;' file > outputfile
You could do this in Perl:
perl -F\\t -nlae 'print join("\t", #F[1,0,2..$#F])' inputfile
The -F specifies the delimiter. In most shells you need to precede a backslash with another to escape it. On some platforms -F automatically implies -n and -a so they can be dropped.
For your problem you wouldn't need to use -l because the last columns appears last in the output. But if in a different situation, if the last column needs to appear between other columns, the newline character must be removed. The -l switch takes care of this.
The "\t" in join can be changed to anything else to produce a different delimiter in the output.
2..$#F specifies a range from 2 until the last column. As you might have guessed, inside the square brackets, you can put any single column or range of columns in the desired order.
No need to call anything else but your shell:
bash> while read col1 col2 rest; do
echo $col2 $col1 $rest
done <input_file
Test:
bash> echo "first second a c d e f g" |
while read col1 col2 rest; do
echo $col2 $col1 $rest
done
second first a b c d e f g
Maybe even with "inlined" Python - as in a Python script within a shell script - but only if you want to do some more scripting with Bash beforehand or afterwards... Otherwise it is unnecessarily complex.
Content of script file process.sh:
#!/bin/bash
# inline Python script
read -r -d '' PYSCR << EOSCR
from __future__ import print_function
import codecs
import sys
encoding = "utf-8"
fn_in = sys.argv[1]
fn_out = sys.argv[2]
# print("Input:", fn_in)
# print("Output:", fn_out)
with codecs.open(fn_in, "r", encoding) as fp_in, \
codecs.open(fn_out, "w", encoding) as fp_out:
for line in fp_in:
# split into two columns and rest
col1, col2, rest = line.split("\t", 2)
# swap columns in output
fp_out.write("{}\t{}\t{}".format(col2, col1, rest))
EOSCR
# ---------------------
# do setup work?
# e. g. list files for processing
# call python script with params
python3 -c "$PYSCR" "$inputfile" "$outputfile"
# do some more processing
# e. g. rename outputfile to inputfile, ...
If you only need to swap the columns for a single file, then you can also just create a single Python script and statically define the filenames. Or just use an answer above.
awk swapping sans temp-variable :
echo '777777744444444464449: 317 647 14423 262927714037 : 0x2A29D5A1BAA7A95541' |
mawk '1; ($1 = $2 substr(_, ($2 = $1)^_))^_' FS=':' OFS=':'
777777744444444464449: 317 647 14423 262927714037 : 0x2A29D5A1BAA7A95541
317 647 14423 262927714037 :777777744444444464449: 0x2A29D5A1BAA7A95541