select specific columns from complex lines

select specific columns from complex lines - perl

I have a file that contains lines with the following format. I would like to keep only the first column and the column containing the string with the following format NC_XXXX.1
484-2117 16 gi|9634679|ref|NC_002188.1| 188705 23 21M * 0 0 CGCGTACCAAAAGTAATAATT IIIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:0G20 YT:Z:UU
787-1087 16 gi|21844535|ref|NC_004068.1| 7006 23 20M * 0 0 CTATACAACCTACTACCTCA IIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:19T0 YT:Z:UU
.....
....
...
output:
484-2117 NC_002188.1
787-1087 NC_004068.1

Something like this in perl:
#!/usr/bin/env perl
use strict;
use warnings;
while (<DATA>) {
my ( $id, $nc ) = m/^([\d\-]+).*(NC_[\d\.]+)/;
print "$id $nc\n";
}
__DATA__
484-2117 16 gi|9634679|ref|NC_002188.1| 188705 23 21M * 0 0 CGCGTACCAAAAGTAATAATT IIIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:0G20 YT:Z:UU
787-1087 16 gi|21844535|ref|NC_004068.1| 7006 23 20M * 0 0 CTATACAACCTACTACCTCA IIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:19T0 YT:Z:UU
Output:
484-2117 NC_002188.1
787-1087 NC_004068.1
Which reduces to a one liner of:
perl -ne 'm/^([\d\-]+).*(NC_[\d\.]+)/ and print "$1 $2\n"' yourfile
Note - this specifically matches a first column made up of number and dash - you could do this with a wider regex match.

awk to the rescue!
$ awk -F' +|\\|' '{for(i=2;i<=NF;i++) if($i ~ /^NC_[0-9.]+$/) {print $1,$i; next}}' file
484-2117 NC_002188.1
787-1087 NC_004068.1
if the space is a tab char, need to add to the delimiter list
$ awk -F' +|\\||\t' ...

With perl:
perl -anE'say "$F[0] ",(split /\|/, $F[2])[3]' file
or awk:
awk -F'\\|| +' '{print $1,$6}' file

Using gnu-awk below could be solution:
awk '{printf "%s %s\n",$1,gensub(/.*(NC_.*\.1).*/,"\\1",1,$0)}' file
Output
484-2117 NC_002188.1
787-1087 NC_004068.1
A more restrictive version would be
awk '{printf "%s %s\n",$1,gensub(/.*(NC_[[:digit:]]*\.1).*/,"\\1",1,$0)}' file

awk -F'[ |]' '{print $1,$10}' file
484-2117 NC_002188.1
787-1087 NC_004068.1

Related

sed: How to find whole word which contains special char "/"?

I would like to search an exact word which contains the special char / like this word "/var/tmp".
I found some example how to search whole word like this sed '/\b$word\b/g'
But it works only on standard words, not contains meta.
Any idea please?
INPUT>
word="/var/tmp"
cat /etc/mtab | grep $word
/dev/mapper/vgroot-lvvar /tmp/var/tmp ext4 rw,seclabel,relatime 0 0
/dev/mapper/vgroot-lv_var_tmp /var/tmp xfs rw,seclabel,relatime,attr2,inode64,noquota 0 0
/dev/mapper/vgroot-lv_var_tmp /tmp/var/tmp xfs rw,seclabel,relatime,attr2,inode64,noquota 0 0
OUTPUT is working only with exact example, but not with universal variable $word
cat /etc/mtab | grep $word | sed '/\b\/var\/tmp\b/g'
/dev/mapper/vgroot-lv_var_tmp /var/tmp xfs rw,seclabel,relatime,attr2,inode64,noquota 0 0
Not working output with variable, filter not working here
cat /etc/mtab | grep $word | sed '/\b\$word\b/g'
/dev/mapper/vgroot-lvvar /tmp/var/tmp ext4 rw,seclabel,relatime 0 0
/dev/mapper/vgroot-lv_var_tmp /var/tmp xfs rw,seclabel,relatime,attr2,inode64,noquota 0 0
/dev/mapper/vgroot-lv_var_tmp /tmp/var/tmp xfs rw,seclabel,relatime,attr2,inode64,noquota 0 0

You can use sed with a different delimiter for it. To use it in address position you need to escape the very first one - \%.
Also, as your word can start(and probable end) with /(non word character) you need to handle the beginning of words manually, not with \b. Otherwise you will not have a match.
Below is a working example:
$ cat txt
asd fasd /asd fas
asd asd /asd/tmp/ asdf
/asd/tmp/ asdf
asdf
sadf asd asd/tmp
$ echo $word
/asd/tmp/
$ sed -nE "\%(^|\s)$word(\s|$)%p" txt
asd asd /asd/tmp/ asdf
/asd/tmp/ asdf
-n - tells sed do not print all lines
-E - to work with extended regexp syntax
\% - delimiter which shouldn't appear in $word
(^|\s) - match the very beginning of a line or space
(\s|$) - match a space or the end of line
p - tells sed to print matched lines

Extract reads from a BAM/SAM file of a designated length

I am a bit of new to Perl and wish to use it in order to extract reads of a specific length from my BAM (alignment) file.
The BAM file contains reads, whose length is from 19 to 29 nt.
Here is an example of first 2 reads:
YT:Z:UUA00182:193:HG2NLDMXX:1:1101:29884:1078 0 3R 6234066 42 22M * 0 0 TCACTGGGCTTTGTTTATCTCA FF:FFFF,FFFFFFFF:FFFFF AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:22
YT:Z:UUA00182:193:HG2NLDMXX:1:1101:1777:1094 16 4 1313373 1 24M * 0 0 TCGCATTCTTATTGATTTTCCTTT FFFFFFF,FFFFFFFFFFFFFFFF AS:i:0 XS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:24
I want to extract only those, which are, let's say, 21 nt in length.
I try to do this with the following code:
my $string = <STDIN>;
$length = samtools view ./file.bam | head | perl -F'\t' -lane'length #F[10]';
if ($length == 21){
print($string)
}
However, the program does not give any result...
Could anyone please suggest the right way of doing this?

Your question is a bit confusing. Is the code snippet supposed to be a Perl script or a shell script that calls a Perl one-liner?
Assuming that you meant to write a Perl script into which you pipe the output of samtools view to:
#!/usr/bin/perl
use strict;
use warnings;
while (<STDIN>) {
my #fields = split("\t", $_);
# debugging, just to see what field is extracted...
print "'$fields[10]' ", length($fields[10]), "\n";
if (length($fields[10]) eq 21) {
print $_;
}
}
exit 0;
With your test data in dummy.txt I get:
# this would be "samtools view ./file.bam | head | perl dummy.pl" in your case?
$ cat dummy.txt | perl dummy.pl
'FF:FFFF,FFFFFFFF:FFFFF' 22
'FFFFFFF,FFFFFFFFFFFFFFFF' 24
Your test data doesn't contain a sample with length 21 though, so the if clause is never executed.

Note that the 10th field in your sample input is having either 22 or 24 in length. Also, the syntax that you use is wrong. Here is the Perl one-liner to match the field with length=22.
$ cat pkom.txt
YT:Z:UUA00182:193:HG2NLDMXX:1:1101:29884:1078 0 3R 6234066 42 22M * 0 0 TCACTGGGCTTTGTTTATCTCA FF:FFFF,FFFFFFFF:FFFFF AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:22
YT:Z:UUA00182:193:HG2NLDMXX:1:1101:1777:1094 16 4 1313373 1 24M * 0 0 TCGCATTCTTATTGATTTTCCTTT FFFFFFF,FFFFFFFFFFFFFFFF AS:i:0 XS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:24
$ perl -lane ' print if length($F[9])==22 ' pkom.txt
YT:Z:UUA00182:193:HG2NLDMXX:1:1101:29884:1078 0 3R 6234066 42 22M * 0 0 TCACTGGGCTTTGTTTATCTCA FF:FFFF,FFFFFFFF:FFFFF AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:22
$

Arrange columns using awk or sed?

I have a file that has around 500 rows and 480K columns, I am required to move columns 2,3 and 4 at the end. My file is a comma separated file, is there a quicker way to arrange this using awk or sed?

You can try below solution -
perl -F"," -lane 'print "#F[0]"," ","#F[4..$#F]"," ","#F[1..3]"' input.file

You can copy the columns easily, moving will take too long for 480K columns.
$ awk 'BEGIN{FS=OFS=","} {print $0,$2,$3,$4}' input.file > output.file
what kind of a data format is this?

Another technique, just bash:
while IFS=, read -r a b c d e; do
echo "$a,$e,$b,$c,$d"
done < file

Testing with 5 fields:
$ cat foo
1,2,3,4,5
a,b,c,d,e
$ cat program.awk
{
$6=$2 OFS $3 OFS $4 OFS $1 # copy fields to the end and $1 too
sub(/^([^,],){4}/,"") # remove 4 first columns
$1=$5 OFS $1 # catenate current $5 (was $1) to $1
NF=4 # reduce NF
} 1 # print
Run it:
$ awk -f program.awk FS=, OFS=, foo
1,5,2,3,4
a,e,b,c,d
So theoretically this should work:
{
$480001=$2 OFS $3 OFS $4 OFS $1
sub(/^([^,],){4}/,"")
$1=$480000 OFS $1
NF=479999
} 1
EDIT: It did work.

Perhaps perl:
perl -F, -lane 'print join(",", #F[0,4..$#F,1,2,3])' file
or
perl -F, -lane '#x = splice #F, 1, 3; print join(",", #F, #x)' file
Another approach: regular expressions
perl -lpe 's/^([^,]+)(,[^,]+,[^,]+,[^,]+)(.*)/$1$3$2/' file
Timing it with a 500 line file, each line containing 480,000 fields
$ time perl -F, -lane 'print join(",", #F[0,4..$#F,1,2,3])' file.csv > file2.csv
40.13user 1.11system 0:43.92elapsed 93%CPU (0avgtext+0avgdata 67960maxresident)k
0inputs+3172752outputs (0major+16088minor)pagefaults 0swaps
$ time perl -F, -lane '#x = splice #F, 1, 3; print join(",", #F, #x)' file.csv > file2.csv
34.82user 1.18system 0:38.47elapsed 93%CPU (0avgtext+0avgdata 52900maxresident)k
0inputs+3172752outputs (0major+12301minor)pagefaults 0swaps
And pure text manipulation is the winner
$ time perl -lpe 's/^([^,]+)(,[^,]+,[^,]+,[^,]+)(.*)/$1$3$2/' file.csv > file2.csv
4.63user 1.36system 0:20.81elapsed 28%CPU (0avgtext+0avgdata 20612maxresident)k
0inputs+3172752outputs (0major+149866minor)pagefaults 0swaps

Substring pattern matching in two files

I have an input flat file like this with many rows:
Apr 3 13:30:02 aag8-ca-acs01-en2 CisACS_01_PassedAuth p1n5ut5s 1 0 Message-Type=Authen OK,User-Name=joe7#it.test.com,NAS- IP-Address=4.196.63.55,Caller-ID=az-4d-31-89-92-90,EAP Type=17,EAP Type Name=LEAP,Response Time=0,
Apr 3 13:30:02 aag8-ca-acs01-en2 CisACS_01_PassedAuth p1n6ut5s 1 0 Message-Type=Authen OK,User-Name=bobe#jg.test.com,NAS-IP-Address=4.197.43.55,Caller-ID=az-4d-4q-x8-92-80,EAP Type=17,EAP Type Name=LEAP,Response Time=0,
Apr 3 13:30:02 abg8-ca-acs01-en2 CisACS_01_PassedAuth p1n4ut5s 1 0 Message-Type=Authen OK,User-Name=jerry777#it.test.com,NAS-IP-Address=7.196.63.55,Caller-ID=az-4d-n6-4e-y2-90,EAP Type=17,EAP Type Name=LEAP,Response Time=0,
Apr 3 13:30:02 aca8-ca-acs01-en2 CisACS_01_PassedAuth p1n4ut5s 1 0 Message-Type=Authen OK,User-Name=frc777o.#it.test.com,NAS-IP-Address=4.196.263.55,Caller-ID=a4-4e-31-99-92-90,EAP Type=17,EAP Type Name=LEAP,Response Time=0,
Apr 3 13:30:02 aag8-ca-acs01-en2 CisACS_01_PassedAuth p1n4ut5s 1 0 Message-Type=Authen OK,User-Name=frc77#xed.test.com,NAS-IP-Address=4.136.163.55,Caller-ID=az-4d-4w-b5-s2-90,EAP Type=17,EAP Type Name=LEAP,Response Time=0,
I'm trying to grep the email addresses from input file to see if they already exist in the master file.
Master flat file looks like this:
a44e31999290;frc777o.#it.test.com;20150403
az4d4qx89280;bobe#jg.test.com;20150403
0dbgd0fed04t;rrfuf#us.test.com;20150403
28cbe9191d53;rttuu4en#us.test.com;20150403
az4d4wb5s290;frc77#xed.test.com;20150403
d89695174805;ccis6n#cn.test.com;20150403
If the email doesn't exist in master I want a simple count.
So using the examples I hope to see: count=3, because bobe#jg.test.com and frc77#xed.test.com already exist in master but the others don't.
I tried various combinations of grep, example below from last tests but it is not working.. I'm using grep within a perl script to first capture emails and then count them but all I really need is the count of emails from input file that don't exist in master.
grep -o -P '(?<=User-Name=\).*(?=,NAS-IP-)' $infile $mstr > $new_emails;
Any help would be appreciated, Thanks.

I would use this approach in awk:
$ awk 'FNR==NR {FS=";"; a[$2]; next}
{FS="[,=]"; if ($4 in a) c++}
END{print c}' master file
3
This works by setting different field separators and storing / matching the emails. Then, printing the final sum.
For master file we use ; and get the 2nd field:
$ awk -F";" '{print $2}' master
frc777o.#it.test.com
bobe#jg.test.com
rrfuf#us.test.com
rttuu4en#us.test.com
frc77#xed.test.com
ccis6n#cn.test.com
For file file (the one with all the info) we use either , or = and get the 4th field:
$ awk -F[,=] '{print $4}' file
joe7#it.test.com
bobe#jg.test.com
jerry777#it.test.com
frc777o.#it.test.com
frc77#xed.test.com

Think the below does what you want as a one liner with diff and perl:
diff <( perl -F';' -anE 'say #F[1]' master | sort -u ) <( perl -pe 'm/User-Name=([^,]+),/; $_ = "$1\n"' data | sort -u ) | grep '^>' | perl -pe 's/> //;'
The diff <( command_a |sort -u ) <( command_b |sort -u) | grep '>' lets you handle the set difference of the command output.
perl -F';' -anE 'say #F[1]' just splits each line of the file on ';' and prints the second field on its own line.
perl -pe 'm/User-Name=([^,]+),/; $_ = "$1\n"' gets the specific field you wanted ignoring the surrounding key= and prints on a new line implicitly.

Get values from two rows and convert to variables

I have values from two rows, want to get all values and make them to variables.
Output is from emc storage:
Bus 0 Enclosure 0 Disk 0
State: Enabled
Bus 0 Enclosure 0 Disk 1
State: Enabled
Expected result:
Bus:0|Enclosure:0|Disk:0|State:Enabled
Or just need somebody to give me direction how to get the last row ...

This might work for you (GNU sed):
sed '/^Bus/!d;N;s/[0-9]\+/:&|/g;s/\s//g' file
To get only the last row:
sed '/^Bus/{N;h};$!d;x;s/[0-9]\+/:&|/g;s/\s//g' file

Try this awk:
$ awk '/^Bus/{for(i=1;i<=NF;i+=2) printf "%s:%s|", $i,$(i+1)}/^State/{printf "%s%s\n", $1, $2}' file
Bus:0|Enclosure:0|Disk:0|State:Enabled
Bus:0|Enclosure:0|Disk:1|State:Enabled
To handle multiple words in the last field, you can do:
$ awk '/^Bus/{for(i=1;i<=NF;i+=2) printf "%s:%s|", $i,$(i+1)}/^State/{printf "%s", $1; for (i=2;i<=NF;i++) printf "%s ", $i; print ""}' file
Bus:0|Enclosure:0|Disk:0|State:Enabled
Bus:0|Enclosure:0|Disk:1|State:hot space

perl -00anE 's/:// for #F; say join "|", map { $_%2 ? () : "$F[$_]:$F[$_+1]" } 0..$#F' file
output
Bus:0|Enclosure:0|Disk:0|State:Enabled
Bus:0|Enclosure:0|Disk:1|State:Enabled

With GNU awk you could do:
$ awk 'NR>1{$6=$6$7;NF--;print RS,$0}' RS='Bus' OFS='|' file
Bus|0|Enclosure|0|Disk|0|State:Enabled
Bus|0|Enclosure|0|Disk|1|State:Enabled
And for the last row only:
$ awk 'END{$6=$6$7;NF--;print RS,$0}' RS='Bus' OFS='|' file
Bus|0|Enclosure|0|Disk|1|State:Enabled

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

select specific columns from complex lines - perl

awk to the rescue! $ awk -F' +|\\|' '{for(i=2;i<=NF;i++) if($i ~ /^NC_[0-9.]+$/) {print $1,$i; next}}' file 484-2117 NC_002188.1 787-1087 NC_004068.1 if the space is a tab char, need to add to the delimiter list $ awk -F' +|\\||\t' ...

With perl: perl -anE'say "$F[0] ",(split /\|/, $F[2])[3]' file or awk: awk -F'\\|| +' '{print $1,$6}' file

Using gnu-awk below could be solution: awk '{printf "%s %s\n",$1,gensub(/.(NC_.\.1)./,"\\1",1,$0)}' file Output 484-2117 NC_002188.1 787-1087 NC_004068.1 A more restrictive version would be awk '{printf "%s %s\n",$1,gensub(/.(NC_[[:digit:]]\.1)./,"\\1",1,$0)}' file

awk -F'[ |]' '{print $1,$10}' file 484-2117 NC_002188.1 787-1087 NC_004068.1

Related

sed: How to find whole word which contains special char "/"?

Extract reads from a BAM/SAM file of a designated length

Arrange columns using awk or sed?

Substring pattern matching in two files

Get values from two rows and convert to variables

Categories

Resources

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

select specific columns from complex lines - perl

awk to the rescue! $ awk -F' +|\\|' '{for(i=2;i<=NF;i++) if($i ~ /^NC_[0-9.]+$/) {print $1,$i; next}}' file 484-2117 NC_002188.1 787-1087 NC_004068.1 if the space is a tab char, need to add to the delimiter list $ awk -F' +|\\||\t' ...

With perl: perl -anE'say "$F[0] ",(split /\|/, $F[2])[3]' file or awk: awk -F'\\|| +' '{print $1,$6}' file

Using gnu-awk below could be solution: awk '{printf "%s %s\n",$1,gensub(/.*(NC_.*\.1).*/,"\\1",1,$0)}' file Output 484-2117 NC_002188.1 787-1087 NC_004068.1 A more restrictive version would be awk '{printf "%s %s\n",$1,gensub(/.*(NC_[[:digit:]]*\.1).*/,"\\1",1,$0)}' file

awk -F'[ |]' '{print $1,$10}' file 484-2117 NC_002188.1 787-1087 NC_004068.1

Related

sed: How to find whole word which contains special char "/"?

Extract reads from a BAM/SAM file of a designated length

Arrange columns using awk or sed?

Substring pattern matching in two files

Get values from two rows and convert to variables

Categories

Resources

Using gnu-awk below could be solution: awk '{printf "%s %s\n",$1,gensub(/.(NC_.\.1)./,"\\1",1,$0)}' file Output 484-2117 NC_002188.1 787-1087 NC_004068.1 A more restrictive version would be awk '{printf "%s %s\n",$1,gensub(/.(NC_[[:digit:]]\.1)./,"\\1",1,$0)}' file