Add dipeptide frequency in this perl script based on sequence length

Add dipeptide frequency in this perl script based on sequence length - perl

I have a perl script to get the di-peptide counts (there are 400 combinations, for example- AA, AC, AD, AE...) from sequences (fasta format). But I would like to add the frequency based on the sequence lengths. I have a input with multiple sequences (myfile.fasta).
I tried to do it, but I got the wrong results. Im am not very familiar with perl.
My script:
use strict;
use warnings;
use Bio::SeqIO;
my #amino=qw/A C D E F G H I K L M N P Q R S T V W Y/;
my #comb=();
foreach my $a (#amino){
foreach my $b (#amino){
push (#comb,$a.$b)
}
}
my $in = Bio::SeqIO->new(-file => "myfile.fasta" , '-format' => 'Fasta');
while ( my $seq= $in->next_seq ) {
my #dipeps=($seq->seq()=~/(?=(.{2}))/g);
my %di_count=();
$di_count{$_}++ for #dipeps;
print $seq->id();
map{exists $di_count{$_}?print " ",$di_count{$_}:print " ",0}sort #comb;
print "\n";
}
I tried:
map{exists $di_count{$_}?print " ",$di_count{$_}:print " ",0}sort #comb/length;
map{exists $di_count{$_}?print " ",$di_count{$_}:print " ",0/length}sort #comb;
I also tried to define the length, such as:
my $seq_len = length($seq);
Also, I do not want to define the input file in the script, I would like to define like "perl script.pl input.fasta > result.txt". For that I should use:
open (S, "$ARGV[0]") || die "cannot open FASTA file to read: $!";

This is pretty ugly code (should be rewritten entirely), but I think you want:
my $length = #dipeps;
map{exists $di_count{$_}?print " ",$di_count{$_}/$length:print " ",0}sort #comb;

Related

Looking for specific text and save it into variable

I am looking for the easiest way to look for text in huge file and save it into same variables for later use.
The file format is:
>gi|24585363|ref|NP_724239.1| short neuropeptide F precursor [Drosophila melanogaster]
MFHLKRELSQGCALALICLVSLQMQQPAQAEVSSAQGTPLSNLYDNLLQREYAGPVVFPNHQVERKAQRS
PSLRLRFGRSDPDMLNSIVEKRWFGDVNQKPIRSPSLRLRFGRRDPSLPQMRRTAYDDLLERELTLNSQQ
QQQQLGTEPDSDLGADYDGLYERVVRKPQRLRWGRSVPQFEANNADNEQIERSQWYNSLLNSDKMRRMLV
ALQQQYEIPENVASYANDEDTDTDLNNDTSEFQREVRKPMRLRWGRSTGKAPSEQKHTPEETSSIPPKTQ
N
>gi|442619471|ref|NP_001262643.1| neuropeptide F, isoform C [Drosophila melanogaster]
MCQTMRCILVACVALALLAAGCRVEASNSRPPRKNDVNTMADAYKFLQDLDTYYGDRARVRFGKRGSLMD
ILRNHEMDNINLGKNANNGGEFARGFNEEEIF
>gi|442619469|ref|NP_001262642.1| neuropeptide F, isoform B [Drosophila melanogaster]
MCQTMRCILVACVALALLAAGCRVEASNSRPPRKNDVNTMADAYKFLQDLDTYYGDRARVRFGKRGSLMD
ILRNHEMDNINLGKNANNGGEFARGFNEEEIF
every sequence starts with ">"
I tried this:
open (FILE, $fastaFile);
while (<FILE>) {
chomp;
($name, $name2) = split(/:/);
print "Name: $name\n";
print "Name2: $name2\n";
} close (FILE);
exit;
I never needed to look for specific text. Perhaps it would be easy to use grep only I don´t know that.
The bigest problem for me is that I have in another file result from my other program and I need to find these results in another file.
My main program gave me these results:
>gi|24585363|ref|NP_724239.1|
>gi|442619469|ref|NP_001262642.1|
and I need to find it in the second file and save this into $name and into $sequence put the sequnces for that name.:
$name = ">gi|24585363|ref|NP_724239.1|"
$sequnce = "MFHLKRELSQGCALALICLVSLQMQQPAQAEVSSAQGTPLSNLYDNLLQREYAGPVVFPNHQVERKAQRS
PSLRLRFGRSDPDMLNSIVEKRWFGDVNQKPIRSPSLRLRFGRRDPSLPQMRRTAYDDLLERELTLNSQQ
QQQQLGTEPDSDLGADYDGLYERVVRKPQRLRWGRSVPQFEANNADNEQIERSQWYNSLLNSDKMRRMLV
ALQQQYEIPENVASYANDEDTDTDLNNDTSEFQREVRKPMRLRWGRSTGKAPSEQKHTPEETSSIPPKTQ
N"
Can anyone give me advice on how to proceed? If use grep or some another way to get there.

This should help get you what you want. You will need to install Bio Perl
#!/usr/bin/perl
use warnings;
use strict;
use Bio::SeqIO;
my $seqFile = Bio::SeqIO->new('-format' => 'fasta', '-file' => 'myFasta.fasta');
while((my $seqObj = $seqFile->next_seq())){
print "Seen Sequence " . $seqObj->display_id . "\n";
print "Sequence: " . $seqObj->seq() . "\n";
}

How can I put N's in my final string

I'm beginner in perl so im have some problems writing a script.
I want a script that put the letter N one certain number of times with basis in a length that I previous check. This Ns have to be in the final of a string inside a .txt. This strings begin with a > and have that 'face':
A1_23ABR2014_53_CC07.P10R_E07_009.ab1
attgccttttgctagcttatagaataataattcatataaacaaaaaatat
tttatattatttaaaaataaataaaccaaataaagtcattgttgatccaa
ttgaacaaatcatattccatccatttaaagcgtctggataatcaggaata
cgtctaggcattacattaaatccaagaaaatgcataggtaagaatgttaa
I already wrote that, but I don't know how to do next.
if $qend > $sendi{
my $leg1 = $qendi - $sendi;
open(my #final, '>>', 'contiggeral.fasta') or die;
while (N < $leg1) {
do N++ in #nomecontig
}
Thanks and sorry for my bad english.

The condition if non-modifier if must be enclosed in parentheses. Variables must start with a sigil (N has none). There is no in operator in Perl.
my $string = 'abc';
my $final_length = 20;
$string .= 'N' x ($final_length - length $string);
print $string, "\n";

calculating velocity from massive simulation data

I have simulation data for the velocity of water molecules. The format of the data is as below. I would like to describe the format of the data for clarity purposes, and it easily would lead to what I want to calculate.
A water molecule is made of three atoms: Oxygen(O) and two Hydrogen (H). Here I would name them O, H1, and H2.
The data below starts with line title 0 and the number 4335, saying it contains 4335 atoms (4335/3 = 1445 water molecules).
The first three numbers starting from the third row ( 0.0923365 0.0341984 -0.1248516 ) representing velocity for oxygen (O) atom at three Cartesian directions Ox, Oy, Oz. The next three numbers, in the same row representing velocities for hydrogen (H1) ==> H1x, H1y, H1z. And finally the first three numbers in fourth row representing velocities for hydrogen (H2) ==> H2x,H2y,H2z. finally, the following three numbers in the same fourth row representing velocities for oxygen atom.
These sequence is goes on for all 4335 atoms in 2170 lines including the top two lines in the data file and it repeats for the following section starting from title 1.
title 0
4335 2.0001000e+04
0.0923365 0.0341984 -0.1248516 -0.8946258 1.6688854 0.8259304
0.2890579 0.8051153 -1.5612963 0.0625492 -0.1361579 0.2869132
0.2343408 -0.0665305 1.0745378 -0.8375892 0.6953992 0.5149021
-0.1628550 0.0131844 0.0688080 0.2429340 0.2168210 -0.0289806
-0.3677613 0.2054004 -0.1511643 -0.3487551 -0.1454157 0.0801884
-0.9039297 -0.0682939 -0.2337404 -0.5605327 -0.0369157 0.2243892
-0.3100274 -0.2673132 -0.2093299 0.1975043 -0.4572202 -0.8410826
-0.6995287 -0.4123909 0.0649209 -0.1910519 0.2289656 0.2443295
-0.0279093 0.5790939 -0.0104249 -1.1961776 -0.5387340 0.1445187
-0.3188485 0.3789352 -0.0112114 0.7831523 0.6043882 -0.7131590
-0.7214440 -0.5358508 -0.3035673 -0.1549275 -0.1402387 -0.0101964
-0.2027608 1.5107149 0.2963312 -1.5104872 -0.1554981 -1.3323215
0.1097982 -0.1553742 0.3803437 0.0816858 0.0265007 0.4215823
0.1157368 0.2100116 0.4712551 0.1799426 -0.1260255 -0.2131755
0.1811777 -0.9442581 -0.6036636 0.9681703 -0.1523646 -0.3502441
0.0976771 0.0019619 -0.1832204 -0.0055989 0.2701100 -0.4416720
0.8496723 0.4070951 -0.0819204 0.1156806 -0.1619873 -0.0016126
-0.4051959 0.4263505 -0.9460036 0.4412067 0.1002270 0.5864405
-0.3831136 0.3240860 -0.0005143 -0.5667163 0.2618876 0.0103317
-0.6442209 0.3965833 -0.0778050 -0.2404238 -0.1339887 -0.1662417
0.3421198 0.7480828 -1.8316993 -0.4454920 -0.0499657 -0.1951254
-0.2895359 -0.1934811 -0.2674928 0.1255802 1.3522828 -0.2829485
-0.4129106 -0.6842645 -1.0147657 -0.1278501 -0.0597648 -0.1478294
-0.2519974 0.0665314 -0.0690079 -0.0480210 -0.1179547 -0.2091919
-0.1942484 0.2583650 -0.0734658 -0.1216313 0.5158040 -0.0676843
-0.3063602 0.8148463 -0.1959571 -0.1009838 -0.3394633 -0.0866587
.
. (goes on until line 2170)
.
0.1028815 -0.0844088 -0.2156557 -0.1698745 -0.2018967 -0.3863209
0.1793070 -0.1005802 0.1800752 -0.1404713 0.2216020 0.2236271
0.5192825 -0.7398186 0.0418758 0.0347715 -0.3457840 -0.1300237
-0.3089482 1.1125441 -0.4020403 0.2739744 -0.9062766 0.0012294
0.1498538 0.0883857 -0.0094638 0.0963565 -1.1027019 0.0115313
-0.0432824 0.3330713 0.0304943
title 1
4335 2.0002000e+04
-0.2082078 0.1774843 -0.1023302 -0.1100437 0.5973607 1.0627041
-0.2216015 0.0448885 -0.8415924 0.1691296 0.6008261 -0.0373434
0.9387534 -0.3642305 0.6756270 -0.6000357 0.6632088 1.0567899
-0.3234407 -0.1781680 -0.1936070 -0.4799916 -0.1522612 -0.2347461
0.1045985 0.1999704 -0.1482928 -0.0439331 0.0413923 0.1605458
0.3403952 -0.2012104 0.4851457 -0.9665228 0.2202362 0.0046218
.
. (goes on until line 2170)
.
What I want to calculate is the resultant velocity for each molecule and I would like to do this using Perl. The algorithm goes in this way.
First store the velocities for oxygen (O) and hydrogens (H1 & H2) in Ox,Oy,Oz, H1x,H1y,H1z and H2x,H2y,H2z respectively.
Next we define:
velocity_x = Ox + Hx + Hx
velocity_y = Oy + Hy + Hy
velocity_z = Oz + Hz + Hz
Finally calculate
resultant_velocity = sqrt(velocity_x**2 + velocity_y**2 + velocity_z**2)
and store the "resultant_velocity" into new file (the file should be title_0.dat). And the program shall calculate the velocities starting from title 1 until title 200 in the file.
I am a newbie at Perl, but I would like to do this operation in Perl since I find that it is very interesting. I can write simple "read and write" operations in Perl but found no idea how to split and assign the values to the variables and carryout the calculation though the calculation is high school standard.
#!/usr/bin/perl -w
$data_file="malto.dat";
open(DAT, $data_file) || die("Could not open file!");
#raw_data=<DAT>;
close(DAT);
while(<#raw_data>){
#columns=split /\s+/,$_;
if($columns[0]=~ m/ATOM/){
print "$columns[5], $columns[6], $columns[7]\n";
}
}
I would like to get some guidance from experts so that I can enhance my understanding of Perl while working on the code.
Appreciate any help.
Regards

Perhaps the following will assist you:
use strict;
use warnings;
use Math::Complex;
my $dataFile = 'malto.dat';
{
local $/ = 'title ';
open my $fh, '<', $dataFile or die $!;
while (<$fh>) {
chomp;
my #data = split or next;
my $titleNum = 'Title ' . shift #data;
my $atom = shift(#data) . ' ' . shift #data;
my $resultantVel = calcResultantVel( \#data );
print $titleNum, "\n";
print $atom, "\n";
print 'ResultantVel: ' . $resultantVel, "\n\n";
}
close $fh;
}
sub calcResultantVel {
my ($dataRef) = #_;
my ($velocity_x, $velocity_y, $velocity_z);
while ( my #nums = splice( #$dataRef, 0, 9 ) ) {
$velocity_x += $nums[0] + $nums[3] + $nums[6];
$velocity_y += $nums[1] + $nums[4] + $nums[7];
$velocity_z += $nums[2] + $nums[5] + $nums[8];
}
return sqrt( $velocity_x**2 + $velocity_y**2 + $velocity_z**2 );
}
The word and space title is used as the record separator, so each read takes in a chunk of data that's delimited by title. The chomp removes the record separator, and then the record is split on whitespace.
The zeroth element is the title number, and that's shifted off #data. The first and second elements of #data are the atom count, and they're shifted off, too. The remaining array elements are the Cartesian directions, and a reference to that array is send to the subroutine calcResultantVel.
The subroutine takes a chunk of nine elements at a time: three for O atom, three for the first H atom, and three for the second H atom, and a running sum is kept based upon the definition you've provided. Finally, the resultant velocity is returned.
Here's some sample output:
Title 0
4335 2.0001000e+04
ResultantVel: 13.2945751170603
Title 1
4335 2.0001000e+04
ResultantVel: 12.7696611061461
You can visually verify that it's working correctly. Since you "...can write simple 'read and write' operations in Perl...," the next step is to have it write the desired results to a file.
Hope this helps!

Here's my advice: break the job down into small components, and write a method for each meaningful part of the work. To wit:
use strict;
use warnings;
main(#ARGV); # Pass data file name on command line. Don't hard-code it.
sub main {
my $data_f = shift;
open(my $data_h, '<', $data_f) or die "$!: $data_f";
while (my $section = get_section($data_h)){
# Also write methods that can be called here to make
# desired computations, print output, etc.
}
}
sub get_section {
# Takes a file handle.
# Returns a hash reference containing all of the data
# for an entire section of the file.
my $h = shift;
return if eof($h);
chomp (my $title = <$h>);
my ($n_atoms) = <$h> =~ /^(\d+)/;
return {
'title' => $title,
'n_atoms' => $n_atoms,
'molecules' => get_molecules($h, $n_atoms / 3),
};
}
sub get_molecules {
my #molecules;
return \#molecules;
}
I have not written the get_molecules() method. It takes a file handle and an integer (N of molecules). It could return a reference to an array-of-arrays or maybe an array-of-hashes, with each inner array/hash holding the info for a single molecule.

Thanks for your help and guide. I have tried to modify your code as below. It works at least for my need.
#!/usr/bin/perl
###############
#use strict;
#use warnings;
use Math::Complex;
open OUTPUT, '>', "velocityOnly.dat" or die "Can't create filehandle: $!";
my $dataFile = 'velF1F2.vel';
{
local $/ = 'title ';
open my $FH, '<', $dataFile or die $!;
while (<$FH>) {
chomp;
my #data = split or next;
my $titleNum = 'Title ' . shift(#data);
my $atom = shift(#data) . ' ' . shift(#data);
#my $resultantVel = calcResultantVel( \#data );
#print OUTPUT "$titleNum", "\n";
print "$titleNum", "\n";
for my $i (1..1445)
{
$j=(9*($i-1));
$velocity_x = $data[($j+0)] + $data[($j+3)] + $data[($j+6)];
$velocity_y = $data[($j+1)] + $data[($j+4)] + $data[($j+7)];
$velocity_z = $data[($j+2)] + $data[($j+5)] + $data[($j+8)];
$velo = sprintf '%.3f',sqrt( $velocity_x**2 + $velocity_y**2 + $velocity_z**2 );
chomp $velo;
print "$velo","\n";
print OUTPUT "$velo\n";
}
#print 'ResultantVel: ' . $resultantVel, "\n\n";
}
close $FH;
}
But I would like to extend further by adding some other functionality for doing some complex calculations. The code
Before that, I need some guide on making the below code into subroutine. I am bit lost here. Your CODE actually add all the X, Y and Z and finally find the velocity. But what I want is not that. Each 9 values subsequently represent coordinate for a water molecule which contain three atoms.
(The number 1445 is number of molecules. Each molecule contain three atoms and each atom has three coordinates.So for a water molecule has 9 Cartesian coordinates.)
the i here represent number of water molecule
for my $i (1..1445)
{
$j=(9*($i-1));
$velocity_x = $data[($j+0)] + $data[($j+3)] + $data[($j+6)];
$velocity_y = $data[($j+1)] + $data[($j+4)] + $data[($j+7)];
$velocity_z = $data[($j+2)] + $data[($j+5)] + $data[($j+8)];
$velo = sprintf '%.3f',sqrt( $velocity_x**2 + $velocity_y**2 + $velocity_z**2 );
chomp $velo;
print "$velo","\n";
print OUTPUT "$velo\n";
}

Perl forming string random string combination

I have a file with around 25000 records, each records has more than 13 entries are drug names. I want to form all the possible pair combination for these entries. Eg: if a line has three records A, B, C. I should form combinations as 1) A B 2) A C 3)B C. Below is the code I got from internet, it works only if a single line is assigned to an array:
use Math::Combinatorics;
my #n = qw(a b c);
my $combinat = Math::Combinatorics->new(
count => 2,
data => [#n],
);
while ( my #combo = $combinat->next_combination ) {
print join( ' ', #combo ) . "\n";
}
The code I am using, it doesn't produce any output:
open IN, "drugs.txt" or die "Cannot open the drug file";
open OUT, ">Combination.txt";
use Math::Combinatorics;
while (<IN>) {
chomp $_;
#Drugs = split /\t/, $_;
#n = $Drugs[1];
my $combinat = Math::Combinatorics->new(
count => 2,
data => [#n],
);
while ( my #combo = $combinat->next_combination ) {
print join( ' ', #combo ) . "\n";
}
print "\n";
}
Can you please suggest me a solution to this problem?

You're setting #n to be an array containing the second value of the #Drugs array, try just using data => \#Drugs in the Math::Combinatorics constructor.
Also, use strict; use warnings; blahblahblah.

All pairs from an array are straightforward to compute. Using drugs A, B, and C as from your question, you might think of them forming a square matrix.
AA AB AC
BA BB BC
CA CB CC
You probably do not want the “diagonal” pairs AA, BB, and CC. Note that the remaining elements are symmetrical. For example, element (0,1) is AB and (1,0) is BA. Here again, I assume these are the same and that you do not want duplicates.
To borrow a term from linear algebra, you want the upper triangle. Doing it this way eliminates duplicates by construction, assuming that each drug name on a given line is unique. An algorithm for this is below.
Select in turn each drug q on the line. For each of these, perform steps 2 and 3.
Beginning with the drug immediately following q and then for each drug r in the rest of the list, perform step 3.
Record the pair (q, r).
The recorded list is the list of all unique pairs.
In Perl, this looks like
#! /usr/bin/env perl
use strict;
use warnings;
sub pairs {
my #a = #_;
my #pairs;
foreach my $i (0 .. $#a) {
foreach my $j ($i+1 .. $#a) {
push #pairs, [ #a[$i,$j] ];
}
}
wantarray ? #pairs : \#pairs;
}
my $line = "Perlix\tScalaris\tHashagra\tNextium";
for (pairs split /\t/, $line) {
print "#$_\n";
}
Output:
Perlix Scalaris
Perlix Hashagra
Perlix Nextium
Scalaris Hashagra
Scalaris Nextium
Hashagra Nextium

I've answered something like this before for someone else. For them, they had a question on how to combine a list of letters into all possible words.
Take a look at How Can I Generate a List of Words from a group of Letters Using Perl. In it, you'll see an example of using Math::Combinatorics from my answer and the correct answer that ikegami had. (He did something rather interesting with regular expressions).
I'm sure one of these will lead you to the answer you need. Maybe when I have more time, I'll flesh out an answer specifically for your question. I hope this link helps.

Counting the occurrence of a sub-graph in a graph

I have a 3 dimensional dataset that describes the gene interactions which can be formulated as a graph. The sample of dataset is:
a + b
b + c
c - f
b - d
a + c
f + g
g + h
f + h
'+' indicates that a gene on the left side positively regulates the gene on the right. In this data I want to count the sub-graph where a gene (say, x) positively regulates another gene (say, y), y in turn positively regulates yet another gene (say, z). Furthermore, z is also positively regulated by x. There are two such cases in above graph. I want to perform this search preferably using awk but any scripting language is fine. My apologies for being a too specific question and thanks in advance for the help.

Note: See the information regarding Graphviz below.
This should give you a starting point:
Edit: This version handles genes that are described by more than one character.
awk '
BEGIN { regdelim = "|" }
{
delim=""
if ($2 == "+") {
if (plus[$1]) delim=regdelim
plus[$1]=plus[$1] delim $3
}
else
if ($2 == "-") {
if (minus[$1]) delim=regdelim
minus[$1]=minus[$1] delim $3
}
}
END {
for (root in plus) {
split(plus[root],regs,regdelim)
for (reg in regs) {
if (plus[regs[reg]] && plus[root] ~ plus[regs[reg]]) {
print "Match: ", root, "+", regs[reg], "+", plus[regs[reg]]
}
}
}
}
' inputfile
In the BEGIN clause, set regdelim to a character that doesn't appear in your data.
I've omitted the processing code for the minus data.
Output:
Match: a + b + c
Match: f + g + h
Edit 2:
The version below allows you to search for arbitrary combinations. It generalizes the technique used in the original version so no code needs to be duplicated. It also fixes a couple of other bugslimitations.
#!/bin/bash
# written by Dennis Williamson - 2010-11-12
# for http://stackoverflow.com/questions/4161001/counting-the-occurrence-of-a-sub-graph-in-a-graph
# A (AB) B, A (AC) C, B (BC) C - where "(XY)" represents a + or a -
# provided by the positional parameters $1, $2 and $3
# $4 carries the data file name and is referenced at the end of the script
awk -v AB=$1 -v AC=$2 -v BC=$3 '
BEGIN { regdelim = "|" }
{
if ($2 == AB) {
if (regAB[$1]) delim=regdelim; else delim=""
regAB[$1]=regAB[$1] delim $3
}
if ($2 == AC) {
if (regAC[$1]) delim=regdelim; else delim=""
regAC[$1]=regAC[$1] delim $3
}
if ($2 == BC) {
if (regBC[$1]) delim=regdelim; else delim=""
regBC[$1]=regBC[$1] delim $3
}
}
END {
for (root in regAB) {
split(regAB[root],ABarray,regdelim)
for (ABindex in ABarray) {
split(regAC[root],ACarray,regdelim)
for (ACindex in ACarray) {
split(regBC[ABarray[ABindex]],BCarray,regdelim)
for (BCindex in BCarray) {
if (ACarray[ACindex] == BCarray[BCindex]) {
print " Match:", root, AB, ABarray[ABindex] ",", root, AC, ACarray[ACindex] ",", ABarray[ABindex], BC, BCarray[BCindex]
}
}
}
}
}
}
' "$4"
This can be called like this to do an exhaustive search:
for ab in + -; do for ac in + -; do for bc in + -; do echo "Searching: $ab$ac$bc"; ./searchgraph $ab $ac $bc inputfile; done; done; done
For this data:
a - e
a + b
b + c
c - f
m - n
b - d
a + c
b - e
l - n
f + g
b + i
g + h
l + m
f + h
a + i
a - j
k - j
a - k
The output of the shell loop calling the new version of the script would look like this:
Searching: +++
Match: a + b, a + c, b + c
Match: a + b, a + i, b + i
Match: f + g, f + h, g + h
Searching: ++-
Searching: +-+
Searching: +--
Match: l + m, l - n, m - n
Match: a + b, a - e, b - e
Searching: -++
Searching: -+-
Searching: --+
Searching: ---
Match: a - k, a - j, k - j
Edit 3:
Graphviz
Another approach would be to use Graphviz. The DOT language can describe the graph and gvpr, which is an "AWK-like"1 programming language, can analyze and manipulate DOT files.
Given the input data in the format as shown in the question, you can use the following AWK program to convert it to DOT:
#!/usr/bin/awk -f
BEGIN {
print "digraph G {"
print " size=\"5,5\""
print " ratio=.85"
print " node [fontsize=24 color=blue penwidth=3]"
print " edge [fontsize=18 labeldistance=5 labelangle=-8 minlen=2 penwidth=3]"
print " {rank=same; f l}"
m = "-" # ASCII minus/hyphen as in the source data
um = "−" # u2212 minus: − which looks better on the output graphic
p = "+"
}
{
if ($2 == m) { $2 = um; c = lbf = "red"; arr=" arrowhead = empty" }
if ($2 == p) { c = lbf = "green3"; arr="" }
print " " $1, "->", $3, "[taillabel = \"" $2 "\" color = \"" c "\" labelfontcolor = \"" lbf "\"" arr "]"
}
END {
print "}"
}
The command to run would be something like this:
$ ./dat2dot data.dat > data.dot
You can then create the graphic above using:
$ dot -Tpng -o data.png data.dot
I used the extended data as given above in this answer.
To do an exhaustive search for the type of subgraphs you specified, you can use the following gvpr program:
BEGIN {
edge_t AB, BC, AC;
}
E {
AB = $;
BC = fstedge(AB.head);
while (BC && BC.head.name != AB.head.name) {
AC = isEdge(AB.tail,BC.head,"");
if (AC) {
printf("%s %s %s, ", AB.tail.name, AB.taillabel, AB.head.name);
printf("%s %s %s, ", AC.tail.name, AC.taillabel, AC.head.name);
printf("%s %s %s\n", BC.tail.name, BC.taillabel, BC.head.name);
}
BC = nxtedge(BC, AB.head);
}
}
To run it, you could use:
$ gvpr -f groups.g data.dot | sort -k 2,2 -k 5,5 -k 8,8
The output would be similar to that from the AWK/shell combination above (under "Edit 2"):
a + b, a + c, b + c
a + b, a + i, b + i
f + g, f + h, g + h
a + b, a − e, b − e
l + m, l − n, m − n
a − k, a − j, k − j
1 Loosely speaking.

An unconventional approach using Perl is below.
#! /usr/bin/perl
use warnings;
use strict;
my $graph = q{
a + c
b + c
c - f
b - d
a + b
f + g
g + h
f + h
};
my $nodes = join ",", sort keys %{ { map +($_ => 1), $graph =~ /(\w+)/g } };
my $search = "$nodes:$nodes:$nodes:$graph";
my $subgraph = qr/
\A .*? (?<x>\w+) .*:
.*? (?<y>\w+) .*:
.*? (?<z>\w+) .*:
(?= .*^\s* \k<x> \s* \+ \s* \k<y> \s*$)
(?= .*^\s* \k<y> \s* \+ \s* \k<z> \s*$)
(?= .*^\s* \k<x> \s* \+ \s* \k<z> \s*$)
(?{ print "x=$+{x}, y=$+{y}, z=$+{z}\n" })
(?!)
/smx;
$search =~ /$subgraph/;
The regex engine is a powerful tool. For your problem, we describe the structure of a transitive subgraph and then allow the resulting machine to go find all of them.
Output:
x=a, y=b, z=c
x=f, y=g, z=h
A more general tool using this same technique is possible. For example, let's say you want to be able to specify gene patterns such as a+b+c;a+c or g1+g2-g3;g1+g3. I hope the meanings of these patterns are obvious.
In the front matter, I specify a minimum version of 5.10.0 because the code uses // and lexical $_. The code constructs dynamic regexes that will evaluate code, which the use re 'eval' pragma enables.
#! /usr/bin/perl
use warnings;
use strict;
use 5.10.0;
use re 'eval';
An identifier is a sequence of one or more “word characters,” i.e., letters, digits, or underscores.
my $ID = qr/\w+/;
Given a regex that accepts variable names, unique_vars searches some specification for all variable names and returns them without repetition.
sub unique_vars {
my($_,$pattern) = #_;
keys %{ { map +($_ => undef), /($pattern)/g } };
}
Compiling a gene pattern into a regex is a little hairy. It dynamically generates a search target and regex with the same form as the static one above.
The first part with multiple occurrences of comma-separated variables lets the regex engine try each possible value for each gene. Then the lookaheads, (?=...), scan the graph looking for edges with the desired properties. If all the lookaheads succeed, we record the hit.
The strange regex (?!) at the end is an unconditional failure that forces the matcher to backtrack and attempt the match with different genes. Because it's unconditional, the engine will evaluate all possibilities.
Calling the same closure from multiple threads concurrently will likely produce strange results.
sub compile_gene_pattern {
my($dataset,$pattern) = #_;
my #vars = sort +unique_vars $pattern, qr/[a-z]\d*/; # / for SO hilite
my $nodes = join ",", sort +unique_vars $dataset, $ID;
my $search = join("", map "$_:", ($nodes) x #vars) . "\n"
. $dataset;
my $spec = '\A' . "\n" . join("", map ".*? (?<$_>$ID) .*:\n", #vars);
for (split /;/, $pattern) {
while (s/^($ID)([-+])($ID)/$3/) {
$spec .= '(?= .*^\s* ' .
' \b\k<' . $1 . '>\b ' .
' \s*' . quotemeta($2) . '\s* ' .
' \b\k<' . $3 . '>\b ' .
' \s*$)' . "\n";
}
}
my %hits;
$spec .= '(?{ ++$hits{join "-", #+{#vars}} })' . "\n" .
'(?!) # backtrack' . "\n";
my $nfa = eval { qr/$spec/smx } || die "$0: INTERNAL: bad regex:\n$#";
sub {
%hits = (); # thread-safety? :-(
(my $_ = $search) =~ /$nfa/;
map [split /-/], sort keys %hits;
}
}
Read the dataset and let the user know about any problems.
sub read_dataset {
my($path) = #_;
open my $fh, "<", $path or die "$0: open $path: $!";
local $/ = "\n";
local $_;
my $graph;
my #errors;
while (<$fh>) {
next if /^\s*#/ || /^\s*$/;
if (/^ \s* $ID \s* [-+] \s* $ID \s* $/x) {
$graph .= $_;
}
else {
push #errors, "$.: syntax error";
}
}
return $graph unless #errors;
die map "$0: $path:$_\n", #errors;
}
Now we set it all into motion:
my $graphs = shift // "graphs.txt";
my $dataset = read_dataset $graphs;
my $ppp = compile_gene_pattern $dataset, "a+b+c;a+c";
print "#$_\n" for $ppp->();
my $pmp = compile_gene_pattern $dataset, "g1+g2-g3;g1+g3";
print "#$_\n" for $pmp->();
Given graphs.txt with contents
a + b
b + c
c - f
b - d
a + c
f + g
g + h
f + h
foo + bar
bar - baz
foo + baz
and then running the program, we get the following output:
a b c
f g h
foo bar baz

I assume that by "count the sub-graph" you mean counting the nodes in a sub-graph. If that's what you need, you can use any scripting language and will have to store the graph, first of all, by creating a structure or class where you store your graph, the node structure/class should look like this (this is not conforming the syntax of any language, this is only a plan for your application):
Node {color = 0; title = ""; minusNodeSet = null; plusNodeSet = null}
Where color = 0 (the defaul value of color means you haven't visited this node before), title will be 'a', 'b', 'c', and so on. minusNodeSet is a Set of Nodes where those nodes are stored, where a minus vertice points from our Node, plusNodeSet is a Set of Nodes where those nodes are stored, where a plus vertice points from our Node.
Now, we have an architecture and should use it in a depth-first algoritm:
int depth_first(Node actualNode)
{
if (actualNode.color == 1)
return;
number = 1;
actualNode.color = 1;
foreach actualNode.nodeSet as node do
if (node.color == 0)
number = number + depth_first(node);
return number;
}
If I misunderstood your question, please, tell me, to be able to edit my answer to be a more useful one.

The structure of the regex in my other answer resembles list-monad processing. Given that inspiration, a search for the transitive subgraphs is below as a literate Haskell. Copy-and-paste this answer to a file with the extension .lhs to get a working program. Be sure to surround the code sections, marked by leading >, with empty lines.
Thanks for the fun problem!
A bit of front matter:
> {-# LANGUAGE ViewPatterns #-}
> module Main where
> import Control.Monad (guard)
> import Data.List (nub)
> import Data.Map (findWithDefault,fromListWith,toList)
The name of a gene can be any string, and for a given Gene g, a function of type PosReg should give us all the genes that g positively regulates.
> type Gene = String
> type PosReg = Gene -> [Gene]
From a graph specified as in your question, we want triples of genes such that the is-positively-regulated-by relation is transitive, and subgraphs describes the desired properties. First, pick an arbitrary gene x from the graph. Next, choose one of the genes y that x positively regulates. For the transitive property to hold, z must be a gene that both x and y positively regulate.
> subgraphs :: String -> [(Gene,Gene,Gene)]
> subgraphs g = do
> x <- choose
> y <- posRegBy x
> z <- posRegBy y
> guard $ z `elem` posRegBy x
> return (x,y,z)
> where (choose,posRegBy) = decode g
With the simple parser in decode, we distill the list of genes in the graph and a PosReg function that gives all genes positively regulated by some other gene.
> decode :: String -> ([Gene], PosReg)
> decode g =
> let pr = fromListWith (++) $ go (lines g)
> gs = nub $ concatMap (\(a,b) -> a : b) $ toList pr
> in (gs, (\x -> findWithDefault [] x pr))
> where
> go ((words -> [a, op, b]):ls)
> | op == "+" = (a,[b]) : go ls
> | otherwise = go ls
> go _ = []
Finally, the main program glues it all together. For each subgraph found, print it to the standard output.
> main :: IO ()
> main = mapM_ (putStrLn . show) $ subgraphs graph
> where graph = "a + b\n\
> \b + c\n\
> \c - f\n\
> \b - d\n\
> \a + c\n\
> \f + g\n\
> \g + h\n\
> \f + h\n"
Output:
("a","b","c")
("f","g","h")

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Add dipeptide frequency in this perl script based on sequence length - perl

This is pretty ugly code (should be rewritten entirely), but I think you want: my $length = #dipeps; map{exists $di_count{$_}?print " ",$di_count{$_}/$length:print " ",0}sort #comb;

Related

Looking for specific text and save it into variable

How can I put N's in my final string

calculating velocity from massive simulation data

Perl forming string random string combination

Counting the occurrence of a sub-graph in a graph

Categories

Resources