What do the non-printable characters in the Perl symbol table represent? - perl

I just learned that in Perl, the symbol table for a given module is stored in a hash that matches the module name -- so, for example, the symbol table for the fictional module Foo::Bar would be %Foo::Bar. The default symbol table is stored in %main::. Just for the sake of curiosity, I decided that I wanted to see what was in %main::, so iterated through each key/value pair in the hash, printing them out as I went:
#! /usr/bin/perl
use v5.14;
use strict;
use warnings;
my $foo;
my $bar;
my %hash;
while( my ( $key, $value ) = each %:: ) {
say "Key: '$key' Value '$value'";
}
The output looked like this:
Key: 'version::' Value '*main::version::'
Key: '/' Value '*main::/'
Key: '' Value '*main::'
Key: 'stderr' Value '*main::stderr'
Key: '_<perl.c' Value '*main::_<perl.c'
Key: ',' Value '*main::,'
Key: '2' Value '*main::2'
...
I was expecting to see the STDOUT and STDERR file handles, and perhaps #INC and %ENV... what I wasn't expecting to see was non-ascii characters ... what the code block above doesn't show is that the third line of the output actually had a glyph indicating a non-printable character.
I ran the script and piped it as follows:
perl /tmp/asdf.pl | grep '[^[:print:]]' | while read line
do
echo $line
od -c <<< $line
echo
done
The output looked like this:
Key: '' Value '*main::'
0000000 K e y : ' 026 ' V a l u e '
0000020 * m a i n : : 026 ' \n
0000032
Key: 'ARNING_BITS' Value '*main::ARNING_BITS'
0000000 K e y : ' 027 A R N I N G _ B I
0000020 T S ' V a l u e ' * m a i n
0000040 : : 027 A R N I N G _ B I T S ' \n
0000060
Key: '' Value '*main::'
0000000 K e y : ' 022 ' V a l u e '
0000020 * m a i n : : 022 ' \n
0000032
Key: 'E_TRIE_MAXBUF' Value '*main::E_TRIE_MAXBUF'
0000000 K e y : ' 022 E _ T R I E _ M A
0000020 X B U F ' V a l u e ' * m a
0000040 i n : : 022 E _ T R I E _ M A X B
0000060 U F ' \n
0000064
Key: ' Value '*main:'
0000000 K e y : ' \b ' V a l u e '
0000020 * m a i n : : \b ' \n
0000032
Key: '' Value '*main::'
0000000 K e y : ' 030 ' V a l u e '
0000020 * m a i n : : 030 ' \n
0000032
So what are non-printable characters doing in the Perl symbol table? What are they symbols for?

Guru is on the right track: specifically, the answer is to be found in perlvar, which says:
"Perl variable names may also be a sequence of digits or a single punctuation or control character. These names are all reserved for special uses by Perl; for example, the all-digits names are used to hold data captured by backreferences after a regular expression match. Perl has a special syntax for the single-control-character names: It understands ^X (caret X) to mean the control-X character. For example, the notation $^W (dollar-sign caret W) is the scalar variable whose name is the single character control-W. This is better than typing a literal control-W into your program.
Since Perl 5.6, Perl variable names may be alphanumeric strings that begin with control characters (or better yet, a caret). These variables must be written in the form ${^Foo}; the braces are not optional. ${^Foo} denotes the scalar variable whose name is a control-F followed by two o's. These variables are reserved for future special uses by Perl, except for the ones that begin with ^_ (control-underscore or caret-underscore). No control-character name that begins with ^_ will acquire a special meaning in any future version of Perl; such names may therefore be used safely in programs. $^_ itself, however, is reserved."
If you want to print those names in a readable way, you could add a line like this to your code:
$key = '^' . ($key ^ '#') if $key =~ /^[\0-\x1f]/;
If first character of $key is a control character, this will replace it with a caret followed by the corresponding letter (^A for control-A, ^B for control-B, etc.).

Perl has special variables such as $", $, , $/ , $\ and so on. All these are part of symbol table which is what you are seeing. Also, you should be able to see #INC, %ENV in the symbol table as well.

Related

How to parse rows in my txt file properly using perl

I hope to parse a txt file that looks like this:
A a, b, c
B e
C f, g
The format I hope to get is:
A a
A b
A c
B e
C f
C g
I tried this:
perl -ane '#s=split(/\,/, $F[1]); foreach $k (#s){print "$F[0] $k\n";}' txt.txt
but it only works when there's no space after commas. In the original file, there is a space after each comma. What should I do?
$ perl -lane 'print "$F[0] $_" for map { tr/,//rd } #F[1..$#F]' input.txt
A a
A b
A c
B e
C f
C g
Use auto-split mode on whitespace like normal, and for each element of an array slice of #F from the second field to the last one, remove any commas (I used tr//d, the more usual s/// works too, of course) and print it with the first field prepended.
Alternatively, don't use -a because it splits too much.
perl -le'#F = split(" ", $_, 2); print "$F[0] $_" for split(/,\s*/, $F[1])'

Splitting an Array into n accessible parts within perl?

My goal is to take an array of letters and cut it up into "n" parts. In this case no more than 10 letters each piece. But I want these arrays to be stored into an array reference which I can access on a counter.
For example, I have the following script to split an array of English alphabetical letters into 1 array of 10 letters. But since the English Alphabet has 26 letters, I need 2 more arrays to access in an array reference.
#!/usr/bin/env perl
#split an array into parts.
use strict;
use warnings;
use feature 'say';
my #letters = ('A' .. 'Z');
say "These are my letters:";
for(#letters){print "$_ ";}
my #letters_selected = splice(#letters, 0, 10);
say "\nThese are my selected letters:";
for(#letters_selected){print "$_ ";}
The output is this:
These are my letters:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
These are my selected letters:
A B C D E F G H I J
This little script only gives me one piece of 10 letters of the alphabet. But I want all three pieces of 10 letters of the alphabet, so I would like to know how I can achieve this:
Goal:
Have an array reference called letters_selected of letters which contains all letters A - Z. But ... I can access all three pieces of size less than or equal to 10 letters like this.
foreach(#{$letters_selected[0]}){say "$_ ";}
returns: A B C D E F G H I J # These are the initial 10 elements of the alphabet.
foreach(#{$letters_selected[1]}){say "$_ ";}
returns: K L M N O P Q R S T # The next 10 after that.
foreach(#{$letters_selected[2]}){say "$_ ";}
returns: U V W X Y Z # The next no more than 10 after that.
Since splice is destructive to its target you can keep applying it
use warnings;
use strict;
use feature 'say';
my #letters = 'A'..'Z';
my #letter_groups;
push #letter_groups, [ splice #letters, 0, 10 ] while #letters;
say "#$_" for #letter_groups;
After this #letters is empty. So make a copy of it and work with that if you will need it.
Every time through, splice removes and returns elements from #letters and [ ] makes an anonymous array of that list. This reference is pushed on #letter_groups.
Since splice takes as many elements as there are (if there aren't 10) once fewer than 10 remain splice removes and returns that, the #letters gets emptied, and while terminates.

how to split the string with |(pipe) as delimiter

When is use split with| as delimiter, it don't give me the expected output. Is there any solution to it?
use warnings;
my $exclude_list = "1213:sutrust.com,sutrust1.com,sutrust3.com|1321:line.com";
my #exclude_client = split(/|/, $exclude_list);
print "Printing excluse #exclude_client \n";
output
Printing excluse 1 2 1 3 : s u n t r u s t . c o m , s u t r u s t 1 . c o m , s u t r u s t 3 . c o m | 1 3 2 1 : l i n e . c o m
Expected output:
Printing excluse 1213:sutrust.com,sutrust1.com,sutrust3.com 1321:line.com
You didn't use the | character as the delimiter, you used the | regular expression as the delimiter. That pattern always matches, so the result is splitting between every character. Escape the |.
split(/\|/, $exclude_list)
The pipe character is a special character for split, and it needs to be escaped.
my #exclude_client = split(/\|/, $exclude_list);
You need to escape the pipe using a backslash: \| :
#!/usr/bin/perl -w
use strict;
my $exclude_list = "1213:sutrust.com,sutrust1.com,sutrust3.com|1321:line.com";
my #exclude_client = split(/\|/, $exclude_list);
foreach(#exclude_client){
print "Printing exclude $_ ";
}
Outputs:
1213:sutrust.com,sutrust1.com,sutrust3.com 1321:line.com

How to Rewrite of One Line Code (or Less Line Code in command line) of this code in Perl?

I have a code like that:
#!/usr/bin/perl
use strict;
use warnings;
my %proteins = qw/
UUU F UUC F UUA L UUG L UCU S UCC S UCA S UCG S UAU Y UAC Y UGU C UGC C UGG W
CUU L CUC L CUA L CUG L CCU P CCC P CCA P CCG P CAU H CAC H CAA Q CAG Q CGU R CGC R CGA R CGG R
AUU I AUC I AUA I AUG M ACU T ACC T ACA T ACG T AAU N AAC N AAA K AAG K AGU S AGC S AGA R AGG R
GUU V GUC V GUA V GUG V GCU A GCC A GCA A GCG A GAU D GAC D GAA E GAG E GGU G GGC G GGA G GGG G
/;
open(INPUT,"<dna.txt");
while (<INPUT>) {
tr/[a,c,g,t]/[A,C,G,T]/;
y/GCTA/CGAU/;
foreach my $protein (/(...)/g) {
if (defined $proteins{$protein}) {
print $proteins{$protein};
}
}
}
close(INPUT);
This code is related to my other question's answer: DNA to RNA and Getting Proteins with Perl
The output of the program is:
SIMQNISGREAT
How can I rewrite that code with Perl, it will run on command line and it will be rewritten with less code(if possible one line code)?
PS 1: dna.txt is like that:
TCATAATACGTTTTGTATTCGCCAGCGCTTCGGTGT
PS 2: If the code will be less line, it is accepted to write the my %proteins variable into a file.
The only changes I would recommend making are simplifying your while loop:
while (<INPUT>) {
tr/acgt/ACGT/;
tr/GCTA/CGAU/;
foreach my $protein (/(...)/g) {
if (defined $proteins{$protein}) {
print $proteins{$protein};
}
}
}
Since y and tr are synonyms, you should only use one of them. I think tr reads better than y, so I picked tr. Further, you were calling them very differently, but this should be the same effect and only mentions the letters you actually change. (All the other characters were being transposed to themselves. That makes it much harder to see what is actually being changed.)
You might want to remove the open(INPUT,"<dna.txt"); and corresponding close(INPUT); lines, as they make it much harder to use your program in shell pipelines or with different input files. But that's up to you, if the input file will always be dna.txt and never anything different, this is alright.
Somebody (#kamaci) called my name in another thread. This is the best I can come up with while keeping the protein table on the command line:
perl -nE'say+map+substr("FYVDINLHL%VEMKLQL%VEIKLQFYVDINLHCSGASTRPWSGARTRP%SGARTRPCSGASTR",(s/GGG/GGC/i,vec($_,0,32)&101058048)%63,1),/.../g' dna.txt
(Shell quoting, for Windows quoting swap ' and " characters). This version marks invalid codons with %, you can probably fix that by adding =~y/%//d at an appropriate spot.
Hint: This picks out 6 bits from the raw ASCII encoding of an RNA triple, giving 64 codes between 0 and 101058048; to get a string index, I reduce the result modulo 63, but this creates one double mapping which regrettably had to code two different proteins. The s/GGG/GGC/i maps one of them to another that codes the right protein.
Also note the parentheses before the % operator which both isolate the , operator from the argument list of substr and fix the precedence of & vs %. If you ever use that in production code, you're a bad, bad person.
#!/usr/bin/perl
%p=qw/UUU F UUC F UUA L UUG L UCU S UCC S UCA S UCG S UAU Y UAC Y UGU C UGC C UGG W
CUU L CUC L CUA L CUG L CCU P CCC P CCA P CCG P CAU H CAC H CAA Q CAG Q CGU R CGC R CGA R CGG R
AUU I AUC I AUA I AUG M ACU T ACC T ACA T ACG T AAU N AAC N AAA K AAG K AGU S AGC S AGA R AGG R
GUU V GUC V GUA V GUG V GCU A GCC A GCA A GCG A GAU D GAC D GAA E GAG E GGU G GGC G GGA G GGG G/;
$_=uc<DATA>;y/GCTA/CGAU/;map{print if$_=$p{$_}}/(...)/g
__DATA__
TCATAATACGTTTTGTATTCGCCAGCGCTTCGGTGT
Phew. Best I can come up with, at least this quickly. If you're sure the input is always already in uppercase, you can also drop the uc saving another two characters. Or if the input is always the same, you could assign it to $_ straight away instead of reading it from anywhere.
I guess I don't need to say that this code should not be used in production environments or anywhere else other than pure fun. When doing actual programming, readability almost always wins over compactness.
A few other versions I mentioned in the comments:
Reading %p and the DNA from files:
#!/usr/bin/perl
open A,"<p.txt";map{map{/(...)/;$p{$1}=chop}/(... .)/g}<A>;
open B,"<dna.txt";$_=uc<B>;y/GCTA/CGAU/;map{print if$_=$p{$_}}/(...)/g
From shell with perl -e:
perl -e 'open A,"<p.txt";map{map{/(...)/;$p{$1}=chop}/(... .)/g}<A>;open B,"<dna.txt";$_=uc<B>;y/GCTA/CGAU/;map{print if$_=$p{$_}}/(...)/g'
Most things have already been pointed out, especially that readability matters. I wouldn't try to reduce the program more than what follows.
use strict;
use warnings;
# http://stackoverflow.com/questions/5402405/
my $fnprot = shift || 'proteins.txt';
my $fndna = shift || 'dna.txt';
# build protein table
open my $fhprot, '<', $fnprot or die "open $fnprot: $!";
my %proteins = split /\s+/, do { local $/; <$fhprot> };
close $fhprot;
# process dna data
my #result;
open my $fhdna, '<', $fndna or die "open $fndna: $!";
while (<$fhdna>) {
tr/acgt/ACGT/;
tr/GCTA/CGAU/;
push #result, map $proteins{$_}, grep defined $proteins{$_}, m/(...)/g;
}
close $fhdna;
# check correctness of result (given input as per original post)
my $expected = 'SIMQNISGREAT';
my $got = join '', #result;
die "#result is not expected" if $got ne $expected;
print "#result - $got\n";
The only "one-liner" thing I added is the push map grep m//g in the while loop. Note that Perl 5.10 adds the "defined or" operator - // - which allows you to write:
push #result, map $proteins{$_} // (), m/(...)/g;
Ah okay, the open do local $/ file slurp idiom is handy for slurping small files into memory. Hope you find it a bit inspiring. :-)
If write proteins data to another file, space delimited and without line break. So, you can import data by reading file once time.
#!/usr/bin/perl
use strict;
use warnings;
open(INPUT, "<mydata.txt");
open(DATA, "<proteins.txt");
my %proteins = split(" ",<DATA>);
while (<INPUT>) {
tr/GCTA/CGAU/;
while(/(\w{3})/gi) {print $proteins{$1} if (exists($proteins{$1}))};
}
close(INPUT);
close(DATA);
You can remove line of code "tr/a,c,g,t/A,C,G,T/" because match operator has option for case insensitive (i option). And original foreach loop can be optimized like code above. $1 variable here is matched pattern result inside parentheses of match operation /(\w{3})/gi

Counting the occurrence of a sub-graph in a graph

I have a 3 dimensional dataset that describes the gene interactions which can be formulated as a graph. The sample of dataset is:
a + b
b + c
c - f
b - d
a + c
f + g
g + h
f + h
'+' indicates that a gene on the left side positively regulates the gene on the right. In this data I want to count the sub-graph where a gene (say, x) positively regulates another gene (say, y), y in turn positively regulates yet another gene (say, z). Furthermore, z is also positively regulated by x. There are two such cases in above graph. I want to perform this search preferably using awk but any scripting language is fine. My apologies for being a too specific question and thanks in advance for the help.
Note: See the information regarding Graphviz below.
This should give you a starting point:
Edit: This version handles genes that are described by more than one character.
awk '
BEGIN { regdelim = "|" }
{
delim=""
if ($2 == "+") {
if (plus[$1]) delim=regdelim
plus[$1]=plus[$1] delim $3
}
else
if ($2 == "-") {
if (minus[$1]) delim=regdelim
minus[$1]=minus[$1] delim $3
}
}
END {
for (root in plus) {
split(plus[root],regs,regdelim)
for (reg in regs) {
if (plus[regs[reg]] && plus[root] ~ plus[regs[reg]]) {
print "Match: ", root, "+", regs[reg], "+", plus[regs[reg]]
}
}
}
}
' inputfile
In the BEGIN clause, set regdelim to a character that doesn't appear in your data.
I've omitted the processing code for the minus data.
Output:
Match: a + b + c
Match: f + g + h
Edit 2:
The version below allows you to search for arbitrary combinations. It generalizes the technique used in the original version so no code needs to be duplicated. It also fixes a couple of other bugslimitations.
#!/bin/bash
# written by Dennis Williamson - 2010-11-12
# for http://stackoverflow.com/questions/4161001/counting-the-occurrence-of-a-sub-graph-in-a-graph
# A (AB) B, A (AC) C, B (BC) C - where "(XY)" represents a + or a -
# provided by the positional parameters $1, $2 and $3
# $4 carries the data file name and is referenced at the end of the script
awk -v AB=$1 -v AC=$2 -v BC=$3 '
BEGIN { regdelim = "|" }
{
if ($2 == AB) {
if (regAB[$1]) delim=regdelim; else delim=""
regAB[$1]=regAB[$1] delim $3
}
if ($2 == AC) {
if (regAC[$1]) delim=regdelim; else delim=""
regAC[$1]=regAC[$1] delim $3
}
if ($2 == BC) {
if (regBC[$1]) delim=regdelim; else delim=""
regBC[$1]=regBC[$1] delim $3
}
}
END {
for (root in regAB) {
split(regAB[root],ABarray,regdelim)
for (ABindex in ABarray) {
split(regAC[root],ACarray,regdelim)
for (ACindex in ACarray) {
split(regBC[ABarray[ABindex]],BCarray,regdelim)
for (BCindex in BCarray) {
if (ACarray[ACindex] == BCarray[BCindex]) {
print " Match:", root, AB, ABarray[ABindex] ",", root, AC, ACarray[ACindex] ",", ABarray[ABindex], BC, BCarray[BCindex]
}
}
}
}
}
}
' "$4"
This can be called like this to do an exhaustive search:
for ab in + -; do for ac in + -; do for bc in + -; do echo "Searching: $ab$ac$bc"; ./searchgraph $ab $ac $bc inputfile; done; done; done
For this data:
a - e
a + b
b + c
c - f
m - n
b - d
a + c
b - e
l - n
f + g
b + i
g + h
l + m
f + h
a + i
a - j
k - j
a - k
The output of the shell loop calling the new version of the script would look like this:
Searching: +++
Match: a + b, a + c, b + c
Match: a + b, a + i, b + i
Match: f + g, f + h, g + h
Searching: ++-
Searching: +-+
Searching: +--
Match: l + m, l - n, m - n
Match: a + b, a - e, b - e
Searching: -++
Searching: -+-
Searching: --+
Searching: ---
Match: a - k, a - j, k - j
Edit 3:
Graphviz
Another approach would be to use Graphviz. The DOT language can describe the graph and gvpr, which is an "AWK-like"1 programming language, can analyze and manipulate DOT files.
Given the input data in the format as shown in the question, you can use the following AWK program to convert it to DOT:
#!/usr/bin/awk -f
BEGIN {
print "digraph G {"
print " size=\"5,5\""
print " ratio=.85"
print " node [fontsize=24 color=blue penwidth=3]"
print " edge [fontsize=18 labeldistance=5 labelangle=-8 minlen=2 penwidth=3]"
print " {rank=same; f l}"
m = "-" # ASCII minus/hyphen as in the source data
um = "−" # u2212 minus: − which looks better on the output graphic
p = "+"
}
{
if ($2 == m) { $2 = um; c = lbf = "red"; arr=" arrowhead = empty" }
if ($2 == p) { c = lbf = "green3"; arr="" }
print " " $1, "->", $3, "[taillabel = \"" $2 "\" color = \"" c "\" labelfontcolor = \"" lbf "\"" arr "]"
}
END {
print "}"
}
The command to run would be something like this:
$ ./dat2dot data.dat > data.dot
You can then create the graphic above using:
$ dot -Tpng -o data.png data.dot
I used the extended data as given above in this answer.
To do an exhaustive search for the type of subgraphs you specified, you can use the following gvpr program:
BEGIN {
edge_t AB, BC, AC;
}
E {
AB = $;
BC = fstedge(AB.head);
while (BC && BC.head.name != AB.head.name) {
AC = isEdge(AB.tail,BC.head,"");
if (AC) {
printf("%s %s %s, ", AB.tail.name, AB.taillabel, AB.head.name);
printf("%s %s %s, ", AC.tail.name, AC.taillabel, AC.head.name);
printf("%s %s %s\n", BC.tail.name, BC.taillabel, BC.head.name);
}
BC = nxtedge(BC, AB.head);
}
}
To run it, you could use:
$ gvpr -f groups.g data.dot | sort -k 2,2 -k 5,5 -k 8,8
The output would be similar to that from the AWK/shell combination above (under "Edit 2"):
a + b, a + c, b + c
a + b, a + i, b + i
f + g, f + h, g + h
a + b, a − e, b − e
l + m, l − n, m − n
a − k, a − j, k − j
1 Loosely speaking.
An unconventional approach using Perl is below.
#! /usr/bin/perl
use warnings;
use strict;
my $graph = q{
a + c
b + c
c - f
b - d
a + b
f + g
g + h
f + h
};
my $nodes = join ",", sort keys %{ { map +($_ => 1), $graph =~ /(\w+)/g } };
my $search = "$nodes:$nodes:$nodes:$graph";
my $subgraph = qr/
\A .*? (?<x>\w+) .*:
.*? (?<y>\w+) .*:
.*? (?<z>\w+) .*:
(?= .*^\s* \k<x> \s* \+ \s* \k<y> \s*$)
(?= .*^\s* \k<y> \s* \+ \s* \k<z> \s*$)
(?= .*^\s* \k<x> \s* \+ \s* \k<z> \s*$)
(?{ print "x=$+{x}, y=$+{y}, z=$+{z}\n" })
(?!)
/smx;
$search =~ /$subgraph/;
The regex engine is a powerful tool. For your problem, we describe the structure of a transitive subgraph and then allow the resulting machine to go find all of them.
Output:
x=a, y=b, z=c
x=f, y=g, z=h
A more general tool using this same technique is possible. For example, let's say you want to be able to specify gene patterns such as a+b+c;a+c or g1+g2-g3;g1+g3. I hope the meanings of these patterns are obvious.
In the front matter, I specify a minimum version of 5.10.0 because the code uses // and lexical $_. The code constructs dynamic regexes that will evaluate code, which the use re 'eval' pragma enables.
#! /usr/bin/perl
use warnings;
use strict;
use 5.10.0;
use re 'eval';
An identifier is a sequence of one or more “word characters,” i.e., letters, digits, or underscores.
my $ID = qr/\w+/;
Given a regex that accepts variable names, unique_vars searches some specification for all variable names and returns them without repetition.
sub unique_vars {
my($_,$pattern) = #_;
keys %{ { map +($_ => undef), /($pattern)/g } };
}
Compiling a gene pattern into a regex is a little hairy. It dynamically generates a search target and regex with the same form as the static one above.
The first part with multiple occurrences of comma-separated variables lets the regex engine try each possible value for each gene. Then the lookaheads, (?=...), scan the graph looking for edges with the desired properties. If all the lookaheads succeed, we record the hit.
The strange regex (?!) at the end is an unconditional failure that forces the matcher to backtrack and attempt the match with different genes. Because it's unconditional, the engine will evaluate all possibilities.
Calling the same closure from multiple threads concurrently will likely produce strange results.
sub compile_gene_pattern {
my($dataset,$pattern) = #_;
my #vars = sort +unique_vars $pattern, qr/[a-z]\d*/; # / for SO hilite
my $nodes = join ",", sort +unique_vars $dataset, $ID;
my $search = join("", map "$_:", ($nodes) x #vars) . "\n"
. $dataset;
my $spec = '\A' . "\n" . join("", map ".*? (?<$_>$ID) .*:\n", #vars);
for (split /;/, $pattern) {
while (s/^($ID)([-+])($ID)/$3/) {
$spec .= '(?= .*^\s* ' .
' \b\k<' . $1 . '>\b ' .
' \s*' . quotemeta($2) . '\s* ' .
' \b\k<' . $3 . '>\b ' .
' \s*$)' . "\n";
}
}
my %hits;
$spec .= '(?{ ++$hits{join "-", #+{#vars}} })' . "\n" .
'(?!) # backtrack' . "\n";
my $nfa = eval { qr/$spec/smx } || die "$0: INTERNAL: bad regex:\n$#";
sub {
%hits = (); # thread-safety? :-(
(my $_ = $search) =~ /$nfa/;
map [split /-/], sort keys %hits;
}
}
Read the dataset and let the user know about any problems.
sub read_dataset {
my($path) = #_;
open my $fh, "<", $path or die "$0: open $path: $!";
local $/ = "\n";
local $_;
my $graph;
my #errors;
while (<$fh>) {
next if /^\s*#/ || /^\s*$/;
if (/^ \s* $ID \s* [-+] \s* $ID \s* $/x) {
$graph .= $_;
}
else {
push #errors, "$.: syntax error";
}
}
return $graph unless #errors;
die map "$0: $path:$_\n", #errors;
}
Now we set it all into motion:
my $graphs = shift // "graphs.txt";
my $dataset = read_dataset $graphs;
my $ppp = compile_gene_pattern $dataset, "a+b+c;a+c";
print "#$_\n" for $ppp->();
my $pmp = compile_gene_pattern $dataset, "g1+g2-g3;g1+g3";
print "#$_\n" for $pmp->();
Given graphs.txt with contents
a + b
b + c
c - f
b - d
a + c
f + g
g + h
f + h
foo + bar
bar - baz
foo + baz
and then running the program, we get the following output:
a b c
f g h
foo bar baz
I assume that by "count the sub-graph" you mean counting the nodes in a sub-graph. If that's what you need, you can use any scripting language and will have to store the graph, first of all, by creating a structure or class where you store your graph, the node structure/class should look like this (this is not conforming the syntax of any language, this is only a plan for your application):
Node {color = 0; title = ""; minusNodeSet = null; plusNodeSet = null}
Where color = 0 (the defaul value of color means you haven't visited this node before), title will be 'a', 'b', 'c', and so on. minusNodeSet is a Set of Nodes where those nodes are stored, where a minus vertice points from our Node, plusNodeSet is a Set of Nodes where those nodes are stored, where a plus vertice points from our Node.
Now, we have an architecture and should use it in a depth-first algoritm:
int depth_first(Node actualNode)
{
if (actualNode.color == 1)
return;
number = 1;
actualNode.color = 1;
foreach actualNode.nodeSet as node do
if (node.color == 0)
number = number + depth_first(node);
return number;
}
If I misunderstood your question, please, tell me, to be able to edit my answer to be a more useful one.
The structure of the regex in my other answer resembles list-monad processing. Given that inspiration, a search for the transitive subgraphs is below as a literate Haskell. Copy-and-paste this answer to a file with the extension .lhs to get a working program. Be sure to surround the code sections, marked by leading >, with empty lines.
Thanks for the fun problem!
A bit of front matter:
> {-# LANGUAGE ViewPatterns #-}
> module Main where
> import Control.Monad (guard)
> import Data.List (nub)
> import Data.Map (findWithDefault,fromListWith,toList)
The name of a gene can be any string, and for a given Gene g, a function of type PosReg should give us all the genes that g positively regulates.
> type Gene = String
> type PosReg = Gene -> [Gene]
From a graph specified as in your question, we want triples of genes such that the is-positively-regulated-by relation is transitive, and subgraphs describes the desired properties. First, pick an arbitrary gene x from the graph. Next, choose one of the genes y that x positively regulates. For the transitive property to hold, z must be a gene that both x and y positively regulate.
> subgraphs :: String -> [(Gene,Gene,Gene)]
> subgraphs g = do
> x <- choose
> y <- posRegBy x
> z <- posRegBy y
> guard $ z `elem` posRegBy x
> return (x,y,z)
> where (choose,posRegBy) = decode g
With the simple parser in decode, we distill the list of genes in the graph and a PosReg function that gives all genes positively regulated by some other gene.
> decode :: String -> ([Gene], PosReg)
> decode g =
> let pr = fromListWith (++) $ go (lines g)
> gs = nub $ concatMap (\(a,b) -> a : b) $ toList pr
> in (gs, (\x -> findWithDefault [] x pr))
> where
> go ((words -> [a, op, b]):ls)
> | op == "+" = (a,[b]) : go ls
> | otherwise = go ls
> go _ = []
Finally, the main program glues it all together. For each subgraph found, print it to the standard output.
> main :: IO ()
> main = mapM_ (putStrLn . show) $ subgraphs graph
> where graph = "a + b\n\
> \b + c\n\
> \c - f\n\
> \b - d\n\
> \a + c\n\
> \f + g\n\
> \g + h\n\
> \f + h\n"
Output:
("a","b","c")
("f","g","h")