Counting the occurrence of a sub-graph in a graph - perl

I have a 3 dimensional dataset that describes the gene interactions which can be formulated as a graph. The sample of dataset is:
a + b
b + c
c - f
b - d
a + c
f + g
g + h
f + h
'+' indicates that a gene on the left side positively regulates the gene on the right. In this data I want to count the sub-graph where a gene (say, x) positively regulates another gene (say, y), y in turn positively regulates yet another gene (say, z). Furthermore, z is also positively regulated by x. There are two such cases in above graph. I want to perform this search preferably using awk but any scripting language is fine. My apologies for being a too specific question and thanks in advance for the help.

Note: See the information regarding Graphviz below.
This should give you a starting point:
Edit: This version handles genes that are described by more than one character.
awk '
BEGIN { regdelim = "|" }
{
delim=""
if ($2 == "+") {
if (plus[$1]) delim=regdelim
plus[$1]=plus[$1] delim $3
}
else
if ($2 == "-") {
if (minus[$1]) delim=regdelim
minus[$1]=minus[$1] delim $3
}
}
END {
for (root in plus) {
split(plus[root],regs,regdelim)
for (reg in regs) {
if (plus[regs[reg]] && plus[root] ~ plus[regs[reg]]) {
print "Match: ", root, "+", regs[reg], "+", plus[regs[reg]]
}
}
}
}
' inputfile
In the BEGIN clause, set regdelim to a character that doesn't appear in your data.
I've omitted the processing code for the minus data.
Output:
Match: a + b + c
Match: f + g + h
Edit 2:
The version below allows you to search for arbitrary combinations. It generalizes the technique used in the original version so no code needs to be duplicated. It also fixes a couple of other bugslimitations.
#!/bin/bash
# written by Dennis Williamson - 2010-11-12
# for http://stackoverflow.com/questions/4161001/counting-the-occurrence-of-a-sub-graph-in-a-graph
# A (AB) B, A (AC) C, B (BC) C - where "(XY)" represents a + or a -
# provided by the positional parameters $1, $2 and $3
# $4 carries the data file name and is referenced at the end of the script
awk -v AB=$1 -v AC=$2 -v BC=$3 '
BEGIN { regdelim = "|" }
{
if ($2 == AB) {
if (regAB[$1]) delim=regdelim; else delim=""
regAB[$1]=regAB[$1] delim $3
}
if ($2 == AC) {
if (regAC[$1]) delim=regdelim; else delim=""
regAC[$1]=regAC[$1] delim $3
}
if ($2 == BC) {
if (regBC[$1]) delim=regdelim; else delim=""
regBC[$1]=regBC[$1] delim $3
}
}
END {
for (root in regAB) {
split(regAB[root],ABarray,regdelim)
for (ABindex in ABarray) {
split(regAC[root],ACarray,regdelim)
for (ACindex in ACarray) {
split(regBC[ABarray[ABindex]],BCarray,regdelim)
for (BCindex in BCarray) {
if (ACarray[ACindex] == BCarray[BCindex]) {
print " Match:", root, AB, ABarray[ABindex] ",", root, AC, ACarray[ACindex] ",", ABarray[ABindex], BC, BCarray[BCindex]
}
}
}
}
}
}
' "$4"
This can be called like this to do an exhaustive search:
for ab in + -; do for ac in + -; do for bc in + -; do echo "Searching: $ab$ac$bc"; ./searchgraph $ab $ac $bc inputfile; done; done; done
For this data:
a - e
a + b
b + c
c - f
m - n
b - d
a + c
b - e
l - n
f + g
b + i
g + h
l + m
f + h
a + i
a - j
k - j
a - k
The output of the shell loop calling the new version of the script would look like this:
Searching: +++
Match: a + b, a + c, b + c
Match: a + b, a + i, b + i
Match: f + g, f + h, g + h
Searching: ++-
Searching: +-+
Searching: +--
Match: l + m, l - n, m - n
Match: a + b, a - e, b - e
Searching: -++
Searching: -+-
Searching: --+
Searching: ---
Match: a - k, a - j, k - j
Edit 3:
Graphviz
Another approach would be to use Graphviz. The DOT language can describe the graph and gvpr, which is an "AWK-like"1 programming language, can analyze and manipulate DOT files.
Given the input data in the format as shown in the question, you can use the following AWK program to convert it to DOT:
#!/usr/bin/awk -f
BEGIN {
print "digraph G {"
print " size=\"5,5\""
print " ratio=.85"
print " node [fontsize=24 color=blue penwidth=3]"
print " edge [fontsize=18 labeldistance=5 labelangle=-8 minlen=2 penwidth=3]"
print " {rank=same; f l}"
m = "-" # ASCII minus/hyphen as in the source data
um = "−" # u2212 minus: − which looks better on the output graphic
p = "+"
}
{
if ($2 == m) { $2 = um; c = lbf = "red"; arr=" arrowhead = empty" }
if ($2 == p) { c = lbf = "green3"; arr="" }
print " " $1, "->", $3, "[taillabel = \"" $2 "\" color = \"" c "\" labelfontcolor = \"" lbf "\"" arr "]"
}
END {
print "}"
}
The command to run would be something like this:
$ ./dat2dot data.dat > data.dot
You can then create the graphic above using:
$ dot -Tpng -o data.png data.dot
I used the extended data as given above in this answer.
To do an exhaustive search for the type of subgraphs you specified, you can use the following gvpr program:
BEGIN {
edge_t AB, BC, AC;
}
E {
AB = $;
BC = fstedge(AB.head);
while (BC && BC.head.name != AB.head.name) {
AC = isEdge(AB.tail,BC.head,"");
if (AC) {
printf("%s %s %s, ", AB.tail.name, AB.taillabel, AB.head.name);
printf("%s %s %s, ", AC.tail.name, AC.taillabel, AC.head.name);
printf("%s %s %s\n", BC.tail.name, BC.taillabel, BC.head.name);
}
BC = nxtedge(BC, AB.head);
}
}
To run it, you could use:
$ gvpr -f groups.g data.dot | sort -k 2,2 -k 5,5 -k 8,8
The output would be similar to that from the AWK/shell combination above (under "Edit 2"):
a + b, a + c, b + c
a + b, a + i, b + i
f + g, f + h, g + h
a + b, a − e, b − e
l + m, l − n, m − n
a − k, a − j, k − j
1 Loosely speaking.

An unconventional approach using Perl is below.
#! /usr/bin/perl
use warnings;
use strict;
my $graph = q{
a + c
b + c
c - f
b - d
a + b
f + g
g + h
f + h
};
my $nodes = join ",", sort keys %{ { map +($_ => 1), $graph =~ /(\w+)/g } };
my $search = "$nodes:$nodes:$nodes:$graph";
my $subgraph = qr/
\A .*? (?<x>\w+) .*:
.*? (?<y>\w+) .*:
.*? (?<z>\w+) .*:
(?= .*^\s* \k<x> \s* \+ \s* \k<y> \s*$)
(?= .*^\s* \k<y> \s* \+ \s* \k<z> \s*$)
(?= .*^\s* \k<x> \s* \+ \s* \k<z> \s*$)
(?{ print "x=$+{x}, y=$+{y}, z=$+{z}\n" })
(?!)
/smx;
$search =~ /$subgraph/;
The regex engine is a powerful tool. For your problem, we describe the structure of a transitive subgraph and then allow the resulting machine to go find all of them.
Output:
x=a, y=b, z=c
x=f, y=g, z=h
A more general tool using this same technique is possible. For example, let's say you want to be able to specify gene patterns such as a+b+c;a+c or g1+g2-g3;g1+g3. I hope the meanings of these patterns are obvious.
In the front matter, I specify a minimum version of 5.10.0 because the code uses // and lexical $_. The code constructs dynamic regexes that will evaluate code, which the use re 'eval' pragma enables.
#! /usr/bin/perl
use warnings;
use strict;
use 5.10.0;
use re 'eval';
An identifier is a sequence of one or more “word characters,” i.e., letters, digits, or underscores.
my $ID = qr/\w+/;
Given a regex that accepts variable names, unique_vars searches some specification for all variable names and returns them without repetition.
sub unique_vars {
my($_,$pattern) = #_;
keys %{ { map +($_ => undef), /($pattern)/g } };
}
Compiling a gene pattern into a regex is a little hairy. It dynamically generates a search target and regex with the same form as the static one above.
The first part with multiple occurrences of comma-separated variables lets the regex engine try each possible value for each gene. Then the lookaheads, (?=...), scan the graph looking for edges with the desired properties. If all the lookaheads succeed, we record the hit.
The strange regex (?!) at the end is an unconditional failure that forces the matcher to backtrack and attempt the match with different genes. Because it's unconditional, the engine will evaluate all possibilities.
Calling the same closure from multiple threads concurrently will likely produce strange results.
sub compile_gene_pattern {
my($dataset,$pattern) = #_;
my #vars = sort +unique_vars $pattern, qr/[a-z]\d*/; # / for SO hilite
my $nodes = join ",", sort +unique_vars $dataset, $ID;
my $search = join("", map "$_:", ($nodes) x #vars) . "\n"
. $dataset;
my $spec = '\A' . "\n" . join("", map ".*? (?<$_>$ID) .*:\n", #vars);
for (split /;/, $pattern) {
while (s/^($ID)([-+])($ID)/$3/) {
$spec .= '(?= .*^\s* ' .
' \b\k<' . $1 . '>\b ' .
' \s*' . quotemeta($2) . '\s* ' .
' \b\k<' . $3 . '>\b ' .
' \s*$)' . "\n";
}
}
my %hits;
$spec .= '(?{ ++$hits{join "-", #+{#vars}} })' . "\n" .
'(?!) # backtrack' . "\n";
my $nfa = eval { qr/$spec/smx } || die "$0: INTERNAL: bad regex:\n$#";
sub {
%hits = (); # thread-safety? :-(
(my $_ = $search) =~ /$nfa/;
map [split /-/], sort keys %hits;
}
}
Read the dataset and let the user know about any problems.
sub read_dataset {
my($path) = #_;
open my $fh, "<", $path or die "$0: open $path: $!";
local $/ = "\n";
local $_;
my $graph;
my #errors;
while (<$fh>) {
next if /^\s*#/ || /^\s*$/;
if (/^ \s* $ID \s* [-+] \s* $ID \s* $/x) {
$graph .= $_;
}
else {
push #errors, "$.: syntax error";
}
}
return $graph unless #errors;
die map "$0: $path:$_\n", #errors;
}
Now we set it all into motion:
my $graphs = shift // "graphs.txt";
my $dataset = read_dataset $graphs;
my $ppp = compile_gene_pattern $dataset, "a+b+c;a+c";
print "#$_\n" for $ppp->();
my $pmp = compile_gene_pattern $dataset, "g1+g2-g3;g1+g3";
print "#$_\n" for $pmp->();
Given graphs.txt with contents
a + b
b + c
c - f
b - d
a + c
f + g
g + h
f + h
foo + bar
bar - baz
foo + baz
and then running the program, we get the following output:
a b c
f g h
foo bar baz

I assume that by "count the sub-graph" you mean counting the nodes in a sub-graph. If that's what you need, you can use any scripting language and will have to store the graph, first of all, by creating a structure or class where you store your graph, the node structure/class should look like this (this is not conforming the syntax of any language, this is only a plan for your application):
Node {color = 0; title = ""; minusNodeSet = null; plusNodeSet = null}
Where color = 0 (the defaul value of color means you haven't visited this node before), title will be 'a', 'b', 'c', and so on. minusNodeSet is a Set of Nodes where those nodes are stored, where a minus vertice points from our Node, plusNodeSet is a Set of Nodes where those nodes are stored, where a plus vertice points from our Node.
Now, we have an architecture and should use it in a depth-first algoritm:
int depth_first(Node actualNode)
{
if (actualNode.color == 1)
return;
number = 1;
actualNode.color = 1;
foreach actualNode.nodeSet as node do
if (node.color == 0)
number = number + depth_first(node);
return number;
}
If I misunderstood your question, please, tell me, to be able to edit my answer to be a more useful one.

The structure of the regex in my other answer resembles list-monad processing. Given that inspiration, a search for the transitive subgraphs is below as a literate Haskell. Copy-and-paste this answer to a file with the extension .lhs to get a working program. Be sure to surround the code sections, marked by leading >, with empty lines.
Thanks for the fun problem!
A bit of front matter:
> {-# LANGUAGE ViewPatterns #-}
> module Main where
> import Control.Monad (guard)
> import Data.List (nub)
> import Data.Map (findWithDefault,fromListWith,toList)
The name of a gene can be any string, and for a given Gene g, a function of type PosReg should give us all the genes that g positively regulates.
> type Gene = String
> type PosReg = Gene -> [Gene]
From a graph specified as in your question, we want triples of genes such that the is-positively-regulated-by relation is transitive, and subgraphs describes the desired properties. First, pick an arbitrary gene x from the graph. Next, choose one of the genes y that x positively regulates. For the transitive property to hold, z must be a gene that both x and y positively regulate.
> subgraphs :: String -> [(Gene,Gene,Gene)]
> subgraphs g = do
> x <- choose
> y <- posRegBy x
> z <- posRegBy y
> guard $ z `elem` posRegBy x
> return (x,y,z)
> where (choose,posRegBy) = decode g
With the simple parser in decode, we distill the list of genes in the graph and a PosReg function that gives all genes positively regulated by some other gene.
> decode :: String -> ([Gene], PosReg)
> decode g =
> let pr = fromListWith (++) $ go (lines g)
> gs = nub $ concatMap (\(a,b) -> a : b) $ toList pr
> in (gs, (\x -> findWithDefault [] x pr))
> where
> go ((words -> [a, op, b]):ls)
> | op == "+" = (a,[b]) : go ls
> | otherwise = go ls
> go _ = []
Finally, the main program glues it all together. For each subgraph found, print it to the standard output.
> main :: IO ()
> main = mapM_ (putStrLn . show) $ subgraphs graph
> where graph = "a + b\n\
> \b + c\n\
> \c - f\n\
> \b - d\n\
> \a + c\n\
> \f + g\n\
> \g + h\n\
> \f + h\n"
Output:
("a","b","c")
("f","g","h")

Related

Extract each match+context of +/-N rows even if contiguous/overlapping (grep/sed/awk/powershell/etc)

I encountered an unexpected problem when using grep -B 10 -A 10 to get keywords + context rows. If there was a match within that context, then it did not count that as a separate match, but extended the context. This problem has been encountered before: Do not merge the context of contiguous matches with grep
Thus, when finding matches from:
a
b
match1
c
d
e
match2
f
match3
g
h
i
j
I'd like to get (with whatever delimiter). Here the example is with N=2 in either direction, but I would like a general solution where the number is easy to tweak.
b
match1
c
d
--
d
e
match2
f
match3
--
match2
f
match3
g
h
--
There is a small python script written as answer, but given it's now 9 years later, maybe there is a better way. I'm working with huge files (100M+ rows), so the python script does not really do the job as well.
Maybe there is some new or old commandline tool that does this?
Since PowerShell is tagged here, you can do the following which should be easy on memory at the expense of speed. We could potentially read the file in other ways that are more efficient, but you lose some conciseness.
$StringMatch = 'match' # Text you want to match
$n = 2 # Context number or the number of lines above and below the match
$sectionEnd = $false
$tracker = [collections.queue]::new()
# You may want to feed in multiple files, which can be done with a surrounding foreach loop at this spot
get-content a.txt -readcount 1 | foreach-object -Process {
$sectionEnd = $false
$tracker.Enqueue($_)
if ($tracker.count -gt ($n*2+1)) {
$null = $tracker.Dequeue()
}
if ($tracker.count -eq ($n*2+1) -and $tracker.ToArray()[$n] -match $StringMatch) {
$tracker
"----------"
$sectionEnd = $true
}
} -End { # -End block can be removed if you don't want to output a final misaligned $tracker
if (!$sectionEnd -and $tracker.ToArray() -match $StringMatch) {
$tracker
"---------"
}
}
Here's a perl script named context.pl where $n controls the number of context lines.
BEGIN { $/ = "\n"; $\ = ""; }
LINE: while (defined($_ = <ARGV>)) {
sub BEGIN {
$n = 3;
}
{
$j = $. % $n;
if (/match/) {
for ($i = $j; $i < $j + $n; ++$i) {
print $buf{$i % $n};
}
print $_;
$fp = tell ARGV;
foreach $_ (1 .. $n) {
unless (eof) {
$line = <ARGV>;
print $line;
}
}
print "--\n";
seek ARGV, $fp, 0;
}
$buf{$j} = $_;
}
}
The script works by having a buffer of size $n which saves the previous $n lines. Whenever there is a match, the buffer content is printed, followed by current line. Then, $fp saves the current file location. Then, next $n lines are fetched to be printed, followed by the delimiter --. The file location is then restored to start processing from the next line to current matching line.
Here's a sample run with $n = 2:
$ perl context.pl ip.txt
a
b
match1
c
d
--
d
e
match2
f
match3
--
match2
f
match3
g
h
--
There are corner case issues. If the file doesn't end with $n lines after the last match, the results are mangled. For example, with the changed input as shown below and $n = 4:
$ cat ip.txt
a
b
match1
match2
c
$ perl context.pl ip.txt
a
b
match1
match2
c
--
b
match1
a
match2
c
--

Add dipeptide frequency in this perl script based on sequence length

I have a perl script to get the di-peptide counts (there are 400 combinations, for example- AA, AC, AD, AE...) from sequences (fasta format). But I would like to add the frequency based on the sequence lengths. I have a input with multiple sequences (myfile.fasta).
I tried to do it, but I got the wrong results. Im am not very familiar with perl.
My script:
use strict;
use warnings;
use Bio::SeqIO;
my #amino=qw/A C D E F G H I K L M N P Q R S T V W Y/;
my #comb=();
foreach my $a (#amino){
foreach my $b (#amino){
push (#comb,$a.$b)
}
}
my $in = Bio::SeqIO->new(-file => "myfile.fasta" , '-format' => 'Fasta');
while ( my $seq= $in->next_seq ) {
my #dipeps=($seq->seq()=~/(?=(.{2}))/g);
my %di_count=();
$di_count{$_}++ for #dipeps;
print $seq->id();
map{exists $di_count{$_}?print " ",$di_count{$_}:print " ",0}sort #comb;
print "\n";
}
I tried:
map{exists $di_count{$_}?print " ",$di_count{$_}:print " ",0}sort #comb/length;
map{exists $di_count{$_}?print " ",$di_count{$_}:print " ",0/length}sort #comb;
I also tried to define the length, such as:
my $seq_len = length($seq);
Also, I do not want to define the input file in the script, I would like to define like "perl script.pl input.fasta > result.txt". For that I should use:
open (S, "$ARGV[0]") || die "cannot open FASTA file to read: $!";
This is pretty ugly code (should be rewritten entirely), but I think you want:
my $length = #dipeps;
map{exists $di_count{$_}?print " ",$di_count{$_}/$length:print " ",0}sort #comb;

What do the non-printable characters in the Perl symbol table represent?

I just learned that in Perl, the symbol table for a given module is stored in a hash that matches the module name -- so, for example, the symbol table for the fictional module Foo::Bar would be %Foo::Bar. The default symbol table is stored in %main::. Just for the sake of curiosity, I decided that I wanted to see what was in %main::, so iterated through each key/value pair in the hash, printing them out as I went:
#! /usr/bin/perl
use v5.14;
use strict;
use warnings;
my $foo;
my $bar;
my %hash;
while( my ( $key, $value ) = each %:: ) {
say "Key: '$key' Value '$value'";
}
The output looked like this:
Key: 'version::' Value '*main::version::'
Key: '/' Value '*main::/'
Key: '' Value '*main::'
Key: 'stderr' Value '*main::stderr'
Key: '_<perl.c' Value '*main::_<perl.c'
Key: ',' Value '*main::,'
Key: '2' Value '*main::2'
...
I was expecting to see the STDOUT and STDERR file handles, and perhaps #INC and %ENV... what I wasn't expecting to see was non-ascii characters ... what the code block above doesn't show is that the third line of the output actually had a glyph indicating a non-printable character.
I ran the script and piped it as follows:
perl /tmp/asdf.pl | grep '[^[:print:]]' | while read line
do
echo $line
od -c <<< $line
echo
done
The output looked like this:
Key: '' Value '*main::'
0000000 K e y : ' 026 ' V a l u e '
0000020 * m a i n : : 026 ' \n
0000032
Key: 'ARNING_BITS' Value '*main::ARNING_BITS'
0000000 K e y : ' 027 A R N I N G _ B I
0000020 T S ' V a l u e ' * m a i n
0000040 : : 027 A R N I N G _ B I T S ' \n
0000060
Key: '' Value '*main::'
0000000 K e y : ' 022 ' V a l u e '
0000020 * m a i n : : 022 ' \n
0000032
Key: 'E_TRIE_MAXBUF' Value '*main::E_TRIE_MAXBUF'
0000000 K e y : ' 022 E _ T R I E _ M A
0000020 X B U F ' V a l u e ' * m a
0000040 i n : : 022 E _ T R I E _ M A X B
0000060 U F ' \n
0000064
Key: ' Value '*main:'
0000000 K e y : ' \b ' V a l u e '
0000020 * m a i n : : \b ' \n
0000032
Key: '' Value '*main::'
0000000 K e y : ' 030 ' V a l u e '
0000020 * m a i n : : 030 ' \n
0000032
So what are non-printable characters doing in the Perl symbol table? What are they symbols for?
Guru is on the right track: specifically, the answer is to be found in perlvar, which says:
"Perl variable names may also be a sequence of digits or a single punctuation or control character. These names are all reserved for special uses by Perl; for example, the all-digits names are used to hold data captured by backreferences after a regular expression match. Perl has a special syntax for the single-control-character names: It understands ^X (caret X) to mean the control-X character. For example, the notation $^W (dollar-sign caret W) is the scalar variable whose name is the single character control-W. This is better than typing a literal control-W into your program.
Since Perl 5.6, Perl variable names may be alphanumeric strings that begin with control characters (or better yet, a caret). These variables must be written in the form ${^Foo}; the braces are not optional. ${^Foo} denotes the scalar variable whose name is a control-F followed by two o's. These variables are reserved for future special uses by Perl, except for the ones that begin with ^_ (control-underscore or caret-underscore). No control-character name that begins with ^_ will acquire a special meaning in any future version of Perl; such names may therefore be used safely in programs. $^_ itself, however, is reserved."
If you want to print those names in a readable way, you could add a line like this to your code:
$key = '^' . ($key ^ '#') if $key =~ /^[\0-\x1f]/;
If first character of $key is a control character, this will replace it with a caret followed by the corresponding letter (^A for control-A, ^B for control-B, etc.).
Perl has special variables such as $", $, , $/ , $\ and so on. All these are part of symbol table which is what you are seeing. Also, you should be able to see #INC, %ENV in the symbol table as well.

Reformatting separated char to couples

Input:
rs001 A C T G C G T T
rs002 C C T T G G A A
out1:
rs001 AC TG CG TT
rs002 CC TT GG AA
out2 :
rs001 1 1 1 2
rs002 2 2 2 2
Ok so basically I want to replace any two similar nucleotides (like AA, CC, TT, or GG) to 2 and any two different (like AT, TA, CG, .. etc) to 1 taking into account that the input should be converted first to out1 then to out2. Also we have so many fields (like 200 columns) in each row, so loops are needed here.
This is what I tried:
cat input | awk '{ for (x = 2; x <= NF; x = x+2) print $x$(x+1) }'
Results are so weird, so can anyone please tell me why I can't get out1 ?! What mistakes I did in awk loops ?
Thanks in advance
For the first,
sed 's/ \([ACGT]\) / \1/g' input >out1
This will remove the space after every other nucleitude. It matches a nucleotide with a space on both sides; the next match will pick up where the previous ended.
For the second,
sed 's/\([ACGT]\)\1/2/g;s/[ACGT][ACGT]/1/g' out1 >out2
This replaces two adjacent identical letters with 2, then any remaining adjacent two letters with 1.
This assumes you have Linux; other sed dialects may require minor modifications.
awk '{
out1 = out2 = $1
for (i=2;i<=NF;i+=2) {
out1 = out1 FS $i $(i+1)
out2 = out2 FS ($i == $(i+1) ? 2 : 1)
}
print out1 > "out1"
print out2 > "out2"
}' input
Here's how you fix your awk script to get output 1:
awk '{ printf "%s ", $1; for (x = 2; x <= NF; x = x + 2) {printf "%s%s ", $x, $(x+1)} printf "\n"}' input
print adds a new line at the end by default, so you'll have to use formatted strings printf to specify where exactly you want the new lines.
(Also added printf "%s ", $1; at the start to print the header at the start of each line)
Edit: Triplee's solution looks much more elegant than mine - you should ditch awk and go with his =)
This might work for you (GNU sed):
sed -re 's/ (.) / \1/g;w out1' -e 's/([ACTG])\1/2/g;s/[ACTG]./1/g' file >out2

Gather the data with similar columns

I want to filter the data from a text file in unix.
I have text file in unix as below:
A 200
B 300
C 400
A 100
B 600
B 700
How could i modify/create data as below from the above data i have in awk?
A 200 100
B 300 600 700
C 400
i am not that much good in awk and i believe awk/perl is best for this.
awk 'END {
for (R in r)
print R, r[R]
}
{
r[$1] = $1 in r ? r[$1] OFS $2 : $2
}' infile
If the order of the values in the first field is important,
more code will be needed.
The solution will depend on your awk implementation and version.
Explanation:
r[$1] = $1 in r ? r[$1] OFS $2 : $2
Set the value of the array r element $1 to:
if the key $1 is already present: $1 in r, append OFS $2
to the existing value
otherwise set it to the value of $2
expression ? if true : if false is the ternary operator.
See ternary operation for more.
You could do it like this, but with Perl there's always more than one way to do it:
my %hash;
while(<>) {
my($letter, $int) = split(" ");
push #{ $hash{$letter} }, $int;
}
for my $key (sort keys %hash) {
print "$key " . join(" ", #{ $hash{$key} }) . "\n";
}
Should work like that:
$ cat data.txt | perl script.pl
A 200 100
B 300 600 700
C 400
Not language-specific. More like pseudocode, but here's the idea :
- Get all lines in an array
- Set a target dictionary of arrays
- Go through the array :
- Split the string using ' '(space) as the delimiter, into array parts
- If there is already a dictionary entry for `parts[0]` (e.g. 'A').
If not create it.
- Add `parts[1]` (e.g. 100) to `dictionary(parts[0])`
And that's it! :-)
I'd do it, probably in Python, but that's rather a matter of taste.
Using awk, sorting the output inside it:
awk '
{ data[$1] = (data[$1] ? data[$1] " " : "") $2 }
END {
for (i in data) {
idx[++j] = i
}
n = asort(idx);
for ( i=1; i<=n; i++ ) {
print idx[i] " " data[idx[i]]
}
}
' infile
Using external program sort:
awk '
{ data[$1] = (data[$1] ? data[$1] " " : "") $2 }
END {
for (i in data) {
print i " " data[i]
}
}
' infile | sort
For both commands output is:
A 200 100
B 300 600 700
C 400
Using sed:
Content of script.sed:
## First line. Newline will separate data, so add it after the content.
## Save it in 'hold space' and read next one.
1 {
s/$/\n/
h
b
}
## Append content of 'hold space' to current line.
G
## Search if first char (\1) in line was saved in 'hold space' (\4) and add
## the number (\2) after it.
s/^\(.\)\( *[0-9]\+\)\n\(.*\)\(\1[^\n]*\)/\3\4\2/
## If last substitution succeed, goto label 'a'.
ta
## Here last substitution failed, so it is the first appearance of the
## letter, add it at the end of the content.
s/^\([^\n]*\n\)\(.*\)$/\2\1/
## Label 'a'.
:a
## Save content to 'hold space'.
h
## In last line, get content of 'hold space', remove last newline and print.
$ {
x
s/\n*$//
p
}
Run it like:
sed -nf script.sed infile
And result:
A 200 100
B 300 600 700
C 400
This might work for you:
sort -sk1,1 file | sed ':a;$!N;s/^\([^ ]*\)\( .*\)\n\1/\1\2/;ta;P;D'
A 200 100
B 300 600 700
C 400