With this sentence:
my $sent = "Mapping and quantifying mammalian transcriptomes RNA-Seq";
We want to get all possible consecutive pairs of words.
my $var = ['Mapping and',
'and quantifying',
'quantifying mammalian',
'mammalian transcriptomes',
'transcriptomes RNA-Seq'];
Is there a compact way to do it?
Yes.
my $sent = "Mapping and quantifying mammalian transcriptomes RNA-Seq";
my #pairs = $sent =~ /(?=(\S+\s+\S+))\S+/g;
A variation that (perhaps unwisely) relies on operator evaluation order but doesn't rely on fancy regexes or indices:
my #words = split /\s+/, $sent;
my $last = shift #words;
my #var;
push #var, $last . ' ' . ($last = $_) for #words;
This works:
my #sent = split(/\s+/, $sent);
my #var = map { $sent[$_] . ' ' . $sent[$_ + 1] } 0 .. $#sent - 1;
i.e. just split the original string into an array of words, and then use map to iteratively produce the desired pairs.
I don't have it as a single line, but the following code should give you somewhere to start. Basically does it with a push and a regext with /g.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
$Data::Dumper::Indent = 1;
my $t1 = 'aa bb cc dd ee ff';
my $t2 = 'aa bb cc dd ee';
foreach my $txt ( $t1, $t2 )
{
my #a;
push( #a, $& ) while( $txt =~ /\G\S+(\s+\S+|)\s*/g );
print Dumper( \#a );
}
One liner thanks to the syntax from #ysth
my #a = $txt =~ /\G(\S+(?:\s+\S+|))\s*/g;
My regex is slightly different in that if you have an odd number of words, the last word still gets an entry.
Related
I have a text file that has approximately 3,000 lines. 99% of the time I need all 3,000 lines. However, periodically I will grep out the lines I need and direct the output to another text file to use.
The only problem I have in doing so, is: Embedded in the text file is a 6 character string of numbers that indicate the line number. In order to use the file, this area needs to be correctly renumbered...(I don't need to re-sort the data, but I need to replace the current six characters with the new line number. and it must be padded with zeros! Unfortuantely the entire rows is one long row of data with no field separators!
For example, my first three rows might look something like:
20130918082020ZZ000001RANDOMDATAFOLLOWSAFTERTHISABCDEFGH
20130810112000ZZ000999MORERANDOMDATAFOLLOWSAFTERTHISABCD
20130810112000ZZ000027SILLMORERANDOMDATAFOLLOWSAFTERTHIS
The six characters at positions 17-22 (Immediately following the "ZZ"), need be renumbered based on the current row number...so the above needs to look like:
20130918082020ZZ000001RANDOMDATAFOLLOWSAFTERTHISABCDEFGH
20130810112000ZZ000002MORERANDOMDATAFOLLOWSAFTERTHISABCD
20130810112000ZZ000003SILLMORERANDOMDATAFOLLOWSAFTERTHIS
Any ideas would be greatly appreciated!
Thanks,
KSL.
Here's the solution I came up with Perl. It assumes that the numbering is always 6 digits after the ZZ sequence.
In convert.pl:
use strict;
use warnings;
my $i = 1; # or the value you want to start numbering
while (<STDIN>) {
my $replace = sprintf("%06d", $i++);
$_ =~ s/ZZ\d{6}/ZZ$replace/g;
print $_;
}
In data.dat:
20130918082020ZZ000001RANDOMDATAFOLLOWSAFTERTHISABCDEFGH
20130810112000ZZ000999MORERANDOMDATAFOLLOWSAFTERTHISABCD
20130810112000ZZ000027SILLMORERANDOMDATAFOLLOWSAFTERTHIS
To run:
cat data.dat | perl convert.pl
Output
20130918082020ZZ000001RANDOMDATAFOLLOWSAFTERTHISABCDEFGH
20130810112000ZZ000002MORERANDOMDATAFOLLOWSAFTERTHISABCD
20130810112000ZZ000003SILLMORERANDOMDATAFOLLOWSAFTERTHIS
If I would solve this, I would create a simple python script to read those lines by filtering as grep does and using a internal counter from inside the python script.
As simple hints you can read each line in a string and access them using variablename[17:22] (17:22 is the position of the string you are trying to use).
Now, there is a method in the string in python which does the replace, just replace the values by the counter you create.
I hope this helps.
To do this in awk:
awk '{print substr($0,1,16) sprintf("%06d", NR) substr($0,23)}'
or
gawk 'match($0,/^(.*ZZ)[0-9]{6}(.*)/,a) {print a[1] sprintf("%06d",NR) a[2]}'
This is exactly the type of thing where unpack is useful.
#!/usr/bin/env perl
use v5.10.0;
use strict;
use warnings;
while( my $line = <> ){
chomp $line;
my #elem = unpack 'A16 A6 A*', $line;
$elem[1] = sprintf '%06d', $.;
# $. is the line number for the last used file handle
say #elem;
}
Actually looking at the lines, it looks like there is date information stored in the first 14 characters.
Assuming that at some point you might want to parse the lines for some reason you can use the following as an example of how you could use unpack to split up the lines.
#!/usr/bin/env perl
use v5.10.0; # say()
use strict;
use warnings;
use DateTime;
my #date_elem = qw'
year month day
hour minute second
';
my #elem_names = ( #date_elem, qw'
ZZ
line_number
random_data
');
while( my $line = <> ){
chomp $line;
my %data;
#data{ #elem_names } = unpack 'A4 (A2)6 A6 A*', $line;
# choose either this:
$data{line_number} = sprintf '%06d', $.;
say #data{#elem_names};
# or this:
$data{line_number} = $.;
printf '%04d' . ('%02d'x5) . "%2s%06d%s\n", #data{ #elem_names };
# the choice will affect the contents of %data
# this just shows the contents of %data
for( #elem_names ){
printf qq'%12s: "%s"\n', $_, $data{$_};
}
# you can create a DateTime object with the date elements
my $dt = DateTime->new(
(map{ $_, $data{$_} } #date_elem),
time_zone => 'floating',
);
say $dt;
print "\n";
}
Although it would be better to use a regular expression, so that you could throw out bogus data.
use v5.14; # /a modifier
...
my $rdate = join '', map{"(\\d{$_})"} 4, (2)x5;
my $rx = qr'$rdate (ZZ) (\d{6}) (.*)'xa;
while( my $line = <> ){
chomp $line;
my %data;
unless( #data{ #elem_names } = $line =~ $rx ){
die qq'unable to parse line "$line" ($.)';
}
...
It would be better still; to use named capture groups added in 5.10.
...
my $rx = qr'
(?<year> \d{4} ) (?<month> \d{2} ) (?<day> \d{2} )
(?<hour> \d{2} ) (?<minute> \d{2} ) (?<second> \d{2} )
ZZ
(?<line_number> \d{6} )
(?<random_data> .* )
'xa;
while( my $line = <> ){
chomp $line;
unless( $line =~ $rx ){
die qq'unable to parse line "$line" ($.)';
}
my %data = %+;
# for compatibility with previous examples
$data{ZZ} = 'ZZ';
...
[perl 5.8.8]
I have a sequence of names of things like:
names='foobar1304,foobar1305,foobar1306,foobar1307'
where the names differ only by a contiguous string of digits somewhere in the name. The strings of digits in any sequence are all of the same length, and the digit strings form a continuous numeric sequence with no skips, e.g. 003,004,005.
I want a compact representation like:
compact_name='foobar1304-7'
(The compact form is just a name, so it's exact form is negotiable.)
There will usually only be <10 things, though some sets might span a decade, e.g.
'foobaz2205-11'
Is there some concise way to do this in perl? I'm not a big perl hacker, so be a little gentle...
Bonus points for handling embedded sequences like:
names='foobar33-pqq,foobar34-pqq,foobar35-pqq'
The ideal script would neatly fall back to 'firstname2301-lastname9922' in case it can't identify a sequence in the names.
I am not sure I got your specification, but it works somehow:
#!/usr/bin/perl
use warnings;
use strict;
use Test::More;
sub compact {
my $string = shift;
my ($name, $value) = split /=/, $string;
$name =~ s/s$// or die "Cannot create compact name for $name.\n"; #/ SO hilite bug
$name = 'compact_' . $name;
$value =~ s/^'|'$//g; #/ SO hilite bug
my #values = split /,/, $value; #/ SO hilite bug
my ($prefix, $first, $suffix) = $values[0] =~ /^(.+?)([0-9]+)(.*)$/;
my $last = $first + $#values;
my $same = 0;
$same++ while substr($first, 0, $same) eq substr($last, 0, $same);
$last = substr $last, $same - 1;
for my $i ($first .. $first + $#values) {
$values[$i - $first] eq ($prefix . $i . $suffix)
or die "Invalid sequence at $values[$i-$first].\n";
}
return "$name='$prefix$first-$last$suffix'";
}
is( compact("names='foobar1304,foobar1305,foobar1306,foobar1307'"),
"compact_name='foobar1304-7'");
is( compact("names='foobaz2205,foobaz2206,foobaz2207,foobaz2208,foobaz2209,foobaz2210,foobaz2211'"),
"compact_name='foobaz2205-11'");
is( compact("names='foobar33-pqq,foobar34-pqq,foobar35-pqq'"),
"compact_name='foobar33-5-pqq'");
done_testing();
Someone sure will post an more elegant solution, but the following
use strict;
use warnings;
my $names='foobar1308-xy,foobar1309-xy,foobar1310-xy,foobar1311-xy';
my #names = split /,/,$names;
my $pfx = lcp(#names);
my #nums = map { m/$pfx(\d*)/; $1 } #names;
my $first=shift #nums;
my $last = pop #nums;
my $suf=$names[0];
$suf =~ s/$pfx\d*//;
print "$pfx\{$first-$last}$suf\n";
#https://gist.github.com/3309172
sub lcp {
my $match = shift;
substr($match, (($match ^ $_) =~ /^\0*/, $+[0])) = '' for #_;
$match;
}
prints:
foobar13{08-11}-xy
I have an input variable, say $a. $a can be either number or string or mix of both.
My question is how can I strip off the variable to separate numeric digits and alphabetic characters?
Example;
$a can be 'AB9'
Here I should be able to store 'AB' in one variable and '9' in other.
How can I do that?
Check this version, it works with 1 or more numeric and alphabetic characters in a variable.
#!/usr/bin/perl
use strict;
use Data::Dumper;
my $var = '11a';
my (#digits, #alphabetics);
while ($var =~ /([a-zA-Z]+)/g) {
push #alphabetics, $1;
}
while ($var =~ /(\d+)/g) {
push #digits, $1;
}
print Dumper(\#alphabetics);
print Dumper(\#digits);
Here's one way to express it very shortly:
my ($digits) = $input =~ /(\d+)/;
my ($alpha) = $input =~ /([a-z]+)/i;
say 'digits: ' . ($digits // 'none');
say 'non-digits: ' . ($alpha // 'none');
It's important to use the match operator in list context here, otherwise it would return if the match succeeded.
If you want to get all occurrences in the input string, simply change the scalar variables in list context to proper arrays:
my #digits = $input =~ /(\d+)/g;
my #alpha = $input =~ /([a-z]+)/gi;
say 'digits: ' . join ', ' => #digits;
say 'non-digits: ' . join ', ' => #alpha;
For my $input = '42AB17C', the output is
digits: 42, 17
non-digits: AB, C
I have a string in Perl that is 23 digits long. I need to cut it apart into different pieces. First 2 digits in one variable, next 3 in another variable, next 4 into another variable, etc. Basically the 23 digits needs to end up as 6 separate variables (2,3,4,4,3,7) characters, in that order.
Any ideas how I can cut the string up like this?
There are lots of ways to do it, but the shortest is probably unpack:
my $string = '1' x 23;
my #values = unpack 'A2A3A4A4A3A7', $string;
If you need separate variables, you can use a list assignment:
my ($v1, $v2, $v3, $v4, $v5, $v6) = unpack 'A2A3A4A4A3A7', $string;
Expanding on Alex's method, rather than specify each start and end, use the list you gave of lengths.
#!/usr/bin/env perl
use strict;
use warnings;
my $string = "abcdefghijklmnopqrstuvw";
my $pos = 0;
my #split = map {
my $start = $pos;
my $end = $_;
$pos += $end;
substr( $string, $start, $end);
} (2,3,4,4,3,7);
print "$_\n" for #split;
This said you probably should look at unpack which is used for fixed width fields. I have no experience with it though.
You could use a regex, viz:
$string =~ /\d{2}\d{3}\d{4}\d{4}\d{3}\d{7}/
and capture each part by surrounding with brackets ().
You then find each capture in the variables $1, $2 ...
or get them all in the returned list
See perldoc perlre
You want to use perldoc substr.
$substring = substr($string, $start, $length);
I'd also use `map' on a list of [start, length] pairs to make your life easier:
$string = "123456789";
#values = map {substr($string, $_->[0], $_->[1])} ([1, 3], [4, 2] , ...);
Here's a sub that will do it, using the already discussed unpack.
sub string_slices {
my $str = shift;
return unpack( join( 'A', '', #_ ), $str );
}
Below is the sample csv file
date,type1,type2,.....
2009-07-01,n1,n2,.....
2009-07-02,n21,n22,....
and so on...
I want to add the values in each row and each column and print at the end and bottom of each line. i.e.
date,type1,type2
2009-07-01,n1,n2,.....row_total1
2009-07-02,n21,n22,....row_total2
Total,col_total1,col_total1,......total
Please suggest.
Less elegant and shorter:
$ perl -plaF, -e '$r=0;$r+=$F[$_],$c[$_]+=$F[$_]for 1..$#F;$_.=",$r";END{$c[0]="Total";print join",",#c}'
Quick and dirty, but should do the trick in basic cases. For anything more complex, use Text::CSV and an actual script.
An expanded version as it's getting a little hairy:
#! perl -plaF,
$r=0;
$r+=$F[$_], $c[$_]+=$F[$_] for 1..$#F;
$_.=",$r";
END { $c[0]="Total"; print join ",", #c }'
Here is a straightforward way which you can easily build upon depending on your requirements:
use strict;
use warnings;
use 5.010;
use List::Util qw(sum);
use List::MoreUtils qw(pairwise);
use Text::ParseWords;
our ($a, $b);
my #header = parse_csv( scalar <DATA> );
my #total = (0) x #header;
output_csv( #header, 'row_total' );
for my $line (<DATA>) {
my #cols = parse_csv( $line );
my $label = shift #cols;
push #cols, sum #cols;
output_csv( $label, #cols );
#total = pairwise { $a + $b } #total, #cols;
}
output_csv( 'Total', #total );
sub parse_csv {
chomp( my $data = shift );
quotewords ',', 0, $data;
}
sub output_csv { say join ',' => #_ }
__DATA__
date,type1,type2
2009-07-01,1,2
2009-07-02,21,22
Outputs the expected:
date,type1,type2,row_total
2009-07-01,1,2,3
2009-07-02,21,22,43
Total,22,24,46
Some things to take away from above is the use of List::Util and List::MoreUtils:
# using List::Util::sum
my $sum_of_all_values_in_list = sum #list;
# using List::MoreUtils::pairwise
my #two_arrays_added_together = pairwise { $a + $b } #array1, #array2;
Also while I've used Text::ParseWords in my example you should really look into using Text::CSV. This modules covers more bizarre CSV edge cases and also provides correct CSV composition (my output_csv() sub is pretty naive!).
/I3az/
Like JB's perlgolf candidate, except prints the end line totals and labels.
#!/usr/bin/perl -alnF,
use List::Util qw(sum);
chomp;
push #F, $. == 1 ? "total" : sum(#F[1..$#F]);
print "$_,$F[-1]";
for (my $i=1;$i<#F;$i++) {
$totals[$i] += $F[$i];
}
END {
$totals[0] = "Total";
print join(",",#totals);
};
Is this something that needs to be done for sure in a Perl script? There is no "quick and dirty" method to do this in Perl. You will need to read the file in, accumulate your totals, and write the file back out (processing input and output line by line would be the cleanest).
If this is a one-time report, or you are working with a competent user base, the data you want can most easily be produced with a spreadsheet program like Excel.
Whenever I work with CSV, I use the AnyData module. It may add a bit of overhead, but it keeps me from making mistakes ("Oh crap, that date column is quoted and has commas in it!?").
The process for you would look something like this:
use AnyData;
my #columns = qw/date type1 type2 type3/; ## Define your input columns.
my $input = adTie( 'CSV', 'input_file.csv', 'r', {col_names => join(',', #columns)} );
push #columns, 'total'; ## Add the total columns.
my $output = adTie( 'CSV', 'output_file.csv', 'o', {col_names => join(',', #columns)} );
my %totals;
while ( my $row = each %$input ) {
next if ($. == 1); ## Skip the header row. AnyData will add it to the output.
my $sum = 0;
foreach my $col (#columns[1..3]) {
$totals{$col} += $row->{$col};
$sum += $row->{$col};
}
$totals{total} += $sum;
$row->{total} = $sum;
$output->{$row->{date}} = $row;
}
$output->{Total} = \%totals;
print adDump( $output ); ## Prints a little table to see the data. Not required.
undef $input; ## Close the file.
undef $output;
Input:
date,type1,type2,type3
2009-07-01,1,2,3
2009-07-03,31,32,33
2009-07-06,61,62,63
"Dec 31, 1969",81,82,83
Output:
date,type1,type2,type3,total
2009-07-01,1,2,3,6
2009-07-03,31,32,33,96
2009-07-06,61,62,63,186
"Dec 31, 1969",81,82,83,246
Total,174,178,182,534
The following in Perl does what you want, its not elegant but it works :-)
Call the script with the inputfile as argument, results in stdout.
chop($_ = <>);
print "$_,Total\n";
while (<>) {
chop;
split(/,/);
shift(#_);
$sum = 0;
for ($n = 0; 0 < scalar(#_); $n++) {
$c = shift(#_);
$sum += $c;
$sums[$n] += $c;
}
$total += $sum;
print "$_,$sum\n";
}
print "Total";
for ($n = 0; $n <= $#sums; $n++) {
print "," . $sums[$n];
}
print ",$total\n";
Edit: fixed for 0 values.
The output is like this:
date,type1,type2,type3,Total
2009-07-01,1, 2, 3,6
2009-07-02,4, 5, 6,15
Total,5,7,9,21