calculating velocity from massive simulation data

calculating velocity from massive simulation data - perl

I have simulation data for the velocity of water molecules. The format of the data is as below. I would like to describe the format of the data for clarity purposes, and it easily would lead to what I want to calculate.
A water molecule is made of three atoms: Oxygen(O) and two Hydrogen (H). Here I would name them O, H1, and H2.
The data below starts with line title 0 and the number 4335, saying it contains 4335 atoms (4335/3 = 1445 water molecules).
The first three numbers starting from the third row ( 0.0923365 0.0341984 -0.1248516 ) representing velocity for oxygen (O) atom at three Cartesian directions Ox, Oy, Oz. The next three numbers, in the same row representing velocities for hydrogen (H1) ==> H1x, H1y, H1z. And finally the first three numbers in fourth row representing velocities for hydrogen (H2) ==> H2x,H2y,H2z. finally, the following three numbers in the same fourth row representing velocities for oxygen atom.
These sequence is goes on for all 4335 atoms in 2170 lines including the top two lines in the data file and it repeats for the following section starting from title 1.
title 0
4335 2.0001000e+04
0.0923365 0.0341984 -0.1248516 -0.8946258 1.6688854 0.8259304
0.2890579 0.8051153 -1.5612963 0.0625492 -0.1361579 0.2869132
0.2343408 -0.0665305 1.0745378 -0.8375892 0.6953992 0.5149021
-0.1628550 0.0131844 0.0688080 0.2429340 0.2168210 -0.0289806
-0.3677613 0.2054004 -0.1511643 -0.3487551 -0.1454157 0.0801884
-0.9039297 -0.0682939 -0.2337404 -0.5605327 -0.0369157 0.2243892
-0.3100274 -0.2673132 -0.2093299 0.1975043 -0.4572202 -0.8410826
-0.6995287 -0.4123909 0.0649209 -0.1910519 0.2289656 0.2443295
-0.0279093 0.5790939 -0.0104249 -1.1961776 -0.5387340 0.1445187
-0.3188485 0.3789352 -0.0112114 0.7831523 0.6043882 -0.7131590
-0.7214440 -0.5358508 -0.3035673 -0.1549275 -0.1402387 -0.0101964
-0.2027608 1.5107149 0.2963312 -1.5104872 -0.1554981 -1.3323215
0.1097982 -0.1553742 0.3803437 0.0816858 0.0265007 0.4215823
0.1157368 0.2100116 0.4712551 0.1799426 -0.1260255 -0.2131755
0.1811777 -0.9442581 -0.6036636 0.9681703 -0.1523646 -0.3502441
0.0976771 0.0019619 -0.1832204 -0.0055989 0.2701100 -0.4416720
0.8496723 0.4070951 -0.0819204 0.1156806 -0.1619873 -0.0016126
-0.4051959 0.4263505 -0.9460036 0.4412067 0.1002270 0.5864405
-0.3831136 0.3240860 -0.0005143 -0.5667163 0.2618876 0.0103317
-0.6442209 0.3965833 -0.0778050 -0.2404238 -0.1339887 -0.1662417
0.3421198 0.7480828 -1.8316993 -0.4454920 -0.0499657 -0.1951254
-0.2895359 -0.1934811 -0.2674928 0.1255802 1.3522828 -0.2829485
-0.4129106 -0.6842645 -1.0147657 -0.1278501 -0.0597648 -0.1478294
-0.2519974 0.0665314 -0.0690079 -0.0480210 -0.1179547 -0.2091919
-0.1942484 0.2583650 -0.0734658 -0.1216313 0.5158040 -0.0676843
-0.3063602 0.8148463 -0.1959571 -0.1009838 -0.3394633 -0.0866587
.
. (goes on until line 2170)
.
0.1028815 -0.0844088 -0.2156557 -0.1698745 -0.2018967 -0.3863209
0.1793070 -0.1005802 0.1800752 -0.1404713 0.2216020 0.2236271
0.5192825 -0.7398186 0.0418758 0.0347715 -0.3457840 -0.1300237
-0.3089482 1.1125441 -0.4020403 0.2739744 -0.9062766 0.0012294
0.1498538 0.0883857 -0.0094638 0.0963565 -1.1027019 0.0115313
-0.0432824 0.3330713 0.0304943
title 1
4335 2.0002000e+04
-0.2082078 0.1774843 -0.1023302 -0.1100437 0.5973607 1.0627041
-0.2216015 0.0448885 -0.8415924 0.1691296 0.6008261 -0.0373434
0.9387534 -0.3642305 0.6756270 -0.6000357 0.6632088 1.0567899
-0.3234407 -0.1781680 -0.1936070 -0.4799916 -0.1522612 -0.2347461
0.1045985 0.1999704 -0.1482928 -0.0439331 0.0413923 0.1605458
0.3403952 -0.2012104 0.4851457 -0.9665228 0.2202362 0.0046218
.
. (goes on until line 2170)
.
What I want to calculate is the resultant velocity for each molecule and I would like to do this using Perl. The algorithm goes in this way.
First store the velocities for oxygen (O) and hydrogens (H1 & H2) in Ox,Oy,Oz, H1x,H1y,H1z and H2x,H2y,H2z respectively.
Next we define:
velocity_x = Ox + Hx + Hx
velocity_y = Oy + Hy + Hy
velocity_z = Oz + Hz + Hz
Finally calculate
resultant_velocity = sqrt(velocity_x**2 + velocity_y**2 + velocity_z**2)
and store the "resultant_velocity" into new file (the file should be title_0.dat). And the program shall calculate the velocities starting from title 1 until title 200 in the file.
I am a newbie at Perl, but I would like to do this operation in Perl since I find that it is very interesting. I can write simple "read and write" operations in Perl but found no idea how to split and assign the values to the variables and carryout the calculation though the calculation is high school standard.
#!/usr/bin/perl -w
$data_file="malto.dat";
open(DAT, $data_file) || die("Could not open file!");
#raw_data=<DAT>;
close(DAT);
while(<#raw_data>){
#columns=split /\s+/,$_;
if($columns[0]=~ m/ATOM/){
print "$columns[5], $columns[6], $columns[7]\n";
}
}
I would like to get some guidance from experts so that I can enhance my understanding of Perl while working on the code.
Appreciate any help.
Regards

Perhaps the following will assist you:
use strict;
use warnings;
use Math::Complex;
my $dataFile = 'malto.dat';
{
local $/ = 'title ';
open my $fh, '<', $dataFile or die $!;
while (<$fh>) {
chomp;
my #data = split or next;
my $titleNum = 'Title ' . shift #data;
my $atom = shift(#data) . ' ' . shift #data;
my $resultantVel = calcResultantVel( \#data );
print $titleNum, "\n";
print $atom, "\n";
print 'ResultantVel: ' . $resultantVel, "\n\n";
}
close $fh;
}
sub calcResultantVel {
my ($dataRef) = #_;
my ($velocity_x, $velocity_y, $velocity_z);
while ( my #nums = splice( #$dataRef, 0, 9 ) ) {
$velocity_x += $nums[0] + $nums[3] + $nums[6];
$velocity_y += $nums[1] + $nums[4] + $nums[7];
$velocity_z += $nums[2] + $nums[5] + $nums[8];
}
return sqrt( $velocity_x**2 + $velocity_y**2 + $velocity_z**2 );
}
The word and space title is used as the record separator, so each read takes in a chunk of data that's delimited by title. The chomp removes the record separator, and then the record is split on whitespace.
The zeroth element is the title number, and that's shifted off #data. The first and second elements of #data are the atom count, and they're shifted off, too. The remaining array elements are the Cartesian directions, and a reference to that array is send to the subroutine calcResultantVel.
The subroutine takes a chunk of nine elements at a time: three for O atom, three for the first H atom, and three for the second H atom, and a running sum is kept based upon the definition you've provided. Finally, the resultant velocity is returned.
Here's some sample output:
Title 0
4335 2.0001000e+04
ResultantVel: 13.2945751170603
Title 1
4335 2.0001000e+04
ResultantVel: 12.7696611061461
You can visually verify that it's working correctly. Since you "...can write simple 'read and write' operations in Perl...," the next step is to have it write the desired results to a file.
Hope this helps!

Here's my advice: break the job down into small components, and write a method for each meaningful part of the work. To wit:
use strict;
use warnings;
main(#ARGV); # Pass data file name on command line. Don't hard-code it.
sub main {
my $data_f = shift;
open(my $data_h, '<', $data_f) or die "$!: $data_f";
while (my $section = get_section($data_h)){
# Also write methods that can be called here to make
# desired computations, print output, etc.
}
}
sub get_section {
# Takes a file handle.
# Returns a hash reference containing all of the data
# for an entire section of the file.
my $h = shift;
return if eof($h);
chomp (my $title = <$h>);
my ($n_atoms) = <$h> =~ /^(\d+)/;
return {
'title' => $title,
'n_atoms' => $n_atoms,
'molecules' => get_molecules($h, $n_atoms / 3),
};
}
sub get_molecules {
my #molecules;
return \#molecules;
}
I have not written the get_molecules() method. It takes a file handle and an integer (N of molecules). It could return a reference to an array-of-arrays or maybe an array-of-hashes, with each inner array/hash holding the info for a single molecule.

Thanks for your help and guide. I have tried to modify your code as below. It works at least for my need.
#!/usr/bin/perl
###############
#use strict;
#use warnings;
use Math::Complex;
open OUTPUT, '>', "velocityOnly.dat" or die "Can't create filehandle: $!";
my $dataFile = 'velF1F2.vel';
{
local $/ = 'title ';
open my $FH, '<', $dataFile or die $!;
while (<$FH>) {
chomp;
my #data = split or next;
my $titleNum = 'Title ' . shift(#data);
my $atom = shift(#data) . ' ' . shift(#data);
#my $resultantVel = calcResultantVel( \#data );
#print OUTPUT "$titleNum", "\n";
print "$titleNum", "\n";
for my $i (1..1445)
{
$j=(9*($i-1));
$velocity_x = $data[($j+0)] + $data[($j+3)] + $data[($j+6)];
$velocity_y = $data[($j+1)] + $data[($j+4)] + $data[($j+7)];
$velocity_z = $data[($j+2)] + $data[($j+5)] + $data[($j+8)];
$velo = sprintf '%.3f',sqrt( $velocity_x**2 + $velocity_y**2 + $velocity_z**2 );
chomp $velo;
print "$velo","\n";
print OUTPUT "$velo\n";
}
#print 'ResultantVel: ' . $resultantVel, "\n\n";
}
close $FH;
}
But I would like to extend further by adding some other functionality for doing some complex calculations. The code
Before that, I need some guide on making the below code into subroutine. I am bit lost here. Your CODE actually add all the X, Y and Z and finally find the velocity. But what I want is not that. Each 9 values subsequently represent coordinate for a water molecule which contain three atoms.
(The number 1445 is number of molecules. Each molecule contain three atoms and each atom has three coordinates.So for a water molecule has 9 Cartesian coordinates.)
the i here represent number of water molecule
for my $i (1..1445)
{
$j=(9*($i-1));
$velocity_x = $data[($j+0)] + $data[($j+3)] + $data[($j+6)];
$velocity_y = $data[($j+1)] + $data[($j+4)] + $data[($j+7)];
$velocity_z = $data[($j+2)] + $data[($j+5)] + $data[($j+8)];
$velo = sprintf '%.3f',sqrt( $velocity_x**2 + $velocity_y**2 + $velocity_z**2 );
chomp $velo;
print "$velo","\n";
print OUTPUT "$velo\n";
}

Related

Add new hash keys and then print in a new file

Previously, I post a question to search for an answer to using regex to match specifics sequence identification (ID).
Now I´m looking for some recommendations to print the data that I looking for.
If you want to see the complete file, here's a GitHub link.
This script takes two files to work. The first file is something like this (this is only a part of the file):
AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 2 2 0.0804934 . .
AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 4 4 0.0925522 . .
AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 13 13 0.0250116 . .
AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 23 23 0.565981 . .
...
This file tells me when there is a value >= 0.5, this information is in the sixth column. When this happens my script takes the first column (this is an ID, to match in with the second file) and the fourth column (this is a position of a letter in the second file).
Here my second file (this is only a part):
>AGY29650.2|NA spike protein
MTYSVFPLMCLLTFIGANAKIVTLPGNDA...EEYDLEPHKIHVH*
Like I said previously, the script takes the ID in the first file to match with the ID in the second file when these are the same and then searches for the position (fourth column) in the contents of the data.
Here an example, in file one the fourth row is a positive value (>=0.5) and the position in the fourth column is 23.
Then the script searches for position 23 in the data contents of the second file, here position 23 is a letter T:
MTYSVFPLMCLLTFIGANAKIV T LP
When the script match with the letter, the looking for 2 letters right and 2 letters left to the position of interest:
IVTLP
In the previous post, thank the help of some people in Stack I could solve the problem because of a difference between ID in each file (difference like this: AGY29650_2_NA (file one) and AGY29650.2 (file two)).
Now I looking for help to obtain the output that I need to complete the script.
The script is incomplete because I couldn't found the way to print the output of interest, in this case, the 5 letters in the second file (one letter of the position that appears in file one) 2 letters right, and 2 left.
I have thousands of files like the one and two, now I need some help to complete the script with any idea that you recommend.
Here is the script:
use strict;
use warnings;
use Bio::SeqIO;

my $file = $ARGV[0];
my $in = $ARGV[1];
my %fastadata = ();
my #array_residues = ();
my $seqio_obj = Bio::SeqIO->new(-file => $in,
-format => "fasta" );
while (my $seq_obj = $seqio_obj->next_seq ) {
my $dd = $seq_obj->id;
my $ss = $seq_obj->seq;
###my $ee = $seq_obj->desc;
$fastadata{$dd} = "$ss";
}

my $thres = 0.5; ### Selection of values in column N°5 with the following condition: >=0.5

# Open file
open (F, $file) or die; ### open the file or end the analyze
while(my $one = <F>) {### readline => F
$one =~ s/\n//g;
$one =~ s/\r//g;
my #cols = split(/\s+/, $one); ### split columns
next unless (scalar (#cols) == 7); ### the line must have 7 columns to add to the array
my $val = $cols[5];

if ($val >= 0.5) {
my $position = $cols[3];
my $id_list = $cols[0];
$id_list =~ s/^\s*([^_]+)_([0-9]+)_([a-zA-Z0-9]+)/$1.$2|$3/;
if (exists($fastadata{$id_list})) {
my $new_seq = $fastadata{$id_list};
my $subresidues = substr($new_seq, $position -3, 6);
}
}
}
close F;
I´m thinking in add a push function to generate the new data and then print in a new file.
My expected output is to print the position of a positive value (>=0.5), in this case, T (position 23) and the 2 letters right and 2 letters left.
In this case, with the data example in GitHub (link above) the expected output is:
IVTLP
Any recommendation or help is welcome.
Thank!

Main problem seems to be that the line has 8 columns not 7 as assumed in the script. Another small issue is that the extracted substring should have 5 characters not 6 as assumed by the script. Here is a modified version of the loop that works for me:
open (F, $file) or die; ### open the file or end the analyze
while(my $one = <F>) {### readline => F
chomp $one;
my #cols = split(/\s+/, $one); ### split columns
next unless (scalar #cols) == 8; ### the line must have 8 columns to add to the array
my $val = $cols[5];
if ($val >= 0.5) {
my $position = $cols[3];
my $id_list = $cols[0];
$id_list =~ s/^\s*([^_]+)_([0-9]+)_([a-zA-Z0-9]+)/$1.$2|$3/;
if (exists($fastadata{$id_list})) {
my $new_seq = $fastadata{$id_list};
my $subresidues = substr($new_seq, $position -3, 5);
print $subresidues, "\n";
}
}
}

Dynamic Loop outputs the same on each iteration

I am attempting to write a script to automate some data collection. Initially the script runs a series of commands which are carried out by the system. The output of these commands is stored in two text files. Following data collection, I am attempting to implement a for loop so that a third output file is generated which lists the value of interest from the first line of output 1 and the second line of output one, as well as the relative error. The following code completes the correct number of times, but returns the same values on all four lines. I suspect this has to do with the filehandler variable, but am unsure how to solve the issue.
for($ln = 1; $ln<5;$ln++){
open($fh, '<',"theoretical.dat",<$ln>)
or die "Could not open file 'theoretical.dat' $!";
#line = split(' ',<$fh>);
$v = $line[3];
open($fh2, '<',"actual.dat",<$ln>)
or die "Could not open file 'actual.dat' $!";
#line = split(' ',<$fh2>);
$v0 = $line[3];
$e = abs(($v0-$v)/$v0);
$rms = $rms + $e^2;
my #result = ($v, $v0, $e);
print "#result \n";
}
The output file code has been omitted. It contains an if/else depending upon if output should be piped into results.dat or appended.
Note that the data in question is stored in as 4 numbers per line, only the fourth of which I wish to access with this script. From the output generated it seems that $ln is changing accordingly after each iteration, but the line being read is not despite the argument within the open command which dictates to read line number $ln.
I have tried undefing $fh and $fh2 after each loop, but it still outputs the same.

You can't specify the line number of a file on the open call. In fact reopening a file will cause it to be read again starting from the top.
Without seeing your data files I can't be sure, but I think you want something like this.
Note that you can use autodie instead of coding an explicit test for an open succeeding. You must also use strict and use warnings a the top of every Perl program, and declare all of your variables using my as close as possible to their first point of use. I have declared $rms outside the loop here so that it can accumulate an aggregate sum of squares instead of being destroyed and recreated each time around the loop.
use strict;
use warnings;
use autodie;
open my $theo_fh, '<', 'theoretical.dat';
open my $act_fh, '<', 'actual.dat';
my $rms;
for my $ln (1 .. 5) {
my $v_theo = (split ' ', <$theo_fh>)[3];
my $v_act = (split ' ', <$act_fh>)[3];
my $e = abs(($v_act - $v_theo) / $v_act);
my $rms = $rms + $e ^ 2;
my #result = ($v_theo, $v_act, $e);
print "#result\n";
}

Parallelization of perl script

I have a perl script I wish to parrallelise.
It is composed of a while loop with over 11000 lines inside of another while loop of 3400 lines, which makes it extremely slow.
open (FILE1, "File1.txt") or die "Can't open File1";
open (OUT, ">Outfile.txt");
while (<FILE1>)
{
my #data=split (/ /, $_);
my $RS=1;
open (FILE2, "File2.txt") or die "Can't open File2";
while (<FILE2>)
{
my #value=split (/ /, $_);
if ($data[$RS] == 1) {print OUT $value[1];$RS++;}
elsif ($data[$RS] == 2) {print OUT $value[2];$RS++;}
elsif ($data[$RS] == 0) {print OUT $value[3];$RS++;}
}
close FILE2;
}
I'm looking for a way to do the equivalent of qsub with every line of File1 so I can send 3440 jobs. Any suggestions? I'd like to stay with perl if possible. I tried to insert this code inside of a bash script, but I don't really understand how to insert a language inside another one.
My File1 contains a list of ID with information in column. Each column is then related to a single line in File2. I'd like to be able to run the second loop for multiple ID simultaneously instead of one after another.
File1
ID RS_10 RS_15 RS_30
23 1 0 1
34 2 2 0
45 1 1 0
23 0 0 2
10 2 1 1
File2
RS_10 A B C
RS_15 D E F
RS_30 G H I

The first rule of optimization is not to do it too early (i.e. jumping to premature conclusions without profiling your code).
The second rule would probably refer to caching.
The File2 of yours isn't very large. I'd say we load it into memory. This has the following advantages:
We do our parsing once and only once.
The file isn't obscenly large, so space isn't much of an issue.
We can create a data structure that makes lookups very simple.
About that first point: You split each line over three thousand times. Those cycles could have been better spent.
About that third point: you seem to do an index conversion:
1 → 1, 2 → 2, 0 → 3
Instead of testing for all values with an if/elsif-switch (linear complexity), we could use an array that does this translation (constant time lookups):
my #conversion = (3, 1, 2);
...;
print OUT $value[$conversion[$data[$RS++]]];
If this index conversion is constant, we could do it once and only once when parsing File2. This would look like
use strict; use warnings;
use autodie; # automatic error handling
my #file2;
{
open my $file2, "<", "File2.txt";
while (<$file2>) {
my (undef, #vals) = split;
# do the reordering. This is equivalent to #vals = #vals[2, 0, 1];
unshift #vals, pop #vals;
push #file2, \#vals;
}
}
Now we can move on to iterating through File1. Printing the corresponding entry from File2 now looks like
open my $file1, "<", "File1.txt";
<$file1>; # remove header
while (<$file1>) {
my ($id, #indices) = split;
print $id, map $file2[$_][$indices[$_]], 0 .. $#indices;
# but I guess you'd want some separator in between
# If so, set the $, variable
}
This algorithm is still quadratic (the map is just a for-loop in disguise), but this should have a better constant factor. The output of above code given your example input is
23 A F G
34 B E I
45 A D I
23 C F H
10 B D G
(with $, = " "; $\ = "\n").
Where to go from here
This last step (looping through File1) could be parallelized, but this is unlikely to help much: IO is slow, communication between threads is expensive (IPC even more so), and the output would be in random order. We could spawn a bunch of workers, and pass unparsed lines in a queue:
use threads; # should be 1st module to be loaded
use Thread::Queue;
use constant NUM_THREADS => 4; # number of cores
# parse the File2 data here
my $queue = Thread::Queue->new;
my #threads = map threads->new(\&worker), 1 .. NUM_THREADS;
# enqueue data
$queue->enqueue($_) while <$file1>;
# end the queue
$queue->enqueue((undef) x NUM_THREADS); # $queue->end in never versions
# wait for threads to complete
$_->join for #threads;
sub worker {
while(defined(my $_ = $queue->dequeue)) {
my ($id, #indices) = split;
print $id, map $file2[$_][$indices[$_]], 0 .. $#indices;
}
}
Note that this copies the #file2 into all threads. Fun fact: for the example data, this threaded solution takes roughly 4× as long. This is mostly the overhead of thread creation, so this will be less of an issue for your data.
In any case, profile your code to see where you can optimize most effectively. I recommend the excellent Devel::NYTProf. E.g. for my non-threaded test run with this very limited data, the overhead implied by autodie and friends used more time than doing the actual processing. For you, the most expensive line would probably be
print $id, map $file2[$_][$indices[$_]], 0 .. $#indices;
but there isn't much we can do here inside Perl.

Perl forming string random string combination

I have a file with around 25000 records, each records has more than 13 entries are drug names. I want to form all the possible pair combination for these entries. Eg: if a line has three records A, B, C. I should form combinations as 1) A B 2) A C 3)B C. Below is the code I got from internet, it works only if a single line is assigned to an array:
use Math::Combinatorics;
my #n = qw(a b c);
my $combinat = Math::Combinatorics->new(
count => 2,
data => [#n],
);
while ( my #combo = $combinat->next_combination ) {
print join( ' ', #combo ) . "\n";
}
The code I am using, it doesn't produce any output:
open IN, "drugs.txt" or die "Cannot open the drug file";
open OUT, ">Combination.txt";
use Math::Combinatorics;
while (<IN>) {
chomp $_;
#Drugs = split /\t/, $_;
#n = $Drugs[1];
my $combinat = Math::Combinatorics->new(
count => 2,
data => [#n],
);
while ( my #combo = $combinat->next_combination ) {
print join( ' ', #combo ) . "\n";
}
print "\n";
}
Can you please suggest me a solution to this problem?

You're setting #n to be an array containing the second value of the #Drugs array, try just using data => \#Drugs in the Math::Combinatorics constructor.
Also, use strict; use warnings; blahblahblah.

All pairs from an array are straightforward to compute. Using drugs A, B, and C as from your question, you might think of them forming a square matrix.
AA AB AC
BA BB BC
CA CB CC
You probably do not want the “diagonal” pairs AA, BB, and CC. Note that the remaining elements are symmetrical. For example, element (0,1) is AB and (1,0) is BA. Here again, I assume these are the same and that you do not want duplicates.
To borrow a term from linear algebra, you want the upper triangle. Doing it this way eliminates duplicates by construction, assuming that each drug name on a given line is unique. An algorithm for this is below.
Select in turn each drug q on the line. For each of these, perform steps 2 and 3.
Beginning with the drug immediately following q and then for each drug r in the rest of the list, perform step 3.
Record the pair (q, r).
The recorded list is the list of all unique pairs.
In Perl, this looks like
#! /usr/bin/env perl
use strict;
use warnings;
sub pairs {
my #a = #_;
my #pairs;
foreach my $i (0 .. $#a) {
foreach my $j ($i+1 .. $#a) {
push #pairs, [ #a[$i,$j] ];
}
}
wantarray ? #pairs : \#pairs;
}
my $line = "Perlix\tScalaris\tHashagra\tNextium";
for (pairs split /\t/, $line) {
print "#$_\n";
}
Output:
Perlix Scalaris
Perlix Hashagra
Perlix Nextium
Scalaris Hashagra
Scalaris Nextium
Hashagra Nextium

I've answered something like this before for someone else. For them, they had a question on how to combine a list of letters into all possible words.
Take a look at How Can I Generate a List of Words from a group of Letters Using Perl. In it, you'll see an example of using Math::Combinatorics from my answer and the correct answer that ikegami had. (He did something rather interesting with regular expressions).
I'm sure one of these will lead you to the answer you need. Maybe when I have more time, I'll flesh out an answer specifically for your question. I hope this link helps.

Parsing files that use synonyms

If I had a text file with the following:
Today (is|will be) a (great|good|nice) day.
Is there a simple way I can generate a random output like:
Today is a great day.
Today will be a nice day.
Using Perl or UNIX utils?

Closures are fun:
#!/usr/bin/perl
use strict;
use warnings;
my #gens = map { make_generator($_, qr~\|~) } (
'Today (is|will be) a (great|good|nice) day.',
'The returns this (month|quarter|year) will be (1%|5%|10%).',
'Must escape %% signs here, but not here (%|#).'
);
for ( 1 .. 5 ) {
print $_->(), "\n" for #gens;
}
sub make_generator {
my ($tmpl, $sep) = #_;
my #lists;
while ( $tmpl =~ s{\( ( [^)]+ ) \)}{%s}x ) {
push #lists, [ split $sep, $1 ];
}
return sub {
sprintf $tmpl, map { $_->[rand #$_] } #lists
};
}
Output:
C:\Temp> h
Today will be a great day.
The returns this month will be 1%.
Must escape % signs here, but not here #.
Today will be a great day.
The returns this year will be 5%.
Must escape % signs here, but not here #.
Today will be a good day.
The returns this quarter will be 10%.
Must escape % signs here, but not here %.
Today is a good day.
The returns this month will be 1%.
Must escape % signs here, but not here %.
Today is a great day.
The returns this quarter will be 5%.
Must escape % signs here, but not here #.

Code:
#!/usr/bin/perl
use strict;
use warnings;
my $template = 'Today (is|will be) a (great|good|nice) day.';
for (1..10) {
print pick_one($template), "\n";
}
exit;
sub pick_one {
my ($template) = #_;
$template =~ s{\(([^)]+)\)}{get_random_part($1)}ge;
return $template;
}
sub get_random_part {
my $string = shift;
my #parts = split /\|/, $string;
return $parts[rand #parts];
}
Logic:
Define template of output (my $template = ...)
Enter loop to print random output many times (for ...)
Call pick_one to do the work
Find all "(...)" substrings, and replace them with random part ($template =~ s...)
Print generated string
Getting random part is simple:
receive extracted substring (my $string = shift)
split it using | character (my #parts = ...)
return random part (return $parts[...)
That's basically all. Instead of using function you could put the same logic in s{}{}, but it would be a bit less readable:
$template =~ s{\( ( [^)]+ ) \)}
{ my #parts = split /\|/, $1;
$parts[rand #parts];
}gex;

Sounds like you may be looking for Regexp::Genex. From the module's synopsis:
#!/usr/bin/perl -l
use Regexp::Genex qw(:all);
$regex = shift || "a(b|c)d{2,4}?";
print "Trying: $regex";
print for strings($regex);
# abdd
# abddd
# abdddd
# acdd
# acddd
# acdddd

Use a regex to match each parenthetical (and the text inside it).
Use a string split operation (pipe delimiter) on the text inside of the matched parenthetical to get each of the options.
Pick one randomly.
Return it as the replacement for that capture.

Smells like a recursive algorithm
Edit: misread and thought you wanted all possibilities
#!/usr/bin/python
import re, random
def expand(line, all):
result = re.search('\([^\)]+\)', line)
if result:
variants = result.group(0)[1:-1].split("|")
for v in variants:
expand(line[:result.start()] + v + line[result.end():], all)
else:
all.append(line)
return all
line = "Today (is|will be) a (great|good|nice) day."
all = expand(line, [])
# choose a random possibility at the end:
print random.choice(all)
A similar construct that produces a single random line:
def expand_rnd(line):
result = re.search('\([^\)]+\)', line)
if result:
variants = result.group(0)[1:-1].split("|")
choice = random.choice(variants)
return expand_rnd(
line[:result.start()] + choice + line[result.end():])
else:
return line
Will fail however on nested constructs

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

calculating velocity from massive simulation data - perl

Related

Add new hash keys and then print in a new file

Dynamic Loop outputs the same on each iteration

Parallelization of perl script

Perl forming string random string combination

Parsing files that use synonyms

Categories

Resources