how to convert PHYLIP format to FASTA - perl

I just start working with perl and I have a question. I have PHYLIP file and I need convert it into FASTA. I start writing a script. Firstly, i removed scpaces in lines, now i need to align all lines that in every line should be 60 aminoacids and sequances identificator should be printed in new line. Maybe someone could give me some advice?

BioPerl Bio::AlignIO module might help. It support the PHYLIP sequence format :
phylip2fasta.pl
use strict;
use warnings;
use Bio::AlignIO;
# http://doc.bioperl.org/bioperl-live/Bio/AlignIO.html
# http://doc.bioperl.org/bioperl-live/Bio/AlignIO/phylip.html
# http://www.bioperl.org/wiki/PHYLIP_multiple_alignment_format
my ($inputfilename) = #ARGV;
die "must provide phylip file as 1st parameter...\n" unless $inputfilename;
my $in = Bio::AlignIO->new(-file => $inputfilename ,
-format => 'phylip',
-interleaved => 1);
my $out = Bio::AlignIO->new(-fh => \*STDOUT ,
-format => 'fasta');
while ( my $aln = $in->next_aln() ) {
$out->write_aln($aln);
}
$ perl phylip2fasta.pl test.phylip
>Turkey/1-42
AAGCTNGGGCATTTCAGGGTGAGCCCGGGCAATACAGGGTAT
>Salmo_gair/1-42
AAGCCTTGGCAGTGCAGGGTGAGCCGTGGCCGGGCACGGTAT
>H._Sapiens/1-42
ACCGGTTGGCCGTTCAGGGTACAGGTTGGCCGTTCAGGGTAA
>Chimp/1-42
AAACCCTTGCCGTTACGCTTAAACCGAGGCCGGGACACTCAT
>Gorilla/1-42
AAACCCTTGCCGGTACGCTTAAACCATTGCCGGTACGCTTAA
test.phylip http://evolution.genetics.washington.edu/phylip/doc/sequence.html
5 42
Turkey AAGCTNGGGC ATTTCAGGGT
Salmo gairAAGCCTTGGC AGTGCAGGGT
H. SapiensACCGGTTGGC CGTTCAGGGT
Chimp AAACCCTTGC CGTTACGCTT
Gorilla AAACCCTTGC CGGTACGCTT
GAGCCCGGGC AATACAGGGT AT
GAGCCGTGGC CGGGCACGGT AT
ACAGGTTGGC CGTTCAGGGT AA
AAACCGAGGC CGGGACACTC AT
AAACCATTGC CGGTACGCTT AA

If you have access to BioPerl, I suggest using that (see other answer). If not, here is a quick script I used in an old HW assignment a few years ago. It may work for you.
One thing to note: It prints the whole fasta sequences on one line, so you have you edit the print statement at the end to print 70 AA per line.
#!/usr/bin/perl
use warnings;
use strict;
<DATA> =~ /(\d+)/; # first number is number of species
my $num_species = $1;
my $i = 0;
my #species;
my #acids;
# first $num_species rows have the species name
for ($i = 0; $i < $num_species; $i++) {
my #line = split /\s+/, <DATA>;
chomp #line;
push #species, shift (#line);
push #acids, join ("", #line);
}
# Get the rest of the AAs
$i = 0;
while (<DATA>) {
chomp;
$_ =~ s/\r//g; #remove \r
next if !$_;
$_ =~ s/\s+//g; #remove spaces
$acids[$i] .= $_;
$i = ++$i % $num_species;
}
# Print them
for ($i = 0; $i < $num_species; $i++) {
print "> ", $species[$i], "\n";
# uncomment next line if you want to remove the gaps ("-")
$acids[$i] =~ s/-//g;
print $acids[$i], "\n\n";
}
# Simple PHYLIP Amino Acid file
__DATA__
10 234
Cow MAYPMQLGFQ DATSPIMEEL LHFHDHTLMI VFLISSLVLY IISLMLTTKL
Carp MAHPTQLGFK DAAMPVMEEL LHFHDHALMI VLLISTLVLY IITAMVSTKL
Chicken MANHSQLGFQ DASSPIMEEL VEFHDHALMV ALAICSLVLY LLTLMLMEKL
Human MAHAAQVGLQ DATSPIMEEL ITFHDHALMI IFLICFLVLY ALFLTLTTKL
Loach MAHPTQLGFQ DAASPVMEEL LHFHDHALMI VFLISALVLY VIITTVSTKL
Mouse MAYPFQLGLQ DATSPIMEEL MNFHDHTLMI VFLISSLVLY IISLMLTTKL
Rat MAYPFQLGLQ DATSPIMEEL TNFHDHTLMI VFLISSLVLY IISLMLTTKL
Seal MAYPLQMGLQ DATSPIMEEL LHFHDHTLMI VFLISSLVLY IISLMLTTKL
Whale MAYPFQLGFQ DAASPIMEEL LHFHDHTLMI VFLISSLVLY IITLMLTTKL
Frog MAHPSQLGFQ DAASPIMEEL LHFHDHTLMA VFLISTLVLY IITIMMTTKL
THTSTMDAQE VETIWTILPA IILILIALPS LRILYMMDEI NNPSLTVKTM
TNKYILDSQE IEIVWTILPA VILVLIALPS LRILYLMDEI NDPHLTIKAM
S-SNTVDAQE VELIWTILPA IVLVLLALPS LQILYMMDEI DEPDLTLKAI
TNTNISDAQE METVWTILPA IILVLIALPS LRILYMTDEV NDPSLTIKSI
TNMYILDSQE IEIVWTVLPA LILILIALPS LRILYLMDEI NDPHLTIKAM
THTSTMDAQE VETIWTILPA VILIMIALPS LRILYMMDEI NNPVLTVKTM
THTSTMDAQE VETIWTILPA VILILIALPS LRILYMMDEI NNPVLTVKTM
THTSTMDAQE VETVWTILPA IILILIALPS LRILYMMDEI NNPSLTVKTM
THTSTMDAQE VETVWTILPA IILILIALPS LRILYMMDEV NNPSLTVKTM
TNTNLMDAQE IEMVWTIMPA ISLIMIALPS LRILYLMDEV NDPHLTIKAI
GHQWYWSYEY TDYEDLSFDS YMIPTSELKP GELRLLEVDN RVVLPMEMTI
GHQWYWSYEY TDYENLGFDS YMVPTQDLAP GQFRLLETDH RMVVPMESPV
GHQWYWTYEY TDFKDLSFDS YMTPTTDLPL GHFRLLEVDH RIVIPMESPI
GHQWYWTYEY TDYGGLIFNS YMLPPLFLEP GDLRLLDVDN RVVLPIEAPI
GHQWYWSYEY TDYENLSFDS YMIPTQDLTP GQFRLLETDH RMVVPMESPI
GHQWYWSYEY TDYEDLCFDS YMIPTNDLKP GELRLLEVDN RVVLPMELPI
GHQWYWSYEY TDYEDLCFDS YMIPTNDLKP GELRLLEVDN RVVLPMELPI
GHQWYWSYEY TDYEDLNFDS YMIPTQELKP GELRLLEVDN RVVLPMEMTI
GHQWYWSYEY TDYEDLSFDS YMIPTSDLKP GELRLLEVDN RVVLPMEMTI
GHQWYWSYEY TNYEDLSFDS YMIPTNDLTP GQFRLLEVDN RMVVPMESPT
RMLVSSEDVL HSWAVPSLGL KTDAIPGRLN QTTLMSSRPG LYYGQCSEIC
RVLVSAEDVL HSWAVPSLGV KMDAVPGRLN QAAFIASRPG VFYGQCSEIC
RVIITADDVL HSWAVPALGV KTDAIPGRLN QTSFITTRPG VFYGQCSEIC
RMMITSQDVL HSWAVPTLGL KTDAIPGRLN QTTFTATRPG VYYGQCSEIC
RILVSAEDVL HSWALPAMGV KMDAVPGRLN QTAFIASRPG VFYGQCSEIC
RMLISSEDVL HSWAVPSLGL KTDAIPGRLN QATVTSNRPG LFYGQCSEIC
RMLISSEDVL HSWAIPSLGL KTDAIPGRLN QATVTSNRPG LFYGQCSEIC
RMLISSEDVL HSWAVPSLGL KTDAIPGRLN QTTLMTMRPG LYYGQCSEIC
RMLVSSEDVL HSWAVPSLGL KTDAIPGRLN QTTLMSTRPG LFYGQCSEIC
RLLVTAEDVL HSWAVPSLGV KTDAIPGRLH QTSFIATRPG VFYGQCSEIC
GSNHSFMPIV LELVPLKYFE KWSASML--- ----
GANHSFMPIV VEAVPLEHFE NWSSLMLEDA SLGS
GANHSYMPIV VESTPLKHFE AWSSL----- -LSS
GANHSFMPIV LELIPLKIFE M-------GP VFTL
GANHSFMPIV VEAVPLSHFE NWSTLMLKDA SLGS
GSNHSFMPIV LEMVPLKYFE NWSASMI--- ----
GSNHSFMPIV LEMVPLKYFE NWSASMI--- ----
GSNHSFMPIV LELVPLSHFE KWSTSML--- ----
GSNHSFMPIV LELVPLEVFE KWSVSML--- ----
GANHSFMPIV VEAVPLTDFE NWSSSML-EA SL--
Output:
> Cow
MAYPMQLGFQDATSPIMEELLHFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQEVETIWTILPAIILILIALPSLRILYMMDEINNPSLTVKTMGHQWYWSYEYTDYEDLSFDSYMIPTSELKPGELRLLEVDNRVVLPMEMTIRMLVSSEDVLHSWAVPSLGLKTDAIPGRLNQTTLMSSRPGLYYGQCSEICGSNHSFMPIVLELVPLKYFEKWSASML
> Carp
MAHPTQLGFKDAAMPVMEELLHFHDHALMIVLLISTLVLYIITAMVSTKLTNKYILDSQEIEIVWTILPAVILVLIALPSLRILYLMDEINDPHLTIKAMGHQWYWSYEYTDYENLGFDSYMVPTQDLAPGQFRLLETDHRMVVPMESPVRVLVSAEDVLHSWAVPSLGVKMDAVPGRLNQAAFIASRPGVFYGQCSEICGANHSFMPIVVEAVPLEHFENWSSLMLEDASLGS
> Chicken
MANHSQLGFQDASSPIMEELVEFHDHALMVALAICSLVLYLLTLMLMEKLSSNTVDAQEVELIWTILPAIVLVLLALPSLQILYMMDEIDEPDLTLKAIGHQWYWTYEYTDFKDLSFDSYMTPTTDLPLGHFRLLEVDHRIVIPMESPIRVIITADDVLHSWAVPALGVKTDAIPGRLNQTSFITTRPGVFYGQCSEICGANHSYMPIVVESTPLKHFEAWSSLLSS
> Human
MAHAAQVGLQDATSPIMEELITFHDHALMIIFLICFLVLYALFLTLTTKLTNTNISDAQEMETVWTILPAIILVLIALPSLRILYMTDEVNDPSLTIKSIGHQWYWTYEYTDYGGLIFNSYMLPPLFLEPGDLRLLDVDNRVVLPIEAPIRMMITSQDVLHSWAVPTLGLKTDAIPGRLNQTTFTATRPGVYYGQCSEICGANHSFMPIVLELIPLKIFEMGPVFTL
> Loach
MAHPTQLGFQDAASPVMEELLHFHDHALMIVFLISALVLYVIITTVSTKLTNMYILDSQEIEIVWTVLPALILILIALPSLRILYLMDEINDPHLTIKAMGHQWYWSYEYTDYENLSFDSYMIPTQDLTPGQFRLLETDHRMVVPMESPIRILVSAEDVLHSWALPAMGVKMDAVPGRLNQTAFIASRPGVFYGQCSEICGANHSFMPIVVEAVPLSHFENWSTLMLKDASLGS
> Mouse
MAYPFQLGLQDATSPIMEELMNFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQEVETIWTILPAVILIMIALPSLRILYMMDEINNPVLTVKTMGHQWYWSYEYTDYEDLCFDSYMIPTNDLKPGELRLLEVDNRVVLPMELPIRMLISSEDVLHSWAVPSLGLKTDAIPGRLNQATVTSNRPGLFYGQCSEICGSNHSFMPIVLEMVPLKYFENWSASMI
> Rat
MAYPFQLGLQDATSPIMEELTNFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQEVETIWTILPAVILILIALPSLRILYMMDEINNPVLTVKTMGHQWYWSYEYTDYEDLCFDSYMIPTNDLKPGELRLLEVDNRVVLPMELPIRMLISSEDVLHSWAIPSLGLKTDAIPGRLNQATVTSNRPGLFYGQCSEICGSNHSFMPIVLEMVPLKYFENWSASMI
> Seal
MAYPLQMGLQDATSPIMEELLHFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQEVETVWTILPAIILILIALPSLRILYMMDEINNPSLTVKTMGHQWYWSYEYTDYEDLNFDSYMIPTQELKPGELRLLEVDNRVVLPMEMTIRMLISSEDVLHSWAVPSLGLKTDAIPGRLNQTTLMTMRPGLYYGQCSEICGSNHSFMPIVLELVPLSHFEKWSTSML
> Whale
MAYPFQLGFQDAASPIMEELLHFHDHTLMIVFLISSLVLYIITLMLTTKLTHTSTMDAQEVETVWTILPAIILILIALPSLRILYMMDEVNNPSLTVKTMGHQWYWSYEYTDYEDLSFDSYMIPTSDLKPGELRLLEVDNRVVLPMEMTIRMLVSSEDVLHSWAVPSLGLKTDAIPGRLNQTTLMSTRPGLFYGQCSEICGSNHSFMPIVLELVPLEVFEKWSVSML
> Frog
MAHPSQLGFQDAASPIMEELLHFHDHTLMAVFLISTLVLYIITIMMTTKLTNTNLMDAQEIEMVWTIMPAISLIMIALPSLRILYLMDEVNDPHLTIKAIGHQWYWSYEYTNYEDLSFDSYMIPTNDLTPGQFRLLEVDNRMVVPMESPTRLLVTAEDVLHSWAVPSLGVKTDAIPGRLHQTSFIATRPGVFYGQCSEICGANHSFMPIVVEAVPLTDFENWSSSMLEASL

Related

Use Perl to loop over files and calculate the mean of each column

I'm new to perl and I would like to lean how to use loops with it. I have multiple directories and each directory contain a file named data.txt. The data.txt file has several columns. I basically need to use a loop to calculate the mean of each column for each data.txt file.
I have this command that does the job for one single file:
perl -lane 'for $c (0..$#F){$t[$c] += $F[$c]}; END{for $c (0..$#t){print $t[$c]/$.}}' data.txt`
I wish to write a script where I visit every directory, read every file that's in it and apply the command.
Example:
data.txt:
-79.2335 0.4041 71.9143 1.3392 -0.7687 0.0212 -8.0934 1.1425
-74.4163 0.6188 60.0468 1.8782 -0.8540 0.0305 -15.1574 1.4755
-74.4118 0.6046 62.1771 1.8058 -0.9143 0.0304 -13.2272 1.3408
-74.3895 0.5935 66.4264 1.6532 -0.8509 0.0223 -8.8819 1.2670
-74.3192 0.5589 67.1619 1.4763 -0.9656 0.0274 -8.1090 1.1450
-73.8272 0.6274 61.6632 1.7554 -0.8840 0.0256 -13.0435 1.3641
-73.3525 0.5856 60.6622 1.7872 -0.8489 0.0222 -13.5014 1.3947
-73.3206 0.6275 53.3129 2.2961 -0.7962 0.0337 -20.8195 1.8538
-72.5461 0.5212 62.0359 1.4267 -0.9378 0.0240 -11.4203 1.0295
-72.3058 0.7225 56.2304 2.1480 -0.7539 0.0293 -16.7954 1.5952
-72.1180 0.6460 51.7954 2.0845 -0.8479 0.0265 -21.0355 1.4630
-72.0690 0.4905 58.8372 1.3918 -0.9866 0.0333 -14.1823 1.1045
-71.7949 0.5799 55.6006 1.9189 -0.8541 0.0313 -17.0112 1.4530
-71.3074 0.4482 45.9271 2.1135 -0.6637 0.0354 -25.9309 1.8761
-71.2542 0.4879 57.3196 1.5406 -0.9523 0.0281 -14.9113 1.2705
-71.2421 0.5480 47.9065 2.2445 -0.8107 0.0352 -24.2489 1.7997
-70.3751 0.5278 49.5489 1.8395 -0.8208 0.0371 -21.5205 1.4994
-69.2181 0.4823 54.8234 1.0645 -0.9897 0.0246 -15.3506 0.9369
-69.0456 0.4650 40.3798 2.0117 -0.6476 0.0360 -29.3403 1.7013
-66.5402 0.5006 42.1805 1.7872 -0.7692 0.0356 -25.1431 1.4522
Output:
-72.354355 0.552015 56.297505 1.77814 -0.845845 0.029485 -16.88618 1.408235
As your comments imply that you have a simple directory structure with one main directory called mean with 100s of subdirectories, each with a file called data.txt, the list of files can be compiled easily with a glob, and the math is fairly straightforward. This is a suggestion how it can be done.
I would not use $. as a way to calculate the average, since it can be corrupted by other factors. But just use a count variable for each file, and count the non-blank lines.
use strict;
use warnings;
use feature 'say';
for my $data (glob "mean/*/data.txt") { # get list of files
open my $fh, '<', $data or die "Cannot open file '$data': $!";
my #sum;
my $count = 0;
while (<$fh>) {
$count++ if /\S/; # count non-blank lines
my #fields = split; # split on whitespace
for (0 .. $#fields) {
$sum[$_] += $fields[$_]; # sum columns
}
}
say $data; # file name
say join "\t", # 3. ...join them with tab and print
map $_/$count, # 2. ...for each sum, divide by count
#sum; # 1. Take list of sums...
}
Output:
mean/A/data.txt
-72.354355 0.552015 56.297505 1.77814 -0.845845 0.029485 -16.88618 1.408235
mean/B/data.txt
-142.354355 0.552015 56.297505 1.77814 -0.845845 0.029485 -16.88618 1.408235
mean/C/data.txt
-72.354355 17.152015 56.297505 1.77814 -0.845845 0.029485 -16.88618 1.408235
I am not a Perl expert but this worked for me. It prints the results to terminal. Or you could redirect it to a file if you want or directly write to a file instead of printing to terminal.
use 5.28.2;
use warnings;
use File::Find;
my ($inf, #sum);
for my $dir (glob "/mainDirectory/*"){ # finds files/subdirectories
if (! -d $dir) {
next; # keeps only directories
}
$inf= "$dir/data.txt";
say "$inf";
find(\&sum_columns, $inf);
}
sub sum_columns{
open (IN, "<", "$inf" ) or die "Cannot open file.\n $!";
while (<IN>){
my $line = $_;
chomp $line;
my #columns = split(/\s+/,$line);
for my $item (0 .. $#columns){
$sum[$item] += $columns[$item];
}
}
say "#sum";
#sum=();
}

How to find number of numerical data for each and every line in a file

Please help me to count the numerical data in each line of a file,
and also to find the line length. The code has to written in Perl.
For example if I have a line such as:
INPUT:I was born on 24th october,1994.
Output:2
You could do something like this:
perl -ne 'BEGIN{my $x} $x += () = /[0-9]+/g; END{print($x . "\n")}' file
-n: causes Perl to assume the following loop around your program, which makes it iterate over filename arguments somewhat like sed -n or awk:
LINE:
while (<>) {
... # your program goes here
}
-e: may be used to enter one line of program;
() will make /[0-9]+/g be evaluated in list context (i.e. () = /[0-9]+/g will return an array containing the sequences of one or more digits found in the default input), while $x += will make the result be evaluated again in scalar context (i.e. $x += () = /[0-9]+/g will add the number of sequences of one or more digits found in the default input to $x); END{print($x . "\n") will print $x after the whole file has been processed.
% cat file
string 123 string 1 string string string
456 string
% perl -ne 'BEGIN{my $x} $x += () = /[0-9]+/g; END{print($x . "\n")}' file
3
%
I'd do something like this
#!/usr/bin/perl
use warnings;
use strict;
my $file = 'num.txt';
open my $fh, '<', $file or die "Failed to open $file: $!\n";
while (my $line = <$fh>){
chomp $line;
my #num = $line =~ /([0-9.]+)/g;
print "On this line --- " .scalar(#num) . "\n";
}
close ($fh);
The input file I tested --
This should say 1
Line 2 should say 2
I want this line to say 5 so I have added 4 other numbers like 0.02 -1 and 5.23
The output as tested ----
On this line --- 1
On this line --- 2
On this line --- 5
Using the regex match ([0-9.]+) will match ANY number and include any decimals (I guess really you could use just ([0-9]+) since you are only counting them and not using the actually number represented.)
Hope it helps.

insert null values in missing rows of file

I have a text file which consists some data of 24 hours time stamp segregated in 10 minutes interval.
2016-02-06,00:00:00,ujjawal,36072-2,MT,37,0,1
2016-02-06,00:10:00,ujjawal,36072-2,MT,37,0,1
2016-02-06,00:20:00,ujjawal,36072-2,MT,37,0,1
2016-02-06,00:40:00,ujjawal,36072-2,MT,37,0,1
2016-02-06,00:50:00,ujjawal,36072-2,MT,42,0,2
2016-02-06,01:00:00,ujjawal,36072-2,MT,55,0,2
2016-02-06,01:10:00,ujjawal,36072-2,MT,41,0,2
2016-02-06,01:20:00,ujjawal,36072-2,MT,46,0,2
2016-02-06,01:30:00,ujjawal,36072-2,MT,56,0,3
2016-02-06,01:40:00,ujjawal,36072-2,MT,38,0,2
2016-02-06,01:50:00,ujjawal,36072-2,MT,49,0,2
2016-02-06,02:00:00,ujjawal,36072-2,MT,58,0,4
2016-02-06,02:10:00,ujjawal,36072-2,MT,43,0,2
2016-02-06,02:20:00,ujjawal,36072-2,MT,46,0,2
2016-02-06,02:30:00,ujjawal,36072-2,MT,61,0,2
2016-02-06,02:40:00,ujjawal,36072-2,MT,57,0,3
2016-02-06,02:50:00,ujjawal,36072-2,MT,45,0,2
2016-02-06,03:00:00,ujjawal,36072-2,MT,45,0,3
2016-02-06,03:10:00,ujjawal,36072-2,MT,51,0,2
2016-02-06,03:20:00,ujjawal,36072-2,MT,68,0,3
2016-02-06,03:30:00,ujjawal,36072-2,MT,51,0,2
2016-02-06,03:40:00,ujjawal,36072-2,MT,68,0,3
2016-02-06,03:50:00,ujjawal,36072-2,MT,67,0,3
2016-02-06,04:00:00,ujjawal,36072-2,MT,82,0,8
2016-02-06,04:10:00,ujjawal,36072-2,MT,82,0,5
2016-02-06,04:20:00,ujjawal,36072-2,MT,122,0,4
2016-02-06,04:30:00,ujjawal,36072-2,MT,133,0,3
2016-02-06,04:40:00,ujjawal,36072-2,MT,142,0,3
2016-02-06,04:50:00,ujjawal,36072-2,MT,202,0,1
2016-02-06,05:00:00,ujjawal,36072-2,MT,731,1,3
2016-02-06,05:10:00,ujjawal,36072-2,MT,372,0,7
2016-02-06,05:20:00,ujjawal,36072-2,MT,303,0,2
2016-02-06,05:30:00,ujjawal,36072-2,MT,389,0,3
2016-02-06,05:40:00,ujjawal,36072-2,MT,454,0,1
2016-02-06,05:50:00,ujjawal,36072-2,MT,406,0,6
2016-02-06,06:00:00,ujjawal,36072-2,MT,377,0,1
2016-02-06,06:10:00,ujjawal,36072-2,MT,343,0,5
2016-02-06,06:20:00,ujjawal,36072-2,MT,370,0,2
2016-02-06,06:30:00,ujjawal,36072-2,MT,343,0,9
2016-02-06,06:40:00,ujjawal,36072-2,MT,315,0,8
2016-02-06,06:50:00,ujjawal,36072-2,MT,458,0,3
2016-02-06,07:00:00,ujjawal,36072-2,MT,756,1,3
2016-02-06,07:10:00,ujjawal,36072-2,MT,913,1,3
2016-02-06,07:20:00,ujjawal,36072-2,MT,522,0,3
2016-02-06,07:30:00,ujjawal,36072-2,MT,350,0,7
2016-02-06,07:40:00,ujjawal,36072-2,MT,328,0,6
2016-02-06,07:50:00,ujjawal,36072-2,MT,775,1,3
2016-02-06,08:00:00,ujjawal,36072-2,MT,310,0,9
2016-02-06,08:10:00,ujjawal,36072-2,MT,308,0,6
2016-02-06,08:20:00,ujjawal,36072-2,MT,738,1,3
2016-02-06,08:30:00,ujjawal,36072-2,MT,294,0,6
2016-02-06,08:40:00,ujjawal,36072-2,MT,345,0,1
2016-02-06,08:50:00,ujjawal,36072-2,MT,367,0,6
2016-02-06,09:00:00,ujjawal,36072-2,MT,480,0,3
2016-02-06,09:10:00,ujjawal,36072-2,MT,390,0,3
2016-02-06,09:20:00,ujjawal,36072-2,MT,436,0,3
2016-02-06,09:30:00,ujjawal,36072-2,MT,1404,2,3
2016-02-06,09:40:00,ujjawal,36072-2,MT,346,0,3
2016-02-06,09:50:00,ujjawal,36072-2,MT,388,0,3
2016-02-06,10:00:00,ujjawal,36072-2,MT,456,0,2
2016-02-06,10:10:00,ujjawal,36072-2,MT,273,0,7
2016-02-06,10:20:00,ujjawal,36072-2,MT,310,0,3
2016-02-06,10:30:00,ujjawal,36072-2,MT,256,0,7
2016-02-06,10:40:00,ujjawal,36072-2,MT,283,0,3
2016-02-06,10:50:00,ujjawal,36072-2,MT,276,0,3
2016-02-06,11:00:00,ujjawal,36072-2,MT,305,0,1
2016-02-06,11:10:00,ujjawal,36072-2,MT,310,0,9
2016-02-06,11:20:00,ujjawal,36072-2,MT,286,0,3
2016-02-06,11:30:00,ujjawal,36072-2,MT,286,0,3
2016-02-06,11:40:00,ujjawal,36072-2,MT,247,0,7
2016-02-06,11:50:00,ujjawal,36072-2,MT,366,0,2
2016-02-06,12:00:00,ujjawal,36072-2,MT,294,0,2
2016-02-06,12:10:00,ujjawal,36072-2,MT,216,0,5
2016-02-06,12:20:00,ujjawal,36072-2,MT,233,0,1
2016-02-06,12:30:00,ujjawal,36072-2,MT,785,1,2
2016-02-06,12:40:00,ujjawal,36072-2,MT,466,0,1
2016-02-06,12:50:00,ujjawal,36072-2,MT,219,0,9
2016-02-06,13:00:00,ujjawal,36072-2,MT,248,0,6
2016-02-06,13:10:00,ujjawal,36072-2,MT,223,0,7
2016-02-06,13:20:00,ujjawal,36072-2,MT,276,0,8
2016-02-06,13:30:00,ujjawal,36072-2,MT,219,0,6
2016-02-06,13:40:00,ujjawal,36072-2,MT,699,1,2
2016-02-06,13:50:00,ujjawal,36072-2,MT,439,0,2
2016-02-06,14:00:00,ujjawal,36072-2,MT,1752,2,3
2016-02-06,14:10:00,ujjawal,36072-2,MT,203,0,5
2016-02-06,14:20:00,ujjawal,36072-2,MT,230,0,7
2016-02-06,14:30:00,ujjawal,36072-2,MT,226,0,1
2016-02-06,14:40:00,ujjawal,36072-2,MT,195,0,6
2016-02-06,14:50:00,ujjawal,36072-2,MT,314,0,1
2016-02-06,15:00:00,ujjawal,36072-2,MT,357,0,2
2016-02-06,15:10:00,ujjawal,36072-2,MT,387,0,9
2016-02-06,15:20:00,ujjawal,36072-2,MT,1084,1,3
2016-02-06,15:30:00,ujjawal,36072-2,MT,1295,2,3
2016-02-06,15:40:00,ujjawal,36072-2,MT,223,0,8
2016-02-06,15:50:00,ujjawal,36072-2,MT,254,0,1
2016-02-06,16:00:00,ujjawal,36072-2,MT,252,0,7
2016-02-06,16:10:00,ujjawal,36072-2,MT,268,0,1
2016-02-06,16:20:00,ujjawal,36072-2,MT,242,0,1
2016-02-06,16:30:00,ujjawal,36072-2,MT,254,0,9
2016-02-06,16:40:00,ujjawal,36072-2,MT,271,0,3
2016-02-06,16:50:00,ujjawal,36072-2,MT,244,0,7
2016-02-06,17:00:00,ujjawal,36072-2,MT,281,0,1
2016-02-06,17:10:00,ujjawal,36072-2,MT,190,0,8
2016-02-06,17:20:00,ujjawal,36072-2,MT,187,0,1
2016-02-06,17:30:00,ujjawal,36072-2,MT,173,0,9
2016-02-06,17:40:00,ujjawal,36072-2,MT,140,0,5
2016-02-06,17:50:00,ujjawal,36072-2,MT,147,0,6
2016-02-06,18:00:00,ujjawal,36072-2,MT,109,0,4
2016-02-06,18:10:00,ujjawal,36072-2,MT,99,0,1
2016-02-06,18:20:00,ujjawal,36072-2,MT,66,0,6
2016-02-06,18:30:00,ujjawal,36072-2,MT,67,0,4
2016-02-06,18:40:00,ujjawal,36072-2,MT,40,0,2
2016-02-06,18:50:00,ujjawal,36072-2,MT,52,0,3
2016-02-06,19:00:00,ujjawal,36072-2,MT,40,0,3
2016-02-06,19:10:00,ujjawal,36072-2,MT,30,0,2
2016-02-06,19:20:00,ujjawal,36072-2,MT,25,0,3
2016-02-06,19:30:00,ujjawal,36072-2,MT,35,0,4
2016-02-06,19:40:00,ujjawal,36072-2,MT,14,0,1
2016-02-06,19:50:00,ujjawal,36072-2,MT,97,0,7
2016-02-06,20:00:00,ujjawal,36072-2,MT,14,0,1
2016-02-06,20:10:00,ujjawal,36072-2,MT,12,0,4
2016-02-06,20:20:00,ujjawal,36072-2,MT,11,0,2
2016-02-06,20:30:00,ujjawal,36072-2,MT,12,0,1
2016-02-06,20:40:00,ujjawal,36072-2,MT,6,0,1
2016-02-06,20:50:00,ujjawal,36072-2,MT,13,0,2
2016-02-06,21:00:00,ujjawal,36072-2,MT,5,0,1
2016-02-06,21:10:00,ujjawal,36072-2,MT,12,0,2
2016-02-06,21:20:00,ujjawal,36072-2,MT,1,0,1
2016-02-06,21:30:00,ujjawal,36072-2,MT,21,0,2
2016-02-06,21:50:00,ujjawal,36072-2,MT,9,0,3
2016-02-06,22:00:00,ujjawal,36072-2,MT,2,0,1
2016-02-06,22:10:00,ujjawal,36072-2,MT,12,0,5
2016-02-06,22:20:00,ujjawal,36072-2,MT,1,0,1
2016-02-06,22:30:00,ujjawal,36072-2,MT,9,0,1
2016-02-06,22:40:00,ujjawal,36072-2,MT,13,0,1
2016-02-06,23:00:00,ujjawal,36072-2,MT,20,0,2
2016-02-06,23:10:00,ujjawal,36072-2,MT,10,0,3
2016-02-06,23:20:00,ujjawal,36072-2,MT,10,0,1
2016-02-06,23:30:00,ujjawal,36072-2,MT,6,0,1
2016-02-06,23:40:00,ujjawal,36072-2,MT,12,0,1
if you see above sample as per 10 minutes interval there should be total 143 rows in 24 hours in this file but after second last line which has time 2016-02-06,23:40:00 data for date, time 2016-02-06,23:50:00 is missing.
similarly after 2016-02-06,22:40:00 data for date, time 2016-02-06,22:50:00 is missing.
can we insert missing date,time followed by 6 null separated by commas e.g. 2016-02-06,22:50:00,null,null,null,null,null,null where ever any data missing in rows of this file based on count no 143 rows and time stamp comparison in rows 2016-02-06,00:00:00 to 2016-02-06,23:50:00 which is also 143 in count ?
here is what i have tried
created a file with 143 entries of date and time as 2.csv and used below command
join -j 2 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,2.1,2.1,2.2 <(sort -k2 1.csv) <(sort -k2 2.csv)|grep "2016-02-06,21:30:00"| sort -u|sed "s/\t//g"> 3.txt
part of output is repetitive like this :
2016-02-06,21:30:00
2016-02-06,21:30:00
2016-02-06,00:00:00,ujjawal,36072-2,MT,37,0,1
2016-02-06,21:30:00
2016-02-06,21:30:00
2016-02-06,00:10:00,ujjawal,36072-2,MT,37,0,1
2016-02-06,21:30:00
2016-02-06,21:30:00
2016-02-06,00:20:00,ujjawal,36072-2,MT,37,0,1
2016-02-06,21:30:00
2016-02-06,21:30:00
2016-02-06,00:40:00,ujjawal,36072-2,MT,37,0,1
2016-02-06,21:30:00
2016-02-06,21:30:00
2016-02-06,00:50:00,ujjawal,36072-2,MT,42,0,2
2016-02-06,21:30:00
any suggestions ?
I'd actually not cross reference a new csv file, and instead do it like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Time::Piece;
my $last_timestamp;
my $interval = 600;
#read stdin line by line
while ( <> ) {
#extract date and time from this line.
my ( $date, $time, #fields ) = split /,/;
#parse the timestamp
my $timestamp = Time::Piece -> strptime ( $date . $time, "%Y-%m-%d%H:%M:%S" );
#set last if undefined.
$last_timestamp //= $timestamp;
#if there's a gap... :
if ( $last_timestamp + $interval < $timestamp ) {
#print "GAP detected at $timestamp: ",$timestamp - $last_timestamp,"\n";
#print lines to fill in the gap
for ( ($timestamp - $last_timestamp) % 600 ) {
$last_timestamp += 600;
print join ( ",", $last_timestamp -> strftime("%Y-%m-%d,%H:%M:%S"), ("null")x6),"\n";
}
}
$last_timestamp = $timestamp;
print;
}
Which for your sample gives me lines (snipped for brevity):
2016-02-06,22:40:00,ujjawal,36072-2,MT,13,0,1
2016-02-06,22:50:00,null,null,null,null,null,null
2016-02-06,23:00:00,ujjawal,36072-2,MT,20,0,2
Note - this is assuming the timestamps are exactly 600s apart. You can adjust the logic a little if that isn't a valid assumption, but it depends exactly what you're trying to get at that point.
Here's another Perl solution
It initialises $date to the date contained in the first line of the file, and a time of 00:00:00
It then fills the %values hash with records using the value of $date as a key, incrementing the value by ten minutes until the day of month changes. These form the "default" values
Then the contents of the file are used to overwrite all elements of %values for which we have an actual value. Any gaps will remain set to their default from the previous step
Then the hash is simply printed in sorted order, resulting in a full set of data with defaults inserted as necessary
use strict;
use warnings 'all';
use Time::Piece;
use Time::Seconds 'ONE_MINUTE';
use Fcntl ':seek';
my $delta = 10 * ONE_MINUTE;
my $date = Time::Piece->strptime(<ARGV> =~ /^([\d-]+)/, '%Y-%m-%d');
my %values;
for ( my $day = $date->mday; $date->mday == $day; $date += $delta ) {
my $ds = $date->strftime('%Y-%m-%d,%H:%M:%S');
$values{$ds} = $ds. ',null' x 6 . "\n";
}
seek ARGV, 0, SEEK_SET;
while ( <ARGV> ) {
my ($ds) = /^([\d-]+,[\d:]+)/;
$values{$ds} = $_;
}
print $values{$_} for sort keys %values;
here is the answer..
cat 1.csv 2.csv|sort -u -t, -k2,2
...or a shell script:
#! /bin/bash
set -e
file=$1
today=$(head -1 $file | cut -d, -f1)
line=0
for (( h = 0 ; h < 24 ; h++ ))
do
for (( m = 0 ; m < 60 ; m += 10 ))
do
stamp=$(printf "%02d:%02d:00" $h $m)
if [ $line -eq 0 ]; then IFS=',' read date time data; fi
if [ "$time" = "$stamp" ]; then
echo $date,$time,$data
line=0
else
echo $today,$stamp,null,null,null,null,null,null
line=1
fi
done
done <$file
I would write it like this in Perl
This program expects the name of the input file as a parameter on the command line, and prints its output to STDOUT, which may be redirected as normal
use strict;
use warnings 'all';
use feature 'say';
use Time::Piece;
use Time::Seconds 'ONE_MINUTE';
my $format = '%Y-%m-%d,%H:%M:%S';
my $delta = 10 * ONE_MINUTE;
my $next;
our #ARGV = 'mydates.txt';
while ( <> ) {
my $new = Time::Piece->strptime(/^([\d-]+,[\d:]+)/, $format);
while ( $next and $next < $new ) {
say $next->strftime($format) . ',null' x 6;
$next += $delta;
}
print;
$next = $new + $delta;
}
while ( $next and $next->hms('') > 0 ) {
say $next->strftime($format) . ',null' x 6;
$next += $delta;
}
output
2016-02-06,00:00:00,ujjawal,36072-2,MT,37,0,1
2016-02-06,00:10:00,ujjawal,36072-2,MT,37,0,1
2016-02-06,00:20:00,ujjawal,36072-2,MT,37,0,1
2016-02-06,00:30:00,null,null,null,null,null,null
2016-02-06,00:40:00,ujjawal,36072-2,MT,37,0,1
2016-02-06,00:50:00,ujjawal,36072-2,MT,42,0,2
2016-02-06,01:00:00,ujjawal,36072-2,MT,55,0,2
2016-02-06,01:10:00,ujjawal,36072-2,MT,41,0,2
2016-02-06,01:20:00,ujjawal,36072-2,MT,46,0,2
2016-02-06,01:30:00,ujjawal,36072-2,MT,56,0,3
2016-02-06,01:40:00,ujjawal,36072-2,MT,38,0,2
2016-02-06,01:50:00,ujjawal,36072-2,MT,49,0,2
2016-02-06,02:00:00,ujjawal,36072-2,MT,58,0,4
2016-02-06,02:10:00,ujjawal,36072-2,MT,43,0,2
2016-02-06,02:20:00,ujjawal,36072-2,MT,46,0,2
2016-02-06,02:30:00,ujjawal,36072-2,MT,61,0,2
2016-02-06,02:40:00,ujjawal,36072-2,MT,57,0,3
2016-02-06,02:50:00,ujjawal,36072-2,MT,45,0,2
2016-02-06,03:00:00,ujjawal,36072-2,MT,45,0,3
2016-02-06,03:10:00,ujjawal,36072-2,MT,51,0,2
2016-02-06,03:20:00,ujjawal,36072-2,MT,68,0,3
2016-02-06,03:30:00,ujjawal,36072-2,MT,51,0,2
2016-02-06,03:40:00,ujjawal,36072-2,MT,68,0,3
2016-02-06,03:50:00,ujjawal,36072-2,MT,67,0,3
2016-02-06,04:00:00,ujjawal,36072-2,MT,82,0,8
2016-02-06,04:10:00,ujjawal,36072-2,MT,82,0,5
2016-02-06,04:20:00,ujjawal,36072-2,MT,122,0,4
2016-02-06,04:30:00,ujjawal,36072-2,MT,133,0,3
2016-02-06,04:40:00,ujjawal,36072-2,MT,142,0,3
2016-02-06,04:50:00,ujjawal,36072-2,MT,202,0,1
2016-02-06,05:00:00,ujjawal,36072-2,MT,731,1,3
2016-02-06,05:10:00,ujjawal,36072-2,MT,372,0,7
2016-02-06,05:20:00,ujjawal,36072-2,MT,303,0,2
2016-02-06,05:30:00,ujjawal,36072-2,MT,389,0,3
2016-02-06,05:40:00,ujjawal,36072-2,MT,454,0,1
2016-02-06,05:50:00,ujjawal,36072-2,MT,406,0,6
2016-02-06,06:00:00,ujjawal,36072-2,MT,377,0,1
2016-02-06,06:10:00,ujjawal,36072-2,MT,343,0,5
2016-02-06,06:20:00,ujjawal,36072-2,MT,370,0,2
2016-02-06,06:30:00,ujjawal,36072-2,MT,343,0,9
2016-02-06,06:40:00,ujjawal,36072-2,MT,315,0,8
2016-02-06,06:50:00,ujjawal,36072-2,MT,458,0,3
2016-02-06,07:00:00,ujjawal,36072-2,MT,756,1,3
2016-02-06,07:10:00,ujjawal,36072-2,MT,913,1,3
2016-02-06,07:20:00,ujjawal,36072-2,MT,522,0,3
2016-02-06,07:30:00,ujjawal,36072-2,MT,350,0,7
2016-02-06,07:40:00,ujjawal,36072-2,MT,328,0,6
2016-02-06,07:50:00,ujjawal,36072-2,MT,775,1,3
2016-02-06,08:00:00,ujjawal,36072-2,MT,310,0,9
2016-02-06,08:10:00,ujjawal,36072-2,MT,308,0,6
2016-02-06,08:20:00,ujjawal,36072-2,MT,738,1,3
2016-02-06,08:30:00,ujjawal,36072-2,MT,294,0,6
2016-02-06,08:40:00,ujjawal,36072-2,MT,345,0,1
2016-02-06,08:50:00,ujjawal,36072-2,MT,367,0,6
2016-02-06,09:00:00,ujjawal,36072-2,MT,480,0,3
2016-02-06,09:10:00,ujjawal,36072-2,MT,390,0,3
2016-02-06,09:20:00,ujjawal,36072-2,MT,436,0,3
2016-02-06,09:30:00,ujjawal,36072-2,MT,1404,2,3
2016-02-06,09:40:00,ujjawal,36072-2,MT,346,0,3
2016-02-06,09:50:00,ujjawal,36072-2,MT,388,0,3
2016-02-06,10:00:00,ujjawal,36072-2,MT,456,0,2
2016-02-06,10:10:00,ujjawal,36072-2,MT,273,0,7
2016-02-06,10:20:00,ujjawal,36072-2,MT,310,0,3
2016-02-06,10:30:00,ujjawal,36072-2,MT,256,0,7
2016-02-06,10:40:00,ujjawal,36072-2,MT,283,0,3
2016-02-06,10:50:00,ujjawal,36072-2,MT,276,0,3
2016-02-06,11:00:00,ujjawal,36072-2,MT,305,0,1
2016-02-06,11:10:00,ujjawal,36072-2,MT,310,0,9
2016-02-06,11:20:00,ujjawal,36072-2,MT,286,0,3
2016-02-06,11:30:00,ujjawal,36072-2,MT,286,0,3
2016-02-06,11:40:00,ujjawal,36072-2,MT,247,0,7
2016-02-06,11:50:00,ujjawal,36072-2,MT,366,0,2
2016-02-06,12:00:00,ujjawal,36072-2,MT,294,0,2
2016-02-06,12:10:00,ujjawal,36072-2,MT,216,0,5
2016-02-06,12:20:00,ujjawal,36072-2,MT,233,0,1
2016-02-06,12:30:00,ujjawal,36072-2,MT,785,1,2
2016-02-06,12:40:00,ujjawal,36072-2,MT,466,0,1
2016-02-06,12:50:00,ujjawal,36072-2,MT,219,0,9
2016-02-06,13:00:00,ujjawal,36072-2,MT,248,0,6
2016-02-06,13:10:00,ujjawal,36072-2,MT,223,0,7
2016-02-06,13:20:00,ujjawal,36072-2,MT,276,0,8
2016-02-06,13:30:00,ujjawal,36072-2,MT,219,0,6
2016-02-06,13:40:00,ujjawal,36072-2,MT,699,1,2
2016-02-06,13:50:00,ujjawal,36072-2,MT,439,0,2
2016-02-06,14:00:00,ujjawal,36072-2,MT,1752,2,3
2016-02-06,14:10:00,ujjawal,36072-2,MT,203,0,5
2016-02-06,14:20:00,ujjawal,36072-2,MT,230,0,7
2016-02-06,14:30:00,ujjawal,36072-2,MT,226,0,1
2016-02-06,14:40:00,ujjawal,36072-2,MT,195,0,6
2016-02-06,14:50:00,ujjawal,36072-2,MT,314,0,1
2016-02-06,15:00:00,ujjawal,36072-2,MT,357,0,2
2016-02-06,15:10:00,ujjawal,36072-2,MT,387,0,9
2016-02-06,15:20:00,ujjawal,36072-2,MT,1084,1,3
2016-02-06,15:30:00,ujjawal,36072-2,MT,1295,2,3
2016-02-06,15:40:00,ujjawal,36072-2,MT,223,0,8
2016-02-06,15:50:00,ujjawal,36072-2,MT,254,0,1
2016-02-06,16:00:00,ujjawal,36072-2,MT,252,0,7
2016-02-06,16:10:00,ujjawal,36072-2,MT,268,0,1
2016-02-06,16:20:00,ujjawal,36072-2,MT,242,0,1
2016-02-06,16:30:00,ujjawal,36072-2,MT,254,0,9
2016-02-06,16:40:00,ujjawal,36072-2,MT,271,0,3
2016-02-06,16:50:00,ujjawal,36072-2,MT,244,0,7
2016-02-06,17:00:00,ujjawal,36072-2,MT,281,0,1
2016-02-06,17:10:00,ujjawal,36072-2,MT,190,0,8
2016-02-06,17:20:00,ujjawal,36072-2,MT,187,0,1
2016-02-06,17:30:00,ujjawal,36072-2,MT,173,0,9
2016-02-06,17:40:00,ujjawal,36072-2,MT,140,0,5
2016-02-06,17:50:00,ujjawal,36072-2,MT,147,0,6
2016-02-06,18:00:00,ujjawal,36072-2,MT,109,0,4
2016-02-06,18:10:00,ujjawal,36072-2,MT,99,0,1
2016-02-06,18:20:00,ujjawal,36072-2,MT,66,0,6
2016-02-06,18:30:00,ujjawal,36072-2,MT,67,0,4
2016-02-06,18:40:00,ujjawal,36072-2,MT,40,0,2
2016-02-06,18:50:00,ujjawal,36072-2,MT,52,0,3
2016-02-06,19:00:00,ujjawal,36072-2,MT,40,0,3
2016-02-06,19:10:00,ujjawal,36072-2,MT,30,0,2
2016-02-06,19:20:00,ujjawal,36072-2,MT,25,0,3
2016-02-06,19:30:00,ujjawal,36072-2,MT,35,0,4
2016-02-06,19:40:00,ujjawal,36072-2,MT,14,0,1
2016-02-06,19:50:00,ujjawal,36072-2,MT,97,0,7
2016-02-06,20:00:00,ujjawal,36072-2,MT,14,0,1
2016-02-06,20:10:00,ujjawal,36072-2,MT,12,0,4
2016-02-06,20:20:00,ujjawal,36072-2,MT,11,0,2
2016-02-06,20:30:00,ujjawal,36072-2,MT,12,0,1
2016-02-06,20:40:00,ujjawal,36072-2,MT,6,0,1
2016-02-06,20:50:00,ujjawal,36072-2,MT,13,0,2
2016-02-06,21:00:00,ujjawal,36072-2,MT,5,0,1
2016-02-06,21:10:00,ujjawal,36072-2,MT,12,0,2
2016-02-06,21:20:00,ujjawal,36072-2,MT,1,0,1
2016-02-06,21:30:00,ujjawal,36072-2,MT,21,0,2
2016-02-06,21:40:00,null,null,null,null,null,null
2016-02-06,21:50:00,ujjawal,36072-2,MT,9,0,3
2016-02-06,22:00:00,ujjawal,36072-2,MT,2,0,1
2016-02-06,22:10:00,ujjawal,36072-2,MT,12,0,5
2016-02-06,22:20:00,ujjawal,36072-2,MT,1,0,1
2016-02-06,22:30:00,ujjawal,36072-2,MT,9,0,1
2016-02-06,22:40:00,ujjawal,36072-2,MT,13,0,1
2016-02-06,22:50:00,null,null,null,null,null,null
2016-02-06,23:00:00,ujjawal,36072-2,MT,20,0,2
2016-02-06,23:10:00,ujjawal,36072-2,MT,10,0,3
2016-02-06,23:20:00,ujjawal,36072-2,MT,10,0,1
2016-02-06,23:30:00,ujjawal,36072-2,MT,6,0,1
2016-02-06,23:40:00,ujjawal,36072-2,MT,12,0,1
2016-02-06,23:50:00,null,null,null,null,null,null

perl + numeration word or parameter in file

I need help about how to numeration text in file.
I have also linux machine and I need to write the script with perl
I have file name: file_db.txt
In this file have parameters like name,ParameterFromBook,NumberPage,BOOK_From_library,price etc
Each parameter equal to something as name=elephant
My question How to do this by perl
I want to give number for each parameter (before the "=") that repeated (unique parameter) in the file , and increase by (+1) the new number of the next repeated parameter until EOF
lidia
For example
file_db.txt before numbering
parameter=1
name=one
parameter=2
name=two
file_db.txt after parameters numbering
parameter1=1
name1=one
parameter2=2
name2=two
other examples
Example1 before
name=elephant
ParameterFromBook=234
name=star.world
ParameterFromBook=200
name=home_room1
ParameterFromBook=264
Example1 after parameters numbering
name1=elephant
ParameterFromBook1=234
name2=star.world
ParameterFromBook2=200
name3=home_room1
ParameterFromBook3=264
Example2 before
file_db.txt before numbering
lines_and_words=1
list_of_books=3442
lines_and_words=13
list_of_books=344224
lines_and_words=120
list_of_books=341
Example2 after
file_db.txt after parameters numbering
lines_and_words1=1
list_of_books1=3442
lines_and_words2=13
list_of_books2=344224
lines_and_words3=120
list_of_books3=341
It can be condensed to a one line perl script pretty easily, though I don't particularly recommend it if you want readability:
#!/usr/bin/perl
s/(.*)=/$k{$1}++;"$1$k{$1}="/e and print while <>;
This version reads from a specified file, rather than using the command line:
#!/usr/bin/perl
open IN, "/tmp/file";
s/(.*)=/$k{$1}++;"$1$k{$1}="/e and print while <IN>;
The way I look at it, you probably want to number blocks and not just occurrences. So you probably want the number on each of the keys to be at least as great as the earliest repeating key.
my $in = \*::DATA;
my $out = \*::STDOUT;
my %occur;
my $num = 0;
while ( <$in> ) {
if ( my ( $pre, $key, $data ) = m/^(\s*)(\w+)=(.*)/ ) {
$num++ if $num < ++$occur{$key};
print { $out } "$pre$key$num=$data\n";
}
else {
$num++;
print;
}
}
__DATA__
name=elephant
ParameterFromBook=234
name=star.world
ParameterFromBook=200
name=home_room1
ParameterFromBook=264
However, if you just wanted to give the key it's particular count. This is enough:
my %occur;
while ( <$in> ) {
my ( $pre, $key, $data ) = m/^(\s*)(\w+)=(.*)/;
$occur{$key}++;
print { $out } "$pre$key$occur{$key}=$data\n";
}
in pretty much pseudo code:
open(DATA, "file");
my #lines = <DATA>;
my %tags;
foreach line (#lines)
{
my %parts=split(/=/, $line);
my $name=$parts[0];
my $value=$parts[1];
$name = ${name}$tags{ $name };
$tags{ $name } = $tags{ $name } + 1;
printf "${name}=$value\n";
}
close( DATA );
This looks like a CS101 assignment. Is it really good to ask for complete solutions instead of asking specific technical questions if you have difficulty?
If Perl is not a must, here's an awk version
$ cat file
name=elephant
ParameterFromBook=234
name=star.world
ParameterFromBook=200
name=home_room1
ParameterFromBook=264
$ awk -F"=" '{s[$1]++}{print $1s[$1],$2}' OFS="=" file
name1=elephant
ParameterFromBook1=234
name2=star.world
ParameterFromBook2=200
name3=home_room1
ParameterFromBook3=264

How can I match string order between two documents in Perl?

I've a problem in making a PERL program for matching the words in two documents. Let's say there are documents A and B.
So I want to delete the words in document A that's not in the document B.
Example 1:
A: I eat pizza
B: She go to the market and eat pizza
result: eat pizza
example 2:
A: eat pizza
B: pizza eat
result:pizza
(the word order is relevant, so "eat" is deleted.)
I use Perl for the system and the sentences in each document isn't in a big numbers so I think I won't use SQL
And the program is a subproram for automatic essay grading for Indonesian Language (Bahasa)
Thanx,
Sorry if my question is a bit confusing. I'm really new to 'this world' :)
OK, I'm without access at the moment so this is not guaranteed to be 100% or even compile but should provide enough guidance:
Solution 1: (word order does not matter)
#!/usr/bin/perl -w
use strict;
use File::Slurp;
my #B_lines = File::Slurp::read_file("B") || die "Error reading B: $!";
my %B_words = ();
foreach my $line (#B_lines) {
map { $B_words{$_} = 1 } split(/\s+/, $line);
}
my #A_lines = File::Slurp::read_file("A") || die "Error reading A: $!";
my #new_lines = ();
foreach my $line (#A_lines) {
my #B_words_only = grep { $B_words{$_} } split(/\s+/, $line);
push #new_lines, join(" ", #B_words_only) . "\n";
}
File::Slurp::write_file("A_new", #new_lines) || die "Error writing A_new: $!";
This should create a new file "A_new" that only contains A's words that are in in B.
This has a slight bug - it will replace any multiple-whitespace in file A with a single space, so
word1 word2 word3
will become
word1 word2 word3
It can be fixed but would be really annoying to do so, so I didn't bother unless you will absolutely require that whitespace be preserved 100% correctly
Solution 2: (word order matters BUT you can print words from file A out with no regards for preserving whitespace at all)
#!/usr/bin/perl -w
use strict;
use File::Slurp;
my #A_words = split(/\s+/gs, File::Slurp::read_file("A") || die "Error reading A:$!");
my #B_words = split(/\s+/gs, File::Slurp::read_file("B") || die "Error reading B:$!");
my $B_counter = 0;
for (my $A_counter = 0; $A_counter < scalar(#A_words); ++$A_counter) {
while ($B_counter < scalar(#B_words)
&& $B_words[$B_counter] ne $A_words[$A_counter]) {++$B_counter;}
last if $B_counter == scalar(#B_words);
print "$A_words[$A_counter]";
}
Solution 3 (why do we need Perl again? :) )
You can do this trivially in shell without Perl (or via system() call or backticks in parent Perl script)
comm -12 A B | tr "\012" " "
To call this from Perl:
my $new_text = `comm -12 A B | tr "\012" " " `;
But see my last comment why this may be considered "bad Perl"... at least if you do this in a loop with very many files being iterated and care about performance.