perl sort question - perl

I have some huge log files I need to sort. All entries have a 32 bit hex number which is the sort key I want to use.
some entries are one liners like
bla bla bla 0x97860afa bla bla
others are a bit more complex, start with the same type of line above and expand to a block of lines marked by curly brackets like the example below. In this case the entire block has to move to the position defined by the hex nbr. Block example-
bla bla bla 0x97860afc bla bla
bla bla {
blabla
bla bla {
bla
}
}
I can probably figure it out but maybe there is a simple perl or awk solution that will save me 1/2 day.
Transferring comments from OP:
Indentation can be space or tab, I can enhance that on any proposed solution, I think that Brian summarizes well: Specifically, do you want to sort "items" which are defined as a chunk of text that starts with a line containing a "0xNNNNNNNN", and contains everything up to (but not including) the next line which contains a "0xNNNNNNNN" (where the N's change, of course). No lines interspersed.

Something like this might work (Not tested):
my $line;
my $lastkey;
my %data;
while($line = <>) {
chomp $line;
if ($line =~ /\b(0x\p{AHex}{8})\b/) {
# Begin a new entry
my $unique_key = $1 . $.; # cred to [Brian Gerard][1] for uniqueness
$data{$1} = $line;
$lastkey = $unique_key;
} else {
# Continue an old entry
$data{$lastkey} .= $line;
}
}
print $data{$_}, "\n" for (sort { $a <=> $b } keys %data);
The problem is that you said "huge" log files, so storing the file in memory will probably be inefficient. However, if you want to sort it, I suspect you're going to need to do that.
If storing in memory is not an option, you can always just print the data to a file instead, with a format that will allow you to sort it by some other means.

For Huge data files, I'd recommend Sort::External.
It doesn't look like you need to parse the brackets, if the indentation does the job. Then you have to do it on "breaks", or when the indentation level 0, then you process the last record gathered, so you always look ahead one line.
So:
sub to_sort_form {
my $buffer = $_[0];
my ( $id ) = $buffer =~ m/(0x\p{AHex}{8})/; # grab the first candidate
return "$id-:-$buffer";
$_[0] = '';
}
sub to_source {
my $val = shift;
my ( $record ) = $val =~ m/-:-(.*)/;
$record =~ s/\$--\^/\n/g;
return $record;
}
my $sortex = Sort::External->new(
mem_threshold => 1024**2 * 16 # default: 1024**2 * 8 (8 MiB)
, cache_size => 100_000 # default: undef (disabled)
, sortsub => sub { $Sort::External::a cmp $Sort::External::b }
, working_dir => $temp_directory # default: see below
);
my $id;
my $buffer = <>;
chomp $buffer;
while ( <> ) {
my ( $indent ) = m/^(\s*)\S/;
unless ( length $indent ) {
$sortex->feed( to_sort_form( $buffer ));
}
chomp;
$buffer .= $_ . '$--^';
}
$sortex->feed( to_sort_form( $buffer ));
$sortex->finish;
while ( defined( $_ = $sortex->fetch ) ) {
print to_source( $_ );
}
Assumptions:
The string '$--^' does not appear in the data on its own.
That you're not alarmed about two 8-hex-digit strings in one record.

If the files are not too big for memory, I would go with TLP's solution. If they are, you can modify it just a bit and print to a file as he suggests. Add this before the while (all untested, ymmv, caveat programmer, etc):
my $currentInFile = "";
my $currentOutFileHandle = "";
And change the body of the while from the current if-else to
if ($currentInFile ne $ARG) {
if (fileno($currentOutFileHandle)) {
if (!close($currentOutFileHandle)) {
# whatever you want to do if you can't close the previous output file
}
}
my $newOutFile = $ARG . ".tagged";
if (!open($currentOutFileHandle, ">", $newOutFile)) {
# whatever you want to do if you can't open a new output file for writing
}
}
if (...conditional from TLP...) {
# add more zeroes if the files really are that large :)
$lastkey = $1 . " " . sprintf("%0.10d", $.);
}
if (fileno($currentOutFileHandle)) {
print $currentOutFileHandle $lastkey . "\t" . $line;
}
else {
# whatever you want to do if $currentOutFileHandle's gone screwy
}
Now you'll have a foo.log.tagged for each foo.log you fed it; the .tagged file contains exactly the contents of the original, but with "0xNNNNNNNN LLLLLLLLLL\t" (LLLLLLLLLL -> zero-padded line number) prepended to each line. sort(1) actually does a pretty good job of handling large data, though you'll want to look at the --temporary-directory argument if you think it will overflow /tmp with its temp files while chewing through the stuff you feed it. Something like this should get you started:
sort --output=/my/new/really.big.file --temporary-directory=/scratch/dir/on/roomy/partition *.tagged
Then trim away the tags if desired:
perl -pi -e 's/^[^\t]+\t//' /my/new/really.big.file
FWIW, I padded the line numbers to keep from having to worry about such things as line 10 sorting before line 2 if their hex keys were identical - since the hex numbers are the primary sort criterion, we can't just sort numerically.

One way (untested)
perl -wne'BEGIN{ $key = " " x 10 }' \
-e '$key = $1 if /(0x[0-9a-f]{8})/;' \
-e 'printf "%s%.10d%s", $key, $., $_' \
inputfile \
| LC_ALL=C sort \
| perl -wpe'substr($_,0,20,"")'

The solution from TLP worked nice with some minor tweaks. Adding all in one line before sorting was a good idea, next I have to add a pos parsing to restore the code blocks that got collapsed but that is easy. below is the final tested version. Thank you all, stackoverflow is awesome.
#!/usr/bin/perl -w
my $line;
my $lastkey;
my %data;
while($line = <>) {
chomp $line;
if ($line =~ /\b(0x\p{AHex}{8})\b/) {
# Begin a new entry
#my $unique_key = $1 . $.; # cred to [Brian Gerard][1] for uniqueness
my $unique_key = hex($1);
$data{$unique_key} = $line;
$lastkey = $unique_key;
} else {
# Continue an old entry
$data{$lastkey} .= $line;
}
}
print $data{$_}, "\n" for (sort { $a <=> $b } keys %data);

Related

Read an file in two hashes inorder to retain the order

I am trying to read a file with user information categorized under a location, I want to fill in the some of the fields using user input and output the file while keeping the fields under each location intact for eg - file
[California]
$;FIrst_Name =
$;Last_Name=
$;Age =
[NewYork]
$;FIrst_Name =
$;Last_Name=
$;Age =
[Washington]
$;FIrst_Name =
$;Last_Name=
$;Age =
Once user provides input from command line it should look it
[California]
$;FIrst_Name = Jack
$;Last_Name= Daner
$;Age = 27
[NewYork]
$;FIrst_Name = Jill
$;Last_Name= XYZ
$;Age = 30
[Washington]
$;FIrst_Name = Kim
$;Last_Name= ABC
$;Age = 25
The order of First_Name, Last_Name and Age within each location can change and even order of locations can change, but each location section should remain separate and intact. I wrote following code so far and some of my code works for taking whole file in one hash, but i am not able to preserve each location section within it! I tried using two hashes - can someone please help me as it is getting really complex for me! Thanks a lot. ( I had another issue with a similar file as well, but unfortunately could not resolve it either)
EDITED code
Open the file
use strict;
use warnings;
use Getopt::Long;
sub read_config {
my $phCmdLineOption = shift;
my $phConfig = shift;
my $sInputfile = shift;
open($input.file, "<$InputFile") or die "Error! Cannot open $InputFile
+ for reading: $!";
while (<$input.file>) {
$_ =~ s/\s+$//;
next if ($_ =~ /^#/);
next if ($_ =~ /^$/);
if ($_ =~ m/^\[(\S+)\]$/) {
$sComponent = $1;
next;
}
elsif ($_ =~ m/^;;\s*(.*)/) {
$sDesc .= "$1.";
next;
}
elsif ($_ =~ m/\$;(\S+)\$;\s*=\s*(.*)/) {
$sParam = $1;
$sValue = $2;
if ((defined $sValue) && ($sValue !~ m/^\s*$/)) {
$phfield->{$sCategory}{$sParam} = ["$sValue", "$sDesc"];
}
else {
$field->{$sCategory}{$sParam} = [undef, "$sDesc"];
}
}
$sParam = $sValue = $sDesc = "";
next;
}
}
Write the new file -
sub write_config {
my $phCmdLineOption = shift;
my $phConfig = shift;
my $sOut = shift;
open(outfile, ">$sOut") or die " $!";
foreach $sCategory (sort {$a cmp $b} keys %{$fields}) {
print $outfile "[$sCategory]\n";
foreach $sParam (sort {$a cmp $b} keys %{$fields-{$sCategory}}) {
$sDesc = ((defined $phConfig->{$sCategory}{$sParam}[1]) $fields->{$sCategory}{$sParam}[1] : "");
print $outfile ";;$sDesc\n" if ((defined $sDesc) && ($sDesc !~ m/^$/));
$sValue = ((defined $fields->{$sCategory}{$sParam}[0]) ? $fields->{$sCategory}{$sParam}[0] : undef);
print $outfile "$sValue" if (defined $sValue);
print $outfile "\n";
}
print $outfile "\n";
}
close($outfile);
return;
Note - I have posted this question on PerlMonks forum as well. Thanks a lot!
I think you're getting lost in the detail and skipping over some basics which is unnecessarily complicating the problem. Those basics are;
Indent your code properly (it's amazing the difference this makes)
Always use the /x modifier on regex and lots of whitespace to increase readability
When using lots of regexs, use "quote rule", qr, to seperate regex definition from regex use
Apart from that, you were headed in the right direction but there are a couple of insights on the algorithm you were missing which further increased the complexity.
Firstly, for small-time parsing of data, look out for the possibility that matching one type of line immediately disqualifies matching of other types of line. All the elsif's aren't necessary since a line that matches a category is never going to match a LastName or Age and vice versa.
Secondly, when you get a match, see if you can do what's needed immediately rather than storing the result of the match for processing later. In this case, instead of saving a "component" or "category" in a variable, put it immediately into the hash you're building.
Thirdly, if you're updating text files that are not huge, consider working on a new version of the file and then at the end of the program declare the current version old, and the new version current. This reduces the chances of unintentionally modifying something in place and allows comparison of the update with the original after execution - if necessary, "rollback" of the change in trivially easy which one of your users may be very grateful for one day.
Fourthly and most of all, you've only got a couple of attributes or components to worry about, so deal with them in the concrete rather than the abstract. You can see below that I've looped over qw( First_Name Last_Name Age) rather than all keys of the hash. Now obviously, if you have to deal with open-ended or unknown attributes you can't do it this way but in this case, AFAICT, your fields are fixed.
Here's a version that basically works given the above mentioned constraints.
#!/usr/bin/env perl
use v5.12 ;
use Getopt::Long ;
my %db ; # DB hash
my $dbf = "data.txt" ; # DB file name
my $dbu = "data.new" ; # updated DB file name
my $dbo = "data.old" ; # Old DB file name
my ($cat, $first, $last, $age) ; # Default is undef
GetOptions( 'cat=s' => \$cat ,
'first=s' => \$first ,
'last=s' => \$last ,
'age=i' => \$age
);
die "Category option (--cat=...) is compolsory\n" unless $cat ;
open my $dbh, '<', $dbf or die "$dbf: $!\n"; # DB Handle
open my $uph, '>', $dbu or die "$dbu: $!\n"; # UPdate Handle
# REs for blank line, category header and attribute specification
my $blank_re = qr/ ^ \s* $ /x ;
my $cat_re = qr/ ^ \[ (\w+) \] \s* $ /x ;
my $attr_re = qr/ ^ \$ ; (?<key>\w+) \s* = \s* (?<val>\N*) $ /x ;
while ( <$dbh> ) {
next unless /$cat_re/ ;
my %obj = ( cat => $1 ) ;
while ( <$dbh> ) {
$obj{ $+{key} } = $+{val} if /$attr_re/ ;
last if /$blank_re/
}
$db{ $obj{cat} } = \%obj
}
# Grab existing obj, otherwise presume we're adding a new one
my $obref = $db{ $cat } // { cat => $cat } ;
$obref->{ First_Name } = $first if defined $first ;
$obref->{ Last_Name } = $last if defined $last ;
$obref->{ Age } = $age if defined $age ;
# Update the DB with the modified/new one
$db{ $obref->{cat} } = $obref ;
for (sort keys %db) {
my $obref = $db{ $_ } ;
printf $uph "[%s]\n", $obref->{ cat } ;
for (qw( First_Name Last_Name Age )) {
printf $uph '$;' . "%s = %s\n", $_, $obref->{ $_ }
}
print $uph "\n"
}
close $dbh ;
close $dbu ;
rename $dbf , $dbo ;
rename $dbu , $dbf ;
exit 0
User input here need be organized, and for this we can use named options for each field, plus one for state. The Getopt option for reading into a hash is useful here. We also need to associate names of these options with field names. With that in hand it is simple to process the file since we have a ready mechanism to identify lines of interest.
By putting lines on a ref-array we can keep the order as well, and that refarray is a value for the section-key in the hash. The hash is not necessary but adds flexibility for future development. Once we are at it we can also keep the order of sections by using a simple auxiliary array.
use warnings;
use strict;
use Getopt::Long;
use feature qw(say);
# Translate between user input and field name ($;) in file
my ($o1, $o2, $o3) = qw(first last age);
my #tags = ('FIrst_Name', 'Last_Name', 'Age');
my %desc = ($tags[0] => $o1, $tags[1] => $o2, $tags[2] => $o3);
my (%input, $state);
GetOptions(\%input, "$o1=s", "$o2=s", "$o3=i", 'state=s' => \$state);
my $locinfo = 'loc_info.txt';
open my $in_fh, '<', $locinfo;
my (%conf, #sec_order, $section, $field);
while (my $line = <$in_fh>)
{
chomp($line);
next if $line =~ m/^\s*$/;
# New section ([]), for hash and order-array
if ($line =~ m/^\s*\[(.*)\]/) {
push #sec_order, $section = $1;
next;
}
# If we are in a wrong state just copy the line
if ($section ne $state) {
push #{$conf{$section}}, $line . "\n";
next;
}
if (($field) = $line =~ m/^\$;\s*(.*?)\s*=/ ) {
if (exists $input{$desc{$field}}) {
# Overwrite what is there or append
$line =~ s|^\s*(.*?=\s*)(.*)|$1 $input{$desc{$field}}|;
}
}
else { warn "Unexpected line: |$line| --" }
push #{$conf{$section}}, $line . "\n";
}
close $in_fh;
for (#sec_order) { say "[$_]"; say #{$conf{$_}}; }
Invocation
script.pl -state STATE -first FIRST_NAME -last LAST_NAME -age INT
Any option may be left out in which case that field is not touched. A field supplied on the command line will be overwritten if it has something. (This can be changed easily.) This works for a single-state entry as it stands but which is simple to modify if needed.
This is a basic solution. The first next thing would be to read the field names from the file itself, instead of having them hard-coded. (This would avoid the need to spot the typo FIrst and inconsistent spacings before =, for one thing.) But the more refinements are added, the more one is getting into template development. At some point soon it will be a good idea to use a module.
Note The regex delimiter above is different than elsewhere (|) to avoid the editor coloring all red.

Perl - Reading .txt files line-by-line and using compare function (printing non-matches only once)

I am really struggling and have spent about two full days on this banging my head against receiving the same result every time I run this perl script.
I have a Perl script that connects to a vendor tool and stores data for ~26 different elements within #data. There is a foreach loop for #data that breaks the 26 elements into $e->{'element1'), $e->{'element2'), $e->{'element3'), $e->{'element4'), etc. etc. etc.
I am also reading from the .txt files within a directory (line-by-line) and comparing the server names that exist within the text files with what exists in $e->{'element4'}.
The Problem:
Matches are working perfectly and only printing one line for each of the 26 elements when there is a match, however non-matches are producing one line for every entry within the .txt files (37 in all). So if there are 100 entries (each entry having 26 elements) stored within #data, then there are 100 x 37 entries being printed.
So for every non-match in the: if ($e->{'element4'} eq '6' && $_ =~ /$e->{element7}/i) statement below, I am receiving a print out saying that there is not a match. 37 entries for the same identical 26 elements (because there are 37 total entries in all of the .txt files).
The Goal:
I need to print out only 1 line for each unique entry (a unique entry being $e->{element1} thru $e->{element26}). It is already printing one 1 line for matches, but it is printing out 37 entries when there is not a match. I need to treat matches and non-matches differently.
Code:
foreach my $e (#data) {
# Open the .txt files stored within $basePath and use for comparison:
opendir( DIRC, $basePath . "/" ) || die("cannot open directory");
my #files = ( readdir(DIRC) );
my #MPG_assets = grep( /(.*?).txt/, #files );
# Loop through each system name found and compare it with the data in SC for a match:
foreach (#MPG_assets) {
$filename = $_;
open( MPGFILES, $basePath . "/" . $filename ) || die "canot open the file";
while (<MPGFILES>) {
if ( $e->{'element4'} eq '6' && $_ =~ /$e->{'element7'}/i ) {
## THIS SECTION WORKS PERFECTLY AND ONLY PRINTS MATCHES WHERE $_
## (which contains the servernames (1 per line) in the .txt files)
## EQUALS $e->{'element7'}.
print $e->{'element1'} . "\n";
print $e->{'element2'} . "\n";
print $e->{'element3'} . "\n";
print $e->{'element4'} . "\n";
print $e->{'element5'} . "\n";
# ...
print $e->{'element26'} . "\n";
} else {
## **THIS SECTION DOES NOT WORK**. FOR EVERY NON-MATCH, THERE IS A
## LINE PRINTED WITH 26 IDENTICAL ELEMENTS BECAUSE ITS LOOPING THRU
## THE 37 LINES IN THE *.TXT FILES.
print $e->{'element1'} . "\n";
print $e->{'element2'} . "\n";
print $e->{'element3'} . "\n";
print $e->{'element4'} . "\n";
print $e->{'element5'} . "\n";
# ...
print $e->{'element26'} . "\n";
} # End of 'if ($e->{'element4'} eq..' statement
} # End of while loop
} # End of 'foreach(#MPG_assets)'
} # End of 'foreach my $e (#data)'
I think I need something to identical unique elements and define what fields make up a unique element but honestly I have tried everything I know. If you would be so kind to provide actual code fixes, that would be wonderful because I am headed to production with this script quite soon. Also. I am looking for code (ideally) that is very human-readable because I will need to document it so others can understand.
Please let me know if you need additional information.
Personally I would push all bad nodes to an array and pull what I wanted with List::MoreUtils. Also, a sample of the results might be helpful as well.
# 1/2 Sudo Code
# =========================
use List::MoreUtils qw(uniq);
#...
else
{
while ( $i <= 26 ) {
push #nonMatches, $e->{'element$i++'};
}
my #badElements = uniq #nonMatches;
foreach $element ( #badElements )
{
print $element;
}
}
You're testing if a particular line matches 'element 7', and if it does - printing all the elements. And it it doesn't match, you appear to be doing the same thing?
What are you trying to do in that else codeblock?
I will hazard a guess that what you might want to try is:
my %match_found_in;
my %match_found_of;
while(my $line = <MPGFILES>) {
chomp $line;
if ($e->{'element4'} eq '6' && $line =~ /$e->{'element7'}/i) {
$match_found_in{$filename} = $line;
$match_found_of{$e->{'element7'}} = $filename;
}
}
foreach my $element ( keys %match_found_of ) {
print "$element had a match in ".$match_found_of{$element},"\n";
}
foreach my $filename ( keys %match_found_in ) {
print "$filename had a match on :", $match_found_in{$filename},"\n";
}
Or am I missing what you're trying to accomplish?

How to skip splitting for some part of the line

Say I have a line lead=george wife=jane "his boy"=elroy. I want to split with space but that does not include the "his boy" part. I should be considered as one.
With normal split it is also splitting "his boy" like taking "his" as one and "boy" as second part. How to escape this
Following this i tried
split " ", $_
Just came to know that this will work
use strict; use warnings;
my $string = q(hi my name is 'john doe');
my #parts = $string =~ /'.*?'|\S+/g;
print map { "$_\n" } #parts;
But it does not looks good. Any other simple thing with split itself?
You could use Text::ParseWords for this
use Text::ParseWords;
$list = "lead=george wife=jane \"his boy\"=elroy";
#words = quotewords('\s+', 0, $list);
$i = 0;
foreach (#words) {
print "$i: <$_>\n";
$i++;
}
ouput:
0: <lead=george>
1: <wife=jane>
2: <his boy=elroy>
sub split_space {
my ( $text ) = #_;
while (
$text =~ m/
( # group ($1)
\"([^\"]+)\" # first try find something in quotes ($2)
|
(\S+?) # else minimal non-whitespace run ($3)
)
=
(\S+) # then maximum non-whitespace run ($4)
/xg
) {
my $key = defined($2) ? $2 : $3;
my $value = $4;
print( "key=$key; value=$value\n" );
}
}
split_space( 'lead=george wife=jane "his boy"=elroy' );
Outputs:
key=lead; value=george
key=wife; value=jane
key=his boy; value=elroy
PP posted a good solution. But just to make it sure, that there is a cool other way to do it, comes my solution:
my $string = q~lead=george wife=jane "his boy"=elroy~;
my #split = split / (?=")/,$string;
my #split2;
foreach my $sp (#split) {
if ($sp !~ /"/) {
push #split2, $_ foreach split / /, $sp;
} else {
push #split2,$sp;
}
}
use Data::Dumper;
print Dumper #split2;
Output:
$VAR1 = 'lead=george';
$VAR2 = 'wife=jane';
$VAR3 = '"his boy"=elroy';
I use a Lookahead here for splitting at first the parts which keys are inside quotes " ". After that, i loop through the complete array and split all other parts, which are normal key=values.
You can get the required result using a single regexp, which extract the keys and the values and put the result inside a hash table.
(\w+|"[\w ]+") will match both a single and multiple word in the key side.
The regexp captures only the key and the value, so the result of the match operation will be a list with the following content: key #1, value #1, key #2, value#2, etc.
The hash is automatically initiated with the appropriate keys and values, when the match result is assigned to it.
here is the code
my $str = 'lead=george wife=jane "hello boy"=bye hello=world';
my %hash = ($str =~ m/(?:(\w+|"[\w ]+")=(\w+)(?:\s|$))/g);
## outputs the hash content
foreach $key (keys %hash) {
print "$key => $hash{$key}\n";
}
and here is the output of this script
lead => george
wife => jane
hello => world
"hello boy" => bye

Using a regular expression with nested for loops, using Perl

I have two arrays:
#file_list holds a list of files in a directory, and
#name_list holds some names.
For example, these arrays could contain
#file_list = ('Bob_car', 'Bob_house', 'Bob_work', 'Fred_car', 'Fred_house', 'Fred_work', ...);
#name_list = ('Bob', 'Fred', ...);
(the real data is not that simple).
My goal is to compare each file with every name and see if they match. They match if the file string starts with the name.
I could then use these matches to sort the files into new directories, based on their corresponding name.
Here is my code:
for ( my $i = 0; $i < scalar #file_list ; $i++ )
{
for ( my $j = 0; $j < #name_list ; $j++ )
{
if ( $file_list[ $i ] =~ m/^$name_list[ $j ]/ )
{
print "$file_list[ $i ] goes with $name_list[ $j ]\n";
}
else
{
print "no match\n";
}
}
}
However, I don't get any matches. I've tested the individual loops and they are working. Else, is there something off about the regex?
About how the arrays were made:
For #name_list, the file containing the names is organized in a seemingly random way, just because of how it was used for something else. The names in that file are on several different lines, with lots of blank lines in between and lots of blank entries within lines. Names can appear more than once.
I used the following code to make #name_list:
while (my $line = <$OriginalFILE>)
{
chomp $line;
my #current_line = split( "\t", $line );
for ( my $i = 0; $i < scalar #current_line ; $i ++ )
{
if ( $current_line[ $i ] =~ m/^\s*$/ )
{
# print "$current_line[$i] is blank\n";
}
else
{
push( #raw_name_list, $current_line[ $i ] );
}
} # end of for
} # while
# collect list without repeat instances of the same name
my %unique = ();
foreach my $name (#raw_name_list)
{
$unique{$name} ++;
}
my #name_list = keys %unique;
foreach my $name ( #name_list )
{
# print "$name\n";
chomp $name;
unless(mkdir $name, 0700)
{
die "Unable to create directory called $name\n";
}
}
The array #file_list was made using:
opendir(DIR, $ARGV[1]);
my #file_list = grep ! /^\./, readdir DIR;
closedir(DIR);
# print #file_list;
#amon, here is what i did to test the loops and regex:
FILE: for my $file (#transposed_files) {
print "$file\n";
for my $name (#transposedunique) {
print "i see this $name\n";
if ($file =~ /^\Q$name\E/) {
print "$file goes with $name\n";
next FILE;
}
}
#print "no match for $file\n";
}
oh, and I transposed the arrays, so that they would print to an outfile into separate rows.
Short version: You are building your name array wrong. Look at this line:
$unique{name} ++;
You are just incrementing the name entry of the hash. You probably wanted the $name variable.
The Longer Version
On English, and Foreach-Loops
Your code is a bit unperlish and looks more like C than like Perl. Perl is much closer to English than you might think. From the original wording of your question:
take the first element from #file_list and then to compare that to each element in #name_list
You wrote this as
for (my $i = 0; $i < #file_list; $i++) {
for (my $j = 0; $j < #name_list; $j++) {
...; # compare $file_list[$i] with $name_list[$j]
}
}
I'd rather do
for my $file (#file_list) {
for my $name (#name_list) {
...; # compare $file with $name
}
}
and save myself from the hassle of array subscripting.
Building Correct Regexes
Your code contains the following test:
$file_list[ $i ] =~ m/^$name_list[ $j ]/
This will not do what you think if $name_list[$j] contains special characters like (, ., +. You can match the literal contents of a variable by enclosing it in \Q ... \E. This would make the code
$file =~ /^\Q$name\E/
(if used with my variant of the loop).
You could also go the nifty route and compare the leading substring directly:
$name eq substr $file, 0, length($name)
This expresses the same condition.
On Loop Control
I will make two assumptions:
You are only interested in the first matching name for any file
You only want to print the no match message if no name was found
Perl allows us to break out of arbitrary loops, or restart the current iteration, or go directly to the next iteration, without using flags, as you would do in other languages. All we have to do is to label our loops like LABEL: for (...).
So once we have a match, we can start our search for the next file. Also, we only want to print no match if we left the inner loop without going to the next file. This code does it:
FILE: for my $file (#file_list) {
for my $name (#name_list) {
if ($file =~ /^\Q$name\E/) {
print "$file goes with $name\n";
next FILE;
}
}
print "no match for $file\n";
}
The Zen of Negation
In your file parsing code, you express a condition
if ($field =~ /^\s*$/) {
} else {
# do this stuff only if the field does not consist only of
# zero or more whitespace characters
}
That description is far to complex. How about
if ($field =~ /\S/) {
# do this stuff only if the field contains a non-whitespace character.
}
The same condition, but simpler, and more efficient.
Simplify your Parse
In short, your file parsing code can be condensed to
my %uniq;
while (<$OriginalFILE>) {
chomp;
$uniq{$_} = undef for grep /\S/, split /\t/;
}
my #name_list = sort { length($b) <=> length($a) } keys %uniq;
The split function takes a regex as first argument, and will split on $_ if no other string is specified. It returns a list of fields.
The grep function takes a condition and a list, and will return all elements of a list that match the condition. The current element is in $_, which regexes match by default. For explanation of the regex, see above.
Note: This still allows for the fields to contain whitespace, even in leading position. To split on all whitespace, you can give split the special argument of a string containing a single space: split ' '. This would make the grep unneccessary.
The for loop can also be used as a statement modifier, i.e. like EXPR for LIST. The current element is in $_. We assign something to the $_ entry in our %uniq hash (which is already initialized to the empty hash). This could be a number, but undef works as well.
The keys are returned in a seemingly random order. But as multiple names could match a file, but we only want to select one match, we will have to match the most specific name first. Therefore, I sort the names after their length in descending order.
Your code seems to work for me. All I did was construct two arrays like this:
my #file_list = qw/Bob_car Bob_house Bob_work Fred_car Fred_house Fred_work/;
my #name_list = qw/Fred Bob Mary/;
Then running your code produces output like this:
no match
Bob_car goes with Bob
no match
no match
Bob_house goes with Bob
no match
no match
Bob_work goes with Bob
no match
Fred_car goes with Fred
no match
no match
Fred_house goes with Fred
no match
no match
Fred_work goes with Fred
no match
no match
So it looks like it's working.
A common problem with reading input from files or from a user is forgetting to strip the newline character from the end of the input. This could be your problem. If so, have a read about perldoc -f chomp, and just chomp each value as you add it to your array.
I'm always interested in doing things in efficient way so every time I see O(N^2) algorithm rings bells for me. Why it should be O(N*M) and not O(N+M)?
my $re = join('|',map quotemeta, #name_list);
$re = qr/$re/;
for my $file (#file_list) {
if($file =~ /^($re)/) {
my $name = $1;
... do what you need
}
}
its look something wrong in loop.
follow comments in code
for ( my $i = 0; $i < scalar #file_list ; $i++ )
{
#use some string variable assign it ""
for ( my $j = 0; $j < #name_list ; $j++ )
{
if ( $file_list[ $i ] =~ m/^$name_list[ $j ]/ )
{
# assign string variable to founded name_list[$j]
break loop
}
}
# check condition if string not equal to "" match found print your requirement with string value else match not found
}

Problems with user arguments in Perl

I'm currently trying to take user arguments (usually 2) that are text files, get the amount of characters, lines, and words from the text file and display them back. My code currently adds them all together instead of listing them separately for each file. How do I list the file name based on user arguments, and the amount of lines, characters and words for each file without adding them together? Thank you for taking time to read this.
#!usr/bin/perl -w
use strict;
my $user_files = #ARGV;
chomp($user_files);
my #parts;
my $word_count = 0;
my $total_words = 0;
my $line_count = 0;
foreach my $line (<>)
{
#parts = split (/\s+/,$line);
$line_count += (line =~tr/\n//);
$word_count += length($line) + 1;
$total_words += scalar(#parts);
}
for(my $i = 0; $i < 1; $i++)
{
print "File name:", #ARGV,
"\t\t Word Count: ", $word_count,
"\t\t Total words: ", $total_words,
"\t\t Total lines: ", $line_count,
"\n";
}
There are two basic things you need to change to enable this to work.
Use $ARGV - when reading across multiple files using <>, it contains the name of the current file
Store the data in a hash (that is keyed on $ARGV)
In this sample, I've retained all of your calculations (but I think you'll need to reconsider some of those) and made a few other changes to clean up your code a bit.
#!/usr/bin/perl
use strict;
use warnings; # better than '-w'
my %files; # Store all the data here
# While is better than foreach here as is reads the file one line at a time.
# Each line goes into $_
while (<>) {
# By default, split splits $_ on whitespace
my #parts = split;
# By default, tr/// works on $_
$files{$ARGV}{line_count} += tr/\n//;
# I think this calculation is wrong.
# length() has no relation to word count. And why add 1 to it?
$files{$ARGV}{word_count} += length($_) + 1;
# Addition imposes scalar context, no need for the scalar keyword
$files{$ARGV}{total_words} += #parts;
}
# Print all the information in the hash
foreach (keys %files) {
print "File name: $_",
"\t\t Word Count: $files{$_}{word_count}",
"\t\t Total words: $files{$_}{total_words}",
"\t\t Total lines: $files{$_}{line_count}",
"\n";
}
This line :
foreach my $line(<>)
Is taking input from STDIN. You need to do something like:
for my $file (#user_files) {
open my $fin, '<', $file or die $!;
while ( my $line = <$fin> ) {
# count stuff
}
close $fin;
# print counted stuff
}
Also note that if you want to take multiple filenames as args:
my $user_files = #ARGV;
will only take the first arg. You probably want:
my #user_files = #ARGV;
Also, the chomp on an arg is unnecessary.
In your script, you're counting all the files before printing. Which is good, but you probably want to store that data in an array or hash. That data structure might look like this :
$file_counts = [
{
$file_name1 => {
characters => $characters,
words => $words,
lines => $lines,
}
},
{
$file_name2 => {
characters => $characters,
words => $words,
lines => $lines,
}
},
];