Match two strings based on common substring - perl

I have a list of files that needs to be grouped in pairs. (I need to append an HTML 'File B' (body) to 'File A' (header) because I need to serve them statically without server-side includes).
Example:
/path/to/headers/.../matching_folder/FileA.html
/someother/path/to/.../matching_folder/body/FileB.html
Emphasizing with the ellipses that the paths are not of uniform length, nor is 'matching folder' always in the same position in the path.
It seems I need to match/join based on the common substring 'matching_folder', but I am stumped on scanning each string, storing, matching (excerpt):
my #dirs = ( $headerPath, $bodyPath );
my #files = ();
find( { wanted => \&wanted, no_chdir => 1 }, #dirs );
foreach my $file (#files) {
# pseudocode: append $file[0] to $file[1] if both paths contain same 'matching_folder'
};
sub wanted {
return unless -f and /(FileA\.html$)|(FileB\.html$)/i;
push #files, $_;
};

Hash the files by all the directory steps in their names.
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
use File::Find;
my $headerPath = 'headers';
my $bodyPath = 'bodies';
my #dirs = ($headerPath, $bodyPath);
my #files;
sub wanted {
return unless -f and /file.\.html$/;
push #files, $_;
};
find({ wanted => \&wanted, no_chdir => 1 }, #dirs);
my %common;
for my $file (#files) {
my #steps = split m(/), $file;
push #{ $common{$_} }, $file for #steps;
};
# All the headers and all the bodies share their prefixes,
# but that's not what we're interested in.
delete #common{qw{ bodies headers }};
for my $step (keys %common) {
next if 1 == #{ $common{$step} };
print "$step common for #{ $common{$step} }\n";
}
Tested on the following structure:
bodies/3/something/C/something2/fileA.html
bodies/2/junk/B/fileB.html
bodies/1/A/fileC.html
headers/a/B/fileD.html
headers/c/one/A/two/fileE.html
headers/b/garbage/C/fileF.html
Output:
B common for headers/a/B/fileD.html bodies/2/junk/B/fileB.html
C common for headers/b/garbage/C/fileF.html bodies/3/something/C/something2/fileA.html
A common for headers/c/one/A/two/fileE.html bodies/1/A/fileC.html

With the above, I can get to
for my $step (keys %common) {
next unless 2 == #{ $common{$step} }; # pairs
my #pairs = #{ $common{$step} };
my $html;
foreach my $f (#pairs) {
$html .= &readfile($f);
};
&writefile($html, $step . '.html');
}
And get what I need for now. Thanks all! (I love Perl, making hard things possible indeed).

Related

Passing Subroutine to Perl Subroutine

I'm trying to make a subroutine that mimics Perl6/Raku's dir. dir can search directories very easily and quickly, with file tests in a directory.
I'm trying to add a recursive option to dir.
I have a script that can accomplish this, but I'm trying to add a recursive option using File::Find script that I already know works:
use File::Find;
my #files;
find({ wanted => \&process_file, no_chdir => 1 }, '.');
sub process_file {
if ((-f $_) && ($_ =~ m/\.pl$/)) {#various tests on the files
push #files, $_;
# print "This is a file: $_\n";
}
}
However, I want to do everything in a single subroutine.
I have tried this:
use strict;
use warnings FATAL => 'all';
use feature 'say';
use autodie ':all';
use Carp 'confess';
use File::Find 'find';
sub dir { # a Perl5 mimic of Perl6/Raku's "dir"
my %args = ( # default arguments listed below
tests => 'f', dir => '.', regex => '.', recursive => 0,
#_, # argument pair list goes here
);
if ($args{tests} =~ m/[^rwxoRWXOezfdlpSbctugkTB]/){# sMAC returns are non-boolean, and are excluded
confess "There are some unrecognized characters in $args{tests}";
}
my #items = split '', $args{tests};
my $tests = '-' . join (' -', #items);
undef #items;
$args{regex} = qr/$args{regex}/;
opendir my $dh, $args{dir} or die "Can't opendir on $args{dir}: $!";
while (my $item = readdir $dh) {
if (($item !~ $args{regex}) or ($item =~ m/^\.{1,2}$/)) { next }
my $i = "$args{dir}/$item";
if (not defined $i) {
confess "\$i isn't defined.";
}
my $rv = eval "${tests} \$i" || 0;
if ($rv == 0) { next }
if ($args{dir} eq '.') {
push #items, $item
} else {
push #items, $i
}
}
# this is the new part, and what doesn't work
if ($args{recursive} == 1) {
find({
wanted => \sub {
if ((eval "${tests} $_") && ($_ =~ /$args{regex}/)) {
push #items, $_;
}
},
chdir => 1,
'.'
});
}
#items
}
foreach my $pl (dir(recursive => 1, regex => '\.pl$')) {
say $pl
}
this gives an error:
Odd number of elements in anonymous hash at clean.pl line 44
I think that this is because of the find function in the bottom, as the error indicates.
Passing one subroutine to another subroutine is somewhat similar to my question, but I don't see how to apply there to my case.
Passing wanted => sub and wanted => \sub gives the same error.
How can I get this wanted subroutine to pass to find without errors?
Perl tries to use every odd argument as hash key and every even one as hash value, so with an odd number of arguments you get that "odd number of arguments" error cause you put all arguments to find in an anonymous hash ref. The file name / path has no hash key, so move that out of the hashref.
Then pass the wanted routine like this:
wanted => sub { ... }
That said, my File::Find only knows no_chdir, not chdir, so you probably want to switch the value.
This works for me:
find(
{
wanted => sub {
if ((eval "${tests} $_") && ($_ =~ /$args{regex}/)) {
push #items, $_;
}
},
no_chdir => 0,
},
'.' # File outside the hash ref
);

Perl: save hash-references in a hash for Image::ExifTool

Trying for two days to save all the readable (non-binary) output of ExifTool into a hash with the path to the files as key:
use strict;
use warnings;
use Image::ExifTool;
use Data::Types qw(:all);
use v5.28;
my #directories = ("/some/filepath/to-pictures");
my #suffixes = qw(jpg jpeg gif png raw svg tif tiff psd orf nef eps cr2 arw);
my %file_catalog = ();
while (my $folder = shift #directories) {
opendir(DirHandle, "$folder") or die "Cannot open $folder\n";
my #files = readdir(DirHandle);
closedir(DirHandle);
foreach my $file (#files) {
my $file_string = "$folder/$file";
if (-f $file_string) {
my $sep_pos = rindex($file, ".");
my $end_chars = -($sep_pos - (length $file) + 1);
my $suffix = substr $file, $sep_pos + 1, $end_chars;
if (grep ( lc $suffix, #suffixes)) {
my $exif_tool = new Image::ExifTool;
my $info = $exif_tool->ImageInfo($file_string);
say "Datei: $file_string";
%file_catalog = ($file_string => $info);
foreach (keys %{$info}) {
say "Key: $_ => Value: $$info{ $_ }";
}
}
}
}
}
Also not working: %file_catalog = ($file_string => %{$info});.
But I either I don't get more than one picture's information in my %file_catalog - can you explain this to me?
Or I get a lot of problems with dereferencing ("Can't use string as a HASH ref while "strict refs" in use") or storing ("Odd number of elements in hash assignment") as you can see in my repo's commit history.
Of course, not all images have the same exif-informations, every image provides different key->values.
For test output:
my $hash_size = keys %file_catalog;
print "\n---------------------------\n";
print "In 'file_catalog'-hash recorded image-path (key)-/ exif-data (value)-hashes: ", $hash_size;
print "\n---------------------------\n";
my ($i, $j) = 0;
foreach my $key (sort keys %file_catalog) {
$i++;
printf "%s. key: $key\n", uc chr($i + ord('A') - 1 );
foreach my $inner_key (keys %{$file_catalog{ $key }}) {
$j++;
say "$j. inner key: $inner_key: $file_catalog{$key}{$inner_key}";
}
}
Maybe somebody can help me understanding. Also there might be a better construct than my neophyte code.
To add a new key-value to a hash you need:
$file_catalog{$file_string} = $info;

How to build hierarhical hash in Perl from directory tree

I am trying to build the structure like this.
{
"file1": "supersong.mp3",
"file2": "supersong2.mp3",
"file3": "text.txt",
"file4": "tex2t.txt",
"file5": "text3.txt",
"file6": "json.pl",
"directory_movies": [
"file1": "supersong.mp3",
"file2": "supersong2.mp3",
"file3": "text.txt",
"file4": "tex2t.txt",
"file5": "text3.txt",
"file6": "json.pl",
"directory_sub_movies": [
"file1": "supersong.mp3",
"file2": "supersong2.mp3",
"file3": "text.txt",
"file4": "tex2t.txt",
"file5": "text3.txt",
"file6": "json.pl",
]
]
};
So as any directory hierarchy in my case in unix. So we have simple files or directories, if it is directory it is nested hash and so on recursively.
I need to represent it as hash in perl, the easiest way I have found is to use File::Find module.
It works correctly but I cannot figure out how to save hierarchy in hash to be nested as above.
Here is my test script. That determines type of current item correctly.
sub path_checker {
if (-d $File::Find::name) {
print "Directory " . $_ . "\n";
}
elsif (-f $File::Find::name) {
print "File " . $_ . " Category is " . basename($File::Find::dir) . "\n";
}
}
sub parse_tree {
my ($class,$root_path) = #_;
File::Find::find(\&path_checker, $root_path);
}
Please help to modify it to create structure like I have described above. I would be very grateful.
Subfolders should also be hashes, not arrays,
use strict;
use warnings;
# use Data::Dumper;
use File::Find;
use JSON;
sub parse_tree {
my ($root_path) = #_;
my %root;
my %dl;
my %count;
my $path_checker = sub {
my $name = $File::Find::name;
if (-d $name) {
my $r = \%root;
my $tmp = $name;
$tmp =~ s|^\Q$root_path\E/?||;
$r = $r->{$_} ||= {} for split m|/|, $tmp; #/
$dl{$name} ||= $r;
}
elsif (-f $name) {
my $dir = $File::Find::dir;
my $key = "file". ++$count{ $dir };
$dl{$dir}{$key} = $_;
}
};
find($path_checker, $root_path);
return \%root;
}
print encode_json(parse_tree("/tmp"));

Push into end of hash in Perl

So what I am trying to do with the following code is push a string, let's say "this string" onto the end of each key in a hash. I'm completely stumped on how to do this. Here's my code:
use warnings;
use strict;
use File::Find;
my #name;
my $filename;
my $line;
my #severity = ();
my #files;
my #info = ();
my $key;
my %hoa;
my $xmlfile;
my $comment;
my #comments;
open( OUTPUT, "> $ARGV[0]" );
my $dir = 'c:/programs/TEST/Test';
while ( defined( $input = glob( $dir . "\\*.txt" ) ) ) {
open( INPUT, "< $input" );
while (<INPUT>) {
chomp;
if (/File/) {
my #line = split /:/;
$key = $line[1];
push #{ $hoa{$key} }, "Filename\n";
}
if ( /XML/ ... /File/ ) {
$xmlfile = $1;
push #{ $hoa{$key} }, "XML file is $xmlfile\n";
}
if (/Important/) {
push #{ $hoa{$key} }, "Severity is $_\n";
}
if (/^\D/) {
next if /Important/;
push #{ $hoa{$key} }, "Given comment is $_\n";
}
push #{ $hoa{$key} }, "this string\n";
}
}
foreach my $k ( keys %hoa ) {
my #list = #{ $hoa{$k} };
foreach my $l (#list) {
print OUTPUT $l, "\n";
}
}
}
close INPUT;
close OUTPUT;
Where I have "this string" is where I was trying to push that string onto the end of the array. However, what ended up happening was that it ended up printing "this string" three times, and not at the end of every key like I wanted. When I tried to put it outside the while() loop, it said that the value of $key was not initialized. So please, any help? And if you need any clarification on what I'm asking, just let me know. Thank you!
No offence, but there are so many issues in this code I don't even know where to start...
First, the 'initialization block' (all these my $something; my #somethings lines at the beginning of this script) is not required in Perl. In fact, it's not just 'redundant' - it's actually confusing: I had to move my focus back and forth every time I encountered a new variable just to check its type. Besides, even with all this $input var is still not declared as local; it's either missing in comments, or the code given has omissions.
Second, why do you declare your intention to use File::Find (good) - but then do not use it at all? It could greatly simplify all this while(glob) { while(<FH>) { ... } } routine.
Third, I'm not sure why you assign something to $key only when the line read is matched by /File/ - but then use its value as a key in all the other cases. Is this an attempt to read the file organized in sections? Then it can be done a bit more simple, either by slurp/splitting or localizing $/ variable...
Anyway, the point is that if the first line of the file scanned is not matched by /File/, the previous (i.e., from the previous file!) value is used - and I'm not quite sure that it's intended. And if the very first line of the first file is not /File/-matched, then an empty string is used as a key - again, it smells like a bug...
Could you please describe your task in more details? Give some test input/output results, perhaps... It'd be great to proceed in short tasks, organizing your code in process.
Your program is ill-conceived and breaks a lot of good practice rules. Rather than enumerate them all, here is an equivalent program with a better structure.
I wonder if you are aware that all of the if statements will be tested and possibly executed? Perhaps you need to make use of elsif?
Aside from the possibility that $key is undefined when it is used, you are also setting $xmlfile to $1 which will never be defined as there are no captures in any of your regular expressions.
It is impossible to tell from your code what you are trying to do, so we can help you only if you show us your output, input and say how to derive one from the other.
use strict;
use warnings;
use File::Find;
my ($outfile) = #ARGV;
my $dir = 'c:/programs/TEST/Test';
my %hoa;
my $key;
while (my $input = glob "$dir/*.txt") {
open my $in, '<', $input or die $!;
while (<$in>) {
chomp;
if (/File/) {
my $key = (split /:/)[1];
push #{ $hoa{$key} }, "Filename\n";
}
if (/XML/ ... /File/) {
my $xmlfile = $1;
push #{ $hoa{$key} }, "XML file is $xmlfile\n";
}
if (/Important/) {
push #{ $hoa{$key} }, "Severity is $_\n";
}
if (/^\D/) {
next if /Important/;
push #{ $hoa{$key} }, "Given comment is $_\n";
}
push #{ $hoa{$key} }, "this string\n";
}
close $in;
}
open my $out, '>', $outfile or die $!;
foreach my $k (keys %hoa) {
foreach my $l (#{ $hoa{$k} }) {
print $out $l, "\n";
}
}
close $out;
I suspect based on your code, that the line where $key is set is not called each time through the loop, and that you do not trigger any of the other if statements.
This would append "this string" to the end of the array. Based on that you are getting 3 of the "this strings" at the end of the array, I would suspect that two lines do not go through the if (/FILE/) or any of the other if statements. This would leave the $key value the same and at the end, you would append "this string" to the array, using whatever the last value of $key was when it was set.
This will append the string "this string" to every element of the hash %hoa, which elements are array refs:
for (values(%hoa)) { push #{$_}, "this string"; }
Put that outside your while loop, and you'll print "this string" at the end of each element of %hoa.
It will autovivify array refs where it finds undefined elements. It will also choke if it cannot dereference an element as an array, and will manipulate arrays by symbolic reference if it finds a simple scalar and is not running under strict:
my %autoviv = ( a => ['foo'], b => undef );
push #$_, "PUSH" for values %autoviv; # ( a => ['foo', 'PUSH'], b => ['PUSH'] )
my %fatal = ( a => {} );
push #$_, "PUSH" for values %fatal; # FATAL: "Not an ARRAY reference at..."
my %dangerous = (a => "foo");
push #$_, "PUSH" for values %dangerous; # Yikes! #foo is now ("PUSH")
use strict;
my %kablam = (a => "foo");
push #$_, "PUSH" for values %kablam; # "Can't use string ("foo") as an ARRAY ref ..."
As I understand it, traverse the hash with a map command to modify its keys. An example:
EDIT: I've edited because I realised that the map command can be assigned to the same hash. No need to create a new one.
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
my %hash = qw|
key1 value1
key2 value2
key3 value3
|;
my %hash = map { $_ . "this string" => $hash{ $_ } } keys %hash;
print Dump \%hash;
Run it like:
perl script.pl
With following output:
$VAR1 = {
'key3this string' => 'value3',
'key2this string' => 'value2',
'key1this string' => 'value1'
};

Where can I find an array of the (un)assigned Unicode code points for a particular block?

At the moment, I'm writing these arrays by hand.
For example, the Miscellaneous Mathematical Symbols-A block has an entry in hash like this:
my %symbols = (
...
miscellaneous_mathematical_symbols_a => [(0x27C0..0x27CA), 0x27CC,
(0x27D0..0x27EF)],
...
)
The simpler, 'continuous' array
miscellaneous_mathematical_symbols_a => [0x27C0..0x27EF]
doesn't work because Unicode blocks have holes in them. For example, there's nothing at 0x27CB. Take a look at the code chart [PDF].
Writing these arrays by hand is tedious, error-prone and a bit fun. And I get the feeling that someone has already tackled this in Perl!
Perhaps you want Unicode::UCD? Use its charblock routine to get the range of any named block. If you want to get those names, you can use charblocks.
This module is really just an interface to the Unicode databases that come with Perl already, so if you have to do something fancier, you can look at the lib/5.x.y/unicore/UnicodeData.txt or the various other files in that same directory to get what you need.
Here's what I came up with to create your %symbols. I go through all the blocks (although in this sample I skip that ones without "Math" in their name. I get the starting and ending code points and check which ones are assigned. From that, I create a custom property that I can use to check if a character is in the range and assigned.
use strict;
use warnings;
digest_blocks();
my $property = 'My::InMiscellaneousMathematicalSymbolsA';
foreach ( 0x27BA..0x27F3 )
{
my $in = chr =~ m/\p{$property}/;
printf "%X is %sin $property\n",
$_, $in ? '' : ' not ';
}
sub digest_blocks {
use Unicode::UCD qw(charblocks);
my $blocks = charblocks();
foreach my $block ( keys %$blocks )
{
next unless $block =~ /Math/; # just to make the output small
my( $start, $stop ) = #{ $blocks->{$block}[0] };
$blocks->{$block} = {
assigned => [ grep { chr =~ /\A\p{Assigned}\z/ } $start .. $stop ],
unassigned => [ grep { chr !~ /\A\p{Assigned}\z/ } $start .. $stop ],
start => $start,
stop => $stop,
name => $block,
};
define_my_property( $blocks->{$block} );
}
}
sub define_my_property {
my $block = shift;
(my $subname = $block->{name}) =~ s/\W//g;
$block->{my_property} = "My::In$subname"; # needs In or Is
no strict 'refs';
my $string = join "\n", # can do ranges here too
map { sprintf "%X", $_ }
#{ $block->{assigned} };
*{"My::In$subname"} = sub { $string };
}
If I were going to do this a lot, I'd use the same thing to create a Perl source file that has the custom properties already defined so I can just use them right away in any of my work. None of the data should change until you update your Unicode data.
sub define_my_property {
my $block = shift;
(my $subname = $block->{name}) =~ s/\W//g;
$block->{my_property} = "My::In$subname"; # needs In or Is
no strict 'refs';
my $string = num2range( #{ $block->{assigned} } );
print <<"HERE";
sub My::In$subname {
return <<'CODEPOINTS';
$string
CODEPOINTS
}
HERE
}
# http://www.perlmonks.org/?node_id=87538
sub num2range {
local $_ = join ',' => sort { $a <=> $b } #_;
s/(?<!\d)(\d+)(?:,((??{$++1})))+(?!\d)/$1\t$+/g;
s/(\d+)/ sprintf "%X", $1/eg;
s/,/\n/g;
return $_;
}
That gives me output suitable for a Perl library:
sub My::InMiscellaneousMathematicalSymbolsA {
return <<'CODEPOINTS';
27C0 27CA
27CC
27D0 27EF
CODEPOINTS
}
sub My::InSupplementalMathematicalOperators {
return <<'CODEPOINTS';
2A00 2AFF
CODEPOINTS
}
sub My::InMathematicalAlphanumericSymbols {
return <<'CODEPOINTS';
1D400 1D454
1D456 1D49C
1D49E 1D49F
1D4A2
1D4A5 1D4A6
1D4A9 1D4AC
1D4AE 1D4B9
1D4BB
1D4BD 1D4C3
1D4C5 1D505
1D507 1D50A
1D50D 1D514
1D516 1D51C
1D51E 1D539
1D53B 1D53E
1D540 1D544
1D546
1D54A 1D550
1D552 1D6A5
1D6A8 1D7CB
1D7CE 1D7FF
CODEPOINTS
}
sub My::InMiscellaneousMathematicalSymbolsB {
return <<'CODEPOINTS';
2980 29FF
CODEPOINTS
}
sub My::InMathematicalOperators {
return <<'CODEPOINTS';
2200 22FF
CODEPOINTS
}
Maybe this?
my #list =
grep {chr ($_) =~ /^\p{Assigned}$/}
0x27C0..0x27EF;
#list = map { $_ = sprintf ("%X", $_ )} #list;
print "#list\n";
Gives me
27C0 27C1 27C2 27C3 27C4 27C5 27C6 27C7 27C8 27C9 27CA 27D0 27D1 27D2 27D3
27D4 27D5 27D6 27D7 27D8 27D9 27DA 27DB 27DC 27DD 27DE 27DF 27E0 27E1 27E2
27E3 27E4 27E5 27E6 27E7 27E8 27E9 27EA 27EB
I don't know why you wouldn't say miscellaneous_mathematical_symbols_a => [0x27C0..0x27EF], because that's how the Unicode standard is defined according to the PDF.
What do you mean when you say it doesn't "work"? If it's giving you some sort of error when you check the existence of the character in the block, then why not just weed them out of the block when your checker comes across an error?