Removing files with duplicate content from single directory [Perl, or algorithm]

Removing files with duplicate content from single directory [Perl, or algorithm] - perl

I have a folder with large number of files, some of with have exactly the same contents. I want to remove files with duplicate contents, meaning if two or more files with duplicate content found, I'd like to leave one of these files, and delete the others.
Following is what I came up with, but I don't know if it works :) , didn't try it yet.
How would you do it? Perl or general algorithm.
use strict;
use warnings;
my #files = <"./files/*.txt">;
my $current = 0;
while( $current <= $#files ) {
# read contents of $files[$current] into $contents1 scalar
my $compareTo = $current + 1;
while( $compareTo <= $#files ) {
# read contents of $files[compareTo] into $contents2 scalar
if( $contents1 eq $contents2 ) {
splice(#files, $compareTo, 1);
# delete $files[compareTo] here
}
else {
$compareTo++;
}
}
$current++;
}

Here's a general algorithm (edited for efficiency now that I've shaken off the sleepies -- and I also fixed a bug that no one reported)... :)
It's going to take forever (not to mention a lot of memory) if I compare every single file's contents against every other. Instead, why don't we apply the same search to their sizes first, and then compare checksums for those files of identical size.
So then when we md5sum every file (see Digest::MD5) calculate their sizes, we can use a hash table to do our matching for us, storing the matches together in arrayrefs:
use strict;
use warnings;
use Digest::MD5 qw(md5_hex);
my %files_by_size;
foreach my $file (#ARGV)
{
push #{$files_by_size{-s $file}}, $file; # store filename in the bucket for this file size (in bytes)
}
Now we just have to pull out the potential duplicates and check if they are the same (by creating a checksum for each, using Digest::MD5), using the same hashing technique:
while (my ($size, $files) = each %files_by_size)
{
next if #$files == 1;
my %files_by_md5;
foreach my $file (#$files_by_md5)
{
open my $filehandle, '<', $file or die "Can't open $file: $!";
# enable slurp mode
local $/;
my $data = <$filehandle>;
close $filehandle;
my $md5 = md5_hex($data);
push #{$files_by_md5{$md5}}, $file; # store filename in the bucket for this MD5
}
while (my ($md5, $files) = each %files_by_md5)
{
next if #$files == 1;
print "These files are equal: " . join(", ", #$files) . "\n";
}
}
-fini

Perl, with Digest::MD5 module.
use Digest::MD5 ;
%seen = ();
while( <*> ){
-d and next;
$filename="$_";
print "doing .. $filename\n";
$md5 = getmd5($filename) ."\n";
if ( ! defined( $seen{$md5} ) ){
$seen{$md5}="$filename";
}else{
print "Duplicate: $filename and $seen{$md5}\n";
}
}
sub getmd5 {
my $file = "$_";
open(FH,"<",$file) or die "Cannot open file: $!\n";
binmode(FH);
my $md5 = Digest::MD5->new;
$md5->addfile(FH);
close(FH);
return $md5->hexdigest;
}
If Perl is not a must and you are working on *nix, you can use shell tools
find /path -type f -print0 | xargs -0 md5sum | \
awk '($1 in seen){ print "duplicate: "$2" and "seen[$1] } \
( ! ($1 in seen ) ) { seen[$1]=$2 }'

md5sum *.txt | perl -ne '
chomp;
($sum, $file) = split(" ");
push #{$files{$sum}}, $file;
END {
foreach (keys %files) {
shift #{$files{$_}};
unlink #{$files{$_}} if #{$files{$_}};
}
}
'

Perl is kinda overkill for this:
md5sum * | sort | uniq -w 32 -D | cut -b 35- | tr '\n' '\0' | xargs -0 rm
(If you are missing some of these utilities or they don't have these flags/functions,
install GNU findutils and coreutils.)

Variations on a theme:
md5sum *.txt | perl -lne '
my ($sum, $file) = split " ", $_, 2;
unlink $file if $seen{$sum} ++;
'
No need to go and keep a list, just to remove one from the list and delete the rest; simply keep track of what you've seen before, and remove any file matching a sum that's already been seen. The 2-limit split is to do the right thing with filenames containing spaces.
Also, if you don't trust this, just change the word unlink to print and it will output a list of files to be removed. You can even tee that output to a file, and then rm $(cat to-delete.txt) in the end if it looks good.

a bash script is more expressive than perl in this case:
md5sum * |sort -k1|uniq -w32 -d|cut -f2 -d' '|xargs rm

I'd recommend that you do it in Perl, and use File::Find while you're at it.
Who knows what you're doing to generate your list of files, but you might want to combine it with your duplicate checking.
perl -MFile::Find -MDigest::MD5 -e '
my %m;
find(sub{
if(-f&&-r){
open(F,"<",$File::Find::name);
binmode F;
$d=Digest::MD5->new->addfile(F);
if(exists($m{$d->hexdigest}){
$m{$d->hexdigest}[5]++;
push $m{$d->hexdigest}[0], $File::Find::name;
}else{
$m{$d->hexdigest} = [[$File::Find::name],0,0,0,0,1];
}
close F
}},".");
foreach $d (keys %m) {
if ($m{$d}[5] > 1) {
print "Probable duplicates: ".join(" , ",$m{$d}[0])."\n\n";
}
}'

Here is a way of filtering by size first and by md5 checksum second:
#!/usr/bin/perl
use strict; use warnings;
use Digest::MD5 qw( md5_hex );
use File::Slurp;
use File::Spec::Functions qw( catfile rel2abs );
use Getopt::Std;
my %opts;
getopt('de', \%opts);
$opts{d} = '.' unless defined $opts{d};
$opts{d} = rel2abs $opts{d};
warn sprintf "Checking %s\n", $opts{d};
my $files = get_same_size_files( \%opts );
$files = get_same_md5_files( $files );
for my $size ( keys %$files ) {
for my $digest ( keys %{ $files->{$size}} ) {
print "$digest ($size)\n";
print "$_\n" for #{ $files->{$size}->{$digest} };
print "\n";
}
}
sub get_same_md5_files {
my ($files) = #_;
my %out;
for my $size ( keys %$files ) {
my %md5;
for my $file ( #{ $files->{$size}} ) {
my $contents = read_file $file, {binmode => ':raw'};
push #{ $md5{ md5_hex($contents) } }, $file;
}
for my $k ( keys %md5 ) {
delete $md5{$k} unless #{ $md5{$k} } > 1;
}
$out{$size} = \%md5 if keys %md5;
}
return \%out;
}
sub get_same_size_files {
my ($opts) = #_;
my $checker = defined($opts->{e})
? sub { scalar ($_[0] =~ /\.$opts->{e}\z/) }
: sub { 1 };
my %sizes;
my #files = grep { $checker->($_) } read_dir $opts->{d};
for my $file ( #files ) {
my $path = catfile $opts->{d}, $file;
next unless -f $path;
my $size = (stat $path)[7];
push #{ $sizes{$size} }, $path;
}
for my $k (keys %sizes) {
delete $sizes{$k} unless #{ $sizes{$k} } > 1;
}
return \%sizes;
}

You might want to have a look at how I did to find duplicate files and remove them. Though you have to modify it to your needs.
http://priyank.co.in/remove-duplicate-files

Related

Perl -two list matching elements

I am trying to grab the list of the files jenkins has updated from last build and latest build and stored in a perl array.
Now i have list of files and folders in source code which are considered as sensitive in terms of changes like XXXX\yyy/., XXX/TTTT/FFF.txt,...in FILE.TXT
i want that script should tell me if any these sensitive files was part of my changed files and if yes list its name so that we can double check with development team about is change before we trigger build .
How should i achieve this , and how to ,compare multiple files under one path form the changed path files .
have written below script ---which is called inside the jenkins with %workspace# as argument
This is not giving any matching result.
use warnings;
use Array::Utils qw(:all);
$url = `curl -s "http://localhost:8080/job/Rev-number/lastStableBuild/" | findstr "started by"`;
$url =~ /([0-9]+)/;
system("cd $ARGV[1]");
#difffiles = `svn diff -r $1:HEAD --summarize`;
chomp #difffiles;
foreach $f (#difffiles) {
$f = substr( $f, 8 );
$f = "$f\n";
}
open FILE, '/path/to/file'
or die "Can't open file: $!\n";
#array = <FILE>;
#isect = intersect( #difffiles, #array );
print #isect;

I have manged to solve this issue using below perl script -
sub Verifysensitivefileschanges()
{
$count1=0;
#isect = intersect(#difffiles,#sensitive);
#print "ISECT=#isect\n";
foreach $t (#isect)
{
if (#isect) {print "Matching element found -- $t\n"; $count1=1;}
}
return $count1;
}
sub Verifysensitivedirschanges()
{
$count2=0;
foreach $g (#difffiles)
{
$dirs = dirname($g);
$filename = basename($g);
#print "$dirs\n";
foreach $j (#array)
{
if( "$j" =~ /\Q$dirs/)
{print "Sensitive path files changed under path $j and file name is $filename\n";$count2=1;}
}
}
return $count2;
}

Can't find file trying to move

I'm trying to clean up a directory that contains a lot of sub directories that actually belong in some of the sub directories, not the main directory.
For example, there is
Main directory
sub1
sub2
sub3
HHH
And HHH belongs in sub3. HHH has multiple text files inside of it (as well as some ..txt and ...txt files that I would like to ignore), and each of these text files has a string
some_pattern [sub3].
So, I attempted to write a script that looks into the file and then moves it into its corresponding directory
use File::Find;
use strict;
use warnings;
use File::Copy;
my $DATA = "D:/DATA/DATA_x/*";
my #dirs = grep { -d } glob $DATA;
foreach (#dirs) {
if ($_ =~ m/HHH/) {
print "$_\n";
my $file = "$_/*";
my #files = grep { -f } glob $file;
foreach (#files) {
print "file $_\n";
}
foreach (#files) {
print "\t$_\n";
my #folders = split('/', $_);
if ($folders[4] eq '..txt' or $folders[4] eq '...txt') {
print "$folders[4] ..txt\n";
}
foreach (#folders) {
print "$_\n";
}
open(FH, '<', $_);
my $value;
while (my $line = <FH>) {
if ($line =~ m/some_pattern/) {
($value) = $line =~ /\[(.+?)\]/;
($value) =~ s/\s*$//;
print "ident'$value'\n";
my $new_dir = "$folders[0]/$folders[1]/$folders[2]/$value/$folders[3]/$folders[4]";
print "making $folders[0]/$folders[1]/$folders[2]/$value/$folders[3]\n";
print "file is $folders[4]\n";
my $new_over_dir = "$folders[0]/$folders[1]/$value/$folders[2]/$folders[3]";
mkdir $new_over_dir or die "Can't make it $!";
print "going to swap\n '$_'\n for\n '$new_dir'\n";
move($_, $new_dir) or die "Can't $!";
}
}
}
}
}
It's saying
Can't make it No such file or directory at foo.pl line 57, <FH> line 82.
Why is it saying that it won't make a file that doesn't exist?
A while later: here is my final script:
use File::Find;
use strict;
use warnings;
use File::Copy;
my $DATA = "D:/DATA/DATA_x/*";
my #dirs = grep { -d } glob $DATA;
foreach (#dirs) {
if ($_ =~ m/HHH/) {
my $value;
my #folders;
print "$_\n";
my $file = "$_/*";
my #files = grep { -f } glob $file;
foreach (#files) {
print "file $_\n";
}
foreach (#files) {
print "\t$_\n";
#folders = split('/', $_);
if ($folders[4] eq '..txt' or $folders[4] eq '...txt') {
print "$folders[4] ..txt\n";
}
foreach (#folders) {
print "$_\n";
}
open(FH, '<', $_);
while (my $line = <FH>) {
if ($line =~ m/some_pattern/) {
($value) = $line =~ /\[(.+?)\]/;
($value) =~ s/\s*$//;
print "ident'$value'\n";
}
}
}
if($value){
print "value $value\n";
my $dir1 = "/$folders[1]/$folders[2]/$folders[3]/$folders[4]/$folders[5]";
my $dir2 = "/$folders[1]/$folders[2]/$folders[3]/$folders[4]/$value";
system("cp -r $dir1 $dir2");
}
}
}
}
This works. It looks like part of my problem from before was that I was trying to run this on a directory in my D: drive--when I moved it to the C: drive, it worked fine without any permissions errors or anything. I did try to implement something with Path::Tiny, but this script was so close to being functional (and it was functional in a Unix environment), that I decided to just complete it.

You really should read the Path::Tiny doccu. It probably contains everything you need.
Some starting points, without error handling and so on...
use strict;
use warnings;
use Path::Tiny;
my $start=path('D:/DATA/DATA_x');
my $iter = path($start)->iterator({recurse => 1});
while ( $curr = $iter->() ) {
#select here the needed files - add more conditions if need
next if $curr->is_dir; #skip directories
next if $curr =~ m/HHH.*\.{2,3}txt$/; #skip ...?txt
#say "$curr";
my $content = $curr->slurp;
if( $content =~ m/some_pattern/ ) {
#do something wih the file
say "doing something with $curr";
my $newfilename = path("insert what you need here"); #create the needed new path for the file ..
path($newfilename->dirname)->mkpath; #make directories
$curr->move($newfilename); #move the file
}
}

Are you sure of the directory path you are trying to create. The mkdir call might be failing if some of the intermediate directories doesn't exist. If your code is robust to ensure that
the variable $new_over_dir contains the directory path you have to create, you can use method make_path from perl module File::Path to create the new directory, instead of 'mkdir'.
From the documentation of make_path:
The make_path function creates the given directories if they don't
exists before, much like the Unix command mkdir -p.

How to create the next file or folder in a series of progressively numbered files?

Sorry for the bad title but this is the best I could do! :D
I have a script which creates a new project every time the specified function is called.
Each project must be stored in its own folder, with the name of the project. But, if you don't specify a name, the script will just name it "new projectX", where X is a progressive number.
With time the user could rename the folders or delete some, so every time the script runs, it checks for the smallest number available (not used by another folder) and creates the relevant folder.
Now I managed to make a program which I think works as wanted, but I would like to hear from you if it's OK or there's something wrong which I'm unable to spot, given my inexperience with the language.
while ( defined( $file = readdir $projects_dir ) )
{
# check for files whose name start with "new project"
if ( $file =~ m/^new project/i )
{
push( #files, $file );
}
}
# remove letters from filenames, only the number is left
foreach $file ( #files )
{
$file =~ s/[a-z]//ig;
}
#files = sort { $a <=> $b } #files;
# find the smallest number available
my $smallest_number = 0;
foreach $file ( #files )
{
if ( $smallest_number != $file )
{
last;
}
$smallest_number += 1;
}
print "Smallest number is $smallest_number";

Here's a basic approach for this sort of problem:
sub next_available_dir {
my $n = 1;
my $d;
$n ++ while -e ($d = "new project$n");
return $d;
}
my $project_dir = next_available_dir();
mkdir $project_dir;
If you're willing to use a naming pattern that plays nicely with Perl's string auto-increment feature, you can simplify the code further, eliminating the need for $n. For example, newproject000.

I think I would use something like:
use strict;
use warnings;
sub new_project_dir
{
my($base) = #_;
opendir(my $dh, $base) || die "Failed to open directory $base for reading";
my $file;
my #numbers;
while ($file = readdir $dh)
{
$numbers[$1] = 1 if ($file =~ m/^new project(\d+)$/)
}
closedir($dh) || die "Failed to close directory $base";
my $i;
my $max = $#numbers;
for ($i = 0; $i < $max; $i++)
{
next if (defined $numbers[$i]);
# Directory did not exist when we scanned the directory
# But maybe it was created since then!
my $dir = "new project$i";
next unless mkdir "$base/$dir";
return $dir;
}
# All numbers from 0..$max were in use...so try adding new numbers...
while ($i < $max + 100)
{
my $dir = "new project$i";
$i++;
next unless mkdir "$base/$dir";
return $dir;
}
# Still failed - give in...
die "Something is amiss - all directories 0..$i in use?";
}
Test code:
my $basedir = "base";
mkdir $basedir unless -d $basedir;
for (my $j = 0; $j < 10; $j++)
{
my $dir = new_project_dir($basedir);
print "Create: $dir\n";
if ($j % 3 == 2)
{
my $k = int($j / 2);
my $o = "new project$k";
rmdir "$basedir/$o";
print "Remove: $o\n";
}
}

Try this:
#!/usr/bin/env perl
use strict;
use warnings;
# get the current list of files
# see `perldoc -f glob` for details.
my #files = glob( 'some/dir/new\\ project*' );
# set to first name, in case there are none others
my $next_file = 'new project1';
# check for others
if( #files ){
# a Schwartian transform
#files = map { $_->[0] } # get original
sort { $a->[1] <=> $b->[1] } # sort by second field which are numbers
map { [ $_, do{ ( my $n = $_ ) =~ s/\D//g; $n } ] } # create an anonymous array with original value and the second field nothing but digits
#files;
# last file name is the biggest
$next_file = $files[-1];
# add one to it
$next_file =~ s/(.*)(\d+)$/$1.($2+1)/e;
}
print "next file: $next_file\n";

Nothing wrong per se, but that's an awful lot of code to achieve a single objective (get the minimum index of directories.
A core module, couple of subs and few Schwartzian transforms will make the code more flexible:
use strict;
use warnings;
use List::Util 'min';
sub num { $_[0] =~ s|\D+||g } # 'new project4' -> '4', 'new1_project4' -> '14' (!)
sub min_index {
my ( $dir, $filter ) = #_;
$filter = qr/./ unless defined $filter; # match all if no filter specified
opendir my $dirHandle, $dir or die $!;
my $lowest_index = min # get the smallest ...
map { num($_) } # ... numerical value ...
grep { -d } # ... from all directories ...
grep { /$filter/ } # ... that match the filter ...
readdir $dirHandle; # ... from the directory contents
$lowest_index++ while grep { $lowest_index == num( $_ ) } readdir $dirhandle;
return $lowest_index;
}
# Ready to use!
my $index = min_index ( 'some/dir' , qr/^new project/ );
my $new_project_name = "new project $index";

How can I compare file list from a tar archive and directory?

I am still learning Perl. Can anyone please suggest me the Perl code to compare files from .tar.gz and a directory path.
Let's say I have tar.gz backup of following directory path which I have taken few days back.
a/file1
a/file2
a/file3
a/b/file4
a/b/file5
a/c/file5
a/b/d/file and so on..
Now I want to compare files and directories under this path with the tar.gz backup file.
Please suggest Perl code to do that.

See Archive::Tar.

The Archive::Tar and File::Find modules will be helpful. A basic example is shown below. It just prints information about the files in a tar and the files in a directory tree.
It was not clear from your question how you want to compare the files. If you need to compare the actual content, the get_content() method in Archive::Tar::File will likely be needed. If a simpler comparison is adequate (for example, name, size, and mtime), you won't need much more than methods used in the example below.
#!/usr/bin/perl
use strict;
use warnings;
# A utility function to display our results.
sub Print_file_info {
print map("$_\n", #_), "\n";
}
# Print some basic information about files in a tar.
use Archive::Tar qw();
my $tar_file = 'some_tar_file.tar.gz';
my $tar = Archive::Tar->new($tar_file);
for my $ft ( $tar->get_files ){
# The variable $ft is an Archive::Tar::File object.
Print_file_info(
$ft->name,
$ft->is_file ? 'file' : 'other',
$ft->size,
$ft->mtime,
);
}
# Print some basic information about files in a directory tree.
use File::Find;
my $dir_name = 'some_directory';
my #files;
find(sub {push #files, $File::Find::name}, $dir_name);
Print_file_info(
$_,
-f $_ ? 'file' : 'other',
-s,
(stat)[9],
) for #files;

Perl is kind of overkill for this, really. A shell script would do fine. The steps you need to take though:
Extract the tar to a temporary folder somewhere.
diff -uR the two folders and redirect the output somewhere (or perhaps pipe to less as appropriate)
Clean up the temporary folder.
And you're done. Shouldn't be more than 5-6 lines. Something quick and untested:
#!/bin/sh
mkdir $TEMP/$$
tar -xz -f ../backups/backup.tgz $TEMP/$$
diff -uR $TEMP/$$ ./ | less
rm -rf $TEMP/$$

Heres an example that checks to see if every file that is in an archive, also exists in a folder.
# $1 is the file to test
# $2 is the base folder
for file in $( tar --list -f $1 | perl -pe'chomp;$_=qq["'$2'$_" ]' )
do
# work around bash deficiency
if [[ -e "$( perl -eprint$file )" ]]
then
echo " $file"
else
echo "no $file"
fi
done
This is how I tested this:
I removed / renamed config, then ran the following:
bash test Downloads/update-dnsomatic-0.1.2.tar.gz Downloads/
Which gave the output of:
"Downloads/update-dnsomatic-0.1.2/"
no "Downloads/update-dnsomatic-0.1.2/config"
"Downloads/update-dnsomatic-0.1.2/update-dnsomatic"
"Downloads/update-dnsomatic-0.1.2/README"
"Downloads/update-dnsomatic-0.1.2/install.sh"
I am new to bash / shell programming, so there is probably a better way to do this.

This might be a good starting point for a good Perl program. It does what the question asked for though.
It was just hacked together, and ignores most of the best practices for Perl.
perl test.pl full \
Downloads/update-dnsomatic-0.1.2.tar.gz \
Downloads/ \
update-dnsomatic-0.1.2
#! /usr/bin/env perl
use strict;
use 5.010;
use warnings;
use autodie;
use Archive::Tar;
use File::Spec::Functions qw'catfile catdir';
my($action,$file,$directory,$special_dir) = #ARGV;
if( #ARGV == 1 ){
$file = *STDOUT{IO};
}
if( #ARGV == 3 ){
$special_dir = '';
}
sub has_file(_);
sub same_size($$);
sub find_missing(\%$);
given( lc $action ){
# only compare names
when( #{[qw'simple name names']} ){
my #list = Archive::Tar->list_archive($file);
say qq'missing file: "$_"' for grep{ ! has_file } #list;
}
# compare names, sizes, contents
when( #{[qw'full aggressive']} ){
my $next = Archive::Tar->iter($file);
my( %visited );
while( my $file = $next->() ){
next unless $file->is_file;
my $name = $file->name;
$visited{$name} = 1;
unless( has_file($name) ){
say qq'missing file: "$name"' ;
next;
}
unless( same_size( $name, $file->size ) ){
say qq'different size: "$name"';
next;
}
next unless $file->size;
unless( same_checksum( $name, $file->get_content ) ){
say qq'different checksums: "$name"';
next;
}
}
say qq'file not in archive: "$_"' for find_missing %visited, $special_dir;
}
}
sub has_file(_){
my($file) = #_;
if( -e catfile $directory, $file ){
return 1;
}
return;
}
sub same_size($$){
my($file,$size) = #_;
if( -s catfile($directory,$file) == $size ){
return $size || '0 but true';
}
return; # empty list/undefined
}
sub same_checksum{
my($file,$contents) = #_;
require Digest::SHA1;
my($outside,$inside);
my $sha1 = Digest::SHA1->new;
{
open my $io, '<', catfile $directory, $file;
$sha1->addfile($io);
close $io;
$outside = $sha1->digest;
}
$sha1->add($contents);
$inside = $sha1->digest;
return 1 if $inside eq $outside;
return;
}
sub find_missing(\%$){
my($found,$current_dir) = #_;
my(#dirs,#files);
{
my $open_dir = catdir($directory,$current_dir);
opendir my($h), $open_dir;
while( my $elem = readdir $h ){
next if $elem =~ /^[.]{1,2}[\\\/]?$/;
my $path = catfile $current_dir, $elem;
my $open_path = catfile $open_dir, $elem;
given($open_path){
when( -d ){
push #dirs, $path;
}
when( -f ){
push #files, $path, unless $found->{$path};
}
default{
die qq'not a file or a directory: "$path"';
}
}
}
}
for my $path ( #dirs ){
push #files, find_missing %$found, $path;
}
return #files;
}
After renaming config to config.rm, adding an extra char to README, changing a char in install.sh, and adding a file .test. This is what it outputted:
missing file: "update-dnsomatic-0.1.2/config"
different size: "update-dnsomatic-0.1.2/README"
different checksums: "update-dnsomatic-0.1.2/install.sh"
file not in archive: "update-dnsomatic-0.1.2/config.rm"
file not in archive: "update-dnsomatic-0.1.2/.test"

How do I read in the contents of a directory in Perl?

How do I get Perl to read the contents of a given directory into an array?
Backticks can do it, but is there some method using 'scandir' or a similar term?

opendir(D, "/path/to/directory") || die "Can't open directory: $!\n";
while (my $f = readdir(D)) {
print "\$f = $f\n";
}
closedir(D);
EDIT: Oh, sorry, missed the "into an array" part:
my $d = shift;
opendir(D, "$d") || die "Can't open directory $d: $!\n";
my #list = readdir(D);
closedir(D);
foreach my $f (#list) {
print "\$f = $f\n";
}
EDIT2: Most of the other answers are valid, but I wanted to comment on this answer specifically, in which this solution is offered:
opendir(DIR, $somedir) || die "Can't open directory $somedir: $!";
#dots = grep { (!/^\./) && -f "$somedir/$_" } readdir(DIR);
closedir DIR;
First, to document what it's doing since the poster didn't: it's passing the returned list from readdir() through a grep() that only returns those values that are files (as opposed to directories, devices, named pipes, etc.) and that do not begin with a dot (which makes the list name #dots misleading, but that's due to the change he made when copying it over from the readdir() documentation). Since it limits the contents of the directory it returns, I don't think it's technically a correct answer to this question, but it illustrates a common idiom used to filter filenames in Perl, and I thought it would be valuable to document. Another example seen a lot is:
#list = grep !/^\.\.?$/, readdir(D);
This snippet reads all contents from the directory handle D except '.' and '..', since those are very rarely desired to be used in the listing.

A quick and dirty solution is to use glob
#files = glob ('/path/to/dir/*');

This will do it, in one line (note the '*' wildcard at the end)
#files = </path/to/directory/*>;
# To demonstrate:
print join(", ", #files);

IO::Dir is nice and provides a tied hash interface as well.
From the perldoc:
use IO::Dir;
$d = IO::Dir->new(".");
if (defined $d) {
while (defined($_ = $d->read)) { something($_); }
$d->rewind;
while (defined($_ = $d->read)) { something_else($_); }
undef $d;
}
tie %dir, 'IO::Dir', ".";
foreach (keys %dir) {
print $_, " " , $dir{$_}->size,"\n";
}
So you could do something like:
tie %dir, 'IO::Dir', $directory_name;
my #dirs = keys %dir;

You could use DirHandle:
use DirHandle;
$d = new DirHandle ".";
if (defined $d)
{
while (defined($_ = $d->read)) { something($_); }
$d->rewind;
while (defined($_ = $d->read)) { something_else($_); }
undef $d;
}
DirHandle provides an alternative, cleaner interface to the opendir(), closedir(), readdir(), and rewinddir() functions.

Similar to the above, but I think the best version is (slightly modified) from "perldoc -f readdir":
opendir(DIR, $somedir) || die "can't opendir $somedir: $!";
#dots = grep { (!/^\./) && -f "$somedir/$_" } readdir(DIR);
closedir DIR;

You can also use the children method from the popular Path::Tiny module:
use Path::Tiny;
my #files = path("/path/to/dir")->children;
This creates an array of Path::Tiny objects, which are often more useful than just filenames if you want to do things to the files, but if you want just the names:
my #files = map { $_->stringify } path("/path/to/dir")->children;

Here's an example of recursing through a directory structure and copying files from a backup script I wrote.
sub copy_directory {
my ($source, $dest) = #_;
my $start = time;
# get the contents of the directory.
opendir(D, $source);
my #f = readdir(D);
closedir(D);
# recurse through the directory structure and copy files.
foreach my $file (#f) {
# Setup the full path to the source and dest files.
my $filename = $source . "\\" . $file;
my $destfile = $dest . "\\" . $file;
# get the file info for the 2 files.
my $sourceInfo = stat( $filename );
my $destInfo = stat( $destfile );
# make sure the destinatin directory exists.
mkdir( $dest, 0777 );
if ($file eq '.' || $file eq '..') {
} elsif (-d $filename) { # if it's a directory then recurse into it.
#print "entering $filename\n";
copy_directory($filename, $destfile);
} else {
# Only backup the file if it has been created/modified since the last backup
if( (not -e $destfile) || ($sourceInfo->mtime > $destInfo->mtime ) ) {
#print $filename . " -> " . $destfile . "\n";
copy( $filename, $destfile ) or print "Error copying $filename: $!\n";
}
}
}
print "$source copied in " . (time - $start) . " seconds.\n";
}

from: http://perlmeme.org/faqs/file_io/directory_listing.html
#!/usr/bin/perl
use strict;
use warnings;
my $directory = '/tmp';
opendir (DIR, $directory) or die $!;
while (my $file = readdir(DIR)) {
next if ($file =~ m/^\./);
print "$file\n";
}
The following example (based on a code sample from perldoc -f readdir) gets all the files (not directories) beginning with a period from the open directory. The filenames are found in the array #dots.
#!/usr/bin/perl
use strict;
use warnings;
my $dir = '/tmp';
opendir(DIR, $dir) or die $!;
my #dots
= grep {
/^\./ # Begins with a period
&& -f "$dir/$_" # and is a file
} readdir(DIR);
# Loop through the array printing out the filenames
foreach my $file (#dots) {
print "$file\n";
}
closedir(DIR);
exit 0;
closedir(DIR);
exit 0;

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Removing files with duplicate content from single directory [Perl, or algorithm] - perl

md5sum *.txt | perl -ne ' chomp; ($sum, $file) = split(" "); push #{$files{$sum}}, $file; END { foreach (keys %files) { shift #{$files{$_}}; unlink #{$files{$_}} if #{$files{$_}}; } } '

Perl is kinda overkill for this: md5sum * | sort | uniq -w 32 -D | cut -b 35- | tr '\n' '\0' | xargs -0 rm (If you are missing some of these utilities or they don't have these flags/functions, install GNU findutils and coreutils.)

a bash script is more expressive than perl in this case: md5sum * |sort -k1|uniq -w32 -d|cut -f2 -d' '|xargs rm

You might want to have a look at how I did to find duplicate files and remove them. Though you have to modify it to your needs. http://priyank.co.in/remove-duplicate-files

Related

Perl -two list matching elements

Can't find file trying to move

How to create the next file or folder in a series of progressively numbered files?

How can I compare file list from a tar archive and directory?

How do I read in the contents of a directory in Perl?

Categories

Resources