Script is reading files from a input directory in that we have 5 different files . I am trying to set priority on the files while I am processing it.
opendir ( INPUT_DIR, $ENV{INPUT_DIR} ) || die "Error in opening dir $ENV{INPUT_DIR}";
my #input_files = grep {!/^\./} readdir(INPUT_DIR);
foreach my $input_file (#input_files)
{
if($input_file =~ m/^$proc_mask}$/i)
{
# processing files
}
}
Like I have 5 files
Creation.txt
Creation_extra.txt
Modify.txt
Modify_add.txt
Delete.txt
Now once we read these input files I want to set priority that first Creation_extra.txt files is processed and then Delete.txt is processed.
I am not able to set priority on the files reading and then processing it
If I understand you correctly, you want to be able to point out some high priority file names that should be processed before other files. Here's a way:
use strict;
use warnings;
use feature 'say';
my #files = <DATA>; # simulate reading dir
chomp #files; # remove newlines
my %prio;
#prio{ #files } = (0) x #files; # set default prio = 0
my #high_prio = qw(Creation_extra.txt Delete.txt); # high prio list
# to set high prio we only want existing files
for (#high_prio) {
if (exists $prio{$_}) { # check if file name exists
$prio{$_} = 1; # set prio
}
}
# now process files by sorting by prio, or alphabetical if same prio
for (sort { $prio{$b} <=> $prio{$a} || $a cmp $b } #files) {
say;
}
__DATA__
Creation.txt
Creation_extra.txt
Modify.txt
Modify_add.txt
Delete.txt
Related
I have 1500 files in one directory and I need to get some information out of every one and write it into a new, single file. The file names consist of a word and a number (Temp1, Temp2, Temp3 and so on) and it is important that the files are read in the correct order according to the numbers.
I did this using
my #files = <Temp*.csv>;
for my $file (#files)
{
this part appends the required data to a seperate file and works fine
}
my problem now is that the files are not opened in the correct order but after file 1 the file 100 gets opened.
Can anybody please give me a hint how I can make it read the files in the right order?
Thank you,
Ca
Sort the files naturally with Sort::Key::Natural natsort.
The following will automatically sort the files naturally, separating out alpha and numerical portions of the name for the appropriate sort logic.
use strict;
use warnings;
use Sort::Key::Natural qw(natsort);
for my $file ( natsort <Temp*.csv> ) {
# this part appends the required data to a seperate file and works fine
}
The following fake data should demonstrate this module in action:
use strict;
use warnings;
use Sort::Key::Natural qw(natsort);
print natsort <DATA>;
__DATA__
Temp100.csv
Temp8.csv
Temp20.csv
Temp1.csv
Temp7.csv
Outputs:
Temp1.csv
Temp7.csv
Temp8.csv
Temp20.csv
Temp100.csv
You can use Schwartzian transform to read and sort files in one step,
my #files =
map { $_->[0] }
sort { $a->[1] <=> $b->[1] }
map { [ $_, /(\d+)/ ] } <Temp*.csv>;
or using less efficient, and more straightforward sort,
my #files = sort { ($a =~ /(\d+)/)[0] <=> ($b =~ /(\d+)/)[0] } <Temp*.csv>;
If the numbers are really important, you might want to read them specifically after file name, with error reporting about missing files:
my #nums = 1 .. 1500; # or whatever the highest is
for my $num (#nums) {
my $file = "Temp$num.csv";
unless (-e $file) {
warn "Missing file: $file";
next;
}
...
# proceed as normal
}
If you need a file count, you can simply use your old glob:
my #files = <Temp*.csv>;
my $count = #files; # get the size of the array
my #nums = 1 .. $count;
On the other hand, if you control the process that prints the files, you might select a format that will automatically sort itself, such as:
temp00001.csv
temp00002.csv
temp00003.csv
temp00004.csv
...
temp00101.csv
i have tried with some script for sorting a input text file in descending order and print only top usage customer.
input text file containts:
NAME,USAGE,IP
example :
Abc,556,10.2.3.5
bbc,126,14.2.5.6
and so on, this is very large file and i am trying to avoid loading file into memory.
I have tried with following script.
use warnings ;
use strict;
my %hash = ();
my $file = $ARGV[0] ;
open (my $fh, "<", $file) or die "Can't open the file $file: ";
while (my $line =<$fh>)
{
chomp ($line) ;
my( $name,$key,$ip) = split /,/, $line;
$hash{$key} = [ $name, $ip ];
}
my $count= 0 ;
foreach ( sort { $b <=> $a } keys %hash ){
my $value = $hash{$_};
print "$_ #{$value} \n" ;
last if (++$count == 5);
}
Output should be sorted based on usage and it will show the name and IP for respective usage. " `
I think you want to print the five lines of the file that have the highest value in the second column
That can be done by a sort of insertion sort that checks each line of the file to see if it comes higher than the lowest of the five lines most recently found, but it's easier to just accumulate a sensible subset of the data, sort it, and discard all but the top five
Here, I have array #top containing lines from the file. When there are 100 lines in the array, it is sorted and reduced to the five maximal entries. Then the while loop continues to add lines to the file until it reaches the limit again or the end of the file has been reached, when the process is repeated. That way, no more than 100 lines from the file are ever help in memory
I have generated a 1,000-line data file to test this with random values between 100 and 2,000 in column 2. The output below is the result
use strict;
use warnings 'all';
open my $fh, '<', 'usage.txt' or die $!;
my #top;
while ( <$fh> ) {
push #top, $_;
if ( #top >= 100 or eof ) {
#top = sort {
my ($aa, $bb) = map { (split /,/)[1] } ($a, $b);
$bb <=> $aa;
} #top;
#top = #top[0..4];
}
}
print #top;
output
qcmmt,2000,10.2.3.5
ciumt,1999,10.2.3.5
eweae,1998,10.2.3.5
gvhwv,1998,10.2.3.5
wonmd,1993,10.2.3.5
The standard way to do this is to create a priority queue that contains k items, where k is the number of items you want to return. So if you want the five lines that have the highest value, you'd do the following:
pq = new priority_queue
add the first five items in the file to the priority queue
for each remaining line in the file
if value > lowest value on pq
remove lowest value on the pq
add new value to pq
When you're done going through the file, pq will contain the five items with the highest value.
To do this in Perl, use the Heap::Priority module.
This will be faster and use less memory than the other suggestions.
Algorithm remembering the last 5 biggest rows.
For each row, check with the lowest memorized element. If more - are stored in the array before next biggest item with unshift lowest.
use warnings;
use strict;
my $file = $ARGV[0] ;
my #keys=(0,0,0,0,0);
my #res;
open (my $fh, "<", $file) or die "Can't open the file $file: ";
while(<$fh>)
{
my($name,$key,$ip) = split /,/;
next if($key<$keys[0]);
for(0..4) {
if($_==4 || $key<$keys[$_+1]) {
#keys[0..$_-1]=#keys[1..$_] if($_>0);
$keys[$_]=$key;
$res[$_]=[ $name, $ip ];
last;
}
}
}
for(0..4) {
print "$keys[4-$_] #{$res[4-$_]}";
}
Test on file from 1M random rows (20 Mbytes):
Last items (This algorithm):
Start 1472567980.91183
End 1472567981.94729 (duration 1.03546 seconds)
full sort in memory (Algorithm of #Rishi):
Start 1472568441.00438
End 1472568443.43829 (duration 2.43391 seconds)
sort by parts of 100 rows (Algorithm of #Borodin):
Start 1472568185.21896
End 1472568195.59322 (duration 10.37426 seconds)
I need to create a Perl script to check the first four characters of the file name of all the files mentioned in a path, and compare it with a text file containing those four characters.
The idea is to check whether any file starting with a list of numbers is missing.
For Example. Files in path D:/temp are
1234-2041-123.txt
1194-2041-123.txt
3234-2041-123.txt
1574-2041-123.txt
I need to compare the first four letter of filename - 1234, 1194, 3234, 1574 - with a text file containing the sequences 1234, 1194, 3234, 1574, 1111, 2222 and send the output
File starting with 1111, 2222 is missing.
I hope I am clear.
I am able to take out the first four characters from the file name but cannot proceed further
#files = <d:/temp/*>;
foreach $file (#files) {
my $xyz = substr $file, 8, 4;
print $xyz . "\n";
}
Similarly to #F. Hauri's proposed solution, I propose a hash-based solution:
# Get this list from the file. Here, 'auto' and 'cron' will exist in
# /etc, but 'fake' and 'mdup' probably won't.
my #expected_prefixes = qw| auto cron fake mdup |;
# Initialize entries for each prefix in the seed file to false
my %prefix_list;
undef #prefix_list{ #expected_prefixes };
opendir my ${dir}, "/etc";
while (my $file_name = readdir $dir ) {
my $first_four = substr $file_name, 0, 4;
# Increment element for the prefix for found files
$prefix_list{$first_four}++;
}
# Get list of prefixes with no matching files found in the directory
my #missing_files = grep { ! $prefix_list{$_} } keys %prefix_list;
say "Missing files: " . Dumper(\#missing_files);
This solution works by creating a hash from all of the values in the file prefixes.txt, then deleting elements from that hash as files are found starting with each sequence.
In addition, if any file name starts with a sequence that doesn't appear in the file then a warning is printed.
The output is simply a matter of listing all the elements of the hash that remain after this process.
use strict;
use warnings;
my %prefixes;
open my $fh, '<', 'prefixes.txt' or die $!;
while (<$fh>) {
chomp;
$prefixes{$_} = 1;
}
my #files = qw/
1234-2041-123.txt
1194-2041-123.txt
3234-2041-123.txt
1574-2041-123.txt
/;
for my $name (#files) {
my $pref = substr $name, 0, 4;
if ($prefixes{$pref}) {
delete $prefixes{$pref};
}
else {
warn qq{Prefix for file "$name" not listed};
}
}
printf "File starting with %s is missing.\n", join ', ', sort keys %prefixes;
output
File starting with 1111, 2222 is missing.
I need to remove any lines that contain certain keywords in them from a huge list of text files I have in a directory.
For example, I need all lines with any of these keywords in them to be removed: test1, example4, coding9
This is the closest example to what I'm trying to do that I can find:
sed '/Unix\|Linux/d' *.txt
Note: the lines don't need to contain all the keywords to be removed, just one should remove it :)
It appears that you are looking for some 1 liner command to read and write back to thousands of files and millions of lines. I wouldn't do it like that personally because I would prefer to write a quick and dirty script in Perl. I very briefly tested this on very simple files and it works but since you are working with thousands of files and millions of lines, I would test whatever you write in a test directory first with some of the files so that you can verify.
#!/usr/bin/perl
# the initial directory to read from
my $directory = 'tmp';
opendir (DIR, $directory) or die $!;
my #keywords = ('woohoo', 'blah');
while (my $file = readdir(DIR)) {
# ignore files that begin with a period
next if ($file =~ m/^\./);
# open the file
open F, $directory.'/'.$file || die $!;
# initialize empty file_lines
#file_lines = ();
# role through and push the line into the new array if no keywords are found
while (<F>) {
next if checkForKeyword($_);
push #file_lines, $_;
}
close F;
# save in a temporary file for testing
# just change these 2 variables to fit your needs
$save_directory = $directory.'-save';
$save_file = $file.'-tmp.txt';
if (! -d $save_directory) {
`mkdir $save_directory`;
}
$new_file = $save_directory.'/'.$save_file;
open S, ">$new_file" || die $!;
print S for #file_lines;
close S;
}
# role through each keyword and return 1 if found, return '' if not
sub checkForKeyword()
{
$line = shift;
for (0 .. $#keywords) {
$k = $keywords[$_];
if ($line =~ m/$k/) {
return 1;
}
}
return '';
}
I am using Perl stat() function to get the size of directory and its subdirectories. I have a list of about 20 parent directories which have few thousand recursive subdirs and every subdir has few hundred records.
Main computing part of script looks like this:
sub getDirSize {
my $dirSize = 0;
my #dirContent = <*>;
my $sizeOfFilesInDir = 0;
foreach my $dirContent (#dirContent) {
if (-f $dirContent) {
my $size = (stat($dirContent))[7];
$dirSize += $size;
} elsif (-d $dirContent) {
$dirSize += getDirSize($dirContent);
}
}
return $dirSize;
}
The script is executing for more than one hour and I want to make it faster.
I was trying with the shell du command, but the output of du (transfered to bytes) is not accurate. And it is also quite time consuming.
I am working on HP-UNIX 11i v1.
With some help from sfink and samtregar on perlmonks, try this one out:
#!/usr/bin/perl
use warnings;
use strict;
use File::Find;
my $size = 0;
find( sub { $size += -f $_ ? -s _ : 0 }, shift(#ARGV) );
print $size, "\n";
Here we're recursing all subdirs of the specified dir, getting the size of each file, and we re-use the stat from the file test by using the special '_' syntax for the size test.
I tend to believe that du would be reliable enough though.
I once faced a similar problem, and used a parallelization approach to speed it up. Since you have ~20 top-tier directories, this might be a pretty straightforward approach for you to try.
Split your top-tier directories into several groups (how many groups is best is an empirical question), call fork() a few times and analyze directory sizes in the child processes. At the end of the child processes, write out your results to some temporary files. When all the children are done, read the results out of the files and process them.
Whenever you want to speed up something, you're first task is to find out what's slow. Use a profiler such as Devel::NYTProf to analyze the program and find out where you should concentrate your efforts.
In addition to reusing that data from the last stat, I'd get rid of the recursion since Perl is horrible at it. I'd construct a stack (or a queue) and work on that until there is nothing left to process.
Below is another variant of getDirSize() which doesn't require a reference to a variable holding the current size and accepts a parameter to indicate whether sub-directories shall be considered or not:
#!/usr/bin/perl
print 'Size (without sub-directories): ' . getDirSize(".") . " bytes\n";
print 'Size (incl. sub-directories): ' . getDirSize(".", 1) . " bytes\n";
sub getDirSize
# Returns the size in bytes of the files in a given directory and eventually its sub-directories
# Parameters:
# $dirPath (string): the path to the directory to examine
# $subDirs (optional boolean): FALSE (or missing) = consider only the files in $dirPath, TRUE = include also sub-directories
# Returns:
# $size (int): the size of the directory's contents
{
my ($dirPath, $subDirs) = #_; # Get the parameters
my $size = 0;
opendir(my $DH, $dirPath);
foreach my $dirEntry (readdir($DH))
{
stat("${dirPath}/${dirEntry}"); # Stat once and then refer to "_"
if (-f _)
{
# This is a file
$size += -s _;
}
elsif (-d _)
{
# This is a sub-directory: add the size of its contents
$size += getDirSize("${dirPath}/${dirEntry}", 1) if ($subDirs && ($dirEntry ne '.') && ($dirEntry ne '..'));
}
}
closedir($DH);
return $size;
}
I see a couple of problems. One #dirContent is explicitly set to <*> this will be reset each time you enter getDirSize. The result will be an infinite loop at least until you exhaust the stack (since it is a recursive call). Secondly, there is special filehandle notation for retrieving information from a stat call -- underscore (_). See: http://perldoc.perl.org/functions/stat.html. Your code as-is is calling stat three times for essentially the same information (-f, stat, and -d). Since file I/O is expensive, what you really want is to call stat once and then reference the data using "_". Here is some sample code that I believe accomplishes what you are trying to do
#!/usr/bin/perl
my $size = 0;
getDirSize(".",\$size);
print "Size: $size\n";
sub getDirSize {
my $dir = shift;
my $size = shift;
opendir(D,"$dir");
foreach my $dirContent (grep(!/^\.\.?/,readdir(D))) {
stat("$dir/$dirContent");
if (-f _) {
$$size += -s _;
} elsif (-d _) {
getDirSize("$dir/$dirContent",$size);
}
}
closedir(D);
}
Bigs answer is good. I modified it slightly as I wanted to get the sizes of all the folders under a given path on my windows machine.
This is how I did it.
#!/usr/bin/perl
use strict;
use warnings;
use File::stat;
my $dirname = "C:\\Users\\xxx\\Documents\\initial-docs";
opendir (my $DIR, $dirname) || die "Error while opening dir $dirname: $!\n";
my $dirCount = 0;
foreach my $dirFileName(sort readdir $DIR)
{
next if $dirFileName eq '.' or $dirFileName eq '..';
my $dirFullPath = "$dirname\\$dirFileName";
#only check if its a dir and skip files
if (-d $dirFullPath )
{
$dirCount++;
my $dirSize = getDirSize($dirFullPath, 1); #bytes
my $dirSizeKB = $dirSize/1000;
my $dirSizeMB = $dirSizeKB/1000;
my $dirSizeGB = $dirSizeMB/1000;
print("$dirCount - dir-name: $dirFileName - Size: $dirSizeMB (MB) ... \n");
}
}
print "folders in $dirname: $dirCount ...\n";
sub getDirSize
{
my ($dirPath, $subDirs) = #_; # Get the parameters
my $size = 0;
opendir(my $DH, $dirPath);
foreach my $dirEntry (readdir($DH))
{
stat("${dirPath}/${dirEntry}"); # Stat once and then refer to "_"
if (-f _)
{
# This is a file
$size += -s _;
}
elsif (-d _)
{
# This is a sub-directory: add the size of its contents
$size += getDirSize("${dirPath}/${dirEntry}", 1) if ($subDirs && ($dirEntry ne '.') && ($dirEntry ne '..'));
}
}
closedir($DH);
return $size;
}
1
;
OUTPUT:
1 - dir-name: acct-requests - Size: 0.458696 (MB) ...
2 - dir-name: environments - Size: 0.771527 (MB) ...
3 - dir-name: logins - Size: 0.317982 (MB) ...
folders in C:\Users\xxx\Documents\initial-docs: 3 ...
If your main directory is overwhelmingly largest consumer of directory and file inodes then don't calculate it. Calculate the other half of system and deduce the size of the rest of the system from that. (you can get used disk space from df in a couple of ms'). You might need to add a small 'fudge' factor to get to the same numbers. (also remember that if you calculate some free space as root, then you'll have some extra compared to other users 5% in ext2/ext3 on Linux, don't know about HPUX).