Remove files less than n lines in Perl - perl

I'm writing a Perl script to remove files that have fewer than a given number of lines. What I have so far is
my $cmd = join('','wc -l ', $file); #prints number of lines to command line
if (system($cmd) < 4)
{
my $rmcmd = join('','rm ',$file);
system($rmcmd);
}
where $file is the name and location of a file.

There's no need to use system for this. Perl is perfectly capable of counting lines:
sub count_lines {
open my $fh, '<', shift;
while(local $_ = <$fh>) {} # loop through all lines
return $.;
}
unlink $file if count_lines($file) < 4;
I'm assuming your end goal is to have it search through a directory tree removing files with line count less than n. Check out File::Find and its nifty code generator find2perl to handle that part for you.

Related

Split file Perl

I want to split parts of a file. Here is what the start of the file looks like (it continues in same way):
Location Strand Length PID Gene
1..822 + 273 292571599 CDS001
906..1298 + 130 292571600 trxA
I want to split in Location column and subtract 822-1 and do the same for every row and add them all together. So that for these two results the value would be: (822-1)+1298-906) = 1213
How?
My code right now, (I don't get any output at all in the terminal, it just continue to process forever):
use warnings;
use strict;
my $infile = $ARGV[0]; # Reading infile argument
open my $IN, '<', $infile or die "Could not open $infile: $!, $?";
my $line2 = <$IN>;
my $coding = 0; # Initialize coding variable
while(my $line = $line2){ # reading the file line by line
# TODO Use split and do the calculations
my #row = split(/\.\./, $line);
my #row2 = split(/\D/, $row[1]);
$coding += $row2[0]- $row[0];
}
print "total amount of protein coding DNA: $coding\n";
So what I get from my code if I put:
print "$coding \n";
at the end of the while loop just to test is:
821
1642
And so the first number is correct (822-1) but the next number doesn't make any sense to me, it should be (1298-906). What I want in the end outside the loop:
print "total amount of protein coding DNA: $coding\n";
is the sum of all the subtractions of every line i.e. 1213. But I don't get anything, just a terminal that works on forever.
As a one-liner:
perl -nE '$c += $2 - $1 if /^(\d+)\.\.(\d+)/; END { say $c }' input.txt
(Extracting the important part of that and putting it into your actual script should be easy to figure out).
Explicitly opening the file makes your code more complicated than it needs to be. Perl will automatically open any files passed on the command line and allow you to read from them using the empty file input operator, <>. So your code becomes as simple as this:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
my $total;
while (<>) {
my ($min, $max) = /(\d+)\.\.(\d+)/;
next unless $min and $max;
$total += $max - $min;
}
say $total;
If this code is in a file called adder and your input data is in add.dat, then you run it like this:
$ adder add.dat
1213
Update: And, to explain where you were going wrong...
You only ever read a single line from your file:
my $line2 = <$IN>;
And then you continually assign that same value to another variable:
while(my $line = $line2){ # reading the file line by line
The comment in this line is wrong. I'm not sure where you got that line from.
To fix your code, just remove the my $line2 = <$IN> line and replace your loop with:
while (my $line = <$IN>) {
# your code here
}

Run a script in multiple directories with multiple output files in Perl (problems comparing hash key values)

I have the script which looks something like this, which I want to use to search through the current directory I am in, open, all directories in that directory, open all files that match certain REs (fastq files that have a format such that every four lines go together), do some work with these files, and write some results to a file in each directory. (Note: the actual script does a lot more than this but I think I have a structural issue associated with the iteration over folders because the script works when a simplified version is used in one folder, and so I am posting a simplified version here)
#!user/local/perl
#Created by C. Pells, M. R. Snyder, and N. T. Marshall 2017
#Script trims and merges high throughput sequencing reads from fastq files for a specific primer set
use Cwd;
use warnings;
my $StartTime= localtime;
my $MasterDir = getcwd; #obtains a full path to the current directory
opendir (DIR, $MasterDir);
my #objects = readdir (DIR);
closedir (DIR);
foreach (#objects){
print $_,"\n";
}
my #Dirs = ();
foreach my $O (0..$#objects){
my $CurrDir = "";
if ((length ($objects[$O]) < 7) && ($O>1)){ #Checking if the length of the object name is < 7 characters. All samples are 6 or less. removing the first two elements: "." and ".."
$CurrDir = $MasterDir."/".$objects[$O]; #appends directory name to full path
push (#Dirs, $CurrDir);
}
}
foreach (#Dirs){
print $_,"\n";#checks that all directories were read in
}
foreach my $S (0..$#Dirs){
my #files = ();
opendir (DIR, $Dirs[$S]) || die "cannot open $Dirs[$S]: $!";
#files = readdir DIR; #reads in all files in a directory
closedir DIR;
my #AbsFiles = ();
foreach my $F (0..$#files){
my $AbsFileName = $Dirs[$S]."/".$files[$F]; #appends file name to full path
push (#AbsFiles, $AbsFileName);
}
foreach my $AF (0..$#AbsFiles){
if ($AbsFiles[$AF] =~ /_R2_001\.fastq$/m){ #finds reverse fastq file
my #readbuffer=();
#read in reverse fastq
my %RSeqHash;
my $c = 0;
print "Reading, reversing, complimenting, and trimming reverse fastq file $AbsFiles[$AF]\n";
open (INPUT1, $AbsFiles[$AF]) || die "Can't open file: $!\n";
while (<INPUT1>){
chomp ($_);
push(#readbuffer, $_);
if (#readbuffer == 4) {
$rsn = substr($readbuffer[0], 0, 45); #trims reverse seq name
$cc++ % 10000 == 0 and print "$rsn\n";
$RSeqHash{$rsn} = $readbuffer[1];
#readbuffer = ();
}
}
}
}
foreach my $AFx (0..$#AbsFiles){
if ($AbsFiles[$AFx] =~ /_R1_001\.fastq$/m){ #finds forward fastq file
print "Reading forward fastq file $AbsFiles[$AFx]\n";
open (INPUT2, $AbsFiles[$AFx]) || die "Can't open file: $!\n";
my $OutMergeName = $Dirs[$S]."/"."Merged.fasta";
open (OUT, ">", "$OutMergeName");
my $cc=0;
my #readbuffer = ();
while (<INPUT2>){
chomp ($_);
push(#readbuffer, $_);
if (#readbuffer == 4) {
my $fsn = substr($readbuffer[0], 0, 45); #trims forward seq name
#$cc++ % 10000 == 0 and print "$fsn\n$readbuffer[1]\n";
if ( exists($RSeqHash{$fsn}) ){ #checks to see if forward seq name is present in reverse seq hash
print "$fsn was found in Reverse Seq Hash\n";
print OUT "$fsn\n$readbuffer[1]\n";
}
else {
$cc++ % 10000 == 0 and print "$fsn not found in Reverse Seq Hash\n";
}
#readbuffer = ();
}
}
close INPUT1;
close INPUT2;
close OUT;
}
}
}
my $EndTime= localtime;
print "Script began at\t$StartTime\nCompleted at\t$EndTime\n";
Again, I know that the script works without iterating over folders. But with this version I just get empty output files. Due to the print functions I inserted in this script, I've determined that Perl cant find the variable $fsn as a key in the hash from INPUT2. I cant understand why because each file is there and it works when I don't iterate over folders so I know that the keys match. So either there is something simple I am missing or this is some sort of limitation to Perl's memory that I have found. Any help is appreciated!
Turns out my issue was with where I was declaring the hash. For some reason even though I only declare it after it finds the first input file. The script fails unless I declare the hash before the foreach loop that cycles through all items in #AbsFiles searching for the first input file, which is fine because it means that the hash is cleared in every new directory. But I don't understand why it failed like it was because it should only be declaring (or clearing) the hash when it finds the input file name. I guess I don't NEED to know why it didn't work before, but some help to understand would be nice.
I have to give credit to another user for helping me realize this. They attempted to answer my question but did not, and then gave me this hint about where I declare my hash in a comment on that answer. This answer has now been deleted so I can't credit that user for pointing me in this direction. I would love to know what they understand about Perl that I do not that made it clear to them that this was the problem. I apologize that I was busy with data analysis and a conference so I could not respond to that comment sooner.

Recursive grep in perl

I am new to perl. I have a directory structure. In each directory, I have a log file. I want to grep pattern from that file and do post processing. Right now I am grepping the pattern from those files using unix grep and putting into text file and reading that text file to do post processing, But I want to automate task of reading each file and grepping pattern from that file. In the code below the mdp_cgdis_1102.txt have grepped pattern from directories. I would really appreciate any help
#!usr/bin/perl
use strict;
use warnings;
open FILE, 'mdp_cgdis_1102.txt' or die "Cannot open file $!";
my #array = <FILE>;
my #arr;
my #brr;
foreach my $i (#array){
#arr = split (/\//, $i);
#brr = split (/\:/, $i);
print " $arr[0] --- $brr[2]";
}
It is unclear to me which part of the process needs automating. I'll go by "want to automate reading each file and grepping pattern from that file," whereby you presumably already have a list of files. If you actually need to build the file list as well see the added code below.
One way: pull all patterns from each file and store that in a hash (filename => arrayref-with-patterns)
my %file_pattern;
foreach my $file (#filelist) {
open my $fh, '<', $file or die "Can't open $file: $!";
$file_pattern{$file} = [ grep { /$pattern/ } <$fh> ];
close $fh;
}
The [ ] takes a reference to the list returned by grep, ie. constructs an "anonymous array", and that (reference) is assigned as a value to the $file key.
Now you can process your patterns, per log file
foreach my $filename (sort keys %file_pattern) {
print "Processing log $filename.\n";
my #patterns = #{$file_pattern{$filename}};
# Process the list of patterns in this log file
}
ADDED
In order to build the list of files #filelist used above, from a known list of directories, use core File::Find
module which recursively scans supplied directories and applies supplied subroutines
use File::Find;
find( { wanted => \&process_logs, preprocess => \&select_logs }, #dir_list);
Your subroutine process_logs() is applied to each file/directory that passed preprocessing by the second sub, with its name available as $File::Find::name, and in it you can either populate the hash with patterns-per-log as shown above, or run complete processing as needed.
Your subroutine select_logs() contains code to filter log files from all files in each directory, that File::Find would normally processes, so that process_file() only gets the log files.
Another way would be to use the other invocation
find(\&process_all, #dir_list);
where now the sub process_all() is applied to all entries (files and directories) found and thus this sub itself needs to ensure that it only processes the log files. See linked documentation.
The equivalent of
find ... -name '*.txt' -type f -exec grep ... {} +
is
use File::Find::Rule qw( );
my $base_dir_qfn = ...;
my $re = qr/.../;
my #log_qfns =
File::Find::Rule
->name(qr/\..txt\z/)
->file
->in($base_dir_qfn);
my $success = 1;
for my $log_qfn (#log_qfns) {
open(my $fh, '<', $log_qfn)
or do {
$success = 0;
warn("Can't open log file \"$log_qfn\": $!\n);
next;
};
while (<$fh>) {
print if /$re/;
}
}
exit(1) if !$success;
Use File::Find to traverse the directory.
In a loop go through all the logfiles:
Open the file
read it line by line
For each line, do a regular expression match (
if ($line =~ /pattern/) ) or use
if (index($line, $searchterm) >= 0) if you are looking for a certain static string.
If you find a match, print the line.
close the file
I hope that gives you enough pointers to get started. You will learn more if you find out how to do each of these steps in Perl by yourself (I pointed out the hard ones).

Copying files that starts with numbers to another directory

I've solved #1, But I got 2 question left.
Anyone that can help me?
From the directory local dir, list all the files that start with a number (0..9) and which file
extension is .song
Create the subdirectory ”local dir/selected” in which you will copy each one of these files after
numbering each (non blank) line in each one of them.
Print out, in a file called stats.txt, the following informations concerning each one of the files:
a) Number of (non blank) lines.
b) Number of paragraphs. A paragraph here is a block of text composed of non empty lines and
delimited at its beginning and at its end by either the beginning of the file, the end of the file or
by a blank line.
c) The mean size of a paragraph (in number of lines).
d) If, yes or no, all paragraphs in the file have the same length.
Bonus questions:
e) Detect each rhymes present in each file.
f) Give the ratio of rhyming lines towards the total number of lines.
For #1:
#!/usr/bin/perl
use strict;
use warnings;
my $directory = '/local_dir';
opendir (DIR, $directory) or die $!;
while (my $file = readdir(DIR))
{
# Use a regular expression to find files ending in .song.txt
next unless ($file =~ m/\.song.txt$/);
print "$file\n";
}
closedir(DIR);
exit 0;
Here's how you might go about doing the first and second step:
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
my $src_dir = '';
my #files = glob $src_dir . qq([0-9]*.song.txt);
my $dest_dir = 'selected';
mkdir $dest_dir;
for my $file (#files) {
open my $fin, "<", $file;
open my $fout, ">", "$dest_dir/$file";
my $c = 1;
while (<$fin>) {
$_ = $c++ . " $_" unless /^$/;
print $fout $_;
}
}
Rather than using opendir, you can use glob to find the files that you are interested in, starting with a number [0-9] and ending with .song.txt. After creating the output directory, the for loop goes through these files and creates a new file in the destination directory with the same name. The while loop goes through each line of the input file. It adds a number to the start of each line if it is not blank, i.e. the start of the line is not followed immediately by the end of the line /^$/. Then it writes the line to the new file.
As for the third step, I think that it would be worth you having a go at that yourself and asking a new question if you get stuck.

Mass remove lines that contain certain words?

I need to remove any lines that contain certain keywords in them from a huge list of text files I have in a directory.
For example, I need all lines with any of these keywords in them to be removed: test1, example4, coding9
This is the closest example to what I'm trying to do that I can find:
sed '/Unix\|Linux/d' *.txt
Note: the lines don't need to contain all the keywords to be removed, just one should remove it :)
It appears that you are looking for some 1 liner command to read and write back to thousands of files and millions of lines. I wouldn't do it like that personally because I would prefer to write a quick and dirty script in Perl. I very briefly tested this on very simple files and it works but since you are working with thousands of files and millions of lines, I would test whatever you write in a test directory first with some of the files so that you can verify.
#!/usr/bin/perl
# the initial directory to read from
my $directory = 'tmp';
opendir (DIR, $directory) or die $!;
my #keywords = ('woohoo', 'blah');
while (my $file = readdir(DIR)) {
# ignore files that begin with a period
next if ($file =~ m/^\./);
# open the file
open F, $directory.'/'.$file || die $!;
# initialize empty file_lines
#file_lines = ();
# role through and push the line into the new array if no keywords are found
while (<F>) {
next if checkForKeyword($_);
push #file_lines, $_;
}
close F;
# save in a temporary file for testing
# just change these 2 variables to fit your needs
$save_directory = $directory.'-save';
$save_file = $file.'-tmp.txt';
if (! -d $save_directory) {
`mkdir $save_directory`;
}
$new_file = $save_directory.'/'.$save_file;
open S, ">$new_file" || die $!;
print S for #file_lines;
close S;
}
# role through each keyword and return 1 if found, return '' if not
sub checkForKeyword()
{
$line = shift;
for (0 .. $#keywords) {
$k = $keywords[$_];
if ($line =~ m/$k/) {
return 1;
}
}
return '';
}