I need to remove any lines that contain certain keywords in them from a huge list of text files I have in a directory.
For example, I need all lines with any of these keywords in them to be removed: test1, example4, coding9
This is the closest example to what I'm trying to do that I can find:
sed '/Unix\|Linux/d' *.txt
Note: the lines don't need to contain all the keywords to be removed, just one should remove it :)
It appears that you are looking for some 1 liner command to read and write back to thousands of files and millions of lines. I wouldn't do it like that personally because I would prefer to write a quick and dirty script in Perl. I very briefly tested this on very simple files and it works but since you are working with thousands of files and millions of lines, I would test whatever you write in a test directory first with some of the files so that you can verify.
#!/usr/bin/perl
# the initial directory to read from
my $directory = 'tmp';
opendir (DIR, $directory) or die $!;
my #keywords = ('woohoo', 'blah');
while (my $file = readdir(DIR)) {
# ignore files that begin with a period
next if ($file =~ m/^\./);
# open the file
open F, $directory.'/'.$file || die $!;
# initialize empty file_lines
#file_lines = ();
# role through and push the line into the new array if no keywords are found
while (<F>) {
next if checkForKeyword($_);
push #file_lines, $_;
}
close F;
# save in a temporary file for testing
# just change these 2 variables to fit your needs
$save_directory = $directory.'-save';
$save_file = $file.'-tmp.txt';
if (! -d $save_directory) {
`mkdir $save_directory`;
}
$new_file = $save_directory.'/'.$save_file;
open S, ">$new_file" || die $!;
print S for #file_lines;
close S;
}
# role through each keyword and return 1 if found, return '' if not
sub checkForKeyword()
{
$line = shift;
for (0 .. $#keywords) {
$k = $keywords[$_];
if ($line =~ m/$k/) {
return 1;
}
}
return '';
}
Related
I am trying to copy the content of three separate .vect files into one. I want to do this for all 5,000 files in the $fromdir directory.
When I run this program it generates just a single modified .vect file in the output directory. If I include the close(DATA) calls after individual while loops inside the foreach loop, I get the same behavior: a single output file in the output directory instead of the wanted 5,000 files.
I have done some reading, and at first thought I may not be opening the files. But if I print($vectfile) in the foreach loop every file name in the directory is printed.
My second thought was that it was how I was closing the files, but
I get the same behavior whether
I close the file handles inside or outside the foreach loop.
My final thought was maybe I don't have write permission to the file or directory, but I don't know how to change this.
How can I get this loop to run all 5,000 times and not just once?
use strict;
use warnings;
use feature qw(say);
my $dir = "D:\\Downloads";
# And M3.1 and P3.1
my $subfolder = "A0.1";
my $fromdir = $dir . "\\" . $subfolder;
my #files = <$fromdir/*vect>;
# Top of file
my $readfiletop = "C:\\Users\\Owner\\Documents\\MoreKnotVis\\ScriptsForAdditionalDataSets\\VectFileHeader.vect";
# Bottom of file
my $readfilebottom = "C:\\Users\\Owner\\Documents\\MoreKnotVis\\ScriptsForAdditionalDataSets\\VectFileCloser.vect";
foreach my $vectfile ( #files ) {
say("$vectfile");
my $count = 0;
my $readfilebody = $vectfile;
my $out_file = "D:\\Downloads\\ColorsA0.1\\" . "$count" . ".vect";
$count++;
# open top part of each file
open(DATA1, "<", $readfiletop) or die "Can't open '$readfiletop': $!";
# open bottom part of each file
open(DATA3, "<", $readfilebottom) or die "Can't open '$readfilebottom': $!";
# open a file to read
open(DATA2, "<", $vectfile) or die "Can't open '$vectfile': $!";
# open a file to write to
open(DATA4, ">" ,$out_file) or die "Can't open '$out_file': $!";
# Copy data from VectFileTop file to another.
while ( <DATA1> ) {
print DATA4 $_;
}
# Copy the data from VectFileBody to another.
while ( <DATA2> ) {
print DATA4 $_, $_ if 8..12;
}
# Copy the data from VectFileBottom to another.
while ( <DATA3> ) {
print DATA4 $_;
}
}
close( DATA1 );
close( DATA2 );
close( DATA3 );
close( DATA4 );
print("quit\n");
You construct the output file name including $count in it.
But note what you do with this variable:
initially, but inside the loop you set it to 0,
the output file name is constructed with 0 in it,
then you increment it, but this has no effect, because this variable
is again set to 0 in the next execution of the loop..
The effect is that:
the loop executes the required numer of times,
but the output file name every time contains 0 as the "number",
so you keep overwriting the same file with a new content.
Move my $count = 0; instruction before the loop and everything
should be OK.
You seem to be clinging to a specific form of code in fear of everything falling apart if you change a single thing. I recommend that you dare to stray a little more from the formula so that the code is more concise and readable
The problem is that you reset your $count to zero before processing each input file, so all the output files have the same name and overwrite one another. The remaining output file contains only the data from the last input file
Here's a refactoring of your code. I can't guarantee that it will run correctly but it looks right and does compile
I've added use autodie to avoid having to check the status of every IO operation
I've used the same lexical file handle $fh for all the input file. Opening another file on a file handle that is already open will close it first, and a lexical file handle will be closed by perl when it goes out of scope at the end of the block
I've used a while loop to iterate over the input file names instead of reading the whole list into an array which unnecessarily uses an additional variable #files and wastes space
I've used forward slashes instead of backslashes in all the file paths. This is fine in library calls on Windows: it is only a problem if they appear in command line input
I hope you'll agree that this form is more readable. I think you would have stood a much better chance of finding the problem if your code were in this form
use strict;
use warnings;
use autodie;
use feature qw/ say /;
my $indir = 'D:/Downloads';
my $subdir = 'A0.1'; # And M3.1 and P3.1
my $extrasdir = 'C:/Users/Owner/Documents/MoreKnotVis/ScriptsForAdditionalDataSets';
my $outdir = "$indir/Colors$subdir";
my $topfile = "$extrasdir/VectFileHeader.vect";
my $bottomfile = "$extrasdir/VectFileCloser.vect";
my $filenum;
while ( my $vectfile = glob "$indir/$subdir/*.vect" ) {
say qq/Processing "$vectfile"/;
$filenum++;
open my $outfh, '>', "$outdir/$filenum.vect";
my $fh;
open $fh, '<', $topfile;
print { $outfh } $_ while <$fh>;
open $fh, '<', $vectfile;
while ( <$fh> ) {
print { $outfh } $_, $_ if 8..12;
}
open $fh, '<', $bottomfile;
print { $outfh } $_ while <$fh>;
}
say 'DONE';
I have the script which looks something like this, which I want to use to search through the current directory I am in, open, all directories in that directory, open all files that match certain REs (fastq files that have a format such that every four lines go together), do some work with these files, and write some results to a file in each directory. (Note: the actual script does a lot more than this but I think I have a structural issue associated with the iteration over folders because the script works when a simplified version is used in one folder, and so I am posting a simplified version here)
#!user/local/perl
#Created by C. Pells, M. R. Snyder, and N. T. Marshall 2017
#Script trims and merges high throughput sequencing reads from fastq files for a specific primer set
use Cwd;
use warnings;
my $StartTime= localtime;
my $MasterDir = getcwd; #obtains a full path to the current directory
opendir (DIR, $MasterDir);
my #objects = readdir (DIR);
closedir (DIR);
foreach (#objects){
print $_,"\n";
}
my #Dirs = ();
foreach my $O (0..$#objects){
my $CurrDir = "";
if ((length ($objects[$O]) < 7) && ($O>1)){ #Checking if the length of the object name is < 7 characters. All samples are 6 or less. removing the first two elements: "." and ".."
$CurrDir = $MasterDir."/".$objects[$O]; #appends directory name to full path
push (#Dirs, $CurrDir);
}
}
foreach (#Dirs){
print $_,"\n";#checks that all directories were read in
}
foreach my $S (0..$#Dirs){
my #files = ();
opendir (DIR, $Dirs[$S]) || die "cannot open $Dirs[$S]: $!";
#files = readdir DIR; #reads in all files in a directory
closedir DIR;
my #AbsFiles = ();
foreach my $F (0..$#files){
my $AbsFileName = $Dirs[$S]."/".$files[$F]; #appends file name to full path
push (#AbsFiles, $AbsFileName);
}
foreach my $AF (0..$#AbsFiles){
if ($AbsFiles[$AF] =~ /_R2_001\.fastq$/m){ #finds reverse fastq file
my #readbuffer=();
#read in reverse fastq
my %RSeqHash;
my $c = 0;
print "Reading, reversing, complimenting, and trimming reverse fastq file $AbsFiles[$AF]\n";
open (INPUT1, $AbsFiles[$AF]) || die "Can't open file: $!\n";
while (<INPUT1>){
chomp ($_);
push(#readbuffer, $_);
if (#readbuffer == 4) {
$rsn = substr($readbuffer[0], 0, 45); #trims reverse seq name
$cc++ % 10000 == 0 and print "$rsn\n";
$RSeqHash{$rsn} = $readbuffer[1];
#readbuffer = ();
}
}
}
}
foreach my $AFx (0..$#AbsFiles){
if ($AbsFiles[$AFx] =~ /_R1_001\.fastq$/m){ #finds forward fastq file
print "Reading forward fastq file $AbsFiles[$AFx]\n";
open (INPUT2, $AbsFiles[$AFx]) || die "Can't open file: $!\n";
my $OutMergeName = $Dirs[$S]."/"."Merged.fasta";
open (OUT, ">", "$OutMergeName");
my $cc=0;
my #readbuffer = ();
while (<INPUT2>){
chomp ($_);
push(#readbuffer, $_);
if (#readbuffer == 4) {
my $fsn = substr($readbuffer[0], 0, 45); #trims forward seq name
#$cc++ % 10000 == 0 and print "$fsn\n$readbuffer[1]\n";
if ( exists($RSeqHash{$fsn}) ){ #checks to see if forward seq name is present in reverse seq hash
print "$fsn was found in Reverse Seq Hash\n";
print OUT "$fsn\n$readbuffer[1]\n";
}
else {
$cc++ % 10000 == 0 and print "$fsn not found in Reverse Seq Hash\n";
}
#readbuffer = ();
}
}
close INPUT1;
close INPUT2;
close OUT;
}
}
}
my $EndTime= localtime;
print "Script began at\t$StartTime\nCompleted at\t$EndTime\n";
Again, I know that the script works without iterating over folders. But with this version I just get empty output files. Due to the print functions I inserted in this script, I've determined that Perl cant find the variable $fsn as a key in the hash from INPUT2. I cant understand why because each file is there and it works when I don't iterate over folders so I know that the keys match. So either there is something simple I am missing or this is some sort of limitation to Perl's memory that I have found. Any help is appreciated!
Turns out my issue was with where I was declaring the hash. For some reason even though I only declare it after it finds the first input file. The script fails unless I declare the hash before the foreach loop that cycles through all items in #AbsFiles searching for the first input file, which is fine because it means that the hash is cleared in every new directory. But I don't understand why it failed like it was because it should only be declaring (or clearing) the hash when it finds the input file name. I guess I don't NEED to know why it didn't work before, but some help to understand would be nice.
I have to give credit to another user for helping me realize this. They attempted to answer my question but did not, and then gave me this hint about where I declare my hash in a comment on that answer. This answer has now been deleted so I can't credit that user for pointing me in this direction. I would love to know what they understand about Perl that I do not that made it clear to them that this was the problem. I apologize that I was busy with data analysis and a conference so I could not respond to that comment sooner.
I'm writing a Perl script to remove files that have fewer than a given number of lines. What I have so far is
my $cmd = join('','wc -l ', $file); #prints number of lines to command line
if (system($cmd) < 4)
{
my $rmcmd = join('','rm ',$file);
system($rmcmd);
}
where $file is the name and location of a file.
There's no need to use system for this. Perl is perfectly capable of counting lines:
sub count_lines {
open my $fh, '<', shift;
while(local $_ = <$fh>) {} # loop through all lines
return $.;
}
unlink $file if count_lines($file) < 4;
I'm assuming your end goal is to have it search through a directory tree removing files with line count less than n. Check out File::Find and its nifty code generator find2perl to handle that part for you.
I have a list of words and I want to group them into different groups depending on whether they are verbs/adjectives/nouns/etc. So, basically I am looking for a Perl module which tells whether a word is verb/noun etc.
I googled but couldn't find what I was looking for. Thanks.
Lingua::EN::Tagger, Lingua::EN::Semtags::Engine, Lingua::EN::NamedEntity
See the Lingua::EN:: namespace in CPAN. Specifically, Link Grammar and perhaps Lingua::EN::Tagger can help you. Also WordNet provides that kind of information and you can query it using this perl module.
follow code perl help you to find all this thing in your text file in your folder only give the path of directory and it will process all file at once and save result in report.txt file strong text
#!/usr/local/bin/perl
# for loop execution
# Perl Program to calculate Factorial
sub fact
{
# Retriving the first argument
# passed with function calling
my $x = $_[0];
my #names = #{$_[1]};
my $length = $_[2];
# checking if that value is 0 or 1
if ($x < $length)
{
#print #names[$x],"\n";
use Lingua::EN::Fathom;
my $text = Lingua::EN::Fathom->new();
# Analyse contents of a text file
$dirlocation="./2015/";
$path =$dirlocation.$names[$x];
$text->analyse_file($path); # Analyse contents of a text file
$accumulate = 1;
# Analyse contents of a text string
$text->analyse_block($text_string,$accumulate);
# TO Do, remove repetition
$num_chars = $text->num_chars;
$num_words = $text->num_words;
$percent_complex_words = $text->percent_complex_words;
$num_sentences = $text->num_sentences;
$num_text_lines = $text->num_text_lines;
$num_blank_lines = $text->num_blank_lines;
$num_paragraphs = $text->num_paragraphs;
$syllables_per_word = $text->syllables_per_word;
$words_per_sentence = $text->words_per_sentence;
# comment needed
%words = $text->unique_words;
foreach $word ( sort keys %words )
{
# print("$words{$word} :$word\n");
}
$fog = $text->fog;
$flesch = $text->flesch;
$kincaid = $text->kincaid;
use strict;
use warnings;
use 5.010;
my $filename = 'report.txt';
open(my $fh, '>>', $filename) or die "Could not open file '$filename' $!";
say $fh $text->report;
close $fh;
say 'done';
print($text->report);
$x = $x+1;
fact($x,\#names,$length);
}
# Recursively calling function with the next value
# which is one less than current one
else
{
done();
}
}
# Driver Code
$a = 0;
#names = ("John Paul", "Lisa", "Kumar","touqeer");
opendir DIR1, "./2015" or die "cannot open dir: $!";
my #default_files= grep { ! /^\.\.?$/ } readdir DIR1;
$length = scalar #default_files;
print $length;
# Function call and printing result after return
fact($a,\#default_files,$length);
sub done
{
print "Done!";
}
I'm writing this Perl script that gets two command line arguments: a directory and a year. In this directory is a ton of text files or html files(depending on the year). Lets say for instance it's the year 2010 which contains files that look like this <number>rank.html with the number ranging from 2001 to 2212. I want it to open each file individually and take a part of the title in the html file and print it to a text file. However, when I run my code it just prints the first files title to the text file. It seems that it only ever opens the first file 2001rank.html and no others. I'll post the code below and thanks to anyone that helps.
my $directory = shift or "Must supply directory\n";
my $year = shift or "Must supply year\n";
unless (-d $directory) {
die "Error: Directory must be a directory\n";
}
unless ($directory =~ m/\/$/) {
$directory = "$directory/";
}
open COLUMNS, "> columns$year.txt" or die "Can't open columns file";
my $column_name;
for (my $i = 2001; $i <= 2212; $i++) {
if ($year >= 2009) {
my $html_file = $directory.$i."rank.html";
open FILE, $html_file;
#check if opened correctly, if not, skip it
unless (defined fileno(FILE)) {
print "skipping $html_file\n";
next;
}
$/ = "\n";
my $line = <FILE>;
if (defined $line) {
$column_name = "";
$_ = <FILE> until m{</title>};
$_ =~ m{<title>CIA - The World Factbook -- Country Comparison :: (.+)</title>}i;
$column_name = $1;
}
else {
close FILE;
next;
}
close FILE;
}
else {
my $text_file = $directory.$i."rank.txt";
open FILE, $text_file;
unless (defined fileno(FILE)) {
print "skipping $text_file\n";
next;
}
$/ = "\r";
my $line = <FILE>;
if (defined $line) {
$column_name = "";
$_ = <FILE> until /Rank/i;
$_ =~ /Rank(\s+)Country(\s+)(.+)(\s+)Date/i;
$column_name = $3;
}
else {
close FILE;
next;
}
close FILE;
}
print "Adding $column_name to text file\n";
print COLUMNS "$column_name\n";
}
close COLUMNS;
In other words $column_name gets set equal to the same thing every pass in the loop, even though I know the html files are different.
You'll probably be able to debug this a lot faster if you convert using local lexicals for your filehandles instead of globals, as well as turn on strict checking:
use strict;
use warnings;
while (...)
{
# ...
open my $filehandle, $html_file;
# ...
my $line = <$filehandle>;
}
This way, the filehandle(s) will go out of scope during each loop iteration, so you can more clearly see what exactly is being referenced and where. (Hint: you may have missed a condition where the filehandle gets closed, so it is improperly reused the next time around.)
For more on best practices with open and filehandles, see:
Why is three-argument open calls with autovivified filehandles a Perl best practice?
What's the best way to open and read a file in Perl?
Some other points:
Don't ever explicitly assign to $_, that's asking for trouble. Declare your own variable to hold your data: my $line = <$filehandle> (as in the example above)
Pull out your matches directly into variables, rather than using $1, $2 etc, and only use parentheses for the portions you actually need: my ($column_name) = ($line =~ m/Rank\s+Country\s+.+(\s+)Date/i);
put the error conditions first, so the bulk of your code can be outdented one (or more) level(s). This will improve readability, as when the bulk of your algorithm is visible on the screen at once, you can better visualize what it is doing and catch errors.
If you apply the points above I'm pretty sure that you'll spot your error. I spotted it while making this last edit, but I think you'll learn more if you discover it yourself. (I'm not trying to be snooty; trust me on this!)
Your processing is similar for HTML and text files, so make your life easy and factor out the common part:
sub scrape {
my($path,$pattern,$sep) = #_;
unless (open FILE, $path) {
warn "$0: skipping $path: $!\n";
return;
}
local $/ = $sep;
my $column_name;
while (<FILE>) {
next unless /$pattern/;
$column_name = $1;
last;
}
close FILE;
($path,$column_name);
}
Then make it specific for the two types of input:
sub scrape_html {
my($directory,$i) = #_;
scrape $directory.$i."rank.html",
qr{<title>CIA - The World Factbook -- Country Comparison :: (.+)</title>}i,
"\n";
}
sub scrape_txt {
my($directory,$i) = #_;
scrape $directory.$i."rank.txt",
qr/Rank\s+Country\s+(.+)\s+Date/i,
"\r";
}
Then your main program is straightforward:
my $directory = shift or die "$0: must supply directory\n";
my $year = shift or die "$0: must supply year\n";
die "$0: $directory is not a directory\n"
unless -d $directory;
# add trailing slash if necessary
$directory =~ s{([^/])$}{$1/};
my $columns_file = "columns$year.txt";
open COLUMNS, ">", $columns_file
or die "$0: open $columns_file: $!";
for (my $i = 2001; $i <= 2212; $i++) {
my $process = $year >= 2009 ? \&scrape_html : \&scrape_txt;
my($path,$column_name) = $process->($directory,$i);
next unless defined $path;
if (defined $column_name) {
print "$0: Adding $column_name to text file\n";
print COLUMNS "$column_name\n";
}
else {
warn "$0: no column name in $path\n";
}
}
close COLUMNS or warn "$0: close $columns_file: $!\n";
Note how careful you have to be to close global filehandles. Please use lexical filehandles as in
open my $fh, $path or die "$0: open $path: $!";
Passing $fh as a parameter or stuffing it in hashes is much nicer. Also, lexical filehandles close automatically when they go out of scope. There's no chance of stomping on a handle someone else is already using.
Have you considered grep?
grep out just the line from the HTML containing the title, and then process the output of grep.
Simpler, as you would not have to write any file-handling code. You didn't say what you want with that title - if you only need a list, you might not need to write any code at all.
Try something like:
grep -ri title <directoryname>