Search a list file of files for keywords - perl

I have an array of files. Now I need to cat each file and search for a list of keywords which is in file keywords.txt.
my keywords.txt contains below
AES
3DES
MD5
DES
SHA-1
SHA-256
SHA-512
10.*
http://
www.
#john.com
john.com
and I'm expecting output as below
file jack.txt contains AES:5 (5 line number) http://:55
file new.txt contains 3DES:75 http://:105

Okay Here is my Code
use warnings;
use strict;
open STDOUT, '>>', "my_stdout_file.txt";
my $filename = $ARGV[2];
chomp ($filename);
open my $fh, q[<], shift or die $!;
my %keyword = map { chomp; $_ => 1 } <$fh>;
print "$fh\n";
while ( <> ) {
chomp;
my #words = split;
for ( my $i = 0; $i <= $#words; $i++ ) {
if ( $keyword{ $words[ $i ] } ) {
print "Keyword Found for file:$filename\n";
printf qq[$filename Line: %4d\tWord position: %4d\tKeyword: %s\n],
$., $i, $words[ $i ];
}
}
}
But the problem is its considering all the Arguments and trying to open the files for ARGV[2]. Actually i need to Open only ARGV[0] and ARGV[1]. ARGV[2] i kept for writing in output only.
Appreciate responses.
Thanks.

I think maybe there are multipe keywords in one line of those txt files. So below code for your reference:
my #keywords;
open IN, "keywords.txt";
while(<IN>) {
if(/^(.+)$/) {
push(#keywords, $1);
}
}
close(IN);
my #filelist=glob("*.txt");
foreach my $filename(#filelist) {
if(open my $fh, '<', $filename) {
my $ln=1;
while(my $line = <$fh>) {
foreach my $kw(#keywords) {
if($line=~/$kw/) {
print $filename.':'.$ln.':'.$kw."\n";
}
}
$ln++;
}
}
close($fh);
}

Ok but one thing i noticed here is..
the Keywords
http://
oracle.com
it is matching the whole word, instead it should try to match as part of strings.. Only these keywords.. Others should be searched as per the above.

Related

Perl code fails to match recursively (nested subs)

The code below loops through folders in “/data/results” directory and matches each .vcf file name, located in a sub-folder (two levels down) to the content of a matrix_key file.
This seem to work only for the first folder. I printed the content of each #matrix_key and it’s correct. The code always fails to match for the second folder. Here is where it fails to match:: if ( my $aref = first { index($sample_id, $_->[1]) != -1 } #matrix_key ) {
I’ve tried to run one folder at a time and it work great. I don’t understand why it fails when I put multiple folders in /data/results/? Could someone please suggest how to correct this issue? Thank you.
Here is an example of directory structure:
/data/results/
TestFolder1/
subfolder1/Variants/MD-14-11856_RNA_v2.vcf
subfoder2/Variants/SU-16-16117_RNA_v2.vcf
matrix.txt
matrixkey.txt
TestFolder2/
subfolder1/Variants/SU-15-2542_v2.vcf
subfolder2/Variants/SU-16-16117_v2.vcf
matrix.txt
matrixkey.txt
Example of #matrix_key:
Barcode SampleName
barcode_003 SU-15-2542
barcode-005 MD-14-11856
barcode-002 SU-16-16117
The code:
#!/usr/bin/perl
use warnings;
use strict;
use File::Copy qw(move);
use List::Util 'first';
use File::Find;
use File::Spec;
use Data::Dumper;
use File::Basename;
use File::Spec::Functions 'splitdir';
my $current_directory = "/data/results";
my #dirs = grep { -d } glob '/data/results/*';
if (grep -d, glob("$current_directory/*")) {
print "$current_directory has subfolder(s)\n";
}
else {
print "there are no folders\n";
die;
}
my %files;
my #matrix_key = ();
for my $dir ( #dirs ) {
print "the directory is $dir\n";
my $run_folder = (split '/', $dir)[3];
print "the folder is $run_folder\n";
my $key2 = $run_folder;
# checks if barcode matrix and barcode summary files exist
#shortens the folder names and unzips them.
#check if each sample is present in the matrix file for each folder.
my $location = "/data/results/".$run_folder;
my $matrix_key_file = "/data/results/".$run_folder."/matrixkey.txt";
open my $key, '<', $matrix_key_file or die $!; # key file
<$key>; # throw away header line in key file (first line)
#matrix_key = sort { length($b->[1]) <=> length($a->[1]) }
map [ split ], <$key>;
close $key or die $!;
print Dumper(#matrix_key) . "===\n\n";
find({ wanted => \&find_vcf, no_chdir=>1}, $location);
#find({ wanted => find_vcf, no_chdir=>1}, $location);
}
my $find_vcf = sub {
#sub find_vcf {
my $F = $File::Find::name;
if ($F =~ /vcf$/ ) {
print "$F\n";
$F =~ m|([^/]+).vcf$| or die "Can't extract Sample ID";
my $sample_id = $1; print "the short vcf name is: $sample_id\n";
if ( my $aref = first { index($sample_id, $_->[1]) != -1 } #matrix_key ) {
#the code fails to match sample_id to matrix_key
#even though it's printed out correctly
print "$sample_id \t MATCHES $aref->[1]\n";
print "\t$aref->[1]_$aref->[0]\n\n";
} else {
# handle all other possible exceptions
#print "folder name is $run_folder\n";
die("The VCF file doesn't match the Summary Barcode file: $sample_id\n");
}
}
}
The posted code appears to be a bit complicated for the job.
Here is one way to do what I understand from the question. It uses File::Find::Rule
use warnings;
use strict;
use File::Find::Rule;
use List::Util 'any';
my $base_dir = '/data/results';
my #dirs = File::Find::Rule->maxdepth(1)->directory->in($base_dir);
foreach my $dir (#dirs)
{
# Find all .vcx files anywhere in this dir or below
my #vcx_files = File::Find::Rule->file->name('*.vcx')->in($dir);
# Remove the path and .vcx extension
my #names = map { m|.*/(.+)\.vcx$| } #vcx_files;
# Find all text files to search, right in this folder
my #files = File::Find::Rule ->
maxdepth(1)->file->name('*.txt')->in($dir);
foreach my $file (#files)
{
open my $fh, '<', $file or die "Can't open $file: $!";
<$fh>; # drop the header line
# Get the second field on each line (with SampleName)
my #samples = map { (split)[1] } <$fh>;
# ... search #samples for #names ...
}
}
It is fine to use glob for non-recursive searches above, but given its treatment of spaces better use core File::Glob replacement for it.
There are other ways to organize traversal of directories and file searches, and there are many ways to compare two lists. Please clarify the overall objective so that I can add suitable code to search .vcx names vs. file content.
Please add checks, fix variable names, implement your policies for when things fail, etc.

Perl Help Needed: Replacing values

I am having an input file like this:
Input file
I need to replace the value #pSBSB_ID="*" of #rectype=#pRECTYPE="SBSB" with #pMEME_SSN="034184233", value of #pRECTYPE="SMSR", ..and have to delete the row where #rectype='#pRECTYPE="SMSR", '
Example:
So, after changes have been made, the file should be like this:
....#pRECTYPE="SBSB", #pGWID="17199269", #pINPUT_METHOD="E", #pGS08="005010X220A1", #pSBSB_FAM_UPDATE_CD="UP", #pSBSB_ID="034184233".....
....#pRECTYPE="SBEL", #pSBEL_EFF_DT="01/01/2013", #pSBEL_UPDATE_CD="TM", #pCSPD_CAT="M", #pCSPI_ID="MHMO1003"
.
.
.
Update
I tried below mentioned code:
Input file extension: mms and there are multiple files to process.
my $save_for_later;
my $record;
my #KwdFiles;
my $r;
my $FilePath = $ARGV[0];
chdir($FilePath);
#KwdFiles = <*>;
foreach $File(#KwdFiles)
{
unless(substr($File,length($File)-4,length($File)) eq '.mms')
{
next;
}
unless(open(INFILE, "$File"))
{
print "Unable to open file: $File";
exit(0);
}
print "Successfully opened the file: \"$File\" for processing\n\n";
while ( my $record = <INFILE> ) {
my %r = $record =~ /\#(\w+) = '(.*?)'/xg;
if ($r{rectype} eq "SMSR") {
$save_for_later = $r{pMEME_SSN};
next;
}
elsif ($r{rectype} eq "SBSB" and $r{pSBSB_ID} eq "*") {
$record =~ s|(\#pSBSB_ID = )'.*?'|$1'$save_for_later'|x;
}
close(INFILE);
}
}
But, I am still not getting the updated values in the file.
#!/usr/bin/perl
open IN, "< in.txt";
open OUT, "> out.txt";
my $CUR_RECID = 1^1;
while (<IN>) {
if ($CUR_RECID) {
s/recname='.+?'/recname='$CUR_RECID'/ if /rectype='DEF'/;
$CUR_RECID = 1^1;
print OUT;
}
$CUR_RECID = $1 if /rectype='ABC'.+?rec_id='(.+?)'/;
}
close OUT;
close IN;
Try that whole code. No need a separate function; This code does everything.
Run this script from your terminal with the files to be modified as arguments:
use strict;
use warnings;
$^I = '.bak'; #modify original file and create a backup of the old ones with .bak appended to the name
my $replacement;
while (<>) {
$replacement = $1 if m/(?<=\#pMEME_SSN=)("\d+")/; #assume replacement will be on the first line of every file.
next if m/^\s*\#pRECTYPE="SMSR"/;
s/(?<=\#pSBSB_ID=)("\*")/$replacement/g;
print;
}

Compare and replace two files in perl

I am trying to compare the string in two documents test1, test 2
Test 1:
<p><imagedata rid="rId7"></p>
...
<p><imagedata rid="rId8"></p>
Test2:
<imagesource Id="rId7" Target="image/image1.jpg"/>
...
<imagesource Id="rId9" Target="image/image2.jpg"/>
...
<imagesource Id="rId8" Target="image/image3.jpg"/>
What I want is, the first file should get replaced with the image target path like:
<p><imagedata src="image/image1.jpg"></p>
...
<p><imagedata rid="image/image3.jpg"></p>
I tried to extract the text from both files but I stuck to compare both strings
opendir(DIR, $filenamenew1);
our(#test1,#test2);
open fhr, "$filenamenew1/test1.txt";
open fhr1, "$filenamenew1/test2.txt";
my #line;
#line= <fhr>;
for (my $i=0;$i<=$#line;$i++)
{
if ($line[$i]=~m/rid="(rId[0-9])"/)
{
my $k = $1;
push (#test1, "$k");
}
}
my #file2;
#file2= <fhr1>;
for (my $i=0;$i<=$#file2;$i++)
{
if ($file2[$i]=~m/Id="(rId[0-9])"/)
{
my $k1 = $1;
push (#test2, "$k1");
foreach (#test1 = #test2)
{
print "equal";
}
}
}
One solution could be to read first the file with <imagesources> and save both the rid and the target in a hash. After that read the other file line by line and compare if the rid exists in the hash and do the substitution, something like:
Content of script.pl:
#!/usr/bin/env perl
use warnings;
use strict;
my (%hash);
open my $fh2, '<', shift or die;
open my $fh1, '<', shift or die;
while ( <$fh2> ) {
chomp;
if ( m/Id="(rId\d+)".*Target="([^"]*)"/i ) {
$hash{ $1 } = $2;
}
}
while ( <$fh1> ) {
if ( m/rId="([^"]+)"/i && defined $hash{ $1 } ) {
s//src="$hash{ $1 }"/;
}
print $_;
}
Run it like:
perl script.pl test2 test1
That yields:
<p><imagedata src="image/image1.jpg"></p>
...
<p><imagedata src="image/image3.jpg"></p>

How to write a correct name using combination of variable and string as a filehandler?

I want to make a tool to classify each line in input file to several files
but it seems have some problem in naming a filehandler so I can't go ahead , how do I solve?
here is my program
ARGV[0] is the input file
ARGV[1] is the number of classes
#!/usr/bin/perl
use POSIX;
use warnings;
# open input file
open(Raw,"<","./$ARGV[0]") or die "Can't open $ARGV[0] \n";
# create a directory class to store class files
system("mkdir","Class");
# create files for store class informations
for($i=1;$i<=$ARGV[1];$i++)
{
# it seems something wrong in here
open("Class$i",">","./Class/$i.class") or die "Can't create $i.class \n";
}
# read each line and random decide which class to store
while( eof(Raw) != 1)
{
$Line = readline(*Raw);
$Random_num = ceil(rand $ARGV[1]);
for($k=1;$k<=$ARGV[1];$k++)
{
if($Random_num == $k)
{
# Store to the file
print "Class$k" $Line;
last;
}
}
}
for($h=1;$h<=$ARGV[1];$h++)
{
close "Class$h";
}
close Raw;
thanks
Later I use the advice provided by Bill Ruppert
I put the name of filehandler into array , but it seems appear a syntax bug , but I can't correct it
I label the syntax bug with ######## A syntax error but it looks quite OK ########
here is my code
#!/usr/bin/perl
use POSIX;
use warnings;
use Data::Dumper;
# open input file
open(Raw,"<","./$ARGV[0]") or die "Can't open $ARGV[0] \n";
# create a directory class to store class files
system("mkdir","Class");
# put the name of hilehandler into array
for($i=0;$i<$ARGV[1];$i++)
{
push(#Name,("Class".$i));
}
# create files of classes
for($i=0;$i<=$#Name;$i++)
{
$I = ($i+1);
open($Name[$i],">","./Class/$I.class") or die "Can't create $I.class \n";
}
# read each line and random decide which class to store
while( eof(Raw) != 1)
{
$Line = readline(*Raw);
$Random_num = ceil(rand $ARGV[1]);
for($k=0;$k<=$#Name;$k++)
{
if($Random_num == ($k+1))
{
print $Name[$k] $Line; ######## A syntax error but it looks quite OK ########
last;
}
}
}
for($h=0;$h<=$#Name;$h++)
{
close $Name[$h];
}
close Raw;
thanks
To quote the Perl documentation on the print function:
If you're storing handles in an array or hash, or in general whenever you're using any expression more complex than a bareword handle or a plain, unsubscripted scalar variable to retrieve it, you will have to use a block returning the filehandle value instead, in which case the LIST may not be omitted:
print { $files[$i] } "stuff\n";
print { $OK ? STDOUT : STDERR } "stuff\n";
Thus, print $Name[$k] $Line; needs to be changed to print { $Name[$k] } $Line;.
How about this one:
#! /usr/bin/perl -w
use strict;
use POSIX;
my $input_file = shift;
my $file_count = shift;
my %hash;
open(INPUT, "<$input_file") || die "Can't open file $input_file";
while(my $line = <INPUT>) {
my $num = ceil(rand($file_count));
$hash{$num} .= $line
}
foreach my $i (1..$file_count) {
open(OUTPUT, ">$i.txt") || die "Can't open file $i.txt";
print OUTPUT $hash{$i};
close OUTPUT;
}
close INPUT;

Perl script.file handling issues

I have written a Perl script:
#!/usr/bin/perl
use strict;
use warnings;
my $file_name;
my $ext = ".text";
my $subnetwork2;
my %files_list = ();
opendir my $dir, "." or die "Cannot open directory: $!";
my #files = readdir $dir;
sub create_files() {
my $subnetwork;
open(MYFILE, 'file.txt');
while (<MYFILE>) {
if (/.subnetwork/) {
my #string = split /[:,\s]+/, $_;
$subnetwork = $string[2];
}
if (/.set/ && (defined $subnetwork)) {
my #string = split /[:,\s]+/, $_;
my $file = $subnetwork . $string[1];
open FILE, ">", "$file.text" or die $!;
close(FILE);
}
}
close(MYFILE);
}
sub create_hash() {
foreach (#files) {
if (/.text/) {
open($files_list{$_}, ">>$_") || die("This file will not open!");
}
}
}
sub init() {
open(MYFILE3, 'file.txt');
while (<MYFILE3>) {
if (/.subnetwork/) {
my #string3 = split /[:,\s]+/, $_;
$subnetwork2 = $string3[2];
last;
}
}
close(MYFILE3);
}
sub main_process() {
init;
create_files;
create_hash;
open(MYFILE1, 'file.txt');
while (<MYFILE1>) {
if (/.subnetwork/) {
my #string3 = split /[:,\s]+/, $_;
$subnetwork2 = $string3[2];
}
if (/.set/) {
my #string2 = split /[:,\s]+/, $_;
$file_name = $subnetwork2 . $string2[1] . $ext;
}
if (/.domain/ || /.end/ || ($. < 6)) {
my $domain = $_;
foreach (#files) {
if (/.text/ && /$subnetwork2/) {
prnt { $files_list{$_} } "$domain";
}
}
}
elsif ($. >= 6) {
print { $files_list{$file_name} } "$_";
}
}
close(MYFILE1);
foreach my $val (values %files_list) { close($val); }
closedir $dir;
}
main_process;
This script creates files in the current directory based upon the content of file.txt, and then open those files again.
Then it starts processing file.txt and redirects the lines according to the filename set dynamically.
This setting of the file name is also based upon the data in the file file.txt.
The problem that I am facing here is that the redirection is only to a single file. That means there is some problem with the file handle.
All the files that are expected to be created are created perfectly but the data goes into only one of them.
I doubt if there is a problem with the file handle that I am using while redirecting.
Could anyone please help?
Sample input file is below:
..cnai #Generated on Thu Aug 02 18:33:18 2012 by CNAI R21D06_EC01, user tcssrpi
..capabilities BASIC
.utctime 2012-08-02 13:03:18
.subnetwork ONRM_ROOT_MO:NETSim_BAG
.domain BSC
.set BAG01
AFRVAMOS="OFF"
AWBVAMOS="OFF"
ALPHA=0
AMRCSFR3MODE=1,3,4,7
AMRCSFR3THR=12,21,21
AMRCSFR3HYST=2,3,3
AMRCSFR3ICM=
AMRCSFR4ICM=
USERDATA=""
.set BAG02
AFRVAMOS="OFF"
AWBVAMOS="OFF"
ALPHA=0
AMRCSFR3MODE=1,3,4,7
AMRCSFR3THR=12,21,21
AMRCSFR3HYST=2,3,3
..end
The problem that i am facing is during execution:
> process.pl
Use of uninitialized value in ref-to-glob cast at process.pl line 79, <MYFILE1> line 6.
Can't use string ("") as a symbol ref while "strict refs" in use at process.pl line 79, <MYFILE1> line 6.
The problem i can understand is with this line:
print { $files_list{$_} } "$domain";
but i am unable to understand why!!
The output i need is :
> cat NETSim_BAGBAG01.text
.set BAG01
AFRVAMOS="OFF"
AWBVAMOS="OFF"
ALPHA=0
AMRCSFR3MODE=1,3,4,7
AMRCSFR3THR=12,21,21
AMRCSFR3HYST=2,3,3
AMRCSFR3ICM=
AMRCSFR4ICM=
USERDATA=""
> cat NETSim_BAGBAG02.text
.set BAG02
AFRVAMOS="OFF"
AWBVAMOS="OFF"
ALPHA=0
AMRCSFR3MODE=1,3,4,7
AMRCSFR3THR=12,21,21
AMRCSFR3HYST=2,3,3
>
Your problem in following lines:
open(PLOT,">>$_") || die("This file will not open!");
$files_list{$_}=*PLOT;
You should replace they with:
open($files_list{$_},">>$_") || die("This file will not open!");
This portion of your code is the key:
open(PLOT,">>$_") || die("This file will not open!");
$files_list{$_}=*PLOT;
The problem is that you are essentially using the filehandle PLOT as a global variable; every single entry in your hash is pointing to this same filehandle. Replace with something like this:
local *PLOT;
open(PLOT,">>$_") || die("This file will not open!");
$files_list{$_}=*PLOT;
You have got youself very entangled with this program. There is no need for the hash table or the multiple subroutines.
Here is a quick refactoring of your code that works with your data and writes files NETSim_BAG.BAG01.text and NETSim_BAG.BAG02.text. I put a dot between the subnet and the set to make the names a little clearer.
use strict;
use warnings;
my $out_fh;
open my $fh, '<', 'file.txt' or die $!;
my ($subnetwork, $set, $file);
while (<$fh>) {
if ( /^\.subnetwork\s+\w+:(\w+)/ ) {
$subnetwork = $1;
}
elsif ( /^\.set\s+(\w+)/ and $subnetwork) {
$set = $1;
$file = "$subnetwork.$set.text";
open $out_fh, '>', $file or die qq(Unable to open "$file" for output: $!);
print $out_fh;
}
elsif ( /^\.\.end/ ) {
undef $subnetwork;
undef $file;
}
if (/^[^.]/ and $file) {
print $out_fh $_;
}
}