I'm parsing the sourcecode of many websites, an entire huge web with thousands of pages. Now I want to search for stuff in perĺ, I want to find the number of occurrences of a keyword.
For parsing the webpages I use curl and pipe the output to "grep -c" which doesn't work, so I want to use perl. Can be perl utilised completely to crawl a page?
E.g.
cat RawJSpiderOutput.txt | grep parsed | awk -F " " '{print $2}' | xargs -I replaceStr curl replaceStr?myPara=en | perl -lne '$c++while/myKeywordToSearchFor/g;END{print$c}'
Explanation: In the textfile above I have usable and unusable URLs. With "Grep parsed" I fetch the usable URLs. With awk I select the 2nd column with contains the pure usable URL. So far so good. Now to this question: With Curl I fetch the source (appending some parameter, too) and pipe the whole source code of each page to perl in order to count "myKeywordToSearchFor" occurrences. I would love to do this in perl only if it is possible.
Thanks!
This uses Perl only (untested):
use strict;
use warnings;
use File::Fetch;
my $count;
open my $SPIDER, '<', 'RawJSpiderOutput.txt' or die $!;
while (<$SPIDER>) {
chomp;
if (/parsed/) {
my $url = (split)[1];
$url .= '?myPara=en';
my $ff = File::Fetch->new(uri => $url);
$ff->fetch or die $ff->error;
my $fetched = $ff->output_file;
open my $FETCHED, '<', $fetched or die $!;
while (<$FETCHED>) {
$count++ if /myKeyword/;
}
unlink $fetched;
}
}
print "$count\n";
Try something more like,
perl -e 'while(<>){my #words = split ' ';for my $word(#words){if(/myKeyword/){++$c}}} print "$c\n"'
i.e.
while (<>) # as long as we're getting input (into “$_”)
{ my #words = split ' '; # split $_ (implicit) into whitespace, so we examine each word
for my $word (#words) # (and don't miss two keywords on one line)
{ if (/myKeyword/) # whenever it's found,
{ ++$c } } } # increment the counter (auto-vivified)
print "$c\n" # and after end of file is reached, print the counter
or, spelled out strict-like
use strict;
my $count = 0;
while (my $line = <STDIN>) # except that <> is actually more magical than this
{ my #words = split ' ' => $line;
for my $word (#words)
{ ++$count; } } }
print "$count\n";
Related
Here is a part of my script:
foreach $i ( #contact_list ) {
print "$i\n";
$e = "zcat $file_list2| grep $i";
print "$e\n";
$f = qx($e);
print "$f";
}
$e prints properly but $f gives a blank line even when $file_list2 has a match for $i.
Can anyone tell me why?
Always is better to use Perl's grep instead of using pipe :
#lines = `zcat $file_list2`; # move output of zcat to array
die('zcat error') if ($?); # will exit script with error if zcat is problem
# chomp(#lines) # this will remove "\n" from each line
foreach $i ( #contact_list ) {
print "$i\n";
#ar = grep (/$i/, #lines);
print #ar;
# print join("\n",#ar)."\n"; # in case of using chomp
}
Best solution is not calling zcat, but using zlib library :
http://perldoc.perl.org/IO/Zlib.html
use IO::Zlib;
# ....
# place your defiiniton of $file_list2 and #contact list here.
# ...
$fh = new IO::Zlib; $fh->open($file_list2, "rb")
or die("Cannot open $file_list2");
#lines = <$fh>;
$fh->close;
#chomp(#lines); #remove "\n" symbols from lines
foreach $i ( #contact_list ) {
print "$i\n";
#ar = grep (/$i/, #lines);
print (#ar);
# print join("\n",#ar)."\n"; #in case of using chomp
}
Your question leaves us guessing about many things, but a better overall approach would seem to be opening the file just once, and processing each line in Perl itself.
open(F, "zcat $file_list |") or die "$0: could not zcat: $!\n";
LINE:
while (<F>) {
######## FIXME: this could be optimized a great deal still
foreach my $i (#contact_list) {
if (m/$i/) {
print $_;
next LINE;
}
}
}
close (F);
If you want to squeeze out more from the inner loop, compile the regexes from #contact_list into a separate array before the loop, or perhaps combine them into a single regex if all you care about is whether one of them matched. If, on the other hand, you want to print all matches for one pattern only at the end when you know what they are, collect matches into one array per search expression, then loop them and print when you have grepped the whole set of input files.
Your problem is not reproducible without information about what's in $i, but I can guess that it contains some shell metacharacter which causes it to be processed by the shell before the grep runs.
I have a directory with a list of image header files of the format
image1.hd
image2.hd
image3.hd
image4.hd
I want to search for the regular expression Image type:=4 in the directory and find the file number which has the first occurrence of this pattern. I can do this with a couple of pipes easily in bash:
grep -l 'Image type:=4' image*.hd | sed ' s/.*image\(.*\).hd/\1/' | head -n1
which returns 1 in this case.
This pattern match will be used in a perl script. I know I could use
my $number = `grep -l 'Image type:=4' image*.hd | sed ' s/.*image\(.*\).hd/\1/' | head -n1`
but is it preferable to use pure perl in such cases? Here is the best I could come up with using perl. It is very cumbersome:
my $tmp;
#want to find the planar study in current study
foreach (glob "$DIR/image*.hd"){
$tmp = $_;
open FILE, "<", "$_" or die $!;
while (<FILE>)
{
if (/Image type:=4/){
$tmp =~ s/.*image(\d+).hd/$1/;
}
}
close FILE;
last;
}
print "$tmp\n";
this also returns the desired output of 1. Is there a more effective way of doing this?
This is simple with the help of a couple of utility modules
use strict;
use warnings;
use File::Slurp 'read_file';
use List::MoreUtils 'firstval';
print firstval { read_file($_) =~ /Image type:=4/ } glob "$DIR/image*.hd";
But if you are restricted to core Perl, then this will do what you want
use strict;
use warnings;
my $firstfile;
while (my $file = glob 'E:\Perl\source\*.pl') {
open my $fh, '<', $file or die $!;
local $/;
if ( <$fh> =~ /Image type:=4/) {
$firstfile = $file;
last;
}
}
print $firstfile // 'undef';
** I have a follow-up question that is marked with '**' **
I was asked to write Perl code that replaces every { with {function(<counter>) and in every replacement the counter should get larger by 1. e.g. first replacement of { will be {function(0) ,
second replacement of { will be {function(1) etc.
It suppose to do the replacement in every *.c and *.h file in a folder including subfolders.
I wrote this code :
#!/usr/bin/perl
use Tie::File;
use File::Find;
$counter = 0;
$flag = 1;
#directories_to_search = 'd:\testing perl';
#newString = '{ function('.$counter.')';
$refChar = "{";
finddepth(\&fileMode, #directories_to_search);
sub fileMode
{
my #files = <*[ch]>; # get all files ending in .c or .h
foreach $file (#files) # go through all the .c and .h flies in the directory
{
if (-f $file) # check if it is a file or dir
{
my #lines;
# copy each line from the text file to the string #lines and add a function call after every '{' '
tie #lines, 'Tie::File', $file or die "Can't read file: $!\n";
foreach ( #lines )
{
if (s/{/#newString/g)
{
$counter++;
#newString = '{function('.$counter.')';
}
untie #lines; # free #lines
}
}
}
}
The code searches the directory d:\testing Perl and does the replacement but instead of getting
{function(<number>) I get {function(number1) function(number3) function(number5) function(number7) for instance for the first replacement I get
{function(0) function(2) function(4) function(6) and I wanted to get {function(0)
I really don't know what is wrong with my code.
An awk solution or any other Perl solution will also be great!
* I have a follow-up question.
now I want my perl program to do the same substitution in all the files except the lines when there is a '{'
and a '}' in the same line. so i modified the code this way.
#!/usr/bin/perl
use strict;
use warnings;
use Tie::File;
use File::Find;
my $dir = "C:/test dir";
# fill up our argument list with file names:
find(sub { if (-f && /\.[hc]$/) { push #ARGV, $File::Find::name } }, $dir);
$^I = ".bak"; # supply backup string to enable in-place edit
my $counter = 0;
# now process our files
#foreach $filename (#ARGV)
while (<>)
{
my #lines;
# copy each line from the text file to the string #lines and add a function call after every '{' '
tie #lines, 'Tie::File', $ARGV or die "Can't read file: $!\n";
#$_='{function(' . $counter++ . ')';
foreach (#lines)
{
if (!( index (#lines,'}')!= -1 )) # if there is a '}' in the same line don't add the macro
{
s/{/'{function(' . $counter++ . ')'/ge;
print;
}
}
untie #lines; # free #lines
}
what I was trying to do is to go through all the files in #ARGV that i found in my dir and subdirs and for each *.c or *.h file I want to go line by line and check if this line contains '{'. if it does the program won't check if there is a '{' and won't make the substitution, if it doesn't the program will substitute '{' with '{function();'
unfortunately this code does not work.
I'm ashamed to say that I'm trying to make it work all day and still no go.
I would really appreciate some help.
Thank You!!
This is a simple matter of combining a finding method with an in-place edit. You could use Tie::File, but it is really the same end result. Also, needless to say, you should keep backups of your original files, always, when doing edits like these because changes are irreversible.
So, if you do not need recursion, your task is dead simple in Unix/Linux style:
perl -pi -we 's/{/"{ function(" . $i++ . ")"/ge' *.h *.c
Of course, since you seem to be using Windows, the cmd shell won't glob our arguments, so we need to do that manually. And we need to change the quotes around. And also, we need to supply a backup argument for the -i (in-place edit) switch.
perl -pi.bak -we "BEGIN { #ARGV = map glob, #ARGV }; s/{/'{ function(' . $i++ . ')'/ge" *.h *.c
This is almost getting long enough to make a script of.
If you do need recursion, you would use File::Find. Note that this code is pretty much identical in functionality as the one above.
use strict;
use warnings;
use File::Find;
my $dir = "d:/testing perl"; # use forward slashes in paths
# fill up our argument list with file names:
find(sub { if (-f && /\.[hc]$/) { push #ARGV, $File::Find::name } }, $dir);
$^I = ".bak"; # supply backup string to enable in-place edit
my $counter = 0;
# now process our files
while (<>) {
s/{/'{ function(' . $counter++ . ')'/ge;
print;
}
Don't be lulled into a false sense of security by the backup option: If you run this script twice in a row, those backups will be overwritten, so keep that in mind.
$ perl -pi -e 's| (?<={) | q#function(# . ($i++) . q#)# |gex' *.c *.h
It can be done in a single line as below:
perl -pi -e 's/({)/"{function(".++$a.")"/ge;' your_file
I have just taken an example input file and tested too.
> cat temp
line-1 { { { {
line-2 { { {
line-3 { {
line-4 {
Now the execution:
> perl -pi -e 's/({)/"{function(".++$a.")"/ge;' temp
> cat temp
line-1 {function(1) {function(2) {function(3) {function(4)
line-2 {function(5) {function(6) {function(7)
line-3 {function(8) {function(9)
line-4 {function(10)
Using awk '/{/{gsub(/{/,"{function("i++")");print;next}{print}' and your code as input:
$ awk '/{/{gsub(/{/,"{function("i++")");print;next}{print}' file
sub fileMode
{function(0)
my #files = <*[ch]>; # get all files ending in .c or .h
foreach $file (#files) # go through all the .c and .h flies in the directory
{function(1)
if (-f $file) # check if it is a file or dir
{function(2)
my #lines;
# copy each line from the text file to the string #lines and add a function call after every '{function(3)' '
tie #lines, 'Tie::File', $file or die "Can't read file: $!\n";
foreach ( #lines )
{function(4)
if (s/{function(5)/#newString/g)
{function(6)
$counter++;
#newString = '{function(7)function('.$counter.')';
}
untie #lines; # free #lines
}
}
}
}
Note: The function number won't be incremented for inline nested {.
$ echo -e '{ { \n{\n-\n{' | awk '/{/{gsub(/{/,"{function("i++")");print;next}1'
{function(0) {function(0)
{function(1)
-
{function(2)
Explanation:
/{/ # For any lines that contain {
gsub( /{/ , "{function("i++")" ) # replace { with function(i++)
print;next # print the line where the replacement happened and skip to the next
print # print all the lines
I have a simple .csv file that has that I want to extract data out of a write to a new file.
I to write a script that reads in a file, reads each line, then splits and structures the columns in a different order, and if the line in the .csv contains 'xxx' - dont output the line to output file.
I have already managed to read in a file, and create a secondary file, however am new to Perl and still trying to work out the commands, the following is a test script I wrote to get to grips with Perl and was wondering if I could aulter this to to what I need?-
open (FILE, "c1.csv") || die "couldn't open the file!";
open (F1, ">c2.csv") || die "couldn't open the file!";
#print "start\n";
sub trim($);
sub trim($)
{
my $string = shift;
$string =~ s/^\s+//;
$string =~ s/\s+$//;
return $string;
}
$a = 0;
$b = 0;
while ($line=<FILE>)
{
chop($line);
if ($line =~ /xxx/)
{
$addr = $line;
$post = substr($line, length($line)-18,8);
}
$a = $a + 1;
}
print $b;
print " end\n";
Any help is much appreciated.
To manipulate CSV files it is better to use one of the available modules at CPAN. I like Text::CSV:
use Text::CSV;
my $csv = Text::CSV->new ({ binary => 1, empty_is_undef => 1 }) or die "Cannot use CSV: ".Text::CSV->error_diag ();
open my $fh, "<", 'c1.csv' or die "ERROR: $!";
$csv->column_names('field1', 'field2');
while ( my $l = $csv->getline_hr($fh)) {
next if ($l->{'field1'} =~ /xxx/);
printf "Field1: %s Field2: %s\n", $l->{'field1'}, $l->{'field2'}
}
close $fh;
If you need do this only once, so don't need the program later you can do it with oneliner:
perl -F, -lane 'next if /xxx/; #n=map { s/(^\s*|\s*$)//g;$_ } #F; print join(",", (map{$n[$_]} qw(2 0 1)));'
Breakdown:
perl -F, -lane
^^^ ^ <- split lines at ',' and store fields into array #F
next if /xxx/; #skip lines what contain xxx
#n=map { s/(^\s*|\s*$)//g;$_ } #F;
#trim spaces from the beginning and end of each field
#and store the result into new array #n
print join(",", (map{$n[$_]} qw(2 0 1)));
#recombine array #n into new order - here 2 0 1
#join them with comma
#print
Of course, for the repeated use, or in a bigger project you should use some CPAN module. And the above oneliner has much cavetas too.
I have the following script that takes in an input file, output file and
replaces the string in the input file with some other string and writes out
the output file.
I want to change the script to traverse through a directory of files
i.e. instead of prompting for input and output files, the script should take
as argument a directory path such as C:\temp\allFilesTobeReplaced\ and
search for a string x and replace it with y for all files under that
directory path and write out the same files.
How do I do this?
Thanks.
$file=$ARGV[0];
open(INFO,$file);
#lines=<INFO>;
print #lines;
open(INFO,">c:/filelist.txt");
foreach $file (#lines){
#print "$file\n";
print INFO "$file";
}
#print "Input file name: ";
#chomp($infilename = <STDIN>);
if ($ARGV[0]){
$file= $ARGV[0]
}
print "Output file name: ";
chomp($outfilename = <STDIN>);
print "Search string: ";
chomp($search = <STDIN>);
print "Replacement string: ";
chomp($replace = <STDIN>);
open(INFO,$file);
#lines=<INFO>;
open(OUT,">$outfilename") || die "cannot create $outfilename: $!";
foreach $file (#lines){
# read a line from file IN into $_
s/$search/$replace/g; # change the lines
print OUT $_; # print that line to file OUT
}
close(IN);
close(OUT);
The use of the perl single liner
perl -pi -e 's/original string/new string/' filename
can be combined with File::Find, to give the following single script (this is a template I use for many such operations).
use File::Find;
# search for files down a directory hierarchy ('.' taken for this example)
find(\&wanted, ".");
sub wanted
{
if (-f $_)
{
# for the files we are interested in call edit_file().
edit_file($_);
}
}
sub edit_file
{
my ($filename) = #_;
# you can re-create the one-liner above by localizing #ARGV as the list of
# files the <> will process, and localizing $^I as the name of the backup file.
local (#ARGV) = ($filename);
local($^I) = '.bak';
while (<>)
{
s/original string/new string/g;
}
continue
{
print;
}
}
You can do this with the -i param:
Just process all the files as normal, but include -i.bak:
#!/usr/bin/perl -i.bak
while ( <> ) {
s/before/after/;
print;
}
This should process each file, and rename the original to original.bak And of course you can do it as a one-liner as mentioned by #Jamie Cook
Try this
#!/usr/bin/perl -w
#files = <*>;
foreach $file (#files) {
print $file . '\n';
}
Take also a look to glob in Perl:
http://perldoc.perl.org/File/Glob.html
http://www.lyingonthecovers.net/?p=312
I know you can use a simple Perl one-liner from the command line, where filename can be a single filename or a list of filenames. You could probably combine this with bgy's answer to get the desired effect:
perl -pi -e 's/original string/new string/' filename
And I know it's trite but this sounds a lot like sed, if you can use gnu tools:
for i in `find ./allFilesTobeReplaced`; do sed -i s/original string/new string/g $i; done
perl -pi -e 's#OLD#NEW#g' filename.
You can replace filename with the pattern that suits your file list.