perl - help parsing out number values from many small text files - perl

I have a number of files in a common directory (/home/test) with a common name:
ABC_1_20110508.out
ABC_1_20110509.out
ABC_1_20110510.out
..
Each text file has one record that looks like this:
(count, 553076)
I would like to strip out the numbers and just list them out in a file one at a time.
553076
1005
7778000
...
Can someone show me how to do this using perl?

use this regex:
/\(\w+, (\d+)\)/
you can also use the magic diamond operator to iterate over all of the files at once:
while (<>) {
# extract the number
/\(\w+, (\d+)\)/;
# print it out
print $1, "\n";
}
And if your perl script is called myscript.pl, the you can call it like this:
$ myscript.pl /home/test/ABC_1_*.out

Sounds like a one-liner to me:
$ perl -wne '/(\d+)/ && print "$1\n"' *.out > out.txt

Easiest way is to use the <> operator. When invoking a perl program without arguments, <> acts just like <STDIN>. If you call it arguments, <> will give you the contents of every file in #ARGV without you having to manually manage the filehandles.
Ex: ./your_script.pl /home/test/ABC_1_????????.out or cat /home/test/ABC_1_????????.out | ./your_script.pl. These would have the same effect.

Related

Perl open file from command line with wildcard

I am executing my script this way:
./script.pl -f files*
I looked at some other threads (like How can I open a file in Perl using a wildcard in the directory name?)
If i hard code the file name like it is written in this thread I get my desired result. If I take it from the command line it does not.
My options subroutine should save all the files I get this way in an array.
my #file;
sub Options{
my $i=0;
foreach my $opt (#ARGV){
switch ($opt){
case "-f" {
$i++;
### This part does not work:
#file= glob $ARGV[$i];
print Dumper("$ARGV[$i]"); #$VAR1 = 'files';
print Dumper(#file); #$VAR1 = 'files';
}
}
$i++;
}
}
It seems the execution is interpreted in advance and the wildcard (*) is dropped in the process.
Desired result: All files beginning with files are saved in an array, after execution from the command line.
I hope you get my problem. If not feel free to ask.
Thank you.
Well, first I'd suggest using a module to do args on command line:
Getopt::Long for example.
But otherwise your problem is simpler - your shell is expanding the 'file*' before perl gets it. (shell glob is getting there first).
If you do this with:
-f 'file*'
then it'll work properly. You should be able to see this - for example - if you just:
use Data::Dumper;
print Dumper \#ARGV;
I expect you'll see a much longer list than you thought.
However, I'd also point out - perl has a really nice feature you may be able to use (depending what you're doing with your files).
You can use <>, which automatically opens and reads all files specified on command line (in order).
Since your shell is already expanding the glob files* into a list of filenames, that's what the Perl program gets.
$ perl -E 'say #ARGV' files*
files1files2files3
There's no need to do that in Perl, if your shell can do it for you. If all you want is the filenames in an array, you already have #ARGV which contains those.

How to display lines in a file where it contains more than 5 comma in a line using egrep or awk

I have the lines in the following format:
enter image description here
Help is required to display the line alone containing more than 5 comma in a line in a separate file.
perl has a tr (translate) operator that returns the number of translations that occurred. We can use this to count substrings in a string.
cat file.txt | perl -ne 'print if tr/,// > 5'
Using egrep:
egrep '([^,]*,){6,}'
Using awk:
awk -F, 'NF>5{print}'
Using a sed which has an "extended regular expression option" (I'll assume -r here, but it could be -E):
sed -n -r -e '/([^,]*,){6,}/p'
Of course you have to be careful what you ask for. For example, if you have a CSV file with commas embedded within "values", and if you only want lines with more than five "values", then things get a little trickier for tools that are not CSV-aware.
Text in image looks like CSV.
then, using AWK...
awk -F'","' 'NF>5{print}'
like peak's above answer.
I think you already have answers to your raw question here. However, if what you're really asking is if you want to find how many rows have CSV fields that exceed 5, then I think you need something like Perl's Text::CSV module.
An example of this is the following string:
one,two,three,four,five,"six,seven"
This has six commas but only five fields. Do you want to see this line, or do you want to skip it? If you want to see it (as an exception -- a line with more than five commas), then use one of the methods already suggested.
If you don't, then you really want a CSV parser, and Perl's is quite nice -- more lightweight and easier than most languages, in my opinion:
use strict;
use Text::CSV;
my $csv = Text::CSV->new ( { binary => 1 } );
open my $IN, "<:encoding(utf8)", "file.csv" or die;
while (my $row = $csv->getline($IN)) {
if (#$row > 5) {
$csv->combine(#$row);
print $csv->string(), "\n";
}
}
close $IN;

Unix - filename and string result on same line

I need to search a directory that has hundreds or thousands of files, each containing XML with one or more instances of a specific string (begin/end tag with data).
I can get all the instances of the string by doing
grep -ho '<mytagname>..............<\/mytagname>' /home/xyzzy/mydata/*.XML > /home/mydata/tagvalues.txt
then a few sed commands to strip off the tags, so I wind up with a file just containing a list of values:
value001
value002
value003
(etc)
Ideally though, I'd like to have each line of the file to also include the filename so I can import into a database for analysis.
So my result would be something like this
fileAAA value001
fileAAA value002
fileAAA value003
fileBBB value004
Exact formatting of the above is flexible - could have spaces or other separator, it could even still include the begin/end tags.
The closest I've been able to get is with grep -o
fileAAA:value001
value002
value003
fileBBB:value004
A perl one-liner would seem ideal but I'm new enough to that, that I have no clue how to begin.
Could be done using a one-liner like so:
perl -lne 'print "$ARGV $1" if /<mytagname>(.*?)<\/mytagname>/' *.xml
However, I'd strongly recommend that you use an actual XML parser like XML::Twig or XML::LibXML
use strict;
use warnings;
use XML::LibXML;
for my $file (</home/xyzzy/mydata/*.XML>) {
my $doc = XML::LibXML->load_xml(location => $file);
for my $node ($doc->findnodes("//mytagname")) {
print "$file " . $node->textContent() . "\n";
}
}
What about awk?
awk -F'</?mytagname>' '$2 {print FILENAME,$2}' /home/xyzzy/mydata/*.XML
Explanation:
-F regex - set field delimiter must be a separate argument thus enclosed in its own quotes
$2 - if second field has a value
{print FILENAME,$2} - print filename SPACE the value of second field

What's the use of <> in Perl?

What's the use of <> in Perl. How to use it ?
If we simply write
<>;
and
while(<>)
what is that the program doing in both cases?
The answers above are all correct, but it might come across more plainly if you understand general UNIX command line usage. It is very common to want a command to work on multiple files. E.g.
ls -l *.c
The command line shell (bash et al) turns this into:
ls -l a.c b.c c.c ...
in other words, ls never see '*.c' unless the pattern doesn't match. Try this at a command prompt (not perl):
echo *
you'll notice that you do not get an *.
So, if the shell is handing you a bunch of file names, and you'd like to go through each one's data in turn, perl's <> operator gives you a nice way of doing that...it puts the next line of the next file (or stdin if no files are named) into $_ (the default scalar).
Here is a poor man's grep:
while(<>) {
print if m/pattern/;
}
Running this script:
./t.pl *
would print out all of the lines of all of the files that match the given pattern.
cat /etc/passwd | ./t.pl
would use cat to generate some lines of text that would then be checked for the pattern by the loop in perl.
So you see, while(<>) gets you a very standard UNIX command line behavior...process all of the files I give you, or process the thing I piped to you.
<>;
is a short way of writing
readline();
or if you add in the default argument,
readline(*ARGV);
readline is an operator that reads a line from the specified file handle. Reading from the special file handle ARGV will read from STDIN if #ARGV is empty or from the concatenation of the files named by #ARGV if it's not.
As for
while (<>)
It's a syntax error. If you had
while (<>) { ... }
it get rewritten to
while (defined($_ = <>)) { ... }
And as previously explained, that means the same as
while (defined($_ = readline(*ARGV))) { ... }
That means it will read lines from (previously explained) ARGV until there are no more lines to read.
It is called the diamond operator and feeds data from either stdin if ARGV is empty or each line from the files named in ARGV. This webpage http://docstore.mik.ua/orelly/perl/learn/ch06_02.htm explains it very well.
In many cases of programming with syntactical sugar like this, Deparse of O is helpful to find out what's happening:
$ perl -MO=Deparse -e 'while(<>){print 42}'
while (defined($_ = <ARGV>)) {
print 42;
}
-e syntax OK
Quoting perldoc perlop:
The null filehandle <> is special: it can be used to emulate the
behavior of sed and awk, and any other Unix filter program that takes
a list of filenames, doing the same to each line of input from all of
them. Input from <> comes either from standard input, or from each
file listed on the command line.
it takes the STDIN standard input:
> cat temp.pl
#!/usr/bin/perl
use strict;
use warnings;
my $count=<>;
print "$count"."\n";
>
below is the execution:
> temp.pl
3
3
>
so as soon as you execute the script it will wait till the user gives some input.
after 3 is given as input,it stores that value in $count and it prints the value in the next statement.

How do I best pass arguments to a Perl one-liner?

I have a file, someFile, like this:
$cat someFile
hdisk1 active
hdisk2 active
I use this shell script to check:
$cat a.sh
#!/usr/bin/ksh
for d in 1 2
do
grep -q "hdisk$d" someFile && echo "$d : ok"
done
I am trying to convert it to Perl:
$cat b.sh
#!/usr/bin/ksh
export d
for d in 1 2
do
cat someFile | perl -lane 'BEGIN{$d=$ENV{'d'};} print "$d: OK" if /hdisk$d\s+/'
done
I export the variable d in the shell script and get the value using %ENV in Perl. Is there a better way of passing this value to the Perl one-liner?
You can enable rudimentary command line argument with the "s" switch. A variable gets defined for each argument starting with a dash. The -- tells where your command line arguments start.
for d in 1 2 ; do
cat someFile | perl -slane ' print "$someParameter: OK" if /hdisk$someParameter\s+/' -- -someParameter=$d;
done
See: perlrun
Sometimes breaking the Perl enclosure is a good trick for these one-liners:
for d in 1 2 ; do cat kk2 | perl -lne ' print "'"${d}"': OK" if /hdisk'"${d}"'\s+/';done
Pass it on the command line, and it will be available in #ARGV:
for d in 1 2
do
perl -lne 'BEGIN {$d=shift} print "$d: OK" if /hdisk$d\s+/' $d someFile
done
Note that the shift operator in this context removes the first element of #ARGV, which is $d in this case.
Combining some of the earlier suggestions and adding my own sugar to it, I'd do it this way:
perl -se '/hdisk([$d])/ && print "$1: ok\n" for <>' -- -d='[value]' [file]
[value] can be a number (i.e. 2), a range (i.e. 2-4), a list of different numbers (i.e. 2|3|4) (or almost anything else, that's a valid pattern) or even a bash variable containing one of those, example:
d='2-3'
perl -se '/hdisk([$d])/ && print "$1: ok\n" for <>' -- -d=$d someFile
and [file] is your filename (that is, someFile).
If you are having trouble writing a one-liner, maybe it is a bit hard for one line (just my opinion). I would agree with #FM's suggestion and do the whole thing in Perl. Read the whole file in and then test it:
use strict;
local $/ = '' ; # Read in the whole file
my $file = <> ;
for my $d ( 1 .. 2 )
{
print "$d: OK\n" if $file =~ /hdisk$d\s+/
}
You could do it looping, but that would be longer. Of course it somewhat depends on the size of the file.
Note that all the Perl examples so far will print a message for each match - can you be sure there are no duplicates?
My solution is a little different. I came to your question with a Google search the title of your question, but I'm trying to execute something different. Here it is in case it helps someone:
FYI, I was using tcsh on Solaris.
I had the following one-liner:
perl -e 'use POSIX qw(strftime); print strftime("%Y-%m-%d", localtime(time()-3600*24*2));'
which outputs the value:
2013-05-06
I was trying to place this into a shell script so I could create a file with a date in the filename, of X numbers of days in the past. I tried:
set dateVariable=`perl -e 'use POSIX qw(strftime); print strftime("%Y-%m-%d", localtime(time()-3600*24*$numberOfDaysPrior));'`
But this didn't work due to variable substitution. I had to mess around with the quoting, to get it to interpret it properly. I tried enclosing the whole lot in double quotes, but this made the Perl command not syntactically correct, as it messed with the double quotes around date format. I finished up with:
set dateVariable=`perl -e "use POSIX qw(strftime); print strftime('%Y-%m-%d', localtime(time()-3600*24*$numberOfDaysPrior));"`
Which worked great for me, without having to resort to any fancy variable exporting.
I realise this doesn't exactly answer your specific question, but it answered the title and might help someone else!
That looks good, but I'd use:
for d in $(seq 1 2); do perl -nle 'print "hdisk$ENV{d} OK" if $_ =~ /hdisk$ENV{d}/' someFile; done
It's already written on the top in one long paragraph but I am also writing for lazy developers who don't read those lines.
Double quotes and single quote has big different meaning for the bash.
So please take care
Doesn't WORK perl '$VAR' $FILEPATH
WORKS perl "$VAR" $FILEPATH