New to Perl - Parsing file and replacing pattern with dynamic values

New to Perl - Parsing file and replacing pattern with dynamic values - perl

I am very new to Perl and i am currently trying to convert a bash script to perl.
My script is used to convert nmon files (AIX / Linux perf monitoring tool), it takes nmon files present in a directory, grep and redirect the specific section to a temp file, grep and redirect the associated timestamp to aother file.
Then, it parses data into a final csv file that will be indexed by a a third tool to be exploited.
A sample NMON data looks like:
TOP,%CPU Utilisation
TOP,+PID,Time,%CPU,%Usr,%Sys,Threads,Size,ResText,ResData,CharIO,%RAM,Paging,Command,WLMclass
TOP,5165226,T0002,10.93,9.98,0.95,1,54852,4232,51220,311014,0.755,1264,PatrolAgent,Unclassified
TOP,5365876,T0002,1.48,0.81,0.67,135,85032,132,84928,38165,1.159,0,db2sysc,Unclassified
TOP,5460056,T0002,0.32,0.27,0.05,1,5060,616,4704,1719,0.072,0,db2kmchan64.v9,Unclassified
The field "Time" (Seen as T0002 and really called ZZZZ in NMON) is a specific NMON timestamp, the real value of this timestamp is present later (in a dedicated section) in the NMON file and looks like:
ZZZZ,T0001,00:09:55,01-JAN-2014
ZZZZ,T0002,00:13:55,01-JAN-2014
ZZZZ,T0003,00:17:55,01-JAN-2014
ZZZZ,T0004,00:21:55,01-JAN-2014
ZZZZ,T0005,00:25:55,01-JAN-2014
The NMON format is very specific and can't be exploited directly without being parsed, the timestamp has to be associated with the corresponding value. (A NMON file is almost like a concatenation of numerous different csv files with each a different format, different fileds and so on.)
I wrote the following bash script to parse the section i'm interested in (The "TOP" section which represents top process cpu, mem, io stats per host)
#!/bin/bash
# set -x
################################################################
# INFORMATION
################################################################
# nmon2csv_TOP.sh
# Convert TOP section of nmon files to csv
# CAUTION: This script is expected to be launched by the main workflow
# $DST and DST_CONVERTED_TOP are being exported by it, if not this script will exit at launch time
################################################################
# VARS
################################################################
# Location of NMON files
NMON_DIR=${DST}
# Location of generated files
OUTPUT_DIR=${DST_CONVERTED_TOP}
# Temp files
rawdatafile=/tmp/temp_rawdata.$$.temp
timestampfile=/tmp/temp_timestamp.$$.temp
# Main Output file
finalfile=${DST_CONVERTED_TOP}/NMON_TOP_processed_at_date_`date '+%F'`.csv
###########################
# BEGIN OF WORK
###########################
# Verify exported var are not null
if [ -z ${NMON_DIR} ]; then
echo -e "\nERROR: Var NMON_DIR is null!\n" && exit 1
elif [ -z ${OUTPUT_DIR} ]; then
echo -e "\nERROR: Var OUTPUT_DIR is null!\n" && exit 1
fi
# Check if temp and output files already exists
if [ -s ${rawdatafile} ]; then
rm -f ${rawdatafile}
elif [ -s ${timestampfile} ]; then
rm -f ${timestampfile}
elif [ -s ${finalfile} ]; then
rm -f ${finalfile}
fi
# Get current location
PWD=`pwd`
# Go to NMON files location
cd ${NMON_DIR}
# For each NMON file present:
# To restrict to only PROD env: `ls *.nmon | grep -E -i 'sp|gp|ge'`
for NMON_FILE in `ls *.nmon | grep -E -i 'sp|gp|ge'`; do
# Set Hostname identification
serialnum=`grep 'AAA,SerialNumber,' ${NMON_FILE} | awk -F, '{print $3}' OFS=, | tr [:lower:] [:upper:]`
hostname=`grep 'AAA,host,' ${NMON_FILE} | awk -F, '{print $3}' OFS=, | tr [:lower:] [:upper:]`
# Grep and redirect TOP Section
grep 'TOP' ${NMON_FILE} | grep -v 'AAA,version,TOPAS-NMON' | grep -v 'TOP,%CPU Utilisation' > ${rawdatafile}
# Grep and redirect associated timestamps (ZZZZ)
grep 'ZZZZ' ${NMON_FILE}> ${timestampfile}
# Begin of work
while IFS=, read TOP PID Time Pct_CPU Pct_Usr Pct_Sys Threads Size ResText ResData CharIO Pct_RAM Paging Command WLMclass
do
timestamp=`grep ${Time} ${timestampfile} | awk -F, '{print $4 " "$3}' OFS=,`
echo ${serialnum},${hostname},${timestamp},${Time},${PID},${Pct_CPU},${Pct_Usr},${Pct_Sys},${Threads},${Size},${ResText},${ResData},${CharIO},${Pct_RAM},${Paging},${Command},${WLMclass} \
| grep -v '+PID,%CPU,%Usr,%Sys,Threads,Size,ResText,ResData,CharIO,%RAM,Paging,Command,WLMclass' >> ${finalfile}
done < ${rawdatafile}
echo -e "INFO: Done for Serialnum: ${serialnum} Hostname: ${hostname}"
done
# Go back to initial location
cd ${PWD}
###########################
# END OF WORK
###########################
This works as wanted and generate a main csv file (you'll see in the code that i voluntary don't keep the csv header in the file) wich is a concatenation of all parsed hosts.
But, i have a very large amount of host to treat each day (around 3000 hosts), with this current code and in worst cases, it can takes a few minutes to generate data for 1 host, multiplicated per number of hosts minutes becomes easily hours...
So, this code is really not performer enough to deal with such amount of data
10 hosts represents around 200.000 lines, which represents finally around 20 MB of csv file.
That's not that much, but i think that a shell script is probably not the better choice to manage such a process...
I guess that perl shall be much better at this task (even if the shell script could probably be improved), but my knowledge in perl is (currently) very poor, this is why i ask your help... I think that code should be quite simple to do in perl but i can't get it to work as for now...
One guy used to develop a perl script to manage NMON files and convert them to sql files (to dump these data into a database), i staged it to use its feature and with the help of some shell scripts i manage the sql files to get my final csv files.
But the TOP section was not integrated into that perl script and can't be used to that without being redeveloped.
The code in question:
#!/usr/bin/perl
# Program name: nmon2mysql.pl
# Purpose - convert nmon.csv file(s) into mysql insert file
# Author - Bruce Spencer
# Disclaimer: this provided "as is".
# Date - March 2007
#
$nmon2mysql_ver="1.0. March 2007";
use Time::Local;
#################################################
## Your Customizations Go Here ##
#################################################
# Source directory for nmon csv files
my $NMON_DIR=$ENV{DST_TMP};
my $OUTPUT_DIR=$ENV{DST_CONVERTED_CPU_ALL};
# End "Your Customizations Go Here".
# You're on your own, if you change anything beyond this line :-)
####################################################################
############# Main Program ############
####################################################################
# Initialize common variables
&initialize;
# Process all "nmon" files located in the $NMON_DIR
# #nmon_files=`ls $NMON_DIR/*.nmon $NMON_DIR/*.csv`;
#nmon_files=`ls $NMON_DIR/*.nmon`;
if (#nmon_files eq 0 ) { die ("No \*.nmon or csv files found in $NMON_DIR\n"); }
#nmon_files=sort(#nmon_files);
chomp(#nmon_files);
foreach $FILENAME ( #nmon_files ) {
#cols= split(/\//,$FILENAME);
$BASEFILENAME= $cols[#cols-1];
unless (open(INSERT, ">$OUTPUT_DIR/$BASEFILENAME.sql")) {
die("Can not open /$OUTPUT_DIR/$BASEFILENAME.sql\n");
}
print INSERT ("# nmon version: $NMONVER\n");
print INSERT ("# AIX version: $AIXVER\n");
print INSERT ("use nmon;\n");
$start=time();
#now=localtime($start);
$now=join(":",#now[2,1,0]);
print ("$now: Begin processing file = $FILENAME\n");
# Parse nmon file, skip if unsuccessful
if (( &get_nmon_data ) gt 0 ) { next; }
$now=time();
$now=$now-$start;
print ("\t$now: Finished get_nmon_data\n");
# Static variables (number of fields always the same)
##static_vars=("LPAR","CPU_ALL","FILE","MEM","PAGE","MEMNEW","MEMUSE","PROC");
##static_vars=("LPAR","CPU_ALL","FILE","MEM","PAGE","MEMNEW","MEMUSE");
#static_vars=("CPU_ALL");
foreach $key (#static_vars) {
&mk_mysql_insert_static($key);;
$now=time();
$now=$now-$start;
print ("\t$now: Finished $key\n");
} # end foreach
# Dynamic variables (variable number of fields)
##dynamic_vars=("DISKBSIZE","DISKBUSY","DISKREAD","DISKWRITE","DISKXFER","ESSREAD","ESSWRITE","ESSXFER","IOADAPT","NETERROR","NET","NETPACKET");
#dynamic_vars=("");
foreach $key (#dynamic_vars) {
&mk_mysql_insert_variable($key);;
$now=time();
$now=$now-$start;
print ("\t$now: Finished $key\n");
}
close(INSERT);
# system("gzip","$FILENAME");
}
exit(0);
############################################
############# Subroutines ############
############################################
##################################################################
## Extract CPU_ALL data for Static fields
##################################################################
sub mk_mysql_insert_static {
my($nmon_var)=#_;
my $table=lc($nmon_var);
my #rawdata;
my $x;
my #cols;
my $comma;
my $TS;
my $n;
#rawdata=grep(/^$nmon_var,/, #nmon);
if (#rawdata < 1) { return(1); }
#rawdata=sort(#rawdata);
#cols=split(/,/,$rawdata[0]);
$x=join(",",#cols[2..#cols-1]);
$x=~ s/\%/_PCT/g;
$x=~ s/\(MB\)/_MB/g;
$x=~ s/-/_/g;
$x=~ s/ /_/g;
$x=~ s/__/_/g;
$x=~ s/,_/,/g;
$x=~ s/_,/,/g;
$x=~ s/^_//;
$x=~ s/_$//;
print INSERT (qq|insert into $table (serialnum,hostname,mode,nmonver,time,ZZZZ,$x) values\n| );
$comma="";
$n=#cols;
$n=$n-1; # number of columns -1
for($i=1;$i<#rawdata;$i++){
$TS=$UTC_START + $INTERVAL*($i);
#cols=split(/,/,$rawdata[$i]);
$x=join(",",#cols[2..$n]);
$x=~ s/,,/,-1,/g; # replace missing data ",," with a ",-1,"
print INSERT (qq|$comma("$SN","$HOSTNAME","$MODE","$NMONVER",$TS,"$DATETIME{#cols[1]}",$x)| );
$comma=",\n";
}
print INSERT (qq|;\n\n|);
} # end mk_mysql_insert
##################################################################
## Extract CPU_ALL data for variable fields
##################################################################
sub mk_mysql_insert_variable {
my($nmon_var)=#_;
my $table=lc($nmon_var);
my #rawdata;
my $x;
my $j;
my #cols;
my $comma;
my $TS;
my $n;
my #devices;
#rawdata=grep(/^$nmon_var,/, #nmon);
if ( #rawdata < 1) { return; }
#rawdata=sort(#rawdata);
$rawdata[0]=~ s/\%/_PCT/g;
$rawdata[0]=~ s/\(/_/g;
$rawdata[0]=~ s/\)/_/g;
$rawdata[0]=~ s/ /_/g;
$rawdata[0]=~ s/__/_/g;
$rawdata[0]=~ s/,_/,/g;
#devices=split(/,/,$rawdata[0]);
print INSERT (qq|insert into $table (serialnum,hostname,time,ZZZZ,device,value) values\n| );
$n=#rawdata;
$n--;
for($i=1;$i<#rawdata;$i++){
$TS=$UTC_START + $INTERVAL*($i);
$rawdata[$i]=~ s/,$//;
#cols=split(/,/,$rawdata[$i]);
print INSERT (qq|\n("$SN","$HOSTNAME",$TS,"$DATETIME{$cols[1]}","$devices[2]",$cols[2])| );
for($j=3;$j<#cols;$j++){
print INSERT (qq|,\n("$SN","$HOSTNAME",$TS,"$DATETIME{$cols[1]}","$devices[$j]",$cols[$j])| );
}
if ($i < $n) { print INSERT (","); }
}
print INSERT (qq|;\n\n|);
} # end mk_mysql_insert_variable
########################################################
### Get an nmon setting from csv file ###
### finds first occurance of $search ###
### Return the selected column...$return_col ###
### Syntax: ###
### get_setting($search,$col_to_return,$separator)##
########################################################
sub get_setting {
my $i;
my $value="-1";
my ($search,$col,$separator)= #_; # search text, $col, $separator
for ($i=0; $i<#nmon; $i++){
if ($nmon[$i] =~ /$search/ ) {
$value=(split(/$separator/,$nmon[$i]))[$col];
$value =~ s/["']*//g; #remove non alphanum characters
return($value);
} # end if
} # end for
return($value);
} # end get_setting
#####################
## Clean up ##
#####################
sub clean_up_line {
# remove characters not compatible with nmon variable
# Max rrdtool variable length is 19 chars
# Variable can not contain special characters (% - () )
my ($x)=#_;
# print ("clean_up, before: $i\t$nmon[$i]\n");
$x =~ s/\%/Pct/g;
# $x =~ s/\W*//g;
$x =~ s/\/s/ps/g; # /s - ps
$x =~ s/\//s/g; # / - s
$x =~ s/\(/_/g;
$x =~ s/\)/_/g;
$x =~ s/ /_/g;
$x =~ s/-/_/g;
$x =~ s/_KBps//g;
$x =~ s/_tps//g;
$x =~ s/[:,]*\s*$//;
$retval=$x;
} # end clean up
##########################################
## Extract headings from nmon csv file ##
##########################################
sub initialize {
%MONTH2NUMBER = ("jan", 1, "feb",2, "mar",3, "apr",4, "may",5, "jun",6, "jul",7, "aug",8, "sep",9, "oct",10, "nov",11, "dec",12 );
#MONTH2ALPHA = ( "junk","jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec" );
} # end initialize
# Get data from nmon file, extract specific data fields (hostname, date, ...)
sub get_nmon_data {
my $key;
my $x;
my $category;
my %toc;
my #cols;
# Read nmon file
unless (open(FILE, $FILENAME)) { return(1); }
#nmon=<FILE>; # input entire file
close(FILE);
chomp(#nmon);
# Cleanup nmon data remove trainig commas and colons
for($i=0; $i<#nmon;$i++ ) {
$nmon[$i] =~ s/[:,]*\s*$//;
}
# Get nmon/server settings (search string, return column, delimiter)
$AIXVER =&get_setting("AIX",2,",");
$DATE =&get_setting("date",2,",");
$HOSTNAME =&get_setting("host",2,",");
$INTERVAL =&get_setting("interval",2,","); # nmon sampling interval
$MEMORY =&get_setting(qq|lsconf,"Good Memory Size:|,1,":");
$MODEL =&get_setting("modelname",3,'\s+');
$NMONVER =&get_setting("version",2,",");
$SNAPSHOTS =&get_setting("snapshots",2,","); # number of readings
$STARTTIME =&get_setting("AAA,time",2,",");
($HR, $MIN)=split(/\:/,$STARTTIME);
if ($AIXVER eq "-1") {
$SN=$HOSTNAME; # Probably a Linux host
} else {
$SN =&get_setting("systemid",4,",");
$SN =(split(/\s+/,$SN))[0]; # "systemid IBM,SN ..."
}
$TYPE =&get_setting("^BBBP.*Type",3,",");
if ( $TYPE =~ /Shared/ ) { $TYPE="SPLPAR"; } else { $TYPE="Dedicated"; }
$MODE =&get_setting("^BBBP.*Mode",3,",");
$MODE =(split(/: /, $MODE))[1];
# $MODE =~s/\"//g;
# Calculate UTC time (seconds since 1970)
# NMON V9 dd/mm/yy
# NMON V10+ dd-MMM-yyyy
if ( $DATE =~ /[a-zA-Z]/ ) { # Alpha = assume dd-MMM-yyyy date format
($DAY, $MMM, $YR)=split(/\-/,$DATE);
$MMM=lc($MMM);
$MON=$MONTH2NUMBER{$MMM};
} else {
($DAY, $MON, $YR)=split(/\//,$DATE);
$YR=$YR + 2000;
$MMM=$MONTH2ALPHA[$MON];
} # end if
## Calculate UTC time (seconds since 1970). Required format for the rrdtool.
## timelocal format
## day=1-31
## month=0-11
## year = x -1900 (time since 1900) (seems to work with either 2006 or 106)
$m=$MON - 1; # jan=0, feb=2, ...
$UTC_START=timelocal(0,$MIN,$HR,$DAY,$m,$YR);
$UTC_END=$UTC_START + $INTERVAL * $SNAPSHOTS;
#ZZZZ=grep(/^ZZZZ,/,#nmon);
for ($i=0;$i<#ZZZZ;$i++){
#cols=split(/,/,$ZZZZ[$i]);
($DAY,$MON,$YR)=split(/-/,$cols[3]);
$MON=lc($MON);
$MON="00" . $MONTH2NUMBER{$MON};
$MON=substr($MON,-2,2);
$ZZZZ[$i]="$YR-$MON-$DAY $cols[2]";
$DATETIME{$cols[1]}="$YR-$MON-$DAY $cols[2]";
} # end ZZZZ
return(0);
} # end get_nmon_data
It almost (i say almost because with recent NMON versions it can sometimes have some issue when no data present) does the job, and it does it much much faster that would do my shell script if i would use it for these section
This is why i think perl shall be a perfect solution.
Off course, i don't ask anyone to convert my shell script into something final in perl, but at least to give me to right direction :-)
I really thank anyone in advance for your help !

Normally i am strongly opposed to questions like this but our production systems are down and until they are fixed i do not really have all that much to do...
Here is some code that might get you started. Please consider it pseudo code as it is completely untested and probably won't even compile (i always forget some parantheses or semicolons and as i said, the actual machines that can run code are unreachable) but i commented a lot and hopefully you will be able to modify it to your actual needs and get it to run.
use strict;
use warnings;
open INFILE, "<", "path/to/file.nmon"; # Open the file.
my #topLines; # Initialize variables.
my %timestamps;
while <INFILE> # This will walk over all the lines of the infile.
{ # Storing the current line in $_.
chomp $_; # Remove newline at the end.
if ($_ =~ m/^TOP/) # If the line starts with TOP...
{
push #topLines, $_; # ...store it in the array for later use.
}
elsif ($_ =~ m/^ZZZZ/) # If it is in the ZZZZ section...
{
my #fields = split ',', $_; # ...split the line at commas...
my $timestamp = join ",", $fields(2), $fields(3); # ...join the timestamp into a string as you wish...
$timestamps{$fields(1)} = $timestamp; # ...and store it in the hash with the Twhatever thing as key.
}
# This iteration could certainly be improved with more knowledge
# of how the file looks. For example the search could be cancelled
# after the ZZZZ section if the file is still long.
}
close INFILE;
open OUTFILE, ">", "path/to/output.csv"; # Open the file you want your output in.
foreach (#topLines) # Iterate through all elements of the array.
{ # Once again storing the current value in $_.
my #fields = split ',', $_; # Probably not necessary, depending on how output should be formated.
my $outstring = join ',', $fields(0), $fields(1), $timestamps{$fields(2)}; # And whatever other fields you care for.
print OUTFILE $outstring, "\n"; # Print.
}
close OUTFILE;
print "Done.\n";

Related

Perl Script Not Liking Date Extension

why do I receive the error complaining about the parenthesis ?
sh: syntax error at line 1 : `)' unexpected
when adding this date extension to the new file -- mv abc abc$(date +%Y%m%d%H%M%S)
for it seems that it doesn't like that last parenthesis
#!/usr/bin/perl
# =========================================== #
# Script to watch POEDIACK file size
#
# - Comments -
#
# script will check the file size of the POEDIACK file in
# $LAWDIR/$PLINE/edi/in.
# If it's > 1 gig, it will send notification via email
#
#
# =========================================== #
use strict;
use POSIX qw(strftime);
# get env vars from system
my $LAWDIR = #ENV{'LAWDIR'};
my $PLINE = #ENV{'PLINE'};
#my $email_file = "/lsf10/monitors/poediack.email";
my $curr_date = strftime('%m%d%Y', localtime);
my $ack_file = "$LAWDIR" . "/$PLINE" . "/edi/in/POEDIACK";
my $ack_location = "$LAWDIR" . "/$PLINE" . "/edi/in/";
my $mv_location = "$LAWDIR" . "/$PLINE" . "/edi/in/Z_files";
my $ack_file_limit = 10;
#my $ack_file_limit = 1000000000;
my $ack_file_size;
if( -e $ack_file)
{
$ack_file_size = -s $ack_file;
if ( $ack_file_size > $ack_file_limit )
{
`compress -vf $ack_file`;
`mv $mv_location\$ack_file.Z $mv_location\$ack_file.Z.$(date +%Y%m%d%H%M%S)`;
}
}
else
{
print "POEDIACK File not found: $ack_file\n";
}
### end perl script ###

$( is being interpreted as a variable. It is the group ID of the process. You need to escape it.
And you probably shouldn't escape $ack_file.
`mv $mv_location$ack_file.Z $mv_location$ack_file.Z.\$(date +%Y%m%d%H%M%S)`;
It's safer and faster to avoid complicated shell commands and use rename instead.
use autodie;
my $timestamp = strftime('%Y%m%d%H%M%S', localtime);
rename "$mv_location$ack_file.Z", "$mv_location$ack_file.Z.$timestamp";
Or use an existing log rotator.

Modify Perl search in file to include only specified directories

I found the code sample below here. It searches for text in files, recursing through sub-directories, but I want to specify a subset of the first level of sub-directories to recurse through.
E.g. suppose I'm in directory C:\ which contains directories bin, src, and Windows, and I want to recursively search for .h and .c files containing the text "include", I'd run the following with the MWE below, where my code is in textsearch.pl:
perl textsearch.pl include "(\.)(h|c)($)"
How can I modify this program to only search in bin and src but not Windows, while at the same time still recursing into sub-directories of bin and src? I.e. I'd like to be able to do something like the following:
perl textsearch.pl include "(\.)(h|c)($)" src,bin
I thought File::Find::Rule would help, but I'm having trouble figuring out how to apply it here.
Also, if there's another much simpler way to do all this, I'd love to hear it.
MWE I found:
use strict;
use warnings;
use Cwd;
use File::Find;
use File::Basename;
my ($in_rgx,$in_files,$simple,$matches,$cwd);
sub trim($) {
my $string = shift;
$string =~ s/[\r\n]+//g;
$string =~ s/\s+$//;
return $string;
}
# 1: Get input arguments
if ($#ARGV == 0) { # *** ONE ARGUMENT *** (search pattern)
($in_rgx,$in_files,$simple) = ($ARGV[0],".",1);
}
elsif ($#ARGV == 1) { # *** TWO ARGUMENTS *** (search pattern + filename or flag)
if (($ARGV[1] eq '-e') || ($ARGV[1] eq '-E')) { # extended
($in_rgx,$in_files,$simple) = ($ARGV[0],".",0);
}
else { # simple
($in_rgx,$in_files,$simple) = ($ARGV[0],$ARGV[1],1);
}
}
elsif ($#ARGV == 2) { # *** THREE ARGUMENTS *** (search pattern + filename + flag)
($in_rgx,$in_files,$simple) = ($ARGV[0],$ARGV[1],0);
}
else { # *** HELP *** (either no arguments or more than three)
print "Usage: ".basename($0)." regexpattern [filepattern] [-E]\n\n" .
"Hints:\n" .
"*) If you need spaces in your pattern, put quotation marks around it.\n" .
"*) To do a case insensitive match, use (?i) preceding the pattern.\n" .
"*) Both patterns are regular expressions, allowing powerful searches.\n" .
"*) The file pattern is always case insensitive.\n";
exit;
}
if ($in_files eq '.') { # 2: Output search header
print basename($0).": Searching all files for \"${in_rgx}\"... (".(($simple) ? "simple" : "extended").")\n";
}
else {
print basename($0).": Searching files matching \"${in_files}\" for \"${in_rgx}\"... (".(($simple) ? "simple" : "extended").")\n";
}
if ($simple) { print "\n"; } # 3: Traverse directory tree using subroutine 'findfiles'
($matches,$cwd) = (0,cwd);
$cwd =~ s,/,\\,g;
find(\&findfiles, $cwd);
sub findfiles { # 4: Used to iterate through each result
my $file = $File::Find::name; # complete path to the file
$file =~ s,/,\\,g; # substitute all / with \
return unless -f $file; # process files (-f), not directories
return unless $_ =~ m/$in_files/io; # check if file matches input regex
# /io = case-insensitive, compiled
# $_ = just the file name, no path
# 5: Open file and search for matching contents
open F, $file or print "\n* Couldn't open ${file}\n\n" && return;
if ($simple) { # *** SIMPLE OUTPUT ***
while (<F>) {
if (m/($in_rgx)/o) { # /o = compile regex
# file matched!
$matches++;
print "---" . # begin printing file header
sprintf("%04d", $matches) . # file number, padded with 4 zeros
"--- ".$file."\n"; # file name, keep original name
# end of file header
last; # go on to the next file
}
}
} # *** END OF SIMPLE OUTPUT ***
else { # *** EXTENDED OUTPUT ***
my $found = 0; # used to keep track of first match
my $binary = (-B $file) ? 1 : 0; # don't show contents if file is bin
$file =~ s/^\Q$cwd//g; # remove current working directory
# \Q = quotemeta, escapes string
while (<F>) {
if (m/($in_rgx)/o) { # /o = compile regex
# file matched!
if (!$found) { # first matching line for the file
$found = 1;
$matches++;
print "\n---" . # begin printing file header
sprintf("%04d", $matches) . # file number, padded with 4 zeros
"--- ".uc($file)."\n"; # file name, converted to uppercase
# end of file header
if ($binary) { # file is binary, do not show content
print "Binary file.\n";
last;
}
}
print "[$.]".trim($_)."\n"; # print line number and contents
#last; # uncomment to only show first line
}
}
} # *** END OF EXTENDED OUTPUT ***
# 6: Close the file and move on to the next result
close F;
}
#7: Show search statistics
print "\nMatches: ${matches}\n";
# Search Engine Source: http://www.adp-gmbh.ch/perl/find.html
# Rewritten by Christopher Hilding, Dec 02 2006
# Formatting adjusted to my liking by Rene Nyffenegger, Dec 22 2006

The second parameter to the find() method can be a list of directories to scan. replace $cwd with #some_list_of_directories and you should be good to go

Summing a column of numbers in a text file using Perl

Ok, so I'm very new to Perl. I have a text file and in the file there are 4 columns of data(date, time, size of files, files). I need to create a small script that can open the file and get the average size of the files. I've read so much online, but I still can't figure out how to do it. This is what I have so far, but I'm not sure if I'm even close to doing this correctly.
#!/usr/bin/perl
open FILE, "files.txt";
##array = File;
while(FILE){
#chomp;
($date, $time, $numbers, $type) = split(/ /,<FILE>);
$total += $numbers;
}
print"the total is $total\n";
This is how the data looks in the file. These are just a few of them. I need to get the numbers in the third column.
12/02/2002 12:16 AM 86016 a2p.exe
10/10/2004 11:33 AM 393 avgfsznew.pl
11/01/2003 04:42 PM 38124 c2ph.bat

Your program is reasonably close to working. With these changes it will do exactly what you want
Always use use strict and use warnings at the start of your program, and declare all of your variables using my. That will help you by finding many simple errors that you may otherwise overlook
Use lexical file handles, the three-parameter form of open, and always check the return status of any open call
Declare the $total variable outside the loop. Declaring it inside the loop means it will be created and destroyed each time around the loop and it won't be able to accumulate a total
Declare a $count variable in the same way. You will need it to calculate the average
Using while (FILE) {...} just tests that FILE is true. You need to read from it instead, so you must use the readline operator like <FILE>
You want the default call to split (without any parameters) which will return all the non-space fields in $_ as a list
You need to add a variable in the assignment to allow for athe AM or PM field in each line
Here is a modification of your code that works fine
use strict;
use warnings;
open my $fh, '<', "files.txt" or die $!;
my $total = 0;
my $count = 0;
while (<$fh>) {
my ($date, $time, $ampm, $numbers, $type) = split;
$total += $numbers;
$count += 1;
}
print "The total is $total\n";
print "The count is $count\n";
print "The average is ", $total / $count, "\n";
output
The total is 124533
The count is 3
The average is 41511

It's tempting to use Perl's awk-like auto-split option. There are 5 columns; three containing date and time information, then the size and then the name.
The first version of the script that I wrote is also the most verbose:
perl -n -a -e '$total += $F[3]; $num++; END { printf "%12.2f\n", $total / ($num + 0.0); }'
The -a (auto-split) option splits a line up on white space into the array #F. Combined with the -n option (which makes Perl run in a loop that reads the file name arguments in turn, or standard input, without printing each line), the code adds $F[3] (the fourth column, counting from 0) to $total, which is automagically initialized to zero on first use. It also counts the lines in $num. The END block is executed when all the input is read; it uses printf() to format the value. The + 0.0 ensures that the arithmetic is done in floating point, not integer arithmetic. This is very similar to the awk script:
awk '{ total += $4 } END { print total / NR }'
First drafts of programs are seldom optimal — or, at least, I'm not that good a programmer. Revisions help.
Perl was designed, in part, as an awk killer. There is still a program a2p distributed with Perl for converting awk scripts to Perl (and there's also s2p for converting sed scripts to Perl). And Perl does have an automatic (built-in) variable that keeps track of the number of lines read. It has several names. The tersest is $.; the mnemonic name $NR is available if you use English; in the script; so is $INPUT_LINE_NUMBER. So, using $num is not necessary. It also turns out that Perl does a floating point division anyway, so the + 0.0 part was unnecessary. This leads to the next versions:
perl -MEnglish -n -a -e '$total += $F[3]; END { printf "%12.2f\n", $total / $NR; }'
or:
perl -n -a -e '$total += $F[3]; END { printf "%12.2f\n", $total / $.; }'
You can tune the print format to suit your whims and fancies. This is essentially the script I'd use in the long term; it is fairly clear without being long-winded in any way. The script could be split over multiple lines if you desired. It is a simple enough task that the legibility of the one-line is not a problem, IMNSHO. And the beauty of this is that you don't have to futz around with split and arrays and read loops on your own; Perl does most of that for you. (Granted, it does blow up on empty input; that fix is trivial; see below.)
Recommended version
perl -n -a -e '$total += $F[3]; END { printf "%12.2f\n", $total / $. if $.; }'
The if $. tests whether the number of lines read is zero or not; the printf and division are omitted if $. is zero so the script outputs nothing when given no input.
There is a noble (or ignoble) game called 'Code Golf' that was much played in the early days of Stack Overflow, but Code Golf questions are no longer considered good questions. The object of Code Golf is to write a program that does a particular task in as few characters as possible. You can play Code Golf with this and compress it still further if you're not too worried about the format of the output and you're using at least Perl 5.10:
perl -Mv5.10 -n -a -e '$total += $F[3]; END { say $total / $. if $.; }'
And, clearly, there are a lot of unnecessary spaces and letters in there:
perl -Mv5.10 -nae '$t+=$F[3];END{say$t/$.if$.}'
That is not, however, as clear as the recommended version.

#!/usr/bin/perl
use warnings;
use strict;
open my $file, "<", "files.txt";
my ($total, $cnt);
while(<$file>){
$total += (split(/\s+/, $_))[3];
$cnt++;
}
close $file;
print "number of files: $cnt\n";
print "total size: $total\n";
printf "avg: %.2f\n", $total/$cnt;
Or you can use awk:
awk '{t+=$4} END{print t/NR}' files.txt

Try doing this :
#!/usr/bin/perl -l
use strict; use warnings;
open my $file, '<', "my_file" or die "open error [$!]";
my ($total, $count);
while (<$file>){
chomp;
next if /^$/;
my ($date, $time, $x, $numbers, $type) = split;
$total += $numbers;
$count++;
}
print "the average is " . $total/$count . " and the total is $total";
close $file;

It is as simple as this:
perl -F -lane '$a+=$F[3];END{print "The average size is ".$a/$.}' your_file
tested below:
> cat temp
12/02/2002 12:16 AM 86016 a2p.exe
10/10/2004 11:33 AM 393 avgfsznew.pl
11/01/2003 04:42 PM 38124 c2ph.bat
Now the execution:
> perl -F -lane '$a+=$F[3];END{print "The average size is ".$a/$.}' temp
The average size is 41511
>
explanation:
-F -a says store the line in an array format.with the default separator as space or tab.
so nopw $F[3] has you size of the file.
sum up all the sizes in the 4th column untill all the lines are processed.
END will be executed after processing all the lines in the file.
so $. at the end will gives the number of lines.
so $a/$. will give the average.

This solution opens the file and loops through each line of the file. It then splits the file into the five variables in the line by splitting on 1 or more spaces.
open the file for reading, "<", and if it fails, raise an error or die "..."
my ($total, $cnt) are our column total and number of files added count
while(<FILE>) { ... } loops through each line of the file using the file handle and stores the line in $_
chomp removes the input record separator in $_. In unix, the default separator is a newline \n
split(/\s+/, $_) Splits the current line represented by$_, with the delimiter \s+. \s represents a space, the + afterward means "1 or more". So, we split the next line on 1 or more spaces.
Next we update $total and $cnt
#!/usr/bin/perl
open FILE, "<", "files.txt" or die "Error opening file: $!";
my ($total, $cnt);
while(<FILE>){
chomp;
my ($date, $time, $am_pm, $numbers, $type) = split(/\s+/, $_);
$total += $numbers;
$cnt++;
}
close FILE;
print"the total is $total and count of $cnt\n";`

regex help on unix df

I need some help tweaking my code to look for another attribute in this unix df output:
Ex.
Filesystem Size Used Avail Capacity Mounted on
/dev/ad4s1e 61G 46G 9.7G 83% /home
So far I can extract capacity, but now I want to add Avail.
Here is my perl line that grabs capacity. How do I get "Avail"?? Thanks!
my #df = qx (df -k /tmp);
my $cap;
foreach my $df (#df)
{
($cap) =($df =~ m!(\d+)\%!);
};
print "$cap\n";

The easy perl way:
perl -MFilesys::Df -e 'print df("/tmp")->{bavail}, "\n"'

This has the merit of producing a nice data structure for you to query all the info about each filesystem.
# column headers to be used as hash keys
my #headers = qw(name size used free capacity mount);
my #df = `df -k`;
shift #df; # get rid of the header
my %devices;
for my $line (#df) {
my %info;
#info{#headers} = split /\s+/, $line; # note the hash slice
$info{capacity} = _percentage_to_decimal($info{capacity});
$devices{ $info{name} } = \%info;
}
# Change 12.3% to .123
sub _percentage_to_decimal {
my $percentage = shift;
$percentage =~ s{%}{};
return $percentage / 100;
}
Now the information for each device is in a hash of hashes.
# Show how much space is free in device /dev/ad4s1e
print $devices{"/dev/ad4s1e"}{free};
This isn't the simplest way to do it, but it is the most generally useful way to work with the df information putting it all in one nice data structure that you can pass around as needed. This is better than slicing it all up into individual variables and its a technique you should get used to.
UPDATE:
To get all the devices which have >60% capacity, you'd iterate through all the values in the hash and select those with a capacity greater than 60%. Except capacity is stored as a string like "88%" and that's not useful for comparison. We could strip out the % here, but then we'd be doing that everywhere we want to use it. Its better to normalize your data up front, that makes it easier to work with. Storing formatted data is a red flag. So I've modified the code above which reads from df to change the capacity from 88% to .88.
Now its easier to work with.
for my $info (values %devices) {
# Skip to the next device if its capacity is not over 60%.
next unless $info->{capacity} > .60;
# Print some info about each device
printf "%s is at %d%% with %dK remaining.\n",
$info->{name}, $info->{capacity}*100, $info->{free};
}
I chose to use printf here rather than interpolation because it makes it a bit easier to see what the string will look like when output.

Have you tried simply splitting on whitespace and taking the 4th and 5th columns?
my #cols = (split(/\s+/, $_));
my $avail = $cols[3];
my $cap = $cols[4];
(Fails if you have spaces in your device names of course...)

Us split instead, and get the args from the resulting array. E.g.
my #values = split /\s+/, $df;
my $avail = $values[3];
Or:
($filesystem, $size, $used, $avail, $cap, $mount) = split /\s/, $df;

I think it is probably best to split the lines, skipping the first line. Since you don't mind using #df and $df, neither do I:
my #df = qx(df -k /tmp);
shift #df; # Lose df heading line
foreach my $df (#df)
{
my($system, $size, $used, $avail, $capacity, $mount) = split / +/, $df;
....
}
This gives you all the fields at once. Now you just need to interpret the 'G' and lose the '%', etc.

foreach my $device ( #df ) {
next unless $device =~ m{^/};
my( $filesystem, $size, $used, $avail, $cap, $mounted ) = split /\s+/, $device;
# you take it from there.... ;)
}

Lots of variations on a theme here. I would keep the first line, since it gives a nice header:
$ perl -E '$,=" "; open my $fh, "-|", "df -k /tmp";
while(<$fh>) { #a=split; say #a[3,4]}'
On second thought, this is a lot cleaner:
$ df -k /tmp | perl -naE '$,="\t"; say #F[3,4]'
Available Capacity
20862392 92%
Final thought: don't use perl at all:
$ df -h /tmp | tr -s ' ' '\t' | cut -f 3,4
or
$ df -h /tmp | awk '{print $3 "\t" $4}'

Perl Regular Expressions + delete line if it starts with #

How to delete lines if they begin with a "#" character using Perl regular expressions?
For example (need to delete the following examples)
line="#a"
line=" #a"
line="# a"
line=" # a"
...
the required syntax
$line =~ s/......../..
or skip loop if line begins with "#"
from my code:
open my $IN ,'<', $file or die "can't open '$file' for reading: $!";
while( defined( $line = <$IN> ) ){
.
.
.

You don't delete lines with s///. (In a loop, you probably want next;)
In the snippet you posted, it would be:
while (my $line = <IN>) {
if ($line =~ /^\s*#/) { next; }
# will skip the rest of the code if a line matches
...
}
Shorter forms /^\s*#/ and next; and next if /^\s*#/; are possible.
perldoc perlre
/^\s*#/
^ - "the beginning of the line"
\s - "a whitespace character"
* - "0 or more times"
# - just a #

Based off Aristotle Pagaltzis's answer you could do:
perl -ni.bak -e'print unless m/^\s*#/' deletelines.txt
Here, the -n switch makes perl put a loop around the code you provide
which will read all the files you pass on the command line in
sequence. The -i switch (for “in-place”) says to collect the output
from your script and overwrite the original contents of each file with
it. The .bak parameter to the -i option tells perl to keep a backup of
the original file in a file named after the original file name with
.bak appended. For all of these bits, see perldoc perlrun.
deletelines.txt (initially):
#a
b
#a
# a
c
# a
becomes:
b
c

Program (Cut & paste whole thing including DATA section, adjust shebang line, run)
#!/usr/bin/perl
use strict;
use warnings;
while(<DATA>) {
next if /^\s*#/; # skip comments
print; # process data
}
__DATA__
# comment
data
# another comment
more data
Output
data
more data

$text ~= /^\s*#.*\n//g
That will delete all of the lines with # in the entire file of $text, without requiring that you loop through each line of the text manually.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

New to Perl - Parsing file and replacing pattern with dynamic values - perl

Related

Perl Script Not Liking Date Extension

Modify Perl search in file to include only specified directories

Summing a column of numbers in a text file using Perl

regex help on unix df

Perl Regular Expressions + delete line if it starts with #

Categories

Resources