Quick question, and I'm sure it's something I'm doing completely wrong with variables, however, here is the issue.
Code first:
#!/usr/bin/perl
use strict;
use warnings;
my $File = "file.txt";
my $CurrentLinesCount = `wc -l < $File` or die "wc failed: $?";
chomp($CurrentLinesCount);
sub GetStatistics() {
if (-d $dir) {
print "Current Lines In File: $CurrentLinesCount\n";
}
else {
exit;
}
}
sub EditFile() {
my $editfile = $File;
my $text = "1234\n12345\n234324\n2342\n2343";
open(MYFILE,">>$editfile") || die("Cannot Open File");
print MYFILE "$text";
close(MYFILE);
sleep 5;
}
## MAIN
GetStatistics();
EditFile();
GetStatistics();
This is the output I get:
Current Lines In File: 258
Current Lines In File: 258
I verified that the file is being written and appended to. Can someone point me in the correct direction on how to have a variable set, updated, and then called again properly?
You call subs, not variables.
Try:
sub CurrentLinesCount {
my $CurrentLinesCount = `wc -l < $File` or die "wc failed: $?";
chomp($CurrentLinesCount);
return $CurrentLinesCount;
}
...
print "Current Lines In File: ", CurrentLinesCount(), "\n";
You're only doing the call to wc once. Thus you're setting the value of $CurrentLinesCount once, and you get the same number when you print it twice.
You'll have to redo the
$CurrentLinesCount = `wc -l < $File` or die "wc failed: $?";
line after you append to the file.
Edit: Or put that line in the GetStatistics function, which would probably be a better place for it.
I would probably move the code block
my $CurrentLinesCount = `wc -l < $File` or die "wc failed: $?";
chomp($CurrentLinesCount);
to the GetStatistics subroutine, so the variable is updated whenever you call your sub.
As an optimization, you can count how many lines you added rather than recounting the whole file (unless another process may also be writing to the file).
use strict;
use warnings;
use FileHandle;
use IPC::Open2;
our $CurrentLinesCount;
our $file = "file.txt";
sub CountLines {
my $File = shift;
my $CurrentLinesCount = `wc -l < $File` or die "wc failed: $?";
$CurrentLinesCount =~ s/\s+//g;
return $CurrentLinesCount;
}
sub ShowStatistics {
my $file = shift;
if (-f $file) {
print "Current Lines In File: $CurrentLinesCount\n";
} else {
exit;
}
}
sub EditFile {
my $editfile = shift;
my $sleeptime = shift || 5;
my $text = "1234\n12345\n234324\n2342\n2343";
open(MYFILE,">>$editfile") || die("Cannot Open File");
print MYFILE "$text";
close(MYFILE);
# Look here:
my $pid = open2(*Reader, *Writer, "wc -l" );
print Writer $text;
close Writer;
$CurrentLinesCount += <Reader>;
sleep $sleeptime;
}
$CurrentLinesCount = CountLines($file);
ShowStatistics($file);
# EditFile updates $CurrentLinesCount
EditFile($file, 2);
ShowStatistics($file);
Still one too many globals for my taste, but I suppose this isn't a program of consequence. On the other hand, globals can be habit forming.
Note that wc doesn't count anything after the final "\n" when counting lines (it views "\n" as a line terminator). If you want to view "\n" as a line separator and count those trailing characters as a line, you'll need an alternate method of counting lines.
Related
I would like to know of a fast/efficient way in any program (awk/perl/python) to split a csv file (say 10k columns) into multiple small files each containing 2 columns. I would be doing this on a unix machine.
#contents of large_file.csv
1,2,3,4,5,6,7,8
a,b,c,d,e,f,g,h
q,w,e,r,t,y,u,i
a,s,d,f,g,h,j,k
z,x,c,v,b,n,m,z
I now want multiple files like this:
# contents of 1.csv
1,2
a,b
q,w
a,s
z,x
# contents of 2.csv
1,3
a,c
q,e
a,d
z,c
# contents of 3.csv
1,4
a,d
q,r
a,f
z,v
and so on...
I can do this currently with awk on small files (say 30 columns) like this:
awk -F, 'BEGIN{OFS=",";} {for (i=1; i < NF; i++) print $1, $(i+1) > i ".csv"}' large_file.csv
The above takes a very long time with large files and I was wondering if there is a faster and more efficient way of doing the same.
Thanks in advance.
The main hold up here is in writing so many files.
Here is one way
use warnings;
use strict;
use feature 'say';
my $file = shift // die "Usage: $0 csv-file\n";
my #lines = do { local #ARGV = $file; <> };
chomp #lines;
my #fhs = map {
open my $fh, '>', "f${_}.csv" or die $!;
$fh
}
1 .. scalar( split /,/, $lines[0] );
for (#lines) {
my ($first, #cols) = split /,/;
say {$fhs[$_]} join(',', $first, $cols[$_])
for 0..$#cols;
}
I didn't time this against any other approaches. Assembling data for each file first and then dumping it in one operation into each file may help, but first let us know how large the original CSV file is.
Opening so many output files at once (for #fhs filehandles) may pose problems. If that is the case then the simplest way is to first assemble all data and then open and write a file at a time
use warnings;
use strict;
use feature 'say';
my $file = shift // die "Usage: $0 csv-file\n";
open my $fh, '<', $file or die "Can't open $file: $!";
my #data;
while (<$fh>) {
chomp;
my ($first, #cols) = split /,/;
push #{$data[$_]}, join(',', $first, $cols[$_])
for 0..$#cols;
}
for my $i (0..$#data) {
open my $fh, '>', $i+1 . '.csv' or die $!;
say $fh $_ for #{$data[$i]};
}
This depends on whether the entire original CSV file, plus a bit more, can be held in memory.
With your show samples, attempts; please try following awk code. Since you are opening files all together it may fail with infamous "too many files opened error" So to avoid that have all values into an array and in END block of this awk code print them one by one and I am closing them ASAP all contents are getting printed to output file.
awk '
BEGIN{ FS=OFS="," }
{
for(i=1;i<NF;i++){
value[i]=(value[i]?value[i] ORS:"") ($1 OFS $(i+1))
}
}
END{
for(i=1;i<=NF;i++){
outFile=i".csv"
print value[i] > (outFile)
close(outFile)
}
}
' large_file.csv
I needed the same functionality and wrote it in bash.
Not sure if it will be faster than ravindersingh13's answer, but I hope it will help someone.
Actual version: https://github.com/pgrabarczyk/csv-file-splitter
#!/usr/bin/env bash
set -eu
SOURCE_CSV_PATH="${1}"
LINES_PER_FILE="${2}"
DEST_PREFIX_NAME="${3}"
DEBUG="${4:-0}"
split_files() {
local source_csv_path="${1}"
local lines_per_file="${2}"
local dest_prefix_name="${3}"
local debug="${4}"
_print_log "source_csv_path: ${source_csv_path}"
local dest_prefix_path="$(pwd)/output/${dest_prefix_name}"
_print_log "dest_prefix_path: ${dest_prefix_path}"
local headline=$(awk "NR==1" "${source_csv_path}")
local file_no=0
mkdir -p "$(dirname ${dest_prefix_path})"
local lines_in_files=$(wc -l "${source_csv_path}" | awk '{print $1}')
local files_to_create=$(((lines_in_files-1)/lines_per_file))
_print_log "There is ${lines_in_files} lines in file. I will create ${files_to_create} files per ${lines_per_file} (Last file may have less)"
_print_log "Start processing."
for (( start_line=1; start_line<=lines_in_files; )); do
last_line=$((start_line+lines_per_file))
file_no=$((file_no+1))
local file_path="${dest_prefix_path}$(printf "%06d" ${file_no}).csv"
if [ $debug -eq 1 ]; then
_print_log "Creating file ${file_path} with lines [${start_line};${last_line}]"
fi
echo "${headline}" > "${file_path}"
awk "NR>${start_line} && NR<=${last_line}" "${source_csv_path}" >> "${file_path}"
start_line=$last_line
done
_print_log "Done."
}
_print_log() {
local log_message="${1}"
local date_time=$(date "+%Y-%m-%d %H:%M:%S.%3N")
printf "%s - %s\n" "${date_time}" "${log_message}" >&2
}
split_files "${SOURCE_CSV_PATH}" "${LINES_PER_FILE}" "${DEST_PREFIX_NAME}" "${DEBUG}"
Execution:
bash csv-file-splitter.sh "sample.csv" 3 "result_" 1
Tried a solution using the module Text::CSV.
#! /usr/bin/env perl
use warnings;
use strict;
use utf8;
use open qw<:std :encoding(utf-8)>;
use autodie;
use feature qw<say>;
use Text::CSV;
my %hsh = ();
my $csv = Text::CSV->new({ sep_char => ',' });
print "Enter filename: ";
chomp(my $filename = <STDIN>);
open (my $ifile, '<', $filename);
while (<$ifile>) {
chomp;
if ($csv->parse($_)) {
my #fields = $csv->fields();
my $first = shift #fields;
while (my ($i, $v) = each #fields) {
push #{$hsh{($i + 1).".csv"}}, "$first,$v";
}
} else {
die "Line could not be parsed: $_\n";
}
}
close($ifile);
while (my ($k, $v) = each %hsh) {
open(my $ifile, '>', $k);
say {$ifile} $_ for #$v;
close($ifile);
}
exit(0);
I am writing a small perl program where I am checking the pattern of #start and #end. The agenda is to create a separate file with the lines in between start and end patterns. This I am able to do with below script.
#!/usr/bin/perl
open(INFILE,"<","testcases") || die "Can't open file: $!";
my $binary;
my $tccounter=1;
while(<INFILE>)
{
if(/^#start/i)
{
open(OUTFILE,">",$tccounter."_case.sh") || die "Can't open file: $!";
print "start of the script\n";
next;
}
elsif(/^#end/i)
{
################################
# Want to replace the previously
# written line here with some
# addtional customized lines
################################
close(OUTFILE);
$tccounter++;
print "End of the script\n";
print "last line for this testcase is \n $binary\n";
next;
}
else
{
$binary=$_ unless(/^\s*$/);
print OUTFILE $_;
}
}
But what I additionally needed is is identify the last line that is being written to a file and then replace that additional line with some custom data.
For example, here in my case the last line for all the files is execute.
I want replace the line "execute" in all the output files.
In the current output files last line is as below:
execute
expected out files last line should be
preline
execute
postline
Input file (testcases):
#start
line1
line 2
execute
#end
#start
line3
line 4
execute
#end
#start
line5
line 6
execute
#end
#start
line7
line 8
execute
#end
I suggest that you should buffer your output
If you push each line to an array instead of printing it then, once the #end tag is seen, it is simple to locate the last non-blank line in the array and replace it
Then the output file can be opened and the contents of the array printed to it
Here's an untested example
use strict;
use warnings 'all';
open my $fh, "<", "testcases" or die "Can't open input file: $!";
my $n;
my $i;
my $print;
my #buff;
while ( <$fh> ) {
if ( /^#start/i ) {
#buff = ();
$i = undef;
$print = 1;
print "start of the script\n";
}
elsif ( /^#end/i ) {
my $file = ++$n . "_case.sh";
$print = 0;
unless ( defined $i ) {
warn "No data found in block $n";
next;
}
splice #buff, $i, 1, "preline\n", $buff[$i], "postline\n";
open my $fh, ">", $file or die qq{Can't open "$file" for output: $!};
print $fh #buff;
close $fh;
print "End of the script\n";
}
elsif ( $print ) {
push #buff, $_;
$i = $#buff if /\S/;
}
}
I think Borodins answer is the way to go (I'm just not able to comment yet).
So the general algorithm is:
collect full record, from start marker to end marker
once end marker is reached, process record content. In your case:
find last non-empty line and surround it with others
print found line
write out file for record
repeat as needed
I couldn't resist and rewrote Borodins solution using the flipflop operator:
use strict;
use warnings;
open(my $in,'<','in.file') || die "Can't open file: $!";
my ($cnt,#rec);
while( <$in> ) {
push(#rec,$_) if /^#start/i .. /^#end/i; # collect record lines (using flipflop operator)
if( /^#end/i ) { # end of record reached?
next if #rec <= 2; # ignore empty records
# determine index of last nonempty line
my ($lci) = grep {$rec[$_]=~/\S/} reverse (1..$#rec-1); # ...except markers
printf "last line for this testcase is \n%s\n", # print find
splice #rec, $lci, 1, ("preline\n",$rec[$lci],"postline\n"); # surround with pre&post
# write out result
open(my $out,'>',++$cnt.'_case.sh') || die "Can't open file: $!";
$out->print(#rec[1..$#rec-1]); # ...except markers
$out->close;
#rec=(); # empty record for next use
}
}
it can display text in file, however, after i add new text in gedit, it do not show the updated one.
sub start_thread {
my #args = #_;
print('Thread started: ', #args, "\n");
open(my $myhandle,'<',#args) or die "unable to open file"; # typical open call
for (;;) {
while (<$myhandle>) {
chomp;
print $_."\n";
}
sleep 1;
seek FH, 0, 1; # this clears the eof flag on FH
}
}
update video
https://docs.google.com/file/d/0B4hnKBXrOBqRWEdjTDFIbHJselk/edit?usp=sharing
https://docs.google.com/file/d/0B4hnKBXrOBqRcEFhU3k4dUN4cXc/edit?usp=sharing
how to print $curpos for updated data
for (;;) {
for ($curpos = tell($myhandle); $_ = <$myhandle>;
$curpos = tell($myhandle)) {
# search for some stuff and put it into files
print $curpos."\n";
}
sleep(1);
seek(FILE, $curpos, 0);
}
Like I said - it works for me. Changes to your script are minimal - just minimal cleanup.
Script: test_tail.pl
#!/usr/bin/perl
sub tail_file {
my $filename = shift;
open(my $myhandle,'<',$filename) or die "unable to open file"; # typical open call
for (;;) {
print "About to read file...\n";
while (<$myhandle>) {
chomp;
print $_."\n";
}
sleep 1;
seek $myhandle, 0, 1; # this clears the eof flag on FH
}
}
tail_file('/tmp/test_file.txt');
Then:
echo -e "aaa\nbbb\nccc\n" > /tmp/test_file.txt
# wait a bit
echo -e "ddd\neee\n" >> /tmp/test_file.txt
Meanwhile (in a different terminal);
$ perl /tmp/test_tail.pl
About to read file...
aaa
bbb
ccc
About to read file...
About to read file...
About to read file...
ddd
eee
Instead of this:
seek $myhandle, 0, 1; # this clears the eof flag on FH
Can you try something like this:
my $pos = tell $myhandle;
seek $myhandle, $pos, 0; # reset the file handle in an alternate way
The file system is trying to give you a consistent view of the file you are reading. To see the changes, you would need to reopen the file.
To see an example of this, try the following:
1.Create a file that has 100 lines of text in it, a man page, for instance:
man tail > foo
2.Print the file slowly:
cat foo | perl -ne 'print; sleep 1;'
3.While that is going on, in another shell or editor, try editing the file by deleting most lines
Result: The file will continue to print slowly, as if you never edited it. Only when you try to print it again, will you see the changes.
The following would also work:
my $TAIL = '/usr/bin/tail -f'; # Adjust accordingly
open my $fh, "$TAIL |"
or die "Unable to run $TAIL : $!";
while (<$fh>) {
# do something
}
I want to print certain lines from a text file in Unix. The line numbers to be printed are listed in another text file (one on each line).
Is there a quick way to do this with Perl or a shell script?
Assuming the line numbers to be printed are sorted.
open my $fh, '<', 'line_numbers' or die $!;
my #ln = <$fh>;
open my $tx, '<', 'text_file' or die $!;
foreach my $ln (#ln) {
my $line;
do {
$line = <$tx>;
} until $. == $ln and defined $line;
print $line if defined $line;
}
$ cat numbers
1
4
6
$ cat file
one
two
three
four
five
six
seven
$ awk 'FNR==NR{num[$1];next}(FNR in num)' numbers file
one
four
six
You can avoid the limitations of the some of the other answers (requirements for sorted lines), simply by using eof within the context of a basic while(<>) block. That will tell you when you've stopped reading line numbers and started reading data. Note that you need to reset $. when the switch occurs.
# Usage: perl script.pl LINE_NUMS_FILE DATA_FILE
use strict;
use warnings;
my %keep;
my $reading_line_nums = 1;
while (<>){
if ($reading_line_nums){
chomp;
$keep{$_} = 1;
$reading_line_nums = $. = 0 if eof;
}
else {
print if exists $keep{$.};
}
}
cat -n foo | join foo2 - | cut -d" " -f2-
where foo is your file with lines to print and foo2 is your file of line numbers
Here is a way to do this in Perl without slurping anything so that the memory footprint of the program is independent of the sizes of both files (it does assume that the line numbers to be printed are sorted):
#!/usr/bin/perl
use strict; use warnings;
use autodie;
#ARGV == 2
or die "Supply src_file and filter_file as arguments\n";
my ($src_file, $filter_file) = #ARGV;
open my $src_h, '<', $src_file;
open my $filter_h, '<', $filter_file;
my $to_print = <$filter_h>;
while ( my $src_line = <$src_h> ) {
last unless defined $to_print;
if ( $. == $to_print ) {
print $src_line;
$to_print = <$filter_h>;
}
}
close $filter_h;
close $src_h;
Generate the source file:
C:\> perl -le "print for aa .. zz" > src
Generate the filter file:
C:\> perl -le "print for grep { rand > 0.75 } 1 .. 52" > filter
C:\> cat filter
4
6
10
12
13
19
23
24
28
44
49
50
Output:
C:\> f src filter
ad
af
aj
al
am
as
aw
ax
bb
br
bw
bx
To deal with an unsorted filter file, you can modified the while loop:
while ( my $src_line = <$src_h> ) {
last unless defined $to_print;
if ( $. > $to_print ) {
seek $src_h, 0, 0;
$. = 0;
}
if ( $. == $to_print ) {
print $src_line;
$to_print = <$filter_h>;
}
}
This would waste a lot of time if the contents of the filter file are fairly random because it would keep rewinding to the beginning of the source file. In that case, I would recommend using Tie::File.
I wouldn't do it this way with large files, but (untested):
open(my $fh1, "<", "line_number_file.txt") or die "Err: $!";
chomp(my #line_numbers = <$fh1>);
$_-- for #line_numbers;
close $fh1;
open(my $fh2, "<", "text_file.txt") or die "Err: $!";
my #lines = <$fh2>;
print #lines[#line_numbers];
close $fh2;
I'd do it like this:
#!/bin/bash
numbersfile=numbers
datafile=data
while read lineno < $numbersfile; do
sed -n "${lineno}p" datafile
done
Downside to my approach is that it will spawn a lot of processes so it will be slower than other options. It's infinitely more readable though.
This is a short solution using bash and sed
sed -n -e "$(cat num |sed 's/$/p/')" file
Where num is the file of numbers and file is the input file ( Tested on OS/X Snow leopard)
$ cat num
1
3
5
$ cat file
Line One
Line Two
Line Three
Line Four
Line Five
$ sed -n -e "$(cat num |sed 's/$/p/')" file
Line One
Line Three
Line Five
$ cat input
every
good
bird
does
fly
$ cat lines
2
4
$ perl -ne 'BEGIN{($a,$b) = `cat lines`} print if $.==$a .. $.==$b' input
good
bird
does
If that's too much for a one-liner, use
#! /usr/bin/perl
use warnings;
use strict;
sub start_stop {
my($path) = #_;
open my $fh, "<", $path
or die "$0: open $path: $!";
local $/;
return ($1,$2) if <$fh> =~ /\s*(\d+)\s*(\d+)/;
die "$0: $path: could not find start and stop line numbers";
}
my($start,$stop) = start_stop "lines";
while (<>) {
print if $. == $start .. $. == $stop;
}
Perl's magic open allows for creative possibilities such as
$ ./lines-between 'tac lines-between|'
print if $. == $start .. $. == $stop;
while (<>) {
Here is a way to do this using Tie::File:
#!/usr/bin/perl
use strict; use warnings;
use autodie;
use Tie::File;
#ARGV == 2
or die "Supply src_file and filter_file as arguments\n";
my ($src_file, $filter_file) = #ARGV;
tie my #source, 'Tie::File', $src_file, autochomp => 0
or die "Cannot tie source '$src_file': $!";
open my $filter_h, '<', $filter_file;
while ( my $to_print = <$filter_h> ) {
print $source[$to_print - 1];
}
close $filter_h;
untie #source;
What is a good/best way to count the number of characters, words, and lines of a text file using Perl (without using wc)?
Here's the perl code. Counting words can be somewhat subjective, but I just say it's any string of characters that isn't whitespace.
open(FILE, "<file.txt") or die "Could not open file: $!";
my ($lines, $words, $chars) = (0,0,0);
while (<FILE>) {
$lines++;
$chars += length($_);
$words += scalar(split(/\s+/, $_));
}
print("lines=$lines words=$words chars=$chars\n");
A variation on bmdhacks' answer that will probably produce better results is to use \s+ (or even better \W+) as the delimiter. Consider the string "The quick brown fox" (additional spaces if it's not obvious). Using a delimiter of a single whitespace character will give a word count of six not four. So, try:
open(FILE, "<file.txt") or die "Could not open file: $!";
my ($lines, $words, $chars) = (0,0,0);
while (<FILE>) {
$lines++;
$chars += length($_);
$words += scalar(split(/\W+/, $_));
}
print("lines=$lines words=$words chars=$chars\n");
Using \W+ as the delimiter will stop punctuation (amongst other things) from counting as words.
The Word Count tool counts characters, words and lines in text files
Here. Try this Unicode-savvy version of the wc program.
It skips non-file arguments (pipes, directories, sockets, etc).
It assumes UTF-8 text.
It counts any Unicode whitespace as a word separator.
It also accepts alternate encodings if there is a .ENCODING at the end of the filename, like foo.cp1252, foo.latin1, foo.utf16, etc.
It also work with files that have been compressed in a variety of formats.
It gives counts of Paragraphs, Lines, Words, Graphemes, Characters, and Bytes.
It understands all Unicode linebreak sequences.
It warns about corrupted textfiles with linebreak errors.
Here’s an example of running it:
Paras Lines Words Graphs Chars Bytes File
2 2270 82249 504169 504333 528663 /tmp/ap
1 2404 11163 63164 63164 66336 /tmp/b3
uwc: missing linebreak at end of corrupted textfiile /tmp/bad
1* 2* 4 19 19 19 /tmp/bad
1 14 52 273 273 293 /tmp/es
57 383 1369 11997 11997 12001 /tmp/funny
1 657068 3175429 31205970 31209138 32633834 /tmp/lw
1 1 4 27 27 27 /tmp/nf.cp1252
1 1 4 27 27 34 /tmp/nf.euc-jp
1 1 4 27 27 27 /tmp/nf.latin1
1 1 4 27 27 27 /tmp/nf.macroman
1 1 4 27 27 54 /tmp/nf.ucs2
1 1 4 27 27 56 /tmp/nf.utf16
1 1 4 27 27 54 /tmp/nf.utf16be
1 1 4 27 27 54 /tmp/nf.utf16le
1 1 4 27 27 112 /tmp/nf.utf32
1 1 4 27 27 108 /tmp/nf.utf32be
1 1 4 27 27 108 /tmp/nf.utf32le
1 1 4 27 27 39 /tmp/nf.utf7
1 1 4 27 27 31 /tmp/nf.utf8
1 26906 101528 635841 636026 661202 /tmp/o2
131 346 1370 9590 9590 4486 /tmp/perl5122delta.pod.gz
291 814 3941 25318 25318 9878 /tmp/perl51310delta.pod.bz2
1 2551 5345 132655 132655 133178 /tmp/tailsort-pl.utf8
1 89 334 1784 1784 2094 /tmp/til
1 4 18 88 88 106 /tmp/w
276 1736 5773 53782 53782 53804 /tmp/www
Here ya go:
#!/usr/bin/env perl
#########################################################################
# uniwc - improved version of wc that works correctly with Unicode
#
# Tom Christiansen <tchrist#perl.com>
# Mon Feb 28 15:59:01 MST 2011
#########################################################################
use 5.10.0;
use strict;
use warnings FATAL => "all";
use sigtrap qw[ die untrapped normal-signals ];
use Carp;
$SIG{__WARN__} = sub {
confess("FATALIZED WARNING: #_") unless $^S;
};
$SIG{__DIE__} = sub {
confess("UNCAUGHT EXCEPTION: #_") unless $^S;
};
$| = 1;
my $Errors = 0;
my $Headers = 0;
sub yuck($) {
my $errmsg = $_[0];
$errmsg =~ s/(?<=[^\n])\z/\n/;
print STDERR "$0: $errmsg";
}
process_input(\&countem);
sub countem {
my ($_, $file) = #_;
my (
#paras, #lines, #words,
$paracount, $linecount, $wordcount,
$grafcount, $charcount, $bytecount,
);
if ($charcount = length($_)) {
$wordcount = eval { #words = split m{ \p{Space}+ }x };
yuck "error splitting words: $#" if $#;
$linecount = eval { #lines = split m{ \R }x };
yuck "error splitting lines: $#" if $#;
$grafcount = 0;
$grafcount++ while /\X/g;
#$grafcount = eval { #lines = split m{ \R }x };
yuck "error splitting lines: $#" if $#;
$paracount = eval { #paras = split m{ \R{2,} }x };
yuck "error splitting paras: $#" if $#;
if ($linecount && !/\R\z/) {
yuck("missing linebreak at end of corrupted textfiile $file");
$linecount .= "*";
$paracount .= "*";
}
}
$bytecount = tell;
if (-e $file) {
$bytecount = -s $file;
if ($bytecount != -s $file) {
yuck "filesize of $file differs from bytecount\n";
$Errors++;
}
}
my $mask = "%8s " x 6 . "%s\n";
printf $mask => qw{ Paras Lines Words Graphs Chars Bytes File } unless $Headers++;
printf $mask => map( { show_undef($_) }
$paracount, $linecount,
$wordcount, $grafcount,
$charcount, $bytecount,
), $file;
}
sub show_undef {
my $value = shift;
return defined($value)
? $value
: "undef";
}
END {
close(STDOUT) || die "$0: can't close STDOUT: $!";
exit($Errors != 0);
}
sub process_input {
my $function = shift();
my $enc;
if (#ARGV == 0 && -t) {
warn "$0: reading from stdin, type ^D to end or ^C to kill.\n";
}
unshift(#ARGV, "-") if #ARGV == 0;
FILE:
for my $file (#ARGV) {
# don't let magic open make an output handle
next if -e $file && ! -f _;
my $quasi_filename = fix_extension($file);
$file = "standard input" if $file eq q(-);
$quasi_filename =~ s/^(?=\s*[>|])/< /;
no strict "refs";
my $fh = $file; # is *so* a lexical filehandle! ☺
unless (open($fh, $quasi_filename)) {
yuck("couldn't open $quasi_filename: $!");
next FILE;
}
set_encoding($fh, $file) || next FILE;
my $whole_file = eval {
use warnings "FATAL" => "all";
local $/;
scalar <$fh>;
};
if ($#) {
$# =~ s/ at \K.*? line \d+.*/$file line $./;
yuck($#);
next FILE;
}
$function->($whole_file, $file);
unless (close $fh) {
yuck("couldn't close $quasi_filename at line $.: $!");
next FILE;
}
} # foreach file
}
sub set_encoding(*$) {
my ($handle, $path) = #_;
my $enc_name = "utf8";
if ($path && $path =~ m{ \. ([^\s.]+) \z }x) {
my $ext = $1;
die unless defined $ext;
require Encode;
if (my $enc_obj = Encode::find_encoding($ext)) {
my $name = $enc_obj->name || $ext;
$enc_name = "encoding($name)";
}
}
return 1 if eval {
use warnings FATAL => "all";
no strict "refs";
binmode($handle, ":$enc_name");
1;
};
for ($#) {
s/ at .* line \d+\.//;
s/$/ for $path/;
}
yuck("set_encoding: $#");
return undef;
}
sub fix_extension {
my $path = shift();
my %Compress = (
Z => "zcat",
z => "gzcat", # for uncompressing
gz => "gzcat",
bz => "bzcat",
bz2 => "bzcat",
bzip => "bzcat",
bzip2 => "bzcat",
lzma => "lzcat",
);
if ($path =~ m{ \. ( [^.\s] +) \z }x) {
if (my $prog = $Compress{$1}) {
return "$prog $path |";
}
}
return $path;
}
I stumbled upon this while googling for a character count solution.
Admittedly, I know next to nothing about perl so some of this may be off base, but here are my tweaks of newt's solution.
First, there is a built-in line count variable anyway, so I just used that. This is probably a bit more efficient, I guess.
As it is, the character count includes newline characters, which is probably not what you want, so I chomped $_.
Perl also complained about the way the split() is done (implicit split, see: Why does Perl complain "Use of implicit split to #_ is deprecated"? ) so I tweaked that.
My input files are UTF-8 so I opened them as such. That probably helps get the correct character count in the input file contains non-ASCII characters.
Here's the code:
open(FILE, "<:encoding(UTF-8)", "file.txt") or die "Could not open file: $!";
my ($lines, $words, $chars) = (0,0,0);
my #wordcounter;
while (<FILE>) {
chomp($_);
$chars += length($_);
#wordcounter = split(/\W+/, $_);
$words += #wordcounter;
}
$lines = $.;
close FILE;
print "\nlines=$lines, words=$words, chars=$chars\n";
There is the Perl Power Tools project whose goal is to reconstruct all the Unix bin utilities, primarily for those on operating systems deprived of Unix. Yes, they did wc. The implementation is overkill, but it is POSIX compliant.
It gets a little ridiculous when you look at the GNU compliant implementation of true.
Non-serious answer:
system("wc foo");
Reading the file in fixed-size chunks may be more efficient than reading line-by-line. The wc binary does this.
#!/usr/bin/env perl
use constant BLOCK_SIZE => 16384;
for my $file (#ARGV) {
open my $fh, '<', $file or do {
warn "couldn't open $file: $!\n";
continue;
};
my ($chars, $words, $lines) = (0, 0, 0);
my ($new_word, $new_line);
while ((my $size = sysread $fh, local $_, BLOCK_SIZE) > 0) {
$chars += $size;
$words += /\s+/g;
$words-- if $new_word && /\A\s/;
$lines += () = /\n/g;
$new_word = /\s\Z/;
$new_line = /\n\Z/;
}
$lines-- if $new_line;
print "\t$lines\t$words\t$chars\t$file\n";
}
To be able to count CHARS and not bytes, consider this:
(Try it with Chinese or Cyrillic letters and file saved in utf8)
use utf8;
my $file='file.txt';
my $LAYER = ':encoding(UTF-8)';
open( my $fh, '<', $file )
|| die( "$file couldn't be opened: $!" );
binmode( $fh, $LAYER );
read $fh, my $txt, -s $file;
close $fh;
print length $txt,$/;
use bytes;
print length $txt,$/;
This may be helpful to Perl beginners.
I tried to simulate MS word counting functionalities and added one more feature which is not shown using wc in Linux.
number of lines
number of words
number of characters with space
number of characters without space (wc will not give this in its output but Microsoft words shows it.)
Here is the url: Counting words,characters and lines in a file