Counting records separated by CR/LF (carriage return and newline) in Perl - perl

I'm trying to create a simple script to read a text file that contains records of book titles. Each record is separated with a plain old double space (\r\n\r\n). I need to count how many records are in the file.
For example here is the input file:
record 1
some text
record 2
some text
...
I'm using a regex to check for carriage return and newline, but it fails to match. What am I doing wrong? I'm at my wits' end.
sub readInputFile {
my $inputFile = $_[0]; #read first argument from the commandline as fileName
open INPUTFILE, "+<", $inputFile or die $!; #Open File
my $singleLine;
my #singleRecord;
my $recordCounter = 0;
while (<INPUTFILE>) { # loop through the input file line-by-line
$singleLine = $_;
push(#singleRecord, $singleLine); # start adding each line to a record array
if ($singleLine =~ m/\r\n/) { # check for carriage return and new line
$recordCounter += 1;
createHashTable(#singleRecord); # send record make a hash table
#singleRecord = (); # empty the current record to start a new record
}
}
print "total records : $recordCounter \n";
close(INPUTFILE);
}

It sounds like you are processing a Windows text file on Linux, in which case you want to open the file with the :crlf layer, which will convert all CRLF line-endings to the standard Perl \n ending.
If you are reading Windows files on a Windows platform then the conversion is already done for you, and you won't find CRLF sequences in the data you have read. If you are reading a Linux file then there are no CR characters in there anyway.
It also sounds like your records are separated by a blank line. Setting the built-in input record separator variable $/ to a null string will cause Perl to read a whole record at a time.
I believe this version of your subroutine is what you need. Note that people familiar with Perl will thank you for using lower-case letters and underscore for variables and subroutine names. Mixed case is conventionally reserved for package names.
You don't show create_hash_table so I can't tell what data it needs. I have chomped and split the record into lines, and passed a list of the lines in the record with the newlines removed. It would probably be better to pass the entire record as a single string and leave create_hash_table to process it as required.
sub read_input_file {
my ($input_file) = #_;
open my $fh, '<:crlf', $input_file or die $!;
local $/ = '';
my $record_counter = 0;
while (my $record = <$fh>) {
chomp;
++$record_counter;
create_hash_table(split /\n/, $record);
}
close $fh;
print "Total records : $record_counter\n";
}

You can do this more succinctly by changing Perl's record-separator, which will make the loop return a record at a time instead of a line at a time.
E.g. after opening your file:
local $/ = "\r\n\r\n";
my $recordCounter = 0;
$recordCounter++ while(<INPUTFILE>);
$/ holds Perl's global record-separator, and scoping it with local allows you to override its value temporarily until the end of the enclosing block, when it will automatically revert back to its previous value.
But it sounds like the file you're processing may actually have "\n\n" record-separators, or even "\r\r". You'd need to set the record-separator correctly for whatever file you're processing.

If your files are not huge multi-gigabytes files, the easiest and safest way is to read the whole file, and use the generic newline metacharacter \R.
This way, it also works if some file actually uses LF instead of CRLF (or even the old Mac standard CR).
Use it with split if you also need the actual records:
perl -ln -0777 -e 'my #records = split /\R\R/; print scalar(#records)' $Your_File
Or if you only want to count the records:
perl -ln -0777 -e 'my $count=()=/\R\R/g; print $count' $Your_File
For more details, see also my other answer here to a similar question.

Related

How to split a file (with sed) into numerous files according to a value found on each line?

I have several Company_***.csv files (altough the separator's a tab not a comma; hence should be *.tsv, but never mind) which contains a header plus numerous data lines e.g
1stHeader 2ndHeader DateHeader OtherHeaders...
111111111 SOME STRING 2020-08-01 OTHER STRINGS..
222222222 ANOT STRING 2020-08-02 OTHER STRINGS..
I have to split them according to the 3rd column here, it's a date.
Each file should be named like e.g. Company_2020_08_01.csv Company_2020_08_02.csv & so one
and containing: same header on the 1st line + matching rows as the following lines.
At first I thought about saving (once) the header in a single file e.g.
sed -n '1w Company_header.csv' Company_*.csv
then parsing the files with a pattern for the date (hence the headers would be skipped) e.g.
sed -n '/\t2020-[01][0-9]-[0-3][0-9]\t/w somefilename.csv' Company_*.csv
... and at last, insert the (missing) header in each generated file.
But I'm stuck at step 2: I can't find how I could generate (dynamically) the "filename" expected by the w command, neither how to capture the date in the search pattern (because apparently this is just an address, not a search-replace "field" as in the s/regexp/replacement/[flags] command, so you can't have capturing groups ( ) in there).
So I wonder if this is actually doable with sed? Or should I look upon other tools e.g. awk?
Disclaimer: I'm quite a n00b with these commands so I'm just learning/starting from scratch...
Perl to the rescue!
perl -e 'while (<>) {
$h = $_, next if $. == 1;
$. = 0 if eof;
#c = split /\t/;
open my $out, ">>", "Company_" . $c[2] =~ tr/-/_/r . ".csv" or die $!;
print {$out} $h unless tell $out;
print {$out} $_;
}' -- Company_*.csv
The diamond operator <> in scalar context reads a line from the input.
The first line of each file is stored in the variable $h, see $. and eof
split populates the #c array by the column values for each line
$c[2] contains the date, using tr we translate dashes to underscores to create a filename from it. open opens the file for appending.
print prints the header if the file is empty (see tell)
and prints the current line, too.
Note that it only appends to the files, so don't forget to delete any output files before running the script again.

Perl script -- Multiple text file parsing and writing

Suppose i have this directory full of text files (raw text). What i need is a Perl script that will parse the directory (up2bottom) text files one by one and save their contents in a new single file, appointed by me. In other words i simply want to create a corpus of many documents. Note: these documents have to be separated by some tag e.g. indicating the sequence in which they were parsed.
So far i have managed to follow some examples and i know how to read, write and parse text files. But i am not yet in position to merge them into one script and handle many text files. Can you please provide some assistance. thanks
edit:
example code for writing to a file.
#!/usr/local/bin/perl
open (MYFILE, '>>data.txt');
print MYFILE "text\n";
close (MYFILE);
example code for reading a file.
#!/usr/local/bin/perl
open (MYFILE, 'data.txt');
while (<MYFILE>) {
chomp;
print "$_\n";
}
close (MYFILE);
I've also find out about the foreach function which can be used for tasks as such, but still don't know how to combine them and achieve the result explained in the description.
The important points in this suggestion are:
the "magic" diamond operator (a.k.a. readline), which reads from each file in *ARGV,
the eof function, which tells if the next readline on the current filehandle will return any data
the $ARGV variable, that contains the name of the currently opened file.
With that intro, here we go!
#!/usr/bin/perl
use strict; # Always!
use warnings; # Always!
my $header = 1; # Flag to tell us to print the header
while (<>) { # read a line from a file
if ($header) {
# This is the first line, print the name of the file
print "========= $ARGV ========\n";
# reset the flag to a false value
$header = undef;
}
# Print out what we just read in
print;
}
continue { # This happens before the next iteration of the loop
# Check if we finished the previous file
$header = 1 if eof;
}
To use it, just do: perl concat.pl *.txt > compiled.TXT

extracting paragraphs from text with perl

I want to extract the paragraphs from a text variable that retrieved from the DB.
for extracting the pargaphs from file handler i use the below code :
local $/ = undef;
#paragarphs =<STDIN>
what is the best option to extract paragraphs from a text variable using perl and if there are module on cpan that do this type of task ?
You're almost there. Setting $/ to undef will slurp in the entire text in one go.
What you want is local $/ = ""; to enable paragraph mode, as per perldoc perlvar (emphasis my own):
$/
The input record separator, newline by default. This influences Perl's
idea of what a "line" is. Works like awk's RS variable, including
treating empty lines as a terminator if set to the null string (an
empty line cannot contain any spaces or tabs). You may set it to a
multi-character string to match a multi-character terminator, or to
undef to read through the end of file. Setting it to "\n\n" means
something slightly different than setting to "" , if the file contains
consecutive empty lines. Setting to "" will treat two or more
consecutive empty lines as a single empty line. Setting to "\n\n"
will blindly assume that the next input character belongs to the next
paragraph, even if it's a newline.
Of course, it is possible to get a filehandle to read from a string instead of a file:
use strict;
use warnings;
use autodie;
my $text = <<TEXT;
This is a paragraph.
Here's another one that
spans over multiple lines.
Last paragraph
TEXT
local $/ = "";
open my $fh, '<', \$text;
while ( <$fh> ) {
print "New Paragraph: $_";
}
close $fh;
Output
New Paragraph: This is a paragraph.
New Paragraph: Here's another one that
spans over multiple lines.
New Paragraph: Last paragraph
You already have the answer for a script (local $/ = "";), but it may be worth noting that there is a shortcut for one-liners: the -00 option.
perl -00 -ne '$count++; END {print "Counted $count paragraphs\n"}' somefile.txt
From man perlrun :
-0[octal/hexadecimal]
specifies the input record separator ($/) [...]
The special value 00 will cause Perl to slurp files in paragraph
mode.
If the text is in a variable, for example:
$text = "Here is a paragraph.\nHere is another paragraph.";
or:
$text = 'Paragraph 1
Paragraph2';
You can simply get the paragraphs by splitting the text with "\n".
#paragraphs = split("\n",$text);
If your paragraphs are separated by double newlines or a combination of \n and \r (like in Windows) you can change the split command accordingly.

Reading .fasta sequences to extract nucleotide data, and then writing to a TabDelimited file

Before I continue, I thought I'd refer readers to my previous problems with Perl, being a beginner to all of this.
These were my posts over the past few days, in chronological order:
How do I average column values from a tab-separated data... (Solved)
Why do I see no computed results in my output file? (Solved)
Using a .fasta file to compute relative content of sequences
Now as I've stated above, thanks to help from a few of you, I've managed to figure out the first two queries and I've really learnt from it. I'm truly grateful. For a person who knows nothing about this, and still feels like he doesn't, the help was practically a Godsend.
The last query remains unsolved and this is a continuation. I did have a look at some of the recommended texts, but as I'm trying to get this finished before Monday, I'm unsure if I've overlooked anything completely. Either way, I have had a go at attempting the task.
Just so you know, the task is to open and read a .fasta file (I think I've finally nailed something pretty well, hallelujah!), read each sequence, compute the relative G+C nucleotide content, and then write to a TABDelimited file and the names of the genes and their respective G+C content.
Even though I've had a go at attempting this, I know that I am no where near ready to execute the program to provide the results that I'm after, which is why I'm reaching out to you guys again for some guidance, or examples of how to go about this. As with my previous, solved queries, I'd like it to be in a similar style to what I've already done them in - even though it might not be the most convenient/efficient way. It just allows me to know what I'm doing each step of the way, even though it seems like I'm spamming it up!
Anyway, the .fasta file reads something like:
>label
sequence
>label
sequence
>label
sequence
I'm unsure how to open the .fasta file, so I'm not sure what labels apply to which, but I know that the genes should be labelled either gag, pol, or env. Do I need to open the .fasta file to know what I'm doing, or can I do it 'blindly' by going with the above format?
It may be perfectly obvious, but I'm still struggling with all of this. I'm feeling like I should have caught on by now!
Anyway, the current code I have is as follows:
#!/usr/bin/perl -w
# This script reads several sequences and computes the relative content of G+C of each sequence.
use strict;
my $infile = "Lab1_seq.fasta"; # This is the file path
open INFILE, $infile or die "Can't open $infile: $!"; # This opens file, but if file isn't there it mentions this will not open
my $outfile = "Lab1_SeqOutput.txt"; # This is the file's output
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!"; # This opens the output file, otherwise it mentions this will not open
my $sequence = (); # This sequence variable stores the sequences from the .fasta file
my $GC = 0; # This variable checks for G + C content
my $line; # This reads the input file one-line-at-a-time
while ($line = <INFILE>) {
chomp $line; # This removes "\n" at the end of each line (this is invisible)
foreach my $line ($infile) {
if($line = ~/^\s*$/) { # This finds lines with whitespaces from the beginning to the ending of the sequence. Removes blank line.
next;
} elsif($line = ~/^\s*#/) { # This finds lines with spaces before the hash character. Removes .fasta comment
next;
} elsif($line = ~/^>/) { # This finds lines with the '>' symbol at beginning of label. Removes .fasta label
next;
} else {
$sequence = $line;
}
}
{
$sequence =~ s/\s//g; # Whitespace characters are removed
return $sequence;
}
I'm not sure if anything's correct here, but executing it left me with a syntax error ar line 35 (beyond the last line, and hence there isn't anything there!). It said at 'EOF'. That's about all I can point out. Otherwise I'm trying to figure out how to compute the quantities of the nucleotides G + C in each of the sequences, and then tabulating this properly in an output .txt file. I believe that's what is meant by a TABDelimited file?
In any case, I apologise if this query seems to be too lengthy, 'dumb' or a repeat, but in saying that, I couldn't find any information directly pertaining to this, so your help would be much appreciated, and the explanations for each step too if possible!!
Kindest.
You have an extra brace right near the end. This should work:
#!/usr/bin/perl -w
# This script reads several sequences and computes the relative content of G+C of each sequence.
use strict;
my $infile = "Lab1_seq.fasta"; # This is the file path
open INFILE, $infile or die "Can't open $infile: $!"; # This opens file, but if file isn't there it mentions this will not open
my $outfile = "Lab1_SeqOutput.txt"; # This is the file's output
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!"; # This opens the output file, otherwise it mentions this will not open
my $sequence = (); # This sequence variable stores the sequences from the .fasta file
my $GC = 0; # This variable checks for G + C content
my $line; # This reads the input file one-line-at-a-time
while ($line = <INFILE>) {
chomp $line; # This removes "\n" at the end of each line (this is invisible)
if($line =~ /^\s*$/) { # This finds lines with whitespaces from the beginning to the ending of the sequence. Removes blank line.
next;
} elsif($line =~ /^\s*#/) { # This finds lines with spaces before the hash character. Removes .fasta comment
next;
} elsif($line =~ /^>/) { # This finds lines with the '>' symbol at beginning of label. Removes .fasta label
next;
} else {
$sequence = $line;
}
$sequence =~ s/\s//g; # Whitespace characters are removed
print OUTFILE $sequence;
}
Also I edited your return line. Return will exit your loop. I suspect what you want is to print it to a file, so I have done that. You may need to do some further transformation first to get it into a tab separated format.

is there a way to designate the line token delimiter in Perl's file reader?

I'm reading a text file via CGI in, in perl, and noticing that when the file is saved in mac's textEdit the line separator is recognized, but when I upload a CSV that is exported straight from excel, they are not. I'm guessing it's a \n vs. \r issue, but it got me thinking that I don't know how to specify what I would like the line terminator token to be, if I didn't want the one it's looking for by default.
Yes. You'll want to overwrite the value of $/. From perlvar
$/
The input record separator, newline by default. This influences Perl's idea of what a "line" is. Works like awk's RS variable, including treating empty lines as a terminator if set to the null string. (An empty line cannot contain any spaces or tabs.) You may set it to a multi-character string to match a multi-character terminator, or to undef to read through the end of file. Setting it to "\n\n" means something slightly different than setting to "", if the file contains consecutive empty lines. Setting to "" will treat two or more consecutive empty lines as a single empty line. Setting to "\n\n" will blindly assume that the next input character belongs to the next paragraph, even if it's a newline. (Mnemonic: / delimits line boundaries when quoting poetry.)
local $/; # enable "slurp" mode
local $_ = <FH>; # whole file now here
s/\n[ \t]+/ /g;
Remember: the value of $/ is a string, not a regex. awk has to be better for something. :-)
Setting $/ to a reference to an integer, scalar containing an integer, or scalar that's convertible to an integer will attempt to read records instead of lines, with the maximum record size being the referenced integer. So this:
local $/ = \32768; # or \"32768", or \$var_containing_32768
open my $fh, "<", $myfile or die $!;
local $_ = <$fh>;
will read a record of no more than 32768 bytes from FILE. If you're not reading from a record-oriented file (or your OS doesn't have record-oriented files), then you'll likely get a full chunk of data with every read. If a record is larger than the record size you've set, you'll get the record back in pieces. Trying to set the record size to zero or less will cause reading in the (rest of the) whole file.
On VMS, record reads are done with the equivalent of sysread, so it's best not to mix record and non-record reads on the same file. (This is unlikely to be a problem, because any file you'd want to read in record mode is probably unusable in line mode.) Non-VMS systems do normal I/O, so it's safe to mix record and non-record reads of a file.
See also "Newlines" in perlport. Also see $..
The variable has multiple names:
$/
$RS
$INPUT_RECORD_SEPARATOR
For the longer names, you need:
use English;
Remember to localize carefully:
{
local($/) = "\r\n";
...code to read...
}
If you are reading in a file with CRLF line terminators, you can open it with the CRLF discipline, or set the binmode of the handle to do automatic translation.
open my $fh, '<:crlf', 'the_csv_file.csv' or die "Oh noes $!";
This will transparently convert \r\n sequences into \n sequences.
You can also apply this translation to an existing handle by doing:
binmode( $fh, ':crlf' );
:crlf mode is typically default in Win32 Perl environments and works very well in practice.
For reading a CSV file, follow Robert-P's advice in his comment, and use a CSV module.
But for the general case of reading lines from a file with different line-endings, what I generally do is slurp the file whole and split it on \R. If it's not a multi-gigabytes file, that should be the safest and easiest way.
So:
perl -ln -0777 -e 'my #lines = split /\R/;
print length($_), " bytes split into ", scalar(#lines), " lines."' $YOUR_FILE
or in your script:
{
local $/ = undef;
open F, $YOUR_FILE or die;
#lines = split /\R/, <F>;
close F;
}
\R works with Unix LF (\x0A), Windows/Internet CRLF, and also with CR (\x0D) which was used by Macs in the nineties, but is in fact still used by some Mac programs.
From the perldoc :
\R matches a generic newline; that is, anything considered a linebreak
sequence by Unicode. This includes all characters matched by \v
(vertical whitespace), and the multi character sequence "\x0D\x0A"
(carriage return followed by a line feed, sometimes called the network
newline; it's the end of line sequence used in Microsoft text files
opened in binary mode)
Or see this much nicer and exhaustive explanation about \R in Brian D Foy's article : The \R generic line ending which even has a couple of fun videos.