Matching and splitting a specific line from a .txt file - perl

I'm looking for concise Perl equivalents (to use within scripts rather than one-liners) to a few things i'd otherwise do in bash/awk:
Count=$(awk '/reads/ && ! seen {print $1; seen=1}' < input.txt)
Which trawls through a specified .txt file that contains a multitude of lines including some in this format:
8523723 reads; of these:
1256265 reads; of these:
2418091 reads; of these:
Printing '8523723' and ignoring the remainder of the matchable lines (as I only wish to act on the first matched instance).
Secondly:
Count2=$(awk '/paired/ {sum+=$1} END{print sum}' < input.txt)
25 paired; of these:
15 paired; of these:
Which would create a running total of the numbers on each matched line, printing 40.

The first one is:
while (<>) {
if (/reads/) {
print;
last;
}
}
The second one is:
my $total = 0;
while (<>) {
if (/(\d+) paired/) {
$total += $1;
}
}
say $total;
You could, no doubt, golf them both. But these versions are readable :-)

Related

Splitting large text files with Perl

I have to split a large, 1.8Tb text file in two (I need only the second half of the file). The file has \n as the record separator.
I tried
perl -ne 'print if $. >= $line_to_start_from' test.txt > result.txt
on a much smaller, 115Mb test file and it did the job but took 22 seconds.
Using this solution for a 1.8Tb file will take unreasonably long time, so my question is whether there is a way in Perl to split huge files without looping over them?
By default perl reads file input one line at a time. If your file contains lots of relatively short lines (and I'm assuming it does), perl will be a lot slower than utilities like split which read in bigger chunks from the file at a time.
For testing, I created a ~200MB file with very short lines:
$ perl -e 'print "123\n" for( 1 .. 50_000_000 );' >file_to_split
split can handle it pretty reasonably:
$ time split --lines=25000000 file_to_split half
real 0m1.266s
user 0m0.314s
sys 0m0.213s
And the naïve perl approach is much slower:
$ time perl -ne 'print if $. > 25_000_000' file_to_split >second_half
real 0m10.474s
user 0m10.257s
sys 0m0.222s
But you can use the $/ special variable to cause perl to read more than one line at a time. For example 16 kb of data at a time:
my $CHUNK_SIZE = 16 * 1024;
my $SPLIT_AT_LINE = 25_000_000;
{
local $/ = \$CHUNK_SIZE;
my $lineNumber = 0;
while ( <> ) {
if ( $lineNumber > $SPLIT_AT_LINE ) {
# everything from here on is in the second half
print $_;
}
else {
my $count = $_ =~ tr/\n/\n/;
$lineNumber += $count;
if ( $lineNumber > $SPLIT_AT_LINE ) {
# we went past the split, get some of the lines from this buffer
my $extra = $lineNumber - $SPLIT_AT_LINE;
my #lines = split m/\n/, $_, $count - $extra + 1;
print $lines[ -1 ];
}
}
}
}
If you don't care about overshooting the split by a few lines, you could make this code even simpler. And this gets perl to do the same operation in a reasonable amount of time:
$ time perl test.pl file_to_split >second_half
real 0m0.678s
user 0m0.095s
sys 0m0.297s

get column list using sed/awk/perl

I have different files like below format
Scenario 1 :
File1
no,name
1,aaa
20,bbb
File2
no,name,address
5,aaa,ghi
7,ccc,mn
I would like to get column list which is having more number of columns and if it is in the same order
**Expected output for scenario 1 :**
no,name,address
Scenario 2 :
File1
no,name
1,aaa
20,bbb
File2
no,age,name,address
5,2,aaa,ghi
7,3,ccc,mn
Expected Results :
Both file headers and positions are different as a message
I am interested in any short solution using bash / perl / sed / awk.
Perl solution:
perl -lne 'push #lines, $_;
close ARGV;
next if #lines < 2;
#lines = sort { length $a <=> length $b } #lines;
if (0 == index "$lines[1],", $lines[0]) {
print $lines[1];
} else {
print "Both file headers and positions are different";
}' -- File1 File2
-n reads the input line by line and runs the code for each line
-l removes newlines from input and adds them to printed lines
closing the special file handle ARGV makes Perl open the next file and read from it instead of processing the rest of the currently opened file.
next makes Perl go back to the beginning of the code, it can continue once more than one input line has been read.
sort sorts the lines by length so that we know the longer one is in the second element of the array.
index is used to check whether the shorter header is a prefix of the longer one (including the comma after the first header, so e.g. no,names is correctly rejected)

Removing lines containing a string from a file w/ perl

I'm trying to take a file INPUT and, if a line in that file contains a string, replace the line with something else (the entire line, including line breaks), or nothing at all (remove the line like it wasn't there). Writing all this to a new file .
Here's that section of code...
while(<INPUT>){
if ($_ =~ / <openTag>/){
chomp;
print OUTPUT "Some_Replacement_String";
} elsif ($_ =~ / <\/closeTag>/) {
chomp;
print OUTPUT ""; #remove the line
} else {
chomp;
print OUTPUT "$_\r\n"; #print the original line
}
}
while(<INPUT>) should read one line at a time (if my understanding is correct) and store each line in the special variable $_
However, when I run the above code I get only the very first if statement condition returned Some_Replacement_String, and only once. (1 line, out of a file with 1.3m, and expecting 600,000 replacements). This obviously isn't the behavior I expect. If I do something like while(<INPUT>){print OUTPUT $_;) I get a copy of the entire file, every line, so I know the entire file is being read (expected behavior).
What I'm trying to do is get a line, test it, do something with it, and move on to the next one.
If it helps with troubleshooting at all, if I use print $.; anywhere in that while statement (or after it), I get 1 returned. I expected this to be the "Current line number for the last filehandle accessed.". So by the time my while statement loops through the entire file, it should be equal to the number of lines in the file, not 1.
I've tried a few other variations of this code, but I think this is the closest I've come. I assume there's a good reason I'm not getting the behavior I expect, can anyone tell me what it is?
The problem you are describing indicates that your input file only contains one line. This may be because of a great many different things, such as:
You have changed the input record separator $/
Your input file does not contain the correct line endings
You are running your script with -0777 switch
Some notes on your code:
if ($_ =~ / <openTag>/){
chomp;
print OUTPUT "Some_Replacement_String";
No need to chomp a line you are not using.
} elsif ($_ =~ / <\/closeTag>/) {
chomp;
print OUTPUT "";
This is quite redundant. You don't need to print an empty string (ever, really), and chomp a value you're not using.
} else {
chomp;
print OUTPUT "$_\r\n"; #print the original line
No need to remove newlines and then put them back. Also, normally you would use \n as your line ending, even on windows.
And, since you are chomping in every if-else clause, you might as well move that outside the entire if-block.
chomp;
if (....) {
But since you are never relying on line endings not being there, why bother using chomp at all?
When using the $_ variable, you can abbreviate some commands, such as you are doing with chomp. For example, a lone regex will be applied to $_:
} elsif (/ <\/closeTag>/) { # works splendidly
When, like above, you have a regex that contains slashes, you can choose another delimiter for your regex, so that you do not need to escape the slashes:
} elsif (m# </closeTag>#) {
But then you need to use the full notation of the m// operator, with the m in front.
So, in short
while(<INPUT>){
if (/ <openTag>/){
print OUTPUT "Some_Replacement_String";
} elsif (m# </closeTag>#) {
# do nothing
} else {
print OUTPUT $_; # print the original line
}
}
And of course, the last two can be combined into one, with some negation logic:
} elsif (not m# </closeTag>#) {
print OUTPUT $_;
}

Matched lines (with regex) being written to both output files, but it's supposed only to be written to one output file..

I have a tab-delimited text file with several rows. I wrote a script in which I assign the rows to an array, and then I search through the array by means of regular expressions, to find the rows that match certain criteria. When a match is found, I write it to Output1. After going through all the if-statements listed (the regular expressions) and the criteria still isn't met, then the line is written to Output 2.
I works 100% when it comes to matching criteria and writing to Output 1, but here is where my problem comes in:
The matched lines are also being written to Output2, along with the unmatched lines. I am probably making a silly mistake, but I really just can't see it. If someone could have a look and help me out, I'd really appreciate it..
Thanks so much! :)
Inputfile sample:
skool school
losieshuis pension
prys prijs
eeu eeuw
lys lijs
water water
outoritêr outoritaire
#!/usr/bin/perl-w
use strict;
use warnings;
use open ':utf8';
use autodie;
open OSWNM, "<SecondWordsNotMatched.txt";
open ONIC, ">Output1NonIdenticalCognates.txt";
open ONC, ">Output2NonCognates.txt";
while (my $line = <OSWNM>)
{
chomp $line;
my #Row = $line;
for (my $x = 0; $x <= $#Row; $x++)
{
my $RowWord = $Row[$x];
#Match: anything, followed by 'y' or 'lê' or 'ê', followed by anything, followed by
a tab, followed by anything, followed by 'ij' or 'leggen' or 'e', followed by anything
if ($RowWord =~ /(.*)(y|lê|ê)(.*)(\t)(.*)(ij|leggen|e)(.*)/)
{
print ONIC "$RowWord\n";
}
#Match: anything, followed by 'eeu', followed by 'e' or 's', optional, followed by
anyhitng, followed by a tab, followed by anything, followed by 'eeuw', followed by 'en', optional
if ($RowWord =~ /(.*)(eeu)(e|s)?(\t)(.*)(eeuw)(en)?/)
{
print ONIC "$RowWord\n";
}
else
{
print ONC "$RowWord\n";
}
}
}
Inside your loop you essentially have:
if (A) {
output to file1
}
if (B) {
output to file1
} else {
output to file2
}
So you'll output to file2 anything that doesn't satisfy B (regardless of whether A was satisfied or not), and output stuff that satisfies both A and B twice to file1.
If outputting twice was not intended, you should modify your logic to something like:
if (A or B) {
output to file1
} else {
output to file2
}
Or:
if (A) {
output to file1
} elsif (B) {
output to file1
} else {
output to file2
}
(This second version allows you to do different processing for the A and B cases.)
If the double output was intended, you could do something like:
my $output_to_file2 = 1;
if (A) {
output to file1
$output_to_file2 = 0;
}
if (B) {
output to file1
$output_to_file2 = 0;
}
if ($output_to_file2) {
output to file2
}

How can I swap two consecutive lines in Perl?

I have this part of a code for editing cue sheets and I don't know how to reverse two consecutive lines if found:
/^TITLE.*?"$/
/^PERFORMER.*?"$/
to reverse to
/^PERFORMER.*?"$/
/^TITLE.*?"$/
What would it be the solution in my case?
use strict;
use warnings;
use File::Find;
use Tie::File;
my $dir_target = 'test';
find(\&c, $dir_target);
sub c {
/\.cue$/ or return;
my $fn = $File::Find::name;
tie my #lines, 'Tie::File', $fn or die "could not tie file: $!";
for (my $i = 0; $i < #lines; $i++) {
if ($lines[$i] =~ /^REM (DATE|GENRE|REPLAYGAIN).*?$/) {
splice(#lines, $i, 3);
}
if ($lines[$i] =~ /^\s+REPLAYGAIN.*?$/) {
splice(#lines, $i, 1);
}
}
untie #lines;
}
This may seem like overkill, but seeing that your files aren't very large, I'm tempted to leverage the following one-liner (either from the command line or via a system call).
The one-liner works by slurping all the lines in one shot, then leaving the rest of the work to a regex substitution which flips the order of the lines.
If you're using *nix:
perl -0777 -i -ne 's/(TITLE.*?")\n(PERFORMER.*?")/$2\n$1/g' file1 file2 ..
If you're using Windows, you'll need to create a backup of the existing files:
perl -0777 -i.bak -ne "s/(TITLE.*?\")\n(PERFORMER.*?\")/$2\n$1/g" file1 file2 ..
Explanation
Command Switches (see perlrun for more info)
-0777 (an octal number) enforces file-slurping behavior
-i enables in-place editing (no need to splice-'n'-dice!). Windows systems require that you provide a backup extension, hence the additional .bak
-n loops over all lines in your file(s) (although since you're slurping them in, Perl treats the contents of each file as one line)
-e allows Perl to recognize code within the command-line
Regex
The substitution regex captures all occurrences of the TITLE line, the consecutive PERFORMER line, and stores it in variables $1 and $2 respectively. The substitution regex then flips the order of the two variables, separated with a newline.
Filename Arguments
You could use *nix to provide the filenames of the directories in question, but I'll leave that to someone else to figure out as I'm not too comfortable with Unix pipes just yet (see this John Siracusa answer for more guidance).
I would create a backup of your files before you try these one-liners though.
Well, since you're tying into an array, I'd just check $lines[$i] and $lines[$i+1] (as long as the +1 address exists, that is), and if the former matches TITLE and the latter PERFORMER, swap them. Unless perhaps you need to transpose these even if they're not consecutive??
Here's an option (this snippet would go inside your for loop, perhaps above the REM-checking line) if you know they'll be consecutive:
if ($i < $#lines and $lines[$i] =~ /^TITLE.*?"$/
and $lines[$i+1] =~ /^PERFORMER.*?$/) {
my $tmp = $lines[$i];
$lines[$i] = $lines[$i+1];
$lines[$i+1] = $tmp;
}
Another option (which would work regardless of consecutiveness, and is arguably more elegant) would be to
use List::MoreUtils qw(first_index);
(up at the top, with your other use statements) and then do (inside &c, but outside the for loop):
my $title_idx = first_index { /^TITLE.*?"$/ } #lines;
my $performer_idx = first_index { /^PERFORMER.*?"$/ } #lines;
if($title_idx >= 0 and $performer_idx >= 0 and $title_idx < $performer_idx)
{
# swap these lines:
($lines[$title_idx],$lines[$performer_idx]) =
($lines[$performer_idx],$lines[$title_idx]);
}
Is that what you're after?