Perl script which will do the Global interpolation of number and timestamps - perl

I'm trying to write a perl script which will do the Global interpolation of number and timestamps (timestamps format YYYYmmDDHHMMSS, e.g. 20150124010502) with system time. The format remain the same as in the original file and is reduced by one minute in each file.
The input files would be file01.txt, file02.txt , file03.txt, file04.txt and so on. All the file have the same number and timestamps and size.
4947000219, 20150124010502 ,2
In our output file we want to replace the existing number and timestamps. The number needs to increment and the timestamps should be replaced with system time and formated like in the original file.
Assuming the system time “Mon Jan 19 13:39:57 IST 2015” our replaced timestamps is 20150119133957 and the minute will be reduced by one minute in each file.
The output file should look like this:
file01.txt 4947000219, 20150119133957 ,2
file02.txt 4947000220, 20150119133857 ,2
file03.txt 4947000221, 20150119133757 ,2
file04.txt 4947000222, 20150119133657 ,2
file05.txt 4947000223, 20150119133557 ,2
file06.txt 4947000224, 20150119133457 ,2
file07.txt 4947000225, 20150119133357 ,2
file07.txt 4947000226, 20150119133257 ,2
.
.
.
.
and so on.
Below is the perl script we created. But it's not working.
#!/usr/bin/perl
use strict;
use File::Find;
use Time::Local;
use POSIX ();
my $n;
my #local=(localtime);
my $directory= "/home/Doc/Test";
chdir $directory;
opendir(DIR, ".") or die "couldn't open $directory: $!\n";
foreach my $file (readdir DIR){
next unless -f $file;
open my $in_fh, "<$file";
my #lines = <$in_fh>;
close $in_fh;
my $date = POSIX::strftime( '%Y%m%d%H%M%S', #local);
++$n;
$lines[0] =~ s~/(4947000219)/~$1+$n~ge;
$lines[1] =~ s~/(20140924105028)/~$date-$n~ge;
open my $out_fh, ">$file";
print $out_fh #lines;
close $out_fh;
}
closedir DIR;
Can anyone tell me, what's wrong?

You are trying to subtract a integer from a date string to reduce the number of minutes:
$lines[1] =~ s~/(20140924105028)/~$date-$n~ge;
That won't work. Instead, substract 60 seconds from the time parameter given to localtime and use strftime again for every file.

This started out as a comment but the limited code formatting possibilities made it an answer instead.
I am not sure i understand you correctly. Is there a number of input files in the same format as the output file? Because else i do not understand at all where you get the timestamp and number from. I will assume your input files look just like the output and you only want to change some numbers around. If that is not true, i still believe that my second point might help you out, the example however will not.
If i get your problem correctly, i would assume your problem are the s~~~ge.
First, you only replace a number in the first line and a timestamp in the second. It appears you lost a loop somewhere (there even is the indentation) and got confused about what #lines is. So first of all you need a loop over all your lines.
Second, from your example input it looks as if there are in fact no slashes. Your replacement however looks for a specific number in between slashes, removing those slashes in the process. I would assume the slashes are left overs from a matching operator you copied or something. But as your replacement operator uses ~ for separation, those slashes are literal.
Third you look for a specific number when the whole point of regular expressions is being better than search&replace.
If i am not mistaken, you look for something along the lines of
foreach my $currentLine (#lines) {
++$n;
$currentLine =~ s~(\S*) (\d*), (\d*)~print "$1 ", $2+$n, ", ", $date-$n~ge;
#any number of non-spaces# #any number of digits#, #any number of digits#
}

Related

Check whether a field from a line of text line matches a value

I have been using the following Perl code to extract text from multiple text files. It works fine.
Example of a couple of lines in one of the input files:
Fa0/19 CUTExyz notconnect 129 half 100 10/100BaseTX
Fa0/22 xyz MLS notconnect 1293 half 10 10/100BaseTX
What I need is to match the numbers in each line exactly (i.e. 129 is not matched by 1293) and print the corresponding lines.
It would also be nice to match a range of numbers leaving specific numbers out i.e. match 2 through 10 but not 11 the 12 through 20
#!/perl/bin/perl
use warnings;
my #files = <c:/perl64/files/*>;
foreach $file ( #files ) {
open( FILE, "$file" );
while ( $line = <FILE> ) {
print "$file $line" if $line =~ /123/n;
}
close FILE;
}
Thank you for the suggestions, but can it can be done using the code structure above?
I suggest that you take a look at perldoc perlre.
You need to anchor your regex pattern. The easiest way is probably using \b which is a zero-width boundary between alphanumerics and non-alphanumerics.
#!/perl/bin/perl
use warnings;
use strict;
foreach my $file ( glob "c:/perl64/files/*" ) {
open( my $input, '<', $file ) or die $!;
while (<$input>) {
print "$file $_" if m/\b123\b/;
}
close $input;
}
Note - you should use three-argument open with lexical file handles as above, because it is better practice.
I've also removed the n pattern modifier, as it appears redundant.
Following your edit though, to give us some source data. I'd suggest the solution is not to use a regex - your source data looks space delimited. (Maybe those are tabs?).
So I'd suggest you're better off using split and selecting the field you want, and testing it numerically, because you mention matching ranges. This is not a good fit for regexes because they don't understand the numeric content.
Instead:
while ( <$input> ) {
print if (split)[-4] == 129;
}
Note - I use -4 in the split, which indexes from the end of the list.
This is because column 3 contains spaces, so splitting on whitespace is going to produce the wrong result unless we count down from the end of the array. Using a negative index we get the right field each time.
If your data is tab separated then you could use chomp and split /\t/. Or potentially split on /\s{2,}/ to split on 2-or-more spaces
But by selecting the field, you can do numeric tests on it, like
if $fields[-4] > 100 and $fields[-4] < 200
etc.
I hope you don't get the answers you're asking for, which discard best practice because of your unfamiliarity with Perl. It is inappropriate to ask how to write an ugly solution because proper Perl is beyond your reach
As has been said repeatedly on this site, if you don't know how to do a job then you should hire someone who does know and pay them for their work. No other profession that I know has the expectation of getting quality work done for free
Here's a few notes on your code. Wherever you have learned your techniques, you have been looking at a very outdated resource
Do you really have a root directory perl, so that your compiler is /perl/bin/perl? That's very unusual, and there is no need to use a shebang line in Windows
You must always add use strict and use warnings 'all' at the top of every Perl program you write, and declare all of your variables using my as close as possible to their first point of use. For some reason you do this with #files but not with $file
It is better to replace <c:/perl64/files/*> with glob 'C:/perl64/files/*'. Otherwise the code is less clear because Perl overloads the <> operator
Don't put variable names inside double quotes. It is unnecessary at best, and may cause bugs. So "$file" should be $file
Always use the three-parameter version of open, so that the second parameter is the open mode
Don't use global file handles. And always test whether the file has been opened correctly, dying with a message including $!—the reason for the failure—if the open fails
open( FILE, "$file" )
should be something like
open my $fh, '<', $file or die qq{Unable to open "$file" for input: $!}
Don't rely on regex patterns for everything. In this case it looks like split would be a better option, or perhaps unpack if your records have fixed-width fields. In my solution below I have used split on "more than one space", but if your real data is different from what you have shown (tab-delimited?) then this is not going to work
Note that Fa0/129 will also be matched by your current approach
This Perl program filters your data, printing lines where the fourth field $lines[3] (delineated by more than one whitespace character) is numerically equal to 129
The output shown is produced when the input is the single file splitn.txt, containing the data shown in your question
use strict;
use warnings 'all';
for my $file ( glob 'C:/perl64/files/*' ) {
open my $fh, '<', $file or die qq{Unable to open "$file" for input: $!};
while ( my $line = <$fh> ) {
chomp;
my #fields = split /\s\s+/, $line;
print "$file $line" if $fields[3] == 129;
}
}
output
splitn.txt Fa0/19 CUTExyz notconnect 129 half 100 10/100BaseTX
Your question is unclear. When you say:
What I need is to match numbers in the on each line exactly
That could mean a couple of things. It could mean that each line contains nothing but a single number which you want to match. In that case, using == is probably better than using a regular expression. Or it could mean that you have lots of text on a line and you only want to match complete numbers. In that case you should use \b (the "word boundary" anchor) - /\b123\b/.
If you're clearer in your questions (perhaps by giving us sample input) then people won't have to guess at your meaning.
A few more points on your code:
Always include both use strict and use warnings.
Always check the return value from open() and take appropriate action on failure.
Use lexical filehandles and 3-arg version of open().
No need to quote $file in your open() call.
Using $_ can simplify your code.
/n on the match operator has no effect unless your regex contains parentheses.
Putting that all together (and assuming my second interpretation of your question is correct), your code could look like this:
#!/perl/bin/perl
use strict;
use warnings;
my #files = <c:/perl64/files/*>;
foreach my $file (#files) {
open my $file_h, '<', $file
or die "Can't open $file: $!";
while (<$file_h>) {
print "$file $_\n" if /\b123\b/;
}
# No need to close $file_h as it is closed
# automatically when the variable goes out
# of scope.
}

Splitting one txt file into multiple txt files based on delimiter and naming them with a specific character

I have a text file that looks like this http://www.uniprot.org/uniprot/?sort=score&desc=&compress=no&query=id:P01375%20OR%20id:P04626%20OR%20id:P08238%20OR%20id:P06213&format=txt.
This file is contained of different entries that are divided with //. I think I have almost found the way how to divide txt file into multiple txt files whenever this specific pattern appears, but I still don't know how to name them after dividing and how to print them in specific directory. I would like that each file that is divided carries specific ID which is a first line nut second column in each entry.
This is the code that I have wrote so far:
mkdir "spliced_files"; #directory where I would like to put all my splitted files
$/="//\n"; # divide them whenever //appears and new line after this
open (IDS, 'example.txt') or die "Cannot open"; #example.txt is an input file
my #ids = <IDS>;
close IDS;
my $entry = 25444; #number of entries or //\n characters
my $i=0;
while ($i eq $entry) {
print $ids[$i];
};
$i++;
I am still having problem with finding how to split all entries from 'example.txt' file whenever "//\n" and to print all this seperated files into directory spliced_files. In addition I would have to name all of these seperated files with the ID that is specific for each of these files or entries (which appears in the first row, but only a second column).
So I expect output to be number of files in spliced_files directory, and each of them are named with their ID (first row, but only second column). For example name of the first file wiould be TNFA_HUMAN, od the second would be ERBB2_HUMAN and so on..)
You still look like you're programming by guesswork. And you haven't made use of any of the advice you have been given in answers to your previous questions. I strongly recommend that you spend a week working through a good beginners book like Learning Perl and come back when you understand more about how Perl works.
But here are some comments on your new code:
open (IDS, 'example.txt') or die "Cannot open";
You have been told that using lexical variables and the three-arg version of open() is a better approach here. You should also include $! in your error message, so you know what has gone wrong.
open my $ids_fh, '<', 'example.txt'
or die "Cannot open: $!";
Then later on (I added the indentation in the while loop to make things clearer)...
my $i=0;
while ($i eq $entry) {
print $ids[$i];
};
$i++;
The first time you enter this loop, $i is 1 and $entry is 25444. You compare them (as strings! You probably want ==, not eq) to see if they are equal. Clearly they are different, so your while loop exits. Once the loop exits, you increment $i.
This code bears no relation at all to the description of your problem. I'm not going to give you the answer, but here is the structure of what you need to do:
mkdir "spliced_files";
local $/ = "//\n"; # Always localise changes to special vars
open my $ids_fh, '<', 'example.txt'
or die "Cannot open: $!";
# No need to read the whole file in one go.
# Process it a line at a time.
while (<$ids_fh>) {
# Your record (the whole thing, not just the first line) is in $_.
# You need to extract the ID value from that string. Let's assume
# you've stored in it $id
# Open a file with the right name
open my $out_fh, '>', "spliced_files/$id" or die $!;
# Print the record to the new file.
print $out_fh $_;
}
But really, you need to take the time to learn about programming before you attack this task. Or, if you don't have the time for that, pay a programmer to do it for you.

Convert GMT datetime string to utc epoch in 1000s of large csv files

I have 1000s of csv files that consist of millions of rows that have integers, floats, nullable integers, and 2 types of GMT datetime string formats. Below is an example of such a row in one of the files:
2/20/2016 3:25,3,,87,340.3456,5/18/2013,5/19/2014,4,6
I'm interested in the quickest way to convert (in place) both types of GMT datetime formatted strings into UTC epochs.
For example, the above row would be converted into:
1455938740,3,,87,340.3456,1368835200,1400457600,4,6
Suppose the files are isolated, so all can be gathered by *.csv
Is there a way I could do this with linux commands? If not, what would you suggest then?
Updated Answer
With thanks to #Borodin's insights, my best solution would now be like this:
perl -MTime::Local -plne '
s|(\d+)\/(\d+)\/(\d+) (\d+):(\d+)|timegm(0,$5,$4,$2,$1-1,$3)|ge ;
s|(\d+)\/(\d+)\/(\d+)|timegm(0,0,0,$2,$1-1,$3)|ge' file.csv
And if that can be debugged and found to work, I would incorporate it into GNU Parallel like this:
function doit(){
tmp=temp_$$
perl -MTime::Local -plne '
s|(\d+)\/(\d+)\/(\d+) (\d+):(\d+)|timegm(0,$5,$4,$2,$1-1,$3)|ge;
s|(\d+)\/(\d+)\/(\d+)|timegm(0,0,0,$2,$1-1,$3)|ge' "$1" >> $tmp && mv $tmp "$1"
}
export -f doit
find . -name \*.csv -print0 | parallel -0 doit {}
Original Answer
I'm afraid I am going to give you a very powerful fishing rod (more of a harpoon) rather than a ready-cooked fish supper, but I think you'll be able to work it out quite easily.
First, if you use the Time::Local module in Perl, you can pass it the seconds, minutes, hours, days, months and year and it will tell you the corresponding Epoch seconds:
# So, for midnight on 02:10:01 AM 1st May 2016, you can do
perl -MTime::Local -e 'print timelocal(1,10,2,1,5,2016)'
1464743401
Second, if you start Perl with -plne switches, it will effectively apply the code you supply to each and every line of the input file and print the result and sort out all line endings for you - somewhat akin to how awk loops over input files. So, if your file is called file.csv and looks like this:
2/20/2016 3:25,3,,87,340.3456,5/18/2013,5/19/2014,4,6
2/21/2013 3:25,3,,87,340.3456,4/20/2013,6/20/2015,4,6
and you run a null program, it will just echo the input file:
perl -MTime::Local -plne '' file.csv
2/20/2016 3:25,3,,87,340.3456,5/18/2013,5/19/2014,4,6
2/21/2013 3:25,3,,87,340.3456,4/20/2013,6/20/2015,4,6
If we now do a substitution and replace all commas with elephants:
perl -MTime::Local -plne 's/,/elephant/g' file.csv
2/20/2016 3:25elephant3elephantelephant87elephant340.3456elephant5/18/2013elephant5/1 9/2014elephant4elephant6
2/21/2013 3:25elephant3elephantelephant87elephant340.3456elephant4/20/2013elephant6/20/2015elephant4elephant6
That seems to work - now you can also do what I call a "computed replacement" - I don't know what real Perl-folk call it. Anyway, you use an e modifier flag after the replacement to execute that code and calculate the replacement text:
perl -MTime::Local -plne 's|(\d+)\/(\d+)\/(\d+)|timelocal(0,0,0,$2,$1,$3)|ge' file.csv
1458432000 3:25,3,,87,340.3456,1371510000,1403132400,4,6
1363824000 3:25,3,,87,340.3456,1369004400,1437346800,4,6
And - in case you missed it - that is the answer. The (\d+) is a regex for "one or more digits" and the fact it is in parentheses means it is captured. The first such group is captured as $1, the second as $2 and so on. So, I am basically looking for one or more digits that I save as $1, followed by a slash then 1 or more digits that I capture as $2 followed by a slash and 1 or more digits that I capture as $3. Then, in the replacement part, I use the captured groups to formulate a date. The g modifier means I do ALL occurrences on each line.
I'll leave you to add further capture groups for the 24-hour time and put that into the timelocal() call.
The capture groups I have given are a little loose too - you may want
\d{1,2}\/\d{1,2}\/\d{4}
or something to mean 1 or 2 digits for the day, 1 or 2 digits for the month and exactly 4 digits for the year. You can look that up!
When you get that working, if you have thousands of files, I would suggest you use GNU Parallel to do the files in parallel. Try looking at my other answers on here, or Ole Tange's as he wrote it, and you will see something like:
function doit(){
perl -plne '...' $1 ...
}
export -f doit
find . -name \*.csv -print0 | parallel -0 doit {}
As regards doing it in place, I think you will need to use a technique like this inside the doit() function. Basically it writes a new file and then, only if the Perl part worked (&& does that bit), it overwrites the original file with the temporary one:
tmp=$(mktemp ...)
perl -plne '...' "$1" > $tmp && mv $tmp "$1"
I suggest you make a backup before you do anything else - there is a lot to go wrong here. Good luck!
P.S. If you edit the tags under your question and add perl, I guess some Perl guru will help you out and maybe put the finishing touches on my suggestions and enlighten me/us as to what the real name is for the e modifier that does a "computed replacement".
Update
As hinted by Mark Setchell the timegm function from Time::Local is likely to be faster than the string parsing that Time::Piece provides
Here's a rewrite of my original solution which uses that module. The output is identical to that of the original
use strict;
use warnings 'all';
use Time::Local 'timegm';
while ( <DATA> ) {
chomp;
my #fields = split /,/;
for ( #fields ) {
next unless m{/};
my ($mn, $dy, $yr, $h, $m, $s) = (/\d+/g, 0, 0, 0);
$_ = timegm($s, $m, $h, $dy, $mn-1, $yr);
}
print join(',', #fields), "\n";
}
__DATA__
2/20/2016 3:25,3,,87,340.3456,5/18/2013,5/19/2014,4,6
output
1455938700,3,,87,340.3456,1368835200,1400457600,4,6
Original post
The Time::Piece module is small and quite fast. Here's a sample program that transforms your sample data
The algorithm is a simple one. Any field that doesn't contain a slash / is left alone, otherwise it is assumed to be a date/time field if there is also a colon : or just a date field if not
use strict;
use warnings 'all';
use feature 'say';
use Time::Piece ();
while ( <DATA> ) {
chomp;
my #fields = split /,/;
for ( #fields ) {
next unless m{/};
my $fmt = /:/ ? '%m/%d/%Y %H:%M' : '%m/%d/%Y';
$_ = Time::Piece->strptime($_, $fmt)->epoch;
}
print join(',', #fields), "\n";
}
__DATA__
2/20/2016 3:25,3,,87,340.3456,5/18/2013,5/19/2014,4,6
output
1455938700,3,,87,340.3456,1368835200,1400457600,4,6
The first field 1455938700 differs from your own expect output 1455938740 by forty seconds. That's odd, as there is no seconds value in the original data and 1455938700 is exactly divisible by 60 whereas 1455938740 is not. So I stand by my computation

Perl: string formatting in tab delimited file

I have no background in programming whatsoever, so I would appreciate it if you would explain how and why any code you recommend should be written the way it is.
I have a data matrix 2,000+ samples, and need to do the following manipulate the format in one column.
I would also like to manipulate the format of one of the columns so that it is easier to merge with my other matrix. For example, one column is known as sample number (column #16). The format is currently similar to ABCD-A1-A0SD-01A-11D-A10Y-09, yet I would like to change it to be formatted to the following ABCD-A1-A0SD-01A. This will allow me to have it in the right format so that I can merge it with another matrix. I seem to not be able to find any information on how to proceed with this step.
The sample input should look like this:
ABCD-A1-A0SD-01A-11D-A10Y-09
ABCD-A1-A0SD-01A-11D-A10Y-09
ABCD-A1-A0SE-01A-11D-A10Y-09
ABCD-A1-A0SE-01A-11D-A10Y-09
ABCD-A1-A0SF-01A-11D-A10Y-09
ABCD-A1-A0SH-01A-11D-A10Y-09
ABCD-A1-A0SI-01A-11D-A10Y-09
I want the last three extensions removed. The output sample should look like this:
ABCD-A1-A0SD-01A
ABCD-A1-A0SD-01A
ABCD-A1-A0SE-01A
ABCD-A1-A0SE-01A
ABCD-A1-A0SF-01A
ABCD-A1-A0SH-01A
ABCD-A1-A0SI-01A
Finally, the matrix that I want to merge with has a different layout, in other words the number of columns and rows are different. This is a issue when I tackle the next step which is merging the two matrices together. The original matrix has about 52 columns and 2,000+ rows, whereas the merging matrix only has 15 column and 467 rows.
Each row of the original matrix has mutational information for a patient. This means that the same patient with the same ID might appear many times. The second matrix contains the patient information, so no patients are repeated in that matrix. When merging the matrix, I want to make sure that every patient mutation (each row) is matched with its corresponding information from the merging matrix.
My sample code:
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'sorted_samples_2.txt';
open(INFILE, $file) or die "Can't open file: $!\n";
open(my $outfile, '>', 'sorted_samples_changed.txt');
foreach my $line (<INFILE>) {
print "The input line is $line\n";
my #columns = split('\t', $line);
($columns[15]) = $columns[15]=~/:((\w\w\w\w-\w\d-\w|\w\w-\d\d\w)+)$/;
printf $outfile "#columns/n";
}
Issues: The code deletes the header and deleted the string in column 16.
A few issues about your code:
Good job on include use strict; and use warnings;. Keep doing that
Anytime you're doing file or directory processing, include use autodie; as well.
Always use lexical file handles $infh instead of globs INFILE.
Use the 3 parameter form of open.
Always process a file line by line using a while loop. Using a for loop loads the entire file into memory
Don't forget to chomp your input from a file.
Use the line number variable $. if you want special logic for your header
The first parameter of split is a pattern. Use /\t/. The only exception to this is ' ' which has special meaning. Currently your introducing a bug by using a single quoted string.
When altering a value with a regex, try to focus on what you DO want instead of what you DON'T. In this case it looks like you want 4 groups separated by dashes, and then truncate the rest. Focus on matching those groups.
Don't use printf when you mean print.
The following applies these fixes to your script:
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $infile = 'sorted_samples_2.txt';
my $outfile = 'sorted_samples_changed.txt';
open my $infh, '<', $infile;
open my $outfh, '>', $outfile;
while (my $line = <$infh>) {
chomp $line;
my #columns = split /\t/, $line;
if ($. > 1) {
$columns[15] =~ s/^(\w{4}-\w\d-\w{4}-\w{3}).*/$1/
or warn "Unable to fix column at line $.";
}
print $outfh join("\t", #columns), "\n";
}
You need to define scope for your variables with 'my' in declaration itself when you use 'use strict'.
In your case, you should use my #sort = sort {....} in first line and
you should have an array reference $t defined somewhere to de-reference it in second line. You don't have #array declared anywhere in this code, that is the reason you got all those errors. Make sure you understand what you are doing before you do it.

get last few lines of file stored in variable

How could I get the last few lines of a file that is stored in a variable? On linux I would use the tail command if it was in a file.
1) How can I do this in perl if the data is in a file?
2) How can I do this if the content of the file is in a variable?
To read the end of a file, seek near the end of the file and begin reading. For example,
open my $fh, '<', $file;
seek $fh, -1000, 2;
my #lines = <$fh>;
close $fh;
print "Last 5 lines of $file are: ", #lines[-5 .. -1];
Depending on what is in the file or how many lines you want to look at, you may want to use a different magic number than -1000 above.
You could do something similar with a variable, either
open my $fh, '<', \$the_variable;
seek $fh, -1000, 2;
or just
open my $fh, '<', \substr($the_variable, -1000);
will give you an I/O handle that produces the last 1000 characters in $the_variable.
The File::ReadBackwards module on the CPAN is probably what you want. You can use it thus. This will print the last three lines in the file:
use File::ReadBackwards
my $bw = File::ReadBackwards->new("some_file");
print reverse map { $bw->readline() } (1 .. 3);
Internally, it seek()s to near the end of the file and looks for line endings, so it should be fairly efficient with memory, even with very big files.
To some extent, that depends how big the file is, and how many lines you want. If it is going to be very big you need to be careful, because reading it all into memory will take a lot longer than just reading the last part of the file.
If it is small. the easiest way is probably to File::Slurp it into memory, split by record delimiters, and keep the last n records. In effect, something like:
# first line if not yet in a string
my $string = File::Slurp::read_file($filename);
my #lines = split(/\n/, $string);
print join("\n", #lines[-10..-1])
If it is large, too large to find into memory, you might be better to use file system operations directly. When I did this, I opened the file and used seek() and read the last 4k or so of the file, and repeated backwards until I had enough data to get the number of records I needed.
Not a detailed answer, but the question could be a touch more specific.
I know this is an old question, but I found it while looking for a way to search for a pattern in the first and last k lines of a file.
For the tail part, in addition to seek (if the file is seekable), it saves some memory to use a rotating buffer, as follows (returns the last k lines, or less if fewer than $k are available):
my $i = 0; my #a;
while (<$fh>) {
$a[$i++ % $k] = $_;
}
my #tail = splice #a,0,$i % $k;
splice #a,#a,0,#tail;
return #a;
A lot has already been stated on the file side, but if it's already in a string, you can use the following regex:
my ($lines) = $str ~= /
(
(?:
(?:(?<=^)|(?<=\n)) # match beginning of line (separated due to variable lookbehind limitation)
[^\n]*+ # match the line
(?:\n|$) # match the end of the line
){0,5}+ # match at least 0 and at most 5 lines
)$ # match must be from end of the string
/sx # s = treat string as single line
# x = allow whitespace and comments
This runs extremely fast. Benchmarking shows between 40-90% faster compared to the split/join method (variable due to current load on machine). This is presumably due to less memory manipulations. Something you might want to consider if speed is essential. Otherwise, it's just interesting.