File Handling in Perl: Reading a file from a specific line

File Handling in Perl: Reading a file from a specific line - perl

Is there no way in perl to start reading a file from a specific line number. Whenever we read a file in perl we do while(<$fileHandler>) which makes the perl interpreter to read the file from beginning. What to do if I don't want to read these lines?

You can skip lines from the beginning, and start processing with $start_line,
my $start_line = 10;
while(<$fileHandler>) {
next unless $. == $start_line .. undef;
# ..
}
The range operator .. also provides the following shorthand:
If either operand of scalar ".." is a constant expression, that operand is considered true if it is equal (==) to the current input line number (the $. variable).
Therefore the above can be reduced to:
while(<$fileHandler>) {
next unless 10 .. undef;
# ..
}

Because you can't know how long a line will be you must read from the beginning of the file.
If you know how wide your line will be because you have a fixed line width, or some other scheme then you can seek to that position in your file. Otherwise you have to read every character and search for the 'special' new line characters.
A text file is just a long list of characters. There is nothing special about lines.

Related

Perl - Detect from command line if file has only a specified character

This was the original question.
Using perl, how can I detect from the command line if a specified file contains only a specified character(s), like '0' for example?
I tried
perl -ne 'print if s/(^0*$)/yes/' filename
But it cannot detect all conditions, for example multiple lines, non-zero lines.
Sample input -
A file containing only zeros -
0000000000000000000000000000000000000000000000000000000000000
output - "yes"
Empty file
output - "no"
File containing zeros but has newline
000000000000000000
000000000000
output - "no"
File containing mixture
0324234-234-324000324200000
output - "no"

-0777 causes $/ to be set to undef, causing the whole file to be read when you read a line, so
perl -0777ne'print /^0+$/ ? "yes" : "no"' file
or
perl -0777nE'say /^0+$/ ? "yes" : "no"' file # 5.10+
Use \z instead of $ if want to make sure there's no trailing newline. (A text file should have a trailing newline.)

To print yes if a file contains at least one 0 character and nothing else, and otherwise no, write
perl -0777 -ne 'print /\A0+\z/ ? "yes" : "no"' myfile

I suspect you want a more generic solution than just detecting zeroes, but I haven't got time to write it for you till tomorrow. Anyway, here is what I think you need to do:
1. Slurp your entire file into a single string "s" and get its length (call it "L")
2. Get the first character of the string, using substr(s,0,1)
3. Create a second string that repeats the first character "L" times, using firstchar x L
4. Check the second string is equal to the slurped file
5. Print "No" if not equal else print "Yes"
If your file is big and you don't want to hold two copies in memory, just test character by character using substr(). If you want to ignore newlines and carriage returns, just use "tr" to delete them from "s" before step 2.

Decipher this sed one-liner

I want to remove duplicate lines from a file, without sorting the file.
Example of why this is useful to me: removing duplicates from Bash's $HISTFILE without changing the chronological order.
This page has a one-liner to do that:
http://sed.sourceforge.net/sed1line.txt
Here's the one-liner:
sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'
I asked a sysadmin and he told me "you just copy the script and it works, don't go philosophising about this", which is fine, so I am asking here as it's a developer forum and I trust people might be like me, suspicious about using things they don't understand:
Could you kindly provide a pseudo-code explanation of what that "black magic" script is doing, please? I tried parsing the incantation in my head but especially the central part is quite hard.

I'll note that this script does not appear to work with my copy of sed (GNU sed 4.1.5) in my current locale. If I run it with LC_ALL=C it works fine.
Here's an annotated version of the script. sed basically has two registers, one is called "pattern space" and is used for (basically) the current input line, and the other, the "hold space", can be used by scripts for temporary storage etc.
sed -n ' # -n: by default, do not print
G # Append hold space to current input line
s/\n/&&/ # Add empty line after current input line
/^\([ -~]*\n\).*\n\1/d # If the current input line is repeated in the hold space, skip this line
# Otherwise, clean up for storing all input in hold space:
s/\n// # Remove empty line after current input line
h # Copy entire pattern space back to hold space
P # Print current input line'
I guess the adding and removal of an empty line is there so that the central pattern can be kept relatively simple (you can count on there being a newline after the current line and before the beginning of the matching line).
So basically, the entire input file (sans duplicates) is kept (in reverse order) in the hold space, and if the first line of the pattern space (the current input line) is found anywhere in the rest of the pattern space (which was copied from the hold space when the script started processing this line), we skip it and start over.
The regex in the conditional can be further decomposed;
^ # Look at beginning of line (i.e. beginning of pattern space)
\( # This starts group \1
[ -~] # Any printable character (in the C locale)
* # Any number of times
\n # Followed by a newline
\) # End of group \1 -- it contains the current input line
.*\n # Skip any amount of lines as necessary
\1 # Another occurrence of the current input line, with newline and all
If this pattern matches, the script discards the pattern space and starts over with the next input line (d).
You can get it to work independently of locale by changing [ -~] to [[:print:]]

The code doesn't work for me, perhaps due to some locale setting, but this does:
vvv
sed -n 'G; s/\n/&&/; /^\([^\n]*\n\).*\n\1/d; s/\n//; h; P'
^^^
Let's first translate this by the book (i.e. sed info page), into something perlish.
# The standard sed loop
my $hold = "";
while ($my pattern = <>) {
chomp $pattern;
$pattern = "$pattern\n$hold"; # G
$pattern =~ s/(\n)/$1$1/; # s/\n/&&/
if ($pattern =~ /^([^\n]*\n).*\n\1/) { # /…/
next; # d
}
$pattern =~ s/\n//; # s/\n//
$hold = $pattern; # h
$pattern =~ /^([^\n]*\n?)/; print $1; # P
}
OK, the basic idea is that the hold space contains all the lines seen so far.
G: At the beginning of each cycle, append that hold space to the current line. Now we have a single string consisting of the current line and all unique lines which preceeded it.
s/\n/&&/: Turn the newline which separates them into a double newline, so that we can match subsequent and non-subsequent duplicates the same, see the next step.
^\([^\n]*\n\).*\n\1/: Look through the current text for the following: at the beginning of all the lines (^) look for a first line including trailing newline (\([^\n]*\n\)), then anything (.*), then a newline (\n), and then that same first line including newline repeated again (\1). If two subsequent lines are the same, then the .* in the regular expression will match the empty string, but the two \n will still match due to the newline duplication in the preceding step. So basically this asks whether the first line appears again among the other lines.
d: If there is a match, this is a duplicate line. We discard this input, keep the hold space as it is as a buffer of all unique lines seen so far, and continue with the next line of input.
s/\n//: Otherwise, we continue and next turn the double newline back into a single newline.
h: We include the current line in our list of all unique lines.
P: And finally print this new unique line, up to the newline character.

For the actual problem to resolve, here is a simpler solution (at least it looks so) with awk:
awk '!_[$0]++' FILE
In short _[$0] is a counter (of appearance) for each unique line, for any line ($0) appearing for the second time _[$0] >= 1, thus !_[$0] evaluates to false, causing it not to be printed except its first time appearance.
See https://gist.github.com/ryenus/5866268 (credit goes to a recent forum I visited.)

Chopping lines in file for equal length but with new identifiers

I have a file, whose every 2nd line is of unequal length. I want to make these lines equal(every 2nd line of output should be equal to 10 characters) but with new identifier (every odd line).
FILE ->
>ZQMK36301EDYQE
ZHZHHEXZZHHZZHHZZXHHHEHHHZZZHHHZHXZHZ
>ZQMK36301EEMJ9
ZZZXHZHHXHHHEZZEEZZHZZZZXEZ
>ZQMK36301EOEM5
ZXHXHZZHEHHHXZEZHXXXHXHHHHXEHHHZHHHH
desired output ->
>ZQMK36301EDYQE
ZHZHHEXZZH
>ZQMK36301EDYQE#2
HZZHHZZXHH
>ZQMK36301EDYQE#3
HEHHHZZZHH
>ZQMK36301EEMJ9
ZZZXHZHHXH
>ZQMK36301EEMJ9#2
HHEZZEEZZH
>ZQMK36301EOEM5
ZXHXHZZHEH
>ZQMK36301EOEM5#2
HHXZEZHXXX
>ZQMK36301EOEM5#3
HXHHHHXEHH
Here if we take the first line which is identifier (>ZQMK36301EDYQE) and in its 2nd line it contains 37 characters. Now it will make 3 sequences of equal length (i:e 10) and if remaining characters are less than 10, we will throw that part. Now each new line of equal length has an identifier which is same as from which the part of sequence it came but followed by "#" and the number. I want to do this for whole file. Please help.
Thanks and Best regards,
Vikas

As a one-liner:
perl -nwle '
$i=0;
for my $add (<>=~/.{10}/g) {
printf "%s%s\n%s\n", $_, $i++ ? "#$i":"", $add;
}' inputfile
-n read file line-by-line and store line in $_. -l autochomps the input. We assume first line is header, and second is data. $i is the counter, so it is reset for each new line pair. The for loop list is made on the fly by reading one line <>, then extracting 10-character long strings from it with a regex. Then we just print the stuff, and make sure not to show the zero counter.

Why am I only getting one match?

I am parsing a log file trying to pull out the lines where the phrase "failures=" is a unique non-zero digit.
The first part of my perl one liner will pull out all the lines where "failures" are greater than zero. But that part of the log file repeats until a new failure occurs, i.e., after the first failure the log entries will be "failures=1" until the second error then it will read, "failures=2".
What I'd like to do is pull only the first line where that value changes and I thought I had it with this:
cat -n Logstats.out | perl -nle 'print "Line No. $1: failures=$2; eventDelta=$3; tracking_id=$4" if /\s(\d+)\t.*failures=(\d+).*eventDelta=(.\d+).*tracking_id="(\d+)"/ && $2 > 0' | perl -ne 'print unless $a{/failures=\d+/}++'
However, that only pulls the first non-zero "failure" line and nothing else. What do I need to change for it to pull all the unique values for "failures"?
thanks in advance!
Update: The amount of text in each line up to the "tracking_id" is more text than I can post. Sorry!
2011-09-06 14:14:18 [INFO] [logStats]: name=somename id=d6e6f4f0-4c0d-93b6-7a71-8e3100000030
successes=1 failures=0 skipped=0 eventDelta=41 original=188 simulated=229
totalDelta=41 averageDelta=41 min=0 max=41 averageOriginalDuration=188 averageSimulatedDuriation=229(txid = b3036eca-6288-48ef-166f-5ae200000646
date = 2011-09-02 08:00:00 type = XML xml
=

perl -ne 'print unless $a{/failures=\d+/}++'
does not work because a hash subscript is evaluated in scalar context, and the m// operator does not return the match in scalar context. Instead, it returns 1. So (since every line matches), what you wrote is equivalent to:
perl -ne 'print unless $a{1}++'
and I think you can see the problem there.
There's a number of ways to work around that, but I'd use:
perl -ne 'print unless /(failures=\d+)/ and $a{$1}++'
However, I'd do the whole thing in one call to perl, including numbering the lines:
perl -nle '
print "Line No. $.: failures=$1; eventDelta=$2; tracking_id=$3"
if /failures=(\d+).*?eventDelta=(.\d+).*?tracking_id="(\d+)"/
&& $1 > 0
&& !$seen{$1}++
' Logstats.out
($. automatically counts input lines in Perl. The line breaks can be removed if desired, but it will actually work as is.)

you could use a hash to store te results and print it:
perl -nle '$f{$2} ||= "Line No. $1: failures=$2; eventDelta=$3; tracking_id=$4" if /\s(\d+)\t.*failures=(\d+).*eventDelta=(.\d+ ).*tracking_id="(\d+)"/ && $2;END{ print $f{$_} for keys %f }' Logstats.out
(not tested due to missing input data...)
HTH,
Paul

Since your input does not match your regex, I can't really help you. But I can tell you that this is doing a lot of backtracking--and that's bad if there is a lot of data after the part that you're interested in.
So here is some alternative ideas:
qr{ \s # a single space
failures=(\d+) # the entry for failures
\s+ # at least one space
skipped=\d+ # skipped
\s+
eventDelta=(.\d+)
.*? # any number of non-newline characters *UNTIL* ...
\btracking_id="(\d+)" # another specified sequence of data
}x;
The parser will scan "skipped=" and then a group of digits a lot faster than scanning the rest of the line and backtracking when it fails back to 'eventDelta=', it is better to put it in, if you know it will always be there.
Since you don't put tracking_id in your example, I can't tell how it occurs, so in this case we used a non-greedy any match which will always be looking for the next sequence. Again, if there is a lot of data in the line, then you do not want to scan to then end and backtrack until you find that you've already read 'tracking_id="nnn"'. However, lookaheads cost processing time, it is still better to spell out 'skipped=' and all possible values then a non-greedy "any match".
You'll also notice that after accepting any data, I specify that tracking_id should appear at a word boundary, which disambiguates it from the possible--though not likely 'backtracking_id='.

Reading a large file in perl, record by record, with a dynamic record separator

I have a script that reads a large file line by line. The record separator ($/) that I would like to use is (\n). The only problem is that the data on each line contains CRLF characters (\r\n), which the program should not be considered the end of a line.
For example, here is a sample data file (with the newlines and CRLFs written out):
line1contents\n
line2contents\n
line3\r\ncontents\n
line4contents\n
If I set $/ = "\n", then it splits the third line into two lines. Ideally, I could just set $/ to a regex that matches \n and not \r\n, but I don't think that's possible. Another possibility is to read in the whole file, then use the split function to split on said regex. The only problem is that the file is too large to load into memory.
Any suggestions?

For this particular task, it sounds pretty straightforward to check your line ending and append the next line as necessary:
$/ = "\n";
...
while(<$input>) {
while( substr($_,-2) eq "\r\n" ) {
$_ .= <$input>;
}
...
}
This is the same logic used to support line continuation in a number of different programming contexts.
You are right that you can't set $/ to a regular expression.

dos2unix would put a UNIX newline character in for the "\r\n" and so wouldn't really solve the problem. I would use a regex that replaces all instances of "\r\n" with a space or tab character and save the results to a different file (since you don't want to split the line at those points). Then I would run your script on the new file.

Try using dos2unix on the file first, and then read in as normal.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

File Handling in Perl: Reading a file from a specific line - perl

Is there no way in perl to start reading a file from a specific line number. Whenever we read a file in perl we do while(<$fileHandler>) which makes the perl interpreter to read the file from beginning. What to do if I don't want to read these lines?

Related

Perl - Detect from command line if file has only a specified character

Decipher this sed one-liner

Chopping lines in file for equal length but with new identifiers

Why am I only getting one match?

Reading a large file in perl, record by record, with a dynamic record separator

Categories

Resources