Efficient way to read columns in a file using Perl

Efficient way to read columns in a file using Perl - perl

I have an input file like so, separated by newline characters.
AAA
BBB
BBA
What would be the most efficient way to count the columns (vertically), first with first, second with second etc etc.
Sample OUTPUT:
ABB
ABB
ABA
I have been using the following, but am unable to figure out how to remove the scalar context from it. Any hints are appreciated:
while (<#seq_prot>){
chomp;
my #sequence = map substr (#seq_prot, 1, 1), $start .. $end;
#sequence = split;
}
My idea was to use the substring to get the first letter of the input (A in this case), and it would cycle for all the other letters (The second A and B). Then I would increment the cycle number + 1 so as to get the next line, until I reached the end. Of course I can't seem to get the first part going, so any help is greatly appreciated, am stumped on this one.

Basically, you're trying to transpose an array.
This can be done easily using Array::Transpose
use warnings;
use strict;
use Array::Transpose;
die "Usage: $0 filename\n" if #ARGV != 1;
for (transpose([map {chomp; [split //]} <>])) {
print join("", map {$_ // " "} #$_), "\n"
}
For an input file:
ABCDEFGHIJKLMNOPQRS
12345678901234
abcdefghijklmnopq
ZYX
Will output:
A1aZ
B2bY
C3cX
D4d
E5e
F6f
G7g
H8h
I9i
J0j
K1k
L2l
M3m
N4n
O o
P p
Q q
R
S

You'll have to read in the file once for each column, or store the information and go through the data structure later.
I was originally thinking in terms of arrays of arrays, but I don't want to get into References.
I'm going to make the assumption that each line is the same length. Makes it simpler that way. We can use split to split your line into individual letters:
my = $line = "ABC"
my #split_line = split //, $line;
This will give us:
$split_line[0] = "A";
$split_line[1] = "B";
$split_line[2] = "C";
What if we now took each letter, and placed it into a #vertical_array.
my #vertical_array;
for my $index ( 0..##split_line ) {
$vertical_array[$index] .= "$split_line[$index];
}
Now let's do this with the next line:
$line = "123";
#split_line = split //, $line;
for my $index ( 0..##split_line ) {
$vertical_array[$index] .= "$split_line[$index];
}
This will give us:
$vertical_array[0] = "A1";
$vertical_array[1] = "B2";
$vertical_array[2] = "C3";
As you can see, I'm building the $vertical_array with each interation:
use strict;
use warnings;
use autodie;
use feature qw(say);
my #vertical_array;
while ( my $line = <DATA> ) {
chomp $line;
my #split_line = split //, $line;
for my $index ( 0..$#split_line ) {
$vertical_array[$index] .= $split_line[$index];
}
}
#
# Print out your vertical lines
#
for my $line ( #vertical_array ) {
say $line;
}
__DATA__
ABC
123
XYZ
BOY
FOO
BAR
This prints out:
A1XBFB
B2YOOA
C3ZYOR
If I had used references, I could probably have built an array of arrays and then flipped it. That's probably more efficient, but more complex. However, that may be better at handling lines of different lengths.

Related

Sorting 5th column in descending order error message

The text file I am trying to sort:
MYNETAPP01-NY
700000123456
Filesystem total used avail capacity Mounted on
/vol/vfiler_PROD1_SF_NFS15K01/ 1638GB 735GB 903GB 45% /vol/vfiler_PROD1_SF_NFS15K01/
/vol/vfiler_PROD1_SF_NFS15K01/.snapshot 409GB 105GB 303GB 26% /vol/vfiler_PROD1_SF_NFS15K01/.snapshot
/vol/vfiler_PROD1_SF_isci_15K01/ 2048GB 1653GB 394GB 81% /vol/vfiler_PROD1_SF_isci_15K01/
snap reserve 0TB 0TB 0TB ---% /vol/vfiler_PROD1_SF_isci_15K01/..
I am trying to sort this text file by its 5th column (the capacity field) in descending order.
When I first started this there was a percentage symbol mixed with the numbers. I solved this by substituting the the value like so: s/%/ %/g for #data;. This made it easier to sort the numbers alone. Afterwards I will change it back to the way it was with s/ %/%/g.
After running the script, I received this error:
#ACI-CM-L-53:~$ ./netapp.pl
Can't use string ("/vol/vfiler_PROD1_SF_isci_15K01/"...) as an ARRAY ref while "strict refs" in use at ./netapp.pl line 20, line 24 (#1)
(F) You've told Perl to dereference a string, something which
use strict blocks to prevent it happening accidentally. See
"Symbolic references" in perlref. This can be triggered by an # or $
in a double-quoted string immediately before interpolating a variable,
for example in "user #$twitter_id", which says to treat the contents
of $twitter_id as an array reference; use a \ to have a literal #
symbol followed by the contents of $twitter_id: "user \#$twitter_id".
Uncaught exception from user code:
Can't use string ("/vol/vfiler_PROD1_SF_isci_15K01/"...) as an ARRAY ref while "strict refs" in use at ./netapp.pl line 20, <$DATA> line 24.
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
open (my $DATA, "<raw_info.txt") or die "$!";
my $systemName = <$DATA>;
my $systemSN = <$DATA>;
my $header = <$DATA>;
my #data;
while ( <$DATA> ) {
#data = (<$DATA>);
}
s/%/ %/g for #data;
s/---/000/ for #data;
print #data;
my #sorted = sort { $b->[5] <=> $a->[5] } #data;
print #sorted;
close($DATA);

Here is an approach using Text::Table which will nicely align your output into neat columns.
#!/usr/bin/perl
use strict;
use warnings;
use Text::Table;
open my $DATA, '<', 'file1' or die $!;
<$DATA> for 1 .. 2; # throw away first two lines
chomp(my $hdr = <$DATA>); # header
my $tbl = Text::Table->new( split ' ', $hdr, 6 );
$tbl->load( map [split /\s{2,}/], sort by_percent <$DATA> );
print $tbl;
sub by_percent {
my $keya = $a =~ /(\d+)%/ ? $1 : '0';
my $keyb = $b =~ /(\d+)%/ ? $1 : '0';
$keyb <=> $keya
}
The output generated is:
Filesystem total used avail capacity Mounted on
/vol/vfiler_PROD1_SF_isci_15K01/ 2048GB 1653GB 394GB 81% /vol/vfiler_PROD1_SF_isci_15K01/
/vol/vfiler_PROD1_SF_NFS15K01/ 1638GB 735GB 903GB 45% /vol/vfiler_PROD1_SF_NFS15K01/
/vol/vfiler_PROD1_SF_NFS15K01/.snapshot 409GB 105GB 303GB 26% /vol/vfiler_PROD1_SF_NFS15K01/.snapshot
snap reserve 0TB 0TB 0TB ---% /vol/vfiler_PROD1_SF_isci_15K01/..
Update
To explain some of the advanced parts of the program.
my $tbl = Text::Table->new( split ' ', $hdr, 6 );
This creates the Text::Table object with the header split into 6 columns. Without the limit of 6 columns, it would have created 7 columns (because the last field, 'mounted on', also contains a space. It would have been incorrectly split into 2 columns for a total of 7).
$tbl->load( map [split /\s{2,}/], sort by_percent <$DATA> );
The statement above 'loads' the data into the table. The map applies a transformation to each line from <$DATA>. Each line is split into an anonymous array, (created by [....]). The split is on 2 or more spaces, \s{2,}. If that wasn't specified, then the data `snap reserve' with 1 space would have been incorrectly split.
I hope this makes whats going on more clear.
And a simpler example that doesn't align the columns like Text::Table, but leaves them in the form they originally were read might be:
open my $DATA, '<', 'file1' or die $!;
<$DATA> for 1 .. 2; # throw away first two lines
my $hdr = <$DATA>; # header
print $hdr;
print sort by_percent <$DATA>;
sub by_percent {
my $keya = $a =~ /(\d+)%/ ? $1 : '0';
my $keyb = $b =~ /(\d+)%/ ? $1 : '0';
$keyb <=> $keya
}

In addition to skipping the fourth line of the file, this line is wrong
my #sorted = sort { $b->[5] <=> $a->[5] } #data
But presumably you knew that as the error message says
at ./netapp.pl line 20
$a and $b are lines of text from the array #data, but you're treating them as array references. It looks like you need to extract the fifth "field" from both variables before you compare them, but no one can tell you how to do that

You code is quite far from what you want. Trying to change it as little as possible, this works:
#!/usr/bin/perl
use strict;
use warnings;
open (my $fh, "<", "raw_info.txt") or die "$!";
my $systemName = <$fh>;
my $systemSN = <$fh>;
my $header = <$fh>;
my #data;
while( my $d = <$fh> ) {
chomp $d;
my #fields = split '\s{2,}', $d;
if( scalar #fields > 4 ) {
$fields[4] = $fields[4] =~ /(\d+)/ ? $1 : 0;
push #data, [ #fields ];
}
}
foreach my $i ( #data ) {
print join("\t", #$i), "\n";
}
my #sorted = sort { $b->[4] <=> $a->[4] } #data;
foreach my $i ( #sorted ) {
$i->[4] .= '%';
print join("\t", #$i), "\n";
}
close($fh);
Let´s make a few things clear:
If using the $ notation, it is customary to define file variables in lower case as $fd. It is also typical to name the file descriptor as "fd".
You define but not use the first three variables. If you don´t apply chomp to them, the final CR will be added to them. I have not done it as they are not used.
You are defining a list with a line in each element. But then you need a list ref inside to separate the fields.
The separation is done using split.
Empty lines are skipped by counting the number of fields.
I use something more compact to get rid of the % and transform the --- into a 0.
Lines are added to list #data using push and turning the list to add into a list ref with [ #list ].
A list of list refs needs two loops to get printed. One traverses the list (foreach), another (implicit in join) the columns.
Now you can sort the list and print it out in the same way. By the way, Perl lists (or arrays) start at index 0, so the 5th column is 4.
This is not the way I would have coded it, but I hope it is clear to you as it is close to your original code.

grouping lines and concatenating them into one line in perl

This seems like a basic thing to do, but I can't figure out a simple way doing it without starting building lots of arrays etc. so my apologies if this is too simple.
I have a file of this format:
a,x1
a,x2
a,x3
b,x4
c,x5
c,x6
this is an edge list for a very big graph.
I need to convert it to the following format:
a,x1 x2 x3
b,x4
c,x5 x6
(this is another common format of graphs)
Is there a simple way of doing that in perl? you can assume that all the "a" and "b" are sorted, so once you got to a new starting node (say "b") there will be no going back (e.g. no more edges outgoing from "a")
Any advice would be appreciated.

Just keep the last "from" node in a variable that survives iterations of the loop.
#!/usr/bin/perl
use warnings;
use strict;
my $last = q();
while (<>) {
chomp;
my ($from, $to) = split /,/;
if ($from ne $last) {
print "\n" x (1 != $.), $from, ',';
$last = $from;
} else {
print ' ';
}
print $to;
}
print "\n";
"\n" x (1 != $.) prevents the newline being printed before the 1st line.
The same as a one-liner:
perl -aF, -ne 'chomp $F[1]; print "\n" x (1 != $.), "$F[0]," if $l ne $F[0];
print " " x ($l eq $F[0]), $F[1]; $l = $F[0] }{ print "\n"' < input

I'd do this in two stages. One to get the data into a useful data structure and another to print the data.
The data structure I have chosen is a hash of arrays. The keys of the hash are your 'a', 'b' and 'c' and the values are references to arrays containing the 'x1, 'x2', etc.
The code looks like this:
#!/usr/bin/perl
use strict;
use warnings;
# We use modern Perl (specifically say())`
use 5.010;
my %edges;
while (<DATA>) {
chomp;
my ($key, $val) = split /,/;
push #{$edges{$key}}, $val;
}
for (sort keys %edges) {
say join ',', $_, #{$edges{$_}};
}
__DATA__
a,x1
a,x2
a,x3
b,x4
c,x5
c,x6

Need to extract value from a file and subtract from variable in Perl

FILE:
1,2015-08-20,00:00:00,89,1007.48,295.551,296.66,
2,2015-08-20,03:00:00,85,1006.49,295.947,296.99,
3,2015-08-20,06:00:00,86,1006.05,295.05,296.02,
4,2015-08-20,09:00:00,85,1005.87,296.026,296.93,
5,2015-08-20,12:00:00,77,1004.96,298.034,298.87
code:
use IPC::System::Simple qw( capture capturex );
use POSIX;
my $tb1_file = '/var/egridmanage_pl/daily_pl/egrid-csv/test.csv';
open my $fh1, '<', $tb1_file or die qq{Unable to open "$tb1_file" for input: $!};
my #t1_temp_12 = map {
chomp;
my #t1_ft_12 = split /,/;
sprintf "%.0f", $t1_ft_12[6] if $t1_ft_12[2] eq '12:00:00';
} <$fh1>;
print "TEMP #t1_temp_12\n";
my $result = #t1_temp_12 - 273.14;
print "$result should equal something closer to 24 ";
$result value prints out -265.14 making me think the #t1_temp_12 is hashed
So I tried to do awk
my $12temp = capture("awk -F"," '$3 == "12:00:00" {print $7 - 273-.15}' test.csv");
I've tried using ``, qx, open, system all having the same error result using the awk command
But this errors out. When executing awk at command line i get the favoured results.

This looks like there's some cargo cult programming going on here. It looks like all you're trying to do is find the line for 12:00:00 and print the temperature in degrees C rather than K.
Which can be done like this:
#!/usr/bin/perl
use strict;
use warnings;
while (<DATA>) {
my #fields = split /,/;
print $fields[6] - 273.15 if $fields[2] eq "12:00:00";
}
__DATA__
1,2015-08-20,00:00:00,89,1007.48,295.551,296.66,
2,2015-08-20,03:00:00,85,1006.49,295.947,296.99,
3,2015-08-20,06:00:00,86,1006.05,295.05,296.02,
4,2015-08-20,09:00:00,85,1005.87,296.026,296.93,
5,2015-08-20,12:00:00,77,1004.96,298.034,298.87
Prints:
25.72
You don't really need to do map sprintf etc. (Although you could do a printf on that output if you do want to format it).
Edit: From the comments, it seems one of the sources of confusion is extracting an element from an array. An array is zero or more scalar elements - you can't just assign one to the other, because .... well, what should happen if there isn't just one element (which is the usual case).
Given an array, we can:
pop #array will return the last element (and remove it from the array) so you could my $result = pop #array;
[0] is the first element of the array, so we can my $result = $array[0];
Or we can assign one array to another: my ( $result ) = #array; - because on the left hand side we have an array now, and it's a single element - the first element of #array goes into $result. (The rest isn't used in this scenario - but you could do my ( $result, #anything_else ) = #array;
So in your example - if what you're trying to do is retrieve a value matching a criteria - the normal tool for the job would be grep - which filters an array by applying a conditional test to each element.
So:
my #lines = grep { (split /,/)[2] eq "12:00:00" } <DATA>;
print "#lines";
print $lines[0];
Which we can reduce to:
my ( $firstresult ) = grep { (split /,/)[2] eq "12:00:00" } <DATA>;
print $firstresult;
But as we want to want to transform our array - map is the tool for the job.
my ( $result ) = map { (split /,/)[6] - 273.15 } grep { (split /,/)[2] eq "12:00:00" } <DATA>;
print $result;
First we:
use grep to extract the matching elements. (one in this case, but doesn't necessarily have to be!)
use map to transform the list, so that that we turn each element into just it's 6th field, and subtract 273.15
assign the whole lot to a list containing a single element - in effect just taking the first result, and throwing the rest away.
But personally, I think that's getting a bit complicated and may be hard to understand - and would suggest instead:
my $result;
while (<DATA>) {
my #fields = split /,/;
if ( $fields[2] eq "12:00:00" ) {
$result = $fields[6] - 273.15;
last;
}
}
print $result;
Iterate your data, split - and test - each line, and when you find one that matches the criteria - set $result and bail out of the loop.

#t1_temp_12 is an array. Why are you trying to subtract an single value from it?
my $result = "#t1_temp_12 - 273.14";
Did you want to do this instead?
#t1_temp_12 = map {$_ - 273.14} #t1_temp_12;
As a shell one-liner, you could write your entire script as:
perl -F, -lanE 'say $F[6]-273.14 if $F[2] eq "12:00:00"' <<DATA
1,2015-08-20,00:00:00,89,1007.48,295.551,296.66,
2,2015-08-20,03:00:00,85,1006.49,295.947,296.99,
3,2015-08-20,06:00:00,86,1006.05,295.05,296.02,
4,2015-08-20,09:00:00,85,1005.87,296.026,296.93,
5,2015-08-20,12:00:00,77,1004.96,298.034,298.87
DATA
25.73

Perl Text-Parsing; Which algorithm is correct?

I am writing a Perl script that takes two files as input: one input is a tab-separated table with an identifier of interested in the second column, the second input is a list of identifiers that match the second column of the first file.
THE GOAL is print only those lines of the table which contain an identifier in the second column and to print each line only once. I have written three versions of this program and have been finding different numbers of lines printed in each.
Version 1:
# TAB-SEPARTED TABLE FILE
open (FILE, $file);
while (<FILE>) {
my $line = $_;
chomp $line;
# ARRAY CONTAINING EACH IDENTIFIER AS A SEPARATE ELEMENT
foreach(#refs) {
my $ref = $_;
chomp $ref;
if ( $line =~ $ref) { print "$line\n"; next; }
}
}
Version 2:
# ARRAY CONTAINING EVERY LINE OF THE TAB-SEPARATED TABLE AS A SEPARATE LINE
foreach(#doc) {
my $full = $_;
# IF LOOP FOR PRINTING THE HEADER BUT NOT COMPARING IT TO ARRAY BELOW
if ( $counter == 0 ) {
print "$full\n";
$counter++;
next; }
# EXTRACT IDENTIFIER FROM LINE
my #cells = split('\t', $full);
my $gene = $cells[1];
foreach(#refs) {
my $text = $_;
if ( $gene =~ $text && $counter == 1 ) { # COMPARE IDENTIFIER
print "$full\n";
next;
}
}
$counter--;
}
Version 3:
# LIST OF IDENTIFIERS
foreach(#refs) {
my $ref = $_;
# LIST OF EACH ROW OF THE TABLE
foreach(#doc) {
my $line = $_;
my #cells = split('\t', $line);
my $gene = $cells[1];
if ( $gene =~ $ref ) { print "$line\n"; next; }
}
}
Each of these approaches gives me different output and I do not understand why. I also do not understand if I can trust any of them to give me the right output. The right output should not contain any duplicate lines but more than one row might match any identifier from the list.
Sample Input File:
Position Symbol Name REF ALT
chr1:887801 NOC2L nucleolar complex associated 2 homolog (S. cerevisiae) A G
chr1:888639 NOC2L nucleolar complex associated 2 homolog (S. cerevisiae) T C
chr1:888659 NOC2L nucleolar complex associated 2 homolog (S. cerevisiae) T C
chr1:897325 KLHL17 kelch-like 17 (Drosophila) G C
chr1:909238 PLEKHN1 pleckstrin homology domain containing, family N member 1 G C
chr1:982994 AGRN agrin T C
chr1:1254841 CPSF3L cleavage and polyadenylation specific factor 3-like C G
chr1:3301721 PRDM16 PR domain containing 16 C T
chr1:3328358 PRDM16 PR domain containing 16 T C
List is pulled from a file that looks like this:
A1BG
A2M
A2ML1
AAK1
ABCA12
ABCA13
ABCA2
ABCA4
ABCC2
Its put into an array using this code:
open (REF, $ref_file);
while (<REF>) {
my $line = $_;
chomp $line;
push(#refs, $line);
}
close REF;

Whenever you hear "I need to look up something", think hashes.
What you can do is create a hash that contains the elements you want to pull out of file #1. Then, use a second hash to track whether or not you printed it before:
#!/usr/bin/env perl
use warnings;
use strict;
use feature qw(say);
use autodie; # This way, I don't have to check my open for failures
use constant {
TABLE_FILE => "file1.txt",
LOOKUP_FILE => "file2.txt",
};
open my $lookup_fh, "<", LOOKUP_FILE;
my %lookup_table;
while ( my $symbol = <$lookup_fh> ) {
chomp $symbol,
$lookup_table{$symbol} = 1;
}
close $lookup_fh;
open my $table_file, "<", TABLE_FILE;
my %is_printed;
while ( my $line = <$table_file> ) {
chomp $line;
my #line_array = split /\s+/, $line;
my $symbol = $line_array[1];
if ( exists $lookup_table{$symbol} and not exists $is_printed{$symbol} ) {
say $line;
$is_printed{$symbol} = 1;
}
}
Two loops, but much more efficient. In yours, if you had 100 items in the first file, and 1000 items in the second file, you would have to loop 100 * 1000 times or 1,000,000. In this, you only loop the total number of lines in both files.
I use the three-parameter method of the open command which allows you to handle files with names that start with | or <, etc. Also, I use variables for my file handles which make it easier to pass the file handle to a subroutine if so desired.
I use use autodie; which handles issues such as what if my file doesn't open. In your program, the program would continue on its merry way. If you don't want to use autodie, you need to do this:
open $fh, "<", $my_file or die qq(Couldn't open "$my_file" for reading);
I use two hashes. The first is %lookup_table which stores the Symbols you want to print. When I go through the first file, I can simply check if `$lookup_table{$symbol} exists. If it doesn't, I don't print it, if it does, I print it.
The second hash %is_printed keeps track of Symbols I've already printed. If $is_printed{$symbol} exists, I know I've already printed that line.
Even though you said the second table is tab separated, I use /\s+/ as the split regular expression. This will catch a tab, but it will also catch if someone used two tabs (to keep things looking nice) or accidentally typed a space before that tab.

I'm pretty sure this should work:
$ awk '
NR == FNR {Identifiers[$1]; next}
$2 in Identifiers {
$1 = ""; $0 = $0; if(!Printed[$0]++) {print}
}' identifiers_file.txt data_file.txt
Given identifiers_file.txt such as this (to which I added NOC2L since there were no matching identifiers in your sample):
A1BG
A2M
A2ML1
AAK1
ABCA12
ABCA13
ABCA2
ABCA4
ABCC2
NOC2L
then your output will be:
$ awk '
NR == FNR {Identifiers[$1]; next}
$2 in Identifiers {
$1 = ""; $0 = $0; if(!Printed[$0]++) {print}
}' idents.txt data.txt
NOC2L nucleolar complex associated 2 homolog (S. cerevisiae) A G
NOC2L nucleolar complex associated 2 homolog (S. cerevisiae) T C
If that's correct and you want a Perl version, you can just:
$ echo 'NR == FNR {Identifiers[$1]; next} $2 in Identifiers { $1 = ""; $0 = $0; if(!Printed[$0]++) {print} }' \
| a2p

I suggest you to mix first version and second and add hashes to them.
First version because it's good(clear way) parse your data file line by line.
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
open (REF, $ARGV[0]);
my %refs;
while (<REF>) {
my $line = $_;
chomp $line;
$refs{$line} = 0;
}
close REF;
#for head printing
$refs{'Symbol'} = 0;
open (FILE, $ARGV[1]);
while (<FILE>) {
my $line = $_;
my #cells = split('\t', $line);
my $gene = $cells[1];
#print $line, "\n" if exists $refs{$gene};
if(exists $refs{$gene} and $refs{$gene} == 0)
{
$refs{$gene}++;
print $line;
}
}
close FILE;

replace a string of characters with the line number

I have a text file that has approximately 3,000 lines. 99% of the time I need all 3,000 lines. However, periodically I will grep out the lines I need and direct the output to another text file to use.
The only problem I have in doing so, is: Embedded in the text file is a 6 character string of numbers that indicate the line number. In order to use the file, this area needs to be correctly renumbered...(I don't need to re-sort the data, but I need to replace the current six characters with the new line number. and it must be padded with zeros! Unfortuantely the entire rows is one long row of data with no field separators!
For example, my first three rows might look something like:
20130918082020ZZ000001RANDOMDATAFOLLOWSAFTERTHISABCDEFGH
20130810112000ZZ000999MORERANDOMDATAFOLLOWSAFTERTHISABCD
20130810112000ZZ000027SILLMORERANDOMDATAFOLLOWSAFTERTHIS
The six characters at positions 17-22 (Immediately following the "ZZ"), need be renumbered based on the current row number...so the above needs to look like:
20130918082020ZZ000001RANDOMDATAFOLLOWSAFTERTHISABCDEFGH
20130810112000ZZ000002MORERANDOMDATAFOLLOWSAFTERTHISABCD
20130810112000ZZ000003SILLMORERANDOMDATAFOLLOWSAFTERTHIS
Any ideas would be greatly appreciated!
Thanks,
KSL.

Here's the solution I came up with Perl. It assumes that the numbering is always 6 digits after the ZZ sequence.
In convert.pl:
use strict;
use warnings;
my $i = 1; # or the value you want to start numbering
while (<STDIN>) {
my $replace = sprintf("%06d", $i++);
$_ =~ s/ZZ\d{6}/ZZ$replace/g;
print $_;
}
In data.dat:
20130918082020ZZ000001RANDOMDATAFOLLOWSAFTERTHISABCDEFGH
20130810112000ZZ000999MORERANDOMDATAFOLLOWSAFTERTHISABCD
20130810112000ZZ000027SILLMORERANDOMDATAFOLLOWSAFTERTHIS
To run:
cat data.dat | perl convert.pl
Output
20130918082020ZZ000001RANDOMDATAFOLLOWSAFTERTHISABCDEFGH
20130810112000ZZ000002MORERANDOMDATAFOLLOWSAFTERTHISABCD
20130810112000ZZ000003SILLMORERANDOMDATAFOLLOWSAFTERTHIS

If I would solve this, I would create a simple python script to read those lines by filtering as grep does and using a internal counter from inside the python script.
As simple hints you can read each line in a string and access them using variablename[17:22] (17:22 is the position of the string you are trying to use).
Now, there is a method in the string in python which does the replace, just replace the values by the counter you create.
I hope this helps.

To do this in awk:
awk '{print substr($0,1,16) sprintf("%06d", NR) substr($0,23)}'
or
gawk 'match($0,/^(.*ZZ)[0-9]{6}(.*)/,a) {print a[1] sprintf("%06d",NR) a[2]}'

This is exactly the type of thing where unpack is useful.
#!/usr/bin/env perl
use v5.10.0;
use strict;
use warnings;
while( my $line = <> ){
chomp $line;
my #elem = unpack 'A16 A6 A*', $line;
$elem[1] = sprintf '%06d', $.;
# $. is the line number for the last used file handle
say #elem;
}
Actually looking at the lines, it looks like there is date information stored in the first 14 characters.
Assuming that at some point you might want to parse the lines for some reason you can use the following as an example of how you could use unpack to split up the lines.
#!/usr/bin/env perl
use v5.10.0; # say()
use strict;
use warnings;
use DateTime;
my #date_elem = qw'
year month day
hour minute second
';
my #elem_names = ( #date_elem, qw'
ZZ
line_number
random_data
');
while( my $line = <> ){
chomp $line;
my %data;
#data{ #elem_names } = unpack 'A4 (A2)6 A6 A*', $line;
# choose either this:
$data{line_number} = sprintf '%06d', $.;
say #data{#elem_names};
# or this:
$data{line_number} = $.;
printf '%04d' . ('%02d'x5) . "%2s%06d%s\n", #data{ #elem_names };
# the choice will affect the contents of %data
# this just shows the contents of %data
for( #elem_names ){
printf qq'%12s: "%s"\n', $_, $data{$_};
}
# you can create a DateTime object with the date elements
my $dt = DateTime->new(
(map{ $_, $data{$_} } #date_elem),
time_zone => 'floating',
);
say $dt;
print "\n";
}
Although it would be better to use a regular expression, so that you could throw out bogus data.
use v5.14; # /a modifier
...
my $rdate = join '', map{"(\\d{$_})"} 4, (2)x5;
my $rx = qr'$rdate (ZZ) (\d{6}) (.*)'xa;
while( my $line = <> ){
chomp $line;
my %data;
unless( #data{ #elem_names } = $line =~ $rx ){
die qq'unable to parse line "$line" ($.)';
}
...
It would be better still; to use named capture groups added in 5.10.
...
my $rx = qr'
(?<year> \d{4} ) (?<month> \d{2} ) (?<day> \d{2} )
(?<hour> \d{2} ) (?<minute> \d{2} ) (?<second> \d{2} )
ZZ
(?<line_number> \d{6} )
(?<random_data> .* )
'xa;
while( my $line = <> ){
chomp $line;
unless( $line =~ $rx ){
die qq'unable to parse line "$line" ($.)';
}
my %data = %+;
# for compatibility with previous examples
$data{ZZ} = 'ZZ';
...

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Efficient way to read columns in a file using Perl - perl

Related

Sorting 5th column in descending order error message

grouping lines and concatenating them into one line in perl

Need to extract value from a file and subtract from variable in Perl

Perl Text-Parsing; Which algorithm is correct?

replace a string of characters with the line number

Categories

Resources