How to validate the data with the line text to generate a different output file in perl - perl

I have a GMF file
TSTARTCUSTEVSUMMROW_SIMPLE
CUSTEVSUMMROW_SIMPLE Airtime - Simple All In Plan (Monthly)|3040|MOU|0|0
CUSTEVSUMMROW_SIMPLE Airtime - Simple All In Plan (Monthly)|1758|MOU|8000|140640
CUSTEVSUMMROW_SIMPLE Airtime - Simple All In Talk and Text Prepaid|3040|MOU|0|0
TENDCUSTEVSUMMROW_SIMPLE
TSTARTCUSTEVSUMMROW_GPRS_SIMPLE
CUSTEVSUMMROW_GPRS_SIMPLE GPRS - Simple All In Plan (Monthly) - LTE|2860|MB|6500|488
TENDCUSTEVSUMMROW_GPRS_SIMPLE
TSTARTCUSTEVSUMMROW_TF
CUSTEVSUMMROW_TF Airtime - TRCF2000Pkg|2952|MOU|8000|236160|
TENDCUSTEVSUMMROW_TF
TSTARTCUSTEVSUMMROW_GPRS_TF
CUSTEVSUMMROW_GPRS_TF GPRS - TRCF2000Pkg - LTE|13493|MB|6500|1496
TENDCUSTEVSUMMROW_GPRS_TF
I'm able to process the file successfully, with the below code.
if ( $line =~ m/^(CUSTEVSUMMROW_GPRS|CUSTEVSUMMROW).*?\s(.*?)\|(\d+)\|.*\|(.*?)$/ ) {
# do some logic
}
where the single is generated for both TF and SIMPLE tags
Instead of using elsif, how can I modify the logic so that both TF and SIMPLE generate two different output files.

I suggest you start by splittling the line on pipe characters |, and then process the first field in more detail
This program works that way. The record is split into the #fields array, and then the first element, $fields[0], is removed and replaced with its constituent parts using splice together with a regex pattern
I have displayed the resulting value of #fields using Data::Dump. You can use the values for whatever you need. Clearly $fields[1] is SIMPLE or TF so you can alter your processing accordingly
use strict;
use warnings;
use Data::Dump;
while ( <DATA> ) {
next unless /^CUSTEVSUMMROW/;
chomp;
my #fields = split /\|/;
splice #fields, 0, 1, $fields[0] =~ /(.+)_(\S+)\s+(.+)/;
dd \#fields;
}
__DATA__
TSTARTCUSTEVSUMMROW_SIMPLE
CUSTEVSUMMROW_SIMPLE Airtime - Simple All In Plan (Monthly)|3040|MOU|0|0
CUSTEVSUMMROW_SIMPLE Airtime - Simple All In Plan (Monthly)|1758|MOU|8000|140640
CUSTEVSUMMROW_SIMPLE Airtime - Simple All In Talk and Text Prepaid|3040|MOU|0|0
TENDCUSTEVSUMMROW_SIMPLE
TSTARTCUSTEVSUMMROW_GPRS_SIMPLE
CUSTEVSUMMROW_GPRS_SIMPLE GPRS - Simple All In Plan (Monthly) - LTE|2860|MB|6500|488
TENDCUSTEVSUMMROW_GPRS_SIMPLE
TSTARTCUSTEVSUMMROW_TF
CUSTEVSUMMROW_TF Airtime - TRCF2000Pkg|2952|MOU|8000|236160|
TENDCUSTEVSUMMROW_TF
TSTARTCUSTEVSUMMROW_GPRS_TF
CUSTEVSUMMROW_GPRS_TF GPRS - TRCF2000Pkg - LTE|13493|MB|6500|1496
TENDCUSTEVSUMMROW_GPRS_TF
output
[
"CUSTEVSUMMROW",
"SIMPLE",
"Airtime - Simple All In Plan (Monthly)",
3040,
"MOU",
0,
0,
]
[
"CUSTEVSUMMROW",
"SIMPLE",
"Airtime - Simple All In Plan (Monthly)",
1758,
"MOU",
8000,
140640,
]
[
"CUSTEVSUMMROW",
"SIMPLE",
"Airtime - Simple All In Talk and Text Prepaid",
3040,
"MOU",
0,
0,
]
[
"CUSTEVSUMMROW_GPRS",
"SIMPLE",
"GPRS - Simple All In Plan (Monthly) - LTE",
2860,
"MB",
6500,
488,
]
[
"CUSTEVSUMMROW",
"TF",
"Airtime - TRCF2000Pkg",
2952,
"MOU",
8000,
236160,
]
[
"CUSTEVSUMMROW_GPRS",
"TF",
"GPRS - TRCF2000Pkg - LTE",
13493,
"MB",
6500,
1496,
]

Related

Perl: break down a string, with some unique constraints

I'm using Perl to feed data to an LCD display. The display is 8 characters wide. The strings of data to be displayed are always significantly longer than 8 characters. As such, I need to break the strings down into "frames" of 8 characters or less, and feed the "frames" to the display one at a time.
The display is not intelligent enough to do this on its own. The only convenience it offers is that strings of less than 8 characters are automatically centered on the display.
In the beginning, I simply sent the string 8 characters at a time - here goes 1-8, now 9-16, now 17-24, etc. But that wasn't especially nice-looking. I'd like to do something better, but I'm not sure how best to approach it.
These are the constraints I'd like to implement:
Fit as many words into a "frame" as possible
No starting/trailing space(s) in a "frame"
Symbol (ie. hyphen, ampersand, etc) with a space on both sides qualifies as a word
If a word is longer than 8 characters, simulate per-character scrolling
Break words longer than 8 characters at a slash or hyphen
Some hypothetical input strings, and desired output for each...
Electric Light Orchestra - Sweet Talkin' Woman
Electric
Light
Orchestr
rchestra
- Sweet
Talkin'
Woman
Quarterflash - Harden My Heart
Quarterf
uarterfl
arterfla
rterflas
terflash
- Harden
My Heart
Steve Miller Band - Fly Like An Eagle
Steve
Miller
Band -
Fly Like
An Eagle
Hall & Oates - Did It In A Minute
Hall &
Oates -
Did It
In A
Minute
Bachman-Turner Overdrive - You Ain't Seen Nothing Yet
Bachman-
Turner
Overdriv
verdrive
- You
Ain't
Seen
Nothing
Yet
Being a relative Perl newbie, I'm trying to picture how would be best to handle this. Certainly I could split the string into an array of individual words. From there, perhaps I could loop through the array, counting the letters in each subsequent word to build the 8-character "frames". Upon encountering a word longer than 8 characters, I could then repetitively call substr on that word (with offset +1 each time), creating the illusion of scrolling.
Is this a reasonable way to accomplish my goal? Or am I reinventing the wheel here? How would you do it?
The base question is to find all consecutive overlapping N-long substrings in a compact way.
Here it is in one pass with a regex, and see the end for doing it using substr.
my $str = join '', "a".."k"; # 'Quarterflash';
my #eights = $str =~ /(?=(.{8}))/g;
This uses a lookahead which also captures, and in this way the regex crawls up the string character by character, capturing the "next" eight each time.
Once we are at it, here is also a basic solution for the problem. Add words to a buffer until it would exceed 8 characters, at which point it is added to an array of display-ready strings and cleared.
use warnings;
use strict;
use feature 'say';
my $str = shift // "Quarterflash - Harden My Heart";
my #words = split ' ', $str;
my #to_display;
my $buf = '';
foreach my $w (#words) {
if (length $w > 8) {
# Now have to process the buffer first then deal with this long word
push #to_display, $buf;
$buf = '';
push #to_display, $w =~ /(?=(.{8}))/g;
}
elsif ( length($buf) + 1 + length($w) > 8 ) {
push #to_display, $buf;
$buf = $w;
}
elsif (length $buf != 0) { $buf .= ' ' . $w }
else { $buf = $w }
}
push #to_display, $buf if $buf;
say for #to_display;
This is clearly missing some special/edge cases, in particular those involving non-word characters and hyphenated words, but that shouldn't be too difficult to add.†
Here is a way to get all consecutive 8-long substrings using substr
my #to_display = map { substr $str, $_, 8 } 0..length($str)-8;
† Example, break a word with hyphen/slash when it has no spaces around it (per question)
my #parts = split m{\s+|(?<=\S)[-/](?=\S)}, $w;
The hyphen/slash is discarded as this stands; that can be changed by capturing the pattern as well and then filtering out elements with only spaces
my #parts = grep { /\S/ } split m{( \s+ | (?<=\S) [-/] (?=\S) )}x, $w;
These haven't been tested beyond just barely. Can fit in the if (length $w > 8) branch.
The initial take-- The regex was originally written with a two-part pattern. Keeping it here for record and as an example of use of pair-handling functions from List::Util
The regex below matches and captures a character, followed by a lookahead for the next seven, which it also captures. This way the engine captures 1 and 7-long substrings as it moves along char by char. Then the consecutive pairs from the returned list are joined
my $str = join '', "a".."k"; # 'Quarterflash';
use List::Util qw(pairmap);
my #eights = pairmap { $a . $b } $str =~ /(. (?=(.{7})) )/gx;
# or
# use List::Util qw(pairs);
# my #eights = map { join '', #$_ } pairs $str =~ /(.(?=(.{7})))/g;

Unix script to transpose rows to columns with defined file layout as output

Script tried:
perl -nle '
if($. == 1)
{ (#a)=/([\w - .]+)(?=,|\s*$)/g }
else
{
(#b)=/([\w - .]+)(?=,|\s*$)/g;
print "$a[0]|$b[0]|$b[1]|$b[2}|$a[$_]|$b[$_+3]" foreach (0..$#a)
}
' ip.txt >op.txt
input data :
src,FI,QMA,PCG,PCC,PREI,G T
PIM2016.csv,MMR.S T - RED,334,114,120,34,123,725
output with latest script:
SRC|PIM2016.csv|MMRPPS|RED|SRC|334
SRC|PIM2016.csv|MMRPPS|RED|FI|114
SRC|PIM2016.csv|MMRPPS|RED|QMA|120
SRC|PIM2016.csv|MMRPPS|RED|PCG|34
SRC|PIM2016.csv|MMRPPS|RED|PCC|123
SRC|PIM2016.csv|MMRPPS|RED|PREI|725
SRC|PIM2016.csv|MMRPPS|RED|G T|
Required output:
SRC|PIM2016.csv|MMRPPS|S T - RED|FI|334
SRC|PIM2016.csv|MMRPPS|S T - RED|QMA|114
SRC|PIM2016.csv|MMRPPS|S T - RED|PCG|120
SRC|PIM2016.csv|MMRPPS|S T - RED|PCC|34
SRC|PIM2016.csv|MMRPPS|S T - RED|PREI|123
SRC|PIM2016.csv|MMRPPS|S T - RED|G T|725
I think your life gets a lot easier if you know about split()
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
my (#head);
while (<>) {
chomp;
if ($. == 1) {
#head = split /,/;
next;
}
my #data = split /,/;
say "$head[0]|$data[0]|$data[1]|$data[2]|$head[$_]|$data[$_+2]" for (1 .. $#head);
}
I've written it as a program rather than as a command line as I think it's too long to be run on the command line.
Also, I hope that the dot after "MMR" in your sample input is a typo for a comma. If that's not the case it gets a little more complex - but not very much.
Oh, and there's no "PPS" in your sample input, so I've no idea where that comes from in your sample output.

Skip bad CSV lines in Perl with Text::CSV

I have a script that is essentially still in testing.
I would like to use Text CSV to breakdown large quantities of CSV files dumped hourly.
These files can be quite large and of inconsistent quality.
Sometimes I'll get strange characters or data, but the usual issue is lines that just stop.
"Something", "3", "hello wor
The closed quote is my biggest hurdle. The script just breaks. The error goes to stderr and my while loop is broken.
While (my $row = $csv->getline($data))
The error I get is...
# CSV_PP ERROR: 2025 - EIQ - Loose unescaped escape
I can't seem to do any kind of error handling for this. If I enable allow_loose_escapes, all I get instead is a lot of errors, because it considers the subsequent new lines as part of the same row.
Allowing the loose escape is not the answer. It just makes your program ignore the error and try to incorporate the broken line with your other lines, as you also mentioned. Instead you can try to catch the problem, and check your $row for definedness:
use strict;
use warnings;
use Text::CSV;
use feature 'say';
my $csv = Text::CSV->new({
binary => 1,
eol => $/,
});
while (1) {
my $row = $csv->getline(*DATA);
$csv->eof and last;
if (defined $row) {
$csv->print(*STDOUT, $row);
} else {
say "==" x 10;
print "Bad line, skipping\n";
say $csv->error_diag();
say "==" x 10;
}
}
__DATA__
1,2,3,4
a,b,c,d
"Something", "3", "hello wor
11,22,33,44
For me this outputs:
1,2,3,4
a,b,c,d
====================
Bad line, skipping
2034EIF - Loose unescaped quote143
====================
11,22,33,44
If you want to save the broken lines, you can access them with $csv->error_input(), e.g.:
print $badlines $csv->error_input();

Parsing a log file using perl

I have a log file where some of the entries look like this:
YY/MM/DD HH:MM:SS:MMM <Some constant text> v1=XXX v2=YYY v3=ZZZ v4=AAA AND BBB v5=CCC
and I'm trying to get it into a CSV format:
Date,Time,v1,v2,v3,v4,v5
YY/MM/DD,HH:MM:SS:MMM,XXX,YYY,ZZZ,AAA AND BBB,CCC
I'd like to do this in Perl - speaking personally, I could probably do it far quicker in other languages but I'd really like to expand my horizons a bit.
So far I can get as far as reading the file in and picking out only lines which meet my criteria but I can't seem to get the next stage done. I'll need to splice up the input line but so far I just can't work out how to do this. I've looked at s//and m// but they don't really give me what I want. If anyone can advise me how this can be done or give me pointers I'd much appreciate it.
Important points:
The values in the second part of the line are always in the same order so mapping / re-organising is not necesarily a problem.
Some of the fields have free text which is not quoted :( but as the labels all start v<number>= I'm hoping parsing this should still be a possibility.
Since there is no one delimiter, you'll need to try this a few different ways:
First, split on ' ', then take the first three values:
my #array = split / /, $line;
my ($date, $time, $constant) = splice #array, 0, 3;
Join the rest of the fields together again, and re-split on v\d+= to get the values:
my $rest = join ' ', #array;
# $rest should now be "v1=XXX v2=YYY ..."
my #values = split /\s*v\d+=/, $rest;
shift #values; # since the first element in #values will be empty
print join ',', $date, $time, #values;
Edit: Here's another approach that may be easier to follow, and is slightly more efficient. This takes advantage of the fact that your constant text occurs between the date/time and the value list.
# assume that CONSTANT is your constant text
my ($datetime, $valuelist) = split /\s*CONSTANT\s*/, $line;
my ($date, $time) = split / /, $datetime;
my #values = split /\s*v\d+=/, $valuelist;
shift #values;
print join ',', $date, $time, #values, "\n";
What have you tried with regular expressions and how has it failed? A regex with m// works fine for me:
#!/usr/bin/env perl
use strict;
use warnings;
print "Date,Time,v1,v2,v3,v4,v5\n";
while (my $line = <DATA>) {
my #matched = $line =~ m{^([^ ]+) ([^ ]+).*v1=(.*) v2=(.*) v3=(.*) v4=(.*) v5=(.*)};
print join(',', #matched), "\n";
}
__DATA__
YY/MM/DD HH:MM:SS:MMM <Some constant text> v1=XXX v2=YYY v3=ZZZ v4=AAA AND BBB v5=CCC
Two caveats:
1) v1 cannot contain the substring " v2=", v2 cannot contain " v3=", etc., but, with such a loose format, that's something that would likely cause problems for a human attempting to parse it, too.
2) This code assumes that there will always be v1 through v5. If there are fewer than five v*n* fields, the line will fail to match. If there are more, all additional fields will be appended to v5 (including their v*n* tags).
In case the log is fixed-width, you better off using unpack, you will see its benefits if the log grows very large (performance wise).

Reading a large file into Perl array of arrays and manipulating the output for different purposes

I am relatively new to Perl and have only used it for converting small files into different formats and feeding data between programs.
Now, I need to step it up a little. I have a file of DNA data that is 5,905 lines long, with 32 fields per line. The fields are not delimited by anything and vary in length within the line, but each field is the same size on all 5905 lines.
I need each line fed into a separate array from the file, and each field within the line stored as its own variable. I am having no problems storing one line, but I am having difficulties storing each line successively through the entire file.
This is how I separate the first line of the full array into individual variables:
my $SampleID = substr("#HorseArray", 0, 7);
my $PopulationID = substr("#HorseArray", 9, 4);
my $Allele1A = substr("#HorseArray", 14, 3);
my $Allele1B = substr("#HorseArray", 17, 3);
my $Allele2A = substr("#HorseArray", 21, 3);
my $Allele2B = substr("#HorseArray", 24, 3);
...etc.
My issues are: 1) I need to store each of the 5905 lines as a separate array. 2) I need to be able to reference each line based on the sample ID, or a group of lines based on population ID and sort them.
I can sort and manipulate the data fine once it is defined in variables, I am just having trouble constructing a multidimensional array with each of these fields so I can reference each line at will. Any help or direction is much appreciated. I've poured over the Q&A sections on here, but have not found the answer to my questions yet.
Do not store each line in it's own array. You need to construct a data structure. Start by reading the following tutorials form perldoc:
perlreftut
perldsc
perllol
Here's some starter code:
use strict;
use warnings;
# Array of data samples. We could use a hash as well; which is better
# depends on how you want to use the data.
my #sample;
while (my $line = <DATA>) {
chomp $line;
# Parse the input line
my ($sample_id, $population_id, $rest) = split(/\s+/, $line, 3);
# extract A/B allele pairs
my #pairs;
while ($rest =~ /(\d{1,3})(\d{3})|(\d{1,3}) (\d{1,2})/g) {
push #pairs, {
A => defined $1 ? $1 : $3,
B => defined $2 ? $2 : $4,
};
}
# Add this sample to the list of samples. Store it as a hashref so
# we can access attributes by name
push #sample, {
sample => $sample_id,
population => $population_id,
alleles => \#pairs,
};
}
# Print out all the values of alleles 2A and 2B for the samples in
# population py18. Note that array indexing starts at 0, so allele 2
# is at index 1.
foreach my $sample (grep { $_->{population} eq 'py18' } #sample) {
printf("%s: %d / %d\n",
$sample->{sample},
$sample->{alleles}[1]{A},
$sample->{alleles}[1]{B},
);
}
__DATA__
00292-97 py17 97101 129129 152164 177177 100100 134136 163165 240246 105109 124124 166166 292292 000000 000000 000000
00293-97 py18 89 97 129139 148154 179179 84 90 132134 167169 222222 105105 126128 164170 284292 000000 000000 000000
00294-97 py17 91 97 129133 152154 177183 100100 134140 161163 240240 103105 120128 164166 290292 000000 000000 000000
00295-97 py18 97 97 131133 148162 177179 84100 132134 161167 240252 111111 124128 164166 284290 000000 000000 000000
I'd start by looping through the lines and parsing each into a hash of fields, and I'd build a hash for each index along the way.
my %by_sample_id; # this will be a hash of hashes
my %by_population_id; # a hash of lists of hashes
foreach (<FILEHANDLE>) {
chomp; # remove newline
my %h; # new hash
$h{SampleID} = substr($_, 0, 7);
$h{PopulationID} = substr($_, 9, 4);
# etc...
$by_sample_id{ $h{SampleID} } = \%h; # a reference to %h
push #{$by_population_id{ $h{PopulationID} }}, \%h; # pushes hashref onto list
}
Then, you can use either index to access the data in which you're interested:
say "Allele1A for sample 123123: ", $by_sample_id{123123}->{Allele1A};
say "all the Allele1A values for population 432432: ",
join(", ", map {$_->{Allele1A}} #{$by_population_id{432432}});
I'm going to assume this isn't a one-off program, so my approach would be slightly different.
I've done a fair amount of data-mashing, and after a while, I get tired of writing queries against data structures.
So -
I would feed the data into a SQLite database(or other sql DB), and then write Perl queries off of that, using Perl DBI. This cranks up the complexity to well past a simple 'parse-and-hack', but after you've written several scripts doing queries on the same data, it becomes obvious that this is a pain, there must be a better way.
You would have a schema that looks similar to this
create table brians_awesome_data (id integer, population_id varchar(32), chunk1 integer, chunk2 integer...);
Then, after you used some of mobrule and Michael's excellent parsing, you'd loop and do some INSERT INTO your awesome_data table.
Then, you could use a CLI for your SQL program and do "select ... where ..." queries to quickly get the data you need.
Or, if it's more analytical/pipeliney, you could Perl up a script with DBI and get the data into your analysis routines.
Trust me, this is the better way to do it than writing queries against data structures over and over.