Perl - get first "word" from input string - perl

I am trying to write a Perl program that reads in lines from a text file, and, for each line, extract the first "word" from the line, and perform a different action based on the string that gets returned.
The main loop looks like this:
while(<AXM60FILE>) {
$inputline = $_;
($start) = ($inputline =~ /\A(.*?) /);
perform something, based on the value of string in $start
}
The input file is actually a parameter file, with the parameter_name and parameter_value, separated by a colon (":"). There can be spaces or tabs before or after the colon.
So, the file looks (for example) like the following:
param1: xxxxxxxxxxxx
param2 :xxxxxxxxxxxxx
param3 : xxxxxxxxxxxxxxxxx
param4:xxxxxxxxxxxxx
That "($start) = ($inputline =~ /\A(.*?) /);" works ok for the "param2" example and the "param3" example where the 1st word is terminated by a blank/space, but how can I handle the "param1" and "param4" situations, where the parameter_name is followed immediately by the colon?
Also, what about if the "whitespace" is a tab or tabs, instead of blank/space character?
Thanks,
Jim

This will cover all of your cases and then some:
my ($key, $value) = split /\s*:\s*/, $inputline, 2;
(Or, in English, split $inputline into a maximum of two elements separated by any amount of whitespace, a colon and any amount of whitespace.)

($start) = $inputline =~ /\A([^:\s]+)/;
This will match anything except whitespace and : at the beginning of the line.
Or using split:
($start) = split /[:\s]+/, $inputline, 2;

Related

Perl: how to format a string containing a tilde character "~"

I have run into an issue where a perl script we use to parse a text file is omitting lines containing the tilde (~) character, and I can't figure out why.
The sample below illustrates what I mean:
#!/usr/bin/perl
use warnings;
formline " testing1\n";
formline " ~testing2\n";
formline " testing3\n";
my $body_text = $^A;
$^A = "";
print $body_text
The output of this example is:
testing1
testing3
The line containing the tilde is dropped entirely from the accumulator. This happens whether there is any text preceding the character or not.
Is there any way to print the line with the tilde treated as a literal part of the string?
~ is special in forms (see perlform) and there's no way to escape it. But you can create a field for it and populate it with a tilde:
formline " \#testing2\n", '~';
The first argument to formline is the "picture" (template). That picture uses various characters to mean particular things. The ~ means to suppress output if the fields are blank. Since you supply no fields in your call to formline, your fields are blank and output is suppressed.
my #lines = ( '', 'x y z', 'x~y~z' );
foreach $line ( #lines ) { # forms don't use lexicals, so no my on control
write;
}
format STDOUT =
~ ID: #*
$line
.
The output doesn't have a line for the blank field because the ~ in the picture told it to suppress output when $line doesn't have anything:
ID: x y z
ID: x~y~z
Note that tildes coming from the data are just fine; they are like any other character.
Here's probably something closer to what you meant. Create a picture, #* (variable-width multiline text), and supply it with values to fill it:
while( <DATA> ) {
local $^A;
formline '#*', $_;
print $^A, "\n";
}
__DATA__
testing1
~testing2
testing3
The output shows the field with the ~:
testing1
~testing2
testing3
However, the question is very odd because the way you appear to be doing things seems like you aren't really doing what formats want to do. Perhaps you have some tricky thing where you're trying to take the picture from input data. But if you aren't going to give it any values, what are you really formatting? Consider that you may not actually want formats.

Reading CSV with Perl produces distorted lines

I am reading a CSV file using Perl 5.26.1 with lines that look like this:
B1_10,202337840166,R08C02,202337840166_R08C02.gtc
I'm reading this data into a hash that has the last element as a key, and the first as a value.
I read the file line by line (snippet only):
while (<$csv>) {
if (/^Sample/) { next }
say "-----start----\noriginal = $_";
chomp;
my #line = split /,/;
my $name = $line[0];
my $vcf = $line[3];
say "1st element = $name";
say "4th element = $vcf";
$vcf2dir{$vcf} = $name;
say "\$vcf2dir{$vcf} = '$name'";
say '-----end------';
}
which produces the following output:
-----start----
original = B1_10,202337840166,R08C02,202337840166_R08C02.gtc
1st element = B1_10
4th element = 202337840166_R08C02.gtc
} = 'B1_10'2337840166_R08C02.gtc
-----end-------
but it should look like
-----start----
original = B1_10,202337840166,R08C02,202337840166_R08C02.gtc
1st element = B1_10
4th element = 202337840166_R08C02.gtc
$vcf2dir{202337840166_R08C02.gtc} = 'B1_10'
-----end-------
and it shows strangely with the data printer package:
use DDP;
p %vcf2dir;
produces
{
' "B1_10"840166_R08C02.gtc
}
in other words, the last string is being cut up for some reason.
I have tried removing non-ascii characters with $_ =~ s/[[:^ascii:]]//g; but this still produces the same error.
I have no idea why Perl is ripping these strings apart :(
while (<$csv>) {
...
chomp;
My guess is that the input file has as line end \r\n (windows style) while you are executing the code in a UNIX like environment (Linux, Mac...) where the line end is \n. This means that $INPUT_RECORD_SEPARATOR is also \n and that chomp only removes the \n and leaves the \r. This left \r causes such strange output.
To fix this either fix the line endings in your input file, set $INPUT_RECORD_SEPARATOR to the expected separator or just do s{\r?\n\z}{} instead of chomp to handle both \r\n and \n line endings.
I ran your snippet against your line and it worked as expected
But I have had behavior like what you show because a spurious Control-M's in my data.
Try filtering for control-M's
after your chomp replace all control-M's with the command below
s/\cM//g;

Best way to parse string in perl

To achieve below task I have written below C like perl program (As I am new to Perl), But I am not sure if this is the best way to achieve.
Can someone please guide?
Note: Not with the full program, But where I can make improvement.
Thanks in advance
Input :
$str = "mail1, local<mail1#mail.local>, mail2#mail.local, <mail3#mail.local>, mail4 local<mail4#mail.local>"
Expected Output :
mail1, local<mail1#mail.local>
mail2#mail.local
<mail3#mail.local>
mail4, local<mail4#mail.local>
Sample Program
my $str="mail1, \#local<mail1\#mail.local>, mail2\#mail.local, <mail3\#mail.local>, mail4, local<mail4\#mail.local>";
my $count=0, #array, $flag=0, $tempStr="";
for my $c (split (//,$str)) {
if( ($count eq 0) and ($c eq ' ') ) {
next;
}
if($c) {
if( ($c eq ',') and ($flag eq 1) ) {
push #array, $tempStr;
$count=0;
$flag1=0;
$tempStr="";
next;
}
if( ($c eq '>' ) or ( $c eq '#' ) ) {
$flag=1;
}
$tempStr="$tempStr$c";
$count++;
}
}
if($count>0) {
push #array, $tempStr;
}
foreach my $var (#array) {
print "$var\n";
}
Edit:
Input:
Input is the output of above code.
Expected Output :
"mail1, local"<mail1#mail.local>
"mail4, local"<mail4#mail.local>
Sample Code:
$str =~ s/([^#>]+[#>][^,]+),\s*/$1\n/g;
my #addresses = split('\n',$str);
if(scalar #addresses) {
foreach my $address (#addresses) {
if (($address =~ /</) and ($address !~ /\"/) and ($address !~ /^</)){
$address="\"$address";
$address=~ s/</\"</g;
}
}
$str = join(',',#addresses);
}
print "$str\n";
As I see, you want to replace each:
comma and following spaces,
occurring after either # or >,
with a newline.
To make such replacement, instead of writing a parsing program, you can use
a regex.
The search part can be as follows:
([^#>]+[#>][^,]+),\s*
Details:
( - Start of the 1st capturing group.
[^#>]+ - A non-empty sequence of chars other than # or >.
[#>] - Either # or >.
[^,]+ - A non-empty sequence of chars other than a comma.
) - End of the 1st capturing group.
,\s* - A comma and optional sequence of spaces.
The replace part should be:
$1 - The 1st capturing group.
\n - A newline.
So the whole program, much shorter than yours, can be as follows:
my $str='mail1, local<mail1#mail.local>, mail2#mail.local, <mail3#mail.local>, mail4, local<mail4#mail.local>';
print "Before:\n$str\n";
$str =~ s/([^#>]+[#>][^,]+),\s*/$1\n/g;
print "After:\n$str\n";
To replace all needed commas I used g option.
Note that I put the source string in single quotes, otherwise Perl
would have complained about Possible unintended interpolation of #mail.
Edit
Your modified requirements must be handled different way.
"Ordinary" replacement is not an option, because now there are some
fragments to match and some framents to ignore.
So the basic idea is to write a while loop with a matching regex:
(\w+),?\s+(\w+)(<[^>]+>), meaning:
(\w+) - First capturing group - a sequence of word chars (e.g. mail1).
,?\s+ - Optional comma and a sequence of spaces.
(\w+) - Second capturing group - a sequence of word chars (e.g. local).
(<[^>]+>) - Third capturing group - a sequence of chars other than >
(actual mail address), enclosed in angle brackets, e.g. <mail1#mail.local>.
Within each execution of the loop you have access to the groups
captured in this particular match ($1, $2, ...).
So the content of this loop is to print all these captured groups,
with required additional chars.
The code (again much shorter than yours) should look like below:
my $str = 'mail1, local<mail1#mail.local>, mail2#mail.local, <mail3#mail.local>, mail4 local<mail4#mail.local>';
while ($str =~ /(\w+),?\s+(\w+)(<[^>]+>)/g) {
print "\"$1, $2\"$3\n";
}
Here is an approach using split, which in this case also needs a careful regex
use warnings;
use strict;
use feature 'say';
my $string = # broken into two parts for readabililty
q(mail1, local<mail1#mail.local>, mail2#mail.local, )
. q(<mail3#mail.local>, mail4, local<mail4#mail.local>);
my #addresses = split /#.+?\K,\s*/, $string;
say for #addresses;
The split takes a full regex in its delimiter specification. In this case I figure that each record is delimited by a comma which comes after the email address, so #.+?,
To match a pattern only when it is preceded by another brings to mind a negative lookbehind before the comma. But those can't be of variable length, which is precisely the case here.
We can instead normally match the pattern #.+? and then use the \K form (of the lookbehind) which drops all previous matches so that they are not taken out of the string. Thus the above splits on ,\s* when that is preceded by the email address, #... (what isn't consumed).
It prints
mail1, local<mail1#mail.local>
mail2#mail.local
<mail3#mail.local>
mail4, local<mail4#mail.local>
The edit asks about quoting the description preceding <...> when it's there. A simple way is to make another pass once addresses have been parsed out of the string as above. For example
my #addresses = split /#.+?\K,\s*/, $string; #/ stop syntax highlight
s/(.+?,\s*.+?)</"$1"</ for #addresses;
say for #addresses;
The regex in a loop is one way to change elements of an array. I use it for its efficiency (changes elements in place), conciseness, and as a demonstration of the following properties.
In a foreach loop the index variable (or $_) is an alias for the currently processed element – so changing it changes that element. This is a known source of bugs when allowed unknowingly, which was another reason to show it in the above form.
The statement also uses the statement modifier and it is equivalent to
foreach my $elem (#addresses) {
$elem =~ s/(.+?,\s*.+?)</"$1"</;
}
This is often considered a more proper way to write it but I find that the other form emphasizes more clearly that elements are being changed, when that is the sole purpose of the foreach.

Replace comma with space in just one field - from a .CSV file

I have happened upon a problem with a program that parses through a CSV file with a few million records: two fields in each line has comments that users have put in, and sometimes they use commas within their comments. If there are commas input, that field will be contained in double quotes. I need to replace any commas found in those fields with a space. Here is one such line from the file to give you an idea -
1925,47365,2,650187016,1,1,"MADE FOR DRAWDOWNS, NEVER P/U",16,IFC 8112NP,Standalone-6,,,44,10/22/2015,91607,,B24W02651,,"PA-3, PURE",4/28/2015,1,0,,1,MAN,,CUST,,CUSTOM MATCH,0,TRUE,TRUE,O,C48A0D001EF449E3AB97F0B98C811B1B,POS.MISTINT.V0000.UP.Q,PROD_SMISA_BK,414D512050524F445F504F5331393235906F28561D2F0020,10/22/2015 9:29,10/22/2015 9:30
NOTE - I do not have the Text::CSV module available to me, nor will it be made available in the server I am using.
Here is part of my code in parsing this file. The first thing I do is concatenate the very first three fields and prepend that concatenated field to each line. Then I want to clear out the commas in #fields[7,19], then format the DATE in three fields and the DATETIME in two fields. The only line I can't figure out is clearing out those commas -
my #data;
# Read the lines one by one.
while ( $line = <$FH> ) {
# split the fields, concatenate the first three fields,
# and add it to the beginning of each line in the file
chomp($line);
my #fields = split(/,/, $line);
unshift #fields, join '_', #fields[0..2];
# remove user input commas in fields[7,19]
$_ = for fields[7,19];
# format DATE and DATETIME fields for MySQL/sqlbatch60
$_ = join '-', (split /\//)[2,0,1] for #fields[14,20,23];
$_ = Time::Piece->strptime($_,'%m/%d/%Y %H:%M')->strftime('%Y-%m-%d %H:%M') for #fields[38,39];
# write the parsed record back to the file
push #data, \#fields;
}
If it is ONLY the eighth field that is troubling AND you know exactly how many fields there should be, you can do it this way
Suppose the total number of fields is always N
Split the line on commas ,
Separate and store the first six fields
Separate and store the last n fields, where n is N-8
Rejoin what remains with commas ,. This now forms field 8
and then do what ever you like to do with it. For example, write it to a proper CSV file
Text::CSV_XS handles quoted commas just fine:
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV_XS qw{ csv };
my $aoa = csv(in => 'file.csv'); # The file contains the sample line.
print $aoa->[0][6];
Note The two main versions below clean up one field. The most recent change in the question states that there are, in fact, two such fields. The third version, at the end, works with any number of bad fields.
All code has been tested with the supplied example and its variations.
Following clarifications, this deals with the case when the file need be processed by hand. A module is easily recommended for parsing .csv, but there is a problem here: reliance on the user to enter double quotes. If they end up not being there we have a malformed file.
I take it that the number of fields in the file is known with certainty and ahead of time.
The two independent solutions below use either array or string processing.
(1) The file is being processed line by line anyway, the line being split already. If there are more fields than expected, join the extra array elements by space and then overwrite the array with correct fields. This is similar to what is outlined in the answer by vanHoesel.
use strict;
use warnings;
my $num_fields = 39; # what should be, using the example
my $ibad = 6; # index of the malformed field-to-be
my #last = (-($num_fields-$ibad-1)..-1); # index-range, rest of fields
my $file = "file.csv";
open my $fh, '<', $file;
while (my $line = <$fh>) { # chomp it if needed
my #fields = split ',', $line;
if (#fields != $num_fields) {
# join extra elements by space
my $fixed = join ' ', #fields[$ibad..$ibad+#fields-$num_fields];
# overwrite array by good fields
#fields = (#fields[0..$ibad-1], $fixed, #fields[#last]);
}
# Process #fields normally
print "#fields";
}
close $fh;
(2) Preprocess the file, only checking for malformed lines and fixing them as needed. Uses string manipulations. (Or, the method above can be used.) The $num_fields and $ibad are the same.
while (my $line = <$fh>) {
# Number of fields: commas + 1 (tr|,|| counts number of ",")
my $have_fields = $line =~ tr|,|| + 1;
if ($have_fields != $num_fields) {
# Get indices of commas delimiting the bad field
my ($beg, $end) = map {
my $p = '[^,]*,' x $_;
$line =~ /^$p/ and $+[0]-1;
} ($ibad, $ibad+$have_fields-$num_fields);
# Replace extra commas and overwrite that part of the string
my $bad_field = substr($line, $beg+1, $end-$beg-1);
(my $fixed = $bad_field) =~ tr/,/ /;
substr($line, $beg+1, $end-$beg-1) = $fixed;
}
# Perhaps write the line out, for a corrected .csv file
print $line;
}
In the last line the bad part of $line is overwritten by assigning to substr, what this function allows. The new substring $fixed is constructed with commas changed (or removed, if desired), and used to overwrite the bad part of the $line. See docs.
If quotes are known to be there a regex can be used. This works with any number of bad fields.
while (my $line = <$fh>) {
$line =~ s/."([^"]+)"/join ' ', split(',', $1)/eg; # "
# process the line. note that double quotes are removed
}
If the quotes are to be kept move them inside parenthesis, to be captured as well.
This one line is all that need be done after while (...) { to clean up data.
The /e modifier makes the replacement side be evaluated as code, instead of being used as a double-quoted string. There the matched part of the line (between ") is split by comma and then joined by space, thus fixing the field. See the last item under "Search and replace" in perlretut.
All code has been tested with multiple lines and multiple commas in the bad field.

What's happening in this Perl foreach loop?

I have this Perl code:
foreach (#tmp_cycledef)
{
chomp;
my ($cycle_code, $close_day, $first_date) = split(/\|/, $_,3);
$cycle_code =~ s/^\s*(\S*(?:\s+\S+)*)\s*$/$1/;
$close_day =~ s/^\s*(\S*(?:\s+\S+)*)\s*$/$1/;
$first_date =~ s/^\s*(\S*(?:\s+\S+)*)\s*$/$1/;
#print "$cycle_code, $close_day, $first_date\n";
$cycledef{$cycle_code} = [ $close_day, split(/-/,$first_date) ];
}
The value of tmp_cycledef comes from output of an SQL query:
select cycle_code,cycle_close_day,to_char(cycle_first_date,'YYYY-MM-DD')
from cycle_definition d
order by cycle_code;
What exactly is happening inside the for loop?
Huh, I'm surprised no one fixed it for you :)
It looks like the person who wrote this was trying to trim leading and trailing whitespace from each field. It's a really odd way to do that, and for some reason he was overly concerned with interior whitespace in each field despite his anchors.
I think that should be the same as trimming the whitespace around the delimiter in the split:
foreach (#tmp_cycledef)
{
s/^\s+//; s/$//; #leading and trailing whitespace on the whole string
my ($cycle_code, $close_day, $first_date) = split(/\s*\|\s*/, $_, 3);
$cycledef{$cycle_code} = [ $close_day, split(/-/,$first_date) ];
}
The key to thinking about split is considering which parts of the string you want to throw away, not just what separates the fields that you want.
For regex part, s/^\s*(\S*(?:\s+\S+)*)\s*$/$1/ do stripping of leading and trailing whitespaces
Each row in #tmp_cycledef is composed of a string formatted following "cycle_code | close_day | first_date".
my ($cycle_code, $close_day, $first_date) = split(/\|/, $_,3);
Split the string into three parts. The following regular expressions are used to strip leading and trailing whitespaces.
The last instruction of the loop creates an entry in the dictionary $cycledef indexed by $cycle_code. The entry is formated is formatted using the following scheme:
[ $close_day, YYYY, MM, DD ]
where $first_date = "YYYY-MM-DD".
#tmp_cycledef: The output of the sql query is stored in this array
foreach (#tmp_cycledef) : For every element in this array.
chomp : remove the \n char from the end of every element.
my ($cycle_code, $close_day, $first_date) = split(/\|/, $_,3);
split the elements into 3 parts and assign the variable to each of the splited element. parts of split are "split(/PATTERN/,EXPR,LIMIT)"
$cycle_code =~ s/^\s*(\S*(?:\s+\S+)*)\s*$/$1/;
$close_day =~ s/^\s*(\S*(?:\s+\S+)*)\s*$/$1/;
$first_date =~ s/^\s*(\S*(?:\s+\S+)*)\s*$/$1/;
This regex part is sripping of leading and trailing whitespaces from each variable.
my god, it's been such a long time since I've read perl... but I'll give it a shot.
you grab a record from #tmp_cycledef, and chomp off the newline at the end, and split it up into the three variables: then, like S.Mark said, each substitution regex strips off the leading and trailing whitespace for each of the three variable. Finally, the values get pushed into a hash as a list, with some debugging code commented out right above it.
hth
Your query gives a set of rows that
are stored in the array
#tmp_cycledef.
We iterate over each row in the
result using: foreach
(#tmp_cycledef).
The result rows might have trailing
newline char, we get rid of them
using chomp.
Next we split the row (which is not
in $_) on the pipe and assign the
first 3 pieces to $cycle_code,
$close_day and $first_date
respectively.
The split pieces might have leading
and trailing white spaces, the next 3
lines are to remove the leading and
trailing white space in the 3
variables.
Finally we make an entry into the
hash %cycledef. The key use is
$cycle_code and the value is an
array whose first element is
$close_day and rest of the elements
are pieces got after splitting
$first_date on hyphen.