Perl parsing Text File with regular expression - perl

I have a file with the following random structures:
USMS 1362224754632|<REQ MSISDN="00966590832186" CONTRACT="580" SUBSCRIPTION="AAA" FORMAT="ascii" TEXT="L2"
or
USMS 1362224754632|<REQ MSISDN="00966590832186" CONTRACT="580" SUBSCRIPTION="BBB" THRESHOLDID="1" FORMAT="ascii" TEXT="L2"
I am trying to parse it with perl to get the values like the following:
1362224754632;00966590832186;580;AAA;L2
Below is the code:
if($Record =~ /USMS (.*?)|<REQ MSISDN="(.*?)" CONTRACT="(.*?)" SUBSCRIPTION="(.*?)" FORMAT="(.*?)" THRESHOLDID="(.*?)" TEXT="(.*?)"/)
{
print LOGFILE "$1;$2;$3;$4;$5;$6;$7\n";
}
elsif($Record =~ /USMS (.*?)|<REQ MSISDN="(.*?)" CONTRACT="(.*?)" SUBSCRIPTION="(.*?)" FORMAT="(.*?)" TEXT="(.*?)"/)
{
print LOGFILE "$1;$2;$3;$4;$5;$6\n";
}
But I am getting always:
;;;;;

Pipe (|) is a special character in regular expressions. Escape it, like: \| and it will work.
if($Record =~ /USMS (.*?)\|<REQ MSISDN="(.*?)" CONTRACT="(.*?)" SUBSCRIPTION="(.*?)" FORMAT="(.*?)" THRESHOLDID="(.*?)" TEXT="(.*?)"/)
and the same for the else branch.

Instead of using a single regex, I would split the data into its separate sections first, then approach them separately.
my($usms_part, $request) = split / \s* \|<REQ \s* /x, $Record;
my($usms_id) = $usms_part =~ /^USMS (\d+)$/;
my %request;
while( $request =~ /(\w+)="(.*?)"/g ) {
$request{$1} = $2;
}
Rather than having to hard code all the possible key/value pairs, and their possible orderings, you can parse them generically in one piece of code.

Change
(.*?)
to
([a-zA-Z0-9]*)

It looks like all you want is the fields contained in double-quotes.
That looks like this
use strict;
use warnings;
while (<DATA>) {
my #values = /"([^"]+)"/g;
print join(';', #values), "\n";
}
__DATA__
USMS 1362224754632|<REQ MSISDN="00966590832186" CONTRACT="580" SUBSCRIPTION="AAA" FORMAT="ascii" TEXT="L2"
USMS 1362224754632|<REQ MSISDN="00966590832186" CONTRACT="580" SUBSCRIPTION="BBB" THRESHOLDID="1" FORMAT="ascii" TEXT="L2"
output
00966590832186;580;AAA;ascii;L2
00966590832186;580;BBB;1;ascii;L2

Related

perl regex too greedy

I went through similar questions asked by other members and applied (or tried to apply) solutions from their inquiry but they did not work on my issue. My pattern match and grouping is too greedy and does not stop at first pipe(|). If I get more specific, I think it can but I'm trying to figure out how I can stop the pattern match at the first instance of the pipe?
Here are couple of lines
09:30:00.063|IN:|8=FIX.4.2|9=206|35=D|34=5159|49=CLIENT|52=20191024-13:30:00.050|56=SERV|57=DEST|1=05033|11=ABZ5702|15=USD|21=1|38=2000|40=2|44=92.48|47=A|54=5|55=RC|60=20191024-13:30:00.050|111=0|114=N|336=X|5700=AP|9281=SOV|10=202
09:37:21.208|IN:|8=FIX.4.2|9=170|35=D|34=5184|49=CLIENT|52=20191024-13:37:21.206|56=SERV|57=ATXB|1=J5129|11=136404|15=USD|21=1|38=100|40=2|44=1.39|47=A|54=2|55=DIW|59=2|60=20191024-13:30:00.206|10=029
I'm expecting my perl script to return the following output from the above data:
09:30:00.063|13:30:00.050|ABZ5702
09:37:21.208|13:37:21.206|136404
I tried all this and few other veriations but could not get it to produce the above output:
#$msg =~ s/([^|]*).*|52=([^|]*).*|11=([^|]*).*/$1|$2|$3/;
$msg =~ s/(.+)\|??.*|52=([^|]*).*|11=([^|]*).*/$1|$2|$3/;
#$msg =~ s/^([^|]*).??|52=([^|]*).??|11=([^|]*).*/$1|$2|$3/;
#$msg =~ s/^([^\|??]*).*|52=([^\|??]*).*|11=([^\|??]*).*/$1|$2|$3/;
#$msg =~ s/(.*\|??).*|52=(.+\|??).*|11=(.+\|??).*/one $1|two $2|three $3/;
#$msg =~ s/(.*?|).*|52=(.*?|).*|11=(.*|?).*/$1|$2|$3/;
#$msg =~ /(.*)|??.*|52=(.*)|??.*|11=(.*)|??.*/$1|$2|$3/;
#$msg =~ s/|.*-[0-3][0-9]:/|/;
print "$msg\n";```
I realize there are other more than one way to skin the cat but there are cases where I need to use the pattern match approach. How can I get it to produce the expected output using the pattern matching where it stops each group at first pipe(|)? Can someone tell me what am I doing wrong?
Try this:
s/(.*?)\|.*\|52=([^|]*).*\|11=([^|]*).*/$1 $2 $3/;
There were a couple of pipe delimiters that needed escaping.
You need to look at non-greedy matching https://docstore.mik.ua/orelly/perl/cookbook/ch06_16.htm
The first matching group is (.*?) instead of (.*). The ? means we match as little as possible.
In general, for parsing FIX in perl, as long as there are no repeating groups, I would recommend splitting on | first and then creating a hash of tag-value pairs.
I would do it a little bit different - split line into array and work on individual element of array.
The regex may be an acceptable solution for one particular case if format of line predetermined and will never change.
use strict;
use warnings;
use Data::Dumper;
my $debug = 0;
while( my $line = <DATA> ) {
my #array = split /\|/, $line;
print Dumper(\#array) if $debug;
$array[7] =~ s/.+?-//;
$array[11] =~ s/\d+=//;
printf "%s\n", join '|', #array[0,7,11];
}
__DATA__
09:30:00.063|IN:|8=FIX.4.2|9=206|35=D|34=5159|49=CLIENT|52=20191024-13:30:00.050|56=SERV|57=DEST|1=05033|11=ABZ5702|15=USD|21=1|38=2000|40=2|44=92.48|47=A|54=5|55=RC|60=20191024-13:30:00.050|111=0|114=N|336=X|5700=AP|9281=SOV|10=202
09:37:21.208|IN:|8=FIX.4.2|9=170|35=D|34=5184|49=CLIENT|52=20191024-13:37:21.206|56=SERV|57=ATXB|1=J5129|11=136404|15=USD|21=1|38=100|40=2|44=1.39|47=A|54=2|55=DIW|59=2|60=20191024-13:30:00.206|10=029

Perl Single Quote replacement

I've been struggling for the last days in regards to a character replacement in Perl:
I have a String which is surrounded by single quotes, yet, inside that String, I have a name which contains a single quote, let's say O'Neil. Now, given the fact that my String is surrounded by single quotes, Perl recognizes the single quote in the Name, as being the end of the String.
Surrounding the entire string in double quotes is not an option, since it's build from an URL.
Now, I did some research and didn't find anything, now I'm asking y'all:
I've tried to play around with the following syntax:
$Daten =~ s/\'/\\'/g; which of course doesn't work...
$Daten is the entire string which contains the Name O'Neil*
Now, I want to replace the single quote, with a backslash quote: ' -> \'
Anyone has any ideas?
Best regards,
Ionut Sanda
Perhaps something like following code should comply with your requirements
use strict;
use warnings;
my $debug = 1;
while( my $line = <DATA> ) {
$line =~ s/(.*)'(.+)'(.+)'(.*)/$1'$2\\'$3'$4/g;
print $line if $debug;
}
__DATA__
'USER1:O'NEILL:PATRICK:M:lastname_firstname#company.com'
datax 'USER1:O'NEILL:PATRICK:M:lastname_firstname#company.com' datay
output
'USER1:O\'NEILL:PATRICK:M:lastname_firstname#company.com'
datax 'USER1:O\'NEILL:PATRICK:M:lastname_firstname#company.com' datay
Well, as you do not provide a sample or your code I have to improvise
use strict;
use warnings;
my $debug = 1;
while( my $Daten = <DATA> ) {
$Daten =~ s/(.*)'(.+)'(.+)'(.*)/$1'$2\\'$3'$4/g; # Magic happens here
print $Daten if $debug;
}
__DATA__
'USER1:O'NEILL:PATRICK:M:lastname_firstname#company.com'
datax 'USER1:O'NEILL:PATRICK:M:lastname_firstname#company.com' datay
output
'USER1:O\'NEILL:PATRICK:M:lastname_firstname#company.com'
datax 'USER1:O\'NEILL:PATRICK:M:lastname_firstname#company.com' datay
Otherwise I do not have enough information to understand your problem (no sample of data, no snippet of the code).

Perl text filtering

How can I filter this text on a log file:
/servicios/busquedas?colecciones=29&orden=score&recursos=rango-1-20&query=%28%28%28texto%3A%28periodos+AND+contractuales%29%29+OR+%28title%3A%28periodos+AND+contractuales%29%29+OR+%28%28extra%3A%28periodos+AND+contractuales%29%29%5E0.5%29+OR+%28%28title%3A%28%22periodos+contractuales%22%7E15%29%29%5E5%29+OR+%28%28extra%3A%28%22periodos+contractuales%22%7E15%29%29%5E3%29+OR+%28%28texto%3A%28%22periodos+contractuales%22%7E15%29%29%5E3%29%29%29 tardo 0.115818977355957 (network 0.111818977355957)
To get only this:
periodos contractuales
I've done it with split methods but I can't find any regular character to split. The words periodos and contractuales are changing all the time!
When its only the periodos and the contractuales part, this should work:
if ( $string =~ m{periodos} && $string =~ m{contractuales}/ ) {
print q{periodos contractuales};
}

Perl split() Function Not Handling Pipe Character Saved As A Variable

I'm running into a little trouble with Perl's built-in split function. I'm creating a script that edits the first line of a CSV file which uses a pipe for column delimitation. Below is the first line:
KEY|H1|H2|H3
However, when I run the script, here is the output I receive:
Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9|Col10|Col11|Col12|Col13|
I have a feeling that Perl doesn't like the fact that I use a variable to actually do the split, and in this case, the variable is a pipe. When I replace the variable with an actual pipe, it works perfectly as intended. How could I go about splitting the line properly when using pipe delimitation, even when passing in a variable? Also, as a silly caveat, I don't have permissions to install an external module from CPAN, so I have to stick with built-in functions and modules.
For context, here is the necessary part of my script:
our $opt_h;
our $opt_f;
our $opt_d;
# Get user input - filename and delimiter
getopts("f:d:h");
if (defined($opt_h)) {
&print_help;
exit 0;
}
if (!defined($opt_f)) {
$opt_f = &promptUser("Enter the Source file, for example /qa/data/testdata/prod.csv");
}
if (!defined($opt_d)) {
$opt_d = "\|";
}
my $delimiter = "\|";
my $temp_file = $opt_f;
my #temp_file = split(/\./, $temp_file);
$temp_file = $temp_file[0]."_add-headers.".$temp_file[1];
open(source_file, "<", $opt_f) or die "Err opening $opt_f: $!";
open(temp_file, ">", $temp_file) or die "Error opening $temp_file: $!";
my $source_header = <source_file>;
my #source_header_columns = split(/${delimiter}/, $source_header);
chomp(#source_header_columns);
for (my $i=1; $i<=scalar(#source_header_columns); $i++) {
print temp_file "Col$i";
print temp_file "$delimiter";
}
print temp_file "\n";
while (my $line = <source_file>) {
print temp_file "$line";
}
close(source_file);
close(temp_file);
The first argument to split is a compiled regular expression or a regular expression pattern. If you want to split on text |. You'll need to pass a pattern that matches |.
quotemeta creates a pattern from a string that matches that string.
my $delimiter = '|';
my $delimiter_pat = quotemeta($delimiter);
split $delimiter_pat
Alternatively, quotemeta can be accessed as \Q..\E inside double-quoted strings and the like.
my $delimiter = '|';
split /\Q$delimiter\E/
The \E can even be omitted if it's at the end.
my $delimiter = '|';
split /\Q$delimiter/
I mentioned that split also accepts a compiled regular expression.
my $delimiter = '|';
my $delimiter_re = qr/\Q$delimiter/;
split $delimiter_re
If you don't mind hardcoding the regular expression, that's the same as
my $delimiter_re = qr/\|/;
split $delimiter_re
First, the | isn't special inside doublequotes. Setting $delimiter to just "|" and then making sure it is quoted later would work or possibly setting $delimiter to "\\|" would be ok by itself.
Second, the | is special inside regex so you want to quote it there. The safest way to do that is ask perl to quote your code for you. Use the \Q...\E construct within the regex to mark out data you want quoted.
my #source_header_columns = split(/\Q${delimiter}\E/, $source_header);
see: http://perldoc.perl.org/perlre.html
It seems as all you want to do is count the fields in the header, and print the header. Might I suggest something a bit simpler than using split?
my $str="KEY|H1|H2|H3";
my $count=0;
$str =~ s/\w+/"Col" . ++$count/eg;
print "$str\n";
Works with most any delimeter (except alphanumeric and underscore), it also saves the number of fields in $count, in case you need it later.
Here's another version. This one uses the character class brackets instead, to specify "any character but this", which is just another way of defining a delimeter. You can specify delimeter from the command-line. You can use your getopts as well, but I just used a simple shift.
my $d = shift || '[^|]';
if ( $d !~ /^\[/ ) {
$d = '[^' . $d . ']';
}
my $str="KEY|H1|H2|H3";
my $count=0;
$str =~ s/$d+/"Col" . ++$count/eg;
print "$str\n";
By using the brackets, you do not need to worry about escaping metacharacters.
#!/usr/bin/perl
use Data::Dumper;
use strict;
my $delimeter="\\|";
my $string="A|B|C|DD|E";
my #arr=split(/$delimeter/,$string);
print Dumper(#arr)."\n";
output:
$VAR1 = 'A';
$VAR2 = 'B';
$VAR3 = 'C';
$VAR4 = 'DD';
$VAR5 = 'E';
seems you need define delimeter as \\|

How can i detect symbols using regular expression in perl?

Please how can i use regular expression to check if word starts or ends with a symbol character, also how to can i process the text within the symbol.
Example:
(text) or te-xt, or tex't. or text?
change it to
(<t>text</t>) or <t>te-xt</t>, or <t>tex't</t>. or <t>text</t>?
help me out?
Thanks
I assume that "word" means alphanumeric characters from your example? If you have a list of permitted characters which constitute a valid word, then this is enough:
my $string = "x1 .text1; 'text2 \"text3;\"";
$string =~ s/([a-zA-Z0-9]+)/<t>$1<\/t>/g;
# Add more to character class [a-zA-Z0-9] if needed
print "$string\n";
# OUTPUT: <t>x1</t> .<t>text1</t>; '<t>text2</t> "<t>text3</t>;"
UPDATE
Based on your example you seem to want to DELETE dashes and apostrophes, if you want to delete them globally (e.g. whether they are inside the word or not), before the first regex, you do
$string =~ s/['-]//g;
I am using DVK's approach here, but with a slight modification. The difference is that her/his code would also put the tags around all words that don't contain/are next to a symbol, which (according to the example given in the question) is not desired.
#!/usr/bin/perl
use strict;
use warnings;
sub modify {
my $input = shift;
my $text_char = 'a-zA-Z0-9\-\''; # characters that are considered text
# if there is no symbol, don't change anything
if ($input =~ /^[a-zA-Z0-9]+$/) {
return $input;
}
else {
$input =~ s/([$text_char]+)/<t>$1<\/t>/g;
return $input;
}
}
my $initial_string = "(text) or te-xt, or tex't. or text?";
my $expected_string = "(<t>text</t>) or <t>te-xt</t>, or <t>tex't</t>. or <t>text</t>?";
# version BEFORE edit 1:
#my #aux;
# take the initial string apart and process it one word at a time
#my #string_list = split/\s+/, $initial_string;
#
#foreach my $string (#string_list) {
# $string = modify($string);
# push #aux, $string;
#}
#
# put the string together again
#my $final_string = join(' ', #aux);
# ************ EDIT 1 version ************
my $final_string = join ' ', map { modify($_) } split/\s+/, $initial_string;
if ($final_string eq $expected_string) {
print "it worked\n";
}
This strikes me as a somewhat long-winded way of doing it, but it seemed quicker than drawing up a more sophisticated regex...
EDIT 1: I have incorporated the changes suggested by DVK (using map instead of foreach). Now the syntax highlighting is looking even worse than before; I hope it doesn't obscure anything...
This takes standard input and processes it to and prints on Standard output.
while (<>) {
s {
( [a-zA-z]+ ) # word
(?= [,.)?] ) # a symbol
}
{<t>$1</t>}gx ;
print ;
}
You might need to change the bit to match the concept of word.
I have use the x modifeid to allow the regexx to be spaced over more than one line.
If the input is in a Perl variable, try
$string =~ s{
( [a-zA-z]+ ) # word
(?= [,.)?] ) # a symbol
}
{<t>$1</t>}gx ;