Perl script that parses CSV file excluding the contents enclosed in []

Perl script that parses CSV file excluding the contents enclosed in [] - perl

Hi there I am struggling with perl script that parses a an eight column CSV line into another CSV line using the split command. But i want to exclude all the text enclosed by square brackets []. The line looks like :
128.39.120.51,0,49788,6,SYN,[8192:127:1:52:M1460,N,W2,N,N,S:.:Windows:XP/2000 (RFC1323+, w+, tstamp-):link:ethernet/modem],1,1399385680
I used the following script but when i print $fields[7] it gives me N. one of the fields inside [] above.but by print "$fields[7]" i want it to be 1399385680 which is the last field in the above line. the script i tried was.
while (my $line = <LOG>) {
chomp $line;
my #fields=grep { !/^[\[.*\]]$/ } split ",", $line;
my $timestamp=$fields[7];
print "$fields[7]";
}
Thanks for your time. I will appreciate your help.

Always include use strict; and use warnings; at the top of EVERY perl script.
Your "csv" file isn't proper csv. So the only thing I can suggest is to remove the contents in the brackets before you split:
use strict;
use warnings;
while (<DATA>) {
chomp;
s/\[.*?\]//g;
my #fields = split ',', $_;
my $timestamp = $fields[7];
print "$timestamp\n";
}
__DATA__
128.39.120.51,0,49788,6,SYN,[8192:127:1:52:M1460,N,W2,N,N,S:.:Windows:XP/2000 (RFC1323+, w+, tstamp-):link:ethernet/modem],1,1399385680
Outputs:
1399385680
Obviously it is possible to also capture the contents of the bracketed fields, but you didn't say that was a requirement or goal.
Update
If you want to capture the bracket delimited field, one method would be to use a regex for capturing instead.
Note, this current regex requires that each field has a value.
chomp;
my #fields = $_ =~ /(\[.*?\]|[^,]+)(?:,|$)/g;
my $timestamp = $fields[7];
print "$timestamp";

Well, if you want to actually ignore the text between square brackets, you might as well get rid of it:
while ( my $line = <LOG> ) {
chomp $line;
$line =~ s,\[.*?\],,; # Delete all text between square brackets
my #fields = split ",", $line;
my $timestamp = $fields[7];
print $fields[7], "\n";
}

Related

Perl, Why it has a extra white space in output

input:
output:
but my out put have one extra white space on last two line.
my output:
my code:
#content = <FILE>;
foreach $line (#content){
if($line =~ /^#(\d+)/){
$number = $1;
$line =~ s/^#(\d+)/$content[$number-1]/;
}
print "$line";
}
Any help will be appreciated.

Here's a version of your code with sample input data. If you want people to help you with problems like this, then it's a good idea to make it as easy as possible for them. Posting images of your input data does not make it easy. Also, it's a good development trick to store sample data in the DATA filehandle so that the code and data are together in the same file.
#!/usr/bin/perl
use strict;
use warnings;
my #content = <DATA>;
foreach my $line (#content){
if($line =~ /^#(\d+)/){
my $number = $1;
$line =~ s/^#(\d+)/$content[$number-1]/;
}
print "$line";
}
__DATA__
line A
line B
line C
#7
line D
#2
line E
I've also added use strict and use warnings to your code. In this case, they don't really help, but you should get into the habit of always including them in your Perl programs.
Your problem is here:
$line =~ s/^#(\d+)/$content[$number-1]/;
Each of the lines in your #content array will include a newline character at the end. But in this line you're replacing the # symbol and the following digit with a complete other line from the array. You're not replacing the original newline and you're adding another newline (from the replacement string) so the line ends up containing two newlines.
The easiest fix is to add the newline to the pattern you are matching.
$line =~ s/^#(\d+)\n/$content[$number-1]/;
Note that an experienced Perl programmer would write your code like this:
#!/usr/bin/perl
use strict;
use warnings;
my #content = <DATA>;
for (#content){
s/^#(\d+)\n/$content[$1 - 1]/;
print;
}

Why is my Perl code not omitting newlines?

I'm reading this textfile to get ONLY the words in it and ignore all kind of whitespaces:
hello
now
do you see this.sadslkd.das,msdlsa but
i hoohoh
And this is my Perl code:
#!usr/bin/perl -w
require 5.004;
open F1, './text.txt';
while ($line = <F1>) {
#print $line;
#arr = split /\s+/, $line;
foreach $w (#arr) {
if ($w !~ /^\s+$/) {
print $w."\n";
}
}
#print #arr;
}
close F1;
And this is the output:
hello
now
do
you
see
this.sadslkd.das,msdlsa
but
i
hoohoh
The output is showing two newlines but I am expecting the output to be just words. What should I do to just get words?

You should always use strict and use warnings (in preference to the -w command-line qualifier) at the top of every Perl program, and declare each variable at its first point of use using my. That way Perl will tell you about simple errors that you may otherwise overlook.
You should also use lexical file handles with the three-parameter form of open, and check the status to make sure it succeeded. There is little point in explicitly closing an input file unless you expect your program to run for an appreciable time, as Perl will close all files for you on exit.
Do you really need to require Perl v5.4? That version is fifteen years old, and if there is anything older than that installed then you have a museum!
Your program would be better like this:
use strict;
use warnings;
open my $fh, '<', './text.txt' or die $!;
while (my $line = <$fh>) {
my #arr = split /\s+/, $line;
foreach my $w (#arr) {
if ($w !~ /^\s+$/) {
print $w."\n";
}
}
}
Note: my apologies. The warnings pragma and lexical file handles were introduced only in v5.6 so that part of my answer is irrelevant. The latest version of Perl is v5.16 and you really should upgrade
As Birei has pointed out, the problem is that, when the line has leading whitespace, there is a empty field before the first separator. Imagine if your data was comma-separated, then you would want Perl to report a leading empty field if the line started with a comma.
To extract all the non-space characters you can use a regular expression that does exactly that
my #arr = $line =~ /\S+/g;
and this can be emulated by using the default parameter for split which is a single quoted space (not a regular expression)
my #arr = $line =~ split ' ', $line;
In this case split behaves like the awk utility and discards any leading empty fields as you expected.
This is even simpler if you let Perl use the $_ variable in the read loop, as all of the parameters for split can be defaulted:
while (<F1>) {
my #arr = split;
foreach my $w (#arr) {
print "$w\n" if $w !~ /^\s+$/;
}
}

This line is the problem:
#arr=split(/\s+/,$line);
\s+ does a match just before the leading spaces. Use ' ' instead.
#arr=split(' ',$line);

I believe that in this line:
if(!($w =~ /^\s+$/))
You wanted to ask if there's nothing in this row - don't print it.
But the "+" in the REGEX actually force it to have at least 1 space.
If you change the "\s+" to "\s*", you'll see that it's working. because * is 0 occurrences or more ...

perl: printing new line instead in a single line

I am new to perl.
when i m trying to print the values of the array along with one variable in the while loop,
the variable is printing in the new line.
while($line=<FH>)
{
chomp($line);
$tem = grep(/gooty/,$line);
if($tem==1)
{
$Date=$date;
#array=split(/\|/,$line);
$sth = "INSERT INTO TABLE VALUES $array[1],$array[2],$date \n";
}
}
print "$sth \n";
the output:
INSERT INTO TABLE VALUES alan ,777
,2012-07-31
instead i want the output as :
INSERT INTO TABLE VALUES alan ,777,2012-07-31
in single line

This is a common problem for new perl programmers. Say
while (defined($line = <FH>))
{
chomp $line; # Eliminate terminating newline if there
...
If the results are still not right, you may be trying to read a text file with MSDOS/Windows line endings using a version of Perl (like Cygwin) that doesn't handle them correctly. This can cause chomp to malfunction. You can work around the problem using this instead:
$line =~ s/[\r\n]+$//;
This cleans all end-of-line characters from the end of the line, no matter how many there are.,
Additional notes on your code: You'll save lots of trouble for yourself with use strict; and use warnings;, which will require variable declarations with my and our. You don't need to call grep. Just say if ($line =~ /gooty/) {. If there is any chance of extra whitespace in your data, a better split pattern is \s+\|\s+. This will consume whitespace around the vertical bar field separators. In that case you also want to use
$line =~ s/\s+$//;
instead of chomp $line. This will clean all whitespace from the end of line, which includes end-of-line characters.

You have a newline at the end of $line. chomp it, before splitting it, to get the desired output.
chomp $line;
perldoc -f chomp

I assume that you don't want your elements of #array enclosed by whitespace characters. Then we should trim them before printing them.
my $line = <FH>;
my $date = '2012-07-31'; # or whatever
if($line =~ /gooty/)
{
my #array = split /[|]/, $line;
foreach (#array) {
s/^\s+//; # removes leading whitespaces
s/\s+$//; # removes trailing whitespaces
}
print "INSERT INTO TABLE VALUES $array[1],$array[2],$date \n";
}
This should print the desired output.
But I cannot be sure until you show us the input you gave your code that produced the unexpected output. (Or could it be that you modify your $sth between the loop and the print statement? I see you appended two newlines?)
Btw: use strict; use warnings!

The cleanest approach, instead of using chomp, is to remove all trailing whitespace from the end of the line
Start your loop with
$line =~ s/\s+\z//;

How to parse multiple line, fixed-width file in perl?

I have a file that I need to parse in the following format. (All delimiters are spaces):
field name 1: Multiple word value.
field name 2: Multiple word value along
with multiple lines.
field name 3: Another multiple word
and multiple line value.
I am familiar with how to parse a single line fixed-width file, but am stumped with how to handle multiple lines.

#!/usr/bin/env perl
use strict; use warnings;
my (%fields, $current_field);
while (my $line = <DATA>) {
next unless $line =~ /\S/;
if ($line =~ /^ \s+ ( \S .+ )/x) {
if (defined $current_field) {
$fields{ $current_field} .= $1;
}
}
elsif ($line =~ /^(.+?) : \s+ (.+) \s+/x ) {
$current_field = $1;
$fields{ $current_field } = $2;
}
}
use Data::Dumper;
print Dumper \%fields;
__DATA__
field name 1: Multiple word value.
field name 2: Multiple word value along
with multiple lines.
field name 3: Another multiple word
and multiple line value.

Fixed-width says unpack to me. It is possible to parse with regexes and split, but unpack should be a safer choice, as it is the Right Tool for fixed width data.
I put the width of the first field to 12 and the empty space between to 13, which works for this data. You may need to change that. The template "A12A13A*" means "find 12 then 13 ascii characters, followed by any length of ascii characters". unpack will return a list of these matches. Also, unpack will use $_ if a string is not supplied, which is what we do here.
Note that if the first field is not fixed width up to the colon, as it appears to be in your sample data, you'll need to merge the fields in the template, e.g. "A25A*", and then strip the colon.
I chose array as the storage device, as I do not know if your field names are unique. A hash would overwrite fields with the same name. Another benefit of an array is that it preserves the order of the data as it appears in the file. If these things are irrelevant and quick lookup is more of a priority, use a hash instead.
Code:
use strict;
use warnings;
use Data::Dumper;
my $last_text;
my #array;
while (<DATA>) {
# unpack the fields and strip spaces
my ($field, undef, $text) = unpack "A12A13A*";
if ($field) { # If $field is empty, that means we have a multi-line value
$field =~ s/:$//; # strip the colon
$last_text = [ $field, $text ]; # store data in anonymous array
push #array, $last_text; # and store that array in #array
} else { # multi-line values get added to the previous lines data
$last_text->[1] .= " $text";
}
}
print Dumper \#array;
__DATA__
field name 1: Multiple word value.
field name 2: Multiple word value along
with multiple lines.
field name 3: Another multiple word
and multiple line value
with a third line
Output:
$VAR1 = [
[
'field name 1:',
'Multiple word value.'
],
[
'field name 2:',
'Multiple word value along with multiple lines.'
],
[
'field name 3:',
'Another multiple word and multiple line value with a third line'
]
];

You could do this:
#!/usr/bin/perl
use strict;
use warnings;
my #fields;
open(my $fh, "<", "multi.txt") or die "Unable to open file: $!\n";
for (<$fh>) {
if (/^\s/) {
$fields[$#fields] .= $_;
} else {
push #fields, $_;
}
}
close $fh;
If the line starts with white space, append it to the last element in #fields, otherwise push it onto the end of the array.
Alternatively, slurp the entire file and split with look-around:
#!/usr/bin/perl
use strict;
use warnings;
$/=undef;
open(my $fh, "<", "multi.txt") or die "Unable to open file: $!\n";
my #fields = split/(?<=\n)(?!\s)/, <$fh>;
close $fh;
It's not a recommended approach though.

You can change delimiter:
$/ = "\nfield name";
while (my $line = <FILE>) {
if ($line =~ /(\d+)\s+(.+)/) {
print "Record $1 is $2";
}
}

With Perl, how do I read records from a file with two possible record separators?

Here is what I am trying to do:
I want to read a text file into an array of strings. I want the string to terminate when the file reads in a certain character (mainly ; or |).
For example, the following text
Would you; please
hand me| my coat?
would be put away like this:
$string[0] = 'Would you;';
$string[1] = ' please hand me|';
$string[2] = ' my coat?';
Could I get some help on something like this?

This will do it. The trick to using split while preserving the token you're splitting on is to use a zero-width lookback match: split(/(?<=[;|])/, ...).
Note: mctylr's answer (currently the top rated) isn't actually correct -- it will split fields on newlines, b/c it only works on a single line of the file at a time.
gbacon's answer using the input record separator ($/) is quite clever--it's both space and time efficient--but I don't think I'd want to see it in production code. Putting one split token in the record separator and the other in the split strikes me as a little too unobvious (you have to fight that with Perl ...) which will make it hard to maintain. I'm also not sure why he's deleting multiple newlines (which I don't think you asked for?) and why he's doing that only for the end of '|'-terminated records.
# open file for reading, die with error message if it fails
open(my $fh, '<', 'data.txt') || die $!;
# set file reading to slurp (whole file) mode (note that this affects all
# file reads in this block)
local $/ = undef;
my $string = <$fh>;
# convert all newlines into spaces, not specified but as per example output
$string =~ s/\n/ /g;
# split string on ; or |, using a zero-width lookback match (?<=) to preserve char
my (#strings) = split(/(?<=[;|])/, $string);

One way is to inject another character, like \n, whenever your special character is found, then split on the \n:
use warnings;
use strict;
use Data::Dumper;
while (<DATA>) {
chomp;
s/([;|])/$1\n/g;
my #string = split /\n/;
print Dumper(\#string);
}
__DATA__
Would you; please hand me| my coat?
Prints out:
$VAR1 = [
'Would you;',
' please hand me|',
' my coat?'
];
UPDATE: The original question posed by James showed the input text on a single line, as shown in __DATA__ above. Because the question was poorly formatted, others edited the question, breaking the 1 line into 2. Only James knows whether 1 or 2 lines was intended.

I prefer #toolic's answer because it deals with multiple separators very easily.
However, if you wanted to overly complicate things, you could always try:
#!/usr/bin/perl
use strict; use warnings;
my #contents = ('');
while ( my $line = <DATA> ) {
last unless $line =~ /\S/;
$line =~ s{$/}{ };
if ( $line =~ /^([^|;]+[|;])(.+)$/ ) {
$contents[-1] .= $1;
push #contents, $2;
}
else {
$contents[-1] .= $1;
}
}
print "[$_]\n" for #contents;
__DATA__
Would you; please
hand me| my coat?

Something along the lines of
$text = <INPUTFILE>;
#string = split(/[;!]/, $text);
should do the trick more or less.
Edit: I've changed "/;!/" to "/[;!]/".

Let Perl do half the work for you by setting $/ (the input record separator) to vertical bar, and then extract semicolon-separated fields:
#!/usr/bin/perl
use warnings;
use strict;
my #string;
*ARGV = *DATA;
$/ = "|";
while (<>) {
s/\n+$//;
s/\n/ /g;
push #string => $1 while s/^(.*;)//;
push #string => $_;
}
for (my $i = 0; $i < #string; ++$i) {
print "\$string[$i] = '$string[$i]';\n";
}
__DATA__
Would you; please
hand me| my coat?
Output:
$string[0] = 'Would you;';
$string[1] = ' please hand me|';
$string[2] = ' my coat?';

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Perl script that parses CSV file excluding the contents enclosed in [] - perl

Well, if you want to actually ignore the text between square brackets, you might as well get rid of it: while ( my $line = <LOG> ) { chomp $line; $line =~ s,\[.*?\],,; # Delete all text between square brackets my #fields = split ",", $line; my $timestamp = $fields[7]; print $fields[7], "\n"; }

Related

Perl, Why it has a extra white space in output

Why is my Perl code not omitting newlines?

perl: printing new line instead in a single line

How to parse multiple line, fixed-width file in perl?

With Perl, how do I read records from a file with two possible record separators?

Categories

Resources