How to parse a CSV file containing serialized PHP? - perl

I've just started dabbling in Perl, to try and gain some exposure to different programming languages - so forgive me if some of the following code is horrendous.
I needed a quick and dirty CSV parser that could receive a CSV file, and split it into file batches containing "X" number of CSV lines (taking into account that entries could contain embedded newlines).
I came up with a working solution, and it was going along just fine. However, as one of the CSV files that I'm trying to split, I came across one that contains serialized PHP code.
This seems to break the CSV parsing. As soon as I remove the serialization - the CSV file is parsed correctly.
Are there any tricks I need to know when it comes to parsing serialized data in CSV files?
Here is a shortened sample of the code:
use strict;
use warnings;
my $csv = Text::CSV_XS->new({ eol => $/, always_quote => 1, binary => 1 });
my $out;
my $in;
open $in, "<:encoding(utf8)", "infile.csv" or die("cannot open input file $inputfile");
open $out, ">outfile.000";
binmode($out, ":utf8");
while (my $line = $csv->getline($in)) {
$lines++;
$csv->print($out, $line);
}
I'm never able to get into the while loop shown above. As soon as I remove the serialized data, I suddenly am able to get into the loop.
Edit:
An example of a line that is causing me trouble (taken straight from Vim - hence the ^M):
"26","other","1","20,000 Subscriber Plan","Some text here.^M\
Some more text","on","","18","","0","","0","0","recurring","0","","payment","totalsend","0","tsadmin","R34bL9oq","37","0","0","","","","","","","","","","","","","","","","","","","","","","","0","0","0","a:18:{i:0;s:1:\"3\";i:1;s:1:\"2\";i:2;s:2:\"59\";i:3;s:2:\"60\";i:4;s:2:\"61\";i:5;s:2:\"62\";i:6;s:2:\"63\";i:7;s:2:\"64\";i:8;s:2:\"65\";i:9;s:2:\"66\";i:10;s:2:\"67\";i:11;s:2:\"68\";i:12;s:2:\"69\";i:13;s:2:\"70\";i:14;s:2:\"71\";i:15;s:2:\"72\";i:16;s:2:\"73\";i:17;s:2:\"74\";}","","","0","0","","0","0","0.0000","0.0000","0","","","0.00","","6","1"
"27","other","1","35,000 Subscriber Plan","Some test here.^M\
Some more text","on","","18","","0","","0","0","recurring","0","","payment","totalsend","0","tsadmin","R34bL9oq","38","0","0","","","","","","","","","","","","","","","","","","","","","","","0","0","0","a:18:{i:0;s:1:\"3\";i:1;s:1:\"2\";i:2;s:2:\"59\";i:3;s:2:\"60\";i:4;s:2:\"61\";i:5;s:2:\"62\";i:6;s:2:\"63\";i:7;s:2:\"64\";i:8;s:2:\"65\";i:9;s:2:\"66\";i:10;s:2:\"67\";i:11;s:2:\"68\";i:12;s:2:\"69\";i:13;s:2:\"70\";i:14;s:2:\"71\";i:15;s:2:\"72\";i:16;s:2:\"73\";i:17;s:2:\"74\";}","","","0","0","","0","0","0.0000","0.0000","0","","","0.00","","7","1"
"28","other","1","50,000 Subscriber Plan","Some text here.^M\
Some more text","on","","18","","0","","0","0","recurring","0","","payment","totalsend","0","tsadmin","R34bL9oq","39","0","0","","","","","","","","","","","","","","","","","","","","","","","0","0","0","a:18:{i:0;s:1:\"3\";i:1;s:1:\"2\";i:2;s:2:\"59\";i:3;s:2:\"60\";i:4;s:2:\"61\";i:5;s:2:\"62\";i:6;s:2:\"63\";i:7;s:2:\"64\";i:8;s:2:\"65\";i:9;s:2:\"66\";i:10;s:2:\"67\";i:11;s:2:\"68\";i:12;s:2:\"69\";i:13;s:2:\"70\";i:14;s:2:\"71\";i:15;s:2:\"72\";i:16;s:2:\"73\";i:17;s:2:\"74\";}","","","0","0","","0","0","0.0000","0.0000","0","","","0.00","","8","1""73","other","8","10,000,000","","","","0","","0","","0","0","recurring","0","","payment","","0","","","75","0","10000000","","","","","","","","","","","","","","","","","","","","","","","0","0","0","a:17:{i:0;s:1:\"3\";i:1;s:1:\"2\";i:2;s:2:\"59\";i:3;s:2:\"60\";i:4;s:2:\"61\";i:5;s:2:\"62\";i:6;s:2:\"63\";i:7;s:2:\"64\";i:8;s:2:\"65\";i:9;s:2:\"66\";i:10;s:2:\"67\";i:11;s:2:\"68\";i:12;s:2:\"69\";i:13;s:2:\"70\";i:14;s:2:\"71\";i:15;s:2:\"72\";i:16;s:2:\"74\";}","","","0","0","","0","0","0.0000","0.0000","0","","","0.00","","14","0"

The CSV you are trying to read escapes embedded quotes with backslash, but the default for Text::CSV_XS is to escape by doubling them. Try adding escape_char => '\\' to the Text::CSV_XS constructor.
You may also need allow_loose_escapes => 1 if it uses backslash to quote other things that don't strictly need it like newlines.
The other option is to change the writer to use doubled quotes instead of backslashes for escaping. Might or might not be possible. Doubling the quotes is the more common flavour of CSV and while programmatic parsers can generally read both (if told), you won't be able to read the variant with backslash e.g. in Excel.

Related

Removing extra commas from csv file in perl

I have a multiple CSV files each with a different amount of entries each with roughly 300 lines each.
The first line in each file is the Data labels
Person_id, person_name, person_email, person_address, person_recruitmentID, person_comments... etc
The Rest of the lines in each file contain the data
"0001", "bailey", "123 fake, street", "bailey#mail.com", "0001", "this guy doesnt know how to get rid of, commas!"... etc
I want to get rid of commas that are in between quotation marks.
I'm currently going through the Text::CSV documentation but its a slow process.
A good CSV parser will have no trouble with this since commas are inside the quoted fields, so you can simply parse the file with it.
A really nice module is Text::CSV_XS, which is loaded by default when you use the wrapper Text::CSV. The only thing to address in your data is the spaces between fields since they aren't in CSV specs, so I use the option for that in the example below.
If you indeed must remove commas for further work do that as the parser hands you lines.
use warnings;
use strict;
use feature 'say';
use Text::CSV;
my $file = 'commas_in_fields.csv';
my $csv = Text::CSV->new( { binary => 1, allow_whitespace => 1 } )
or die "Cannot use CSV: " . Text::CSV->error_diag ();
open my $fh, '<', $file or die "Can't open $file: $!";
my #headers = #{ $csv->getline($fh) }; # if there is a separate header line
while (my $line = $csv->getline($fh)) { # returns arrayref
tr/,//d for #$line; # delete commas from each field
say "#$line";
}
This uses tr on $_ in the for loop, thus changing the elements of the array iterated over themselves, for conciseness.
I'd like to repeat and emphasize what others have explained: do not parse CSV by hand, since trouble may await; use a library. This is akin to parsing XML and similar formats: no regex please, but libraries.
Let's get this out of the way: you cannot read a CSV by just splitting on commas. You've just demonstrated why; commas might be escaped or inside quotes. Those commas are totally valid, they're part of the data. Discarding them mangles the data in the CSV.
For this reason, and others, CSV files must be read using a CSV parsing library. To find which commas are data and which commas are structural also requires parsing the CSV using a CSV parsing library. So you won't be saving yourself any time by trying to remove the commas from inside quotes. Instead you'll give yourself more work while mangling the data. You'll have to use a CSV parsing library.
Text::CSV_XS is a very good, very fast CSV parsing library. It has a ton of features, most of which you do not need. Fortunately it has examples for doing most common actions.
For example, here's how you read and print each row from a file called file.csv.
use strict;
use warnings;
use autodie;
use v5.10; # for `say`
use Text::CSV_XS;
# Open the file.
open my $fh, "<", "file.csv";
# Create a new Text::CSV_XS object.
# allow_whitespace allows there to be whitespace between the fields
my $csv = Text::CSV_XS->new({
allow_whitespace => 1
});
# Read in the header line so it's not counted as data.
# Then you can use $csv->getline_hr() to read each row in as a hash.
$csv->header($fh);
# Read each row.
while( my $row = $csv->getline($fh) ) {
# Do whatever you want with the list of cells in $row.
# This prints them separated by semicolons.
say join "; ", #$row;
}

How to remove words with digits without removing a digit at the beginning of a string?

I'm doing some tweet sentiment analysis, and right now I'm trying to clean the data using perl on Ubuntu command line.
I have some data in the follow format:
sentiment, 'text'
Where sentiment = {0, 4} and text is any valid string.
Right now I'm having trouble removing data such as this:
0,'My 21yo son has finally graduated from college!'
4,'The NT2000 is an awesome product!'
4,'what is good88guy doing on my following list?'
I want the following to look like this after:
0,'My son has finally graduated from college!'
4,'The is an awesome product!'
4,'what is doing on my following list?'
I don't want to remove the sentiment and I also need to remove the yo. Any ideas how I can write this script?
You may want to try this:
s/ ?( |[a-z]+)\d+( |[a-z]+|)? ?/ /simg;
DEMO
http://regex101.com/r/zW2nJ3
Sounds like you want the following:
s/\w*\d\w*\s*//g;
You're statement that you don't want things removed "from the beginning" is a little confusing, but you'll have to add more information to get a better answer.
One of the easiest methods to communicate what you want is to create a list of before and after strings, trying to make each demonstrate an special case.
Since your recent comments I have understood your problem a little better.
The data format you describe must be processed using Text::CSV, so as to account for quoted fields and comma separators.
This program should suit your needs as far as I understand them. It has use autodie to avoid the need for hand-coding exceptions if the input file cannot be opened, and Text::CSV reads the data from the file, specifying single quotes as field delimiters.
I have used the code from my original answer to process each line of the file, as it provides the best flexibility if your requirements need to be refined.
use strict;
use warnings;
use autodie;
use Text::CSV;
my $csv_proc = Text::CSV->new({ eol => $/, quote_char => "'" });
open my $fh, '<', 'myfile.txt';
while ( my $row = $csv_proc->getline($fh) ) {
my #fields = split ' ', $row->[1];
$row->[1] = join ' ', grep { not /\d/ } #fields;
$csv_proc->print(*STDOUT, $row);
}
output
0,'My son has finally graduated from college!'
4,'The is an awesome product!'
4,'what is doing on my following list?'

How to append lines to csv files with perl

I have two dozen .csv files, each about a thousand lines long, they've been created with Tie::Array::CSV. Now I want to append a line or two to each of them every day, what's the most efficient way to do that?
I suppose I could read each file into an array, add my data, and write array back to csv again but this means I risk losing all data if something goes wrong during overwrite, and if I create new files I need to figure out some system to automatically manage all those copies of copies that keep piling up every day.
If there's no module to append lines, how to do it manually with all the conventions and escape characters needed for proper csv so that they can be read back into perl without problems?
...
Thanks for replies, I don't have enough reputation to add comments to answers directly so I eidted the original question.
I don't worry about losing my data too much, I can rebuild it from scratch, it would just take time and manual intervention but not enough to warrant running a daily backup system, thought it's always a point to consider. Appending lines should be easier on processing time, too, as writing .csv to disk is relatively slow on my machines.
Opening files in append mode is what I didn't know. WIth solutions involging$csv->print($fh, $row); should I be running this through a loop to add more than one line?
I've got another solution proposed on Perlmonks
use Tie::Array::CSV;
my $filename = 'tied.csv';
tie my #file, 'Tie::Array::CSV', $filename;
push(#file,[4,5,6]);
untie #file;
Would that work better? I don't have the opportunity to test it right now.
EDIT. Push() solution above worked magic. Consider this closed.
Use Text::CSV:
use Text::CSV;
my $csv = Text::CSV->new({ binary => 1, auto_diag => 1, eol => "\n"})
or die "Cannot use CSV: " . Text::CSV->error_diag();
# open in append mode
open my $fh, ">>", "file.csv" or die "Failed to open file: $!";
$csv->print($fh, [ "foo", "bar", "foo,bar" ]);
close $fh;
This appends a single line to file.csv using \n as the EOL character, comma as the field separator (this is the default), and quotes around fields containing commas:
foo,bar,"foo,bar"

How to Access an Array in Perl for Regex

I have two inputs reading into my command prompt, the first being a series of words that are to be searched by the program I'm writing, and the second being the file that contains where the words are to be found. So, for instance, my command prompt reads perl WebScan.pl word WebPage000.htm
Now, I have no trouble accessing either of these inputs for printing, but I am having great difficulty accessing the contents of the webpage so I can perform regular expressions to remove html tags and access the content. I realize that there is a subroutine available to do this without regular expressions that is far more effective, but I need to do with with regular expressions :(.
I can access the html file for printing with no trouble:
open (DATA, $ARGV[1]);
my #file = <DATA>;
print #file;
Which prints the entire code of the html page, but I am unable to pass regular expressions in order to remove html blocks. I keep receiving an error that says "Can't modify array dereference in s/// near," which is where I have my specific regular expression. I'm not sure how to get around this- I've tried converting the array into a scalar but then I am unable to access any of the data in the html at all (and no, it doesn't just print the number of values in the array :P)
How do I access the array's contents so I can use regular expressions to refine the desired output?
It sounds like you are doing something like #file =~ s/find/replace/;. You are getting that error because the left hand side of the regex binding operator imposes scalar context on its argument. An array in scalar context returns its length, but this value is read only. So when your substitution tries to perform the replacement, kaboom.
In order to process all of the lines of the file, you could use a foreach loop:
foreach my $line (#file) {$line =~ s/find/replace/}
or more succinctly as:
s/find/replace/ for #file;
However, if you are running regular expressions on an HTML file, chances are you will need them to match across multiple lines. What you are doing above is reading the entire file in, and storing each line as an element of #file. If you use one of Perl's iterative control structures on the array, you will not be able to match multiple lines. So you should instead read the file into a single scalar. You can then use $file =~ s/// as expected.
You can slurp the file into a single variable by temporarily clearing the input record separator $/:
my $file = do {local $/; <DATA>};
In general, regular expressions are the wrong tool for parsing HTML, but it sounds like this is a homework assignment, so in that case its just practice anyway.
And finally, in modern Perl, you should use the three argument form of open with a lexical file handle and error checking:
open my $DATA, '<', $ARGV[1] or die "open error: $!";
my $file = do {local $/; <$DATA>};

How can I read a continuously updating log file in Perl?

I have a application generating logs in every 5 sec. The logs are in below format.
11:13:49.250,interface,0,RX,0
11:13:49.250,interface,0,TX,0
11:13:49.250,interface,1,close,0
11:13:49.250,interface,4,error,593
11:13:49.250,interface,4,idle,2994215
and so on for other interfaces...
I am working to convert these into below CSV format:
Time,interface.RX,interface.TX,interface.close....
11:13:49,0,0,0,....
Simple as of now but the problem is, I have to get the data in CSV format online, i.e as soon the log file updated the CSV should also be updated.
What I have tried to read the output and make the header is:
#!/usr/bin/perl -w
use strict;
use File::Tail;
my $head=["Time"];
my $pos={};
my $last_pos=0;
my $current_event=[];
my $events=[];
my $file = shift;
$file = File::Tail->new($file);
while(defined($_=$file->read)) {
next if $_ =~ some filters;
my ($time,$interface,$count,$eve,$value) = split /[,\n]/, $_;
my $key = $interface.".".$eve;
if (not defined $pos->{$eve_key}) {
$last_pos+=1;
$pos->{$eve_key}=$last_pos;
push #$head,$eve;
}
print join(",", #$head) . "\n";
}
Is there any way to do this using Perl?
Module Text::CSV will allow you to both read and write CSV format files. Text::CSV will internally use Text::CSV_XS if it's installed, or it will fall back to using Text::CSV_PP (thanks to Brad Gilbert for improving this explanation).
Grouping the related rows together is something you will have to do; it is not clear from your example where the source date goes to.
Making sure that the CSV output is updated is primarily a question of ensuring that you have the output file line buffered.
As David M suggested, perhaps you should look at the File::Tail module to deal with the continuous reading aspect of the problem. That should allow you to continually read from the input log file.
You can then use the 'parse' method in Text::CSV to split up the read line, and the 'print' method to format the output. How you combine the information from the various input lines to create an output line is a mystery to me - I cannot see how the logic works from the example you give. However, I assume you know what you need to do, and these tools will give you the mechanisms you need to handle the data.
No-one can do much more to spoon-feed you the answer. You are going to have to do some thinking for yourself. You will have a file handle that can be continuously read via File::Tail; you will have a CSV structure for reading the data lines; you will probably have another CSV structure for the written output; you will have an output file handle that you ensure is flushed every time you write. Connecting these dots is now your problem.