I would like to optimise this Perl sub:
push_csv($string,$addthis,$position);
for placing strings inside a CSV string.
e.g. if $string="one,two,,four"; $addthis="three"; $position=2;
then push_csv($string,$addthis,$position) will change the value of $string = "one,two,three,four";
sub push_csv {
my #fields = split /,/, $_[0]; # split original string by commas;
$_[1] =~ s/,//g; # remove commas in $addthis
$fields[$_[2]] = $_[1]; # put the $addthis string into
# the array position $position.
$_[0] = join ",", #fields; # join the array with commas back
# into the string.
}
This is a bottleneck in my code, as it needs to be called a few million times.
If you are proficient in Perl, could you take a look at it, and suggest optimisation/alternatives? Thanks in advance! :)
EDIT:
Converting to #fields and back to string is taking time, I just thought of a way to speed it up where I have more than one sub call in a row. Split once, then push more than one thing into the array, then join once at the end.
For several reasons, you should be using Text::CSV to handle these low-level CSV details. Provided that you are able to install the XS version, my understanding is that it will run faster than anything you can do in pure Perl. In addition, the module will correctly handle all sorts of edge cases that you are likely to miss.
use Text::CSV;
my $csv = Text::CSV->new;
my $line = 'foo,,fubb';
$csv->parse($line);
my #fields = $csv->fields;
$fields[1] = 'bar';
$csv->combine(#fields);
print $csv->string; # foo,bar,fubb
Keep your array as an array in the first place, not as a ,-separated string?
You might want to have a look at Data::Locations.
Or try (untested, unbenchmarked, doesn't append new fields like your original can...)
sub push_csv {
$_[1] =~ y/,//d;
$_[0] =~ s/^(?:[^,]*,){$_[2]}\K[^,]*/$_[1]/;
return;
}
A few suggestions:
Use tr/,//d instead of s/,//g as it is faster. This is essentially the same as ysth's suggestion to use y/,//d
Perform split only as much as is needed. If $position = 1, and you have 10 fields, then you're wasting computation performing unnecessary splits and joins. The optional third argument to split can be leveraged to your advantage here. However, this does depend on how many consecutive empty fields you are expecting. It may not be worth it if you don't know ahead of time how many of these you have
You're quite right in wanting to perform multiple appends with one sub-call. There is no need to perform multiple splits and joins when one will do just as well
You really ought to be using Text::CSV, but here's how I would revise the implementation of your sub in pure Perl (assuming a maximum of one consecutive empty field):
sub push_csv {
my ( $items, $positions ) = #_[1..2];
# Test inputs
warn "No. of items to add & positions not equal"
and
return unless #{$items} == #{$positions};
my $maxPos; # Find the maximum position number
for my $position ( #{$positions} ) {
$maxPos ||= $position;
$maxPos = $position if $maxPos < $position;
}
my #fields = split /,/ , $_[0], $maxPos+2; # Split only as much as needed
splice ( #fields, $positions->[$_], 1, $items->[$_] ) for 0 .. $#{$items};
$_[0] = join ',' , #fields;
print $_[0],"\n";
}
Usage
use strict;
use warnings;
my $csvString = 'one,two,,four,,six';
my #missing = ( 'three', 'five' );
my #positions = ( 2, 4 );
push_csv ( $csvString, \#missing, \#positions );
print $csvString; # Prints 'one,two,three,four,five,six'
If you're hitting a bottleneck by splitting and joining a few million times... then don't split and join constantly. split each line once when it initially enters the system, pass that array (or, more likely, a reference to the array) around while doing your processing, and then do a single join to turn it into a string when you're ready for it to leave the system.
e.g.:
#!/usr/bin/env perl
use strict;
use warnings;
# Start with some CSV-ish data
my $initial_data = 'foo,bar,baz';
# Split it into an arrayref
my $data = [ split /,/, $initial_data ];
for (1 .. 1_000_000) {
# Pointless call to push_csv, just to make it run
push_csv($data, $_, $_ % 3);
}
# Turn it back into a string and display it
my $final_data = join ',', #$data;
print "Result: $final_data\n";
sub push_csv {
my ($data_ref, $value, $position) = #_;
$$data_ref[$position] = $value;
# Alternately:
# $data_ref->[$position] = $value;
}
Note that this simplifies things enough that push_csv becomes a single, rather simple, line of processing, so you may want to just do the change inline instead of calling a sub for it, especially if runtime efficiency is a key criterion - in this trivial example, getting rid of push_csv and doing it inline reduced run time by about 70% (from 0.515s to 0.167s).
Don't you think it might be easier to use arrays and splice, and only use join to create the comma separation at the end?
I really don't think using s/// repeatedly is a good idea if this is a major bottleneck in your code.
Related
I have written a script which collects marks of students and print the one who scored above 50.
Script is below:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my #array = (
'STUDENT1,90
STUDENT2,40
STUDENT3,30
STUDENT4,30
');
print Dumper(\#array);
my $class = "3";
foreach my $each_value (#array) {
print "EACH: $each_value\n";
my ($name, $score ) = split (/,/, $each_value);
if ($score lt 50) {
next;
} else {
print "$name, \"GOOD SCORE\", $score, $class";
}
}
Here I wanted to print data of STUDENT1, since his score is greater than 50.
So output should be:
STUDENT1, "GOOD SCORE", 90, 3
But its printing output like this:
STUDENT1, "GOOD SCORE", 90
STUDENT2, 3
Here some manipulation happens between 90 STUDENT2 which it discards to separate it.
I know I was not splitting data with new line character since we have single element in the array #array.
How can I split the element which is in array to new line, so that inside for loop I can split again with comma(,) to have the values in $name and $score.
Actually the #array is coming as an argument to this script. So I have to modify this script in order to parse right values.
As you already know your "array" only has one "element" with a string with the actual records in it, so it essentially is more a scalar than an array.
And as you suspect, you can split this scalar just as you already did with the newline as a separator instead of a comma. You can then put a foreach around the result of split() to iterate over the records.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $records = 'STUDENT1,90
STUDENT2,40
STUDENT3,30
STUDENT4,30
';
my $class = "3";
foreach my $record (split("\n", $records)) {
my ($name, $score) = split(',', $record);
if ($score >= 50) {
print("$name, \"GOOD SCORE\", $score, $class\n");
}
}
As a small note, lt is a string comparison operator. The numeric comparisons use symbols, such as <.
Although you have an array, you only have a single string value in it:
my #array = (
'STUDENT1,90
STUDENT2,40
STUDENT3,30
STUDENT4,30
');
That's not a big deal. Dave Cross has already shown you have you can break that up into multiple values, but there's another way I like to handle multi-line strings. You can open a filehandle on a reference to the string, then read lines from the string as you would a file:
my $string = 'STUDENT1,90
STUDENT2,40
STUDENT3,30
STUDENT4,30
';
open my $string_fh, '<', \$string;
while( <$string_fh> ) {
chomp;
...
}
One of the things to consider while programming is how many times you are duplicating the data. If you have it in a big string then split it into an array, you've now stored the data twice. That might be fine and its usually expedient. You can't always avoid it, but you should have some tools in your toolbox that let you avoid it.
And, here's a chance to use indented here docs:
use v5.26;
my $string = <<~"HERE";
STUDENT1,90
STUDENT2,40
STUDENT3,30
STUDENT4,30
HERE
open my $string_fh, '<', \$string;
while( <$string_fh> ) {
chomp;
...
}
For your particular problem, I think you have a single string where the lines are separated by the '|' character. You don't show how you call this program or get the data, though.
You can choose any line ending you like by setting the value for the input record separator, $/. Set it to a pipe and this works:
use v5.10;
my $string = 'STUDENT1,90|STUDENT2,40|STUDENT3,30|STUDENT4,30';
{
local $/ = '|'; # input record separator
open my $string_fh, '<', \$string;
while( <$string_fh> ) {
chomp;
say "Got $_";
}
}
Now the structure of your program isn't too far away from taking the data from standard input or a file. That gives you a lot of flexibility.
The #array contains one element, Actually the for loop will working correct, you can fix it without any change in the for block just by replacing this array:
my #array = (
'STUDENT1,90',
'STUDENT2,40',
'STUDENT3,30',
'STUDENT4,30');
Otherwise you can iterate on them by splitting lines using new line \n .
I am learning Perl for work and I'm trying to practise with some basic programs.
I want my program to take a string from STDIN and modify it by taking the last character and putting it at the start of the string.
I get an error when I use variable $str in $str = <STDIN>.
Here is my code:
my $str = "\0";
$str = <STDIN>;
sub last_to_first {
chomp($str);
pop($str);
print $str;
}
last_to_first;
Exec :
Matrix :hi
Not an ARRAY reference at matrix.pl line 13, <STDIN> line 1.
Why your approach doesn't work
The pop keyword does not work on strings. Strings in Perl are not automatically cast to character arrays, and those array keywords only work on arrays.
The error message is Not an ARRAY reference because pop sees a scalar variable. References are scalars in Perl (the scalar here is something like a reference to the address of the actual array in memory). The pop built-in takes array references in Perl versions between 5.14 and 5.22. It was experimental, but got removed in the (currently latest) 5.24.
Starting with Perl 5.14, an experimental feature allowed pop to take a scalar expression. This experiment has been deemed unsuccessful, and was removed as of Perl 5.24.
How to make it work
You have to split and join your string first.
my $str = 'foo';
# turn it into an array
my #chars = split //, $str;
# remove the last char and put it at the front
unshift #chars, pop #chars;
# turn it back into a string
$str = join '', #chars;
print $str;
That will give you ofo.
Now to use that as a sub, you should pass a parameter. Otherwise you do not need a subroutine.
sub last_to_first {
my $str = shift;
my #chars = split //, $str;
unshift #chars, pop #chars;
$str = join '', #chars;
return $str;
}
You can call that sub with any string argument. You should do the chomp to remove the trailing newline from STDIN outside of the sub, because it is not needed for switching the chars. Always build your subs in the smallest possible unit to make it easy to debug them. One piece of code should do exactly one functionality.
You also do not need to initialize a string with \0. In fact, that doesn't make sense.
Here's a full program.
use strict;
use warnings 'all';
my $str = <STDIN>;
chomp $str;
print last_to_first($str);
sub last_to_first {
my $str = shift;
my #chars = split //, $str;
unshift #chars, pop #chars;
$str = join '', #chars;
return $str;
}
Testing your program
Because you now have one unit in your last_to_first function, you can easily implement a unit test. Perl brings Test::Simple and Test::More (and other tools) for that purpose. Because this is simple, we'll go with Test::Simple.
You load it, tell it how many tests you are going to do, and then use the ok function. Ideally you would put the stuff you want to test into its own module, but for simplicity I'll have it all in the same program.
use strict;
use warnings 'all';
use Test::Simple tests => 3;
ok last_to_first('foo', 'ofo');
ok last_to_first('123', '321');
ok last_to_first('qqqqqq', 'qqqqqq');
sub last_to_first {
my $str = shift;
my #chars = split //, $str;
unshift #chars, pop #chars;
$str = join '', #chars;
return $str;
}
This will output the following:
1..3
ok 1
ok 2
ok 3
Run it with prove instead of perl to get a bit more comprehensive output.
Refactoring it
Now let's change the implementation of last_to_first to use a regular expression substitution with s/// instead of the array approach.
sub last_to_first {
my $str = shift;
$str =~ s/^(.+)(.)$/$2$1/;
return $str;
}
This code uses a pattern match with two groups (). The first one has a lot of chars after the beginning of the string ^, and the second one has exactly one char, after which the string ends $. You can check it out here. Those groups end up in $1 and $2, and all we need to do is switch them around.
If you replace your function in the program with the test, and then run it, the output will be the same. You have just refactored one of the units in your program.
You can also try the substr approach from zdim's answer with this test, and you will see that the tests still pass.
The core function pop takes an array, and removes and returns its last element.
To manipulate characters in a string you can use substr, for example
use warnings;
use strict;
my $str = <STDIN>;
chomp($str);
my $last_char = substr $str, -1, 1, '';
my $new_str = $last_char . $str;
The arguments to substr mean: search the variable $str, at offset -1 (one from the back), for a substring of length 1, and replace that with an empty string '' (thus removing it). The substring that is found, here the last character, is returned. See the documentation page linked above.
In the last line the returned character is concatenated with the remaining string, using the . operator.
You can browse the list of functions broken down by categories at Perl functions by category.
Perl documentation has a lot of goodies, please look around.
Strings are very often manipulated using regular expressions. See the tutorial perlretut, the quick start perlrequick, the quick reference perlreref, and the full reference perlre.
You can also split a string into a character array and work with that. This is shown in detail in the answer by simbabque, which packs a whole lot more of good advice.
This is for substring function used for array variables:
my #arrays = qw(jan feb mar);
last_to_first(#arrays);
sub last_to_first
{
my #lists = #_;
my $last = pop(#lists);
#print $last;
unshift #lists, $last;
print #lists;
}
This is for substring function used for scalar variables:
my $str = "";
$str = <STDIN>;
chomp ($str);
last_to_first($str);
sub last_to_first
{
my $chr = shift;
my $lastchar = substr($chr, -1);
print $lastchar;
}
I want to able to read this CSV file into an array of arrays or hashes for manipulation. How can I go about it?
For example my file contains the following (the first line is the header):
Name,Age,Items,Available
John,29,laptop,mouse,Yes
Jane,28,desktop,keyboard,mouse,yes
Doe,56,tablet,keyboard,trackpad,touchpen,Yes
First column is name, second is Age, third is Items, But items can contain more than one thing separated by commas, and last column is Person availability.
How can I accurately read this?
Well-formed CSV quotes fields that contain a comma as part of the value. If your CSV is well-formed use the Text::CSV module:
use Text::CSV;
my $csv = Text::CSV->new();
while (my $row = $csv->getline(\*DATA)) {
my $name = $row->[0];
my $age = $row->[1];
my #items = split /,/, $row->[2];
my $available = $row->[3];
print "$name/$age/#items/$available\n";
}
__DATA__
Name,Age,Items,Available
John,29,"laptop,mouse",Yes
Jane,28,"desktop,keyboard,mouse",yes
Doe,56,"tablet,keyboard,trackpad",touchpen,Yes
Output:
Name/Age/Items/Available
John/29/laptop mouse/Yes
Jane/28/desktop keyboard mouse/yes
Doe/56/tablet keyboard trackpad touchpen/Yes
If your CSV is not well-formed you'll need to implement a custom parse based on knowledge of your data. Assuming that the Items column is the only multi-valued field you can split on a comma and then remove the fields with a known position. Whatever is left is the items.
while (my $line = <DATA>) {
chomp $line;
my #record = split /,/, $line;
my $name = shift #record;
my $age = shift #record;
my $available = pop #record;
my #items = #record;
print "$name/$age/#items/$available\n";
}
__DATA__
Name,Age,Items,Available
John,29,laptop,mouse,Yes
Jane,28,desktop,keyboard,mouse,yes
Doe,56,tablet,keyboard,trackpad,touchpen,Yes
Alternately, you could use array slicing to get the same result:
my ($name, $age, $available, #items) = #record[0, 1, -1, 2 .. #record - 2];
Since your data is, in reality, a properly-formatted CSV file, you can use the standard tools to read and store it
Here's the data I'm now assuming that you have
Name,Age,Items,Available
John,29,"laptop,mouse",Yes
Jane,28,"desktop,keyboard,mouse",yes
Doe,56,"tablet,keyboard,trackpad,touch pen",Yes
Solution
Like my original answer, this code uses Text::CSV to parse each line of input. But instead of having to reformat it, each row may be pushed directly onto array #data
Also as before, it conforms to the standard of reading from STDIN. But this time I have used Data::Dump to reveal the in-memory data structure that has been built. If you run it on the command line you should use
$ perl unpack_csv.pl text.csv
use strict;
use warnings 'all';
use Text::CSV;
my $csv = Text::CSV->new;
my #data;
while ( <> ) {
$csv->parse($_);
my #row = $csv->fields;
push #data, \#row;
}
use Data::Dump;
dd \#data;
Update
I now realise that the OP's file may well contain properly-formatted CSV data, which makes this answer superfluous
However the question has not been changed to show the real data, so I am leaving this answer here in case the question's subject line and content entices people with a problem that this will solve
I recommend that you use an intermediate program to format your CSV file properly. Once you have a standard-format file, the resulting output can then be processed using Perl with Text::CSV, Excel, or anything similar
This program uses Text::CSV to read your input data and write the Items column enclosed in quotes if necessary
It works by using Text::CSV->parse to split each line into fields, and then reserving the first two and final fields for new fields 1, 2 and 4. Whatever is left is joined with a comma , and used for field 3. The four resulting values are passed back to Text::CSV->combine and printed
It conforms to the standard of reading from STDIN and writing to STDOUT, so if you run it on the command line you should use
$ perl reformat_csv.pl text.csv > new_text.csv
use strict;
use warnings 'all';
use Text::CSV;
my $csv = Text::CSV->new;
while ( <> ) {
$csv->parse($_);
my #row = $csv->fields;
my $f1 = shift #row;
my $f2 = shift #row;
my $f4 = pop #row;
my $f3 = join ',', #row;
$csv->combine($f1, $f2, $f3, $f4);
print $csv->string, "\n";
}
output
Name,Age,Items,Available
John,29,"laptop,mouse",Yes
Jane,28,"desktop,keyboard,mouse",yes
Doe,56,"tablet,keyboard,trackpad,touchpen",Yes
So i have been working on this perl script that will analyze and count the same letters in different line spaces. I have implemented the count to a hash but am having trouble excluding a " - " character from the output results of this hash. I tried using delete command or next if, but am not getting rid of the - count in the output.
So with this input:
#extract = ------------------------------------------------------------------MGG-------------------------------------------------------------------------------------
And following code:
#Count selected amino acids.
my %counter = ();
foreach my $extract(#extract) {
#next if $_ =~ /\-/; #This line code does not function correctly.
$counter{$_}++;
}
sub largest_value_mem (\%) {
my $counter = shift;
my ($key, #keys) = keys %$counter;
my ($big, #vals) = values %$counter;
for (0 .. $#keys) {
if ($vals[$_] > $big) {
$big = $vals[$_];
$key = $keys[$_];
}
}
$key
}
I expect the most common element to be G, same as the output. If there is a tie in the elements, say G = M, if there is a way to display both in that would be great but not necessary. Any tips on how to delete or remove the '-' is much appreciated. I am slowly learning perl language.
Please let me know if what I am asking is not clear or if more information is needed, thanks again kindly for all the comments.
Your data doesn't entirely make sense, since it's not actually working perl code. I'm guessing that it's a string divided into characters. After that it sounds like you just want to be able to find the highest frequency character, which is essentially just a sort by descending count.
Therefore the following demonstrates how to count your characters and then sort the results:
use strict;
use warnings;
my $str = '------------------------------------------------------------------MGG-------------------------------------------------------------------------------------';
my #chars = split '', $str;
#Count Characteres
my %count;
$count{$_}++ for #chars;
delete $count{'-'}; # Don't count -
# Sort keys by count descending
my #keys = sort {$count{$b} <=> $count{$a}} keys %count;
for my $key (#keys) {
print "$key $count{$key}\n";
}
Outputs:
G 2
M 1
foreach my $extract(#extract) {
#next if $_ =~ /\-/
$_ setting is suppressed by $extract here.
(In this case, $_ keeps value from above, e.g. routine argument list, previous match, etc.)
Also, you can use character class for better readability:
next if $extract=~/[-]/;
So I have a text file with four sets of data on a line, such as aa bb username password. So far I have been able to parse through the first line of the file using substrings and indices, and assigning each of the four to variables.
My goal is to use an array and chomp through each line and assign them to the four variables, and than to match an user inputted argument to the first variable, and use the four variables in that correct line.
For example, this would be the text file:
"aa bb cc dd"
"ee ff gg hh"
And depending on whether the user inputs "aa" or "ee" as the argument, it would use that line's set of arguments in the code.
I am trying to get up a basic array and chomp through it based on a condition for the first variable, essentially.
Here is my code for the four variables for the first line, but like I said, this only works for the first line in the text file:
local $/;
open(FILE, $configfile) or die "Can't read config file 'filename' [$!]\n";
my $document = <FILE>;
close (FILE);
my $string = $document;
my $substring = " ";
my $Index = index($string, $substring);
my $aa = substr($string, 0, $Index);
my $newstring = substr($string, $Index+1);
my $Index2 = index($newstring, $substring);
my $bb = substr($newstring, 0, $Index2);
my $newstring2 = substr($newstring, $Index2+1);
my $Index3 = index($newstring2, $substring);
my $cc = substr($newstring2, 0, $Index3);
my $newstring3 = substr($newstring2, $Index3+1);
my $Index4 = index($newstring3, $substring);
my $dd = substr($newstring3, 0, $Index4);
First of all, you can parse your whole line using split instead of running index and substring on them:
my ( $aa, $bb, $cc, $dd ) = split /\s+/, $line;
Even better, use an array:
my #array = split /\s+/, $line;
I think you're saying that you need to store each array of command parts into another array of lines. Is that correct? Take a look at this tutorial on references available in the Perl Documentation.
Perl has three different types of variables. The problem is that each of the types of variables of these stores only a single piece of data. Arrays and hashes may store lots of data, but only one piece of data can be stored in each element of a hash or array.
References allow you to get around this limitation. A reference is simply a pointer to another piece of data. For example, if $line = aa bb cc dd, doing this:
my #command_list = split /\s+/ $line;
Will give you the following:
$command_list[0] = "aa";
$command_list[1] = "bb";
$command_list[2] = "cc";
$command_list[3] = "dd";
You want to store #command_list into another structure. What you need is a reference to #command_list. To get a reference to it, you merely put a backslash in front of it:
my $reference = \#command_list;
This could be put into an array:
my #array;
$array[0] = $reference;
Now, I'm storing an entire array into a single element of an array.
To get back to the original structure from the reference, you put the correct sigil. Since this is an array, you put # in front of it:
my #new_array = #{ $reference };
If you want the first item in your reference without using having to transport it into another array, you could simply treat #{ $reference } as an array itself:
${ $reference }[0] = "aa";
Or, use the magic -> which makes the syntax a bit cleaner:
$reference->[0] = "aa";
Go through the tutorial. This will help you understand the full power of references, and how they can be used. Your program would look something like this:
use strict;
use warnings;
use feature qw(say); #Better print that print
use autodie; #Kills your program if the file can't be open
my $file = [...] #Somehow get the file you're reading in...
open my $file_fh, "<", $file;
my #command_list;
while ( my $line = <$file_fh> ) {
chomp $line;
my #line_list = split /\s+/, $line;
push #command_list, \#line_list;
}
Note that push #command_list, \#line_list; is pushing a reference to one array into another. How do you get it back out? Simple:
for my $cmd_line_ref ( #command_list ) {
my $command = $cmd_line_ref->[0]; #This is the first element in your command
next unless $command eq $user_desires; # However you figure out what the user wants
my $line = join " ", #{ $cmd_line_ref } #Rejoins your command line once again
??? #Profit
}
Read the tutorial on references, and learn about join and split.
You are reading the whole file in the my $document = <FILE> line.
Try something like:
my #lines;
open my $file, '<', $configfile or die 'xxx';
while( <$file> ) {
chomp;
push #lines, [ split ]
}
And now #lines has an array of arrays with the information you need.
(EDIT) don't forget to lose the local $/; -- it's what is making you read the whole file at once.
my $document = <FILE> is reading in only the first line. Try using a while loop.
If you want to read all lines of the file at once - assuming it's a small file - you may want to use File::Slurp module:
use File::Slurp;
my #lines = File::Slurp::read_file($configfile);
foreach my $line (#lines) {
# do whatever
Also, you can use CPAN modules to split the strings into fields.
If they are single-space separated, simply read the whole file using a standard CSV parser (you can configure Text::CSV_XS to use any characater as separator). Example here: How can I parse downloaded CSV data with Perl?
If they are separated by random amount of whitespace, use #massa's advice below and use split function.