loading data from file into 2d array - perl

I am just starting with perl and would like some help with arrays please.
I am reading lines from a data file and splitting the line into fields:
open (INFILE, $infile);
do {
my $linedata = <INFILE>;
my #data= split ',',$linedata;
....
} until eof;
I then want to store the individual field values (in #data) in and array so that the array looks like the input data file ie, the first "row" of the array contains the first line of data from INFILE etc.
Each line of data from the infile contains 4 values, x,y,z and w and once the data are all in the array, I have to pass the array into another program which reads the x,y,z,w and displays the w value on a screen at the point determined by the x,y,z value. I can not pas the data to the other program on a row-by-row basis as the program expects the data to in a 2d matrtix format.
Any help greatly appreciated.
Chris

That's not really that difficult, you just need to store the splits, not in their own separate list, but in an array, taking up a slot of a larger array:
my #all_data;
while (my $linedata = <INFILE>) {
push # creates the next (n) slot(s) in an array
#all_data
, [ split ',',$linedata ]
# ^ we're pushing an *array* not just additional elements.
;
}
However, if you're just trying to read a commonly-known concept as a comma-separated values format, then have a look at something like Text::CSV, because the full capabilities of CSV is more than splitting on commas.

Related

What does $variable{$2}++ mean in Perl?

I have a two-column data set in a tab-separated .txt file, and the perl script reads it as FH and this is the immediate snippet of code that follows:
while(<FH>)
{
chomp;
s/\r//;
/(.+)\t(.+)/;
$uniq_tar{$2}++;
$uniq_mir{$1}++;
push#{$mir_arr{$1}},$2;
push #{$target{$2}} ,$1;
}
When I try to print any of the above 4 variables, it says the variables are uninitialized.
And, when I tried to print $uniq_tar{$2}++; and $uniq_mir{$1}++;
It just prints some numbers which I cannot understand.
I would just like to know what this part of code evaluate in general?
$uniq_tar{$2}++;
The while loop puts each line of your file, in turn, into Perl's special variable $_.
/.../ is the match operator. By default it works on $_.
/(.*)\t(.*)/ is a regular expression inside the match operator. If the regex matches what is in $_, then the bits of the matching string that are inside the two pairs of parentheses are stored in Perl's special variables $1 and $2.
You have hashes called %uniq_tar and %uniq_mir. You access individual elements in a hash using the $hashname{key}. So, $uniq_tar{$1} is finding the value in %uniq_tar associated with the key that is stored in $1 (that is - the part of your record before the first tab).
$variable++ increments the number in $variable. So $uniq_tar{$1}++ increments the value that we found in the previous paragraph.
So, as zdim says, it's a frequency counter. You read each line in the file, and extract the bits of data before and after the first tab in the line. You then increment the values in two hashes to count the number of occurences of each of the strings.

Get the number of columns in an ASCII file

I have found many questions regarding CSV files, but not regarding a normal ASCII file (.dat) file.
Assuming I have a subroutine sub writeMyFile($data), which writes different values in an ASCII file my_file.dat. Each column is then a value, which I want to plot in another subroutine sub plotVals(), but for that I need to know the number of columns of my_file.dat, which is not always the same.
What is an easy an readable way in Perl to have the number of columns of an ASCII file my_file.dat?
Some sample input/output would be (note: file might have multiple rows):
In:
(first line on my_data1.dat) -19922 233.3442 12312 0 0
(first line on my_data2.dat) 0 0 0
Out:
(for my_data1.dat) 5
(for my_data2.dat) 3
You haven't really given us enough detail for any answer to be really helpful (explaining the format of your data file, for example, would have been a great help).
But let's assume that you have a file where the fields are separated by whitespace - something like this:
col1 col2 col3 col4 col5 col6 col7 col8
We know nothing about the columns, only that they are separated by varying amounts of white space.
We can open the file in the usual manner.
my $file = 'my_file.dat';
open my $data_fh, '<', $file or die "Can't open $file: $!";
We can read each record from the file in turn in the usual manner.
while (<$data_fh>) {
# Data is in $_. Let's remove the newline from the end.
chomp;
# Here we do other interesting stuff with the data...
}
Probably, a useful thing to do would be to split the record so that each field is stored in a separate element of an array. That's simple with split().
# By default, split() works on $_ and splits on whitespace, so this is
# equivalent to:
# my #data = split /\s+/, $_;
my #data = split;
Now we get to your question. We have all of our values in #data. But we don't know how many values there are. Luckily, Perl makes it simple to find out the number of elements in an array. We just assign the array to a scalar variable.
my $number_of_values = #data;
I think that's all the information you'll need. Depending on the actual format of your data file, you might need to change the split() line in some way - but without more information it's impossible for us to know what you need there.
When reading the file in plotVals(), split each line on whatever delimiter you use in the data file, and count how many fields you get. I presume that you have to split the lines anyway to plot the individual data points, unless you call an external utility for doing the plotting. If you call an external utility for plotting, then it is enough to read one representative row (the first?) and count the fields in that.
Alternatively pass the data or some meta data (the number of columns) directly to plotVals().

Reading large amount of data stored in lines from csv

I need to read in a lot of data (~10^6 data points) from a *.csv-file.
the data is stored in lines
I can't know how many data points per line and how many lines are there before I read it in
the amount of data points per line can be different for each line
So the *.csv-file could look like this:
x Header
x1,x2
y Header
y1,y2,y3, ...
z Header
z1,z2
...
Right now I read in every line as string and split it at every comma. This is what my code looks like:
index = 1;
headerLine = textscan(csvFileHandle,'%s',1,'Delimiter','\n');
while ~isempty(headerLine{1})
dummy = textscan(csvFileHandle,'%s',1,'Delimiter','\n', ...
'BufSize',2^31 - 1);
rawData(index) = textscan(dummy{1}{1},'%f','Delimiter',',');
headerLine = textscan(csvFileHandle,'%s',1,'Delimiter','\n');
index = index + 1;
end
It's working, but it's pretty slow. Most of the time is used while splitting the string with textscan. (~95%).
I preallocated rawData with sample data, but it brought next to nothing for the speed.
Is there a better way than mine to read in something like this?
If not, is there a faster way to split this string?
First suggestion: to read a single line as a string when looping over a file, just use fgetl (returns a nice single string so no faffing with cell arrays).
Also, you might consider (if possible), reading everything in a single go rather than making repeating reads from file:
output = textscan(fid, '%*s%s','Delimiter','\n'); % skips headers with *
If the file is so big that you can't do everything at once, try to read in blocks (e.g. tackle 1000 lines at a time, parsing data as you go).
For converting the string, there are the options of str2num or strsplit+str2double but the only thing I can think of that might be slightly quicker than textscan is sscanf. Since this doesn't accept the delimiter as a separate input put it in the format string (the last value doesn't end with ,, true, but sscanf can handle that).
for n = 1:length(output);
data{n} = sscanf(output{n},'%f,');
end
Tests with a limited patch of test data suggests sscanf is a bit quicker (but might depend on machine/version/data sizes).

Perl get array count so can start foreach loop at a certain array element

I have a file that I am reading in. I'm using perl to reformat the date. It is a comma seperated file. In one of the files, I know that element.0 is a zipcode and element.1 is a counter. Each row can have 1-n number of cities. I need to know the number of elements from element.3 to the end of the line so that I can reformat them properly. I was wanting to use a foreach loop starting at element.3 to format the other elements into a single string.
Any help would be appreciated. Basically I am trying to read in a csv file and create a cpp file that can then be compiled on another platform as a plug-in for that platform.
Best Regards
Michael Gould
you can do something like this to get the fields from a line:
my #fields = split /,/, $line;
To access all elements from 3 to the end, do this:
foreach my $city (#fields[3..$#fields])
{
#do stuff
}
(Note, based on your question I assume you are using zero-based indexing. Thus "element 3" is the 4th element).
Alternatively, consider Text::CSV to read your CSV file, especially if you have things like escaped delimiters.
Well if your line is being read into an array, you can get the number of elements in the array by evaluating it in scalar context, for example
my $elems = #line;
or to be really sure
my $elems = scalar(#line);
Although in that case the scalar is redundant, it's handy for forcing scalar context where it would otherwise be list context. You can also find the index of the last element of the array with $#line.
After that, if you want to get everything from element 3 onwards you can use an array slice:
my #threeonwards = #line[3 .. $#line];

How to randomly select from a list of 47 names that are entered from a data file?

I have managed to input a number data file into a matrix but have been unable to do so for any data that is not a number.
I have a list of 47 names and supposed to generate a random name from the list. I have tried to use the function textscan but was not going anywhere. Also how do I generate a random name from the list? All I have been able to do was generate a random number between 1 to 47.
Appreciate the replies. I should have said I need it in MATLAB sorry.
Here is a sample list of data in my data file
name01
name02
name03
and the code to read it:
fid = fopen('names.dat','rt');
headerChars = fgetl(fid);
data = fscanf(fid,'%f,%f,%f,%f',[4 47]).';
fclose(fid);
The above is what I have to read the data file into a matrix but it is only reading the first line. (Yes it was modified from a previous post here on this forums :/)
Edit: As per the helpful comments from mtrw, and the fixed formatting of the sample data file, I've updated my answer with more detail.
With a single name (i.e. "Bob", "Bob Smith", or "Smith, Bob") on each line of the file, you can use the function TEXTSCAN by specifying '%s' as the format argument (to denote reading a string) and the newline character '\n' as the 'Delimiter' (the character that separates the strings in the file):
fid = fopen('namefile.txt','r');
names = textscan(fid,'%s','Delimiter','\n');
fclose(fid);
Then it's a matter of randomly picking one of the names. You can use the function RANDI to generate a random integer in the range from 1 to the number of names read from the file (found using the NUMEL function):
names = names{1}; %# Get the contents from the cell returned by TEXTSCAN
selectedName = names{randi(numel(names))};
Sounds like you're halfway home. Take that random number and use it as an index for the list.
For example, if you randomly generate the number 23 then fetch the 23rd entry in the list which gives you a random name draw.
Use the RANDOMBETWEEN function to get a random number within your range. Use INDEX to get the actual cell value. For instance:
=INDEX(A1:A47, RANDBETWEEN(1, 47))
The above will work for your specific case of 47 names, assuming they're in column A. In general, you'd want something like:
=INDEX(MyNames, RANDBETWEEN(ROW(MyNames), ROW(MyNames) + ROWS(MyNames) - 1))
This assumes you've named your range of cells "MyNames" (for example, by selecting all the cells in your range and setting a name in the naming box). The above formula works by using the ROW function to return the top row of the MyNames array and the ROWS function to get the total rows in MyNames.