Get the number of columns in an ASCII file - perl

I have found many questions regarding CSV files, but not regarding a normal ASCII file (.dat) file.
Assuming I have a subroutine sub writeMyFile($data), which writes different values in an ASCII file my_file.dat. Each column is then a value, which I want to plot in another subroutine sub plotVals(), but for that I need to know the number of columns of my_file.dat, which is not always the same.
What is an easy an readable way in Perl to have the number of columns of an ASCII file my_file.dat?
Some sample input/output would be (note: file might have multiple rows):
In:
(first line on my_data1.dat) -19922 233.3442 12312 0 0
(first line on my_data2.dat) 0 0 0
Out:
(for my_data1.dat) 5
(for my_data2.dat) 3

You haven't really given us enough detail for any answer to be really helpful (explaining the format of your data file, for example, would have been a great help).
But let's assume that you have a file where the fields are separated by whitespace - something like this:
col1 col2 col3 col4 col5 col6 col7 col8
We know nothing about the columns, only that they are separated by varying amounts of white space.
We can open the file in the usual manner.
my $file = 'my_file.dat';
open my $data_fh, '<', $file or die "Can't open $file: $!";
We can read each record from the file in turn in the usual manner.
while (<$data_fh>) {
# Data is in $_. Let's remove the newline from the end.
chomp;
# Here we do other interesting stuff with the data...
}
Probably, a useful thing to do would be to split the record so that each field is stored in a separate element of an array. That's simple with split().
# By default, split() works on $_ and splits on whitespace, so this is
# equivalent to:
# my #data = split /\s+/, $_;
my #data = split;
Now we get to your question. We have all of our values in #data. But we don't know how many values there are. Luckily, Perl makes it simple to find out the number of elements in an array. We just assign the array to a scalar variable.
my $number_of_values = #data;
I think that's all the information you'll need. Depending on the actual format of your data file, you might need to change the split() line in some way - but without more information it's impossible for us to know what you need there.

When reading the file in plotVals(), split each line on whatever delimiter you use in the data file, and count how many fields you get. I presume that you have to split the lines anyway to plot the individual data points, unless you call an external utility for doing the plotting. If you call an external utility for plotting, then it is enough to read one representative row (the first?) and count the fields in that.
Alternatively pass the data or some meta data (the number of columns) directly to plotVals().

Related

Iteration to Match Line Patterns from Text File and Then Parse out N Lines

I have a text file that contains three columns. Using perl, I'm trying to loop through the text file and search for a particular pattern...
Logic: IF column2 = 00z24aug2016 & column3 = e01. When this pattern is matched I need to parse out the matched line and then the next 3 lines. to new files.
Text File:
site1,00z24aug2016,e01
site1,00z24aug2016,e01
site1,00z24aug2016,e01
site1,00z24aug2016,e01
site2,00z24aug2016,e02
site2,00z24aug2016,e02
site2,00z24aug2016,e02
site2,00z24aug2016,e02
Desired Output...
New File 1:
site1,00z24aug2016,e01
site1,00z24aug2016,e01
site1,00z24aug2016,e01
site1,00z24aug2016,e01
New File 2:
site2,00z24aug2016,e02
site2,00z24aug2016,e02
site2,00z24aug2016,e02
site2,00z24aug2016,e02
Based on your comment in response to zdim and Borodin, it appears that you're asking for pointers on how to do this with Perl rather than actual working code, so I am answering on that basis.
What you describe in the "logic" portion of your question is extremely simple and straightforward to do in Perl - the actual code would be far shorter than this description of it:
Start your program with use strict; use warnings; - this will catch most common errors and make debugging vastly easier!
Open your input file for reading (open(my $fh, '<', $file_name) or die "Failed to open $file_name: $!")
Read in each line of the file (my $line = <$fh>;)
Optionally use chomp to remove line endings
Use split to break the line into fields (my #column = split /,/, $line;)
Check the values of the first and third fields (note that arrays start counting from 0, not from 1, so these will be $column[1] and $column[2] rather than 2 and 3)
If the field values match your criteria, set a counter to 4 (the total number of lines to output)
If the counter is greater than zero, output the original $line and decrement the counter
The logic mentions "new files" but does not specify when a new output file should be created and when output should continue to be sent to the same file. Since this was not specified, I have ignored it and described all output going to a single destination.
Note, however, that your sample desired output does not match the described logic. According to the specified logic, the output should include the first seven lines of your example data, but not the final line (because none of the three lines preceding it include "e01").
So. Take this information, along with whatever you may already know about Perl, and try to write a solution. If you reach a point where you can't figure out how to make any further progress, post a new question (or update this one) containing a copy of your code and input data, so that we can run it ourselves, and a description of how it fails to work properly, then we'll be much more able to help you with that information (and more people will be willing to help if you can show that you made an effort to do it yourself first).

Perl get array count so can start foreach loop at a certain array element

I have a file that I am reading in. I'm using perl to reformat the date. It is a comma seperated file. In one of the files, I know that element.0 is a zipcode and element.1 is a counter. Each row can have 1-n number of cities. I need to know the number of elements from element.3 to the end of the line so that I can reformat them properly. I was wanting to use a foreach loop starting at element.3 to format the other elements into a single string.
Any help would be appreciated. Basically I am trying to read in a csv file and create a cpp file that can then be compiled on another platform as a plug-in for that platform.
Best Regards
Michael Gould
you can do something like this to get the fields from a line:
my #fields = split /,/, $line;
To access all elements from 3 to the end, do this:
foreach my $city (#fields[3..$#fields])
{
#do stuff
}
(Note, based on your question I assume you are using zero-based indexing. Thus "element 3" is the 4th element).
Alternatively, consider Text::CSV to read your CSV file, especially if you have things like escaped delimiters.
Well if your line is being read into an array, you can get the number of elements in the array by evaluating it in scalar context, for example
my $elems = #line;
or to be really sure
my $elems = scalar(#line);
Although in that case the scalar is redundant, it's handy for forcing scalar context where it would otherwise be list context. You can also find the index of the last element of the array with $#line.
After that, if you want to get everything from element 3 onwards you can use an array slice:
my #threeonwards = #line[3 .. $#line];

Using a .fasta file to compute relative content of sequences

So me being the 'noob' that I am, being introduced to programming via Perl just recently, I'm still getting used to all of this. I have a .fasta file which I have to use, although I'm unsure if I'm able to open it, or if I have to work with it 'blindly', so to speak.
Anyway, the file that I have contains DNA sequences for three genes, written in this .fasta format.
Apparently it's something like this:
>label
sequence
>label
sequence
>label
sequence
My goal is to write a script to open and read the file, which I have gotten the hang of now, but I have to read each sequence, compute relative amounts of 'G' and 'C' within each sequence, and then I'm to write it to a TAB-delimited file the names of the genes, and their respective 'G' and 'C' content.
Would anyone be able to provide some guidance? I'm unsure what a TAB-delimited file is, and I'm still trying to figure out how to open a .fasta file to actually see the content. So far I've worked with .txt files which I can easily open, but not .fasta.
I apologise for sounding completely bewildered. I'd appreciate your patience. I'm not like you pros out there!!
I get that it's confusing, but you really should try to limit your question to one concrete problem, see https://stackoverflow.com/faq#questions
I have no idea what a ".fasta" file or 'G' and 'C' is.. but it probably doesn't matter.
Generally:
Open input file
Read and parse data. If it's in some strange format that you can't parse, go hunting on http://metacpan.org for a module to read it. If you're lucky someone has already done the hard part for you.
Compute whatever you're trying to compute
Print to screen (standard out) or another file.
A "TAB-delimite" file is a file with columns (think Excel) where each column is separated by the tab ("\t") character. As quick google or stackoverflow search would tell you..
Here is an approach using 'awk' utility which can be used from the command line. The following program is executed by specifying its path and using awk -f <path> <sequence file>
#NR>1 means only look at lines above 1 because you said the sequence starts on line 2
NR>1{
#this for-loop goes through all bases in the line and then performs operations below:
for (i=1;i<=length;i++)
#for each position encountered, the variable "total" is increased by 1 for total bases
total++
}
{
for (i=1;i<=length;i++)
#if the "substring" i.e. position in a line == c or g upper or lower (some bases are
#lowercase in some fasta files), it will carry out the following instructions:
if(substr($0,i,1)=="c" || substr($0,i,1)=="C")
#this increments the c count by one for every c or C encountered, the next if statement does
#the same thing for g and G:
c++; else
if(substr($0,i,1)=="g" || substr($0,i,1)=="G")
g++
}
END{
#this "END-block" prints the gene name and C, G content in percentage, separated by tabs
print "Gene name\tG content:\t"(100*g/total)"%\tC content:\t"(100*c/total)"%"
}

How do I parse this file and store it in a table?

I have to parse a file and store it in a table. I was asked to use a hash to implement this. Give me simple means to do that, only in Perl.
-----------------------------------------------------------------------
L1234| Archana20 | 2010-02-12 17:41:01 -0700 (Mon, 19 Apr 2010) | 1 line
PD:21534 / lserve<->Progress good
------------------------------------------------------------------------
L1235 | Archana20 | 2010-04-12 12:54:41 -0700 (Fri, 16 Apr 2010) | 1 line
PD:21534 / Module<->Dir,requires completion
------------------------------------------------------------------------
L1236 | Archana20 | 2010-02-12 17:39:43 -0700 (Wed, 14 Apr 2010) | 1 line
PD:21534 / General Page problem fixed
------------------------------------------------------------------------
L1237 | Archana20 | 2010-03-13 07:29:53 -0700 (Tue, 13 Apr 2010) | 1 line
gTr:SLC-163 / immediate fix required
------------------------------------------------------------------------
L1238 | Archana20 | 2010-02-12 13:00:44 -0700 (Mon, 12 Apr 2010) | 1 line
PD:21534 / Loc Information Page
------------------------------------------------------------------------
I want to read this file and I want to perform a split or whatever to extract the following fields in a table:
the id that starts with L should be the first field in a table
Archana20 must be in the second field
timestamp must be in the third field
PD must be in the fourth field
Type (content preceding / must be in the last field)
My questions are:
How to ignore the --------… (separator line) in this file?
How to extract the above?
How to split since the file has two delimiters (|, /)?
How to implement it using a hash and what is the need for this?
Please provide some simple means so that I can understand since I am a beginner to Perl.
My questions are:
How to ignore the --------… (separator line) in this file?
How to extract the above?
How to split since the file has two delimiters (|, /)?
How to implement it using a hash and what is the need for this?
You will probably be working through the file line by line in a loop. Take a look at perldoc -f next. You can use regular expressions or a simpler match in this case, to make sure that you only skip appropriate lines.
You need to split first and then handle each field as needed after, I would guess.
Split on the primary delimiter (which appears to be ' | ' - more on that in a minute), then split the final field on its secondary delimiter afterwards.
I'm not sure if you are asking whether you need a hash or not. If so, you need to pick which item will provide the best set of (unique) keys. We can't do that for you since we don't know your data, but the first field (at a glance) looks about right. As for how to get something like this into a more complex data structure, you will want to look at perldoc perldsc eventually, though it might only confuse you right now.
One other thing, your data above looks like it has a semi-important typo in the first line. In that line only, there is no space between the first field and its delimiter. Everywhere else it's ' | '. I mention this only because it can matter for split. I nearly edited this, but maybe the data itself is irregular, though I doubt it.
I don't know how much of a beginner you are to Perl, but if you are completely new to it, you should think about a book (online tutorials vary widely and many are terribly out of date). A reasonably good introductory book is freely available online: Beginning Perl. Another good option is Learning Perl and Intermediate Perl (they really go together).
When you say This is not a homework...to mean this will be a start to assess me in perl I assume you mean that this is perhaps the first assignment you have at a new job or something, in which case It seems that if we just give you the answer it will actually harm you later since they will assume you know more about Perl than you do.
However, I will point you in the right direction.
A. Don't use split, use regular expressions. You can learn about them by googling "perl regex"
B. Google "perl hash" to learn about perl hashes. The first result is very good.
Now to your questions:
regular expressions will help you ignore lines you don't want
regular expressions with extract items. Look up "capture variables"
Don't split, use regex
See point B above.
If this file is line based then you can do a line by line based read in a while loop. Then skip those lines that aren't formatted how you wish.
After that, you can either use regex as indicated in the other answer. I'd use that to split it up and get an array and build a hash of lists for the record. Either after that (or before) clean up each record by trimming whitespace etc. If you use regex, then use the capture expressions to add to your list in that fashion. Its up to you.
The hash key is the first column, the list contains everything else. If you are just doing a direct insert, you can get away with a list of lists and just put everything in that instead.
The key for the hash would allow you to look at particular records for fast lookup. But if you don't need that, then an array would be fine.
You can try this one,
Points need to know:
read the file line by line
By using regular expression, removing '----' lines.
after that use split function to populate Hashes of array .
#!/usr/bin/perl
use strict;
use warning;
my $test_file = 'test.txt';
open(IN, '<' ,"$test_file") or die $!;
my (%seen, $id, $name, $timestamp, $PD, $type);
while(<IN>){
chomp;
my $line = $_;
if($line =~ m/^-/){ #removing '---' lines
# print "$line:hello\n";
}else{
if ($line =~ /\|/){
($id , $name, $timestamp) = split /\|/, $line, 4;
} else{
($PD, $type) = split /\//, $line , 3;
}
$seen{$id}= [$name, $timestamp, $PD, $type]; //use Hashes of array
}
}
for my $test(sort keys %seen){
my $test1 = $seen{$test};
print "$test:#{$test1}\n";
}
close(IN);

loading data from file into 2d array

I am just starting with perl and would like some help with arrays please.
I am reading lines from a data file and splitting the line into fields:
open (INFILE, $infile);
do {
my $linedata = <INFILE>;
my #data= split ',',$linedata;
....
} until eof;
I then want to store the individual field values (in #data) in and array so that the array looks like the input data file ie, the first "row" of the array contains the first line of data from INFILE etc.
Each line of data from the infile contains 4 values, x,y,z and w and once the data are all in the array, I have to pass the array into another program which reads the x,y,z,w and displays the w value on a screen at the point determined by the x,y,z value. I can not pas the data to the other program on a row-by-row basis as the program expects the data to in a 2d matrtix format.
Any help greatly appreciated.
Chris
That's not really that difficult, you just need to store the splits, not in their own separate list, but in an array, taking up a slot of a larger array:
my #all_data;
while (my $linedata = <INFILE>) {
push # creates the next (n) slot(s) in an array
#all_data
, [ split ',',$linedata ]
# ^ we're pushing an *array* not just additional elements.
;
}
However, if you're just trying to read a commonly-known concept as a comma-separated values format, then have a look at something like Text::CSV, because the full capabilities of CSV is more than splitting on commas.