Need to create regular expression for one file having sentence - perl

L02 TIME DEPOSITS 489,26,45,422.92
L18 DRAFTS ACCOUNT (IF CREDIT) 10,063.00 10,063.00
L21 SBI BILLS ACCOUNT (CONTRA) A18 37,51,432.00
A12A DEMAND LOANS 4,39,13,597.30
These are the lines I have in my file I want to extract the amounts from each line which starts with either (L or A) and store into a variable.
This is what I have written
pattern =/[A-Z]\w+\s*([\d,.]*)\s*([\d,.])*/g
$first = $1;
$second= $2;

Your regex is looking for a string of \w and then spaces in the middle so it cannot match multiple words. The last * should be inside the parenthesis, like the first one (but see below). The [A-Z] matches any block capital while you say that you want A or L, so use [AL] instead.
my #amounts = $string =~ /^[AL]\w+ \s+ [A-Za-z ]* ([\d,.]*)/xg;
You don't want to literally repeat the pattern with * quantifier in order to account for a variable number of occurrences. What if 2 becomes 3 when requirements change? Four? Instead, you can capture all matches in an array and get exactly as many as there are.
The /x allows us to use spaces inside for readability.
Here is another approach, which is more flexible.
You need a pattern containing any of digit, , (comma), . (period) -- and which is only such in the string. You want this only on lines that start with A or L.
So skip lines which do not start with A or L, then match only the needed pattern.
use warnings;
use strict;
my $filename = '...';
open my $fh, '<', $filename or die "Can't open $filename: $!";
while (<$fh>)
{
next unless /^[AL]/; # skip if the line doesn't start with A or L
my #amounts = $_ =~ /\b ([\d,.]+) \b/xg;
print "#amounts\n" if #amounts;
}
close $fh;
Here you need to specify \b, the word boundary. Otherwise 02 in L02 is matched, for example.
With no matches the array is empty so we test, to not print empty lines. Adjust as suitable.
The next step in reducing reliance on regex details and making code more flexible is to split the line by spaces and process term by term. Then adjustments are far easier and changes can be absorded.
For example, this helps with the change in data mentioned in a comment -- what if there is a date? The above regex would match the numeric parts, while the first one would just break down.
With a loop over fields on each line we can just skip the date, next if /\d{4}-\d{2}/;

Related

Sorting upper case and lower case with Perl

I tried to sorting upper case and lower case in the perl language. A bunch of text are save in as "electricity.txt"
in the .txt file:
Today's scientific question is: What in the world is electricity and
where does it go after it leaves the toaster?
Here is a simple experiment that will teach you an important
electrical lesson: On a cool dry day, scuff your feet along a carpet,
then reach your hand into a friend's mouth and touch one of his dental
fillings. Did you notice how your friend twitched violently and cried
out in pain? This teaches one that electricity can be a very powerful
force, but we must never use it to hurt others unless we need to learn
an important lesson about electricity.
Somehow, I can't get any uppercase word
and my code is
my %count;
my $openFileile = "electricity.txt";
open my $openFile, '<', $openFileile;
while (my $list = <$openFile>) {
chomp $list;
foreach my $word (split /\s+/, $list) {
$count{lc($word)}++;
}
}
printf "\n\nSorting Alphabetically with upper case words in front of lower-case words with the same initial characters\n";
foreach my $word (sort keys %count){
printf "%-31s \n", sort {"\$a" cmp uc"\$b"} lc($word);
}
Issue 1
First problem is the statement below means you are only storing the lower-case versions of all the words
$count{lc($word)}++;
After the initial while loop %count has only lower-case words. That means your foreach loop can never retrieve the upper-case words.
Issue 2
Second issue is this statement
printf "%-31s \n", sort {"\$a" cmp uc"\$b"} lc($word);
I have no idea what you think that the sort will achieve -- it is sorting a list with only one element, lc($word), so doesn't actually do anything.
A working example
Taking the comments above into account, here is a version that outputs both upper & lower-case words (abbreviated)
use strict;
use warnings;
my %count;
#my $openFileile = "electricity.txt";
#open my $openFile, '<', $openFileile;
while (my $list = <DATA>) {
chomp $list;
foreach my $word (split /\s+/, $list) {
$count{$word}++;
}
}
printf "\n\nSorting Alphabetically with upper case words in front of lower-case words with the same initial characters\n";
foreach my $word (sort keys %count){
printf "%-31s \n", $word;
}
__DATA__
Today's scientific question is: What in the world is electricity and where does it go after it leaves the toaster?
Here is a simple experiment that will teach you an important electrical lesson: On a cool dry day, scuff your feet along a carpet, then reach your hand into a friend's mouth and touch one of his dental fillings. Did you notice how your friend twitched violently and cried out in pain? This teaches one that electricity can be a very powerful force, but we must never use it to hurt others unless we need to learn an important lesson about electricity.
That print this
Sorting Alphabetically with upper case words in front of lower-case words with the same initial characters
Did
Here
On
This
Today's
What
a
about
after
along
...
use
very
violently
we
where
will
world
you
As Hunter McMillen's comment says, you are using lc on the words when creating the hash, therefore all of your original capitalization will be lost. Lets go through your code, as I spot some other mistakes.
First off, always use use strict; use warnings. Especially if you have a preference for long and complicated variable names. It will save you from typos and weird bugs.
open my $openFile, '<', $openFileile;
With open statements, it is idiomatic to check the return value of the open, to see if anything went wrong. And if it did, to report the error. I.e. add ..., or die "Cannot open '$openFileile': $!".
foreach my $word (split /\s+/, $list) {
Typically, if you split on whitespace you usually want to split on ' ' -- a single space. This is a special case for split, also the default split mode, it will split on \s+, but also remove leading whitespace.
$count{lc($word)}++;
Here is your problem. All the words lose their original case.
printf "\n\nSorting Alphabetically with upper case words in front of lower-case words with the same initial characters\n";
printf is a special formatting print. If you do not intend to use that formatting, use the regular print to avoid problems.
printf "%-31s \n", sort {"\$a" cmp uc"\$b"} lc($word);
You cannot sort just one (1) word. You need at least 2 words to be able to sort.
Why are you using double quotes, and then escaping the variable sigil? I am guessing this is you testing different things to see what works. This looks very unlikely to do what you want.
"\$a" will just become $a -- a dollar sign plus an "a". This is what you do when you want to print the variable name, e.g. print "\$a is $a" (prints $a is 12, for example).
lc will have no effect, since all your words are already in lower case.
Even if lc and uc would work here, you cannot use uc like that in the sort subroutine. The sort function will choose one word in the comparison at random and capitalize it. Effectively destroying your sort.
Also uc will change all the letters to upper case (cat => CAT). You want ucfirst (cat => Cat).
When I clean up your code, and also make the variable names somewhat more reasonable, I get this below. Also, I removed your file open, since I use the internal DATA file handle to facilitate testing. You can just put back your own open, with the additions I described above.
use strict;
use warnings;
my %words;
while (my $line = <DATA>) {
for my $word (split ' ', $line) { # split on ' ' a single space removes leading and trailing whitespace
my $key = lc $word; # save lowercase word as key
$words{$key}{count}++; # separate count
$words{$key}{value} = $word; # word original formatting as value
}
}
# printf is used for special formatting, if you are not using that formatting, use regular print to avoid unnecessary interpolation of %
print "\nSorting Alphabetically with upper case words in front of lower-case words with the same initial characters\n";
for my $word (sort keys %words) {
printf "%-31s : %s\n", $words{$word}{value}, $words{$word}{count};
}
__DATA__
Today's scientific question is: What in the world is electricity and where does it go after it leaves the toaster?
Here is a simple experiment that will teach you an important electrical lesson: On a cool dry day, scuff your feet along a carpet, then reach your hand into a friend's mouth and touch one of his dental fillings. Did you notice how your friend twitched violently and cried out in pain? This teaches one that electricity can be a very powerful force, but we must never use it to hurt others unless we need to learn an important lesson about electricity.
And it prints
a : 5
about : 1
after : 1
along : 1
an : 2
and : 3
be : 1
but : 1
can : 1
carpet, : 1
cool : 1
...etc
As can be noticed, this differentiates between carpet and carpet, since you are only splitting on whitespace. It keeps the non-word characters and includes them in the hash. There are different ways to find words in a text. For example, instead of split you could use a regex:
my #words = $line =~ /\w+/g; # \w is word characters, plus numbers, and underscore _
Even this is simplistic, but will work better than your split. You can add characters to the regex as your needs require, for example: /[\w\-]+/ -- include dash for hyphenated words, e.g. mega-carpet. (Note that dash - has to be escaped when placed between other characters inside a character class bracket, otherwise it will be interpreted as a range, e.g. a-z.)

In Perl, can you use a variable for the whole of a match string?

I'm new to Perl, though not to programming, and am working through Learning Perl. The book has exercises to match successive lines of a small text file.
I had the idea of supplying match strings from STDIN, and going through the file for each one:
while(<STDIN>) {
chomp;
$regex = $_;
seek JUNK, 0, 0;
while(<JUNK>) {
chomp();
if(/$regex/) {
say;
}
}
say '';
}
This works fine, but I can't find a way to interpolate an entire match string, e.g.
/fred/i
into the predicate. I tried
if($$matcher) # with $matcher = '/fred/'
but Perl complained.
I imagine this is my ignorance, and should welcome enlightenment.
Statement modifiers, such as /i, are a part of the code telling Perl how to perform the match, not a part of the pattern to be matched. This is why that doesn't work for you.
You have three ways to work around this (well, probably more, since this is Perl we're talking about, but three ways that I can think of straight off):
1) Use extended regex syntax and, when you want a case-insensitive match, enter (?i:fred), as suggested in comments on the question.
2) Use string eval to allow the use of the regular statement modifiers: if (eval "$_ =~ $regex") { say } Note that this method will require you to also type the surrounding slashes. e.g., You'd have to enter /fred/i; just typing in fred would not work. Note also that it's a huge security hole to do this without validating your input first, since the user's entered text is executed as Perl code, just as if it were part of the original program. (Imagine if the user entered //, system("rm -rf /") - it would test against an empty regex, then delete all the files on your computer.) So probably not a recommended approach unless you really know what you're doing and/or you're the only one who will ever run the program.
3) The most complex, but also most correct, solution is to write a parser which inspects the user's entered string to see whether any special flags are present and then responds accordingly. A very simple example which allows the user to append /i for a case-insensitive search:
#!/usr/bin/env perl
use strict;
use warnings;
use 5.010;
while(<STDIN>) {
chomp;
my #parts = split '/', $_;
# If the user input starts with a /, the first part will be empty, so throw
# it away.
shift #parts unless $parts[0];
my $re = shift #parts;
my %flags;
for (#parts) {
for (split '') {
$flags{i} = 1 if $_ eq 'i';
}
}
my $f = join '', keys %flags;
say "Matched" if eval qq('foo' =~ /$re/$f);
}
This also uses string eval, so it is potentially vulnerable to the same kind of security issues as #2, but $re cannot contain any / characters (the split '/' would have ended $re immediately prior to the first /), which prevents code from being inserted there and $f can contain only the letter i (or any other flags you might choose to recognize if you expand on this). So it should be safe. (But, if anyone can demonstrate an exploit I missed, please tell me about it in comments!)
Problem
What you are trying to do can be summarized by:
my $regex = '/fred/i';
my #lines = (
'A line containing some words and Fred said Hello.',
'Another line. Here is a regex embedded in the line: /fred/i',
);
for ( #lines ) {
say if /$regex/;
}
Output:
Another line. Here is a regex embedded in the line: /fred/i
We see that the second line matches $regex, whereas we wanted the first line containing Fred to match the string fred with the (case insensitive) i flag added to the regex. The problem is that the characters / and i in $regex are taken as characters to be matched literally, i.e., they are not interpreted as special characters surrounding a Regex (as part of a Perl expression).
Note:
The character / is special as part of a Perl expression for a regular expression, but it is not special inside the Regex pattern. There are however characters that are special inside the pattern, the so-called meta characters:
\ | ( ) [ { ^ $ * + ? .
see perldoc quotemeta for more information.
A solution using extended patterns
Simply change the first line to:
my $regex = '(?i)fred'; # or alternatively: (?i:fred)
Regex flags can be added to a regex pattern using "Extended patterns" described in the manual perldoc perlre :
Extended Patterns
The syntax for most of these is a pair of parentheses with a question
mark as the first thing within the parentheses. The character after
the question mark indicates the extension.
[...]
(?adlupimnsx-imnsx)
(?^alupimnsx)
One or more embedded pattern-match modifiers, to be turned on (or
turned off if preceded by "-" ) for the remainder of the pattern or
the remainder of the enclosing pattern group (if any). This is
particularly useful for dynamically-generated patterns, such as those
read in from a configuration file, taken from an argument, or
specified in a table somewhere.
[...]
These modifiers are restored at the end of the enclosing group.
Alternatively the non-capturing form can be used:
(?:pattern)
(?adluimnsx-imnsx:pattern)
(?^aluimnsx:pattern)
This is for clustering, not capturing; it groups subexpressions like
"()" , but doesn't make backreferences as "()" does.
The question has been answered in the following comment:
Try (?i:fred), see Extended
patterns in
perldoc perlre for more information
– Håkon Hægland 7 hours ago.

Perl using regex to compare fields with multiple delimiters

I am studying Perl.
My data.txt file contains:
Lori:James Apple
Jamie:Eric Orange
My code below prints the first line "Lori:James Apple"
open(FILE,'data.txt');
while(<FILE>){
print if /James/;
}
But how do I modify my regular expression to search for a specific field?
For example, I'd like to use 2 delimiters ' ' and ':' to make each line contain 3 fields and check if the 3rd field of the first line is Apple. Which will be equivalent to awk -F'[ :]' '$3 = "Lori"' data.txt
One simple way with regex is to use the negated character class (also see it in perlreftut)
open my $fh, '<', $file or die "Can't open $file: $!";
while (my $line = <$fh>)
{
my #fields = $line =~ /([^:\s]+)/g;
}
The [^...] matches any character other than those listed inside (after ^ which "negates"). The + quantifier means to match one-or-more times so the whole pattern matches a string of consecutive characters other than : and "white space." See docs for a precise description of \s. If you actually mean to skip only a single literal space use [^: ]. All this is captured by ().
The search keeps going through the string due to the global modifier /g, finding all such matches. Since it is in the list context it returns the list of matches, which is assigned to #fields array.
One can pick elements "on the fly" by indexing into the list, ($line =~ /([^:\s]+)/g)[2]. If we are matching $_ this is (/([^:\s]+)/g)[2].
I suggest a good read through perlreftut, for starters.
On the other hand, it is often simpler and clearer to use split
my #fields = split /[:\s]/, $line;
This also uses regex for the pattern by which to split the string. The character class is not negated since here it specifies the delimiter itself, either : or \s (each delimiter may be either of these, they don't have to all be the same).
I would now like to answer the specific question, but the question isn't clear to me.
It asks to "check if the 3rd field of the first line is Apple", what can be done for example by
while (<$fh>)
{
if ( (/([^:\s]+)/g)[2] eq 'Apple' ) {
# ....
}
}
but it isn't clear what to do with it. Perhaps get the first field by what the third one is?
I suggest to get an array and then process. One can write a regex to identify and pick fields directly but that's more brittle and the regex itself then depends on the position (and number) of fields.
At this point we are in a guessing game. If you need more detail please clarify.
The given awk code would yield Lori James Lori and I don't see how that fits.
The short answer is - don't. Regular expressions are about pattern matching, and not context.
You can define a pattern that builds in delimiters and fields, but ... it's not the right tool for the job.
The answer is use split and then handle the fields separately.
open ( my $input, '<', 'data.txt' ) or die $!;
while(<$input>){
chomp;
my #fields = split /[\s:]/;
print if $fields[2] eq "Apple";
}
You can compact this further if you wish, but I'd advise caution - compressing your code at the expense of readability isn't a virtue.
Also - whilst we're at it:
open(FILE,'data.txt');
is bad style - it doesn't check for success, and it also uses a global file handle name. It would be much better to:
open ( my $input, '<', 'data.txt' ) or die $!;
The autodie pragma also does this implicitly.

Need further understanding in the next unless code I'm reading

I need help with 2 things on this code that I'm reading. First, is I keep seeing this inside of while loop to read a file:
wile(<filename>){
next unless (/\w/);
chomp;
s/^\s*//;
s/^\s*$//;
my($name, $datatype, $io, $dummy) = split /\s*,\s*/, $_, 4;
}
So, I'm wondering what that is doing? Because there are commas in the same line being read, so wouldn't the commas make it go to the next iteration? SO how would it split the lines if it is going to another iteration when the commas are being read?
Another one I'm stomped by is:
while (<AP>) {
chomp;
s/
//g;
}
I have no idea what that code is actually substituting...
Thanks!
The first snippet:
Reads a line from a filehandle called filename. This is a really bad name for a filehandle
It skips the processing if there is not even a single \w (word character) on the line.
The next unless (/\w/); is the same as next if not (/\w/). Note that there is no need for parenthesis -- next unless /\w/; is fine.
A word character is, from perlretut
\w matches a word character (alphanumeric or _), not just [0-9a-zA-Z_] but also digits and characters from non-roman scripts
It removes (only) the newline with chomp. Then it removes leading spaces, if any
It removes blank lines, the ones with only spaces on them
It splits the line by commas, allowing that they have spaces before and/or after. It also limits the number of terms returned, to 4. This means that it returns the first three comma-separated fields, and then all the rest as one string in the last element of the list
The second snippet is really bad, whatever it is meant to do. (Remove spaces on the line?)
Comments
It is far better to use lexical filehandles, rather than barenames. So you'd open a file as
open my $fh, '<', $filename or die "Can't open $filename: $!";
and read it by while (my $line = <$fh>) or by while (<$fh>).
Normally you'll see lines skipped if they have nothing other than spaces
next unless /\S/; # or
next if /^\s*$/;
Using \w also skips lines with some characters (other than what is matched by \w), which means that one had better be very sure that those are fine to skip.
Here it may be meant to skip a line with commas but no \w (comma is not matched by \w), for which split would return spaces (or empty strings) in a list. I find this a bit hidden and fragile. I'd drop lines with spaces only, and handle possible loose commas in processing. As it stands it doesn't help with ,,a, anyway, what yields ('', '', 'a'). So checking is probably needed in any case.
Note that altogether this code leaves trailing spaces. When split is invoked with the optional fourth argument it keeps all spaces, and they haven't been removed otherwise.

Perl Program to Mimic RNA Synthesis

Looking for suggestions on how to approach my Perl programming homework assignment to write an RNA synthesis program. I've summed and outlined the program below. Specifically, I'm looking for feedback on the blocks below (I'll number for easy reference). I've read up to chapter 6 in Elements of Programming with Perl by Andrew Johnson (great book). I've also read the perlfunc and perlop pod-pages with nothing jumping out on where to start.
Program Description: The program should read an input file from the command line, translate it into RNA, and then transcribe the RNA into a sequence of uppercase one-letter amino acid names.
Accept a file named on the command line
here I will use the <> operator
Check to make sure the file only contains acgt or die
if ( <> ne [acgt] ) { die "usage: file must only contain nucleotides \n"; }
Transcribe the DNA to RNA (Every A replaced by U, T replaced by A, C replaced by G, G replaced by C)
not sure how to do this
Take this transcription & break it into 3 character 'codons' starting at the first occurance of "AUG"
not sure but I'm thinking this is where I will start a %hash variables?
Take the 3 character "codons" and give them a single letter Symbol (an uppercase one-letter amino acid name)
Assign a key a value using (there are 70 possibilities here so I'm not sure where to store or how to access)
If a gap is encountered a new line is started and process is repeated
not sure but we can assume that gaps are multiples of threes.
Am I approaching this the right way? Is there a Perl function that I'm overlooking that can simplify the main program?
Note
Must be self contained program (stored values for codon names & symbols).
Whenever the program reads a codon that has no symbol this is a gap in the RNA, it should start a new line of output and begin at the next occurance of "AUG". For simplicity we can assume that gaps are always multiples of threes.
Before I spend any additional hours on research I am hoping to get confirmation that I'm taking the right approach. Thanks for taking time to read and for sharing your expertise!
1. here I will use the <> operator
OK, your plan is to read the file line by line. Don't forget to chomp each line as you go, or you'll end up with newline characters in your sequence.
2. Check to make sure the file only contains acgt or die
if ( <> ne [acgt] ) { die "usage: file must only contain nucleotides \n"; }
In a while loop, the <> operator puts the line read into the special variable $_, unless you assign it explicitly (my $line = <>).
In the code above, you're reading one line from the file and discarding it. You'll need to save that line.
Also, the ne operator compares two strings, not one string and one regular expression. You'll need the !~ operator here (or the =~ one, with a negated character class [^acgt]. If you need the test to be case-insensitive, look into the i flag for regular expression matching.
3. Transcribe the DNA to RNA (Every A replaced by U, T replaced by A, C replaced by G, G replaced by C).
As GWW said, check your biology. T->U is the only step in transcription. You'll find the tr (transliterate) operator helpful here.
4. Take this transcription & break it into 3 character 'codons' starting at the first occurance of "AUG"
not sure but I'm thinking this is where I will start a %hash variables?
I would use a buffer here. Define an scalar outside the while(<>) loop. Use index to match "AUG". If you don't find it, put the last two bases on that scalar (you can use substr $line, -2, 2 for that). On the next iteration of the loop append (with .=) the line to those two bases, and then test for "AUG" again. If you get a hit, you'll know where, so you can mark the spot and start translation.
5. Take the 3 character "codons" and give them a single letter Symbol (an uppercase one-letter amino acid name)
Assign a key a value using (there are 70 possibilities here so I'm not sure where to store or how to access)
Again, as GWW said, build a hash table:
%codons = ( AUG => 'M', ...).
Then you can use (for eg.) split to build an array of the current line you're examining, build codons three elements at a time, and grab the correct aminoacid code from the hash table.
6.If a gap is encountered a new line is started and process is repeated
not sure but we can assume that gaps are multiples of threes.
See above. You can test for the existence of a gap with exists $codons{$current_codon}.
7. Am I approaching this the right way? Is there a Perl function that I'm overlooking that can simplify the main program?
You know, looking at the above, it seems way too complex. I built a few building blocks; the subroutines read_codon and translate: I think they help the logic of the program immensely.
I know this is a homework assignment, but I figure it might help you get a feel for other possible approaches:
use warnings; use strict;
use feature 'state';
# read_codon works by using the new [state][1] feature in Perl 5.10
# both #buffer and $handle represent 'state' on this function:
# Both permits abstracting reading codons from processing the file
# line-by-line.
# Once read_colon is called for the first time, both are initialized.
# Since $handle is a state variable, the current file handle position
# is never reset. Similarly, #buffer always holds whatever was left
# from the previous call.
# The base case is that #buffer contains less than 3bp, in which case
# we need to read a new line, remove the "\n" character,
# split it and push the resulting list to the end of the #buffer.
# If we encounter EOF on the $handle, then we have exhausted the file,
# and the #buffer as well, so we 'return' undef.
# otherwise we pick the first 3bp of the #buffer, join them into a string,
# transcribe it and return it.
sub read_codon {
my ($file) = #_;
state #buffer;
open state $handle, '<', $file or die $!;
if (#buffer < 3) {
my $new_line = scalar <$handle> or return;
chomp $new_line;
push #buffer, split //, $new_line;
}
return transcribe(
join '',
shift #buffer,
shift #buffer,
shift #buffer
);
}
sub transcribe {
my ($codon) = #_;
$codon =~ tr/T/U/;
return $codon;
}
# translate works by using the new [state][1] feature in Perl 5.10
# the $TRANSLATE state is initialized to 0
# as codons are passed to it,
# the sub updates the state according to start and stop codons.
# Since $TRANSLATE is a state variable, it is only initialized once,
# (the first time the sub is called)
# If the current state is 'translating',
# then the sub returns the appropriate amino-acid from the %codes table, if any.
# Thus this provides a logical way to the caller of this sub to determine whether
# it should print an amino-acid or not: if not, the sub will return undef.
# %codes could also be a state variable, but since it is not actually a 'state',
# it is initialized once, in a code block visible form the sub,
# but separate from the rest of the program, since it is 'private' to the sub
{
our %codes = (
AUG => 'M',
...
);
sub translate {
my ($codon) = #_ or return;
state $TRANSLATE = 0;
$TRANSLATE = 1 if $codon =~ m/AUG/i;
$TRANSLATE = 0 if $codon =~ m/U(AA|GA|AG)/i;
return $codes{$codon} if $TRANSLATE;
}
}
I can give you a few hints on a few of your points.
I think your first goal should be to parse the file character by character, ensuring each one is valid, group them into sets of three nucleotides and then work on your other goals.
I think your biology is a bit off as well, when you transcribe DNA to RNA you need to think about what strands are involved. You may not need to "complement" your bases during your transcription step.
2. You should check this as your parse the file character by character.
3. You could do this with a loop and some if statements or hash
4. This could probably be done with a counter as you read the file character by character. Since you need to insert a space after every 3rd character.
5. This would be a good place to use a hash that's based on the amino acid codon table.
6. You'll have to look for the gap character as you parse the file. This seems to contradict your #2 requirement since the program says your text can only contain ATGC.
There are a lot of perl functions that could make this easier. There are also perl modules such as bioperl. But I think using some of these could defeat the purpose of your assignment.
Look at BioPerl and browse the source-modules for indicators on how to go about it.