I need to remove all of the duplicate lines from a file, but ignoring all appearances of these characters:
(),、“”。!?#
As an example, these two lines would be considered duplicates, so one of them would be deleted:
“This is a line。“
This is a line
Similarly, these three lines would be considered duplicates, and only one would remain:
This is another line、 with more words。
“This is another line with more words。”
This is another line! with more words!
It does not matter which of the duplicate lines is kept remaining in the document.
After removing duplicates, the orders of the lines should not be changed.
Nearly all lines have important punctuation, but the punctuation might vary somewhat. Whichever line is kept might still have punctuation, so the punctuation should not be deleted in the final output.
How can I delete all of the duplicate lines in a file, while ignoring some characters?
From your example, you could just delete your symbols, and then remove your duplicates.
For instance :
$ cat foo
«This is a line¡»
This is another line! with more words¡
Similarly, these three lines would be considered duplicates, and only one would remain:
This is a line
This is another line, with more words!
This is another line with more words
$ tr --delete '¡!«»,' < foo | awk '!a[$0]++'
This is a line
This is another line with more words
Similarly these three lines would be considered duplicates and only one would remain:
$
Seems to do the job.
Edit :
From your question, it seems like those symbol/punctuation mars do not matter. You should precise that.
I don't have time to write that but I think the easy way should be to parse your file and maintain an array of already printed line :
for each line:
cleanedLine = stripFromSymbol(line)
if cleanedLine not in AlreadyPrinted:
AlreadyPrinted.push(cleanedLine)
print line
This is an approach. You collect them into arrays keyed on a normalized version. Normalized here means remove all the chars you don’t want and squash spaces too. Then it picks the shortest version to print/keep. That heuristic—which to keep—wasn’t really specified so season to taste. Code is a bit terse for production so you might flesh it out for clarity.
use utf8;
use strictures;
use open qw/ :std :utf8 /;
my %tree;
while (my $original = <DATA>) {
chomp $original;
( my $normalized = $original ) =~ tr/ (),、“”。!?#/ /sd;
push #{$tree{$normalized}}, $original;
#print "O:",$original, $/;
#print "N:",$normalized, $/;
}
#{$_} = sort { length $a <=> length $b } #{$_} for values %tree;
print $_->[0], $/ for values %tree;
__DATA__
“This is a line。“
This is a line
This is a line
This is another line、 with more words。
This is another line with more words
This is another line! with more words!
Yields–
This is another line with more words
This is a line
Related
i have a log file where the errors will be mentioned as "ERROR" in the beginning of the line and next line will have the detailed text about the error. I would like to search for "ERROR" in the reverse order so i can find the last error and print the next line or copy is the line to a variable.
In shell i can try the below command which will help me to achieve the same. Can some one give me a equivalent perl code.
grep -A2 ERROR sapinst.log | tail -2
As the log file will be huge (~5000+ lines), so I don't want to store it in an array.
Your file size is rather small and perl is pretty quick, so I wouldn't worry about reverse order that much. This little program reads lines of input from the files you specify on the command line (or standard input if you specify none), skips lines until it finds ERROR, then prints that and the next line:
#!perl
while( <> ) {
next unless /ERROR/;
print;
print scalar <>;
}
From there you can use tail if you like. This Perl does the same as the grep you posted (although since you already have that solution I wonder why you want a different one).
If you don't want to use tail, keep track of the two lines you'll output and replace them when you find a new set:
my( $error_line, $next_line );
while( <> ) {
next unless /ERROR/;
$error_line = $_;
$next_line = scalar <>;
}
print $error_line, $next_line;
If you have a recent enough perl, you can use the safer double diamond line input operator:
use v5.22;
my( $error_line, $next_line );
while( <<>> ) {
next unless /ERROR/;
$error_line = $_;
$next_line = scalar <<>>;
}
print $error_line, $next_line;
You can use File::ReadBackwards, but you'll have to do the same task by remembering every line then checking if the previous line had ERROR. For you data sizes, the benefit probably isn't apparent. If the simple solution isn't fast enough, it's time to get fancier (but not before then).
L02 TIME DEPOSITS 489,26,45,422.92
L18 DRAFTS ACCOUNT (IF CREDIT) 10,063.00 10,063.00
L21 SBI BILLS ACCOUNT (CONTRA) A18 37,51,432.00
A12A DEMAND LOANS 4,39,13,597.30
These are the lines I have in my file I want to extract the amounts from each line which starts with either (L or A) and store into a variable.
This is what I have written
pattern =/[A-Z]\w+\s*([\d,.]*)\s*([\d,.])*/g
$first = $1;
$second= $2;
Your regex is looking for a string of \w and then spaces in the middle so it cannot match multiple words. The last * should be inside the parenthesis, like the first one (but see below). The [A-Z] matches any block capital while you say that you want A or L, so use [AL] instead.
my #amounts = $string =~ /^[AL]\w+ \s+ [A-Za-z ]* ([\d,.]*)/xg;
You don't want to literally repeat the pattern with * quantifier in order to account for a variable number of occurrences. What if 2 becomes 3 when requirements change? Four? Instead, you can capture all matches in an array and get exactly as many as there are.
The /x allows us to use spaces inside for readability.
Here is another approach, which is more flexible.
You need a pattern containing any of digit, , (comma), . (period) -- and which is only such in the string. You want this only on lines that start with A or L.
So skip lines which do not start with A or L, then match only the needed pattern.
use warnings;
use strict;
my $filename = '...';
open my $fh, '<', $filename or die "Can't open $filename: $!";
while (<$fh>)
{
next unless /^[AL]/; # skip if the line doesn't start with A or L
my #amounts = $_ =~ /\b ([\d,.]+) \b/xg;
print "#amounts\n" if #amounts;
}
close $fh;
Here you need to specify \b, the word boundary. Otherwise 02 in L02 is matched, for example.
With no matches the array is empty so we test, to not print empty lines. Adjust as suitable.
The next step in reducing reliance on regex details and making code more flexible is to split the line by spaces and process term by term. Then adjustments are far easier and changes can be absorded.
For example, this helps with the change in data mentioned in a comment -- what if there is a date? The above regex would match the numeric parts, while the first one would just break down.
With a loop over fields on each line we can just skip the date, next if /\d{4}-\d{2}/;
I need help with 2 things on this code that I'm reading. First, is I keep seeing this inside of while loop to read a file:
wile(<filename>){
next unless (/\w/);
chomp;
s/^\s*//;
s/^\s*$//;
my($name, $datatype, $io, $dummy) = split /\s*,\s*/, $_, 4;
}
So, I'm wondering what that is doing? Because there are commas in the same line being read, so wouldn't the commas make it go to the next iteration? SO how would it split the lines if it is going to another iteration when the commas are being read?
Another one I'm stomped by is:
while (<AP>) {
chomp;
s/
//g;
}
I have no idea what that code is actually substituting...
Thanks!
The first snippet:
Reads a line from a filehandle called filename. This is a really bad name for a filehandle
It skips the processing if there is not even a single \w (word character) on the line.
The next unless (/\w/); is the same as next if not (/\w/). Note that there is no need for parenthesis -- next unless /\w/; is fine.
A word character is, from perlretut
\w matches a word character (alphanumeric or _), not just [0-9a-zA-Z_] but also digits and characters from non-roman scripts
It removes (only) the newline with chomp. Then it removes leading spaces, if any
It removes blank lines, the ones with only spaces on them
It splits the line by commas, allowing that they have spaces before and/or after. It also limits the number of terms returned, to 4. This means that it returns the first three comma-separated fields, and then all the rest as one string in the last element of the list
The second snippet is really bad, whatever it is meant to do. (Remove spaces on the line?)
Comments
It is far better to use lexical filehandles, rather than barenames. So you'd open a file as
open my $fh, '<', $filename or die "Can't open $filename: $!";
and read it by while (my $line = <$fh>) or by while (<$fh>).
Normally you'll see lines skipped if they have nothing other than spaces
next unless /\S/; # or
next if /^\s*$/;
Using \w also skips lines with some characters (other than what is matched by \w), which means that one had better be very sure that those are fine to skip.
Here it may be meant to skip a line with commas but no \w (comma is not matched by \w), for which split would return spaces (or empty strings) in a list. I find this a bit hidden and fragile. I'd drop lines with spaces only, and handle possible loose commas in processing. As it stands it doesn't help with ,,a, anyway, what yields ('', '', 'a'). So checking is probably needed in any case.
Note that altogether this code leaves trailing spaces. When split is invoked with the optional fourth argument it keeps all spaces, and they haven't been removed otherwise.
I am trying to pass all the data from a file into a Perl array and then I am trying to use a foreach loop to process every string in the array. The problem is that the foreach instead of printing each individual line is printing the entire array.I am using the following script.
while (<FILE>) {
$_ =~ s/(\)|\()//g;
push #array, $_;
}
foreach $n(#array) {
print "$n\n";
}
Say for example the data in the array is #array=qw(He goes to the school everyday)
the array is getting printed properly but the foreach loop instead of printing every element on different line is printing the entire array.
After reading your comments, I am guessing that your problem is that your source file does not contain any newlines: I.e. the entire file is just one line. Some text editors just wrap the text without actually adding any line break characters.
There is no "solution" to that problem; You have to add line breaks where you want them. You could write a script to do it, but I doubt it would make much sense. It all depends on what you want to do with this text.
Here's my code suggestions for your snippet.
chomp(#array = <FILE>);
s/[()]//g for #array;
print "$_\n" for #array;
or
#array = <FILE>;
s/[()]//g for #array;
print #array;
Note that if you have a file from another filesystem, you may get \r characters left over at the end of your strings after chomp, causing the output to look corrupted, overwriting itself.
Additional notes:
(\)|\() is better written as a character class: [()].
#array = <FILE> will read the entire file into the array. No need
to loop.
As shown in my examples, print can be assigned a list of items
(e.g. an array) as arguments. And you can have a postfix loop to
print sequentially.
With a (postfix) loop, all the loop elements are aliased to $_,
which is a handy way to do substitutions on the array.
Since the entire file is just one line.You can split the string on basis of whitespace and print every element of array in new line
use strict;
use warnings;
open(FILE,'YOURFILE' ) || die ("could not open");
my $line= <FILE>;
my #array = split ' ',$line;
foreach my $n(#array)
{
print "$n\n";
}
close(FILE);
Input File
In recent years many risk factors for the development of breast cancer that .....
Output
In
recent
years
many
risk
factors
for
the
development
of
breast
cancer
that
.....
Looking for suggestions on how to approach my Perl programming homework assignment to write an RNA synthesis program. I've summed and outlined the program below. Specifically, I'm looking for feedback on the blocks below (I'll number for easy reference). I've read up to chapter 6 in Elements of Programming with Perl by Andrew Johnson (great book). I've also read the perlfunc and perlop pod-pages with nothing jumping out on where to start.
Program Description: The program should read an input file from the command line, translate it into RNA, and then transcribe the RNA into a sequence of uppercase one-letter amino acid names.
Accept a file named on the command line
here I will use the <> operator
Check to make sure the file only contains acgt or die
if ( <> ne [acgt] ) { die "usage: file must only contain nucleotides \n"; }
Transcribe the DNA to RNA (Every A replaced by U, T replaced by A, C replaced by G, G replaced by C)
not sure how to do this
Take this transcription & break it into 3 character 'codons' starting at the first occurance of "AUG"
not sure but I'm thinking this is where I will start a %hash variables?
Take the 3 character "codons" and give them a single letter Symbol (an uppercase one-letter amino acid name)
Assign a key a value using (there are 70 possibilities here so I'm not sure where to store or how to access)
If a gap is encountered a new line is started and process is repeated
not sure but we can assume that gaps are multiples of threes.
Am I approaching this the right way? Is there a Perl function that I'm overlooking that can simplify the main program?
Note
Must be self contained program (stored values for codon names & symbols).
Whenever the program reads a codon that has no symbol this is a gap in the RNA, it should start a new line of output and begin at the next occurance of "AUG". For simplicity we can assume that gaps are always multiples of threes.
Before I spend any additional hours on research I am hoping to get confirmation that I'm taking the right approach. Thanks for taking time to read and for sharing your expertise!
1. here I will use the <> operator
OK, your plan is to read the file line by line. Don't forget to chomp each line as you go, or you'll end up with newline characters in your sequence.
2. Check to make sure the file only contains acgt or die
if ( <> ne [acgt] ) { die "usage: file must only contain nucleotides \n"; }
In a while loop, the <> operator puts the line read into the special variable $_, unless you assign it explicitly (my $line = <>).
In the code above, you're reading one line from the file and discarding it. You'll need to save that line.
Also, the ne operator compares two strings, not one string and one regular expression. You'll need the !~ operator here (or the =~ one, with a negated character class [^acgt]. If you need the test to be case-insensitive, look into the i flag for regular expression matching.
3. Transcribe the DNA to RNA (Every A replaced by U, T replaced by A, C replaced by G, G replaced by C).
As GWW said, check your biology. T->U is the only step in transcription. You'll find the tr (transliterate) operator helpful here.
4. Take this transcription & break it into 3 character 'codons' starting at the first occurance of "AUG"
not sure but I'm thinking this is where I will start a %hash variables?
I would use a buffer here. Define an scalar outside the while(<>) loop. Use index to match "AUG". If you don't find it, put the last two bases on that scalar (you can use substr $line, -2, 2 for that). On the next iteration of the loop append (with .=) the line to those two bases, and then test for "AUG" again. If you get a hit, you'll know where, so you can mark the spot and start translation.
5. Take the 3 character "codons" and give them a single letter Symbol (an uppercase one-letter amino acid name)
Assign a key a value using (there are 70 possibilities here so I'm not sure where to store or how to access)
Again, as GWW said, build a hash table:
%codons = ( AUG => 'M', ...).
Then you can use (for eg.) split to build an array of the current line you're examining, build codons three elements at a time, and grab the correct aminoacid code from the hash table.
6.If a gap is encountered a new line is started and process is repeated
not sure but we can assume that gaps are multiples of threes.
See above. You can test for the existence of a gap with exists $codons{$current_codon}.
7. Am I approaching this the right way? Is there a Perl function that I'm overlooking that can simplify the main program?
You know, looking at the above, it seems way too complex. I built a few building blocks; the subroutines read_codon and translate: I think they help the logic of the program immensely.
I know this is a homework assignment, but I figure it might help you get a feel for other possible approaches:
use warnings; use strict;
use feature 'state';
# read_codon works by using the new [state][1] feature in Perl 5.10
# both #buffer and $handle represent 'state' on this function:
# Both permits abstracting reading codons from processing the file
# line-by-line.
# Once read_colon is called for the first time, both are initialized.
# Since $handle is a state variable, the current file handle position
# is never reset. Similarly, #buffer always holds whatever was left
# from the previous call.
# The base case is that #buffer contains less than 3bp, in which case
# we need to read a new line, remove the "\n" character,
# split it and push the resulting list to the end of the #buffer.
# If we encounter EOF on the $handle, then we have exhausted the file,
# and the #buffer as well, so we 'return' undef.
# otherwise we pick the first 3bp of the #buffer, join them into a string,
# transcribe it and return it.
sub read_codon {
my ($file) = #_;
state #buffer;
open state $handle, '<', $file or die $!;
if (#buffer < 3) {
my $new_line = scalar <$handle> or return;
chomp $new_line;
push #buffer, split //, $new_line;
}
return transcribe(
join '',
shift #buffer,
shift #buffer,
shift #buffer
);
}
sub transcribe {
my ($codon) = #_;
$codon =~ tr/T/U/;
return $codon;
}
# translate works by using the new [state][1] feature in Perl 5.10
# the $TRANSLATE state is initialized to 0
# as codons are passed to it,
# the sub updates the state according to start and stop codons.
# Since $TRANSLATE is a state variable, it is only initialized once,
# (the first time the sub is called)
# If the current state is 'translating',
# then the sub returns the appropriate amino-acid from the %codes table, if any.
# Thus this provides a logical way to the caller of this sub to determine whether
# it should print an amino-acid or not: if not, the sub will return undef.
# %codes could also be a state variable, but since it is not actually a 'state',
# it is initialized once, in a code block visible form the sub,
# but separate from the rest of the program, since it is 'private' to the sub
{
our %codes = (
AUG => 'M',
...
);
sub translate {
my ($codon) = #_ or return;
state $TRANSLATE = 0;
$TRANSLATE = 1 if $codon =~ m/AUG/i;
$TRANSLATE = 0 if $codon =~ m/U(AA|GA|AG)/i;
return $codes{$codon} if $TRANSLATE;
}
}
I can give you a few hints on a few of your points.
I think your first goal should be to parse the file character by character, ensuring each one is valid, group them into sets of three nucleotides and then work on your other goals.
I think your biology is a bit off as well, when you transcribe DNA to RNA you need to think about what strands are involved. You may not need to "complement" your bases during your transcription step.
2. You should check this as your parse the file character by character.
3. You could do this with a loop and some if statements or hash
4. This could probably be done with a counter as you read the file character by character. Since you need to insert a space after every 3rd character.
5. This would be a good place to use a hash that's based on the amino acid codon table.
6. You'll have to look for the gap character as you parse the file. This seems to contradict your #2 requirement since the program says your text can only contain ATGC.
There are a lot of perl functions that could make this easier. There are also perl modules such as bioperl. But I think using some of these could defeat the purpose of your assignment.
Look at BioPerl and browse the source-modules for indicators on how to go about it.