Discovering duplicate lines - perl

I've got a file of CSS elements, and I'm trying to check for any duplicate CSS elements,.. then output the lines that show the dupe lines.
###Test
###ABC
###test
##.hello
##.ABC
##.test
bob.com###Test
~qwerty.com###Test
~more.com##.ABC
###Test & ##.ABC already exists in the list, and I'd like a way to output the lines that are used in the file, basically duplication checking (case sensitive). So using the above list, I would generate something like this..
Line 1: ###Test
Line 7: bob.com###Test
Line 8: ~qwerty.com###Test
Line 5: ##.ABC
Line 9: ~more.com##.ABC
Something in bash, or maybe perl?
Thanks :)

I've been challenged by your problem, so I wrote you a script. Hope you liked it. :)
#!/usr/bin/perl
use strict;
use warnings;
sub loadf($);
{
my #file = loadf("style.css");
my #inner = #file;
my $l0 = 0; my $l1 = 0; my $l2 = 0; my $dc = 0; my $tc;
foreach my $line (#file) {
$l1++;
$line =~ s/^\s+//;
$line =~ s/\s+$//;
foreach my $iline (#inner) {
$l2++;
$iline =~ s/^\s+//;
$iline =~ s/\s+$//;
next if ($iline eq $line);
if ($iline =~ /\b$line\b/) {
$dc++;
if ($dc > 0) {
if ($l0 == 0) {
print "Line " . $l1 . ": " . $line . "\n";
$l0++;
}
print "Line " . $l2 . ": " . $iline . "\n";
}
}
}
print "\n" unless($dc == 0);
$dc = 0; $l0 = 0; $l2 = 0;
}
}
sub loadf($) {
my #file = ( );
open(FILE, $_[0] . "\n") or die("Couldn't Open " . $_[0] . "\n");
#file = <FILE>;
close(FILE);
return #file;
}
__END__
This does exactly what you need. And sorry if it's a bit messy.

This seems to work:
sort -t '#' -k 2 inputfile
It groups them by the part after the # characters:
##.ABC
~more.com##.ABC
###ABC
##.hello
##.test
###test
bob.com###Test
~qwerty.com###Test
###Test
If you only want to see the unique values:
sort -t '#' -k 2 -u inputfile
Result:
##.ABC
###ABC
##.hello
##.test
###test
###Test
This pretty closely duplicates the example output in the question (it relies on some possibly GNU-specific features):
cat -n inputfile |
sed 's/^ *\([0-9]\)/Line \1:/' |
sort -t '#' -k 2 |
awk -F '#+' '{if (! seen[$2]) { \
if ( count > 1) printf "%s\n", lines; \
count = 0; \
lines = "" \
}; \
seen[$2] = 1; \
lines = lines "\n" $0; ++count}
END {if (count > 1) print lines}'
Result:
Line 5: ##.ABC
Line 9: ~more.com##.ABC
Line 1: ###Test
Line 7: bob.com###Test
Line 8: ~qwerty.com###Test

I'd recommend using the uniq function if you can install MoreUtils:
how-do-i-print-unique-elements-in-perl-array

Here is one way to do it, which is fairly easy to extend to multiple files if need be.
With this file find_dups.pl:
use warnings;
use strict;
my #lines;
while (<>) { # read input lines
s/^\s+//; s/\s+$//; # trim whitespace
push #lines, {data => $_, line => $.} if $_ # store useful data
}
#lines = sort {length $$a{data} <=> length $$b{data}} #lines; # shortest first
while (#lines) {
my ($line, #found) = shift #lines;
my $re = qr/\Q$$line{data}\E$/; # search token
#lines = grep { # extract matches from #lines
not $$_{data} =~ $re && push #found, $_
} #lines;
if (#found) { # write the report
print "line $$_{line}: $$_{data}\n" for $line, #found;
print "\n";
}
}
then perl find_dups.pl input.css prints:
line 5: ##.ABC
line 9: ~more.com##.ABC
line 1: ###Test
line 7: bob.com###Test
line 8: ~qwerty.com###Test

Related

Ignore the first two lines with ## in perl

all.
Im a newbie in programming especially in perl. I would like to skip the first two lines in my dataset.
these are my codes.
while (<PEPTIDELIST>) {
next if $_ !=~ "##";
chomp $_;
#data = split /\t/;
chomp $_;
next if /Sequence/;
chomp $_;
$npeptides++;
# print "debug: 0: $data[0] 1: $data[1] 2: $data[2] 3:
$data[3]
\n" if ( $debug );
my $pepseq = $data[1];
#print $pepseq."\n";
foreach my $header (keys %sequence) {
#print "looking for $pepseq in $header \n";
if ($sequence{$header} =~ /$pepseq/ ) {
print "matched $pepseq in protein $header" if ( $debug );
# my $in =<STDIN>;
if ( $header =~ /(ENSGALP\S+)\s.+(ENSGALG\S+)/ ) {
print "debug: $1 $2 have the pep = $pepseq \n\n" if (
$debug);
my $lprot = $1;
my $lgene = $2;
$gccount{$lgene}++;
$pccount{$lprot}++;
# print "$1" if($debug);
# print "$2" if ($debug);
print OUT "$pepseq,$1,$2\n";
}
}
}
my $ngenes = keys %gccount;
my $nprots = keys %pccount;
somehow the peptide is not in the output list. please help point me where it goes wrong?
thanks
If you want to skip lines that contain ## anywhere in them:
next if /##/;
If you only want to skip lines that start with ##:
next if /^##/;
If you always want to skip the first two lines, regardless of content:
next if $. < 3;
next if $_ !=~ "##"; must be next if $_ =~ "##";
Ignore this lie if $_ matched ##

How can I choose particular lines from a file with Perl

I have a file which I want to take all the lines which starts with CDS and a line below.
This lines are like:
CDS 297300..298235
/gene="ENSBTAG00000035659"
I found this in your site:
open(FH,'FILE');
while ($line = <FH>) {
if ($line =~ /Pattern/) {
print "$line";
print scalar <FH>;
}
}
and it works great when the CDS is only a line.
Sometimes in my file is like
CDS join(complement(416559..416614),complement(416381..416392),
complement(415781..416087))
/gene="ENSBTAG00000047603"
or with more lines in the CDS.
How can I take only the CDS lines and the next line of the ID???
please i need your help!
Thank you in advance.
Assuming the "next line" always contains /gene=, one can use the flip-flop operator.
while (<>) {
print if m{^CDS} ... m{/gene=};
}
Otherwise, you need to parse the CDS line. It might be sufficient to count parens.
my $depth = 0;
my $print_next = 0;
while (<>) {
if (/^CDS/) {
print;
$depth = tr/(// - tr/)//;
$print_next = 1;
}
elsif ($depth) {
print;
$depth += tr/(// - tr/)//;
}
elsif ($print_next) {
print;
$print_next = 0;
}
}
You need to break the input into outdented paragraphs. Outdented paragraphs start a non-space character in their first line and start with space characters for the rest.
Try:
#!/usr/bin/env perl
use strict;
use warnings;
# --------------------------------------
my $input_file = shift #ARGV;
my $para = undef; # holds partial paragraphs
open my $in_fh, '<', $input_file or die "could not open $input_file: $!\n";
while( my $line = <$in_fh> ){
# paragraphs are outdented, that is, start with a non-space character
if( $line =~ m{ \A \S }msx ){
# don't do if very first line of file
if( defined $para ){
# If paragraph starts with CDS
if( $para =~ m{ \A CDS \b }msx ){
process_CDS( $para );
}
# delete the old paragraph
$para = undef;
}
}
# add the line to the paragraph,
$para .= $line;
}
close $in_fh or die "could not close $input_file: $!\n";
# the last paragraph is not handle inside the loop, so do it now
if( defined $para ){
# If paragraph starts with CDS
if( $para =~ m{ \A CDS \b }msx ){
process_CDS( $para );
}
}

Perl: How to print next line after matching a pattern?

I would like to print specific data after matching a pattern or line. I have a file like this:
#******************************
List : car
Design: S
Date: Sun 10:10
#******************************
b-black
g-green
r-red
Car Type No. color
#-------------------------------------------
N17 bg099 g
#-------------------------------------------
Total 1 car
#******************************
List : car
Design: L
Date: Sun 10:20
#******************************
b-black
g-green
r-red
Car Type No. color
#-------------------------------------------
A57 ft2233 b
#-------------------------------------------
Total 1 car
#******************************
List : car
Design: M
Date: Sun 12:10
#******************************
b-black
g-green
r-red
Car Type No. color
#-------------------------------------------
L45 nh669 g
#-------------------------------------------
Total 1 car
#. .
#. .
#.
#.
I want to print the data for example after the lines "Type...." and dashes line"------" which is N17 and bg099. I have tried this but it cannot work.
my #array;
While (#array = <FILE>) {
foreach my $line (#array) {
if ($line =~ m/(Car)((.*))/) {
my $a = $array[$i+2];
push (#array, $a);
}
if ($array[$i+2] =~ m/(.*)\s+(.*)\s+(.*)/) {
my $car_type = "$1";
print "$car_type\n";
}
}
}
Expected Output:
Car Type No.
N17 bg099
A57 ft2233
L45 nh669
.. ..
. .
while (<FILE>) { #read line by line
if ($_ =~ /^Car/) { #if the line starts with 'Car'
<FILE> or die "Bad car file format"; #read the first line after a Car line, which is '---', in order to get to the next line
my $model = <FILE>; #assign the second line after Car to $model, this is the line we're interested in.
$model =~ /^([^\s]+)\s+([^\s]+)/; #no need for if, assuming correct file format #capture the first two words. You can replace [^\s] with \w, but I prefer the first option.
print "$1 $2\n";
}
}
Or if you prefer a more compact solution:
while (<FILE>) {
if ($_ =~ /^Car/) {
<FILE> or die "Bad car file format";
print join(" ",(<FILE> =~ /(\w+)\s+(\w+)/))."\n";
}
}
Here's another option:
use strict;
use warnings;
print "Car Type\tNo.\n";
while (<>) {
if (/#-{32}/) {
print "$1\t$2\n" if <> =~ /(\S+)\s+(\S+)/;
<>;
}
}
Output:
Car Type No.
N17 bg099
A57 ft2233
L45 nh669
Usage: perl script.pl inFile [>outFile]
Edit: Simplified
I got your code to work with a couple small tweaks.
It's still not perfect but it works.
"while" should be lower case.
You never increment $i.
The way you reuse #array is confusing at best, but if you just output $a you'll get your car data.
Code:
$file_to_get = "input_file.txt";
open (FILE, $file_to_get) or die $!;
my #array;
while (#array = <FILE>) {
$i = 0;
foreach my $line (#array) {
if ($line =~ m/(Car)((.*))/) {
my $a = $array[$i+2];
push (#array, $a);
print $a;
}
$i++;
}
}
close(FILE);
Something like this:
while (my $line = <>) {
next unless $line =~ /Car\s+Type/;
next unless $line = <> and $line =~ /^#----/;
next unless $line = <>;
my #fields = split ' ', $line;
print "#fields[0,1]\n";
}
a shell one-liner to do the same thing
echo "Car Type No. "; \
grep -A 2 Type data.txt \
| grep -v -E '(Type|-)' \
| grep -o -E '(\w+ *\w+)'
perl -lne 'if(/Type/){$a=<>;$a=<>;$a=~m/^([^\s]*)\s*([^\s]*)\s/g; print $1." ".$2}' your_file
tested:
> perl -lne 'if(/Type/){$a=<>;$a=<>;$a=~m/^([^\s]*)\s*([^\s]*)\s/g; print $1." ".$2}' temp
N17 bg099
A57 ft2233
L45 nh669
if you want to use awk,you can do this as below:
> awk '/Type/{getline;if($0~/^#---*/){getline;print $1,$2}}' your_file
N17 bg099
A57 ft2233
L45 nh669
Solution using Perl flip-flop operator. Assumption from the input that you always have Total line at the end of the block of interest
perl -ne '$a=/^#--/;$b=/^Total/;print if(($a .. $b) && !$a && !$b);' file

When comparing two files, how do I skip (ignore) blank lines?

I'm comparing line against line of two text files, ref.txt (reference) and log.txt. But there may be an arbitrary number of blank lines in either file that I'd like to ignore; how can I accomplish this?
ref.txt
one
two
three
end
log.txt
one
two
three
end
There would be no incorrect log lines in the output, in other words log.txt matches with ref.txt.
What I like to accomplish in pseudo code:
while (traversing both files at same time) {
if ($l is blank line || $r is blank line) {
if ($l is blank line)
skip to next non-blank line
if ($r is blank line)
skip to next non-blank line
}
#continue with line by line comparison...
}
My current code:
use strict;
use warnings;
my $logPath = ${ARGV [0]};
my $refLogPath = ${ARGV [1]} my $r; #ref log line
my $l; #log line
open INLOG, $logPath or die $!;
open INREF, $refLogPath or die $!;
while (defined($l = <INLOG>) and defined($r = <INREF>)) {
#code for skipping blank lines?
if ($l ne $r) {
print $l, "\n"; #Output incorrect line in log file
$boolRef = 0; #false==0
}
}
If you are on a Linux platform, use :
diff -B ref.txt log.txt
The -B option causes changes that just insert or delete blank lines to be ignored
You can skip blank lines by comparing it to this regular expression:
next if $line =~ /^\s*$/
This will match any white space or newline characters which can potentially make up a blank line.
This way seems the most "perl-like" to me. No fancy loops or anything, just slurp the files and grep out the blank lines.
use warnings;
$f1 = "path/file/1";
$f2 = "path/file/2";
open(IN1, "<$f1") or die "Cannot open file: $f1 ($!)\n";
open(IN2, "<$f2") or die "Cannot open file: $f2 ($!)\n";
chomp(#lines1 = <IN1>); # slurp the files
chomp(#lines2 = <IN2>);
#l1 = grep(!/^\s*$/,#lines1); # get the files without empty lines
#l2 = grep(!/^\s*$/,#lines2);
# something like this to print the non-matching lines
for $i (0 .. $#l1) {
print "[$f1 $i]: $l1[$i]\n[$f2 $i]: $l2[$i]\n" if($l1[$i] ne $l2[$i]);
}
You can loop to find each line, each time:
while(1) {
while(defined($l = <INLOG>) and $l eq "") {}
while(defined($r = <INREF>) and $r eq "") {}
if(!defined($l) or !defined($r)) {
break;
}
if($l ne $r) {
print $l, "\n";
$boolRef = 0;
}
}
man diff
diff -B ref.txt log.txt
# line skipping code
while (defined($l=<INLOG>) && $l =~ /^$/ ) {} # no-op loop exits with $l that has length
while (defined($r=<INREF>) && $r =~ /^$/ ) {} # no-op loop exits with $r that has length

making one single line

keyword harry /
sally/
tally/
want that whenever the string matches with keyword it should also look for "/" character.This signifies continuation of line
Then I want output as
keyword harry sally tally
==========================
My current code:
#!/usr/bin/perl
open (file2, "trial.txt");
$keyword_1 = keyword;
foreach $line1 (<file2>) {
s/^\s+|\s+$//g;
if ($line1 =~ $keyword_1) {
$line2 =~ (s/$\//g, $line1) ;
print " $line2 " ;
}
}
If the ===== lines in your question are supposed to be in the output, then use
#! /usr/bin/env perl
use strict;
use warnings;
*ARGV = *DATA; # for demo only; delete
sub print_line {
my($line) = #_;
$line =~ s/\n$//; # / fix Stack Overflow highlighting
print $line, "\n",
"=" x (length($line) + 1), "\n";
}
my $line = "";
while (<>) {
$line .= $line =~ /^$|[ \t]$/ ? $_ : " $_";
if ($line !~ s!/\n$!!) { # / ditto
print_line $line;
$line = "";
}
}
print_line $line if length $line;
__DATA__
keyword jim-bob
keyword harry /
sally/
tally/
Output:
keyword jim-bob
================
keyword harry sally tally
==========================
You did not specify what to do with the lines that do not contain the keyword. You might use this code as an inspiration, though:
#!/usr/bin/perl
use warnings;
use strict;
my $on_keyword_line;
while (<>) {
if (/keyword/ or $on_keyword_line) {
chomp;
if (m{/$}) {
chop;
$on_keyword_line = 1;
} else {
$on_keyword_line = 0;
}
print;
} else {
$on_keyword_line = 0;
print "\n";
}
}
A redo is useful when dealing with concatenating continuation lines.
my $line;
while ( defined( $line = <DATA> )) {
chomp $line;
if ( $line =~ s{/\s*$}{ } ) {
$line .= <DATA>;
redo unless eof(DATA);
}
$line =~ s{/}{};
print "$line\n";
}
__DATA__
keyword harry /
sally/
tally/
and
done!!!
$ ./test.pl
keyword harry sally tally and
done!!!
I think you need to simply concatenate all lines that end in a slash, regardless of the keyword.
I suggest this code.
Updated to account for the OP's comment that continuation lines are terminated by backslashes.
while (<>) {
s|\\\s*\z||;
print;
}