Compare two CSV files and show only the difference - perl

I have two CSV files:
File1.csv
Time, Object_Name, Carrier_Name, Frequency, Longname
2013-08-05 00:00, Alpha, Aircel, 917.86, Aircel_Bhopal
2013-08-05 00:00, Alpha, Aircel, 915.13, Aircel_Indore
File2.csv
Time, Object_Name, Carrier_Name, Frequency, Longname
2013-08-05 00:00, Alpha, Aircel, 917.86, Aircel_Bhopal
2013-08-05 00:00, Alpha, Aircel, 815.13, Aircel_Indore
These are sample input files in actual so many headers and values will be there, so I can not make them hard coded.
In my expected output I want to keep the first two columns and the last column as it is as there won't be any change in the same and then the comparison should happen for the rest of the columns and values.
Expected output:
Time, Object_Name, Frequency, Longname
2013-08-05 00:00, 815.13, Aircel_Indore
How can I do this?

Please look at the links below, there are some examples scripts:
http://bytes.com/topic/perl/answers/647889-compare-two-csv-files-using-perl
Perl: Compare Two CSV Files and Print out differences
http://www.perlmonks.org/?node_id=705049

If you are not bound to Perl, here a solution using AWK:
#!/bin/bash
awk -v FS="," '
function filter_columns()
{
return sprintf("%s, %s, %s, %s", $1, $2, $(NF-1), $NF);
}
NF !=0 && NR == FNR {
if (NR == 1) {
print filter_columns();
} else {
memory[line++] = filter_columns();
}
} NF != 0 && NR != FNR {
if (FNR == 1) {
line = 0;
} else {
new_line = filter_columns();
if (new_line != memory[line++]) {
print new_line;
}
}
}' File1.csv File2.csv
This outputs:
Time, Object_Name, Frequany, Longname
2013-08-05 00:00, Alpha, 815.13, Aircel_Indore
Here the explanation:
#!/bin/bash
# FS = "," makes awk split each line in fields using
# the comma as separator
awk -v FS="," '
# this function selects the columns you want. NF is the
# the number of field. Therefore $NF is the content of
# the last column and $(NF-1) of the but last.
function filter_columns()
{
return sprintf("%s, %s, %s, %s", $1, $2, $(NF-1), $NF);
}
# This block processes just the first file, this is the aim
# of the condition NR == FNR. The condition NF != 0 skips the
# empty lines you have in your file. The block prints the header
# and then save all the other lines in the array memory.
NF !=0 && NR == FNR {
if (NR == 1) {
print filter_columns();
} else {
memory[line++] = filter_columns();
}
}
# This block processes just the second file (NR != FNR).
# Since the header has been already printed, it skips the first
# line of the second file (FNR == 1). The block compares each line
# against that one saved in the array memory (the corresponding
# line in the first file). The block prints just the lines
# that do not match.
NF != 0 && NR != FNR {
if (FNR == 1) {
line = 0;
} else {
new_line = filter_columns();
if (new_line != memory[line++]) {
print new_line;
}
}
}' File1.csv File2.csv

Answering #IlmariKaronen's questions would clarify the problem much better, but meanwhile I made some assumptions and took a crack at the problem - mainly because I needed an excuse to learn a bit of Text::CSV.
Here's the code:
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
use Array::Compare;
use feature 'say';
open my $in_file, '<', 'infile.csv';
open my $exp_file, '<', 'expectedfile.csv';
open my $out_diff_file, '>', 'differences.csv';
my $text_csv = Text::CSV->new({ allow_whitespace => 1, auto_diag => 1 });
my $line = readline($in_file);
my $exp_line = readline($exp_file);
die 'Different column headers' unless $line eq $exp_line;
$text_csv->parse($line);
my #headers = $text_csv->fields();
my %all_differing_indices;
#array-of-array containings lists of "expected" rows for differing lines
# only columns that differ from the input have values, others are empty
my #all_differing_rows;
my $array_comparer = Array::Compare->new(DefFull => 1);
while (defined($line = readline($in_file))) {
$exp_line = readline($exp_file);
if ($line ne $exp_line) {
$text_csv->parse($line);
my #in_fields = $text_csv->fields();
$text_csv->parse($exp_line);
my #exp_fields = $text_csv->fields();
my #differing_indices = $array_comparer->compare([#in_fields], [#exp_fields]);
#all_differing_indices{#differing_indices} = (1) x scalar(#differing_indices);
my #output_row = ('') x scalar(#exp_fields);
#output_row[0, 1, #differing_indices, $#exp_fields] = #exp_fields[0, 1, #differing_indices, $#exp_fields];
$all_differing_rows[$#all_differing_rows + 1] = [#output_row];
}
}
my #columns_needed = (0, 1, keys(%all_differing_indices), $#headers);
$text_csv->combine(#headers[#columns_needed]);
say $out_diff_file $text_csv->string();
for my $row_aref (#all_differing_rows) {
$text_csv->combine(#{$row_aref}[#columns_needed]);
say $out_diff_file $text_csv->string();
}
It works for the File1 and File2 given in the question and produces the Expected output (except that the Object_Name 'Alpha' is present in the data line - I'm assuming that's a typo in the question).
Time,Object_Name,Frequany,Longname
"2013-08-05 00:00",Alpha,815.13,Aircel_Indore

I've created a script for it with very powerful linux tools. Link here...
Linux / Unix - Compare Two CSV Files
This project is about comparison of two csv files.
Let's assume that csvFile1.csv has XX columns and csvFile2.csv has YY columns.
Script I've wrote should compare one (key) column form csvFile1.csv with another (key) column from csvFile2.csv. Each variable from csvFile1.csv (row from key column) will be compared to each variable from csvFile2.csv.
If csvFile1.csv has 1,500 rows and csvFile2.csv has 15,000 total number of combinations (comparisons) will be 22,500,000. So this is very helpful way how to create Availability Report script which for example could compare internal product database with external (supplier's) product database.
Packages used:
csvcut (cut columns)
csvdiff (compare two csv files)
ssconvert (convert xlsx to csv)
iconv
curlftpfs
zip
unzip
ntpd
proFTPD
More you can find on my official blog (+example script):
http://damian1baran.blogspot.sk/2014/01/linux-unix-compare-two-csv-files.html

Related

sort and extract certain number of rows from a file containing dates

i have in a txt file, date like:
yyyymmdd
raw data are like:
20171115
20171115
20180903
...
20201231
They are more than 100k rows. i am trying to keep in one file the "newest" 10k lines, and in a separate file the 10k "oldest" 10k lines.
I guess this must be a two steps process:
sort lines,
then extract the 10k rows that are on top, the "newest = most recent dates" and the 10k rows that are towards the end of the file ie the "oldest = most ancient dates"
How could i achieve it using awk?
I even tried with perl no luck though, so a perl one liner would be highly accepted as well.
Edit: i would prefer a clean clever solution so that i learn from,
and not an optimization of my attempts.
example with perl
#dates = ('20170401', '20170721', '20200911');
#ordered = sort { &compare } #dates;
sub compare {
$a =~ /(\d{4})(\d{2})(\d{2})/;
$c = $3 . $2 . $1;
$b =~ /(\d{4})(\d{2})(\d{2})/;
$c = $3 . $2 . $1;
$c <=> $d;
}
print "#ordered\n";
This is an answer using perl.
If you want the oldest on top, you can use the standard sort order:
#dates = sort #dates;
Reverse sort order, with the newest on top:
#dates = sort { $b <=> $a } #dates;
# ^^^
# |
# numerical three-way comparison returning -1, 0 or +1
You can then extract 10000 of the entries from the top:
my $keep = 10000;
my #top = splice #dates, 0, $keep;
And 10000 from the bottom:
$keep = #dates unless(#dates >= $keep);
my #bottom = splice #dates, -$keep;
#dates will now contain the dates between the 10000 at the top and the 10000 at the bottom that you extracted.
You can then save the two arrays to files if you want:
sub save {
my $filename=shift;
open my $fh, '>', $filename or die "$filename: $!";
print $fh join("\n", #_) . "\n" if(#_);
close $fh;
}
save('top', #top);
save('bottom', #bottom);
A command-line script ("one"-liner) with Perl
perl -MPath::Tiny=path -we'
$f = shift; $n = shift//2; # filename; number of lines or default
#d = sort +(path($f)->lines); # sort lexicographically, ascending
$n = int #d/2 if 2*$n > #d; # top/bottom lines, up to half of file
path("bottom.txt")->spew(#d[0..$n-1]); # write files, top/bottom $n lines
path("top.txt") ->spew(#d[$#d-$n+1..$#d])
' dates.txt 4
Comments
Needs a filename, and can optionally take the number of lines to take from top and bottom; in this example 4 is passed (with default 2), for easy tests with small files. Don't need to check for the filename since the library used to read it, Path::Tiny, does that
For the library (-MPath::Tiny) I specify the method name (=path) only for documentation; this isn't necessary since the libary is a class, so that =path may be just removed
Sorting is alphabetical but that is fine with dates in this format; oldest dates come first but that doesn't matter since we'll split off what we need. To enforce numerical sorting, and once at it to sort in descending order, use sort { $b <=> $a } #d;. See sort
We check whether there is enough lines in the file for the desired number of lines to shave off from the (sorted) top and bottom ($n). If there isn't then that's set to half the file
The syntax $#ary is the last index of the array #ary and that is used to count off $n items from the back of the array with lines #d
This is written as a command-line program ("one-liner") merely because that was asked for. But that much code would be far more comfortable in a script.
Given that your lines with dates will sort lexicographically, it is simple. Just use sort then split.
Given:
cat infile
20171115
20171115
20180903
20100101
20211118
20201231
You can sort then split that input file into files of 3 lines each:
split -l3 <(sort -r infile) dates
# -l10000 for a 10,000 line split
The result:
for fn in dates*; do echo "${fn}:"; cat "$fn"; done
datesaa:
20211118
20201231
20180903
datesab:
20171115
20171115
20100101
# files are names datesaa, datesab, datesac, ... dateszz
# if you only want two blocks of 10,000 dates,
# just throw the remaining files away.
Given you may have significantly more lines than you are interested in, you can also sort to a intermediate file then use head and tail to get the newest and oldest respectively:
sort -r infile >dates_sorted
head -n10000 dates_sorted >newest_dates
tail -n10000 dates_sorted >oldest_dates
Assumptions:
dates are not unique (per OPs comment)
results are dumped to two files newest and oldest
newest entries will be sorted in descending order
oldest entries will be sorted in ascending order
there's enough memory on the host to load the entire data file into memory (in the form of an awk array)
Sample input:
$ cat dates.dat
20170415
20171115
20180903
20131115
20141115
20131115
20141115
20150903
20271115
20271105
20271105
20280903
20071115
20071015
20070903
20031115
20031015
20030903
20011115
20011125
20010903
20010903
One idea using GNU awk:
x=5
awk -v max="${x}" '
{ dates[$1]++ }
END { count=0
PROCINFO["sorted_in"]="#ind_str_desc" # find the newest "max" dates
for (i in dates) {
for (n=1; n<=dates[i]; n++) {
if (++count > max) break
print i > "newest"
}
if (count > max) break
}
count=0
PROCINFO["sorted_in"]="#ind_str_asc" # find the oldest "max" dates
for (i in dates) {
for (n=1; n<=dates[i]; n++) {
if (++count > max) break
print i > "oldest"
}
if (count > max) break
}
}
' dates.dat
NOTE: if duplicate date shows up as rows #10,000 and #10,001, the #10,001 entry will not be included in the output
This generates:
$ cat oldest
20010903
20010903
20011115
20011125
20030903
$ cat newest
20280903
20271115
20271105
20271105
20180903
Here is a quick and dirty Awk attempt which collects the ten smallest and the ten largest entries from the file.
awk 'BEGIN { for(i=1; i<=10; i++) max[i] = min[i] = 0 }
NR==1 { max[1] = min[1] = $1; next }
(!max[10]) || ($1 > max[10]) {
for(i=1; i<=10; ++i) if(!max[i] || (max[i] < $1)) break
for(j=9; j>=i; --j) max[j+1]=max[j]
max[i] = $1 }
(!min[10]) !! ($1 < min[10]) {
for(i=1; i<=10; ++i) if (!min[i] || (min[i] > $1)) break
for(j=9; j>=i; --j) min[j+1]=min[j]
min[i] = $1 }
END { for(i=1; i<=10; ++i) print max[i];
print "---"
for(i=1; i<=10; ++i) print min[i] }' file
For simplicity, this has some naïve assumptions (numbers are all positive, there are at least 20 distinct numbers, duplicates should be accounted for).
This avoids external dependencies by using a brute-force sort in native Awk. We keep two sorted arrays min and max with ten items in each, and shift off the values which no longer fit as we populate them with the largest and smallest numbers.
It should be obvious how to extend this to 10,000.
Same assumptions as with my [other answer], except newest data is in ascending order ...
One idea using sort and head/tail:
$ sort dates.dat | tee >(head -5 > oldest) | tail -5 > newest
$ cat oldest
20010903
20010903
20011115
20011125
20030903
$ cat newest
20180903
20271105
20271105
20271115
20280903
OP can add another sort if needed (eg, tail -5 | sort -r > newest).
For large datasets OP may also want to investigate other sort options, eg, -S (allocate more memory for sorting), --parallel (enable parallel sorting), etc.

Matching and splitting a specific line from a .txt file

I'm looking for concise Perl equivalents (to use within scripts rather than one-liners) to a few things i'd otherwise do in bash/awk:
Count=$(awk '/reads/ && ! seen {print $1; seen=1}' < input.txt)
Which trawls through a specified .txt file that contains a multitude of lines including some in this format:
8523723 reads; of these:
1256265 reads; of these:
2418091 reads; of these:
Printing '8523723' and ignoring the remainder of the matchable lines (as I only wish to act on the first matched instance).
Secondly:
Count2=$(awk '/paired/ {sum+=$1} END{print sum}' < input.txt)
25 paired; of these:
15 paired; of these:
Which would create a running total of the numbers on each matched line, printing 40.
The first one is:
while (<>) {
if (/reads/) {
print;
last;
}
}
The second one is:
my $total = 0;
while (<>) {
if (/(\d+) paired/) {
$total += $1;
}
}
say $total;
You could, no doubt, golf them both. But these versions are readable :-)

Filter unique lines

I have a tile type called fasta which contains a header "> 12122" followed by a string. I would like to remove duplicated strings in the file and keep only one of the duplicated string (same which) and the corresponding header.
In the example below the AGGTTCCGGATAAGTAAGAGCC is duplicated
in:
>17-46151
AGGTTCCGGATAAGTAAGAGCC
>1-242
AGGTTCCGGATAAGTAAGAGCC
>18-41148
TCTTAACCCGGACCAGAAACTA
>43-16054
GTCCCACTCCGTAGATCTGTTC
>32-24116
TAGCATATCGAGCCTGAGAACA
>42-16312
TGATACGGATGTTATACGCAGC
out:
>1-242
AGGTTCCGGATAAGTAAGAGCC
>18-41148
TCTTAACCCGGACCAGAAACTA
>43-16054
GTCCCACTCCGTAGATCTGTTC
>32-24116
TAGCATATCGAGCCTGAGAACA
>42-16312
TGATACGGATGTTATACGCAGC
if order is mandatory
# Field are delimited by new line
awk -F "\n" '
BEGIN {
# Record is delimited by ">"
RS = ">"
}
# skip first "record" due to first ">"
NR > 1 {
# if string is not know, add it to "Order" list array
if ( ! ( $2 in L ) ) O[++a] = $2
# remember (last) peer label/string
L[$2] = $1
}
# after readiong the file
END{
# display each (last know) peer based on the order
for ( i=1; i<=a; i++ ) printf( ">%s\n%s\n", L[O[i]], O[i])
}
' YourFile
if order is not mandatory
awk -F "\n" 'BEGIN{RS=">"}NR>1{L[$2]=$1}END{for (l in L) printf( ">%s\n%s\n", L[l], l)}' YourFile
$ awk '{if(NR%2) p=$0; else a[$0]=p}END{for(i in a)print a[i] ORS i}' file
>18-41148
TCTTAACCCGGACCAGAAACTA
>32-24116
TAGCATATCGAGCCTGAGAACA
>1-242
AGGTTCCGGATAAGTAAGAGCC
>43-16054
GTCCCACTCCGTAGATCTGTTC
>42-16312
TGATACGGATGTTATACGCAGC
Explained:
{
if(NR%2) # every first (of 2) line in p
p=$0
else # every second line is the hash key
a[$0]=p
}
END{
for(i in a) # output every unique key and it's header
print a[i] ORS i
}
Here's a quick one-line awk solution for you. It should be more immediate than the other answers because it runs line by line rather than queuing the data (and looping through it) until the end:
awk 'NR % 2 == 0 && !seen[$0]++ { print last; print } { last = $0 }' file
Explanation:
NR % 2 == 0 runs only on even numbered records (lines, NR)
!seen[$0]++ stores and increments values and returns true only when there were no values in the seen[] hash (!0 is 1, !1 is 0, !2 is 0, etc.)
(Skipping to the end) last is set to the value of each line after we're otherwise done with it
{ print last; print } will print last (the header) and then the current line (gene code)
Note: while this preserves the original order, it shows the first uniquely seen instance while the expected output showed the final uniquely seen instance:
>17-46151
AGGTTCCGGATAAGTAAGAGCC
>18-41148
TCTTAACCCGGACCAGAAACTA
>43-16054
GTCCCACTCCGTAGATCTGTTC
>32-24116
TAGCATATCGAGCCTGAGAACA
>42-16312
TGATACGGATGTTATACGCAGC
If you want the final uniquely seen instance, you can reverse the file before passing to awk and then reverse it back afterwards:
tac file |awk … |tac

use sed (or awk, perl etc) to identify first occurrence of a markdown title

I have a series of files with yaml headers followed by markdown subtitles, looking something like this:
Minimal example input file:
---
layout: post
tags:
- might
- be
- variable
- number
- of
- these
category: ecology
---
my (h2 size) title
------------------
some text
possible other titles we don't want
-----------------------------------
more text more text
As I've tried to indicate, the size of the YAML header and the line on which the first subtitle appears varies, so I can't count on knowing the line numbers for any change ahead of time. I'd like to identify the first title (which should also be the first non-blank text following the closing ---. I'd then like to write that text into the YAML header like so, the tile we grabbed removed from the body text, and the rest of the text remaining intact:
Target Output file
---
layout: post
tags:
- might
- be
- variable
- number
- of
- these
categories: ecology
title: my (h2 size) title
---
some text
possible other titles we don't want
-----------------------------------
more text more text
Seems like this should be a reasonable task for sed/awk or such, but my usage of these tools is quite elementary and I haven't been able to puzzled this one out.
I see I can search between words, sed 'word1/,/word2/p, but not sure how to convert this to search between the second occurrence of ^---$ and the first occurrence of ^----+-$ (line with greater than 3 dashes); how to then drop the extra blank lines and then paste into the yaml matter above.
Perhaps with so many steps perl would be a better choice than sed, but one where I have even less familiarity. Thanks for any hints or advice.
Just do 2 passes - the first (when NR==FNR) to find the title and the line number that you want it printed before and the second to print it and the other lines when the line numbers are appropriate:
$ cat tst.awk
NR==FNR {
if (hdrEnd && !title && NF) {title = $0; titleStart=FNR; titleEnd=FNR+1 }
if (hdrStart && /^---$/) {hdrEnd = FNR }
if (!hdrStart && /^---$/) {hdrStart = FNR }
next
}
FNR == hdrEnd { print "title:", title }
(FNR < titleStart) || (FNR > titleEnd)
$ awk -f tst.awk file file
---
layout: post
tags:
- might
- be
- variable
- number
- of
- these
category: ecology
title: my (h2 size) title
---
some text
possible other titles we don't want
-----------------------------------
more text more text
hdrStart is the line number where the header starts, etc. If you want to skip more lines around the title than just the text and subsequent line of underscores, just change how titleStart and titleEnd are populated to FNR-1 and FNR+2 or whatever. FNR (File Number of Records) is the current line number in just the currently open file while NR (Number of Records) is the number of lines read so far in total across all previously and currently opened files.
If you don't want to specify the file name twice on the command line you can duplicate it in awks BEGIN section:
$ cat tst.awk
BEGIN{ ARGV[ARGC++] = ARGV[ARGC-1] }
NR==FNR {
if (hdrEnd && !title && NF) {title = $0; titleStart=FNR; titleEnd=FNR+1 }
if (hdrStart && /^---$/) {hdrEnd = FNR }
if (!hdrStart && /^---$/) {hdrStart = FNR }
next
}
FNR == hdrEnd { print "title:", title }
(FNR < titleStart) || (FNR > titleEnd)
then you only need to invoke the script as:
$ awk -f tst.awk file
EDIT: Actually - here's an alternative that doesn't do a 2-pass approach and is arguably simpler:
$ cat tst.awk
(state == 0) && /^---$/ { state=1; print; next }
(state == 1) && /^---$/ { state=2; next }
(state == 2) && /^./ { state=3; printf "title: %s\n---\n",$0; next }
(state == 3) && /^-+$/ { state=4; next }
state != 2 { print }
$ awk -f tst.awk file
---
layout: post
tags:
- might
- be
- variable
- number
- of
- these
category: ecology
title: my (h2 size) title
---
some text
possible other titles we don't want
-----------------------------------
more text more text
If you're familiar with state machines it should be obvious what it's doing, if not let me know.
A quick and dirty perl code:
$/=undef; # null line delimiter, so that the following reads the full file
my $all=<STDIN>;
my #parts=split(/^(----*)$/m,$all); # split in sections delimited by all-dashes linse
my #head=split("\n",$parts[2]); # split the header in lines
my #tit=split("\n",$parts[4]); # split the title section in lines
push #head,pop #tit; # remove the last line from the title section and append to head
$parts[2]=join("\n",#head)."\n"; # rebuild the header
$parts[4]=join("\n",#tit); # rebuild the title section
print join("",#parts); # rebuild all and print to stdout
This might not be robust enough for you: it doesn't care if there are 3 or more dashes, it assumes UNIX newlines, does not check that the title is non-blank, etc. BUt it might be useful as a starting point, or if you need only to run this once.
Another approach could be to read all lines in memory in a array, loop for the delimiter lines and move the title line.
maybe this Perl code will help you to find a solution:
#!/usr/bin/env perl
use Modern::Perl;
use File::Slurp;
my #file_content = read_file('test.yml');
my ($start, $stop, $title);
foreach my $line (#file_content) {
if ($line =~ m{ --- }xms) {
if (!$start) {
$start = 1;
}
else {
$stop = 1;
next;
}
}
if ($line && $stop && $line = m{\w}xms) {
$title = $line;
last;
}
}
say "Title: $title";
Output with data from above:
Title: my (h2 size) title
Good old python:
with open("i.yaml") as fp:
lines = fp.readlines()
c = False
i = 0
target = -1
for line in lines:
i += 1
if c:
if line.strip() != "":
source = i - 1
c = False
if line.strip() == "---":
if i > 1:
c = True
target = i - 1
lines[target:target] = ["title: " + lines[source]]
del lines[source + 1]
del lines[source + 1]
with open("o.yaml", "w") as fp:
fp.writelines(lines)

Matched lines (with regex) being written to both output files, but it's supposed only to be written to one output file..

I have a tab-delimited text file with several rows. I wrote a script in which I assign the rows to an array, and then I search through the array by means of regular expressions, to find the rows that match certain criteria. When a match is found, I write it to Output1. After going through all the if-statements listed (the regular expressions) and the criteria still isn't met, then the line is written to Output 2.
I works 100% when it comes to matching criteria and writing to Output 1, but here is where my problem comes in:
The matched lines are also being written to Output2, along with the unmatched lines. I am probably making a silly mistake, but I really just can't see it. If someone could have a look and help me out, I'd really appreciate it..
Thanks so much! :)
Inputfile sample:
skool school
losieshuis pension
prys prijs
eeu eeuw
lys lijs
water water
outoritêr outoritaire
#!/usr/bin/perl-w
use strict;
use warnings;
use open ':utf8';
use autodie;
open OSWNM, "<SecondWordsNotMatched.txt";
open ONIC, ">Output1NonIdenticalCognates.txt";
open ONC, ">Output2NonCognates.txt";
while (my $line = <OSWNM>)
{
chomp $line;
my #Row = $line;
for (my $x = 0; $x <= $#Row; $x++)
{
my $RowWord = $Row[$x];
#Match: anything, followed by 'y' or 'lê' or 'ê', followed by anything, followed by
a tab, followed by anything, followed by 'ij' or 'leggen' or 'e', followed by anything
if ($RowWord =~ /(.*)(y|lê|ê)(.*)(\t)(.*)(ij|leggen|e)(.*)/)
{
print ONIC "$RowWord\n";
}
#Match: anything, followed by 'eeu', followed by 'e' or 's', optional, followed by
anyhitng, followed by a tab, followed by anything, followed by 'eeuw', followed by 'en', optional
if ($RowWord =~ /(.*)(eeu)(e|s)?(\t)(.*)(eeuw)(en)?/)
{
print ONIC "$RowWord\n";
}
else
{
print ONC "$RowWord\n";
}
}
}
Inside your loop you essentially have:
if (A) {
output to file1
}
if (B) {
output to file1
} else {
output to file2
}
So you'll output to file2 anything that doesn't satisfy B (regardless of whether A was satisfied or not), and output stuff that satisfies both A and B twice to file1.
If outputting twice was not intended, you should modify your logic to something like:
if (A or B) {
output to file1
} else {
output to file2
}
Or:
if (A) {
output to file1
} elsif (B) {
output to file1
} else {
output to file2
}
(This second version allows you to do different processing for the A and B cases.)
If the double output was intended, you could do something like:
my $output_to_file2 = 1;
if (A) {
output to file1
$output_to_file2 = 0;
}
if (B) {
output to file1
$output_to_file2 = 0;
}
if ($output_to_file2) {
output to file2
}