Remove duplicates from csv (ie remove original and the duplicate) - perl

Scenario: I have two csv files. One CSV is a trusted address file (trusted.csv), the other csv is testing address file (testing.csv) that will have duplicates addresses from the first file.
The problem: trusted.csv has already been used to print labels. I need to use testing.csv to generate more labels but I can't have any duplicates. I tried merging the two csv files, but I cant figure out how to remove both the duplicate entry and the offending originating entry. Another problem is that I need to ignore case. sort -uf works like it should, but of course that means it leave the original value.

As you are talking about sort, I believe a solution based on command line is OK.
This is quite heavy solution: I believe there is something better but for the moment I have no better idea.
You need to lines that doesn't match some other lines (or remove those that do match). grep -v does that very well and if added -i option, it doesn't care about the case. As you may have many duplicate lines to remove -f will be your friend as it allows to specify many patterns in a file. As many *nix commands and file options specifying - (a single dash) as a filename makes the command read the data from the standard input rather than from a file on a storage. To summarize : grep -i -f - -v ~/tmp/file will read the file /tmp/file and the patterns from the standard input. It will keep all lines that doesn't match the patterns and the match will be done regardless of characters case.
Now you need to build the pattern list which is the list of duplicate lines. uniq identifies duplicate adjacent lines, -d makes it print duplicates once and -i makes it ignore the case. to make line adjacent, you can use sort which with option -f also ignores the case. So sort -f ~/tmp/file | uniq -d -i get a file prints duplicates once.
Putting both parts together results in the following command : sort -f ~/tmp/file | uniq -d -i | grep -i -f - -v ~/tmp/file. sort groups same lines together so that uniq can keep those that are duplicated which are used as patterns to select lines that will be removed.
Let's take an example. The file below has one letter per line (dup simply identifies lines that are duplicated):
a dup
b
c dup
a dup
d
C dup
e
f
c dup
A dup
The application of our pipe of filters results in:
sort -f ~/tmp/file | uniq -d -i | grep -i -f - -v ~/tmp/file
a
a
A b
b a d
c -----> b ----> e
c f
C
d
e
f

Related

grep not performing very well on large files, is there an alternative?

I have a diff that essentially equates to either additional unique lines or to lines that have moved around in the file, and thus their line numbers have changed. To identify what is truly a new addition, I run this little perl snippet to separate the 'resolved' lines from the 'unresolved' lines:
perl -n -e'
/^\-([^\-].*?)\([^,\(]+,\d+,\d+\).*$/ && do { print STDOUT "$1\n"; next; };
/^\+([^\+].*?)\([^,\(]+,\d+,\d+\).*$/ && do { print STDERR "$1\n"; next; };
' "$delta" 1>resolved 2>unresolved
This is quite quick in fact and does the job, separating a 6000+ line diff into two 3000+ line files, removing any references to line numbers and unified diff decoration. Next comes the grep command that seems to run at 100% CPU for nearly 9 minutes (real):
grep -v -f resolved unresolved
This is essentially removing all resolved lines from the unresolved file. The output, after 9 minutes, is coincidentally 9 lines of output - the unique additions or unresolved lines.
Firstly, when I have used grep in the past, it's been pretty good at this, so why in this case is it being exceptionally slow and CPU hungry?
Secondly, is there a more efficient alternative way of removing lines from one file that are contained within another?
If the lines to be matched across the two files are supposed to be exact matches, you can use sort and uniq to do the job:
cat resolved resolved unresolved | sort | uniq -u
The only non-duplicated lines in the pipeline above will be lines in unresolved that are not in resolved. Note that it's important to specify resolved twice in the cat command: otherwise the uniq will also pick out lines unique to that file. This assumes that resolved and unresolved didn't have duplicated lines to begin with. But that's pretty easy to deal with: just sort and uniq them first
sort resolved | uniq > resolved.uniq
sort unresolved | uniq > unresolved.uniq
Also, I've found fgrep to be significantly faster if I'm trying to match fixed strings, so that might be an alternative.
Grep is probably parsing that file entirely for each and every match it's been told to find. You can try "fgrep" if it exists on your system, or grep -F if it doesn't, which forces grep to use the Aho-Corasick string matching algorithm (http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm), which attempts to match all strings simultaneously, only necessitating one run-through of the file.

How to assign number for a repeating pattern

I am doing some calculations using gaussian. From the gaussian output file, I need to extract the input structure information. The output file contains more than 800 structure coordinates. What I did so far is, collect all the input coordinates using some combinations of the grep, awk and sed commands, like so:
grep -A 7 "Input orientation:" test.log | grep -A 5 "C" | awk '/C/{print "structure number"}1' | sed '/--/d' > test.out
This helped me to grep all the input coordinates and insert a line with "structure number". So now I have a file that contains a pattern which is being repeated in a regular fashion. The file is like the following:
structure Number
4.176801 -0.044096 2.253823
2.994556 0.097622 2.356678
5.060174 -0.115257 3.342200
structure Number
4.180919 -0.044664 2.251182
3.002927 0.098946 2.359346
5.037811 -0.103410 3.389953
Here, "Structure number" is being repeated. I want to write a number like "structure number:1", "structure number 2" in increasing order.
How can I solve this problem?
Thanks for your help in advance.
I am not familiar at all with a program called gaussian, so I have no clue what the original input looked like. If someone posts an example I might be able to give an even shorter solution.
However, as far as I got it the OP is contented with the output of his/her code besided that he/she wants to append an increasing number to the lines inserted with awk.
This can be achieved with the following line (adjusting the OP's code):
grep -A 7 "Input orientation:" test.log | grep -A 5 "C" | awk '/C/{print "structure number"++i}1' | sed '/--/d' > test.out
Addendum:
Even without knowing the actual input, I am sure that one can at least get rid of the sed command leaving that piece of work to awk. Also, there is no need to quote a single character grep pattern:
grep -A 7 "Input orientation:" test.log | grep -A 5 C | awk '/C/{print "structure number"++i}!/--/' > test.out
I am not sure since I cannot test, but it should be possible to let awk do the grep's work, too. As a first guess I would try the following:
awk '/Input orientation:/{li=7}!li{next}{--li}/C/{print "structure number"++i;lc=5}!lc{next}{--lc}!/--/' test.log > test.out
While this might be a little bit longer in code it is an awk-only solution doing all the work in one process. If I had input to test with, I might come up with a shorter solution.

Perl script to compare two files but print in order

I have followed this question perl compare two file and print the matching lines and found lines which match or dont match between two files using hash.
But I find that hash rearranges the lines and I want the lines in order. I can write multiple for loops to get results in order but this is not as efficient as hash. Has anyone faced this issue before and could please help with their solution
Maybe don't understand fully the question but
fgrep -xf file2 file1
is not enough?
or
fgrep -xf file1 file2
yes, it is not perl but, short simple and fast...
This can be done efficiently in two steps. Let's assume you have been able to find the "lines that match" but they are in the wrong order; then a simple grep can re-organize them. Assuming you have a script matchThem that takes two inputs (file1 and file2) and outputs them to tempFile, then the over all script will be:
matchThem file1 file2 > tempFile
grep -Fx -f tempFile file1
The -Fx flag means:
-F : find exact match only (much faster than wildcards)
-x : only match whole lines
If you want an hash which keeps the insertion order, then try out the CPAN module Tie::IxHash.

I want to print a text file in columns

I have a text file which looks something like this:
jdkjf
kjsdh
jksfs
lksfj
gkfdj
gdfjg
lkjsd
hsfda
gadfl
dfgad
[very many lines, that is]
but would rather like it to look like
jdkjf kjsdh
jksfs lksfj
gkfdj gdfjg
lkjsd hsfda
gadfl dfgad
[and so on]
so I can print the text file on a smaller number of pages.
Of course, this is not a difficult problem, but I'm wondering if there is some excellent tool out there for solving problems like these.
EDIT: I'm not looking for a way to remove every other newline from a text file, but rather a tool which interprets text as "pictures" and then lays these out on the page nicely (by writing the appropriate whitespace symbols).
You can use this python code.
tables=input("Enter number of tables ")
matrix=[]
file=open("test.txt")
for line in file:
matrix.append(line.replace("\n",""))
if (len(matrix)==int(tables)):
print (matrix)
matrix=[]
file.close()
(Since you don't name your operating system, I'll simply assume Linux, Mac OS X or some other Unix...)
Your example looks like it can also be described by the expression "joining 2 lines together".
This can be achieved in a shell (with the help of xargs and awk) -- but only for an input file that is structured like your example (the result always puts 2 words on a line, irrespective of how many words each one contains):
cat file.txt | xargs -n 2 | awk '{ print $1" "$2 }'
This can also be achieved with awk alone (this time it really joins 2 full lines, irrespective of how many words each one contains):
awk '{printf $0 " "; getline; print $0}' file.txt
Or use sed --
sed 'N;s#\n# #' < file.txt
Also, xargs could do it:
xargs -L 2 < file.txt
I'm sure other people could come up with dozens of other, quite different methods and commandline combinations...
Caveats: You'll have to test for files with an odd number of lines explicitly. The last input line may not be processed correctly in case of odd number of lines.

*nix: perform set union/intersection/difference of lists

I sometimes need to compare two text files. Obviously, diff shows the differences, it also hides the similarities, which is kind of the point.
Suppose I want to do other comparisons on these files: set union, intersection, and subtraction, treating each line as an element in the set.
Are there similarly simple common utilities or one-liners which can do this?
Examples:
a.txt
john
mary
b.txt
adam
john
$> set_union a.txt b.txt
john
mary
adam
$> set_intersection a.txt b.txt
john
$> set_difference a.txt b.txt
mary
Union: sort -u files...
Intersection: sort files... | uniq -d
Overall difference (elements which are just in one of the files):
sort files... | uniq -u
Mathematical difference (elements only once in one of the files):
sort files... | uinq -u | sort - <(sort -u fileX ) | uniq -d
The first two commands get me all unique elements. Then we merge this with the file we're interested in. Command breakdown for sort - <(sort -u fileX ):
The - will process stdin (i.e. the list of all unique elements).
<(...) runs a command, writes the output in a temporary file and passes the path to the file to the command.
So this gives is a mix of all unique elements plus all unique elements in fileX. The duplicates are then the unique elements which are only in fileX.
If you want to get the common lines between two files, you can use the comm utility.
A.txt :
A
B
C
B.txt
A
B
D
and then, using comm will give you :
$ comm <(sort A.txt) <(sort B.txt)
A
B
C
D
In the first column, you have what is in the first file and not in the second.
In the second column, you have what is in the second file and not in the first.
In the third column, you have what is in the both files.
If you don't mind using a bit of Perl, and if your file sizes are reasonable such that they can be written into a hash, you could collect the files into two hashes to do:
#...get common keys in an array...
my #both_things
for (keys %from_1) {
push #both_things, $_ if exists $from_2{$_};
}
#...put unique things in an array...
my #once_only
for (keys %from_1) {
push #once_only, $_ unless exists $from_2($_);
}
I can't comment on Aaron Digulla's answer, which despite being accepted does not actually compute the set difference.
The set difference A\B with the given inputs should only return mary, but the accepted answer also incorrectly returns adam.
This answer has an awk one-liner that correctly computes the set difference:
awk 'FNR==NR {a[$0]++; next} !a[$0]' b.txt a.txt