*nix: perform set union/intersection/difference of lists - perl

I sometimes need to compare two text files. Obviously, diff shows the differences, it also hides the similarities, which is kind of the point.
Suppose I want to do other comparisons on these files: set union, intersection, and subtraction, treating each line as an element in the set.
Are there similarly simple common utilities or one-liners which can do this?
Examples:
a.txt
john
mary
b.txt
adam
john
$> set_union a.txt b.txt
john
mary
adam
$> set_intersection a.txt b.txt
john
$> set_difference a.txt b.txt
mary

Union: sort -u files...
Intersection: sort files... | uniq -d
Overall difference (elements which are just in one of the files):
sort files... | uniq -u
Mathematical difference (elements only once in one of the files):
sort files... | uinq -u | sort - <(sort -u fileX ) | uniq -d
The first two commands get me all unique elements. Then we merge this with the file we're interested in. Command breakdown for sort - <(sort -u fileX ):
The - will process stdin (i.e. the list of all unique elements).
<(...) runs a command, writes the output in a temporary file and passes the path to the file to the command.
So this gives is a mix of all unique elements plus all unique elements in fileX. The duplicates are then the unique elements which are only in fileX.

If you want to get the common lines between two files, you can use the comm utility.
A.txt :
A
B
C
B.txt
A
B
D
and then, using comm will give you :
$ comm <(sort A.txt) <(sort B.txt)
A
B
C
D
In the first column, you have what is in the first file and not in the second.
In the second column, you have what is in the second file and not in the first.
In the third column, you have what is in the both files.

If you don't mind using a bit of Perl, and if your file sizes are reasonable such that they can be written into a hash, you could collect the files into two hashes to do:
#...get common keys in an array...
my #both_things
for (keys %from_1) {
push #both_things, $_ if exists $from_2{$_};
}
#...put unique things in an array...
my #once_only
for (keys %from_1) {
push #once_only, $_ unless exists $from_2($_);
}

I can't comment on Aaron Digulla's answer, which despite being accepted does not actually compute the set difference.
The set difference A\B with the given inputs should only return mary, but the accepted answer also incorrectly returns adam.
This answer has an awk one-liner that correctly computes the set difference:
awk 'FNR==NR {a[$0]++; next} !a[$0]' b.txt a.txt

Related

Improving sed program - conditions

I use this code according to this question.
$ names=(file1.txt file2.txt file3.txt) # Declare array
$ printf 's/%s/a-&/g\n' "${names[#]%.txt}" # Generate sed replacement script
s/file1/a-&/g
s/file2/a-&/g
s/file3/a-&/g
$ sed -f <(printf 's/%s/a-&/g\n' "${names[#]%.txt}") f.txt
TEXT
\connect{a-file1}
\begin{a-file2}
\connect{a-file3}
TEXT
75
How to make conditions that solve the following problem please?
names=(file1.txt file2.txt file3file2.txt)
I mean that there is a world in the names of files that is repeated as a part of another name of file. Then there is added a- more times.
I tried
sed -f <(printf 's/{%s}/{s-&}/g\n' "${files[#]%.tex}")
but the result is
\input{a-{file1}}
I need to find {%s} and a- place between { and %s
It's not clear from the question how to resolve conflicting input. In particular, the code will replace any instance of file1 with a-file1, even things like 'foofile1'.
On surface, the goal seems to be to change tokens (e.g., foofile1 should not be impacted by by file1 substitution. This could be achieved by adding word boundary assertion (\b) - before and after the filename. This will prevent the pattern from matching inside other longer file names.
printf 's/\\b%s\\b/a-&/g\n' "${names[#]%.txt}"
Since this explanation is too long for comment so adding an answer here. I am not sure if my previous answer was clear or not but my answer takes care of this case and will only replace exact file names only and NOT mix of file names.
Lets say following is array value and Input_file:
names=(file1.txt file2.txt file3file2.txt)
echo "${names[*]}"
file1.txt file2.txt file3file2.txt
cat file1
TEXT
\connect{file1}
\begin{file2}
\connect{file3}
TEXT
75
Now when we run following code:
awk -v arr="${names[*]}" '
BEGIN{
FS=OFS="{"
num=split(arr,array," ")
for(i=1;i<=num;i++){
sub(/\.txt/,"",array[i])
array1[array[i]"}"]
}
}
$2 in array1{
$2="a-"$2
}
1
' file1
Output will be as follows. You could see file3 is NOT replaced since it was NOT present in array value.
TEXT
\connect{a-file1}
\begin{a-file2}
\connect{file3}
TEXT
75

Remove duplicates from csv (ie remove original and the duplicate)

Scenario: I have two csv files. One CSV is a trusted address file (trusted.csv), the other csv is testing address file (testing.csv) that will have duplicates addresses from the first file.
The problem: trusted.csv has already been used to print labels. I need to use testing.csv to generate more labels but I can't have any duplicates. I tried merging the two csv files, but I cant figure out how to remove both the duplicate entry and the offending originating entry. Another problem is that I need to ignore case. sort -uf works like it should, but of course that means it leave the original value.
As you are talking about sort, I believe a solution based on command line is OK.
This is quite heavy solution: I believe there is something better but for the moment I have no better idea.
You need to lines that doesn't match some other lines (or remove those that do match). grep -v does that very well and if added -i option, it doesn't care about the case. As you may have many duplicate lines to remove -f will be your friend as it allows to specify many patterns in a file. As many *nix commands and file options specifying - (a single dash) as a filename makes the command read the data from the standard input rather than from a file on a storage. To summarize : grep -i -f - -v ~/tmp/file will read the file /tmp/file and the patterns from the standard input. It will keep all lines that doesn't match the patterns and the match will be done regardless of characters case.
Now you need to build the pattern list which is the list of duplicate lines. uniq identifies duplicate adjacent lines, -d makes it print duplicates once and -i makes it ignore the case. to make line adjacent, you can use sort which with option -f also ignores the case. So sort -f ~/tmp/file | uniq -d -i get a file prints duplicates once.
Putting both parts together results in the following command : sort -f ~/tmp/file | uniq -d -i | grep -i -f - -v ~/tmp/file. sort groups same lines together so that uniq can keep those that are duplicated which are used as patterns to select lines that will be removed.
Let's take an example. The file below has one letter per line (dup simply identifies lines that are duplicated):
a dup
b
c dup
a dup
d
C dup
e
f
c dup
A dup
The application of our pipe of filters results in:
sort -f ~/tmp/file | uniq -d -i | grep -i -f - -v ~/tmp/file
a
a
A b
b a d
c -----> b ----> e
c f
C
d
e
f

How to assign number for a repeating pattern

I am doing some calculations using gaussian. From the gaussian output file, I need to extract the input structure information. The output file contains more than 800 structure coordinates. What I did so far is, collect all the input coordinates using some combinations of the grep, awk and sed commands, like so:
grep -A 7 "Input orientation:" test.log | grep -A 5 "C" | awk '/C/{print "structure number"}1' | sed '/--/d' > test.out
This helped me to grep all the input coordinates and insert a line with "structure number". So now I have a file that contains a pattern which is being repeated in a regular fashion. The file is like the following:
structure Number
4.176801 -0.044096 2.253823
2.994556 0.097622 2.356678
5.060174 -0.115257 3.342200
structure Number
4.180919 -0.044664 2.251182
3.002927 0.098946 2.359346
5.037811 -0.103410 3.389953
Here, "Structure number" is being repeated. I want to write a number like "structure number:1", "structure number 2" in increasing order.
How can I solve this problem?
Thanks for your help in advance.
I am not familiar at all with a program called gaussian, so I have no clue what the original input looked like. If someone posts an example I might be able to give an even shorter solution.
However, as far as I got it the OP is contented with the output of his/her code besided that he/she wants to append an increasing number to the lines inserted with awk.
This can be achieved with the following line (adjusting the OP's code):
grep -A 7 "Input orientation:" test.log | grep -A 5 "C" | awk '/C/{print "structure number"++i}1' | sed '/--/d' > test.out
Addendum:
Even without knowing the actual input, I am sure that one can at least get rid of the sed command leaving that piece of work to awk. Also, there is no need to quote a single character grep pattern:
grep -A 7 "Input orientation:" test.log | grep -A 5 C | awk '/C/{print "structure number"++i}!/--/' > test.out
I am not sure since I cannot test, but it should be possible to let awk do the grep's work, too. As a first guess I would try the following:
awk '/Input orientation:/{li=7}!li{next}{--li}/C/{print "structure number"++i;lc=5}!lc{next}{--lc}!/--/' test.log > test.out
While this might be a little bit longer in code it is an awk-only solution doing all the work in one process. If I had input to test with, I might come up with a shorter solution.

Perl script to compare two files but print in order

I have followed this question perl compare two file and print the matching lines and found lines which match or dont match between two files using hash.
But I find that hash rearranges the lines and I want the lines in order. I can write multiple for loops to get results in order but this is not as efficient as hash. Has anyone faced this issue before and could please help with their solution
Maybe don't understand fully the question but
fgrep -xf file2 file1
is not enough?
or
fgrep -xf file1 file2
yes, it is not perl but, short simple and fast...
This can be done efficiently in two steps. Let's assume you have been able to find the "lines that match" but they are in the wrong order; then a simple grep can re-organize them. Assuming you have a script matchThem that takes two inputs (file1 and file2) and outputs them to tempFile, then the over all script will be:
matchThem file1 file2 > tempFile
grep -Fx -f tempFile file1
The -Fx flag means:
-F : find exact match only (much faster than wildcards)
-x : only match whole lines
If you want an hash which keeps the insertion order, then try out the CPAN module Tie::IxHash.

Bash alias for ls that prints multiple columns by "type"

I'm listing just the file basenames with an ls command like this, which I got from here:
ls --color -1 . | tr '\n' '\0' | xargs -0 -n 1 basename
I would like to list all the directories in the first column, all the executables in the next, all the regular files last (perhaps also with a column for each extension).
So the first (and main) "challenge" is to print multiple columns of different lengths.
Do you have any suggestions what commands I should be using to write that script? Should I switch to find? Or should I just write the script all in Perl?
I want to be able to optionally sort the columns by size too ;-) I'm not necessarily looking for a script to do the above, but perhaps some advice on ways to approach writing such a script.
#!/bin/bash
width=20
awk -F':' '
/directory/{
d[i++]=$1
next
}
/executable/{
e[j++]=$1
next
}
{
f[k++]=$1
}
END{
a[1]=i;a[2]=j;a[3]=k
asort(a)
printf("%-*.*s | \t%-*.*s | \t%-*.*s\n", w,w,"Directories", w,w,"Executables", w,w,"Files")
print "------------------------------------------------------------------------"
for (i=0;i<a[3];i++)
printf("%-*.*s |\t%-*.*s |\t%-*.*s\n", w,w,d[i], w,w,e[i], w,w,f[i])
}' w=$width < <(find . -exec file {} +)
Sample output HERE
This can be further improved upon by calculating what the longest entry is per-column and using that as the width. I'll leave that as an exercise to the reader