intersection with multiple files - diff

I have 4 text file in which each file has a single column of data (~2000 lines in each file). What I am trying to do, is to compare all of the files, and determine what is the overlap between the different files. So, I would want to know what is in file1 but not the other 3 files, and what is in file2 but not in the other 3, what is in file1 and file2 only, etc. The ultimate goal is to make a venn diagram with 4 overlapping circles showing the various overlaps between the files.
I have been raking my brain trying to figure out how to do this. I have been playing with the comm and diff commands but am having trouble doing this with all of the files. Would anyone have any suggestions on how to do this?
Thanks for any help or suggestions.

Assuming 4 files named a b c d
lines existing in file a but not in any of the others (I assume ^ is a char not used in any of the files):
for l in `cat a | sort | uniq`;do echo $l^`grep -c $l b c d`;done | grep 'b:0 c:0 d:0$' | cut -d\^ -f1
lines existing in all of them:
for l in `cat a | sort | uniq`;do echo $l^`grep -c $l b c d`;done | grep 'b:[1-9]* c:[1-9]* d:[1-9]*$' | cut -d\^ -f1
...

Related

Remove duplicates from csv (ie remove original and the duplicate)

Scenario: I have two csv files. One CSV is a trusted address file (trusted.csv), the other csv is testing address file (testing.csv) that will have duplicates addresses from the first file.
The problem: trusted.csv has already been used to print labels. I need to use testing.csv to generate more labels but I can't have any duplicates. I tried merging the two csv files, but I cant figure out how to remove both the duplicate entry and the offending originating entry. Another problem is that I need to ignore case. sort -uf works like it should, but of course that means it leave the original value.
As you are talking about sort, I believe a solution based on command line is OK.
This is quite heavy solution: I believe there is something better but for the moment I have no better idea.
You need to lines that doesn't match some other lines (or remove those that do match). grep -v does that very well and if added -i option, it doesn't care about the case. As you may have many duplicate lines to remove -f will be your friend as it allows to specify many patterns in a file. As many *nix commands and file options specifying - (a single dash) as a filename makes the command read the data from the standard input rather than from a file on a storage. To summarize : grep -i -f - -v ~/tmp/file will read the file /tmp/file and the patterns from the standard input. It will keep all lines that doesn't match the patterns and the match will be done regardless of characters case.
Now you need to build the pattern list which is the list of duplicate lines. uniq identifies duplicate adjacent lines, -d makes it print duplicates once and -i makes it ignore the case. to make line adjacent, you can use sort which with option -f also ignores the case. So sort -f ~/tmp/file | uniq -d -i get a file prints duplicates once.
Putting both parts together results in the following command : sort -f ~/tmp/file | uniq -d -i | grep -i -f - -v ~/tmp/file. sort groups same lines together so that uniq can keep those that are duplicated which are used as patterns to select lines that will be removed.
Let's take an example. The file below has one letter per line (dup simply identifies lines that are duplicated):
a dup
b
c dup
a dup
d
C dup
e
f
c dup
A dup
The application of our pipe of filters results in:
sort -f ~/tmp/file | uniq -d -i | grep -i -f - -v ~/tmp/file
a
a
A b
b a d
c -----> b ----> e
c f
C
d
e
f

Importing a non delimiter text file into matlab

I have a question: I have download 1 rainfall data int Text file *.txt it included texts, heading, bottom text, and data also some of spaces line between data.
When I import the file into the Matlab, The matlab could not defined each cells included column, and rows (non delimiter)
I have did like, convert text file to Excel, and remove the cols and rows very easily and save with another files. But my data is up to 846000 data set ~ 24h x 30days x 12month x 20 years which combined many different files for each data. SO it will difficult to make manual converting like I did.
My adviser told me that there are Matlab CODE could do it well. Does anyone can help me this problem?
The original: https://drive.google.com/file/d/0By5tEg03EXCpekNaemItMF85ZWs/edit?usp=sharing
If you're on Mac or Linux I suggest converting these data files using the shell into a format Matlab will like rather than trying to make Matlab do it. This works on Windows too, but only if you have a unix-like shell installed such as MinGW, Cygwin or Git Bash.
For example, this converts the raw data section of the file you shared into CSV:
cat "$file" | sed 's: *:,:g' | sed 's:^,::' | grep '^[0-9]' > "$file".csv
You could then loop through all your raw data files and combine them into a single CSV like this:
for file in *.txt; do
cat "$file" | sed 's: *:,:g' | sed 's:^,::' | grep '^[0-9]' >> all.csv
done
If you need to preserve, for example, which year and which weather station, you could get a little fancier with it and capture those values at the beginning of each file and turn them into columns on each line. Here's an example that grabs the year and weather station ID and inserts it as a column before each day.
for file in *.txt; do
station="$(grep 'Station -' "$file" | sed 's: *Station - ::' | sed 's: .*::' | uniq)"
year="$(grep 'Water Year' "$file" | awk '{print $4}')"
cat "$file" | sed 's: *:,:g' | grep '^,[0-9]' |\
sed "s/^,/$station,$year,/" >> all.csv
done

How to assign number for a repeating pattern

I am doing some calculations using gaussian. From the gaussian output file, I need to extract the input structure information. The output file contains more than 800 structure coordinates. What I did so far is, collect all the input coordinates using some combinations of the grep, awk and sed commands, like so:
grep -A 7 "Input orientation:" test.log | grep -A 5 "C" | awk '/C/{print "structure number"}1' | sed '/--/d' > test.out
This helped me to grep all the input coordinates and insert a line with "structure number". So now I have a file that contains a pattern which is being repeated in a regular fashion. The file is like the following:
structure Number
4.176801 -0.044096 2.253823
2.994556 0.097622 2.356678
5.060174 -0.115257 3.342200
structure Number
4.180919 -0.044664 2.251182
3.002927 0.098946 2.359346
5.037811 -0.103410 3.389953
Here, "Structure number" is being repeated. I want to write a number like "structure number:1", "structure number 2" in increasing order.
How can I solve this problem?
Thanks for your help in advance.
I am not familiar at all with a program called gaussian, so I have no clue what the original input looked like. If someone posts an example I might be able to give an even shorter solution.
However, as far as I got it the OP is contented with the output of his/her code besided that he/she wants to append an increasing number to the lines inserted with awk.
This can be achieved with the following line (adjusting the OP's code):
grep -A 7 "Input orientation:" test.log | grep -A 5 "C" | awk '/C/{print "structure number"++i}1' | sed '/--/d' > test.out
Addendum:
Even without knowing the actual input, I am sure that one can at least get rid of the sed command leaving that piece of work to awk. Also, there is no need to quote a single character grep pattern:
grep -A 7 "Input orientation:" test.log | grep -A 5 C | awk '/C/{print "structure number"++i}!/--/' > test.out
I am not sure since I cannot test, but it should be possible to let awk do the grep's work, too. As a first guess I would try the following:
awk '/Input orientation:/{li=7}!li{next}{--li}/C/{print "structure number"++i;lc=5}!lc{next}{--lc}!/--/' test.log > test.out
While this might be a little bit longer in code it is an awk-only solution doing all the work in one process. If I had input to test with, I might come up with a shorter solution.

*nix: perform set union/intersection/difference of lists

I sometimes need to compare two text files. Obviously, diff shows the differences, it also hides the similarities, which is kind of the point.
Suppose I want to do other comparisons on these files: set union, intersection, and subtraction, treating each line as an element in the set.
Are there similarly simple common utilities or one-liners which can do this?
Examples:
a.txt
john
mary
b.txt
adam
john
$> set_union a.txt b.txt
john
mary
adam
$> set_intersection a.txt b.txt
john
$> set_difference a.txt b.txt
mary
Union: sort -u files...
Intersection: sort files... | uniq -d
Overall difference (elements which are just in one of the files):
sort files... | uniq -u
Mathematical difference (elements only once in one of the files):
sort files... | uinq -u | sort - <(sort -u fileX ) | uniq -d
The first two commands get me all unique elements. Then we merge this with the file we're interested in. Command breakdown for sort - <(sort -u fileX ):
The - will process stdin (i.e. the list of all unique elements).
<(...) runs a command, writes the output in a temporary file and passes the path to the file to the command.
So this gives is a mix of all unique elements plus all unique elements in fileX. The duplicates are then the unique elements which are only in fileX.
If you want to get the common lines between two files, you can use the comm utility.
A.txt :
A
B
C
B.txt
A
B
D
and then, using comm will give you :
$ comm <(sort A.txt) <(sort B.txt)
A
B
C
D
In the first column, you have what is in the first file and not in the second.
In the second column, you have what is in the second file and not in the first.
In the third column, you have what is in the both files.
If you don't mind using a bit of Perl, and if your file sizes are reasonable such that they can be written into a hash, you could collect the files into two hashes to do:
#...get common keys in an array...
my #both_things
for (keys %from_1) {
push #both_things, $_ if exists $from_2{$_};
}
#...put unique things in an array...
my #once_only
for (keys %from_1) {
push #once_only, $_ unless exists $from_2($_);
}
I can't comment on Aaron Digulla's answer, which despite being accepted does not actually compute the set difference.
The set difference A\B with the given inputs should only return mary, but the accepted answer also incorrectly returns adam.
This answer has an awk one-liner that correctly computes the set difference:
awk 'FNR==NR {a[$0]++; next} !a[$0]' b.txt a.txt

Bash alias for ls that prints multiple columns by "type"

I'm listing just the file basenames with an ls command like this, which I got from here:
ls --color -1 . | tr '\n' '\0' | xargs -0 -n 1 basename
I would like to list all the directories in the first column, all the executables in the next, all the regular files last (perhaps also with a column for each extension).
So the first (and main) "challenge" is to print multiple columns of different lengths.
Do you have any suggestions what commands I should be using to write that script? Should I switch to find? Or should I just write the script all in Perl?
I want to be able to optionally sort the columns by size too ;-) I'm not necessarily looking for a script to do the above, but perhaps some advice on ways to approach writing such a script.
#!/bin/bash
width=20
awk -F':' '
/directory/{
d[i++]=$1
next
}
/executable/{
e[j++]=$1
next
}
{
f[k++]=$1
}
END{
a[1]=i;a[2]=j;a[3]=k
asort(a)
printf("%-*.*s | \t%-*.*s | \t%-*.*s\n", w,w,"Directories", w,w,"Executables", w,w,"Files")
print "------------------------------------------------------------------------"
for (i=0;i<a[3];i++)
printf("%-*.*s |\t%-*.*s |\t%-*.*s\n", w,w,d[i], w,w,e[i], w,w,f[i])
}' w=$width < <(find . -exec file {} +)
Sample output HERE
This can be further improved upon by calculating what the longest entry is per-column and using that as the width. I'll leave that as an exercise to the reader