BASH: comm (or similar) when compare multiple files - diff

I've the following problem: I would like to compare the content of 8 files contaning a list like this
Sample1.txt Sample2.txt Sample3.txt
apple pineapple apple
pineapple apple pineapple
bananas bananas bananas
orange orange mango
grape nuts nuts
using comm Sample1.txt Sample 2.txt I can have something like this
grape nuts apple
pineapple
bananas
orange
meaning that in the first column I have something related only to the first sample, the second column the things related only to the second sample and the third column the things in common.
I would like to do the same but with 8 files (sample). With diff it is not possible but at the end I would like to have
Sample1 Sample2 Sample3 ...Sample8 Things in common
grape nuts mango apple
pineapple
bananas
Is there a chance to do it with bash? Is there a command like diff that allow the searching for differences on more than two files?
Thank you to everybody...I know this is a challenging question
Fabio

Here is my naive solution:
first=sample1.txt; for a in *.txt; do comm -12 $first $a >temp_$a; echo "comparing" $first " " $a "and writing to temp_$a"; first=temp_$a; cat temp_$a; done;

Related

I have a Tab separated file in Unix which has data issue

I have to make sure each line has 4 columns, but the input data is quite a mess:
The first line is header.
The second line is valid as it has 4 columns.
The third is also valid (it's ok if the description field is null)
ID field and "god bless me" Last column PNumber is are not null fields.
As one can see 4th line is messed up because of newline character in "Description column" it spanned across multiple lines.
ID Name Description Phnumber
1051 John 5674 I am doing good, is this task we need to fix 908342
1065 Rohit 9876246
10402 rob I am
doing good,
is this task we need to fix 908341
105552 "Julin rob hain" i know what to do just let me do it
"
"
"
"
"
"
908452 1051 Dave I am doing reporting this week 88889999
Maybe a screenshot will make it easier to see the problem
Each line will start with a number and ends with a number. Each line should have 4 columns.
desired output
ID Name Description Phnumber
1051 John 5674 I am doing good, is this task we need to fix 908342
1065 Rohit 9876246
10402 rob I am doing good, 563 is this task we need to fix 908341
105552 "Julin rob hain" i know what to do just let me do it 908452
1051 Dave I am doing reporting this week 88889999
The data is sample data the actual file has 12 columns. yes in between columns can have numbers and few are date fields (like 2017-03-02)
This did the trick
cat file_name | perl -0pe 's/\n(?!([0-9]{6}|$)\t)//g' | perl -0pe 's/\r(?!([0-9]{6}|$)\t)//g' | sed '/^$/d'
awk to the rescue!
assumes the all digit fields don't appear except first and last fields
awk 'NR==1;
NR>1 {for(i=1;i<=NF;i++)
{if($i~/[0-9]+/) s=!s; printf "%s", $i (s?OFS:RS)}}' file
ID Name Description Phnumber
1051 John I am doing good, is this task we need to fix 908342
10423 rob I am doing good, is this task we need to fix 908341
1052 Julin rob hain i know what to do just let me do it " " " " " " 908452
1051 Dave I am doing reporting this week 88889999
perhaps set the OFS to \t to have more structure

use perl to extract specific output lines

I'm endeavoring to create a system to generalize rules from input text. I'm using reVerb to create my initial set of rules. Using the following command[*], for instance:
$ echo "Bananas are an excellent source of potassium." | ./reverb -q | tr '\t' '\n' | cat -n
To generate output of the form:
1 stdin
2 1
3 Bananas
4 are an excellent source of
5 potassium
6 0
7 1
8 1
9 6
10 6
11 7
12 0.9999999997341693
13 Bananas are an excellent source of potassium .
14 NNS VBP DT JJ NN IN NN .
15 B-NP B-VP B-NP I-NP I-NP I-NP I-NP O
16 bananas
17 be source of
18 potassium
I'm currently piping the output to a file, which includes the preceding white space and numbers as depicted above.
What I'm really after is just the simple rule at the end, i.e. lines 16, 17 & 18. I've been trying to create a script to extract just that component and put it to a new file in the form of a Prolog clause, i.e. be source of(banans, potassium).
Is that feasible? Can Prolog rules contain white space like that?
I think I'm locked into getting all that output from reVerb so, what would be the best way to extract the desirable component? With a Perl script? Or maybe sed?
*Later I plan to replace this with a larger input file as opposed to just single sentences.
This seems wasteful. Why not leave the tabs as they are, and use:
$ echo "Bananas are an excellent source of potassium." \
| ./reverb -q | cut --fields=16,17,18
And yes, you can have rules like this in Prolog. See the answer by #mat. You need to know a bit of Prolog before you move on, I guess.
It is easier, however, to just make the string a a valid name for a predicate:
be_source_of with underscores instead of spaces
or 'be source of' with spaces, and enclosed in single quotes.
You can use probably awk to do what you want with the three fields. See for example the printf command in awk. Or, you can parse it again from Prolog directly. Both are beyond the scope of your current question, I feel.
sed -n 'N;N
:cycle
$!{N
D
b cycle
}
s/\(.*\)\n\(.*\)\n\(.*\)/\2 (\1,\3)/p' YourFile
if number are in output and not jsut for the reference, change last sed action by
s/\^ *[0-9]\{1,\} \{1,\}\(.*\)\n *[0-9]\{1,\} \{1,\}\(.*\)\n *[0-9]\{1,\} \{1,\}\(.*\)/\2 (\1,\3)/p
assuming the last 3 lines are the source of your "rules"
Regarding the Prolog part of the question:
Yes, Prolog facts can contain whitespace like this, with suitable operator declarations present.
For example:
:- op(700, fx, be).
:- op(650, fx, source).
:- op(600, fx, of).
Example query and its result, to let you see the shape of terms that are created with this syntax:
?- write_canonical(be source of(a, b)).
be(source(of(a,b))).
Therefore, with these operator declarations, a fact like:
be source of(a, b).
is exactly the same as stating:
be(source(of(a,b)).
Depending on use cases and other definitions, it may even be an advantage to create this kind of facts (i.e., facts of the form be/1 instead of source_of/2). If this is the only kind of facts you need, you can simply write:
source_of(a, b).
This creates no redundant wrappers and is easier to use.
Or, as Boris suggested, you can use single quotes as in 'be source of'/2.

Perl - finding multiple substrings in a variable?

I'm very new to Perl and I've been using this site for a few years looking for answers to questions I've had regarding other projects I've been working on but I can't seem to find the answer to this one..
My script receives a HTTP response with data that I need to strip down to parse the correct information.
So far, I have this for the substring..
my $output = substr ($content, index($content, 'Down'), index($content, '>'));
This seems to be doing what it's supposed to be.. finding the word 'Down' and then substringing it up until it finds a >.
However, the string 'Down' can appear many times within the response and this stops looking after it finds the first example of this.
How would I go about getting the substring to be recursive?
Thanks for your help :)
One way like this:
my $x="aa Down aaa > bb Down bbb > cc Down ccc >";
while ($x =~/(Down[^>]+>)/g){
print $1;
}
Another solution,without iteration and just storing what is down.
use Data::Dumper;
my $x="aa Down aaa > bb Down bbb > cc Down ccc >";
my #downs = $x =~ m!Down([^>]+)>!gis;
print Dumper(\#downs);

Sphinx Search and rank based on word position

Is it possible, using Sphinx Search, to have the weight of a result to be determined on the position of words in a list?
For example, if you have rows with a column containing the following text:
Row #1: "dog, bird, horse, cat"
Row #2: "dog, bird, cat"
and then perform a OR search using "dog | cat" I would like row #2 to rank higher than #1 because both "dog" and "cat" were found, but #2 has these two closer together than #1.
Hope this makes sense.
Thanks
Michael
You can do this by using field level ranking. Use "SPH_RANK_EXPR" as your ranker and look at the field level factor "min_hit_pos" to tell which word matched first.
All the information can be found at http://sphinxsearch.com/docs/manual-2.0.4.html#weighting
If you look closely at the SPH_RANK_SPH04 ranking algorithm below, it includes min_hit_pos, but only gives credit to rows where the matched word is the first word.
sum((4*lcs+2*(min_hit_pos==1)+exact_hit)*user_weight)*1000+bm25
What you can do is use the same algorithm but change "2*(min_hit_pos==1)" to be something like this:-
(101-IF(min_hit_pos<100,min_hit_pos,100))
A row will get an extra 100 weight if matched on the first word, 99 if matched on the second word and so on until the 100th word, after which no more weight is given.
You can play around with the values and include a multiplier to see if the results are any better.
Hope that helps. Let me know if you have any questions.
Have you tried SPH_RANK_PROXIMITY ranking mode?
Otherwise could be more explicit and do a query like - with SPH_RANK_WORDCOUNT
"dog cat"/1 | "dog cat"~10 | "dog cat"~8 | "dog cat"~6 | "dog cat"~4 | "dog cat"~3 | "dog cat"~2 | "dog cat"~1
or similar.

print column and count the string

I have large column wise text file with space demlimited
Name subject Result
John maths pass
John science fail
John history pass
John geography pass
Jack maths pass
jack history fail
kelly science pass
kelly history pass
I want to count for each name (it is long name list, each name should be appear only once), how many of them pass. For eg. For John, he passed 3 and similarily for Jack he passed 1. It should print the result as
Name Passcount
John 3
Jack 1
Kelly 2
Can anybody can help with awk or perl script. Thanks in advance
You can try something like this -
awk '
BEGIN{ print "Name\tPasscount"}
NR>1{if ($3=="pass") a[$1]++}
END{ for (x in a) print x"\t"a[x]}' file
Test:
$ cat file
Name subject Result
John maths pass
John science fail
John history pass
John geography pass
Jack maths pass
jack history fail
kelly science pass
kelly history pass
$ awk 'BEGIN{ print "Name\tPasscount"} NR>1{if ($3=="pass") a[$1]++}END{ for (x in a) print x"\t"a[x]}' file
Name Passcount
Jack 1
kelly 2
John 3