Need perl/shell script to compare 2 files - perl

Hi I have 2 files as below, I need script to compare those and find the match. How can I achieve this?
file1 as a.txt :
Anirban
Ball
Cat
Dog
cow
file2 as b.txt :
I am Anirban
I am Ball
I am Cat_cat
I am Dog
I am cow
I am horse
I want output like this :
I am Anirban
I am Ball
I am Dog
I am cow
I tried with grep -f b a, it did not give the exact match.

Like this can be a way:
$ grep -wf a.txt b.txt
I am Anirban
I am Ball
I am Dog
I am cow
On your solution you were not using grep -w, which is convenient. Also, note you were giving the files in the opposite order.
-f is used to tell grep to obtain parameters from a file.
-w matches whole words.

Using awk
awk 'NR==FNR{a[$1];next} $NF in a' a.txt b.txt

Related

Improving sed program - conditions

I use this code according to this question.
$ names=(file1.txt file2.txt file3.txt) # Declare array
$ printf 's/%s/a-&/g\n' "${names[#]%.txt}" # Generate sed replacement script
s/file1/a-&/g
s/file2/a-&/g
s/file3/a-&/g
$ sed -f <(printf 's/%s/a-&/g\n' "${names[#]%.txt}") f.txt
TEXT
\connect{a-file1}
\begin{a-file2}
\connect{a-file3}
TEXT
75
How to make conditions that solve the following problem please?
names=(file1.txt file2.txt file3file2.txt)
I mean that there is a world in the names of files that is repeated as a part of another name of file. Then there is added a- more times.
I tried
sed -f <(printf 's/{%s}/{s-&}/g\n' "${files[#]%.tex}")
but the result is
\input{a-{file1}}
I need to find {%s} and a- place between { and %s
It's not clear from the question how to resolve conflicting input. In particular, the code will replace any instance of file1 with a-file1, even things like 'foofile1'.
On surface, the goal seems to be to change tokens (e.g., foofile1 should not be impacted by by file1 substitution. This could be achieved by adding word boundary assertion (\b) - before and after the filename. This will prevent the pattern from matching inside other longer file names.
printf 's/\\b%s\\b/a-&/g\n' "${names[#]%.txt}"
Since this explanation is too long for comment so adding an answer here. I am not sure if my previous answer was clear or not but my answer takes care of this case and will only replace exact file names only and NOT mix of file names.
Lets say following is array value and Input_file:
names=(file1.txt file2.txt file3file2.txt)
echo "${names[*]}"
file1.txt file2.txt file3file2.txt
cat file1
TEXT
\connect{file1}
\begin{file2}
\connect{file3}
TEXT
75
Now when we run following code:
awk -v arr="${names[*]}" '
BEGIN{
FS=OFS="{"
num=split(arr,array," ")
for(i=1;i<=num;i++){
sub(/\.txt/,"",array[i])
array1[array[i]"}"]
}
}
$2 in array1{
$2="a-"$2
}
1
' file1
Output will be as follows. You could see file3 is NOT replaced since it was NOT present in array value.
TEXT
\connect{a-file1}
\begin{a-file2}
\connect{file3}
TEXT
75

Replace first occurrence of a pattern if not preceded with another pattern

Using GNU sed, I try to replace first occurrence of pattern in file, but I don't want to replace if there is another pattern before the match.
For example, if the file contains line with "bird [number]" I want to replace the number with "0" if this pattern has no "cat" word any where before.
Example text
dog cat - fish bird 123
dog fish - bird 1234567
dog - cat fish, lion bird 3456
Expected result:
dog cat - fish bird 123
dog fish - bird 0
dog - cat fish, lion bird 3456
I try to combine How to use sed to replace only the first occurrence in a file? and Sed regex and substring negation solutions and came up with something like
sed -E '0,/cat.*bird +[0-9]+/b;/(bird +)[0-9]+/ s//\10/'
where 0,/cat.*bird +[0-9]+/b;/(bird +)[0-9]+/ should match the first occurrence of (bird +)[0-9]+ if the cat.*bird +[0-9]+ pattern does not match, but I get
dog cat - fish bird 123
dog fish - bird 0
dog - cat fish, lion bird 0
The third line is also changed. How can I prevent it? I think it is related to address ranges, but I do not get it how to negate the second part of the address range.
This might work for you (GNU sed):
sed '/\<cat\>.*\<bird\>/b;s/\<\(bird\) \+[0-9]\+/\1 0/;T;:a;n;ba' file
If a line contains the word cat before the word bird end processing for that line.
Try to substitute the number following the word bird by zero. If not successful end processing for that line. Otherwise read/print all following lines until the end of the file.
Might also be written:
sed -E '/cat.*bird/b;/(bird +)[0-9]+/{s//\10/;:a;n;ba}' file
sed is for doing simple s/old/new replacements, that is all. For anything else just use awk, e.g. with GNU awk instead of the GNU sed you were using:
$ awk 'match($0,/(.*bird\s+)[0-9]+(.*)/,a) && (a[1] !~ /cat/) {$0=a[1] 0 a[2]} 1' file
dog cat - fish bird 123
dog fish - bird 0
dog - cat fish, lion bird 3456

Terminal command to find unique pairs where order does not matter

I have a Python script my_script.py which generates a list of tab-separated pairings between two elements, one for each line:
$ python my_script.py
cat dog
dog wolf
cat dog
pig chicken
dog cat
I am looking to pipe the output of this script into a terminal command of some sort that I want to filter out duplicate combinations, not just duplicate permutations. For duplicate permutations, I can use something like:
$ python my_script.py | sort | uniq
cat dog
dog cat
dog wolf
pig chicken
to remove the duplicate "cat dog".
The problem with this approach is that I am left with both "cat dog" and "dog cat", which for my purposes should be treated as the same (same combination). I know I could write another very simple Python script to perform the kind of filtering I am after, but I wanted to see whether there is an even simpler terminal command that will do the equivalent.
Here's one way using awk:
... | awk -F "\t" '!a[$1,$2]++ && !a[$2,$1]++'
Results:
cat dog
dog wolf
pig chicken
Explanation:
-F "\t" # sets the field (column) separator to a single tab character
!a[$1,$2]++ # adds column one and column two to a pseudo-multidimensional
# array if they haven't already been added to the array
!a[$2,$1]++ # does the same thing, but adds the columns in the opposite
# orientation.
Putting it altogether:
So for every line of input, the line will be printed if and only if the first two fields (in either orientation) don't exist in the array. You can read more about how to emulate a multi-dimensional array here.
Caution: script above doesn't provide any output for cases where $1==$2 . Can test via:
echo "dog dog" | awk '!a[$1,$2]++ && !a[$2,$1]++'|wc -l
Try this instead:
|awk '{if($1<$2)print $1,$2; else print $2,$1}'|sort|uniq

Extract the part enclosed by a predefined multiline character sequence

Hope the AWK gurus can provide a solution to my problem .
I have a file that goes like this :
cat cat cat cat cat cat dog rat ate dog tit
dog cat dog dog dog rat dog pat ate cat dog
I have to use AWK to extract the pattern between the first occuring c and a d .Starting from the first c a count should be kept on the number of c's and d's such that when the count matches , the part between the first c and the matched d shoud be ouput to a file including the number of the line in which the match for d occured .
In this particular example the match occurs on the seventh dog , therefore the output will have to be :
cat cat cat cat cat cat dog rat ate dog tit
dog cat dog dog dog rat d
The match can go beyond just two lines ! The output can or cannot be inclusive of the c and the d .There exists all kinds of characters inclusive of the special ones in the text !
In order for the print to occur the count has to be matched .
Thanks in advance for the replies. Suggestions are always welcome .
EDIT : The capture of the pattern between c and d can be compromised as long as the condition is met and the line number of the exit d is obtained :)
A few tips, without giving the full solution:
By default, awk considers each line as a record. The default record separator is RS="\n".
Depending on your version of awk, you may be able to set RS, the record separator, to a regex which matches either c or d. Then, for each record, you can check the RT variable, which will contain either c or d, depending on what has actually been matched. Starting from there, using a variable incremented on c, decremented on d you will be able to find the end of the match when it reaches 0.
You can then use a variable that contains your match so far, and keep concatenating RT and the new record to it, until you're done.
If you need to know the line number of the end of the match, you can set RS to a regex that either matches c, d, as previously, but also add the possibility to match \n. And by maintaining another counter variable incremented every time RT tells you that \n has been matched, you'll have your line number.
Here's a sed solution just for fun:
sed -rne ':r;$!{N;br};s/^[^c]*(.*d)[^d]*$/\1/;:a;h;s/[^cd]//g;' \
-e ':s;s/d(.*)c/c\1d/;ts;s/cd/c\nd/;T;y/c/d/;/^(d+)\n\1$/{g;i -------' \
-e 'p};g;s/d[^d]*d$/d/;ta'
This prints all satisfying sequences from longest to shortest.

*nix: perform set union/intersection/difference of lists

I sometimes need to compare two text files. Obviously, diff shows the differences, it also hides the similarities, which is kind of the point.
Suppose I want to do other comparisons on these files: set union, intersection, and subtraction, treating each line as an element in the set.
Are there similarly simple common utilities or one-liners which can do this?
Examples:
a.txt
john
mary
b.txt
adam
john
$> set_union a.txt b.txt
john
mary
adam
$> set_intersection a.txt b.txt
john
$> set_difference a.txt b.txt
mary
Union: sort -u files...
Intersection: sort files... | uniq -d
Overall difference (elements which are just in one of the files):
sort files... | uniq -u
Mathematical difference (elements only once in one of the files):
sort files... | uinq -u | sort - <(sort -u fileX ) | uniq -d
The first two commands get me all unique elements. Then we merge this with the file we're interested in. Command breakdown for sort - <(sort -u fileX ):
The - will process stdin (i.e. the list of all unique elements).
<(...) runs a command, writes the output in a temporary file and passes the path to the file to the command.
So this gives is a mix of all unique elements plus all unique elements in fileX. The duplicates are then the unique elements which are only in fileX.
If you want to get the common lines between two files, you can use the comm utility.
A.txt :
A
B
C
B.txt
A
B
D
and then, using comm will give you :
$ comm <(sort A.txt) <(sort B.txt)
A
B
C
D
In the first column, you have what is in the first file and not in the second.
In the second column, you have what is in the second file and not in the first.
In the third column, you have what is in the both files.
If you don't mind using a bit of Perl, and if your file sizes are reasonable such that they can be written into a hash, you could collect the files into two hashes to do:
#...get common keys in an array...
my #both_things
for (keys %from_1) {
push #both_things, $_ if exists $from_2{$_};
}
#...put unique things in an array...
my #once_only
for (keys %from_1) {
push #once_only, $_ unless exists $from_2($_);
}
I can't comment on Aaron Digulla's answer, which despite being accepted does not actually compute the set difference.
The set difference A\B with the given inputs should only return mary, but the accepted answer also incorrectly returns adam.
This answer has an awk one-liner that correctly computes the set difference:
awk 'FNR==NR {a[$0]++; next} !a[$0]' b.txt a.txt