How can I show lines in common (reverse diff)? - command-line

I have a series of text files for which I'd like to know the lines in common rather than the lines which are different between them. Command line Unix or Windows is fine.
File foo:
linux-vdso.so.1 => (0x00007fffccffe000)
libvlc.so.2 => /usr/lib/libvlc.so.2 (0x00007f0dc4b0b000)
libvlccore.so.0 => /usr/lib/libvlccore.so.0 (0x00007f0dc483f000)
libc.so.6 => /lib/libc.so.6 (0x00007f0dc44cd000)
File bar:
libkdeui.so.5 => /usr/lib/libkdeui.so.5 (0x00007f716ae22000)
libkio.so.5 => /usr/lib/libkio.so.5 (0x00007f716a96d000)
linux-vdso.so.1 => (0x00007fffccffe000)
So, given these two files above, the output of the desired utility would be akin to file1:line_number, file2:line_number == matching text (just a suggestion; I really don't care what the syntax is):
foo:1, bar:3 == linux-vdso.so.1 => (0x00007fffccffe000)

On *nix, you can use comm. The answer to the question is:
comm -1 -2 file1.sorted file2.sorted
# where file1 and file2 are sorted and piped into *.sorted
Here's the full usage of comm:
comm [-1] [-2] [-3 ] file1 file2
-1 Suppress the output column of lines unique to file1.
-2 Suppress the output column of lines unique to file2.
-3 Suppress the output column of lines duplicated in file1 and file2.
Also note that it is important to sort the files before using comm, as mentioned in the man pages.

I found this answer on a question listed as a duplicate. I find grep to be more administrator-friendly than comm, so if you just want the set of matching lines (useful for comparing CSV files, for instance) simply use
grep -F -x -f file1 file2
Or the simplified fgrep version:
fgrep -xf file1 file2
Plus, you can use file2* to glob and look for lines in common with multiple files, rather than just two.
Some other handy variations include
-n flag to show the line number of each matched line
-c to only count the number of lines that match
-v to display only the lines in file2 that differ (or use diff).
Using comm is faster, but that speed comes at the expense of having to sort your files first. It isn't very useful as a 'reverse diff'.

It was asked here before: Unix command to find lines common in two files
You could also try with Perl (credit goes here):
perl -ne 'print if ($seen{$_} .= #ARGV) =~ /10$/' file1 file2

I just learned the comm command from the answers, but I wanted to add something extra: if the files are not sorted, and you don't want to touch the original files, you can pipe the output of the sort command. This leaves the original files intact. It works in Bash, but I can't say about other shells.
comm -1 -2 <(sort file1) <(sort file2)
This can be extended to compare command output, instead of files:
comm -1 -2 <(ls /dir1 | sort) <(ls /dir2 | sort)

The easiest way to do it is:
awk 'NR==FNR{a[$1]++;next} a[$1] ' file1 file2
Files are not necessary to be sorted.

I think diff utility itself, using its unified (-U) option, can be used to achieve effect. Because the first column of output of diff marks whether the line is an addition, or deletion, we can look for lines that haven't changed.
diff -U1000 file_1 file_2 | grep '^ '
The number 1000 is chosen arbitrarily, big enough to be larger than any single hunk of diff output.
Here's the full, foolproof set of commands:
f1="file_1"
f2="file_2"
lc1=$(wc -l "$f1" | cut -f1 -d' ')
lc2=$(wc -l "$f2" | cut -f1 -d' ')
lcmax=$(( lc1 > lc2 ? lc1 : lc2 ))
diff -U$lcmax "$f1" "$f2" | grep '^ ' | less
# Alternatively, use this grep to ignore the lines starting
# with +, -, and # signs.
# grep -vE '^[+#-]'
If you want to include the lines that are just moved around, you can sort the input before diffing, like so:
f1="file_1"
f2="file_2"
lc1=$(wc -l "$f1" | cut -f1 -d' ')
lc2=$(wc -l "$f2" | cut -f1 -d' ')
lcmax=$(( lc1 > lc2 ? lc1 : lc2 ))
diff -U$lcmax <(sort "$f1") <(sort "$f2") | grep '^ ' | less

In Windows, you can use a PowerShell script with CompareObject:
compare-object -IncludeEqual -ExcludeDifferent -PassThru (get-content A.txt) (get-content B.txt)> MATCHING.txt | Out-Null #Find Matching Lines
CompareObject:
IncludeEqual without -ExcludeDifferent: Everything
ExcludeDifferent without -IncludeEqual: Nothing

Just for information, I made a little tool for Windows doing the same thing as "grep -F -x -f file1 file2" (As I haven't found anything equivalent to this command on Windows)
Here it is:
http://www.nerdzcore.com/?page=commonlines
Usage is "CommonLines inputFile1 inputFile2 outputFile"
Source code is also available (GPL).

Related

How to rename a zero-padded file sequence efficiently in ZSH?

I have a picture sequence named with zero-padded numbers like so:
/path/to/file_07469.jpx
/path/to/file_07470.jpx
/path/to/file_07471.jpx
/path/to/file_07472.jpx
/path/to/file_07473.jpx
/path/to/file_07474.jpx
/path/to/file_07475.jpx
/path/to/file_07476.jpx
/path/to/file_07477.jpx
/path/to/file_07478.jpx
/path/to/file_07479.jpx
/path/to/file_07480.jpx
/path/to/file_07481.jpx
/path/to/file_07482.jpx
This is just an extract. It is thousands of files. I’d like to rename all files from a certain number on, adding / subtracting X. I’d love to use find with a regex.
#!/bin/zsh
shift=-1000
seqnumstart="$(echo "$1" | grep -Eo "\d+")"
bn="$(basename $1)"
bbn="$(echo "${bn%_*}")"
ext="$(echo "${bn##*.}")"
find "$(dirname $1)" -name "$bbn*$ext" -print0 | while read -d $'\0' file
do
seqnum="$(echo "$file" | grep -Eo "\d+")"
seqnum="$(echo "${seqnum#"${seqnum%%[!0]*}"}")"
if [[ "$seqnum" -ge "$seqnumstart" ]]; then
seqnumnew=$(($seqnum + $shift))
seqnumnew=$(printf %05d $seqnumnew)
filenew="$(echo $file | sed -E 's [0-9]+ '$seqnumnew' g')"
mv "$file" "$filenew"
fi
done
How can I improve my code? It is very slow. Im on a Mac (zsh).
zmv is a utility in zsh that can do a lot of filename manipulation and looping for you. Try this:
zmv -n 'p/file_(<7000-7999>).jpx' 'p/file_$(printf "%05d" $(($1 - 1000))).jpx'
Some of the pieces:
zmv: an autoload function; use autoload -Uz zmv to make it available (this is usually added to .zshrc).
-n: no-op. With this option, zmv will just print what would have happened, giving you an idea if the command is correct. Remove this to actually mv the files.
(...): grouping operator for zmv. This identifies sections in the name that you want to change; this section is referenced in the 'to' argument as $1.
<7000-7999>: glob operator for a range. Note that leading zeroes are not always required.
$(printf "%05d" ...): zero-padding.
$((...)): arithmetic.
$1: reference to the parenthetical value in the 'from' argument'. This is where zmv's magic happens - this is substituted for each matching filename.
As you likely know, you'll need to do the renaming in groups or in a specific order to avoid trying to change a name to a name that already exists. zmv will usually halt when it encounters collisions like that.
This is much faster:
#!/bin/zsh
shift=1000
seqnumstart="$(echo "$1" | grep -Eo "\d+")"
lastfile="$(find "$(dirname $1)" -name "*.jpx" | sort | tail -1)"
seqnumend="$(echo "$lastfile" | grep -Eo "\d+")"
bn="$(basename $1)"
bbn="$(echo "${bn%_*}")"
#extension
ext="$(echo "${bn##*.}")"
#basepath before the padded number
bp="$(echo "${1%_*}")"
function buildpath {
echo "$bp"_"$1"."$ext"
}
for i in {$seqnumstart..$seqnumend}
do
unpad="$(echo $i | sed 's/^0*//')"
seqnumnew="$(($unpad + $shift))"
seqnumnewpad="$(printf %05d $seqnumnew)"
op="$(buildpath "$i")"
np="$(buildpath "$seqnumnewpad")"
mv "$op" "$np"
done

sed with filename from pipe

In a folder I have many files with several parameters in filenames, e.g (just with one parameter) file_a1.0.txt, file_a1.2.txt etc.
These are generated by a c++ code and I'd need to take the last one (in time) generated. I don't know a priori what will be the value of this parameter when the code is terminated. After that I need to copy the 2nd line of this last file.
To copy the 2nd line of the any file, I know that this sed command works:
sed -n 2p filename
I know also how to find the last generated file:
ls -rtl file_a*.txt | tail -1
Question:
how to combine these two operation? Certainly it is possible to pipe the 2nd operation to that sed operation but I dont know how to include filename from pipe as input to that sed command.
You can use this,
ls -rt1 file_a*.txt | tail -1 | xargs sed -n '2p'
(OR)
sed -n '2p' `ls -rt1 file_a*.txt | tail -1`
sed -n '2p' $(ls -rt1 file_a*.txt | tail -1)
Typically you can put a command in back ticks to put its output at a particular point in another command - so
sed -n 2p `ls -rt name*.txt | tail -1 `
Alternatively - and preferred, because it is easier to nest etc -
sed -n 2p $(ls -rt name*.txt | tail -1)
-r in ls is reverse order.
-r, --reverse
reverse order while sorting
But it is not good idea when used it with tail -1.
With below change (head -1 without r option in ls), performance will be better, that you needn't wait to list all files then pipe to tail command
sed -n 2p $(ls -t1 name*.txt | head -1 )
I was looking for a similar solution: taking the file names from a pipe of grep results to feed to sed. I've copied my answer here for the search & replace, but perhaps this example can help as it calls sed for each of the names found in the pipe:
this command to simply find all the files:
grep -i -l -r foo ./*
this one to exclude this_shell.sh (in case you put the command in a script called this_shell.sh), tee the output to the console to see what happened, and then use sed on each file name found to replace the text foo with bar:
grep -i -l -r --exclude "this_shell.sh" foo ./* | tee /dev/fd/2 | while read -r x; do sed -b -i 's/foo/bar/gi' "$x"; done
I chose this method, as I didn't like having all the timestamps changed for files not modified. Feeding the grep result allows only the files with target text to be looked at (thus likely may improve performance / speed as well)
be sure to backup your files & test before using. May not work in some environments for files with embedded spaces. (?)
fwiw - I had some problems using the tail method, it seems that the entire dataset was generated before calling tail on just the last item.

find file names by searching the last lines for pattern

I have to find all files in a large number of large ASCII files which contain a specific pattern. At the moment I'm doing that with
grep -l <pattern> <files>
and it's very slow.
But I know that the pattern appears in the last 10 lines, if it exists. Is there an elegant possibility to search only the last lines to speed up the search, e.g. with awk?
You can simply print the filename while processing
for f in $files; do
echo "---- File \"$f\" ------"
tail -n 10 "$f" | grep -l "$pattern"
# you can also save the stdout to $f...
done
to see only specific number of line of a file then command syntex is as follow.
tail [+ number] [-l] [-b] [-c] [-r] [-f] [-c number | -n number] [file]
Now you can use pipe to comand greap and cat to perform your specific work.
i.e.
tail -n 10 <fileName>|grep -l <pattern> <files>
Click here to know more.

How to "grep" out specific line ranges of a file

There are often times I will grep -n whatever file to find what I am looking for. Say the output is:
1234: whatev 1
5555: whatev 2
6643: whatev 3
If I want to then just extract the lines between 1234 and 5555, is there a tool to do that? For static files I have a script that does wc -l of the file and then does the math to split it out with tail & head but that doesn't work out so well with log files that are constantly being written to.
Try using sed as mentioned on
http://linuxcommando.blogspot.com/2008/03/using-sed-to-extract-lines-in-text-file.html. For example use
sed '2,4!d' somefile.txt
to print from the second line to the fourth line of somefile.txt. (And don't forget to check http://www.grymoire.com/Unix/Sed.html, sed is a wonderful tool.)
The following command will do what you asked for "extract the lines between 1234 and 5555" in someFile.
sed -n '1234,5555p' someFile
If I understand correctly, you want to find a pattern between two line numbers. The awk one-liner could be
awk '/whatev/ && NR >= 1234 && NR <= 5555' file
You don't need to run grep followed by sed.
Perl one-liner:
perl -ne 'if (/whatev/ && $. >= 1234 && $. <= 5555) {print}' file
Line numbers are OK if you can guarantee the position of what you want. Over the years, my favorite flavor of this has been something like this:
sed "/First Line of Text/,/Last Line of Text/d" filename
which deletes all lines from the first matched line to the last match, including those lines.
Use sed -n with "p" instead of "d" to print those lines instead. Way more useful for me, as I usually don't know where those lines are.
Put this in a file and make it executable:
#!/usr/bin/env bash
start=`grep -n $1 < $3 | head -n1 | cut -d: -f1; exit ${PIPESTATUS[0]}`
if [ ${PIPESTATUS[0]} -ne 0 ]; then
echo "couldn't find start pattern!" 1>&2
exit 1
fi
stop=`tail -n +$start < $3 | grep -n $2 | head -n1 | cut -d: -f1; exit ${PIPESTATUS[1]}`
if [ ${PIPESTATUS[0]} -ne 0 ]; then
echo "couldn't find end pattern!" 1>&2
exit 1
fi
stop=$(( $stop + $start - 1))
sed "$start,$stop!d" < $3
Execute the file with arguments (NOTE that the script does not handle spaces in arguments!):
Starting grep pattern
Stopping grep pattern
File path
To use with your example, use arguments: 1234 5555 myfile.txt
Includes lines with starting and stopping pattern.
If I want to then just extract the lines between 1234 and 5555, is
there a tool to do that?
There is also ugrep, a GNU/BSD grep compatible tool but one that offers a -K option (or --range) with a range of line numbers to do just that:
ugrep -K1234,5555 -n '' somefile.log
You can use the usual GNU/BSD grep options and regex patterns (but it also offers a lot more such as -K.)
If you want lines instead of line ranges, you can do it with perl: eg. if you want to get line 1, 3 and 5 from a file, say /etc/passwd:
perl -e 'while(<>){if(++$l~~[1,3,5]){print}}' < /etc/passwd

What are some one-liners that can output unique elements of the nth column to another file?

I have a file like this:
1 2 3
4 5 6
7 6 8
9 6 3
4 4 4
What are some one-liners that can output unique elements of the nth column to another file?
EDIT: Here's a list of solutions people gave. Thanks guys!
cat in.txt | cut -d' ' -f 3 | sort -u
cut -c 1 t.txt | sort -u
awk '{ print $2 }' cols.txt | uniq
perl -anE 'say $F[0] unless $h{$F[0]}++' filename
In Perl before 5.10
perl -lane 'print $F[0] unless $h{$F[0]}++' filename
In Perl after 5.10
perl -anE 'say $F[0] unless $h{$F[0]}++' filename
Replace 0 with the column you want to output.
For j_random_hacker, here is an implementation that will use very little memory (but will be a slower and requires more typing):
perl -lane 'BEGIN {dbmopen %h, "/tmp/$$", 0600; unlink "/tmp/$$.db" } print $F[0] unless $h{$F[0]}++' filename
dbmopen creates an interface between a DBM file (that it creates or opens) and the hash named %h. Anything stored in %h will be stored on disc instead of in memory. Deleting the file with unlink ensures that the file will not stick around after the program is done, but has no effect on the current process (since, according to POSIX rules, open filehandles are respected by the filesystem as real files).
Corrected: Thank you Mark Rushakoff.
$ cut -c 1 t.txt | sort | uniq
or
$ cut -c 1 t.txt | sort -u
1
4
7
9
Taking the unique values of the third column:
$ cat in.txt | cut -d' ' -f 3 | sort -u
3
4
6
8
cut -d' ' means to separate the input delimited by spaces, and the -f 3 part means take the third field. Finally, sort -u sorts the output, keeping only unique entries.
Say your file is "cols.txt" and you want the unique elements of the second column:
awk '{ print $2 }' cols.txt | uniq
You might find the following article useful for learning more about such utilities:
Simplify data extraction using Linux text utilities
if using awk, no need to use other commands
awk '!_[$2]++{print $2}' file