Grep only for the first match in amount of files - command-line

In general, I'm looking for a way to show each match of a grep command just once.
For the current usage I intend to have a list of all programmers which contributed files to a database. The files of interest are all written in java, therefor the search pattern is "#author".
In the end, I like to get an enumeration of all shortcuts ( at this point I do not even care, in which files the pattern occur).
The result should be similar to the example below:
pak#Q:~$ grep -r "#author" | [...]
#bsh
#janS
#Jan Snow
...
edit: in case anyone is facing a similar problem, the command of interest is
grep -rh "#author" | sort -u

Pretty straightforward, you can sort and get unique entries:
grep [...] | sort -u
If you're grepping across multiple files, you'll probably want -h option, and perhaps -s to hide error messages:
Example
For example:
dir
├── a
│  └── File contents:
│  #author ed
│  #author frank
│  #author ben
│ 
└── b
  └── File contents:
  #author ben
  #author frank
  #author steve
From dir we run
$ grep -sh '#author' * | sort -u
Output:
#author ben
#author ed
#author frank
#author steve
More info
From grep man page:
-h, --no-filename Suppress the prefixing of file names on output. This is the default when there is only one file (or only standard input) to search.
-s, --no-messages Suppress error messages about nonexistent or unreadable files.
From sort man page:
sort - sort lines of text files
-u, --unique
with -c, check for strict ordering; without -c, output only the first of an equal run
Credit
Thanks to #EdMorton for the sort -u version. Originally I suggested the following (which remains valid):
grep -r "author" | sort | uniq

Related

Renaming files/dir's from one date format to another

So I am a coding newbie and have, for some time, wanted to edit the formatting of my fairly extensive live music library. I have looked around on here and various other resources to get to where I am, but I have hit a snag. I have directories named in the following ways:
02.10.90 | 23 East Caberet - Ardmore, PA
02.16.90 | The Paradise - Boston, MA
and I would like to rename these simply to
1990-02-10 | 23 East Caberet - Ardmore, PA
1990-02-16 | The Paradise - Boston, MA
I have been able to rename the date correctly using:
ls -1 | grep 90 | awk '{print $1}' | awk -F. '{printf "%s-%s-%s\n", "19"$3,$1,$2}' > list1.txt
and then pull the rest of the name using
ls -1 | grep 90 | awk '{first = $1; $1 = ""; print $0}'>list2.txt
So, I have a list of directories ranging from years 1990-2004 that I would like to apply this to (they are all in different sub directories so I don't mind manually changing the "grep 90". However, from the two separate lists that I generate, I cant figure out how to make it loop through each row and print "mv original_name list1.txt+list2.txt" so that it would read:
mv 02.10.90 | 23 East Caberet - Ardmore, PA 1990-02-10 | 23 East Caberet - Ardmore, PA
I scanned through many previous posts and couldn't quite figure out the last bit - or better yet, a more elegant solution! Any help is greatly appreciated, thank you in advance!
Don't parse the output from ls, google why, and you don't need grep when you're using awk, nor do you need chains of awk commands but for this task you wouldn't use any of those commands anyway.
The UNIX command to find files is named find so start with that. This will find all directories with names starting with the given globbing pattern:
find . -type d -name '[0-9][0-9].[0-9][0-9].90 *'
Now that you've found the files you need to do something with them. For your needs IMHO the simplest approach is best and that'd be:
find . -type d -name '[0-9][0-9].[0-9][0-9].90 *' -print0 |
while IFS= read -r -d '' old; do
path="$(dirname "$old")"
oldDirName="$(basename "$old")"
if [[ oldDirName =~ ([0-9]+)\.([0-9]+)\.([0-9]+)( .*) ]]; then
newDirName="19${BASH_REMATCH[3]}-${BASH_REMATCH[1]}-${BASH_REMATCH[2]}${BASH_REMATCH[4]}
echo mv -- "${path}/${oldDirName}" "${path}/${newDirName}"
fi
done
The above is using GNU find for -print0 and bash for BASH_REMATCH. Remove the echo when you've debugged it if necessary and are happy with what it's going to do.

finding most recent file version from list of file path names with jumbled file names

I recently lost a bunch of files from eclipse in an accidental copy/replace dilema. I was able to recover most of them but I found in the eclipse metadata folder a history of files, some of which are the ones I need. The path for the history is:
($WORKSPACE/.metadata/.plugins/org.eclipse.core.resources/.history).
Inside there are a bunch of folders like 3e,2f,1a,ff, etc.. each with a couple files named like "2054f7f9a0d30012175be7013ca49f5b". I was able to do a recursive grep with a keyword i know would be in the file and return a list of file names (grep -R -l 'KEYWORD') and now I can't figure out how to sort them by most recently modified.
any help would be great, thanks!
you can try:
find $WORK.../.history -type f -printf '%T#\t%p\n' | sort -nr | cut -f2- | xargs grep 'your_pattern'
Decomposed:
the find finds all plain files and prints their modification time and path
the sort sort sort them numerically - and reverse, so highest number comes first (the latest modified)
the cut removes the time from each line
the xargs run its argument for each file what get to it input,
in this case will run the grep command, so
the 1st file what the grep find - was the lastest modified
The above not works when the filenames containing spaces, but hopefully this is not your case... The -printf works only with GNU find.
For the repetative work, you can split the command to two parts:
find $WORK.../.history -type f -printf '%T#\t%p\n' | sort -nr | cut -f2- > /somewhere/FILENAMES_SORTED_BY_MODIF_TIME
so in 1st step you save to somewhere the list of filenames sorted by their modification times, and after you can repeatedly use the grep command on their content with:
< /somewhere/FILENAMES_SORTED_BY_MODIF_TIME xargs grep 'your_pattern'
the above command is usually written as
xargs grep 'your_pattern' < /somewhere/FILENAMES_SORTED_BY_MODIF_TIME
but for the bash is OK write the redirection to the start and in this case is simpler changing the pattern for the grep if the pattern is in the last place...
If you want check the list of filenames with modification times, you can break the above commands as:
find $WORK.../.history -type f -printf "%T#\t%Tc\t%p\n" | sort -nr >/somewehre/FILENAMES_WITH_DATE
check the list (they now contains readable date too) and use the next
< /somewehre/FILENAMES_WITH_DATE cut -f3- | xargs grep 'your_pattern'
note, now need to use -f3- and not -f2- as in the 1st example.

comparing two directories with separate diff output per file

I'd need to see what has been changed between two directories which contain different version of a software sourcecode. While I have found a way to get a unique .diff file, how can I obtain a different file for each changed file in the two directories? I'd need this, as the "main" is about 6 MB and wanted some more handy thing.
I came around this problem too, so I ended up with some lines of a shell script. It takes three arguments: Source and destination directory (as used for diff) and a target folder (should exist) for the output.
It's a bit hacky, but maybe it would be useful for someone. So use with care, especially if your paths have special characters.
#!/bin/sh
DIFFARGS="-wb"
LANG=C
TARGET=$3
SRC=`echo $1 | sed -e 's/\//\\\\\\//g'`
DST=`echo $2 | sed -e 's/\//\\\\\\//g'`
if [ ! -d "$TARGET" ]; then
echo "'$TARGET' is not a directory." >&2
exit 1
fi
diff -rqN $DIFFARGS "$1" "$2" | sed "s/Files $SRC\/\(.*\?\) and $DST\/\(.*\?\) differ/\1/" | \
while read file
do
if [ ! -d "$TARGET/`dirname \"$file\"`" ]; then
mkdir -p "$TARGET/`dirname \"$file\"`"
fi
diff $DIFFARGS -N "$1/$file" "$2/$file" > "$TARGET"/"$file.diff"
done
if you want to compare source code it is better to commit it to a source vesioning program as "svn".
after you have done so. do a diff of your uploaded code and pipe it to file.diff
svn diff --old svn:url1 --new svn:url2 > file.diff
A bash for loop will work for you. The following will diff two directories with C source code and produce a separate diff for each file.
for FILE in $(find <FIRST_DIR> -name '*.[ch]'); do DIFF=<DIFF_DIR>/$(echo $FILE | grep -o '[-_a-zA-Z0-9.]*$').diff; diff -u $FILE <SECOND_DIR>/$FILE > $DIFF; done
Use the correct patch level for the lines starting with +++

How to search inside a list of files for multiple values exisiting on the same line?

I am trying to do a search in one directory containing a large number of html files, to find those files that contain the exact values on the same line. This should work:
grep -iwc 'word1' -sl | xargs grep -iwc 'word2' -s
But that only works on one file at a time. I tried something like this:
find . -iname '*html' | xargs grep -iwc 'word1' -sl | xargs grep -iwc 'word2' -s
But that seems to display files containing any of the two values, so even those that are not on the same line.
The output should only be the file names and the number of occurrences like:
file.html:2
If it possible to group those 2 greps? Or another way to do this search?
An extended regex may help. Something like this perhaps?
find . -iname '*html' | xargs egrep -iwl '(word1.*word2|word2.*word1)'
Since you only have two words you're looking for, it's not too hard to list off all the ways to order them.

Finding most commonly edited files in clearcase

We are currently planning a quality improvement exercise and i would like to target the most commonly edited files in our clearcase vobs. Since we have just been through a bug fixing phase the most commonly edited files should give a good indication of where the most bug prone code is, and therefore the most in need of quality improvment.
Does anyone know if there is a way of obtaining a top 100 list of most edited files? Preferably this would cover edits that are happening on multiple branches.
(The previous answer was for a simpler case: single branch)
Since "most projects dev has not all happened on the one branch so the version numbers don't necessarily mean most edited", a "way to get number of check-ins across all branches" would be:
search all versions created since the date of the last bug fixing phase,
sort them by file,
then by occurrence.
Something along the lines of:
C:\Prog\cc\test\test>ct find -all -type f -ver "created_since(16-Oct-2009)" -exec "cleartool descr -fmt """%En~%Sn\n""""""%CLEARCASE_XPN%"""" | grep -v "\\0" | awk -F ~ "{print $1}" | sort | uniq -c | sort /R | head -100
Or, for Unix syntax:
$ ct find -all -type f -ver 'created_since(16-Oct-2009)' -exec 'cleartool descr -fmt "%En~%Sn\n" "%CLEARCASE_XPN%"' | grep -v "/0" | awk -F ~ '{print $1}' | sort | uniq -c | sort -rn | head -100
replace the date by the one of the label marking the start of your bug-fixing phase
Again, note the double-quotes around the '%CLEARCASE_XPN%' to accommodate spaces within file names.
Here, '%CLEARCASE_XPN%' is used rather than '%CLEARCASE_PN%' because we need every versions.
grep -v "/0" is here to exclude version 0 (/main/0, /main/myBranch/0, ...)
awk -F ~ "{print $1}" is used to only print the first part of each line:
C:\Prog\cc\test\test\a.txt~\main\mybranch\2 becomes C:\Prog\cc\test\test\a.txt
From there, the counting and sorting can begin:
sort to make sure every identical line is grouped
uniq -c to remove duplicate lines and precede each remaining line with a count of said duplicates
sort -rn (or sort /R for Windows) for having the most edited files at the top
head -100 for keeping only the 100 most edited files.
Again, GnuWin32 will come in handy for the Windows version of the one-liner.
(See answer for more complicated case: multiple branches)
First, use a dynamic view: easier and quicker to update its content and fiddle with its config spec rules.
If your bug-fixing has been made in a branch, starting from a given label, set-up a dynamic view with the following config spec as:
element * .../MY_BRANCH/LATEST
element * MY_STARTING_LABEL
element * /main/LATEST
Then you find all files, with their current version number (closely related to the number of edits)
ct find . -type f -exec "cleartool desc -fmt """%Ln\t\t%En\n""" """%CLEARCASE_PN%""""|sort /R|head -100
This is the Windows syntax (nothe the triple "double-quotes" around %CLEARCASE_PN% in order to accommodate spaces within the file names.
the 'head' command comes from the GnuWin32 library.
The most edited version are at the top of the list.
A Unix version would be:
$ ct find . -type f -exec 'cleartool desc -fmt "%Ln\t\t%En\n" "$CLEARCASE_PN"' | sort -rn | head -100
The most edited version would be at the top.
Do not forget that for metrics, the raw numbers are not enough, trends are important too.