What order does find(1) list files in? - find

On extfs, if there are only file-creations and no -deletions in a directory, I expect that find . -type f would list the files either in their chronological order of creation (or mtime), or if not, at least in their reverse chronological order... depending on how a directory's contents are traversed.
But that isn't the behavior I'm seeing.
The following code, eg, creates a fresh set of directories and files:
#!/bin/bash -u
for i in a/ a/{1,2,3,4,5} b/ b/{1,2,3,4,5}; do
if echo "$i" | egrep -q "/$"; then
echo "Creating dir $i"
mkdir -p "$i"
else
echo "Creating file $i"
touch "$i"
fi
sleep 0.500
done
Output of the above snippet:
Creating dir a/
Creating file a/1
Creating file a/2
Creating file a/3
Creating file a/4
Creating file a/5
Creating dir b/
Creating file b/1
Creating file b/2
Creating file b/3
Creating file b/4
Creating file b/5
However, find lists files in somewhat random order. For example, a/2 doesn't follows a/1, and b/2 doesn't follow b/1:
$ find . -type f
./a/1
./a/3
./a/4
./a/2 <----
./a/5
./b/1
./b/3
./b/4
./b/2 <----
./b/5
Any idea why this should happen?
My main problem is: I have a very large volume storing 100s of 1000s of files. I need to traverse these files and directories in the order of their creation/modification (mtime) and pipe each file to another process for further processing. But I don't necessarily want to first create a temporary list of this large set of files and then sort it based on mtime before piping it to my process.

find lists objects in the order that they are reported by the underlying filesystem implementation. You can tell ls to show you this "raw" order by passing it the -f option.
The order could be anything at all -- alphabetical, by mtime, by atime, by length of name, by permissions, or something completely different. The ordering can even vary from one listing to the next.
It's common for filesystems to report in an order that reflects the filesystem's strategy for allocating directory slots to files. If this is some sort of hash-based strategy based on filename then the order can appear nonsensical. This is what happens with widely-used Linux and BSD filesystem implementations. Since you mention extfs this is probably what causes the ordering you're seeing.
So, if you need the output from find to be ordered in a particular way then you'll have to create that order yourself. Maybe based on something like:
find . -type f -exec ls -ltr --time-style=+%s {} \; | sort -n -k6

Related

find + cp sparse directory tree

I have a directory tree which, among other files, has files which match certain patterns. For the sake of the discussion, let's assume these are files matching *.foo, or *.bar, or baz*. I want to backup inside my zsh-script only files matching these pattern to a new directory.
The seemingly obvious solution,
find fromdir \( -name '*.{foo,bar}' -o -name 'baz*' \) -exec cp {} todir \;
does not work, because the destination directory for, i.e., fromdir/x/y/a.foo does not exist.
I was thinking of using rsync, but I know only how to exclude certain files from being copied, not how to restrict copying.
I can solve the problem by writing a small auxiliary script, mdcp1file, like this:
#!/bin/zsh
set -u
mkdir -p $2/$1:h # Create destination directory if needed
cp $1 $2
and use it in my find command instead of cp. I wonder whether there is an easier way to solve this problem, either by beefing up the -exec of my find, or by using rsync in a clever way.
As you mention that you make use of zsh, you could just do something like this:
cd /path/to/source/dir
cp --parents **/{*.{foo,bar},baz*}(.) /path/to/destination/dir
Here we make use of:
cp --parents: Bash: Copy named files recursively, preserving folder structure
**: for matching over multiple directories
BRACE EXPANSION: A string of the form foo{xx,yy,zz}bar is expanded to the individual words fooxxbar, fooyybar and foozzbar. Left-to-right order is preserved. This construct may be nested. Commas may be quoted in order to include them literally in a word.
Glob Qualifier (.): Patterns used for filename generation may end in a list of qualifiers enclosed in parentheses. The qualifiers specify which filenames that otherwise match the given pattern will be inserted in the
argument list. The . selects files only.

Logrotate files in multiple sub directories to backup location in same folder structure

Im trying to use logrotate with very little experience, Currently working i have the files rotating, compressing and renaming into the same folder. Now i need instead of dropping the files in the same place, i need to have them dropped in another location. They also need to have the same folder structure and if it isn't there than it needs to create the new folder. All the compressed files need to be added and not override the existing files
I'm thinking that the olddir will drop them into a destination folder but not sure on how to have it drop it in the corresponding folder or create it if its not already there.
Example source
var/log/device1/*.log
var/log/device2/*.log
var/log/device3/*.log
Example Destination to drop .gz files into
opt/archive/device1/
opt/archvie/device2/
(needs to create opt/archive/device3 and put rotated file in here)
Didn't end up finding a way to move with logrotate but came up with script to do the same sort of thing. pretty simplistic and wont work for more than 1 level deep of subfolders.
#!/bin/bash
source="/opt/log/host"
destination="/opt/archive/"
for i in $(find $source -maxdepth 2 -type f -name "*.gz")
do
#removing /opt/log/host from string
dd="$( echo "$i" | sed -e 's#^/opt/log/host/##' )"
#removing everything after the first /
ff=$( echo "$dd" | cut -f1 -d"/" )
#setting the correct destination string
ee=$destination$dd
#create new folders if they do not exist
mkdir -p -- "$destination$ff"
#move files
mv $i $ee
done

Folders not showing up in Bucket storage

So my problem is that a have a few files not showing up in gcsfuse when mounted. I see them in the online console and if I 'ls' with gsutils.
Also, if If I manually create the folder in the bucket, i then can see the files inside it, but I need to create it first. Any suggestions?
gs://mybucket/
dir1/
ok.txt
dir2
lafu.txt
If I mount mybucket with gcsfuse and do 'ls' it only returns dir1/ok.txt.
Then I'll create the folder dir2 inside dir1 at the root of the mounting point, and suddenly 'lafu.txt' shows up.
By default, gcsfuse won't show a directory "implicitly" defined by a file with a slash in its name. For example if your bucket contains an object named dir/foo.txt, you won't be able to find it unless there is also an object nameddir/.
You can work around this by setting the --implicit-dirs flag, but there are good reasons why this is not the default. See the documentation for more information.
Google Cloud Storage doesn't have folders. The various interfaces use different tricks to pretend that folders exist, but ultimately there's just an object whose name contains a bunch of slashes. For example, "pictures/january/0001.jpg" is the full name of a single object.
If you need to be sure that a "folder" exists, put an object inside it.
#Brandon Yarbrough suggests creating needed directory entries in the GCS bucket. This avoids the performance penalty described by #jacobsa.
Here is a bash script for doing so:
# 1. Mount $BUCKET_NAME at $MOUNT_PT
# 2. Run this script
MOUNT_PT=${1:-HOME/mnt}
BUCKET_NAME=$2
DEL_OUTFILE=${3:-y} # Set to y or n
echo "Reading objects in $BUCKET_NAME"
OUTFILE=dir_names.txt
gsutil ls -r gs://$BUCKET_NAME/** | while read BUCKET_OBJ
do
dirname "$BUCKET_OBJ"
done | sort -u > $OUTFILE
echo "Processing directories found"
cat $OUTFILE | while read DIR_NAME
do
LOCAL_DIR=`echo "$DIR_NAME" | sed "s=gs://$BUCKET_NAME/==" | sed "s=gs://$BUCKET_NAME=="`
#echo $LOCAL_DIR
TARG_DIR="$MOUNT_PT/$LOCAL_DIR"
if ! [ -d "$TARG_DIR" ]
then
echo "Creating $TARG_DIR"
mkdir -p "$TARG_DIR"
fi
done
if [ $DEL_OUTFILE = "y" ]
then
rm $OUTFILE
fi
echo "Process complete"
I wrote this script, and have shared it at https://github.com/mherzog01/util/blob/main/sh/mk_bucket_dirs.sh.
This script assumes that you have mounted a GCS bucket locally on a Linux (or similar) system. The script first specifies the GCS bucket and location where the bucket is mounted. It then identifies all "directories" in the GCS bucket which are not visible locally, and creates them.
This (for me) fixed the issue with folders (and associated objects) not showing up in the mounted folder structure.

Copy lines from multiple files in subfolders into one file

I'm very very very new to programming and trying to learn how to make tedious analysis tasks a little faster. I have a master folder (Master) with 50 experiment folders and within each experiment folder are another set of folders holding text files. I want to extract 2 lines from one of the text fiels (experiment title on line 7, slope on line 104) and copy them to a new single file.
So far, all I have learned is how to extract the lines and add to a new file.
sed -n '7p; 104 p' reco.txt >> results.txt
How can I extract these two lines from all files 'reco.txt' in the subfolder of the folder 'Master' and export into a single text file?
As much explanation as you can bear would be great to help me learn.
You can use find in combination with xargs for this. On its own, you can get a list of all relevant files:
find . -name reco.txt -print
This finds all files named reco.txt in the current directory (.) or any subdirectories and writes them to standard output.
Now, normally you can use the -exec argument to find, which will run a program for each file found, except that typically multiple results are combined into a single execution (appended to the command line). Your particular invocation of sed only works on one file at a time.
So, instead of -exec, you can use xargs which is essentially the same thing but with more control.
find Master -name reco.txt -print0 | xargs -0 -n1 sed -n '7p; 104 p' > results.txt
This does the following:
Searches in the directory Master or subdirectories for any file named reco.txt.
Outputs each filename with null-terminator instead of newline (-print0) -- this allows the full path to contain characters that usually need escaping (such as spaces)
Pipes the result into xargs, which does the following:
Accepts null-terminated strings (-0)
Only puts at most one file into each command (-n1)
Runs sed -n '7p; 104 p' on that file
Entire output is redirected to results.txt, which will overwrite any existing contents in the file.

Why does grep hang when run against the / directory?

My question is in two parts :
1) Why does grep hang when I grep all files under "/" ?
for example :
grep -r 'h' ./
(note : right before the hang/crash, I note that I see some "no such device or address" messages , regarding sockets....
Of course, I know that grep shouldn't run against a socket, but I would think that since sockets are just files in Unix, it should return a negative result, rather than crashing.
2) Now, my follow up question : In any case -- how can I grep the whole filesystem? Are there certain *NIX directories which we should leave out when doing this ? In particular, I'm looking for all recently written log files.
As #ninjalj said, if you don't use -D skip, grep will try to read all your device files, socket files, and FIFO files. In particular, on a Linux system (and many Unix systems), it will try to read /dev/zero, which appears to be infinitely long.
You'll be waiting for a while.
If you're looking for a system log, starting from /var/log is probably the best approach.
If you're looking for something that really could be anywhere in your file system, you can do something like this:
find / -xdev -type f -print0 | xargs -0 grep -H pattern
The -xdev argument to find tells it to stay within a single filesystem; this will avoid /proc and /dev (as well as any mounted filesystems). -type f limits the search to ordinary files. -print0 prints the file names separated by null characters rather than newlines; this avoid problems with files having spaces or other funny characters in their names.
xargs reads a list of file names (or anything else) on its standard input and invokes the specified command on everything in the list. The -0 option works with find's -print0.
The -H option to grep tells it to prefix each match with the file name. By default, grep does this only if there are two or more file names on its command line. Since xargs splits its arguments into batches, it's possible that the last batch will have just one file, which would give you inconsistent results.
Consider using find ... -name '*.log' to limit the search to files with names ending in .log (assuming your log files have such names), and/or using grep -I ... to skip binary files.
Note that all this depends on GNU-specific features. Some of these options might not be available on MacOS (which is based on BSD) or on other Unix systems. Consult your local documentation, and consider installing GNU findutils (for find and xargs) and/or GNU grep.
Before trying any of this, use df to see just how big your root filesystem is. Mine is currently 268 gigabytes; searching all of it would probably take several hours. A few minutes spent (a) restricting the files you search and (b) making sure the command is correct will be well worth the time you spend.
By default, grep tries to read every file. Use -D skip to skip device files, socket files and FIFO files.
If you keep seeing error messages, then grep is not hanging. Keep iotop open in a second window to see how hard your system is working to pull all the contents off its storage media into main memory, piece by piece. This operation should be slow, or you have a very barebones system.
Now, my follow up question : In any case -- how can I grep the whole filesystem? Are there certain *NIX directories which we should leave out when doing this ? In particular, Im looking for all recently written log files.
Grepping the whole FS is very rarely a good idea. Try grepping the directory where the log files should have been written; likely /var/log. Even better, if you know anything about the names of the files you're looking for (say, they have the extension .log), then do a find or locate and grep the files reported by those programs.