How to grep over 1 million files? - perl

I need to grep about 1 million files. If there's a better way to do this, let me know. I was thinking there may be a faster way to do it in perl.
What I'm trying to do is export every line that contains the text httpsfile in it.
Here's what I'm trying to run:
grep 'httpsfile' * >> grepped.txt
Here's the error I'm getting:
-bash: /bin/grep: Argument list too long
Any help would be appreciated.

You can do it in parallel if you want:
ls > /tmp/files
parallel -a /tmp/files --xargs -s 100 grep 'httpsfile'

Unless you have a lot of RAM and your on million files are already in the buffer cache, parallelizing won't be of any help given the fact the operation will be I/O bound so here is the fastest still portable (POSIX) way:
find . -exec grep httpsfile {} + > grepped.txt
Note that unlike the accepted answer solution, using find won't fail with oddly named files. Have a look to https://unix.stackexchange.com/questions/128985/why-not-parse-ls

Try ls | xargs grep httpsfile.

Just change * to ./ or, whatever is the root directory that contains the 1 million files. You might need to add -r as well to make grep recursive and look into nested directories.
* in the shell expands out into all the files.

Related

Sorting and removing duplicates from single or multiple large files

I have a 70gb file with 400 million+ lines (JSON). My end goal is to remove duplicates lines so i have a fully "de-duped" version of the file. I am doing this on a machine with 8cores and 64gb ram.
I am also expanding on this thread, 'how to sort out duplicates from a massive list'.
Things i have tried:
Neek - javascript quickly runs out of memory
Using Awk (doesn't seem to work for this)
using Perl (perl -ne 'print unless $dup{$_}++;') - again, runs out of memory
sort -u largefile > targetfile
does not seem to work. I think the file is too large.
Current approach:
Split the files into chunks of 5million lines each.
Sort/Uniq each of the files
for X in *; do sort -u --parallel=6 $X > sorted/s-$X; done
Now I have 80 individually sorted files. I am trying to re-merge/de-dupe them them using sort -m. This seems to do nothing as the file/line size ends up being the same.
Since sort -m does not seem to work, i am currently trying this:
cat *.json | sort > big-sorted.json
then I will try to run uniq with
uniq big-sorted.json > unique-sorted.json
Based on past experience, I do not believe this will work.
What is the best approach here? How do i re-merge the files and remove any duplicate lines at this point.
Update 1
As I suspected, cat * | sort > bigfile did not work. It just copied everything to a single file the way it was previously sorted (in individual files).
Update 2:
I also tried the following code:
cat *.json | sort --parallel=6 -m > big-sorted.json
The result was the same as the previous update.
I am fresh out of ideas.
Thanks!
After some trial and error, i found the solution:
sort -us -o out.json infile.json

Why does grep hang when run against the / directory?

My question is in two parts :
1) Why does grep hang when I grep all files under "/" ?
for example :
grep -r 'h' ./
(note : right before the hang/crash, I note that I see some "no such device or address" messages , regarding sockets....
Of course, I know that grep shouldn't run against a socket, but I would think that since sockets are just files in Unix, it should return a negative result, rather than crashing.
2) Now, my follow up question : In any case -- how can I grep the whole filesystem? Are there certain *NIX directories which we should leave out when doing this ? In particular, I'm looking for all recently written log files.
As #ninjalj said, if you don't use -D skip, grep will try to read all your device files, socket files, and FIFO files. In particular, on a Linux system (and many Unix systems), it will try to read /dev/zero, which appears to be infinitely long.
You'll be waiting for a while.
If you're looking for a system log, starting from /var/log is probably the best approach.
If you're looking for something that really could be anywhere in your file system, you can do something like this:
find / -xdev -type f -print0 | xargs -0 grep -H pattern
The -xdev argument to find tells it to stay within a single filesystem; this will avoid /proc and /dev (as well as any mounted filesystems). -type f limits the search to ordinary files. -print0 prints the file names separated by null characters rather than newlines; this avoid problems with files having spaces or other funny characters in their names.
xargs reads a list of file names (or anything else) on its standard input and invokes the specified command on everything in the list. The -0 option works with find's -print0.
The -H option to grep tells it to prefix each match with the file name. By default, grep does this only if there are two or more file names on its command line. Since xargs splits its arguments into batches, it's possible that the last batch will have just one file, which would give you inconsistent results.
Consider using find ... -name '*.log' to limit the search to files with names ending in .log (assuming your log files have such names), and/or using grep -I ... to skip binary files.
Note that all this depends on GNU-specific features. Some of these options might not be available on MacOS (which is based on BSD) or on other Unix systems. Consult your local documentation, and consider installing GNU findutils (for find and xargs) and/or GNU grep.
Before trying any of this, use df to see just how big your root filesystem is. Mine is currently 268 gigabytes; searching all of it would probably take several hours. A few minutes spent (a) restricting the files you search and (b) making sure the command is correct will be well worth the time you spend.
By default, grep tries to read every file. Use -D skip to skip device files, socket files and FIFO files.
If you keep seeing error messages, then grep is not hanging. Keep iotop open in a second window to see how hard your system is working to pull all the contents off its storage media into main memory, piece by piece. This operation should be slow, or you have a very barebones system.
Now, my follow up question : In any case -- how can I grep the whole filesystem? Are there certain *NIX directories which we should leave out when doing this ? In particular, Im looking for all recently written log files.
Grepping the whole FS is very rarely a good idea. Try grepping the directory where the log files should have been written; likely /var/log. Even better, if you know anything about the names of the files you're looking for (say, they have the extension .log), then do a find or locate and grep the files reported by those programs.

How do I do a recursive find & replace within an SVN checkout?

How do I find and replace every occurrence of:
foo
with
bar
in every text file under the /my/test/dir/ directory tree (recursive find/replace).
BUT I want to be able to do it safely within an SVN checkout and not touch anything inside the .svn directories
Similar to this but now with the SVN restriction: Awk/Sed: How to do a recursive find/replace of a string?
There are several possiblities:
Using find:
Using find to create a list of all files, and then piping them to sed or the equivalent, as suggested in the answer you reference, is fairly straightforward, and only requires scanning through the files once.
You'd use one of the same answers as from the question you referenced, but adding -path '*/.svn' -prune -o after the find . in order to prune out the SVN directories. See this question for a discussion of using the prune option with find -- although note that they've got the pattern wrong. Thus, to print out all the files, you would use:
find . -path '*/.svn' -prune -o -type f -print
Then, you can pipe that into an xargs call or whatever to do the individual replacements, as suggested in the question you referenced. There is a lot of discussion there about different options, which I won't reproduce here, although I prefer the version from John Zwinck's answer:
find . -path '*/.svn' -prune -o -type f -exec sed -i 's/foo/bar/g' {} +
Using recursive grep:
If you have a system with GNU grep, you can use that to find the list of files as well. This is probably less efficient than find, but it does allow you to only call sed on the files that match, and I personally find the syntax a lot easier to remember (or figure out from manpages):
sed -i 's/foo/bar/g' `grep -l -R --exclude-dir='*/.svn' 'foo' .`
The -l option causes grep to only output the list of file names, rather than the matching lines.
Using a GUI editor:
Alternately, if you're using windows, do what I do -- get a copy of the NoteTab editor (available in a free version), and use its search-and-replace-on-disk command, which ignores hidden .svn directories automatically and just works.
Edit: Corrected find pattern to */.svn instead of .svn, added more details and some other possibilities. However, this depends on your platform and svn version: .svn without */ may be required in some cases, like on CentOS 7.
How about this?
grep -i "search_string" `find "*.some_extension"`
That is halfway solution to finding a search_string within files that have a specific extension....once you know the files that has the string, can be easily modified by piping it into sed....

Why isn't this command taking the diff of two directories?

I am asked to diff two directories using Perl but I think something is wrong with my command,
$diff = system("sudo diff -r '/Volumes/$vol1' '/Volumes/$vol2\\ 1/' >> $diff.txt");
It doesn't display and output. Can someone help me with this? Thanks!
It seems that you want to store all differences in a string.
If this is the case, the command in the question is not going to work for a few reasons:
It's hard to tell whether it's intended or not, but the $diff variable is being used to set the filename storing the differences. Perhaps this should be diff.txt, not $diff.txt
The result of the diff command is saved in $diff.txt. It doesn't display anything in STDOUT. This can be remedied by omitting the >> $diff.txt part. If it also needs to be stored in file, consider the tee command:
sudo diff -r dir1/ dir2/ | tee diff.txt
When a system call is assigned to a variable, it will return 0 upon success. To quote the documentation:
The return value is the exit status of the program as returned by the wait call.
This means that $diff won't store the differences, but the command exit status. A more sensible approach would be to use backticks. Doing this will allow $diff to store whatever is output to STDOUT by the command:
my $diff = `sudo diff -r dir1/ dir2/ | tee diff.txt`; # Not $diff.txt
Is it a must to use the sudo command? Avoid using it if even remotely possible:
my $diff = `diff -r dir1/ dir2/ | tee diff.txt`; # Not $diff.txt
A final recommendation
Let a good CPAN module take care of this task, as backtick calls can only go so far. Some have already been suggested here; it may be well worth a look.
Is sudo diff being prompted for a password?
If possible, take out the sudo from the invocation of diff, and run your script with sudo.
"It doesn't display and output." -- this is becuase you are saving the differences to a file, and then (presumably) not doing anything with that resulting file.
However, I expect "diff two directories using Perl" does not mean "use system() to do it in the shell and then capture the results". Have you considered doing this in the language itself? For example, see Text::Diff. For more nuanced control over what constitutes a "difference", you can simply read in each file and craft your own algorithm to perform the comparisons and compile the similarities and differences.
You might want to check out Test::Differences for a more flexible diff implementation.

using grep and find commands - basic questions to help me sort it out in my simple mind

I am back with a second no-brainer question, but I would like to get this straight in my head.
I have an assignment in which I am charged with providing a command to find a file named test in my home directory (one command using find, and one using grep). I understand that using find is just 'find ~/test', but using grep, wouldn't I have to search out a pattern within the file 'test'? Or is there a way to search for the file (using grep), even if the file is empty?
ls ~ | grep test
I understand that using find is just 'find ~/test'
No. find ~/test will also have a match for every file or directory under the directory $HOME/test/. Rather use find ~ -type f -name test.
The assignment sounds unclear. But yes, if you give any filenames to grep, it will look at the contents of the files and ignore the names of the files. Perhaps you can grep the output of another command? Maybe ls as #Reese suggested, or maybe a different find command.
ls -R ~ | grep test
Explanation: ls -R ~ will recursively list all files and directories in your home folder. grep test will narrow down that list to files (and directories) that have "test" in their name.