Hdfs find files below certain size

Hdfs find files below certain size - scala

Is there a way to list files with size less than certain size in Hdfs . Using the command line or even a spark script ?
Scala / spark would be great as it may run faster as compared to command line .
I have looked at the Apache FileSystem documentation but could not find much information

You can use the below command to show files which are more than 1KB
hdfs dfs -ls -R / | awk '$5 > 1000'
Similarly, you can use the below script to show files of less than 1KB
hdfs dfs -ls -R / | awk '$5 < 1000'
Hope that helps.

Related

How to find the max file size in a hdfs directory

I want to find the max size of files in a HDFS directory. Does anyone have any idea how to find it? I'm in Hadoop 2.6.0.
I found hadoop fs -ls -S /url which can Sort output by file size from Hadoop 2.7.0 document, but it's not supported in 2.6.0. So is there any similar function that can sort output files by size? Thank you!

You can make use of hdfs fsck command to get the file sizes.
For e.g., when I execute hdfs fsck /tmp/ -files, then I get the following output:
/tmp <dir>
/tmp/100GB <dir>
/tmp/100GB/Try <dir>
/tmp/100GB/Try/1.txt 5 bytes, 1 block(s): OK
/tmp/100GB/_SUCCESS 0 bytes, 0 block(s): OK
/tmp/100GB/part-m-00000 107374182400 bytes, 800 block(s): OK
/tmp/100GB/part-m-00001._COPYING_ 44163923968 bytes, 330 block(s):
/tmp/10GB <dir>
/tmp/10GB/_SUCCESS 0 bytes, 0 block(s): OK
/tmp/10GB/part-m-00000 10737418300 bytes, 81 block(s): OK
/tmp/1GB <dir>
/tmp/1GB/_SUCCESS 0 bytes, 0 block(s): OK
/tmp/1GB/part-m-00000 1073741900 bytes, 9 block(s): OK
/tmp/1GB/part-m-00001 1073741900 bytes, 9 block(s): OK
It recursively lists all the files under /tmp along with their sizes.
Now, to parse out the file with max size, you can execute the following command:
hdfs fsck /tmp/ -files | grep "/tmp/" | grep -v "<dir>" | gawk '{print $2, $1;}' | sort -n
This command does the following:
hdfs fsck /tmp/ -files - It runs HDFS file system check on the folder /tmp/ and seeks report for each of the files under /tmp/
grep "/tmp/" - It greps for /tmp/ (the folder which we want to search). This will give only files and folders under /tmp/
"grep -v "<dir>"" - This removes the directories from the output (since we only want files)
gawk '{print $2, $1;}' - This prints the file size ($2), followed by the file name ($1)
sort -n - This does a numeric sort on the file size and the last file in the list should be the file with the largest size
You can pipe the output to tail -1 to get the largest file.
For e.g. I got output as:
107374182400 /tmp/100GB/part-m-00000

Try this to find which is max hdfs dfs -ls -h /path | sort -r -n -k 5

Please try below command.
hadoop fs -du Folder | sort -n -r | head -n 1

Search and delete all tar.gz files on centos 6

I need to free up some disk space on my web server and would like to ask if running the command below would break anything?
My server is running centos 6 with cpanel/whm.
$ find / -type f -name "*.tar.gz" -exec rm -i {} \;
Any help or advice will be greatly appreciated.
Thanks.

Well, you'll lose logs if they are already compressed, or uploaded files if any. By default there shouldn't be any of those files on installed system. Personally I think this is wrong to just jettison what you can instead of trying to find the cause.
You can try finding what's occupying space by running:
du -hs / # shows how much root directory occupies
Compare that to the output of:
df -h # shows used space on disks
If the number didn't match by a far - you probably have unclosed deleted files and a simple reboot will reclaim this space for you.
If not you can proceed by recursively doing:
cd <dir>; du -hs * # enter directory and calculate size of its contents
You can do that starting from / and proceeding to the biggest dir. After all you'll find your source of free space. :)
PS: CentOS doesn't compress logs by default. You will not detect those logs by searching for archived files, but they can be huge. Compressing them is an easy way to get some space:
Turn on compression in /etc/logrotate.conf:
compress
Compress already rotated logs with:
cd /var/log; find . -type f | grep '.*-[0-9]\+$' | xargs -n1 gzip -9

List and operate on files above a certain size?

How do you erase all files bellow a certain size with gsutil? We could use a script to filter output of gsutil ls but sounds overkill.

gsutil doesn't have any direct support (as in command line flags) for operating only on files below a given size. Instead you'd have to use a script, such as:
gsutil ls -l gs://your-bucket | awk '{if ($1 < 1024) print $NF}' | xargs some-command

Grep data and output to file

I'm attempting to extract data from log files and organise it systematically. I have about 9 log files which are ~100mb each in size.
What I'm trying to do is: Extract multiple chunks from each log file, and for each chunk extracted, I would like to create a new file and save this extracted data to it. Each chunk has a clear start and end point.
Basically, I have made some progress and am able to extract the data I need, however, I've hit a wall in trying to figure out how to create a new file for each matched chunk.
I'm unable to use a programming language like Python or Perl, due to the constraints of my environment. So please excuse the messy command.
My command thus far:
find Logs\ 13Sept/Log_00000000*.log -type f -exec \
sed -n '/LRE Starting chunk/,/LRE Ending chunk/p' {} \; | \
grep -v -A1 -B1 "Starting chunk" > Logs\ 13Sept/Chunks/test.txt
The LRE Starting chunk and LRE Ending chunk are my boundaries. Right now my command works, but it saves all matched chunks to one file (whose size is becoming exessive).
How do I go about creating a new file for each match and add the matched content to it? keeping in mind that each file could hold multiple chunks and is not limited to one chunk per file.

Probably need something more programmable than sed: I'm assuming awk is available.
awk '
/LRE Ending chunk/ {printing = 0}
printing {print > "chunk" n ".txt"}
/LRE Starting chunk/ {printing = 1; n++}
' *.log

Try something like this:
find Logs\ 13Sept/Log_00000000*.log -type f -print | while read file; do \
sed -n '/LRE Starting chunk/,/LRE Ending chunk/p' "$file" | \
grep -v -A1 -B1 "Starting chunk" > "Logs 13Sept/Chunks/$file.chunk.txt";
done
This loops over the find results and executes for each file and then create one $file.chunk.txt for each of the files.

Something like this perhaps?
find Logs\ 13Sept/Log_00000000*.log -type f -exec \
sed -n '/LRE Starting chunk/,/LRE Ending chunk/{;/LRE .*ing chunk/d;w\
'"{}.chunk"';}' {} \;
This uses sed's w command to write to a file named (inputfile).chunk. If that is not acceptable, perhaps you can use sh -c '...' to pass in a small shell script to wrap the sed command with. (Or is a shell script also prohibited for some reason?)

Perhaps you could use csplit to do the splitting, then truncate the output files at the chunk end.

How can I tell if a file is on a remote filesystem with Perl?

Is there a quick-and-dirty way to tell programmatically, in shell script or in Perl, whether a path is located on a remote filesystem (nfs or the like) or a local one? Or is the only way to do this to parse /etc/fstab and check the filesystem type?

stat -f -c %T <filename> should do what you want. You might also want -l

You can use "df -T" to get the filesystem type for the directory, or use the -t option to limit reporting to specific types (like nfs) and if it comes back with "no file systems processed", then it's not one of the ones you're looking for.
df -T $dir | tail -1 | awk '{print $2;}'

If you use df on a directory to get info only of the device it resides in, e.g. for the current directory:
df .
Then, you can just parse the output, e.g.
df . | tail -1 | awk '{print $1}'
to get the device name.

I have tested the following on solaris7,8,9 & 10 and it seems to be reliable
/bin/df -g <filename> | tail -2 | head -1 | awk '{print $1}'
Should give you have the fs type rather than trying to match for a "host:path" in your mount point.

On some systems, the device number is negative for NFS files. Thus,
print "remote" if (stat($filename))[0] < 0

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Hdfs find files below certain size - scala

Is there a way to list files with size less than certain size in Hdfs . Using the command line or even a spark script ? Scala / spark would be great as it may run faster as compared to command line . I have looked at the Apache FileSystem documentation but could not find much information

You can use the below command to show files which are more than 1KB hdfs dfs -ls -R / | awk '$5 > 1000' Similarly, you can use the below script to show files of less than 1KB hdfs dfs -ls -R / | awk '$5 < 1000' Hope that helps.

Related

How to find the max file size in a hdfs directory

Search and delete all tar.gz files on centos 6

List and operate on files above a certain size?

Grep data and output to file

How can I tell if a file is on a remote filesystem with Perl?

Categories

Resources