find all gz file not empty - find

I'm desperately searching how to find in a directory all the gzip files that are not empty, the goal if to retrieve all the logs from a date thru ssh and to rsync them into a local directory, but i can get 10k files depending the date, and a lot of them are empty so i would like to sort them before making the rsync.
I know i can find all gz files like this:
ssh toto "find /logexport/proxies*/logs/ -type f -name '*20170511*.gz'" > test.txt
but i would like to sort them removing the empty one, if they werent gzip i could use:
! -size 0
for now i rsync all the files into a folder and then i sort them like this:
for f in ${FOLDER}/*; do
if [[ $(gunzip -c $f | head -c1 | wc -c) == "0" ]]; then
rm -f $f
fi
done
Do u know how to combine the last command into the first one ?
The goal is to get thru ssh a list of all the gz file that contains data.

If the gzip files have no additional header information, such as a file name, then all of the empty gzip files should be 20 bytes long.

Related

Deleting lines from file not matching pattern with sed

I generate a list of everything in a directory (subdirectories and all files) with
ls -R $DIRPATH | awk '/:$/&&f{s=$0;f=0} /:$/&&!f{sub(/:$/,"");s=$0;f=1;next} NF&&f{ print s"/"$0 }' > filelist
and I would like to delete all files not ending in a certain file extension, for example .h. I am trying this with
sed -ne '/.h$/p' filelist > filelist_h
but this is allowing files like C:/dev/boost/boost_1_59_0/boost/graph. How do I get this working with .h and not h?
find is the tool you are looking for:
find "$DIRPATH" -type f -name '*.h'

How to rename files downloaded with wget -r

I want to download an entire website using the wget -r command and change the name of the file.
I have tried with:
wget -r -o doc.txt "http....
hoping that the OS would have automatically create file in order like doc1.txt doc2.txt but It actually save the stream of the stdout in that file.
Is there any way to do this with just one command?
Thanks!
-r tells wget to recursively get resources from a host.
-o file saves log messages to file instead of the standard error. I think that is not what you are looking for, I think it is -O file.
-O file stores the resource(s) in the given file, instead of creating a file in the current directory with the name of the resource. If used in conjunction with -r, it causes wget to store all resources concatenated to that file.
Since wget -r downloads and stores more than one file, recreating the server file tree in the local system, it has no sense to indicate the name of one file to store.
If what you want is to rename all downloaded files to match the pattern docX.txt, you can do it with a different command after wget has end:
wget -r http....
i=1
while read file
do
mv "$file" "$(dirname "$file")/doc$i.txt"
i=$(( $i + 1 ))
done < <(find . -type f)

How can I extract multiple .gz log files in command line

I have a years worth of log files that are all in .gz files. Is there a command I can use to extract these all at once into their current directory? I tried unzip *.gz but doesn't work. Any other suggestions?
shell sciprt?
#!/bin/ksh
TEMPFILE=tempksh_$$.tmp #create a file name
> $TEMPFILE #create a file w/ name
ls -l | grep '.*\.gz$' \ #make dynamic shell script
| awk '{printf "unzip %s;\n", $9;}' \ #with unzip cmd for each item
>> $TEMPFILE #and write to TEMPFILE
chmod 755 $TEMPFILE #give run permissions
./$TEMPFILE #and run it
rm -f $TEMPFILE #clean up
Untested but i think you get the idea....
Actually a little fiddling and gets far simpler...
set -A ARR *.gz;
for i in ${characters[#]}; do `unzip $i`; done;
unset ARR;
For googles sake, since it took me here, it's as simple as this:
gzip -dc access.log.*.gz > access.log
As noted in a comment, you want to use gunzip, not gzip. unzip is for .zip files. gzip is for .gz files. Two completely different formats.
gunzip *.gz
or:
gzip -d *.gz
That will delete the .gz files after successfully decompressing them. If you'd like to keep all of the original .gz files, then:
gzip -dk *.gz

Grep data and output to file

I'm attempting to extract data from log files and organise it systematically. I have about 9 log files which are ~100mb each in size.
What I'm trying to do is: Extract multiple chunks from each log file, and for each chunk extracted, I would like to create a new file and save this extracted data to it. Each chunk has a clear start and end point.
Basically, I have made some progress and am able to extract the data I need, however, I've hit a wall in trying to figure out how to create a new file for each matched chunk.
I'm unable to use a programming language like Python or Perl, due to the constraints of my environment. So please excuse the messy command.
My command thus far:
find Logs\ 13Sept/Log_00000000*.log -type f -exec \
sed -n '/LRE Starting chunk/,/LRE Ending chunk/p' {} \; | \
grep -v -A1 -B1 "Starting chunk" > Logs\ 13Sept/Chunks/test.txt
The LRE Starting chunk and LRE Ending chunk are my boundaries. Right now my command works, but it saves all matched chunks to one file (whose size is becoming exessive).
How do I go about creating a new file for each match and add the matched content to it? keeping in mind that each file could hold multiple chunks and is not limited to one chunk per file.
Probably need something more programmable than sed: I'm assuming awk is available.
awk '
/LRE Ending chunk/ {printing = 0}
printing {print > "chunk" n ".txt"}
/LRE Starting chunk/ {printing = 1; n++}
' *.log
Try something like this:
find Logs\ 13Sept/Log_00000000*.log -type f -print | while read file; do \
sed -n '/LRE Starting chunk/,/LRE Ending chunk/p' "$file" | \
grep -v -A1 -B1 "Starting chunk" > "Logs 13Sept/Chunks/$file.chunk.txt";
done
This loops over the find results and executes for each file and then create one $file.chunk.txt for each of the files.
Something like this perhaps?
find Logs\ 13Sept/Log_00000000*.log -type f -exec \
sed -n '/LRE Starting chunk/,/LRE Ending chunk/{;/LRE .*ing chunk/d;w\
'"{}.chunk"';}' {} \;
This uses sed's w command to write to a file named (inputfile).chunk. If that is not acceptable, perhaps you can use sh -c '...' to pass in a small shell script to wrap the sed command with. (Or is a shell script also prohibited for some reason?)
Perhaps you could use csplit to do the splitting, then truncate the output files at the chunk end.

How can I count the number of times that different PDF files are accessed in an Apache log file?

I have a log file which contains traffic for an entire server. The server serves multiple domains, but I know that all of the PDF files I want to count are in /some/directory/.
I know that I can get a list of all the PDF files I want if I grep that directory for the 'pdf' extension.
How can I then count how many times each PDF was accessed individually from the command line?
this is a bit longer than one line but it will give you a better summary. You can modify this with the path to the pdfs and the apache access_log file and just paste it in to the command line or put it in a bash script
for file in `ls /path/to/pdfs | grep pdf `
do
COUNT=`grep -c $file access_log`
echo $file $COUNT
done
Grep for the name of the pdf file in your log and use the -c option to count occurrences. For example:
grep -c myfile.pdf apache.log
If you have hundreds of files, create a single file with a list of all the filenames, e.g.
$ cat filelist.txt
foo.pdf
bar.pdf
and then use grep in a loop
while read filename
do
COUNT=$(grep -c $filename apache.log)
echo $filename:$COUNT
done < filelist.txt
This will print out how many times each pdf file occurred in the log.
Use grep to identify the rows with your pdf and then wc -l to count the rows found:
grep /your/pdf logfile | wc -l
You may also check for 200 responses wrt 302 - i.e. if the user has only accessed a page or the full document (some pdf readers only load a page at a time)