Grep and extract specific data in multiple log files

Grep and extract specific data in multiple log files - sed

I've got multiple log files in a directory and trying to extract just the timestamp and a section of the log line i.e. the value of the fulltext query param. Each query param in a request is separated by an ampersand(&) as shown below.
Input
30/Mar/2022:00:27:36 +0000 [59823] -> GET
/libs/granite/omnisearch?p.guessTotal=1000&fulltext=798&savedSearches%40Delete=&
31/Mar/2022:00:27:36 +0000 [59823] -> GET
/libs/granite/omnisearch?p.guessTotal=1000&fulltext=Dyson+V7&savedSearches%40Delete=&
Intended Output
30/Mar/2022:00:27:36 -> 798
31/Mar/2022:00:27:36 -> Dyson+V7
I've got this command to recursively search over all the files in the directory.
grep -rn "/libs/granite/omnisearch" ~/Downloads/ReqLogs/ > output.txt
This prints the entire log line starting with the directory name, like so
/Users/****/Downloads/ReqLogs/logfile1_2022-03-31.log:6020:31/Mar/2022:00:27:36 +0000 [59823] -> GET /libs/granite/omnisearch?p.guessTotal=1000&fulltext=798&savedSearches%4
Please enlighten, How do i manipulate this to achieve the intended output.

grep can return the whole line or the string which matched. For extracting a different piece of data from the matching lines, turn to sed or Awk.
awk -v search="/libs/granite/omnisearch" '$0 ~ search { s = $0; sub(/.*fulltext=/, "", s); sub(/&.*/, "", s); print $1, s }' ~/Downloads/ReqLogs/*
or
sed -n '\%/libs/granite/omnisearch%s/ .*fulltext=\([^&]*\)&.*/\1/p' ~/Downloads/ReqLogs/*
The sed version is more succinct, but also somewhat more oblique.
\%...% uses the alternate delimiter % so that we can use literal slashes in our search expression.
The s/ .../\1/p then says to replace everything on the matching lines after the first space, capturing anything between fulltext= and &, and replace with the captured substring, then print the resulting line.
The -n flag turns off the default printing action, so that we only print the lines where the search expression matched.
The wildcard ~/Downloads/ReqLogs/* matches all files in that directory; if you really need to traverse subdirectories, too, perhaps add find to the mix.
find ~/Downloads/ReqLogs -type f -exec sed -n '\%/libs/granite/omnisearch%s/ .*fulltext=\([^&]*\)&.*/\1/p' {} +
or similarly with the Awk command after -exec. The placeholder {} tells find where to add the name of the found file(s) and + says to put as many as possible in one go, rather than running a separate -exec for each found file. (If you want that, use \; instead of +.)

Related

Renaming multiple files with multiple field separator in awk

I need to rename following file likewise from PFSI4C.CSC.CCC.FSIContractData20211008.zip to TFSI4C.CSC.CCC.FSIContractData20211104.zip.
Every file's name should start with "T" and end up with current system date + .zip"
I am trying to loop over files and it looks like this:
for FILENAME in PFSI4C.CSC.CCC.FSIContractData20211008; do
NEW_FILENAME_HEADER=`echo $FILENAME | awk -F "." '{ print $1"."$2"."$3 }'` # which would takes PFSI4C.CSC.CCC.
NEW_FILENAME_SUFFIX=`echo $FILENAME | awk -F "[.|Data20]" '{ print "."$4 }'` # this part where I can't figure out to take only "FSIContract"
NEW_FILENAME="${NEW_FILENAME_HEADER}.""${NEW_FILENAME_SUFFIX}""Data20""${DATE}".zip" # which should make "TFSI4C.CSC.CCC.FSIContractData20211104.zip."
mv $FILENAME $NEW_FILENAME
done
FYI $DATE in our script defined like this: DATE='date +'%y%m%d' for example 211104
Thanks in advance!

With Perl's rename command you could try following code. I am using -n option here to test it in DRY RUN mode it will only print the file names from which file name(current) to which file name(required one) will be changed; remove -n in code once you are satisfied with shown output. Also DATE variable (DATE='20211104') is a shell variable which contains value of date needed to be in new file name.
rename -n 's:^.(.*)\d{8}(\.zip)$:T$1$2:; s:\.zip$:'"$DATE"'.zip:' *.zip
Output will be as follows:
rename(PFSI4C.CSC.CCC.FSIContractData20211008.zip, TFSI4C.CSC.CCC.FSIContractData20211104.zip)
Explanation of rename code:
-n: To run rename command in DRY RUN mode.
s:^.(.*)\d{8}(\.zip)$:T$1$2:;: Running 1st set of substitution in rename code. Where it creates 2 capturing group, 1st capturing group has everything from 2nd character onwards to just before 8 digits AND 2nd capturing group contains .zip at the end of filename. While substitution substituting it with T1$2 as per requirement.
s:\.zip$:'"$DATE"'.zip:: Running 2nd set of substitution in rename code. Where .zip$ with shell variable DATE along with .zip as per requirement.

First of all you should get the current date with date +%Y%m%d (4-digits year) instead of date +%y%m%d (2-digits year). The following assumes you do that. Prepend 20 to $DATE if it is not an option.
If your file names all look as the example you show bash substitutions can do it. First compute the length, extract the characters from second to last before date, prepend T, append $DATE.zip:
len="${#FILENAME}"
NEW_FILENAME="T${FILENAME:1:$((len-13))}$DATE.zip"
But you could also use sed, it offers a bit more flexibility. For instance it can deal with ending dates on a variable number of digits:
NEW_FILENAME=$(echo "$FILENAME" | sed 's/.\(.*[^0-9]\)\?[0-9]*\.zip/T\1'"$DATE"'.zip/')
Or, a bit more elegant with bash (here strings) and GNU sed or another sed that supports the -E option (for extended regular expressions):
NEW_FILENAME=$(sed -E 's/.(.*[^0-9])?[0-9]*\.zip/T\1'"$DATE"'.zip/' <<< "$FILENAME")

Assumptions:
replace first character (P in OP's sample) with T
replace last 10 characters (YYMMDD.zip) with $DATE.zip (OP has already defined $DATE)
all files contain 20YYMMDD so we don't need to worry about names containing strings like 19YYMMDD and 21YYMMDD
One idea using parameter substitutions (which also eliminates the overhead of subprocess calls to execute various echo, awk and sed commands):
DATE='211104'
FILENAME='PFSI4C.CSC.CCC.FSIContractData20211008.zip'
NEWFILENAME="T${FILENAME/?}" # prepend "T"; "/?" => remove first character
NEWFILENAME="${NEWFILENAME/??????.zip}${DATE}.zip" # remove string "??????.zip"; append "${DATE}.zip"
echo mv "${FILENAME}" "${NEWFILENAME}"
This generates:
mv PFSI4C.CSC.CCC.FSIContractData20211008.zip TFSI4C.CSC.CCC.FSIContractData20211104.zip
Once OP is satisified with the accuracy of the code the echo can be removed to enable execution of the mv command.

Extract filename from multiple lines in unix

I'm trying to extract the name of the file name that has been generated by a Java program. This Java program spits out multiple lines and I know exactly what the format of the file name is going to be. The information text that the Java program is spitting out is as follows:
ABCASJASLEKJASDFALDSF
Generated file YANNANI-0008876_17.xml.
TDSFALSFJLSDJF;
I'm capturing the output in a variable and then applying a sed operator in the following format:
sed -n 's/.*\(YANNANI.\([[:digit:]]\).\([xml]\)*\)/\1/p'
The result set is:
YANNANI-0008876_17.xml.
However, my problem is that want the extraction of the filename to stop at .xml. The last dot should never be extracted.
Is there a way to do this using sed?

Let's look at what your capture group actually captures:
$ grep 'YANNANI.\([[:digit:]]\).\([xml]\)*' infile
Generated file YANNANI-0008876_17.xml.
That's probably not what you intended:
\([[:digit:]]\) captures just a single digit (and the capture group around it doesn't do anything)
\([xml]\)* is "any of x, m or l, 0 or more times", so it matches the empty string (as above – or the line wouldn't match at all!), x, xx, lll, mxxxxxmmmmlxlxmxlmxlm, xml, ...
There is no way the final period is removed because you don't match anything after the capture groups
What would make sense instead:
Match "digits or underscores, 0 or more": [[:digit:]_]*
Match .xml, literally (escape the period): \.xml
Make sure the rest of the line (just the period, in this case) is matched by adding .* after the capture group
So the regex for the string you'd like to extract becomes
$ grep 'YANNANI.[[:digit:]_]*\.xml' infile
Generated file YANNANI-0008876_17.xml.
and to remove everything else on the line using sed, we surround regex with .*\( ... \).*:
$ sed -n 's/.*\(YANNANI.[[:digit:]_]*\.xml\).*/\1/p' infile
YANNANI-0008876_17.xml
This assumes you really meant . after YANNANI (any character).

You can call sed twice: first in printing and then in replacement mode:
sed -n 's/.*\(YANNANI.\([[:digit:]]\).\([xml]\)*\)/\1/p' | sed 's/\.$//g'
the last sed will remove all the last . at the end of all the lines fetched by your first sed
or you can go for a awk solution as you prefer:
awk '/.*YANNANI.[0-9]+.[0-9]+.xml/{print substr($NF,1,length($NF)-1)}'
this will print the last field (and truncate the last char of it using substr) of all the lines that do match your regex.

Using command line to remove text?

I have a huge file that contains lines that follow this format:
New-England-Center-For-Children-L0000392290
Southboro-Housing-Authority-L0000392464
Crew-Star-Inc-L0000391998
Saxony-Ii-Barber-Shop-L0000392491
Test-L0000392334
What I'm trying to do is narrow it down to just this:
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Test
Can anyone help with this?

Using GNU awk:
awk -F\- 'NF--' OFS=\- file
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Saxony-Ii-Barber-Shop
Test
Set the input and output field separator to -.
NF contains number of fields. Reduce it by 1 to remove the last field.
Using sed:
sed 's/\(.*\)-.*/\1/' file
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Saxony-Ii-Barber-Shop
Test
Simple greedy regex to match up to the last hyphen.
In replacement use the captured group and discard the rest.

Version 1 of the Question
The first version of the input was in the form of HTML and parts had to be removed both before and after the desired text:
$ sed -r 's|.*[A-Z]/([a-zA-Z-]+)-L0.*|\1|' input
Special-Restaurant
Eliot-Cleaning
Kennedy-Plumbing
Version 2 of the Question
In the revised question, it is only necessary to remove the text that starts with -L00:
$ sed 's|-L00.*||' input2
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Saxony-Ii-Barber-Shop
Test
Both of these commands use a single "substitute" command. The command has the form s|old|new|.

The perl code for this would be: perl -nle'print $1 if(m{-.*?/(.*?-.*?)-})
We can break the Regex down to matching the following:
- for that's between the city and state
.*? match the smallest set of character(s) that makes the Regex work, i.e. the State
/ matches the slash between the State and the data you want
( starts the capture of the data you are interested in
.*?-.*? will match the data you care about
) will close out the capture
- will match the dash before the L####### to give the regex something to match after your data. This will prevent the minimal Regex from matching 0 characters.
Then the print statement will print out what was captured (your data).

awk likes these things:
$ awk -F[/-] -v OFS="-" '{print $(NF-3), $(NF-2)}' file
Special-Restaurant
Eliot-Cleaning
Kennedy-Plumbing
This sets / and - as possible field separators. Based on them, it prints the last_field-3 and last_field-2 separated by the delimiter -. Note that $NF stands for last parameter, hence $(NF-1) is the penultimate, etc.
This sed is also helpful:
$ sed -r 's#.*/(\w*-\w*)-\w*\.\w*</loc>$#\1#' file
Special-Restaurant
Eliot-Cleaning
Kennedy-Plumbing
It selects the block word-word after a slash / and followed with word.word</loc> + end_of_line. Then, it prints back this block.
Update
Based on your new input, this can make it:
$ sed -r 's/(.*)-L\w*$/\1/' file
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Saxony-Ii-Barber-Shop
Test
It selects everything up to the block -L + something + end of line, and prints it back.
You can use also another trick:
rev file | cut -d- -f2- | rev
As what you want is every slice of - separated fields, let's get all of them but last one. How? By reversing the line, getting all of them from the 2nd one and then reversing back.

Here's how I'd do it with Perl:
perl -nle 'm{example[.]com/bp/(.*?)/(.*?)-L\d+[.]htm} && print $2' filename
Note: the original question was matching input lines like this:
<loc>http://www.example.com/bp/Lowell-MA/Special-Restaurant-L0000423916.htm</loc>
<loc>http://www.example.com/bp/Houston-TX/Eliot-Cleaning-L0000422797.htm</loc>
<loc>http://www.example.com/bp/New-Orleans-LA/Kennedy-Plumbing-L0000423121.htm</loc>
The -n option tells Perl to loop over every line of the file (but not print them out).
The -l option adds a newline onto the end of every print
The -e 'perl-code' option executes perl-code for each line of input
The pattern:
/regex/ && print
Will only print if the regex matches. If the regex contains capture parentheses you can refer to the first captured section as $1, the second as $2 etc.
If your regex contains slashes, it may be cleaner to use a different regex delimiter ('m' stands for 'match'):
m{regex} && print
If you have a modern Perl, you can use -E to enable modern feature and use say instead of print to print with a newline appended:
perl -nE 'm{example[.]com/bp/(.*?)/(.*?)-L\d+[.]htm} && say $2' filename

This is very concise in Perl
perl -i.bak -lpe's/-[^-]+$//' myfile
Note that this will modify the input file in-place but will keep a backup of the original data in called myfile.bak

Read word from a file and return next word

Using shell script I want to read a word from text file and return next column word.
For eg, my input file will be like
AGE1 PERSON1
AGE2 PERSON2
AGE3 PERSON3
AGE4 PERSON4
I have variable in Sh file having PERSON's name.
I want read input text file and get value of person's age.
Please help, i'm beginner in Shell Scripting

A slightly simpler solution is:
age=$( awk '$2==name { print $1 }' name="$name" input-file )

Building upon shellter's comment:
age=$(grep "$person_name" people_file.txt | cut -f1 -d' ')
I'll try to explain everything. First, I assume somethings (but you can change them on your script):
Your file with the data you entered is called people_file.txt.
The person's name you want to find is in the variable $person_name.
The variable you want to store the result is $age.
Firstly, because we need to use commands to generate the value of the $age variable, we must use $( and ) to run a command (or a series of commands), and replace itself with the text it captures from executing the command (or commands).
We first need to find the line which contains the person's name. For that we use grep: grep regex file. Grep will search file line by line until it finds a line that matches the regular expression regex. In our case we can simply search for the person's name directly (assuming it doesn't contain special characters, like the period or an asterisk). Note that we must place the variable between double quotes, otherwise a person's name that has a space in it might be split in the command line so that its first name is used as the regular expression and the surname as the file. If you want to search in a case insensitive manner (like for example: John will find a line with JOHN or john), you can use the -i flag: grep -i regex file. The selected lines will be printed by grep into its output, but we will pump those lines into the input of the next command with the pipe operator |.
Finally, we have a line (or many lines) with the results. Now we must extract the age. The cut command will split each line it reads from the input into fields, and only print the fields you ask it to. In this case, we ask for the first field with the -f1 option. Also, we specify that the space character is to be used as the delimeter (ie. the character that separates the fields) with the -d1 command.
If you have more than one line with the same person's name, we need to pipe the output of grep into a head command, so that we can have only the number of lines we want. We can tell head how many lines we want with the -n N option. So if you only want the first match:
age=$(grep "$person_name" people_file.txt | head -n 1 | cut -f1 -d' ')
Hope this helps a little =)

age=`
perl -nle'
BEGIN { $n = shift(#ARGV); }
print $1 if /^(\S+)\s+\Q$n\E$/;
' "$name" file
`
Tested with bash in sh mode.

Unable to remove carriage returns

Greetings!
I have been tasked to create a report off files we receive from our hardware suppliers. I need to grep these files for two fields 'Test_Version' and 'Model-Manufacturer' ; for each field, I need to capture their corresponding values.
In a previous post, I found help to create a basic report like so:
find . -name "*.VER" -exec egrep -A 1 'Test_Version=|Model-Manufacturer:' {} ';'
Model-Manufacturer:^M
R22-100^M
Test_Version=2.6.3^M
Model-Manufacturer:^M
R16-300^M
Test_Version=2.6.3^M
However, the data that's output is riddled with DOS carriage returns "^M". My boss wants "Model-Manufacturer" to show like "Test_Version" i.e
Model-Manufacturer:R22-100
Test_Version=2.6.3
Model-Manufacturer:R16-300
Test_Version=2.6.3
Using sed, I attempted to remove the "^M" characters for "Model-Manufacturer" but to no avail:
find . -name "*.VER" -exec egrep -A 1 'Test_Version=|Model-Manufacturer:' {} ';' | sed 's/Model-Manufacturer:^M//g'
This command has not effect. What am I missing here?

Give this a try:
sed '/Model-Manufacturer:/s/\r//g'
If you also have newlines and you want to combine the two lines into one, you can use one of the techniques shown in the answers to your previous question.

you can remove the carriage returns using dos2unix if you have it. Or using tr
tr -d '\r' < file

If you're using Bash as your shell, or creating the script in vi, you should be able to do:
sed -e 's/<Ctrl-V><Ctrl-M>//g'
to remove the CRs.
Ctrl-V (the keystroke on your keyboard) inserts the next keystroke literally, and Ctrl-M is carriage return.