Renaming multiple files with multiple field separator in awk - command-line

I need to rename following file likewise from PFSI4C.CSC.CCC.FSIContractData20211008.zip to TFSI4C.CSC.CCC.FSIContractData20211104.zip.
Every file's name should start with "T" and end up with current system date + .zip"
I am trying to loop over files and it looks like this:
for FILENAME in PFSI4C.CSC.CCC.FSIContractData20211008; do
NEW_FILENAME_HEADER=`echo $FILENAME | awk -F "." '{ print $1"."$2"."$3 }'` # which would takes PFSI4C.CSC.CCC.
NEW_FILENAME_SUFFIX=`echo $FILENAME | awk -F "[.|Data20]" '{ print "."$4 }'` # this part where I can't figure out to take only "FSIContract"
NEW_FILENAME="${NEW_FILENAME_HEADER}.""${NEW_FILENAME_SUFFIX}""Data20""${DATE}".zip" # which should make "TFSI4C.CSC.CCC.FSIContractData20211104.zip."
mv $FILENAME $NEW_FILENAME
done
FYI $DATE in our script defined like this: DATE='date +'%y%m%d' for example 211104
Thanks in advance!

With Perl's rename command you could try following code. I am using -n option here to test it in DRY RUN mode it will only print the file names from which file name(current) to which file name(required one) will be changed; remove -n in code once you are satisfied with shown output. Also DATE variable (DATE='20211104') is a shell variable which contains value of date needed to be in new file name.
rename -n 's:^.(.*)\d{8}(\.zip)$:T$1$2:; s:\.zip$:'"$DATE"'.zip:' *.zip
Output will be as follows:
rename(PFSI4C.CSC.CCC.FSIContractData20211008.zip, TFSI4C.CSC.CCC.FSIContractData20211104.zip)
Explanation of rename code:
-n: To run rename command in DRY RUN mode.
s:^.(.*)\d{8}(\.zip)$:T$1$2:;: Running 1st set of substitution in rename code. Where it creates 2 capturing group, 1st capturing group has everything from 2nd character onwards to just before 8 digits AND 2nd capturing group contains .zip at the end of filename. While substitution substituting it with T1$2 as per requirement.
s:\.zip$:'"$DATE"'.zip:: Running 2nd set of substitution in rename code. Where .zip$ with shell variable DATE along with .zip as per requirement.

First of all you should get the current date with date +%Y%m%d (4-digits year) instead of date +%y%m%d (2-digits year). The following assumes you do that. Prepend 20 to $DATE if it is not an option.
If your file names all look as the example you show bash substitutions can do it. First compute the length, extract the characters from second to last before date, prepend T, append $DATE.zip:
len="${#FILENAME}"
NEW_FILENAME="T${FILENAME:1:$((len-13))}$DATE.zip"
But you could also use sed, it offers a bit more flexibility. For instance it can deal with ending dates on a variable number of digits:
NEW_FILENAME=$(echo "$FILENAME" | sed 's/.\(.*[^0-9]\)\?[0-9]*\.zip/T\1'"$DATE"'.zip/')
Or, a bit more elegant with bash (here strings) and GNU sed or another sed that supports the -E option (for extended regular expressions):
NEW_FILENAME=$(sed -E 's/.(.*[^0-9])?[0-9]*\.zip/T\1'"$DATE"'.zip/' <<< "$FILENAME")

Assumptions:
replace first character (P in OP's sample) with T
replace last 10 characters (YYMMDD.zip) with $DATE.zip (OP has already defined $DATE)
all files contain 20YYMMDD so we don't need to worry about names containing strings like 19YYMMDD and 21YYMMDD
One idea using parameter substitutions (which also eliminates the overhead of subprocess calls to execute various echo, awk and sed commands):
DATE='211104'
FILENAME='PFSI4C.CSC.CCC.FSIContractData20211008.zip'
NEWFILENAME="T${FILENAME/?}" # prepend "T"; "/?" => remove first character
NEWFILENAME="${NEWFILENAME/??????.zip}${DATE}.zip" # remove string "??????.zip"; append "${DATE}.zip"
echo mv "${FILENAME}" "${NEWFILENAME}"
This generates:
mv PFSI4C.CSC.CCC.FSIContractData20211008.zip TFSI4C.CSC.CCC.FSIContractData20211104.zip
Once OP is satisified with the accuracy of the code the echo can be removed to enable execution of the mv command.

Related

Extract filename from multiple lines in unix

I'm trying to extract the name of the file name that has been generated by a Java program. This Java program spits out multiple lines and I know exactly what the format of the file name is going to be. The information text that the Java program is spitting out is as follows:
ABCASJASLEKJASDFALDSF
Generated file YANNANI-0008876_17.xml.
TDSFALSFJLSDJF;
I'm capturing the output in a variable and then applying a sed operator in the following format:
sed -n 's/.*\(YANNANI.\([[:digit:]]\).\([xml]\)*\)/\1/p'
The result set is:
YANNANI-0008876_17.xml.
However, my problem is that want the extraction of the filename to stop at .xml. The last dot should never be extracted.
Is there a way to do this using sed?
Let's look at what your capture group actually captures:
$ grep 'YANNANI.\([[:digit:]]\).\([xml]\)*' infile
Generated file YANNANI-0008876_17.xml.
That's probably not what you intended:
\([[:digit:]]\) captures just a single digit (and the capture group around it doesn't do anything)
\([xml]\)* is "any of x, m or l, 0 or more times", so it matches the empty string (as above – or the line wouldn't match at all!), x, xx, lll, mxxxxxmmmmlxlxmxlmxlm, xml, ...
There is no way the final period is removed because you don't match anything after the capture groups
What would make sense instead:
Match "digits or underscores, 0 or more": [[:digit:]_]*
Match .xml, literally (escape the period): \.xml
Make sure the rest of the line (just the period, in this case) is matched by adding .* after the capture group
So the regex for the string you'd like to extract becomes
$ grep 'YANNANI.[[:digit:]_]*\.xml' infile
Generated file YANNANI-0008876_17.xml.
and to remove everything else on the line using sed, we surround regex with .*\( ... \).*:
$ sed -n 's/.*\(YANNANI.[[:digit:]_]*\.xml\).*/\1/p' infile
YANNANI-0008876_17.xml
This assumes you really meant . after YANNANI (any character).
You can call sed twice: first in printing and then in replacement mode:
sed -n 's/.*\(YANNANI.\([[:digit:]]\).\([xml]\)*\)/\1/p' | sed 's/\.$//g'
the last sed will remove all the last . at the end of all the lines fetched by your first sed
or you can go for a awk solution as you prefer:
awk '/.*YANNANI.[0-9]+.[0-9]+.xml/{print substr($NF,1,length($NF)-1)}'
this will print the last field (and truncate the last char of it using substr) of all the lines that do match your regex.

Using command line to remove text?

I have a huge file that contains lines that follow this format:
New-England-Center-For-Children-L0000392290
Southboro-Housing-Authority-L0000392464
Crew-Star-Inc-L0000391998
Saxony-Ii-Barber-Shop-L0000392491
Test-L0000392334
What I'm trying to do is narrow it down to just this:
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Test
Can anyone help with this?
Using GNU awk:
awk -F\- 'NF--' OFS=\- file
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Saxony-Ii-Barber-Shop
Test
Set the input and output field separator to -.
NF contains number of fields. Reduce it by 1 to remove the last field.
Using sed:
sed 's/\(.*\)-.*/\1/' file
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Saxony-Ii-Barber-Shop
Test
Simple greedy regex to match up to the last hyphen.
In replacement use the captured group and discard the rest.
Version 1 of the Question
The first version of the input was in the form of HTML and parts had to be removed both before and after the desired text:
$ sed -r 's|.*[A-Z]/([a-zA-Z-]+)-L0.*|\1|' input
Special-Restaurant
Eliot-Cleaning
Kennedy-Plumbing
Version 2 of the Question
In the revised question, it is only necessary to remove the text that starts with -L00:
$ sed 's|-L00.*||' input2
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Saxony-Ii-Barber-Shop
Test
Both of these commands use a single "substitute" command. The command has the form s|old|new|.
The perl code for this would be: perl -nle'print $1 if(m{-.*?/(.*?-.*?)-})
We can break the Regex down to matching the following:
- for that's between the city and state
.*? match the smallest set of character(s) that makes the Regex work, i.e. the State
/ matches the slash between the State and the data you want
( starts the capture of the data you are interested in
.*?-.*? will match the data you care about
) will close out the capture
- will match the dash before the L####### to give the regex something to match after your data. This will prevent the minimal Regex from matching 0 characters.
Then the print statement will print out what was captured (your data).
awk likes these things:
$ awk -F[/-] -v OFS="-" '{print $(NF-3), $(NF-2)}' file
Special-Restaurant
Eliot-Cleaning
Kennedy-Plumbing
This sets / and - as possible field separators. Based on them, it prints the last_field-3 and last_field-2 separated by the delimiter -. Note that $NF stands for last parameter, hence $(NF-1) is the penultimate, etc.
This sed is also helpful:
$ sed -r 's#.*/(\w*-\w*)-\w*\.\w*</loc>$#\1#' file
Special-Restaurant
Eliot-Cleaning
Kennedy-Plumbing
It selects the block word-word after a slash / and followed with word.word</loc> + end_of_line. Then, it prints back this block.
Update
Based on your new input, this can make it:
$ sed -r 's/(.*)-L\w*$/\1/' file
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Saxony-Ii-Barber-Shop
Test
It selects everything up to the block -L + something + end of line, and prints it back.
You can use also another trick:
rev file | cut -d- -f2- | rev
As what you want is every slice of - separated fields, let's get all of them but last one. How? By reversing the line, getting all of them from the 2nd one and then reversing back.
Here's how I'd do it with Perl:
perl -nle 'm{example[.]com/bp/(.*?)/(.*?)-L\d+[.]htm} && print $2' filename
Note: the original question was matching input lines like this:
<loc>http://www.example.com/bp/Lowell-MA/Special-Restaurant-L0000423916.htm</loc>
<loc>http://www.example.com/bp/Houston-TX/Eliot-Cleaning-L0000422797.htm</loc>
<loc>http://www.example.com/bp/New-Orleans-LA/Kennedy-Plumbing-L0000423121.htm</loc>
The -n option tells Perl to loop over every line of the file (but not print them out).
The -l option adds a newline onto the end of every print
The -e 'perl-code' option executes perl-code for each line of input
The pattern:
/regex/ && print
Will only print if the regex matches. If the regex contains capture parentheses you can refer to the first captured section as $1, the second as $2 etc.
If your regex contains slashes, it may be cleaner to use a different regex delimiter ('m' stands for 'match'):
m{regex} && print
If you have a modern Perl, you can use -E to enable modern feature and use say instead of print to print with a newline appended:
perl -nE 'm{example[.]com/bp/(.*?)/(.*?)-L\d+[.]htm} && say $2' filename
This is very concise in Perl
perl -i.bak -lpe's/-[^-]+$//' myfile
Note that this will modify the input file in-place but will keep a backup of the original data in called myfile.bak

Removing text with command line?

I have a huge list of locations in this form in a text file:
ar,casa de piedra,Casa de Piedra,20,,-49.985133,-68.914673
gr,riziani,Ríziani,18,,39.5286111,20.35
mx,tenextepec,Tenextepec,30,,19.466667,-97.266667
Is there any way with command line to remove everything that isn't between the first and second commas? For example, I want my list to look like this:
casa de piedra
riziani
tenextepec
with Perl
perl -F/,/ -ane 'print $F[1]."\n"' file
Use cut(1):
cut -d, -f2 inputfile
With perl:
perl -pe 's/^.*?,(.*?),.*/$1/' filename
Breakdown of the above code
perl - the command to use the perl programming language.
-pe - flags.
e means "run this as perl code".
p means:
Set $_ to the first line of the file (given by filename)
Run the -e code
Print $_
Repeat from step 1 with the next line of the file
what -p actually does behind the scenes is best explained here.
s/.*?,(.*?),.*/$1/ is a regular expression:
s/pattern/replacement/ looks for pattern in $_ and replaces it with replacement
.*? basically means "anything" (it's more complicated than that but outside the scope of this answer)
, is a comma (nothing special)
() capture whatever is in them and save it in $1
.* is another (slightly different) "anything" (this time it's more like "everything")
$1 is what we captured with ()
so the whole thing basically says to search in $_ for:
anything
a comma
anything (save this bit)
another comma
everything
and replace it with the bit it saved. This effectively saves the stuff between the first and second commas, deletes everything, and then puts what it saved into $_.
filename is the name of your text file
To review, the code goes through your file line by line, applies the regular expression to extract your needed bit, and then prints it out.
If you want the result in a file, use this:
perl -pe 's/^.*?,(.*?),.*/$1/' filename > out.txt
and the result goes into a file named out.txt (that will be placed wherever your terminal is pointed to at the moment.) What this pretty much does is tell the terminal to print the command's result to a file instead of on the screen.
Also, if it isn't crucial to use the command line, you can just import into Excel (it's in CSV format) and work with it graphically.
With awk:
$ awk -F ',' '{ print $2 }' file

Read word from a file and return next word

Using shell script I want to read a word from text file and return next column word.
For eg, my input file will be like
AGE1 PERSON1
AGE2 PERSON2
AGE3 PERSON3
AGE4 PERSON4
I have variable in Sh file having PERSON's name.
I want read input text file and get value of person's age.
Please help, i'm beginner in Shell Scripting
A slightly simpler solution is:
age=$( awk '$2==name { print $1 }' name="$name" input-file )
Building upon shellter's comment:
age=$(grep "$person_name" people_file.txt | cut -f1 -d' ')
I'll try to explain everything. First, I assume somethings (but you can change them on your script):
Your file with the data you entered is called people_file.txt.
The person's name you want to find is in the variable $person_name.
The variable you want to store the result is $age.
Firstly, because we need to use commands to generate the value of the $age variable, we must use $( and ) to run a command (or a series of commands), and replace itself with the text it captures from executing the command (or commands).
We first need to find the line which contains the person's name. For that we use grep: grep regex file. Grep will search file line by line until it finds a line that matches the regular expression regex. In our case we can simply search for the person's name directly (assuming it doesn't contain special characters, like the period or an asterisk). Note that we must place the variable between double quotes, otherwise a person's name that has a space in it might be split in the command line so that its first name is used as the regular expression and the surname as the file. If you want to search in a case insensitive manner (like for example: John will find a line with JOHN or john), you can use the -i flag: grep -i regex file. The selected lines will be printed by grep into its output, but we will pump those lines into the input of the next command with the pipe operator |.
Finally, we have a line (or many lines) with the results. Now we must extract the age. The cut command will split each line it reads from the input into fields, and only print the fields you ask it to. In this case, we ask for the first field with the -f1 option. Also, we specify that the space character is to be used as the delimeter (ie. the character that separates the fields) with the -d1 command.
If you have more than one line with the same person's name, we need to pipe the output of grep into a head command, so that we can have only the number of lines we want. We can tell head how many lines we want with the -n N option. So if you only want the first match:
age=$(grep "$person_name" people_file.txt | head -n 1 | cut -f1 -d' ')
Hope this helps a little =)
age=`
perl -nle'
BEGIN { $n = shift(#ARGV); }
print $1 if /^(\S+)\s+\Q$n\E$/;
' "$name" file
`
Tested with bash in sh mode.

Bash/AWK/SED Match and ReWrite a string of numbers (date) in a line

I have a text file with the following contents repeating about 60 times coming from a converted .ics file:
Start Vak
Tijd van: 20120411T093000Z
Tijd tot: 20120411T100000Z
Klas(sen) en Docent(en): VPOS0A1 VPOS0A2 Mariel Kers
Vak: Ex. Verst. beperk.
Lokaal: 7.05
Einde Vak
I want to rewrite the "Tijd van" and "Tijd tot" values to become a good date (in a bash script on a gnu/linux system with awk,sed and grep etc.). I tried to use awk to find it:
awk '/^Tijd.*[:digit:][:digit:]Z$/; { getline; print $0; }' rooster2.txt
and grep:
egrep '/^Tijd(.*)[:digit:][:digit:]Z$/' rooster2.txt
But they both do not even find the line.
What I want is to get that date rewritten to a more bash parsable/feasible time format like the EPOCH or something like 31.04.2012 13:00:00. I do not want to replace or rewrite the whole line, just the specific string! Anything, either tips, examples or links are welcome and very usefull.
Try this (GNU sed):
sed -r 's/(Tijd ...: )(....)(..)(..).(..)(..)(..)./\1 \4.\3.\2 \5:\6:\7/' FILE
There are several issues with your awk code:
While [:digit:] refers to "any digit", you still need another pair of square brackets ([...]) for the character group: [[:digit:]] (Just image you wanted "a,any digit or _" , this would be [a[:digit:]_], the outer square brackets defining the character group.)
The semicolon (;) between your pattern (/.../) and the corresponding action ({...}) separates the two, so you have a pattern with no action, resulting in the standard action {print $0}, and a second action without a pattern, resulting in it being performed for all records (i.e. lines).
The getline asks awk to read the next record (i.e. line) before continuing.
Taking all that together your code does the following:
Print all lines matching /^Tijd.*[:digit:][:digit:]Z$/ (that is none, since [:digit:] translates to "one of :,d,i,g, or t").
Additionally, for all lines: read the next line and print it.
Thus, it will print all but the first line (because that is the only one that is not the next one to any other line).
Assuming you just want to print the lines matching "starting with 'Tijd' and ending with two digits followed by a 'Z'" you could use the following code:
awk '/^Tijd.*[[:digit:]][[:digit:]]Z$/{ print $0; }' rooster2.txt
Since {print $0} is the standard action you could even shorten that to
awk '/^Tijd.*[[:digit:]][[:digit:]]Z$/' rooster2.txt
To solve your actual problem you could use something like the following:
awk '/^Tijd.*[[:digit:]][[:digit:]]Z$/{year=substr($NF,1,4);month=substr($NF,5,2);day=substr($NF,7,2);hour=substr($NF,10,2);min=substr($NF,12,2);sec=substr($NF,14,2);$NF=day"."month"."year" "hour":"min":"sec}1' rooster2.txt
This works as follows:
For records (i.e lines) matching the pattern (/.../), rearrange the last field ($NF) to your needs.
Print all records (i.e. lines) (1 is a pattern matching all records (i.e. lines) with no specified action, resulting in the standard one ({print $0}))
Note that GNU awk also has a strftime function.
However, that needs the timestamp to be in a different format.
If you want to use that you must still rearrange the field, first:
awk -v FORMAT="%c" '/^Tijd.*[[:digit:]][[:digit:]]Z$/{$NF=strftime(FORMAT,mktime(substr($NF,1,4)" "substr($NF,5,2)" "substr($NF,7,2)" "substr($NF,10,2)" "substr($NF,12,2)" "substr($NF,14,2)))}1' rooster2.txt
Now, you just need to adjust FORMAT to your needs to change the format.
See man strftime for details.
As a ruby one-liner; requiring time for Time.parse then replacing
matching regexp. You may look strftime method for formatting time
output.
[slmn#uriel ~]$ ruby -rtime -ne 'puts $_.sub(/(Tijd (van|tot): )(.*)/) { $1 + Time.parse($3).strftime("%D %T") }' < yourfile.txt
Start Vak
Tijd van: 04/11/12 09:30:00
Tijd tot: 04/11/12 10:00:00
Klas(sen) en Docent(en): VPOS0A1 VPOS0A2 Mariel Kers
Vak: Ex. Verst. beperk.
Lokaal: 7.05
Einde Vak