Using grep to correct XML files - sed

I have a folder and sub folder that contains 2000 xml files.
Need to process all the files with BizTalk systems.
But some of the files has wrong tags
streetName Bombay Crescent /addressRegion
/streetName.
I need to you grep to find and replace the worng tags only.
I.e with the for loop.. find any xml file with wrong tag and replace it.
Only the tag "streetName" is affected. And only "addressRegion" is in the wrong place.
will like to do
grep -Po where streetName and *** /addressRegion if the condition is true
replace /addressRegion with /streetName
Thanks in Advance

The following will look for a tag <streetName> that with a matching closing tag of </addressRegion>, and will change addressRegion to streetName. It will replace all occurrences on the line. The street name must not contain any < signs, that would break the matching.
sed -e 's:\(<streetName>[^<]*\)</addressRegion>:\1</streetName>:g'
The command reads its standard input and writes standard output.
Sed -i will do the replacement in-place in all its input files:
sed -i -e 's:\(<streetName>[^<]*\)</addressRegion>:\1</streetName>:g' folder/subfolder/*.xml

Related

Using sed/awk to print ONLY words that contains matched pattern - Words starting with /pattern/ or Ending with /pattern/

I have the following output:
junos-vmx-x86-64-21.1R1.11.qcow2 metadata-usb-fpc0.img metadata-usb-fpc10.img
metadata-usb-fpc11.img metadata-usb-fpc1.img metadata-usb-fpc2.img metadata-usb-fpc3.img
metadata-usb-fpc4.img metadata-usb-fpc5.img metadata-usb-fpc6.img metadata-usb-fpc7.img
metadata-usb-fpc8.img metadata-usb-fpc9.img metadata-usb-re0.img metadata-usb-re1.img
metadata-usb-re.img metadata-usb-service-pic-10g.img metadata-usb-service-pic-2g.img
metadata-usb-service-pic-4g.img vFPC-20210211.img vmxhdd.img
The output came from the following script:
images_fld=$(for i in $(ls "$DIRNAME_IMG"); do echo ${i%%/}; done)
The previous output is saved in a variable called images_fld=
Problem:
I need to extract the values of junos-vmx-x86-64-21.1R1.11.qcow2
vFPC-20210211.img and vmxhdd.img When I mean values I mean the entire word
The problem is that this directory containing all the files is always being updated, and new files are added constantly, which means that I can not rely on the line number ($N) to extract the name of those files.
I am trying to use awk or sed to achieve this.
Is there a way to:
match all files ending with.qcow2 and then extract the full file name? Like: junos-vmx-x86-64-21.1R1.11.qcow2
match all files starting withvFPC and then extract the full file name? Like: vFPC-20210211.img
match all files starting withvmxhdd and then extract the full file name? Like: vmxhdd.img
I am using those patterns as those file names tend to change names according to each version I am deploying. But the patterns like: .qcow2 or vFPC or vmxhddalways remain the same regardless, so for that reason, I need to extract the entire string only by matching partial patterns. Is it possible? Thanks!
Note: I can not rely on files ending with .img as there are quite a lot of them, so it would make it more difficult to extract the specific file names :/
This might work for you (GNU sed):
sed -nE '/\<\S+\.qcow2\>|\<(vFPC|vmxhdd)\S+\>/{s//\n&\n/;s/[^\n]*\n//;P;D}' file
If a string matches the required criteria, delimit it by newlines.
Delete up to and including the first newline.
Print/delete the first line and repeat.
Thanks to KamilCuk I was able to solve the problem. Thank you! For anyone who may need this in the future, instead of using sed or awk the solution was by using tail.
echo $images_fld | tail -f | tr ' ' '\n' | grep '\.qcow2$\|vFPC\|vmxhdd')
Basically, the problem that I was having was only to extract the names of the files ending with .qcow2 | and starting with vFPC & vmxhdd
Thank you KamilCuk
Another solution given by potong is by using
echo $images_fld sed -nE '/\<\S+\.qcow2\>|\<(vFPC|vmxhdd)\S+\>/{s//\n&\n/;s/[^\n]*\n//;P;D}'
which gives the same output as KamilCuk's! Thanks both

Adding a space before each capital letter in a selected set of lines in a yaml file using sed

I want to write a regex for a shell script. It is used to match only this kind of lines in yml file. (lines with the tag summary: Example Summary)
summary: GetMembersSavedSearchesByMemberId
So What I want to do is add a space before each Uppercase letter and output like this
summary: Get Members Saved Searches By Member Id
I tried this regex
matchregex="summary[:][[:space:]].\([A-Z]\)"
replacement="summary: .\1"
sed -e "s/${matchregex}/${replacement}/g"
It is not working. What is the correct way of writing this?
Would you please try the following:
sed -E '/^summary:/ s/([a-z])([A-Z])/\1 \2/g'
Result:
summary: Get Members Saved Searches By Member Id
This might work for you (GNU sed):
sed 's/\B[[:upper:]]/ &/g' file
Globally insert a space inside a word where the following character is uppercase.

Improving sed program - conditions

I use this code according to this question.
$ names=(file1.txt file2.txt file3.txt) # Declare array
$ printf 's/%s/a-&/g\n' "${names[#]%.txt}" # Generate sed replacement script
s/file1/a-&/g
s/file2/a-&/g
s/file3/a-&/g
$ sed -f <(printf 's/%s/a-&/g\n' "${names[#]%.txt}") f.txt
TEXT
\connect{a-file1}
\begin{a-file2}
\connect{a-file3}
TEXT
75
How to make conditions that solve the following problem please?
names=(file1.txt file2.txt file3file2.txt)
I mean that there is a world in the names of files that is repeated as a part of another name of file. Then there is added a- more times.
I tried
sed -f <(printf 's/{%s}/{s-&}/g\n' "${files[#]%.tex}")
but the result is
\input{a-{file1}}
I need to find {%s} and a- place between { and %s
It's not clear from the question how to resolve conflicting input. In particular, the code will replace any instance of file1 with a-file1, even things like 'foofile1'.
On surface, the goal seems to be to change tokens (e.g., foofile1 should not be impacted by by file1 substitution. This could be achieved by adding word boundary assertion (\b) - before and after the filename. This will prevent the pattern from matching inside other longer file names.
printf 's/\\b%s\\b/a-&/g\n' "${names[#]%.txt}"
Since this explanation is too long for comment so adding an answer here. I am not sure if my previous answer was clear or not but my answer takes care of this case and will only replace exact file names only and NOT mix of file names.
Lets say following is array value and Input_file:
names=(file1.txt file2.txt file3file2.txt)
echo "${names[*]}"
file1.txt file2.txt file3file2.txt
cat file1
TEXT
\connect{file1}
\begin{file2}
\connect{file3}
TEXT
75
Now when we run following code:
awk -v arr="${names[*]}" '
BEGIN{
FS=OFS="{"
num=split(arr,array," ")
for(i=1;i<=num;i++){
sub(/\.txt/,"",array[i])
array1[array[i]"}"]
}
}
$2 in array1{
$2="a-"$2
}
1
' file1
Output will be as follows. You could see file3 is NOT replaced since it was NOT present in array value.
TEXT
\connect{a-file1}
\begin{a-file2}
\connect{file3}
TEXT
75

sed replace from csv include last character of search term

I am trying to replace a list of words found in a csv file with index markup (docbook). The csv is in this format:
testword[ -?],testword<indexterm><primary>testword</primary></indexterm>
This finds all occurrences of the testword with punctuation at the end. This part works. However, I need the final punctuation mark to be included in the replace part of the sed command.
sed -e 's/\(.*\)/s,\1,g/' index.csv > index.sed
sed -i -f index.sed file.xml
So e.g. This is a testword, in a test.
Would get replaced with This is a testword,<indexterm><primary>testword</primary></indexterm> in a test.
Problem is the string in the csv file that steers the proces, here you loose the punctuation.
Replacing the:
testword[ -?],testword<indexterm><primary>testword</primary></indexterm>
by:
testword\([ -?]\),testword\1<indexterm><primary>testword</primary></indexterm>
Would already solve your problem.

Using grep to adjust timecode

I'm trying to change the timecode found from one format into another, basically to remove the milliseconds off the end of a file and update it. This is to remove extra milliseconds from a transcription timecode software and make it look pretty for file for client.
Input looks like this:
00:50:34.00>INTERVIEWER
Why was it ............... script?
00:50:35.13>JOHN DOE
Because of the quality.
So I'm trying to use grep to match the timecode and got it working with following expression.
grep [0-9][0-9][:][0-9][0-9][:][0-9][0-9]\.[0-9][0-9] -P -o transcriptionFile.txt
Output looks like this:
00:50:34.00
00:50:35.13
So now I'm trying to take timecode and update the file with updated values like:
00:50:34
00:50:35
How do I do that? Should I use a pipe to push it over to sed so I can update the values in the file?
I've also tried to use sed with following command:
sed 's/[0-9][0-9][:][0-9][0-9][:][0-9][0-9]\.[0-9][0-9]/[0-9][0-9][:][0-9][0-9][:][0-9][0-9]/g' transcriptionFile.txt > outtranscriptionFile.txt
I get output but puts in my RegExp in place where timecode is supposed to be. Any ideas? Also How do I can trim last 3 digits off far right side of timecode before I update file?
Any tips or suggestions will be much appreciated.
Thanks :-)
With GNU sed:
$ sed -r 's/^([0-9]{2}:[0-9]{2}:[0-9]{2})\>\.[0-9]{2}/\1/' transcriptionFile.txt
00:50:34>INTERVIEWER
Why was it ............... script?
00:50:35>JOHN DOE
Because of the quality.
To edit the file in place, add the -i option:
sed -r -i 's/^([0-9]{2}:[0-9]{2}:[0-9]{2})\>\.[0-9]{2}/\1/' transcriptionFile.txt
Explanation:
[0-9]{2}: matches every two digits followed by a :. All three occurences are captured using brackets.
\>\.[0-9]{2} matches > followed by a dot and two digits.
using backreference \1, strings matching previous pattern are replaced with captured characters (timecode without milliseconds).