Extract occurrences of pattern from multiple files

Extract occurrences of pattern from multiple files - sed

I am trying to find all occurrences of three characters between a pair of spaces in all files in my current directory.
So far I have
sed 's/.../(&)/g'
I know it's not right; I think I'm stuck on something. How can I do that?

grep -r -l ' [a-zA-Z]{3} ' .
Explanation:
-r grep recursively from the current folder (.)
-l only display file names, rather than all matching lines
The regex I used is [a-zA-Z]{3}, which matches three characters in between a pair of spaces, anywhere within a given file.

Related

Using sed/awk to print ONLY words that contains matched pattern - Words starting with /pattern/ or Ending with /pattern/

I have the following output:
junos-vmx-x86-64-21.1R1.11.qcow2 metadata-usb-fpc0.img metadata-usb-fpc10.img
metadata-usb-fpc11.img metadata-usb-fpc1.img metadata-usb-fpc2.img metadata-usb-fpc3.img
metadata-usb-fpc4.img metadata-usb-fpc5.img metadata-usb-fpc6.img metadata-usb-fpc7.img
metadata-usb-fpc8.img metadata-usb-fpc9.img metadata-usb-re0.img metadata-usb-re1.img
metadata-usb-re.img metadata-usb-service-pic-10g.img metadata-usb-service-pic-2g.img
metadata-usb-service-pic-4g.img vFPC-20210211.img vmxhdd.img
The output came from the following script:
images_fld=$(for i in $(ls "$DIRNAME_IMG"); do echo ${i%%/}; done)
The previous output is saved in a variable called images_fld=
Problem:
I need to extract the values of junos-vmx-x86-64-21.1R1.11.qcow2
vFPC-20210211.img and vmxhdd.img When I mean values I mean the entire word
The problem is that this directory containing all the files is always being updated, and new files are added constantly, which means that I can not rely on the line number ($N) to extract the name of those files.
I am trying to use awk or sed to achieve this.
Is there a way to:
match all files ending with.qcow2 and then extract the full file name? Like: junos-vmx-x86-64-21.1R1.11.qcow2
match all files starting withvFPC and then extract the full file name? Like: vFPC-20210211.img
match all files starting withvmxhdd and then extract the full file name? Like: vmxhdd.img
I am using those patterns as those file names tend to change names according to each version I am deploying. But the patterns like: .qcow2 or vFPC or vmxhddalways remain the same regardless, so for that reason, I need to extract the entire string only by matching partial patterns. Is it possible? Thanks!
Note: I can not rely on files ending with .img as there are quite a lot of them, so it would make it more difficult to extract the specific file names :/

This might work for you (GNU sed):
sed -nE '/\<\S+\.qcow2\>|\<(vFPC|vmxhdd)\S+\>/{s//\n&\n/;s/[^\n]*\n//;P;D}' file
If a string matches the required criteria, delimit it by newlines.
Delete up to and including the first newline.
Print/delete the first line and repeat.

Thanks to KamilCuk I was able to solve the problem. Thank you! For anyone who may need this in the future, instead of using sed or awk the solution was by using tail.
echo $images_fld | tail -f | tr ' ' '\n' | grep '\.qcow2$\|vFPC\|vmxhdd')
Basically, the problem that I was having was only to extract the names of the files ending with .qcow2 | and starting with vFPC & vmxhdd
Thank you KamilCuk
Another solution given by potong is by using
echo $images_fld sed -nE '/\<\S+\.qcow2\>|\<(vFPC|vmxhdd)\S+\>/{s//\n&\n/;s/[^\n]*\n//;P;D}'
which gives the same output as KamilCuk's! Thanks both

Extract filename from multiple lines in unix

I'm trying to extract the name of the file name that has been generated by a Java program. This Java program spits out multiple lines and I know exactly what the format of the file name is going to be. The information text that the Java program is spitting out is as follows:
ABCASJASLEKJASDFALDSF
Generated file YANNANI-0008876_17.xml.
TDSFALSFJLSDJF;
I'm capturing the output in a variable and then applying a sed operator in the following format:
sed -n 's/.*\(YANNANI.\([[:digit:]]\).\([xml]\)*\)/\1/p'
The result set is:
YANNANI-0008876_17.xml.
However, my problem is that want the extraction of the filename to stop at .xml. The last dot should never be extracted.
Is there a way to do this using sed?

Let's look at what your capture group actually captures:
$ grep 'YANNANI.\([[:digit:]]\).\([xml]\)*' infile
Generated file YANNANI-0008876_17.xml.
That's probably not what you intended:
\([[:digit:]]\) captures just a single digit (and the capture group around it doesn't do anything)
\([xml]\)* is "any of x, m or l, 0 or more times", so it matches the empty string (as above – or the line wouldn't match at all!), x, xx, lll, mxxxxxmmmmlxlxmxlmxlm, xml, ...
There is no way the final period is removed because you don't match anything after the capture groups
What would make sense instead:
Match "digits or underscores, 0 or more": [[:digit:]_]*
Match .xml, literally (escape the period): \.xml
Make sure the rest of the line (just the period, in this case) is matched by adding .* after the capture group
So the regex for the string you'd like to extract becomes
$ grep 'YANNANI.[[:digit:]_]*\.xml' infile
Generated file YANNANI-0008876_17.xml.
and to remove everything else on the line using sed, we surround regex with .*\( ... \).*:
$ sed -n 's/.*\(YANNANI.[[:digit:]_]*\.xml\).*/\1/p' infile
YANNANI-0008876_17.xml
This assumes you really meant . after YANNANI (any character).

You can call sed twice: first in printing and then in replacement mode:
sed -n 's/.*\(YANNANI.\([[:digit:]]\).\([xml]\)*\)/\1/p' | sed 's/\.$//g'
the last sed will remove all the last . at the end of all the lines fetched by your first sed
or you can go for a awk solution as you prefer:
awk '/.*YANNANI.[0-9]+.[0-9]+.xml/{print substr($NF,1,length($NF)-1)}'
this will print the last field (and truncate the last char of it using substr) of all the lines that do match your regex.

Using grep to correct XML files

I have a folder and sub folder that contains 2000 xml files.
Need to process all the files with BizTalk systems.
But some of the files has wrong tags
streetName Bombay Crescent /addressRegion
/streetName.
I need to you grep to find and replace the worng tags only.
I.e with the for loop.. find any xml file with wrong tag and replace it.
Only the tag "streetName" is affected. And only "addressRegion" is in the wrong place.
will like to do
grep -Po where streetName and *** /addressRegion if the condition is true
replace /addressRegion with /streetName
Thanks in Advance

The following will look for a tag <streetName> that with a matching closing tag of </addressRegion>, and will change addressRegion to streetName. It will replace all occurrences on the line. The street name must not contain any < signs, that would break the matching.
sed -e 's:\(<streetName>[^<]*\)</addressRegion>:\1</streetName>:g'
The command reads its standard input and writes standard output.
Sed -i will do the replacement in-place in all its input files:
sed -i -e 's:\(<streetName>[^<]*\)</addressRegion>:\1</streetName>:g' folder/subfolder/*.xml

Extract CentOS mirror domain names using sed

I'm trying to extract a list of CentOS domain names only from http://mirrorlist.centos.org/?release=6.4&arch=x86_64&repo=os
Truncating prefix "http://" and "ftp://" to the first "/" character only resulting a list of
yum.phx.singlehop.com
mirror.nyi.net
bay.uchicago.edu
centos.mirror.constant.com
mirror.teklinks.com
centos.mirror.netriplex.com
centos.someimage.com
mirror.sanctuaryhost.com
mirrors.cat.pdx.edu
mirrors.tummy.com
I searched stackoverflow for the sed method but I'm still having trouble.
I tried doing this with sed
curl "http://mirrorlist.centos.org/?release=6.4&arch=x86_64&repo=os" | sed '/:\/\//,/\//p'
but doesn't look like it is doing anything. Can you give me some advice?

Here you go:
curl "http://mirrorlist.centos.org/?release=6.4&arch=x86_64&repo=os" | sed -e 's?.*://??' -e 's?/.*??'
Your sed was completely wrong:
/x/,/y/ is a range. It selects multiple lines, from a line matching /x/ until a line matching /y/
The p command prints the selected range
Since all lines match both the start and end pattern you used, you effectively selected all lines. And, since sed echoes the input by default, the p command results in duplicated lines (all lines printed twice).
In my fix:
I used s??? instead of s/// because this way I didn't need to escape all the / in the patterns, so it's a bit more readable this way
I used two expressions with the -e flag:
s?.*://?? matches everything up until :// and replaces it with nothing
s?/.*?? matches everything from / until the end replaces it with nothing
The two expressions are executed in the given order
In modern versions of sed you can omit -e and separate the two expressions with ;. I stick to using -e because it's more portable.

Using command line to lowercase all text in all files?

I've been trying out several commands such as
dd if=*.xml of=*.xml conv=lcase
to mass all the content of all the xml files in my folder to being lowercase. The folders filenames are already lowercase, I'm trying to change all the actual content to being lower case as well.
Can someone post the command to do this or tell me what I'm doing wrong? Thanks!

Use sed to edit files in place which will save you from writing a loop.
sed -ri 's/.+/\L\0/' *.xml

for i in *.xml; do tr A-Z a-z < $i > tmp && mv tmp $i; done
If your file names contain unusual characters (whitespace, newlines, control characters, etc), you may have to quote "$i", but since you say the names are all lowercase, I'm assuming that is not necessary.

I would go for:
sed -ie 's/\(.*\)/\L\1/' *.xml
I see that you've tagged your question with ssh. You didn't specify it, but does this mean that you want to run this command at the end of an ssh command? I that case, you will need to escape out the asterisks, as they're supposed to be interpreted remotely, like this:
sed -ie 's/\(.*\)/\L\1/' \*.xml

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Extract occurrences of pattern from multiple files - sed

I am trying to find all occurrences of three characters between a pair of spaces in all files in my current directory. So far I have sed 's/.../(&)/g' I know it's not right; I think I'm stuck on something. How can I do that?

grep -r -l ' [a-zA-Z]{3} ' . Explanation: -r grep recursively from the current folder (.) -l only display file names, rather than all matching lines The regex I used is [a-zA-Z]{3}, which matches three characters in between a pair of spaces, anywhere within a given file.

Related

Using sed/awk to print ONLY words that contains matched pattern - Words starting with /pattern/ or Ending with /pattern/

Extract filename from multiple lines in unix

Using grep to correct XML files

Extract CentOS mirror domain names using sed

Using command line to lowercase all text in all files?

Categories

Resources