Extracting the contents between two different strings using bash or perl - perl

I have tried to scan through the other posts in stack overflow for this, but couldn't get my code work, hence I am posting a new question.
Below is the content of file temp.
<?xml version="1.0" encoding="UTF-8"?>
<env:Envelope xmlns:env="http://schemas.xmlsoap.org/soap/envelope/<env:Body><dp:response xmlns:dp="http://www.datapower.com/schemas/management"><dp:timestamp>2015-01-
22T13:38:04Z</dp:timestamp><dp:file name="temporary://test.txt">XJzLXJlc3VsdHMtYWN0aW9uX18i</dp:file><dp:file name="temporary://test1.txt">lc3VsdHMtYWN0aW9uX18i</dp:file></dp:response></env:Body></env:Envelope>
This file contains the base64 encoded contents of two files names test.txt and test1.txt. I want to extract the base64 encoded content of each file to seperate files test.txt and text1.txt respectively.
To achieve this, I have to remove the xml tags around the base64 contents. I am trying below commands to achieve this. However, it is not working as expected.
sed -n '/test.txt"\>/,/\<\/dp:file\>/p' temp | perl -p -e 's#<dp:file name="temporary://test.txt">##g'|perl -p -e 's#</dp:file>##g' > test.txt
sed -n '/test1.txt"\>/,/\<\/dp:file\>/p' temp | perl -p -e 's#<dp:file name="temporary://test1.txt">##g'|perl -p -e 's#</dp:file></dp:response></env:Body></env:Envelope>##g' > test1.txt
Below command:
sed -n '/test.txt"\>/,/\<\/dp:file\>/p' temp | perl -p -e 's#<dp:file name="temporary://test.txt">##g'|perl -p -e 's#</dp:file>##g'
produces output:
XJzLXJlc3VsdHMtYWN0aW9uX18i
<dp:file name="temporary://test1.txt">lc3VsdHMtYWN0aW9uX18i</dp:response> </env:Body></env:Envelope>`
Howeveer, in the output I am expecting only first line XJzLXJlc3VsdHMtYWN0aW9uX18i. Where I am commiting mistake?
When i run below command, I am getting expected output:
sed -n '/test1.txt"\>/,/\<\/dp:file\>/p' temp | perl -p -e 's#<dp:file name="temporary://test1.txt">##g'|perl -p -e 's#</dp:file></dp:response></env:Body></env:Envelope>##g'
It produces below string
lc3VsdHMtYWN0aW9uX18i
I can then easily route this to test1.txt file.
UPDATE
I have edited the question by updating the source file content. The source file doesn't contain any newline character. The current solution will not work in that case, I have tried it and failed. wc -l temp must output to 1.
OS: solaris 10
Shell: bash

sed -n 's_<dp:file name="\([^"]*\)">\([^<]*\).*_\1 -> \2_p' temp
I add \1 -> to show link from file name to content but for content only, just remove this part
posix version so on GNU sed use --posix
assuming that base64 encoded contents is on the same line as the tag around (and not spread on several lines, that need some modification in this case)
Thanks to JID for full explaination below
How it works
sed -n
The -n means no printing so unless explicitly told to print, then there will be no output from sed
's_
This is to substitute the following regex using _ to separate regex from the replacement.
<dp:file name=
Regular text
"\([^"]*\)"
The brackets are a capture group and must be escaped unless the -r option is used( -r is not available on posix). Everything inside the brackets is captured. [^"]* means 0 or more occurrences of any character that is not a quote. So really this just captures anything between the two quotes.
>\([^<]*\)<
Again uses the capture group this time to capture everything between the > and <
.*
Everything else on the line
_\1 -> \2
This is the replacement, so replace everything in the regex before with the first capture group then a -> and then the second capture group.
_p
Means print the line
Resources
http://unixhelp.ed.ac.uk/CGI/man-cgi?sed
http://www.grymoire.com/Unix/Sed.html

/usr/xpg4/bin/sed works well here.
/usr/bin/sed is not working as expected in case if the file contains just 1 line.
below command works for a file containing only single line.
/usr/xpg4/bin/sed -n 's_<env:Envelope\(.*\)<dp:file name="temporary://BackUpDir/backupmanifest.xml">\([^>]*\)</dp:file>\(.*\)_\2_p' securebackup.xml 2>/dev/null
Without 2>/dev/null this sed command outputs the warning sed: Missing newline at end of file.
This because of the below reason:
Solaris default sed ignores the last line not to break existing scripts because a line was required to be terminated by a new line in the original Unix implementation.
GNU sed has a more relaxed behavior and the POSIX implementation accept the fact but outputs a warning.

Related

How to use SED to find multiple paths in the same line and replace them with a different path?

I have a file with multiple paths in the same line:
cat modules.dep
kernel/mm/zsmalloc.ko:
kernel/crypto/lzo.ko:
kernel/drivers/char/tpm/tpm_vtpm_proxy.ko: kernel/drivers/char/tpm/tpm.ko
kernel/drivers/block/virtio_blk.ko:
kernel/drivers/block/zram/zram.ko: kernel/mm/zsmalloc.ko
kernel/drivers/nvdimm/virtio_pmem.ko: kernel/drivers/nvdimm/nd_virtio.ko
kernel/drivers/nvdimm/nd_virtio.ko:
kernel/drivers/net/virtio_net.ko: kernel/drivers/net/net_failover.ko kernel/net/core/failover.ko
kernel/drivers/net/net_failover.ko: kernel/net/core/failover.ko
extra/virtio_gpu/virtio-gpu.ko: kernel/drivers/virtio/virtio_dma_buf.ko
extra/wlan_simulation/virt_wifi_sim.ko: kernel/drivers/net/wireless/virt_wifi.ko
I would like to change it to:
/lib/modules/zsmalloc.ko:
/lib/modules/lzo.ko:
/lib/modules/tpm_vtpm_proxy.ko: /lib/modules/tpm.ko
/lib/modules/virtio_blk.ko:
/lib/modules/zram.ko: /lib/modules/zsmalloc.ko
/lib/modules/virtio_pmem.ko: /lib/modules/nd_virtio.ko
/lib/modules/nd_virtio.ko:
/lib/modules/virtio_net.ko: /lib/modules/net_failover.ko /lib/modules/failover.ko
/lib/modules/net_failover.ko: /lib/modules/failover.ko
/lib/modules/virtio-gpu.ko: /lib/modules/virtio_dma_buf.ko
/lib/modules/virt_wifi_sim.ko: /lib/modules/virt_wifi.ko
But my attempt:
sed -i 's/\(.*\)\//\/lib\/modules\//g' modules.load
works only, if there is just one path per line.
How can I achieve this, via sed, with multiple paths per line?
I am using sed from BusyBox in D(ASH) Standalone.
BusyBox v1.32.1-Magisk (2021-01-21 00:17:27 PST) multi-call binary.
Usage: sed [-i[SFX]] [-nrE] [-f FILE]... [-e CMD]... [FILE]...
or: sed [-i[SFX]] [-nrE] CMD [FILE]...
-e CMD Add CMD to sed commands to be executed
-f FILE Add FILE contents to sed commands to be executed
-i[SFX] Edit files in-place (otherwise sends to stdout)
Optionally back files up, appending SFX
-n Suppress automatic printing of pattern space
-r,-E Use extended regex syntax
If no -e or -f, the first non-option argument is the sed command string.
Remaining arguments are input files (stdin if none).
This sed should work:
sed -E 's~[^[:blank:]]+/~/lib/modules/~g' modules.dep
/lib/modules/zsmalloc.ko:
/lib/modules/lzo.ko:
/lib/modules/tpm_vtpm_proxy.ko: /lib/modules/tpm.ko
/lib/modules/virtio_blk.ko:
/lib/modules/zram.ko: /lib/modules/zsmalloc.ko
/lib/modules/virtio_pmem.ko: /lib/modules/nd_virtio.ko
/lib/modules/nd_virtio.ko:
/lib/modules/virtio_net.ko: /lib/modules/net_failover.ko /lib/modules/failover.ko
/lib/modules/net_failover.ko: /lib/modules/failover.ko
/lib/modules/virtio-gpu.ko: /lib/modules/virtio_dma_buf.ko
/lib/modules/virt_wifi_sim.ko: /lib/modules/virt_wifi.ko
[^[:blank:]]+/ finds 1+ non-whitespace characters followed by a / thus matching longest string until it gets a / in each of the multiple string per line.

Awk inside of qsub

I have a bash script in which I have a few qsubs. Each of them are waiting for a preivous qsub to be done before starting.
My first qsub consist of sending files in a certain directory to a perl program and having the outfiles printed in a new directory. At the end, I echo the array with all my jobs names. This script works as intented.
mkdir -p /perl_files_dir
for ID_FILES in `ls Infiles_dir/*.txt`;
do
JOB_ID=`echo "perl perl_scirpt.pl $ID_FILES" | qsub -j oe `
JOB_ID_ARRAY="${JOB_ID_ARRAY}:$JOB_ID"
done
echo $JOB_ID_ARRAY
My second qsub is meant to sort all my previous files made with my perl script in a new outfile and to start after all these jobs are done (about 100 jobs) with depend=afterany. Again, this part is working fine.
SORT_JOB=`echo "sort -m -n perl_files_dir/*.txt >>sorted_file.txt" | qsub -j oe -W depend=afterany$JOB_ID_ARRAY`
SORT_ARRAY="${SORT_ARRAY}:$SORT_JOB"
My issue is that in my sorted file, I have a few columns I wish to remove (2 to 6), so I came up with this last line using awk piped to sed with another depend=afterany
SED=`echo "awk '{\$2="";\$3="";\$4="";\$5="";\$6=""; print \$0}' sorted_file.txt \
| sed 's/ //g' >final_file.txt" | qsub -j oe -W depend=afterany$SORT_ARRAY`
This last step creates final_file.txt, but leaves it empty. I added SED= before my echo because it would otherwise give me Command not found.
I tried without the pipe so it would just print everything. Unfortunately it prints nothing.
I assume it is not opening my sorted file and this is why my final file is empty after my sed. If it's the case, then why won't awk read it?
In my script, I am using variables to define my directories and files (with the correct path). I know my issue is not about find my files or directories since they are perfectly defined at the beginning and used throughout the script. I tried to write the whole path instead of a variable and I get the same results.
for ID_FILES in `ls Infiles_dir/*.txt`
Simplify this to
for ID_FILES in Infiles_dir/*.txt
ls lists the files you pass it (except when you pass it directories, then it lists their content). Rather than telling it to display a list of files and parse the output, use the list of files you already have! This is more reliable (parsing the output of ls will fail if the file names contain whitespace or wildcard characters), clearer and faster. Don't parse the output of ls.
SORT_JOB=`echo "sort -m -n perl_files_dir/*.txt >>sorted_file.txt" | qsub -j oe -W depend=afterany$JOB_ID_ARRAY`
You'd make your life simpler if you used the right form of quoting in the right place. Don't use backquotes, because it's difficult to know how to quote things inside. Use $(…) instead, it's exactly equivalent except that it is parsed in a sane way.
I recommend using a here document for the shell snippet that you're feeding to qsub. You have fewer quoting issues to worry about, and it's more readable.
While we're at it, always put double quotes around variable substitutions and command substitutions: "$some_variable", "$(some_command)". Annoyingly, $var in shell syntax doesn't mean “take the value of the variable var”, it means “take the value of the variable var, parse it as a list of wildcard patterns, and replace each pattern by the list of matching files if there are matching files”. This extra stuff is turned off if the substitution happens inside double quotes (or in a here document, by the way): "$var" means “take the value of the variable var”.
SORT_JOB=$(qsub -j oe -W depend="afterany$JOB_ID_ARRAY" <<'EOF'
sort -m -n perl_files_dir/*.txt >>sorted_file.txt
EOF
)
We now get to the snippet where the quoting was actually causing a problem.
SED=`echo "awk '{\$2="";\$3="";\$4="";\$5="";\$6=""; print \$0}' sorted_file.txt \
| sed 's/ //g' >final_file.txt" | qsub -j oe -W depend=afterany$SORT_ARRAY`
The string that becomes the argument to the echo command is:
awk '{$2=;$3=;$4=;$5=;$6=; print $0}' sorted_file.txt | sed 's/ //g' >final_file.txt
This is syntactically incorrect, and that's why you're not getting any output.
You didn't escape the double quotes inside what was meant to be the awk snippet. It's a lot clearer if you use a here document. Also, you don't need the SED= part. You added it because you had a command substitution (a command between …), which substitutes the output of a command. But since you aren't interested in the output of the qsub command, don't take its output, just execute it.
qsub -j oe -W depend="afterany$SORT_ARRAY" <<'EOF'
awk '{$2="";$3="";$4="";$5="";$6=""; print $0}' sorted_file.txt |
sed 's/ //g' >final_file.txt
EOF
I'm not familiar with qsub, but presumably there's a way to get the error output and the return status of the commands it runs. Inspect that error output, you should have seen the errors from awk.
The version of awk that I am using, does not like the character escapes
awk --version
GNU Awk 3.1.7
spuder#cent64$ awk '{\$2="";\$3="";\$4=""; print \$0}' foo.txt
awk: {\$2="";\$3="";\$4=""; print \$0}
awk: ^ backslash not last character on line
Try the following syntax
awk '{for(i=2;i<=7;i++) $i="";print}' foo.txt
As a side note, if you are using Torque 4.x you may not be able to use a comma separated list of jobs with -W depend=, instead you may need to create a new PBS declarative (-W) for each job.
eg...
#Invalid syntax in newer versions of torque
qsub -W depend=foo,bar
Resources
backslash in gawk fields
Print all but the first three columns
http://docs.adaptivecomputing.com/torque/help.htm#topics/commands/qsub.htm#-W

Add text at the end of each line

I'm on Linux command line and I have file with
127.0.0.1
128.0.0.0
121.121.33.111
I want
127.0.0.1:80
128.0.0.0:80
121.121.33.111:80
I remember my colleagues were using sed for that, but after reading sed manual still not clear how to do it on command line?
You could try using something like:
sed -n 's/$/:80/' ips.txt > new-ips.txt
Provided that your file format is just as you have described in your question.
The s/// substitution command matches (finds) the end of each line in your file (using the $ character) and then appends (replaces) the :80 to the end of each line. The ips.txt file is your input file... and new-ips.txt is your newly-created file (the final result of your changes.)
Also, if you have a list of IP numbers that happen to have port numbers attached already, (as noted by Vlad and as given by aragaer,) you could try using something like:
sed '/:[0-9]*$/ ! s/$/:80/' ips.txt > new-ips.txt
So, for example, if your input file looked something like this (note the :80):
127.0.0.1
128.0.0.0:80
121.121.33.111
The final result would look something like this:
127.0.0.1:80
128.0.0.0:80
121.121.33.111:80
Concise version of the sed command:
sed -i s/$/:80/ file.txt
Explanation:
sed stream editor
-i in-place (edit file in place)
s substitution command
/replacement_from_reg_exp/replacement_to_text/ statement
$ matches the end of line (replacement_from_reg_exp)
:80 text you want to add at the end of every line (replacement_to_text)
file.txt the file name
How can this be achieved without modifying the original file?
If you want to leave the original file unchanged and have the results in another file, then give up -i option and add the redirection (>) to another file:
sed s/$/:80/ file.txt > another_file.txt
sed 's/.*/&:80/' abcd.txt >abcde.txt
If you'd like to add text at the end of each line in-place (in the same file), you can use -i parameter, for example:
sed -i'.bak' 's/$/:80/' foo.txt
However -i option is non-standard Unix extension and may not be available on all operating systems.
So you can consider using ex (which is equivalent to vi -e/vim -e):
ex +"%s/$/:80/g" -cwq foo.txt
which will add :80 to each line, but sometimes it can append it to blank lines.
So better method is to check if the line actually contain any number, and then append it, for example:
ex +"g/[0-9]/s/$/:80/g" -cwq foo.txt
If the file has more complex format, consider using proper regex, instead of [0-9].
You can also achieve this using the backreference technique
sed -i.bak 's/\(.*\)/\1:80/' foo.txt
You can also use with awk like this
awk '{print $0":80"}' foo.txt > tmp && mv tmp foo.txt
Using a text editor, check for ^M (control-M, or carriage return) at the end of each line. You will need to remove them first, then append the additional text at the end of the line.
sed -i 's|^M||g' ips.txt
sed -i 's|$|:80|g' ips.txt
sed -i 's/$/,/g' foo.txt
I do this quite often to add a comma to the end of an output so I can just easily copy and paste it into a Python(or your fav lang) array

Using variables in sed -f (where sed script is in a file rather than inline)

We have a process which can use a file containing sed commands to alter piped input.
I need to replace a placeholder in the input with a variable value, e.g. in a single -e type of command I can run;
$ echo "Today is XX" | sed -e "s/XX/$(date +%F)/"
Today is 2012-10-11
However I can only specify the sed aspects in a file (and then point the process at the file), E.g. a file called replacements.sed might contain;
s/XX/Thursday/
So obviously;
$ echo "Today is XX" | sed -f replacements.sed
Today is Thursday
If I want to use an environment variable or shell value, though, I can't find a way to make it expand, e.g. if replacements.txt contains;
s/XX/$(date +%F)/
Then;
$ echo "Today is XX" | sed -f replacements.sed
Today is $(date +%F)
Including double quotes in the text of the file just prints the double quotes.
Does anyone know a way to be able to use variables in a sed file?
This might work for you (GNU sed):
cat <<\! > replacements.sed
/XX/{s//'"$(date +%F)"'/;s/.*/echo '&'/e}
!
echo "Today is XX" | sed -f replacements.sed
If you don't have GNU sed, try:
cat <<\! > replacements.sed
/XX/{
s//'"$(date +%F)"'/
s/.*/echo '&'/
}
!
echo "Today is XX" | sed -f replacements.sed | sh
AFAIK, it's not possible. Your best bet will be :
INPUT FILE
aaa
bbb
ccc
SH SCRIPT
#!/bin/sh
STRING="${1//\//\\/}" # using parameter expansion to prevent / collisions
shift
sed "
s/aaa/$STRING/
" "$#"
COMMAND LINE
./sed.sh "fo/obar" <file path>
OUTPUT
fo/obar
bbb
ccc
As others have said, you can't use variables in a sed script, but you might be able to "fake" it using extra leading input that gets added to your hold buffer. For example:
[ghoti#pc ~/tmp]$ cat scr.sed
1{;h;d;};/^--$/g
[ghoti#pc ~/tmp]$ sed -f scr.sed <(date '+%Y-%m-%d'; printf 'foo\n--\nbar\n')
foo
2012-10-10
bar
[ghoti#pc ~/tmp]$
In this example, I'm using process redirection to get input into sed. The "important" data is generated by printf. You could cat a file instead, or run some other program. The "variable" is produced by the date command, and becomes the first line of input to the script.
The sed script takes the first line, puts it in sed's hold buffer, then deletes the line. Then for any subsequent line, if it matches a double dash (our "macro replacement"), it substitutes the contents of the hold buffer. And prints, because that's sed's default action.
Hold buffers (g, G, h, H and x commands) represent "advanced" sed programming. But once you understand how they work, they open up new dimensions of sed fu.
Note: This solution only helps you replace entire lines. Replacing substrings within lines may be possible using the hold buffer, but I can't imagine a way to do it.
(Another note: I'm doing this in FreeBSD, which uses a different sed from what you'll find in Linux. This may work in GNU sed, or it may not; I haven't tested.)
I am in agreement with sputnick. I don't believe that sed would be able to complete that task.
However, you could generate that file on the fly.
You could change the date to a fixed string, like
__DAYOFWEEK__.
Create a temp file, use sed to replace __DAYOFWEEK__ with $(date +%Y).
Then parse your file with sed -f $TEMPFILE.
sed is great, but it might be time to use something like perl that can generate the date on the fly.
To add a newline in the replacement expression using a sed file, what finally worked for me is escaping a literal newline. Example: to append a newline after the string NewLineHere, then this worked for me:
#! /usr/bin/sed -f
s/NewLineHere/NewLineHere\
/g
Not sure it matters but I am on Solaris unix, so not GNU sed for sure.

In-place replacement

I have a CSV. I want to edit the 35th field of the CSV and write the change back to the 35th field. This is what I am doing on bash:
awk -F "," '{print $35}' test.csv | sed -i 's/^0/+91/g'
so, I am pulling the 35th entry using awk and then replacing the "0" in the starting position in the string with "+91". This one works perfet and I get desired output on the console.
Now I want this new entry to get written in the file. I am thinking of sed's "in -place" replacement feature but this fetuare needs and input file. In above command, I cannot provide input file because my primary command is awk and sed is taking the input from awk.
Thanks.
You should choose one of the two tools. As for sed, it can be done as follows:
sed -ri 's/^(([^,]*,){34})0([^,]*)/\1+91\3/' test.csv
Not sure about awk, but #shellter's comment might help with that.
The in-place feature of sed is misnamed, as it does not edit the file in place. Instead, it creates a new file with the same name. eg:
$ echo foo > foo
$ ln -f foo bar
$ ls -i foo bar # These are the same file
797325 bar 797325 foo
$ echo new-text > foo # Changes bar
$ cat bar
new-text
$ printf '/new/s//newer\nw\nq\n' | ed foo # Edit foo "in-place"; changes bar
9
newer-text
11
$ cat bar
newer-text
$ ls -i foo bar # Still the same file
797325 bar 797325 foo
$ sed -i s/new/newer/ foo # Does not edit in-place; creates a new file
$ ls -i foo bar
797325 bar 792722 foo
Since sed is not actually editing the file in place, but writing a new file and then renaming it to the old file, you might as well do the same.
awk ... test.csv | sed ... > test.csv.1 && mv test.csv.1 test.csv
There is the misperception that using sed -i somehow avoids the creation of the temporary file. It does not. It just hides the fact from you. Sometimes abstraction is a good thing, but other times it is unnecessary obfuscation. In the case of sed -i, it is the latter. The shell is really good at file manipulation. Use it as intended. If you do need to edit a file in place, don't use the streaming version of ed; just use ed
So, it turned out there are numerous ways to do it. I got it working with sed as below:
sed -i 's/0\([0-9]\{10\}\)/\+91\1/g' test.csv
But this is little tricky as it will edit any entry which matches the criteria. however in my case, It is working fine.
Similar implementation of above logic in perl:
perl -p -i -e 's/\b0(\d{10})\b/\+91$1/g;' test.csv
Again, same caveat as mentioned above.
More precise way of doing it as shown by Lev Levitsky because it will operate specifically on the 35th field
sed -ri 's/^(([^,]*,){34})0([^,]*)/\1+91\3/g' test.csv
For more complex situations, I will have to consider using any of the csv modules of perl.
Thanks everyone for your time and input. I surely know more about sed/awk after reading your replies.
This might work for you:
sed -i 's/[^,]*/+91/35' test.csv
EDIT:
To replace the leading zero in the 35th field:
sed 'h;s/[^,]*/\n&/35;/\n0/!{x;b};s//+91/' test.csv
or more simply:
|sed 's/^\(\([^,]*,\)\{34\}\)0/\1+91/' test.csv
If you have moreutils installed, you can simply use the sponge tool:
awk -F "," '{print $35}' test.csv | sed -i 's/^0/+91/g' | sponge test.csv
sponge soaks up the input, closes the input pipe (stdin) and, only then, opens and writes to the test.csv file.
As of 2015, moreutils is available in package repositories of several major Linux distributions, such as Arch Linux, Debian and Ubuntu.
Another perl solution to edit the 35th field in-place:
perl -i -F, -lane '$F[34] =~ s/^0/+91/; print join ",",#F' test.csv
These command-line options are used:
-i edit the file in-place
-n loop around every line of the input file
-l removes newlines before processing, and adds them back in afterwards
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace.
-e execute the perl code
-F autosplit modifier, in this case splits on ,
#F is the array of words in each line, indexed starting with 0
$F[34] is the 35 element of the array
s/^0/+91/ does the substitution