How to scan logfile for xml values and combine them on one line - sed

I have a logfile with multiple lines like "DEBUG MDM payload:" and on separate lines after that is the xml content which can differ in lenght but always ends the same. How can i user tr and sed or another method to combine all the xml content on the same line as "DEBUG MDM payload:"?
2017-01-26T10:54:28.712+0100 - CORE {wff-device-thread-15 : deviceRestExecuteWorkflow.deviceRestExecuteActivity.EventQueueConsumer}|logger [{{Correlation,dhdjwdw-3a44-4b0d-aa52-dwdwdwdwdw}{Uri,PUT /S29112264/ios/mdm2/dwdwdwdw-c7d2-465c-be44-dwdwdwdw}{host,}}] - DEBUG MDM payload:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "">
<plist version="1.0">
Update: The following command will print what i want just cant figure out how to take that and alter the file to put it on one line.
sed -n "/\/ios\/mdm2\DEBUG MDM payload/,/<\/plist>/p"
Update 2: Ok bit further now but i still have some tabs in some cases. trying to find a way to remove tabs next.
sed -n "/DEBUG MDM payload/,/<\/plist>/p" fake_log.txt | tr -d "\r" | tr -d "\n"
Update 3: OK, got it. Now is there a way to remove the existing lines and have this newly modified line added in its place?
sed -n "/DEBUG MDM payload/,/<\/plist>/p" fake_log.txt | tr -d "\r" | tr -d "\n" | tr -d "\t"

To replace newlines in place, sed is probably not the tool you're looking for. Try using Perl instead. This one-liner should do the trick.
perl -i -a -n -e '$matched=true if /DEBUG MDM payload/; if ($matched && /<\/plist>/) {print "#F\n"; $matched=false;} elsif ($matched) {print #F;} else {print "#F\n";}' log.txt
To do a dry run, just remove the -i option, and it will output instead of changing the file in place.


Trying to remove line and all following using sed but one line remains

I'm removing this text from the bottom of some files
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "">
<plist version="1.0">
(snipped for brevity) I'm using
sed -n -i '' '/<?xml version="1.0" encoding="UTF-8"?>/q;p' myfile.txt
The text is removed as expected but not the first line - I thought I was asking 'when you get to this line, remove it and everything following.'
(I seem to get everything removed when I just run this in a Terminal window, but not when saving the file.)
Mac user BTW.
Use this Perl one-liner:
perl -i.bak -pe 'last if m{\Q<?xml version="1.0" encoding="UTF-8"?>\E}; ' in_file
For multiple files:
find /path/to/files ... -exec perl -i.bak -pe 'last if m{\Q<?xml version="1.0" encoding="UTF-8"?>\E}; ' {} \;
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
-i.bak : Edit input files in-place (overwrite the input file). Before overwriting, save a backup copy of the original file by appending to its name the extension .bak.
\Q quote (disable) pattern metacharacters until \E
\E end either case modification or quoted section, think vi
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)
This is my solution which removes all the lines I want to remove (in files in a directory).
for file in *.emlx; do
sed -i '' '/<?xml version="1.0" encoding="UTF-8"?>/,$d' "$file"

exiftool not showing space character when using -t or -T option

I am using the following command to save a tag with a space character:
exiftool -config xmp.config -overwrite_original -PropertyID=' ' /Users/admin/Downloads/Files/09913/1KingWithSofaBed_rm521_1.tif
Using the -X option, I can see that the space character was saved succesfully:
exiftool -X -filename -PropertyID /Users/admin/Downloads/Files/09913/1KingWithSofaBed_rm521_1.tif
<?xml version='1.0' encoding='UTF-8'?>
<rdf:RDF xmlns:rdf=''>
<rdf:Description rdf:about='/Users/admin/Downloads/Files/09913/1KingWithSofaBed_rm521_1.tif'
xmlns:et='' et:toolkit='Image::ExifTool 11.84'
<XMP-xmp:PropertyID> </XMP-xmp:PropertyID>
The problem is that -t or -T does not show the space:
exiftool -t -filename -PropertyID /Users/admin/Downloads/Files/09913/1KingWithSofaBed_rm521_1.tif
File Name 1KingWithSofaBed_rm521_1.tif
Property ID
exiftool -T -filename -PropertyID /Users/admin/Downloads/Files/09913/1KingWithSofaBed_rm521_1.tif
In both cases the space is not present (I have checked the contents with an hex editor) for the PropertyID field.
Is this a limitation of exiftool or it is possible to show it usint -t or -T option?
The answer from Phil Harvey, the author of exiftool
You can use the (undocumented) -ec option (ExifTool 11.54 or later) to escape control characters using C-style escape sequences and preserve trailing newlines, nulls and newlines, etc
I tested it out and it seemed to preserve trailing spaces

Extracting the contents between two different strings using bash or perl

I have tried to scan through the other posts in stack overflow for this, but couldn't get my code work, hence I am posting a new question.
Below is the content of file temp.
<?xml version="1.0" encoding="UTF-8"?>
<env:Envelope xmlns:env="<env:Body><dp:response xmlns:dp=""><dp:timestamp>2015-01-
22T13:38:04Z</dp:timestamp><dp:file name="temporary://test.txt">XJzLXJlc3VsdHMtYWN0aW9uX18i</dp:file><dp:file name="temporary://test1.txt">lc3VsdHMtYWN0aW9uX18i</dp:file></dp:response></env:Body></env:Envelope>
This file contains the base64 encoded contents of two files names test.txt and test1.txt. I want to extract the base64 encoded content of each file to seperate files test.txt and text1.txt respectively.
To achieve this, I have to remove the xml tags around the base64 contents. I am trying below commands to achieve this. However, it is not working as expected.
sed -n '/test.txt"\>/,/\<\/dp:file\>/p' temp | perl -p -e 's#<dp:file name="temporary://test.txt">##g'|perl -p -e 's#</dp:file>##g' > test.txt
sed -n '/test1.txt"\>/,/\<\/dp:file\>/p' temp | perl -p -e 's#<dp:file name="temporary://test1.txt">##g'|perl -p -e 's#</dp:file></dp:response></env:Body></env:Envelope>##g' > test1.txt
Below command:
sed -n '/test.txt"\>/,/\<\/dp:file\>/p' temp | perl -p -e 's#<dp:file name="temporary://test.txt">##g'|perl -p -e 's#</dp:file>##g'
produces output:
<dp:file name="temporary://test1.txt">lc3VsdHMtYWN0aW9uX18i</dp:response> </env:Body></env:Envelope>`
Howeveer, in the output I am expecting only first line XJzLXJlc3VsdHMtYWN0aW9uX18i. Where I am commiting mistake?
When i run below command, I am getting expected output:
sed -n '/test1.txt"\>/,/\<\/dp:file\>/p' temp | perl -p -e 's#<dp:file name="temporary://test1.txt">##g'|perl -p -e 's#</dp:file></dp:response></env:Body></env:Envelope>##g'
It produces below string
I can then easily route this to test1.txt file.
I have edited the question by updating the source file content. The source file doesn't contain any newline character. The current solution will not work in that case, I have tried it and failed. wc -l temp must output to 1.
OS: solaris 10
Shell: bash
sed -n 's_<dp:file name="\([^"]*\)">\([^<]*\).*_\1 -> \2_p' temp
I add \1 -> to show link from file name to content but for content only, just remove this part
posix version so on GNU sed use --posix
assuming that base64 encoded contents is on the same line as the tag around (and not spread on several lines, that need some modification in this case)
Thanks to JID for full explaination below
How it works
sed -n
The -n means no printing so unless explicitly told to print, then there will be no output from sed
This is to substitute the following regex using _ to separate regex from the replacement.
<dp:file name=
Regular text
The brackets are a capture group and must be escaped unless the -r option is used( -r is not available on posix). Everything inside the brackets is captured. [^"]* means 0 or more occurrences of any character that is not a quote. So really this just captures anything between the two quotes.
Again uses the capture group this time to capture everything between the > and <
Everything else on the line
_\1 -> \2
This is the replacement, so replace everything in the regex before with the first capture group then a -> and then the second capture group.
Means print the line
/usr/xpg4/bin/sed works well here.
/usr/bin/sed is not working as expected in case if the file contains just 1 line.
below command works for a file containing only single line.
/usr/xpg4/bin/sed -n 's_<env:Envelope\(.*\)<dp:file name="temporary://BackUpDir/backupmanifest.xml">\([^>]*\)</dp:file>\(.*\)_\2_p' securebackup.xml 2>/dev/null
Without 2>/dev/null this sed command outputs the warning sed: Missing newline at end of file.
This because of the below reason:
Solaris default sed ignores the last line not to break existing scripts because a line was required to be terminated by a new line in the original Unix implementation.
GNU sed has a more relaxed behavior and the POSIX implementation accept the fact but outputs a warning.

Parsing HTML on the command line; How to capture text in <strong></strong>?

I'm trying to grab data from HTML output that looks like this:
<strong>Target1NoSpaces</strong><span class="creator"> ....
<strong>Target2 With Spaces</strong><span class="creator"> ....
I'm using a pipe train to whittle down the data to the targets I'm trying to hit. Here's my approach so far:
grep "/strong" output.html | awk '{print $1}'
Grep on "/strong" to get the lines with the targets; that works fine.
Pipe to 'awk '{print $1}'. That works in case #1 when the target has no spaces, but fails in case #2 when the target has spaces..only the first word is preserved as below:
Do you have any tips on hitting the target properly, either in my awk or in different command? Anything quick and dirty (grep, awk, sed, perl) would be appreciated.
Try pup, a command line tool for processing HTML. For example:
$ pup 'strong text{}' < file.html
Target2 With Spaces
To search via XPath, try xpup.
Alternatively, for a well-formed HTML/XML document, try html-xml-utils.
One way using mojolicious and its DOM parser:
perl -Mojo -E '
->each( sub { if ( $t = shift->text ) { say $t } } )'
Using Perl regex's look-behind and look-ahead feature in grep. It should be simpler than using awk.
grep -oP "(?<=<strong>).*?(?=</strong>)" file
Target2 With Spaces
This implementation of Perl's regex's multi-matching in Ruby could match values in multiple lines:
ruby -e '<=<strong>).*?(?=<\/strong>)/m).each{|e| puts "----------"; puts e;}' file
</strong><strong>Target D</strong><strong>Target E</strong>
Target D
Target E
Here's a solution using xmlstarlet
xml sel -t -v //strong input.html
Trying to parse HTML without a real HTML parser is a bad idea. Having said that, here is a very quick and dirty solution to the specific example you provided. It will not work when there is more than one <strong> tag on a line, when the tag runs over more than one line, etc.
awk -F '<strong>|</strong>' '/<strong>/ {print $2}' filename
You never need grep with awk and the field separator doesn't have to be whitespace:
$ awk -F'<|>' '/strong/{print $3}' file
Target2 With Spaces
You should really use a proper parser for this however.
Since you tagged perl
perl -ne 'if(/(?:<strong>)(.*)(?:<\/strong>)/){print $1."\n";}' input.html
I am surprised no one mensions W3C HTML-XML-utils
curl -Ss |
hxnormalize -x |
hxselect -s '\n' strong
<strong class="fc-black-750 mb6">Stack Overflow
for Teams</strong>
To capture only content:
curl -Ss |
hxnormalize -x |
hxselect -s '\n' -c strong
Stack Overflow
for Teams

Xmlstarlet and sed to replace string in a file

I have huge number of html files. I need to replace all the , and " with html entities &nsbquo and &quto respectively.
I need to succeed in two steps for this:
1) Find all the text between tags. I need to replace only in this text between tags.
2) Replace all required strings using sed
My command for this is :
xmlstarlet sel -t -v "*//p" "index.html" | sed 's/,/\&nsbquo/'
This works, but now I dont know how to put back the changes to index.html file.
In sed we have -i option, but for that I need to specify the filename with sed command. But in my case, i have to use | to filter out the required string from html file.
Please help. I did a lot of search for this from 2 days but no luck.
Thank you,
The main problem here is that in XML there is no difference between " and ", so you can't use xmlstarlet to do this directly. You could replace " with a special string and then use sed to replace that with ":
xmlstarlet ed -u "//p/text()" \
-x "str:replace(str:replace(., ',', '#NSBQUO#'), '\"', '#QUOT#')" \
quote.html | \
sed 's/#NSBQUO#/\&nsbquo\;/g; s/#QUOT#/\&quot\;/g' > quote-new.html
mv quote-new.html quote.html
NOTE: str:replace and other exslt functions were only added to xmlstarlet ed in version 1.3.0, so it was not available at the time this question was asked.