On a Unix system I have an input text file containing long multi-line strings.
I now want to remove line breaks only between two patterns ( and ) which can be on different lines.
Example input file:
text1 text2 <remarks> text3
text4 text5 </remarks> text6 text7 text8
Result output for the above input file should be:
text1 text2 <remarks> text3 text4 text5 </remarks> text6 text7 text8
I would prefer to use sed or Perl or maybe awk to do the job.
I do not see a solution as the newlines can happen "randomly" and text is just some log messages.
Here is a more detailed look of the input file I need to process. It does not contain a root XML section, but for testing I might just add one manually. Also there may be many "remarks" sections.
Inputfile Snippet (as it is very long), Filename is test:
<paymentTerm keyValue1="8" objectType="PAYMENTTERM" />
<paymentType keyValue1="20" objectType="PAYMENTTYPE" />
<priceList keyValue1="1" objectType="PRICELIST" />
<remarks>Zollanmeldung ab 250 €
Lager Adresse:
Hessen-Ring 456
D-64546 Mörfelden-Walldorff
eine Stunde vor Ankunft melden unter Mobile
Neu Spedition
A&R Logistics Group
Storkenburgstrasse 99
D-62546 Mörfelden-Walldorf
www.asp.de</remarks>
<salesPersons>
<PERSON keyValue1="2" keyValue2="SALESEMPLOYEE" objectType="PERSON" />
</salesPersons>
<shippingType keyValue1="5" objectType="SHIPPINGTYPE" />
As stated above I want to remove the linebreaks ONLY between the patterns "remarks" and "/remarks".
I tried the Perl XML Parsing suggested by borodin like this:
use strict;
use warnings 'all';
use XML::Twig;
use constant XML_FILE => 'test';
my $twig = XML::Twig->new(
twig_handlers => {
remarks => sub { $_->set_text($_->trimmed_text) }
}
);
$twig->parsefile(XML_FILE);
$twig->print;
It works, but prints everything on one line.
With GNU awk for multi-char RS:
$ awk -v RS='</?remarks>' -v ORS= '!(NR%2){gsub(/\n/,OFS)} {print $0 RT}' file
text1 text2 <remarks> text3 text4 text5 </remarks> text6 text7 text8
XML can represent the same information in many different ways, and it is always a risk to try processing it using regular expressions. It is far better to use a proper XML module to process XML data. This solution uses
XML::Twig
In the constructor for the $twig object you can specify a callback which is called automatically every time a given XML element is encountered in the input
The trimmed_text method removes leading and trailing whitespace from the text of the element, and turns any internal whitespace sequences, including line breaks, into a single space. That is exactly what you are asking for here, so a call to set_text is all that is necessary to update the string
The file to be processed is specified by the XML_FILE constant and you should modify that to specify the path to your own data file. The modified XML is printed to STDOUT
use strict;
use warnings 'all';
use open qw/ :std :encoding(UTF-8) /;
use XML::Twig;
use constant XML_FILE => 'remarks.xml';
my $twig = XML::Twig->new(
keep_spaces => 1,
twig_handlers => {
remarks => sub { $_->set_text($_->trimmed_text) }
}
);
$twig->parsefile(XML_FILE);
$twig->print;
input
Your sample data is invalid XML, so I have edited it to look like this. I have added the XML declaration that you said in a comment that you had, and I have added a root element <data>
<?xml version="1.0" encoding="UTF-8"?>
<data>
<paymentTerm keyValue1="8" objectType="PAYMENTTERM" />
<paymentType keyValue1="20" objectType="PAYMENTTYPE" />
<priceList keyValue1="1" objectType="PRICELIST" />
<remarks>Zollanmeldung ab 250 €
Lager Adresse:
Hessen-Ring 456
D-64546 Mörfelden-Walldorff
eine Stunde vor Ankunft melden unter Mobile
Neu Spedition
A&R Logistics Group
Storkenburgstrasse 99
D-62546 Mörfelden-Walldorf
www.asp.de</remarks>
<salesPersons>
<PERSON keyValue1="2" keyValue2="SALESEMPLOYEE" objectType="PERSON" />
</salesPersons>
<shippingType keyValue1="5" objectType="SHIPPINGTYPE" />
</data>
output
<?xml version="1.0" encoding="UTF-8"?>
<data>
<paymentTerm keyValue1="8" objectType="PAYMENTTERM"/>
<paymentType keyValue1="20" objectType="PAYMENTTYPE"/>
<priceList keyValue1="1" objectType="PRICELIST"/>
<remarks>Zollanmeldung ab 250 € Lager Adresse: Hessen-Ring 456 D-64546 Mörfelden-Walldorff eine Stunde vor Ankunft melden unter Mobile Neu Spedition A&R Logistics Group Storkenburgstrasse 99 D-62546 Mörfelden-Walldorf www.asp.de</remarks>
<salesPersons>
<PERSON keyValue1="2" keyValue2="SALESEMPLOYEE" objectType="PERSON"/>
</salesPersons>
<shippingType keyValue1="5" objectType="SHIPPINGTYPE"/>
</data>
Related
How can I replace the following string:
<value>-myValue</value>
<value>1234</value>
And make it to be:
<value>-myValue</value>
<value>0</value>
Please take into account that there is a line break.
Script
sed -e '/<value>-myValue</,/<value>/{ /<value>[0-9][0-9]*</ s/[0-9][0-9]*/0/; }' data
From a line containing <value>-myValue< to the next line containing <value>, if the line matches <value>XX< where XX is a string of one or more digits, replace the string of digits with 0.
Input
This is not something to change
<value>-myValue</value>
<value>1234</value>
<value>myValue</value>
<value>1234</value>
nonsense
<value>-myValue</value>
<value>abcd</value>
<value>-myValue</value>
<value>4321</value>
stuffing
Output
This is not something to change
<value>-myValue</value>
<value>0</value>
<value>myValue</value>
<value>1234</value>
nonsense
<value>-myValue</value>
<value>abcd</value>
<value>-myValue</value>
<value>0</value>
stuffing
If this is XML, TLP is right that an XML parser would be superior. Continuing on with your sed approach, however, consider:
$ sed '/<value>-myValue/ {N; s/<value>[[:digit:]]\+/<value>0/}' file
<value>-myValue</value>
<value>0</value>
You can possibly simplify this a bit, depending on what criteria you specifically want to use:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig->new( 'pretty_print' => 'indented_a' )->parse( \*DATA );
foreach my $value ( $twig->findnodes('//value') ) {
if ( $value->trimmed_text eq '-myValue'
and $value->next_sibling('value')
and $value->next_sibling('value')->text =~ m/^\d+$/ )
{
$value->next_sibling('value')->set_text('1234');
}
}
$twig->print;
__DATA__
<root>
<value>-myValue</value>
<value>0</value>
</root>
This outputs:
<root>
<value>-myValue</value>
<value>1234</value>
</root>
It parses your XML.
Looks for all nodes with a tag of value.
Checks that it has a sibling.
Checks that sibling is 'just numeric' e.g. matching regex ^\d+$
replaces the content of that sibling with 1234.
And will work on XML regardless of formatting, which is the problem with XML - pretty fundamentally there's a bunch of entirely valid things you can do that are semantically identical in XML.
I have a file that looks like this:
cat output_title.txt
C817491287 Cat: Nor Sus: something date: 02/26/14
C858151287 Cat: Nor Sus: really something date: 02/26/14
I would like to send an email in HTML format, using parameters from the file, e.g.
mine :firstparamter starting with C
sus: ?
date: ?
How can I do this?
EDIT: CODE
open (FILE, 'output_title.txt');
while (<FILE>) {
chomp;
($chg, $Cat, $category, $sta, $stus, $sus, $open, $open_date) = split(" ");
print "Chnge is:$chg\n";
}
After doncoyote comments :
use strict;
use warnings;
open (FILE, 'output_title.txt');
while (<FILE>) {
my ($Cnum,$Cat,$Sus,$Date) = m!(C\d{9})\s+Cat:\s+(\w+)\s+Sus:\s([\w\s]*?)date:\s+([\d/]+)$! ;
print "Cnum:$Cnum\t";
print "Caty:$Cat\t";
print "Stus:$Sus\t";
print "opendate:$Date\n";
}
close (FILE); exit;
You may find a regex pattern capture to define the required variables works better than split, when there are slight but quantifiable differences in the extraction text.
something like this should handle the cases provided. This could be improved but makes an ok starting point of the top of my head.
my ( $Cnum, $Cat, $Sus, $Date )
= m!(C\d{9})\s+Cat:\s+(\w+)\s+Sus:\s([\w\s]*?)date:\s+([\d/]+)$!
You should start to look into regexes in the perlretut documentation to understand what is going on. Basically the escaped letters w,d,s stand for word digit and non-printable character(spaces,tabs) respectively. The Parentheses capture the pattern and pass those as a list to the assignment variables. The square brackets define a multiple choice of characters.
Quantifiers: + is one or more, * is zero or more, and curly braces is the comma separated specified min/max. Each of the character they immediately follow. The question mark is a non-greedy * and the $ is the end of line anchor.
I'm pretty sure there are several methods to send html-mails from perl.
For example:
use MIME::Lite;
my $msg = MIME::Lite->new(
From => from_you#somedomain.com,
To => to_someone_else#someotherdomain.com,
Subject => "your mail subject",
Type => 'text/html',
Data => qq {
<body>
<table>
<tr> <td>$chg</td><td>$Cat</td>.....</tr>
</table>
</body>
},
);
$msg->send();
I asked this question before but don't think I really explained it properly based on the answers given.
I have a file named backup.xml that is 28,000 lines and contains the phrase *** in it 766 times. I also have a file named list.txt that has 766 lines in it, each with different keywords.
What I basically need to do is insert each of the lines from list.txt into backup.xml to replace the 766 places *** is mentioned.
Here's an example of what's contained in list.txt:
Anaheim
Anchorage
Ann Arbor
Antioch
Apple Valley
Appleton
Here's an example of one of the lines with *** in it from backup.xml:
<title>*** Hosting Services - Company Review</title>
So, for example, the first line that has *** mentioned should be changed to this according to the sample above:
<title>Anaheim Hosting Services - Company Review</title>
Any help would be greatly appreciated. Thanks in advance!
In this case you can probably get away with treating the XML as pure text.
So read the XML file, and replace each occurrence of the marker with a line read from the keyword file:
#!/usr/bin/perl
use strict;
use warnings;
use autodie qw( open);
my $xml_file = 'backup.xml';
my $list_file = 'list.txt';
my $out_file = 'out.xml';
my $pattern='***';
# I assumed all files are utf8 encoded
open( my $xml, '<:utf8', $xml_file );
open( my $list, '<:utf8', $list_file );
open( my $out, '>:utf8', $out_file );
while( <$xml>)
{ s{\Q$pattern\E}{my $kw= <$list>; chomp $kw; $kw}eg;
print {$out} $_;
}
rename $out_file, $xml_file;
How about this:
awk '{print NR-1 ",/\\*\\*\\*/{s/\\*\\*\\*/" $0 "/}"}' list.txt > list.sed
sed -f list.sed backup.xml
The first line used awk to make a list of search/replace commands based on the list, which is then executed on the next line via sed.
Using awk. It reads backup.xml file and when found a *** text, I extract a word from the list.txt file. The BEGIN block removes list.txt from the argument list to avoid its processing. The order of arguments is very important. Also I assume that there is only one *** string per line.
awk '
BEGIN { listfile = ARGV[2]; --ARGC }
/\*\*\*/ {
getline word <listfile
sub( /\*\*\*/, word )
}
1 ## same as { print }
' backup.xml list.txt
If the two files sequentially correspond, you can use paste command to join lines from both files and then postprocess.
paste list.txt backup.xml |
awk 'BEGIN {FS="\t"} {sub(/\*\*\*/, $1); print substr($0, length($1)+2)}'
paste command will produce the following:
Anaheim \t <title>*** Hosting Services - Company Review</title>
while the one-liner in AWK will replace *** with the first field, subsequently removing the first field and the field separator (\t) after it.
Another variation is:
paste list.txt backup.xml |
awk 'BEGIN {FS="\t"} {sub(/\*\*\*/, $1); print $0}' |
cut -f 2-
I use perl v5.10 (on windows 7) + TT v2.22. When I use TT, for each source line, I get in the produced html an extra CR :
Source text (windows format):
"Some_html" CR LF
Output text :
"Some_html" CR
CR LF
However, when I convert my source file to unix format, and then I run TT, I get :
Source text (unix format):
"Some_html" LF
Output text :
"Some_html" CR LF
(I use notepad++ to show the CR & LF characters; also to change unix <-> windows formats in the source template).
When I google the problem, I get some (few) posts about extra ^M on windows, but I couldn't find explanation as for the root cause neither a true solution (just some workaround how to get rid of extra ^M).
Although not a real problem, I find it quite "unclean".
Is there some configuration that i should turn on (I reviewed www.template-toolkit.org/docs/manual/Config.html but could not find anything) ?
Some other solution ? (other than post-fixing the output file).
Thanks
Template Toolkit reads source files for templates in binary mode, but writing in text mode. Data from template (that contains CR LF) are translated during output in text mode, so the LF becomes CR LF.
The easiest solution for the problem is to write files in binary mode (note the raw modifier to open call):
my $tt = Template->new;
my $output_file = 'some_file.txt';
open my $out_fh, '>:raw', $output_file or die "$output_file: $!\n";
$tt->process('template', \%data, $out_fh) or die $tt->error();
bvr's solution unfortunately doesn't work for output generated using [% FILTER redirect(...) %]. On Windows 10, template
[% FILTER redirect("bar.txt") %]
This text is for bar.txt.
[% END %]
This text is for foo.txt.
(with DOS-style CR-LF line endings) expanded through
#! /bin/perl
use strict;
use warnings;
use Template;
my $tt = Template->new({
OUTPUT_PATH => '.',
RELATIVE => 1,
}) || die "$Template::ERROR\n";
my $srcfile = 'foo.txt.tt';
my $tgtfile = 'foo.txt';
open my $ofh, '>:raw', $tgtfile or die;
$tt->process($srcfile, {}, $ofh, { binmode => ':raw' })
|| die $tt->error . "\n";
creates output file foo.txt with the expected CR-LF line endings, but creates bar.txt with bad CR-CR-LF line endings:
> od -c bar.txt
0000000 \r \r \n T h i s t e x t i s
0000020 f o r b a r . t x t . \r \r \n
0000037
I reported this problem to the TT author at https://github.com/abw/Template2/issues/63.
I found a simple workaround solution: In sub Template::_output (in Template.pm), change
my $bm = $options->{ binmode };
to
my $bm = $options->{ binmode } // $BINMODE;
Then in your main perl script set
$Template::BINMODE = ':raw';
Then you can process the template using
$tt->process($srcfile, {}, $tgtfile) || die $tt->error . "\n";
and get CR-LF line endings in both the main and redirected output.
Have a good day,
I found a very simple solution on this:
my $tt = Template->new({
...,
PRE_CHOMP => 1,
POST_CHOMP => 1,
...
});
This config will instruct the template engine to remove all pre and post CR LF of the template text.
I have a lot of text files with fixed-width fields:
<c> <c> <c>
Dave Thomas 123 Main
Dan Anderson 456 Center
Wilma Rainbow 789 Street
The rest of the files are in a similar format, where the <c> will mark the beginning of a column, but they have various (unknown) column & space widths. What's the best way to parse these files?
I tried using Text::CSV, but since there's no delimiter it's hard to get a consistent result (unless I'm using the module wrong):
my $csv = Text::CSV->new();
$csv->sep_char (' ');
while (<FILE>){
if ($csv->parse($_)) {
my #columns=$csv->fields();
print $columns[1] . "\n";
}
}
As user604939 mentions, unpack is the tool to use for fixed width fields. However, unpack needs to be passed a template to work with. Since you say your fields can change width, the solution is to build this template from the first line of your file:
my #template = map {'A'.length} # convert each to 'A##'
<DATA> =~ /(\S+\s*)/g; # split first line into segments
$template[-1] = 'A*'; # set the last segment to be slurpy
my $template = "#template";
print "template: $template\n";
my #data;
while (<DATA>) {
push #data, [unpack $template, $_]
}
use Data::Dumper;
print Dumper \#data;
__DATA__
<c> <c> <c>
Dave Thomas 123 Main
Dan Anderson 456 Center
Wilma Rainbow 789 Street
which prints:
template: A8 A10 A*
$VAR1 = [
[
'Dave',
'Thomas',
'123 Main'
],
[
'Dan',
'Anderson',
'456 Center'
],
[
'Wilma',
'Rainbow',
'789 Street'
]
];
CPAN to the rescue!
DataExtract::FixedWidth not only parses fixed-width files, but (based on POD) appears to be smart enough to figure out column widths from header line by itself!
Just use Perl's unpack function. Something like this:
while (<FILE>) {
my ($first,$last,$street) = unpack("A9A25A50",$_);
<Do something ....>
}
Inside the unpack template, the "A###", you can put the width of the field for each A.
There are a variety of other formats that you can use to mix and match with, that is, integer fields, etc...
If the file is fixed width, like mainframe files, then this should be the easiest.