Edit text file using sed or awk - sed

I have a sample text file as shown below:
>chr1 dna:chromosome chromosome:CanFam3.1:1:1:122678785:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>chr10 dna:chromosome chromosome:CanFam3.1:1:1:122678785:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>chr11 dna:chromosome chromosome:CanFam3.1:1:1:122678785:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>chr12 dna:chromosome chromosome:CanFam3.1:1:1:122678785:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>AAEX03020170.1 dna:chromosome chromosome:CanFam3.1:1:1:122678785:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTATGTGAGAAGATAGCTGAA
>AAEX03022270.1 dna:chromosome chromosome:CanFam3.1:1:1:122678785:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTATGTGAGAAGATAGCTGAA
>JH373398.1dna:chromosome chromosome:CanFam3.1:1:1:122678785:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTATGTGAGAAGATAGCTGAA
>JH373568.1dna:chromosome chromosome:CanFam3.1:1:1:122678785:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTATGTGAGAAGATAGCTGAA
The first four starts with chr1, chr10,chr11 and chr12 and the rest starts with a common prefix AAEX and JH.
I would like to delete all the data from lines starting with AAEX and JH i.e. the output should be like:
>chr1 dna:chromosome chromosome:CanFam3.1:1:1:122678785:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>chr10 dna:chromosome chromosome:CanFam3.1:1:1:122678785:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>chr11 dna:chromosome chromosome:CanFam3.1:1:1:122678785:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>chr12 dna:chromosome chromosome:CanFam3.1:1:1:122678785:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
The original file has many such lines starting with 'AAEX' and 'JH' and would like to convert as shown above. Any help?

This should do the trick:
$ awk '/>[AJ]/{if(!f++)print ">chrX";next}NF' file
>chr1 dna:chromosome chromosome:CanFam3.1:1:1:122678785:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>chr10 dna:chromosome chromosome:CanFam3.1:1:1:122678785:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>chr11 dna:chromosome chromosome:CanFam3.1:1:1:122678785:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>chr12 dna:chromosome chromosome:CanFam3.1:1:1:122678785:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>chrX
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTATGTGAGAAGATAGCTGAA
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTATGTGAGAAGATAGCTGAA
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTATGTGAGAAGATAGCTGAA
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTATGTGAGAAGATAGCTGAA

This might work for you (GNU sed):
sed -r '/^>(AAEX|JH)/{x;/./{x;d};x;s/.*/>chrX/p;h;d};/^>/{x;s/.*//;x}' file

Related

Need to finding HTML/XML nested tag levels using Perl

Is their any simple way to find the level of the tag in nested form, i.e. no. of parent element with same tag name.
Note: I'm planning to create subroutine that if I pass a scalar like below input, it should return output like below as a scalar.
I need output like below from the input using Perl.
Input:
<sec>
<sec></sec>
<sec>
<sec></sec>
</sec>
</sec>
Output should be:
<sec level="1">
<sec level="2"></sec>
<sec level="2">
<sec level="3"></sec>
</sec>
</sec>
One approach, that uses XML::LibXML to generate a DOM tree from the XML, and then walks the tree adding an incrementing level attribute to matching tags:
#!/usr/bin/env perl
use warnings;
use strict;
use XML::LibXML;
# Recursively walk a DOM tree, and invoke callbacks on elements
sub walk_elements {
my ($node, $callbacks) = #_;
$callbacks->{pre}->($node) if $node->nodeType == XML_ELEMENT_NODE;
for my $child ($node->childNodes) {
walk_elements($child, $callbacks);
}
$callbacks->{post}->($node) if $node->nodeType == XML_ELEMENT_NODE;
}
sub add_levels {
my ($xml, $tagname) = #_;
my $dom = XML::LibXML->load_xml(string => $xml);
my $level = 1;
walk_elements($dom->getDocumentElement,
{ pre => sub {
$_[0]->setAttribute('level', $level++)
if $_[0]->nodeName eq $tagname
},
post => sub { $level-- if $_[0]->nodeName eq $tagname }
}
);
return $dom->toStringHTML; # Or toString for XML style tags
}
my $xml = <<EOXML;
<sec>
<sec></sec>
<sec>
<sec></sec>
</sec>
</sec>
EOXML
print add_levels($xml, 'sec');
Running this script outputs
<sec level="1">
<sec level="2"></sec>
<sec level="2">
<sec level="3"></sec>
</sec>
</sec>

perl - use command-line argument multiple times

I'm modifying a perl script in which the command line arguments are parsed like this:
if ($arg eq "-var1") {
$main::variable1 = shift(#arguments)
} elsif ($arg eq "-var2") {
$main::variable2 = shift(#arguments)
} elsif ($arg eq "var3") {
$main::variable3 = shift(#arguments)
} ...
So there is a whole bunch of elsif statements to cover all command-line arguments.
I'm now in a situaton where I want to use the argument '-var2' multiple times.
So my main::variable2 should maybe be an array that contains all values that are passed with "-var2".
I found that with Perl::getopt, this can be easily achieved (Perl Getopt Using Same Option Multiple Times).
However the way that my script parses its command-line arguments is different. So I was wondering if it could be achieved, without having to change the way the arguments are parsed.
That's not your actual code, is it? It won't even compile.
I'd be really surprised if Getopt::Long can't solve your problem and it's really a better idea to use a library rather than writing your own code.
But changing your code to store -var2 options in an array is simple enough.
my ($variable1, #variable2, $variable3);
if ($arg eq "-var1") {
$variable1 = shift(#arguments)
} elsif ($arg eq "-var2") {
push #variable2, shift(#arguments)
} elsif ($arg eq "-var3") {
$variable3 = shift(#arguments)
}
(I've also removed the main:: from your variables and added the, presumably missing, $s. It's really unlikely that you want to be using package variables rather than lexical variables.)
This particular wheel already exists. Please don't try to reinvent it. That just makes it a pain for the people trying to use your script. There's no reason to force people to learn a whole new set of rules in order to execute your program.
use File::Basename qw( basename );
use Getopt::Long qw( );
my $foo;
my #bars;
my $baz;
sub help {
my $prog = basename($0);
print
"Usage:
$prog [options]
$prog --help
Options:
--foo foo
...
--bar bar
May be used multiple times.
...
--baz baz
...
";
exit(0);
}
sub usage {
if (#_) {
my ($msg) = #_;
chomp($msg);
say STDERR $msg;
}
my $prog = basename($0);
say STDERR "Try '$prog --help' for more information.";
exit(1);
}
sub parse_args {
Getopt::Long::Configure(qw( posix_default ));
Getopt::Long::GetOptions(
"help" => \&help,
"foo=s" => \$foo,
"bar=s" => \#bars,
"baz=s" => \$baz,
)
or usage();
!#ARGV
or usage("Too many arguments");
return #ARGV;
}
main(parse_args());
Well, it is good practice to document your core -- you would appreciate it as soon as you return to make changes
NOTE: in Linux it requires perl-doc package to be installed to use --man option in full extent
#!/usr/bin/perl
#
# Description:
# Describe purpose of the program
#
# Parameters:
# Describe parameters purpose
#
# Date: Tue Nov 29 1:18:00 UTC 2019
#
use warnings;
use strict;
use Getopt::Long qw(GetOptions);
use Pod::Usage;
my %opt;
GetOptions(
'input|i=s' => \$opt{input},
'output|o=s' => \$opt{output},
'debug|d' => \$opt{debug},
'help|?' => \$opt{help},
'man' => \$opt{man}
) or pod2usage(2);
pod2usage(1) if $opt{help};
pod2usage(-exitval => 0, -verbose => 2) if $opt{man};
print Dumper(\%opt) if $opt{debug};
__END__
=head1 NAME
program - describe program's functionality
=head1 SYNOPSIS
program.pl [options]
Options:
-i,--input input filename
-o,--output output filename
-d,--debug output debug information
-?,--help brief help message
--man full documentation
=head1 OPTIONS
=over 4
=item B<-i,--input>
Input filename
=item B<-o,--output>
Output filename
=item B<-d,--debug>
Print debug information.
=item B<-?,--help>
Print a brief help message and exits.
=item B<--man>
Prints the manual page and exits.
=back
B<This program> accepts B<several parameters> and operates with B<them> to produce some B<result>
=cut

Filling hash by reference in a procedure

I'm trying to call a procedure, which is filling a hash by reference. The reference to the hash is given as a parameter. The procedure fills the hash, but when I return, the hash is empty. Please see the code below.
What is wrong?
$hash_ref;
genHash ($hash_ref);
#hash is empty
sub genHash {
my ($hash_ref)=(#_);
#cut details; filling hash in a loop like this:
$hash_ref->{$lid} = $sid;
#hash is generetad , filled and i can dump it
}
You might want to initialize hashref first,
my $hash_ref = {};
as autovivification happens inside function to another lexical variable.
(Not so good) alternative is to use scalars inside #_ array which are directly aliased to original variables,
$_[0]{$lid} = $sid;
And btw, consider use strict; use warnings; to all your scripts.
The caller's $hash_ref is undefined. The $hash_ref in the sub is therefore undefined too. $hash_ref->{$lid} = $sid; autovivifies the sub's $hash_ref, but nothing assigns that hash reference to the caller's $hash_ref.
Solution 1: Actually passing in a hash ref to assign to the caller's $hash_ref.
sub genHash {
my ($hash_ref) = #_;
...
}
my $hash_ref = {};
genHash($hash_ref);
Solution 2: Taking advantage of the fact that Perl passes by reference.
sub genHash {
my $hash_ref = $_[0] ||= {};
...
}
my $hash_ref;
genHash($hash_ref);
-or-
genHash(my $hash_ref);
Solution 3: If the hash is going to be empty initially, why not just create it in the sub?
sub genHash {
my %hash;
...
return \%hash;
}
my $hash_ref = genHash();

Having trouble with Perl script that converts XML to hash

I have a Perl script to convert the XML file below into a hash:
<university>
<name>svu</name>
<location>ravru</location>
<branch>
<electronics>
<student name="xxx" number="12">
<semester number="1"subjects="7" rank="2"/>
</student>
<student name="xxx" number="15">
<semester number="1" subjects="7" rank="10"/>
<semester number="2" subjects="4" rank="1"/>
</student>
<student name="xxx" number="16">
<semester number="1"subjects="7" rank="2"/>
<semester number="2"subjects="4" rank="2"/>
</student>
</electronics>
</branch>
</university>.
.
.
.
.
.
<data>
<student name="msr" number="1" branch="computers" />
<student name="ksr" number="2" branch="electronics" />
<student name="lsr" number="3" branch="EEE" />
<student name="csr" number="4" branch="IT" />
<student name="msr" number="5" branch="MEC" />
<student name="ssr" number="6" branch="computers" />
<student name="msr" number="1" branch="CIV" />
.............................
..............................
.....................
</data>
How can I create a hash table for the data elements, with the name and number as the key and branch is the value in that hash. I need this because some students have the same name and some students have same number.
By using this hash key I have to search in the university node for student if found and print the branch name of each student.
I written some script in XML::Simple but am not able to create a hash.
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
use XML::Simple;
my $xml = new XML::Simple;
my $data = $xml->XMLin("data.xml", forcearray => [ 'student' , 'semister' ],
KeyAttr => { student => "+Name" } );
print Dumper($data);
by using data dumper I am printing hole xml information. but I need to print only Data Node elements only please help me how to do this.
I would probably write my own XML::Parser handler to combine attributes into key values (if that's something supported by XML::Simple I couldn't find it in the docs). This example should get you started:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Parser;
use Data::Dumper;
my %hash;
sub tag_start { my ($expat, $tagname) = (shift, shift);
# attributes are now in #_
my %a = grep { $_=$_=>shift } #_; # attribute hash for this tag
my $context = join('/',$expat->context()) || '';
if ($context eq 'xml/data') {
if ($tagname eq 'student') {
push #{($hash{"$a{name}:$a{number}"}||=[])}, $a{branch};
}
} elsif ($context eq ...) {
...
}
}
my $p = new XML::Parser(Handlers => { Start=>\&tag_start });
$p->parsefile('file.xml');
print Dumper \%hash;
Note that to get this to work I had to clean up your XML a bit by enclosing it in an <xml> tag and adding some missing spaces:
<xml>
<university>
<name>svu</name>
<location>ravru</location>
<branch>
<electronics>
<student name="xxx" number="12">
<semester number="1" subjects="7" rank="2"/>
</student>
<student name="xxx" number="15">
<semester number="1" subjects="7" rank="10"/>
<semester number="2" subjects="4" rank="1"/>
</student>
<student name="xxx" number="16">
<semester number="1" subjects="7" rank="2"/>
<semester number="2" subjects="4" rank="2"/>
</student>
</electronics>
</branch>
</university>
<data>
<student name="msr" number="1" branch="computers" />
<student name="ksr" number="2" branch="electronics" />
<student name="lsr" number="3" branch="EEE" />
<student name="csr" number="4" branch="IT" />
<student name="msr" number="5" branch="MEC" />
<student name="ssr" number="6" branch="computers" />
<student name="msr" number="1" branch="CIV" />
</data>
</xml>
Result:
$VAR1 = {
'ksr:2' => [
'electronics'
],
'msr:1' => [
'computers',
'CIV'
],
'csr:4' => [
'IT'
],
'ssr:6' => [
'computers'
],
'msr:5' => [
'MEC'
],
'lsr:3' => [
'EEE'
]
};
There is no need to use XML::Simple and XML::Fast together. Both perform essentially the same thing.
Invoking multiple XML parsers for the same functionality invites trouble in the form of undesired behavior, code that should work but doesn't and debugging that will leave you holding your hands in your head because identically-named methods are treading on one another's toes.
I'd stick with XML::Fast for this case:
use strict;
use warnings;
use XML::Fast;
my $data = xml2hash 'data.xml', array => [ 'student', 'semester' ];
Even if the structure is not exactly the desired one, $data can easily be post-processed and seasoned to taste (it is a data structure after all).

perl script- to read many values from xml file

how to read the multiple values from XML file using perl script?
i have the xml file like:
<Provisioning>
<Appliance>
<ID>1</ID>
<SiteID></SiteID>
<IPAddress>10.52.32.230</IPAddress>
</Appliance>
<Appliance>
<ID>1</ID>
<SiteID></SiteID>
<IPAddress>10.52.32.530</IPAddress>
</Appliance>
<Appliance>
<ID>1</ID>
<SiteID></SiteID>
<IPAddress>10.52.32.730</IPAddress>
</Appliance>...
</Provisioning>
and i have written the code like:
use XML::Simple;
use Data::Dumper;
my $xml = new XML::Simple;
my $peermas = $xml->XMLin($masapplications);
print "file contents: $peermas \n";
print Dumper($peermas);
#masipaddr =+ $peermas->{Appliance}->{IPAddress}; #{Provisioning}->{Appliance}->{IPAddress};
print "The MAS ip: #masipaddr \n";
i am very new to perl script and my code can read only one IP address not the remaining 2.
so what should i do in this case?? please reply soon...
thanks in advance.
You already have all info you need in your $peermas. But if you need array of your IP addressed you may use:
my #massipaddr = map { $_->{IPAddress} } #{ $peermas->{Appliance} };
This map iterate on array of hashes $peermas->{Appliance} and push each IPAddress from it into #massipaddr.
Something like this perhaps:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Simple;
my $xml = join '', <DATA>;
my $peermas = XMLin($xml);
foreach (#{$peermas->{Appliance}}) {
print $_->{IPAddress}. "\n";
}
__DATA__
<Provisioning>
<Appliance>
<ID>1</ID>
<SiteID></SiteID>
<IPAddress>10.52.32.230</IPAddress>
</Appliance>
<Appliance>
<ID>1</ID>
<SiteID></SiteID>
<IPAddress>10.52.32.530</IPAddress>
</Appliance>
<Appliance>
<ID>1</ID>
<SiteID></SiteID>
<IPAddress>10.52.32.730</IPAddress>
</Appliance>...
</Provisioning>