Perl : converting csv file to xml - perl

I'm trying to parse a csv file and convert it to XML. The .csv file consists of a list of entries, separated by commas. So, two sample entries look like this:
AP,AB,A123,B123,,,
MA,NA,M123,TEXT,TEXT,TEXT_VALUE
Some field are blank in file and file has no header rows.
Any suggestions how to go with this. Haven't figured out how to go with this.
Thanks.

print "<myXML>\n";
while (<>) {
print "<aRow>$_</aRow>\n";
}
print "</myXML>\n";
Or, using XML::LibXML, something like this:
#!/usr/bin/perl
use strict;
use warnings;
use XML::LibXML;
my $doc = XML::LibXML::Document->createDocument;
my $root = $doc->createElement('myXML');
$doc->setDocumentElement($root);
$root->appendChild($doc->createTextNode("\n"));
while (<>) {
chomp;
my $row = $doc->createElement('aRow');
$root->appendChild($row);
$row->appendChild($doc->createTextNode($_));
$root->appendChild($doc->createTextNode("\n"));
}
print $doc->toString;
Of course, if you told us what the output should look like, we could probably come up with something a little more sophisticated!

Related

Converting multiple xmls in 1 file to a csv

I have a text file that contains over multiple xmls that look like this:
<queryResponse><entity><devicesDTO><clearedAlarms>1</clearedAlarms><warningAlarms>0</warningAlarms></devicesDTO></entity></queryResponse>
<queryResponse><entity><devicesDTO><clearedAlarms>2</clearedAlarms><warningAlarms>2</warningAlarms></devicesDTO></entity></queryResponse>
I would like to convert each line to a csv:
clearedAlarms, warningAlarms
1, 0
2, 2
Here's what I have now that only enables me to parse a xml and output the csv. The file has actually changed now and I'm supposed to be reading a txt file that contains multiple xmls
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use XML::Simple;
#Elements, that I want see in my csv
my #Fields = qw{clearedAlarms warningAlarms};
open(my $out, '>', 'test.csv') or die "Output: $!\n";
print $out join(',', #Fields) . "\n";
my $xml = XMLin('test.xml', ForceArray => ['entity']);
foreach my $entity ( #{ $xml->{entity} } ) {
print Dumper $entity;
}
foreach my $entity ( #{ $xml->{entity} } ) {
print $out join( ',', #{ $entity->{devicesDTO} }{#Fields} ) . "\n";
}
It's the origin of the Perl slogan: “There's More Than One Way To Do It!” If you don't want to use xml module ( as you mentioned The file has actually changed now and I'm supposed to be reading a txt file that contains multiple xmls) You can use https://metacpan.org/pod/File::Grep mdoule (Find matches to a pattern in a series of files and related functions) for file operations.and https://metacpan.org/pod/Text::CSV_XS (https://metacpan.org/pod/Text::CSV_XS) This module provides more function related to csv operation which you can utilise as per your requirements.
fmap BLOCK LIST
Performs a map operation on the files in LIST, using BLOCK as the mapping function. The results from BLOCK will be appended to the list that is returned at the end of the call.
csv This is a high-level function that aims at simple (user) interfaces. This can be used to read/parse a CSV file or stream (the default behavior) or to produce a file or write to a stream (define the out attribute).
use strict;
use warnings;
use File::Grep qw(fmap);
use Text::CSV_XS qw( csv );
use Data::Dumper;
my $data;
my $csv_file='test_file.csv';
# my #result = fmap { <block> } file_name;
# replace *DATA with your file path.
# checking the pattern and extracting value
# pushing values to array to create array of array
fmap { (/<clearedAlarms>(.*?)<\/clearedAlarms><warningAlarms>(.*?)<\/warningAlarms>/ ? push(#$data,[$1,$2]) : () ) } *DATA;
if (#$data) {
# Write array of arrays as csv file
csv (in => $data, out => $csv_file, sep_char=> ",", headers => [qw( clearedAlarms warningAlarms )]);
} else {
print "\n data not found (provide proper message)\n";
}
__DATA__
<queryResponse><entity><devicesDTO><clearedAlarms>1</clearedAlarms><warningAlarms>0</warningAlarms></devicesDTO></entity></queryResponse>
<queryResponse><entity><devicesDTO><clearedAlarms>2</clearedAlarms><warningAlarms>2</warningAlarms></devicesDTO></entity></queryResponse>
Ouput (if you open $csv_file file)
clearedAlarms,warningAlarms
1,0
2,2
Given the simplicity of the XML schema, this easier to do with AnyData
For instance:
#!/usr/bin/perl
# This script converts a XML file to CSV format.
# Load the AnyData XML to CSV conversion modules
use XML::Parser;
use XML::Twig;
use AnyData;
my $input_xml = "test.xml";
my $output_csv = "test.csv";
$flags->{record_tag} = 'ITEM';
adConvert( 'XML', $input_xml, 'CSV', $output_csv, $flags );
Would convert your data structure (XML) into:
clearedAlarms, warningAlarms
1, 0
2, 2
If your file structure is always the same, you indeed don't need an XML parser. But you don't need anything else either. You can treat that input file like a slightly more complex CSV file that has weird delimiters.
split uses a pattern to turn a string into a list of strings. By default it will eat the delimiter match, so this disappear. We can use a pattern that looks like an XML tag as the pattern. Note how I am using qr// with a delimiter that isn't the slash / to make this more readable, because it avoids escaping the optional slash / in the closing tag.
split qr{</?[^>]+>}, '<foo>123</foo>';
This will produce a data structure that looks like this (using Data::Printer to produce the output):
[
[0] "",
[2] 123
]
The first element is an empty string, which denotes the lack of any other characters before the first match of the delimiter pattern. We need to filter these out. That's easily done with grep.
grep { $_ ne q{} } split qr{</?[^>]+>}, '<foo>123</foo>';
Now our output is nice and clean.
[
[0] 123
]
All we now need to do is apply this to a full file. Since our data only contains a couple of numbers, there is no need to use Text::CSV in this case.
use strict;
use warnings;
use feature 'say';
while (<DATA>) {
chomp;
say join ',', grep { $_ ne q{} } split qr{</?[^>]+>};
}
__DATA__
<queryResponse><entity><devicesDTO><clearedAlarms>1</clearedAlarms><warningAlarms>0</warningAlarms></devicesDTO></entity></queryResponse>
<queryResponse><entity><devicesDTO><clearedAlarms>2</clearedAlarms><warningAlarms>2</warningAlarms></devicesDTO></entity></queryResponse>
Please keep in mind that we can use a pattern match here because we don't actually parse the XML. Do not use regular expressions if you want to do any real XML parsing!

Perl - Need to append duplicates in a file and write unique value only

I have searched a fair bit and hope I'm not duplicating something someone has already asked. I have what amounts to a CSV that is specifically formatted (as required by a vendor). There are four values that are being delimited as follows:
"Name","Description","Tag","IPAddresses"
The list is quite long (and there are ~150 unique names--only 2 in the sample below) but it basically looks like this:
"2B_AppName-Environment","desc","tag","192.168.1.1"
"2B_AppName-Environment","desc","tag","192.168.22.155"
"2B_AppName-Environment","desc","tag","10.20.30.40"
"6G_ServerName-AltEnv","desc","tag","1.2.3.4"
"6G_ServerName-AltEnv","desc","tag","192.192.192.40"
"6G_ServerName-AltEnv","desc","tag","192.168.50.5"
I am hoping for a way in Perl (or sed/awk, etc.) to come up with the following:
"2B_AppName-Environment","desc","tag","192.168.1.1,192.168.22.155,10.20.30.40"
"6G_ServerName-AltEnv","desc","tag","1.2.3.4,192.192.192.40,192.168.50.5"
So basically, the resulting file will APPEND the duplicates to the first match -- there should only be one line per each app/server name with a list of comma-separated IP addresses just like what is shown above.
Note that the "Decription" and "Tag" fields don't need to be considered in the duplication removal/append logic -- let's assume these are blank for the example to make things easier. Also, in the vendor-supplied list, the "Name" entries are all already sorted to be together.
This short Perl program should suit you. It expects the path to the input CSV file as a parameter on the command line and prints the result to STDOUT. It keeps track of the appearance of new name fields in the #names array so that it can print the output in the order that each name first appears, and it takes the values for desc and tag from the first occurrence of each unique name.
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new({always_quote => 1, eol => "\n"});
my (#names, %data);
while (my $row = $csv->getline(*ARGV)) {
my $name = $row->[0];
if ($data{$name}) {
$data{$name}[3] .= ','.$row->[3];
}
else {
push #names, $name;
$data{$name} = $row;
}
}
for my $name (#names) {
$csv->print(*STDOUT, $data{$name});
}
output
"2B_AppName-Environment","desc","tag","192.168.1.1,192.168.22.155,10.20.30.40"
"6G_ServerName-AltEnv","desc","tag","1.2.3.4,192.192.192.40,192.168.50.5"
Update
Here's a version that ignores any record that doesn't have a valid IPv4 address in the fourth field. I've used Regexp::Common as it's the simplest way to get complex regex patterns right. It may need installing on your system.
use strict;
use warnings;
use Text::CSV;
use Regexp::Common;
my $csv = Text::CSV->new({always_quote => 1, eol => "\n"});
my (#names, %data);
while (my $row = $csv->getline(*ARGV)) {
my ($name, $address) = #{$row}[0,3];
next unless $address =~ $RE{net}{IPv4};
if ($data{$name}) {
$data{$name}[3] .= ','.$address;
}
else {
push #names, $name;
$data{$name} = $row;
}
}
for my $name (#names) {
$csv->print(*STDOUT, $data{$name});
}
I would advise you to use a CSV parser like Text::CSV for this type of problem.
Borodin has already pasted a good example of how to do this.
One of the approaches that I'd advise you NOT to use are regular expressions.
The following one-liner demonstrates how one could do this, but this is a very fragile approach compared to an actual csv parser:
perl -0777 -ne '
while (m{^((.*)"[^"\n]*"\n(?:(?=\2).*\n)*)}mg) {
$s = $1;
$s =~ s/"\n.*"([^"\n]+)(?=")/,$1/g;
print $s
}' test.csv
Outputs:
"2B_AppName-Environment","desc","tag","192.168.1.1,192.168.22.155,10.20.30.40"
"6G_ServerName-AltEnv","desc","tag","1.2.3.4,192.192.192.40,192.168.50.5"
Explanation:
Switches:
-0777: Slurp the entire file
-n: Creates a while(<>){...} loop for each “line” in your input file.
-e: Tells perl to execute the code on command line.
Code:
while (m{^((.*)"[^"]*"\n(?:(?=\2).*\n)*)}mg): Separate text into matching sections.
$s =~ s/"\n.*"([^"\n]+)(?=")/,$1/g;: Join all ip addresses by a comma in matching sections.
print $s: Print the results.

How can I query Genbank and print the results to a fasta file?

I have been trying to write a code using BioPerl that will query Genbank for a specific protein and then print the results to a fasta file. So far the code I have works and I can print the results to the screen, but not to a file. I have done lots of researching on the BioPerl website and other sources (CPAN, PerlMonks, etc.) but I have not found anything that can solve my problem. I understand how to read something from a file and then print the output to a new file (using SeqIO), but the problem I am having seems to be that what I want the program to read is not stored in a text or FASTA file, but is the result of a database query. Help? I am very much a beginner, new to Perl/BioPerl and programming in general.
Here is the code I have so far:
#!usr/bin/perl
use Bio::DB::GenBank;
use Bio::DB::Query::GenBank;
use Bio::Seq;
$query = "Homo sapiens[ORGN] AND TFII-I[TITL]";
$query_obj = Bio::DB::Query::GenBank->new(-db => 'protein', -query => $query);
$gb_obj = Bio::DB::GenBank->new;
$stream_obj = $gb_obj->get_Stream_by_query($query_obj);
while ($seq_obj = $stream_obj->next_seq)
{print $seq_obj->desc, "\t", $seq_obj->seq, "\n";
}
So, what I want to do in the last line is instead of printing to the screen, print to a file in fasta format.
Thanks,
~Jay
Your code was actually quite close, you are returning a Bio::Seq object in your loop, and you just need to create a Bio::SeqIO object that can handle those objects and write them to a file ("myseqs.fasta" is the file in the example).
#!usr/bin/env perl
use strict;
use warnings;
use Bio::DB::GenBank;
use Bio::DB::Query::GenBank;
use Bio::SeqIO;
my $query = "Homo sapiens[ORGN] AND TFII-I[TITL]";
my $query_obj = Bio::DB::Query::GenBank->new(-db => 'protein', -query => $query);
my $gb_obj = Bio::DB::GenBank->new(-format => 'fasta');
my $stream_obj = $gb_obj->get_Stream_by_query($query_obj);
my $seq_out = Bio::SeqIO->new(-file => ">myseqs.fasta", -format => 'fasta');
while (my $seq_obj = $stream_obj->next_seq) {
$seq_out->write_seq($seq_obj);
}
Also, note that I added use strict; and use warnings; to the top of the script. That will help solve most "Why isn't this working?" type of questions by generating diagnostic messages, and it is a good idea to include those lines.
Assuming you have the data to make a fasta seq (which it seems you do), can you use Bio::FASTASequence module the seq2file function? I've never used it nor am I bioinformatic expert, just saw the option there and thought it might be useful to you.

new to Perl - CSV - find a string and print all numbers in that column

I've got a bunch of data in a CSV file, first row is all strings (all text and underscores), all subsequent rows are filled with numbers relating to said strings.
I'm trying to parse through the first line and find particular strings, remember which column that string was in, and then go through the rest of the file and get the data in the same column. I need to do this to three strings.
I've been using Text::CSV but I can't figure out how to get it to increment a counter until it finds the string in the first line and then go to the next line, get the data from that same column, etc. etc. Here's what I've tried so far:
while (<CSV>) {
if ($csv->parse($data)) {
my #field = $csv->fields;
my $count = 0;
for $column (#field) {
print ++$count, " => ", $column, "\n";
}
} else {
my $err = $csv->error_input;
print "Failed to parse line: $err";
}
}
Since $data is in line 1, it prints "1 $data" 25 times (# of lines in CSV file). How do I get it to remember which column it found $data in? Also, since I know all of the strings are in line 1, how do I get it to only parse through line 1, find all of the strings in #data, and then parse through the rest of the file, grabbing data from the necessary columns and putting it into a matrix or array of arrays?
Thanks for the help!
edit: I realized my questions were a bit poorly phrased. I don't know how to get the column number from CSV. How is this done?
Also, once I've got the column number, how do I tell it CSV to run through the subsequent lines and grab data from only that column?
Try something like this:
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new({binary=>1});
my $thing_to_match = "blah";
my $matched_index;
my #stored_data = ();
while(my $row= $csv->getline(*DATA)) #grabs lines below __DATA__
#(near the end of the script)
{
my #fields = #$row;
#If we haven't found the matched index, yet, search for it.
if(not defined $matched_index)
{
foreach my $i(0..$#fields)
{
$matched_index = $i if($fields[$i] eq $thing_to_match);
}
}
#NOTE: We're pushing a *reference* to an array!
#Look at perldoc perldata
push #stored_data,\#fields;
}
die "Column for '$thing_to_match' not found!" unless defined $matched_index;
foreach my $row(#stored_data)
{
print $row->[$matched_index] . "\n";
}
__DATA__
stuff,more stuff,yet more stuff
"yes, this thing, is one item",blah,blarg
1,2,3
The output is:
more stuff
blah
2
I don't have time to write up a full example, but I wrote a module that might help you do this. Tie::Array::CSV uses some magic to make your csv file act like a Perl array of arrayrefs. In this way you can use your knowledge of Perl to interact with the file.
A word of warning though! One benefit of my module is that it is read/write. Since you only want read, be careful not to assign to it!

How can I check if contents of one file exist in another in Perl?

Requirement:-
File1 has contents like -
ABCD00000001,\some\some1\ABCD00000001,Y,,5 (this indicates there are 5 file in total in unit)
File2 has contents as ABCD00000001
So what i need to do is check if ABCD00000001 from File2 exist in File1 -
if yes{
print the output to Output.txt till it finds another ',Y,,X'}
else{ No keep checking}
Anyone? Any help is greatly appreciated.
Hi Arkadiy Output should be :- any filename from file 2 -ABCD00000001 in file1 and from Y to Y .
for ex :- file 1 structure will be :-
ABCD00000001,\some\some1\ABCD00000001,Y,,5
ABCD00000001,\some\some1\ABCD00000002
ABCD00000001,\some\some1\ABCD00000003
ABCD00000001,\some\some1\ABCD00000004
ABCD00000001,\some\some1\ABCD00000005
ABCD00000001,\some\some1\ABCD00000006,Y,,2
so out put should contain all line between
ABCD00000001,\some\some1\ABCD00000001,Y,,5 and
ABCD00000001,\some\some1\ABCD00000006,Y,,2
#!/usr/bin/perl -w
use strict;
my $optFile = "C:\\Documents and Settings\\rgolwalkar\\Desktop\\perl_scripts\\SampleOPT1.opt";
my $tifFile = "C:\\Documents and Settings\\rgolwalkar\\Desktop\\perl_scripts\\tif_to_stitch.txt";
print "Reading OPT file now\n";
open (OPT, $optFile);
my #opt_in_array = <OPT>;
close(OPT);
foreach(#opt_in_array){
print();
}
print "\nReading TIF file now\n";
open (TIF, $tifFile);
my #tif_in_array = <TIF>;
close(TIF);
foreach(#tif_in_array){
print();
}
so all it does it is reads 2 files "FYI -> I am new to programming"
Try breaking up your problem into discrete steps. It seems that you need to do this (although your question is not very clear):
open file1 for reading
open file2 for reading
read file1, line by line:
for each line in file1, check if there is particular content anywhere in file2
Which part are you having difficulty with? What code have you got so far? Once you have a line in memory, you can compare it to another string using a regular expression, or perhaps a simpler form of comparison.
OK, I'll bite (partially)...
First general comments. Use strict and -w are good, but you are not checking for the results of open or explicitly stating your desired read/write mode.
The contents of your OPT file kinda sorta looks like it is CSV and the second field looks like a Windows path, true? If so, use the appropriate library from CPAN to parse CSV and verify your file names. Misery and pain can be the result otherwise...
As Ether stated earlier, you need to read the file OPT then match the field you want. If the first file is CSV, first you need to parse it without destroying your file names.
Here is a small snippet that will parse your OPT file. At this point, all it does is print the fields, but you can add logic to match to the other file easily. Just read (slurp) the entire second file into a single string and match with your chosen field from the first:
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new();
my #opt_fields;
while (<DATA>) {
if ($csv->parse($_)) {
push #opt_fields, [ $csv->fields() ];
} else {
my $err = $csv->error_input;
print "Failed to parse line: $err";
}
}
foreach my $ref (#opt_fields) {
# foreach my $field (#$ref) { print "$field\n"; }
print "The anon array: #$ref\n";
print "Use to match?: $ref->[0]\n";
print "File name?: $ref->[1]\n";
}
__DATA__
ABCD00000001,\some\some1\ABCD00000001,Y,,5
ABCD00000001,\some\some1\ABCD00000002
ABCD00000001,\some\some1\ABCD00000003
ABCD00000001,\some\some1\ABCD00000004
ABCD00000001,\some\some1\ABCD00000005
ABCD00000001,\some\some1\ABCD00000006,Y,,2