Related
I have lots of data dumps in a pretty huge amount of data structured as follow
Key1:.............. Value
Key2:.............. Other value
Key3:.............. Maybe another value yet
Key1:.............. Different value
Key3:.............. Invaluable
Key5:.............. Has no value at all
Which I would like to transform to something like:
Key1,Key2,Key3,Key5
Value,Other value,Maybe another value yet,
Different value,,Invaluable,Has no value at all
I mean:
Generate a collection of all the keys
Generate a header line with all the Keys
Map all the values to their correct "columns" (notice that in this example I have no "Key4", and Key3/Key5 interchanged)
Possibly in Perl, since it would be easier to use in various environments.
But I am not sure if this format is unusual, or if there is a tool that already does this.
This is fairly easy using hashes and the Text::CSV_XS module:
use strict;
use warnings;
use Text::CSV_XS;
my #rows;
my %headers;
{
local $/ = "";
while (<DATA>) {
chomp;
my %record;
for my $line (split(/\n/)) {
next unless $line =~ /^([^:]+):\.+\s(.+)/;
$record{$1} = $2;
$headers{$1} = $1;
}
push(#rows, \%record);
}
}
unshift(#rows, \%headers);
my $csv = Text::CSV_XS->new({binary => 1, auto_diag => 1, eol => $/});
$csv->column_names(sort(keys(%headers)));
for my $row_ref (#rows) {
$csv->print_hr(*STDOUT, $row_ref);
}
__DATA__
Key1:.............. Value
Key2:.............. Other value
Key3:.............. Maybe another value yet
Key1:.............. Different value
Key3:.............. Invaluable
Key5:.............. Has no value at all
Output:
Key1,Key2,Key3,Key5
Value,"Other value","Maybe another value yet",
"Different value",,Invaluable,"Has no value at all"
If your CSV format is 'complicated' - e.g. it contains commas, etc. - then use one of the Text::CSV modules. But if it isn't - and this is often the case - I tend to just work with split and join.
What's useful in your scenario, is that you can map key-values within a record quite easily using a regex. Then use a hash slice to output:
#!/usr/bin/env perl
use strict;
use warnings;
#set paragraph mode - records are blank line separated.
local $/ = "";
my #rows;
my %seen_header;
#read STDIN or files on command line, just like sed/grep
while ( <> ) {
#multi - line pattern, that matches all the key-value pairs,
#and then inserts them into a hash.
my %this_row = m/^(\w+):\.+ (.*)$/gm;
push ( #rows, \%this_row );
#add the keys we've seen to a hash, so we 'know' what we've seen.
$seen_header{$_}++ for keys %this_row;
}
#extract the keys, make them unique and ordered.
#could set this by hand if you prefer.
my #header = sort keys %seen_header;
#print the header row
print join ",", #header, "\n";
#iterate the rows
foreach my $row ( #rows ) {
#use a hash slice to select the values matching #header.
#the map is so any undefined values (missing keys) don't report errors, they
#just return blank fields.
print join ",", map { $_ // '' } #{$row}{#header},"\n";
}
This for you sample input, produces:
Key1,Key2,Key3,Key5,
Value,Other value,Maybe another value yet,,
Different value,,Invaluable,Has no value at all,
If you want to be really clever, then most of that initial building of the loop can be done with:
my #rows = map { { m/^(\w+):\.+ (.*)$/gm } } <>;
The problem then is - you would need to build up the 'headers' array still, and that means a bit more complicated:
$seen_header{$_}++ for map { keys %$_ } #rows;
It works, but I don't think it's as clear about what's happening.
However the core of your problem may be the file size - that's where you have a bit of a problem, because you need to read the file twice - first time to figure out which headings exist throughout the file, and then second time to iterate and print:
#!/usr/bin/env perl
use strict;
use warnings;
open ( my $input, '<', 'your_file.txt') or die $!;
local $/ = "";
my %seen_header;
while ( <$input> ) {
$seen_header{$_}++ for m/^(\w+):/gm;
}
my #header = sort keys %seen_header;
#return to the start of file:
seek ( $input, 0, 0 );
while ( <$input> ) {
my %this_row = m/^(\w+):\.+ (.*)$/gm;
print join ",", map { $_ // '' } #{$this_row}{#header},"\n";
}
This will be slightly slower, as it'll have to read the file twice. But it won't use nearly as much memory footprint, because it isn't holding the whole file in memory.
Unless you know all your keys in advance, and you can just define them, you'll have to read the file twice.
This seems to work with the data you've given
use strict;
use warnings 'all';
my %data;
while ( <> ) {
next unless /^(\w+):\W*(.*\S)/;
push #{ $data{$1} }, $2;
}
use Data::Dump;
dd \%data;
output
{
Key1 => ["Value", "Different value"],
Key2 => ["Other value"],
Key3 => ["Maybe another value yet", "Invaluable"],
Key5 => ["Has no value at all"],
}
Here's my code:
my %hash = (
'2564' => {
'st_responsible' => 'mname1',
'critical' => '',
'last_modified_by' => 'teamname1',
'transstatus' => '',
'rt_res' => 'pname1'
},
'2487' => {
'st_responsible' => 'mname2',
'critical' => '',
'last_modified_by' => 'teamname2',
'transstatus' => '',
'rt_res' => ''
}
);
print "xnum,st_responsible,critical,last_modified_by,transstatus,rt_res\n";
foreach my $x_number (sort keys %hash)
{
print "$x_number";
foreach my $element (keys %{$hash{$x_number}})
{
print ",$hash{$x_number}{$element}";
}
print "\n";
}
Expected output
xnum,st_responsible,critical,last_modified_by,transstatus,rt_res
2487,mname2,,teamname2,,
2564,mname1,,teamname1,,pname1
Actual output
xnum,st_responsible,critical,last_modified_by,transstatus,rt_res
2487,mname2,,,teamname2,
2564,mname1,,,teamname1,pname1
Please help in letting me know as to how exactly do I preserve the order of this data structure, and then write this to a CSV file.
I would suggest that for this, you'd be better off doing this with a slice, which is a way of extracting a list of values from a hash in a particular order?
#configure field order
my #order = qw ( st_responsible critical last_modified_by transstatus rt_res );
#print header row
print join (",", "xnum", #order ),"\n";
#iterate the rows
foreach my $key ( sort keys %hash ) {
#extract hash slice and join it with commas
print join ( ",", $key, #{$hash{$key}}{#order} ),"\n";
}
This gives:
xnum,st_responsible,critical,last_modified_by,transstatus,rt_res
2487,mname2,,teamname2,,
2564,mname1,,teamname1,,pname1
You can consider Text::CSV - but I'd suggest in this scenario it's overkill, best used when you've got quotes and quoted field separators to worry about. (And you don't).
If you have to deal with not just empty keys, but missing ones, you can make use of map:
my #order = qw ( st_responsible critical last_modified_by
transstatus missing rt_res extra_field_here );
print join (",", "xnum", #order ),"\n";
foreach my $key ( sort keys %hash ) {
print join ( ",", $key, map { $_ // '' } #{$hash{$key}}{#order} ),"\n";
}
(Otherwise you'll get a warning about an undef value).
Perl doesn't guarantee the order of items in the hash, this is the root cause of the issue. Even two different hashes with the same keys can have different order of keys. It may also differ from platform to platform and architecture and perl version.
You need to define another array with the list of keys which you want to print in correct order.
my #keys = qw(st_responsible critical last_modified_by transstatus rt_res);
foreach my $element (#keys) {
... print the value
}
EDIT: As you're trying to write CSV file, consider using Text::CSV which takes care about special characters, correct formatting and things like that.
There's probably a slicker way of achieving this, but give this a go:
use warnings;
use strict;
open my $csv_out, '>', 'out.csv' or die $!;
my #keys = qw(2487 2564);
my #vals = qw(st_responsible critical last_modified_by transstatus rt_res);
print $csv_out "xnum,st_responsible,critical,last_modified_by,transstatus,rt_res\n";
for my $key (#keys){
print $csv_out "$key,";
for my $vals (#vals){
$vals eq $vals[-1] ? print $csv_out "$hash{$key}{$vals}\n" : print $csv_out "$hash{$key}{$vals},";
}
}
This will print out comma-separated values to a csv file out.csv maintaining your original order (by iterating over arrays). If it's the last value it will print a newline.
--- OUTPUT ---
xnum,st_responsible,critical,last_modified_by,transstatus,rt_res
2487,mname2,,teamname2,,
2564,mname1,,teamname1,,pname1
I want to implement an ordered hash where the value of each key value pair will be a another nested hash map. I am unable to do so. I am not getting any errors but nothing is being printed.
use Hash::Ordered;
use constant { lead_ID => 44671 , lag_ID => 11536 , start_time => time };
my $dict_lead=Hash::Ordered->new;
my $dict_lag=Hash::Ordered->new;
open(my $f1,"<","tcs_07may_nse_fo") or die "cant open input file";
open(my $f2,">","bid_ask_".&lead_ID) or die "cant open output file";
open(my $f3,">","ema_data/bid_ask_".&lag_ID) or die "cant open output file";
while(my $line =<$f1>){
my #data=split(/,/,$line);
chomp(#data);
my ($tstamp, $instr) = (int($data[0]), $data[1]);
if($instr==&lead_ID){
$dict_lead->set($tstamp=>{"bid"=>$data[5],"ask"=>$data[6]});
}
if($instr==&lag_ID){
$dict_lag->set($tstamp=>{"bid"=>$data[5],"ask"=>$data[6]});
}
}
close $f1;
foreach my $key ($dict_lead->keys){
my $spread=$dict_lead{$key}{"ask"}-$dict_lead{$key}{"bid"};
%hash=$dict_lead->get($key);
print $key.",".$hash{"ask"}."\n";
print $f2 $key.",".$dict_lead{$key}{"bid"}.","
.$dict_lead{$key}{"ask"}.",".$spread."\n";
}
foreach my $key ($dict_lag->keys){
my $spread=$dict_lag{$key}{"ask"}-$dict_lag{$key}{"bid"};
print $f3 $key.",".$dict_lag{$key}{"bid"}.","
.$dict_lag{$key}{"ask"}.",".$spread."\n";
}
close $f2;
close $f3;
print "Ring destroyed in " , time() - &start_time , " seconds\n";
The output printed on my terminal is :
1430992791,
1430992792,
1430992793,
1430992794,
1430992795,
1430992796,
1430992797,
1430992798,
1430992799,
Ring destroyed in 24 seconds
I realize from the first column of output that I am able to insert the key in ordered hash. But I don't understand how to insert another hash as value for those keys. Also how would I access those values while iterating through the keys of the hash?
The output in the file corresponding to file handle $f2 is:
1430970394,,,0
1430970395,,,0
1430970396,,,0
1430970397,,,0
1430970398,,,0
1430970399,,,0
1430970400,,,0
First of all, I don't see why you want to use a module that keeps your hash in order. I presume you want your output ordered by the timestamp fields, and the data that you are reading from the input file is already ordered like that, but it would be simple to sort the keys of an ordinary hash and print the contents in order without relying on the incoming data being presorted
You have read an explanation of why your code isn't behaving as it should. This is how I would write a solution that hopefully behaves properly (although I haven't been able to test it beyond checking that it compiles)
Instead of a hash, I have chosen to use a two-element array to contain the ask and bid prices for each timestamp. That should make the code run fractionally faster as well as making it simpler and easier to read
It's also noteworthy that I have added use autodie, which makes perl check the status of IO operations such as open and chdir automatically and removes the clutter caused by coding those checks manually. I have also defined a constant for the path to the root directory of the files, and used chdir to set the working directory there. That removes the need to repeat that part of the path and reduces the length of the remaining file path strings
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use autodie;
use Hash::Ordered;
use constant DIR => '../tcs_nse_fo_merged';
use constant LEAD_ID => 44671;
use constant LAG_ID => 11536;
chdir DIR;
my $dict_lead = Hash::Ordered->new;
my $dict_lag = Hash::Ordered->new;
{
open my $fh, '<', 'tcs_07may_nse_fo';
while ( <$fh> ) {
chomp;
my #data = split /,/;
my $tstamp = int $data[0];
my $instr = $data[1];
if ( $instr == LEAD_ID ) {
$dict_lead->set( $tstamp => [ #data[5,6] ] );
}
elsif ( $instr == LAG_ID ) {
$dict_lag->set( $tstamp => [ #data[5,6] ] );
}
}
}
{
my $file = 'ema_data/bid_ask_' . LEAD_ID;
open my $out_fh, '>', $file;
for my $key ( $dict_lead->keys ) {
my $val = $dict_lead->get($key);
my ($ask, $bid) = #$val;
my $spread = $ask - $bid;
print join(',', $key, $ask), "\n";
print $out_fh join(',', $key, $bid, $ask, $spread), "\n";
}
}
{
my $file = 'ema_data/bid_ask_' . LAG_ID;
open my $out_fh, '>', $file;
for my $key ( $dict_lag->keys ) {
my $val = $dict_lead->get($key);
my ($ask, $bid) = #$val;
my $spread = $ask - $bid;
print $out_fh join(',', $key, $bid, $ask, $spread), "\n";
}
}
printf "Ring destroyed in %d seconds\n", time - $^T;
With ordered hashes constructed using Hash::Ordered, the hash is an object. Those objects have properties (e.g. an index; if you examine a Hash::Ordered object it will have more than just hash elements inside of it) and they provide methods for you manipulate and access their data. So you need to use the supplied methods - like set to access the hash such as you do in this line:
$dict_lead->set($tstamp=>{"bid"=>$data[5],"ask"=>$data[6]});
where you create a key using the the scalar $tstamp and then associate it with an anonymous hash as it value.
But while you are using Hash::Ordered objects, your script also makes use of a plain data-structure (%hash) that you populate using $dict_lead->get($key) in your first foreach loop. All the normal techniques, idioms and rules for adding keys to a hash still apply in this case. You don't want to repeatedly copy the nested hash out of $dict_lead Hash::Ordered object into %hash here, you want to add the nested hash to %hash and associate it with a unique key.
Without sample data to test or a description of the expected output to compare against it is difficult to know for sure, but you probably just need to change:
%hash=$dict_lead->get($key);
to something like:
$hash{$key} = $dict_lead->get($key);
to populate your temporary %hash correctly. Or, since each key's value is an anonymous hash that is nested, you might instead want to try changing print $key.",".$hash{"ask"}."\n"; to:
print $key.",".$hash{$key}{"ask"}."\n"
There are other ways to "deeply" copy part of one nested data structure to another (see the Stackoverflow reference below) and you maybe be able to avoid using the temporary variable all together, but these small changes might be all that is necessary in your case.
In general, in order to "insert another hash as a value for ... keys" you need to use a reference or an anonymous hash constructor ({ k => "v" , ... }). So e.g. to add one key:
my %sample_hash ;
$sample_hash{"key_0"} = { bid => "1000000" , timestamp => 1435242285 };
dd %sample_hash ;
Output:
("key_0", { bid => 1000000, timestamp => 1435242285 })
To add multiple keys from one hash to another:
my %to_add = ( key_1 => { bid => "1500000" , timestamp => 1435242395 },
key_2 => { bid => "2000000" , timestamp => 1435244898 } );
for my $key ( keys %to_add ) {
$sample_hash{$key} = $to_add{$key}
}
dd %sample_hash ;
Output:
(
"key_1",
{ bid => 1000000, timestamp => 1435242285 },
"key_0",
{ bid => 1400000, timestamp => 1435242395 },
"key_2",
{ bid => 2000000, timestamp => 1435244898 },
)
References
How can I combine hashes in Perl? ++
perldoc perlfaq4
perldoc perldsc
i keep learning hashes and various things u can do with them.
taday i have this question. how do i sort a hash by value, when i have 2 keys in it? and how do i print it out?
i have a csv file. im trying to store values in the hash, sort it by value. this way I'll be able to print the biggest and the smallest value, i also need the date this value was there.
so far i can print the hash, but i cant sort it.
#!/usr/bin/perl
#find openMin and openMax.
use warnings;
use strict;
my %pick;
my $key1;
my $key2;
my $value;
my $file= 'msft2.csv';
my $lines = 0;
my $date;
my $mm;
my $mOld = "";
my $open;
my $openMin;
my $openMax;
open (my $fh,'<', $file) or die "Couldnt open the $file:$!\n";
while (my $line=<$fh>)
{
my #columns = split(',',$line);
$date = $columns[0];
$open = $columns[1];
$mm = substr ($date,5,2);
if ($lines>=1) { #first line of file are names of columns wich i
$key1 = $date; #dont need. data itself begins with second line
$key2 = "open";
$value = $open;
$pick{$key1}{"open"}=$value;
}
$lines++;
}
foreach $key1 (sort keys %pick) {
foreach $key2 (keys %{$pick{$key1}}) {
$value = $pick{$key1}{$key2};
print "$key1 $key2 $value \n";
}
}
exit;
1. Use a real CSV parser
Parsing a CSV with split /,/ works fine...unless one of your fields contains a comma. If you are absolutely, positively, 100% sure that your code will never, ever have to parse a CSV with a comma in one of the fields, feel free to ignore this. If not, I'd recommend using Text::CSV. Example usage:
use Text::CSV;
my $csv = Text::CSV->new( { binary => 1 } )
or die "Cannot use CSV: " . Text::CSV->error_diag ();
open my $fh, "<", $file or die "Failed to open $file: $!";
while (my $line = $csv->getline($fh)) {
print #$line, "\n";
}
$csv->eof or $csv->error_diag();
close $fh;
2. Sorting
I only see one secondary key in your hash: open. If you're trying to sort based on the value of open, do something like this:
my %hash = (
foo => { open => "date1" },
bar => { open => "date2" },
);
foreach my $key ( sort { $hash{$a}{open} cmp $hash{$b}{open} } keys %hash ) {
print "$key $hash{$key}{open}\n";
}
(this assumes that the values you're sorting are not numeric. If the values are numeric (e.g. 3, -17.57) use the spaceship operator <=> instead of the string comparison operator cmp. See perldoc -f sort for details and examples.)
EDIT: You haven't explained what format your dates are in. If they are in YYYY-MM-DD format, sorting as above will work, but if they're in MM-DD-YYYY format, for example, 01-01-2014 would come before 12-01-2013. The easiest way to take care of this is to reorder the components of your date from most to least significant (i.e. year followed by month followed by day). You can do this using Time::Piece like this:
use Time::Piece;
my $date = "09-26-2013";
my $t = Time::Piece->strptime($date, "%m-%d-%Y");
print $t->strftime("%Y-%m-%d");
Another tidbit: in general you should only declare variables right before you use them. You gain nothing by declaring everything at the top of your program except decreased readability.
You could concatenate key1 and key2 into a single key as:
$key = "$key1 key2";
$pick{$key} = $value;
This question already has an answer here:
I need help in perl, how to write a code to get the output of my csv file in the form of a hash [closed]
(1 answer)
Closed 10 years ago.
I am new to Perl, and have to write a code which takes contents of a file into and array and print the output that it looks like a hash. Here is an example entry:
my %amino_acids = (F => ["Phenylalanine", "Phe", ["TTT", "TTC"]])
Out put should be exactly in above format.
Lines of Files are like this...
"Methionine":"Met":"M":"AUG":"ATG"
"Phenylalanine":"Phe":"F":"UUU, UUC":"TTT, TTC"
"Proline":"Pro":"P":"CCU, CCC, CCA, CCG":"CCT, CCC, CCA, CCG"
I have to take last codons after semicolon and ignore the first group.
Is it your intention to build the equivalent hash? Or do you really want the string format? This program uses Text::CSV to build the hash from the file and then dumps it using Data::Dump so that you have the string format as well.
use strict;
use warnings;
use Text::CSV;
use Data::Dump 'dump';
my $csv = Text::CSV->new({ sep_char => ':' });
open my $fh, '<', 'amino.txt' or die $!;
my %amino_acids;
while (my $data= $csv->getline($fh)) {
$amino_acids{$data->[2]} = [
$data->[0],
$data->[1],
[ $data->[4] =~ /[A-Z]+/g ]
];
}
print '$amino_acids = ', dump \%amino_acids;
output
$amino_acids = {
F => ["Phenylalanine", "Phe", ["TTT", "TTC"]],
M => ["Methionine", "Met", ["ATG"]],
P => ["Proline", "Pro", ["CCT", "CCC", "CCA", "CCG"]],
}
Update
If you really don't want to install modules (it is a very straightforward process and makes the code much more concise and reliable) then this does what you need.
use strict;
use warnings;
open my $fh, '<', 'amino.txt' or die $!;
print "my %amino_acids = (\n";
while (<$fh>) {
chomp;
my #data = /[^:"]+/g;
my #codons = $data[4] =~ /[A-Z]+/g;
printf qq{ %s => ["%s", "%s", [%s]],\n},
#data[2,0,1],
join ', ', map qq{"$_"}, #codons;
}
print ")\n";
output
my %amino_acids = (
M => ["Methionine", "Met", ["ATG"]],
F => ["Phenylalanine", "Phe", ["TTT", "TTC"]],
P => ["Proline", "Pro", ["CCT", "CCC", "CCA", "CCG"]],
)
Assuming you actually want valid perl as the output, this will do it:
open(my $IN, "<input.txt") or die $!;
while(<$IN>){
chomp;
my #tmp = split(':',$_);
if(#tmp != 5){
# error on this line
next;
}
my $group = join('","',split(/,\s*/,$tmp[4]));
print "\$amino_acids{$tmp[2]} = [$tmp[0],$tmp[1],[$group]];\n";
}
close $IN;
Using your sample lines, the output is:
$amino_acids{"M"} = ["Methionine","Met",["ATG"]];
$amino_acids{"F"} = ["Phenylalanine","Phe",["TTT","TTC"]];
$amino_acids{"P"} = ["Proline","Pro",["CCT","CCC","CCA","CCG"]];
#Borodin Thank you very much for your answer, actually I don't have to use Text::csv or Data::dump.I have to open a file and build the equivalent hash from the file.I am trying to do without using both, hopefully it will help.Thanks again!!!
Perl has no special method to print hashes. What you should probably do is create a hash when reading the file:
while (<FILE>) {
my #line = split ':'; # split the line into an array
$amino_acids{$line[0]} = \#line[1..-1]; # take elements 1..end
}
And then print out the hash one entry at a time:
foreach (keys %amino_acids) {
print "$_ => [", (join ",", #$amino_acids{$_}), "]\n";
}
Note that I didn't compile this, so it may need a small amount of work to get it done.