Perl parsing the csv file

Perl parsing the csv file - perl

I am just trying to read .csv file first time.I have gone through the below link :
http://metacpan.org/pod/Text::CSV_XS#Reading-a-CSV-file-line-by-line:
I have few doubt, well if you want, u can tell me this are silly question but i don't know, why i am not able to figure it out that how exactly perl is reading csv file :(
So, my doubt is:
First Question
What is the difference between reading the csv file line by line and parsing the file.
I have simple program where i am reading the csv file line by line.
Below is my program:
#!/usr/bin/perl -w
use strict;
use Text::CSV;
use Data::Dumper;
my $csv=Text::CSV->new( );
my $my_file="test.csv";
open(my $fl,"<",$my_file) or die"can not open the file $!";
#print "$ref_list\n";
while(my $ref_list=$csv->getline($fl))
{
print "$ref_list->[0]\n";
}
Below is the data in csv file :
"Emp_id","Emp_name","Location","Company"
102713,"raj","Banglore","abc"
403891,"Rakesh","Pune","Infy"
530201,"Kiran","Hyd","TCS"
503110,"raj","Noida","HCL"
Second Question:
If I want to get specific Emp_id along with Location then how can i proceed.
Third Question :
If I want only 102713 ,530201,503110 Emp record i.e name,location,compnay name then what should i do ?
Thanks

A CSV file is a good representation of tabular data in a text format, but it is unsuitable for an in-memory represenation. Because of that, we have to create an adequate representation. One such representation would be a hash:
my $hashref = {
Emp_Id => ...,
Emp_name => ...,
Location => ...,
Company => ...,
};
If the header row is in the array #header, we can create this hash with:
my #header = ...;
my #row = #{$csv->getline($fl)}; # turn the arrayref into an array
my $hashref = {};
for my $i (0..$#header) {
$hashref->{$header[$i]} = $row[$i];
}
# The $hashref now looks as described above
We can then create lookup hashes that use the id values as keys. So %lookup looks like this:
my %lookup = (
102713 => $hashref_to_first_line,
...,
);
We populate it by doing
$lookup{$row[0]} = $hashref;
after the above loop. We can then access a certain hashref with
my $a_certain_id_hashref = $lookup{102713};
or access certain elements directly with
my $a_certain_id_location = $lookup{102713}{Location};
If the key does not exist, these lookups should return undef.
If the CSV file is too big, this might cause perl to run out of memory. In that case, the hashes should be tied to files, but that is a different topic completely.

Here's another option that addresses your second question and part of your third question:
use Modern::Perl;
use Text::CSV;
my #empID = qw/ 102713 530201 503110 /;
my $csv = Text::CSV->new( { binary => 1 } )
or die 'Cannot use CSV: ' . Text::CSV->error_diag();
my $my_file = "test.csv";
open my $fl, '<', $my_file or die "can not open the file $!";
while ( my $ref_list = $csv->getline($fl) ) {
if ( $ref_list->[0] ~~ #empID ) {
say "Emp_id: $ref_list->[0] is Location: $ref_list->[2]";
}
}
$csv->eof or $csv->error_diag();
close $fl;
Output:
Emp_id: 102713 is Location: Banglore
Emp_id: 530201 is Location: Hyd
Emp_id: 503110 is Location: Noida
The array #empID contains the ID(s) you're interested in. In the while loop, each Emp_id is checked using the smart match operator (Perl v5.10+) to see if it's in the list of IDs. If so, the Emp_id and its corresponding Location is printed.

Related

Creating multiple hashes from multiple files in one go

I want to perform a vlookup like process but with multiple files wherein the contents of the first column from all files (sorted n uniq-ed) is reference value. Now I would like to store these key-values pairs from each file in each hash and then print them together. Something like this:
file1: while(){$hash1{$key}=$val}...file2: while(){$hash2{$key}=$val}...file3: while(){$hash3{$key}=$val}...so on
Then print it: print "$ref_val $hash1{$ref_val} $hash3{$ref_val} $hash3{$ref_val}..."
$i=1;
#FILES = #ARGV;
foreach $file(#FILES)
{
open($fh,$file);
$hname="hash".$i; ##trying to create unique hash by attaching a running number to hash name
while(<$fh>){#d=split("\t");$hname{$d[0]}=$d[7];}$i++;
}
$set=$i-1; ##store this number for recreating the hash names during printing
open(FH,"ref_list.txt");
while(<FH>)
{
chomp();print "$_\t";
## here i run the loop recreating the hash names and printing its corresponding value
for($i=1;$i<=$set;$i++){$hname="hash".$i; print "$hname{$_}\t";}
print "\n";
}
Now this where I am stuck perl takes $hname as hash name instead of $hash1, $hash2...
Thanks in advance for the helps and opinions

The shown code attempts to use symbolic references to construct variable names at runtime. Those things can raise a lot of trouble and should not be used, except very occasionally in very specialized code.
Here is a way to read multiple files, each into a hash, and store them for later processing.
use warnings;
use strict;
use feature 'say';
use Data::Dump qw(dd);
my #files = #ARGV;
my #data;
for my $file (#files) {
open my $fh, '<', $file or do {
warn "Skip $file, can't open it: $!";
next;
};
push #data, { map { (split /\t/, $_)[0,6] } <$fh> };
}
dd \#data;
Each hash associates the first column with the seventh (index 6), as clarified, for each line. A reference to such a hash for each file, formed by { }, is added to the array.
Note that when you add a key-value pair to a hash which already has that key the new overwrites the old. So if a string repeats in the first column in a file, the hash for that file will end up with the value (column 7) for the last one. The OP doesn't discuss possible duplicates of this kind in data files (only for the reference file), please clarify if needed.
The Data::Dump is used only to print; if you don't wish to install it use core Data::Dumper.
I am not sure that I get the use of that "reference file", but you can now go through the array of hash references for each file and fetch values as needed. Perhaps like
open my $fh_ref, '<', $ref_file or die "Can't open $ref_file: $!";
while (my $line = <$fh_ref>) {
my $key = ... # retrieve the key from $line
print "$key: ";
foreach my $hr (#data) {
print "$hr->{$key} ";
}
say '';
}
This will print key: followed by values for that string, one from each file.

Parse report in blocks to CSV

I have lots of data dumps in a pretty huge amount of data structured as follow
Key1:.............. Value
Key2:.............. Other value
Key3:.............. Maybe another value yet
Key1:.............. Different value
Key3:.............. Invaluable
Key5:.............. Has no value at all
Which I would like to transform to something like:
Key1,Key2,Key3,Key5
Value,Other value,Maybe another value yet,
Different value,,Invaluable,Has no value at all
I mean:
Generate a collection of all the keys
Generate a header line with all the Keys
Map all the values to their correct "columns" (notice that in this example I have no "Key4", and Key3/Key5 interchanged)
Possibly in Perl, since it would be easier to use in various environments.
But I am not sure if this format is unusual, or if there is a tool that already does this.

This is fairly easy using hashes and the Text::CSV_XS module:
use strict;
use warnings;
use Text::CSV_XS;
my #rows;
my %headers;
{
local $/ = "";
while (<DATA>) {
chomp;
my %record;
for my $line (split(/\n/)) {
next unless $line =~ /^([^:]+):\.+\s(.+)/;
$record{$1} = $2;
$headers{$1} = $1;
}
push(#rows, \%record);
}
}
unshift(#rows, \%headers);
my $csv = Text::CSV_XS->new({binary => 1, auto_diag => 1, eol => $/});
$csv->column_names(sort(keys(%headers)));
for my $row_ref (#rows) {
$csv->print_hr(*STDOUT, $row_ref);
}
__DATA__
Key1:.............. Value
Key2:.............. Other value
Key3:.............. Maybe another value yet
Key1:.............. Different value
Key3:.............. Invaluable
Key5:.............. Has no value at all
Output:
Key1,Key2,Key3,Key5
Value,"Other value","Maybe another value yet",
"Different value",,Invaluable,"Has no value at all"

If your CSV format is 'complicated' - e.g. it contains commas, etc. - then use one of the Text::CSV modules. But if it isn't - and this is often the case - I tend to just work with split and join.
What's useful in your scenario, is that you can map key-values within a record quite easily using a regex. Then use a hash slice to output:
#!/usr/bin/env perl
use strict;
use warnings;
#set paragraph mode - records are blank line separated.
local $/ = "";
my #rows;
my %seen_header;
#read STDIN or files on command line, just like sed/grep
while ( <> ) {
#multi - line pattern, that matches all the key-value pairs,
#and then inserts them into a hash.
my %this_row = m/^(\w+):\.+ (.*)$/gm;
push ( #rows, \%this_row );
#add the keys we've seen to a hash, so we 'know' what we've seen.
$seen_header{$_}++ for keys %this_row;
}
#extract the keys, make them unique and ordered.
#could set this by hand if you prefer.
my #header = sort keys %seen_header;
#print the header row
print join ",", #header, "\n";
#iterate the rows
foreach my $row ( #rows ) {
#use a hash slice to select the values matching #header.
#the map is so any undefined values (missing keys) don't report errors, they
#just return blank fields.
print join ",", map { $_ // '' } #{$row}{#header},"\n";
}
This for you sample input, produces:
Key1,Key2,Key3,Key5,
Value,Other value,Maybe another value yet,,
Different value,,Invaluable,Has no value at all,
If you want to be really clever, then most of that initial building of the loop can be done with:
my #rows = map { { m/^(\w+):\.+ (.*)$/gm } } <>;
The problem then is - you would need to build up the 'headers' array still, and that means a bit more complicated:
$seen_header{$_}++ for map { keys %$_ } #rows;
It works, but I don't think it's as clear about what's happening.
However the core of your problem may be the file size - that's where you have a bit of a problem, because you need to read the file twice - first time to figure out which headings exist throughout the file, and then second time to iterate and print:
#!/usr/bin/env perl
use strict;
use warnings;
open ( my $input, '<', 'your_file.txt') or die $!;
local $/ = "";
my %seen_header;
while ( <$input> ) {
$seen_header{$_}++ for m/^(\w+):/gm;
}
my #header = sort keys %seen_header;
#return to the start of file:
seek ( $input, 0, 0 );
while ( <$input> ) {
my %this_row = m/^(\w+):\.+ (.*)$/gm;
print join ",", map { $_ // '' } #{$this_row}{#header},"\n";
}
This will be slightly slower, as it'll have to read the file twice. But it won't use nearly as much memory footprint, because it isn't holding the whole file in memory.
Unless you know all your keys in advance, and you can just define them, you'll have to read the file twice.

This seems to work with the data you've given
use strict;
use warnings 'all';
my %data;
while ( <> ) {
next unless /^(\w+):\W*(.*\S)/;
push #{ $data{$1} }, $2;
}
use Data::Dump;
dd \%data;
output
{
Key1 => ["Value", "Different value"],
Key2 => ["Other value"],
Key3 => ["Maybe another value yet", "Invaluable"],
Key5 => ["Has no value at all"],
}

Matching string with substrings

I’m working with multiple vcf files in a directory (Linux server) and also a tab delimited key file that contains the sample names and the corresponding barcodes.
Here is how the files are named:
RA_4090_v1_RA_4090_RNA_v1.vcf
RA_4090_dup_v1_RA_4090_dup_RNA_v1.vcf
RA_565_v1.vcf
RA_565_dup_v1.vcf
RA_HCC-78-2.vcf
Here are contents of the key file:
Barcode ID Sample Name
IonSelect-2 RA_4090
IonSelect-4 RA_565
IonSelect-6 RA_HCC-78-2
IonSelect-10 RA_4090_dup
IonSelect-12 RA_565_dup
I need to correlate the correct sample names with each .vcf file and then rename each .vcf file.
There is always one vcf file for each sample. However, sometimes the samples names begin with the same substring and it’s impossible to match them up correctly, since the sample names are not standardized.
The following code works well when the sample names are different but fails if multiple sample names begin with the same substring. I have no idea how to account for multiple sample names that begging with the same substring.
Please suggest something that will work. Here is the current code:
#!/usr/bin/perl
use warnings;
use strict;
use File::Copy qw(move);
my $home="/data/";
my $bam_directory = $home."test_all_runs/".$ARGV[0];
my $matrix_key = $home."test_all_runs/".$ARGV[0]."/key.txt";
my #matrix_key = ();
open(TXT2, "$matrix_key") or die "Can't open '$matrix_key': $!";
while (<TXT2>){
push (#matrix_key, $_);
}
close(TXT2);
my #ant_vcf = glob "$bam_directory/*.vcf";
for my $tsv_file (#ant_vcf){
my $matrix_barcode_vcf = "";
my $matrix_sample_vcf = "";
foreach (#matrix_key){
chomp($_);
my #matrix_key = split ("\t", $_);##
if (index ($tsv_file,$matrix_key[1]) != -1) {
$matrix_barcode_vcf = $matrix_key[0]; print $matrix_key[0];
$matrix_sample_vcf = $matrix_key[1];
chomp $matrix_barcode_vcf;
chomp $matrix_sample_vcf;
#print $bam_directory."/".$matrix_sample_id."_".$matrix_barcode.".bam";
move $tsv_file, $bam_directory."/".$matrix_sample_vcf."_".$matrix_sample_vcf.".vcf";
}
}
}

The following code works well when the sample names are different but fails if multiple sample names begin with the same substring. I have no idea how to account for multiple sample names that begging with the same substring.
The key to solving your problem is sorting the 'Sample Name' names by length - longest first.
For example, MATCHES RA_4090_dup should be before MATCHES RA_4090 in the #matrix_key array so it will attempt to match the longer string first. Then, after a match, you stop searching (I used first from the List::Util module which is part of core perl since version 5.08).
#!/usr/bin/perl
use strict;
use warnings;
use List::Util 'first';
my #files = qw(
RA_4090_v1_RA_4090_RNA_v1.vcf
RA_4090_dup_v1_RA_4090_dup_RNA_v1.vcf
RA_565_v1.vcf
RA_565_dup_v1.vcf
RA_HCC-78-2.vcf
);
open my $key, '<', 'junk.txt' or die $!; # key file
<$key>; # throw away header line in key file (first line)
my #matrix_key = sort {length($b->[1]) <=> length($a->[1])} map [ split ], <$key>;
close $key or die $!;
for my $tsv_file (#files) {
if ( my $aref = first { index($tsv_file, $_->[1]) != -1 } #matrix_key ) {
print "$tsv_file \t MATCHES $aref->[1]\n";
print "\t$aref->[1]_$aref->[0]\n\n";
}
}
This produced this output:
RA_4090_v1_RA_4090_RNA_v1.vcf MATCHES RA_4090
RA_4090_IonSelect-2
RA_4090_dup_v1_RA_4090_dup_RNA_v1.vcf MATCHES RA_4090_dup
RA_4090_dup_IonSelect-10
RA_565_v1.vcf MATCHES RA_565
RA_565_IonSelect-4
RA_565_dup_v1.vcf MATCHES RA_565_dup
RA_565_dup_IonSelect-12
RA_HCC-78-2.vcf MATCHES RA_HCC-78-2
RA_HCC-78-2_IonSelect-6

Load multiple csv file in oracle table from perl

after some research decided to put question here for more expert answers.Couldn't find exact scenario as my problem so here it goes...
I think it will take few days for me to get something working, can't even think about how to move forward now.
DB: 11gR2
OS: Unix
I'm trying to load multiple csv file into Oracle table using perl script.
List what all csv I need to work on, since directory where csv file exist contains many other files.
Open csv file and insert into table
If there are any error then rollback all inserts of that file and move into next file
Record how many inserts done by each file
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV;
use DBD::Oracle;
my $exitStatus = 0;
my $dow = `date +%a`; chomp $dow;
my $csvDow = `date -dd +%a`; chomp $csvDow;
# define logfile
my logFile;
$logFile = "log.dbinserts"
# define csv file directory
my $csvLogDir = "Home/log/$csvDow";
# csv Files in array to list all possible match of file
opendir(my $dh, $csvLogDir ) || die "can't opendir $csvLogDir : $!";
my #csvFile = grep { /csv.*host1/ && -f "$csvLogDir/$_" } readdir($dh); chomp #csvFile;
closedir $dh;
foreach my $i (#csvFile)
{
$logFile (CSV File: $i);
}
foreach my $file (#csvFile)
{
chomp ($item);
$logFile-> ("Working under: $file");
&insertRecords($csvLogDir."/".$file);
}
$logFile-> ("Exit status")
#----------------
sub insertRecords
{
my $filetoInsert=shift;
my $row;
open my $fh, "<" or die "$fileToInsert: $!";
my $csv = Text::CSV->new ({
binary =>1,
auto_diag =>1,
});
while ($row = $csv->getline ($fh))
{
print "first column : $row->[0]\n,";
}
close $fh;
}
========
CSV File
=========
date, host, first, number1, number2
20141215 13:05:08, S1, John, 100, 100.20
20141215 13:06:08, S2, Ray, 200, 200.50
...
...
...
=========
Table - tab1
=========
Sample_Date
Server
First
N1
N2

For the first step it depends one which criteria you'll need to select your CSV files
if it's on the name of those CSV you could simply use opendir and get the list of files with readd :
my $dirToScan = '/var/data/csv';
opendir(my $dh, $dirToScan ) || die "can't opendir $dirToScan : $!";
my #csvFiles = grep { /.csv$/ && -f "$some_dir/$_" } readdir($dh);
closedir $dh;
In this example you'll retrieve a array with all the files that end whith .csv (whithin the design dir)
After that you'll need to use your foreach on your array.
You can find more example and explanation here
I don't know the structure of your CSV but I would advise to use a module like Text::CSV, it's a simple CSV parser that will wrap Text::CSV_PP or Text::CSV_XS, if it's installed on your system ( it's faster than the PP version (because written in perl/XS)
this module allows you to transform a CSV row in a array like this :
use Text::CSV;
my $file = "listed.csv";
open my $fh, "<", $file or die "$file: $!";
my $csv = Text::CSV->new ({
binary => 1, # Allow special character. Always set this
auto_diag => 1, # Report irregularities immediately
});
while (my $row = $csv->getline ($fh)) {
print "first colum : $row->[0]\n";
}
close $fh;
from : perlmeme.org
You'll need to open() your file (within the foreach loop), pass it to the Text::CSV element (you can declare your parser outside of the loop)
That's the easiest case where you know the column number of you CSV, if you need to use the column name you'll need to user the getline_hr() function (see the CPAN doc of Text::CSV)
And once you have your values (you should be whithin the foreach loop of you file list and in the while, that list the rows of your CSV, you will need to insert this data in your database.
For this you'll need the DBD::Oracle module that will allow you to connect to the database.
Like every DBI connector you'll need to instanciate a connection, using this syntax :
use DBI;
$dbh = DBI->connect("dbi:Oracle:$dbname", $user, $passwd);
And then in your loop (while your reading you CSV rows) you should be able to do something like this :
$SQL = "INSERT INTO yourTable (foobar,baz) VALUES (?,?)";
$sth = $dbh->prepare($SQL);
$sth->execute($row->[0],$row->[1]);
here you have tree step where you prepare the request with the value replaced by '?' (you can also use declared variable instead, if you have a lot of columns)
after the preparation you execute the request with the desired value (once again you don't have to use anonymous vars)
To catch if the request failed you only have to set RaiseError to when the connection is declared, that would look like something like this :
$dbh = DBI->connect("dbi:Oracle:$dbname", $user, $passwd,
{
PrintError => 1,
PrintWarn => 1,
RaiseError => 1
});
And then when playing the request :
try
{
$sth->execute($row->[0],$row->[1]);
}
catch
{
warn "INSERT error : $_";
$CSVhasFailures = 1;
};
You'll need to set the value of $CSVhasFailures to 0 before each CSV
After that, by testing the value of the CSVhasFailures at the end of the while loop you could decide to execute a commit or a rollback using the integrated function commit and rollback whithin the DBD::Oracle module
if you wan't to count the number of insert you'll just have to put a $counter++ after the $sth->execute statement
for more info on the DBD::Oracle I would suggest you to read the CPAN documentation page.
Last suggestion, begin step by step : Lists your CSV files, read the rows of each CSV, read a column, print a set of column and then insert you data in a temporary table.

hash arrays for basic Perl script

I'm writing my first Perl script and am reading a small text file line by line. The fields are delimited by ':' so i want to split each field into a hash array using the first field(name) as the key for each. Also, (I think) I want a big hash that holds all the information, or maybe just an array that holds each field so I can print all info on one line based on a pattern. I've not gotten far as %info is creating odd # elements in the hash assignment. Should I make it a regular array, and am I even going about this the right way? Basically, lines are in this order.
name:phone:address:date:salary
#!/usr/bin/perl -w
use strict;
print $#ARGV;
if($#ARGV == -1)
{
print "Script needs 1 argument please.\n";
exit 1;
}
my $inFILE = $ARGV[0];
#open the file passed
open(IN, "$inFILE") || die "Cannot open: $!"; #open databook.txt
my %info = (my %name, my %phone, my %address, my %date, my %salary);
while(<IN>)
{
%info = (split /:/)[1];
}
close($inFILE);

First of all, you should define your data structure depending on how you would use the information parsed. If you're using the name as index to search the information, I suggest to use a nested hash, indexed by the name:
{name => {phone => ..., address => ..., date => ..., salary => ...}, ...}
If you're not going to use name as index, just store the information in an array:
[ {name => ..., address => ..., date => ..., salary => ...},
{name => ..., address => ..., date => ..., salary => ...}, ...]
In most cases I would use the first one.
Secondly, arrays and hashes in Perl are flat. So this:
my %info = (my %name, my %phone, my %address, my %date, my %salary);
doesn't make sense. Use a ref to store the data.
Last but not least, Perl has a syntax sugar for the input file. Use <> to read file from arguments, instead of opening files explicitly. This makes the program more "Perlish".
use strict;
use warnings;
use Data::Dumper;
my $info = {};
while (<>) {
chomp;
my #items = split /:/, $_;
$info->{$items[0]} = { phone => $items[1],
address => $items[2],
date => $items[3],
salary => $items[4] };
}
print Dumper $info;

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Perl parsing the csv file - perl

Related

Creating multiple hashes from multiple files in one go

Parse report in blocks to CSV

Matching string with substrings

Load multiple csv file in oracle table from perl

hash arrays for basic Perl script

Categories

Resources