Find doublet data in csv file - perl

I'm trying to write a Perl script that can check if a csv file has doublet data in the two last columns. If doublet data is found, an additional column with the word "doublet" should be added:
Example, the original file looks like this:
cat,111,dog,555
cat,444,dog,222
mouse,333,dog,555
mouse,555,cat,555
The final output file should look like this:
cat,111,dog,555,doublet
cat,444,dog,222
mouse,333,dog,555,doublet
mouse,555,cat,555
I'm very much a newbie to Perl scripting, so I won't expose myself with what i've written so far. I tried to read through the file splitting the data into two arrays, one with the first two columns, and the other with the last two columns
The idea was then to check for doublets in the second array, and add (push?) the additional column with the "doublets" information to that array, and then afterwards merge to two array back together again(?)
Unfortunately my brain has now collapsed, and I need help from someone smarter than me, to guide me in the right direction.
Any help would be very much appreciated, thanks.

This is not the most efficient way but here is something to get you started. Script assumes that your input data is comma separated and can contain any number of columns.
#!/usr/bin/env perl
use strict;
use warnings;
my %h;
my #lines;
while (<>) {
chomp;
push (#lines,$_); # save each line
my #fields = split(/,/,$_);
if(#fields > 1) {
$h{join("",#fields[-2,-1])}++; # keep track of how many times a doublet appears.
}
}
# go back through the lines. If doublet appears 2 or more times, append ',doublet' to the output.
foreach (#lines) {
my $d = "";
my #fields = split(/,/,$_);
if (#fields > 1 && $h{join("",#fields[-2,-1])} >= 2) {
$d = ",doublet";
}
print $_,$d,$/;
}
The syntax #fields[-2,-1] is an array slice that returns an array with the last two column values. Then, join("",...) concatenates them together and this becomes the key to the hash. $/ is the input record separator which is newline by default and is quicker to write than "\n"
cat,111,dog,555,doublet
cat,444,dog,222
mouse,333,dog,555,doublet
mouse,555,cat,555

Related

how to replace , to . in large text file

I wish to use perl and write a program that looks for latitude and longitude values in a large tab delimited text file (100000 rows), and replaces the , used in the lat long values to a . . the file has multiple columns.
ie. I want to change it to.
51,2356 to 51.2356
can someone show me how this is done?
many thanks,
You don't need a "program" for that, you do things like this with one liners really. If you want to replace ALL , (commas) in file with . (dots) in entire file (your question doesn't go into specifics of original file format) then below does the trick:
perl -pi.bak -e 's/,/\./g;' your_file.txt
It will also backup your file before running replace to your_file.txt.bak.
Quick and dirty way is to replace ALL commas by dots ( if that will serve your requirement)
while (<$fh>) {
$_ =~ s/,/\./g
}
However, if there are other fields which might be affected, better solution would be to replace only the desired columns.
Assuming the two fields that you're interested are the first and second column of the tab-delimited file, ideally the two fields should be matched and then replaced with "dot".
while (<$fh>) {
if ($_ =~ /(\d+,\d+)\|(\d+,\d+)\|(.*)$/) {
my ($a, $b, $c) = ($1, $2, $3);
$a=~s/,/\./g;
$b=~s/,/\./g;
$_= $a."|".$b."|".$c;
}
}
Here,
$a and $b will match the "latitude and longitude" columns.

A Perl script to process a CSV file, aggregating properties spread over multiple records

Sorry for the vague question, I'm struggling to think how to better word it!
I have a CSV file that looks a little like this, only a lot bigger:
550672,1
656372,1
766153,1
550672,2
656372,2
868194,2
766151,2
550672,3
868179,3
868194,3
550672,4
766153,4
The values in the first column are a ID numbers and the second column could be described as a property (for want of a better word...). The ID number 550672 has properties 1,2,3,4. Can anyone point me towards how I can begin solving how to produce strings such as that for all the ID numbers? My ideal output would be a new csv file which looks something like:
550672,1;2;3;4
656372,1;2
766153,1;4
etc.
I am very much a Perl baby (only 3 days old!) so would really appreciate direction rather than an outright solution, I'm determined to learn this stuff even if it takes me the rest of my days! I have tried to investigate it myself as best as I can, although I think I've been encumbered by not really knowing what to really search for. I am able to read in and parse CSV files (I even got so far as removing duplicate values!) but that is really where it drops off for me. Any help would be greatly appreciated!
I think it is best if I offer you a working program rather than a few hints. Hints can only take you so far, and if you take the time to understand this code it will give you a good learning experience
It is best to use Text::CSV whenever you are processing CSV data as all the debugging has already been done for you
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new;
open my $fh, '<', 'data.txt' or die $!;
my %data;
while (my $line = <$fh>) {
$csv->parse($line) or die "Invalid data line";
my ($key, $val) = $csv->fields;
push #{ $data{$key} }, $val
}
for my $id (sort keys %data) {
printf "%s,%s\n", $id, join ';', #{ $data{$id} };
}
output
550672,1;2;3;4
656372,1;2
766151,2
766153,1;4
868179,3
868194,2;3
Firstly props for seeking an approach not a solution.
As you've probably already found with perl, There Is More Than One Way To Do It.
The approach I would take would be;
use strict; # will save you big time in the long run
my %ids # Use a hash table with the id as the key to accumulate the properties
open a file handle on csv or die
while (read another line from the file handle){
split line into ID and property variable # google the split function
append new property to existing properties for this id in the hash table # If it doesn't exist already, it will be created
}
foreach my $key (keys %ids) {
deduplicate properties
print/display/do whatever you need to do with the result
}
This approach means you will need to iterate over the whole set twice (once in memory), so depending on the size of the dataset that may be a problem.
A more sophisticated approach would be to use a hashtable of hashtables to do the de duplication in the intial step, but depending on how quickly you want/need to get it working, that may not be worthwhile in the first instance.
Check out
this question
for a discussion on how to do the deduplication.
Well, open the file as stdin in perl, assume each row is of two columns, then iterate over all lines using left column as hash identifier, and gathering right column into an array pointed by a hash key. At the end of input file you'll get a hash of arrays, so iterate over it, printing a hash key and assigned array elements separated by ";" or any other sign you wish.
and here you go
dtpwmbp:~ pwadas$ cat input.txt
550672,1
656372,1
766153,1
550672,2
656372,2
868194,2
766151,2
550672,3
868179,3
868194,3
550672,4
766153,4
dtpwmbp:~ pwadas$ cat bb2.pl
#!/opt/local/bin/perl
my %hash;
while (<>)
{
chomp;
my($key, $value) = split /,/;
push #{$hash{$key}} , $value ;
}
foreach my $key (sort keys %hash)
{
print $key . "," . join(";", #{$hash{$key}} ) . "\n" ;
}
dtpwmbp:~ pwadas$ cat input.txt | perl -f bb2.pl
550672,1;2;3;4
656372,1;2
766151,2
766153,1;4
868179,3
868194,2;3
dtpwmbp:~ pwadas$
perl -F"," -ane 'chomp($F[1]);$X{$F[0]}=$X{$F[0]}.";".$F[1];if(eof){for(keys %X){$X{$_}=~s/;//;print $_.",".$X{$_}."\n"}}'
Another (not perl) way which incidentally is shorter and more elegant:
#!/opt/local/bin/gawk -f
BEGIN {FS=OFS=",";}
NF > 0 { IDs[$1]=IDs[$1] ";" $2; }
END { for (i in IDs) print i, substr(IDs[i], 2); }
The first line (after specifying the interpreter) sets the input FIELD SEPARATOR and the OUTPUT FIELD SEPARATOR to the comma. The second line checks of we have more than zero fields and if you do it makes the ID ($1) number the key and $2 the value. You do this for all lines.
The END statement will print these pairs out in an unspecified order. If you want to sort them you have to option of asorti gnu awk function or connecting the output of this snippet with a pipe to sort -t, -k1n,1n.

How do I read a CSV file column by column in order to transpose it?

I have a data set in the following format:
snp,T2DG0200001,T2DG0200002,T2DG0200003,T2DG0200004
3_60162,AA,AA,AA,AA
3_61495,AA,AA,GA,GA
3_61466,GG,GG,CG,CG
The real data is much larger than this, extending to millions of rows and about a thousand columns. My eventual goal is to transpose this monstrosity and output the result in a text file (or CSV file or whatever, doesn't matter).
I need to feed the data to my computer piece by piece so as not to overload my memory. I read the CSV file line by line, and then transpose it, and write to file. I then loop back and repeat the steps, appending to the text file as I go.
The problem of course is that I am supposed to append the text file column by column instead of by row if the result is the transpose of the original data file. But a friend told me that is not feasible in Perl code. I am wondering if I can read the data column by column. Is there something similar such as the getline method which I used in my original code
while (my $row = $csv->getline ($fh)) {
that can return columns instead of rows? Something akin to a Unix cut command would be preferred, if it does not require loading the entire data into memory.
a CSV is simply a text file; it consists of one big long line of text characters, so there is no random access to columns. Ideally, you would put the CSV into a database, which will then be able to do this directly.
However, barring that, I believe you could do this in Perl with a little cleverness. My approach would be something like this:
my #filehandles;
my $line = 0;
while (my $row = $csv->getline ($fh)<FILE>)
{
#open an output file for each column!
if (not defined $filehandles[0])
{
for (0..$#$row)
{
local $handle;
open $handle, ">column_$_.txt" or die "Oops!";
push #filehandles, $handle;
}
}
#print each column to its respective output file.
for (0..$#$row)
{
print $filehandles[$_] $row->[$_] . ",";
}
#This is going to take a LONG time, so show some sign of life.
print '.' if (($line++ % 1000) == 0);
}
At the end, each column would be printed as a row in its own text file. Don't forget to close all the files, then open them all again for reading, then write them into a single output file one at a time. My guess is this would be slow, but fast enough to do millions of rows, as long as you don't have to do it very often. And it wouldn't face memory limitations.
If the file does not fit in your computers memory, your program has to read through it multiple times. There is no way around it.
There might be modules that obscure or hide this fact - like DBD::CSV - but those just do the same work behind the scenes.

perl while nest when read files

I want to compare two files so I wrote the following code:
while($line1 = <FH1>){
while($line2 = <FH2>){
next if $line1 > $line2;
last if $line1 < $line2;
}
next;
}
My question here is that when the outer loop comes to the next line of file1 and then goes to the inner loop, will the inner while statement read from the first line of file2 again or continue where it left off on the previous iteration of the outer loop?
Thanks
You should always use strict and use warnings at the start of all your programs and declare all variables at their point of first use. This applies especially when you are asking for help with your code.
Is all the data in your files numeric? If not then enabling warnings would have told you that that the < and > operators are for comparing numeric values rather than general strings.
Once a file has been read through completely - i.e. the second loop's while condition terminates - you can read no more data from the file unless you open it again or use seek to rewind to the beginning.
In general it is better in these circumstances to read the smaller of the two files into an array and use the data from there. If both files are very large then something special must be done.
What sort of file comparison are you trying to do? Are you making sure that the two files are identical, or that all data in the second file appears in the first, or something else? Please give an example of your two data files so that we can help you better.
The inner while loop will consume all the content of the FH2 filehandle when you have read the first line from the FH1 handle. If I can intuit what you want to accomplish, one way to go about it would be to read from both handles in the same statement:
while ( defined($line1 = <FH1>) && defined($line2 = <FH2>) ) {
# 'lt' is for string comparison, '<' is for numbers
if ($line1 lt $line2) {
# print a warning?
last;
}
}
The inner loop will continue from it's last known position in FH2 - if you want it to restart from the beginning of the file you need to put:
seek(FH2, SEEK_SET, 0);
before the inner while
Documentation for seek is here in perldoc

Include file data in Perl array

I am trying to pass all the data from a file into a Perl array and then I am trying to use a foreach loop to process every string in the array. The problem is that the foreach instead of printing each individual line is printing the entire array.I am using the following script.
while (<FILE>) {
$_ =~ s/(\)|\()//g;
push #array, $_;
}
foreach $n(#array) {
print "$n\n";
}
Say for example the data in the array is #array=qw(He goes to the school everyday)
the array is getting printed properly but the foreach loop instead of printing every element on different line is printing the entire array.
After reading your comments, I am guessing that your problem is that your source file does not contain any newlines: I.e. the entire file is just one line. Some text editors just wrap the text without actually adding any line break characters.
There is no "solution" to that problem; You have to add line breaks where you want them. You could write a script to do it, but I doubt it would make much sense. It all depends on what you want to do with this text.
Here's my code suggestions for your snippet.
chomp(#array = <FILE>);
s/[()]//g for #array;
print "$_\n" for #array;
or
#array = <FILE>;
s/[()]//g for #array;
print #array;
Note that if you have a file from another filesystem, you may get \r characters left over at the end of your strings after chomp, causing the output to look corrupted, overwriting itself.
Additional notes:
(\)|\() is better written as a character class: [()].
#array = <FILE> will read the entire file into the array. No need
to loop.
As shown in my examples, print can be assigned a list of items
(e.g. an array) as arguments. And you can have a postfix loop to
print sequentially.
With a (postfix) loop, all the loop elements are aliased to $_,
which is a handy way to do substitutions on the array.
Since the entire file is just one line.You can split the string on basis of whitespace and print every element of array in new line
use strict;
use warnings;
open(FILE,'YOURFILE' ) || die ("could not open");
my $line= <FILE>;
my #array = split ' ',$line;
foreach my $n(#array)
{
print "$n\n";
}
close(FILE);
Input File
In recent years many risk factors for the development of breast cancer that .....
Output
In
recent
years
many
risk
factors
for
the
development
of
breast
cancer
that
.....