how to replace , to . in large text file - perl

I wish to use perl and write a program that looks for latitude and longitude values in a large tab delimited text file (100000 rows), and replaces the , used in the lat long values to a . . the file has multiple columns.
ie. I want to change it to.
51,2356 to 51.2356
can someone show me how this is done?
many thanks,

You don't need a "program" for that, you do things like this with one liners really. If you want to replace ALL , (commas) in file with . (dots) in entire file (your question doesn't go into specifics of original file format) then below does the trick:
perl -pi.bak -e 's/,/\./g;' your_file.txt
It will also backup your file before running replace to your_file.txt.bak.

Quick and dirty way is to replace ALL commas by dots ( if that will serve your requirement)
while (<$fh>) {
$_ =~ s/,/\./g
}
However, if there are other fields which might be affected, better solution would be to replace only the desired columns.
Assuming the two fields that you're interested are the first and second column of the tab-delimited file, ideally the two fields should be matched and then replaced with "dot".
while (<$fh>) {
if ($_ =~ /(\d+,\d+)\|(\d+,\d+)\|(.*)$/) {
my ($a, $b, $c) = ($1, $2, $3);
$a=~s/,/\./g;
$b=~s/,/\./g;
$_= $a."|".$b."|".$c;
}
}
Here,
$a and $b will match the "latitude and longitude" columns.

Related

How to split a file (with sed) into numerous files according to a value found on each line?

I have several Company_***.csv files (altough the separator's a tab not a comma; hence should be *.tsv, but never mind) which contains a header plus numerous data lines e.g
1stHeader 2ndHeader DateHeader OtherHeaders...
111111111 SOME STRING 2020-08-01 OTHER STRINGS..
222222222 ANOT STRING 2020-08-02 OTHER STRINGS..
I have to split them according to the 3rd column here, it's a date.
Each file should be named like e.g. Company_2020_08_01.csv Company_2020_08_02.csv & so one
and containing: same header on the 1st line + matching rows as the following lines.
At first I thought about saving (once) the header in a single file e.g.
sed -n '1w Company_header.csv' Company_*.csv
then parsing the files with a pattern for the date (hence the headers would be skipped) e.g.
sed -n '/\t2020-[01][0-9]-[0-3][0-9]\t/w somefilename.csv' Company_*.csv
... and at last, insert the (missing) header in each generated file.
But I'm stuck at step 2: I can't find how I could generate (dynamically) the "filename" expected by the w command, neither how to capture the date in the search pattern (because apparently this is just an address, not a search-replace "field" as in the s/regexp/replacement/[flags] command, so you can't have capturing groups ( ) in there).
So I wonder if this is actually doable with sed? Or should I look upon other tools e.g. awk?
Disclaimer: I'm quite a n00b with these commands so I'm just learning/starting from scratch...
Perl to the rescue!
perl -e 'while (<>) {
$h = $_, next if $. == 1;
$. = 0 if eof;
#c = split /\t/;
open my $out, ">>", "Company_" . $c[2] =~ tr/-/_/r . ".csv" or die $!;
print {$out} $h unless tell $out;
print {$out} $_;
}' -- Company_*.csv
The diamond operator <> in scalar context reads a line from the input.
The first line of each file is stored in the variable $h, see $. and eof
split populates the #c array by the column values for each line
$c[2] contains the date, using tr we translate dashes to underscores to create a filename from it. open opens the file for appending.
print prints the header if the file is empty (see tell)
and prints the current line, too.
Note that it only appends to the files, so don't forget to delete any output files before running the script again.

Find doublet data in csv file

I'm trying to write a Perl script that can check if a csv file has doublet data in the two last columns. If doublet data is found, an additional column with the word "doublet" should be added:
Example, the original file looks like this:
cat,111,dog,555
cat,444,dog,222
mouse,333,dog,555
mouse,555,cat,555
The final output file should look like this:
cat,111,dog,555,doublet
cat,444,dog,222
mouse,333,dog,555,doublet
mouse,555,cat,555
I'm very much a newbie to Perl scripting, so I won't expose myself with what i've written so far. I tried to read through the file splitting the data into two arrays, one with the first two columns, and the other with the last two columns
The idea was then to check for doublets in the second array, and add (push?) the additional column with the "doublets" information to that array, and then afterwards merge to two array back together again(?)
Unfortunately my brain has now collapsed, and I need help from someone smarter than me, to guide me in the right direction.
Any help would be very much appreciated, thanks.
This is not the most efficient way but here is something to get you started. Script assumes that your input data is comma separated and can contain any number of columns.
#!/usr/bin/env perl
use strict;
use warnings;
my %h;
my #lines;
while (<>) {
chomp;
push (#lines,$_); # save each line
my #fields = split(/,/,$_);
if(#fields > 1) {
$h{join("",#fields[-2,-1])}++; # keep track of how many times a doublet appears.
}
}
# go back through the lines. If doublet appears 2 or more times, append ',doublet' to the output.
foreach (#lines) {
my $d = "";
my #fields = split(/,/,$_);
if (#fields > 1 && $h{join("",#fields[-2,-1])} >= 2) {
$d = ",doublet";
}
print $_,$d,$/;
}
The syntax #fields[-2,-1] is an array slice that returns an array with the last two column values. Then, join("",...) concatenates them together and this becomes the key to the hash. $/ is the input record separator which is newline by default and is quicker to write than "\n"
cat,111,dog,555,doublet
cat,444,dog,222
mouse,333,dog,555,doublet
mouse,555,cat,555

A Perl script to process a CSV file, aggregating properties spread over multiple records

Sorry for the vague question, I'm struggling to think how to better word it!
I have a CSV file that looks a little like this, only a lot bigger:
550672,1
656372,1
766153,1
550672,2
656372,2
868194,2
766151,2
550672,3
868179,3
868194,3
550672,4
766153,4
The values in the first column are a ID numbers and the second column could be described as a property (for want of a better word...). The ID number 550672 has properties 1,2,3,4. Can anyone point me towards how I can begin solving how to produce strings such as that for all the ID numbers? My ideal output would be a new csv file which looks something like:
550672,1;2;3;4
656372,1;2
766153,1;4
etc.
I am very much a Perl baby (only 3 days old!) so would really appreciate direction rather than an outright solution, I'm determined to learn this stuff even if it takes me the rest of my days! I have tried to investigate it myself as best as I can, although I think I've been encumbered by not really knowing what to really search for. I am able to read in and parse CSV files (I even got so far as removing duplicate values!) but that is really where it drops off for me. Any help would be greatly appreciated!
I think it is best if I offer you a working program rather than a few hints. Hints can only take you so far, and if you take the time to understand this code it will give you a good learning experience
It is best to use Text::CSV whenever you are processing CSV data as all the debugging has already been done for you
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new;
open my $fh, '<', 'data.txt' or die $!;
my %data;
while (my $line = <$fh>) {
$csv->parse($line) or die "Invalid data line";
my ($key, $val) = $csv->fields;
push #{ $data{$key} }, $val
}
for my $id (sort keys %data) {
printf "%s,%s\n", $id, join ';', #{ $data{$id} };
}
output
550672,1;2;3;4
656372,1;2
766151,2
766153,1;4
868179,3
868194,2;3
Firstly props for seeking an approach not a solution.
As you've probably already found with perl, There Is More Than One Way To Do It.
The approach I would take would be;
use strict; # will save you big time in the long run
my %ids # Use a hash table with the id as the key to accumulate the properties
open a file handle on csv or die
while (read another line from the file handle){
split line into ID and property variable # google the split function
append new property to existing properties for this id in the hash table # If it doesn't exist already, it will be created
}
foreach my $key (keys %ids) {
deduplicate properties
print/display/do whatever you need to do with the result
}
This approach means you will need to iterate over the whole set twice (once in memory), so depending on the size of the dataset that may be a problem.
A more sophisticated approach would be to use a hashtable of hashtables to do the de duplication in the intial step, but depending on how quickly you want/need to get it working, that may not be worthwhile in the first instance.
Check out
this question
for a discussion on how to do the deduplication.
Well, open the file as stdin in perl, assume each row is of two columns, then iterate over all lines using left column as hash identifier, and gathering right column into an array pointed by a hash key. At the end of input file you'll get a hash of arrays, so iterate over it, printing a hash key and assigned array elements separated by ";" or any other sign you wish.
and here you go
dtpwmbp:~ pwadas$ cat input.txt
550672,1
656372,1
766153,1
550672,2
656372,2
868194,2
766151,2
550672,3
868179,3
868194,3
550672,4
766153,4
dtpwmbp:~ pwadas$ cat bb2.pl
#!/opt/local/bin/perl
my %hash;
while (<>)
{
chomp;
my($key, $value) = split /,/;
push #{$hash{$key}} , $value ;
}
foreach my $key (sort keys %hash)
{
print $key . "," . join(";", #{$hash{$key}} ) . "\n" ;
}
dtpwmbp:~ pwadas$ cat input.txt | perl -f bb2.pl
550672,1;2;3;4
656372,1;2
766151,2
766153,1;4
868179,3
868194,2;3
dtpwmbp:~ pwadas$
perl -F"," -ane 'chomp($F[1]);$X{$F[0]}=$X{$F[0]}.";".$F[1];if(eof){for(keys %X){$X{$_}=~s/;//;print $_.",".$X{$_}."\n"}}'
Another (not perl) way which incidentally is shorter and more elegant:
#!/opt/local/bin/gawk -f
BEGIN {FS=OFS=",";}
NF > 0 { IDs[$1]=IDs[$1] ";" $2; }
END { for (i in IDs) print i, substr(IDs[i], 2); }
The first line (after specifying the interpreter) sets the input FIELD SEPARATOR and the OUTPUT FIELD SEPARATOR to the comma. The second line checks of we have more than zero fields and if you do it makes the ID ($1) number the key and $2 the value. You do this for all lines.
The END statement will print these pairs out in an unspecified order. If you want to sort them you have to option of asorti gnu awk function or connecting the output of this snippet with a pipe to sort -t, -k1n,1n.

Extracting unique values from multiple files in Perl

I have several data files that are tab delimited. I need to extract all the unique values in a certain column of these data files (say column 25) and write these values into an output file for further processing. How might I do this in Perl? Remember I need to consider multiple files in the same folder.
edit: The code I've done thus far is like this.
#!/usr/bin/perl
use warnings;
use strict;
my #hhfilelist = glob "*.hh3";
for my $f (#hhfilelist) {
open F, $f || die "Cannot open $f: $!";
while (<F>) {
chomp;
my #line = split /\t/;
print "field is $line[24]\n";
}
close (F);
}
The question is how do I efficiently create the hash/array of unique values as I read each line of each file. Or is it faster if I populate the whole array and then remove duplicates?
Some tips on how to handle the problem:
Find files
For finding files within a directory, use glob: glob '.* *'
For finding files within a directory tree, use File::Find's find function
Open each file, use Text::CSV with \t character as the delimiter, extract wanted values and write to file
For Perl solution, please use Text::CSV module to parse flat (X-separated) files - the constructor accepts a parameter specifying a separator character. Do this for every file in a loop, with file list generated by either glob() for files in a given directory or File::Find for subdirectories as well
Then, to get the unique values, for each row, store the column #25 in a hash.
E.g. after retrieving the values:
$colref = $csv->getline($io);
$unique_values_hash{ $colref->[24] } = 1;
Then, iterate over hash keys and print to a file.
For non-Perl shell solution, you can simply do:
cat MyFile_pattern | awk -F'\t' 'print $25' |sort -u > MyUniqueValuesFile
You can replace awk with cut
Please note that non-Perl solution only works if the files don't contain TABs in the fields themselves and the columns aren't quoted.
perl -F/\\t/ -ane 'print"$F[24]\n" unless $seen{$F[24]}++' inputs > output
perl -F/\\t/ -ane 'print"$F[24]\n" unless $seen{$F[24]}++' *.hh3 > output
Command-line switches -F/\\t/ -an mean iterate through every line in every input file and split the line on the tab character into the array #F.
$F[24] refers to the value in the 25-th field of each line (between the 24-th and 25-th tab characters)
$seen{...} is a hashtable to keep track of which values have already been observed.
The first time a value is observed, $seen{VALUE} is 0 so Perl will execute the statement print"$F[24]\n". Every other time the value is observed, $seen{VALUE} will be non-zero and the statement won't be executed. This way each unique value gets printed out exactly once.
In a similar context to your larger script:
my #hhfilelist = glob "*.hh3";
my %values_in_field_25 = ();
for my $f (#hhfilelist) {
open F, $f || die "Cannot open $f: $!";
while (<F>) {
my #F = split /\t/;
$values_in_field_25{$F[24]} = 1;
}
close (F);
}
my #unique_values_in_field_25 = keys %values_in_field_25; # or sort keys ...

How can I filter out specific column from a CSV file in Perl?

I am just a beginner in Perl and need some help in filtering columns using a Perl script.
I have about 10 columns separated by comma in a file and I need to keep 5 columns in that file and get rid of every other columns from that file. How do we achieve this?
Thanks a lot for anybody's assistance.
cheers,
Neel
Have a look at Text::CSV (or Text::CSV_XS) to parse CSV files in Perl. It's available on CPAN or you can probably get it through your package manager if you're using Linux or another Unix-like OS. In Ubuntu the package is called libtext-csv-perl.
It can handle cases like fields that are quoted because they contain a comma, something that a simple split command can't handle.
CSV is an ill-defined, complex format (weird issues with quoting, commas, and spaces). Look for a library that can handle the nuances for you and also give you conveniences like indexing by column names.
Of course, if you're just looking to split a text file by commas, look no further than #Pax's solution.
Use split to pull the line apart then output the ones you want (say every second column), create the following xx.pl file:
while(<STDIN>) {
chomp;
#fields = split (",",$_);
print "$fields[1],$fields[3],$fields[5],$fields[7],$fields[9]\n"
}
then execute:
$ echo 1,2,3,4,5,6,7,8,9,10 | perl xx.pl
2,4,6,8,10
If you are talking about CSV files in windows (e.g., generated from Excel), you will need to be careful to take care of fields that contain comma themselves but are enclosed by quotation marks.
In this case, a simple split won't work.
Alternatively, you could use Text::ParseWords, which is in the standard library. Add
use Text::ParseWords;
to the top of Pax's example above, and then substitute
my #fields = parse_line(q{,}, 0, $_);
for the split.
You can use some of Perl's built in runtime options to do this on the command line:
$ echo "1,2,3,4,5" | perl -a -F, -n -e 'print join(q{,}, $F[0], $F[3]).qq{\n}'
1,4
The above will -a(utosplit) using the -F(ield) of a comma. It will then join the fields you are interested in and print them back out (with a line separator). This assumes simple data without nested comma's. I was doing this with an unprintable field separator (\x1d) so this wasn't an issue for me.
See http://perldoc.perl.org/perlrun.html#Command-Switches for more details.
Went looking didn't find a nice csv compliant filter program thats flexible to be useful for than just a one-of, so I wrote one. Enjoy.
Basic usage is:
bash$ csvfilter [-r <columnTitle>]* [-quote] <csv.file>
#!/usr/bin/perl
use strict;
use warnings;
use Getopt::Long;
use Text::CSV;
my $always_quote=0;
my #remove;
if ( ! GetOptions('remove:s'=> \#remove,
'quote-always'=>sub {$always_quote=1;}) ) {
die "$0:invalid option (use --remove [--quote-always])";
}
my #cols2remove;
sub filter(#)
{
my #fields=#_;
my #r;
my $i=0;
for my $c (#cols2remove) {
my $p;
#if ( $i $i ) {
push(#r, splice(#fields, $i));
}
return #r;
}
# create just one if these
my $csvOut=new Text::CSV({always_quote=>$always_quote});
sub printLine(#)
{
my #fields=#_;
my $combined=$csvOut->combine(filter(#fields));
my $str=$csvOut->string();
if ( length($str) ) {
print "$str\n";
}
}
my $csv = Text::CSV->new();
my $od;
open($od, "| cat") || die "output:$!";
while () {
$csv->parse($_);
if ( $. == 1 ) {
my $failures=0;
my #cols=$csv->fields;
for my $rm (#remove) {
for (my $c=0; $c$b} #cols2remove);
}
printLine($csv->fields);
}
exit(0);
\
In addition to what people here said about processing comma-separated files, I'd like to note that one can extract the even (or odd) array elements using an array slice and/or map:
#myarray[map { $_ * 2 } (0 .. 4)]
Hope it helps.
My personal favorite way to do CSV is using the AnyData module. It seems to make things pretty simple, and removing a named column can be done rather easily. Take a look on CPAN.
This answers a much larger question, but seems like a good relevant bit of information.
The unix cut command can do what you want (and a whole lot more). It has been reimplemented in Perl.