Extracting unique values from multiple files in Perl

Extracting unique values from multiple files in Perl - perl

I have several data files that are tab delimited. I need to extract all the unique values in a certain column of these data files (say column 25) and write these values into an output file for further processing. How might I do this in Perl? Remember I need to consider multiple files in the same folder.
edit: The code I've done thus far is like this.
#!/usr/bin/perl
use warnings;
use strict;
my #hhfilelist = glob "*.hh3";
for my $f (#hhfilelist) {
open F, $f || die "Cannot open $f: $!";
while (<F>) {
chomp;
my #line = split /\t/;
print "field is $line[24]\n";
}
close (F);
}
The question is how do I efficiently create the hash/array of unique values as I read each line of each file. Or is it faster if I populate the whole array and then remove duplicates?

Some tips on how to handle the problem:
Find files
For finding files within a directory, use glob: glob '.* *'
For finding files within a directory tree, use File::Find's find function
Open each file, use Text::CSV with \t character as the delimiter, extract wanted values and write to file

For Perl solution, please use Text::CSV module to parse flat (X-separated) files - the constructor accepts a parameter specifying a separator character. Do this for every file in a loop, with file list generated by either glob() for files in a given directory or File::Find for subdirectories as well
Then, to get the unique values, for each row, store the column #25 in a hash.
E.g. after retrieving the values:
$colref = $csv->getline($io);
$unique_values_hash{ $colref->[24] } = 1;
Then, iterate over hash keys and print to a file.
For non-Perl shell solution, you can simply do:
cat MyFile_pattern | awk -F'\t' 'print $25' |sort -u > MyUniqueValuesFile
You can replace awk with cut
Please note that non-Perl solution only works if the files don't contain TABs in the fields themselves and the columns aren't quoted.

perl -F/\\t/ -ane 'print"$F[24]\n" unless $seen{$F[24]}++' inputs > output
perl -F/\\t/ -ane 'print"$F[24]\n" unless $seen{$F[24]}++' *.hh3 > output
Command-line switches -F/\\t/ -an mean iterate through every line in every input file and split the line on the tab character into the array #F.
$F[24] refers to the value in the 25-th field of each line (between the 24-th and 25-th tab characters)
$seen{...} is a hashtable to keep track of which values have already been observed.
The first time a value is observed, $seen{VALUE} is 0 so Perl will execute the statement print"$F[24]\n". Every other time the value is observed, $seen{VALUE} will be non-zero and the statement won't be executed. This way each unique value gets printed out exactly once.
In a similar context to your larger script:
my #hhfilelist = glob "*.hh3";
my %values_in_field_25 = ();
for my $f (#hhfilelist) {
open F, $f || die "Cannot open $f: $!";
while (<F>) {
my #F = split /\t/;
$values_in_field_25{$F[24]} = 1;
}
close (F);
}
my #unique_values_in_field_25 = keys %values_in_field_25; # or sort keys ...

Related

How to split a file (with sed) into numerous files according to a value found on each line?

I have several Company_***.csv files (altough the separator's a tab not a comma; hence should be *.tsv, but never mind) which contains a header plus numerous data lines e.g
1stHeader 2ndHeader DateHeader OtherHeaders...
111111111 SOME STRING 2020-08-01 OTHER STRINGS..
222222222 ANOT STRING 2020-08-02 OTHER STRINGS..
I have to split them according to the 3rd column here, it's a date.
Each file should be named like e.g. Company_2020_08_01.csv Company_2020_08_02.csv & so one
and containing: same header on the 1st line + matching rows as the following lines.
At first I thought about saving (once) the header in a single file e.g.
sed -n '1w Company_header.csv' Company_*.csv
then parsing the files with a pattern for the date (hence the headers would be skipped) e.g.
sed -n '/\t2020-[01][0-9]-[0-3][0-9]\t/w somefilename.csv' Company_*.csv
... and at last, insert the (missing) header in each generated file.
But I'm stuck at step 2: I can't find how I could generate (dynamically) the "filename" expected by the w command, neither how to capture the date in the search pattern (because apparently this is just an address, not a search-replace "field" as in the s/regexp/replacement/[flags] command, so you can't have capturing groups ( ) in there).
So I wonder if this is actually doable with sed? Or should I look upon other tools e.g. awk?
Disclaimer: I'm quite a n00b with these commands so I'm just learning/starting from scratch...

Perl to the rescue!
perl -e 'while (<>) {
$h = $_, next if $. == 1;
$. = 0 if eof;
#c = split /\t/;
open my $out, ">>", "Company_" . $c[2] =~ tr/-/_/r . ".csv" or die $!;
print {$out} $h unless tell $out;
print {$out} $_;
}' -- Company_*.csv
The diamond operator <> in scalar context reads a line from the input.
The first line of each file is stored in the variable $h, see $. and eof
split populates the #c array by the column values for each line
$c[2] contains the date, using tr we translate dashes to underscores to create a filename from it. open opens the file for appending.
print prints the header if the file is empty (see tell)
and prints the current line, too.
Note that it only appends to the files, so don't forget to delete any output files before running the script again.

Counting records separated by CR/LF (carriage return and newline) in Perl

I'm trying to create a simple script to read a text file that contains records of book titles. Each record is separated with a plain old double space (\r\n\r\n). I need to count how many records are in the file.
For example here is the input file:
record 1
some text
record 2
some text
...
I'm using a regex to check for carriage return and newline, but it fails to match. What am I doing wrong? I'm at my wits' end.
sub readInputFile {
my $inputFile = $_[0]; #read first argument from the commandline as fileName
open INPUTFILE, "+<", $inputFile or die $!; #Open File
my $singleLine;
my #singleRecord;
my $recordCounter = 0;
while (<INPUTFILE>) { # loop through the input file line-by-line
$singleLine = $_;
push(#singleRecord, $singleLine); # start adding each line to a record array
if ($singleLine =~ m/\r\n/) { # check for carriage return and new line
$recordCounter += 1;
createHashTable(#singleRecord); # send record make a hash table
#singleRecord = (); # empty the current record to start a new record
}
}
print "total records : $recordCounter \n";
close(INPUTFILE);
}

It sounds like you are processing a Windows text file on Linux, in which case you want to open the file with the :crlf layer, which will convert all CRLF line-endings to the standard Perl \n ending.
If you are reading Windows files on a Windows platform then the conversion is already done for you, and you won't find CRLF sequences in the data you have read. If you are reading a Linux file then there are no CR characters in there anyway.
It also sounds like your records are separated by a blank line. Setting the built-in input record separator variable $/ to a null string will cause Perl to read a whole record at a time.
I believe this version of your subroutine is what you need. Note that people familiar with Perl will thank you for using lower-case letters and underscore for variables and subroutine names. Mixed case is conventionally reserved for package names.
You don't show create_hash_table so I can't tell what data it needs. I have chomped and split the record into lines, and passed a list of the lines in the record with the newlines removed. It would probably be better to pass the entire record as a single string and leave create_hash_table to process it as required.
sub read_input_file {
my ($input_file) = #_;
open my $fh, '<:crlf', $input_file or die $!;
local $/ = '';
my $record_counter = 0;
while (my $record = <$fh>) {
chomp;
++$record_counter;
create_hash_table(split /\n/, $record);
}
close $fh;
print "Total records : $record_counter\n";
}

You can do this more succinctly by changing Perl's record-separator, which will make the loop return a record at a time instead of a line at a time.
E.g. after opening your file:
local $/ = "\r\n\r\n";
my $recordCounter = 0;
$recordCounter++ while(<INPUTFILE>);
$/ holds Perl's global record-separator, and scoping it with local allows you to override its value temporarily until the end of the enclosing block, when it will automatically revert back to its previous value.
But it sounds like the file you're processing may actually have "\n\n" record-separators, or even "\r\r". You'd need to set the record-separator correctly for whatever file you're processing.

If your files are not huge multi-gigabytes files, the easiest and safest way is to read the whole file, and use the generic newline metacharacter \R.
This way, it also works if some file actually uses LF instead of CRLF (or even the old Mac standard CR).
Use it with split if you also need the actual records:
perl -ln -0777 -e 'my #records = split /\R\R/; print scalar(#records)' $Your_File
Or if you only want to count the records:
perl -ln -0777 -e 'my $count=()=/\R\R/g; print $count' $Your_File
For more details, see also my other answer here to a similar question.

Perl script to modify file

I have Oracle files that I need to compare to CVS files, but the problem is that there are many files that I want to ignore the first line(s) as part of the diff. I want to run a script that opens each file, and replaces the file contents in such a way that the final output is replacing 'CREATE OR REPLACE PACKAGE "TRON"."SOME_PACKAGE" IS' with 'CREATE OR REPLACE PACKAGE SOME_PACKAGE IS'. The problem I am having is that the statement can span several lines, so I have to consider a situation like 'CREATE OR REPLACE "TRON"."SOME_PACKAGE" IS'.
My approach (since this is part of a Jenkins job), is to loop through all the files in the workspace, modifying any files that meet this criteria. I can then use my existing Perl script that is using File::Compare and Text::Diff::Table.
I've been testing with Zaid's solution with little success, since it still is not dealing with scenarios where the command string spans multiple lines. (my changes):
use strict;
use warnings;
use Tie::File;
use Data::Dumper;
my #array;
tie #array, 'Tie::File', 'c:\cb_k_check_recon_mma.sps' or die "Unable to tie file";
my %unwanted = map { $_ => 1 }
map { $_-1..$_-4, $_, $_+2 .. $_+4 }
grep { $array[$_] =~ /^CREATE.*[IS|AS]$/ }
0 .. $#array ;
print Dumper \%unwanted;
#array = map { $array[$_] } grep { ! $unwanted{$_} } 0 .. $#array;
print Dumper \#array;
untie #array;

If the text can span several lines, for a single regex to work you need to read the file into a string, not line-by-line.
perl -0777 -pi.bak -e 's/CREATE\s+OR\s+REPLACE\s+PACKAGE\s+"TRON"\."SOME_PACKAGE"\s+IS/CREATE OR REPLACE PACKAGE SOME_PACKAGE IS/g' /path/*.pl
The -0777 switch tells perl to slurp the file, so the regex will only be run once. For that reason, I added the global /g modifier, in case more than one substitution per file is needed.
As you see, I use \s+ instead of space, to match possible randomly inserted newlines. -pi in short means to perform in-place edit on the target file(s), and .bak after -i means to save backups with that extension. It is recommendable to save backups, but not required (except on Windows).

How can I combine files into one CSV file?

If I have one file FOO_1.txt that contains:
FOOA
FOOB
FOOC
FOOD
...
and a lots of other files FOO_files.txt. Each of them contains:
1110000000...
one line that contain 0 or 1 as the number of FOO1 values (fooa,foob, ...)
Now I want to combine them to one file FOO_RES.csv that will have the following format:
FOOA,1,0,0,0,0,0,0...
FOOB,1,0,0,0,0,0,0...
FOOC,1,0,0,0,1,0,0...
FOOD,0,0,0,0,0,0,0...
...
What is the simple & elegant way to conduct that
(with hash & arrays -> $hash{$key} = \#data ) ?
Thanks a lot for any help !
Yohad

If you can't describe a your data and your desired result clearly, there is no way that you will be able to code it--taking on a simple project is a good way to get started using a new language.
Allow me to present a simple method you can use to churn out code in any language, whether you know it or not. This method only works for smallish projects. You'll need to actually plan ahead for larger projects.
How to write a program:
Open up your text editor and write down what data you have. Make each line a comment
Describe your desired results.
Start describing the steps needed to change your data into the desired form.
Numbers 1 & 2 completed:
#!/usr/bin perl
use strict;
use warnings;
# Read data from multiple files and combine it into one file.
# Source files:
# Field definitions: has a list of field names, one per line.
# Data files:
# * Each data file has a string of digits.
# * There is a one-to-one relationship between the digits in the data file and the fields in the field defs file.
#
# Results File:
# * The results file is a CSV file.
# * Each field will have one row in the CSV file.
# * The first column will contain the name of the field represented by the row.
# * Subsequent values in the row will be derived from the data files.
# * The order of subsequent fields will be based on the order files are read.
# * However, each column (2-X) must represent the data from one data file.
Now that you know what you have, and where you need to go, you can flesh out what the program needs to do to get you there - this is step 3:
You know you need to have the list of fields, so get that first:
# Get a list of fields.
# Read the field definitions file into an array.
Since it is easiest to write CSV in a row oriented fashion, you will need to process all your files before generating each row. So you'll need someplace to store the data.
# Create a variable to store the data structure.
Now we read the data files:
# Get a list of data files to parse
# Iterate over list
# For each data file:
# Read the string of digits.
# Assign each digit to its field.
# Store data for later use.
We've got all the data in memory, now write the output:
# Write the CSV file.
# Open a file handle.
# Iterate over list of fields
# For each field
# Get field name and list of values.
# Create a string - comma separated string with field name and values
# Write string to file handle
# close file handle.
Now you can start converting comments into code. You could have anywhere from 1 to 100 lines of code for each comment. You may find that something you need to do is very complex and you don't want to take it on at the moment. Make a dummy subroutine to handle the complex task, and ignore it until you have everything else done. Now you can solve that complex, thorny sub-problem on it's own.
Since you are just learning Perl, you'll need to hit the docs to find out how to do each of the subtasks represented by the comments you've written. The best resource for this kind of work is the list of functions by category in perlfunc. The Perl syntax guide will come in handy too. Since you'll need to work with a complex data structure, you'll also want to read from the Data Structures Cookbook.
You may be wondering how the heck you should know which perldoc pages you should be reading for a given problem. An article on Perlmonks titled How to RTFM provides a nice introduction to the documentation and how to use it.
The great thing, is if you get stuck, you have some code to share when you ask for help.

If I understand correctly your first file is your key order file, and the remaining files each contain a byte per key in the same order. You want a composite file of those keys with each of their data bytes listed together.
In this case you should open all the files simultaneously. Read one key from the key order file, read one byte from each of the data files. Output everything as you read it to you final file. Repeat for each key.

It looks like you have many foo_files that have 1 line in them, something like:
1110000000
Which stands for
fooa=1
foob=1
fooc=1
food=0
fooe=0
foof=0
foog=0
fooh=0
fooi=0
fooj=0
And it looks like your foo_res is just a summation of those values? In that case, you don't need a hash of arrays, but just a hash.
my #foo_files = (); #NOT SURE HOW YOU POPULATE THIS ONE
my #foo_keys = qw(a b c d e f g h i j);
my %foo_hash = map{ ( $_, 0 ) } #foo_keys; # initialize hash
foreach my $foo_file ( #foo_files ) {
open( my $FOO, "<", $foo_file) || die "Cannot open $foo_file\n";
my $line = <$FOO>;
close( $FOO );
chomp($line);
my #foo_values = split(//, $line);
foreach my $indx ( 0 .. $#foo_keys ) {
last if ( ! $foo_values[ $indx ] ); # or some kind of error checking if the input file doesn't have all the values
$foo_hash{ $foo_keys[$indx] } += $foo_values[ $indx ];
}
}
It's pretty hard to understand what you are asking for, but maybe this helps?

Your specifications aren't clear. You couldn't have a "lots of other files" named FOO_files.txt, because it's only one name. So I'm going to take this as the files-with-data + filelist pattern. In this case, there are files named FOO*.txt, each containing "[01]+\n".
Thus the idea is to process all the files in the filelist file and to insert them all into a result file FOO_RES.csv, comma-delimited.
use strict;
use warnings;
use English qw<$OS_ERROR>;
use IO::Handle;
open my $foos, '<', 'FOO_1.txt'
or die "I'm dead: $OS_ERROR";
#ARGV = sort map { chomp; "$_.txt" } <$foos>;
$foos->close;
open my $foo_csv, '>', 'FOO_RES.csv'
or die "I'm dead: $OS_ERROR";
while ( my $line = <> ) {
my ( $foo_name ) = ( $ARGV =~ /(.*)\.txt$/ );
$foo_csv->print( join( ',', $foo_name, split //, $line ), "\n" );
}
$foo_csv->close;

You don't really need to use a hash. My Perl is a little rusty, so syntax may be off a bit, but basically do this:
open KEYFILE , "foo_1.txt" or die "cannot open foo_1 for writing";
open VALFILE , "foo_files.txt" or die "cannot open foo_files for writing";
open OUTFILE , ">foo_out.txt"or die "cannot open foo_out for writing";
my %output;
while (<KEYFILE>) {
my $key = $_;
my $val = <VALFILE>;
my $arrVal = split(//,$val);
$output{$key} = $arrVal;
print OUTFILE $key."," . join(",", $arrVal)
}
Edit: Syntax check OK
Comment by Sinan: #Byron, it really bothers me that your first sentence says the OP does not need a hash yet your code has %output which seems to serve no purpose. For reference, the following is a less verbose way of doing the same thing.
#!/usr/bin/perl
use strict;
use warnings;
use autodie qw(:file :io);
open my $KEYFILE, '<', "foo_1.txt";
open my $VALFILE, '<', "foo_files.txt";
open my $OUTFILE, '>', "foo_out.txt";
while (my $key = <$KEYFILE>) {
chomp $key;
print $OUTFILE join(q{,}, $key, split //, <$VALFILE> ), "\n";
}
__END__

Searching/reading another file from awk based on current file's contents, is it possible?

I'm processing a huge file with (GNU) awk, (other available tools are: Linux shell tools, some old (>5.0) version of Perl, but can't install modules).
My problem: if some field1, field2, field3 contain X, Y, Z I must search for a file in another directory which contains field4, and field5 on one line, and insert some data from the found file to the current output.
E.g.:
Actual file line:
f1 f2 f3 f4 f5
X Y Z A B
Now I need to search for another file (in another directory), which contains e.g.
f1 f2 f3 f4
A U B W
And write to STDOUT $0 from the original file, and f2 and f3 from the found file, then process the next line of the original file.
Is it possible to do it with awk?

Let me start out by saying that your problem description isn't really that helpful. Next time, please just be more specific: You might be missing out on much better solutions.
So from your description, I understand you have two files which contain whitespace-separated data. In the first file, you want to match the first three columns against some search pattern. If found, you want to find all lines in another file which contain the fourth and and fifth column of the matching line in the first file. From those lines, you need to extract the second and third column and then print the first column of the first file and the second and third from the second file. Okay, here goes:
#!/usr/bin/env perl -nwa
use strict;
use File::Find 'find';
my #search = qw(X Y Z);
# if you know in advance that the otherfile isn't
# huge, you can cache it in memory as an optimization.
# with any more columns, you want a loop here:
if ($F[0] eq $search[0]
and $F[1] eq $search[1]
and $F[2] eq $search[2])
{
my #files;
find(sub {
return if not -f $_;
# verbatim search for the columns in the file name.
# I'm still not sure what your file-search criteria are, though.
push #files, $File::Find::name if /\Q$F[3]\E/ and /\Q$F[4]\E/;
# alternatively search for the combination:
#push #files, $File::Find::name if /\Q$F[3]\E.*\Q$F[4]\E/;
# or search *all* files in the search path?
#push #files, $File::Find::name;
}, '/search/path'
)
foreach my $file (#files) {
open my $fh, '<', $file or die "Can't open file '$file': $!";
while (defined($_ = <$fh>)) {
chomp;
# order of fields doesn't matter per your requirement.
my #cols = split ' ', $_;
my %seen = map {($_=>1)} #cols;
if ($seen{$F[3]} and $seen{$F[4]}) {
print join(' ', $F[0], #cols[1,2]), "\n";
}
}
close $fh;
}
} # end if matching line
Unlike another poster's solution which contains lots of system calls, this doesn't fall back to the shell at all and thus should be plenty fast.

This is the type of work that got me to move from awk to perl in the first place. If you are going to accomplish this, you may actually find it easier to create a shell script that creates awk script(s) to query and then update in separate steps.
(I've written such a beast for reading/updating windows-ini-style files - it's ugly. I wish I could have used perl.)

I often see the restriction "I can't use any Perl modules", and when it's not a homework question, it's often just due to a lack of information. Yes, even you can use CPAN contains the instructions on how to install CPAN modules locally without having root privileges. Another alternative is just to take the source code of a CPAN module and paste it into your program.
None of this helps if there are other, unstated, restrictions, like lack of disk space that prevent installation of (too many) additional files.

This seems to work for some test files I set up matching your examples. Involving perl in this manner (interposed with grep) is probably going to hurt the performance a great deal, though...
## perl code to do some dirty work
for my $line (`grep 'X Y Z' myhugefile`) {
chomp $line;
my ($a, $b, $c, $d, $e) = split(/ /,$line);
my $cmd = 'grep -P "' . $d . ' .+? ' . $e .'" otherfile';
for my $from_otherfile (`$cmd`) {
chomp $from_otherfile;
my ($oa, $ob, $oc, $od) = split(/ /,$from_otherfile);
print "$a $ob $oc\n";
}
}
EDIT: Use tsee's solution (above), it's much more well-thought-out.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse