I am new to Perl and trying to learn it. I have two files, 'file1' and 'file2', I need to find which symbols in 'file1' are not in 'file2' for companyA and departments B and C.
File1
GTY
TTY
UJK
TRE
File2
departmentA_companyA.try=675 UJK 88 KKR
departmentA_companyB.try=878 UJK 37 TAR
departmentA_companyC.try=764 UJK 92 PAM
departmentB_companyA.try=675 UJK 88 KKR
departmentB_companyB.try=878 UJK 37 TAR
departmentB_companyC.try=764 UJK 92 PAM
departmentC_companyA.try=675 UJK 88 KKR
departmentC_companyB.try=878 UJK 37 TAR
departmentC_companyC.try=764 UJK 92 PAM
Create a list of all the symbols in file1
Go through file2. If the criteria matches, delete the symbol from the list.
In this case, I'd suggest you use the keys of a hash to store this list ($symbols{$symbol} = 1;). This is because it's easy and cheap to delete from a hash (delete $symbols{$symbol};).
Spoiler:
use strict;
use warnings;
use feature qw( say );
my %symbols;
{
open(my $fh, '<', 'file1')
or die("Can't open file1: $!\n");
while (<$fh>) {
chomp;
++$symbols{$_};
}
}
{
open(my $fh, '<', 'file2')
or die("Can't open file2: $!\n");
while (<$fh>) {
chomp;
my ($key, $val) = split /=/;
my ($dept, $co) = split /[_\.]/, $key;
if ($co eq 'companyA' || $dept eq 'departmentB' || 'departmentC') {
my #symbols = split ' ', $val;
delete #symbols{#symbols};
}
}
}
say for keys %symbols;
You can use a hash to count the number of times each symbol appears in the file, then print the ones that have a count of 0.
use strict;
open SYMS, $ARGV[0] || die;
open INFILE, $ARGV[1] || die;
my %symbols;
while (<SYMS>) {
chomp;
$symbols{$_} = 0;
}
while (<INFILE>) {
my #F=split;
next unless $F[0] =~ /companyA/;
next unless $F[0] =~ /department[BC]/;
++$symbols{$F[1]} if (defined $symbols{$F[1]});
++$symbols{$F[3]} if (defined $symbols{$F[3]});
}
for my $symbol (keys %symbols) {
print "$symbol\n" if $symbols{$symbol} == 0;
}
Related
I would like to know of a fast/efficient way in any program (awk/perl/python) to split a csv file (say 10k columns) into multiple small files each containing 2 columns. I would be doing this on a unix machine.
#contents of large_file.csv
1,2,3,4,5,6,7,8
a,b,c,d,e,f,g,h
q,w,e,r,t,y,u,i
a,s,d,f,g,h,j,k
z,x,c,v,b,n,m,z
I now want multiple files like this:
# contents of 1.csv
1,2
a,b
q,w
a,s
z,x
# contents of 2.csv
1,3
a,c
q,e
a,d
z,c
# contents of 3.csv
1,4
a,d
q,r
a,f
z,v
and so on...
I can do this currently with awk on small files (say 30 columns) like this:
awk -F, 'BEGIN{OFS=",";} {for (i=1; i < NF; i++) print $1, $(i+1) > i ".csv"}' large_file.csv
The above takes a very long time with large files and I was wondering if there is a faster and more efficient way of doing the same.
Thanks in advance.
The main hold up here is in writing so many files.
Here is one way
use warnings;
use strict;
use feature 'say';
my $file = shift // die "Usage: $0 csv-file\n";
my #lines = do { local #ARGV = $file; <> };
chomp #lines;
my #fhs = map {
open my $fh, '>', "f${_}.csv" or die $!;
$fh
}
1 .. scalar( split /,/, $lines[0] );
for (#lines) {
my ($first, #cols) = split /,/;
say {$fhs[$_]} join(',', $first, $cols[$_])
for 0..$#cols;
}
I didn't time this against any other approaches. Assembling data for each file first and then dumping it in one operation into each file may help, but first let us know how large the original CSV file is.
Opening so many output files at once (for #fhs filehandles) may pose problems. If that is the case then the simplest way is to first assemble all data and then open and write a file at a time
use warnings;
use strict;
use feature 'say';
my $file = shift // die "Usage: $0 csv-file\n";
open my $fh, '<', $file or die "Can't open $file: $!";
my #data;
while (<$fh>) {
chomp;
my ($first, #cols) = split /,/;
push #{$data[$_]}, join(',', $first, $cols[$_])
for 0..$#cols;
}
for my $i (0..$#data) {
open my $fh, '>', $i+1 . '.csv' or die $!;
say $fh $_ for #{$data[$i]};
}
This depends on whether the entire original CSV file, plus a bit more, can be held in memory.
With your show samples, attempts; please try following awk code. Since you are opening files all together it may fail with infamous "too many files opened error" So to avoid that have all values into an array and in END block of this awk code print them one by one and I am closing them ASAP all contents are getting printed to output file.
awk '
BEGIN{ FS=OFS="," }
{
for(i=1;i<NF;i++){
value[i]=(value[i]?value[i] ORS:"") ($1 OFS $(i+1))
}
}
END{
for(i=1;i<=NF;i++){
outFile=i".csv"
print value[i] > (outFile)
close(outFile)
}
}
' large_file.csv
I needed the same functionality and wrote it in bash.
Not sure if it will be faster than ravindersingh13's answer, but I hope it will help someone.
Actual version: https://github.com/pgrabarczyk/csv-file-splitter
#!/usr/bin/env bash
set -eu
SOURCE_CSV_PATH="${1}"
LINES_PER_FILE="${2}"
DEST_PREFIX_NAME="${3}"
DEBUG="${4:-0}"
split_files() {
local source_csv_path="${1}"
local lines_per_file="${2}"
local dest_prefix_name="${3}"
local debug="${4}"
_print_log "source_csv_path: ${source_csv_path}"
local dest_prefix_path="$(pwd)/output/${dest_prefix_name}"
_print_log "dest_prefix_path: ${dest_prefix_path}"
local headline=$(awk "NR==1" "${source_csv_path}")
local file_no=0
mkdir -p "$(dirname ${dest_prefix_path})"
local lines_in_files=$(wc -l "${source_csv_path}" | awk '{print $1}')
local files_to_create=$(((lines_in_files-1)/lines_per_file))
_print_log "There is ${lines_in_files} lines in file. I will create ${files_to_create} files per ${lines_per_file} (Last file may have less)"
_print_log "Start processing."
for (( start_line=1; start_line<=lines_in_files; )); do
last_line=$((start_line+lines_per_file))
file_no=$((file_no+1))
local file_path="${dest_prefix_path}$(printf "%06d" ${file_no}).csv"
if [ $debug -eq 1 ]; then
_print_log "Creating file ${file_path} with lines [${start_line};${last_line}]"
fi
echo "${headline}" > "${file_path}"
awk "NR>${start_line} && NR<=${last_line}" "${source_csv_path}" >> "${file_path}"
start_line=$last_line
done
_print_log "Done."
}
_print_log() {
local log_message="${1}"
local date_time=$(date "+%Y-%m-%d %H:%M:%S.%3N")
printf "%s - %s\n" "${date_time}" "${log_message}" >&2
}
split_files "${SOURCE_CSV_PATH}" "${LINES_PER_FILE}" "${DEST_PREFIX_NAME}" "${DEBUG}"
Execution:
bash csv-file-splitter.sh "sample.csv" 3 "result_" 1
Tried a solution using the module Text::CSV.
#! /usr/bin/env perl
use warnings;
use strict;
use utf8;
use open qw<:std :encoding(utf-8)>;
use autodie;
use feature qw<say>;
use Text::CSV;
my %hsh = ();
my $csv = Text::CSV->new({ sep_char => ',' });
print "Enter filename: ";
chomp(my $filename = <STDIN>);
open (my $ifile, '<', $filename);
while (<$ifile>) {
chomp;
if ($csv->parse($_)) {
my #fields = $csv->fields();
my $first = shift #fields;
while (my ($i, $v) = each #fields) {
push #{$hsh{($i + 1).".csv"}}, "$first,$v";
}
} else {
die "Line could not be parsed: $_\n";
}
}
close($ifile);
while (my ($k, $v) = each %hsh) {
open(my $ifile, '>', $k);
say {$ifile} $_ for #$v;
close($ifile);
}
exit(0);
This is what is required:
File1
rama
krishna
mahadev
bentick
william
with million records
File2
hello how are you
rama is king of ayadhya
krishna is king of dwarka
mahadev is great lord
this is what is you go
with million records
Output required
*strings matched-below
rama is king of ayadhya
krishna is king of dwarka
mahadev is great lord
***strings unmatched-below *****
-----------------------------
bentick
william
and so on
I tried this but did not work:
#!/usr/bin/perl -w
use strict;
if (scalar(#ARGV) != 2) {
printf STDERR "Usage: fgrep.pl smallfile bigfile\n";
exit(2);
}
my ($small_file, $big_file) = ($ARGV[0], $ARGV[1]);
my ($small_fp, $big_fp, %small_hash, $field);
open($small_fp, "<", $small_file) || die "Can't open $small_file: " . $!;
open($big_fp, "<", $big_file) || die "Can't open $big_file: " . $!;
# store contents of small file in a hash
while (<$small_fp>) {
chomp;
$small_hash{$_} = undef;
}
close($small_fp);
# loop through big file and find matches
while (<$big_fp>) {
# no need for chomp
$field = (split(/ /, $_))[1];
if (defined($field) && exists($small_hash{$field})) {
printf("%s", $_);
}
}
close($big_fp);
exit(0);
Correcting the small mistakes you made:
use strict;
use warnings;
if (scalar(#ARGV) != 2) {
printf STDERR "Usage: fgrep.pl smallfile bigfile\n";
exit(2);
}
my ($small_file, $big_file) = ($ARGV[0], $ARGV[1]);
my ($small_fp, $big_fp, %small_hash, $field);
open($small_fp, "<", $small_file) || die "Can't open $small_file: " . $!;
open($big_fp, "<", $big_file) || die "Can't open $big_file: " . $!;
# store contents of small file in a hash
while (<$small_fp>) {
s/\s+//g;
next unless $_;
$small_hash{$_} = undef;
}
close($small_fp);
# loop through big file and find matches
while (<$big_fp>) {
# no need for chomp
$field = (split(/ /, $_))[0];
if (defined($field) && exists($small_hash{$field})) {
printf("%s", $_);
$small_hash{$field}++;
}
}
close($big_fp);
print "\n ***unmatched Strings***\n";
foreach my $key (keys %small_hash) {
print "$key\n"
unless $small_hash{$key};
}
exit(0);
you left whitespace in the names.
the first word is (split(/ /, $_))[0] not (split(/ /, $_))[1]
you forgot to save which words you found and print them out
i want to read a file1 that containfor every word a numeric reprsentation, for example:
clinton 279
capital 553
fond|fonds 1410
I read a second file, every time i find a number i replace it with the corresponding word. above an example of second file
279 695 696 152 - 574
553 95 74 96 - 444
1410 74 95 96 - 447
The problem in my code is that it execute the subroutine only one time. and it only show:
279 clinton
normally in this example it should show 3 words, when i add print $b; in the subrtoutine it show the different numbers.
#!/usr/bin/perl
use stricrt;
use warnings;
use autodie;
my #a,my $i,my $k;
my $j;
my $fich_in = "C:\\charm\\ats\\4.con";
my $fich_nom = "C:\\charm\\ats\\ats_dict.txt";
open(FICH1, "<$fich_in")|| die "Problème d'ouverture : $!";
open my $fh, '<', $fich_nom;
#here i put the file into a table
while (<FICH1>) {
my $ligne=$_;
chomp $ligne;
my #numb=split(/-/,$ligne);
my $b=$numb[0];
$k=$#uniq+1;
print "$b\n";
my_handle($fh,$b);
}
sub my_handle {
my ($handle,$b) = #_;
my $content = '';
#print "$b\n";
## or line wise
while (my $line = <$handle>){
my #liste2=split(/\s/,$line);
if($liste2[1]==$b){
$i=$liste2[0];
print "$b $i";}
}
return $i;
}
close $fh;
close(FIC1);
The common approach to similar problems is to hash the "dictionary" first, than iterate over the second file and search for replacements in the hash table:
#!/usr/bin/perl
use warnings;
use strict;
my $fich_in = '4.con';
my $fich_nom = 'ats_dict.txt';
open my $F1, '<', $fich_in or die "Problème d'ouverture $fich_in : $!";
open my $F2, '<', $fich_nom or die "Problème d'ouverture $fich_nom : $!";;
my %to_word;
while (<$F1>) {
my ($word, $code) = split;
$to_word{$code} = $word;
}
while (<$F2>) {
my ($number_string, $final_num) = split / - /;
my #words = split ' ', $number_string;
$words[0] = $to_word{ $words[0] } || $words[0];
print "#words - $final_num";
}
I'm very new to Perl and am working on a Bioinformatics project at University. I have FILE1 containing a list of positions, in the format:
99269
550
100
126477
1700
And FILE2 in the format:
517 1878 forward
700 2500 forward
2156 3289 forward
99000 100000 forward
22000 23000 backward
I want to compare every position in FILE1 to every range in values on FILE2, and if a position falls into one of the ranges then I want to print the position, range and direction.
So my expected output would be:
99269 99000 100000 forward
550 517 1878 forward
1700 517 1878 forward
Currently it will run with no errors, however it doesn't output any information so I am unsure where I am going wrong! When I split the final 'if' rule it runs but will only work if the position is on exactly the same line as the range.
My code is as follows:
#!/usr/bin/perl
use strict;
use warnings;
my $outputfile = "/Users/edwardtickle/Documents/CC22CDS.txt";
open FILE1, "/Users/edwardtickle/Documents/CC22positions.txt"
or die "cannot open > CC22: $!";
open FILE2, "/Users/edwardtickle/Documents/CDSpositions.txt"
or die "cannot open > CDS: $!";
open( OUTPUTFILE, ">$outputfile" ) or die "Could not open output file: $! \n";
while (<FILE1>) {
if (/^(\d+)/) {
my $CC22 = $1;
while (<FILE2>) {
if (/^(\d+)\s+(\d+)\s+(\S+)/) {
my $CDS1 = $1;
my $CDS2 = $2;
my $CDS3 = $3;
if ( $CC22 > $CDS1 && $CC22 < $CDS2 ) {
print OUTPUTFILE "$CC22 $CDS1 $CDS2 $CDS3\n";
}
}
}
}
}
close(FILE1);
close(FILE2);
I have posted the same question on Perlmonks.
Because you are only reading FILE2 once it is only compared with the first line of FILE1
Subsequent lines are compared with the closed file
Stash the lines from FILE1 in an array and then compare each line in FILE2 with each array entry, as shown below
#!/usr/bin/perl
use strict;
use warnings;
my $outputfile = "out.txt";
open FILE1, "file1.txt"
or die "cannot open > CC22: $!";
open FILE2, "file2.txt"
or die "cannot open > CDS: $!";
open( OUTPUTFILE, ">$outputfile" ) or die "Could not open output file: $! \n";
my #file1list = ();
while (<FILE1>) {
if (/^(\d+)/) {
push #file1list, $1;
}
}
while (<FILE2>) {
if (/^(\d+)\s+(\d+)\s+(\S+)/) {
my $CDS1 = $1;
my $CDS2 = $2;
my $CDS3 = $3;
for my $CC22 (#file1list) {
if ( $CC22 > $CDS1 && $CC22 < $CDS2 ) {
print OUTPUTFILE "$CC22 $CDS1 $CDS2 $CDS3\n";
}
}
}
}
( there are also stylistic issues with the program (like capital letters for variables) but I've ignored these, it's quite a nice program for a beginner)
I thought I could simplify some of that by using split instead of regex, but I think my code is actually longer and more difficult to read! In any event, remember that split works great for problems like this:
# User config area
my $positions_file = 'input_positions.txt';
my $ranges_file = 'input_ranges.txt';
my $output_file = 'output_data.txt';
# Reading data
open my $positions_fh, "<", $positions_file;
open my $ranges_fh, "<", $ranges_file;
chomp( my #positions = <$positions_fh> );
# Store the range data in an array containing hash tables
my #range_data;
# to be used like $range_data[0] = {start => $start, end => $end, dir => $dir}
while (<$ranges_fh>) {
chomp;
my ( $start, $end, $dir ) = split; #splits $_ according to whitespace
push #range_data, { start => $start, end => $end, dir => $dir };
#print "start: $start, end: $end, direction: $dir\n";
} #/while
close $positions_fh;
close $ranges_fh;
# Data processing:
open my $output_fh, ">", $output_file;
#It feels like it should be more efficient to process one range at a time for all data points
foreach my $range (#range_data) { #start one range at a time
#each $range = $range_data[#] = { hash table }
foreach my $position (#positions) { #check all positions
if ( ( $range->{start} <= $position ) and ( $position <= $range->{end} ) ) {
my $output_string = "$position " . $range->{start} . " " . $range->{end} . " " . $range->{dir} . "\n";
print $output_fh $output_string;
} #/if
} #/foreach position
} #/foreach range
close $output_fh;
This code would probably run faster if the data processing was done during the while loop that's reading the range data.
Your bug was because you were embedding file processing, so your inner loop only went through the file's contents a single time and then was stuck at eof.
The easiest solution is just to load the inner loop file entirely into memory first.
The following demonstrates using more Modern Perl techniques:
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $cc22file = "/Users/edwardtickle/Documents/CC22positions.txt";
my $cdsfile = "/Users/edwardtickle/Documents/CDSpositions.txt";
my $outfile = "/Users/edwardtickle/Documents/CC22CDS.txt";
my #ranges = do {
# open my $fh, '<', $cdsfile; # Using Fake Data instead below
open my $fh, '<', \ "517 1878 forward\n700 2500 forward\n2156 3289 forward\n99000 100000 forward\n22000 23000 backward\n";
map {[split]} <$fh>;
};
# open my $infh, '<', $cc22file; # Using Fake Data instead below
open my $infh, '<', \ "99269\n550\n100\n126477\n1700\n";
# open my $outfh, '>', $outfile; # Using STDOUT instead below
my $outfh = \*STDOUT;
CC22:
while (my $cc22 = <$infh>) {
chomp $cc22;
for my $cds (#ranges) {
if ($cc22 > $cds->[0] && $cc22 < $cds->[1]) {
print $outfh "$cc22 #$cds\n";
next CC22;
}
}
# warn "$cc22 No match found\n";
}
Outputs:
99269 99000 100000 forward
550 517 1878 forward
1700 517 1878 forward
Live Demo
I have a binary file that contain 3 files, a PNG, a PHP and a TGA file.
Here the file to give you the idea : container.bin
the file is build this way:
first 6 bytes are a pointer to the index, in this case 211794
Then you have all 3 files stacked one after the other
and at the ofset 211794, you have the index, that tell you where the file start and end
in this example you have:
[offset start] [offset end] [random data] [offset start] [name]
6 15149 asdf 6 Capture.PNG
15149 15168 4584 15149 index.php
15168 211794 12 15168 untilted.tga
meaning that capture.png start at offset 6, finish at offset 15149, then asdf is a random data, and the start offset is repeated again.
Now what I want to do is a perl to separate the file on this binary files.
The perl need to check the first 6 offset of the file (header), then jump to the index location, and use the list to extract the file out.
A mix of seek and read can be used to achieve the task:
#!/usr/bin/env perl
use strict;
use warnings;
use Fcntl 'SEEK_SET';
sub get_files_info {
my ( $fh, $offset ) = #_;
my %file;
while (<$fh>) {
chomp;
my $split_count = my ( $offset_start, $offset_end, $random_data, $offset_start_copy,
$file_name ) = split /\s/;
next if $split_count != 5;
if ( $offset_start != $offset_start_copy ) {
warn "Start of offset mismatch: $file_name\n";
next;
}
$file{$file_name} = {
'offset_start' => $offset_start,
'offset_end' => $offset_end,
'random_data' => $random_data,
};
}
return %file;
}
sub write_file {
my ( $fh, $file_name, $file_info ) = #_;
seek $fh, $file_info->{'offset_start'}, SEEK_SET;
read $fh, my $contents,
$file_info->{'offset_end'} - $file_info->{'offset_start'};
open my $fh_out, '>', $file_name or die 'Error opening file: $!';
binmode $fh_out;
print $fh_out $contents;
print "Wrote file: $file_name\n";
}
open my $fh, '<', 'container.bin' or die "Error opening file: $!";
binmode $fh;
read $fh, my $offset, 6;
seek $fh, $offset, SEEK_SET;
my %file = get_files_info $fh, $offset;
for my $file_name ( keys %file ) {
write_file $fh, $file_name, $file{$file_name};
}
The only real difficulty here is to make sure that both input and output files are read in binary mode. This can be achieved by using the :raw PerlIO layer when the files are opened.
This program seems to do what you want. It first locates and reads the index block into a string, and then opens that string for input and reads the start and end position and name of each of the constituent files. Thereafter processing each file is simple.
Be aware that unless the formatting of the index block is more strict than you say, you can rely only on the first, second, and last whitespace-separated fields on each line since random text could contain spaces. There is also no way to specify a file name containing spaces.
The output, using Data::Dump, is there to demonstrate correct functionality and is not necessary for the functioning of the program.
use v5.10;
use warnings;
use Fcntl ':seek';
use autodie qw/ open read seek close /;
open my $fh, '<:raw', 'container.bin';
read $fh, my $index_loc, 6;
seek $fh, $index_loc, SEEK_SET;
read $fh, my ($index), 1024;
my %contents;
open my $idx, '<', \$index;
while (<$idx>) {
my #fields = split;
next unless #fields;
$contents{$fields[-1]} = [ $fields[0], $fields[1] ];
}
use Data::Dump;
dd \%contents;
for my $file (keys %contents) {
my ($start, $end) = #{ $contents{$file} };
my $size = $end - $start;
seek $fh, $start, SEEK_SET;
my $nbytes = read $fh, my ($data), $size;
die "Premature EOF" unless $nbytes == $size;
open my $out, '>:raw', $file;
print { $out } $data;
close $out;
}
output
{
"Capture.PNG" => [6, 15149],
"index.php" => [15149, 15168],
"untilted.tga" => [15168, 211794],
}