Split large csv file into multiple files based on column(s) - perl

I would like to know of a fast/efficient way in any program (awk/perl/python) to split a csv file (say 10k columns) into multiple small files each containing 2 columns. I would be doing this on a unix machine.
#contents of large_file.csv
1,2,3,4,5,6,7,8
a,b,c,d,e,f,g,h
q,w,e,r,t,y,u,i
a,s,d,f,g,h,j,k
z,x,c,v,b,n,m,z
I now want multiple files like this:
# contents of 1.csv
1,2
a,b
q,w
a,s
z,x
# contents of 2.csv
1,3
a,c
q,e
a,d
z,c
# contents of 3.csv
1,4
a,d
q,r
a,f
z,v
and so on...
I can do this currently with awk on small files (say 30 columns) like this:
awk -F, 'BEGIN{OFS=",";} {for (i=1; i < NF; i++) print $1, $(i+1) > i ".csv"}' large_file.csv
The above takes a very long time with large files and I was wondering if there is a faster and more efficient way of doing the same.
Thanks in advance.

The main hold up here is in writing so many files.
Here is one way
use warnings;
use strict;
use feature 'say';
my $file = shift // die "Usage: $0 csv-file\n";
my #lines = do { local #ARGV = $file; <> };
chomp #lines;
my #fhs = map {
open my $fh, '>', "f${_}.csv" or die $!;
$fh
}
1 .. scalar( split /,/, $lines[0] );
for (#lines) {
my ($first, #cols) = split /,/;
say {$fhs[$_]} join(',', $first, $cols[$_])
for 0..$#cols;
}
I didn't time this against any other approaches. Assembling data for each file first and then dumping it in one operation into each file may help, but first let us know how large the original CSV file is.
Opening so many output files at once (for #fhs filehandles) may pose problems. If that is the case then the simplest way is to first assemble all data and then open and write a file at a time
use warnings;
use strict;
use feature 'say';
my $file = shift // die "Usage: $0 csv-file\n";
open my $fh, '<', $file or die "Can't open $file: $!";
my #data;
while (<$fh>) {
chomp;
my ($first, #cols) = split /,/;
push #{$data[$_]}, join(',', $first, $cols[$_])
for 0..$#cols;
}
for my $i (0..$#data) {
open my $fh, '>', $i+1 . '.csv' or die $!;
say $fh $_ for #{$data[$i]};
}
This depends on whether the entire original CSV file, plus a bit more, can be held in memory.

With your show samples, attempts; please try following awk code. Since you are opening files all together it may fail with infamous "too many files opened error" So to avoid that have all values into an array and in END block of this awk code print them one by one and I am closing them ASAP all contents are getting printed to output file.
awk '
BEGIN{ FS=OFS="," }
{
for(i=1;i<NF;i++){
value[i]=(value[i]?value[i] ORS:"") ($1 OFS $(i+1))
}
}
END{
for(i=1;i<=NF;i++){
outFile=i".csv"
print value[i] > (outFile)
close(outFile)
}
}
' large_file.csv

I needed the same functionality and wrote it in bash.
Not sure if it will be faster than ravindersingh13's answer, but I hope it will help someone.
Actual version: https://github.com/pgrabarczyk/csv-file-splitter
#!/usr/bin/env bash
set -eu
SOURCE_CSV_PATH="${1}"
LINES_PER_FILE="${2}"
DEST_PREFIX_NAME="${3}"
DEBUG="${4:-0}"
split_files() {
local source_csv_path="${1}"
local lines_per_file="${2}"
local dest_prefix_name="${3}"
local debug="${4}"
_print_log "source_csv_path: ${source_csv_path}"
local dest_prefix_path="$(pwd)/output/${dest_prefix_name}"
_print_log "dest_prefix_path: ${dest_prefix_path}"
local headline=$(awk "NR==1" "${source_csv_path}")
local file_no=0
mkdir -p "$(dirname ${dest_prefix_path})"
local lines_in_files=$(wc -l "${source_csv_path}" | awk '{print $1}')
local files_to_create=$(((lines_in_files-1)/lines_per_file))
_print_log "There is ${lines_in_files} lines in file. I will create ${files_to_create} files per ${lines_per_file} (Last file may have less)"
_print_log "Start processing."
for (( start_line=1; start_line<=lines_in_files; )); do
last_line=$((start_line+lines_per_file))
file_no=$((file_no+1))
local file_path="${dest_prefix_path}$(printf "%06d" ${file_no}).csv"
if [ $debug -eq 1 ]; then
_print_log "Creating file ${file_path} with lines [${start_line};${last_line}]"
fi
echo "${headline}" > "${file_path}"
awk "NR>${start_line} && NR<=${last_line}" "${source_csv_path}" >> "${file_path}"
start_line=$last_line
done
_print_log "Done."
}
_print_log() {
local log_message="${1}"
local date_time=$(date "+%Y-%m-%d %H:%M:%S.%3N")
printf "%s - %s\n" "${date_time}" "${log_message}" >&2
}
split_files "${SOURCE_CSV_PATH}" "${LINES_PER_FILE}" "${DEST_PREFIX_NAME}" "${DEBUG}"
Execution:
bash csv-file-splitter.sh "sample.csv" 3 "result_" 1

Tried a solution using the module Text::CSV.
#! /usr/bin/env perl
use warnings;
use strict;
use utf8;
use open qw<:std :encoding(utf-8)>;
use autodie;
use feature qw<say>;
use Text::CSV;
my %hsh = ();
my $csv = Text::CSV->new({ sep_char => ',' });
print "Enter filename: ";
chomp(my $filename = <STDIN>);
open (my $ifile, '<', $filename);
while (<$ifile>) {
chomp;
if ($csv->parse($_)) {
my #fields = $csv->fields();
my $first = shift #fields;
while (my ($i, $v) = each #fields) {
push #{$hsh{($i + 1).".csv"}}, "$first,$v";
}
} else {
die "Line could not be parsed: $_\n";
}
}
close($ifile);
while (my ($k, $v) = each %hsh) {
open(my $ifile, '>', $k);
say {$ifile} $_ for #$v;
close($ifile);
}
exit(0);

Related

How do I properly find double entries in two files in Perl?

Let's say I have two files with lists of ip-addresses. Lines in the first file are unique. Lines in the second may or may not be the same as in the first one.
What I need is to compare two files, and remove possible doubles from the second file in order to merge it with the base file later.
I've managed to write the following code and it seems to work properly, but I have a solid feeling that this code can be improved or I may be totally missing some important concept.
Are there any ways to solve the task without using complex data structures, i.e. hashrefs?
#!/usr/bin/perl
use strict;
use warnings;
my $base = shift #ARGV;
my $input = shift #ARGV;
my $res = 'result.txt';
open ("BASE","<","$base");
open ("INP","<","$input");
open ("RES", ">", "$res");
my $rhash = {}; # result hash
while (my $line = <BASE>) {chomp($line); $rhash->{$line}{'res'} = 1;} # create uniq table
while (my $line = <INP>) { chomp($line); $rhash->{$line}{'res'}++; $rhash->{$line}{'new'} = 1; } # create compare table marking it's entries as new and incrementing double keys
close BASE;
close INP;
for my $line (sort keys %$rhash) {
next if $line =~ /\#/; # removing comments
printf("%-30s%3s%1s", $line, $rhash->{$line}{'res'}, "\n") if $rhash->{$line}{'res'} > 1; # kinda diagnosti output of doubles
if (($rhash->{$line}{'new'}) and ($rhash->{$line}{'res'} < 2)) {
print RES "$line\n"; # printing new uniq entries to result file
}
}
close RES;
If I understand correctly file1 and file2 each contain ips (unique in each file) And you want to get ips in file2 not in file1. If so, then maybe the following code achieves your goal.
Although it seems your code will do it, this might be clearer.
#!/usr/bin/perl
use strict;
use warnings;
my $base = shift #ARGV;
my $input = shift #ARGV;
my $res = 'result.txt';
open ("BASE","<","$base") or die $!;
open ("INP","<","$input") or die $!;
open ("RES", ">", "$res") or die $!;
my %seen;
while (my $line = <BASE>) {
chomp $line;
$seen{$line}++;
}
close BASE or die $!;
while (my $line = <INP>) {
chomp $line;
print RES "$line\n" unless $seen{$line}; # only in file2 not in file1
}
close INP or die $!;
close RES or die $!;

"Out of memory" error from Perl when processing files

I'm using Perl with Mojo::DOM to process a large batch of text files. I need to count the occurrences of all the words that end with certain suffixes.
Running this code keeps returning out of memory error messages for batches of over, say, 40 files.
Is there any way to accomplish this task more efficiently (less memory usage) than what I'm doing below?
#!/software/perl512/bin/perl
use strict;
use warnings;
use autodie;
use Mojo::DOM;
my $path = "/data/10K/2012";
chdir($path) or die "Cant chdir to $path $!";
# This program counts the total number of suffixes of a form in a given document.
my #sequence;
my %sequences;
my $file;
my $fh;
my #output;
# Reading in the data.
for my $file (<*.txt>) {
my %affixes;
my %word_count;
my $data = do {
open my $fh, '<', $file;
local $/; # Slurp mode
<$fh>;
};
my $dom = Mojo::DOM->new($data);
my $text = $dom->all_text();
for (split /\s+/, $text) {
if ($_ =~ /[a-zA-Z]+(ness|ship|dom|ance|ence|age|cy|tion|hood|ism|ment|ure|tude|ery|ity|ial)\b/ ) {
++$affixes{"affix_count"};
}
++$word_count{"word_count"};
}
my $output = join ",", $file, $affixes{"affix_count"}, $word_count{"word_count"};
push #output, ($output);
}
#output = sort #output;
open(my $fh3, '>', '/home/usr16/rcazier/PerlCode/affix_count.txt');
foreach (#output) {
print $fh3 "$_\n ";
}
close $fh3;
This is as near as I can get to a solution. It incorporates all the points that have been made in the comments, and solves the "Out of memory" error by leaving any HTML tags intact. It also leaves the result unsorted as the original code doesn't really do any useful sorting.
Because of the way you are looking for suffixed words, I think it's very unlikely that leaving HTML tags in your text files will pervert your results significantly.
#!/software/perl512/bin/perl
use strict;
use warnings;
use 5.010;
use autodie;
# Build and compile a regex that will match any of the suffixes that interest
# us, for later use in testing each "word" in the input file
#
my $suffix_re = do {
my #suffixes = qw/ ness ship dom ance ence age cy tion hood ism ment ure tude ery ity ial /;
my $alternation = join '|', #suffixes;
qr/ (?: $alternation ) /xi;
};
# Set the directory that we want to examine. `autodie` will check the success
# of `chdir` for us
#
my $path = '/data/10K/2012';
chdir $path;
# Process every file with a `txt` file type
#
for my $filename ( grep -f, glob('*.txt') ) {
warn qq{Processing "$filename"\n};
open my ($fh), '<', $filename;
my ($suffixes, $word_count) = (0, 0);
while (<$fh>) {
for (split) {
++$word_count;
++$suffixes if /\A[a-z]+$suffix_re\z/i;
}
}
say join ',', $filename, $suffixes, $word_count if $suffixes;
}

Compare two files and write the lines from second file to first file

I Have two files name test1 and test2. The contents of the files are as follows.
test1
nijin:qwe123
nijintest:qwerty
nijintest2:abcsdef
nijin2:qwehj
test2
nijin:qwe
nijintest2:abc
I have to change the values of nijin and nijintest2 in test1 to match that in test2, leaving all other values alone. I have tried all possible Perl replace comments without any success. Any help will be appreciated.
Edit
I have tried many open close file functions to replace the entry but none of them gives a required output. I have tried everything here In Perl, how do I change, delete, or insert a line in a file, or append to the beginning of a file? . But with no luck
This works, though it could probably still be compressed:
#!/usr/bin/env perl
use strict;
use warnings;
die "Usage: $0 map [file ...]\n" unless scalar(#ARGV) >= 1;
my %mapping;
open my $fh, "<", $ARGV[0] or die "Failed to open $ARGV[0] for reading";
while (<$fh>)
{
my($key, $value) = ($_ =~ m/^([^:]*):(.*)/);
$mapping{$key} = "$value\n";
}
close $fh;
shift;
while (<>)
{
my($key) = ($_ =~ m/^([^:]*):/);
$_ = "$key:$mapping{$key}" if (defined $mapping{$key});
print;
}
If it is called sub.pl, you can run:
perl sub.pl test2 test1
perl sub.pl test2 <test1
perl sub.pl test2 test1 test3 test4
For the first two invocations, the output is:
nijin:qwe
nijintest:qwerty
nijintest2:abc
nijin2:qwehj
The following should work for importing any number of new files. I also included code for appending new entries, which you didn't specify how you wanted handled.
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
die "Usage: $0 dbfile [mapfiles ...]\n" if #ARGV < 2;
my $db = shift;
my %mapping = map {chomp; /([^:]*)/; $1 => $_} <>;
local #ARGV = $db;
local $^I = '.bak';
while (<>) {
chomp;
/([^:]*)/;
print delete $mapping{$1} // $_, "\n";
}
#unlink "$db$^I"; # Uncomment if you want to delete backup
# Append new entries;
open my $fh, '>>', $db;
$fh->print($_, "\n") for values %mapping;

Perl find and replace multiple(huge) strings in one shot

Based on a mapping file, i need to search for a string and if found append the replace string to the end of line.
I'm traversing through the mapping file line by line and using the below perl one-liner, appending the strings.
Issues:
1.Huge find & replace Entries: But the issues is the mapping file has huge number of entries (~7000 entries) and perl one-liners takes ~1 seconds for each entries which boils down to ~1 Hour to complete the entire replacement.
2.Not Simple Find and Replace: Its not a simple Find & Replace. It is - if found string, append the replace string to EOL.
If there is no efficient way to process this, i would even consider replacing rather than appending.
Mine is on Windows 7 64-Bit environment and im using active perl. No *unix support.
File Samples
Map.csv
findStr1,RplStr1
findStr2,RplStr2
findStr3,RplStr3
.....
findStr7000,RplStr7000
input.csv
col1,col2,col3,findStr1,....col-N
col1,col2,col3,findStr2,....col-N
col1,col2,col3,FIND-STR-NOT-EXIST,....col-N
output.csv (Expected Output)
col1,col2,col3,findStr1,....col-N,**RplStr1**
col1,col2,col3,findStr1,....col-N,**RplStr2**
col1,col2,col3,FIND-STR-NOT-EXIST,....col-N
Perl Code Snippet
One-Liner
perl -pe '/findStr/ && s/$/RplStr/' file.csv
open( INFILE, $MarketMapFile ) or die "Error occured: $!";
my #data = <INFILE>;
my $cnt=1;
foreach $line (#data) {
eval {
# Remove end of line character.
$line =~ s/\n//g;
my ( $eNodeBID, $MarketName ) = split( ',', $line );
my $exeCmd = 'perl -i.bak -p -e "/'.$eNodeBID.'\(M\)/ && s/$/,'.$MarketName.'/;" '.$CSVFile;
print "\n $cnt Repelacing $eNodeBID with $MarketName and cmd is $exeCmd";
system($exeCmd);
$cnt++;
}
}
close(INFILE);
To do this in a single pass through your input CSV, it's easiest to store your mapping in a hash. 7000 entries is not particularly huge, but if you're worried about storing all of that in memory you can use Tie::File::AsHash.
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
use Tie::File::AsHash;
tie my %replace, 'Tie::File::AsHash', 'map.csv', split => ',' or die $!;
my $csv = Text::CSV->new({ binary => 1, auto_diag => 1, eol => $/ })
or die Text::CSV->error_diag;
open my $in_fh, '<', 'input.csv' or die $!;
open my $out_fh, '>', 'output.csv' or die $!;
while (my $row = $csv->getline($in_fh)) {
push #$row, $replace{$row->[3]};
$csv->print($out_fh, $row);
}
untie %replace;
close $in_fh;
close $out_fh;
map.csv
foo,bar
apple,orange
pony,unicorn
input.csv
field1,field2,field3,pony,field5,field6
field1,field2,field3,banana,field5,field6
field1,field2,field3,apple,field5,field6
output.csv
field1,field2,field3,pony,field5,field6,unicorn
field1,field2,field3,banana,field5,field6,
field1,field2,field3,apple,field5,field6,orange
I don't recommend screwing up your CSV format by only appending fields to matching lines, so I add an empty field if a match isn't found.
To use a regular hash instead of Tie::File::AsHash, simply replace the tie statement with
open my $map_fh, '<', 'map.csv' or die $!;
my %replace = map { chomp; split /,/ } <$map_fh>;
close $map_fh;
This is untested code / pseudo-Perl you'll need to polish it (strict, warnings, etc.):
# load the search and replace sreings into memeory
open($mapfh, "<", mapfile);
%maplines;
while ( $mapline = <fh> ) {
($findstr, $replstr) = split(/,/, $mapline);
%maplines{$findstr} = $replstr;
}
close $mapfh;
open($ifh, "<", inputfile);
while ($inputline = <$ifh>) { # read an input line
#input = split(/,/, $inputline); # split it into a list
if (exists $maplines{$input[3]}) { # does this line match
chomp $input[-1]; # remove the new line
push #input, $maplines{$input[3]}; # add the replace str to the end
last; # done processing this line
}
print join(',', #input); # or print or an output file
}
close($ihf)

How can I print specific lines from a file in Unix?

I want to print certain lines from a text file in Unix. The line numbers to be printed are listed in another text file (one on each line).
Is there a quick way to do this with Perl or a shell script?
Assuming the line numbers to be printed are sorted.
open my $fh, '<', 'line_numbers' or die $!;
my #ln = <$fh>;
open my $tx, '<', 'text_file' or die $!;
foreach my $ln (#ln) {
my $line;
do {
$line = <$tx>;
} until $. == $ln and defined $line;
print $line if defined $line;
}
$ cat numbers
1
4
6
$ cat file
one
two
three
four
five
six
seven
$ awk 'FNR==NR{num[$1];next}(FNR in num)' numbers file
one
four
six
You can avoid the limitations of the some of the other answers (requirements for sorted lines), simply by using eof within the context of a basic while(<>) block. That will tell you when you've stopped reading line numbers and started reading data. Note that you need to reset $. when the switch occurs.
# Usage: perl script.pl LINE_NUMS_FILE DATA_FILE
use strict;
use warnings;
my %keep;
my $reading_line_nums = 1;
while (<>){
if ($reading_line_nums){
chomp;
$keep{$_} = 1;
$reading_line_nums = $. = 0 if eof;
}
else {
print if exists $keep{$.};
}
}
cat -n foo | join foo2 - | cut -d" " -f2-
where foo is your file with lines to print and foo2 is your file of line numbers
Here is a way to do this in Perl without slurping anything so that the memory footprint of the program is independent of the sizes of both files (it does assume that the line numbers to be printed are sorted):
#!/usr/bin/perl
use strict; use warnings;
use autodie;
#ARGV == 2
or die "Supply src_file and filter_file as arguments\n";
my ($src_file, $filter_file) = #ARGV;
open my $src_h, '<', $src_file;
open my $filter_h, '<', $filter_file;
my $to_print = <$filter_h>;
while ( my $src_line = <$src_h> ) {
last unless defined $to_print;
if ( $. == $to_print ) {
print $src_line;
$to_print = <$filter_h>;
}
}
close $filter_h;
close $src_h;
Generate the source file:
C:\> perl -le "print for aa .. zz" > src
Generate the filter file:
C:\> perl -le "print for grep { rand > 0.75 } 1 .. 52" > filter
C:\> cat filter
4
6
10
12
13
19
23
24
28
44
49
50
Output:
C:\> f src filter
ad
af
aj
al
am
as
aw
ax
bb
br
bw
bx
To deal with an unsorted filter file, you can modified the while loop:
while ( my $src_line = <$src_h> ) {
last unless defined $to_print;
if ( $. > $to_print ) {
seek $src_h, 0, 0;
$. = 0;
}
if ( $. == $to_print ) {
print $src_line;
$to_print = <$filter_h>;
}
}
This would waste a lot of time if the contents of the filter file are fairly random because it would keep rewinding to the beginning of the source file. In that case, I would recommend using Tie::File.
I wouldn't do it this way with large files, but (untested):
open(my $fh1, "<", "line_number_file.txt") or die "Err: $!";
chomp(my #line_numbers = <$fh1>);
$_-- for #line_numbers;
close $fh1;
open(my $fh2, "<", "text_file.txt") or die "Err: $!";
my #lines = <$fh2>;
print #lines[#line_numbers];
close $fh2;
I'd do it like this:
#!/bin/bash
numbersfile=numbers
datafile=data
while read lineno < $numbersfile; do
sed -n "${lineno}p" datafile
done
Downside to my approach is that it will spawn a lot of processes so it will be slower than other options. It's infinitely more readable though.
This is a short solution using bash and sed
sed -n -e "$(cat num |sed 's/$/p/')" file
Where num is the file of numbers and file is the input file ( Tested on OS/X Snow leopard)
$ cat num
1
3
5
$ cat file
Line One
Line Two
Line Three
Line Four
Line Five
$ sed -n -e "$(cat num |sed 's/$/p/')" file
Line One
Line Three
Line Five
$ cat input
every
good
bird
does
fly
$ cat lines
2
4
$ perl -ne 'BEGIN{($a,$b) = `cat lines`} print if $.==$a .. $.==$b' input
good
bird
does
If that's too much for a one-liner, use
#! /usr/bin/perl
use warnings;
use strict;
sub start_stop {
my($path) = #_;
open my $fh, "<", $path
or die "$0: open $path: $!";
local $/;
return ($1,$2) if <$fh> =~ /\s*(\d+)\s*(\d+)/;
die "$0: $path: could not find start and stop line numbers";
}
my($start,$stop) = start_stop "lines";
while (<>) {
print if $. == $start .. $. == $stop;
}
Perl's magic open allows for creative possibilities such as
$ ./lines-between 'tac lines-between|'
print if $. == $start .. $. == $stop;
while (<>) {
Here is a way to do this using Tie::File:
#!/usr/bin/perl
use strict; use warnings;
use autodie;
use Tie::File;
#ARGV == 2
or die "Supply src_file and filter_file as arguments\n";
my ($src_file, $filter_file) = #ARGV;
tie my #source, 'Tie::File', $src_file, autochomp => 0
or die "Cannot tie source '$src_file': $!";
open my $filter_h, '<', $filter_file;
while ( my $to_print = <$filter_h> ) {
print $source[$to_print - 1];
}
close $filter_h;
untie #source;