How to append lines to csv files with perl - perl

I have two dozen .csv files, each about a thousand lines long, they've been created with Tie::Array::CSV. Now I want to append a line or two to each of them every day, what's the most efficient way to do that?
I suppose I could read each file into an array, add my data, and write array back to csv again but this means I risk losing all data if something goes wrong during overwrite, and if I create new files I need to figure out some system to automatically manage all those copies of copies that keep piling up every day.
If there's no module to append lines, how to do it manually with all the conventions and escape characters needed for proper csv so that they can be read back into perl without problems?
...
Thanks for replies, I don't have enough reputation to add comments to answers directly so I eidted the original question.
I don't worry about losing my data too much, I can rebuild it from scratch, it would just take time and manual intervention but not enough to warrant running a daily backup system, thought it's always a point to consider. Appending lines should be easier on processing time, too, as writing .csv to disk is relatively slow on my machines.
Opening files in append mode is what I didn't know. WIth solutions involging$csv->print($fh, $row); should I be running this through a loop to add more than one line?
I've got another solution proposed on Perlmonks
use Tie::Array::CSV;
my $filename = 'tied.csv';
tie my #file, 'Tie::Array::CSV', $filename;
push(#file,[4,5,6]);
untie #file;
Would that work better? I don't have the opportunity to test it right now.
EDIT. Push() solution above worked magic. Consider this closed.

Use Text::CSV:
use Text::CSV;
my $csv = Text::CSV->new({ binary => 1, auto_diag => 1, eol => "\n"})
or die "Cannot use CSV: " . Text::CSV->error_diag();
# open in append mode
open my $fh, ">>", "file.csv" or die "Failed to open file: $!";
$csv->print($fh, [ "foo", "bar", "foo,bar" ]);
close $fh;
This appends a single line to file.csv using \n as the EOL character, comma as the field separator (this is the default), and quotes around fields containing commas:
foo,bar,"foo,bar"

Related

How to pipe to and read from the same tempfile handle without race conditions?

Was debugging a perl script for the first time in my life and came over this:
$my_temp_file = File::Temp->tmpnam();
system("cmd $blah | cmd2 > $my_temp_file");
open(FIL, "$my_temp_file");
...
unlink $my_temp_file;
This works pretty much like I want, except the obvious race conditions in lines 1-3. Even if using proper tempfile() there is no way (I can think of) to ensure that the file streamed to at line 2 is the same opened at line 3. One solution might be pipes, but the errors during cmd might occur late because of limited pipe buffering, and that would complicate my error handling (I think).
How do I:
Write all output from cmd $blah | cmd2 into a tempfile opened file handle?
Read the output without re-opening the file (risking race condition)?
You can open a pipe to a command and read its contents directly with no intermediate file:
open my $fh, '-|', 'cmd', $blah;
while( <$fh> ) {
...
}
With short output, backticks might do the job, although in this case you have to be more careful to scrub the inputs so they aren't misinterpreted by the shell:
my $output = `cmd $blah`;
There are various modules on CPAN that handle this sort of thing, too.
Some comments on temporary files
The comments mentioned race conditions, so I thought I'd write a few things for those wondering what people are talking about.
In the original code, Andreas uses File::Temp, a module from the Perl Standard Library. However, they use the tmpnam POSIX-like call, which has this caveat in the docs:
Implementations of mktemp(), tmpnam(), and tempnam() are provided, but should be used with caution since they return only a filename that was valid when function was called, so cannot guarantee that the file will not exist by the time the caller opens the filename.
This is discouraged and was removed for Perl v5.22's POSIX.
That is, you get back the name of a file that does not exist yet. After you get the name, you don't know if that filename was made by another program. And, that unlink later can cause problems for one of the programs.
The "race condition" comes in when two programs that probably don't know about each other try to do the same thing as roughly the same time. Your program tries to make a temporary file named "foo", and so does some other program. They both might see at the same time that a file named "foo" does not exist, then try to create it. They both might succeed, and as they both write to it, they might interleave or overwrite the other's output. Then, one of those programs think it is done and calls unlink. Now the other program wonders what happened.
In the malicious exploit case, some bad actor knows a temporary file will show up, so it recognizes a new file and gets in there to read or write data.
But this can also happen within the same program. Two or more versions of the same program run at the same time and try to do the same thing. With randomized filenames, it is probably exceedingly rare that two running programs will choose the same name at the same time. However, we don't care how rare something is; we care how devastating the consequences are should it happen. And, rare is much more frequent than never.
File::Temp
Knowing all that, File::Temp handles the details of ensuring that you get a filehandle:
my( $fh, $name ) = File::Temp->tempfile;
This uses a default template to create the name. When the filehandle goes out of scope, File::Temp also cleans up the mess.
{
my( $fh, $name ) = File::Temp->tempfile;
print $fh ...;
...;
} # file cleaned up
Some systems might automatically clean up temp files, although I haven't care about that in years. Typically is was a batch thing (say once a week).
I often go one step further by giving my temporary filenames a template, where the Xs are literal characters the module recognizes and fills in with randomized characters:
my( $name, $fh ) = File::Temp->tempfile(
sprintf "$0-%d-XXXXXX", time );
I'm often doing this while I'm developing things so I can watch the program make the files (and in which order) and see what's in them. In production I probably want to obscure the source program name ($0) and the time; I don't want to make it easier to guess who's making which file.
A scratchpad
I can also open a temporary file with open by not giving it a filename. This is useful when you want to collect outside the program. Opening it read-write means you can output some stuff then move around that file (we show a fixed-length record example in Learning Perl):
open(my $tmp, "+>", undef) or die ...
print $tmp "Some stuff\n";
seek $tmp, 0, 0;
my $line = <$tmp>;
File::Temp opens the temp file in O_RDWR mode so all you have to do is use that one file handle for both reading and writing, even from external programs. The returned file handle is overloaded so that it stringifies to the temp file name so you can pass that to the external program. If that is dangerous for your purpose you can get the fileno() and redirect to /dev/fd/<fileno> instead.
All you have to do is mind your seeks and tells. :-) Just remember to always set autoflush!
use File::Temp;
use Data::Dump;
$fh = File::Temp->new;
$fh->autoflush;
system "ls /tmp/*.txt >> $fh" and die $!;
#lines = <$fh>;
printf "%s\n\n", Data::Dump::pp(\#lines);
print $fh "How now brown cow\n";
seek $fh, 0, 0 or die $!;
#lines2 = <$fh>;
printf "%s\n", Data::Dump::pp(\#lines2);
Which prints
[
"/tmp/cpan_htmlconvert_DPzx.txt\n",
"/tmp/cpan_htmlconvert_DunL.txt\n",
"/tmp/cpan_install_HfUe.txt\n",
"/tmp/cpan_install_XbD6.txt\n",
"/tmp/cpan_install_yzs9.txt\n",
]
[
"/tmp/cpan_htmlconvert_DPzx.txt\n",
"/tmp/cpan_htmlconvert_DunL.txt\n",
"/tmp/cpan_install_HfUe.txt\n",
"/tmp/cpan_install_XbD6.txt\n",
"/tmp/cpan_install_yzs9.txt\n",
"How now brown cow\n",
]
HTH

Splitting one txt file into multiple txt files based on delimiter and naming them with a specific character

I have a text file that looks like this http://www.uniprot.org/uniprot/?sort=score&desc=&compress=no&query=id:P01375%20OR%20id:P04626%20OR%20id:P08238%20OR%20id:P06213&format=txt.
This file is contained of different entries that are divided with //. I think I have almost found the way how to divide txt file into multiple txt files whenever this specific pattern appears, but I still don't know how to name them after dividing and how to print them in specific directory. I would like that each file that is divided carries specific ID which is a first line nut second column in each entry.
This is the code that I have wrote so far:
mkdir "spliced_files"; #directory where I would like to put all my splitted files
$/="//\n"; # divide them whenever //appears and new line after this
open (IDS, 'example.txt') or die "Cannot open"; #example.txt is an input file
my #ids = <IDS>;
close IDS;
my $entry = 25444; #number of entries or //\n characters
my $i=0;
while ($i eq $entry) {
print $ids[$i];
};
$i++;
I am still having problem with finding how to split all entries from 'example.txt' file whenever "//\n" and to print all this seperated files into directory spliced_files. In addition I would have to name all of these seperated files with the ID that is specific for each of these files or entries (which appears in the first row, but only a second column).
So I expect output to be number of files in spliced_files directory, and each of them are named with their ID (first row, but only second column). For example name of the first file wiould be TNFA_HUMAN, od the second would be ERBB2_HUMAN and so on..)
You still look like you're programming by guesswork. And you haven't made use of any of the advice you have been given in answers to your previous questions. I strongly recommend that you spend a week working through a good beginners book like Learning Perl and come back when you understand more about how Perl works.
But here are some comments on your new code:
open (IDS, 'example.txt') or die "Cannot open";
You have been told that using lexical variables and the three-arg version of open() is a better approach here. You should also include $! in your error message, so you know what has gone wrong.
open my $ids_fh, '<', 'example.txt'
or die "Cannot open: $!";
Then later on (I added the indentation in the while loop to make things clearer)...
my $i=0;
while ($i eq $entry) {
print $ids[$i];
};
$i++;
The first time you enter this loop, $i is 1 and $entry is 25444. You compare them (as strings! You probably want ==, not eq) to see if they are equal. Clearly they are different, so your while loop exits. Once the loop exits, you increment $i.
This code bears no relation at all to the description of your problem. I'm not going to give you the answer, but here is the structure of what you need to do:
mkdir "spliced_files";
local $/ = "//\n"; # Always localise changes to special vars
open my $ids_fh, '<', 'example.txt'
or die "Cannot open: $!";
# No need to read the whole file in one go.
# Process it a line at a time.
while (<$ids_fh>) {
# Your record (the whole thing, not just the first line) is in $_.
# You need to extract the ID value from that string. Let's assume
# you've stored in it $id
# Open a file with the right name
open my $out_fh, '>', "spliced_files/$id" or die $!;
# Print the record to the new file.
print $out_fh $_;
}
But really, you need to take the time to learn about programming before you attack this task. Or, if you don't have the time for that, pay a programmer to do it for you.

Perl replacement operator doesn't work under Windows when patterns contain slashes

I want to replace a string with a path:
my $somedir = "D:/somedir/someotherdir";
system("perl -pi.bak -e \"s{STRING_TO_BE_REPLACED}{$somedir}\" $file");
but under Windows it replaces string with random symbols instead of slashes.
What's the problem?
I think it's got something to do with a syntax detail needed on Windows, but can't test now.
However, as you are in a Perl script, why go out with system and run another Perl interpreter? It is far more complex and inefficient since it involves a syscall or a shell, and starts another program. Also, it is far harder to get it right -- you need to deal with syntax details, quoting and escaping, for system, your system's command interpreter, the other instance of Perl, and the regex.
The code below reads the whole file into an array first, which is fine if files aren't huge. In general it is better to process a file line by line, and how to do what you need in that way is discussed in detail in a perlfaq5 page. See the comment at the end, with the link.
use warnings 'all';
use strict;
# your code ...
open my $fh, '<', $file or die "Can't open $file: $!";
my #lines = <$fh>;
# Change #lines in-place. See the comment
s/STRING_TO_BE_REPLACED/$somedir/ for #lines;
open $fh, '>', $file or die "Can't open $file for write: $!";
print $fh #lines;
close $fh;
When we open the $fh the second time it is closed and re-opened, so there is no need for an explicit close. When an existing file is opened for writing ('>') it is clobbered, so this replaces it.
It's more to write but it is better.
Comment on the in-place change to #lines This uses the fact that when iterating over an array if we change the index variable, here $_, the change is made in the original element. The index variable is like an alias for the array element. It says in perlsyn
If any element of LIST is an lvalue, you can modify it by modifying VAR inside the loop. Conversely, if any element of LIST is NOT an lvalue, any attempt to modify that element will fail. In other words, the foreach loop index variable is an implicit alias for each item in the list that you're looping over.
This has the benefit of not copying data and not touching elements that don't change so it is more efficient, potentially a lot more. However, it relies on a subtle property and thus it may be tricky and error prone, so I do not recommend it as a general practice.
To copy the array, with modifications, to a new one
my #lines_new;
foreach my $line (#lines) {
$line =~ s{STRING_TO_BE_REPLACED}{$somedir};
push #lines_new, $line;
}
This also changes #lines. If it need be kept intact do (my $new_line = $line) =~ s/.../. Then write #lines_new to $file. Somewhere in between these two is
#lines = map { s{STRING_TO_BE_REPLACED}{$somedir}; $_ } #lines;
what was posted originally. However, since the map changes elements of #lines and copies data to build the output list, while the whole statement also overwrites the array, on reflection I think it makes more sense to do either the in-place change or an explicit copy to a new array.
In principle it is better to not read the whole file at once but rather to process line by line, unless the file is small enough. In that case open the file for reading and new one for writing, and after you copy (with changes) the file over, move the new one to rewrite $file. See the topic in perlfaq5
The copied file is temporary, to be used to overwrite $file, so it can be named using the core module File::Temp to avoid accidents. To move a file use move from the core module File::Copy.

How to parse a CSV file containing serialized PHP?

I've just started dabbling in Perl, to try and gain some exposure to different programming languages - so forgive me if some of the following code is horrendous.
I needed a quick and dirty CSV parser that could receive a CSV file, and split it into file batches containing "X" number of CSV lines (taking into account that entries could contain embedded newlines).
I came up with a working solution, and it was going along just fine. However, as one of the CSV files that I'm trying to split, I came across one that contains serialized PHP code.
This seems to break the CSV parsing. As soon as I remove the serialization - the CSV file is parsed correctly.
Are there any tricks I need to know when it comes to parsing serialized data in CSV files?
Here is a shortened sample of the code:
use strict;
use warnings;
my $csv = Text::CSV_XS->new({ eol => $/, always_quote => 1, binary => 1 });
my $out;
my $in;
open $in, "<:encoding(utf8)", "infile.csv" or die("cannot open input file $inputfile");
open $out, ">outfile.000";
binmode($out, ":utf8");
while (my $line = $csv->getline($in)) {
$lines++;
$csv->print($out, $line);
}
I'm never able to get into the while loop shown above. As soon as I remove the serialized data, I suddenly am able to get into the loop.
Edit:
An example of a line that is causing me trouble (taken straight from Vim - hence the ^M):
"26","other","1","20,000 Subscriber Plan","Some text here.^M\
Some more text","on","","18","","0","","0","0","recurring","0","","payment","totalsend","0","tsadmin","R34bL9oq","37","0","0","","","","","","","","","","","","","","","","","","","","","","","0","0","0","a:18:{i:0;s:1:\"3\";i:1;s:1:\"2\";i:2;s:2:\"59\";i:3;s:2:\"60\";i:4;s:2:\"61\";i:5;s:2:\"62\";i:6;s:2:\"63\";i:7;s:2:\"64\";i:8;s:2:\"65\";i:9;s:2:\"66\";i:10;s:2:\"67\";i:11;s:2:\"68\";i:12;s:2:\"69\";i:13;s:2:\"70\";i:14;s:2:\"71\";i:15;s:2:\"72\";i:16;s:2:\"73\";i:17;s:2:\"74\";}","","","0","0","","0","0","0.0000","0.0000","0","","","0.00","","6","1"
"27","other","1","35,000 Subscriber Plan","Some test here.^M\
Some more text","on","","18","","0","","0","0","recurring","0","","payment","totalsend","0","tsadmin","R34bL9oq","38","0","0","","","","","","","","","","","","","","","","","","","","","","","0","0","0","a:18:{i:0;s:1:\"3\";i:1;s:1:\"2\";i:2;s:2:\"59\";i:3;s:2:\"60\";i:4;s:2:\"61\";i:5;s:2:\"62\";i:6;s:2:\"63\";i:7;s:2:\"64\";i:8;s:2:\"65\";i:9;s:2:\"66\";i:10;s:2:\"67\";i:11;s:2:\"68\";i:12;s:2:\"69\";i:13;s:2:\"70\";i:14;s:2:\"71\";i:15;s:2:\"72\";i:16;s:2:\"73\";i:17;s:2:\"74\";}","","","0","0","","0","0","0.0000","0.0000","0","","","0.00","","7","1"
"28","other","1","50,000 Subscriber Plan","Some text here.^M\
Some more text","on","","18","","0","","0","0","recurring","0","","payment","totalsend","0","tsadmin","R34bL9oq","39","0","0","","","","","","","","","","","","","","","","","","","","","","","0","0","0","a:18:{i:0;s:1:\"3\";i:1;s:1:\"2\";i:2;s:2:\"59\";i:3;s:2:\"60\";i:4;s:2:\"61\";i:5;s:2:\"62\";i:6;s:2:\"63\";i:7;s:2:\"64\";i:8;s:2:\"65\";i:9;s:2:\"66\";i:10;s:2:\"67\";i:11;s:2:\"68\";i:12;s:2:\"69\";i:13;s:2:\"70\";i:14;s:2:\"71\";i:15;s:2:\"72\";i:16;s:2:\"73\";i:17;s:2:\"74\";}","","","0","0","","0","0","0.0000","0.0000","0","","","0.00","","8","1""73","other","8","10,000,000","","","","0","","0","","0","0","recurring","0","","payment","","0","","","75","0","10000000","","","","","","","","","","","","","","","","","","","","","","","0","0","0","a:17:{i:0;s:1:\"3\";i:1;s:1:\"2\";i:2;s:2:\"59\";i:3;s:2:\"60\";i:4;s:2:\"61\";i:5;s:2:\"62\";i:6;s:2:\"63\";i:7;s:2:\"64\";i:8;s:2:\"65\";i:9;s:2:\"66\";i:10;s:2:\"67\";i:11;s:2:\"68\";i:12;s:2:\"69\";i:13;s:2:\"70\";i:14;s:2:\"71\";i:15;s:2:\"72\";i:16;s:2:\"74\";}","","","0","0","","0","0","0.0000","0.0000","0","","","0.00","","14","0"
The CSV you are trying to read escapes embedded quotes with backslash, but the default for Text::CSV_XS is to escape by doubling them. Try adding escape_char => '\\' to the Text::CSV_XS constructor.
You may also need allow_loose_escapes => 1 if it uses backslash to quote other things that don't strictly need it like newlines.
The other option is to change the writer to use doubled quotes instead of backslashes for escaping. Might or might not be possible. Doubling the quotes is the more common flavour of CSV and while programmatic parsers can generally read both (if told), you won't be able to read the variant with backslash e.g. in Excel.

How do I delete a random value from an array in Perl?

I'm learning Perl and building an application that gets a random line from a file using this code:
open(my $random_name, "<", "out.txt");
my #array = shuffle(<$random_name>);
chomp #array;
close($random_name) or die "Error when trying to close $random_name: $!";
print shift #array;
But now I want to delete this random name from the file. How I can do this?
shift already deletes a name from the array.
So does pop (one from the beginning, one from the end) - I would suggest using pop as it may be more efficient and being a random one, you don't care which on you use.
Or do you need to delete it from a file?
If that's the case, you need to:
A. get a count of names inside a file (if small, read it all in memory using File::Slurp, if large, either read it line-by-line and count or simply execute wc -l $filename command via backticks.
B. Generate a random # from 1 to <$ of lines> (say, $random_line_number
C. Read the file line by line. For every line read, WRITE it to another temp file (use File::Temp to generate temp files. Except do NOT write the line numbered $random_line_number to text file
D. Close temp file and move it instead of your original file
If the list contains filenames and you need to delete the file itself (the random file), use unlink() function. Don't forget to process return code from unlink() and, like with any IO operation, print error message containing $! which will be the text of system error on failure.
Done.
D.
When you say "delete this … from the list" do you mean delete it from the file? If you simply mean remove it from #array then you've already done that by using shift. If you want it removed from the file, and the order doesn't matter, simply write the remaining names in #array back into the file. If the file order does matter, you're going to have to do something slightly more complicated, such as reopen the file, read the items in in order, except for the one you don't want, and then write them all back out again. Either that, or take more notice of the order when you read the file.
If you need to delete a line from a file (its not entirely clear from your question) one of the simplest and most efficient ways is to use Tie::File to manipulate a file as if it were an array. Otherwise perlfaq5 explains how to do it the long way.