I am working on a Perl script that opens a huge file and which has the records in the below format. Script might run in Solaris 10 or HP UX 11.0
Filename1 , col1, col2
Filename1 , col1, col2
Filename2 , col1, col2
Filename3 , col1, col2
When I read the first field file name of the input file I need to create a new file if it doesn't exists and print the rest of the fields to the file. There might be 13000 unique file names in the input file. What is the maximum number of file handles that I can open in Solaris 10 or hpux 11? Will I be able to open 13000 file handles? I am planning to use a hash to store the file handles for writing it to the files and closing it. Also how can I easily get the unique file name from the first field across the whole file? Is there a easy way to do it rather than reading each line of the file?
The maximum number of filehandles is OS depended (and is configurable)
See ulimit (manual page is here)
However opening that many file handles is unreasonable. Have a rethink about your algorithm.
No, there's no way to get all the unique filenames without reading the entire file. But you can generate this list as you're processing the file. When you read a line, add the filename as the key of a hash. At the end, print the keys of the hash.
I don't know what your system allows, but you can open more file handles than your system permits using the FileCache module. This is a core Perl module, so you shouldn't even need to install it.
There is no way to get the first column out of a text file without reading the whole file, because text files don't really have an internal structure of columns or even lines; they are just one long string of data. The only way to find each "line" is to go through the whole file and look for newline characters.
However, even huge files are generally processed quite quickly by Perl. This is unlikely to be a problem. Here is simple code to get the unique filenames (assuming your file is opened as FILE):
my %files;
while (<FILE>) { /^(\S+)/ and $files{$1}++; }
This ends up with a count of how many times each file occurs. It assumes that your filenames don't contain any spaces. I did a quick test of this with >30,000 lines, and it was instantaneous.
Related
I want to test an application with a file containing 10000 lines of records (plus header and footer lines). I have a test file with 10 lines now, so I want to duplicate these line 1000 times. I don't want to create a C# code in my app to generate that file (is only for test), so I am looking for a different and simple way to do that.
What kind of tool can I use to do that? CMD? Visual Studio/VS Code extension? Any thought?
If your data is textual, load the 10 records from your test file into an editor. Select all, copy, insert at the end of file. Repeat until the file is of length 10000+
This procedure requires ceil(log_2(1000)) cycles, 10 in your case, in general ceil(log_2(<target_number_of_lines>/<base_number_of_lines>)).
Alternative (large files)
Modern editors should not have performance problems here. However, the principle can be applied using a cat cli command. Assuming that you copy the original file into a file named dup0.txt proceed as follows:
cat dup0.txt dup0.txt >dup1.txt
cat dup1.txt dup1.txt >dup0.txt
leaving you with the quadrupled number of lines in dup0.txt.
So I'm reading in a .csv file, and it all works as I want bar one thing. The headers of the data have spaces, which I want later for displaying data to the user. However, these spaces get stripped when the csv file is read in via readtable (as they get used as the variable names). Again, no problem with this per se, but I still need the unmodified strings as well.
Two additional notes:
I'm happy for the strings to be stored separately from the main table if that makes things easier.
The actual .csv file I'm reading in is reasonably large (about 2 million data points) so from a computational cost side of things, the less reading of the file the better
Example read in code:
File = 'example.csv';
Import_Options = detectImportOptions( File, 'NumHeaderLines', 0 );
Data = readtable( File )
Example csv file (example.csv):
"this","is","an","example test"
"1","1","2","3"
"3","1","4","1"
"hot","hot","cold","hot"
You can simply read the first line with fgetl, thus grabbing the headers, before reading the entire file with readtable.
How to read a text file (Windows) in TCL?
I've written some PowerShell code which generates a text file with multiple values. The generated values serve as input data for further processing.
I need the required logic to read the content using TCL.
How can I do that?
To read a file holding text, assuming you know the file is called INPUT_DATA.TXT in the current directory:
set f [open "INPUT_DATA.TXT"]; # Or [open "INPUT_DATA.TXT" "r"]
set lineList [split [read $f] "\n"]
close $f
This puts a list of lines of text in the variable lineList. To do this it opens the filename, which returns a “file handle” that I store in the variable f. Then (reading the next line of code from the innermost part outwards) I read the whole contents of the file from the file handle and split that big string by \n (newline) to get a list of all all the contents of the lines in the file. Finally, I close that file handle; they're not usually a good idea to keep open when you don't need them as the OS has a finite number available (though that finite number is pretty large).
Next, you'll need to do further work to get the code to understand the contents of the file. Alas, that's more data-format-dependent so there's not really a general rule.
If you were working with a binary file instead, you might instead do:
set f [open "INPUT_DATA.BIN" "rb"]
set data [read $f]
close $f
but binary data formats are far more varied than text data formats, so “what next?” is even more difficult to generalise for. Fortunately, binary data in Tcl isn't too hard; apart from that extra b in the open, binary data is just yet another string to Tcl, and Tcl's good at strings!
I hope you are all well.
So my question is about the procedure to open multiple raw data files that are compressed.
My files' names are ordered so I have for example : o_equities_20080528.tas.zip o_equities_20080529.tas.zip o_equities_20080530.tas.zip ...
Thank you all in advance.
How much work this will be depends on whether:
You have enough space to extract all the files simultaneously into one folder
You need to be able to keep track of which file each record has come from (i.e. you can't tell just from looking at a particular record).
If you have enough space to extract everything and you don't need to track which records came from which file, then the simplest option is to use a wildcard infile statement, allowing you to import the records from all of your files in one data step:
infile "c:\yourdir\o_equities_*.tas" <other infile options as per individual files>;
This syntax works regardless of OS - it's a SAS feature, not shell expansion.
If you have enough space to extract everything in advance but you need to keep track of which records came from each file, then please refer to this page for an example of how to do this using the filevar option on the infile statement:
http://www.ats.ucla.edu/stat/sas/faq/multi_file_read.htm
If you don't have enough space to extract everything in advance, but you have access to 7-zip or another archive utility, and you don't need to keep track of which records came from each file, you can use a pipe filename and extract to standard output. If you're on a Linux platform then this is very simple, as you can take advantage of shell expansion:
filename cmd pipe "nice -n 19 gunzip -c /yourdir/o_equities_*.tas.zip";
infile cmd <other infile options as per individual files>;
On windows it's the same sort of idea, but as you can't use shell expansion, you have to construct a separate filename for each zip file, or use some of 7zip's more arcane command-line options, e.g.:
filename cmd pipe "7z.exe e -an -ai!C:\yourdir\o_equities_*.tas.zip -so -y";
This will extract all files from all of the matching archives to standard output. You can narrow this down further via the 7-zip command if necessary. You will have multiple header lines mixed in with the data - you can use findstr to filter these out in the pipe before SAS sees them, or you can just choose to tolerate the odd error message here and there.
Here, the -an tells 7-zip not to read the zip file name from the command line, and the -ai tells it to expand the wildcard.
If you need to keep track of what came from where and you can't extract everything at once, your best bet (as far as I know) is to write a macro to process one file at a time, using the above techniques and add this information while you're importing each dataset.
I have a text file called playlist.pls which is dynamically created, and in the text file I have thousands of lines that look like this:
File000001=/home/ubu32sc/Documents/octave/pre/wavefn_0001.wav
File000002=/home/ubu32sc/Documents/octave/pre/wavefn_0002.wav
File000003=/home/ubu32sc/Documents/octave/pre/wavefn_0003.wav
File000004=/home/ubu32sc/Documents/octave/pre/wavefn_0004.wav
File000005=/home/ubu32sc/Documents/octave/pre/wavefn_0005.wav
File000006=/home/ubu32sc/Documents/octave/pre/wavefn_0006.wav
File000007=/home/ubu32sc/Documents/octave/pre/wavefn_0007.wav
File000008=/home/ubu32sc/Documents/octave/pre/wavefn_0008.wav
File000009=/home/ubu32sc/Documents/octave/pre/wavefn_0009.wav
File000010=/home/ubu32sc/Documents/octave/pre/wavefn_0010.wav etc...
I need to have the data in the text file split into several different files.
example:
The play1.pls file would contain:
File000001=/home/ubu32sc/Documents/octave/pre/wavefn_0001.wav
File000002=/home/ubu32sc/Documents/octave/pre/wavefn_0002.wav
File000003=/home/ubu32sc/Documents/octave/pre/wavefn_0003.wav
The play2.pls file would contain:
File000004=/home/ubu32sc/Documents/octave/pre/wavefn_0004.wav
File000005=/home/ubu32sc/Documents/octave/pre/wavefn_0005.wav
File000006=/home/ubu32sc/Documents/octave/pre/wavefn_0006.wav
The play3.pls file would contain:
File000007=/home/ubu32sc/Documents/octave/pre/wavefn_0007.wav
File000008=/home/ubu32sc/Documents/octave/pre/wavefn_0008.wav
File000009=/home/ubu32sc/Documents/octave/pre/wavefn_0009.wav
The play4.pls file would contain:
File000010=/home/ubu32sc/Documents/octave/pre/wavefn_0010.wav etc...
What's the best way to go about doing this I was thinking about using octave/matlab to do this but I think this would be over kill and resource intensive to run a for loop on a text file with 10's of thousands of lines. Is grep or perl the proper thing to use and or should I use another type of program? and if so how could I do this with it?
I'm using Ubuntu 32 10.04 6 gig ram
Thanks
As you mentionned it, Matlab / Octave seems to be an overkill if you just want to split a text file into multiple files.
There are a thousand ways to do this (espcially on a unix system) so just pick yours.
One of the possibilities is to use split which goes like this:
split --lines=3 file prefix