How to read a text file (Windows) in TCL?
I've written some PowerShell code which generates a text file with multiple values. The generated values serve as input data for further processing.
I need the required logic to read the content using TCL.
How can I do that?
To read a file holding text, assuming you know the file is called INPUT_DATA.TXT in the current directory:
set f [open "INPUT_DATA.TXT"]; # Or [open "INPUT_DATA.TXT" "r"]
set lineList [split [read $f] "\n"]
close $f
This puts a list of lines of text in the variable lineList. To do this it opens the filename, which returns a “file handle” that I store in the variable f. Then (reading the next line of code from the innermost part outwards) I read the whole contents of the file from the file handle and split that big string by \n (newline) to get a list of all all the contents of the lines in the file. Finally, I close that file handle; they're not usually a good idea to keep open when you don't need them as the OS has a finite number available (though that finite number is pretty large).
Next, you'll need to do further work to get the code to understand the contents of the file. Alas, that's more data-format-dependent so there's not really a general rule.
If you were working with a binary file instead, you might instead do:
set f [open "INPUT_DATA.BIN" "rb"]
set data [read $f]
close $f
but binary data formats are far more varied than text data formats, so “what next?” is even more difficult to generalise for. Fortunately, binary data in Tcl isn't too hard; apart from that extra b in the open, binary data is just yet another string to Tcl, and Tcl's good at strings!
Related
Perl has modes for IO::File like r and w. Where are these documented? From perldoc IO::File
$fh = IO::File->new("file", "r");
I'm looking to find the character that corresponds to the mode to open the file for appending, and create it if it doesn't exist.
an ANSI C fopen() mode string ("w", "r+", etc.), it uses the basic Perl "open" operator (but protects any special characters).
So in man 3 fopen
The argument mode points to a string beginning with one of the
following sequences (possibly followed by additional characters,
as described below):
r Open text file for reading. The stream is positioned at
the beginning of the file.
r+ Open for reading and writing. The stream is positioned at
the beginning of the file.
w Truncate file to zero length or create text file for writ‐
ing. The stream is positioned at the beginning of the
file.
w+ Open for reading and writing. The file is created if it
does not exist, otherwise it is truncated. The stream is
positioned at the beginning of the file.
a Open for appending (writing at end of file). The file is
created if it does not exist. The stream is positioned at
the end of the file.
a+ Open for reading and appending (writing at end of file).
The file is created if it does not exist. The initial file
position for reading is at the beginning of the file, but
output is always appended to the end of the file.
I'm using Perl WWW::Mechanize package in order to fetch and process data from some websites. Usually my way of action is as follows:
Fetch a webpage
$mech->get("$url");
Save the webpage contents in a variable (BTW, I'm not sure if it's the right way to save this amount of text inside a scalar which, as far as I know, supposed to be used for a single value)
my $list = $mech->content();
Use a subroutine that I've created to write the contents of the variable to a text file. (The writetoFile subroutine includes few more features, like path and existing file validations..)
writeToFile("$filename.tmp","$path",$list);
Processing the text in a file created in the previous step by creating an additional file and save the processed content there (Then deleting the initial temporary file).
What I wonder about, is whether it is possible to perform the processing before storing the text in a file, directly inside the $list variable? The whole process is working as expected but I don't really like the logic behind it and it seems a bit inefficient as well, since I have to rewrite the same file multiple times.
EDIT:
Just to give a bit more information about what I'm actually after when I process the variable contents. So the data I fetch from the website in this case is actually a list of items separated by a blank line and the first line is irrelevant to me. So what I'm doing while processing this data is 2 things:
Remove the empty (CRLF) lines
Remove the first line if it includes a particular text.
Ideally I want to save the processed list (no blank spaces and first line removed) in a file without creating any additional files on the way. In order to save the file I would like to use the writeToFile sub (I wrote) since it also performs validation on whether such file already exists (If a file will be saved before final processing - the writeToFile will always rewrite the existing file).
Hope it makes sense.
You're looking for split. The pattern depends: use (?<=\n) split at a new line character and keep it. If that doesn't matter, use \R to include all sort of line breaks.
foreach my $line (split qr/\R/, $mech->content) {
…
}
Now the obligatory HTML-parsing-with-regex admonishment: if you get HTML source with Mechanize, parsing it line-by-line does not make much sense. You probably want to process the HTML-stripped text version of the document instead, or pass the HTML source to a parser such as Web::Query to declaratively get at the pieces you need.
I am new to MATLAB programming and some of the syntax escapes me. So I need a little help. Plus I need some complex looping ideas.
Here's the breakdown of what I have:
12 seperate .dat files, each titled something like output_1_x.dat, output_2_x.dat, etc.
each file is actually one piece of a whole that was seperated and processed
each .dat file is approx. 3.9 GB
Here's what I need to do:
create a single file containing all the data from each seperate file, i.e. I need to recreate the original file.
call this complete output file something like output_final.dat
it has to be done in MATLAB, there are no other alternatives (actually there maybe; see note below)
What is implied:
I will have to fread each 3.9 GBfile into chunks or packets, probably 100 mb at a time (using an imbedded loop?)
these packets will have to be read then written sequentially
after one file is read then written into output_final.dat, the next file is automatically read & written (the master loop).
Well, that's pretty much it. I did a search for 'merging mulitple files' and found this. That isn't exactly what I need to do...I don't need to take part of a file, or data from files, and write it to a new one. I'm simply...concatenating...? This would be simple in Java or Perl, but I only have MATLAB as a tool.
Note: I am however running KDE in OpenSUSE on a pretty powerful box. Maybe someone who is also an expert in terminal knows a command/script to do this from the kernel?
So on this site we usually would point you to whathaveyoutried.com but this question is well phrased.
I wont write the code but i will give you how I would do it. So first I am a bit confused about why you need to fread the file. Are you just appending one file onto the end of another?
You can actually use unix commands to achieve what you want:
files = dir('*.dat');
for i = 1:length(files)
string = sprintf('cat %s >> output_final.dat.temp', files(i).name);
unix(string);
end
That code should loop through all the files and pipe all of the content into output_final.dat.temp (then just rename it, we didn't want it to be included in anything);
But if you really want to use fread because you want to parse the lines in some manner then you can use the same process:
files = dir('*.dat');
fidF = fopen('output_final.dat', 'w');
for i = 1:length(files)
fid = fopen(files(i).name);
while(~feof(fid))
string = fgetl(fid) %You may choose to parse the string in some manner here
fprintf(fidF, '%s', string)
end
end
Just remember, if you are not parsing the lines this will take much much longer.
Hope this helps.
I suggest using a matlab.io.matfileclass objects on two of the files:
matObj1 = matfile('datafile1.mat')
matObj2 = matfile('datafile2.mat')
This does not load any data into memory. Then you can use the objects' methods to sequentialy save a variable from one file to another.
matObj1.varName = matObj2.varName
You can get all the variables in one file with fieldnames(mathObj1) and loop through to copy contents from one file to another. You can then clear some space by removing the copied fields. Or you can use a bit more risky procedure by directly moving the data:
matObj1.varName = rmfield(matObj2,'varName')
Just a disclaimer: haven't tried it, use at own risk.
I am working on a Perl script that opens a huge file and which has the records in the below format. Script might run in Solaris 10 or HP UX 11.0
Filename1 , col1, col2
Filename1 , col1, col2
Filename2 , col1, col2
Filename3 , col1, col2
When I read the first field file name of the input file I need to create a new file if it doesn't exists and print the rest of the fields to the file. There might be 13000 unique file names in the input file. What is the maximum number of file handles that I can open in Solaris 10 or hpux 11? Will I be able to open 13000 file handles? I am planning to use a hash to store the file handles for writing it to the files and closing it. Also how can I easily get the unique file name from the first field across the whole file? Is there a easy way to do it rather than reading each line of the file?
The maximum number of filehandles is OS depended (and is configurable)
See ulimit (manual page is here)
However opening that many file handles is unreasonable. Have a rethink about your algorithm.
No, there's no way to get all the unique filenames without reading the entire file. But you can generate this list as you're processing the file. When you read a line, add the filename as the key of a hash. At the end, print the keys of the hash.
I don't know what your system allows, but you can open more file handles than your system permits using the FileCache module. This is a core Perl module, so you shouldn't even need to install it.
There is no way to get the first column out of a text file without reading the whole file, because text files don't really have an internal structure of columns or even lines; they are just one long string of data. The only way to find each "line" is to go through the whole file and look for newline characters.
However, even huge files are generally processed quite quickly by Perl. This is unlikely to be a problem. Here is simple code to get the unique filenames (assuming your file is opened as FILE):
my %files;
while (<FILE>) { /^(\S+)/ and $files{$1}++; }
This ends up with a count of how many times each file occurs. It assumes that your filenames don't contain any spaces. I did a quick test of this with >30,000 lines, and it was instantaneous.
I am using a perl script to read in a file, but I'm not sure what encoding the file is in. Basically, my file is a list of book titles, but each book has other info associated with it (author, publication date, etc). So each book title is within a discrete chunk of data for the book. So I iterate through the file line by line until I find the regular expression '/Book Title: (.*)/' and take what's in the paren. Then, I create a separate .txt file with the name of the text file being my book. However, in my unix server, when I look at the name of the file, it's actually not, for example, 'LordOfTheFlies.txt' but rather 'LordOfTheFlies^M.txt'
What is this '^M'? Is that a weird end of line encoding I'm not taking into account? I tried chomp but it doesn't seem to be working. What is the best file encoding for working with perl?
It's the additional carriage return character that Windows systems insert before line feed characters (M == 13th letter, hence ASCII 13 is visualised as ^M).
It has nothing to do with file encoding, it's just the line ending policy biting you. Perl is usually good at handling line ending characters correctly, but if they occur somewhere else than the end of a line you have to do it yourself. You can use s/\r// instead of chomp() to get them out.
Before processing the file, you need to know the encoding of the file, which is determined by the producer of the file.
That "^M" is control-M, which is a carriage return, and is not needed in Unix file systems.Looks like the file is created in Unix and transferred to Windows. It can also be added with ftp when text file are transfered as binaries.
Try chop, instead of 'chomp'. Chomp removes the 'new line character'. s/\r// is also good.
For your general question, you might want to use appropriate module for the file type you have to make your life easier and better with Perl.