Command line CSV viewer with column-alignment for LARGE files - command-line

I would like to view my CSV files in a column-aligned format from the command line, with something like less, but my CSV files are sometimes gigabytes big, and I'm using a little computer (Netbook, 1GB RAM, 8GB HD, 1GHz processor), so I don't want to waste a lot of memory or processing power viewing the file.
I mention that I'd like to use something like less because I would like to be able to navigate around within the file.
cat FILE | column -s, -t | less is one thought, but cat is still going to try to print the whole file and I'm not sure how much buffering the pipes will use (if any) or what sort of caching less employs.
This question is similar to this other question, but I'm specifically interested in viewing large files using minimal resources preferably already on the machine. I don't presently use VI or EMACS, and think they'd both be overkill here. VI, for instance, would be a 27MB install for a utility acting merely as a viewer.

First of all, less can open oversized files. Second, both vim (which I use with the Largefile plugin and with files over 8 GB) and emacs can do it.
But... Most of the time, viewing a big file in a 80x40 (or a bit bigger) terminal is useless... so you should filter it with something like (f)grep or process it with awk. If you want only the start or end, then there are head and tail.
HTH

Check the tail \ head commands.
Or even better, Download VIM source and compile it. That should be easy enough. Version 5.8 source is 1Mb before decompressing (4MB after). Enjoy.

Related

Reduce relocatable win32 Perl to as few files and bytes as possible

I'm trying to use a perl program on a Windows HTCondor computing cluster. The way HTCondor on windows works is it copies all dependencies into a temporary directory (used as a chroot of sorts) and then it deletes the directory after the specified outputs are moved to a designated place.
If I take only perl.exe and perl514.dll and make a job like this: perl -e "print qq/hello\n/" and tell the cluster to run it 200 times, then each replication winds up taking about 15 seconds, which is acceptable overhead. That's almost all time spent repeatedly copying the files over the network and then deleting them. echo_hello.bat run 200 times takes more like two seconds per replication.
The problem I have is that when I try to use my full blown perl distribution of 55MB and 2,289 files, a single "hello" rep takes something like four minutes of copying and deleting, which is unacceptable. When I try to do many runs the disks on the machines grind to a halt trying to concurrently handle all the file operations across all the reps, so it doesn't work at all. I don't know how long it might take to eventually finish because I gave up after half an hour and no jobs had finished.
I figured PAR::Packer might fix the issue, but nope. I tried print_hello.exe created like this: pp -o print_hello.exe -e "print qq/hello\n/". It still makes things grind to a halt, apparently by swamping the filesystem. I think a PAR::Packer executable makes a ton of temporary files as it pulls out files it needs from the archive. I think the windows file system totally chokes when there are a bunch of concurrent small file operations.
So how can I go about cutting down the perl I built to something like 6MB and a dozen files? I'm really only using a tiny number of core modules and don't need most of the crap in bin and lib, but I have no idea how to proceed ripping out stuff in a sane way.
Is there an automated way to strip away un-needed files and modules?
I know TCL has a bunch of facilities for packing files into a single uncompressed archive that can then be accessed through a "virtual filesystem" without expanding the file. Is there some way to do this with perl itself sort of like with PAR? The problem is PAR compresses everything and then has to extract to temporary files, rather than directly work through a virtual filesystem layer. (If I understand correctly.)
My usage of perl is actually as a scripting layer. It's embedded in a simulation. So I'm really running my_simulation.exe which depends on per514.dll, but you get the idea. I also cannot realistically do anything to the HTCondor cluster other than use it. So there's no need to think outside the box on what I should be using instead of perl and what I could administratively tweak in Windows and HTCondor, thanks.
You can use Module::ScanDeps to get list of actual dependencies of your perl. It was terrible, that it took significant amount of time, when PAR::Packer unpacked the whole application, so I decided to build the executable by myself.
Here is my ready to use script which gathers perl dependencies into some directory; it might be useful for you to reduce the number of perl-modules, e.g. by manually removing some dependencies after copying.
In theory (I have never tried that), the next your step could be merge all pure-perl dependencies into single file (like deps.pm); although it might be non-trivial due to perl's autoload magic and some other tricks.
You can list the modules that are needed by your program using the very nice ListDependencies module
To my knowledge it isn't downloadable anywhere, but it is simple to copy and paste into your own ListDependencies.pm file
You should read the POD documentation within the module for usage instructions

Emacs: trying to write something after saving provokes message "file changed on disk. Really edit the buffer?"

Emacs 24 in Ubuntu 14.
I have file opened only in emacs, and it gives me this constantly, after each saving. that is annoying.
This is strange, because earlier everything worked fine. I can hardly guess what could I break during this time. I'am total newbie in Ubuntu, using it according to instructions found in internet.
Now I'm using emacs 23, everything is fine. I guess, I need auto-syncronization of opened buffer with saved file right after saving. Anyway, how can I fix it?
It sounds like some other program on your computer is reading the file when it changes, and possibly even introducing changes (perhaps just to the modification time, rather than to the contents). It's hard to say off-hand just what that would be.
A workaround try M-x global-auto-revert-mode. It will only auto-revert if you have no local modification since the last saving. This is generally a nice mode to turn on if you use multiple editors, and I keep it enabled all the time.
Other ideas:
Check if any other process currently has the file open using fuser /path/to/filename.txt (note: it only shows open file descriptors, not processes that hold the file content in memory and write it later)
Do you use any non-standard filesystem? (check with df -h /path/to/filename.txt and mount)
Is your system time stable? (Manually check date, scan the output of dmesg for obvious errors concerning timekeeping, and look for errors related to NTP in the logfiles in /var/log/.

SVG to PDF (with Perl Cairo?)

In a perl script, I try to convert svg files to pdf. This works great by just refering to Inkscape:
system "inkscape -D -z --file=$in --export-pdf=$out";
But it is enormously slow even for little 100 KB files, I mean it can be minutes per file, causing the script to fail when running with a time-out constrain, eg. on a webserver.
To speed up, I have read about svg2pdf as a standalone, but never found a binary for Win7 or managed to compile it, even with the libcairo dlls present.
My last idea now is to use the CPAN module Cairo. It makes me hoping that it can convert an svg file to pdf, but in the documentation I only find drawings and surfaces, but no method to write/convert.
Has anyone experience with that?
Making my comment an answer: You could try rsvg-convert which is part of the librsvg library. It's probably faster than Inkscape but it's still an external command.

Counting files and directories in a very large subversion repository

Here at work, we have a rather large subversion repository. As part of our internal monitoring, we want a count of all files and directories for every revision in all our repositories. Problem is, one of them has around 29000 revisions, and contains around 300000 directories, with almost 4 million files. Our previous method simply used the output of the 'svnlook' command in a perl script to count everything. I've tried using the output 'svnlook changed' to build a count, and it mostly works, but there is some rather annoying guesswork involved. As a side note, the repos are hosted on a xen vm, so I/O performance is a bit of an issue. Anyone have a better way to do this?
Assuming you are talking about server side repos.
svn list -R --xml file:///svnrepos/myrepo | grep kind=\"file\" | wc -l
its not very fast, but it is accurate.
I'd look into the svnadmin dump delta format. I've played with it a little, but basically it's one huge patch-type file containing all the files and all the revisions. It's text in nature, so relatively straightforward to process with something like Perl, and it is fairly small compared to going through the whole of each revision one at a time.
You'd probably need to have a representation of all the files (if 4 million, maybe use SQLite for this) and update them as you pass through each revision. The delta does the revisions in order, so it ought to be relatively straightforward. (Maybe I am being optimistic.)
How about something like:
find /svndir | wc -l
The output from find on Linux or Unix generates one line per file or directory, and it is recursive. Pipe the output to "wc -l" to count the lines.

To read a big file which are in Gigs fastly in PERL

We are currently reading the file line by line which delays to read and complete for all.
we would need to read the file fastly and prgoress with our commands.
the commands which i tried using fork and array just displays me the first set of lines only and not proceeding with pther sets.
please help on it.
Reading a large file takes a fair bit of time - disks are slow, after all. Before you start looking at Perl, first try (assuming you're on a unix-type system):
time cat /path/to/your/large/file >/dev/null
The output will tell you how long it takes to just read that file from disk without doing anything to it. Alternately, open the file in your favorite text editor and time how long it takes to load. Once you have that time, compare it to how long your Perl program takes to read the file. Unless the Perl program takes significantly longer, you're not likely to be able to do anything about it because the time is being spent on getting the data from disk rather than on processing it.
Of course, that's assuming that you actually do need to read the entire file. If you can get by with only reading specific parts of it, then you could create an index file and use that to jump directly to the part that's of interest, but you haven't provided enough information for us to tell whether that would apply to your case or not.
If you need more specific help, please provide a better description of what you mean to accomplish and a small, runnable piece of Perl code which shows how you're currently reading and processing the file so that we can see whether you're doing anything particularly inefficient that can be improved on.