How to quickly get directory (and contents) size in cygwin perl - perl

I have a perl script which monitors several windows network share drive usages. It currently monitors the free space of several network drives using the cygwin df command. I'd like to add the individual drive usages as well. When I use the
du -s | grep total
command, it takes for ever. I need to look at the shared drive usages because there are several network drives that are shared from a single drive on the server. Thus, filling one network drive fills them all (yes I know, not the best solution, not my choice).
So, I'd like to know if there is a quicker way to get the folder usage that doesn't take for ever.

du -s works by recursively querying the size of every directory and file. If your filesystem implementation doesn't store this total value somewhere, this is the only way to determine disk usage. Therefore, you should investigate which filesystem and drivers you are using, and see if there is a way to directly query for this data. Otherwise, you're probably SOL and will have to suck up the time it takes to run du.

1) The problem possibly lies in the fact that they are network drives - local du is acceptably fast in most cases. Are you doing du on the exact server where the disk is housed? If not, try to approach the problem from a different angle - run an agent on every server hosting the drives which calculates the local du summaries and then report the totals to a central process (either IPC or heck, by writing a report into a file on that same share filesystem).
2) If one of the drives is taking a significantly larger share of space (on average) than the rest of them, you can optimize by doing du on all but the "biggest" one and then calculate the biggest one by subtracting the sum of others from df results
3) Also, to be perfectly honest, it sounds like a suboptimal solution from design standpoint - while you indicated that it's not your choice, I'd strongly recommend that you post a question on how you can improve the design within the parameters you were given (to ServerFault website, not SO)

Related

What if my mmap virtual memory exceeds my computer’s RAM?

Background and Use Case
I have around 30 GB of data that never changes, specifically, every dictionary of every language.
Client requests to see the definition of a word, I simply respond with it.
On every request I have to conduct an algorithmic search of my choice so I don’t have to loop through the over two hundred million words I have stored in my .txt file.
If I open the txt file and read it so I can search for the word, it would take forever due to the size of the file (even if that file is broken down into smaller files, it is not feasible nor it is what I want to do).
I came across the concept of mmap, mentioned to me as a possible solution to my problem by a very kind gentleman on discord.
Problem
As I was learning about mmap I came across the fact that mmap does not store the data on the RAM but rather on a virtual RAM… well regardless of which it is, my server or docker instances may have no more than 64 GB of RAM and that chunk of data taking 30 of them is quite painful and makes me feel like there needs to be an alternative that is better. Even on a worst case scenario, if my server or docker container does not have enough RAM for the data stored on mmap, then it is not feasible, unless I am wrong as to how this works, which is why I am asking this question.
Questions
Is there better solution for my use case than mmap?
Will having to access such a large amount of data through mmap so I don’t have to open and read the file every time allocate RAM memory of the amount of the file that I am accessing?
Lastly, if I was wrong about a specific statement I made on what I have written so far, please do correct me as I am learning lots about mmap still.
Requirements For My Specific Use Case
I may get a request from one client that has tens of words that I have to look up, so I need to be able to retrieve lots of data from the txt file effectively.
The response to the client has to be as quick as possible, the quicker the better, I am talking ideally a less than three seconds, or if impossible, then as quick as it can be.

Will the depth of a file within a filesystem change the time taken to copy it?

I am trying to figure out if whether or not the depth of a file in a filesystem will change the amount of time it takes to execute a "cp" bash command with that file.
By depth I mean how many parent directories its contained in.
I tried running a few tests, but my results are pretty inconclusive, and when I try to logically answer, I can think of reasons of why it would be either way.
What is the purpose of this?
Provided nothing is cached, the deeper the directory tree the more data has to be read from storage to get to the file - you have to find the name of the second dir, then the third within the second and so on. On the other hand if the file is big, the time needed to do this can be negligible in comparison.
Also mere startup of a command like cp is not without its cost.
If you are interested in how file systems work read this free book: http://www.nobius.org/~dbg/practical-file-system-design.pdf
Performance is a complicated subject, especially so when hard media is involved. Without proper understanding of how this works and proper understanding of statistics, you can't perform a correct test.

system hardware related

hi every body there
first of all thanks for your support and help .Now i want two know how to calculate the load on a computer system. i heard about load sensor and other utility which mostly provide the option of finding the temperature of hard disk or system , like top, system monitor but i don't want to use them.
all i want to know simply load on CPU like CPU is 40% free, load on memory means total usages of memory or amount of free memory ,free disk space etc.
i don't need soft ware tool for finding these thing .
all i want to know is there any program in c or in other language or script which can find out these thing one by one or simultaneously?
are there any commands to find this?
or can anybody explain how system monitor work?
waiting for your help
Depends on your operating system. On Unix/Linux, there is the directory /proc which contains a lot of files/directories with system information such as /proc/loadvg or /proc/cpuinfo
These aren't real files on your disk, but virtual files containing system information. But you can just open them and read them with standard file functions, so it works with any programming language, from C to Python.

Do any common OS file systems use hashes to avoid storing the same content data more than once?

Many file storage systems use hashes to avoid duplication of the same file content data (among other reasons), e.g., Git and Dropbox both use SHA256. The file names and dates can be different, but as long as the content gets the same hash generated, it never gets stored more than once.
It seems this would be a sensible thing to do in a OS file system in order to save space. Are there any file systems for Windows or *nix that do this, or is there a good reason why none of them do?
This would, for the most part, eliminate the need for duplicate file finder utilities, because at that point the only space you would be saving would be for the file entry in the file system, which for most users is not enough to matter.
Edit: Arguably this could go on serverfault, but I feel developers are more likely to understand the issues and trade-offs involved.
ZFS supports deduplication since last month: http://blogs.oracle.com/bonwick/en_US/entry/zfs_dedup
Though I wouldn't call this a "common" filesystem (afaik, it is currently only supported by *BSD), it is definitely one worth looking at.
It would save space, but the time cost is prohibitive. The products you mention are already io bound, so the computational cost of hashing is not a bottleneck. If you hashed at the filesystem level, all io operations which are already slow will get worse.
NTFS has single instance storage.
NetApp has supported deduplication (that's what its called in the storage industry) in the WAFL filesystem (yeah, not your common filesystem) for a few years now. This is one of the most important features found in the enterprise filesystems today (and NetApp stands out because they support this on their primary storage also as compared to other similar products which support it only on their backup or secondary storage; they are too slow for primary storage).
The amount of data which is duplicate in a large enterprise with thousands of users is staggering. A lot of those users store the same documents, source code, etc. across their home directories. Reports of 50-70% data deduplicated have been seen often, saving lots of space and tons of money for large enterprises.
All of this means that if you create any common filesystem on a LUN exported by a NetApp filer, then you get deduplication for free, no matter what the filesystem created in that LUN. Cheers. Find out how it works here and here.
btrfs supports online de-duplication of data at the block level. I'd recommend duperemove as an external tool is needed.
It would require a fair amount of work to make this work in a file system. First of all, a user might be creating a copy of a file, planning to edit one copy, while the other remains intact -- so when you eliminate the duplication, the hard link you created that way would have to give COW semantics.
Second, the permissions on a file are often based on the directory into which that file's name is placed. You'd have to ensure that when you create your hidden hard link, that the permissions were correctly applied based on the link, not just the location of the actual content.
Third, users are likely to be upset if they make (say) three copies of a file on physically separate media to ensure against data loss from hardware failure, then find out that there was really only one copy of the file, so when that hardware failed, all three copies disappeared.
This strikes me as a bit like a second-system effect -- a solution to a problem long after the problem ceased to exist (or at least matter). With hard drives current running less than $100US/terabyte, I find it hard to believe that this would save most people a whole dollar worth of hard drive space. At that point, it's hard to imagine most people caring much.
There are file systems that do deduplication, which is sort of like this, but still noticeably different. In particular, deduplication is typically done on a basis of relatively small blocks of a file, not on complete files. Under such a system, a "file" basically just becomes a collection of pointers to de-duplicated blocks. Along with the data, each block will typically have some metadata for the block itself, that's separate from the metadata for the file(s) that refer to that block (e.g., it'll typically include at least a reference count). Any block that has a reference count greater than 1 will be treated as copy on write. That is, any attempt at writing to that block will typically create a copy, write to the copy, then store the copy of the block to the pool (so if the result comes out the same as some other block, deduplication will coalesce it with the existing block with the same content).
Many of the same considerations still apply though--most people don't have enough duplication to start with for deduplication to help a lot.
At the same time, especially on servers, deduplication at a block level can serve a real purpose. One really common case is dealing with multiple VM images, each running one of only a few choices of operating systems. If we look at the VM image as a whole, each is usually unique, so file-level deduplication would do no good. But they still frequently have a large chunk of data devoted to storing the operating system for that VM, and it's pretty common to have many VMs running only a few operating systems. With block-level deduplication, we can eliminate most of that redundancy. For a cloud server system like AWS or Azure, this can produce really serious savings.

How many sub-directories should be put into a directory

At SO there has been much discussion about how many files in a directory are appropriate: on older filesystems stay below a few thousand on newer stay below a few hundred thousand.
Generally the suggestion is to create sub-directories for every few thousand files.
So the next question is: what is the maximum number of sub directories I should put into a directory? Nesting them too deep kills dir tree traversal performance. IS there a nesting them to shallow?
From a practicality standpoint applications might not handle well large directory entries.
For example Windows Explorer gets bogged down with with several thousand directory entries (I've had Vista crash, but XP seems to handle it better).
Since you mention nesting directories also keep in mind that there are limits to the length of fully qualified (with drive designator and path) filenames (See wikipedia 'filename' entry). This will vary with the operating system file system (See Wikipedia 'comparison on file systems' entry).
For Windows NTFS it is supposed to be 255, however, I have encountered problems with commands and API functions with fully qualified filenames at about 120 characters. I have also had problems with long path names on mapped networked drives (at least with Vista and I.E. Explorer 7).
Also there are limitations on the nesting level of subdirectories. For example CD-ROM (ISO 9660) is limited to 8 directory levels (something to keep in mind if you would want to copy your directory structure to a CD-ROM, or another filesystem).
So there is a lot of inconsistency when you push the file system to extremes
(while the file system may be able to handle it theoretically, apps and libraries may not).
Really depends on the OS you are using as directory manipulations are done using system calls. For unix based OS, i-node look-up algorithms are highly efficient and number of files and folders in a directory does not matter. May be that's why there is no limit to it in Unix based systems. However, in windows, it varies from file-system to file-systems.
Usually modern filesystems (like NTFS or ext3) don't have a problem with accessing files directly (ie. if you are trying to open /foo/bar/baz.dat). Where you can run into problems is enumerating subdirectories / files in a given directory (ie. give me all the files/dirs from /foo). This can occur in multiple scenarios (for example while debugging, or during backup, etc). I found that keeping childcount around a couple of hundred at most gave me acceptable response times.
Of course this varies from situation to situation, so do test :-)
My guess is as little as possible.
At ISP I was working for (back in 2003) we had lots of user emails and web files. We structured them with md5 hashed usernames, 3 levels deep (ie. /home/a/b/c/abcuser). This resulted in maybe up to 100 users inside third level directory.
You can make deeper structure with user directories in shallow structure too. The best option is to try and see, but smaller the directory count faster the lookup is.
I've come across a similar situation recently. We were using the file system to store serialized trade details. These would only be looked at infrequently and it wasn't worth the pain to store them in a database.
We found that Windows and Linux coped with a thousand or so files but it did get much slower accessing them - we organised them in sub-dirs in a logical grouping and this solved the problem.
It was also easier to grep them. Grepping through thousands of files is slower than changing to the correct sub-dir and grepping through a few hundred.
In windows API, the maximum length is set as 260 characters. The unicode functions do extend this limit to 32767 characters, which is used by major file systems.
I found out the hard way that for UFS2 the limit is around 2^15 sub-directories. So while UFS2 and oder modern filesystems work decently with a few hundred thousand files in a directory it can only handle relatively few sub-directories. The non-obvious error message is "can't create link".
While I haven't tested ext2 I found various mailing list postings where the posters also had issues with more than 2^15 files on an ext2 filesystem.