Does Linux treat pages in read-only mmap files differently than "regular" pages in the page cache? - mmap

Let say I have a file that's exactly 4096 bytes long.
(a) I open the file using open(2) and read its entire content. Close the file.
vs
(b) I open the file, mmap its fd with PROT_READ, read the entire content of the buffer, close the fd but do NOT release the buffer using munmap.
I assume in both cases, the file's single page will be put in the page cache (so a future attempt to read this file requires no disk IO)
My question is -- when the system runs out of physical memory, and needs to drop some pages from the page cache, will it prefer to drop the "regular pages" (a) over the "mmaped pages" (b)?

Related

INode File System, what is the extra space in a data block used for?

So, I am currently learning about the INode file system and am asked to write a simple file system using Inodes.
So far, I understand that there is an INode table that has a mapping from INode-> Data blocks through direct/indirect pointers.
Let's assume data gets written into a file, the data is stored into two blocks. Let's say each block is 512bytes, and the file takes one full block, and only 200 bytes of the second block. What happens with the rest of the space in that data block? Is it reserved for that file only or do other files use this block?
Depending on the file system, usually and most likely this area is now lost. I think the Reiser File System actually reclaimed this area, but I could be wrong.
Creating your own File System can be a challenging experience, but also an enjoyable experience. I have created a few myself and worked on another. If you are creating your own file system, you can have it do whatever you wish.
Look at the bottom of this page for a few that I am working on/with. The LeanFS in particular, uses Inodes as well. The SFS is a very simple file system. Each is well documented so that you can research and decide what you would like to do.

Does IPFS provide block-level file copying feature?

Update 11-14-2019: Please see the following discussion for feature request to IPFS: git-diff feature: Improve efficiency of IPFS for sharing updated file. Decrease file/block duplication
There is a .tar.gz file, which contains a data.txt file, file.tar.gz (~1 GB) stored in my IPFS-repo, which is pulled from another node-a.
I open the data.txt file and added a single character in a random locations in the file(beginning of the file, middle of the file, and end of the file), and compress it again as file.tar.gz and store it in my IPFS-repo.
When node-a wants to re-get the updated tar.gz file a re-sync will take place.
I want to know using IPFS whether the entire 1 GB of file will be synced or there exists a way by which parts of the file that is changed (called the delta) get synced.
Similar question is asked for the Google Drive: Does Google Drive sync entire file again on making small changes?
The feature you are asking is called "Block-Level File Copying". With
this feature, when you make a change to a file, rather than copying
the entire file from your hard drive to the cloud server again, only
the parts of the file that changed (called the delta) get sent.
As far as I know, Drobox, pCloud, and OneDrive, which however only supports it for Microsoft Office documents, offers block level sync.

MS Word (2007) - increased file size after removing content

MS Word (2007 in my case, but I had that experience also with 2010, didn't use 2013 yet) surprises me with the file size it uses - I have a standard .docx of 96 kB, after changing one character (a 7 to a 6) and saving again, it had 101 kB. I had in mind that Word sometimes saves additional information, so I searched a bit and found that in the Office button menu (the round button in the upper left corner) there is Prepare and then Inspect Document. I chose to have the Properties removed and also Header and Footers. Then, after saving the file size was 104 kB.
So, what is MS Word doing when saving documents after small changes or deleting content, that file size can increase afterwards. And how to get rid of this behaviour.
Word file sizes can increase if there's "dross" in the file: sometimes, a document becomes damaged and left-overs accumulate. If the damage is not critical, Word will work around it, but the "bad" information often remains in the file. Under some circumstances, Word encounters the problem every time it saves, which will cause file size to increase.
It can help to save the document to another file format, such as RTF, HTML or an earlier version of Word, then opening that file in Word. Another thing you can try is to copy/paste the content to a new document WITHOUT any section breaks and WITHOUT the last paragraph mark (because "dross" often accumulates in the non-visible section information).
But these attempts should always be done on a COPY of the document because information can get lost in the dual conversion process.
According to support.microsoft.com/en-us/kb/111277, the file size of your Word document may increase unexpectedly in the following situations:
Allow Fast Saves option is turned on. A fast save appends the changes to the end of your document, which increases the size of the document. By contrast, when you turn off the Allow Fast Saves option and save the document, Word performs a full save, which incorporates all your revisions (instead of appending them). If you perform a full save after a file was fast saved, Word reduces the size of the file.
Note Even with the Allow Fast Saves option turned on, Word periodically performs a full save of your document. As a result, the file size of your document may change substantially between save operations.
The option to Embed TrueType Fonts is selected. To check this, on the Tools menu, click Options and then click the Save tab.
You are automatically saving versions of a document. On the File menu, click versions. Check to see whether Automatically Save a Version on Close is selected.
If you open a document from a previous version of Word, Word may temporarily allocate more disk space for the document than is actually necessary.

System.IO - Does BinaryReader/Writer read/write exactly what a file contains? (abstract concept)

I'm relatively new to C# and am attempting to adapt a text encryption algorithm I designed in wxMaxima into a Binary encryption program in C# using Visual Studio forms. Because I am new to reading/writing binary files, I am lacking in knowledge regarding what happens when I try to read or write to a filestream.
For example, instead of encrypting a text file as I've done in the past, say I want to encrypt an executable or any other form of binary file.
Here are a few questions I don't understand:
When I open a file stream and use binaryreader will it read in an absolute duplicate of absolutely everything in the file? I want to be able to, for example, read in an entire file, delete the original file, then create a new file with the old name and write the entire binary stream back. Will this reproduce the original file exactly or will there be some sort of corruption that must otherwise be accounted for?
Because it's an encryption program, I was hoping to add in a feature that would low-level "format" the original file before deleting it so it would be theoretically inaccessible by combing the physical data of a harddisk. If I use binarywriter to overwrite parts of the original file with gibberish will it be put on the same spot on the harddisk or will the file become fragmented and actually just redirect via the FAT to some other portion of the harddisk? Obviously there's no point in overwriting the original file with gibberish if it's not over-writing the original cluster on the harddisk.
For your first question: A BinaryReader is not what you want. The name is a bit misleading: it "Reads primitive data types as binary values in a specific encoding." You probably want a FileStream.
Regarding the second question: That will not be easy: please see the "How SDelete Works" section of SDelete for an explanation. Brief extract in case that link breaks in the future:
"Securely deleting a file that has no special attributes is relatively straight-forward: the secure delete program simply overwrites the file with the secure delete pattern. What is more tricky is securely deleting Windows NT/2K compressed, encrypted and sparse files, and securely cleansing disk free spaces.
Compressed, encrypted and sparse are managed by NTFS in 16-cluster blocks. If a program writes to an existing portion of such a file NTFS allocates new space on the disk to store the new data and after the new data has been written, deallocates the clusters previously occupied by the file."

How to detect changing directory size in Perl

I am trying to find a way of monitoring directories in Perl, in particular the size of a directory, and upon detecting a change in directory size, perform a particular action.
The issue I have is with large files that require a noticeable amount of time to copy into this directory, i.e. > 100MB. What happens (in Windows, not Unix) is the system reserves enough disk space for the entire file, even though the file is still copying in progress. This causes problems for me, because my script will try to perform an action on this file that has not finished copying over. I can easily detect directory size changes in Unix via 'du', but 'du' in Windows does not behave the same way.
Are there any accurate methods of detecting directory size changes in Perl?
Edit: Some points to clarify:
- My Perl script is only monitoring a particular directory, and upon detecting a new file or a new directory, perform an action on this new file or directory. It is not copying any files; users on the network will be copying files into the directory I am monitoring.
- The problem occurs when a new file or directory appears (copied, not moved) that is significantly large (> 100MB, but usually a couple GB) and my program fires before this copy completes
- In Unix I can easily 'du' to see that the file/directory in question is growing in size, and take the appropriate action
- In Windows the size is static, so I cannot detect this change
- opendir/readdir/closedir is not feasible, as some of the directories that appear may contain thousands of files, and I want to avoid the overhead of
Ideally I would like my program to be triggered on change, but I am not sure how to do this. As of right now it busy waits until it detects a change. The change in file/directory size is not in my control.
You seem to be working around the underlying issue rather than addressing it -- your program is not properly sending a notification when it is finished copying a file. Why not do that instead of using OS-specific mechanisms to try to indirectly determine when the operation is complete?
You can use Linux::Inotify2 or Win32::ChangeNotify to detect directory/file changes.
EDIT: File::ChangeNotify seems a better option (cross-platform & used by Catalyst)
As I understand it, you are polling a directory with thousands of files. When you see a new file, there is an action that is taken on the file. This causes problems if the file is in use or still being copied, correct?
There are potentially several solutions:
1) Use flock to detect if the file is still in use by another process (test if it works properly on your OS, file system, and Perl version).
2) Use a LockFile call on Windows. If it fails, the OS or another process is using that file.
3) Change the poll interval to a non busy time on the server and take the directory off line while your process completes.
Evaluating the size of a directory is something all but the most inexperienced Perl programmers should be able to do. You can write your own portable version of du in 15 lines of code if you know about:
Either glob or opendir / readdir / closedir to iterate through the files in a directory
The filetest operators (-f file, -d file, etc.) to distinguish between regular files and directory names
The stat function or file size operator -s file to obtain the size of a file
There is a nice module called File::Monitor, it will detect new files, deleted files, changes in size and any other attribute that can be done with stat. It will then go and out put the files for you.
http://metacpan.org/pod/File::Monitor
You set up a baseline scan, then set up a call back for each item you are looking for, so new changes you can see via
$monitor->watch( {
name => 'somedir',
recurse => 1,
callback => {
files_created => sub {
my ($name, $event, $change) = #_;
# Do stuff
}
}
} );
If you need to go deeper than one level just do it to whatever level you need. After this is done and it finds new files you can trigger you application to do what you want on the files.