How can we manage non-code files in TFS for designers, etc? - version-control

Normally projects consist of a set of non-source code files like interface images (PSDs, JPGs,...). How can we managing these types of files with TFS and how graphic designers can check-in or out their image files to use them in applications like Photoshop?

You can simply add binary files (PSD, JPG etc.) to your tree, with the following caveats:
Large files take more space on the server. A quote from http://social.msdn.microsoft.com/Forums/en-US/tfsversioncontrol/thread/6f642d0f-5459-4a14-a19d-ede34713bcf4 :
TFS does handle large (> 16mb) files differently. It does not perform Delta storage but instead stores a complete copy of each version. This is an optimization to make check-ins faster for those large files. There is no difference between text files and binary files. Small ones are Delta'd, large ones are Stored.
Large files take slower to download (see the same link above).
If there is a conflict (i.e. two people modify the same binary file at the same time), one of them has to resolve the conflict completely manually, e.g. he has to load all 3 image versions in the image editor, look at the differences, and merge the changes manually.

Related

Why moodledata directory has this structure?

I know moodle's internal files such as uploaded images are stored in moodledata directory.
Inside, there are several directories:
moodledata/filedir/1c/01/1c01d0b6691ace075042a14416a8db98843b0856
moodledata/filedir/63/
moodledata/filedir/63/89/
moodledata/filedir/63/89/63895ece79c4a91666312d0a24db82fe3017f54d
moodledata/filedir/63/3c/
moodledata/filedir/63/37/
moodledata/filedir/63/a7/
What are these hashses?
What are the design reasons behind this design, in oposition with, for example, wordpress /year/month/file.jpg structure?
https://docs.moodle.org/dev/File_API_internals#File_storage_on_disk
Simple answer - files are stored based on the hash of their content (inspired by the way Git stores files internally).
This means that if you have the same file in multiple places (e.g. the same PDF or image in multiple courses), it is stored only once on the disk, even if the original filename is different.
On real sites this can involve a huge reduction in disk usage (obviously dependent on how much duplication there is on your site).
Moodledata files are stored according to the SHA1 encoding of their content, in order to prevent the duplication of content (for example, when the same file is uploaded twice with a different name).
For further explanations of how to handle such files, you can read the official documentation of the File API :
https://docs.moodle.org/dev/File_API_internals
especially the File storage on disk part.

I need to backup database files. Need something like github that is happy with 100s of gigabytes of data

I want a source controlled environment for a fairly large amount of database data, in text, before its loaded into the DBMS. We've been using GITHUB and its great. But they expect that a repository is less than 1 gigabyte and we have hundreds.
It could be in CVS or SVN, but tracking versions is important. The data is very static and is accessed only at low rates, say once a week for parts of it, once a month for more.
Any suggested places/services that do this? It doesn't have to be free, we'll happily pay a reasonable amount.
I confirm this kind of amount of data is incompatible with a Version Control System (made to record the history, ie the evolution of mostly text files and small binary files)
It is certainly not compatible with a Distributed VCS, where any clone would clone all the repo.
You need to look at cloud services for this type of storage.
The OP protests (downvote), stating that:
They would be normal ASCII except that GitHub has such small file size limits that I ran them through ZIP compression.
They rarely change, and when the contents change, its just a tiny number of lines within the file.
Its exactly what version control is about. Which 0.005% of the ASCII changed? Who changed it? When?
I maintain that:
hundreds of megabytes is incompatible with most source control repo providers out there (it would even be incompatible with most internal enterprise repos, and I am in a large company)
putting them in a zip file isn't practical in that a Version Control Tool system wouldn't be able to record the delta.
You need to keep separate:
the data (stores "elsewhere" as a large content of plain text files, certainly not on GitHub)
the metadata you want (author, date of modification), stored in a regular git repo in association with "shell" data (ie, your files which are actually "references", or kind of "symlinks" to, the actual files put elsewhere)
The one system, based on Git, who provides that is git-annex, using your own cloud storage with (if implemented) git-annex assistant: see its roadmap.

Large Files in Source Control (TFS)

Recently at the office we have been talking about placing large files into our TFS repository. The files themselves are XML, usually 100-200MB in size, and sometimes as large as 1GB. We use them as data for automated testing and they are mostly static (one gets a minor tweak every year or so). Anyway, there is a notion that putting files like this into the repository is a no-no because they are "big" and that will make things "slow" (outside of the original check-in/out) but we don't really have any evidence to back this up.
So my question is, what are the pros / cons / implications of putting large static files into a source code repository like TFS (or SVN, Git, etc. for that matter) Is it OK? Will it "fill up the server" or have some other dire consequence?
tl;dr: TFS is designed to handle large files gracefully. The largest hurdle you'll have to face is network bandwidth to upload/download the files. The second issue is that of storage space on the server. Assuming you've considered these two issues, you shouldn't have any other problems.
Network bandwidth: There is very little overhead in checking in or getting files, it should be as fast as a typical HTTP upload or download. If your clients are remote from the server, network-wise, they may benefit by having a TFS source control proxy on their local network to speed up downloads.
Note that unlike some version control systems, TFS does not compute and transmit deltas when uploading or downloading new content. That is to say, if a client had revision 4 of a large text file, and revision 5 had added a few lines at the end, some version control tools optimize this experience to only send the changed lines. TFS does not do this optimization, so if your files change frequently, clients will need to download the entirety of the file each time.
Server storage: Disk space on the server is fairly straightforward - you'll need enough space to hold the files, there's little overhead beyond that. TFS will not slow down just because your repository contains large files.
If these files get modified frequently, you will need to account for the disk space used by the revisions, also. TFS stores "deltas" between file revisions - that is, a binary difference between two versions. So if the file's contents change minimally between revisions as in the typical use case with text files, the storage cost should be inexpensive. However, if the entirety of the contents change as would be typical with binary files like images or DLLs, then you'll need enough disk space to store each revision. (Of course, you can destroy previous revisions in order to regain that space.)
One note on deltas in TFS: to reduce overhead at check-in time, the deltas between revisions are not computed immediately, there's a background "deltafication" job that runs nightly to compute the deltas to trim space. Until that point, each revision is stored in its entirety in the database. So if you have a very large text file with a lot of revisions happening daily, your disk space requirements will need to take this into account.
Client storage: Clients will need to have enough disk space to contain these files also (although only at the revision that they've downloaded.) This can be mitigated in your workspace mappings such that the large files are cloaked (or otherwise not included in your workspace) if they're not needed.
Caveat: Getting Historic Versions: If you find yourself requesting historical versions of large files frequently (for example: I want an ISO image seven changesets ago), then you're going to make the server apply the delta chain to get back to that revision. If you have multiple clients doing this concurrently, this could tax your memory.
If those files were constantly changing & their deltas were big, I would eventually expect a penalty in the overall TFS performance.You clearly state that this is not the case, so, provided that your SQL server has the capacity to house the storage, I believe you should be able to proceed without any implications. A minor downside you may experience, is when you 're constructing new workspaces, where you would have to pull those files from their repository. Unfortunately this does also happen during TFS Build, so it's possible that your builds will now take that much longer. The severity of this angle greatly depends on your network constellation/stability.
The biggest problem (inconvenience) you'll have is having to download these massive files to all your workspaces, or map them out. Consider putting them into a separate team project to make this easier (unless you want to include them in branches, in which case I'd abuse keeping everything in one team project)
If you have control of the xml format then also consider a few tweaks to make them smaller. This will improve performance of store/get operations and also loading speed... Shorten element and attribute names, reduce the number of decimal places you are outputting for floating point numbers, etc. You will find threat simple schemes like this will knock many megabytes off the size of Gb-sized files, and it's easy to knock up a quick xslt transform or code to convert the files quickly over to the new format.

A custom directory/folder merge tool

I am thinking about developing a custome directory/folder merge tool as part of learning functional programming as well as to scratch a very personal itch.
I usually work on three different computers and I tend to accumulate lots of files (text, video, audio) locally and then painstakingly merge them for backup purposes. I am pretty sure I have dupes and unwanted files lying around wasting space. I am moving to a cloud backup solution as a secondary backup source and I want to save as much space as possible by eliminating redundant files.
I have a complex deeply nested directory structure and I want an automated tool that automatically walks down the folder tree and perform the merge. Another problem is that I use a mix of Linux and Windows and many of my files have spaces in the name...
My initial thought was that I need to generate hashes for every file and compare using hashes rather than file names (spaces in folder name as well as contents of files could be different between source and target). Is RIPEMD-160 a good balance between performance and collision avoidance? or is SHA-1 enough? Is SHA-256/512 overkill?
Which functional programming env comes with a set of ready made libraries for generating these hashes? I am leaning towards OCaml...
Check out the Unison file synchronizer.
I don't use it myself, but I heard quite a few positive reviews. It is a mature software based on some theoretic foundation.
Also, it is written in OCaml.

How do you deal with lots of small files?

A product that I am working on collects several thousand readings a day and stores them as 64k binary files on a NTFS partition (Windows XP). After a year in production there is over 300000 files in a single directory and the number keeps growing. This has made accessing the parent/ancestor directories from windows explorer very time consuming.
I have tried turning off the indexing service but that made no difference. I have also contemplated moving the file content into a database/zip files/tarballs but it is beneficial for us to access the files individually; basically, the files are still needed for research purposes and the researchers are not willing to deal with anything else.
Is there a way to optimize NTFS or Windows so that it can work with all these small files?
NTFS actually will perform fine with many more than 10,000 files in a directory as long as you tell it to stop creating alternative file names compatible with 16 bit Windows platforms. By default NTFS automatically creates an '8 dot 3' file name for every file that is created. This becomes a problem when there are many files in a directory because Windows looks at the files in the directory to make sure the name they are creating isn't already in use. You can disable '8 dot 3' naming by setting the NtfsDisable8dot3NameCreation registry value to 1. The value is found in the HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\FileSystem registry path. It is safe to make this change as '8 dot 3' name files are only required by programs written for very old versions of Windows.
A reboot is required before this setting will take effect.
NTFS performance severely degrades after 10,000 files in a directory. What you do is create an additional level in the directory hierarchy, with each subdirectory having 10,000 files.
For what it's worth, this is the approach that the SVN folks took in version 1.5. They used 1,000 files as the default threshold.
The performance issue is being caused by the huge amount of files in a single directory: once you eliminate that, you should be fine. This isn't a NTFS-specific problem: in fact, it's commonly encountered with user home/mail files on large UNIX systems.
One obvious way to resolve this issue, is moving the files to folders with a name based on the file name. Assuming all your files have file names of similar length, e.g. ABCDEFGHI.db, ABCEFGHIJ.db, etc, create a directory structure like this:
ABC\
DEF\
ABCDEFGHI.db
EFG\
ABCEFGHIJ.db
Using this structure, you can quickly locate a file based on its name. If the file names have variable lengths, pick a maximum length, and prepend zeroes (or any other character) in order to determine the directory the file belongs in.
I have seen vast improvements in the past from splitting the files up into a nested hierarchy of directories by, e.g., first then second letter of filename; then each directory does not contain an excessive number of files. Manipulating the whole database is still slow, however.
I have run into this problem lots of times in the past. We tried storing by date, zipping files below the date so you don't have lots of small files, etc. All of them were bandaids to the real problem of storing the data as lots of small files on NTFS.
You can go to ZFS or some other file system that handles small files better, but still stop and ask if you NEED to store the small files.
In our case we eventually went to a system were all of the small files for a certain date were appended in a TAR type of fashion with simple delimiters to parse them. The disk files went from 1.2 million to under a few thousand. They actually loaded faster because NTFS can't handle the small files very well, and the drive was better able to cache a 1MB file anyway. In our case the access and parse time to find the right part of the file was minimal compared to the actual storage and maintenance of stored files.
You could try using something like Solid File System.
This gives you a virtual file system that applications can mount as if it were a physical disk. Your application sees lots of small files, but just one file sits on your hard drive.
http://www.eldos.com/solfsdrv/
If you can calculate names of files, you might be able to sort them into folders by date, so that each folder only have files for a particular date. You might also want to create month and year hierarchies.
Also, could you move files older than say, a year, to a different (but still accessible) location?
Finally, and again, this requires you to be able to calculate names, you'll find that directly accessing a file is much faster than trying to open it via explorer. For example, saying
notepad.exe "P:\ath\to\your\filen.ame"
from the command line should actually be pretty quick, assuming you know the path of the file you need without having to get a directory listing.
One common trick is to simply create a handful of subdirectories and divvy up the files.
For instance, Doxygen, an automated code documentation program which can produce tons of html pages, has an option for creating a two-level deep directory hierarchy. The files are then evenly distributed across the bottom directories.
Aside from placing the files in sub-directories..
Personally, I would develop an application that keeps the interface to that folder the same, ie all files are displayed as being individual files. Then in the application background actually takes these files and combine them into a larger files(and since the sizes are always 64k getting the data you need should be relatively easy) To get rid of the mess you have.
So you can still make it easy for them to access the files they want, but also lets you have more control how everything is structured.
Having hundreds of thousands of files in a single directory will indeed cripple NTFS, and there is not really much you can do about that. You should reconsider storing the data in a more practical format, like one big tarball or in a database.
If you really need a separate file for each reading, you should sort them into several sub directories instead of having all of them in the same directory. You can do this by creating a hierarchy of directories and put the files in different ones depending on the file name. This way you can still store and load your files knowing just the file name.
The method we use is to take the last few letters of the file name, reversing them, and creating one letter directories from that. Consider the following files for example:
1.xml
24.xml
12331.xml
2304252.xml
you can sort them into directories like so:
data/1.xml
data/24.xml
data/1/3/3/12331.xml
data/2/5/2/4/0/2304252.xml
This scheme will ensure that you will never have more than 100 files in each directory.
Consider pushing them to another server that uses a filesystem friendlier to massive quantities of small files (Solaris w/ZFS for example)?
If there are any meaningful, categorical, aspects of the data you could nest them in a directory tree. I believe the slowdown is due to the number of files in one directory, not the sheer number of files itself.
The most obvious, general grouping is by date, and gives you a three-tiered nesting structure (year, month, day) with a relatively safe bound on the number of files in each leaf directory (1-3k).
Even if you are able to improve the filesystem/file browser performance, it sounds like this is a problem you will run into in another 2 years, or 3 years... just looking at a list of 0.3-1mil files is going to incur a cost, so it may be better in the long-term to find ways to only look at smaller subsets of the files.
Using tools like 'find' (under cygwin, or mingw) can make the presence of the subdirectory tree a non-issue when browsing files.
Rename the folder each day with a time stamp.
If the application is saving the files into c:\Readings, then set up a scheduled task to rename Reading at midnight and create a new empty folder.
Then you will get one folder for each day, each containing several thousand files.
You can extend the method further to group by month. For example, C:\Reading become c:\Archive\September\22.
You have to be careful with your timing to ensure you are not trying to rename the folder while the product is saving to it.
To create a folder structure that will scale to a large unknown number of files, I like the following system:
Split the filename into fixed length pieces, and then create nested folders for each piece except the last.
The advantage of this system is that the depth of the folder structure only grows as deep as the length of the filename. So if your files are automatically generated in a numeric sequence, the structure is only is deep is it needs to be.
12.jpg -> 12.jpg
123.jpg -> 12\123.jpg
123456.jpg -> 12\34\123456.jpg
This approach does mean that folders contain files and sub-folders, but I think it's a reasonable trade off.
And here's a beautiful PowerShell one-liner to get you going!
$s = '123456'
-join (( $s -replace '(..)(?!$)', '$1\' -replace '[^\\]*$','' ), $s )