How to tune EGit for large repositories? - egit

The problem:
I find EGit great and use it intensively, but it can be incredibly slow. It can get frustrating when it takes several minutes to complete operations that the C version of git (Cgit) does in less than a couple of seconds.
All operations are significantly slower than Cgit. For example switching branches will take 10's of seconds compared to near instant. A rebase can take several minutes compared to less than a couple of seconds.
Some details:
History size: 10114 commits as reported with: git rev-list HEAD --count
Current Working directory size: 63.7 MB
Current .git size: 77.4 MB
Largest file size: 4.0 MB
OS: Linux - CentOS 5.5
File System: ext3
JVM: Oracle - Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
EGit and JGit version: 3.0.0.201306101825-r
I was previously running 2.3 but did not notice any change in performance after upgrading.
Could suitable window cache settings help:
I found the following quote in JGit's bugzilla here:
...EGit had to expose UI to allow users to configure it when working on
bigger repositories.
Which sounds like it fits my case. So I looked around in eclipse and under Window -> Preferences -> Team -> Git found these Git Window Cache settings:
But how do I use them?
What do the different controls actually do? Has anyone had any success in getting EGit to be more responsive by using them?

Recommendations
As Matthias Sohn suggested, the Window cache limit appears to be most significant of these parameters.
For me, increasing this from "10 m" to "500 m" made a huge difference to how responsive egit was.
Details of each parameter
From the source code† of WindowCacheConfig.java:
Windows Size
packedGitWindowSize: size in bytes of a single window mapped or read in from the pack file
Default: 8 k
Window cache limit
packedGitLimit: maximum number bytes of heap memory to dedicate to caching pack file data.
Default: 10 m
Delta base cache limit
deltaBaseCacheLimit: maximum number of bytes to cache in delta base cache for inflated, recently accessed objects, without delta chains.
Default: 10 m
Stream File Threshold
streamFileThreshold: the size threshold beyond which objects must be streamed.
Objects smaller than this size can be obtained as a contiguous byte array, while objects bigger than this size require using anObjectStream.
Default: 50 m
Use virtual memory mapping
packedGitMMAP: true enables use of Java NIO virtual memory mapping for windows; false reads entire window into a byte[] with standard read calls.
Default: Unchecked
Not presented on the preferences page
packedGitOpenFiles: maximum number of streams to open at a time. Open packs count against the process limits.
Default: 128
† Thanks Jens Theeß for their comment on Matthias Sohn's answer containing a pointer to the source code.

EGit 3.5.0 will bring huge performance fix for large repositories - without "tuning" anything. See https://bugs.eclipse.org/bugs/show_bug.cgi?id=440722
You can use nightly EGit build update site to get the fix immediately:
http://download.eclipse.org/egit/updates-nightly

increase the window cache limit to a larger value, it's defining the size of jgit's cache used to map pack files into memory
do you run gc (either from egit or native git) on your repository ? JGit/EGit don't run it automatically (yet). YOu may check the number of loose object in EGit's repositories view (click "Properties" and select tab "Statistics")
how many commits does your repository have (age in year doesn't say anything since it could have an age of 10 years with 2 commits or 2 mio commits)
which OS / filesystem are you using ?

Related

I need to backup database files. Need something like github that is happy with 100s of gigabytes of data

I want a source controlled environment for a fairly large amount of database data, in text, before its loaded into the DBMS. We've been using GITHUB and its great. But they expect that a repository is less than 1 gigabyte and we have hundreds.
It could be in CVS or SVN, but tracking versions is important. The data is very static and is accessed only at low rates, say once a week for parts of it, once a month for more.
Any suggested places/services that do this? It doesn't have to be free, we'll happily pay a reasonable amount.
I confirm this kind of amount of data is incompatible with a Version Control System (made to record the history, ie the evolution of mostly text files and small binary files)
It is certainly not compatible with a Distributed VCS, where any clone would clone all the repo.
You need to look at cloud services for this type of storage.
The OP protests (downvote), stating that:
They would be normal ASCII except that GitHub has such small file size limits that I ran them through ZIP compression.
They rarely change, and when the contents change, its just a tiny number of lines within the file.
Its exactly what version control is about. Which 0.005% of the ASCII changed? Who changed it? When?
I maintain that:
hundreds of megabytes is incompatible with most source control repo providers out there (it would even be incompatible with most internal enterprise repos, and I am in a large company)
putting them in a zip file isn't practical in that a Version Control Tool system wouldn't be able to record the delta.
You need to keep separate:
the data (stores "elsewhere" as a large content of plain text files, certainly not on GitHub)
the metadata you want (author, date of modification), stored in a regular git repo in association with "shell" data (ie, your files which are actually "references", or kind of "symlinks" to, the actual files put elsewhere)
The one system, based on Git, who provides that is git-annex, using your own cloud storage with (if implemented) git-annex assistant: see its roadmap.

Data management in matlab versus other common analysis packages

Background:
I am analyzing large amounts of data using an object oriented composition structure for sanity and easy analysis. Often times the highest level of my OO is an object that when saved is about 2 gigs. Loading the data into memory is not an issue always, and populating sub objects then higher objects based on their content is much more java memory efficient than just loading in a lot of mat files directly.
The Problem:
Saving these objects that are > 2 gigs will often fail. It is a somewhat well known problem that I have gotten around by just deleting a number of sub objects until the total size is below 2-3 gigs. This happens regardless of how boss the computer is, a 16 gigs of ram 8 cores etc, will still fail to save the objects correctly. Back versioning the save also does not help
Questions:
Is this a problem that others have solved somehow in MATLAB? Is there an alternative that I should look into that still has a lot of high level analysis and will NOT have this problem?
Questions welcome, thanks.
I am not sure this will help, but here: Do you make sure to use recent version of mat file? Check for instance save. Quoting from the page:
'-v7.3' 7.3 (R2006b) or later Version 7.0 features plus support for data items greater than or equal to 2 GB on 64-bit systems.
'-v7' 7.0 (R14) or later Version 6 features plus data compression and Unicode character encoding. Unicode encoding enables file sharing between systems that use different default character encoding schemes.
Also, could by any chance your object by or contain a graphic handle object? In that case, it is wise to use hgsave

Large Files in Source Control (TFS)

Recently at the office we have been talking about placing large files into our TFS repository. The files themselves are XML, usually 100-200MB in size, and sometimes as large as 1GB. We use them as data for automated testing and they are mostly static (one gets a minor tweak every year or so). Anyway, there is a notion that putting files like this into the repository is a no-no because they are "big" and that will make things "slow" (outside of the original check-in/out) but we don't really have any evidence to back this up.
So my question is, what are the pros / cons / implications of putting large static files into a source code repository like TFS (or SVN, Git, etc. for that matter) Is it OK? Will it "fill up the server" or have some other dire consequence?
tl;dr: TFS is designed to handle large files gracefully. The largest hurdle you'll have to face is network bandwidth to upload/download the files. The second issue is that of storage space on the server. Assuming you've considered these two issues, you shouldn't have any other problems.
Network bandwidth: There is very little overhead in checking in or getting files, it should be as fast as a typical HTTP upload or download. If your clients are remote from the server, network-wise, they may benefit by having a TFS source control proxy on their local network to speed up downloads.
Note that unlike some version control systems, TFS does not compute and transmit deltas when uploading or downloading new content. That is to say, if a client had revision 4 of a large text file, and revision 5 had added a few lines at the end, some version control tools optimize this experience to only send the changed lines. TFS does not do this optimization, so if your files change frequently, clients will need to download the entirety of the file each time.
Server storage: Disk space on the server is fairly straightforward - you'll need enough space to hold the files, there's little overhead beyond that. TFS will not slow down just because your repository contains large files.
If these files get modified frequently, you will need to account for the disk space used by the revisions, also. TFS stores "deltas" between file revisions - that is, a binary difference between two versions. So if the file's contents change minimally between revisions as in the typical use case with text files, the storage cost should be inexpensive. However, if the entirety of the contents change as would be typical with binary files like images or DLLs, then you'll need enough disk space to store each revision. (Of course, you can destroy previous revisions in order to regain that space.)
One note on deltas in TFS: to reduce overhead at check-in time, the deltas between revisions are not computed immediately, there's a background "deltafication" job that runs nightly to compute the deltas to trim space. Until that point, each revision is stored in its entirety in the database. So if you have a very large text file with a lot of revisions happening daily, your disk space requirements will need to take this into account.
Client storage: Clients will need to have enough disk space to contain these files also (although only at the revision that they've downloaded.) This can be mitigated in your workspace mappings such that the large files are cloaked (or otherwise not included in your workspace) if they're not needed.
Caveat: Getting Historic Versions: If you find yourself requesting historical versions of large files frequently (for example: I want an ISO image seven changesets ago), then you're going to make the server apply the delta chain to get back to that revision. If you have multiple clients doing this concurrently, this could tax your memory.
If those files were constantly changing & their deltas were big, I would eventually expect a penalty in the overall TFS performance.You clearly state that this is not the case, so, provided that your SQL server has the capacity to house the storage, I believe you should be able to proceed without any implications. A minor downside you may experience, is when you 're constructing new workspaces, where you would have to pull those files from their repository. Unfortunately this does also happen during TFS Build, so it's possible that your builds will now take that much longer. The severity of this angle greatly depends on your network constellation/stability.
The biggest problem (inconvenience) you'll have is having to download these massive files to all your workspaces, or map them out. Consider putting them into a separate team project to make this easier (unless you want to include them in branches, in which case I'd abuse keeping everything in one team project)
If you have control of the xml format then also consider a few tweaks to make them smaller. This will improve performance of store/get operations and also loading speed... Shorten element and attribute names, reduce the number of decimal places you are outputting for floating point numbers, etc. You will find threat simple schemes like this will knock many megabytes off the size of Gb-sized files, and it's easy to knock up a quick xslt transform or code to convert the files quickly over to the new format.

How much can SQLite store on the iPhone?

I have an idea for a webapp for the iPhone but its unknown to me how much data can be stored in mobile Safari's SQLite db. I tried searching through the Apple docs but found nothing:
Safari Client-Side Storage and Offline Applications Programming Guide: Using the JavaScript Database
Most of these answers are totally wrong. Safari will not allow you to create SQLite databases over 50MB (or expand existing databases beyond that size).
This is a limit imposed by Safari - as other people have noted, SQLite itself supports much larger databases that you can use from native apps. But webapps are limited to 50MB.
It might be useful to note that this is per database - if you really need the extra space, you can create multiple databases, although this would obviously cause a lot of hassle.
It's as the other posters say. You're only limited by the drive space on the device.
You also need to consider your in memory footprint though. There is a finite amount of memory on the iphone, and in general it's quiet small, so the amount of data/hydrated objects you'll be able to have in memory is another potential limitation for your app.
There are a LOT of people answering that have clearly never tested it. I am on the latest version of iOS (4.3.3) and have set up a system to create multiple databases and keep them under 45 MB but found that the 50 MB cap is for the site as a whole. So, no matter how much you split the data up, it still restricts it to an aggregated cap of 50 MB.
The database size limit on safari mobile, is 50 mb per site not per database. i have tested this. even if you have an extra empty database you cannot add to it if the total size of all databases on a single site is 50 mb
whats worth noting as well is that characters are saved as double bytes on websql, that is 2 million characters will be 4 megabytes not 2 megabytes on disk.
You are only limited by the amount of free space on the device.
I'm not sure. If you were doing your own application you'd be limited by free space on the device and to some extent in memory footprint (as Bryan McLemore points out).
However since you're looking at using JavaScript inside of Safari there's no easy way to tell. According to the document you found it looks like it may be limited by site, but there's nothing telling you how much. I'd suggest writing a quick script to fill up the database and figure out how much it actually is. After that, I'd probably halve that value and assume I'd be always be able to use that much.
Be sure to report back so we'll all know!
It's most likely 32 terabytes... which is well over the available disk space.
I reached this number by multiplying the maximum page size by the maximum page count listed at the bottom of the SQLite limits page.
Limits In SQLite
"Limits" in the context of this article means sizes or quantities that can not be exceeded. We are concerned with things like the maximum number of bytes in a BLOB or the maximum number of columns in a table.
SQLite was originally designed with a policy of avoiding arbitrary limits. Of course, every program that runs on a machine with finite memory and disk space has limits of some kind. But in SQLite, those limits were not well defined. The policy was that if it would fit in memory and you could count it with a 32-bit integer, then it should work.
Unfortunately, the no-limits policy has been shown to create problems. Because the upper bounds were not well defined, they were not tested, and bugs (including possible security exploits) were often found when pushing SQLite to extremes. For this reason, newer versions of SQLite have well-defined limits and those limits are tested as part of the test suite.
As of version 3.6.19 (all statistics in the report are against that release of SQLite), the SQLite library consists of approximately 65.7 KSLOC of C code. (KSLOC means thousands of "Source Lines Of Code" or, in other words, lines of code excluding blank lines and comments.) By comparison, the project has 690 times as much test code and test scripts - 45409.7 KSLOC.
The default storage limit on iPhone seems to be 5mb
davibe has done some work to raise the limit up to 1GB with his PhoneGap plugin.
https://github.com/davibe/Phonegap-SQLitePlugin
The plugin calls the native sqlite3 API, with a wrapper on the Javascript side.
The relevant code extracted from sqlite.js are:
update origins set quota = '999999999999' where origin = 'file__0';
"update databases set estimatedSize = '999999999999' where name = '" + dbName + "';'";
Caution: my iphone is jailbroken! But I don't suspect that this changes anything.
The limit of 50MB is no longer correct.
On my iPhone 4S with iOS 6.1 I have a database of 58.66 MB (448496 records) for my webclip (website pinned to the springboard).
No special tricks, just standard HTML5 usage.
Maximum Database Size
Please refer Official Sqlite site
Every database consists of one or more "pages". Within a single database, every page is the same size, but different database can have page sizes that are powers of two between 512 and 65536, inclusive. The maximum size of a database file is 2147483646 pages. At the maximum page size of 65536 bytes, this translates into a maximum database size of approximately 1.4e+14 bytes (140 terabytes, or 128 tebibytes, or 140,000 gigabytes or 128,000 gibibytes).
This particular upper bound is untested since the developers do not have access to hardware capable of reaching this limit. However, tests do verify that SQLite behaves correctly and sanely when a database reaches the maximum file size of the underlying filesystem (which is usually much less than the maximum theoretical database size) and when a database is unable to grow due to disk space exhaustion.

tfs database size - version control

I have TFS installed on a single server and am running out of space on the disk. (We've been using the instance for about 2 years now.)
Looking at the tables in SQL Server what seems to be culprit is the tbl_content table, it is at 70 GB. If I do a get on the entire source tree for all projects it is only about 8 GB of data.
Is this just all the histories of the files? It seems like a 10:1 ratio just the histories...since I would think the deltas would be very small.
Does anyone know if that is a reasonable size given 8 GB of source (and 2 yrs of activity)? And if not what to look at to 'fix' this?
Thanks
I can't help with the ratio question at the moment, sorry. For a short-term fix you might check to see if there is any space within the DB files that can be freed up. You may have already, but if not..
SELECT name ,size/128.0 - CAST(FILEPROPERTY(name, 'SpaceUsed') AS int)/128.0 AS AvailableSpaceInMB
FROM sys.database_files;
If the statement above returns some space you want to recover you can look into a one time DBCC SHRINKDATABASE or DBCC SHRINKFILE along with scheduling routine SQL maintenance plan that may include defragmenting the database.
DBCC SHRINKDATABASE and DBCC SHRINKFILE aren't things you should do on a regular basis, because SQL Server needs some "swap" space to move things around for optimal performance. So neither should be relied upon as your long term fix, and both could cause some noticeable performance degradation of TFS response times.
JB
Are you seeing data growth every day, even when no activity occurs on the system? If the answer is yes, are you storing any binaries outside of the 8GB of source somewhere?
The reason that I ask is that if TFS is unable to calculate a delta or if the file exceeds the size of delta generation, TFS will duplicate the entire binary file. I don't have the link with me, but I have it on my work machine, which describes this scenario and how to fix it, in the event that this is the cause of your problems.