Github - Keep only last n versions of large files - github

Our GitHub repo has grown in size. Cloning on a new machine takes a lot of time and bandwidth. This is mainly because we're keeping JAR files in our repo. I understand we can use something like nexus to place jars.
As a short term fix, is there a way we can keep only last n versions of a file in GitHub?
Thanks

Ideally, you might consider activating Git LFS in order to store only references to your JARs instead of the binary artifacts themselves.
It is possible on GitHub, but the bandwidth will be limited.
The alternative is to use Instead of BFG, try the new git filter-repo, which will replace the old git filter-branch or BFG.
You would need a commit callback which would remove those files if the date is too old.

Related

Importing existing files in repo to new LFS storage (using SourceTree and Git LFS)

TL;DR: What is the least intrusive way of migrating existing files to the LFS storage, once LFS has been initialized on the repository, and a suitable .gitattributes file has been prepared but not yet pushed, for a Unity project using Github, Git LFS and SourceTree?
Software/services used:
Unity 2020 LTS
Github
Git LFS, which is part of Github
SourceTree
Links I'm using:
Githubs own migration tutorial
Githubs documentation for the migrate command
A .gitattributes file, which handles casing-differences in extensions using RegEx (I think).
Another answered question very much like this one.
I also took a look at BFG Repo-cleaner, but it seems a bit much for this task.
Motivation for posting
I've been trying to find a guide describing the best way to do this, but most seem to be out-dated or don't include enough information for me to be confident in firing these import-commands on my repo. Others focus a lot on how to set up LFS using a specific server backbone, like Bitbucket Cloud, but I can't find any focusing on Git LFS and SourceTree.
SourceTree has direct integration with Git LFS, but there seems to be no UI-implementation of the migration process for existing files in the repository, so I think I have to rely on the terminal for this part (which I honestly haven't had to use in years, so there's that).
I also have a question about a command for this that I don't see mentioned in my contexted searches: --fixup
The setup
It's not a huge repo; it's a main branch with a few dead branches that don't matter. I just want all the files with certain extensions converted to LFS. My initial thought was to do the migrate-command with --norewrite, because I liked the idea of compacting all the changes into one commit and have no changes to the commit history, but when I realized that means the files would have to stay in the repo for posterity, I changed my mind. I want the repo to shrink, as well, which means I want to replace the files with pointer-files back through the commit history, in order to eliminate the actual files from the repo completely, so they only exist in the LFS-storage.
My questions and options, as I see them
In that other answered question I linked to, a full answer isn't given, but it's very close. Though the answer is very informative, it doesn't answer these questions:
How to handle case-sensitivity surrounding extensions when writing migrate-commands with --include and --exclude options? Do I need to either go through all existing files in the repo and ensure a similar casing in order to do one-liners, or alternatively fire a separate line for each permutation of the extension I find in the repo? (ignore this question, if --fixup understands the casing-format mentioned in 2. and works for my purposes).
I see that there is a --fixup option for the migrate command, which isn't mentioned in the search results I'm getting with the given context, except in Githubs documentation for the migrate-command. It says:
"Infer --include and --exclude filters on a per-commit basis based on the .gitattributes files in a repository. In practice, this option imports any filepaths which should be tracked by Git LFS according to the repository's .gitattributes file(s), but aren't already pointers."
Is there some reason why the --fixup option isn't recommended anywhere, when it seems to do exactly what I need? Does it not rewrite the history, or something?
I hope it works, because since --fixup reads the .gitattributes files, it should be able to read the different casing-options from the casing-format in the .gitattributes files. Then I don't have to worry about missing files due to different casings, like when using the --include option.
This is the casing-format used in the .gitattributes file (example for .wav files):
*.[Ww][Aa][Vv]
It works for .gitattributes' normal workings, but doesn't work when you use the format for the --include or --exclude options for the migrate-command, e.g.:
git lfs migrate import --include="*.[Ww][Aa][Vv]"
...does not work. This is what gives the casing-issue mentioned in 1.
I have LFS initialized. I have my .gitattributes file (uncommitted) at the root of the project. So, committing and pushing .gitattributes, and then doing 2. seems to be what I want, right? I should be able to open the terminal in SourceTree, and put in this line:
git lfs migrate import --fixup
or do I need --everything, in order to affect all branches?:
git lfs migrate import --everything --fixup
Also, if --fixup doesn't work, is it just a bunch of these?:
git lfs migrate import --everything --include="*.WAV"
git lfs migrate import --everything --include="*.wav"
...followed by a:
push --force
or
push --force --full
???
That's about where I'm at. I'm not entirely sure which way to go here, or how each of these commands affect the result. I'd appreciate any input on what would be the best course here. If anything, I've tried to be as informative as I can and include the best sources I've found on the subject, to help anyone else asking the same questions.
EDIT: I found this issue at the Git LFS repository, which describes a caveat with the --fixup option; it takes into account when the existing files were added to the repo, and if they were added before the changes to the .gitattributes-file, then they aren't included. So, it is recommended to either rebase, or use filter-branch or filter-repo, in order to add/change the .gitattributes file at the root commit, so that the --fixup option will register the files. The issue deals with some more caveats, like having to uninstall LFS first, but eventually found that solution. This is all just to avoid having to do separate calls for each extension-casing permutation. I don't want to do any of those suggestions, since some seem to run into trouble with them, and I can't get --fixup to work with info, meaning I can't preview what it'll change, so I'm going with --include calls, and just searching up all the current permutations of the extensions of the existing files, and doing --include calls for those. Please, if you do have any good information on the subject, I (and probably many others) would love to hear some easier solutions to this problem.
Thanks in advance and best regards, Jonas Tingmose.

Can not push backup to GitHub due to too large files

I am aware that this issue has several related issues, but I have a slightly backward problem, where there already is a commit history to preserve.
I have a project repo I worked on last year together with several other people. I have successfully pushed the changes to this repo(education license for Bitbucket repo). I have attempted to make a backup of this repo to my private Github since the Edu repo will be wiped eventually. Some of the files are larger than Githubs file limit, so I have packaged these files into segmented packages that fit within Githubs filesize limit. The problem is that there is still a history of these larger files, and pushing is aborted once these files are getting processed.
Is there any way to only change the history of these files for my backup version, and not mangle the original version history? Essentially not processing the big files on the Github commit history, and just skip straight to the segmented packages version.
The most related issue I've managed to find is this one Can't push to GitHub because of large file which I already deleted
which mentions that you should not run the command without understanding the consequences (which I think it does, and that would be changing the history).

Clone huge 16 GB Git repo with Eclipse Neon

Is there any way I can clone a huge Git repository (16+ GB) using the Git integration of latest Eclipse Neon?
I'm cloning by HTTP connection.
First, I ran into timeouts, but then increased the Remote connection timeout to 1800 seconds in Eclipse config.
Then the cloning almost completed, but at the very end it always fails telling me Premature EOF.
I have increased the http.postBuffer to 524288000 also (as many users suggested on StackOverflow), but this was not much of a help.
I also tried cloning the master branch only, but again, I was stuck with the same error message.
Is EGit not capable of handling such a big repo over HTTP?
The only Git-related way to clone such a huge Git repo would be through the recent (February 2017) GVFS (Git Virtual File System).
As tweeted, for a 270GB repo:
“The Windows codebase has over 3.5M files. With GVFS (Git Virtual File System), cloning now takes a few minutes instead of 12+ hours.”
See github.com/Microsoft/GVFS.
GVFS is based on Git fork: github.com/Microsoft/git.
And based on a protocol whose specifications are described here.
This is not yet supported by EGit, or even regular Git for now.
Depending on what you want to do with the repo, a shallow clone may be the solution (it won't bring the full git history): https://www.perforce.com/blog/141218/git-beyond-basics-using-shallow-clones
also, for such big repo, consider using git lfs in the future: https://git-lfs.github.com/
finally, I've seen many huge git repos that became so big because had files that wasn't supposed to be saved on git (executable files, binaries, videos, audio, and so on). If by mistake something like that happen, you can remove it from history using filter-branch. Check this SO ans: How to remove/delete a large file from commit history in Git repository? or this github article https://help.github.com/articles/remove-sensitive-data/
EDIT:
Microsoft has been developing GVFS that may be a solution in a near future (i think it's still not ready, but I haven't tested)
Do you really have a code project that's 16GB? That's pretty crazy, man!
I think the least painful way to go about this, is to open your shell and just type git clone http://my-url/project.git. And then try to see if you can make the repository somewhat smaller.
Eventually, I ended up cloning the repository using a SSH connection.
This works fine, even from within Eclipse (using EGit).
I had to create a SSH key in Eclipse properties, since Putty's PPK format is not compatible with Eclipse. Then, I managed to clone the entire repository.
Seems like HTTP is not suited to download a chunk of 16+ GB. :)

Can I clean out a Mercurial Repository that has large unused history?

My team has a Mercurial repository with a long history, including large files that are no longer part of the project. The repository is getting so large that it often times out when attempting to clone from the Google Code hosting site. Can we cull the repository so that files that are not in the tip are removed entirely from the history, yet keep the history of all the other active files?
The ConvertExtension can do this. See its --filemap option.

Inefficient handling of file renames in Mercurial

When I rename a file using Mercurial, and then commit without any changes, why does it still send the full file to the repository? (I can tell because the subsequent push to the remote repository shows how much data is being transferred). Isn't it obvious to it that it simply needs a rename?
I'm using the latest version of TortoiseHG under Windows, and the file in question is a 20MB text file.
This is a known deficiency in the storage format used by Mercurial. You can search for "lightweight copies" for the full story, but briefly, the problem is that a new revlog is created for the new file name when you rename. The new revlog starts with a compressed snapshot of the full file — this is normally not a big problem, but it's still bigger than a zero-sized delta.
There's little you can do about it now unless you want to patch your Mercurial and run experimental code. The good news is that you just have to wait: the patches that we've been working on will be able to convert your existing repository into a more space efficient one automatically. This will happen when you hg clone over the network or if you use hg clone --pull locally.