Azure private pipeline agent .git folder size

Azure private pipeline agent .git folder size - azure-devops

We recently moved from hosted to private agents, because of reasons that are not relevant to this question. The problem we're having now, is that the private agent runs out of disk space. I've checked why this is the case, and it turns out that for one of the workspaces the agent creates, the .git folder grows to over 20Gb during the day, while the repository is only a few Gb. What can explain this excessive growth?
some extra info:
We build from different branches, using the same pipeline (so it re-uses the same workspace)
We do not clean the workspace between runs, since this would require is to re-get the entire repository each build, which slows the build. (I understand adding the clean option would solve our problem, but it would also slow down all builds, which we don't want)
We used to use fetchdepth: 1 in our pipelines, but we recently removed this, since it is no longer necessary on private agents, since the sources are cached between runs
Edit:
to clarify, I'm looking for a way to avoid running out of disk space on the agents, without losing the ability to cache source files.

When I run the same pipeline with different branches, the .git folder size indeed increases.
Then I find that the root cause of this issue could be the pack files in .git/objects/pack.
It will pack the source files, if your source files are large enough, the packaged files will also take up a lot of space.
You could try to use BFG tool or Git command to remove the files.
For more detailed information, you could refer to this ticket: Remove large .pack file created by git

Related

How to fix excessively large size of unpacked object files in Git (with GitLFS)?

I'm having an odd issue with Git on my Azure Devops build agents. We have a large repo that is using GitLFS, while GitLFS prunes keep the size of .git/lfs/objects down, some of our environments have begun accumulating massive numbers and sizes of objects in .git/objects that do not get cleaned with either a git gc or a git lfs prune.
For a sense of the scale here, the .git pack file is about 2GB, the lfs objects folder is about 1.4GB and the .git/objects files that won't pack are about 105GB!!! Every single one of the files begins with an x as the first character.
On a typical developer's machine, the entire repo checked out is around 5GB, so something is very, very off, but nothing I try will clean up the files. Any ideas what there source is and/or how to clean them, short of simply periodically nuking the entire repository and re-pulling it.

Can not push backup to GitHub due to too large files

I am aware that this issue has several related issues, but I have a slightly backward problem, where there already is a commit history to preserve.
I have a project repo I worked on last year together with several other people. I have successfully pushed the changes to this repo(education license for Bitbucket repo). I have attempted to make a backup of this repo to my private Github since the Edu repo will be wiped eventually. Some of the files are larger than Githubs file limit, so I have packaged these files into segmented packages that fit within Githubs filesize limit. The problem is that there is still a history of these larger files, and pushing is aborted once these files are getting processed.
Is there any way to only change the history of these files for my backup version, and not mangle the original version history? Essentially not processing the big files on the Github commit history, and just skip straight to the segmented packages version.
The most related issue I've managed to find is this one Can't push to GitHub because of large file which I already deleted
which mentions that you should not run the command without understanding the consequences (which I think it does, and that would be changing the history).

How do I handle a large number of files as an input to a build when using VSTS?

To set expectations, I'm new to build tooling. We're currently using a hosted agent but we're open to other options.
We've got a local application that kicks off a build using the VSTS API. The hosted build tasks involve the Get sources step from a GitHub repo to the local file system in VSO. The next step we need to copy over a large number of files (upwards of about 10000 files), building the solution, and running the tests.
The problem is that the cloned GitHub repo is in the file system in Visual Studio Online, and my 10000 input files are on a local machine. That seems like a bit much, especially since we plan on doing CI and may have many builds being kicked off per day.
What is the best way to move the input files into the cloned repo so that we can build it? Should we be using a hosted agent for this? Or is it best to do this on our local system? I've looked in the VSO docs but haven't found an answer there. I'm not sure if I asking the right questions here.

There are some ways to handle the situation, you can follow the way which is closest to your situations.
Option 1. Add the large files to the github repo
If the local files are only related to the code of the github repo, you should add the files into the same repo so that all the required files will be cloned in Get Sources step, then you can build directly without copy files step.
Option 2. Manage the large files in another git repo, and then add the git repo as submodule for the github repo
If the local large files are also used for other code, you can manage the large files in a separate repo, and treat it as submodule for github repo by git submodule add <URL for the separate repo>. And in your VSTS build definition, select Checkout submodules in Get sources step. Then the large files can be used directly when you build the github code.
Option 3. Use private agent on your local machine
If you don’t want add the large files in the github repo or a separate git repo for some reasons, you can use a private agent instead. But the build run time may not improve obviously, because the changed run time is only the different between copying local files to server and copying local files to the same local machine.

Git repository size increases on every Eclipse remote synchronization

I'm using Eclipse Remote synchronization to upload a PHP project from my Windows workstation to FreeBSD VM, where my web server is running. In order to make the synchronization Eclipse create on both machines .ptp-sync directory where stores the git objects.
Initially the project is ~1MB. But after every save (which triggers sync) this size increases to 2MB, 3MB, 5MB, 10MB etc, on both machines. After a couple of synchronizations it goes through hundreds of MB, to GBs. Once it reaches even 11GB. Of course the synchronization starts to take, instead 1-2 sec as it is initially, 1-2min. In such cases I should delete both .ptp-sync dirs and init Eclipse sync again.
I notice that the largest files are in .ptp-sync\objects\pack\
My last test, after 3 saves (and syncs) the repo increase steps were 77MB - 138MB - 267MB - 396MB. Just before that I try
git -C .ptp-sync --work-tree=. gc --prune
which reduces the size from 140MB to 77MB, but after 396MB it doesn't reduce anything. Next save make the repo 779MB.
One of my guesses was that it is not ignoring .ptp-sync which causes to push it everytime, although there is /.ptp-sync in .gitignore file, and also in Eclipse Preferences->Remote Development->Synchronized Projects->File Filtering.
P.S. Ah, and of course this does not happen on my colleague setup which is prity the same - he also uses Windows and Eclipse with a copy of the same VM.

I figure out how to handle this situation. As I guess the .ptp-sync directory even it was added into .gitignore it wasn't actually ignored, and this cause its recommit on every repack.
The solution is to add into .ptp-sync/info/exclude the row /.ptp-sync/. Maybe the synchronization doesn't use .gitignore by default.
For different ways of ignoring files, the following link can give some info: https://help.github.com/articles/ignoring-files/

Emulating symlink-like behaviour in a source control repository

Suppose I have the following (desired) folder structure:
*CommonProject
*Project#1
----> CommonProject(link)
*Project#2
----> CommonProject(link)
Where the CommonProject is the location of the source belonging to that project, and CommonProject(link) is merely a soft link to the main location. If we imagine this as a tree-view in a visual client, if I expand Project#1 I will see CommonProject there as a subdirectory, even though the files are not actually stored there.
The purpose of this is to enable the following behaviour:
When I check out Project#1 I get the files associated with that project as well as a subfolder CommonProject containing all of its files (as if Project#1 contained the copy of the files in the Version Control repository). Now if I were to modify CommonProject's files inside of Project#1 and was to submit my changes to the repository, the changes would go into the CommonProject location (no file is actually stored locally under Project#1 in the repository). Now if I was to sync Project#2, as it also contains symlink to CommonProject, it will now get my updates.
Essentially the duplication of files only exists on my machine, but in the repository there is only one version of CommonProject.
I know Perforce can’t do this, without juggling 3 specs. This is very complicated and error prone, especially when a lot of people do it. Is there a source control repository out there that can do this? (a pointer to some docs on how it can be done is a plus)
Thank you.

Subversion can directly store symlinks in the repository. This only works for operating systems that support symlinks though, as svn just stores the symlink the same way it would with any other file.
I think what you really want is to link to separate projects though. Subversion supports this through externals and git through submodules. Another alternative is to manage this sort of thing with in your build process, so that some static resources are gathered when you initialize the build. Generally, updating a utilities library that changes often is going to cause stability problems, so you can do this manually (or with clever scripts) when you need to

You'd probably be much better off just storing the projects in a flat directory (1 directory per project, all at the same level), and using whatever you build system or IDE is to link all the stuff together.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse