I'm doing a project about storing lots of repositories from GitHub. There are many objects shared by many repositories. So I want to learn about if GitHub just store each object once to save storage, and how to do this(if it's not a secret).
I have not found any satisfying answer, just some predictions that GitHub has done that.
GitHub has not done this. GitHub stores each "repository network" individually, where a repository network is:
An original repository
Forks of that repository
Each "repository network" can share objects between them, using Git's "alternates" mechanism. This allows Git to consider other object database locations beyond just the normal storage within the repository.
When you create a repository on GitHub, you create a single, bare repository on disk, with a normal on-disk object database backing it up. When you create a fork from that repository, GitHub will:
Create a new "alternates" area for the repository network.
Move the repository's objects in to the alternates area.
Set up the original repository to know about the new alternates area
Set up the new fork to know about the new alternates area.
When this happens, the repository network will share objects between them. This means that objects are shared between the original repository and repositories that have been forked.
But this is the limit of shared object storage. There's no big database of objects that everybody can share (nor should there be for scalability and security reasons).
(Note: although I worked on the Git Infrastructure team at GitHub, this information is not confidential.)
Sources
These two talks at the Git Merge conference discuss GitHub's git repository storage:
Scaling at GitHub, a talk by Patrick Reynolds at Git Merge 2016.
Top Ten Worst Repositories to Host on GitHub, a talk by Carlos Martín Nieto at Git Merge 2017.
Counting Objects, an article by Vicent Martí.
Related
On Github, say I forked a project but did not want to display it on my profile because the project relates to something that I am not allowed to work on for non-compete reasons.
When I try to set the fork to private, I receive this error message on Github:
"For security reasons, you cannot change the visibility of a fork."
What are the potential security implications of changing the visibility of a fork?
When you perform a push on GitHub, the data is pushed into the repository for you fork. Then, if there are multiple forks, those objects are moved into an alternate that is shared by all repositories in that network, forks included. This saves a lot of space when there are many forks, and it makes pull requests much easier, since the objects are already present in the main repository.
However, it means practically that all objects in all forks in the network are visible through any fork. As a result, if your fork were private, then someone who knew an object ID could view it through the main repository and see that data. This would be a security problem, so GitHub doesn't allow it.
I think it's just a product design of GitHub rather than regarding of the security.
we have a UI mono repo using NX workspace , we are sharing code with multiple teams.Is there a way
To allow access to Team members only the modules they own?
To create PR which can be viewed and approved by module owners (Team only)?
No, read access = entire repo. If you're using a mono repo, then read access to all or no access. See: Information about Managing Teams
Read access for PRs is same as above, but you can require certain groups of approvers for PRs that include certain paths when merging to particular branches. See Information about Code Owners and Protected Branches
No, there is no way to restrict access to only part of a repository. The Git documentation is very clear that anyone who can read or write to a repository can access all of the contents of that repository. From the gitnamespaces(7) manual page:
The fetch and push protocols are not designed to prevent one side from stealing data from the other repository that was not intended to be shared. If you have private data that you need to protect from a malicious peer, your best option is to store it in another repository. This applies to both clients and servers.
If you need granular permissions, you need multiple repositories. I generally recommend against monorepos because they usually end up growing very large and then performing poorly (well after it's too late to fix), but this is also another reason why they're a bad idea.
As for PRs which can be approved by module owners, it depends on the platform. GitHub has the CODEOWNERS file, which can be used to mandate that files owned by certain teams require a review from that team.
There are two popular models of collaborative development on GitHub:
Fork & pull
Shared repository model.
How to check which model the given repository uses? How to change it?
This isn't something that's formally included a GitHub repository's settings. It's something that is determined by the repository's permissions, and it's simply a recognition of some common conventions.
Do you have push access? Then (for you) the repository is effectively "shared":
The shared repository model is more prevalent with small teams and organizations collaborating on private projects. Everyone is granted push access to a single shared repository and topic branches are used to isolate changes.
Pull requests are especially useful in the fork & pull model because they provide a way to notify project maintainers about changes in your fork. However, they're also useful in the shared repository model where they're used to initiate code review and general discussion about a set of changes before being merged into a mainline branch.
If not, then it's effectively "fork & pull" (again, for you):
The fork & pull model lets anyone fork an existing repository and push changes to their personal fork without requiring access be granted to the source repository. The changes must then be pulled into the source repository by the project maintainer. This model reduces the amount of friction for new contributors and is popular with open source projects because it allows people to work independently without upfront coordination.
Note that in both cases I said "for you". It is possible and common to grant a core group of committers push access ("shared model"), while still accepting pull requests from outsiders ("fork & pull"). If this were a setting, it wouldn't be on the repository. It would be a setting for each user who may have access to the repository.
And there are many other possible models, one obvious one being a private repository where certain users may fork the repository and submit pull requests. All other users wouldn't have any access to such a repository at all.
I have to maintain a base version of code and different "variations" of that code with client specific modifications for different clients sites.
It would be much easier to force all clients on the same variant, or have a super variant that encompasses all clients' needs. However that is not the nature of my world and I can't change it.
Given this environment, what is the best way to use github?
I can create a separate repository for each version. I can create one repository with separate branches. In either case I see how I can use github as a storage medium and version control for each variant, but I don't see how I can use github to help manage the code divergence.
thanks
Based on your question I would recommend a single repo, with a master branch. You can use other branches for the "variations".
The tricky part is when a commit is made to master that won't cleanly merge into another branch. You can do a merge commit with git merge, but I prefer to do a git rebase onto the new HEAD of master. This way people using master could easily pull in changes from that branch.
Related
Can someone provide me with the cheat sheet for GitHub collaboration for a team of two who want equal access/rights to the repo. I am confused as to the need to use forking which appears to make sense for a large open source project with dispersed devs but seeems like overkill when I and my partner sit 10 feet from each other.
Thanks,
Doug
If you have a small team and want everyone to have access to the repo, you can just grant them collaborator permission in the repo's admin settings. Forking isn't required if your scenario doesn't require it. (Although forking can be useful, you're also partly right: if you have a small team and know all the other team members and don't mind giving them read/write access, there's no need to fork.)
the difference between the fork and pull model and the shared repository model could be explained by the Github.com:
(https://help.github.com/articles/about-collaborative-development-models/)
About collaborative development models
The way you use pull requests depends on the type of development model you use in your project.
There are two main types of development models with which you'd use pull requests. In the fork and pull model, anyone can fork an existing repository and push changes to their personal fork without needing access to the source repository. The changes can be pulled into the source repository by the project maintainer. When you open a pull request proposing changes from your fork's branch to a branch in the source (upstream) repository, you can allow anyone with push access to the upstream repository to make changes to your pull request. This model is popular with open source projects as it reduces the amount of friction for new contributors and allows people to work independently without upfront coordination.
In the shared repository model, collaborators are granted push access to a single shared repository and topic branches are created when changes need to be made. Pull requests are useful in this model as they initiate code review and general discussion about a set of changes before the changes are merged into the main development branch. This model is more prevalent with small teams and organizations collaborating on private projects.