How to manage large number of repositories on github? - github

I am student and I so far I have been using Github for keeping all of my projects. But the number of repos is growing and it is getting hard to manage. Is there a way to group repositories in something like folders, or is there some other service than Github, where I can do just that?

Related

Building a Dataset From a Large Number of GitHub Repositories for an NLP Project

I am working on a Machine Learning project for which I need as many code repositories as possible. What is the easiest to download/clone a large number of GitHub repositories? I assume I could use the GitHub API to search for repositories I want to download, e.g. repositories with > 1k stars, and then I could start doing git clone on those repositories. However, in addition to being incredibly slow, I am pretty sure GitHub has some throttling mechanism, which would result in making this even slower if I am to implement some retry mechanism to overcome throttling. I appreciate some alternative ideas.
Thanks!

Allow Issues to have multiple Repositories or auto clone issue to two repositories

Our project has a common set up that we handle the front and backend on two repositories.
We use Github Projects for managing both repos but we are running into the issue where we essentially need to clone one task so it can become an issue for a single repository.
Is there a way to allow one issue to have two repositories or to auto clone tasks with labels to somehow speed up or improve this process? I am open for creative responses but I would prefer something free.

Given a github repo, how to see if one or more of its fork has measurable activity?

Context
I often found valuable projects with many forks. Sometimes I would like to add something to the original functionality, so I go for forking. However to prevent reinventing the wheel, I would like to ensure no one done such kind of work before. So I review the existing forks, but usually those just a stalling copies. In case of 100 forks this is a tedious work.
Question
Given a github repo, how to see if one or more of its fork has measurable activity? I would like to do this to filter out the "just another copy" like forks.
GitHub has a network graph as one of their features for Understanding connections between repositories. This provides a:
Timeline of the most recent commits to this repository and its network ordered by most recently pushed to.
This would show the most recently active forks just below the main repository. You can access it at https://github.com/<user>/<repo>/network, e.g. here's one from one of my repos.
I don't think Github itself provides an easy way to do this, but some third-party tools have been written to help with that.
gitpop3 for example supports sorting the forks of a given repository by either number of stars, forks, commits or the last modified date.

How to link pull requests in different Azure DevOps repositories?

We have several Repositories in Azure DevOps, some in the same project others in different projects.
I'm trying to find out how pull requests could be linked in some fashion.
I know we could actually put urls in the description, for different pull requests, which I'm guess might be our only option.
However, it would useful to be able to see some status across pull requests in different repositories.
I'm picturing a project which has meant changes in different repos, at the end of the project, we'd want to merge all these pull requests at the same time. So having some means of making it obvious across all pull requests, which pull requests are also required. Obviously I'm not expecting one merge / complete all button.
But there would need to be some obvious link.
Maybe there's an extension which might help, or something obvious I've missed.

Does GitHub store each object just once?

I'm doing a project about storing lots of repositories from GitHub. There are many objects shared by many repositories. So I want to learn about if GitHub just store each object once to save storage, and how to do this(if it's not a secret).
I have not found any satisfying answer, just some predictions that GitHub has done that.
GitHub has not done this. GitHub stores each "repository network" individually, where a repository network is:
An original repository
Forks of that repository
Each "repository network" can share objects between them, using Git's "alternates" mechanism. This allows Git to consider other object database locations beyond just the normal storage within the repository.
When you create a repository on GitHub, you create a single, bare repository on disk, with a normal on-disk object database backing it up. When you create a fork from that repository, GitHub will:
Create a new "alternates" area for the repository network.
Move the repository's objects in to the alternates area.
Set up the original repository to know about the new alternates area
Set up the new fork to know about the new alternates area.
When this happens, the repository network will share objects between them. This means that objects are shared between the original repository and repositories that have been forked.
But this is the limit of shared object storage. There's no big database of objects that everybody can share (nor should there be for scalability and security reasons).
(Note: although I worked on the Git Infrastructure team at GitHub, this information is not confidential.)
Sources
These two talks at the Git Merge conference discuss GitHub's git repository storage:
Scaling at GitHub, a talk by Patrick Reynolds at Git Merge 2016.
Top Ten Worst Repositories to Host on GitHub, a talk by Carlos Martín Nieto at Git Merge 2017.
Counting Objects, an article by Vicent Martí.