Building a Dataset From a Large Number of GitHub Repositories for an NLP Project - github

I am working on a Machine Learning project for which I need as many code repositories as possible. What is the easiest to download/clone a large number of GitHub repositories? I assume I could use the GitHub API to search for repositories I want to download, e.g. repositories with > 1k stars, and then I could start doing git clone on those repositories. However, in addition to being incredibly slow, I am pretty sure GitHub has some throttling mechanism, which would result in making this even slower if I am to implement some retry mechanism to overcome throttling. I appreciate some alternative ideas.
Thanks!

Related

how not to expose whole codebase to a remote developer?

for a small startup , I employed some remote developers. However, I only want to reveal the necessary codes to a certain developer, not the entire source code.
is this kind of feature offered by GitHub? If not, please provide a workaround.
Many thanks
With git repositories in GitHub there is no way to prevent a developer from cloning the whole repository and GitHub can't filter the contents of the repository to leave out part of the data. Permissions in GitHub can only prevent access to a repository, make the whole repo read-only or grant write access to the repository.
If you really want to limit access, you'll need to split your solution into multiple pieces, each in their own git repository. You can then set permissions for each repository in GitHub.
As a developer myself I caution you against this. A developer with only part if the sources would have a hard time verifying their changes work in way you intend to and it might make it much harder for them to debug any issues that happen in development.

Allow Issues to have multiple Repositories or auto clone issue to two repositories

Our project has a common set up that we handle the front and backend on two repositories.
We use Github Projects for managing both repos but we are running into the issue where we essentially need to clone one task so it can become an issue for a single repository.
Is there a way to allow one issue to have two repositories or to auto clone tasks with labels to somehow speed up or improve this process? I am open for creative responses but I would prefer something free.

Are there best-practice guidelines for maintaining a repository?

Are there best-practice guidelines for maintaining a GitHub repository? I've contributed to many open source projects and used GitHub for projects that I work on solo, but now I'm working with a team of six developers, including myself, to build a system, and I've been placed in charge of maintaining the repository. Nothing is to get merged into our main branch without my approval. As little as I know about maintaining a GitHub repository, of those within the organization (two team members are consultants) I've the most experience with the process.
But I've never maintained a GitHub repository, and while I'm doing OK, I know that there must be a body of knowledge out there of how to handle this correctly. I just haven't been able to find it.
One hurdle I've been jumping over repeatedly, for example, is merge conflicts. Usually they're minor, but not always. Is there some known system available that allows me to enforce who has the ability to edit which files at any given time, for example?
And yes, I realize this may not be the best Stack Exchange forum, but none of the others seemed more suited to the topic.
The Cloud Native Computing Foundation (CNCF) serves as the vendor-neutral home for many of the fastest-growing open source projects, including Kubernetes, Prometheus, and Envoy.
As such, it can be used as a starting point for your own project: see contribute.cncf.io/maintainers/github/, which offers:
template, to be usre you have your README, LICENSE and other important files.
labels, to better classify your issues
Add also a clear "release and maintenance policy", and you should be in good shape.

What is the beat git workflow for collaborating on a project with several people?

I have been looking into what is the best way to collaborate on a github project with several people while being able to maintain code quality and avoid pushing mistakes to the trunk. I have found numerous resources on how alot of people prefer to do it. From what I gather the preferred way is to create a new branch with each new piece of development or fix you are making, pull request, and then merge to trunk. I want to know what other options are that seem to work for people?
See Git Flow.

What are the advantages of a distributed version control for a team that is effectively never distributed?

When working remotely, our team only has access to our source code by remote desktop into our office PCs so we never really work in offline mode. Does a distributed version control system like Mercurial or Git still give us advantages over our current centralized Subversion set up? If so, what are they? Are there any drawbacks or pitfalls? I've read in numerous places that shifting to distributed version control requires a change in thinking. Can someone explain what needs to change in this regard?
As explained in the differences between DVCS and CVCS (Centralized VCS), the main advantages are:
local commits (you can commit more often in private branches, then clean up the history you want to push to other repos)
publication process (you pull from multiple repos, or quickly established intermediate repos to push to, where you can do intermediate tasks like continuous integration tests)
That last point required the most "change in thinking" and is a bit scary ("I can pull from any repo?!")
But once you realize the benefits, you can really have more productive development cycles because you are able to monitor (by fetching commits from your peers) the development of some of your colleagues. If they are developing a function that you need, you can start integrating it sooner.
(The thing to remember with a DVCS is that is doesn't prevent the setup of a "central" repo, for other developers to pull from)
As for continuous integration, instead of pushing directly from your repo to a central server in charge of CI, you can push to a local repo on your desktop, which will run all the tests, before pushing automatically (if "green") the code to a "central" repo.
It is so effective that you can now push to the official central repo a code that "never breaks the build", rendering your CI server pretty much useless ;)
I would recommend HgInit as a very thorough explanation of just how svn is improved upon by a decentralized toolset. It will also help you to understand the conceptual differences.
One of the big improvements I'd like to emphasize is the notion of merge tracking. Subversion didn't have this feature at all until 1.5, and with the difference in the way it treats revisions and branches, it will probably never be as good as the decentralized tools can be. Nobody likes merges. Might as well reduce as much of that pain as you can. Also see: Why is branching and merging easier in Mercurial than in Subversion?.
The biggest change in thinking for me when making the switch from subversion was getting over the idea that history is strictly linear, and branching is nothing but copying code to another directory. Note that in Git and Mercurial, you don't checkout a subdirectory of the repository. You won't see 'git checkout http://github.com/project/branches/v2.0' or anything. Eric Sink wrote a really good explanation of the difference in the way the history is stored. I recommend taking a look.
The development machines might stand next to each other, but the source code is still distributed between them. That the machines are in close physical proximity really doesn't matter for managing source code changes made by different developers.