github repo counts by language (and historical data)? - github

I'm interested in getting a count of github repos for a certain set of languages (with historical data if possible.)
Here are things I've tried to start collecting the stats myself:
Screen scraping a page like:
https://github.com/search?q=language%3Aperl&type=&ref=simplesearch
Using the github API:
https://api.github.com/legacy/repos/search/KEYWORD?language=perl
but unfortunately this seems to require a KEYWORD to get any results. Also, I only need a count not the meta data on each repo.
I'm also interested in historical data, and it seems that those stats might already be available somewhere.
Any ideas on better ways to get repo counts by language and/or historical data?

You can try this:
https://api.github.com/search/repositories?q=language:Python
Also, you can query the github archive.
Using big query interface, the query should be:
bq query 'SELECT repository_language, count(repository_language) as pushes
FROM [githubarchive:github.timeline]
WHERE type="CreateEvent" and repository_fork == "false"
GROUP BY repository_language
ORDER BY pushes DESC'
This query generates statistics of number of repos per language.

Related

Github GraphQL v4 API nested pagination (Multiple pagination cursors can not be followed in a single query)

Let's paint a hypothetical picture for discussion.
Let's say a large company has 200 organizations each with 250 repositories and each of those repositories has 300 contributors.
Let's say I would like to build up a GraphQL query that answers the question:
Give me all contributors (and their privileges) of all repositories of all organizations in my account.
Obviously, pagination is needed.
But the way it is currently implemented, a pagination cursor is provided for each list of contributors, each list of repositories, and each list of organizations.
As a result, it is not possible to complete the query by following a single pagination cursor.
It is not clear to me that the query can be completed at all due to the ambiguity of specifying a pagination cursor for one list of contributors for one org/repo combo versus the next org/repo combo.
Thanks
Your initial query structure looks something like this (simplified):
query {
organizations(first: 10) {
repositories(first: 20) {
contributors(first: 30) {
name,
privileges
}
}
}
}
Now imagine this query would return a single pagination cursor. What should the next page look like?
next 10 organizations (with first 20 repositories, with first 30 contributors)
same 10 organizations, but next 20 repositories (with first 30 contributors)
same 10 organizations, with the same 20 repositories, but next 30 contributors
some wild mix of the above
When you build your own GraphQL API, you can design your cursor pagination according to your needs. But the GitHub API has to serve a wide range of consumers, and they chose a very flexible schema design, that enables the clients to fetch exactly the data they need, without overfetching. But in some cases it may take additional roundtrips to get all the data you need.
Let's look at this from a frontend perspective:
After the initial request you will display the first 10 orgs, and for each org the first 20 repos, and for each repo the first 30 contributors.
Now the user can decide of which data he wants more:
either load more orgs, or
load more repos for a specific org, or
load more contributors for a specific repo
Each of these decisions will result in a simple paginated query with one of the cursors the GitHub API provided. No need for an all-mighty pagination cursor.
(I highly doubt, that there's a UI/UX use case where you want to paginate everything at once)
Though in this case I'd say that the GitHub API is perfectly suited as it is. In my opinion it's not reasonable to display 200 * 250 * 300 = 15000000 contributors at once, because from a user's perspective that's just way too much.
Let's look at this from a backend perspective:
If you want to gather the data you described for analysis, aggregation or something similar on your backend server, and you already know that you need all the data, you may be able to skip pagination entirely by providing a large number for first. (may not work for GitHub's API - as far as I know they are limited to max. 100 entries per pagination).
Even if you are forced to use pagination, you are able to cache the results. Of course it still takes a few hundred roundtrips to the GitHub API, but this can be a scheduled job that runs once every night.
And because at this point you've already written all the necessary code, it's easy to implement some kind of partial refresh. For example if you know that "repo 42 of org 13" is pretty active, you're able to just refetch the data for this specific repo (on demand or in a shorter interval) and update your cache.
I don't know your specific use case, but as long as you don't need (nearly) live updates of this huge data set, I'd say that GitHub's API is sufficient and flexible enough for most people's requirements.

Why i receive different result searching repositories?

I try find latest updated repo on GitHub.
I use this two methods:
https://api.github.com/search/repositories?q=user:github+sort:updated+&per_page=5&type=all
https://api.github.com/users/github/repos?type=all&sort=updated&per_page=5
Why i get differend repos? Which method is working?
On GitHub web site i can see results like in the first link:
https://github.com/github
I went through the results of both the requests. It looks like in the first case sort:updated uses pushed_at field to sort the results. In the second case, sort=updatedis using updated_at field to sort the results. So, depending on which field you would like to sort your results, you could use either. Strangely, i could not find any documentation of this difference.

GitHub API - latest public repositories

I would like to list public GitHub repositories with the latest create/update/push timestamps (for me any of these is acceptable). Can I achieve this with the GitHub API?
I have tried the following:
Tried using /repositories endpoint, and use the link header to navigate to the last page. However, the link header I receive only has first and next links, whereas I need a last link.
Tried using /search/repositories endpoint. This will work as long as I have a keyword or filter in the q parameter, but it will not accept an empty q parameter.
I got in touch with GitHub support, and there are two solutions to this:
Use binary search on the since parameter of the /repositories endpoint to find the last page.
Cons: may quickly exhaust the API rate limit.
Use the /search/repositories endpoint with an always-true predicate such as stars>=0.
Cons: likely to cause a query timeout/ incomplete results.

Ckan API, List more information on rest dataset

I am using ckan 2.6.0
According with the documentation: http://docs.ckan.org/en/latest/api/legacy-api.html
I am trying to use the endpoint /rest/dataset and works (only for public data but works), it only returns an array of datasets names, and nothing else, an example can be found here http://demo.ckan.org/api/1/rest/dataset
Is there a way to get a complete listing for datasets ? I also tried the search endpoint and returns the same array.
For example I would like to get the title, description, tags, file types, etc, like in the image below:
The REST api is deprecated/unmaintained and has been for a long time. Follow the up-to-date API documentation here.
package_search is your best bet: http://demo.ckan.org/api/action/package_search
That gives you a batch of datasets. Get more by paging through using the 'start' and 'rows' parameters.
If you simply want them all, then it's much better still to use a bulk download that some sites offer, such as data.gov.uk, which supplies it complete as a simple JSONL download: Meta-data for data.gov.uk datasets.

In Github, is there a way to search for pull requests created by any author from a provided list?

For my team's weekly builds, I go through all pull requests from the company GitHub and pull out the PRs associated to my team. This requires an annoying sieving step that requires a walk-through of the company's previous week of code contribution.
I looked at the official GitHub search documentation (HERE) and found the "author" field could be used to narrow down the search in the way I want, but when I try this at https://github.com/pulls it only works on one author at a time.
Is there a way to search across a list of authors?
For a little extra context, my team operates across a large list of repos, all of which are under a blanket organization which houses all repos across the company.
Make sure that you are using the full search at https://github.com/search.
Then simply add extra author: <name> fields to your query. The searching engine will OR fields. For example:
is:pr author:username1 author:username2
(Note that this only works on https://github.com/search. The search syntax on other pages, like https://github.com/pulls, is severely limited and does not support searching by multiple authors. If you try the same search on https://github.com/pulls, GitHub will simply ignore all but one author that you list.)
To limit it to repositories by a specific owner, add the user: <owner> field to the query.
Using the route github.com/search instead of github.com/pulls is the "right" answer in some sense, but I like the format of the /pulls page better. When working in a small team my approach is to use /pulls but substitute "involves" for "author", like this (for reference, the same query using /search and "author").
You will get "extra" hits where the author is someone outside the list, but it's another trick to know. (Names in the examples picked at random from recent public PRs)
You could simply use the advanced search for that: https://github.com/search/advanced 🤗
Option 1: Using Github's Search Query Language
Go to https://github.com/search
Type in a query following the format of this example (replacing author:* with your usernames.
Example: is:pr repo:zino-hofmann/graphql-flutter author:apackin author:kvenn
Explained
is:pr - only PRs (since Github treats Issues and PRs both as "Issues")
repo: - only show PRs in that repo
author: - only show PRs for these authors
It shows as "Issues", but the list will only include PRs.
Option 2: Fancy Bookmark/Alfred/Spotlight Search
You can modify the query params in the following URL to have the list of people on your team.
Replacing <username1,2,3,4> with your teammates Github username's.
Replacing <your_company> with your company URL (or removing that entirely if not on enterprise).
https://github.<your_company>.com/search?q=author%3A<username1>+author%3A<username2>+author%3A<username3>+author%3A<username4>+is%3Apr&type=Issues
Option 3: Using Github's Advanced Search UI
You can use Github's "Advanced Search" to achieve what you're looking for without needing to learn Github's query language.
For public repos: http://github.com/search/advanced
For internal/enterprise repos: http://github.<your_company>.com/search/advanced
You can use the fields below for filtering:
To filter for specific repos, use "Advanced options" -> "In these repositories"
To filter for specific authors, use "Issues options" -> "Opened by the author"
It uses query params under the hood, so you can generate the search with your UI and copy and paste it (to use for Option 3).
Note: You'll need to add "is:pr" to the resulting search query, no way to do that in the UI.