Find a string in a GitHub Pull Request - github

I'd like to build a bot to let you know if a certain string, like DONT_MERGE_ME appears in a GitHub Pull Request, so I can block the commit with a failed check and add a helpful comment for the developer.
Let's say you had committed code like the follow, that you don't want to accidentally merge with your PR (e.g. you're hacking around).
const bar = 'some-hack-value'; // DONT_MERGE_ME
Given the PR id, I'd like to figure out if the PR still has the string DONT_MERGE_ME in it. However,
the GitHub Code Search API has many limits, like 384KiB max file size and only searches the default branch
the GitHub Commit Search API only searches the default branch
the GitHub Pull Requests Search API only searches by title/body/comment
Given the above limitations, it looks like the only way to figure this out, for a given PR id and commit, would be to find all commits in the PR up to this commit, download the diffs, and sum them up.
Is there a simpler way to do this with the GitHub API?

The approach I would recommend is subscribe to the pull_request event. If payload.action is either opened or synchronize, load the diff of the pull request and look for the string in all lines that have been changed.
You can preview the diff response for a pull request by adding .diff to any pull request URL, e.g. https://patch-diff.githubusercontent.com/raw/gr2m/sandbox/pull/194.diff
Find the lines starting with a + and look for your string in them
If you use the JavaScript octokit package, you can load a pull request like this
const { data: diff } = octokit.rest.pulls.get({ owner, repo, pull_number, mediaType: { format: "diff }})
Also check out the TODO GitHub App, its source is Open Source, too

I discovered (thanks for the tip, #Gregor), that there is a GitHub API for getting the pull request as a diff, if you pass certain headers.
Were we can get the delta for a repo's PR:
const pullId = 14956; // NOTE: 73 files changed!
const repoFullname = 'eslint/eslint';
const url = `https://api.github.com/repos/${repoFullname}/pulls/${pullId}.diff`;
const diffStr = (await axios.get(url,requestConfig)).data;
Then, we can use the parse-diff library to parse these, and filter out on the add changes, and matching content changes we want.
// Search for this word
const KEYWORD = 'Requirements';
// Analyze all files
const files = parse(diffStr);
const filesWithMatchingAdds = files.map(
file => ({
file: file.to,
adds: file.chunks.map(
chunk => chunk.changes
// Only look for added lines
.filter(chunk => chunk.type === 'add')
// That match our keyword
.filter(chunk => chunk.content.includes(KEYWORD))
).flat()}) // collapse into one array
// Only files with at least one match
).filter(file => file.adds.length);
Output looks something like
[
{
"file": "tests/tools/internal-rules/multiline-comment-style.js",
"adds": [
{
"type": "add",
"add": true,
"ln": 4,
"content": "+// Requirements"
}
]
}
]
Full gist here.

Related

GitHub REST and GraphQL API are returning different data

I am scraping some data from GitHub. The RESTful URL to this particular PR shows that it has a merge_commit_sha value: https://api.github.com/repos/ansible/ansible/pulls/15088
However, when I try to get the same PR using GitHub GraphQL API, it shows it does not have any mergedCommit value.
resource(
url: "https://github.com/ansible/ansible/pull/15088"
) {
...on PullRequest {
id
number
title
merged
mergeCommit {
message
}
}
}
For context, the PR of interest is actually merged and should have a merged-commit value. I am looking for an explanation of the difference between these two APIs.
This link posted in the other answer contains the explanation:
As in, Git doesn’t have the originalCommit (which makes sense).
Presumably the original commit SHA is there, but the graphQL API actually checks to see if git has it, whereas the REST API doesn’t?
If you search for the commit SHA the API returns, you can't find it in the repo.
https://github.com/ansible/ansible/commit/d7b54c103050d9fc4965e57b7611a70cb964ab25
Since this is a very old pull request on an active repo, there's a good chance some old commits were cleaned up or other maintenance on the repo. It's hard to tell as that kind of maintenance obviously isn't version controlled.
Another option is the pull request was merged with fast-forward, which does not involve a merge commit. But that wouldn't explain the SHA on the REST API response.
So probably at some point they removed old merge commits to save some space, or something similar. Some objects still point to removed SHAs, but GraphQL API filters on existing objects.
Feel like it is a bug to me because if you query another PR such as 45454 , it can return the mergeCommit:
{
"data": {
"resource": {
"id": "MDExOlB1bGxSZXF1ZXN0MjE0NDYyOTY2",
"number": 45454,
"title": "win_say - fix up syntax and test issues (#45450)",
"merged": true,
"mergeCommit": {
"message": "win_say - fix up syntax and test issues (#45450)\n\n\n(cherry picked from commit c9c141fb6a51d6b77274958a2340fa54754db692)",
"oid": "f2d5954d11a1707cdb70b01dfb27c722b6416295"
}
}
}
}
Also find out other encountered the same problem at this and another similar issue at this. I suggest you can try to raise this issue to them at this.

Github API get user code merged into master branch of specific repo

Just had a question regarding GitHub's REST api, I am looking through the documentation but cant seem to find what I am looking for. Is there a way to look up how much code a user has merged into the master branch over a period of time of a specific repo?
essentially I would like to get a return of
user x committed x lines of code in the range of May 2020 to May 2021. Something like that.
GITHUB API
Unfortunately, the Github API doesn't provide those informations (yet?), you'll only get the amount of contributions the user made to the repository with the List repository contributors service.
WORKAROUND
There seems to be a workaround:
Consulting this URL: https://github.com/<repo_owner>/<repo_name>/graphs/contributors, I observed the following requisition was made to fill contributors datas:
Request URL: https://github.com/<repo_owner>/<repo_name>/graphs/contributors-data
Request Method: GET
Request Headers: Not Sure...
This endpoint will return something similar to the following object:
[
{
"total":N,
"author":{
"id":<user_id>,
"login":"<user_name>",
"avatar":"<user_avatar>",
"path":"/<user_name>",
"hovercard_url":"/users/<user_name>/hovercard"
},
"weeks":[
{
"w":1586044800,
"a":0,
"d":0,
"c":0
},
{
"w":1586649600,
"a":0,
"d":0,
"c":0
},
...
]
}
]
On this response, you'll have a list of contributors with contributors' details, where:
total: represent the amount of contribution this user made to the repository.
author: the Github user datas.
weeks: a list of weekly contribution where for each week:
w: is the week id
a: the amount of lines added
d: the amount of lines deleted
c: the amount of commits made
Therefore, to get what you want, you'll have to sum all weeks contributions for each user.
Observation: I understand that to do so, you need to have access to the repository insights (they may not be available to any user).

Using GitHub's API to get lines of code added/deleted per commit (on a branch)?

The following gets a raw list of commits for a project's master branch:
https://api.github.com/repos/<organization_name>/<repo_name/commits?page=0&per_page=30
Question 1: How can one get a similar list but for a specific <branchname>?
Question 2: The list of commits above doesn't include any data about the lines of code added/deleted per commit (i.e., a very rough productivity metric). Is there a way to get this data in the query?
You can fetch the specific branch with sha={branchname} param in the /commits params;
sha string SHA or branch to start listing commits from. Default: the repository’s default branch (usually master).
https://api.github.com/repos/<org_name>/<repo_name>/commits?sha=<branchName>&page=0&per_page=30
To get per-file specific changes for each commit, you'd need to check url variable for each commit entity in the response of above URL. From that new endpoint call, you will get a more detailed information of that single commit. files variable in there will contain the changes contained in that commit. Both added & removed codes per file.
An example with my repo;
https://api.github.com/repos/buraequete/orikautomation/commits?sha=master&page=0&per_page=30
If we get the first commits url;
https://api.github.com/repos/buraequete/orikautomation/commits/89792e6256dfccc5e9151d81bf04145ba02fef8f
Which contains the changes you want in files variable as a list.
"files": [
{
"sha": "8aaaa7de53bed57fc2865d2fd84897211c3e70b6",
"filename": "lombok.config",
"status": "added",
"additions": 1,
"deletions": 0,
"changes": 1,
"blob_url": "https://github.com/buraequete/orikautomation/blob/89792e6256dfccc5e9151d81bf04145ba02fef8f/lombok.config",
"raw_url": "https://github.com/buraequete/orikautomation/raw/89792e6256dfccc5e9151d81bf04145ba02fef8f/lombok.config",
"contents_url": "https://api.github.com/repos/buraequete/orikautomation/contents/lombok.config?ref=89792e6256dfccc5e9151d81bf04145ba02fef8f",
"patch": "## -0,0 +1 ##\n+lombok.accessors.chain = true"
},
...
]
Sorry but I don't think there is a way to get those per file changes in the original /commits endpoint call, you have to do multiple calls...

Recommended way to list all repos/commits for a given user using github3.py

I'm building a GitHub application to pull commit information from our internal repos. I'm using the following code to iterate over all commits:
gh = login(token=gc.ACCESS_TOKEN)
for repo in gh.iter_repos():
for commit in repo.iter_commits():
print(commit.__dict__)
print(commit.additions)
print(commit.author)
print(commit.commit)
print(commit.committer)
print(commit.deletions)
print(commit.files)
print(commit.total)
The additions/deletions/total values are all coming back as 0, and the files attribute is always []. When I click on the url, I can see that this is not the case. I've verified through curl calls that the API indeed has record of these attributes.
Reading more in the documentation, it seems that iter_commits is deprecated in favor of iter_user_commits. Might this be the case why it is not returning all information about the commits? However, this method does not return any repositories for me when I use it like this:
gh = login(token=gc.ACCESS_TOKEN)
user = gh.user()
for repo in gh.iter_user_repos(user):
In short, I'm wondering what the recommended method is to get all commits for all the repositories a user has access to.
There's nothing wrong with iter_repos with a logged in GitHub instance.
In short here's what's happening (this is described in github3.py's documentation): When listing a resource from GitHub's API, not all of the attributes are actually returned. If you want all of the information, you have to request the information for each commit. In short your code should look like this:
gh = login(token=gc.ACCESS_TOKEN)
for repo in gh.iter_repos():
for commit in repo.iter_commits():
commit.refresh()
print(commit.additions)
print(commit.deletions)
# etc.

How can I get a list of all pull requests for a repo through the github API?

I want to obtain a list of all pull requests on a repo through the github API.
I've followed the instructions at http://developer.github.com/v3/pulls/ but when I query /repos/:owner/:repo/pulls it's consistently returning fewer pull requests than displayed on the website.
For example, when I query the torvalds/linux repo I get 9 open pull requests (there are 14 on the website). If I add ?state=closed I get a different set of 11 closed pull requests (the website shows around 20).
Does anyone know where this discrepancy arises, and if there's any way to get a complete list of pull requests for a repo through the API?
You can get all pull requests (closed, opened, merged) through the variable state.
Just set state=all in the GET query, like this->
https://api.github.com/repos/:owner/:repo/pulls?state=all
For more info: check the Parameters table at https://developer.github.com/v3/pulls/#list-pull-requests
Edit: As per Tomáš Votruba's comment:
the default value for, "per_page=30". The maximum is per_page=100. To get more than 100 results, you need to call it multiple itmes: "&page=1", "&page=2"...
PyGithub (https://github.com/PyGithub/PyGithub), a Python library to access the GitHub API v3, enables you to get paginated resources.
For example,
g = Github(login_or_token=$YOUR_TOKEN, per_page=100)
r = g.get_repo($REPO_NUMBER)
for pull in r.get_pulls('all'):
# You can access pulls
See the documentation (http://pygithub.readthedocs.io/en/latest/index.html).
With Github's new official CLI (command line interface):
gh pr list --repo OWNER/REPO
which would produce something like:
Showing 2 of 2 pull requests in OWNER/REPO
#62 Doing something that-weird-branch-name
#58 My PR title wasnt-inspired-branch
See additional details and options and installation instructions.
There is a way to get a complete list and you're doing it. What are you using to communicate with the API? I suspect you may not be doing something correctly. For example (there are only 13 open pull requests currently) using my API wrapper (github3.py) I get all of the open pull requests. An example of how to do it without my wrapper in python is:
import requests
r = requests.get('https://api.github.com/repos/torvalds/linux/pulls')
len(r.json()) == 13
and I can also get that result (vaguely) in cURL by counting the results myself: curl https://api.github.com/repos/torvalds/linux/pulls.
If you, however, run into a repository with more than 25 (or 30) pull requests that's an entirely different issue but most certainly it is not what you're encountering now.
If you want to retrieve all pull requests (commits, comments, issues etc) you have to use pagination.
https://developer.github.com/v3/#pagination
The GET request "pulls" will only return open pull-requests.
If you want to get all pull-requests either you do set the parameter state to all, or you use issues.
Extra information
If you need other data from Github, such as issues, then you can identify pull-requests from issues, and you can then retrieve each pull-request no matter if it is closed or open. It will also give you a couple of more attributes (mergeable, merged, merge-commit-sha, nr of commits etc)
If an issue is a pull-request, then it will contain that attribute. Otherwise, it is just an issue.
From the API: https://developer.github.com/v3/pulls/#labels-assignees-and-milestones
"Every pull request is an issue, but not every issue is a pull request. For this reason, “shared” actions for both features, like manipulating assignees, labels and milestones, are provided within the Issues API."
Edit I just found that issues behaves similar to pull-requests, so one would need to do retrieve all by setting the state parameter to all
You can also use GraphQL API v4 to request all pull requests for a repo. It requests all the pull requests by default if you don't specify the states field :
{
repository(name: "material-ui", owner: "mui-org") {
pullRequests(first: 100, orderBy: {field: CREATED_AT, direction: DESC}) {
totalCount
nodes {
title
state
author {
login
}
createdAt
}
}
}
}
Try it in the explorer
The search API shoul help: https://help.github.com/enterprise/2.2/user/articles/searching-issues/
q = repo:org/name is:pr ...
GitHub provides a "Link" header which specifies the previous, next and last URL to fetch the values.Eg, Link Header response,
<https://api.github.com/repos/:owner/:repo/pulls?state=all&page=2>; rel="next", <https://api.github.com/repos/:owner/:repo/pulls?state=all&page=15>; rel="last"
rel="next" suggests the next set of values.
Here's a snippet of Python code that retrieves information of all pull requests from a specific GitHub repository and parses it into a nice DataFrame:
import pandas as pd
organization = 'pvlib'
repository = 'pvlib-python'
state = 'all' # other options include 'closed' or 'open'
page = 1 # initialize page number to 1 (first page)
dfs = [] # create empty list to hold individual dataframes
# Note it is necessary to loop as each request retrieves maximum 30 entries
while True:
url = f"https://api.github.com/repos/{organization}/{repository}/pulls?" \
f"state={state}&page={page}"
dfi = pd.read_json(url)
if dfi.empty:
break
dfs.append(dfi) # add dataframe to list of dataframes
page += 1 # Advance onto the next page
df = pd.concat(dfs, axis='rows', ignore_index=True)
# Create a new column with usernames
df['username'] = pd.json_normalize(df['user'])['login']