Why am I able to bypass pagination when I call the same field twice (with different queries) in GitHub's GraphQL API - github

I noticed something I don't understand while trying to get the number of open issues per repository for a user.
When I use the following query I am asked to perform pagination (as expected) -
query {
user(login:"armsp"){
repositories{
nodes{
name
issues(states: OPEN){
totalCount
}
}
}
}
}
The error message after running the above -
{
"data": {
"user": null
},
"errors": [
{
"type": "MISSING_PAGINATION_BOUNDARIES",
"path": [
"user",
"repositories"
],
"locations": [
{
"line": 54,
"column": 5
}
],
"message": "You must provide a `first` or `last` value to properly paginate the `repositories` connection."
}
]
}
However when I do the following I actually get all the results which doesn't make any sense to me -
query {
user(login:"armsp"){
repositories{
totalCount
}
repositories{
nodes{
name
issues(states: OPEN){
totalCount
}
}
}
}
}
Shouldn't I be asked for pagination in the second query too ?

TLDR; This appears to be a bug. There's no way to bypass the limit applied when fetching a list of resources.
Limiting responses like this is a common feature of public APIs -- if the response could include thousands or millions of results, it'll tie up a lot of server resources to fulfill it all at once. Allowing users to make those sort of queries is both costly and a potential security risk.
Github's intent appears to be to always limit the amount of results when fetching a list of resources. This isn't well documented on the GraphQL side, but matches the behavior of their REST API:
Requests that return multiple items will be paginated to 30 items by default. You can specify further pages with the ?page parameter. For some resources, you can also set a custom page size up to 100 with the ?per_page parameter.
For connections, it looks like the check for the first or last parameter is only ran whenever the nodes field is present in the selection set. This makes sense, since this is ultimately the field we want to limit -- requesting other fields like totalDiskUsage or totalDiskUsage, even without a limit argument, is harmless with the regard to above concerns.
Things get funky when you consider how GraphQL handles selection sets with selections that have the same name. Without getting into the nitty gritty details, GraphQL will let you request the same field multiple times. If the field in question has a selection set, it will effectively merge the selection sets into a single one. So
query {
user(login:"armsp") {
repositories {
totalCount
}
repositories {
totalDiskUsage
}
}
}
becomes and is equivalent to
query {
user(login:"armsp") {
repositories {
totalCount
totalDiskUsage
}
}
}
Side note: The above does not hold true if you explicitly give one of the fields an alias since then the two fields have different response names.
All that to say, technically this query:
query {
user(login:"armsp"){
repositories{
totalCount
}
repositories{
nodes{
name
issues(states: OPEN){
totalCount
}
}
}
}
}
should also blow up with the same MISSING_PAGINATION_BOUNDARIES error. The fact that it doesn't means the selection set merging is somehow borking the check that's in place. This is clearly a bug. However, even while this appears to "work", it still doesn't get around whatever limits Github has applies at the storage layer -- you will always get at most 100 results even when exploiting the above bug.

Related

Get Product Data from shopify GraphQL for over 10000 Products

I have an extremely large selection of products in a collection (140,000), to get the data of 250 is fine but I need to get a list of tags for 140,000 products, I have created a bulkOperationRunQuery to get the data. Here is the query I use to run
mutation {
bulkOperationRunQuery(
query: """
{
products{
edges{
node{
id
tags
}
}
}
}
"""
) {
bulkOperation {
id
status
}
userErrors {
field
message
}
}}
This Works but takes far to long to process, how can I make this quicker is there a set limit on the request
That is all you get for a massive ask like that. If you have 140,000 products you ask for the once. Then you have them, and speed should be of little consequence. There is no need to repeat yourself by asking for them again and again. If you are interested in changes, just listen to product change webhooks. Save yourself a lot of grief that way.

Pagination on reactionGroups in the GitHub GraphQL API

I’m trying to extract the username of all the users that have reacted to an issue (and how they reacted) with the GitHub GraphQL API. I’ve only been able to extract a maximum of 11 users per reaction group per query, and I haven’t found a way to successfully paginate the queries - the same users are returned each time.
Here’s an example of my query using an issue with many reactions:
{
repository(owner: "mapbox", name: "mapbox-gl-js") {
issue(number: 3184) {
reactionGroups {
content
reactors(first: 30) {
totalCount
pageInfo {
hasNextPage
endCursor
}
edges {
node {
... on User {
login
}
}
}
}
}
}
}
}
For THUMBS_UP reactions this correctly returns totalCount: 77. However, there are only 11 usernames returned (not the 30 requested). The value of hasNextPage in pageInfo is false, and using the returned cursor value or modifying the reactors query to last:30 instead of first:30 has no impact on which 11 users are returned.
Is there a way I can modify my query to get this working (I’m new to GraphQL) or is this a current limitation of the API? Thanks!
(I've also asked this on the GitHub community forums, but no reply yet - see here)

Firestore rules and data structure

I have a question regarding data structure and rules ... I have content on which users can vote. Something like this:
Firestore object:
{
name: "Cat",
description: "A cat named Cat",
votes: 56
}
Now ... I want authenticated users to be able to have update access to the votes, but not to any other values of the object and of course read rights since the content has to be displayed.
I did this because I wanted to avoid additional queries when displaying the content.
Should I create another collection "votes" maybe where the votes are kept and for each document make an additional request to get them?
In rules, you have access to the state of the data both before and after the writes - so you can test specific fields to be sure they have not changed:
function existing() {
return resource.data;
}
function resulting() {
return request.resource.data;
}
function matchField(fieldName) {
return existing()[fieldName] == resulting()[fieldName];
}
....
allow update: if matchField("name") && matchField("description")
....
The functions just make the rule easier to read.

How can I get branch count on a repository via GitHub API?

I'm working on a UI which lists all repositories of a given user or organization. This is using a tree format, where the first level is the repositories, and the second level of hierarchy (child nodes) are to be each branch, if expanded.
I'm using a mechanism that deliberately doesn't require me to pull a list of all branches of a given repo, because the API has rate limits on API calls. Instead, all I have to do is instruct it how many child nodes it contains, without actually assigning values to them (until the moment the user expands it). I was almost sure that fetching a list of repos includes branch count in the result, but to my disappointment, I don't see it. I can only see count of forks, stargazers, watchers, issues, etc. Everything except branch count.
The intention of the UI is that it will know in advance the number of branches to populate the child nodes, but not actually fetch them until after user has expanded the parent node - thus immediately showing empty placeholders for each branch, followed by asynchronous loading of the actual branches to populate. Again, since I need to avoid too many API calls. As user scrolls, it will use pagination to fetch only the page(s) it needs to show to the user, and keep it cached for later display.
Specifically, I'm using the Virtual TreeView for Delphi:
procedure TfrmMain.LstInitChildren(Sender: TBaseVirtualTree; Node: PVirtualNode;
var ChildCount: Cardinal);
var
L: Integer;
R: TGitHubRepo;
begin
L:= Lst.GetNodeLevel(Node);
case L of
0: begin
//TODO: Return number of branches...
R:= TGitHubRepo(Lst.GetNodeData(Node));
ChildCount:= R.I['branch_count']; //TODO: There is no such thing!!!
end;
1: ChildCount:= 0; //Branches have no further child nodes
end;
end;
Is there something I'm missing that allows me to get repo branch count without having to fetch a complete list of all of them up-front?
You can use the new GraphQL API instead. This allows you to tailor your queries and results to just what you need. Rather than grabbing the count and then later filling in the branches, you can do both in one query.
Try out the Query Explorer.
query {
repository(owner: "octocat", name: "Hello-World") {
refs(first: 100, refPrefix:"refs/heads/") {
totalCount
nodes {
name
}
},
pullRequests(states:[OPEN]) {
totalCount
}
}
}
{
"data": {
"repository": {
"refs": {
"totalCount": 3,
"nodes": [
{
"name": "master"
},
{
"name": "octocat-patch-1"
},
{
"name": "test"
}
]
},
"pullRequests": {
"totalCount": 192
}
}
}
}
Pagination is done with cursors. First you get the first page, up to 100 at a time, but we're using just 2 here for brevity. The response will contain a unique cursor.
{
repository(owner: "octocat", name: "Hello-World") {
pullRequests(first:2, states: [OPEN]) {
edges {
node {
title
}
cursor
}
}
}
}
{
"data": {
"repository": {
"pullRequests": {
"edges": [
{
"node": {
"title": "Update README"
},
"cursor": "Y3Vyc29yOnYyOpHOABRYHg=="
},
{
"node": {
"title": "Just a pull request test"
},
"cursor": "Y3Vyc29yOnYyOpHOABR2bQ=="
}
]
}
}
}
}
You can then ask for more elements after the cursor. This will get the next 2 elements.
{
repository(owner: "octocat", name: "Hello-World") {
pullRequests(first:2, after: "Y3Vyc29yOnYyOpHOABR2bQ==", states: [OPEN]) {
edges {
node {
title
}
cursor
}
}
}
}
Queries can be written like functions and passed arguments. The arguments are sent in a separate bit of JSON. This allows the query to be a simple unchanging string.
This query does the same thing as before.
query NextPullRequestPage($pullRequestCursor:String) {
repository(owner: "octocat", name: "Hello-World") {
pullRequests(first:2, after: $pullRequestCursor, states: [OPEN]) {
edges {
node {
title
}
cursor
}
}
}
}
{
"pullRequestCursor": "Y3Vyc29yOnYyOpHOABR2bQ=="
}
{ "pullRequestCursor": null } will fetch the first page.
Its rate limit calculations are more complex than the REST API. Instead of calls per hour, you get 5000 points per hour. Each query costs a certain number of points which roughly correspond to how much it costs Github to compute the results. You can find out how much a query costs by asking for its rateLimit information. If you pass it dryRun: true it will just tell you the cost without running the query.
{
rateLimit(dryRun:true) {
limit
cost
remaining
resetAt
}
repository(owner: "octocat", name: "Hello-World") {
refs(first: 100, refPrefix: "refs/heads/") {
totalCount
nodes {
name
}
}
pullRequests(states: [OPEN]) {
totalCount
}
}
}
{
"data": {
"rateLimit": {
"limit": 5000,
"cost": 1,
"remaining": 4979,
"resetAt": "2019-08-21T05:13:56Z"
}
}
}
This query costs just one point. I have 4979 points remaining and I'll get my rate limit reset at 05:13 UTC.
The GraphQL API is extremely flexible. You should be able to do more with it using less Github resources and less programming to work around rate limits.

REST, cross-references and performances, which compromise?

After reading this excellent thread REST Complex/Composite/Nested Resources about nested structures in REST responses, I still have a question. What's the best choice in terms of performance about the response ?
Let's take an example.
I have an Category object, which contains some Questions. Those Questions contains some Answers. All of these structures have meta-informations.
Now, when querying an url like GET http://<base_url>/categories/, should I include a description of the Categories only, include Question description ? Which one, full description or simplified one ?
In other terms, what's the best solution between those :
{
"results":[
{
'id':1,
'name':'category1',
'description':'foobar',
'questions':[
{
'id':1234,
'question':'My question',
'author' : 4235345,
'answers':[
{
'id':56786,
'user':456,
'votes':6,
'answer':'It's an answer !'
},
{
'id':3486,
'user':4564,
'votes':2,
'answer':'It's another answer !'
},
]
},
...
]
}
...
]
}
OR SOLUTION 2 :
{
"results":[
{
'id':1,
'name':'category1',
'description':'foobar',
'questions':[
{
'id':1234,
'url':'http://foobar/questions/1234'
'answers':[
{
'id':56786,
'url':'http://foobar/answers/56786'
},
{
'id':3486,
'url':'http://foobar/answers/3486'
},
]
},
...
]
}
...
]
}
OR SOLUTION 3 :
{
"results":[
{
'id':1,
'name':'category1',
'description':'foobar',
'questions':'http://foobar/categories/1/questions'
}
...
]
}
Or maybe another solution ?
Thanks !
That depends on what the application will do with the data. If it is only going to display a list of categories, then it is very inefficient to transfer all the data it ever needs at once, especially if the categories are many, which will decrease response time of user (absolute no no).
These scenarios depend heavily on application and usage of data.
One optimization that we can do is, we can create two requests,
GET http://<base_url>/categories
Which will return minimal data immediately and another request,
GET http://<base_url>/categories?all=true
Which will return all data.
Then the client app can make some clever optimizations like, when user requests for categories, request one is sent and it will immediately render the data. Then after getting the list of categories the user will be idle for some time looking and we can use this opportunity to request all data using request two.
However, as I said this will largely depend on the application.