Github's diffing algorithm for PRs: can you swap it? - github

(Note: Github usage appears in bounds GitHub issues asked on Stack Overflow)
PRs that move code around, even within the same function, appear very onerous on Github, even when they don't do anything else. I have created a very basic PR to demonstrate this: https://github.com/tommyjcarpenter/github_test/pull/1/commits/2afb07ec5c6b56724bd10c6b56386299493bbb43. All the repo does is define two functions, and PRs a change to move the first below the second. The diff on github shows 20 lines removed and 21 added. One would assume the diff could be shown as a trivial "code move".
Now imagine this with a lot more functions and a lot more trivial code moves.
It seems git itself IS able to detect such changes: Using Git diff to detect code movement + How to use diff options
So, is there a way to swap out the diffing algorithm such that PRs like this don't look so onerous? Does github use it's own internal algorithm or does it use your default diffing algorithm?
(EDIT: this also appears to make account-level contributions on Github a bit misleading: someone that just moves code around may be shown to make a huge number of additions and deletions to a repository, thus giving the impression that they are a large contributor, when in fact they didn't contribute any functionality)

I suspect if this sort of feature existed, it would be a hidden feature like hiding-of-whitespaces in diffs.
The challenge I see for Github to implement different diffing algos is the implications to contribution metrics. Contribution metrics (i.e. contributers, PR size) would now have to be footnoted with the algorithm used in order to properly audit and review changes.
As a work around, you could separate formatting commits from functional change commits to at least be able to distinguish between the two via commit history.

Related

Is it possible to filter weblate push per language by completeness rate

General situation
We have a project on GitHub and in hosted.weblate in the libre plan.
The weblate project contains 3 components and a glossary.
Two languages are already completely translated and we use the continuous localization workflow.
There are some additional languages from the very kind community. However, these are not complete(d yet), so they cause problems in the front end (like showing no text or the plain string.variable.name).
What should be achieved?
We would like to have the incomplete languages only available after they are complete (at least have no empty strings). So they should be pushed only if either a manual flag is set/removed or at a certain completeness level. Is there a way or best practice on how to deal with that?
Ideas to achieve it (but no idea if this is possible)
An idea would be to only commit changes on languages, that have a certain overall completeness level. For the languages that are completed already we would ideally keep the continuous translation workflow. Also manual commits are problematic, since they would commit also the incomplete languages.
Is there a way to set a flag or achieve a .gitignore like behaviour for certain languages in weblate? When they are not empty anymore, we could of course manually activate the languages.
I've set up a translation project for Syncthing on Hosted Weblate, which was previously handled through Transifex. There was already some tooling in place which respects this completeness filter, and it was pretty easy to adapt to Weblate.
Basically we don't push Weblate changes back to the upstream repo, but pull them in through a regularly called script, together with some other housekeeping tasks like an authors list. The script checks the statistics on every available language and if the completion is above 95 percent, the language is added to a "valid" list, which the GUI uses to offer choices. Translations previously on that list drop off only if they fall below 75 percent completion.
In any case, the script downloads the most recent translation files and commits them to the upstream repo for archival reasons. When this happens, Weblate picks up the new commits and rebases its internal repo. That also allows integrating translation contributions from other sources easily.
It is currently not possible, but there is issue tracking this: https://github.com/WeblateOrg/weblate/issues/3745

"This comparison is taking too long to generate." error on github

I'm working on a project that has a large number of json files that are never reviewed in pull requests but occasionally need to be changed. Recently we had to make minor changes to them, and github isn't allowing me to create a pull request with those changes. Instead it gives me:
This comparison is taking too long to generate.
Unfortunately it looks like we can’t render this comparison for you right now. It might be too big, or there might be something weird with your repository.
I checked the diff locally and the actual code changes are pretty minor (maybe 200 lines changed), but there are millions of changed lines in these json files. Is there any way to tell Github to ignore them? Right now I am unable to make a PR so the changes can't go through our normal company review process.
I've tried using the .gitattributes file with *.json linguist-generated=true unfortunately that had no effect.
Edit: As suggested in the accepted answer, I contacted github support about this case. Their suggestion was to create a new branch with a small commit, create the PR, and then merge the actual branch that I want to deploy into it. This will update the PR, and while the diff still won't display, it will let me create a PR.
I have just been able to solve this problem via the gh command line tool!
gh pr create
Maybe this option did not exist when the original question was posted, but I'll definitely be doing this in future.
When you compare two branches on GitHub, GitHub has to go compute the diff for those changes using Git and then render it for you. In your case, you have millions of lines of changes, and Git doesn't perform very well in this case because the algorithm that's used to compute diffs is O((N + M)D). Thus, if you have a number of differences proportional to the number of lines, the algorithm is essentially O(N²). Having a large N makes that even worse.
GitHub has a limit on long a request can take, so your large number of changes are just not going to render in the interface. It may be possible to choose the branches you want even though the diff won't render and still open the pull request. If not, you may need to resort to using the API, which won't generate the diff for rendering and therefore is likely to work a little better.
I would encourage you to let GitHub Support know about this, if you haven't been able to find a way to do it through the UI, since they can notify someone to make sure the interface is usable to create a PR even if the diff can't render. You probably aren't the first person to encounter this.
You may also want to store these files outside of Git on some sort of artifact server and pull them down to your repository based on hash, in which case you wouldn't have this pathological case.

How to compare a file between different GitHub repositories (for clarity in a pull request)?

When creating a GitHub Pull Request, it is often that a file (script, lib, etc.) may be completely replaced (or introduced) with one from another repo. Sometimes, the file requires small changes. I'm trying to establish a standard for my team for how to communicate where the file came from and what changed. In the same way that you can craft a URL to highlight a specific change in a single repo, I'd like to be able to highlight a change across repos.
The reality may very well be that GitHub does not offer this. (I do a lot of research before asking questions. Consequently, the answer is often, "you couldn't find an answer because it is impossible.") In which case an alternative will be needed. One possibility might be to generate a diff in markdown and add it as a comment. (Notice I improved that answer back in 2016.)
One possibility might be to generate a diff in markdown and add it as a comment.
Good idea.
One alternative which would not depend on a PR comment would be to use git notes. They are not supported/displayed by GitHub since 2014 and they are criticised, but they would remain in your case possible way to leave... well a note describing where some of the PR files are coming from.

Is there any way to merge in a visual merge tool on windows that simultaneously shows annotations?

I am using Mercurial, but I imagine that any merge tool that is aware of the version control system below it could do several things that a merge tool which is not aware of the version control system and only sees two "files" in two different folders, could never do.
I have been using KDIFF3, and recently tried BeyondCompare, and neither of them will do this, at least not that I could figure out.
What I want to do is best shown in this picture, an annotation column and perhaps even ability to open other windows from those annotation columns so I could browse specific versions of specific files to see context when trying to do a merge.
In the image here, I am showing a two way merge, but the same applies for a three way merge. To the right or to the left of the actual file content being shown, I would like a gutter or a right side annotation column showing some kind of annotation of where this change came from. Since Mercurial hex ids are relatively unfriendly and unhelpful, and since repository-local-revision-numbers are repository local, I think that a short text description based on commit comments would be most helpful. Of course, with Mercurial, 99% of these commit comments are going to say "Merge", and nothing else. (Groan.) But lets pretend for a minute that we weren't using tools and workflows that left us that crippled at merge time, and instead, that we could have a useful commit comment show up each time:
Right now the workflow for complex merges looks like this for me:
Using my distributed version control tool (mercurial), pull changes from another repository which is in effect a branch. Merge. The merge window for TortoiseHg is usually where I start all this from. This in turn lets me configure a merge tool (beyond compare or Kdiff3).
However, it does not appear that there is any merge tool (that I have seen) that can be told, "hey you're not just merging two way or three way with different versions of a file in the two completely different folders, with the names I told you, but those files are also files that have a complete edit history available to you to show your human the actual context, the commits that those line changes came from with their commit comments, often having a bug number as part of the commit which will give the person doing the merge the ability to see What in the Heck is Really Going on.
I would change from Mercurial to Git, for example, even, for a real merge experience that didn't force me to do manually what I think my tools could be doing for me automatically. I'm using Mercurial, TortoiseHG, and KDIFF3, and if I could just change from KDIFF3 to some other tool, or do ANYTHING at all to get annotations and merges together on one screen, I would like to do so.

Keeping experimental history out of shared repository in Mercurial

I'm fairly new to Mercurial, but one of the advantages I see using Mercurial is that while writing a feature you can be more free to experiment, check in changes, share them, etc, while still maintaining a "clean" repo for the finished feature.
The issue is one of history. If I tried 6 different ways to get something to work, now I'm stuck with all of the history for all my mistakes. What I'd like to do is go through and clean up my changes and "collapse" them into one changeset that can be pushed into a shared repository. This is complicated by the fact that I might pull in new changesets from the shared repository, and have those changesets intermingled with my own.
The best way I know of to do that is to use hg export to create a patch of my changes since cloning, clone a fresh repository, and apply the patch to the fresh repository.
Those steps seems a little bit cumbersome and easy to mess up, particularly if this methodology is rolled out to the whole dev team, some of whom are a little resistant to change (don't get me started). TortoiseHg makes the process slightly better since you can highlight the changesets you want to be included in an export.
My question is this: Am I making this more complex than it needs to be? Is there a better workflow I can use to ease my troubles? Is it too much to expect a clean history where entire (small-ish) features are included in one changeset?
Or maybe my whole question could be summed up this way:
Is there an equivalent for this in mercurial? Collapsing a git repository's history
Although I think you should reconsider your use of branches in Mercurial (as per my comment on your post), using named branches doesn't really help with your concern of maintaining useless or unnecessary history - it just organizes them a bit.
I would recommend a combination of these tools:
mercurial queues
histedit (not distributed with Hg)
the mq changeset strip feature
to rework a messy history before pushing to a blessed or master repo. The easiest thing would be to use strip to permanently remove any changeset with no children. Once you've done that you can use mq or histedit to combine, relocate, or modify existing commits. Histedit will even let you redo the comment associated with a changeset.
Some pitfalls:
In your opening paragraph you mention sharing changesets during feature development. Please understand that once you've shared a changeset it's not a good idea to modify using mq or histedit, or strip. Using these extensions can result in a change to the revision hash, which will make them look like a new changeset to everyone else.
Also, I agree with Paul Nathan's comment that mq (and histedit) are power features and can easily destroy a history. It's a good idea to make a safety clone before using these extensions.
Named branches are the simplest solution. Each experimental approach gets its own branch.This retains the history of the experiments.
The next solution is to have a fresh clone for each experiment. The working one gets pushed back to the main repo.
The next solution - and probably what you are really looking for - is the mq extension, which can "squash" a series of patches into a single commit. I consider mq to be "advanced", and "subject to accidently shooting yourself in the foot". I also don't care to squash my commits - I like having my version history present for reference.