Can you get the number of lines of code from a GitHub repository? - github

In a GitHub repository you can see “language statistics”, which displays the percentage of the project that’s written in a language. It doesn’t, however, display how many lines of code the project consists of. Often, I want to quickly get an impression of the scale and complexity of a project, and the count of lines of code can give a good first impression. 500 lines of code implies a relatively simple project, 100,000 lines of code implies a very large/complicated project.
So, is it possible to get the lines of code written in the various languages from a GitHub repository, preferably without cloning it?
The question “Count number of lines in a git repository” asks how to count the lines of code in a local Git repository, but:
You have to clone the project, which could be massive. Cloning a project like Wine, for example, takes ages.
You would count lines in files that wouldn’t necessarily be code, like i13n files.
If you count just (for example) Ruby files, you’d potentially miss massive amount of code in other languages, like JavaScript. You’d have to know beforehand which languages the project uses. You’d also have to repeat the count for every language the project uses.
All in all, this is potentially far too time-intensive for “quickly checking the scale of a project”.

You can run something like
git ls-files | xargs wc -l
Which will give you the total count →
You can also add more instructions. Like just looking at the JavaScript files.
git ls-files | grep '\.js' | xargs wc -l
Or use this handy little tool → https://line-count.herokuapp.com/

A shell script, cloc-git
You can use this shell script to count the number of lines in a remote Git repository with one command:
#!/usr/bin/env bash
git clone --depth 1 "$1" temp-linecount-repo &&
printf "('temp-linecount-repo' will be deleted automatically)\n\n\n" &&
cloc temp-linecount-repo &&
rm -rf temp-linecount-repo
Installation
This script requires CLOC (“Count Lines of Code”) to be installed. cloc can probably be installed with your package manager – for example, brew install cloc with Homebrew. There is also a docker image published under mribeiro/cloc.
You can install the script by saving its code to a file cloc-git, running chmod +x cloc-git, and then moving the file to a folder in your $PATH such as /usr/local/bin.
Usage
The script takes one argument, which is any URL that git clone will accept. Examples are https://github.com/evalEmpire/perl5i.git (HTTPS) or git#github.com:evalEmpire/perl5i.git (SSH). You can get this URL from any GitHub project page by clicking “Clone or download”.
Example output:
$ cloc-git https://github.com/evalEmpire/perl5i.git
Cloning into 'temp-linecount-repo'...
remote: Counting objects: 200, done.
remote: Compressing objects: 100% (182/182), done.
remote: Total 200 (delta 13), reused 158 (delta 9), pack-reused 0
Receiving objects: 100% (200/200), 296.52 KiB | 110.00 KiB/s, done.
Resolving deltas: 100% (13/13), done.
Checking connectivity... done.
('temp-linecount-repo' will be deleted automatically)
171 text files.
166 unique files.
17 files ignored.
http://cloc.sourceforge.net v 1.62 T=1.13 s (134.1 files/s, 9764.6 lines/s)
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
Perl 149 2795 1425 6382
JSON 1 0 0 270
YAML 2 0 0 198
-------------------------------------------------------------------------------
SUM: 152 2795 1425 6850
-------------------------------------------------------------------------------
Alternatives
Run the commands manually
If you don’t want to bother saving and installing the shell script, you can run the commands manually. An example:
$ git clone --depth 1 https://github.com/evalEmpire/perl5i.git
$ cloc perl5i
$ rm -rf perl5i
Linguist
If you want the results to match GitHub’s language percentages exactly, you can try installing Linguist instead of CLOC. According to its README, you need to gem install linguist and then run linguist. I couldn’t get it to work (issue #2223).

I created an extension for Google Chrome browser - GLOC which works for public and private repos.
Counts the number of lines of code of a project from:
project detail page
user's repositories
organization page
search results page
trending page
explore page

If you go to the graphs/contributors page, you can see a list of all the contributors to the repo and how many lines they've added and removed.
Unless I'm missing something, subtracting the aggregate number of lines deleted from the aggregate number of lines added among all contributors should yield the total number of lines of code in the repo. (EDIT: it turns out I was missing something after all. Take a look at orbitbot's comment for details.)
UPDATE:
This data is also available in GitHub's API. So I wrote a quick script to fetch the data and do the calculation:
'use strict';
async function countGithub(repo) {
const response = await fetch(`https://api.github.com/repos/${repo}/stats/contributors`)
const contributors = await response.json();
const lineCounts = contributors.map(contributor => (
contributor.weeks.reduce((lineCount, week) => lineCount + week.a - week.d, 0)
));
const lines = lineCounts.reduce((lineTotal, lineCount) => lineTotal + lineCount);
window.alert(lines);
}
countGithub('jquery/jquery'); // or count anything you like
Just paste it in a Chrome DevTools snippet, change the repo and click run.
Disclaimer (thanks to lovasoa):
Take the results of this method with a grain of salt, because for some repos (sorich87/bootstrap-tour) it results in negative values, which might indicate there's something wrong with the data returned from GitHub's API.
UPDATE:
Looks like this method to calculate total line numbers isn't entirely reliable. Take a look at orbitbot's comment for details.

You can clone just the latest commit using git clone --depth 1 <url> and then perform your own analysis using Linguist, the same software Github uses. That's the only way I know you're going to get lines of code.
Another option is to use the API to list the languages the project uses. It doesn't give them in lines but in bytes. For example...
$ curl https://api.github.com/repos/evalEmpire/perl5i/languages
{
"Perl": 274835
}
Though take that with a grain of salt, that project includes YAML and JSON which the web site acknowledges but the API does not.
Finally, you can use code search to ask which files match a given language. This example asks which files in perl5i are Perl. https://api.github.com/search/code?q=language:perl+repo:evalEmpire/perl5i. It will not give you lines, and you have to ask for the file size separately using the returned url for each file.

Not currently possible on Github.com or their API-s
I have talked to customer support and confirmed that this can not be done on github.com. They have passed the suggestion along to the Github team though, so hopefully it will be possible in the future. If so, I'll be sure to edit this answer.
Meanwhile, Rory O'Kane's answer is a brilliant alternative based on cloc and a shallow repo clone.

From the #Tgr's comment, there is an online tool :
https://codetabs.com/count-loc/count-loc-online.html

You can use tokei:
cargo install tokei
git clone --depth 1 https://github.com/XAMPPRocky/tokei
tokei tokei/
Output:
===============================================================================
Language Files Lines Code Comments Blanks
===============================================================================
BASH 4 48 30 10 8
JSON 1 1430 1430 0 0
Shell 1 49 38 1 10
TOML 2 78 65 4 9
-------------------------------------------------------------------------------
Markdown 4 1410 0 1121 289
|- JSON 1 41 41 0 0
|- Rust 1 47 38 5 4
|- Shell 1 19 16 0 3
(Total) 1517 95 1126 296
-------------------------------------------------------------------------------
Rust 19 3750 3123 119 508
|- Markdown 12 358 5 302 51
(Total) 4108 3128 421 559
===============================================================================
Total 31 6765 4686 1255 824
===============================================================================
Tokei has support for badges:
Count Lines
[![](https://tokei.rs/b1/github/XAMPPRocky/tokei)](https://github.com/XAMPPRocky/tokei)
By default the badge will show the repo's LoC(Lines of Code), you can also specify for it to show a different category, by using the ?category= query string. It can be either code, blanks, files, lines, comments.
Count Files
[![](https://tokei.rs/b1/github/XAMPPRocky/tokei?category=files)](https://github.com/XAMPPRocky/tokei)

You can use GitHub API to get the sloc like the following function
function getSloc(repo, tries) {
//repo is the repo's path
if (!repo) {
return Promise.reject(new Error("No repo provided"));
}
//GitHub's API may return an empty object the first time it is accessed
//We can try several times then stop
if (tries === 0) {
return Promise.reject(new Error("Too many tries"));
}
let url = "https://api.github.com/repos" + repo + "/stats/code_frequency";
return fetch(url)
.then(x => x.json())
.then(x => x.reduce((total, changes) => total + changes[1] + changes[2], 0))
.catch(err => getSloc(repo, tries - 1));
}
Personally I made an chrome extension which shows the number of SLOC on both github project list and project detail page. You can also set your personal access token to access private repositories and bypass the api rate limit.
You can download from here https://chrome.google.com/webstore/detail/github-sloc/fkjjjamhihnjmihibcmdnianbcbccpnn
Source code is available here https://github.com/martianyi/github-sloc

Hey all this is ridiculously easy...
Create a new branch from your first commit
When you want to find out your stats, create a new PR from main
The PR will show you the number of changed lines - as you're doing a PR from the first commit all your code will be counted as new lines
And the added benefit is that if you don't approve the PR and just leave it in place, the stats (No of commits, files changed and total lines of code) will simply keep up-to-date as you merge changes into main. :) Enjoy.

Firefox add-on Github SLOC
I wrote a small firefox addon that prints the number of lines of code on github project pages: Github SLOC

npm install sloc -g
git clone --depth 1 https://github.com/vuejs/vue/
sloc ".\vue\src" --format cli-table
rm -rf ".\vue\"
Instructions and Explanation
Install sloc from npm, a command line tool (Node.js needs to be installed).
npm install sloc -g
Clone shallow repository (faster download than full clone).
git clone --depth 1 https://github.com/facebook/react/
Run sloc and specifiy the path that should be analyzed.
sloc ".\react\src" --format cli-table
sloc supports formatting the output as a cli-table, as json or csv. Regular expressions can be used to exclude files and folders (Further information on npm).
Delete repository folder (optional)
Powershell: rm -r -force ".\react\" or on Mac/Unix: rm -rf ".\react\"
Screenshots of the executed steps (cli-table):
sloc output (no arguments):
It is also possible to get details for every file with the --details option:
sloc ".\react\src" --format cli-table --details

Open terminal and run the following:
curl -L "https://api.codetabs.com/v1/loc?github=username/reponame"

If the question is "can you quickly get NUMBER OF LINES of a github repo", the answer is no as stated by the other answers.
However, if the question is "can you quickly check the SCALE of a project", I usually gauge a project by looking at its size. Of course the size will include deltas from all active commits, but it is a good metric as the order of magnitude is quite close.
E.g.
How big is the "docker" project?
In your browser, enter api.github.com/repos/ORG_NAME/PROJECT_NAME
i.e. api.github.com/repos/docker/docker
In the response hash, you can find the size attribute:
{
...
size: 161432,
...
}
This should give you an idea of the relative scale of the project. The number seems to be in KB, but when I checked it on my computer it's actually smaller, even though the order of magnitude is consistent. (161432KB = 161MB, du -s -h docker = 65MB)

Pipe the output from the number of lines in each file to sort to organize files by line count.
git ls-files | xargs wc -l |sort -n

This is so easy if you are using Vscode and you clone the project first. Just install the Lines of Code (LOC) Vscode extension and then run LineCount: Count Workspace Files from the Command Pallete.
The extension shows summary statistics by file type and it also outputs result files with detailed information by each folder.

There in another online tool that counts lines of code for public and private repos without having to clone/download them - https://klock.herokuapp.com/

None of the answers here satisfied my requirements. I only wanted to use existing utilities. The following script will use basic utilities:
Git
GNU or BSD awk
GNU or BSD sed
Bash
Get total lines added to a repository (subtracts lines deleted from lines added).
#!/bin/bash
git diff --shortstat 4b825dc642cb6eb9a060e54bf8d69288fbee4904 HEAD | \
sed 's/[^0-9,]*//g' | \
awk -F, '!($2 > 0) {$2="0"};!($3 > 0) {$3="0"}; {print $2-$3}'
Get lines of code filtered by specified file types of known source code (e.g. *.py files or add more extensions, etc).
#!/bin/bash
git diff --shortstat 4b825dc642cb6eb9a060e54bf8d69288fbee4904 HEAD -- *.{py,java,js} | \
sed 's/[^0-9,]*//g' | \
awk -F, '!($2 > 0) {$2="0"};!($3 > 0) {$3="0"}; {print $2-$3}'
4b825dc642cb6eb9a060e54bf8d69288fbee4904 is the id of the "empty tree" in Git and it's always available in every repository.
Sources:
My own scripting
How to get Git diff of the first commit?
Is there a way of having git show lines added, lines changed and lines removed?

shields.io has a badge that can count up all the lines for you here. Here is an example of what it looks like counting the Raycast extensions repo:

You can use sourcegraph, an open source search engine for code. It can connect to your GitHub account, index the content, and then on the admin section you would see the number of lines of code indexed.

I made an NPM package specifically for this usage, which allows you to call a CLI tool and providing the directory path and the folders/files to ignore
it goes like this:
npm i -g #quasimodo147/countlines
to get the $ countlines command in your terminal
then you can do
countlines . node_modules build dist

Related

Get last commit for every file of a file list in Mercurial

I have an hg repository and I would like to know the last commit date of every file in sources/php/dracca/endpoint/wiki/**/Wiki*.php
So far, I have this one liner:
find sources/php/dracca/endpoint/wiki/ -name "Wiki*.php" -exec hg log --limit 1 --template "{date|shortdate}" {} \; -exec echo {} \;
But this seems utterly slow as (I suppose) find makes 1 hg call per file, leading to 15seconds of computation for the (say) ~40 files I have in there...
Is there a faster way?
The output of this command looks like:
2019-09-20 sources/php/dracca/endpoint/wiki/characters/colmarr/WikiCharactersColmarrEndpoint.php
2019-09-20 sources/php/dracca/endpoint/wiki/characters/dracquints/allgroup/WikiCharactersDracquintsAllgroupEndpoint.php
...
It might be changed a bit if needed (I won't mind having, say, 1 date and then the list of files changed for that date, or whatever like this)
Even with find+exec you can have shorter (by one last exec) chain with modified template {date|shortdate}\n
You can use (accepted) perl-ism from this question or ask anybody to update mentioned lof extension to current Mercurial (code from 2012 will not work now)
Alternatives (dirty ugly hacks)
In any case, you can|have to call hg only once and perform some post-processing of results.
Before these trick, read hg help filesets and get one common fileset for your files (I suppose, it can be just set:sources/php/dracca/endpoint/wiki/**/Wiki*.php but TBT!)
After it, you can:
Perform hg log like this
hg log setup.* --template "{files % '{file} {rev} {date|shortdate}\n'}"
(I used simple pattern for test, you have to have own fileset)
get output in such form
setup.py 1163 2018-11-07
README.md 1162 2018-11-07
setup.py 1162 2018-11-07
hggit/git_handler.py 1124 2018-05-01
setup.py 1124 2018-05-01
setup.cfg 1118 2017-11-27
setup.py 1117 2017-11-27
hggit/git2hg.py 1111 2017-11-27
hggit/overlay.py 1111 2017-11-27
setup.py 1111 2017-11-27
…
(there are some unwanted unexpected files, because I out all files in revision, which affect file in interest, without filter). You have to grep only needed files, sort by cols 1+2 and use date of latest revision of each file
Use hg grep. For the above test-pattern
hg grep "." -I setup.* --files-with-matches -d -q
(find any changes, output only filename+revision, short date)
you'll get something like
setup.py:1163:2018-11-07
setup.cfg:1118:2017-11-27
and 3-rd column will be your needed last modification date of file

Taking github repo public causes problems with Dist::Zilla

I have a module, built with Dist::Zilla. I have Dist::Zilla set up to automatically push changes out to my GitHub repo. Works great when the repo is private.
However, as soon as I make the repo public, I start getting errors during the build process. Specifically, these lines in the dist.ini
[Bugtracker]
web = http://github.com/myaccount/%s/issues
If I comment out these lines, it works. With these lines left in, I get an error:
Duplication of element resources.bugtracker.web at /Users/me/perl5/perlbrew/perls/perl-5.24.1/lib/site_perl/5.24.4/Dist/Zilla.pm line 595.
OK, so fine, I comment out the lines. However, another problem crops up. The version number of my builds no longer autoincrements and is stuck at the same number every time I try to release a build.
Is there some configuration setting I need to change with Dist::Zilla so it will play nice with public github repos? Here is the full dist.ini file:
name = Module-Test
author = me
license = Perl_5
copyright_holder = Me
copyright_year = 2018
[Repository]
;[Bugtracker]
;web = http://github.com/sdondley/%s/issues
[Git::NextVersion]
[GitHub::Meta]
[PodVersion]
[PkgVersion]
[NextRelease]
[Run::AfterRelease]
run = mv Changes tmp && cp %n-%v/Changes Changes
[InstallGuide]
[PodWeaver]
[ReadmeAnyFromPod]
type = markdown
location = root
phase = release
[Git::Check]
[Git::Commit]
allow_dirty = README.mkdn
allow_dirty = Changes
allow_dirty = INSTALL
[Git::Tag]
[Git::Push]
[Run::AfterRelease / MyAppAfter]
run = mv tmp/Changes Changes
[GatherDir]
[AutoPrereqs]
[PruneCruft]
[PruneFiles]
filename = weaver.ini
filename = README.mkdn
filename = dist.ini
filename = .gitignore
[ManifestSkip]
[MetaYAML]
[License]
[Readme]
[ExtraTests]
[ExecDir]
[ShareDir]
[MakeMaker]
[Manifest]
[TestRelease]
[FakeRelease]
Your [Bugtracker] entry leads to duplication because you are also setting the bugtracker through [GitHub::Meta]. Choose one or the other.
As for version number management, note that [Git::NextVersion] is based on your git tags. Make sure that these tags are present in your local repository and have the correct format. That plugin uses a command line invocation similar to this to obtain all tags:
git rev-list --simplify-by-decoration --pretty=%d HEAD | grep -oE 'tag: [^,)\s]+'
Public GitHub repos should not be a problem for Dist::Zilla – this is exactly the setup most dzil distros use anyway. But interactions between multiple plugins can lead to hard to track down bugs, especially since the order of plugins is important. It can help to organize your plugins by the phase in which they run, and to test whether the problem persists after removing optional plugins. It also tends to be better to start with a simple dist.ini and add plugins as pain points in your development process become apparent.

Download latest GitHub release

I'd like to have "Download Latest Version" button on my website which would represent the link to the latest release (stored at GitHub Releases). I tried to create release tag named "latest", but it became complicated when I tried to load new release (confusion with tag creation date, tag interchanging, etc.). Updating download links on my website manually is also a time-consuming and scrupulous task. I see the only way - redirect all download buttons to some html, which in turn will redirect to the actual latest release.
Note that my website is hosted at GitHub Pages (static hosting), so I simply can't use server-side scripting to generate links. Any ideas?
You don't need any scripting to generate a download link for the latest release. Simply use this format:
https://github.com/:owner/:repo/zipball/:branch
Examples:
https://github.com/webix-hub/tracker/zipball/master
https://github.com/iDoRecall/selection-menu/zipball/gh-pages
If for some reason you want to obtain a link to the latest release download, including its version number, you can obtain that from the get latest release API:
GET /repos/:owner/:repo/releases/latest
Example:
$.get('https://api.github.com/repos/idorecall/selection-menu/releases/latest', function (data) {
$('#result').attr('href', data.zipball_url);
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<a id="result">Download latest release (.ZIP)</a>
Github now provides a "Latest release" button on the release page of a project, after you have created your first release.
In the example you gave, this button links to https://github.com/reactiveui/ReactiveUI/releases/latest
You can use the following where:
${Organization} as the GitHub user or organization
${Repository} is the repository name
curl -L https://api.github.com/repos/${Organization}/${Repository}/tarball > ${Repository}.tar.gz
The top level directory in the .tar.gz file has the sha hash of the commit in the directory name which can be a problem if you need an automated way to change into the resulting directory and do something.
The method below will strip this out, and leave the files in a folder with a predictable name.
mkdir ${Repository}
curl -L https://api.github.com/repos/${Organization}/${Repository}/tarball | tar -zxv -C ${Repository} --strip-components=1
Since February 18th, 2015, the GitHUb V3 release API has a get latest release API.
GET /repos/:owner/:repo/releases/latest
See also "Linking to releases".
Still, the name of the asset can be tricky.
Git-for-Windows, for instance, requires a command like:
curl -IkLs -o NUL -w %{url_effective} \
https://github.com/git-for-windows/git/releases/latest|\
grep -o "[^/]*$"| sed "s/v//g"|\
xargs -I T echo \
https://github.com/git-for-windows/git/releases/download/vT/PortableGit-T-64-bit.7z.exe \
-o PortableGit-T-64-bit.7z.exe| \
sed "s/.windows.1-64/-64/g"|sed "s/.windows.\(.\)-64/.\1-64/g"|\
xargs curl -kL
The first 3 lines extract the latest version 2.35.1.windows.2
The rest will build the right URL
https://github.com/git-for-windows/git/releases/download/
v2.35.1.windows.2/PortableGit-2.35.1.2-64-bit.7z.exe
^^^^^^^^^^^^^^^^^ ^^^^^^^^^
Maybe could you use some client-side scripting and dynamically generate the target of the link by invoking the GitHub api, through some JQuery magic?
The Releases API exposes a way to retrieve the list of all the releases from a repository. For instance, this link return a Json formatted list of all the releases of the ReactiveUI project.
Extracting the first one would return the latest release.
Within this payload:
The html_url attribute will hold the first part of the url to build (ie. https://github.com/{owner}/{repository}/releases/{version}).
The assets array will list of the downloadable archives. Each asset will bear a name attribute
Building the target download url is only a few string operations away.
Insert the download/ keyword between the releases/ segment from the html_url and the version number
Append the name of the asset to download
Resulting url will be of the following format: https://github.com/{owner}/{repository}/releases/download/{version}/name_of_asset
For instance, regarding the Json payload from the link ReactiveUI link above, we've got html_url: "https://github.com/reactiveui/ReactiveUI/releases/5.99.0" and one asset with name: "ReactiveUI.6.0.Preview.1.zip".
As such, the download url is https://github.com/reactiveui/ReactiveUI/releases/download/5.99.0/ReactiveUI.6.0.Preview.1.zip
If you using PHP try follow code:
function getLatestTagUrl($repository, $default = 'master') {
$file = #json_decode(#file_get_contents("https://api.github.com/repos/$repository/tags", false,
stream_context_create(['http' => ['header' => "User-Agent: Vestibulum\r\n"]])
));
return sprintf("https://github.com/$repository/archive/%s.zip", $file ? reset($file)->name : $default);
}
Function usage example
echo 'Download';
As I didn't see the answer here, but it was quite helpful for me while running continuous integration tests, this one-liner that only requires you to have curl will allow to search the Github repo's releases to download the latest version
https://gist.github.com/steinwaywhw/a4cd19cda655b8249d908261a62687f8
I use it to run PHPSTan on our repository using the following script
https://gist.github.com/rvanlaak/7491f2c4f0c456a93f90e31774300b62
If you are trying to download form any linux — even old or tiny versions — or are trying to download from a bash script then the failproof way is using this command:
wget https://api.github.com/repos/$OWNER/$REPO/releases/latest -O - | awk -F \" -v RS="," '/browser_download_url/ {print $(NF-1)}' | xargs wget
do not forget to replace $OWNER and $REPO with the right owner and repository names. The command downloads a json page with the data of the latest release. then awk gets the value from the browser_download_url key.
If you are in a really old linux or a tiny embedded system with a small wget, the download name can be a problem. In such case you can always use the ultra-reliable:
URL=$(wget https://api.github.com/repos/$OWNER/$REPO/releases/latest -O - | awk -F \" -v RS="," '/browser_download_url/ {print $(NF-1)}'); wget $URL -O $(basename "$URL")
As noted by #Dan Dascalescu in a comment to accepted answer, there are some projects (roughly 30%) which do not bother to file formal releases, so neither "Latest release" button nor /releases/latest API call would return useful data.
To reliably fetch the latest release for a GitHub project, you can use lastversion.

Why does hg clone of hg.netbeans.org/main report "9 integrity errors"?

I just finished cloning the (huge) netbeans repository for
the second time. I found that I couldn't
successfully pull changes, after my first attempt to clone, earlier this week.
I guessed that some intermittent error had
corrupted the repository the first time around... that appears not to
be the case.
I'm using hg 1.3.1 on Ubuntu 9.4 (32-bit).
I cloned with hg clone http://hg.netbeans.org/main main
hg verify (below) ends with:
9 warnings encountered!
9 integrity errors encountered!
incidentally, the size of 00manifest.d is 1.1GiB, is that normal?
What could be causing this? Where does one even report this kind of error?
(assuming for the moment that it's not a PEBKAC.)
This should give you an idea of what I'm seeing (repetitive bits removed to save space):
[smithma#oberon:~/w/netbeans/main]
$ { hg --version ; echo ; echo ; hg --debug verify ; } | tee
../netbeans-main-hg-verify.txt
Mercurial Distributed SCM (version 1.3.1)
Copyright (C) 2005-2009 Matt Mackall and others
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
manifest#?: rev 149491 points to unexpected changeset 149752
(expected 149754)
[...SNIP...]
repository uses revlog format 1
checking changesets
checking manifests
crosschecking files in changesets and manifests
checking files
applemenu/src/org/netbeans/modules/applemenu/layer.xml#?: rev 12
points to unexpected changeset 149753
(expected 41473 46378 56815 59563 66079 70568 71017 83303 103972 105432 135060 137239 147766 149755)
warning: cnd.repository/src/org/netbeans/modules/cnd/repository/disk/UnitImpl.java#74688:
copy source revision is nullid cnd.repository/src/org/netbeans/modules/cnd/repository/disk/UnitDiskRepository.java:000000000000
[...SNIP...]
defaults/src/org/netbeans/modules/defaults/mf-layer.xml#?: rev 74
points to unexpected changeset 149753
(expected 25730 25732 25733 25741 25746 25747 25752 25768 26270 26561
27350 27495 27539 27566 27776 28203 28741 29191 29244 29364 29582
32476 33848 34406 35712 35713 36197 38355 40775 40854 42144 43593 44912
46378 46644 46697 46757 48145 48325 49166 50888 54548 54616 54618
55792 56816 56868 56895 56915 57513 58323 59288 59456 59563 59709 60225
66549 67160 67595 76198 77297 85585 86938 87361 93609 93755 113163
113177 117980 117992 124182 124475 135060 147766 149755)
[...SNIP...]
118132 files, 151874 changesets, 591274 total revisions
9 warnings encountered!
9 integrity errors encountered!
First, no it's not a PEBKAC. The errors from verify are fixable, the best way is probably to contact a Mercurial dev to write a script fixing the broken linkrevs.
The huge manifest could be dealt with contrib/shrink-revlog.py, from a quick testing I think it would shrink to approximately 50MB.

Find deleted files in Mercurial repository history, quickly?

You can use hg grep, but it searches the contents of all files.
What if I just want to search the file names of deleted files to recover one?
I tried hg grep -I <file-name-pattern> <pattern> but this seems to return no results.
using templates is simple:
$ hg log --template "{rev}: {file_dels}\n"
Update for Mercurial 1.6
You can use revsets for this too:
hg log -r "removes('**')"
(Edit: Note the double * - a single one detects removals from the root of the repository only.)
Edit: As Mathieu Longtin suggests, this can be combined with the template from dfa's answer to show you which files each listed revision removes:
hg log -r "removes('**')" --template "{rev}: {file_dels}\n"
That has the virtue (for machine-readability) of listing one revision per line, but you can make the output prettier for humans by using % to format each item in the list of deletions:
hg log -r "removes('**')" --template "{rev}:\n{file_dels % '{file}\n'}\n"
If you are using TortoiseHg workbench, a convenient way is to use the revision filter. Just hit ctrl+s, and then type
removes("**/FileYouWantToFind.txt")
**/ indicates that you want to search recursively in your repository.
You can use * wildcard in the filename too. You can combine this query with other revision sets using and, or operators.
There is also this Advanced Query Editor:
I have taken other answers and improved it.
Added "--no-merges". On large project with dev teams, there will lots of merges. --no-merger will filter out the log noise.
Change removes("**") to sort(removes("**"), -rev). For a large project with over 100K changesets, this will get to the latest files removed a lot faster. This reverses the order from starting at rev 0 to start at tip instead.
Added {author} and {desc} to ouput. This will give context as to why the files was removed by displaying the log comment and who did it.
So for my use case, it was hg log --template "File(s) deleted in rev {rev}: {author} \n {desc}\n {file_dels % '\n {file}'}\n\n" -r 'sort(removes("**"), -rev)' --no-merges
Sample output:
File(s) deleted in rev 52363: Ansariel
STORM-2141: Fix various inventory floater related issues:
* Opening new inventory via Control-Shift-I shortcut uses legacy and potentinally dangerous code path
* Closing new inventory windows don't release memory
* During shutdown legacy and inoperable code for inventory window cleanup is called
* Remove old and unused inventory legacy code
indra/newview/llfloaterinventory.cpp
indra/newview/llfloaterinventory.h
File(s) deleted in rev 51951: Ansariel
Remove readme.md file - again...
README.md
File(s) deleted in rev 51856: Brad Payne (Vir Linden) <vir#lindenlab.com>
SL-276 WIP - removed avatar_skeleton_spine_joints.xml
indra/newview/character/avatar_skeleton_spine_joints.xml
File(s) deleted in rev 51821: Brad Payne (Vir Linden) <vir#lindenlab.com>
SL-276 WIP - removed avatar_XXX_orig.xml files.
indra/newview/character/avatar_lad_orig.xml
indra/newview/character/avatar_skeleton_orig.xml
Search for a specific file you deleted efficiently, and format the result nicely:
hg log --template "File(s) deleted in rev {rev}: {file_dels % '\n {file}'}\n\n" -r 'removes("**/FileYouWantToFind.txt")'
Sample output:
File(s) deleted in rev 33336:
class/WebEngineX/Database/RawSql.php
File(s) deleted in rev 34468:
class/PdoPlus/AccessDeniedException.php
class/PdoPlus/BulkInsert.php
class/PdoPlus/BulkInsertInfo.php
class/PdoPlus/CannotAddForeignKeyException.php
class/PdoPlus/DuplicateEntryException.php
class/PdoPlus/Escaper.php
class/PdoPlus/MsPdo.php
class/PdoPlus/MyPdo.php
class/PdoPlus/MyPdoException.php
class/PdoPlus/NoSuchTableException.php
class/PdoPlus/PdoPlus.php
class/PdoPlus/PdoPlusException.php
class/PdoPlus/PdoPlusStatement.php
class/PdoPlus/RawSql.php
from project root
hg status . | grep "\!" >> /tmp/filesmissinginrepo.txt