wget - selective recursive download + page-rerequisites?

wget - selective recursive download + page-rerequisites? - wget

I'm trying to scrape a forum site, to build a read-only archive.
I understand how to use -A and -R to limit the pages I retrieve, but is there a way to also retrieve page-prerequisites (e.g., icons and such)
Thanks!

Related

Downloading public data directory from google cloud storage with command line utilities like wget

I would like to download publicly available data from google cloud storage. However, because I need to be in a Python3.x environment, it is not possible to use gsutil. I can download individual files with wget as
wget http://storage.googleapis.com/path-to-file/output_filename -O output_filename
However, commands like
wget -r --no-parent https://console.cloud.google.com/path_to_directory/output_directoryname -O output_directoryname
do not seem to work as they just download an index file for the directory. Neither do rsync or curl attempts based on some initial attempts. Any idea of how to download publicly available data on google cloud storage as a directory?

The approach you mentioned above does not work because Google Cloud Storage doesn't have real "directories". As an example, "path/to/some/files/file.txt" is the entire name of that object. A similarly named object, "path/to/some/files/file2.txt", just happens to share the same naming prefix.
As for how you could fetch these files: The GCS APIs (both XML and JSON) allow you to do an object listing against the parent bucket, specifying a prefix; in this case, you'd want all objects starting with the prefix "path/to/some/files/". You could then make individual HTTP requests for each of the objects specified in the response body. That being said, you'd probably find this much easier to do via one of the GCS client libraries, such as the Python library.
Also, gsutil currently has a GitHub issue open to track adding support for Python 3.

wget recursive fails on wiki pages

I'm trying to recursively fetch all pages linked from a Moin wiki page. I've tried many different wget recursive options, which all have the same result: only the html file from the given URL gets downloaded, not any of the pages linked from that html page.
If I use the --convert-links option, wget correctly translates the unfetched links to the right web links. It just doesn't recursively download those linked pages.
wget --verbose -r https://wiki.gnome.org/Outreachy
--2017-03-02 10:34:03-- https://wiki.gnome.org/Outreachy
Resolving wiki.gnome.org (wiki.gnome.org)... 209.132.180.180, 209.132.180.168
Connecting to wiki.gnome.org (wiki.gnome.org)|209.132.180.180|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘wiki.gnome.org/Outreachy’
wiki.gnome.org/Outreachy [ <=> ] 52.80K 170KB/s in 0.3s
2017-03-02 10:34:05 (170 KB/s) - ‘wiki.gnome.org/Outreachy’ saved [54064]
FINISHED --2017-03-02 10:34:05--
Total wall clock time: 1.4s
Downloaded: 1 files, 53K in 0.3s (170 KB/s)
I'm not sure if it's failing because the wiki's html links don't end with .html. I've tried using various combinations of --accept='[a-zA-Z0-9]+', --page-requisites, and --accept-regex='[a-zA-Z0-9]+' to work around that, no luck.
I'm not sure if it's failing because the wiki has html pages like https://wiki.gnome.org/Outreachy that links page URLs like https://wiki.gnome.org/Outreachy/Admin and https://wiki.gnome.org/Outreachy/Admin/GettingStarted. Maybe wget is confused because there will need to be an HTML page and a directory with the same name? I also tried using --nd but no luck.
The linked html pages are all relative to the base wiki URL (e.g. Outreachy history page). I've tried also adding --base="https://wiki.gnome.org/ with no luck.
At this point, I've tried a whole lot of different wget options, read several stack overflow and unix.stackexchange.com questions, and nothing I've tried has worked. I'm hoping there's a wget expert that can look at this particular wiki page and figure why wget is failing to recursively fetch linked pages. The same options work fine on other domains.
I've also tried httrack, with the same result. I'm running Linux, so please don't suggest Windows or proprietary tools.

This seems to be caused by the following tag in the wiki:
<meta name="robots" content="index,nofollow">
If you are sure you want to ignore the tag, you can make wget ignore it using -e robots=off:
wget -e robots=off --verbose -r https://wiki.gnome.org/Outreachy

Are you able to create clean URLs with Wget?

I'm attempting to create a mirror of a WordPress site with clean URLs (i.e. http://example.org/foo not http://example.org/foo.php). When Wget mirrors the site, it gives all pages and links a ".html" extension (i.e. http://example.org/foo.html).
Is it possible to set options for Wget to create a clean URL structure, so that the mirrored file corresponding to the page "http:example.org/foo" would be "/foo/index.html" and the link to that page would be "http:example.org/foo"? If so, how?

If I understand your question correctly, you're asking for what is the default behaviour of Wget.
Wget will only add the extension to the local copy, if the --adjust-extension option has been passed to it. Quoting the man page for Wget:
--adjust-extension
If a file of type application/xhtml+xml or text/html is downloaded and the URL does not end with the regexp \.[Hh][Tt][Mm][Ll]?, this option will cause the suffix .html to be appended to the
local filename. This is useful, for instance, when you're mirroring a remote site that uses .asp pages, but you want the mirrored pages to be viewable on your stock Apache server. Another good
use for this is when you're downloading CGI-generated materials. A URL like http://example.com/article.cgi?25 will be saved as article.cgi?25.html.
However, what you seem to be asking for, that Wget saves example.org/foo as /foo/index.html is actually the default option. If you're seeing some other output, you should post the complete output of Wget with the --debug switch.

How to skip selected url while mirroring site with wget

I have the following problem. I need to mirror password protected site. Sounds like simple task:
wget -m -k -K -E --cookies=on --keep-session-cookies --load-cookies=myCookies.txt http://mysite.com
in myCookies.txt I am keeping proper session cookie. This works until wget come accross logout page - then session is invalidated and, effectively, further mirroring is usless.
W tried to add --reject option, but it works only with file types - I can block only html file download or swf file download, I can't say
--reject http://mysite.com/*.php?type=Logout*
Any ideas how to skip certain URLs in wget? Maybe there is other tool that can do the job (must work on MS Windows).

What if you first download (or even just touch) the logout page, and then
wget --no-clobber --your-original-arguments
This should skip the logout page, as it has already been downloaded
(Disclaimer: I didn't try this myself)

I have also encountered this problem and later solved it like this: "--reject-regex logout", more:wget-devTips

Github: Can I see the number of downloads for a repo?

In Github, is there a way I can see the number of downloads for a repo?

Update 2019:
Ustin's answer points to:
API /repos/:owner/:repo/traffic/clones, to get the total number of clones and breakdown per day or week, but: only for the last 14 days.
API /repos/:owner/:repo/releases/:release_id for getting downloads number of your assets (files attached to the release), field download_count mentioned below, but, as commented, only for the most recent 30 releases..
Update 2017
You still can use the GitHub API to get the download count for your releases (which is not exactly what was asked)
See "Get a single release", the download_count field.
There is no longer a traffic screen mentioning the number of repo clones.
Instead, you have to rely on third-party services like:
GitItBack (at www.netguru.co/gititback), but even that does not include the number of clones.
githubstats0, mentioned below by Aveek Saha.
www.somsubhra.com/github-release-stats (web archive), mentioned below.
For instance, here is the number for the latest git for Windows release
Update August 2014
GitHub also proposes the number of clones for repo in its Traffic Graph:
See "Clone Graphs"
Update October 2013
As mentioned below by andyberry88, and as I detailed last July, GitHub now proposes releases (see its API), which has a download_count field.
Michele Milidoni, in his (upvoted) answer, does use that field in his python script.
(very small extract)
c.setopt(c.URL, 'https://api.github.com/repos/' + full_name + '/releases')
for p in myobj:
if "assets" in p:
for asset in p['assets']:
print (asset['name'] + ": " + str(asset['download_count']) +
" downloads")
Original answer (December 2010)
I am not sure you can see that information (if it is recorded at all), because I don't see it in the GitHub Repository API:
$ curl http://github.com/api/v2/yaml/repos/show/schacon/grit
---
repository:
:name: grit
:owner: schacon
:source: mojombo/grit # The original repo at top of the pyramid
:parent: defunkt/grit # This repo's direct parent
:description: Grit is a Ruby library for extracting information from a
git repository in an object oriented manner - this fork tries to
intergrate as much pure-ruby functionality as possible
:forks: 4
:watchers: 67
:private: false
:url: http://github.com/schacon/grit
:fork: true
:homepage: http://grit.rubyforge.org/
:has_wiki: true
:has_issues: false
:has_downloads: true
You can only see if it has downloads or not.
Adam Jagosz reports in the comments:
I got it to work with
curl -H "Accept: application/vnd.github.v3+json" https://api.github.com/repos/:user/:repo/releases
A couple of things that I had wrong:
I needed an actual Github release (not just git tag, even though Github does display those under releases, ugh).
And the release needs an asset file other than the zipped source that is added automatically in order to get the download count.

I have written a small web application in javascript for showing count of the number of downloads of all the assets in the available releases of any project on Github. You can try out the application over here: http://somsubhra.github.io/github-release-stats/

VISITOR count should be available under your dashboard > Traffic (or stats or insights):

GitHub has deprecated the download support and now supports 'Releases' - https://github.com/blog/1547-release-your-software. To create a release either use the GitHub UI or create an annotated tag (http:// git-scm.com/book/ch2-6.html) and add release notes to it in GitHub. You can then upload binaries, or 'assets', to each release.
Once you have some releases, the GitHub API supports getting information about them, and their assets.
curl -i \
https://api.github.com/repos/:owner/:repo/releases \
-H "Accept: application/vnd.github.manifold-preview+json"
Look for the 'download_count' entry. Theres more info at http://developer.github.com/v3/repos/releases/. This part of the API is still in the preview period ATM so it may change.
Update Nov 2013:
GitHub's releases API is now out of the preview period so the 'Accept' header is no longer needed - http://developer.github.com/changes/2013-11-04-releases-api-is-official/
It won't do any harm to continue to add the 'Accept' header though.

I had made a web app that shows GitHub release statistics in a clean format:
https://hanadigital.github.io/grev/

As mentioned, GitHub API returns downloads count of binary file releases. I developed a little script to easly get downloads count by command line.

Formerly, there was two methods of download code in Github: clone or download as zip a .git repo, or upload a file (for example, a binary) for later download.
When download a repo (clone or download as zip), Github doesn't count the number of downloads for technical limitations. Clone a repository is a read-only operation. There is no authentication required. This operation can be done via many protocols, including HTTPS, the same protocol that the web page uses to show the repo in the browser. It's very difficult to count it.
See: http://git-scm.com/book/en/Git-on-the-Server-The-Protocols
Recently, Github deprecate the download functionality. This was because they understand that Github is focused in building software, and not in distribute binaries.
See: https://github.com/blog/1302-goodbye-uploads

To check the number of times a release file/package was downloaded you can go to https://githubstats0.firebaseapp.com
It gives you a total download count and a break up of of total downloads per release tag.

Very late, but here is the answer you want:
https://api.github.com/repos/ [git username] / [git project] /releases
Next, find the id of the project you are looking for in the data. It should be near the top, next to the urls. Then, navigate to
https://api.github.com/repos/ [git username] / [git project] /releases/ [id] / assets
The field named download_count is your answer.
EDIT: Capitals matter in your username and project name

The Github API does not provide the needed information anymore. Take a look at the releases page, mentioned in Stan Towianski's answer. As we discussed in the comments to that answer, the Github API only reports the downloads of 1 of the three files he offers per release.
I have checked the solutions, provided in some other answers to this questions. Vonc's answer presents the essential part of Michele Milidoni's solution. I installed his gdc script with the following result
# ./gdc stant
mdcsvimporter.mxt: 37 downloads
mdcsvimporter.mxt: 80 downloads
How-to-use-mdcsvimporter-beta-16.zip: 12 downloads
As you can clearly see, gdc does not report the download count of the tar.gz and zip files.
If you want to check without installing anything, try the web page where Somsubhra has installed the solution, mentioned in his answer. Fill in 'stant' as Github username and 'mdcsvimporter2015' as Repository name and you will see things like:
Download Info:
mdcsvimporter.mxt(0.20MB) - Downloaded 37 times.
Last updated on 2015-03-26
Alas, once again only a report without the downloads of the tar.gz and zip files. I have carefully examined the information that Github's API returns, but it is not provided anywhere. The download_count that the API does return is far from complete nowadays.

I ended up writing a scraper script to find my clone count:
#!/bin/sh
#
# This script requires:
# apt-get install html-xml-utils
# apt-get install jq
#
USERNAME=dougluce
PASSWORD="PASSWORD GOES HERE, BE CAREFUL!"
REPO="dougluce/node-autovivify"
TOKEN=`curl https://github.com/login -s -c /tmp/cookies.txt | \
hxnormalize | \
hxselect 'input[name=authenticity_token]' 2>/dev/null | \
perl -lne 'print $1 if /value=\"(\S+)\"/'`
curl -X POST https://github.com/session \
-s -b /tmp/cookies.txt -c /tmp/cookies2.txt \
--data-urlencode commit="Sign in" \
--data-urlencode authenticity_token="$TOKEN" \
--data-urlencode login="$USERNAME" \
--data-urlencode password="$PASSWORD" > /dev/null
curl "https://github.com/$REPO/graphs/clone-activity-data" \
-s -b /tmp/cookies2.txt \
-H "x-requested-with: XMLHttpRequest" | jq '.summary'
This'll grab the data from the same endpoint that Github's clone graph uses and spit out the totals from it. The data also includes per-day counts, replace .summary with just . to see those pretty-printed.

To try to make this more clear:
for this github project: stant/mdcsvimporter2015
https://github.com/stant/mdcsvimporter2015
with releases at
https://github.com/stant/mdcsvimporter2015/releases
go to http or https: (note added "api." and "/repos")
https://api.github.com/repos/stant/mdcsvimporter2015/releases
you will get this json output and you can search for "download_count":
"download_count": 2,
"created_at": "2015-02-24T18:20:06Z",
"updated_at": "2015-02-24T18:20:07Z",
"browser_download_url": "https://github.com/stant/mdcsvimporter2015/releases/download/v18/mdcsvimporter-beta-18.zip"
or on command line do:
wget --no-check-certificate https://api.github.com/repos/stant/mdcsvimporter2015/releases

Based on VonC and Michele Milidoni answers I've created this bookmarklet which displays downloads statistics of github hosted released binaries.
Note: Because of issues with browsers related to Content Security Policy implementation, bookmarklets can temporarily violate some CSP directives and basically may not function properly when running on github while CSP is enabled.
Though its highly discouraged, you can disable CSP in Firefox as a
temporary workaround. Open up about:config and set security.csp.enable
to false.

I have created three solutions to fetch the download count and other statistics for GitHub releases. Each of these implementations are able to accumulate the GitHub API pagination results, which means that calculating the total number of downloads won't be an issue.
Web Application
https://qwertycube.com/github-release-stats/
Available as a PWA
Supports the GitHub API pagination
Node.js Implementation
https://github.com/kefir500/github-release-stats
Available via NPM
Written in TypeScript, compiled to JavaScript
Can be used as a command-line tool
Can be used as a Node.js module
Can be used in a browser environment
Supports the GitHub API pagination
Python Implementation
https://github.com/kefir500/ghstats
Available via PyPI
Can be used as a command-line tool
Can be used as a Python module
Supports the GitHub API pagination

New implementation:
Port into GitHub composite action to reuse workflow code base.
https://github.com/andry81-devops/github-accum-stats
With additional features:
Can count traffic clones or/and views.
Can use GitHub composite action to reuse workflow code base: https://docs.github.com/en/actions/creating-actions/creating-a-composite-action
GitHub workflow file example:
.github/workflows/accum-gh-clone-stats.yml
Previous implementation (marked as obsolete):
This implementation based on GitHub Actions + statistic accumulation into separate repository: https://github.com/andry81-devops/github-clone-count-badge
based on: https://github.com/MShawon/github-clone-count-badge
With some advantages:
Repository to track and repository to store traffic statistic are different, and you may directly point the statistic as commits list: https://github.com/{{REPO_OWNER}}/{{REPO}}--gh-stats/commits/master/traffic/clones
Workflow is used accum-traffic-clones.sh bash script to accumulate traffic clones
The script accumulates statistic both into a single file and into a set of files grouped by year and allocated per day: traffic/clones/by_year/YYYY/YYYY-MM-DD.json
GitHub workflow file example:
.github/workflows/myrepo-gh-clone-stats.yml

As already stated, you can get information about your Releases via the API.
For those using WordPress, I developed this plugin: GitHub Release Downloads. It allows you to get the download count, links and more information for releases of GitHub repositories.
To address the original question, the shortcode [grd_count user="User" repo="MyRepo"] will return the number of downloads for a repository. This number corresponds to the sum of all download count values of all releases for one GitHub repository.
Example:

Answer from 2019:
For number of clones you can use https://developer.github.com/v3/repos/traffic/#clones (but be aware that it returns count only for last 14 days)
For get downloads number of your assets (files attached to the release), you can use https://developer.github.com/v3/repos/releases/#get-a-single-release (exactly "download_count" property of the items of assets list in response)

There's a nice Chrome extension that does exactly what you want:
GitHub Release Downloads

11 years later...
Here's a small python3 snippet to retrieve the download count of the last 100 release assets:
import requests
owner = "twbs"
repo = "bootstrap"
h = {"Accept": "application/vnd.github.v3+json"}
u = f"https://api.github.com/repos/{owner}/{repo}/releases?per_page=100"
r = requests.get(u, headers=h).json()
r.reverse() # older tags first
for rel in r:
if rel['assets']:
tag = rel['tag_name']
dls = rel['assets'][0]['download_count']
pub = rel['published_at']
print(f"Pub: {pub} | Tag: {tag} | Dls: {dls} ")
Pub: 2013-07-18T00:03:17Z | Tag: v1.2.0 | Dls: 1193
Pub: 2013-08-19T21:20:59Z | Tag: v3.0.0 | Dls: 387786
Pub: 2013-10-30T17:07:16Z | Tag: v3.0.1 | Dls: 102278
Pub: 2013-11-06T21:58:55Z | Tag: v3.0.2 | Dls: 381136
...
Pub: 2020-12-07T16:24:37Z | Tag: v5.0.0-beta1 | Dls: 93943
Demo

Here is a python solution using the pip install PyGithub package
from github import Github
g = Github("youroauth key") #create token from settings page
for repo in g.get_user().get_repos():
if repo.name == "yourreponame":
releases = repo.get_releases()
for i in releases:
if i.tag_name == "yourtagname":
for j in i.get_assets():
print("{} date: {} download count: {}".format(j.name, j.updated_at, j._download_count.value))

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse