nexus 3: get more than 9999 artifacts - rest

I'm querying all artifacts using the Nexus API "curl -u username:pw -X GET 'https://xxx.xxx/service/rest/v1/search?repository=xxx&group=xxx'" under a huge Nexus raw repository with more than 200000 artifacts.
It works well with small directories, but after the 9999th artifact, I get the error:
nested: QueryPhaseExecutionException[Result window is too large, from + size must be less than or equal to: [10000] but was [10050]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level parameter.]; }
I won't be able change the limit size, is there any way that I can get all the artifacts?

Related

Kubernetes object size limitations

I am dealing with CRDs and creating Custom resources. I need to keep lots of information about my application in the Custom resource. As per the official doc, etcd works with request up to 1.5MB. I am hitting errors something like
"error": "Request entity too large: limit is 3145728"
I believe the specified limit in the error is 3MB. Any thoughts around this? Any way out for this problem?
The "error": "Request entity too large: limit is 3145728" is probably the default response from kubernetes handler for objects larger than 3MB, as you can see here at L305 of the source code:
expectedMsgFor1MB := `etcdserver: request is too large`
expectedMsgFor2MB := `rpc error: code = ResourceExhausted desc = trying to send message larger than max`
expectedMsgFor3MB := `Request entity too large: limit is 3145728`
expectedMsgForLargeAnnotation := `metadata.annotations: Too long: must have at most 262144 bytes`
The ETCD has indeed a 1.5MB limit for processing a file and you will find on ETCD Documentation a suggestion to try the--max-request-bytes flag but it would have no effect on a GKE cluster because you don't have such permission on master node.
But even if you did, it would not be ideal because usually this error means that you are consuming the objects instead of referencing them which would degrade your performance.
I highly recommend that you consider instead these options:
Determine whether your object includes references that aren't used;
Break up your resource;
Consider a volume mount instead;
There's a request for a new API Resource: File (orBinaryData) that could apply to your case. It's very fresh but it's good to keep an eye on.
If you still need help let me know.
This happened to me when I put some large files in my Helm chart directory. Removing those files helped me resolve my issue.
Check the size of files in the directory, which contains the templates and values.yaml of the release of your chart (as it seems the name of directory is usually equals to charts).
du <directory-path> --max-depth=1
# if you want it to be more readable add -h switch
du -h <directory-path> --max-depth=1
Make sure you do not have any irrelevant files if the file size exceeded 3145728. (source)
If you are using HELM, check if you have large file like log files. Add .helmignore
.DS_Store
# Common VCS dirs
.gitignore
.bzr/
.bzrignore
.hg/
.hgignore
.svn/
# Common backup files
*.swp
*.bak
*.tmp
*~
# Various IDEs
.project
.idea/
*.tmproj
.vscode/
*.log

How to compress a list of files into a single gzip file using elasticluster, grid-engine-tools, and google cloud

I want to start by thanking you all for your help ahead of time, as this will help clear up a detail left out on the readthedocs.io guide. What I need is to compress several files into a single gzip, however, the guide shows only how to compress a list of files as individual gzipped file. Again, I appreciate any help as there is very few resources and documentation for this set up. (If there is some extra info, please include links to sources)
After I had set up the grid engine, I ran through the samples in the guide.
Am I right in assuming there is not a script for combining multiple files into one gzip using grid-computing-tools?
Are there any solutions on the Elasticluster Grid Engine setup to compress multiple files into 1 gzip?
What changes can be made to the grid-engine-tools to make it work?
EDIT
The reason we are considering a cluster is that we do expect multiple operations occurring simultaneously, zipped up files per order, which will occur systematically so that a vendor can download a single compressed file per order.
May I state the definition of the problem and you can let me know if I understood it correctly, as both Matt and I provided the exact same solution and somehow it doesn't seem sufficient.
Problem Definition
You have an Order defining the start of a task to process some data.
The processing of data would be split among several compute nodes, each producing a resulting file stored on GS directories.
The goal is:
Collect the files from GS bucket (that were produced by each of the nodes),
Archive the collection of files as one file,
Then compress that archive, and
Push it back to a different GS location.
Let me know if I summarized it properly,
Thanks,
Paul
Are the files in question in Cloud Storage?
Are the files in question on a local or network drive?
In your description, you indicate "What I need is to compress several files into a single gzip". It isn't clear to me that a cluster of computers is needed for this. It sounds more like you just want to use tar along with gzip.
The tar utility will create an archive file it can compress it as well. For example:
$ # Create a directory with a few input files
$ mkdir myfiles
$ echo "This is file1" > myfiles/file1.txt
$ echo "This is file2" > myfiles/file2.txt
$ # (C)reate a compressed archive
$ tar cvfz archive.tgz myfiles/*
a myfiles/file1.txt
a myfiles/file2.txt
$ # (V)erify the archive
$ tar tvfz archive.tgz
-rw-r--r-- 0 myuser mygroup 14 Jul 20 15:19 myfiles/file1.txt
-rw-r--r-- 0 myuser mygroup 14 Jul 20 15:19 myfiles/file2.txt
To extract the contents use:
$ # E(x)tract the archive contents
$ tar xvfz archive.tgz
x myfiles/file1.txt
x myfiles/file2.txt
UPDATE:
In your updated problem description, you have indicated that you may have multiple orders processed simultaneously. If the frequency in which results need to be tar-ed is low, and providing the tar-ed results is not extremely time-sensitive, then you could likely do this with a single node.
However, as the scale of the problem ramps up, you might take a look at using the Pipelines API.
Rather than keeping a fixed cluster running, you could initiate a "pipeline" (in this case a single task) when a customer's order completes.
A call to the Pipelines API would start a VM whose sole purpose is to download the customer's files, tar them up, and push the resulting tar file into Cloud Storage. The Pipelines API infrastructure does the copying from and to Cloud Storage for you. You would effectively just need to supply the tar command line.
There is an example that does something similar here:
https://github.com/googlegenomics/pipelines-api-examples/tree/master/compress
This example will download a list of files and compress each of them independently. It could be easily modified to tar the list of input files.
Take a look at the https://github.com/googlegenomics/pipelines-api-examples github repository for more information and examples.
-Matt
So there are many ways to do it, but the thing is that you cannot directly compress on Google Storage a collection of files - or a directory - into one file, and would need to perform the tar/gzip combination locally before transferring it.
If you want you can have the data compressed automatically via:
gsutil cp -Z
Which is detailed at the following link:
https://cloud.google.com/storage/docs/gsutil/commands/cp#changing-temp-directories
And the nice thing is that you retrieve uncompressed results from compressed data on Google Storage, because it has the ability to perform Decompressive Transcoding:
https://cloud.google.com/storage/docs/transcoding#decompressive_transcoding
You will notice on the last line in the following script:
https://github.com/googlegenomics/grid-computing-tools/blob/master/src/compress/do_compress.sh
The following line will basically copy the current compressed file to Google Cloud Storage:
gcs_util::upload "${WS_OUT_DIR}/*" "${OUTPUT_PATH}/"
What you will need is to first perform the tar/zip on the files in the local scratch directory, and then gsutil copy the compressed file over to Google Storage, but make sure that all the files that need to be compressed are in the scratch directory before starting to compress them. Most likely you would need to SSH copy (scp) them to one of the nodes (i.e. master), and then have the master tar/gzip the whole directory before sending it over to Google Storage. I am assuming each GCE instance has its own scratch disk, but the "gsutil cp" transfer is very fast when working on GCE.
Since Google Storage is fast at data transfers with Google Compute instances, the easiest second option to pursue is to mark out lines 66-69 in the do_compress.sh file:
https://github.com/googlegenomics/grid-computing-tools/blob/master/src/compress/do_compress.sh
This way no compression happens, but the copy happens on the last line via gsutil::upload, in order to have all the uncompressed files transferred to the same Google Storage bucket. Then using "gsutil cp" from the master node you would copy them back locally, in order to compress them locally via tar/gz and then copy the compressed directory file back to the bucket using "gsutil cp".
Hope it helps but it's tricky,
Paul

Github repository statistics

Is there a tool available which generates some statistics about a github repository? With statistics I mean analyzing commits and issues which are related to the code.
Statistics for example:
which file has the most changes (commits)
which file is most related with bugs (with a specific label)
...
You can find some statistics under the Pulse and Graphs tabs in a GitHub repository. The Graphs tab includes a Commits chart. Anyone with push access to a repository can view its traffic, including full clones (not fetches), visitors from the past 14 days, referring sites, and popular content in the Traffic graph.
If you wanted to drill down into commit volume on particular file you can use the commits API endpoint. Here is an example call:
curl -u <USER>:<API_KEY> https://api.github.com/repos/<OWNET>/<REPO>/commits
You have the option to pass in a path and since/until dates to narrow your results.
The herdstat tool can be used to generate contribution graphs known from GitHub user profiles but for individual repositories or aggregated ones for multiple repositories, e.g., all repositories in a GitHub organization.
The tool is packaged as a Docker image and can be used as follows:
docker run --name herdstat-dev -it herdstat/herdstat:v0.4.0 \
/herdstat -r herdstat contribution-graph -u 2022-12-31
docker cp $(docker ps -aqf "name=herdstat-dev"):/contribution-graph.svg .
The graph is generated for a 52 week time frame ending with the date specified via the -u / --until flag. The repositories to be analyzed are given using the -r / --repositories flag. It can be set either to owners like herdstat or repositories like hertstat/herdstat. In the first case all public repositories owned by herdstat are analyzed.
Depending on your use case, you can use the herdstat GitHub action to perform the analysis in an automated fashion, e.g., nightly or as part of your CI/CD pipeline and embed the resulting graph into your README file using an HTML image tag.
The tool is still in its early stages, but it generates informative and nice looking contribution charts like this one.
Disclaimer: I am the author of herdstat.

Counting and analyzing commits in Github organization (not repo)

I'd like to count the commits of 2012 in http://github.com/plone and http://github.com/collective
Are there any tools to do this - provide statistics for Github organizations?
Do I need to write my own script to scrape the repositories, check out them individually and count commits?
Here is how I'd do it:
use the GitHub API to enumerate the repositories (see the JSON for Plone for an example). Loop over the JSON result and with each:
Check out the repository (the git_url URL) with git clone --bare; only the git info, no working copy. This creates a <repository_name>.git> directory, say plone.event.git if you cloned git://github.com/plone/plone.event.git.
Count the revisions with git --git-dir=<git_directory> rev-list HEAD --count; outputs the count to stdout, so subprocess.check_output() should do the job just fine.
Remove the .git directory again
That only requires 2 API calls, so you avoid being rate limited; paging through all the commits with the API would require too many requests to count all the repository commits, checking out a bare repository copy would be faster anyway.
The herdstat tool can be used to generate contribution graphs known from GitHub user profiles but for individual repositories or aggregated ones for multiple repositories, e.g., all repositories in a GitHub organization.
The tool is packaged as a Docker image and can be used as follows:
docker run --name herdstat-dev -it herdstat/herdstat:v0.4.0 \
/herdstat -r plone contribution-graph -u 2012-12-31
docker cp $(docker ps -aqf "name=herdstat-dev"):/contribution-graph.svg .
The graph is generated for a 52 week time frame ending with the date specified via the -u / --until flag.
The second command copies the generated graph from the container to the current directory.
The tool is still in its early stages, but it generates informative and nice looking contribution charts like this one that include also the number of contributions in the respective time frame.
Disclaimer: I am the author of herdstat.

Github: Can I see the number of downloads for a repo?

In Github, is there a way I can see the number of downloads for a repo?
Update 2019:
Ustin's answer points to:
API /repos/:owner/:repo/traffic/clones, to get the total number of clones and breakdown per day or week, but: only for the last 14 days.
API /repos/:owner/:repo/releases/:release_id for getting downloads number of your assets (files attached to the release), field download_count mentioned below, but, as commented, only for the most recent 30 releases..
Update 2017
You still can use the GitHub API to get the download count for your releases (which is not exactly what was asked)
See "Get a single release", the download_count field.
There is no longer a traffic screen mentioning the number of repo clones.
Instead, you have to rely on third-party services like:
GitItBack (at www.netguru.co/gititback), but even that does not include the number of clones.
githubstats0, mentioned below by Aveek Saha.
www.somsubhra.com/github-release-stats (web archive), mentioned below.
For instance, here is the number for the latest git for Windows release
Update August 2014
GitHub also proposes the number of clones for repo in its Traffic Graph:
See "Clone Graphs"
Update October 2013
As mentioned below by andyberry88, and as I detailed last July, GitHub now proposes releases (see its API), which has a download_count field.
Michele Milidoni, in his (upvoted) answer, does use that field in his python script.
(very small extract)
c.setopt(c.URL, 'https://api.github.com/repos/' + full_name + '/releases')
for p in myobj:
if "assets" in p:
for asset in p['assets']:
print (asset['name'] + ": " + str(asset['download_count']) +
" downloads")
Original answer (December 2010)
I am not sure you can see that information (if it is recorded at all), because I don't see it in the GitHub Repository API:
$ curl http://github.com/api/v2/yaml/repos/show/schacon/grit
---
repository:
:name: grit
:owner: schacon
:source: mojombo/grit # The original repo at top of the pyramid
:parent: defunkt/grit # This repo's direct parent
:description: Grit is a Ruby library for extracting information from a
git repository in an object oriented manner - this fork tries to
intergrate as much pure-ruby functionality as possible
:forks: 4
:watchers: 67
:private: false
:url: http://github.com/schacon/grit
:fork: true
:homepage: http://grit.rubyforge.org/
:has_wiki: true
:has_issues: false
:has_downloads: true
You can only see if it has downloads or not.
Adam Jagosz reports in the comments:
I got it to work with
curl -H "Accept: application/vnd.github.v3+json" https://api.github.com/repos/:user/:repo/releases
A couple of things that I had wrong:
I needed an actual Github release (not just git tag, even though Github does display those under releases, ugh).
And the release needs an asset file other than the zipped source that is added automatically in order to get the download count.
I have written a small web application in javascript for showing count of the number of downloads of all the assets in the available releases of any project on Github. You can try out the application over here: http://somsubhra.github.io/github-release-stats/
VISITOR count should be available under your dashboard > Traffic (or stats or insights):
GitHub has deprecated the download support and now supports 'Releases' - https://github.com/blog/1547-release-your-software. To create a release either use the GitHub UI or create an annotated tag (http:// git-scm.com/book/ch2-6.html) and add release notes to it in GitHub. You can then upload binaries, or 'assets', to each release.
Once you have some releases, the GitHub API supports getting information about them, and their assets.
curl -i \
https://api.github.com/repos/:owner/:repo/releases \
-H "Accept: application/vnd.github.manifold-preview+json"
Look for the 'download_count' entry. Theres more info at http://developer.github.com/v3/repos/releases/. This part of the API is still in the preview period ATM so it may change.
Update Nov 2013:
GitHub's releases API is now out of the preview period so the 'Accept' header is no longer needed - http://developer.github.com/changes/2013-11-04-releases-api-is-official/
It won't do any harm to continue to add the 'Accept' header though.
I had made a web app that shows GitHub release statistics in a clean format:
https://hanadigital.github.io/grev/
As mentioned, GitHub API returns downloads count of binary file releases. I developed a little script to easly get downloads count by command line.
Formerly, there was two methods of download code in Github: clone or download as zip a .git repo, or upload a file (for example, a binary) for later download.
When download a repo (clone or download as zip), Github doesn't count the number of downloads for technical limitations. Clone a repository is a read-only operation. There is no authentication required. This operation can be done via many protocols, including HTTPS, the same protocol that the web page uses to show the repo in the browser. It's very difficult to count it.
See: http://git-scm.com/book/en/Git-on-the-Server-The-Protocols
Recently, Github deprecate the download functionality. This was because they understand that Github is focused in building software, and not in distribute binaries.
See: https://github.com/blog/1302-goodbye-uploads
To check the number of times a release file/package was downloaded you can go to https://githubstats0.firebaseapp.com
It gives you a total download count and a break up of of total downloads per release tag.
Very late, but here is the answer you want:
https://api.github.com/repos/ [git username] / [git project] /releases
Next, find the id of the project you are looking for in the data. It should be near the top, next to the urls. Then, navigate to
https://api.github.com/repos/ [git username] / [git project] /releases/ [id] / assets
The field named download_count is your answer.
EDIT: Capitals matter in your username and project name
The Github API does not provide the needed information anymore. Take a look at the releases page, mentioned in Stan Towianski's answer. As we discussed in the comments to that answer, the Github API only reports the downloads of 1 of the three files he offers per release.
I have checked the solutions, provided in some other answers to this questions. Vonc's answer presents the essential part of Michele Milidoni's solution. I installed his gdc script with the following result
# ./gdc stant
mdcsvimporter.mxt: 37 downloads
mdcsvimporter.mxt: 80 downloads
How-to-use-mdcsvimporter-beta-16.zip: 12 downloads
As you can clearly see, gdc does not report the download count of the tar.gz and zip files.
If you want to check without installing anything, try the web page where Somsubhra has installed the solution, mentioned in his answer. Fill in 'stant' as Github username and 'mdcsvimporter2015' as Repository name and you will see things like:
Download Info:
mdcsvimporter.mxt(0.20MB) - Downloaded 37 times.
Last updated on 2015-03-26
Alas, once again only a report without the downloads of the tar.gz and zip files. I have carefully examined the information that Github's API returns, but it is not provided anywhere. The download_count that the API does return is far from complete nowadays.
I ended up writing a scraper script to find my clone count:
#!/bin/sh
#
# This script requires:
# apt-get install html-xml-utils
# apt-get install jq
#
USERNAME=dougluce
PASSWORD="PASSWORD GOES HERE, BE CAREFUL!"
REPO="dougluce/node-autovivify"
TOKEN=`curl https://github.com/login -s -c /tmp/cookies.txt | \
hxnormalize | \
hxselect 'input[name=authenticity_token]' 2>/dev/null | \
perl -lne 'print $1 if /value=\"(\S+)\"/'`
curl -X POST https://github.com/session \
-s -b /tmp/cookies.txt -c /tmp/cookies2.txt \
--data-urlencode commit="Sign in" \
--data-urlencode authenticity_token="$TOKEN" \
--data-urlencode login="$USERNAME" \
--data-urlencode password="$PASSWORD" > /dev/null
curl "https://github.com/$REPO/graphs/clone-activity-data" \
-s -b /tmp/cookies2.txt \
-H "x-requested-with: XMLHttpRequest" | jq '.summary'
This'll grab the data from the same endpoint that Github's clone graph uses and spit out the totals from it. The data also includes per-day counts, replace .summary with just . to see those pretty-printed.
To try to make this more clear:
for this github project: stant/mdcsvimporter2015
https://github.com/stant/mdcsvimporter2015
with releases at
https://github.com/stant/mdcsvimporter2015/releases
go to http or https: (note added "api." and "/repos")
https://api.github.com/repos/stant/mdcsvimporter2015/releases
you will get this json output and you can search for "download_count":
"download_count": 2,
"created_at": "2015-02-24T18:20:06Z",
"updated_at": "2015-02-24T18:20:07Z",
"browser_download_url": "https://github.com/stant/mdcsvimporter2015/releases/download/v18/mdcsvimporter-beta-18.zip"
or on command line do:
wget --no-check-certificate https://api.github.com/repos/stant/mdcsvimporter2015/releases
Based on VonC and Michele Milidoni answers I've created this bookmarklet which displays downloads statistics of github hosted released binaries.
Note: Because of issues with browsers related to Content Security Policy implementation, bookmarklets can temporarily violate some CSP directives and basically may not function properly when running on github while CSP is enabled.
Though its highly discouraged, you can disable CSP in Firefox as a
temporary workaround. Open up about:config and set security.csp.enable
to false.
I have created three solutions to fetch the download count and other statistics for GitHub releases. Each of these implementations are able to accumulate the GitHub API pagination results, which means that calculating the total number of downloads won't be an issue.
Web Application
https://qwertycube.com/github-release-stats/
Available as a PWA
Supports the GitHub API pagination
Node.js Implementation
https://github.com/kefir500/github-release-stats
Available via NPM
Written in TypeScript, compiled to JavaScript
Can be used as a command-line tool
Can be used as a Node.js module
Can be used in a browser environment
Supports the GitHub API pagination
Python Implementation
https://github.com/kefir500/ghstats
Available via PyPI
Can be used as a command-line tool
Can be used as a Python module
Supports the GitHub API pagination
New implementation:
Port into GitHub composite action to reuse workflow code base.
https://github.com/andry81-devops/github-accum-stats
With additional features:
Can count traffic clones or/and views.
Can use GitHub composite action to reuse workflow code base: https://docs.github.com/en/actions/creating-actions/creating-a-composite-action
GitHub workflow file example:
.github/workflows/accum-gh-clone-stats.yml
Previous implementation (marked as obsolete):
This implementation based on GitHub Actions + statistic accumulation into separate repository: https://github.com/andry81-devops/github-clone-count-badge
based on: https://github.com/MShawon/github-clone-count-badge
With some advantages:
Repository to track and repository to store traffic statistic are different, and you may directly point the statistic as commits list: https://github.com/{{REPO_OWNER}}/{{REPO}}--gh-stats/commits/master/traffic/clones
Workflow is used accum-traffic-clones.sh bash script to accumulate traffic clones
The script accumulates statistic both into a single file and into a set of files grouped by year and allocated per day: traffic/clones/by_year/YYYY/YYYY-MM-DD.json
GitHub workflow file example:
.github/workflows/myrepo-gh-clone-stats.yml
As already stated, you can get information about your Releases via the API.
For those using WordPress, I developed this plugin: GitHub Release Downloads. It allows you to get the download count, links and more information for releases of GitHub repositories.
To address the original question, the shortcode [grd_count user="User" repo="MyRepo"] will return the number of downloads for a repository. This number corresponds to the sum of all download count values of all releases for one GitHub repository.
Example:
Answer from 2019:
For number of clones you can use https://developer.github.com/v3/repos/traffic/#clones (but be aware that it returns count only for last 14 days)
For get downloads number of your assets (files attached to the release), you can use https://developer.github.com/v3/repos/releases/#get-a-single-release (exactly "download_count" property of the items of assets list in response)
There's a nice Chrome extension that does exactly what you want:
GitHub Release Downloads
11 years later...
Here's a small python3 snippet to retrieve the download count of the last 100 release assets:
import requests
owner = "twbs"
repo = "bootstrap"
h = {"Accept": "application/vnd.github.v3+json"}
u = f"https://api.github.com/repos/{owner}/{repo}/releases?per_page=100"
r = requests.get(u, headers=h).json()
r.reverse() # older tags first
for rel in r:
if rel['assets']:
tag = rel['tag_name']
dls = rel['assets'][0]['download_count']
pub = rel['published_at']
print(f"Pub: {pub} | Tag: {tag} | Dls: {dls} ")
Pub: 2013-07-18T00:03:17Z | Tag: v1.2.0 | Dls: 1193
Pub: 2013-08-19T21:20:59Z | Tag: v3.0.0 | Dls: 387786
Pub: 2013-10-30T17:07:16Z | Tag: v3.0.1 | Dls: 102278
Pub: 2013-11-06T21:58:55Z | Tag: v3.0.2 | Dls: 381136
...
Pub: 2020-12-07T16:24:37Z | Tag: v5.0.0-beta1 | Dls: 93943
Demo
Here is a python solution using the pip install PyGithub package
from github import Github
g = Github("youroauth key") #create token from settings page
for repo in g.get_user().get_repos():
if repo.name == "yourreponame":
releases = repo.get_releases()
for i in releases:
if i.tag_name == "yourtagname":
for j in i.get_assets():
print("{} date: {} download count: {}".format(j.name, j.updated_at, j._download_count.value))