How to measure language popularity via Github Archive data? - github

I'm attempting to measure programming language popularity via:
The number of stars on repos in combination with...
The programming languages used in the repo and...
The total bytes of code in each language (recognizing that some languages are more/less verbose)
Conveniently, there is a massive trove of Github data provided by Github Archive, and hosted by BigQuery. The only problem is that I don't see "language" available in any of the payloads for the various event types in Github Archive.
Here's the BigQuery query I've been running trying to find if, and where, language may be populated in the Github Archive data:
SELECT *
FROM [githubarchive:month.201612]
WHERE JSON_EXTRACT(payload, "$.repository.language") is null
LIMIT 100
Can someone please provide insight into whether I'll be able to utilize Github Archive data in this way, and how I can go about doing so? Or will I need to pursue some other approach? I see that there is also a github_repos public dataset on BigQuery, and it does have some language metrics, but the languages metrics seem to be over all time. I'd prefer to get some sort of monthly metric eventually (i.e., of "active" repos in a given month, what were the most popular languages).
Any advice is appreciated!

With BigQuery and GitHub Archive and GHTorrent -
To get the languages by pull requests, last December (copy pasted from http://mads-hartmann.com/2015/02/05/github-archive.html):
SELECT COUNT(*) c, JSON_EXTRACT_SCALAR(payload, '$.pull_request.base.repo.language') lang
FROM [githubarchive:month.201612]
WHERE JSON_EXTRACT_SCALAR(payload, '$.pull_request.base.repo.language') IS NOT NULL
GROUP BY 2
ORDER BY 1 DESC
LIMIT 10
To find the number of stars per project:
SELECT COUNT(*) c, repo.name
FROM [githubarchive:month.201612]
WHERE type='WatchEvent'
GROUP BY 2
ORDER BY 1 DESC
LIMIT 10
For a quick language vs bytes view, you can use GHTorrent:
SELECT language, SUM(bytes) bytes
FROM [ghtorrent-bq:ght.project_languages]
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
Or to look at the actual files, see the contents of GitHub on BigQuery.
Now you can mix these queries to get the results you want!

SELECT
JSON_EXTRACT_SCALAR(payload, '$.pull_request.head.repo.language') AS language,
COUNT(1) AS usage
FROM [githubarchive:month.201601]
GROUP BY language
HAVING NOT language IS NULL
ORDER BY usage DESC

Related

MySQL Workbench - script storing return in array and performing calculations?

Firstly, this is part of my college homework.
Now that's out of the way: I need to write a query that will get the number of free apps in a DB as a percentage of the total number of apps, sorted by what category the app is in.
I can get the number of free apps and also the number of total apps by category. Now I need to find the percentage, this is where it goes a bit pear-shaped.
Here is what I have so far:
-- find total number of apps per category
select #totalAppsPerCategory := count(*), category_primary
from apps
group by category_primary;
-- find number of free apps per category
select #freeAppsPerCategory:= count(*), category_primary
from apps
where (price = 0.0)
group by category_primary;
-- find percentage of free apps per category
set #totals = #freeAppsPerCategory / #totalAppsPercategory * 100;
select #totals, category_primary
from apps
group by category_primary;
It then lists the categories but the percentage listed in each category is the exact same value.
I had initially thought to use an array, but from what I have read mySql does not seem to support arrays.
I'm a bit lost of how to proceed from here.
Finally figured it out. Since I had been saving the previous results in variables it seems that it was not able to calculate on a row by row basis, which is why all the percentages were identical, it was an average. So the calculation needed to be part of the query.
Here's what I came up with:
SELECT DISTINCT
category_primary,
CONCAT(FORMAT(COUNT(CASE
WHEN price = 0 THEN 1
END) / COUNT(*) * 100,
1),
'%') AS FreeAppSharePercent
FROM
apps
GROUP BY category_primary
ORDER BY FreeAppSharePercent DESC;
Then the query result is:

BigQuery for GitHub : How to get most starred repo with a specific language

I want to get the list of the repo with the most amount stars using BigQuery. I wrote a query but I am not sure about the result :
SELECT
repo.name,
count(JSON_EXTRACT_SCALAR(payload, '$.pull_request.base.repo.stargazers_count')) as Stars
FROM `githubarchive.year.2019`
WHERE
JSON_EXTRACT_SCALAR(payload, '$.pull_request.base.repo.language') = 'Java'
GROUP by repo.name
ORDER BY
count(JSON_EXTRACT_SCALAR(payload, '$.pull_request.base.repo.stargazers_count')) DESC
LIMIT
100
Can anyone check this with me while adding a new column with the repo url.
That's a good start - but note that you have a query that goes over 1TB of data, and will quickly consume your monthly free quota.
I'll recommend you to start by extracting all the interesting rows (like the Java related ones) to a new table. Then run your future queries out of the smaller table.
This query will give you the results you want:
SELECT repo.name
, MAX(CAST(JSON_EXTRACT_SCALAR(payload, '$.pull_request.base.repo.stargazers_count')AS INT64)) stars
FROM `githubarchive.month.201912`
WHERE JSON_EXTRACT_SCALAR(payload, '$.pull_request.base.repo.language') = 'Java'
AND type='PullRequestEvent'
GROUP by repo.name
ORDER BY stars DESC
I'm only looking at repos that had pull requests during December 2019. It might be a good sign of the repo being alive.
Since I'm only looking at December, the query cost is 1/12th.
MAX() gives you the total count of stars at the moment of the pull request.
Now let me create and share with you a way smaller table, to get the top starred repositories by language:
CREATE TABLE `fh-bigquery.github_extracts.201912_repo_lang_stars`
AS
SELECT repo.name
, MAX(CAST(JSON_EXTRACT_SCALAR(payload, '$.pull_request.base.repo.stargazers_count')AS INT64)) stars
, JSON_EXTRACT_SCALAR(payload, '$.pull_request.base.repo.language') lang
FROM `githubarchive.month.201912`
WHERE type='PullRequestEvent'
GROUP by repo.name, lang
# 28.1 sec elapsed, 161.7 GB processed
;
SELECT lang
, COUNT(*) repos
, ARRAY_AGG(STRUCT(name, stars) ORDER BY stars DESC LIMIT 3) repo
FROM `fh-bigquery.github_extracts.201912_repo_lang_stars`
GROUP BY lang
ORDER BY repos DESC
# 1.4 sec elapsed, 52.2 MB processed
https://twitter.com/felipehoffa/status/1220470249310445572

How to pull github timeline data from BigQuery

I am having trouble accessing the GitHub timeline from BigQuery.
I was using the following query:
SELECT repository_name, actor_attributes_company, payload_ref_type, payload_action, type, created_at FROM githubarchive:github.timeline WHERE repository_organization = 'foo' and created_at > '2014-07-01'
and everything was working great. Now, it looks like the githubarchive:github.timeline table is no longer available. I've been looking around and I found another table:
SELECT repository_name, actor_attributes_company, payload_ref_type, payload_action, type, created_at FROM publicdata:samples.github_timeline WHERE repository_organization = 'foo' and created_at > '2014-07-01'
This query works but returns zero rows. When I remove the created_at restriction it worked but only returned a few rows from 2012 so it looks like this is just sample data.
Does anyone know how to pull live timeline data from GitHub?
Indeed, publicdata:samples.github_timeline has only sample data.
For the real GitHub Archive documentation, look at http://www.githubarchive.org/
I wrote an article yesterday about querying it:
https://medium.com/#hoffa/analyzing-github-issues-and-comments-with-bigquery-c41410d3308
Sample query:
SELECT repo.name,
JSON_EXTRACT_SCALAR(payload, '$.action') action,
COUNT(*) c,
FROM [githubarchive:month.201606]
WHERE type IN ('IssuesEvent')
AND repo.name IN ('kubernetes/kubernetes', 'docker/docker', 'tensorflow/tensorflow')
GROUP BY 1,2
ORDER BY 2 DESC
As Mikhail points out, there's also another dataset with all of GitHub's code:
https://medium.com/#hoffa/github-on-bigquery-analyze-all-the-code-b3576fd2b150
Check out githubarchive BigQuery project
It has three datasets: day, month, year with respective daily, monthly and yearly data
Check out https://cloudplatform.googleblog.com/2016/06/GitHub-on-BigQuery-analyze-all-the-open-source-code.html for more details

ABAP 7.40 SELECT .. ENDSELECT UP TO n ROWS syntax?

Update: The question should be withdrawn, the grammar is correct. Apparently, SAP defines ABAP via a grammar, which is then modified by additional rules in plain text. I missed this second part.
I'm looking at the ABAP Keyword Documentation 7.40, SELECT -> SELECT additions. For addition UP TO n ROWS, it gives the example
DATA: wa_scustom TYPE scustom.
SELECT *
FROM scustom
WHERE custtype = 'B'
ORDER BY discount DESCENDING
INTO #wa_scustom
UP TO 3 ROWS.
ENDSELECT.
I verified that code in an SAP 7.40 system and got the error
Row 7: "INTO" is not valid here. '.' is expected
On the other hand, the following code is accepted, although it is not covered by the grammar of SELECT as given in the document: UP TO n ROWS should be after the FROM.
SELECT COUNT(*) UP TO 1 ROWS
FROM MARC
WHERE matnr eq 100.
As we are writing a tool that automatically generates ABAP code, it would be nice to know what's legal and what's not. Is there a "definitive" document? In general, is it worth the try to contact someone at SAP for corrections? (You see, I'm somewhat alien to the SAP world) If yes, who could that be?
I can't check it right now, but I suspect that you have "INTO ..." and "UP TO ..." parts placed after "WHERE ..." and "ORDER BY ..." parts.
Documentation states that the syntax of SELECT is:
SELECT result
INTO target
FROM source
[WHERE condition]
[GROUP BY fields]
[HAVING cond]
[ORDER BY fields].
with remark "The FROM clause ... can also be placed before the INTO clause." There are no remarks that you can insert WHERE/GROUP BY/HAVING/ORDER BY between SELECT/INTO/FROM.
So, give a try to:
SELECT *
FROM scustom
INTO #wa_scustom
UP TO 3 ROWS
WHERE custtype = 'B'
ORDER BY discount DESCENDING.
ENDSELECT.
Indeed, there's a syntax imposed by SAP that only allows "UP TO n ROWS" and other arguments to come BEFORE the WHERE clause. However, there are more flexibility to it in newer ABAP server releases.
When generalizing, please use the older syntax. It will still work on newer versions, as SAP has a strong backward compatibility policy.
"Be conservative in what you send, be liberal in what you accept".
Something like that:
SELECT {fields}
FROM {db table}
INTO {work area}
UP TO {n} ROWS
WHERE {field} {condition} {value}
ORDER BY {field} {ASCENDING/DESCENDING}.
ENDSELECT.
Hope it helps.

Count total number of pages globally

We are preparing for an upgrade of confluence. We are documenting all the current information about our confluence instance before we do the upgrade.
I need to know how many Total pages there are globally within confluence.
From the administrators console I can see this information
Content (All Versions) 91892
Content (Current Versions) 18194
It is my understanding that this information is total content, including blog posts, attachments, comments etc... Not specifically Page Count
I have found some macros that will tell me how many pages exist in a particular space, but that does not quite do what I need as we have over 200 spaces currently.
The database we have this on is: Microsoft SQL Server.
Confluence Version: 3.4.2
Is there any way to get the total number of pages within confluence globally?
Any help would be greatly appreciated.
Yes, the system information page lists the total count of ALL content elements.
If you want to know the figures by content type you could query the database with the following SQL:
ALL
SELECT contenttype, count(*) FROM content c GROUP BY c.contenttype
Current Versions
SELECT contenttype, count(*) FROM content c WHERE c.prevver is null GROUP BY c.contenttype