How to pull github timeline data from BigQuery - github

I am having trouble accessing the GitHub timeline from BigQuery.
I was using the following query:
SELECT repository_name, actor_attributes_company, payload_ref_type, payload_action, type, created_at FROM githubarchive:github.timeline WHERE repository_organization = 'foo' and created_at > '2014-07-01'
and everything was working great. Now, it looks like the githubarchive:github.timeline table is no longer available. I've been looking around and I found another table:
SELECT repository_name, actor_attributes_company, payload_ref_type, payload_action, type, created_at FROM publicdata:samples.github_timeline WHERE repository_organization = 'foo' and created_at > '2014-07-01'
This query works but returns zero rows. When I remove the created_at restriction it worked but only returned a few rows from 2012 so it looks like this is just sample data.
Does anyone know how to pull live timeline data from GitHub?

Indeed, publicdata:samples.github_timeline has only sample data.
For the real GitHub Archive documentation, look at http://www.githubarchive.org/
I wrote an article yesterday about querying it:
https://medium.com/#hoffa/analyzing-github-issues-and-comments-with-bigquery-c41410d3308
Sample query:
SELECT repo.name,
JSON_EXTRACT_SCALAR(payload, '$.action') action,
COUNT(*) c,
FROM [githubarchive:month.201606]
WHERE type IN ('IssuesEvent')
AND repo.name IN ('kubernetes/kubernetes', 'docker/docker', 'tensorflow/tensorflow')
GROUP BY 1,2
ORDER BY 2 DESC
As Mikhail points out, there's also another dataset with all of GitHub's code:
https://medium.com/#hoffa/github-on-bigquery-analyze-all-the-code-b3576fd2b150

Check out githubarchive BigQuery project
It has three datasets: day, month, year with respective daily, monthly and yearly data
Check out https://cloudplatform.googleblog.com/2016/06/GitHub-on-BigQuery-analyze-all-the-open-source-code.html for more details

Related

I can't show to Grafana what time field it should use for chart building

I have a Postgresql DataSource with the following table:
It's kinda logs. All I want is to show on a chart how many successful records (with http_status == 200) do I have per each hour. Sounds simple, right? I wrote this query:
SELECT
count(http_status) AS "suuccess_total_count_per_hour",
date_trunc('hour', created_at) "log_date"
FROM logs
WHERE
http_status = 200
GROUP BY log_date
ORDER BY log_date
It gives me the following result:
Looks good to me. I'm going ahead and trying to put it into Grafana:
Ok, I get it, I have to help Grafana to understand where is the field for time count.
I go to Query Builder and I see that it breaks me query at all. And since that moment I got lost completely. Here is the Query Builder screen:
How to explain to Grafana what do I want? I want just a simple chart like:
Sorry for the rough picture, but I think you got the idea. Thanks for any help.
Your time column (e.g. created_at) should be TIMESTAMP WITH TIME ZONE type*
Use time condition, Grafana has macro so it will be easy, e.g. WHERE $__timeFilter(created_at)
You want to have hourly grouping, so you need to write select for that. Again Grafana has macro: $__timeGroupAlias(created_at,1h,0)
So final Grafana SQL query (not tested, so it may need some minor tweaks):
SELECT
$__timeGroupAlias(created_at,1h,0),
count(*) AS value,
'succcess_total_count_per_hour' as metric
FROM logs
WHERE
$__timeFilter(created_at)
AND http_status = 200
GROUP BY 1
ORDER BY 1
*See Grafana doc: https://grafana.com/docs/grafana/latest/datasources/postgres/
There are documented macros. There are also macros for the case, when your time column is UNIX timestamp.

Redshift - Extracting data from table based on filter condition

I have some sales data that shows if a store has done a sale or not. I am trying to pull out all stores that have no sale done till date. Given below is the query and the sample data I am working with.
store_name,sale_made,count
store_a,0,100
store_a,1,23
store_b,1,18
store_c,0,32
store_d,0,50
store_d,1,70
Expected output:
store_name,sale_made,count
store_c,0,32
Reason being only store_c in that list has sale_made = 0 and no sale_made = 1
you need something like this
SELECT store_name FROM table
GROUP BY 1
HAVING SUM(sale_made)=0

Inaccurate COUNT DISTINCT Aggregation with Date dimension in Google Data Studio

When I aggregate values in Google Data Studio with a date dimension on a PostgreSQL Connector, I see buggy behaviour. The symptom is that performing COUNT(DISTINCT) returns the same value as COUNT():
My theory is that it has something to do with the aggregation on the data occurring after the count has already happened. If I attempt the exact same aggregation on the same data in an exported CSV instead of directly from a PostgreSQL Connector Data Source, the issue does not reproduce:
My PostgreSQL Connector is connecting to Amazon Redshift (jdbc:postgresql://*******.eu-west-1.redshift.amazonaws.com) with the following custom query:
SELECT
userid,
submissionid,
date
FROM mytable
Workaround
If I stop using the default date field for the Date Dimension and aggregate my own dates directly in within the SQL query (date_byweek), the COUNT(DISTINCT) aggregation works as expected:
SELECT
userid,
submissionid,
to_char(date,'YYYY-IW') as date_byweek
FROM mytable
While this workaround solves my immediate problem, it sucks because I miss out on all the date functionality provided by Data Studio (Hierarchy Drill Down, Date Range filtering, etc.). Not to mention reducing my confidence at what else may be "buggy" within the product 😞
How to Reproduce
If you'd like to re-create the issue, using the following data as a PostgreSQL Data Source should suffice:
> SELECT * FROM mytable
userid submissionid
-------- -------------
1 1
2 2
1 3
1 4
3 5
> COUNT(DISTINCT userid) -- ERROR: Returns 5 when data source is PostgreSQL
> COUNT(DISTINCT userid) -- EXPECTED: Returns 3 when data source is CSV (exported from same PostgreSQL query above)
I'm happy to report that as of Sep 17 2020, there's a workaround.
DataStudio added the DATETIME_TRUNC function (see here https://support.google.com/datastudio/answer/9729685?), that allows you to add a custom field that truncs the original date to whatever granularity you want, without causing the distinct bug.
Attempting to set the display granularity in the report still causes the bug (i.e., you'll still set Oct 1 2020 12:00:00 instead of Oct 2020).
This can be solved by creating a SECOND custom field, which just returns the first, and then you can add IT to the report, change the display granularity, and everything will work OK.
I have the same issue with MySQL Connector. But my problem is solved, when I change date field format in DB from DATETIME (YYYY-MM-DD HH:MM:SS) to INT (Unixtimestamp). After connection this table to the Googe Datastudio I set type for this field as Date (YYYYMMDD) and all works, as expected. Hope, this may help you :)
In this Google forum there is a curious solution by Damien Choizit that involves combining your data source with itself. It works well for me.
https://support.google.com/datastudio/thread/13600719?hl=en&msgid=39060607
It says:
I figured out a solution in my case: I used a Blend Data joining twice the same data source with corresponding join key(s), then I specified a data range dimension only on the left side and selected the columns I wanted to CTD aggregate as "dimensions" (and not metric!) on the right side.

How to measure language popularity via Github Archive data?

I'm attempting to measure programming language popularity via:
The number of stars on repos in combination with...
The programming languages used in the repo and...
The total bytes of code in each language (recognizing that some languages are more/less verbose)
Conveniently, there is a massive trove of Github data provided by Github Archive, and hosted by BigQuery. The only problem is that I don't see "language" available in any of the payloads for the various event types in Github Archive.
Here's the BigQuery query I've been running trying to find if, and where, language may be populated in the Github Archive data:
SELECT *
FROM [githubarchive:month.201612]
WHERE JSON_EXTRACT(payload, "$.repository.language") is null
LIMIT 100
Can someone please provide insight into whether I'll be able to utilize Github Archive data in this way, and how I can go about doing so? Or will I need to pursue some other approach? I see that there is also a github_repos public dataset on BigQuery, and it does have some language metrics, but the languages metrics seem to be over all time. I'd prefer to get some sort of monthly metric eventually (i.e., of "active" repos in a given month, what were the most popular languages).
Any advice is appreciated!
With BigQuery and GitHub Archive and GHTorrent -
To get the languages by pull requests, last December (copy pasted from http://mads-hartmann.com/2015/02/05/github-archive.html):
SELECT COUNT(*) c, JSON_EXTRACT_SCALAR(payload, '$.pull_request.base.repo.language') lang
FROM [githubarchive:month.201612]
WHERE JSON_EXTRACT_SCALAR(payload, '$.pull_request.base.repo.language') IS NOT NULL
GROUP BY 2
ORDER BY 1 DESC
LIMIT 10
To find the number of stars per project:
SELECT COUNT(*) c, repo.name
FROM [githubarchive:month.201612]
WHERE type='WatchEvent'
GROUP BY 2
ORDER BY 1 DESC
LIMIT 10
For a quick language vs bytes view, you can use GHTorrent:
SELECT language, SUM(bytes) bytes
FROM [ghtorrent-bq:ght.project_languages]
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
Or to look at the actual files, see the contents of GitHub on BigQuery.
Now you can mix these queries to get the results you want!
SELECT
JSON_EXTRACT_SCALAR(payload, '$.pull_request.head.repo.language') AS language,
COUNT(1) AS usage
FROM [githubarchive:month.201601]
GROUP BY language
HAVING NOT language IS NULL
ORDER BY usage DESC

Crystal Reports Filter by Most Recent Date of Field

I have a report I am creating through an ODBC connection. The report includes several invoices, where each invoice has several products. There is also a table which contains all the historical price changes for each product (field: unit-price). Currently there are duplicate product records being pulled, one for each time there was a price change. Therefore, I need to filter my data so that only the most recent unit-price is shown (date field: effective-date). How can I do this via the "Select Expert?"
In short, show the product's unit-price for the most recent effective-date.
Thank you!
You'll need to create a sql-expression field to get the most-recent effective date, then use this field in the record-selection formula.
// {%MAX_EFFECTIVE_DATE}
// most-likely you'll need to alias the table in the main report for this to work
(
SELECT Max(effective_date)
FROM price_history
WHERE product_id = price_history_alias.product_id
)
Record-selection formula:
{price_history_alias.effective_date}={%MAX_EFFECTIVE_DATE}
Instead of doing it in select expert. group by effective date and set the ordering as Descending.