BigQuery github dataset returns wrong results - github

So, I'm trying to do some queries using bigquery-public-data:github_repos.files, which was updated on May 25, 2018, 2:07:03 AM, in theory, it contains all files data from github - as it says in the description of the table:
File metadata for all files at HEAD.
Join with [bigquery-public-data:github_repos.contents] on id columns
to search text.
So, I have this tool called goreleaser, to use it, users create a file named .goreleaser.yaml. To have an idea of how many repositories are using it, I was using the github search, something like this a search for filename:goreleaser extension:yaml extension:yml path:/, you can see the results on this link.
This shows 1k+ results, and gets results for all these possible names:
goreleaser.yml
goreleaser.yaml
.goreleaser.yml
.goreleaser.yaml
The problem is, github shows the 1k result count, but you can only paginate until 1k or so. I wrote some code in Go using the API and etc, you see it here.
Anyway, I tried to do something similar with bigquery, here is my foolish attempt:
SELECT repo_name, path
FROM [bigquery-public-data:github_repos.files]
WHERE REGEXP_MATCH(path, r'\.?goreleaser.ya?ml')
This will include the vendored tools, which is not ok, but that's not the problem. The problem is that even with the vendored tools, it only shows ~500 results, not 1k.
PS: I also tried the simplified version matching path with LIKE and etc, same results.
So, either I'm doing something horribly wrong, this table does not include all data as it says it does or github search is lying to me.
Any advice?
Thanks!

Not every project in GitHub is mirrored on BigQuery's repo dataset.
Let's look at all projects that got more than 40 stars in April, vs what we can find mirrored in BigQuery's repos:
SELECT COUNT(name) april_projects_gt_stars, COUNT(repo_name) projects_mirrored
FROM (
SELECT DISTINCT repo_name, name, c
FROM `bigquery-public-data.github_repos.files` a
RIGHT JOIN (
SELECT repo.name, COUNT(*) c
FROM `githubarchive.month.201804`
WHERE type='WatchEvent'
GROUP BY 1
HAVING c>40
) b
ON repo_name=name
)
9522 vs 3995. Why?
Only open source projects are mirrored. This according to the open source detected license - if GitHub can't tell what license a project is using, the project can't be mirrored.
New projects: The pipeline might miss some new projects. Please report them.

Related

Azure DevOps Boards - display query result on a board

how to develop the extension to display query result on a board? Such thing is not possible in the Azure Devops unfortunatelly. I've found two extensions on the marketplace which are doing what I need:
AA Query Board
Query based boards
but this extensions are not updated for a long time and I couldn't contact the authors (I need to change few things in order to be able to use it internally in my company).
I've found also this topic Add tabs on query result pages, so it looks like it's quite easy to add new tab to the query result menu, but I have no idea and I can't find any info how to get data (work items) from query result to display them?
Rest of the extension is just to display this data in grid, so that would be also quite easy, but getting this query result data is blocking me.
There is a Query Results Widget that you can use to display the query results on the Dashboards under Overview.
1, First you need to create a shared query if not exist, and save query to the shared queries folder shown as below screenshot. (You can click the Column options from the Editor page to add and remove columns to be shown on the results)
Or drag and drop the query from My Queries folder to Shared Queries folder.
2, Go to Dashboards under Overview, and Click Edit, then search and add widget Query results
3, Click the gear icon on the Query Results widget to configure it and select the query you want to display. Then the query result will be display on the Dashboards
Update:
There are some other ways to show the query results on the dashboards, For below example:
you can select your shared query and click more actions(3dots) and click Add to dashboards. This will display simple total number of query results.
you can also create different Charts for the query results and add it to Dashboards.
Select your shared query and go to Charts tab, the choose New Chart, select a Chart type, After you configured the chart, you can click the 3dots on the chart and add it to dashboards, check below screenshot:
Eventually I managed to contact the author of the "AA Query Board" extension and it turns out that he has a public repository on GitHub with the source code of the extension, so basically everyone can lookup how it's done or base on it.
Link to the repository: https://github.com/staticnz/aa_query_board

How to query advanced issue handling on github (use of milestones and projects)?

I'd like to get the repositories that make the most active use of milestones and/or projects. By "most active" I mean something like most cards moved on a project board or most issues added to a milestone.
I tried GH Archive which has yearly datasets on Google bigquery. I ran this query
SELECT
JSON_EXTRACT(payload, '$.action')
FROM
[githubarchive:year.2017]
WHERE
type in ("IssuesEvent")
and JSON_EXTRACT(payload, '$.action') in ("milestoned", "labeled", "assigned")
LIMIT
20
and this query
SELECT
type
FROM
[githubarchive:year.2017]
WHERE
type IN ("MilestoneEvent",
"ProjectEvent",
"ProjectCardEvent")
LIMIT
20
Both return zero results. Does GH Archive not import all events? Am I making a mistake in the queries? Is there another source where I can get this information?

TYPO3 7.6 Backend module to list values from several tables

I have been struggling for some time now and I can't really find anyone having done the same thing before.
I'm creating a backend module in TYPO3 7.6 which belongs to a shop extension.
The shop extension with the backend module was created with the extension builder. The shop has the following three models:
Product (products which can be ordered through the shop)
Productsorder (link to the customer)
ProductsorderPosition (the ordered product, the ordered amount and size and the link to the Productsorder)
The customers are of a model type from a different extension. These customers are linked to fe_users.
Now what I wanna do in my backend module is getting an overview to all these orders listed with the customer, some information about the fe_user and of course the product. I have created a sql-query, which does exactly that:
SELECT p.productname, p.productpriceperpiece,
pop.amount, pop.size,
h.name, h.address, h.zipcode, h.city, h.email, h.phone,
f.first_name, f.last_name, f.email
FROM `tx_gipdshop_domain_model_productorderposition` AS pop
JOIN `tx_gipdshop_domain_model_product` AS p ON pop.products = p.uid
JOIN `tx_gipdshop_domain_model_productsorder` AS po ON pop.productorder = po.uid
JOIN `tx_gipleasedisturbhotels_domain_model_hotel` AS h ON po.hotel = h.uid
JOIN `fe_users` AS f ON h.feuser = f.uid
If I use this query from the product repository it gives back the right amount of data records but they're of type product and the products are all "empty" (uid = 0 etc).
I've added an additional action for this in the product controller (getOrdersAction) and in the repository containing the query I've added a method findAllOrders.
I'm still rather a beginner in TYPO3 but I can somehow understand why it returns data sets of type Product when the query is called from the ProductRepository. But what I do not know is how I can get all the information from the query above and list it in the backend module.
I've already thought about moving the query to the ProductsorderPositionRepository but I would probably be faced with a similar problem, it would only return the information from the ProductsorderPosition and everything else would be left out.
Can someone point me to the right direction?
Would I need to create another model with separate repository and controller? Isn't there an easier way?
If you need more information, just ask! ;)
First of all, you are doing a joined query with subsets of data mixed from multiple tables. There is nothing against this.
Because of this, there is no "model" which has the mixed datasets.
If you are using the default query thing in a repository, the magic behind the repository assumes that the result of the query statement reflects the defined base model for this repository.
Moving the query function to another repository does not solve the problem.
You have not provided the code snippet you are executing the sql statement, so I assume you have used the query thing in the repository to execute the statement. Something like this:
$result = $query->statement('
SELECT p.productname, p.productpriceperpiece,
pop.amount, pop.size,
h.name, h.address, h.zipcode, h.city, h.email, h.phone,
f.first_name, f.last_name, f.email
FROM `tx_gipdshop_domain_model_productorderposition` AS pop
JOIN `tx_gipdshop_domain_model_product` AS p ON pop.products = p.uid
JOIN `tx_gipdshop_domain_model_productsorder` AS po ON pop.productorder = po.uid
JOIN `tx_gipleasedisturbhotels_domain_model_hotel` AS h ON po.hotel = h.uid
JOIN `fe_users` AS f ON h.feuser = f.uid', NULL);
or have used the query building stuff.
First solution
The first and simpliest solution would be to retrieve the result as plain php array. Before TYPO3 7.0 you could have done this by using this:
$query->getQuerySettings()->setReturnRawQueryResult(TRUE);
With TYPO3 7.0 this deprecated method was removed from the core.
The only way is to define the query and call $query->execute(TRUE); for now.
This should return the data in pure array form.
This is the simpliest one, but as we are in the extbase context this should not be suffering enough.
Second Solution - no, just an idea that I would try next
The second solution means that you have some work to do and is for now only a suggestion, because I have not tried this by myself.
Create a model with the properties and getter/setters for the result columns of your query
Create a corresponding repository
Third solution
Not nice, but if nothing else works, fall back to the old TYPO3 v4 query methods:
$GLOBALS['TYPO3_DB']->exec_SELECTgetRows([...]))
and replace this with the QueryBuilder in/for TYPO3 v8.
This is really not nice.
I hope I could direct you to the right way, even if not giving a full solving solution.

Search for files on GitHub, where the repo has more than X stars

Is it possible to search GitHub for a particular filename AND restrict results to repos with some number of stars? I want to find all repositories that have a wepack.config.json file with 100+ stars.
You can search for a particular file name like so:
filename:webpack.config.json
And you can search for repos with some number of stars like so:
stars:>100
But there doesn't seem to be a way to combine the syntax to limit file searches.
But there doesn't seem to be a way to combine the syntax to limit file searches.
That is because stars: is a repository selector, as opposed to filename: which is a Code selector.
You would need a GitHub BigQuery in order to effectively combine the two search criteria.
However, as the OP Don P adds in the comments:
It looks like there is no dataset for repos.
Another approach would be using a GitHub GraphQL query, looking for:
a TreeEntry,
with a StarOrder for the result

Get list of all files with the user who checked in the latest version in TFS

Is there a way in TFS to get a list of files under source control with the user who checked in the latest version/version you have locally.
The closest functionality to this that i can find is in the source control explorer window you can see each files with the latest check-in date, but not with the user who checked it in.
There is no way currently from the VS Source control explorer. The best you can get is using the Web TFS version. You will see the name of the user in the comments section (in orange in the image below) along with changset # and any comment.
If that doesn't work for you somehow then you can either use TFS Api or SQL query against TFS DB. Following SQL should give you the result.
SELECT TOP 10
V.ChildItem AS [FileName],
I.DisplayName AS [ChangedBy],
CS.CreationDate AS [ChangeDate]
FROM tbl_Changeset CS
INNER JOIN tbl_Identity I
ON I.IdentityID = CS.OwnerID
INNER JOIN tbl_Version V
ON V.VersionFrom = CS.ChangesetID