perform aggregate computation on Github data - rest

I have a Github (Enterprise) organization with multiple repositories. Each repository contains one or more .properties files.
Some of these .properties files will be contained in folders that contain "i18n" in their path or filename.
These .properties files would be relevant for translation processes.
As a basic step: I would need to get the average/min/max frequency of commits that involve translation-relevant files (as defined above) for each one of the repositories.
As an ideal scenario: I would also need to determine how many key-values were changed/added/deleted by each commit on average, to better determine the resulting workload for the translation process.
What I tried so far:
Github GraphQL APIs v4: seems to me that the API is very well suited for searching, but not as much for computing aggregations.
Github ReST APIs v3: specific commits can be searched, but not based on file extension. While file extensions are a query criteria for files themselves, they are not for commits.
Any hint on how to achieve this?

Related

Google Data Fusion reading files from multiple sub folders in a bucket and need to place in another folder in side sub folder

Example
sameer/student/land/compressed files
sameer/student/pro/uncompressed files
sameer/employee/land/compressed files
sameer/employee/pro/uncompressed files
In the above example I need to read files from all LAND folders present in different sub directories and need to process them and place them in PRO folders with in same sub folders.
For this I have taken two GCS nodes one from source and another from sink.
in the GCS source i have provided path gs://sameer/ , it is reading files from all sub folders and merging them into one file placing it in sink path.
Excepted output all files should be placed in sub directories where i have fetched from.
It can achieve the excepted output by running pipeline separately for each folder
I am expecting is this can be possible by a single pipeline run
It seems like your use case is simply moving files. In that case, I would suggest using the Action plugin GCS Move or GCS Copy.
It seems like the task you are trying to carry out is not possible to do in one single Data Fusion pipeline, at least at the time of writing this.
In a pipeline, all the sources and sinks have to be connected. Otherwise you will get the following error:
'Invalid DAG. There is an island made up of stages ...'
This means it is not possible to parallelise several uncompression tasks, one for each folder of files, inside the same pipeline.
At the same time, if you were to use something like the following schema, the outputs would be aggregated and replicated over all of the sinks:
Finally, I would say that the only case in which you can parallelise a task between several sources and several links is when using multiple database tables. By means of the following plug-ins (2) and (3) you can process data from multiple table inputs and export the output to multiple tables. If you would like to see all available plugins for Data fusion, please check the following link (4).

Advanced search on github excluding a specific repository

I'm trying to figure out if there's any way to exercise the various fields defined for the github advanced search form that would allow me to effectively exclude hits from a specific repo. In other words I want to do a code search for all hits landing outside a given repository, an inverse repository search if you will.
I may be able to tune the size field with an inequality, but I'm hoping there's something I may be overlooking that has this sort of search in mind. My specific use case is that there's a major monorepo on our remote but there's a small constellation of support repositories which reuse some bits of the main repo that need to be refactored. I'm trying to identify those source hits in the smaller repos that need to be upgraded.
https://github.com/search/advanced?q=test&type=Repositories
Use -repo in the normal search. You can exclude a repository by prepending a hyphen (-).
foo_library -repo:owner1/repoX -repo:owner2/repo
See also docs.github.com or github.community.

What file name should I use to identify GitHubs userName/repoName structure for analysis on my local filesystem without nested folders?

I am analyzing GitHub repositories.
To keep track of the same resource throughout multiple calculations I want to use the userName and repoName in combination as an identifier.
First I thought a simple
userName_repoName
would do but apparently repoNames can now contain underscores as well (userNames can't so splitting after the first underscore would work okay I guess)
So my question is:
Do you have any advice for me on how to create a cross-platform failsafe identifier from a GitHub userName and repoName?
I want to avoid nested folders as it is easier to keep track of same-level folders than multiple nested ones (one user with multiple repos)

looking for file in repo with certain properties

In GitHub, is it possible to search for repos with greater than 100 stars, and contains a file called "foo.txt"?
I am trying to use their search interface.
Not exactly, considering the GitHub search limitation (or "Considerations").
Only the default branch is considered. In most cases, this will be the master branch.
Only files smaller than 384 KB are searchable.
Only repositories with fewer than 500,000 files are searchable.
You must always include at least one search term when searching source code.
For example, searching for language:go is not valid, while amazing language:go is.

What is deliverbl in UCM ClearCase?

I am wondering about 'deliverbl', how important are they for historical purposes or during development? All I know is they are created during delivery and includes activies.
If i am exporting major baselines from Clearcase to different SCM, should I consider 'deliverbl' as major baseline?
There are representing merges (from a child stream to another stream), so:
they should be considered, as they show the code "merged" (with potential conflicts resolved)
but you might have more trouble to export the merge hyperlink which shows the source of the merge
I like to export those baselines because their naming convention shows the stream destination name, as well as the date of the deliver, so you are left with clues about a merge between two branches.
However, their are unlabelled, so you might:
either need to convert those to a full baseline (cleartool chbl -full)
or decide to not bother with them and leave those out of your export.