What file name should I use to identify GitHubs userName/repoName structure for analysis on my local filesystem without nested folders? - github

I am analyzing GitHub repositories.
To keep track of the same resource throughout multiple calculations I want to use the userName and repoName in combination as an identifier.
First I thought a simple
userName_repoName
would do but apparently repoNames can now contain underscores as well (userNames can't so splitting after the first underscore would work okay I guess)
So my question is:
Do you have any advice for me on how to create a cross-platform failsafe identifier from a GitHub userName and repoName?
I want to avoid nested folders as it is easier to keep track of same-level folders than multiple nested ones (one user with multiple repos)

Related

Nextcloud - mass removal of collaborative tags from files

due to an oversight in a flow-routine that was meant to tag certain folders on upload into the cloud, a huge amount of unwanted files were also tagged in the process. Now there are thousands upon thousands of files that have the wrong tag and need to be untagged. Neither doing this by hand nor reuploading with the correct flow-routine are really workable options. Is there a way to do the following:
Crawl through every entry in a folder
If its a file, untag it, if its a folder, don't
Everything I found about tags and NextCloud was concerning with handling them when they were uploaded, but never running over existing files in regards of tagging.
Is this possible?
The cloud stores those data into the configured database. So you could simply remove the assigns from the db.
The assigns are stored in oc_systemtag_object_mapping while the tags itself are in oc_systemtag. If you found the ID of the tag to remove (let's say 4), you could simply remove all assignments from the db:
DELETE FROM oc_systemtag_object_mapping WHERE systemtagid = 4;
If you would like to do this only for a specific folder, it's not even getting much more complicated. Files (including their folder structure!) are stored in oc_filecache, while oc_systemtag_object_mapping.objectid references oc_filecache.fileid. So with some joining and LIKEing, you could limit the rows to delete. If your tag is used for non-files, your condition should include oc_systemtag_object_mapping.objecttype = 'files'.

Advanced search on github excluding a specific repository

I'm trying to figure out if there's any way to exercise the various fields defined for the github advanced search form that would allow me to effectively exclude hits from a specific repo. In other words I want to do a code search for all hits landing outside a given repository, an inverse repository search if you will.
I may be able to tune the size field with an inequality, but I'm hoping there's something I may be overlooking that has this sort of search in mind. My specific use case is that there's a major monorepo on our remote but there's a small constellation of support repositories which reuse some bits of the main repo that need to be refactored. I'm trying to identify those source hits in the smaller repos that need to be upgraded.
https://github.com/search/advanced?q=test&type=Repositories
Use -repo in the normal search. You can exclude a repository by prepending a hyphen (-).
foo_library -repo:owner1/repoX -repo:owner2/repo
See also docs.github.com or github.community.

perform aggregate computation on Github data

I have a Github (Enterprise) organization with multiple repositories. Each repository contains one or more .properties files.
Some of these .properties files will be contained in folders that contain "i18n" in their path or filename.
These .properties files would be relevant for translation processes.
As a basic step: I would need to get the average/min/max frequency of commits that involve translation-relevant files (as defined above) for each one of the repositories.
As an ideal scenario: I would also need to determine how many key-values were changed/added/deleted by each commit on average, to better determine the resulting workload for the translation process.
What I tried so far:
Github GraphQL APIs v4: seems to me that the API is very well suited for searching, but not as much for computing aggregations.
Github ReST APIs v3: specific commits can be searched, but not based on file extension. While file extensions are a query criteria for files themselves, they are not for commits.
Any hint on how to achieve this?

Mongo schema: Todo-list with groups

I want to learn mongo and decided to create a more complex todo-application for learning purpose.
The basic idea is a task-list where tasks are grouped in folders. Users may have different access to those folders (read, write) and tasks may be moved to other folders. Usually (especially for syncing) tasks will be requested by-folder and not alone.
Basically I thought about three approaches and would like to hear your opinion for them. Maybe I missed some points or just have the wrong way of thinking.
A - List of References
Collections: User, Folder, Task
Folders contain references to Users
Folders contain references to Tasks
Problem
When updating a Task a reference to Folder is needed. Either those reference is stored within the Task (redundancy) or it must be passed with each API-call.
B - Subdocuments
Collections: User, Folder
Folders contain references to Users
Tasks are subdocuments within Folders
Problem
No way to update a Task without knowing the Folder. Both need to be transmitted as well but there is no redundancy compared to A.
C - References
Collections: User, Folder, Task
Folders contain references to Users
Taskskeep a reference to their Folders
Problem
Requesting a folder means searching in a long list instead of having direct references (A) or just returning the folder (B).
If you don't need any metadata for the folder except the name you could also go with:
Collections: User,Task
Task has field folder
User has arrays read_access and write_access
Then
You can get a list of all folders with
db.task.distinct("folder")
The folder a specific user can access are automatically retrieved when you retrieve the user document so those can basically known at login.
You can get all tasks a user can read with
db.task.find( { folder: { $in: read_access } } )
with read_access beeing the respective array you got from your users document. The same with write_access.
You can find all tasks within a folder with a simple find query for the folder name.
Renaming a folder can be achieved with one update query on each of the collections.
Creating a folder or moving a task to another folder can also be achieved in simple manners.
So without metadata for folders that is what I would do. If you need metadata for folders it can become a little more complicated but basically you could manage those independent of the tasks and users above using a folder collection containing the metadata with _id beeing the folder name referenced in user and task.
Edit:
Comparison of the different approaches
Stumbled over this link which might be of interest for you. In there is a discussion of transitioning from a relational database model to mongo. The difference beeing that in a relational database you usually try to go for third normal form where one of the goals is to avoid bias to any form of access pattern where in mongodb you can try to model your data to best fit your access patterns (while keeping in mind not to introduce possible data anomalies through redundancy).
So with that in mind:
your model A is a way how you could do it in a relational database (each type of information in one table referenced over id)
model B would be tailored for an access pattern where you always list a complete folder and tasks are only edited when the folder is opened (if you retrieve one folder you have all the task without an additional query)
C would be a different relational model than A and I think little closer to third normal form (without knowing the exact tables)
My suggestion would support the folder access not as optimal as B but would make it easier to show and edit single tasks
Problems that could come up with the schemas: Since A and C are basically relational you can get a problem with foreign keys since mongodb does not enforce foreign key constraints (e.g. you could delete a folder while there are still tasks referencing it in C or a task without deleting its reference in the folder in A). You could circumvent this problem by enforcing it from the application. For B the 16MB document limit could become a problem circumventable by allowing folders to split into multiple document when they reach a certain task count.
So new conclusion: I think A and C might not show you the advanatages of mongodb (and might even be more work to build in mongodb than in sql) since they are what you would do on a traditional relational database which is the way mongodb was not designed for (e.g. the missing join statement, no foreign key constraints). In sum B most matches your access patern "Usually (especially for syncing) tasks will be requested by-folder" while still allowing to easily edit and move tasks once the folder is opened.

split a mercurial repo into different baby repos

The situation is, I once placed some conceptually related codes into one package in hope of interweaving them gradually later, but it turns out they eventually become independent of each other (can be safely separated). Therefore, I decide it's time to split them into different packages, but I'm not sure how to do it in a way so that I could also keep the respective version control history for each sub-package. Any ideas?
The Convert extension included with the standard distribution is used for this purpose. Specifically, check out the --filemap option, which can include, exclude and rename files and directories when converting from one database to another.