What is Best practice to Move DATA FILES to GITHUB Repo - github

I am slightly new to github, So please bare if i'm asking very basic here.
I have a github repository. Where i have folder structures like below
company --> scripts --> python --> python_scripts.py
company --> inbound --> Data files
company --> outbound --> Data files
the size of my datafiles in inbound and outbound folders are ~2GB and it keeps increasing daily. What is the best practice to store data files in git repo.

You should not store these data files in your repository. They are data that your programs operate on and are not part of the source code. Adding them to your repository will just bloat the repository.
You should remove the directories in question with git rm -r --cached DIRECTORY and then add them to .gitignore. You should then store them in some other location that's a better fit for data, like an artifact server or a cloud storage bucket, or just locally on the affected system.

I would recommend to use
Git Large File Storage (LFS), see https://git-lfs.github.com/

Related

Config eleventy data folder for github

How to config the _data folder to another git repository? The whole folder is clone from github and want to continue due to easy update. But the _data folder need to sync with another data repository. How can I config it?
Actually just one yaml file : _data/authorlist.yaml
What to put into package.js if want a script to sync the yaml file with github?
Thank you.
You can use Git submodules to keep a shared repository of data files that you can use in multiple projects. This will allow you to keep a reference to the data repository in your projects, and pull updates to this repository with a single command. The great thing is that submodules are a built-in feature of Git, so it's independent of any NPM scripts, environments (like a bash script would be) or frameworks. See the link above for documentation on how to set up and work with submodules.

Can I use external location as a separate server for git-lfs?

I have a repository on GitHub.com. I need to add a large file to it (>1GB). git-lfs seems to be the solution but GitHub probably offers only upto 2GB for free. Can I use any other location as a separate large file server while the actual codes stay in GitHub?
I tried configuring lfs Endpoint to the Azure devops repo and the git origin to GitHub.com. It does not seem to be working that way.

How to get source code from GitHub data export?

I decided to backup all my github data and found this: https://help.github.com/en/github/understanding-how-github-uses-and-protects-your-data/requesting-an-archive-of-your-personal-accounts-data
I managed to get the .tar.gz file and it seems to contain all my repositories but there is no source code in there. Judging by the size, it looks like some kind of archive in objects/pack/*.pack
Is there any way to access original source code?
it looks like some kind of archive in objects/pack/*.pack
According to Download a user migration archive:
The archive will also contain an attachments directory that includes all attachment files uploaded to GitHub.com and a repositories directory that contains the repository's Git data.
Those might be bare repositories or bundles.
Once uncompressed, try and git clone one of those folders (to a new empty folder)
The OP johnymachine confirms in the comments:
git clone .\repositories\username\repository.git\ .\repository\
Meaning repository.git is a bare repo (Git database only, no files checked out)

Is it safe for multiple users to use a Git repo on a shared network drive?

We're using Eclipse (with the eGit plugin) and I want to host a repo on a shared network drive that several users have write access to.
Can users all point at the same original repo (on the shared drive) or would it be better for each user to clone the repo to their local drive, work off this local version, and push changes to the networked original as required?
Eclipse seems to allow you to "import" (to your Eclipse workspace) a project from a Git repo, but that imported project doesn't seem to be monitored by Git until you choose to "Share project". At this step the working directory becomes that of the repo's working dir for that project. Presumably all users sharing this project would have the same working dir i.e. that of the repo on the shared drive.
I am not clear on the implications of this, but it doesn't seem like a good idea, on first inspection! How will it handle basic problems like 2 users trying to open the same file for editing simultaneously, for instance?
Thanks.
It's better that each person has their own repo.
Clone you current repository as a bare repo and place it on the network drive.
e.g.
git clone --bare /path/to/current/cool_project cool_project.git
Move the cool_project.git to your network drive, and get everyone to clone from that. Bare repos don't have a working directory, hence the name, so they are safe to push to.
See the chapter 4 of the Git Pro book - Git on a Server, and specifically chapter 4.2 for more details.
From the sound of it you are talking about each user pointing to the git repository over the network and not having individual git repositories on each developer's computers and then pushing to a 'central' repository. If I am correct in reading your question that is not a great way to take advantage of what git has to offer. Git is a distributed version control system so everyone should have their own repository and push the changes to a central repository that you do your CI builds off of.

Version control of uploaded images to file system

After reading Storing Images in DB - Yea or Nay? I think that the file system is the right place for storing images. But I would like to know how you handle backup/version control of uploaded images in your different environments (dev/stage/prod) and for network load balancing?
These problems is pretty easy to handle when working with a database e.g. to make a backup from the production environment and restore the DB in the development environment.
What do you think of using for example git to handle version control of uploaded files e.g?
Production Environment:
A image is uploaded to a shared folder at the web server.
Meta data is stored in the database
The image is automatically added to a git repository
Developer at work:
Checks out the source code.
Runs a script to restore the database.
Runs a script to get the the latest images.
I think the solution above is pretty smooth for the developer, the images will be under version control and the environments can be isolated from each other.
For us, the version control isn't as important as the distribution. Meta data is added via the web admin and the images are dropped on the admin server. Rsync scripts push those out to the cluster that serves prod images. For dev/test, we just rsync from prod master server back to the dev server.
The rsync is great for load balancing and distribution. If you sub in git for the admin/master server, you have a pretty good solution.
If you're OK with backup that preserves file history at the time of backup (as opposed to version control with every revision), then some adaption of this may help:
Automated Snapshot-style backups with rsync.
It can work, but I would store those images in a git repository which would then be a submodule of the git repo with the source code.
That way, a strong relationship exists between the code and and images, even though the images are in their own repo.
Plus, it avoids issues with git gc or git prune being less efficient with large number of binary files: if images are in their own repo, and with few variations for each of them, the maintenance on that repo is fairly light. Whereas the source code repo can evolve much more dynamically, with the usual git maintenance commands in play.