How to skip big files for GitHub in Machine Learning Repo? - github

I'm new to GitHub and Machine Learning.
I've been using Conda and Jupyter Notebook for my tests in ML.
It all was fine.
But I know that it's better to use VS Code (easer to code?) and GitHub(promote and share my code?). I don't really care about version control because I'm only doing my own fist steps.
But nevertheless I did create GitHub account and I try to create a Repo and push my already existing folders with Python files. These folders also contain raw and modified data that is used in the code... .csv and .xlsx files. Some of them are 100 Mb+
I use Mac M1 and I've tried to create .gitignore_global file (and it works - when I git add . from the Terminal files noted .gitignore_global don't push (upload).
I've also created a .gitignore file in my working directory.
And I use find ./* -size +100M | cat >> .gitignore to add these files in the .gitignore (and it adds).
But when I try to git init -b main , git add . , git commit -m "First commit", git remote add origin <REMOTE_URL> and git push -u origin main it still tries to upload 100m+ files.
I've tried to delete the whole git subfolder and Repo on the site... it doesn't work.
What should I do in order not to upload (push) these files?
How do you use GitHub for DataScience / Machine Learning with these limitations?
It's really impossible not to use all the data files...
Please see above. I've tried several ways

Related

Amazon Sagemaker Failed: Please check if you have a directory that has same name as the git repo?

I was working in Sagemaker, and noticed that my notebook instance was behind my github repo as I had just pushed to it outside of working in Sagemaker. I couldn't seem to pull, so I deleted the directory within Jupyter and git cloned my updated repo. It worked fine, but once I was done working I haven't since been able to reinitialize the notebook. Sagemaker simpy says
Failure reason
Please check if you have a directory that has same name as the git repo.
I cloned from the same repo, so I don't imagine that the directory name changed. Maybe the directory is in the wrong place? Either way, how do I go in and change things if I can't open the notebook? Not sure what to do about this.
Do you have a lifecycle configuration (LCC) script that clones the repository? I can't think of another reason why the notebook would fail to start (I assume what you meant by 'reinitialize'). If you do, remove the LCC script to start your notebook. Or you could add a condition to check for the folder and clone if it does not exist, something like -
if [ ! -d "$FOLDER" ] ; then
git -C /home/sagemaker-user clone $REPOSITORY_URL
# if you want to pull latest when restarting, uncomment lines below
# else
# cd "$FOLDER"
# git pull $REPOSITORY_URL
fi
I work at AWS and my opinions are my own.

Jupyter Notebook file is taking forever to upload on Github

I was trying to upload one of my Jupyter Notebook files on GitHub, but it's taking forever to upload.
File size is also not that big. It's about 17KB. Also getting problem for this notebook only.
Here's the screen shot.
Any kind of help or suggestions are highly appreciated.
Try using Git Bash to push your code/make changes instead of uploading files directly on GitHub (it is less prone to errors and is quite comfortable at times - takes less time as well!), for doing so, you may follow the below-given steps:
Download and install the latest version of Git Bash from here - https://git-scm.com/
Right-click on any desired location on your system.
Click “Git Bash Here”.
git config --global user.name “your name”
git config --global user.email “your email”
Go back to your GitHub account – open your project – click on “clone” – copy HTTPS
link.
git clone PASTE HTTPS LINK.
Clone of your GitHub project will be created on your computer location.
Open the folder and paste your content.
Make sure content is not empty
Right-click inside the cloned folder where you have pasted your content.
Click “Git Bash Here” again.
You will find (master) appearing after your location address.
git add .
Try git status to check if all your changes are marked in green.
git commit --m “Some message”
git push origin master
Hope this helps! Good luck!
You could try and:
clone the repository, add the file locally, commit and push
check on github.com if your remote repository has a .gitattributes file with lfs directives in it.
Maybe that repository, managed by LFS, has reached some upload limit which would prevent any new upload.

SourceTree permanent local discard

I am new to Sourcetree and source control in general. I am working on an Android project with a few other people and use bitbucket as the repository. I have learned the basics but don't want to track certain files in my local, specifically a lot of the gradle and iml files. But i think Stop tracking will remove those from the repo. Is there a way to just have source tree ignore any changes i make to certain files locally but not delete them from the repo ?
Thank you in advance
You can create a file and name it .gitignore in the root of the project and in that file place every directory to exclude by git like:
my_folder
my_folder2
The above would be excluded from git tracked files.
If you are already tracking files this command will remove them from index:
git rm -r --cached <folder>

Rstudio: Changing origin for git version control of project

I originally set up git in Rstudio while enrolled in the Data Scientist's Toolbox course at Coursera. Unfortunately, I did this in my phd project. The repository no longer exists on github. I am now attempting to write my thesis in rmarkdown using knitr and bookdown. I would like to use version control, both to learn proper git workflow and to have a structured back up of everything I have done in my thesis. However, I have been unable to change the version control repository in Rstudio.
I am unable to change this in the Tools > Version control > Project setup > Git/SVN menu. The Origin: textbox is unchangable.
I tried creating a new project using the old phd project's working directory. This also cloned the version control settings.
How do I change the origin to accomplish what is described above?
Git, Github and Rstudio are different things. You could use git as local version control tools. You might connect your local repo to Github account which is based on git by push/pull. Rstudio just makes a user interface for git and supplies the function to push the repo into remote server based on git to make version control(not only Github, but also Gitlab).
So for your issue, if you do not want to pay for github for a private repo, all of your code would be public and I don't think it is good before your finally finished your thesis. But version control could be made locally with git only. Just use git shell to control the version.
However, as a student, github could support private repo here for you. Just register and find your student package. Then just remove the url for remote repo after you cd to your workdir in command line, use the following code to find your remote url(mostly you might fing origin):
git remote -v
Then use this to remove them:
git remote rm origin
Now you could use version control locally. If you want to connect this repo to your remote github private repo, use this:
git remote add origin https://github.com/[YourUsername]/[YourRepoName].git
RStudio would find this information about git and support your following operation. Project in RStudio is different with git, although project support git as version control tool. So you need git in command line or shell to solve your problem.
This can be done by opening /your.project/.git/config
and editing the remote origin line(s), e.g. changing from git to https.
Restart Rstudio & you'll be prompted for your github username & password.
This is what worked for me for migrating from github to Azure
Go to the top right Git window in RStudio and click on the gear. Now click Shell (to open the terminal there).
#remove origin
git remote rm origin
#add new origin like Azure for me via HTTPS
git remote add origin https://USER#dev.azure.com/USER/PROJECT/_git/REPONAME
#push your local repro
git push -u origin --all
#in my case put in the PAT password if you needed to generate one.
After testing, I found some clue
Actually Rstudio is not really smart about this setting
It will first search for the git file in the Rproject folder where your Rporject file is located
if it could not, then it goes up to the folder contains your Rproject folder
However, for version control you only need coding files while RProject may contains some big files like .RData some pictures etc.
I don't find a way to manually disrupt this logic flow, the only thing you can do is to delete the current git repository setting files(which is .git folder and 2 other git setting files), then Rstudio may ask you if you want to init a new one.

Cloning repository using Fossil?

I tried to clone a repository to my home computer using Fossil scm, but instead of getting the folders, I ended up with a _FOSSIL_ file.
The steps I used were:
made a directory called Fossils
used fossil clone command which resulted in a .fossil file in Fossils
made another directory Work and used fossil open to open the .fossil file from Fossils.
This resulted in a file named _FOSSIL_ in Work.
Any ideas for what I'm doing wrong?
That looks perfectly normal. The _FOSSIL_ file indicates a checkout (aka work dir). If there's no other file in your Work directory, that means your repository is empty; or at least, that the branch you checked out (trunk by default) is empty.
What does fossil timeline show?
What occurs when you clone https://www.fossil-scm.org like:
fossil clone https://www.fossil-scm.org fossil.fossil
then
fossil open fossil.fossil
I have not heard of a FOSSIL file before. Try above step in its own directory and on more than one OS to see if the results are the same or similar to what you have now.
A sample way to use Fossil is very much similar to other VSCs, apart from the initial step of setting up a repository (either by init or by clone command.)
Generally, a Fossil repository is a database file (SQLite db). So init or clone commands create that local database (commonly given a .fossil extension). Some users prefer to keep all of the "fossils" in a separate directory (e.g. ~/fossils, ~/archive, ~/museum).
Once the fossil repository db has been created, it may be opened/checked-out into a working directory, in fact, as many directories as wanted (some users prefer to keep one work-dir per active branch). This is initially done with open command from within the working directory.
After that a user can do all of the familiar VCS operations, such as checkout or create branches, edit files, commit changes, pull/push etc.
In the working directory Fossil also creates its local config database (also SQLite), named _FOSSIL_ (Windows), or .fslckout (Linux).
So the sample flow to clone and open a remote repo could be:
mkdir ~/fossils
fossil clone <remote-url> ~/fossils/aproject.fossil
mkdir aproject
cd aproject
fossil open ~/fossils/aproject.fossil
fossil user default <my-remote-username> --user <my-remote-username>
fossil status
On Windows the sequence is effectively the same, just use path with backslashes and your user profile directory. By the way, Fossil commands accept Unix-style paths on Windows as well.
You may aslo be interested to checkout ChiselApp service which offers free public Fossil repositories; lots of various projects there to try to clone and contribute to, or create or own.
Of course, one may try to clone Fossil's own repo from the remote-url https://fossil-scm.org
More help from the official Quick Start guide.