Control tracked version of external dependency - version-control

I am trying to set up a DVC repository for machine learning data with different tagged versions of the dataset. I do this with something like:
$ cd /raid/ml_data # folder on a data drive
$ git init
$ dvc init
$ [add data]
$ [commit to dvc, git]
$ git tag -a 1.0.0
$ [add or change data]
$ [commit to dvc, git]
$ git tag -a 1.1.0
I have multiple projects that each need to reference some version of this dataset. The problem is I can't figure out how to set up those projects to reference a specific version. I'm able to track the HEAD of the repo with something like:
$ cd ~/my_proj # different drive than the remote
$ mkdir data
$ git init
$ dvc init
$ dvc remote add -d local /raid/ml_data # add the remote on my data drive
$ dvc cache dir /raid/ml_data/.dvc/cache # tell DVC to use the remote cache
$ dvc checkout
$ dvc run --external -d /raid/ml_data -o data/ cp -r /raid/ml_data data
This gets me the latest version of the dataset, symlinked into my data folder, but what if I want some projects to use the 1.0.0 version and some to use the 1.1.0 version, or another version? Or for that matter, if I update the dataset to 2.0.0 but don't want my existing projects to necessarily track HEAD and instead keep the version with which they were set up?
It's important to me to not create a ton of local copies of my dataset as the /home drive is much smaller than the /raid drive and some of these datasets are huge.

I think you are looking for the data access set of commands.
In your particular case, dvc import makes sense:
$ dvc import /raid/ml_data data
if you want to get the most recent version (HEAD). Then you will be able to update it with the dvc update command (if 2.0.0 is released, for example).
$ dvc import /raid/ml_data data --rev 1.0.0
if you'd like to "fix" it to the specific version.
Avoiding copies
Make sure also, that symlinks are set for the second project, as described in the Large Dataset Optimization:
$ dvc config cache.type reflink,hardlink,symlink,copy
(there are config modifiers --global, --local, --system to set this setting for everyone at once, or just for one project, etc)
Check the details instruction here.
Overall, it's a great setup, and looks like you got pretty much everything right. Please, don't hesitate to follow up and/or create other questions here- we'll help you with this.

Related

Publicly verify open source code was deployed as is

Let’s say we have some open source project at GitHub with its source code. I want to deploy it to some server and let people access some informative page/tool/whatever that informs in some trustworthy way that effectively what it was deploy to the server was exactly the code that is in the repository.
Is there something that can help with this? Maybe an open source tool like Travis-CI that can help verify that a deploy was done using the latest code from X branch? Or perhaps there is a known way to do this using some kind of checksum for a deployable source code?
Any help/guidance would be much appreciated.
This is a build issue: you need to be able to include in your compiled delivery the checksum which shows from which sources said deliverable has been compiled.
It depends on your compilation language.
Go, for instance, would use build flags (as in this example):
go build -i -v -ldflags="-X main.version=$(git describe --always --long --dirty)" github.com/MyUserName/MyProject
Travis-CI would use the same ldflags, but with a fixed value.
This example simply add the Git commit as a flag.
script:
- go get -t -v ./...
- diff -u <(echo -n) <(gofmt -d .)
- go vet $(go list ./... | grep -v /vendor/)
- go test -v -race ./...
# Only build binaries from the latest Go release.
- if [ "${LATEST}" = "true" ]; then gox -os="linux darwin windows" \
-arch="amd64" -output="logshare.." \
-ldflags "-X main.Rev=`git rev-parse --short HEAD`" -verbose ./...; fi
Again, this is a build step, before the deployment step.
And it is illustrated for Go, but the idea remain for any other language.
At runtime, the program is able to display its version, and let the user know of the GitHub reference: they can check that reference is the one used for the build.
Alternative approach: signing a docker image
Then your Travis-CI could apply that on build stages for sharing that image.
But you will need to manage the Docker Content Trust (DCT) keys.

Basic Github Repo Creation

I just created my first Github Repo through the Github Bash on Windows 10.
I ran:
$ mkdir Projects
$ mkdir Projects/DataScientistsToolbox
$ mkdir Projects/DataScientistsToolbox/sample
$ cd Projects/DataScientistsToolbox/sample
$ git init
Initialised empty Git repository in /home/osboxes/Projects/DataScientistsToolbox/sample/.git/
$ ls -la
I am really struggling with understanding this code. So I created three directories: projects, datascientiststoolbox, and sample.
What does the cd command on my code do?
Does the git init code run the creation?
What does the ls -la do?
Lastly, I can't seem to find where the repo is saved on my computer, is it located on the desktop or in a special spot?
Thank you, sorry about the large amount of questions.
The cd command lets you change your working directory.
Yes, git init creates (initalizes) a new repository on your machine in your current working directory.
ls displays all files and directories in the current working directory. -la changes the way they are printed.
pwd makes your machine print the working directory. Use it to find out where your repository was created.
Read here about how to create a GitHub Repository. And this is a list of basic unix commands - it may help you to get started with unix systems.
You created only one repository with git init. cd stands for change directory, it's like when you double-click a folder to open it.
So basically your repo is in Projects/DataScientistsToolbox/sample.
ls is used for listing all files in the current directory you're in. -la are flags for different styles of displaying (try running just ls).
Also, all these commands have nothing to do with GitHub. They're a part of git.

What's a good version control system to use to version my whole local filesystem?

I'm in the specific use case of wanting to methodologically document everything significant I do while setting up my new workstation (running Mac OS X Lion).
I would like to version control, in the same repository, files that are at totally different places on my file system, for instance files in /etc, ~/, /Libraries, etc.
Some thoughts/details on my requirements:
This repo will be for personal use only. I'll use a GUI client to browse my settings history.
I initially wanted to use Git, hosted in one large Github private repository, but as you can't clone subfolders the way you would do it with SVN, I'd have to create symlinks everywhere, which does not seem convenient.
So, would I be better off setting up a local SVN server and just checking in the files I want, when I want to version them?
You can use Mercurial, Git, ..., and then simply ignore all the files you don't want to version. Create the repository in the root and track the rest. Like (for Mercurial):
$ cd /
$ hg init
$ echo ".*" > .hgignore
$ echo '^(?!(etc|Libraries))' > .hgignore
$ hg add
$ hg commit -m "initial checkin"
An alternative is to use more specialized tools such as etckeeper that are made for tracking configuration data.

Retrieve old version of a file without changing working copy parent

How do you get a copy of an earlier revision of a file in Mercurial without making that the new default working copy of the file in your workspace?
I've found the hg revert command and I think it does what I want but I'm not sure.
I need to get a copy of an earlier revision of my code to work with for a few minutes. But I don't want to disturb the current version which is working fine.
So I was going to do this:
hg revert -r 10 myfile.pls
Is there a way to output it to a different directory so my current working version of the file is not disturbed? Something like:
hg revert -r 10 myfile.pls > c:\temp\dump\myfile_revision10.pls
The cat command can be used to retrieve any revision of a file:
$ hg cat -r 10 myfile.pls
You can redirect the output to another file with
$ hg cat -r 10 myfile.pls > old.pls
or by using the --output flag. If you need to do this for several files, then take a look at the archive command, which can do this for an entire project, e.g.,
$ hg archive -r 10 ../revision-10
This creates the folder revision-10 which contains a snapshot of your repository as it looked in revision 10.
However, most of the time you should just use the update command to checkout an earlier revision. Update is the command you use to bring the working copy up to date after pulling in new changes, but the command can also be used to make your working copy outdated if needed. So
$ hg update -r 10 # go back
(look at your files, test, etc...)
$ hg update # go back to the tip
The command you use is this:
hg cat -r 10 myfile.pls > C:\temp\dump\myfile_revision10.pls
Knowing a bit of Unix helps with Mercurial commands. Perhaps cat should have a built in alias print or something similar.

Mercurial: Get non-versioned copy of an earlier version of a file

How do I get a non-versioned copy of an older version of a file from a mercurial repository?
Edit: I have changed somefile.png (binary file) in my local copy. I am looking for a command which will allow me to get an earlier version of somefile.png so that I can compare it with my modified copy (using an image viewer) before I commit changes. How can I do that?
The command you are looking for is cat
hg cat [OPTION]... FILE...
output the current or given revision of files
hg cat -o outputfile.png -r revision somefile.png
You can then compare somefile.png with outputfile.png
If you mean: what is the equivalent of svn export?, that would be:
hg archive ..\project.export
See also this TipsAndTrick section
Make a clean copy of a source tree, like CVS export
hg clone source export
rm -rf export/.hg
or using the archive command
cd source
hg archive ../export
The same thing, but for a tagged release:
hg clone --noupdate source export-tagged
cd export-tagged
hg update mytag
rm -rf .hg
or using the archive command
cd source
hg archive -r mytag ../export-tagged
There's a tip on the hgtip that might get you most of the way there:
Merging binary files
While it specifically talks about merging, the diff stuff should be generic enough to do it outside a merge...
I'm not sure I understand the question. You you could just copy it somewhere else using the normal file copy tools of your operating system.