How do you access DVC remote storage to view the file content? - data-management

I am very new to DVC and I encounter a few problems with remote storage. I stored my data into dvc remote storage here (.dvc/config file):
[core]
remote = dvc-remote
['remote "dvc-remote"']
url = /tmp/dvc-storage
Questions:
Where can I access it in my file explorer? Or is there any way to check for the content inside without dvc pull?
I first store a data name dataset into this remote storage, after I retrieve and deleted some pictures in the dataset and push it back into the storage, is my original dataset overwrite or dataset files are kept?
I only dvc add the dataset file into dvc remote, why is it that on Iterative Studio the path of my other files changed to /tmp/dvc-storage/d5/df97ac43b0 as well?

To recap, your DVC project's default remote found in a local directory (/tmp/dvc-storage). OK
All your data files are in /tmp/dvc-storage so that's where you could point your file explorer to, but this type* of DVC remote (local directory) is not meant for direct human handling. They're been renamed and reorganized in the same way as the project cache.
Basically, the directory structure (let's call it space dimension) AND data versions (time dimension) are flattened into a content-addressable data store. This is why you see all those 2 letter directories containing long hex file names (similar to Git Object Storage).
By default nothing is deleted from the cache (or remote storage) during regular dvc operations. The data store is append-only for the most part. This way you can git checkout and dvc checkout (or dvc pull) the data for a previous project version (past Git repo commit).
You'd have to specifically garbage collect certain data from cache or storage locations using dvc gc, and even then it's designed to try preserving stuff you might need in the future.
Note that dvc add does not affect remote storage, it only works with the local cache. You need to dvc push and dvc pull to sync the data cache with a DVC remote.
Wrt the Studio UI, I'm not sure where you see that path but its correct (as its hopefully clearer now). You'd get the same from dvc get --show-url, so maybe reading that reference helps.
* Note that DVC remotes can integrate with cloud versioning on Amazon S3, Azure Blob Storage, and Google Cloud Storage (probably more in the future). This means that if you use those types and enable this feature, you'll see the same directory structure as in your project folder (not the obfuscated cache structure). Cloud-versioned remotes are easier to handle directly (although it may also not be ideal).

Related

The created date of file is lost after pushing my files to GitHub

The "created" date is lost after I push my files to git. When I clone my repository the "created" date is the current date. Is that normal?
When you clone a repository, Git does not check files out using the original time specified in your commits. Instead, it creates the files as normal, using the current time.
This is in fact normal, and it also has the nice benefit of working properly with Make, which uses the file time to determine whether a file is in need of being rebuilt. Since Git always uses the current time, and files Git has checked out will be considered as changed, and Make will build any products that depend on them.
Yes that's normal.
As far as file metadata (created, last modified, executable or not, etc.) goes git only saves if the file is executable or not. The other values like when it was created are completely managed by your filesystem independent from git.
When you clone the repository the files are created now on your filesystem - so the created metadata of the file is the current date.

What is Best practice to Move DATA FILES to GITHUB Repo

I am slightly new to github, So please bare if i'm asking very basic here.
I have a github repository. Where i have folder structures like below
company --> scripts --> python --> python_scripts.py
company --> inbound --> Data files
company --> outbound --> Data files
the size of my datafiles in inbound and outbound folders are ~2GB and it keeps increasing daily. What is the best practice to store data files in git repo.
You should not store these data files in your repository. They are data that your programs operate on and are not part of the source code. Adding them to your repository will just bloat the repository.
You should remove the directories in question with git rm -r --cached DIRECTORY and then add them to .gitignore. You should then store them in some other location that's a better fit for data, like an artifact server or a cloud storage bucket, or just locally on the affected system.
I would recommend to use
Git Large File Storage (LFS), see https://git-lfs.github.com/

cf push - how can I update/push selected files (modified files) using CloudFoundry

Pushing the entire code every time is time consuming and it's not a good practice.
How can I perform incremental push? Is there a way?
Expanding on #jimmc comment:
Cloud Foundry clients (CLI, Java Client, etc) automatically do incremental push of application bits. Here's how it works:
When a CF client is given a directory to push, it gets a list of files in the directory and all subdirectories. When a client is given an archive (.jar, .war, .zip) to push, it explodes the archive locally on the client machine. Only the first level of the archive is exploded, any embedded archives (e.g. .jar files in a .war file) are not exploded. It then gets a list of files in the exploded archive.
The client then calculates a SHA for each file and sends the list of files with SHAs to the CF resource matching API. CF will respond with a list of files that it already has (e.g. from a previous push). The client then sends only the files that CF doesn't already have.
push should be capable of sync:
$ cf p -h
NAME:
push - Push a new app or sync changes to an existing app
However By default, cf push recursively pushes the contents of the current working directory.
Note: If you want to push more than a single file, but not the entire
contents of a directory, consider using a .cfignore file to tell cf
push what to exclude
Example .cfignore file contents:
tmp/
log/
my_unnecessary_file.txt
When executing your next cf push for deploying the application it will omit the files and directories listed in your .cfignore file.
Regards

getting started with fossil

I just got started with fossil. My reasons for selecting fossil are:
cross-platform
single exectuable
single repository file (typical extension .fossil)
supposedly easy to use (but aren't they all?)
I have several questions. Context: Suppose I want to keep track of changes to every file inside several directories, aptly named dir1, dir2, etc. Suppose I want to keep a copy on a USB stick. Suppose I want to keep a copy on another partition of the same disc as I move back and forth between Linux and Windows partitions. I'm the only user and may not always have access to the internet.
I would like to store dir1.fossil outside of dir1. Can I do that? The user-manual instructions tell me to create dir1.fossil from inside dir1, and that's where the dir1.fossil are currently created in my setup. Ideally I'd like my dir1.fossil, dir2.fossil, etc. files to be stored together in another directory, e.g. named fossilreposdir and located at the root. Possible?
I would like to stick a usb flash drive into my laptop and push/pull repositories from it in a plug-and-play manner.
If possible I would also like to push/pull repositories across my windows and linux partitions without using the usb stick.
If setting it up is too much of a headache (for my poor head), I will resort to simple copy-pasting of the .fossil repositories back and forth.
Yes.
Yes.
What DO you want to use? Me, I use dropbox to hold my repositories. Then every machine registered with dropbox has access to all my repositories.
// into working directory
cd ../dir1
// create repository somewhere else
fossil new ../fossilreposdir/test.fsl
// open remote repo in local working directory
fossil open ../fossilreposdir/test.fsl test.fsl
// add files
fossil addremove
// commit
fossil ci

Version control of uploaded images to file system

After reading Storing Images in DB - Yea or Nay? I think that the file system is the right place for storing images. But I would like to know how you handle backup/version control of uploaded images in your different environments (dev/stage/prod) and for network load balancing?
These problems is pretty easy to handle when working with a database e.g. to make a backup from the production environment and restore the DB in the development environment.
What do you think of using for example git to handle version control of uploaded files e.g?
Production Environment:
A image is uploaded to a shared folder at the web server.
Meta data is stored in the database
The image is automatically added to a git repository
Developer at work:
Checks out the source code.
Runs a script to restore the database.
Runs a script to get the the latest images.
I think the solution above is pretty smooth for the developer, the images will be under version control and the environments can be isolated from each other.
For us, the version control isn't as important as the distribution. Meta data is added via the web admin and the images are dropped on the admin server. Rsync scripts push those out to the cluster that serves prod images. For dev/test, we just rsync from prod master server back to the dev server.
The rsync is great for load balancing and distribution. If you sub in git for the admin/master server, you have a pretty good solution.
If you're OK with backup that preserves file history at the time of backup (as opposed to version control with every revision), then some adaption of this may help:
Automated Snapshot-style backups with rsync.
It can work, but I would store those images in a git repository which would then be a submodule of the git repo with the source code.
That way, a strong relationship exists between the code and and images, even though the images are in their own repo.
Plus, it avoids issues with git gc or git prune being less efficient with large number of binary files: if images are in their own repo, and with few variations for each of them, the maintenance on that repo is fairly light. Whereas the source code repo can evolve much more dynamically, with the usual git maintenance commands in play.