How does DVC store differences on the directory level into DVC cache? - version-control

Can someone explain how DVC stores differences on the directory level into DVC cache.
I understand that the DVC-files (.dvc) are metafiles to track data, models and reproduce pipeline stages. However, it is not clear for me how the process of creating branches, commiting them and switching back to a master file is exactly saved in differences.

Short version:
.dvc file contains info (md5) about JSON file inside cache that describes current state of directory.
When directory gets updated, there is new md5 in .dvc file and new JSON file is created with updated state of directory.
In git, you store the .dvc file, so that DVC know (basing on md5) where to look for information about directory.
Longer version:
Let me try to break particular steps of directory handling with DVC.
Lets assume we have some data directory you want to add under DVC control:
├── 1
└── 2
You are using dvc add data to make DVC track you directory. In result, DVC produces data.dvc file. As you noted this file contains metadata required to connect your git repository with your data storage. Inside this file (besides other things) you can see:
- md5: f437247ec66d73ba66b0ade0246fcb49.dir
path: data
The md5 part is used to store information about directory in DVC cache (.dvc/cache):
(dvc3.7) ➜ repo$ tree .dvc/cache
├── 26
│   └── ab0db90d72e28ad0ba1e22ee510510
├── b0
│   └── 26324c6904b2a9cb4b88d6d61c81d1
└── f4
└── 37247ec66d73ba66b0ade0246fcb49.dir
If you will open the file with .dir suffix, you will see that it contains description of current data state:
(dvc3.7) ➜ repo$ cat .dvc/cache/f4/37247ec66d73ba66b0ade0246fcb49.dir
[{"md5": "b026324c6904b2a9cb4b88d6d61c81d1", "relpath": "1"},
{"md5": "26ab0db90d72e28ad0ba1e22ee510510", "relpath": "2"}]
As you can see, particular files(1 and 2) are described by entries in this file
When you change your directory:
(dvc3.7) ➜ repo$ echo 3 >> data/3
(dvc3.7) ➜ repo$ dvc commit data.dvc
The content of data.dvc will be updated:
- md5: 12f4b7d54a32e58818e27fba28376fba.dir
path: data
And there is new file inside the cache:
├── 12
│   └── f4b7d54a32e58818e27fba28376fba.dir
(dvc3.7) ➜ repo$ cat .dvc/cache/12/f4b7d54a32e58818e27fba28376fba.dir
[{"md5": "b026324c6904b2a9cb4b88d6d61c81d1", "relpath": "1"},
{"md5": "26ab0db90d72e28ad0ba1e22ee510510", "relpath": "2"},
{"md5": "6d7fce9fee471194aa8b5b6e47267f03", "relpath": "3"}]
From perspecitve of git the only change is inside data.dvc.
(Assuming you did git commit after adding data with 1 and 2 inside):
diff --git a/data.dvc b/data.dvc
index 098aec5..88d1a90 100644
--- a/data.dvc
+++ b/data.dvc
## -1,6 +1,6 ##
-md5: a427c5bf8680fbf8d1951806b28b82fe
+md5: 1b674d61c195eea7a6b14f176c020b9c
-- md5: f437247ec66d73ba66b0ade0246fcb49.dir
+- md5: 12f4b7d54a32e58818e27fba28376fba.dir
path: data
cache: true
metric: false
NOTE: First md5 corresponds to md5 of this file, so it had to change with dir md5 change


TemplateNotFound in Airflow

I have the following dir structure
├── ConfigSpark.yaml
├── project1
│   ├── dags
│   │   └──
│   └── sparkjob
│   └──
└── sparkutils
I'm trying to import de ConfigSpark.yaml file in my SparkKubernetesOperator using:
job= SparkKubernetesOperator(
task_id = 'job',
My dag is returning the following error:
jinja2.exceptions.TemplateNotFound: /opt/airflow/dags/ConfigSpark.yaml
I've noticed that if the DAG is in the same directory of ConfigSpark.yaml my tasks run perfectly, but why my task is not running when I place my dag in a subfolder?
I've checked my values.yaml file and airflowHome is /opt/airflow and defaultAirflowRepository is apache/airflow.
What is happening?
Airflow searches for the template file (ConfigSpark.yaml in your case) from the directory in which the DAG file is stored. Therefore, it doesn't find it automatically with your code.
If you would store the template file in same folder your DAG file is stored in (/project1/dags), or a nested folder inside the /project1/dags folder, you can specify the path from there in your task:
job = SparkKubernetesOperator(
Which would read the template file from /project1/dags/path/to/ConfigSpark.yaml.
However, if the folder your template file is stored in is not a child of the folder your DAG file is stored in, the above won't work. In that case you can specify template_searchpath on the DAG-level:
with DAG(..., template_searchpath="/opt/airflow/dags/repo/dags") as dag:
job = SparkKubernetesOperator(
This path (for example /opt/airflow/dags) is added to the Jinja searchpath and that way ConfigSpark.yaml will be found.

Is possible to auto generate documentation for pytest tests?

I have a project which contains only pytest tests, without modules or classes, which test remote project.
E.g. structure ->
Tests look like
Service requires credentials (app_id, app_key) to be passed using the Basic Auth
import base64
import pytest
import authorising.auth
from authorising.resources import Service
def service_settings(service_settings):
"Set auth mode to app_id/app_key"
service_settings.update({"backend_version": Service.Auth_app})
return service_settings
def test_basic_auth_app_id_key(application):
"""Test client access with Basic HTTP Auth using app id and app key
Configure Api/Service to use App ID / App Key Authentication
and Basic HTTP Auth to pass the credentials.
credentials = application.authobj.credentials
encoded = base64.b64encode(
response = application.test_request()
assert response.status_code == 200
assert response.request.headers["Auth"] == "Basic %s" % encoded
Is it possible to auto generate documentation from docstrings e.g using Sphinx ?
You can use sphinx-apidoc to generate test-documentation automatically using python-docstrings
For instance, if you have directory structure like below
|-- rst
|-- html
sphinx-apidoc -o docs/rst tests
sphinx-build -a -b html docs/rst docs/html -j auto
All Your docs HTML Files will be under docs/html.
There are multiple options sphinx-apidoc supports. Here is the [link]:
When using sphinx, you should add your test-folder to the Python path in the file:
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'tests')))
Then in each rst file you can simply write:
.. automodule:: test_basic_auth_app
If you want to document also the test results, please take a look into Sphinx-Test-Reports

How to import data from cloud firestore to the local emulator?

I want to be able to run cloud functions locally and debug against a copy from the production data.
Is there a way to copy the data that is online to the local firestore emulator?
This can be accomplished through a set of commands in terminal on the existing project:
1. Login to firebase and Gcloud:
firebase login
gcloud auth login
2. See a list of your projects and connect to one:
firebase projects:list
firebase use your-project-name
gcloud projects list
gcloud config set project your-project-name
3. Export your production data to gcloud bucket with chosen name:
gcloud firestore export gs://
4. Now copy this folder to your local machine, I do that in functions folder directly:
Note : Don't miss the dot ( . ) at the end of below command
cd functions
gsutil -m cp -r gs:// .
5. Now we just want to import this folder. This should work with the basic command, thanks to latest update from Firebase team
firebase emulators:start --import ./your-choosen-folder-name
Check out my article on Medium about it and a shorthanded script to do the job for you
Note: Its better to use a different bucket for it, as copying into your project bucket will result in the folder created in your firebase storage.
If you are interested in gsutil arguments like -m, you can see them described by executing gsutil --help.
My method is somewhat manual but it does the trick. I've shared it in this useful Github thread but I'll list the steps I did here if you find them useful:
Go to my local Firebase project path.
Start the emulators using: firebase emulators:start
Create manually some mockup data using the GUI at http://localhost:4000/firestore using the buttons provided: + Start Collection and + Add Document.
Export this data locally using: emulators:export ./mydirectory
About the project data located at Firebase Database / Cloud Firestore, I exported a single collection like this: gcloud firestore export gs:// --collection-ids=myCollection The export is now located under Firebase Storage in a folder with a timestamp as name (I didn't use a prefix for my test)
Download this folder to local drive with: gsutil cp -r gs:// ./production_data_export NOTE: I did this in a Windows environment... gsutil will throw this error: "OSError: The filename, directory name, or volume label syntax is incorrect" if the folder has invalid characters for a folder name in Windows (i.e. colons) or this error: "OSError: Invalid argument.9.0 B]" if an inner file in the folder has invalid characters too. To be able to download the export locally, rename these with a valid Windows name (i.e. removing the colons) like this: gsutil mv gs:// gs://
Once downloaded, imitate the local export structure renaming the folder to firestore_export and copying the firebase-export-metadata.json file from the local export folder. Just to be visual, here's the structure I got:
$ tree .
├── local_data_export
│ ├── firebase-export-metadata.json
│ └── firestore_export
│ ├── all_namespaces
│ │ └── all_kinds
│ │ ├── all_namespaces_all_kinds.export_metadata
│ │ └── output-0
│ └── firestore_export.overall_export_metadata
└── production_data_export
├── firebase-export-metadata.json
└── firestore_export
├── all_namespaces
│ └── kind_myCollection
│ ├── all_namespaces_kind_myCollection.export_metadata
│ ├── output-0
│ └── output-1
└── firestore_export.overall_export_metadata
8 directories, 9 files
Finally, start the local emulator pointing to this production data to be imported: firebase emulators:start --import=./mock_up_data/production_data_export/
You should see the imported data at: http://localhost:4000/firestore/
This should assist readers for now, while we await a more robust solution from the Firebase folks.
You can use the firestore-backup-restore to export and import your production data as JSON files.
I wrote a quick hack to allow for importing these JSON in the Firebase Simulator Firestore instance.
I proposed a pull request and made this npm module in the meantime.
You can use it this way:
const firestoreService = require('#crapougnax/firestore-export-import')
const path = require('path')
// list of JSON files generated with the export service
// Must be in the same folder as this script
const collections = ['languages', 'roles']
// Start your firestore emulator for (at least) firestore
// firebase emulators:start --only firestore
// Initiate Firebase Test App
const db = firestoreService.initializeTestApp('test', {
uid: 'john',
email: '',
// Start importing your data
let promises = []
try { =>
path.resolve(__dirname, `./${collection}.json`),
} catch (err) {
Obviously, since this data won't persist in the emulator, you'll typically inject them in the before() function of your test suite or even before every test.
There is no built-in way to copy data from a cloud project to the local emulator. Since the emulator doesn't persist any data, you will have to re-generate the initial data set on every run.
I was able to make some npm scripts to import from remote to local emulator and vice-versa.
"serve": "yarn build && firebase emulators:start --only functions,firestore --import=./firestore_export",
"db:update-local-from-remote": "yarn db:backup-remote && gsutil -m cp -r gs:// .",
"db:update-remote-from-local": "yarn db:backup-local && yarn db:backup-remote && gsutil -m cp -r ./firestore_export gs:// && yarn run db:import-remote",
"db:import-remote": "gcloud firestore import gs://",
"db:backup-local": "firebase emulators:export --force .",
"db:rename-remote-backup-folder": "gsutil mv gs:// gs://$(date +%d-%m-%Y-%H-%M)",
"db:backup-remote": "yarn db:rename-remote-backup-folder && gcloud firestore export gs://"
So you can export the local Firestore data to remote with:
npm db:update-remote-from-local
Or to update your local Firestore data with remote one, do:
npm db:update-local-from-remote
These operations will backup the remote Firestore data, making a copy of it and storing it on Firebase Storage.
I was about to go add a cli option to firebase-tools but pretty happy with the node-firestore-import-export package.
yarn add -D node-firestore-import-export
"scripts": {
"db:export": "firestore-export -a ./serviceAccountKey.json -b ./data/firestore.json",
"db:import": "firestore-import -a ./serviceAccountKey.json -b ./data/firestore.json",
"db:emulator:export": "export FIRESTORE_EMULATOR_HOST=localhost:8080 && yarn db:export",
"db:emulator:import": "export FIRESTORE_EMULATOR_HOST=localhost:8080 && yarn db:import",
"db:backup": "cp ./data/firestore.json ./data/firestore-$(date +%d-%m-%Y-%H-%M).json",
"dev": "firebase emulators:start --import=./data --export-on-exit=./data",
You will need to create a service account in the firebase console.
You can replace the GCLOUD_PROJECT environment variable with hard coded values.
mv ~/Downloads/myProjectHecticKeyName.json ./serviceAccountKey.json
That being said the gcloud tools are definitely the way to go in production, as you will need s3 backups anyway.
you can use fire-import npm package. for importing both firestore and firebase storage
There is also a way to import data to local storage from Google Cloud Storage without any commands:
export Firestore to Google cloud storage bucket by clicking More in google cloud
choose your desired file in google cloud storage bucket
open terminal (Google terminal shell near the search bar)
in terminal click Open editor
right click on desired file in online VSCode and click download.
You shoud start downloading of .tar file which is in fact your exported data from firestore.
Create a folder in your root (as example you may call it 'firestore-local-data')
Copy paste (or unarchive data) to this folder from archive file .tar
run firebase emulators:start --import ./firestore-local-data
This should do the trick
I wrote a little script to able to do that:
const db = admin.firestore();
const collections = ['albums', 'artists'];
let rawData: any;
for (const i in collections) {
rawData = fs.readFileSync(`./${collections[i]}.json`);
const arr = JSON.parse(rawData);
for (const j in arr) {
.then(val => console.log(val))
.catch(err => console.log('ERRO: ', err))

Yocto recipe to update /etc/fstab

I'm having trouble updating the /etc/fstab of my Linux distribution, when building it with Yocto. I'm pretty new to Yocto, so maybe I'm off my rocker.
My latest attempt is to add a recipe named base-files_%.bbappend.
mount_smackfs () {
cat >> ${IMAGE_ROOTFS}/etc/fstab <<EOF
# Generated from smack-userspace
smackfs /smack smackfs smackfsdefault=* 0 0
But, the output /etc/fstab on the distribution hasn't changed. So the questions are:
Is there a better way to do this?
How can I tell if my .bbappend file was actually executed?
ROOTFS_POSTPROCESS_COMMAND is handled in image recipes and not in package recipes. You have 2 possibilities.
Update your fstab in base-files_%.bbappend:
do_install_append () {
cat >> ${D}${sysconfdir}/fstab <<EOF
# Generated from smack-userspace
smackfs /smack smackfs smackfsdefault=* 0 0
Update the fstab in your image's recipe: In this case, you just append
what you wrote above (in your post) in the image's recipe.
Create a new layer using
yocto-layer create mylayer
inside it, create a folder called recipes-core and inside this folder
create another folder called base-files.
Inside this folder create a file called base-files_%.bbappend, with the following content:
Create another folder called base-files, inside which you should put a file called fstab with your configurations.
Make sure to enable your new layer in the bblayers.conf and it will work correctly, no need to create any append recipe or thing.
I had this issue and solved it using this method today.
Given the following directory structure:
└── recipes-core/
└── base-files/
├── base-files/
│ └── fstab
└── base-files_%.bbappend
and the following content for the recipe base-files_%.bbappend in question
DESCRIPTION = "Allows to customize the fstab"
PR = "r0"
SRC_URI += " \
file://fstab \
install -m 0644 ${WORKDIR}/fstab ${D}${sysconfdir}/
You can specify the fstab you want in that file and include this in your own custom layer. Once the compilation is finished you will have the custom fstab on the target system.

Custom CoffeeScript compilation with GruntJS

I have a Gruntfile, which takes all *.coffee files from a certain folder, and compiles them to JS files, keeping the same folder structure (if any).
So with a folder structure like:
| |
| |
It will generate the same folder structure, but with JS files instead of Coffeescript files. I would like to have a separate rule for the widgets folder and the file, ie. I want to compile all of the contents of widgets and into one single file.
How can I exclude one file and one folder from the files property of the grunt object? This is the code I'm currently running:
files: [{
expand: true,
cwd: '<%= %>/scripts',
src: '{,*/}*.{coffee,litcoffee,}',
dest: '.tmp/scripts',
ext: '.js'
I've also seen that there are 2 syntaxes to declare files. One is as an object, and one as an array (what I have above). What is the difference and would the other declaration better help me in my case?
The documentation on configuring grunt tasks has some words on what you want. Actually, there are four ways to define a files property, of which one is deprecated.
Here is a, because it is shorter. Use the File Arrays Format with exclusion patterns for the compile subtask and the Compact Format for the compileJoined subtask. I hope you use grunt-contrib-coffee. grunt-coffee is out of maintenance for almost two years now and doesn't seem to have a join option.
module.exports = (grunt) ->
params: app: '.' # ignore this, it's just that this file works as expected.
files: [
cwd: '<%= %>/scripts'
expand: yes
src: ['**/*.{coffee,litcoffee,}' # everything coffee in the scripts dir
'!' # exclude this
'!widgets/**/*'] # and these
dest: '.tmp/scripts'
ext: '.js'
extDot: 'first' # to make .js files from files
options: join: yes
# sadly you can't use expand here, so you'll have to do cwd "by hand".
src: [
'<%= %>/scripts/'
'<%= %>/scripts/widgets/**/*.{coffee,litcoffee,}'
dest: '.tmp/special.js'
grunt.loadNpmTasks 'grunt-contrib-coffee'
here's a small output from script, it seems to work:
$ tree scripts
├── vendor
│ └──
└── widgets
2 directories, 4 files
$ rm -rf .tmp
$ grunt coffee
Running "coffee:compile" (coffee) task
>> 2 files created.
Running "coffee:compileJoined" (coffee) task
>> 1 files created.
Done, without errors.
$ tree .tmp
├── scripts
│ ├── d.js
│ └── vendor
│ └── b.js
└── special.js
2 directories, 3 files
$ cat scrips/
variableInC_coffee = "a variable"
$ cat scripts/widgets/
variableInC_coffee = variableInC_coffee.replace /\s+/, '_'
$ cat .tmp/special.js
(function() {
var variableInC_coffee;
variableInC_coffee = "a variable";
variableInC_coffee = variableInC_coffee.replace(/\s+/, '_');