DVC experiments with large data in kubernetes - kubernetes

We have a Computer Vision project. Raw Data stores in S3. Label Team every day send new increment of labeled data. We want to automize train process with these new data. We use dvc for reproducing pipelines and ML Flow for logging and deploying models and airflow for scheduling executions in K8S. Also we can produce new branch and modify model params or architecture and trigger train pipeline in Gitlab CI manually. These pipeline do the same that airflow task.
We want to checkout those raw data which labeled by label team on PV for avoiding pulling huge data every run from S3. Every time when we'll run dvc pipeline, which pull new labeled data and corresponding raw data from S3, produce preprocessing, train model and calculate metrics. In dvc we'll be versioning pipeline code, labeled data and model params. But here we don't version raw and preprocessed data, which means that only one pipeline can be run at the moment.
We can versionize raw and preprocessed data and use sharing cache in dvc, but here we produce a lot of replicas in in cache and in working area, because if we want to add new labeled data, we should do dvc unprotect raw_data which copy cached data on our local workspace (PV in k8s).
How to track the integrity of raw data and keep ability to run several experiments at the same time and don't produce a lot of copies of data? Do it optimal way to store data on PV in k8s? Should we use shared cache?

Related

Difference between DataFlow and Pipelines

I do not understand the difference between dataflow and pipeline in Azure Data Factory.
I have read and see DataFlow can Transform Data without writing any line of code.
But I have made a pipeline and this is exactly the same thing.
Thanks
A Pipeline is an orchestrator and does not transform data. It manages a series of one or more activities, such as Copy Data or Execute Stored Procedure. Data Flow is one of these activity types and is very different from a Pipeline.
Data Flow performs row and column level transformations, such as parsing values, calculations, adding/renaming/deleting columns, even adding or removing rows. At runtime a Data Flow is executed in a Spark environment, not the Data Factory execution runtime.
A Pipeline can run without a Data Flow, but a Data Flow cannot run without a Pipeline.
Firstly, dataflow activity need to be executed in the pipeline. So I suspect that you are talking about the copy activity and dataflow activity as both of them are used for transferring data from source to sink.
I have read and see DataFlow can Transform Data without writing any
line of code.
Your could see the overview of Data Flow. Data flow allows data engineers to develop graphical data transformation logic without writing code. All data transfer steps are based on visual interfaces.
I have made a pipeline and this is exactly the same thing.
Copy activity could be used for data transmission. However, it has many limitations with column mapping. So,if you just need simple and pure data transmission, Copy Activity could be used. In order to further meet the personalized needs, you could find many built-in features in the Data Flow Activity. For example, Derived column, Aggregate,Sort etc.

Do I need a storage (of some sort) when pulling data in Azure Data factory

*Data newbie here *
Currently, to run analytics report on data pulled from Dynamics 365, I use Power BI.
Issue with this is, Power BI is quite slow processing large data. I carry out a number of transform steps (e.g. Merge, Join, deleting or renaming columns, etc). So, when I try to run a query in Power BI with said steps, it takes a long time to complete.
So, as a solution, I decided to make use of Azure Data Factory(ADF). The plan is to use ADF to pull the data from CRM (i.e. Dynamics 365), perform transformations and publish the data. Then I'll use Power BI for visual analytics.
My question is:
What azure service will I need in addition to Data Factory? Will I need to store the data I pulled from CRM somewhere - like Azure Data Lake or Blob storage? Or can I do the transformation on the fly, right after the data is ingested?
Initially, I thought I could use the 'copy' activity to ingest data from CRM and start playing with the data. But using the copy activity, I needed to provide a sink (destination for the data. Which has to be a storage of some sort).
I also thought, I could make use of the 'lookup' activity. I tried to use it, but getting errors (no exception message is produced).
I have scoured the internet for a similar process (i.e. Dynamics 365 -> Data Factory -> Power BI), but I've not been able to find any.
Most of the processes I've seen however, utilises some sort of data storage right after data ingest.
All response welcome. Even if you believe I am going about this the wrong way.
Thanks.
Few things here:
The copy activity just moves data from a source, to a sink. It doesnt modify it on the fly.
The lookup activity is just to look for some atributes to use later on the same pipeline.
ADF cannot publish a dataset to power bi (although it may be able to push to a streaming dataset).
You approach is correct, but you need that last step of transforming the data. You have a lot of options here, but since you are already familiar with Power Bi you can use the Wrangling Dataflows, which allows you to take a file from the datalake, apply some power query and save a new file in the lake. You can also use Mapping Dataflows, databricks, or any other data transformation tool.
Lastly, you can pull files from a data lake with Power Bi to make your report with the data on this new file.
Of course, as always in Azure there are a lot of ways to solve problems or architect services, this is the one I consider simpler for you.
Hope this helped!

Azure Data Factory: Data Lifecycle Management and cleaning up stale data

I'm working on a requirement to reduce the cost of data storage. It includes the following tasks:
Being able to remove files from File Share and blobs from Blob Storage, based on their last modified date.
Being able to change the tier of individual blobs, based on their last modified date.
Does Azure Data Factory has built-in activities to take care of these tasks? What's the best approach for automating the clean-up process?
1.Being able to remove files from File Share and blobs from Blob Storage, based on their last modified date.
This requirement could be implemented by ADF built-in method: Delete Activity.
Please create a blob storage dataset and just refer to this example and configure the range of last modify date :https://learn.microsoft.com/en-us/azure/data-factory/delete-activity#clean-up-the-expired-files-that-were-last-modified-before-201811
Please consider some back up strategy for some accidents because:
2.Being able to change the tier of individual blobs, based on their last modified date.
No built-in feature to complete this in ADF. However,while i notice that your profile shows you are .net maker, so follow this case:Azure Java SDK - set block blob to cool storage tier on upload so that you could know the Tier could be changed in sdk code. That's easy to create an Azure Function to do such simple task. Moreover,ADF supports Azure Function Activity.

Grafana snapshots - is the needed data stored or fetched from the source?

We want to use Grafana to show measuring data. Now, our measuring setup creates a huge amount of data that is saved in files. We keep the files as-is and do post-processing on them directly with Spark ("Data Lake" approach).
We now want to create some visualization and I thought of setting up Cassandra on the cluster running Spark and HDFS (where the files are stored). There will be a service (or Spark-Streaming job) that dumps selected channels from the measuring data files to a Kafka topic and another job that puts them into Cassandra. I use this approach because we have other stream processing jobs that do on the fly calculations as well.
I now thought of writing a small REST service that makes Grafana's Simple JSON datasource usable to pull the data in and visualize it. So far so good, but as the amount of data we are collecting is huge (sometimes about 300MiB per minute) the Cassandra database should only hold the most recent few hours of data.
My question now is: If someone looks at the data, finds something interesting and creates a snapshot of a dashboard or panel (or a certain event occurrs and a snapshot is taken automatically), and the original data is deleted from Cassandra, can the snapshot still be viewed? Is the data saved with it? Or does the snapshot only save metadata and the data source is queried anew?
According to Grafana docs:
Dashboard snapshot
A dashboard snapshot is an instant way to share an interactive dashboard publicly. When created, we strip sensitive data like queries (metric, template and annotation) and panel links, leaving only the visible metric data and series names embedded into your dashboard. Dashboard snapshots can be accessed by anyone who has the link and can reach the URL.
So, data is saved inside snapshot and no longer depends on original data.
As far as I understand Local Snapshot is stored in grafana db. At your data scale using external storage (webdav, etc) for snapshots can be more a better option.

Data streaming to Google Cloud ML Engine

I found that Google ml engine expects data in cloud storage, big query etc. Is there any way to stream data to ml-engine. For example, imagine that I need to use data in WordPress or Drupal site to create a tensorflow model, say a spam detector. One way is to export the whole data as CSV and upload it to cloud storage using google-cloud--php library. The problem here is that, for every minor change, we have to upload the whole data. Is there any better way?
By minor change, do you mean "when you get new data, you have to upload everything--the old and new data--again to gcs"? One idea is to export just the new data to gcs on some schedule, making many csv files over time. You can write your trainer to take a file pattern and expand it using get_matching_files/Glob or multiple file paths.
You can also modify your training code to start from an old checkpoint and train over just the new data (which is in its own file) for a few steps.