Cleanup Assets and Blobs after media encoding job - azure-media-services

I am using the following workflow which leaves several copies of the original assets and blobs that should be cleaned up. I want to make sure I only keep the necessary assets in order to playback the videos that have been encoded. Also I am wondering if there is a more efficient way of creating encoded assets. It seems the only improvements that could be made is uploading the blob directly to a media service container instead of having to copy the blob.
I am using the following workflow:
From my website, a video file is uploaded to a non media service container
After file is uploaded, a message queue is created for the blob
Azure Web Job receives the message queue
The uploaded blob is copied to the media service container
Create a media service asset from the copied blob
Start a media encoder job from the new asset for H264 Adaptive Bitrate MP4 Set 720p
After the job is complete, delete the original blob, the first asset, and the queue message

As you already mentioned one of optimization step is to eliminate uploading a media file to none media associated storage. Also since you already using azure queues you can use them to be notified when job is done. With proposed changes your workflow will be.
in UI you calling asset creation before uploads starts.
User directly uploading to storage associated with media account. see https://stackoverflow.com/a/28951408/774068
Once upload is finished, trigger creation of media jobs with azure queues associated with it. See https://learn.microsoft.com/en-us/azure/media-services/media-services-dotnet-check-job-progress-with-queues
Listen when azure queue get a message about job completion and execute source asset deletion once message received. You can utilize azure functions for it. https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-storage

Related

How to zip objects in an object storage

How would you go about organizing a process of zipping objects that reside an object storage?
For context, our users sometimes request an extraction of their entire data from the app - think of "Downloading Twitter archive" feature of Twitter.
Our users are able to upload files, so the extracted data must contain files stored in a object storage (Google Cloud Storage). The requested data must be packed into a single .zip archive.
A naive approach would look like this:
download all files from object storage on a disk,
zip all files into an archive,
put it .zip back on an object storage,
send a link to download the .zip file back to user.
However, there are multiple disadvantages here:
sometimes files for even single user add up to gigabytes,
if the process of zipping is interrupted, it has to start over.
What's a reasonable way to design a process of generating a .zip archive with user files, that originally reside on an object storage?
Unfortunately, your naive approach is the only way because Cloud Storage offers no compute abilities. Archiving files requires compute, memory, and temporary storage.
The key item is to choose a service, such as Compute Engine, that can meet your file processing requirements: multi-gig files, fast processing (compression), and high-speed networking.
Another issue will be the time that it takes to download, zip, and upload. That means using an asynchronous event-based design. Start file processing and notify the user (email, message, web inbox, etc) once the file processing is complete.
You could make the process synchronous and display a progress bar, but that will complicate the design.

How does Azure Purview perform Data Lineage in Azure Data Factory when there are multiple Copy Data Activities on the same Source?

My particular scenario is this:
Data Factory Pipeline
I have a .txt file in Azure Blob Storage.
I copy this file in Blob to Azure SQLDB
I copy the same file to archive location in the same blob container
I then delete the file after it's archived
When I triggered the Pipeline in Azure Data Factory, Purview gave me a data lineage that only showed the Archive copy activity, and never showed the BLOB to Azure SQLDB activity.
Refer to this screenshot for lineage: Purview Data Lineage
When I navigate to the Azure SQLDB destination in Purview, it says no data lineage is available for this asset.
Here is what I have done or thought could be the reason:
Maybe the copy activities need to be in different pipelines. I tested this and same result occurs
Maybe because I deleted the file it's not picking up the Blob source to Azure SQLDB copy activity. I will be testing this, but I think it's unlikely since it did pick up the Blob Source, to Blob Archive Destination copy activity
Maybe it only picks up the last copy activity for a given source, and not all of them. I tested this and it did not change the data lineage. It is possible that I need to do something in Azure Purview to "reset" the data lineage, but I think that it uses the last pipeline run for the source and I noticed it did update the data lineage when I separated the pipeline into 2 pipelines (one for loading Azure SQLDB, and the other for archiving the Blob File)
I'm fairly stuck on this one... I will completely remove the archiving and see what happens, but according to all of the Microsoft Documentation, Data Lineage for Azure Blob and Azure SQLDB is supported, so this should be working. If anyone has answers or ideas, I would love to hear them.
Update** My newest theory is that there is a time lag between when you run a pipeline and the Data Lineage is refreshed in Purview... I am going to try disconnecting the Data Factory and Reconnecting.
Update #2** Deleting the Data Factory connection and reconnecting did nothing from what I can tell. I have been playing with how to delete assets via the REST API, which is apparently the only way to delete assets/relationships in Purview... I think my next step will be to delete the Purview Account and Storage.
Update #3*** Even after spinning up a new Purview account, the lineage still fails to show the Blob to Azure SQLDB. I am wondering if it's because of the for each logic I have, but doesn't make sense because the archive copy activity was in the for each as well. I'm at a loss. I have other Copy activities from Blob to Azure SQLDB that work, but not this one.
Thanks
After a LOT of testing. I believe the problem is Purview does not know how to handle Copy Activities that include additional columns.
Does NOT work: With additional columns
Works: Without additional columns
The ONLY difference was the fact one had additional columns mapped, and the other did not. Slight design gap...
I have created this Azure Purview Feature Request! https://feedback.azure.com/forums/932437-azure-purview/suggestions/42357196-allow-data-lineage-to-be-performed-on-azure-data-f
Please vote for this so it can be implemented in a future release.

Copy image from REST to Blob using DataFacotry

In my scenario I need to consume an external REST api. One of the fields in the response is a url to an image. What I'm trying to achieve is to grab the pic behind that url and store it in the blob storage. This would be easy with a Function or WebJob but is there a way to do it with DataFactory on its own?
Based on my research,only Http Connector supports downloading file which could be used in the copy activity as source dataset.

Google Cloud Dataflow: while in PubSub streaming mode, TextIO.Read uses massive amounts of vCPU time

I'm using Google Cloud Platform to transfer data from an Azure server to a BigQuery table (working nice and smoothly, functionally speaking).
The pipeline looks like this:
Dataflow streaming pipeline
The 'FetchMetadata' part of the pipeline is a simple TextIO.Read implementation where I read a 66-line .csv file with metadata from a GCP Storage bucket:
PCollection<String> metaLine = p.apply(TextIO.Read.named("FetchMetadata")
.from("gs://my-bucket"));
When I use my pipeline in Batch mode this works like a charm: first the metadata file is loaded in the pipeline in less than a second of vCPU time and then the data itself is loaded in the pipeline. Now when running in Streaming mode I would love to replicate that behaviour to some extent but when I just use the same code there is a problem: when running the pipeline for just 15 minutes (actual time) the TextIO.Read block uses a whopping 4 hours of vCPU time. For a pipeline that will be permanently running for a low budget project this is unacceptable.
So my question: is it possible to change the code so the file is periodically read again (if the file changes I want the pipeline to be updated, so let's say hourly updates) and not continiously like it's doing right now.
I've found some documentation where there is mention of TextIO.Read.Bound which looks like a good place to start solving this issue, but it's no solution for my periodical update problem (as far as I know)
I was stuck in a similar situation. The way I solved this problem is a bit different. I would like the community's insights into this solution.
I had files being updated every hour in a GCS bucket. I followed the blog post about Scheduling Dataflow Jobs from App Engine or Google Cloud Function.
I had the app engine endpoint configured to receive the object change notifications from the GCS bucket which contained the files to be processed. For every file that was created (update is also a create operation in an object store), app engine application would submit a job to google dataflow. The job would read the lines from the file (file name in the HTTP request body) and publish it to a Google PubSub topic.
A streaming pipeline then had been subscribed to the Google PubSub topic that would process and output the relevant rows to big query. This way, streaming pipeline ran at the minimum worker count when idle, the ingest of the files happened through a batch pipeline and the streaming pipeline scaled with respect to the volume of the publications in the Google PubSub topic.
In the tutorial for submitting jobs to Google Dataflow, the jar is executed on the underlying terminal. I modified the code to submit a job to google dataflow using templates which can be executed with parameters. This way, the job submission operation becomes super light weight while still creating a job for every new file upload to the GCS bucket. Please refer this link for details about executing google dataflow job templates.
Note: Please mention in the comments if the answer needs to be modified for the code snippets of the dataflow job template and app engine application and I will update the answer accordingly.

How to Sync Queryable Metadata with Cloud Blob Storage

I am trying to understand the general architecture and components needed to link metadata with blob objects stored into the Cloud such as Azure Blob Storage or AWS.
Consider an application which allows users to upload a blob files to the cloud. With each file there would be a miriade of metadata describing the file, its cloud URL and perhaps emails of users the file is shared with.
In this case, the file gets save to the cloud and the metadata into some type of database somewhere else. How would you go about doing this transactionally so that it is guaranteed both the file was saved and the metadata? If one of the two fails the application would need to notify the user so that another attempt could be made.
There's no built-in mechanism to span transactions across two disparate systems, such as Neo4j/mongodb and Azure/AWS blob storage as you mentioned. This would be up to your app to manage. And how you go about that is really a matter of opinion/discussion.