I am exploring options to create a data mesh using Azure native services and need a Federated Data Catalogue.Could the following be accomplished:
Create the Azure Data Platform 1 with Purview Data Catalogue 1 cataloging the related data assets
Create the Azure Data Platform 2 with Purview Data Catalogue 2 cataloging the related data assets
Create a Federated Azure Purview Data Catalogue 3 where Purview Data Catalogue 1 & 2 synchs its contents automatically to Purview Data Catalogue 3.
Allow users to search all the company assets from Purview Data Catalogue 3 and request access to the data from Azure Data Platform 1 or Azure Data Platform 2 to consume the data.
Thanks
CK
There is not a built-in feature currently for Azure Purview.
The recommended approach would be to use Collections and the Data Access Policies to support the access control.
If you must support scanning via one Purview instance and copying to another, you may subscribe to data catalog events using the Purview Kafka Endpoint and then use some compute to relay them to the other catalog.
You should also consider doing a one-time Purview migration if Purview Data Catalog 1 and 2 already have scanned assets.
Related
I have a setup which consists of devices sending data to Azure cloud IoT Hub using message routing (to storage endpoints) which land up as blobs in a container. The frequency of data push is high. On the other end, I want to be able to query my blob container to pull files based on specific dates.
I came across blob index tags which look like a promising solution to query and is supported by the Azure SDK for .net.
I was thinking to add tags to each blob ex: processedDate: <dd/mm/yyyy>, which would help me query on the same later.
I found out that while uploading the blobs manually it is possible to add the tags but not sure how to go about or where to configure the same in the message routing flow where blobs are created on the fly. So I am looking for a solution to add those tags in flight as they are being pushed on to the container.
Any help on this will be much appreciated.
Thanks much!
Presently, the Azure IoT Hub doesn't have a feature to populate a custom endpoint for instance headers, properties, tags, etc.
However, in your case such as a storage custom endpoint you can use an EventGridTrigger function to populate a blob based on your needs.
My company has lots of data(Database: PostgreSQL) and now the requirement is to add search feature in that,we have been asked to use Azure cognitive search.
I want to know that how we can transform the data and send it to the Azure search engine.
There are few cases which we have to handle:
How will we transfer and upload on index of search engine for existing data?
What will be the easy way to update the data on search engine with new records in our Production Database?(For now we are using Java back end code for transforming the data and updating the index, but it is very time consuming.)
3.What will be the best way to manage when there's an update on existing database structure? How will we update the indexer without doing lots of work by creating the indexers every time?
Is there anyway we can automatically update the index whenever there is change in database records.
You can either write code to push data from your PostgreSQL database into the Azure Search index via the /docs/index API, or you can configure an Azure Search Indexer to do the data ingestion. The upside of configuring an Indexer to do the ingestion is that you can also configure it to monitor the datasource on a schedule for updates, and have those updates reflected into the search index automatically. For example via SQL Integrated Change Tracking Policy
PostgreSQL is a supported datasource for Azure Search Indexers, although the datasource is in preview (not get generally available).
Besides the answer above that involves coding on your end, there is a solution you may implement using Azure Data Factory PostgreSQL connector with a custom query that tracks for recent records and create a Pipeline Activity that sinks to an Azure Blob Storage account.
Then within Data Factory you can link to a Pipeline Activity that copies to an Azure Cognitive Search index and add a trigger to the pipeline to run at specified times.
Once the staged data is in the storage account in delimitedText format, you can also use built-in Azure Blob indexer with change tracking enabled.
I want to catalog data stored in csv files in the Azure Blob Storage. I tried to see if there is anyway to get metadata of Blob Storage and found Data Catalog is an option. Thing is, csv file is handled as a blob type and we can not profile it. I want, csv files in blob storage to act as tables.
Is this possible using Azure Data Catalog?
Yes you can use Data Catalog, For updated Data Catalog features, please use the new Azure Purview service, which offers unified data governance for your entire data estate. I would recommend to use : Azure Purview( Still you possible through Data Catalog)
Registering assets from a data source copies the assets’ metadata to Azure, but the data remains in the existing data-source location.
For updated Data Catalog features, please use the new Azure Purview service, which offers unified data governance for your entire data estate.
Introduction to Azure Purview (preview) - Azure Purview
This article provides an overview of Azure Purview, including its features and the problems it addresses. Azure Purview enables any user to register, discover, understand, and consume data sources.
This article outlines how to register an Azure Blob Storage account in Purview and set up a scan.
For more information on Blob index tags categorize data in your storage account using key-value tag attributes. These tags are automatically indexed and exposed as a searchable multi-dimensional index to easily find data. This article shows you how to set, get, and find data using blob index tags. Use blob index tags to manage and find data on Azure Blob Storage
I'm doing a small project that I don't know how-to connect IBM Watson with Django backend and even looking for the docs: I can't find examples, documentation or tutorials.
Basically, I want to create Jobs (Notebooks running) remotely, but I need to send an ID to each notebook because when I run a notebook I need to specify which file are going to process from Cloud Storage ("MY-PROJECT-COS"). The situation shown in the Figure below describes that.
The pipeline that I want to implement is like the Figure below. And this problem just stopped the whole project. I will really appreciate any suggestion, recommendations and solutions.
You should check the Watson Data APIs. Especially, Create a job and Start a run for a job API calls. Use the request body to pass the specific ID.
You can use a collection of Watson Data REST APIs associated with
Watson Studio and Watson Knowledge Catalog to manage data-related
assets and connections in analytics projects and catalogs on IBM Cloud
Pak for Data.
Catalog data Use the catalog and asset APIs to create catalogs to
administer your assets, associate properties with those assets, and
organize the users who use the assets. Assets can be notebooks or
connections to files, database sources, or data assets from a
connection.
Govern data Use the governance and workflows APIs to implement data
policies and a business glossary that fits to your organization to
control user access rights to assets and to uncover data quality and
data lineage.
Add and find data Use the discovery, search, and connections APIs to
add and find data within your projects and catalogs.
You can also access a local version of this API docs on each Cloud Pak
for Data installation:
https://{cpd_cluster_host}/data-api/api-explorer
I am planning to use Azure Data factory for ETL process, I would like to know if Azure Data factory uses the metamodel that is captured in the Data Catalog. Please advice
No currently you can' t reuse Metadata stored in Azure Data Catalog in Azure Data Factory directly. You could try to reuse some of the Metadata, retrieving Data Assets via the Rest API (https://learn.microsoft.com/en-us/rest/api/datacatalog/data-catalog-data-asset), but I think that it will be faster doing the setup in Azure Data Factory. Also be aware that main focus of Data Factory is on movement and orchestration. For Big Data transformations, you will use e. g. Databricks activities, for "classic" ETL integrate SSIS.