Azure IoT hub message routing push with blob index tags - tags

I have a setup which consists of devices sending data to Azure cloud IoT Hub using message routing (to storage endpoints) which land up as blobs in a container. The frequency of data push is high. On the other end, I want to be able to query my blob container to pull files based on specific dates.
I came across blob index tags which look like a promising solution to query and is supported by the Azure SDK for .net.
I was thinking to add tags to each blob ex: processedDate: <dd/mm/yyyy>, which would help me query on the same later.
I found out that while uploading the blobs manually it is possible to add the tags but not sure how to go about or where to configure the same in the message routing flow where blobs are created on the fly. So I am looking for a solution to add those tags in flight as they are being pushed on to the container.
Any help on this will be much appreciated.
Thanks much!

Presently, the Azure IoT Hub doesn't have a feature to populate a custom endpoint for instance headers, properties, tags, etc.
However, in your case such as a storage custom endpoint you can use an EventGridTrigger function to populate a blob based on your needs.

Related

Tool for Azure cognitive search similar to Logstash?

My company has lots of data(Database: PostgreSQL) and now the requirement is to add search feature in that,we have been asked to use Azure cognitive search.
I want to know that how we can transform the data and send it to the Azure search engine.
There are few cases which we have to handle:
How will we transfer and upload on index of search engine for existing data?
What will be the easy way to update the data on search engine with new records in our Production Database?(For now we are using Java back end code for transforming the data and updating the index, but it is very time consuming.)
3.What will be the best way to manage when there's an update on existing database structure? How will we update the indexer without doing lots of work by creating the indexers every time?
Is there anyway we can automatically update the index whenever there is change in database records.
You can either write code to push data from your PostgreSQL database into the Azure Search index via the /docs/index API, or you can configure an Azure Search Indexer to do the data ingestion. The upside of configuring an Indexer to do the ingestion is that you can also configure it to monitor the datasource on a schedule for updates, and have those updates reflected into the search index automatically. For example via SQL Integrated Change Tracking Policy
PostgreSQL is a supported datasource for Azure Search Indexers, although the datasource is in preview (not get generally available).
Besides the answer above that involves coding on your end, there is a solution you may implement using Azure Data Factory PostgreSQL connector with a custom query that tracks for recent records and create a Pipeline Activity that sinks to an Azure Blob Storage account.
Then within Data Factory you can link to a Pipeline Activity that copies to an Azure Cognitive Search index and add a trigger to the pipeline to run at specified times.
Once the staged data is in the storage account in delimitedText format, you can also use built-in Azure Blob indexer with change tracking enabled.

Is it possible to catalog data inside csv files inside Azure Blob Storage using Azure Data Catalog?

I want to catalog data stored in csv files in the Azure Blob Storage. I tried to see if there is anyway to get metadata of Blob Storage and found Data Catalog is an option. Thing is, csv file is handled as a blob type and we can not profile it. I want, csv files in blob storage to act as tables.
Is this possible using Azure Data Catalog?
Yes you can use Data Catalog, For updated Data Catalog features, please use the new Azure Purview service, which offers unified data governance for your entire data estate. I would recommend to use : Azure Purview( Still you possible through Data Catalog)
Registering assets from a data source copies the assets’ metadata to Azure, but the data remains in the existing data-source location.
For updated Data Catalog features, please use the new Azure Purview service, which offers unified data governance for your entire data estate.
Introduction to Azure Purview (preview) - Azure Purview
This article provides an overview of Azure Purview, including its features and the problems it addresses. Azure Purview enables any user to register, discover, understand, and consume data sources.
This article outlines how to register an Azure Blob Storage account in Purview and set up a scan.
For more information on Blob index tags categorize data in your storage account using key-value tag attributes. These tags are automatically indexed and exposed as a searchable multi-dimensional index to easily find data. This article shows you how to set, get, and find data using blob index tags. Use blob index tags to manage and find data on Azure Blob Storage

How to connect an external Django backend with IBM Watson (Notebook)?

I'm doing a small project that I don't know how-to connect IBM Watson with Django backend and even looking for the docs: I can't find examples, documentation or tutorials.
Basically, I want to create Jobs (Notebooks running) remotely, but I need to send an ID to each notebook because when I run a notebook I need to specify which file are going to process from Cloud Storage ("MY-PROJECT-COS"). The situation shown in the Figure below describes that.
The pipeline that I want to implement is like the Figure below. And this problem just stopped the whole project. I will really appreciate any suggestion, recommendations and solutions.
You should check the Watson Data APIs. Especially, Create a job and Start a run for a job API calls. Use the request body to pass the specific ID.
You can use a collection of Watson Data REST APIs associated with
Watson Studio and Watson Knowledge Catalog to manage data-related
assets and connections in analytics projects and catalogs on IBM Cloud
Pak for Data.
Catalog data Use the catalog and asset APIs to create catalogs to
administer your assets, associate properties with those assets, and
organize the users who use the assets. Assets can be notebooks or
connections to files, database sources, or data assets from a
connection.
Govern data Use the governance and workflows APIs to implement data
policies and a business glossary that fits to your organization to
control user access rights to assets and to uncover data quality and
data lineage.
Add and find data Use the discovery, search, and connections APIs to
add and find data within your projects and catalogs.
You can also access a local version of this API docs on each Cloud Pak
for Data installation:
https://{cpd_cluster_host}/data-api/api-explorer

Moving image metadata from Azure Blob Storage into CosmosDB

I have several images in a container in Blob Storage that have metadata I want to store in CosmosDB. The images are Jpeg's, Would I have to pull the data into Data Factory or is there a simpler method?
I think the simplest way would be to just write a small console app to loop through all your blob containers and items and insert into Cosmos DB using it's SDK.
There are many ways to loop through images stored in Blob Storage and store them in Cosmos DB. Please refer to below links. A little tweak and required functionality can be achieved.
Using ADF
Link
Use Azure Logic App
Link 1 Link 2
Write a custom code and then populate the result into Cosmos DB
Link

Is there any way to call Bing-ads api through a pipeline and load the data into Bigquery through Google Data Fusion?

I'm creating a pipeline in Google Data Fusion that allows me to export my bing-ads data into Bigquery using my bing-ads developer token. I couldn't find any data sources that should be added to my pipeline in data fusion. Is fetching data from API calls even supported on Google Data Fusion and if it is, how can it be done?
HTTP based sources for Cloud Data Fusion are currently in development and will be released by Q3. Could you elaborate on your use case a little more, so we can make sure that your requirements will be covered by those plugins? For example, are you looking to build a batch or real-time pipeline?
In the meantime, you have the following two, more immediate options/workarounds:
If you are ok with storing the data in a staging area in GCS before loading it into BigQuery, you can use the HTTPToHDFS plugin that is available in the Hub. Use a path that starts with gs:///path/to/file
Alternatively, we also welcome contributions, so you can also build the plugin using the Cloud Data Fusion APIs. We are happy to guide you, and can point you to documentation and samples.