using azure data factory to retrieve blobs urls from a container - azure-data-factory

I'm using Azure, I have some blobs in a container, what I'm looking for is to copy the urls of these blobs in a json file using azure data factory or data flow.
heres as example of a url :
the expected result is :
[[{'ur':'urltObLOB1'},{'ur':'urltObLOB2'},{'ur':'urltObLOB3'},.....]
is there a way to achieve that using azure data factory or data flow please ?

I could not find an activity or out of the box method to achieve the ask. However if you look at the REST api
The List Blobs operation returns a list of the blobs under the specified container.
In version 2013-08-15 and newer, the EnumerationResults element
contains a ServiceEndpoint attribute specifying the blob endpoint, and
a ContainerName field specifying the name of the container. In
previous versions these two attributes were combined together in the
ContainerName field. Also in version 2013-08-15 and newer, the Url
element under Blob has been removed.
Checkout MS DOC List blobs and snapshots
An example shows the result of a listing operation that returns blobs and snapshots in a container named mycontainer.
The request URI is as follows:
GET https://myaccount.blob.core.windows.net/mycontainer?restype=container&comp=list&include=snapshots&include=metadata
Sample XML response contains a block for each blob in the container.
<Blob>
<Name>blob3.txt</Name>
<Url>https://myaccount.blob.core.windows.net/mycontainer/blob3.txt</Url>
<Properties>
<Last-Modified>Wed, 09 Sep 2009 09:20:03 GMT</Last-Modified>
<Etag>0x8CBFF45D911FADF</Etag>
<Content-Length>16384</Content-Length>
<Content-Type>image/jpeg</Content-Type>
<Content-Encoding />
<Content-Language />
<Content-MD5 />
<Cache-Control />
<x-ms-blob-sequence-number>3</x-ms-blob-sequence-number>
<BlobType>PageBlob</BlobType>
<LeaseStatus>locked</LeaseStatus>
</Properties>
<Metadata>
<Color>yellow</Color>
<BlobNumber>03</BlobNumber>
<SomeMetadataName>SomeMetadataValue</SomeMetadataName>
</Metadata>
</Blob>
You can find the blob URL between <Url> </Url>
Next you can parse the XML output to store the url values in an array variable in the pipeline.

Related

Extract only data (without response element) using MarkLogic search API with REST

When making REST endpoint call to MarkLogic, is there an option to get only the data in the response without the additional metadata?
I am trying to use to make such connections using tools like Tableau, Qlik, Denodo etc
Options I use in the POST body on the URL: http://localhost:8000/v1/search
<search xmlns="http://marklogic.com/appservices/search">
<options>
<extract-document-data selected="all"></extract-document-data>
<transform-results apply="raw" />
</options>
</search>
Result I get:
<search:response snippet-format="raw" total="150" start="1" page-length="10" selected="all" xmlns:search="http://marklogic.com/appservices/search">
<search:result index="1" uri="/doc/21_doc.xml" path="fn:doc("/doc/21_doc.xml")" score="0" confidence="0" fitness="0" href="/v1/documents?uri=%2Fdoc%2F21_doc.xml" mimetype="application/xml" format="xml">
<root>
<col1>data1</col1>
<col2>data1</col2>
<col3>data1</col3>
</root>
<search:extracted kind="element">
<root>
<col1>data1</col1>
<col2>data1</col2>
<col3>data1</col3>
</root>
</search:extracted>
</search:result>
<search:metrics>
<search:query-resolution-time>PT0.0005236S</search:query-resolution-time>
<search:snippet-resolution-time>PT0.0001001S</search:snippet-resolution-time>
<search:extract-resolution-time>PT0.0003971S</search:extract-resolution-time>
<search:total-time>PT0.0021503S</search:total-time>
</search:metrics>
</search:response>
Expected:
<search:extracted kind="element">
<root>
<col1>data1</col1>
<col2>data1</col2>
<col3>data1</col3>
</root>
</search:extracted>
Also, why am I getting data in both extracted and result elements?
Both snippeting and data extraction provide access to content. Use either, or use them for different purposes. If you only like to get extracted data, then use:
<transform-results apply="empty-snippet" />
HTH!
If you want the original documents, specify an Accept header of multipart-mixed and specify only the content value for the a category URI parameter.
See: https://docs.marklogic.com/REST/POST/v1/search
and https://docs.marklogic.com/guide/rest-dev/bulk#id_65903
As an alternative, you may find it easier to provide rows to Business Intelligence Tools by creating TDE indexes and paging over an Optic query with the /v1/rows endpoint.
See: https://docs.marklogic.com/guide/rest-dev/search#id_34628
and https://docs.marklogic.com/REST/POST/v1/rows
Hoping that helps,

I want to list all the subfolder inside a container using azure storage api

I have a hierarchy
my container
--images
--img01
--img02
I want to make a rest request and get the result: <BlobPrefix>images</BlobPrefix>
As shown at this link https://learn.microsoft.com/en-us/rest/api/storageservices/fileservices/enumerating-blob-resources under the section "Delimited Blob List"
But, I am not getting this result instead of i am getting all the resources names as a recursive calls. As though I am not supplying any delimiter.
String to Sign is:
GET /account/mycontainer
x-ms-date:Tue, 07 Feb 2017 05:38:21 GMT
x-ms-version:2016-05-31
comp:list
delimeter:/
maxresults:4
restype:container
URL I am sending:
http://account.blob.core.windows.net/mycontainer?restype=container&comp=list&delimeter=/&maxresults=4
In result I am getting:
<xml>
<blobs>
<blob><Name>images/img01</Name>
<blob><Name>images/img02</Name>
</blobs>
</xml>
However based on above mentioned link. The response should be:
<xml>
<blobs>
<BlobPrefix>
<Name>images/</Name>
</BlobPrefix >
</xml>
According to your description, I checked Delimited Blob List and tested this method on my side. To simplify this issue, I created a container and set "Container Public Access Level" to "Public read access for container and blobs". Here is the file structure of my blob container:
List blobs without specifying the delimeter ('/' by default)
https://brucechen.blob.core.windows.net/brucechen?restype=container&comp=list
Result:
List blobs by specifying the delimeter ('/' by default)
https://brucechen.blob.core.windows.net/brucechen?restype=container&comp=list&delimiter=%2F
Result:
According to your stringToSign, the name of delimeter is incorrect, you need to change it to delimiter. For more details, you could refer to List Blobs.

SharePoint REST API for Fetching custom Metadata Columns

I am Developing a plugin that exports Documents to SharePoint Repository after processing them. Along with the document, I need to send values for Custom Metadata columns defined in SharePoint.
I have figured out how to send across Files and Metadata to a specified location.
Problem: Initially, I do not know what custom metadata columns are available in the given folder.
Could someone shed light on any REST webservice that can fetch the available Metadata Columns for a given location in the repository.
Note: I am using pure Java for REST Requests using Apache HTTP Client.
The REST url for retrieving the custom fields in a list is:
_api/web/lists/GetByTitle('Custom List')/fields
I don't know much about parsing JSON in java but this will give you a list of all the columns and extensive details about them. I displayed some of the data returned below.
DefaultValue : null
Description : ""
EnforceUniqueValues : false
Id : "fa564e0f-0c70-4ab9-b863-0177e6ddd123"
Indexed : false
InternalName : "Title"
ReadOnlyField : false
Required : false
StaticName : "Title"
Title : "Title"
FieldTypeKind : 2
TypeAsString : "Text"
TypeDisplayName : "Single line of text"
If you need to get the available columns of a specific folder, and not a library:
_api/web/getfolderbyserverrelativeurl('/Shared%20Documents/Folder')/ListItemAllFields
SharePoint 2013 has a REST API endpoint that could retrieve and filter Metadata columns if you obtain the information through a POST request using CAML. If your requests were made from SharePoint itself, you would use the masterpage's RequestDigest, but since you are doing it remotely, you would have to get this parameter by querying /_api/contextinfo and obtaining the FormDigestValue. Here is an article on it:
http://www.cleverworkarounds.com/2013/09/23/how-to-filter-on-a-managed-metadata-column-via-rest-in-sharepoint-2013/
Also, you must enable CORS on your SharePoint data repository.
<?xml version="1.0" encoding="utf-8"?>
<configuration>
<system.webServer>
<httpProtocol>
<customHeaders>
<add name="Access-Control-Allow-Origin" value="*" />
</customHeaders>
</httpProtocol>
</system.webServer>
</configuration>

How to lookup HBase REST API (Stargate) if the row-keys are reversed urls

I am using nutch2.2.1 + hbase0.90.4, and wanting to access the data via the HBase REST API Stargate. If I seed nutch with a url (eg. www.usatoday.com), the reversed url becomes the HBase row-key in the designated table ('webpage'). I can lookup the data via the hbase shell as follows:
hbase(main):001:0> get 'webpage', 'com.usatoday.www:http/'
COLUMN CELL
f:fi timestamp=1404762373394,value=\x00'\x8D\x00
f:ts timestamp=1404762373394, value=\x00\x00\x01G\x12\\xB5\xB3
mk:_injmrk_ timestamp=1404762373394, value=y
mk:dist timestamp=1404762373394, value=0
mtdt:_csh_ timestamp=1404762373394, value=?\x80\x00\x00
s:s timestamp=1404762373394, value=?\x80\x00\x00
However, I am having trouble using the REST API. Presumably I need to do some pretty simple URL encoding to suppress the colon before 'http' that is making trouble for me?
For eg., I get a HTTP 404 when I try
curl http://localhost:8900/webpage/com.usatoday.www:http/
also when I try
curl http://localhost:8900/webpage/com.usatoday.www%3Ahttp/
I know that the REST API is working fine as I can create a row called 'row3' into a table called 'test' and lookup
curl http://localhost:8900/test/row3
to see the following expected result:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><CellSet><Row key="cm93Mw=="><Cell timestamp="1404761922130" column="Y2Y6Yw==">dGhpcyBpcyBzb3J0YSB3b3JraW5nIG5vdw==</Cell></Row></CellSet>
Thanks for the help!
Needed to URL encode the forward slash as well. The following works.
curl http://localhost:8900/webpage/com.usatoday.www%3Ahttp%2F

Referring to Dumped LinkedIn API data

Quick problem here that I don't know how to solve and thought you guys could give me a heads-up on which way to go. I have successfully pulled all my connection data using the LinkedIn REST API both in XML and JSON and dumped them (the former using the cPickle plugin). The issue is that I need to refer to a single field within the data and so decided to use XML as it seemed to be the easiest one to use by far. When I refer to the specific field in my .pickle file, it gives me the following error:
AttributeError: 'str' object has no attribute 'location'
However, opening the pickle file with notepad, I can see that all my connections do have their location field stored in the XML format. It's very strange!
Here's my referral code:
import cPickle
connections_data = 'linkedin_connections.pickle'
response = cPickle.load(open(connections_data))
print response
locations = [ec.location for ec in response]
I have a print function set up to show what's in my file and all of the data appears as a normal XML output using the people call of the REST API. The XML data appears as follows:
<person>
<id>ID_number</id>
<first-name>blah</first-name>
<last-name>blah</last-name>
<headline>Business Development and Sales Executive at Computaris</headline>
<picture-url>picture_url</picture-url>
<api-standard-profile-request>
<url>profile_request</url>
<headers total="1">
<http-header>
<name>x-li-auth-token</name>
<value>name</value>
</http-header>
</headers>
</api-standard-profile-request>
<site-standard-profile-request>
<url>request_url</url>
</site-standard-profile-request>
<location>
<name>location</name>
<country>
<code>country_code</code>
</country>
</location>
<industry>industry</industry>
Any help will be much appreciated.