Not able to read data from Google Cloud Platform in StreamSets Data Collector - streamsets

I am trying to create a pipeline in StreamSets Data Collector to read data from a Google Cloud Platform bucket and load the data into the same bucket with a different file name.
The data file in the bucket is in JSON form.
I used the Google Cloud Storage origin in StreamSets Data Collector and gave below properties:
Common Prefix = gs://<my-bucket-name>/<json-file-name>
Prefix Pattern = https://storage.cloud.google.com/<my-bucket-name>/<json-file-name>
Could someone correct or provide any alternative options?

This is documented in Common Prefix, Prefix Pattern, and Wildcards.
Common prefix is a path common to all the files you want to read
Prefix pattern contains wildcards specifying the files you want to read
Neither of these should contain the bucket name (since that is configured separately) or the protocol. In your case, it looks like you can use something like:
Common prefix: /
Prefix pattern: *.json (or some other wildcard that matched your files)

Related

Pick files with specific filenames in Azure Data Factory

I am looking to copy files from blob storage to another blob using Azure Data Factory. However I want to pick only files which starts with say AAABBBCCC , XXXYYYZZZ and MMMNNNOOO. And the remaining I would like to ignore.
In ADF, use the Copy Activity and the wildcard path to set your matching file patterns
You could use prefix to pick the files that you want to copy. And this sample shows how to copy blob to blob using Azure Data Factory.
prefix: Specifies a string that filters the results to return only blobs whose name begins with the specified prefix.
// List blobs start with "AAABBBCCC" in the container
await foreach (BlobItem blobItem in client.GetBlobsAsync(prefix: "AAABBBCCC"))
{
Console.WriteLine(blobItem.Name);
}
With ADF setting:
Set Wildcard paths with AAABBBCCC*. For more details, see here.

Is it possible to rename all files/blobs in google storage that start with a given string?

I can see on the docs how to rename files (blobs) in a bucket. However, I am looking for way to rename only those files/blobs that start with a specific prefix in a bucket (ie. renaming using wildcard like gs://my-bucket/loc/my-file*)
I don't see a blob "name" property in the docs to try to match to after getting list of blobs in a bucket.
Is there a way to do that using Python language?
Thanks
Using gsutil, it is possible to use wildcard naming and rename all files starting with a prefix to another name.

Different S3 behavior using different endpoints?

I'm currently writing code to use Amazon's S3 REST API and I notice different behavior where the only difference seems to be the Amazon endpoint URI that I use, e.g., https://s3.amazonaws.com vs. https://s3-us-west-2.amazonaws.com.
Examples of different behavior for the the GET Bucket (List Objects) call:
Using one endpoint, it includes the "folder" in the results, e.g.:
/path/subfolder/
/path/subfolder/file1.txt
/path/subfolder/file2.txt
and, using the other endpoint, it does not include the "folder" in the results:
/path/subfolder/file1.txt
/path/subfolder/file2.txt
Using one endpoint, it represents "folders" using a trailing / as shown above and, using the other endpoint, it uses a trailing _$folder$:
/path/subfolder_$folder$
/path/subfolder/file1.txt
/path/subfolder/file2.txt
Why the differences? How can I make it return results in a consistent manner regardless of endpoint?
Note that I get these same odd results even if I use Amazon's own command-line AWS S3 client, so it's not my code.
And the contents of the buckets should be irrelevant anyway.
Your assertion notwithstanding, your issue is exactly about the content of the buckets, and not something S3 is doing -- the S3 API has no concept of folders. None. The S3 console can display folders, but this is for convenience -- the folders are not really there -- or if there are folder-like entities, they're irrelevant and not needed.
In Amazon S3, buckets and objects are the primary resources, where objects are stored in buckets. Amazon S3 has a flat structure with no hierarchy like you would see in a typical file system. However, for the sake of organizational simplicity, the Amazon S3 console supports the folder concept as a means of grouping objects. Amazon S3 does this by using key name prefixes for objects.
http://docs.aws.amazon.com/AmazonS3/latest/UG/FolderOperations.html
So why are you seeing this?
Either you've been using EMR/Hadoop, or some other code written by someone who took a bad example and ran with it... or is doing something differently than it should have been done for quite some time.
Amazon EMR is a web service that uses a managed Hadoop framework to process, distribute, and interact with data in AWS data stores, including Amazon S3. Because S3 uses a key-value pair storage system, the Hadoop file system implements directory support in S3 by creating empty files with the <directoryname>_$folder$ suffix.
https://aws.amazon.com/premiumsupport/knowledge-center/emr-s3-empty-files/
This may have been something the S3 console did many years ago, and apparently (since you don't report seeing them in the console) it still supports displaying such objects as folders in the console... but the S3 console no longer creates them this way, if it ever did.
I've mirrored the bucket "folder" layout exactly
If you create a folder in the console, an empty object with the key "foldername/" is created. This in turn is used to display a folder that you can navigate into, and upload objects with keys beginning with that folder name as a prefix.
The Amazon S3 console treats all objects that have a forward slash "/" character as the last (trailing) character in the key name as a folder
http://docs.aws.amazon.com/AmazonS3/latest/UG/FolderOperations.html
If you just create objects using the API, then "my/object.txt" appears in the console as "object.txt" inside folder "my" even though there is no "my/" object created... so if the objects are created with the API, you'd see neither style of "folder" in the object listing.
That is probably a bug in the API endpoint which includes the "folder" - S3 internally doesn't actually have a folder structure, but instead is just a set of keys associated with files, where keys (for convenience) can contain slash-separated paths which then show up as "folders" in the web interface. There is the option in the API to specify a prefix, which I believe can be any part of the key up to and including part of the filename.
EMR's s3 client is not the apache one, so I can't speak accurately about it.
In ASF hadoop releases (and HDP, CDH)
The older s3n:// client uses $folder$ as its folder delimiter.
The newer s3a:// client uses / as its folder marker, but will handle $folder$ if there. At least it used to; I can't see where in the code it does now.
The S3A clients strip out all folder markers when you list things; S3A uses them to simulate empty dirs and deletes all parent markers when you create child file/dir entries.
Whatever you have which processes GET should just ignore entries with "/" or $folder at the end.
As to why they are different, the local EMRFS is a different codepath, using dynamo for implementing consistency. At a guess, it doesn't need to mock empty dirs, as the DDB tables will host all directory entries.

Google storage api list storage bucket with "/" in the name

I am trying to list all objects in a bucket(Google storage) in the google storage api. The bucket is nested like a folder, such as "my-bucket/sub-folder". I got the following error:
com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found
If I use a bucket name without "/" it works fine. How can I list a bucket like a folder structure?
Google Cloud Storage buckets do not have slashes in their name. In the example above, the bucket is named "my-bucket" and the object is named something like "sub-folder/object.txt" or just "object.txt".
It's useful to remember that GCS does not have any real notion of folders. There are only buckets and objects in buckets. If you have a subdirectory named "dir" in bucket named "mybucket", and that subdirectory has 5 objects in it, what you really have is 5 objects named "dir/obj1", "dir/obj2", etc, all still within bucket "mybucket."
A number of tools (like gsutil and the GCS web-based storage browser) make it appear that there are folders, through use of markers and prefixes in the API -- even though as noted, there really are just objects that have slashes in the name.

Determine S3 file last modified timestamp

I have a Scala Play 2 app and using AWS S3 API to read from S3 files. I have a need to determine when the last modified timestamp is for a file, what's the best way to do that? Is it using getObjectMetadata or perhaps listObjects or ? If possible, I would like to determine the timestamps for multiple files in one call. Are there other open source libraries built on top of AWS S3 APIs?
A representation of S3 Object in AWS Java SDK is S3ObjectSummary, which has method getLastModified. It returns the modified timestamp.
Ideally just list all of the files using listObjects and than call getObjectSummaries on a returned object.