Listing files by particular expression from GCS in Java

Listing files by particular expression from GCS in Java - google-cloud-storage

Have anyone achieved this functionality before ? It's equivalent to ls -ltr *xyz* in unix and I would like to achieve the same in my cloud dataflow code.
Any lead would be appreciated.
Thank you.

It is possible to do this filtering on the client side. Here is an example using the google-cloud java client library to access the Google Cloud Storage APIs.
The example below lists all files in the root directory of the bucket which matches the given regular expression pattern.
I've used regular expressions instead of the glob pattern that shell commands like ls support since regular expressions are more flexible.
I would recommend you go through the java library documentation for google-cloud.
Example
import com.google.api.gax.paging.Page;
import com.google.cloud.storage.Blob;
import com.google.cloud.storage.Storage;
import com.google.cloud.storage.Storage.BlobListOption;
import com.google.cloud.storage.StorageOptions;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Pattern;
/**
* An example which lists the files in the specified GCS bucket matching the
* specified regular expression pattern.
*
* <p>Run it as PROGRAM_NAME <BUCKET_NAME> <REGEX_MATCH_PATTERN>
*/
public class ListBlobsSample {
public static void main(String[] args) throws IOException {
// Instantiates a Storage client
Storage storage = StorageOptions.getDefaultInstance().getService();
// The name of the GCS bucket
String bucketName = args[0];
// The regular expression for matching blobs in the GCS bucket.
// Example: '.*abc.*'
String matchExpr = args[1];
List<String> results = listBlobs(storage, bucketName, Pattern.compile(matchExpr));
System.out.println("Results: " + results.size() + " items.");
for (String result : results) {
System.out.println("Blob: " + result);
}
}
// Lists all blobs in the bucket matching the expression.
// Specify a regex here. Example: '.*abc.*'
private static List<String> listBlobs(Storage storage, String bucketName, Pattern matchPattern)
throws IOException {
List<String> results = new ArrayList<>();
// Only list blobs in the current directory
// (otherwise you also get results from the sub-directories).
BlobListOption listOptions = BlobListOption.currentDirectory();
Page<Blob> blobs = storage.list(bucketName, listOptions);
for (Blob blob : blobs.iterateAll()) {
if (!blob.isDirectory() && matchPattern.matcher(blob.getName()).matches()) {
results.add(blob.getName());
}
}
return results;
}
}
Using just prefix matching
If you instead need to match just prefixes in the object names, Objects: list API supports it.
You need to specify the prefix query parameter in the request when doing GET https://www.googleapis.com/storage/v1/b/bucket/o. This is also supported using the java client library (you will have to specify it while building the BlobListOption you pass to storage.list()).
prefix
string
Filter results to objects whose names begin with this prefix.
gsutil
gsutil supports such queries and it does the filtering solely on the client side (for some cases it issues multiple requests too).

GCS supports prefix queries, you can efficiently list xyz*; but to list xyz you would have to list the entire bucket and filter at the client.

The following may not be exactly helpful for your use case, but if you are looking to narrow down the results by a certain prefix and then apply regex to match your final regex.
Storage storage = StorageOptions.getDefaultInstance().getService();
Bucket bucket = storage.get(bucketName)
BlobListOption blobListOption = Storage.BlobListOption.prefix(prefixPattern)
for (Blob blob : bucket.list(blobListOption).iterateAll()) {
System.out.println(blob);
}

Related

How to run BigQueryIO.read().fromQuery with parameters

I need to run multiple queries from a single .SQL file but with different params
I've tried something like this but it does not work as BigQueryIO.Read consumes only PBegin.
public PCollection<KV<String, TestDitoDto>> expand(PCollection<QueryParamsBatch> input) {
PCollection<KV<String, Section1Dto>> section1 = input.apply("Read Section1 from BQ",
BigQueryIO
.readTableRows()
.fromQuery(ResourceRetriever.getResourceFile("query/test/section1.sql"))
.usingStandardSql()
.withoutValidation())
.apply("Convert section1 to Dto", ParDo.of(new TableRowToSection1DtoFunction()));
}
Are there any other ways to put params from existing PCollection inside my BigQueryIO.read() invocation?

Are different queries/parameters available in the pipeline construction time ? If so you could just create multiple read transforms and combine results, for example, using a Flatten transform.
Beam Java BigQuery source does not support reading a PCollection of queries currently. Python BQ source does though.

I've come up with the following solution: not to use BigQueryIO but regular GCP library for accessing BigQuery, marking it as transient and initializing it each time in method with #Setup annotation, as it is not Serializable
public class DenormalizedCase1Fn extends DoFn<*> {
private transient BigQuery bigQuery;
#Setup
public void initialize() {
this.bigQuery = BigQueryOptions.newBuilder()
.setProjectId(bqProjectId.get())
.setLocation(LOCATION)
.setRetrySettings(RetrySettings.newBuilder()
.setRpcTimeoutMultiplier(1.5)
.setInitialRpcTimeout(Duration.ofSeconds(5))
.setMaxRpcTimeout(Duration.ofSeconds(30))
.setMaxAttempts(3).build())
.build().getService();
}
#ProcessElement
...

Can we customize mapping file names in Wiremock?

I am recording the application through Wiremock using JAVA DSL, Do we have the option to customize the mapping file names? instead of getting the filename which is generated from wiremock..
Example: searchpanel_arrivalairport_th-72f9b8b7-076f-4102-b6a8-aa38710fde1b.json (Generated form wiremock using java )
I am expecting the above file name with my desired naming convention like
seacrpanel_airport_LGW.json

Custom filenames can be added by customizing StubMappingJsonRecorder.
I added CustomStubMappingJsonRecorder and override writeToMappingAndBodyFile method.
if(fileName!=null && !fileName.equals("")){
mappingFileName=fileName+"-mapping.json";
bodyFileName=fileName+"-body.json";
}else {
mappingFileName = UniqueFilenameGenerator.generate(request.getUrl(),
"mapping", filed);
bodyFileName = UniqueFilenameGenerator.generate(request.getUrl(), "body",
fileId, ContentTypes.determineFileExtension(request.getUrl(),
response.getHeaders().getContentTypeHeader(), body));
}

There's no easy way to do this at the moment. It is however possible. As #santhiya-ps says you need to write your own implementation of RequestListener, probably using StubMappingJsonRecorder as a template.
You can't extend it and override writeToMappingAndBodyFile as that method is private, but that is the method you probably want to change.
import com.github.tomakehurst.wiremock.common.*;
import com.github.tomakehurst.wiremock.core.*;
import com.github.tomakehurst.wiremock.http.*;
import java.util.List;
import static com.github.tomakehurst.wiremock.core.WireMockApp.*;
class NameTemplateStubMappingJsonRecorder implements RequestListener {
private final FileSource mappingsFileSource;
private final FileSource filesFileSource;
private final Admin admin;
private final List<CaseInsensitiveKey> headersToMatch;
private final IdGenerator idGenerator = new VeryShortIdGenerator();
public NameTemplateStubMappingJsonRecorder(Admin admin) {
this.mappingsFileSource = admin.getOptions().filesRoot().child(MAPPINGS_ROOT);
this.filesFileSource = admin.getOptions().filesRoot().child(FILES_ROOT);
this.admin = admin;
this.headersToMatch = admin.getOptions().matchingHeaders();
}
#Override
public void requestReceived(Request request, Response response) {
// TODO copy StubMappingJsonRecorder changing as required...
}
}
You can then register your RequestListener as so:
WireMockServer wireMockServer = new WireMockServer();
wireMockServer.addMockServiceRequestListener(
new NameTemplateStubMappingJsonRecorder(wireMockServer)
);
wireMockServer.start();
So long as you still store the mapping files in the expected directory (stored in FileSource mappingsFileSource above, which will be ${rootDir}/mappings, where rootDir is configured as explained in Configuration - File Locations) they should be loaded successfully as all files with extension json in that dir are loaded as mappings.
It would be much easier if StubMappingJsonRecorder took a strategy for generating these names - it might be worth creating an issue on the WireMock repo asking for an easier way to do this. I'd suggest getting an agreement on a basic design before raising a PR though.

Any way I can change in runtime mongo document name

In the project we need to change collection name suffix everyday based on date.
So one day collection is named:
samples_22032019
and in the next day it is
samples_23032019
Everyday I need to change suffix and recompile spring-boot application because of this. Is there any way I can change this so the collection/table can be calculated dynamically based on current date? Any advice for MongoRepository?

Considering the below is your bean. you can use #Document annotation with spring expression language to resolve suffix at runtime. Like show below,
#Document(collection = "samples_#{T(com.yourpackage.Utility).getDateSuffix()}")
public class Samples {
private String id;
private String name;
}
Now have your date change function in a Utility method which spring can resolve at runtime. SpEL is handy in such scenarios.
package com.yourpackage;
public class Utility {
public static final String getDateSuffix() {
//Add your real logic here, below is for representational purpose only.
return DateTime.now().toDate().toString();;
}
}
HTH!

Make a cron job to run daily and generateNewName for your collection and execute the below code. Here I am getting collection using MongoDatabse than by using MongoNamespace we can rename the collection.
To get old/new collection name you can write a separate method.
#Component
public class RenameCollectionTask {
#Scheduled(cron = "${cron}")
public void renameCollection() {
// creating mongo client object
final MongoClient client = new MongoClient(HOST_NAME, PORT);
// selecting the mongo database
final MongoDatabase database = client.getDatabase("databaseName");
// selecting the mongo collection
final MongoCollection<Document> collection = database.getCollection("oldCollectionName");
// creating namespace
final MongoNamespace newName = new MongoNamespace("databaseName", "newCollectionName");
// renaming the collection
collection.renameCollection(newName);
System.out.println("Collection has been renamed");
// closing the client
client.close();
}
}
To assign the name of the collection you can refer this so that every time restart will not be required.
The renameCollection() method has the following limitations:
1) It cannot move a collection between databases.
2) It is not supported on sharded collections.
3) You cannot rename the views.
Refer this for detail.

Is it possible to place variables into a resource path within a sling servlet?

We are trying to provide a clean URI structure for external endpoints to pull json information from CQ5.
For example, if you want to fetch information about a particular users history (assuming you have permissions etc), ideally we would like the endpoint to be able to do the following:
/bin/api/user/abc123/phone/555-klondike-5/history.json
In the URI, we would specifying /bin/api/user/{username}/phone/{phoneNumber}/history.json so that it is very easy to leverage the dispatcher to invalidate caching changes etc without invalidating a broad swath of cached information.
We would like to use a sling servlet to handle the request, however, I am not aware as to how to put variables into the path.
It would be great if there were something like #PathParam from JaxRS to add to the sling path variable, but I suspect it's not available.
The other approach we had in mind was to use a selector to recognise when we are accessing the api, and thus could return whatever we wanted to from the path, but it would necessitate a singular sling servlet to handle all of the requests, and so I am not happy about the approach as it glues a lot of unrelated code together.
Any help with this would be appreciated.
UPDATE:
If we were to use a OptingServlet, then put some logic inside the accepts function, we could stack a series of sling servlets on and make the acceptance decisions from the path with a regex.
Then during execution, the path itself can be parsed for the variables.

If the data that you provide comes from the JCR repository, the best is to structure it exactly as you want the URLs to be, that's the recommended way of doing things with Sling.
If the data is external you can create a custom Sling ResourceProvider that you mount on the /bin/api/user path and acquires or generates the corresponding data based on the rest of the path.
The Sling test suite's PlanetsResourceProvider is a simple example of that, see http://svn.apache.org/repos/asf/sling/trunk/launchpad/test-services/src/main/java/org/apache/sling/launchpad/testservices/resourceprovider/
The Sling resources docs at https://sling.apache.org/documentation/the-sling-engine/resources.html document the general resource resolution mechanism.

It is now possible to integrate jersy(JAX-RS) with CQ. We are able to create primitive prototype to say "Hello" to the world.
https://github.com/hstaudacher/osgi-jax-rs-connector
With this we can use the #PathParam to map the requests
Thanks and Regards,
San

There is no direct way to create such dynamic paths. You could register servlet under /bin/api/user.json and provide the rest of the path as a suffix:
/bin/api/user.json/abc123/phone/555-klondike-5/history
^ ^
| |
servlet path suffix starts here
then you could parse the suffix manually:
#SlingServlet(paths = "/bin/api/user", extensions = "json")
public class UserServlet extends SlingSafeMethodsServlet {
public void doGet(SlingHttpServletRequest request, SlingHttpServletResponse response) {
String suffix = request.getRequestPathInfo().getSuffix();
String[] split = StringUtils.split(suffix, '/');
// parse split path and check if the path is valid
// if path is not valid, send 404:
// response.sendError(HttpURLConnection.HTTP_NOT_FOUND);
}
}

The RESTful way to approach this would be to have the information stored in the structure that you want to use. i.e. /content/user/abc123/phone/555-klondike-5/history/ would contain all the history nodes for that path.
In that usage. you can obtain an out of the box json response by simply calling
/content/user/abc123/phone/555-klondike-5/history.json
Or if you need something in a specific json format you could use the sling resource resolution to use a custom json response.

Excited to share this! I've worked ~ a week solving this, finally have the best Answer.
First: Try to use Jersey
The osgi-jax-rs-connector suggested by kallada is best, but I couldn't get it working on Sling 8. I lost a full day trying, all I have to show for it are spooky class not found errors and dependency issues.
Solution: The ResourceProvider
Bertrand's link is for Sling 9 only, which isn't released. So here's how you do it in Sling 8 and older!
Two Files:
ResourceProvider
Servlet
The ResourceProvider
The purpose of this is only to listen to all requests at /service and then produce a "Resource" at that virtual path, which doesn't actually exist in the JCR.
#Component
#Service(value=ResourceProvider.class)
#Properties({
#Property(name = ResourceProvider.ROOTS, value = "service/image"),
#Property(name = ResourceProvider.OWNS_ROOTS, value = "true")
})
public class ImageResourceProvider implements ResourceProvider {
#Override
public Resource getResource(ResourceResolver resourceResolver, String path) {
AbstractResource abstractResource;
abstractResource = new AbstractResource() {
#Override
public String getResourceType() {
return TypeServlet.RESOURCE_TYPE;
}
#Override
public String getResourceSuperType() {
return null;
}
#Override
public String getPath() {
return path;
}
#Override
public ResourceResolver getResourceResolver() {
return resourceResolver;
}
#Override
public ResourceMetadata getResourceMetadata() {
return new ResourceMetadata();
}
};
return abstractResource;
}
#Override
public Resource getResource(ResourceResolver resourceResolver, HttpServletRequest httpServletRequest, String path) {
return getResource(resourceResolver , path);
}
#Override
public Iterator<Resource> listChildren(Resource resource) {
return null;
}
}
The Servlet
Now you just write a servlet which handles any of the resources coming from that path - but this is accomplished by handling any resources with the resource type which is produced by the ResourceProvider listening at that path.
#SlingServlet(
resourceTypes = TypeServlet.RESOURCE_TYPE,
methods = {"GET" , "POST"})
public class TypeServlet extends SlingAllMethodsServlet {
static final String RESOURCE_TYPE = "mycompany/components/service/myservice";
#Override
protected void doGet(SlingHttpServletRequest request, SlingHttpServletResponse response) throws ServletException, IOException {
final String [] pathParts = request.getResource().getPath().split("/");
final String id = pathParts[pathParts.length-1];
response.setContentType("text/html");
PrintWriter out = response.getWriter();
try {
out.print("<html><body>Hello, received this id: " + id + "</body></html>");
} finally {
out.close();
}
}
}
Obviously your servlet would do something much more clever, such as process the "path" String more intelligently and probably produce JSON.

how to find whole path of Google Apps Drive Documents using deprecated apis 3.0

I am using Google Apps Drive APIs 3.0 . This API is deprecated but it is in maintenance phase.
I want to find a path of Google Drive document.
e.g. test/test1/test2/test3/testDoc.txt
As of now, I am able to retrieve all the documents but without directory path.
I want to show the whole path of a drive document.
I believe, there is no API to retrieve the whole parent path or parent link.
getFolders() method of DocumentListEntry is now deprecated is not able to show the folders path.
I investigated and found that there is one more method getParentsLink() which just shows immediate parent link. It returns List. On which I can not do re-iteration to find its parent link again.
public class MyClass {
private static final String DOCS_BASE_URL = "https://docs.google.com/feeds/";
private static final String DOCS_URL = "/private/full";
private static final String adminEmail = "admin#mytest.com";
private static final String password = "password";
private static final String projectKey = "MyProject";
public static void main(String[] args) {
try {
URL queryUrl = new URL(DOCS_BASE_URL + adminEmail + DOCS_URL);
DocumentQuery docQry = new DocumentQuery(queryUrl);
DocsService docService = new DocsService(projectKey);
docService.setUserCredentials(adminEmail, password);
docQry.setStringCustomParameter("showfolders", "true");
DocumentListFeed docFeed = docService.query(docQry, DocumentListFeed.class);
Iterator<DocumentListEntry> documentEntry = docFeed.getEntries().iterator();
while (documentEntry.hasNext()) {
DocumentListEntry docsEntry = documentEntry.next();
// Complex Logic to find whole directory path.(that I don't understand :P)
}
} catch (Exception exception) {
System.out.println("Error Occured " + exception.getMessage());
}
}
}
Any inputs are welcome.
Thanks.

To solve this you need to stop thinking in terms of folders and paths, and think in terms of labels (aka parents, aka collections). A file can (optionally) have one or more labels/parents/collections. Each parent can in turn have one or more label/parent/collection. So to get the "path" of a file, you need to recursively get the parents of its parent. Remember that a file can have multiple parents, each of which can also have multiple parents, thus a file can have multiple paths.
Taking your example "test/test1/test2/test3/testDoc.txt", assuming you have the ID of testDoc.txt, you can get it's DocumentListEntry, and call getParentLinks which returns a list if URLs for the DocumentListEntry of each of its parents, in your case just "test3". Get the DocumentListEntry for test3, and repeat to get test2, etc.
It might sound complicated, but once you accept that the thing you're calling a folder is not a container of files, but simply a property of the file, it makes more sense.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Listing files by particular expression from GCS in Java - google-cloud-storage

Have anyone achieved this functionality before ? It's equivalent to ls -ltr xyz in unix and I would like to achieve the same in my cloud dataflow code. Any lead would be appreciated. Thank you.

GCS supports prefix queries, you can efficiently list xyz*; but to list xyz you would have to list the entire bucket and filter at the client.

Related

How to run BigQueryIO.read().fromQuery with parameters

Can we customize mapping file names in Wiremock?

Any way I can change in runtime mongo document name

Is it possible to place variables into a resource path within a sling servlet?

how to find whole path of Google Apps Drive Documents using deprecated apis 3.0

Categories

Resources

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Listing files by particular expression from GCS in Java - google-cloud-storage

Have anyone achieved this functionality before ? It's equivalent to ls -ltr *xyz* in unix and I would like to achieve the same in my cloud dataflow code. Any lead would be appreciated. Thank you.

GCS supports prefix queries, you can efficiently list xyz*; but to list xyz you would have to list the entire bucket and filter at the client.

Related

How to run BigQueryIO.read().fromQuery with parameters

Can we customize mapping file names in Wiremock?

Any way I can change in runtime mongo document name

Is it possible to place variables into a resource path within a sling servlet?

how to find whole path of Google Apps Drive Documents using deprecated apis 3.0

Categories

Resources

Have anyone achieved this functionality before ? It's equivalent to ls -ltr xyz in unix and I would like to achieve the same in my cloud dataflow code. Any lead would be appreciated. Thank you.