GCP Cloud Storage - Wildcard Prefix List - google-cloud-storage

This is a question of how to acomplish a certain task with the GCP Cloud Storage API.
I have a bucket with a "folder" structure as follows:
ID / Year / Month / Day / FILES
I need to search for all files with the following format: ID/2016/04/03/. I had hoped I could use a * in the prefix (*/2016/04/03/), but this does not work.
Anyone know a way to make this happen without iterating every top level folder myself?

There is no API support for wildcard expressions - just for prefix queries.
When you say "iterating every top level folder myself" it sounds like you mean manually listing them in your client code? You can avoid doing that by doing a query that specifies delimiter="/" and prefix="" to find the top-level "folders". You would then iterate over that list and construct prefix queries to list the individual objects within the given date-named folder.
If it's possible for you to restructure your names, you could avoid having to do the extra prefix+delimiter query and iteration , so the top level is the date, e.g.,
Year / Month / Day / ID / FILES

Related

GetMetadata to get the full file directory in Azure Data Factory

I am working through a use case where I want to load all the folder names that were loaded into an Azure Database into a different "control" table, but am having problems with using GetMetadata activity properly.
The purpose of this use case would be to skip all of the older folders (that were already loaded) and only focus on the new folder and get the ".gz" file and load it into an Azure Database. Oh a high level I thought I would use GetMetadata activity to send all of the folder names to a stored procedure. That stored procedure would then load those folder names with a status of '1' (meaning successful).
That table would then be used in a separate pipeline that is used to load files into a database. I would use a Lookup activity to compare against already loaded folders and if one of them don't match then that would be the folder to get the file from (the source is an S3 bucket).
The folder structure is nested in the YYYY/MM/DD format (ex: 2019/12/27 where each day a new folder is created and a "gz" file is placed there).
I created an ADF pipeline using the "GetMetadata" activity pointing to the blob storage that has already had the folders loaded into it.
However, when I run this pipeline I only get the top three folder names: 2019, 2018, 2017.
Is it possible to to not only get the top level folder name, but go down all the way down to day level? so instead of the output being "2019" it would be "2019/12/26" and then next one would be "2019/12/27" plus all of the months and days from 2017 and 2018.
If anyone faced this issue any insight would be greatly appreciated.
Thank you
you can also use a wildcard placeholder in this case, if you have a defined and nonchanging folder structure.
Use as directory: storageroot / * / * / * / filename
For example I used csvFiles / * / * / * / * / * / * / *.csv
to get all files that have this structure:
csvFiles / topic / subtopic / country / year / month / day
Then you get all files in this folder structure.
Based on the statements in the Get-Metadata Activity doc,childItems only returns elements from the specific path,won’t include items in subfolders.
I supposed that you have to use ForEach Activity to loop the childItems array layer by layer to flatten all structure. At the same time,use Set Variable Activity to concat the complete folder path. Then use IfCondition Activity,when you detect the element type is file,not folder,you could call the SP you mentioned in your question.

Deleting substrings throughout a column

I am trying to clean up throughout columns within a table to create a clear attribution/reference for reporting on my digital marketing campaigns. The goal is to keep one part of a string while deleting all others. All strings within my marketing campaigns have symbols separating each substring.
Attached are pictures of my current table and of the desired table.
I am essentially trying to only keep on part of the structure of a string and delete all other sub strings. I have already managed to do this successfully by applying the following formula given to be from a separate thread.
update adwords
set campaign = substring(campaign from '%-%-#"%#"' for '#')
where campaign like '%-%-%';
This worked perfectly, however, I do not fully understand why and have not found a clear answer thus far on this forum.
How would I apply this to future rows? Ad group and match type can be used for this purpose.
Many Thanks.
First thing: You do not modify source data. Do ETL instead, and transform it to a final stage. Do that periodically and thus taking care of new data.
You could just create a trigger which should work for all new data, but there are 2 caveats with that:
Failure will lead to missing data and you not being able to QA it.
If you modify the source data in an incorrect way by mistake, you cannot undo it unless you have a backup, and even then it's just too hard.
So instead look at ETL tools like Talend or Pentaho Kettle; create your own ETL scripts, or whatever. Use Jenkins to schedule all of this periodically and you're set.
Now, about the transformation itself.
for '#'
indicates that # will be an escape symbol, which means that #" will be treated as a regular quote in this case.
substring(campaign from '%-%-#"%#"' for '#')
thus, selects everything between the quotes in the pattern. % is a wildcard, same as used in LIKE comparisons. So everything in the last group will be returned. This can better be done with regular expressions
substring(campaign from '.*?-.*?-(.*)')
For the second column the regex would be ^(.*?)\s*\{
And for the third one - similar: ^(.*?)\s*\}
I would create the new table like this:
CREATE TABLE aw_final AS
SELECT
substring(campaign FROM '^\w{2}-\w+-(.*)$') AS campaign,
substring(ad_group FROM '^(\w+)\s*\{\w+\}$') AS ad_group,
substring(match_type FROM '^(\w+)\s*\}$') AS match_type
FROM adwords
WHERE campaign ~ '^\w{2}-\w+-(.*)$'
But if you must do an update, this would be how:
UPDATE adwords SET
campaign = substring(campaign FROM '^\w{2}-\w+-(.*)$'),
ad_group = substring(ad_group FROM '^(\w+)\s*\{\w+\}$'),
match_type = substring(match_type FROM '^(\w+)\s*\}$')
WHERE campaign ~ '^\w{2}-\w+-(.*)$'

Geofire TableView - CircleQuery Users for leaderboard [duplicate]

I'm trying to figure out how to query with filter with Geofire.
Suppose I have restaurants with different category. and I want to add that category to my query. How do I go about this?
One way I have now is querying the key with Geofire, run the for loop through each key and get the restaurant, and insert the appropriate restaurant to the array.
These seems so inefficient. Is there any other way to go about this?
Ideally I will have the filtered results, and only load each item when they're about to be shown.
Cheers!
Firebase queries can only filter by one condition. Geofire already does quite some "magic" to allow it to filter on both longitude and latitude. Adding another property to that equation might be possible, but is well beyond what Geofire handles by default. See GeoFire: How to add extra conditions within the query?
If you only ever want to access one category at a time, you can put the restaurants in a top-level node per category and point Geofire to one category.
/category1
item1
g: "pns0h0mf2u"
l: [-53.435719, 140.808716]
item2
g: "u417k3dwub"
l: [56.83069, 1.94822]
/category2
item3
g: "8m3rz3s480"
l: [30.902225, -166.66809]
/items
item1: ...
item2: ...
item3: ...
In the above example, we have two categories: category1 with 2 items and category2 with just 1 item. For each item, we see the data that Geofire uses: a geohash and the longitude and latitude. We also keep a single list with the other properties of these 3 items.
But more commonly, you simply do the extra filtering in client-side code. If you're worried about the performance of that: measure it, share the code, JSON data and measurements.
This is an old question, but I've seen it in a few places on the web, so I thought I might share one trick I've used.
The Problem
If you have a large collection in your database, maybe containing hundreds of thousands of keys, for example, it might not be feasible to grab them all. If you're trying to filter results based on location in addition to other criteria, you're stuck with something like:
Execute the location query
Loop through each returned geofire key and grab the corresponding data in the database
Check each returned piece of data to see if it matches the other criteria
Unfortunately, that's a lot of network requests, which is quite slow.
More concretely, let's say we want to get all users within e.g. 100 miles of a particular location that are male and between ages 20 and 25. If there are 10,000 users within 100 miles, that means 10,000 network requests to grab the user data and compare their gender and age.
The Workaround:
You can store the data you need for your comparisons in the geofire key itself, separated by a delimiter. Then, you can just split the keys returned by the geofire query to get access to the data. You still have to filter through them, but it's much faster than sending hundreds or thousands of requests.
For instance, you could use the format:
UserID*gender*age, which might look something like facebook:1234567*male*24. The important points are
Separate data points by a delimiter
Use a valid character for the delimiter -- "It can include any unicode characters except for . $ # [ ] / and ASCII control characters 0-31 and 127.)"
Use a character that is not going to be found elsewhere in your database - I used *, but that might not work for you. Do not use any characters from -0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz, since those are fair-game for keys generated by firebase's push()
Choose a consistent order for the data - in this case, UserID first, then gender, then age.
You can store up to 768 bytes of data in firebase keys, which goes a long way.
Hope this helps!

Selecting records with a huge "where data set"

Background Info
C#
MS MVC 4
Sql Azure
Linq - Identities
Problem at hand:
Selecting records in an Items table where zip code is within a certain range of miles.
Items Table
id (PK)
Title
Body
ZipCode (Int)
Summary of Progress:
I have a class which uses the 2013 US Gazatteer zip code and tabulation areas to gather zip codes and assess distances between zip codes. It is basically a .csv/.txt file that I open into a stream and convert to POCOs in order to process distances. That much of the equation is working fine; however, selecting a list of Items from an Items table based on this list of zip codes is where I'm not sure what to do.
Scenario
User A wants to search for items within a 25 miles radius of area code 46324.
User A hits search and in the background my class returns a list of 124 zip codes within a 25 mile radius.
Question: What is the best way (performance wise) to retrieve items in my Item table using this list of zipcodes?
Possible Solutions
I thought about creating a dynamic query using the tsql in keyword within my where clause and simply supplying this list as the where parameters. This does not seem to be a very performance oriented way of doing this; however, considering my current architecture I do not see any other way.
I also thought about incorporating a sort of paging functionality that will only take the first 5 zip codes to return results followed by the next 5 and so on and so on. This will involve more work but it definitely would seem to be a better performance choice.
Any ideas?
I stumbled across your question purely by chance searching for something else, and also I see it's quite old, but I thought I'd give you a comment none the less:
What I would do in this case is actually allow the database to do the search and the C# to do the calcs. You have a class in C# which calculates the distances? Then why not save the distance from each zip code to each zip code in a "lookup table" in sql.
Doing it this way makes sure that the data is calculated once but you let the sql find the right data for you.
ie:
Create a table with from_zip, to_zip, distance fields
Calculate and populate table once at the beginning
Query by saying "select * from zip_lookup where zip_from = bla and distance between 0 and 100" or something like that

CQ5 JCR query in multiple paths

I need to have a JCR SQL query of this form :
select * from jcr:content where cq:template like '%myTemplate%' and ( jcr:path like '%path1%' or jcr:path like '%path2%')
But I get an exception saying that "incorrect use of property jcr:path" Is there a quick workaround for this ? The number of paths to search in may vary each time based on user selection.
I tried the following query in CQ's Query tool and it worked.
SELECT * FROM [cq:PageContent] WHERE [cq:PageContent].[cq:template] LIKE '%content%' AND ( isdescendantnode('/content/geometrixx/fr/') OR isdescendantnode('/content/geometrixx/en/'))
But ISDESCENDANTNODE requires an absolute path and i think relative ones wouldnt work.
As you've noticed, it's impossible to use multiple path comparisons in the JCR queries. You have a few options here:
Create a few queries, one per path.
Add some custom attribute to jcr:content node marking the pages you are interested in and use it instead of paths.
Iterate over path1 and path2 subtrees rather than query them.