Google cloud storage - Split files by value in the file - google-cloud-storage

I need to read a file from Google Cloud storage and split it into multiple files based on transaction_date which is a field in the file. File is about 6 TB in size (broken in to multiple files). What's the most effective ways to achieve this? Do I have to use Dataflow or Dataproc, any other simple way to do this?

I take it to mean that you want to write a separate (sharded) file per transaction_date. There isn't any direct support for this in the TextIO.Write that ships with Dataflow, but since it sounds like you have a special case where you know the date range, so you manually create ~11 different filtered TextIO.Write transforms.
PCollection<Record> input = ...
for (Date transaction_date : known_transaction_dates) {
input.apply(Filter.by(<record has this date>)
.apply(TextIO.Write.to(
String.format("gs://my-bucket/output/%s", transaction_date)));
}
This is certainly not ideal. For BigQueryIO there is a feature to write to a different table based on the windowing of the data - similar functionality added to TextIO might address your use case. Otherwise, data-dependent writes of various sorts are on our radar and include cases like yours.

Related

KubeFlow, handling large dynamic arrays and ParallelFor with current size limitations

I've been struggling to find a good solution for this manner for the past day and would like to hear your thoughts.
I have a pipeline which receives a large & dynamic JSON array (containing only stringified objects),
I need to be able to create a ContainerOp for each entry in that array (using dsl.ParallelFor).
This works fine for small inputs.
Right now the array comes in as a file http url due to pipeline input arguements size limitations of argo and Kubernetes (or that is what I understood from the current open issues), but - when I try to read the file from one Op to use as input for the ParallelFor I encounter the output size limitation.
What would be a good & reusable solution for such a scenario?
Thanks!
the array comes in as a file http url due to pipeline input arguements size limitations of argo and Kubernetes
Usually the external data is first imported into the pipeline (downloaded and output). Then the components use inputPath and outputPath to pass big data pieces as files.
The size limitation only applies for the data that you consume as value instead of file using inputValue.
The loops consume the data by value, so the size limit applies to them.
What you can do is make this data smaller. For example if your data is a JSON list of big objects [{obj1}, {obj2}, ... , {objN}], you can transform it to list of indexes [1, 2, ... , N], pass that list to the loop and then inside the loop you can have a component that uses the index and the data to select a single piece to work on N ->{objN}.

Filtering GCS uris before job execution

I have a frequent use case I couldn't solve.
Let's say I have a filepattern like gs://mybucket/mydata/*/files.json where * is supposed to match a date.
Imagine I want to keep 251 dates (this is an example, let's say a big number of dates but without a meta-pattern to match them like 2019* or else).
For now, I have two options :
create a TextIO for every single file, which is overkill and fails almost everytime (graph too large)
read ALL data and then filter it within my job from data : which is also overkill when you have 10 TB of data while you only need 10 Gb for instance
In my case, I would like to just do something like that (pseudo code) :
Read(LIST[uri1,uri2,...,uri251])
And that this instruction actually spawn a single TextIO task on the graph.
I am sorry if I missed something, but I couldn't find a way to do it.
Thanks
Ok I found it, the naming was mileading me :
Example 2: reading a PCollection of filenames.
Pipeline p = ...;
// E.g. the filenames might be computed from other data in the pipeline, or
// read from a data source.
PCollection<String> filenames = ...;
// Read all files in the collection.
PCollection<String> lines =
filenames
.apply(FileIO.matchAll())
.apply(FileIO.readMatches())
.apply(TextIO.readFiles());
(Quoted from Apache Beam documentation https://beam.apache.org/releases/javadoc/2.13.0/org/apache/beam/sdk/io/TextIO.html)
So we need to generate a PCollection of URIS (with Create/of) or to read it from the pipeline, then to match all the uris (or patterns I guess) and the to read all files.

Watson Studio Data Refine

I have two questions regarding the data refinery process in WS.
I have 200 Columns of data and when it was first loaded into the platform, by default everything is in the string type. How do I
Change columns in a batch
Specify the data type when I am uploading the data as using a CSV format file.
Regarding the first question, you can use operations (in Code an operation...) available from the dplyr R library. For example, in order to convert all double columns to the Double Type, you can use something like this:
mutate_all(~ ifelse(is.na(as.double(.x)),.x,as.double(.x)))
As for the second question, I think this is not possible, as long as you upload the data directly via browser.

spring batch making flatfilereader support three different tokenizers for the same datafile

if a single fixed width ascii text file has records in three different fixed witdth formats, does spring batch org.springframework.batch.item.file.FlatFileItemReader has a mechanism to map three different type of tokenizers to the same data file?
Take a look at the PatternMatchingCompositeLineTokenizer. It allows you to map a regular expression to a LineTokenizer implementation. You can read more about the PatternMatchingCompositeLineTokenizer in the documentation here: http://docs.spring.io/spring-batch/trunk/apidocs/org/springframework/batch/item/file/transform/PatternMatchingCompositeLineTokenizer.html

Fastest method of checking if multiple different strings are a substring of a 2nd string

Context:
I'm creating a program which will sort and rename my media files which are named e.g. The.Office.s04e03.DIVX.WaREZKiNG.avi into an organized folder structure, which will consist of a list of folders for each TV Series, each folder will have a list of folders for the seasons, and those folders will contain the media files.
The problem:
I am unsure as to what the best method for reading a file name and determining what part of that name is the TV Show. For e.g. In "The.Office.s04e03.DIVX.WaREZKiNG.avi", The Office is the name of the series. I decided to have a list of all TV Shows and to check if each TV Show is a substring in the file name, but as far as I know this means I have to check every single series against the name for every file.
My question: How should I determine if a string contains one of many other strings?
Thanks
The Aho-Corsasick algorithm[1] efficiently solves the "does this possibly long string exactly contain any of these many short strings" problem.
However, I suspect this isn't really the problem you want to solve. It seems to me that you want something to extract the likely components from a string that is in one of possibly many different formats. I suspect that having a few different regexps for likely providers, video formats, season/episode markers, perhaps a database of show names, etc, is really what you want. Then you can independently run these different 'information extractors' on your filenames to pull out their structure.
[1] http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
It depends on the overall structure of the filenames in general, for instance is the series name always first? If so a tree structure work well. Is there a standard marking between words (period in your example) if so you can split the string on those and create a case-insensitive hashtable of interesting words to boost performance.
However extracting seasons and episodes becomes more difficult, a simple solution would be to implement an algorithm to handle each format you uncover, although by using hints you could create an interesting parser if you wanted too. (Likely overkill however)