KubeFlow, handling large dynamic arrays and ParallelFor with current size limitations - kubernetes

I've been struggling to find a good solution for this manner for the past day and would like to hear your thoughts.
I have a pipeline which receives a large & dynamic JSON array (containing only stringified objects),
I need to be able to create a ContainerOp for each entry in that array (using dsl.ParallelFor).
This works fine for small inputs.
Right now the array comes in as a file http url due to pipeline input arguements size limitations of argo and Kubernetes (or that is what I understood from the current open issues), but - when I try to read the file from one Op to use as input for the ParallelFor I encounter the output size limitation.
What would be a good & reusable solution for such a scenario?
Thanks!

the array comes in as a file http url due to pipeline input arguements size limitations of argo and Kubernetes
Usually the external data is first imported into the pipeline (downloaded and output). Then the components use inputPath and outputPath to pass big data pieces as files.
The size limitation only applies for the data that you consume as value instead of file using inputValue.
The loops consume the data by value, so the size limit applies to them.
What you can do is make this data smaller. For example if your data is a JSON list of big objects [{obj1}, {obj2}, ... , {objN}], you can transform it to list of indexes [1, 2, ... , N], pass that list to the loop and then inside the loop you can have a component that uses the index and the data to select a single piece to work on N ->{objN}.

Related

Filtering GCS uris before job execution

I have a frequent use case I couldn't solve.
Let's say I have a filepattern like gs://mybucket/mydata/*/files.json where * is supposed to match a date.
Imagine I want to keep 251 dates (this is an example, let's say a big number of dates but without a meta-pattern to match them like 2019* or else).
For now, I have two options :
create a TextIO for every single file, which is overkill and fails almost everytime (graph too large)
read ALL data and then filter it within my job from data : which is also overkill when you have 10 TB of data while you only need 10 Gb for instance
In my case, I would like to just do something like that (pseudo code) :
Read(LIST[uri1,uri2,...,uri251])
And that this instruction actually spawn a single TextIO task on the graph.
I am sorry if I missed something, but I couldn't find a way to do it.
Thanks
Ok I found it, the naming was mileading me :
Example 2: reading a PCollection of filenames.
Pipeline p = ...;
// E.g. the filenames might be computed from other data in the pipeline, or
// read from a data source.
PCollection<String> filenames = ...;
// Read all files in the collection.
PCollection<String> lines =
filenames
.apply(FileIO.matchAll())
.apply(FileIO.readMatches())
.apply(TextIO.readFiles());
(Quoted from Apache Beam documentation https://beam.apache.org/releases/javadoc/2.13.0/org/apache/beam/sdk/io/TextIO.html)
So we need to generate a PCollection of URIS (with Create/of) or to read it from the pipeline, then to match all the uris (or patterns I guess) and the to read all files.

Working with many inputs (Matlab)

I'm new to Matlab and I need some suggestions on how to deal with having many inputs to a function.
The program reads data from multiple elements and stores them in an array, which I'm doing in a loop. The problem is that if I input the wrong information about one element, I must re-input the data all over again. I believe that there must exist a better way to input these data, like reading it from a external file, for example.
The problem with the external file would be, as far as I know, with the reading of multiple arrays from a single file, hence the need of multiple external files - and I believe also that must exist some better way.
As noted by #beaker, you can use save and load to store the data. You can store multiple variables in a given file without a problem.

Features of GStreamer

Does GStreamer have the following functionalities/features, or is it possible to implement them on top of GStreamer:
Time windows: set up the graph such that a sink pad of one element does not just receive the current frame, but also n previous frames and m future frames. Including when seeking to a new position.
No data copies when passing data between elements, instead reusing the same buffer.
Having shared data between mutiple elements on different branches, that changes with time, but is buffered in such a way that all elements get the same value for it for the same frame index.
Q1) Time windows
You need to write your plugin using GstAdapter.
Q2) No data copies when passing data between elements
It's done by default. No data is copied from element to element unless required. It just passes a pointer to an instance of GstBuffer. If an element is like encoder or filter, which needs to work on a buffer to produce new data, a new GstBuffer instance is created with newly generated data in GstMemory, obviously.
Q3) Having shared data between mutiple elements
Not sure exactly what you mean. Is is possible to achieve what you want by using GstMemory share? Take a look at gst_memory_share(), gst_buffer_copy_region(), or gst_adapter_get_buffer().

Handing Arrays in soap webservice testing using fitnesse

Is there a way to dynamically create tables in wiki?
Usecase : I'm trying to mimic similar to soap sonar in fitnesse. SOAP SOANR 1. Once we import the wsdl, soap sonar generates inputs for operations in wsdl. 2. Choose a operation, Enter input and then execute the operation. 3. In case of arrays, we can select size of array and enter values in respective array.
Fitnesse 1. I'm able to achieve point 1 using soapui jars. 2. This i'm able to achieve using xmlhttptest fixture
I'm stuck in 3rd point. Is there a way i can do this in fitnesse? (My idea is from point 1, i can get sample input for each operation, from which i will get to know that there are arrays/complex types present in input.xml but how do we represent this in wiki dynamically?
Thanks in advance
What I've done in the past is use ListFixture (and MapFixture) to dynamically fill a List (and Map/Hashes for each element's properties) and then use these as input values to a XmlHttpTest's feature to create the body to be sent using a FreeMarker template (which allows iteration over a list, which I use to create elements in the array based on the list).
But this gets quite complex quickly. Is that level of flexibility truly required? I found that quite often hard coding the number of elements in arrays/lists in the wiki is simpler to do and makes the test far easier to understand/maintain.
I most cases I prefer to create a script (or scenario) with the right number of elements for the test case(s) in with the request in the wiki page. The use of scenarios allows me to test with different values (but the same number of elements). Another element count gets its own script/scenario.
Being able to dynamically change the number of elements is only worthwhile if you need to test for many different counts, otherwise the added complexity of dynamically creating the body is just not worth it.

How should I store my large MATLAB data files during analysis?

I am having issues with 'data overload' while processing point cloud data in MATLAB. This is what I am currently doing:
I begin with my raw data files, each in the order of ~30Mb each.
I then do initial processing on them to extract n individual objects and remove outlying points, which are all combined into a 1 x n structure, testset, saved into testset.mat (~100Mb).
So far so good. Now things become complicated:
For each point in each object in testset, I will compute one of a number of features, which ends up being a matrix of some size (for each point). The size of the matrix, and some other properties of the computation, are parameters of the calculations. I save these computed features in a 1 x n cell array, each cell of which contains an array of the matrices for each point.
I then save this cell array in a .mat file, where the name specified the parameters, the name of the test data used and the types of features extracted. For example:
testset_feature_type_A_5x5_0.2x0.2_alpha_3_beta_4.mat
Now for each of these files, I then do some further processing (using a classification algorithm). Again there are more parameters to set.
So now I am in a tricky situation, where each final piece of the initial data has come through some path, but the path taken (and the parameters set along that path) are not intrinsically held with the data itself.
So my question is:
Is there a better way to do this? Can anyone who has experience in working with large datasets in MATLAB suggest a way to store the data and the parameter settings more efficiently, and more integrally?
Ideally, I would be able to look up a certain piece of data without having to use regex on the file strings—but there is also an incentive to keep individually processed files separate to save system memory when loading them in (and to help prevent corruption).
The time taken for each calculation (some ~2 hours) prohibits computing data 'on the fly'.
For a similar problem, I have created a class structure that does the following:
Each object is linked to a raw data file
For each processing step, there is a property
The set method of the properties saves the data to file (in a directory with the same name as
the raw data file), stores the file name, and updates a "status" property to indicate that this step is done.
The get method of the properties loads the data if the file name has been stored and the status indicates "done".
Finally, the objects can be saved/loaded, so that I can do some processing now, save the object, later load it and I immediately know how far along the particular data set is in the processing pipeline.
Thus, the only data in memory is the data that is currently being worked on, and you can easily know which data set is at which processing stage. Furthermore, if you set up your methods to accept arrays of objects, you can do very convenient batch processing.
I'm not completely sure if this is what you need, but the save command allows you to store multiple variables inside a single .mat file. If your parameter settings are, for example, stored in an array, then you can save this together with the data set in a single .mat file. Upon loading the file, both the dataset and the array with parameters are restored.
Or do you want to be able to load the parameters without loading the file? Then I would personally opt for the cheap solution of having a second set of files with just the parameters (but similar filenames).