Upload data to Google Cloud Storage with any offset

Upload data to Google Cloud Storage with any offset - google-cloud-storage

I want to upload data to Google Cloud Storage object with any random offset (not chunk by chunk). Desirable to have unknow size for target object.
Is any way to do it with JSON API ?

You can't do this directly. Uploads to a single object must be done chunk by chunk.
However, multiple objects may be composed into a single, final object. In other words, you could upload all of your pieces as separate objects and then call "compose" on them to produce the correct final object. There are some limits to this approach, though. There's a maximum number of original elements that can be composed together (it's 1024). You'll also need to take care of deleting the original pieces once you're done composing, or you'll be storing twice the data.

Related

Does Firebase Real time database always read the complete node if you reference it?

On an hypothetic node structure like:
NodeA:
-Subnode1: 000000001
-Subnode2: "thisIsAVeeeeeeeeeeeryLoooongString"
I would like to update the NodeA every X minutes, just write it, not reading it, Subnode1 would be a timestamp which I set with Server.TimeStamp and Subnode2 would be a changing string.
I would like to know if just by referencing 'NodeA' Firebase will read the contents of the whole node, and if it does, is there a way to avoid it? since the Subnode2 can be quite heavy and I would like to have control whenever I want to read it.
Clarifications:
I'm not reading the node using any querying function. My question arises because I wonder if when the app starts the referenced nodes (using dbReference = fbbase.GetReference(path)) are read automatically.
I know I could use different references for each node but then I would incur into different upload costs since it would mean 2 different connections (yes, the uploads also have costs depending on the frequency)
I'm using Firebase SDK for Unity.
Thanks in advance.

If you query NodeA, it will pull down the entire contents of that node, including all of its children.
If you want just a specific child, query it instead. You can certainly build a path to Subnode1 if you want.
There is no way to exclude a certain child from a query, while getting all others. If you don't want all children, you must query each desired child individually.

Firebase rtdb charges on storage volumes and data downloads. If you are simply updating the record in the node you should not incur costs other than minor network costs.

A reference does not incure any fee's
that being said, reads and writes do.
Reason being is a reference is a hypothetical location for a document or a query and does not nessessarily exist until its contents has been populated by an update snapshot
when you read or write to a node, your data + overhead is calculated based on the current cost model per kb

KubeFlow, handling large dynamic arrays and ParallelFor with current size limitations

I've been struggling to find a good solution for this manner for the past day and would like to hear your thoughts.
I have a pipeline which receives a large & dynamic JSON array (containing only stringified objects),
I need to be able to create a ContainerOp for each entry in that array (using dsl.ParallelFor).
This works fine for small inputs.
Right now the array comes in as a file http url due to pipeline input arguements size limitations of argo and Kubernetes (or that is what I understood from the current open issues), but - when I try to read the file from one Op to use as input for the ParallelFor I encounter the output size limitation.
What would be a good & reusable solution for such a scenario?
Thanks!

the array comes in as a file http url due to pipeline input arguements size limitations of argo and Kubernetes
Usually the external data is first imported into the pipeline (downloaded and output). Then the components use inputPath and outputPath to pass big data pieces as files.
The size limitation only applies for the data that you consume as value instead of file using inputValue.
The loops consume the data by value, so the size limit applies to them.
What you can do is make this data smaller. For example if your data is a JSON list of big objects [{obj1}, {obj2}, ... , {objN}], you can transform it to list of indexes [1, 2, ... , N], pass that list to the loop and then inside the loop you can have a component that uses the index and the data to select a single piece to work on N ->{objN}.

Features of GStreamer

Does GStreamer have the following functionalities/features, or is it possible to implement them on top of GStreamer:
Time windows: set up the graph such that a sink pad of one element does not just receive the current frame, but also n previous frames and m future frames. Including when seeking to a new position.
No data copies when passing data between elements, instead reusing the same buffer.
Having shared data between mutiple elements on different branches, that changes with time, but is buffered in such a way that all elements get the same value for it for the same frame index.

Q1) Time windows
You need to write your plugin using GstAdapter.
Q2) No data copies when passing data between elements
It's done by default. No data is copied from element to element unless required. It just passes a pointer to an instance of GstBuffer. If an element is like encoder or filter, which needs to work on a buffer to produce new data, a new GstBuffer instance is created with newly generated data in GstMemory, obviously.
Q3) Having shared data between mutiple elements
Not sure exactly what you mean. Is is possible to achieve what you want by using GstMemory share? Take a look at gst_memory_share(), gst_buffer_copy_region(), or gst_adapter_get_buffer().

Cloud storage rewrite not resetting the componentsCount property

I'm composing several files into one and then i do perform a "rewrite" operation to reset componentsCount, so they won't block further compositions (to avoid 1024 components problem, actually). But, the resulting rewritten object's componentCount property increases as if it was just a "rename" request.
It is stated in documentation (https://cloud.google.com/storage/docs/json_api/v1/objects/rewrite):
When you rewrite a composite object where the source and destination
are different locations and/or storage classes, the result will be a
composite object containing a single component (and, as always with
composite objects, it will have only a crc32c checksum, not an MD5).
It is not clear to me what do they mean by "different locations" -- different object names and/or different buckets?
Is there a way to reset this count w/o downloading and uploading resulting composite?

Locations refers to geographically where the source and destination bucket are (us-east1, asia, etc.) -- see https://cloud.google.com/about/locations
If your rewrite request is between buckets in different locations and/or storage classes, the operation does byte copying and (in the case of composite objects) will result in a new object with component count 1. Otherwise the operation will complete without byte copying and in that case (in the case of composite objects) will not change the component count.

It's no longer necessary to reset the component count using rewrite or download/upload because there's no longer a restriction on the component count. Composing > 1024 parts is allowed.
https://cloud.google.com/storage/docs/composite-objects

How should I store my large MATLAB data files during analysis?

I am having issues with 'data overload' while processing point cloud data in MATLAB. This is what I am currently doing:
I begin with my raw data files, each in the order of ~30Mb each.
I then do initial processing on them to extract n individual objects and remove outlying points, which are all combined into a 1 x n structure, testset, saved into testset.mat (~100Mb).
So far so good. Now things become complicated:
For each point in each object in testset, I will compute one of a number of features, which ends up being a matrix of some size (for each point). The size of the matrix, and some other properties of the computation, are parameters of the calculations. I save these computed features in a 1 x n cell array, each cell of which contains an array of the matrices for each point.
I then save this cell array in a .mat file, where the name specified the parameters, the name of the test data used and the types of features extracted. For example:
testset_feature_type_A_5x5_0.2x0.2_alpha_3_beta_4.mat
Now for each of these files, I then do some further processing (using a classification algorithm). Again there are more parameters to set.
So now I am in a tricky situation, where each final piece of the initial data has come through some path, but the path taken (and the parameters set along that path) are not intrinsically held with the data itself.
So my question is:
Is there a better way to do this? Can anyone who has experience in working with large datasets in MATLAB suggest a way to store the data and the parameter settings more efficiently, and more integrally?
Ideally, I would be able to look up a certain piece of data without having to use regex on the file strings—but there is also an incentive to keep individually processed files separate to save system memory when loading them in (and to help prevent corruption).
The time taken for each calculation (some ~2 hours) prohibits computing data 'on the fly'.

For a similar problem, I have created a class structure that does the following:
Each object is linked to a raw data file
For each processing step, there is a property
The set method of the properties saves the data to file (in a directory with the same name as
the raw data file), stores the file name, and updates a "status" property to indicate that this step is done.
The get method of the properties loads the data if the file name has been stored and the status indicates "done".
Finally, the objects can be saved/loaded, so that I can do some processing now, save the object, later load it and I immediately know how far along the particular data set is in the processing pipeline.
Thus, the only data in memory is the data that is currently being worked on, and you can easily know which data set is at which processing stage. Furthermore, if you set up your methods to accept arrays of objects, you can do very convenient batch processing.

I'm not completely sure if this is what you need, but the save command allows you to store multiple variables inside a single .mat file. If your parameter settings are, for example, stored in an array, then you can save this together with the data set in a single .mat file. Upon loading the file, both the dataset and the array with parameters are restored.
Or do you want to be able to load the parameters without loading the file? Then I would personally opt for the cheap solution of having a second set of files with just the parameters (but similar filenames).

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse