I have two questions regarding the data refinery process in WS.
I have 200 Columns of data and when it was first loaded into the platform, by default everything is in the string type. How do I
Change columns in a batch
Specify the data type when I am uploading the data as using a CSV format file.
Regarding the first question, you can use operations (in Code an operation...) available from the dplyr R library. For example, in order to convert all double columns to the Double Type, you can use something like this:
mutate_all(~ ifelse(is.na(as.double(.x)),.x,as.double(.x)))
As for the second question, I think this is not possible, as long as you upload the data directly via browser.
Related
I am working on physics simulation research. I have a large fixed grid in one of my projects that does not vary with time. The fields on the grid, on the other hand, vary with time in the simulation. I need to use VTK to record the field data in each step for visualization (Paraview).
The method I am using is to write a separate *.vtu file to disk at each time step. This basically serves the purpose, but actually writes a lot of duplicate data (re-recording the geometry of the mesh at each step), which not only consumes more disk space, but also wastes time on encoding and parsing.
I would like to have a way to write the mesh information only once, and the rest of the time only new field data is written, while being able to guarantee the same visualization. Please let me know if VTK and Paraview provide such an interface and how to implement it.
Using .pvtu and refer to the same .vtu as Piece for each step should do the trick.
See this similar post on the ParaView discourse, and the pvtu doc
EDIT
This seems to be a side effect of the format, this is not supported by the writer.
The correct solution is to use another file format ...
Let me provide my own research findings for reference.
As Nico said, with the combination of pvtu/vtu files, we could theoretically implement a geometry structure stored in a separate vtu file, referenced by a pvtu file. Setting the NumberOfPieces attribute of the ptvu file to 1 would enable the construction of only one separate vtu file.
However, the VTK library does not expose a dedicated operation interface to control the writing process of vtu files. No matter how it is set, as long as the writer's input contains geometry structures, the writer will write geometry information to disk, and this process cannot be skipped through the exposed interface.
However, it is indeed possible to make multiple pvtu files point to the same vtu file by manually editing the piece node in the ptvu file, and paraview can recognize and visualize such a file group properly.
I did not proceed to try adding arrays to the unstructured grid and using pvtu output.
So, I think the conclusion is.
if you don't want to dive into VTK's library code and XML implementation, then this approach doesn't make sense.
if you are willing to write a series of files, delete most of them from the vtu file, and then point all the pvtu's piece nodes to the only surviving vtu file by editing the pvtu file, you can save a lot of disk space, but will not shorten the write, read, and parse times.
If you implement an XML writer by yourself, you can achieve all the requirements in theory, but it requires a lot of coding work.
I've been struggling to find a good solution for this manner for the past day and would like to hear your thoughts.
I have a pipeline which receives a large & dynamic JSON array (containing only stringified objects),
I need to be able to create a ContainerOp for each entry in that array (using dsl.ParallelFor).
This works fine for small inputs.
Right now the array comes in as a file http url due to pipeline input arguements size limitations of argo and Kubernetes (or that is what I understood from the current open issues), but - when I try to read the file from one Op to use as input for the ParallelFor I encounter the output size limitation.
What would be a good & reusable solution for such a scenario?
Thanks!
the array comes in as a file http url due to pipeline input arguements size limitations of argo and Kubernetes
Usually the external data is first imported into the pipeline (downloaded and output). Then the components use inputPath and outputPath to pass big data pieces as files.
The size limitation only applies for the data that you consume as value instead of file using inputValue.
The loops consume the data by value, so the size limit applies to them.
What you can do is make this data smaller. For example if your data is a JSON list of big objects [{obj1}, {obj2}, ... , {objN}], you can transform it to list of indexes [1, 2, ... , N], pass that list to the loop and then inside the loop you can have a component that uses the index and the data to select a single piece to work on N ->{objN}.
We are doing POC on Druid to check whether it fits our use cases. Though we are able to ingest data but not sure on following:
How druid supports schemaless input: Let's say input dimension are on end user discretion. Then there is no defined schema here. Thus onus lies on application to identify new dimension, identify data type and ingest. Any way to achieve this?
How druid support data type changes: Lets say in course (say after ingesting 100GBs of data), there is need to change data type of dimension from string to long or long to string (or other). What are receommended way to do it without hampering ongoing ingestion?
I looked over docs but could not get a substantial overview for both use cases.
For question 1 I'd ingest everything as string and figure it out later. It should be possible to query string columns in druid as numbers
Getting the possible behaviours explained in: https://github.com/apache/incubator-druid/issues/4888
Consider values are zeros, do not try to parse string values. Seems this is the current behaviour.
Try to parse string values, and consider values are zero if they are not parseable, or null, or multiple-value
One current inconsistency is that with expression-based column selectors (anything that goes through Parser/Expr) the behavior is (2). See IdentifierExpr + how it handles strings that are treated as numbers. But with direct column selectors the behavior is (1). In particular this means that e.g. a longSum aggregator behaves differently if it's "fieldName" : "x" vs. "expression" : "x" even though you might think they should behave the same.
You can follow the entire discussion here: https://github.com/apache/incubator-druid/issues/4888
For question 2 it think it is necessary a reindex of the data
- http://druid.io/docs/latest/ingestion/update-existing-data.html
- http://druid.io/docs/latest/ingestion/schema-changes.html
I hope this helps
1) In such cases, you don't need to specify any dimension columns in druid ingestion spec and druid will treat all columns which are not timestamp as a dimension.
More detail about such approach can be found here:
Druid Schema less Ingestion
2) For 2nd question, you can make changes to schema and druid will create new segments with new data type while your old segments will still use old data type.
In cases if you want to keep all your segments with new data type then you can reindex all the segments. Please checkout this link for further description about reindexing all segments. http://druid.io/docs/latest/ingestion/update-existing-data.html
Additional info on schema changes can be found here:
http://druid.io/docs/latest/ingestion/schema-changes.html
I need to read a file from Google Cloud storage and split it into multiple files based on transaction_date which is a field in the file. File is about 6 TB in size (broken in to multiple files). What's the most effective ways to achieve this? Do I have to use Dataflow or Dataproc, any other simple way to do this?
I take it to mean that you want to write a separate (sharded) file per transaction_date. There isn't any direct support for this in the TextIO.Write that ships with Dataflow, but since it sounds like you have a special case where you know the date range, so you manually create ~11 different filtered TextIO.Write transforms.
PCollection<Record> input = ...
for (Date transaction_date : known_transaction_dates) {
input.apply(Filter.by(<record has this date>)
.apply(TextIO.Write.to(
String.format("gs://my-bucket/output/%s", transaction_date)));
}
This is certainly not ideal. For BigQueryIO there is a feature to write to a different table based on the windowing of the data - similar functionality added to TextIO might address your use case. Otherwise, data-dependent writes of various sorts are on our radar and include cases like yours.
!!! UPDATED !!!
We have spreadsheets of complex product data coming in from multiple sources (internal, customers, vendors).
Since the authorship is so diverse, it's impractical to try governing formatting details such as column order and the number of header-rows.
These CSV spreadsheets will be uploaded to our DB via an existing form.
(My first Zend_Form ... I'm almost done with it)
The user needs to see a sample from a given spreadsheet so they can Map the columns and start-row.
To achieve that, I need to generate an html table of that dynamic content, and weave the form elements in and around the table data.
The user would select which values are to be found in each column, and identify the first row of data (after any header rows).
CLICK HERE to see an example.
(NOTE: Most of my work here is under an NDA, so contrived examples is the best we can get :)
In this example, I'd expect the output to be:
_POST('first_row'=>2, 'column0'=>'mi', 'column1'=>'lName', 'column2'=>'fName', 'column3'=>'gender')
With all those scpecifics mapped/defined, the uploaded spreadsheet can then be parsed and accurate data can be added to the product_history database.
Is ZF a good tool for this particular problem, or should I just write something from scratch?
How would you aproach this?
I am finally JUST BARELY starting to get this ZF stuff straight in my head, and this one has got me totally lost :)
Any and All advice appreciated.
~ Mo
I think in your case, using Zend_Form would be helpful for this situation.
The tricky part to it is of course that your forms are going to be largely dynamically generated on-the-fly based on the header and first row content of the CSV file.
Whether you used Zend_Form, or pure PHP, or some other solution, a lot of what you will be doing is the same (analyzing the CSV, providing dynamic inputs based on the CSV, and then error checking the selections). I think using Zend_Form has the advantage of making something like this very cleanly.
Given Zend_Form's nature, e.g. how it validates existing forms based on the elements added to the Zend_Form itself, you need to take a special approach with the form. Basically, after the user uploads the CSV once, you will create a Zend_Form object based on the number of columns, their positions in the CSV, and the name of the column.
Since you don't want to bother the user to upload the CSV multiple times if they make incorrect selections, I would parse the CSV into some sort of structure, maybe a simple object or array, and then build your Zend_Form based on that data. This way, you can save that structure to the session, so you can continue to regenerate the form based on the parsed data without having to read the file each time. This is because the main challenge with Zend_Form and dynamic forms, is that not only does the form need all of the elements and their properties when you want to display the form, but they are also required in order to validate the form and re-display the validated form.
I remember seeing this functionality many years ago in a PHP script, which I found is still available. Perhaps you could look at it for ideas. I won't post the link here since the screenshots and script are mostly adult website related and the site is NSFW for the most part, but it is called TGPX by JMBSoft. The 7th of the 8th screenshot on the main product page shows the import process where it lets the user map fields to data, exactly what you are doing.
Hope my advice is helpful, feel free to comment with any questions.