Writing Files in HDFS- Using TextIO API vs Java - BufferedWriter(new FileWriter()) - apache-beam

We have a requirement that we need to create files for each and every Employee ID. We used the groupbyKey feature and able to create the data structure.
< ID, Iterable < Employee Objects > > where we converted Iterable < Employee > to List.
Then we created a folder and a file for the Employee ID.
So in < EmployeeID, List < Employee > > object, for each and every row we loop and using BufferedWriter we write data into the file. IS it good enough or we need to use TextIO framework to do the same thing. The question is whether we will get drastic performance improvements using TextIO when compared to BufferedWriter to write data in files each and every row.
Thanks

It is surprisingly difficult to write files in a way that produces well-defined results and no data loss or duplication in case of failures. You can get a glimpse of this complexity by looking at the implementation of the WriteFiles transform, used by TextIO under the hood. So TextIO handles this complexity for you, and I strongly recommend using it, if you can - rather than using hand-crafted code. You probably want the write().to(DynamicDestinations) version.
If you're doing something that TextIO definitely can not do, I still recommend looking at the implementation of WriteFiles to get an understanding of what else needs to be done to make sure your code is resilient against failures.

Related

How to implement Featuretools into my ML Process?

I am exploring the possibility of implementing Featuretools into my pipeline, to be able to create new features from my Df.
Currently I am using a GridSearchCV, with a Pipeline embedded inside it. Since Featuretools is creating new features with aggregation on columns, like STD(column) etc, I feel like it is suspectible to data leakage. In their FAQ, they are giving an example approach to tackle it, which is not suitable for a Pipeline structure I am using.
Idea 0: I would love to integrate it directly into my Pipeline but it seems like not compatible with Pipelines. It would use fold train data to construct features, transform fold test data. K times. At the end, it would use whole data to construct, during Refit= True stage of GridSearchCV. If you have any example opposed to this fact, you are very welcome.
Idea 1: I can switch to a manual CV structure, not embedded into pipeline. And inside it, I can use Train data to construct new features, and test data to transform with these. It will work K times. At the end, all data can be used to construct Ultimate model.
It is the safest option, with time and complexity disadvantages.
Idea 2: Using it with whole data, ignore the leakage possibility. I am not in favor of this of course. But when I look at Project Github page, all the examples are combining Train and Test data, creating these features with whole data. Then go on with Train-Test division for modeling.
https://github.com/Featuretools/predict-taxi-trip-duration/blob/master/NYC%20Taxi%203%20-%20Simple%20Featuretools.ipynb
Actually if the developers of the project think like that, I could give it a chance with whole data.
What do you think, I would love to hear about your experiences on FeatureTools.

How to use VTK to efficiently write time-varying field data on a fixed mesh?

I am working on physics simulation research. I have a large fixed grid in one of my projects that does not vary with time. The fields on the grid, on the other hand, vary with time in the simulation. I need to use VTK to record the field data in each step for visualization (Paraview).
The method I am using is to write a separate *.vtu file to disk at each time step. This basically serves the purpose, but actually writes a lot of duplicate data (re-recording the geometry of the mesh at each step), which not only consumes more disk space, but also wastes time on encoding and parsing.
I would like to have a way to write the mesh information only once, and the rest of the time only new field data is written, while being able to guarantee the same visualization. Please let me know if VTK and Paraview provide such an interface and how to implement it.
Using .pvtu and refer to the same .vtu as Piece for each step should do the trick.
See this similar post on the ParaView discourse, and the pvtu doc
EDIT
This seems to be a side effect of the format, this is not supported by the writer.
The correct solution is to use another file format ...
Let me provide my own research findings for reference.
As Nico said, with the combination of pvtu/vtu files, we could theoretically implement a geometry structure stored in a separate vtu file, referenced by a pvtu file. Setting the NumberOfPieces attribute of the ptvu file to 1 would enable the construction of only one separate vtu file.
However, the VTK library does not expose a dedicated operation interface to control the writing process of vtu files. No matter how it is set, as long as the writer's input contains geometry structures, the writer will write geometry information to disk, and this process cannot be skipped through the exposed interface.
However, it is indeed possible to make multiple pvtu files point to the same vtu file by manually editing the piece node in the ptvu file, and paraview can recognize and visualize such a file group properly.
I did not proceed to try adding arrays to the unstructured grid and using pvtu output.
So, I think the conclusion is.
if you don't want to dive into VTK's library code and XML implementation, then this approach doesn't make sense.
if you are willing to write a series of files, delete most of them from the vtu file, and then point all the pvtu's piece nodes to the only surviving vtu file by editing the pvtu file, you can save a lot of disk space, but will not shorten the write, read, and parse times.
If you implement an XML writer by yourself, you can achieve all the requirements in theory, but it requires a lot of coding work.

What's the best way to store a huge Map object populated at runtime to be reused by another tool?

I'm writing a Scala tool that encodes ~300 JSON Schema files into files of a different format and saves them to disk. These schemas I later re-need for instantiating JSON Data files, or better, I don't need all the schemas but only a few fields of each.
I was thinking that the best solution could be to populate a Map object (while the tool encodes the schemas) containing only the info that I need. And later re-use the Map object (in another run of the tool) as already compiled and populated map.
I've got two questions:
1. Is this really the most performant solution? and
2. How can I save the Map object, created at runtime, on disk as a file that can be later built/executed with the rest of my code?
I've read several posts about serialization and storing objects, but I'm not entirely sure whether these are the same as what I need. Also, I'm not sure is the best solution and I would like to hear an opinion from people with more experience than me.
What I would like to achieve is an elegant solution that allows me to lookup values from a map generated by another tool.
The whole process of compiling/building/executing sometimes is still confusing to me, so apologies if the question is trivial.
To Answer your question,
I think using an embedded KV Store would be more efficient considering the number of files and amount of traversal.
Here is a small Wiki on "How to use RocksJava". You can consider it as a KV store. https://github.com/facebook/rocksdb/wiki/RocksJava-Basics
You can use the below reference to serialize and de-serialize an object in Scala and put it as Key value pair in the RocksDB as I mentioned in the comment.
Convert Any type in scala to Array[Byte] and back
On how to use rocksDB, the below dependency in your build will suffice:
"org.rocksdb" % "rocksdbjni" % "5.17.2"
Thanks.

Need some help understanding Vocabulary of Interlinked Dataset (VoID) in Linked Open Data

I have been trying to understand VoID in Linked Open Data. It would be great if anyone could help clarify some of my confusions.
Does it need to be stored in a separate file or it can be included in the RDF dataset itself? If so, how do I query it? (A sample query would be really helpful)
How is the information in VoID used in real life?
Does it need to be stored in a separate file or it can be included in the RDF dataset itself? If so, how do I query it? (A sample query would be really helpful)
In theory not, but for practical purposes yes. In the end the information is encoded in triples, so it doesn't really matter in what file you put them and you could argue that it's best to actually put the VoID info into the data files and serve these triples with your data as meta-info. It's query-able as all other forms of RDF, either load it into some SPARQL endpoint or use a library that can directly load RDF files. This however also shows the reason why a separate file makes sense: instead of having to load potentially large data files just to get some dataset meta info, it makes sense to offer the meta-data in its own file.
How is the information in VoID used in real life?
VoID is actually used in several scenarios already, but mostly a recommendation and a good idea. The most prominent use-cases i know of is to get your dataset shown in the LOD Cloud. You currently have to register it with datahub.io and add a VoID file (example from my associations dataset).
Other examples (sadly many defunct nowadays) can be found here: http://semanticweb.org/wiki/VoID.html

How can I build a generic dataset-handling Perl library?

I want to build a generic Perl module for handling and analysing biomedical character separated datasets and which can, most certain, be used on any kind of datasets that contain a mixture of categorical (A,B,C,..) and continuous (1.2,3,881..) and identifier (XXX1,XXX2...). The plan is to have people initialize the module and then use some arguments to point to the data file(s), the place were the analysis reports should be placed and the structure of the data.
By structure of data I mean which variable is in which place and its name/type. And this is where I need some enlightenment. I am baffled how to do this in a clean way. Obviously, having people create a simple schema file, be it XML or some other format would be the cleanest but maybe not all people enjoy doing something like this.
The solutions I can think of are:
Create a configuration file in XML or similar and with a prespecified format.
Pass the information during initialization of the module.
Use the first row of the data as headers and try to guess types (ouch)
Surely there must be a "canonical" way of doing this that is also usable and efficient.
This doesn't answer your question directly, but have you checked CPAN? It might have the module you need already. If not, it might have similar modules -- related either to biomedical data or simply to delimited data handling -- that you can mine for good ideas, both concerning formats for metadata and your module's API.
Any of the approaches you've listed could make sense. It all depends on how complex the data structures and their definitions are. What will make something like this useful to people is whether it saves them time and effort. So, your decision will have to be answered based on what approach will best satisfy the need to make:
use of the module easy
reuse of data definitions easy
the data definition language sufficiently expressive to describe all known use cases
the data definition language sufficiently simple that an infrequent user can spend minimal time with the docs before getting real work done.
For example, if I just need to enter the names of the columns and their types (and there are only 4 well defined types), doing this each time in a script isn't too bad. Unless I have 350 columns to deal with in every file.
However, if large, complicated structure definitions are common, then a more modular reuse oriented approach is better.
If your data description language is difficult to work with, you can mitigate the issue a bit by providing a configuration tool that allows one to create and edit data schemes.
rx might be worth looking at, as well as the Data::Rx module on the CPAN. It provides schema checking for JSON, but there is nothing inherent in the model that makes it JSON-only.