Read a model (.pkl and .scl) file stored in HDFS - pyspark

My question is similar to this question, but has difference.
I have two models model_1.scl and model_2.pkl saved in the HDFS.
pickle.load(open('model_1.scl', 'rb'))
pickle.load(open('model_2.pkl', 'rb'))
For textfile, I can simply use:
sc.textFile('hdfs://abc_cluster/user/user_1/textfile.txt').collect()
But how about non-text file, such as my model file? It will makes non-textfile fail for pickle.load( )?
Any method to read model file stored on HDFS?

Related

How to use VTK to efficiently write time-varying field data on a fixed mesh?

I am working on physics simulation research. I have a large fixed grid in one of my projects that does not vary with time. The fields on the grid, on the other hand, vary with time in the simulation. I need to use VTK to record the field data in each step for visualization (Paraview).
The method I am using is to write a separate *.vtu file to disk at each time step. This basically serves the purpose, but actually writes a lot of duplicate data (re-recording the geometry of the mesh at each step), which not only consumes more disk space, but also wastes time on encoding and parsing.
I would like to have a way to write the mesh information only once, and the rest of the time only new field data is written, while being able to guarantee the same visualization. Please let me know if VTK and Paraview provide such an interface and how to implement it.
Using .pvtu and refer to the same .vtu as Piece for each step should do the trick.
See this similar post on the ParaView discourse, and the pvtu doc
EDIT
This seems to be a side effect of the format, this is not supported by the writer.
The correct solution is to use another file format ...
Let me provide my own research findings for reference.
As Nico said, with the combination of pvtu/vtu files, we could theoretically implement a geometry structure stored in a separate vtu file, referenced by a pvtu file. Setting the NumberOfPieces attribute of the ptvu file to 1 would enable the construction of only one separate vtu file.
However, the VTK library does not expose a dedicated operation interface to control the writing process of vtu files. No matter how it is set, as long as the writer's input contains geometry structures, the writer will write geometry information to disk, and this process cannot be skipped through the exposed interface.
However, it is indeed possible to make multiple pvtu files point to the same vtu file by manually editing the piece node in the ptvu file, and paraview can recognize and visualize such a file group properly.
I did not proceed to try adding arrays to the unstructured grid and using pvtu output.
So, I think the conclusion is.
if you don't want to dive into VTK's library code and XML implementation, then this approach doesn't make sense.
if you are willing to write a series of files, delete most of them from the vtu file, and then point all the pvtu's piece nodes to the only surviving vtu file by editing the pvtu file, you can save a lot of disk space, but will not shorten the write, read, and parse times.
If you implement an XML writer by yourself, you can achieve all the requirements in theory, but it requires a lot of coding work.

Is there any way to convert pyspark random forest model to pmml?

I have trained RandomForest in pyspark2.1, but saved as pyspark model file.
rf_model = RandomForestClassifier(featuresCol='features',
labelCol='click',
maxDepth=10,
maxBins=32,
numTrees=100,
)
model = rf_model.fit(dftrain)
model_path = 'hdfs://hacluster/user/model'
model.save(model_path)
But now,we have downloaded the model without the dftrain data and cannot access to the hdfs right now. Is there any way to convert model file to pmml without exact train data?
I already knew pyspark2pmml or jpmml-sparkml, both have train data as input.Like,
from jpmml_sparkml import toPMMLBytes
pmmlBytes = toPMMLBytes(sc, dftrain, pipelineModel)
print(pmmlBytes)
I already knew pyspark2pmml or jpmml-sparkml, both have train data as input.
The JPMML-SparkML library (either directly or via the PySpark2PMML wrapper library) is still your only option. However, you should check out its README file to refresh your knowledge about it - your example uses outdated API (toPMMLBytes utility method instead of PMMLBuilder#buildByteArray builder method).
Regarding the need for the training dataset, then JPMML-SparkML needs to know the schema (in the form of org.apache.spark.sql.types.StructType object) of the training dataset, not the actual data. This schema is used for getting column names, data types, and other metadata.
If you don't have the original schema available, then it shouldn't be difficult to create one programmatically.

Create spark dataframe from multiple directories on S3

basically I need to create a spark data frame from multiple directories on S3.
The directory structure under the root directory is as follows:
s3://some-bucket/data/date=2018-04-01/
s3://some-bucket/data/date=2018-04-02/
..
s3://some-bucket/data/date=2018-04-30/
s3://some-bucket/data/date=2018-05-01/
...
Now I need to create a data frame for specific dates (e.g. 10 days from 2018-04-26).
What's the best approach to do it?
I know I can create one data frame per directory (e.g. one for 2018-04-26, one for 2018-04-27, etc) and then union all data frames to get a single data frame. I'm not sure if there is extra overhead with this approach. Is there a way to specify a list of directories as input for data frame?
The programming language I use is Scala.
Thanks
I have done it in python. I am sure there will be scala equivalent for the same.
Spark read function uses variable length arguments feature to take multiple path as input.Call spark.read function with all paths separated by ','.
dataframe = spark.read.parquet(file1_path,file2_path,file3_path,...)
FFR:
What if you have all the path in a list? Just place asterisk before list while calling read function (* treats list as variable length arguments)
dataframe = spark.read.parquet(*file_path_list)

Working with many inputs (Matlab)

I'm new to Matlab and I need some suggestions on how to deal with having many inputs to a function.
The program reads data from multiple elements and stores them in an array, which I'm doing in a loop. The problem is that if I input the wrong information about one element, I must re-input the data all over again. I believe that there must exist a better way to input these data, like reading it from a external file, for example.
The problem with the external file would be, as far as I know, with the reading of multiple arrays from a single file, hence the need of multiple external files - and I believe also that must exist some better way.
As noted by #beaker, you can use save and load to store the data. You can store multiple variables in a given file without a problem.

Efficient way to load csv file in spark/scala

I am trying to load a csv file in scala from spark. I see that we can do using the below two different syntaxes:
sqlContext.read.format("csv").options(option).load(path)
sqlContext.read.options(option).csv(path)
What is the difference between these two and which gives the better performance?
Thanks
There's no difference.
So why do both exist?
The .format(fmt).load(path) method is a flexible, pluggable API that allows adding more formats without having to re-compile spark - you can register aliases for custom Data Source implementations and have Spark use them; "csv" used to be such a custom implementation (outside of the packaged Spark binaries), but it is now part of the project
There are shorthand methods for "built-in" data sources (like csv, parquet, json...) which make the code a bit simpler (and verified at compile time)
Eventually, they both create a CSV Data Source and use it to load the data.
Bottom line, for any supported format, you should opt for the "shorthand" method, e.g. csv(path).