Own data set consisting of numbers (csv) at PyTorch - neural-network

I want to use my own dataset, which consists of numbers, in PyTorch. They are therefore available as a csv file, for example. What is the easiest way to do load this into PyTorch? So far I only know how to use already existing datasets in PyTorch, but I don't want to do that.

You need to create a custom class that inherits from Pytorch's Dataset class.
Then, you need to wrap it with a DataLoader.
Follow this tutorial for an in depth explanation:
https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#creating-a-custom-dataset-for-your-files

The easiest way to import your dataset would be to:
Use pandas package to load your csv file:
import pandas as pd
data = pd.read_csv("filename.csv")
Then, implement a very simple pytorch Dataset class as described here.
You will finally pass your instance of Dataset as the first parameter of a pytorch DataLoader

Related

Is there any way to convert pyspark random forest model to pmml?

I have trained RandomForest in pyspark2.1, but saved as pyspark model file.
rf_model = RandomForestClassifier(featuresCol='features',
labelCol='click',
maxDepth=10,
maxBins=32,
numTrees=100,
)
model = rf_model.fit(dftrain)
model_path = 'hdfs://hacluster/user/model'
model.save(model_path)
But now,we have downloaded the model without the dftrain data and cannot access to the hdfs right now. Is there any way to convert model file to pmml without exact train data?
I already knew pyspark2pmml or jpmml-sparkml, both have train data as input.Like,
from jpmml_sparkml import toPMMLBytes
pmmlBytes = toPMMLBytes(sc, dftrain, pipelineModel)
print(pmmlBytes)
I already knew pyspark2pmml or jpmml-sparkml, both have train data as input.
The JPMML-SparkML library (either directly or via the PySpark2PMML wrapper library) is still your only option. However, you should check out its README file to refresh your knowledge about it - your example uses outdated API (toPMMLBytes utility method instead of PMMLBuilder#buildByteArray builder method).
Regarding the need for the training dataset, then JPMML-SparkML needs to know the schema (in the form of org.apache.spark.sql.types.StructType object) of the training dataset, not the actual data. This schema is used for getting column names, data types, and other metadata.
If you don't have the original schema available, then it shouldn't be difficult to create one programmatically.

DeepLearning4J - Acquiring Data and Train Model

I try to create the easiest of a NeuralNetwork and training it with some data:
Therefore I created a test.csv with a the following pattern:
number,number+1;
number2,number2+1
...
I try to make a linear regression with the network...
But I do not find a way to acquire the data, DataSetIterator does not work.
How to fit the Data, how to test the Data?
In our examples, we encourage people to use datavec + recordreaderdatasetiterator.
Datavec has all of the various data loading components.
I'm not sure what you mean about "datasetiterator not working" wihtout seeing any code, but it seems like you didn't really look at our examples.
In there are multiple examples of a csv record reader you can use for both regression and classification use cases.
Consider reorienting your data pipeline to use those.
Those examples are always found here:
https://github.com/deeplearning4j/dl4j-examples
If you follow any of those, the same pattern emerges:
Record reader for whatever data format -> RecordReaderDataSetIterator
The iterator allows you to specify common constructors such as whether it is a regression or not, which column your label is etc.

How to load a PMML model?

I'm following the instructions of PMML model export - spark.mllib to create a K-means model.
val numClusters = 10
val numIterations = 10
val clusters = KMeans.train(data, numClusters, numIterations)
// Save and load model: export to PMML
println("PMML Model:\n" + clusters.toPMML("/kmeans.xml"))
But I don't know how to load the PMML after that.
I'm trying
val sameModel = KMeansModel.load(sc, "/kmeans.xml")
and appears:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/kmeans.xml/metadata
Any idea?
Best regards
As stated in the documentation (for the version you seem to be interested it - 1.6.1 and also for the latest available - 2.1.0) Spark supports exporting to PMML only. The load method actually expects to retrieve a model saved in Spark own format and this is why the load method expects a certain path to be there and why the exception has been thrown.
If you trained the model with Spark, you can save it and load it later.
If you need to load a model that has not been trained in Spark and has been saved as PMML you can use jpmml-spark to load and evaluate it.
My limited experience in this spark.mllib's KMeans space is that it is not possible, but you could develop the feature yourself.
spark.mllib's KMeansModel is PMMLExportable:
class KMeansModel #Since("1.1.0") (#Since("1.0.0") val clusterCenters: Array[Vector])
extends Saveable with Serializable with PMMLExportable {
That's why you can use toPMML that saves a model into the PMML XML format.
(Again I've got a very little experience in Spark MLlib) My understanding is that KMeans is all about centroids and that's what is loaded when you do KMeansModel.load that in turn uses KMeansModel.SaveLoadV1_0.load that reads the centroids and creates a KMeansModel:
new KMeansModel(localCentroids.sortBy(_.id).map(_.point))
For KMeansModel.toPMML, Spark MLlib uses pmml-model's PMML (as you can see here):
new PMML("4.2", header, null)
I'd recommend exploring pmml-model's PMML how to do saving and loading as that's beyond Spark's realm.
Side notes
Why would you even want to use Spark to have the model after you trained it? It is indeed possible, but you may be wasting your cluster resources for Spark to host the model.
In my limited understanding, the sole purpose of Spark MLlib is to use Spark's features like distribution and parallelism to handle large datasets to build models and use them without the Spark machinery afterwards.
I must be missing something important in my narrow view...
You could use PMML4S-Spark to load a PMML model to evaluate it in Spark, for example:
import org.pmml4s.spark.ScoreModel
val model = ScoreModel.fromFile("/kmeans.xml")
The model is a SparkML transformer, so you can make prediction against a dataframe:
val scoreDf = model.transform(df)
PMML files are actually xml files with schemas defined by Data Mining Consortium. For that reason you can either define a deserializer based on the contract given at DMC and PMML web page here or use 3rd party libraries.
I am researching on jpmml library for incorporation python prepared models in Spring application.
Information here:
https://github.com/jpmml
http://dmg.org/pmml/v4-1/GeneralStructure.html

usage of naive bayes Model for prediction

Hi all I am new to scala and spark MLIB.
I have a dataset of diseses of diseases along with the symptoms which are in the following format:
Disease,symptom1 symptom2 symptom3
I have almost 300 entries which are in the above mentioned format in a CSV file.
I want to achieve this following functionality:
If a user has given a input of sysmptoms namely Symptom1,Symptom2,Symptom3 the model must be able to predict the disease.
I have the following Questions:
which machine learning model should I use to achieve this functionality.
I have gone through some models and founf NAIVES Bayes model if wrong correct me.
can I provide text input to Naives Bayes model.
Is there any sample code available to achieve this functionality.
You can use any of the classification algorithms present in Spark MLlib for further reference read the official docs and go thru this link from databricks blog https://databricks.com/blog/2015/07/29/new-features-in-machine-learning-pipelines-in-spark-1-4.html

Add extra attribute to data loaded with Mongoimport

In Mongodb, is their a way to add an extra attribute to documents (--TSV, --headerline) created with mongoimport?
I don't have control of the data being imported, however I need to be able to distinguish one import data set from another, and their are no attributes in the file to distinguishing one import from another.
I think your best option would be to write your own script to parse the csv/tsv and import it into mongodb. I think it would take under 10 lines of python.
Alternatively if nothing else is inserting into the collection, and your import runs are far enough apart, you could just do something like this between runs:
db.collecton.update({extraField:null}, {$set:{extraField: ObjectId()}}, false, true)
This would work best with an index on {extraField:1}.