I am trying a time prediction model in deeplearning4j for text processing which takes no of words,sentences,char as input features and produces time as output.But while modelling input data to output i am having difficulties to transform these values and how to tell the network for these values of input these are respective output values.
Also should i reduce dimensionality from just having x1 and y.instead of x1-x4?
training-data.csv has the below columns with 100 values.
x1,x2,x3,x4(inputs) y(output)
I tried using SequenceRecorder and Iterator which can capture variant inputs.
below is my code
public static void main(String[] args) throws Exception
{
// Initlizing parametres
final Logger log = LoggerFactory.getLogger(MainExpert.class);
final int seed =123;
final int numInput = 4;
final int numOutput = 1;
final int numHidden = 20;
final double learningRate = 0.015;
final int batchSize =30;
final int nEpochs =30;
//final int inputFeatures =4;
//Constructing Training data
final File baseFolder =new File("/home/aj/my/samples/corpus");
final File testFolder = new File("/home/aj/my/samples/corpus/train_data_0.csv");
SequenceRecordReader trainReader = new CSVSequenceRecordReader(0,",");
trainReader.initialize(new NumberedFileInputSplit(baseFolder.getAbsolutePath() + "/train_data_%d.csv",0,0));
DataSetIterator trainIterator = new SequenceRecordReaderDataSetIterator(trainReader,batchSize,-1,4,true);
SequenceRecordReader testReader = new CSVSequenceRecordReader(0,",");
testReader.initialize(new NumberedFileInputSplit(baseFolder.getAbsolutePath() + "/test_data_%d.csv",0,0));
DataSetIterator testIterator = new SequenceRecordReaderDataSetIterator(testReader,batchSize,-1,4,true);
DataSet trainData = trainIterator.next();
System.out.println(trainData);
DataSet testData = testIterator.next();
NormalizerMinMaxScaler normalizer = new NormalizerMinMaxScaler(0, 1);
normalizer.fitLabel(true);
normalizer.fit(trainData);
normalizer.transform(trainData);
normalizer.transform(testData);
//Configuring Network
log.info("Building Model");
MultiLayerConfiguration config = new NeuralNetConfiguration.Builder()
.seed(seed)
.iterations(1)
.optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
.learningRate(learningRate)
.updater(Updater.NESTEROVS).momentum(0.9)
.list()
.layer(0, new DenseLayer.Builder()
.nIn(numInput)
.nOut(numHidden)
.weightInit(WeightInit.XAVIER).
activation(Activation.RELU)
.build())
.layer(1, new DenseLayer.Builder()
.nIn(numHidden)
.nOut(numHidden)
.weightInit(WeightInit.XAVIER)
.activation(Activation.RELU)
.build())
.layer(2, new OutputLayer.Builder(LossFunction.MSE)
.nIn(numHidden)
.nOut(numOutput)
.weightInit(WeightInit.XAVIER)
.activation(Activation.IDENTITY)
.build())
.pretrain(false).backprop(true).build();
//Initializing network
log.info("initlizing model");
MultiLayerNetwork model = new MultiLayerNetwork(config);
model.init();
model.setListeners(new ScoreIterationListener(1));
log.info("Training Model");
for(int i=0;i<nEpochs;i++)
{
model.fit(trainData);
}
//Evaluation
RegressionEvaluation reval=new RegressionEvaluation(1);
while(testIterator.hasNext())
{
INDArray feat =testData.getFeatureMatrix();
INDArray labels =testData.getLabels();
INDArray prediction =model.output(feat);
reval.eval(labels, prediction);
}
System.out.println(reval.stats());
}
}
my data has four input values and one output values.
But i get an exception
org.deeplearning4j.exception.DL4JInvalidInputException: Input that is not a matrix; expected matrix (rank 2), got rank 3 array with shape [1, 4, 107]
We have an end to end csv classifier example here: https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-examples/src/main/java/org/deeplearning4j/examples/recurrent/seqClassification/UCISequenceClassificationExample.java
An rnn can handle multi variate input. In fact I encourage it. Having only 1 input feature doesn't do much for you.
I don't see the need to reduce it down to x1 and y.
Related
In the current project I am working, we are using spark as computation engine for one of workflows.
Workflow is as follows
We have product catalog being served from several pincodes. User logged in from any particular pin code should be able to see least available cost from all available serving pincodes.
Least cost is calculated as follows
product price+dist(pincode1,pincode2) -
pincode2 being user pincode and pincode1 being source pincode. Apply the above formula for all source pincodes and identify the least available one.
My Core spark logic looks like this
pincodes.javaRDD().cartesian(pincodePrices.javaRDD()).mapPartitionsToPair(new PairFlatMapFunction<Iterator<Tuple2<Row,Row>>, Row, Row>() {
#Override
public Iterator<Tuple2<Row, Row>> call(Iterator<Tuple2<Row, Row>> t)
throws Exception {
MongoClient mongoclient = MongoClients.create("mongodb://localhost");
MongoDatabase database = mongoclient.getDatabase("catalogue");
MongoCollection<Document>pincodeCollection = database.getCollection("pincodedistances");
List<Tuple2<Row,Row>> list =new LinkedList<>();
while (t.hasNext()) {
Tuple2<Row, Row>tuple2 = t.next();
Row pinRow = tuple2._1;
Integer srcPincode = pinRow.getAs("pincode");
Row pricesRow = tuple2._2;
Row pricesRow1 = (Row)pricesRow.getAs("leastPrice");
Integer buyingPrice = pricesRow1.getAs("buyingPrice");
Integer quantity = pricesRow1.getAs("quantity");
Integer destPincode = pricesRow1.getAs("pincodeNum");
if(buyingPrice!=null && quantity>0) {
BasicDBObject dbObject = new BasicDBObject();
dbObject.append("sourcePincode", srcPincode);
dbObject.append("destPincode", destPincode);
//System.out.println(srcPincode+","+destPincode);
Number distance;
if(srcPincode.intValue()==destPincode.intValue()) {
distance = 0;
}else {
Document document = pincodeCollection.find(dbObject).first();
distance = document.get("distance", Number.class);
}
double margin = 0.02;
Long finalPrice = Math.round(buyingPrice+(margin*buyingPrice)+distance.doubleValue());
//Row finalPriceRow = RowFactory.create(finalPrice,quantity);
StructType structType = new StructType();
structType = structType.add("finalPrice", DataTypes.LongType, false);
structType = structType.add("quantity", DataTypes.LongType, false);
Object values[] = {finalPrice,quantity};
Row finalPriceRow = new GenericRowWithSchema(values, structType);
list.add(new Tuple2<Row, Row>(pinRow, finalPriceRow));
}
}
mongoclient.close();
return list.iterator();
}
}).reduceByKey((priceRow1,priceRow2)->{
Long finalPrice1 = priceRow1.getAs("finalPrice");
Long finalPrice2 = priceRow2.getAs("finalPrice");
if(finalPrice1.longValue()<finalPrice2.longValue())return priceRow1;
return priceRow2;
}).collect().forEach(tuple2->{
// Business logic to push computed price to mongodb
}
I am able to get the answer correctly, however mapPartitionsToPair is taking a bit of time(~22 secs for just 12k records).
After browsing internet I found that mapPartitions performs better than mapPartitionsToPair, but I am not sure how to emit (key,value) from mapPartitions and then sort it.
Is there any alternative for above transformations or any better approach is highly appreciated.
Spark Cluster: Standalone(1 executor, 6 cores)
My game lets the user modify the terrain at runtime, but now I need to save said terrain. I've tried to directly save the terrain's heightmap to a file, but this takes almost up to two minutes to write for this 513x513 heightmap.
What would be a good way to approach this? Is there any way to optimize the writing speed, or am I approaching this the wrong way?
public static void Save(string pathraw, TerrainData terrain)
{
//Get full directory to save to
System.IO.FileInfo path = new System.IO.FileInfo(Application.persistentDataPath + "/" + pathraw);
path.Directory.Create();
System.IO.File.Delete(path.FullName);
Debug.Log(path);
//Get the width and height of the heightmap, and the heights of the terrain
int w = terrain.heightmapWidth;
int h = terrain.heightmapHeight;
float[,] tData = terrain.GetHeights(0, 0, w, h);
//Write the heights of the terrain to a file
for (int y = 0; y < h; y++)
{
for (int x = 0; x < w; x++)
{
//Mathf.Round is to round up the floats to decrease file size, where something like 5.2362534 becomes 5.24
System.IO.File.AppendAllText(path.FullName, (Mathf.Round(tData[x, y] * 100) / 100) + ";");
}
}
}
As a sidenote, the Mathf.Round doesn't seem to influence the saving time too much, if at all.
You are making a lot of small individual File IO calls. File IO is always time consuming and expensive as it contains opening the file, writing to it, saving the file and closing the file.
Instead I would rather generate the complete string using e.g. a StringBuilder which is also more efficient than using something like
var someString
for(...)
{
someString += "xyz"
}
because the latter always allocates a new string.
Then use e.g. a FileStream and StringWriter.WriteAsync(string) for writing async.
Also rather use Path.Combine instead of directly concatenating string via /. Path.Combine automatically uses the correct connectors according to the OS it is used on.
And instead of FileInfo.Directory.Create rather use Directory.CreateDirectory which doesn't throw an exception if the directory already exists.
Something like
using System.IO;
...
public static void Save(string pathraw, TerrainData terrain)
{
//Get full directory to save to
var filePath = Path.Combine(Application.persistentDataPath, pathraw);
var path = new FileInfo(filePath);
Directory.CreateDirectory(path.DirectoryName);
// makes no sense to delete
// ... rather simply overwrite the file if exists
//File.Delete(path.FullName);
Debug.Log(path);
//Get the width and height of the heightmap, and the heights of the terrain
var w = terrain.heightmapWidth;
var h = terrain.heightmapHeight;
var tData = terrain.GetHeights(0, 0, w, h);
// put the string together
// StringBuilder is more efficient then using
// someString += "xyz" because latter always allocates a new string
var stringBuilder = new StringBuilder();
for (var y = 0; y < h; y++)
{
for (var x = 0; x < w; x++)
{
// also add the linebreak if needed
stringBuilder.Append(Mathf.Round(tData[x, y] * 100) / 100).Append(';').Append('\n');
}
}
using (var file = File.Open(filePath, FileMode.OpenOrCreate, FileAccess.Write))
{
using (var streamWriter = new StreamWriter(file, Encoding.UTF8))
{
streamWriter.WriteAsync(stringBuilder.ToString());
}
}
}
You might want to specify how exactly the numbers shall be printed with a certain precision like e.g.
(Mathf.Round(tData[x, y] * 100) / 100).ToString("0.00000000");
I have a time-series uni-variate data. So just TimeStamp and Value. Now I want to extrapolate(forecast) this Value for next day/month/year. I know there are methods such as Box-jenkins (ARIMA) etc.
Spark has Linear Regression and I tried it, but I did not get satisfactory results. Did anybody tried time-series simple forecast in Spark. Can share their implementation approach?
PS: I check at User Mailing list for this issue, Almost all the questions regarding this issue are unanswered there.
Yes I have been already applied ARIMA in spark for uni variate time series.
public static void main(String args[])
{
System.setProperty("hadoop.home.dir", "C:/winutils");
SparkSession spark = SparkSession
.builder().master("local")
.appName("Spark-TS Example")
.config("spark.sql.warehouse.dir", "file:///C:/Users/abc/Downloads/Spark/sparkdemo/spark-warehouse/")
.getOrCreate();
Dataset<String> lines = spark.read().textFile("C:/Users/abc/Downloads/thunderbird/Time series/trainingvector_arima.csv");
Dataset<Double> doubleDataset = lines.map(line>Double.parseDouble(line.toString()),
Encoders.DOUBLE());
List<Double> doubleList = doubleDataset.collectAsList();
//scala.collection.immutable.List<Object> scalaList = new
Double[] doubleArray = new Double[doubleList.size()];
doubleArray = doubleList.toArray(doubleArray);
double[] values = new double[doubleArray.length];
for(int i = 0; i< doubleArray.length; i++)
{
values[i] = doubleArray[i];
}
Vector tsvector = Vectors.dense(values);
System.out.println("Ts vector:" + tsvector.toString());
//ARIMAModel arimamodel = ARIMA.fitModel(1, 0, 1, tsvector, true, "css-bobyqa", null);
ARIMAModel arimamodel = ARIMA.autoFit(tsvector, 1, 1, 1);
Vector forcst = arimamodel.forecast(tsvector, 10);
System.out.println("forecast of next 10 observations: " + forcst);
}
This code works for me. Here any values which you want to forecast pass as input data.
i have a csv file with the following format
productname, review of a the product
now using mallet i have to train the classifier so that if a test dataset is given as input which contains product reviews, it should tell me to which product a particular review belongs to
mallet java api help will be appreciated
Here is a little example suited to your case:
public static void main(String[] args) throws IOException {
//prepare instance transformation pipeline
ArrayList<Pipe> pipes = new ArrayList<Pipe>();
pipes.add(new Target2Label());
pipes.add(new CharSequence2TokenSequence());
pipes.add(new TokenSequence2FeatureSequence());
pipes.add(new FeatureSequence2FeatureVector());
SerialPipes pipe = new SerialPipes(pipes);
//prepare training instances
InstanceList trainingInstanceList = new InstanceList(pipe);
trainingInstanceList.addThruPipe(new CsvIterator(new FileReader("datasets/training.txt"), "(.*),(.*)", 2, 1, -1));
//prepare test instances
InstanceList testingInstanceList = new InstanceList(pipe);
testingInstanceList.addThruPipe(new CsvIterator(new FileReader("datasets/testing.txt"), "(.*),(.*)", 2, 1, -1));
ClassifierTrainer trainer = new NaiveBayesTrainer();
Classifier classifier = trainer.train(trainingInstanceList);
System.out.println("Accuracy: " + classifier.getAccuracy(testingInstanceList));
}
I just want to merge two ArrayLists and have the contents in one ArrayList. Both lists contain the object which is an instance of the same class.
The object reference themselves are different though. However I am getting this unexpected size for the combined arraylist. I use JAVA 1.4
ArrayList a1 = new ArrayList();
ArrayList b1 = new ArrayList();
ClassA cls1A = new ClassA();
ClassA cls1B = new ClassA();
a1.add(cls1A);
b1.add(cls1B);
a1.size() = 100;
b1.size() = 50;
//merge the two arraylist contents into one
//1st method and its result
a1.addAll(b1);
//Expected Result
a1.size = 150
//but
//Obtained result
a1.size = 6789
//2nd method and its result
Collections.copy(a1, b1)
//Expected result
a1.size() = 150
//but
//Obtained result
a1.size = 6789
How can I have an ArrayList which displays the combined size??
I came up with the following solution. First get the size of both the arraylists, then increase the capacity(by using ensureCapacity method of ArrayList class) of the arraylist to which the two needs to be merged. Then add the objects of the 2nd arraylist from the last index of the first arraylist.
a1.add(cls1A);
b1.add(cls1B);
int p = a1.size();
int w = b1.size();
int j = (p + w);
a1.ensureCapacity(j);
for(int r = 0; r <w; r++)
{
ClassA cls1B = new ClassA();
Object obj = b1.get(r);
cls1B = (ClassA)obj;
a1.add(p, cls1B);
p++;
}