Training the classifier in mallet - classification

i have a csv file with the following format
productname, review of a the product
now using mallet i have to train the classifier so that if a test dataset is given as input which contains product reviews, it should tell me to which product a particular review belongs to
mallet java api help will be appreciated

Here is a little example suited to your case:
public static void main(String[] args) throws IOException {
//prepare instance transformation pipeline
ArrayList<Pipe> pipes = new ArrayList<Pipe>();
pipes.add(new Target2Label());
pipes.add(new CharSequence2TokenSequence());
pipes.add(new TokenSequence2FeatureSequence());
pipes.add(new FeatureSequence2FeatureVector());
SerialPipes pipe = new SerialPipes(pipes);
//prepare training instances
InstanceList trainingInstanceList = new InstanceList(pipe);
trainingInstanceList.addThruPipe(new CsvIterator(new FileReader("datasets/training.txt"), "(.*),(.*)", 2, 1, -1));
//prepare test instances
InstanceList testingInstanceList = new InstanceList(pipe);
testingInstanceList.addThruPipe(new CsvIterator(new FileReader("datasets/testing.txt"), "(.*),(.*)", 2, 1, -1));
ClassifierTrainer trainer = new NaiveBayesTrainer();
Classifier classifier = trainer.train(trainingInstanceList);
System.out.println("Accuracy: " + classifier.getAccuracy(testingInstanceList));
}

Related

Spark mapPartitionsToPai execution time

In the current project I am working, we are using spark as computation engine for one of workflows.
Workflow is as follows
We have product catalog being served from several pincodes. User logged in from any particular pin code should be able to see least available cost from all available serving pincodes.
Least cost is calculated as follows
product price+dist(pincode1,pincode2) -
pincode2 being user pincode and pincode1 being source pincode. Apply the above formula for all source pincodes and identify the least available one.
My Core spark logic looks like this
pincodes.javaRDD().cartesian(pincodePrices.javaRDD()).mapPartitionsToPair(new PairFlatMapFunction<Iterator<Tuple2<Row,Row>>, Row, Row>() {
#Override
public Iterator<Tuple2<Row, Row>> call(Iterator<Tuple2<Row, Row>> t)
throws Exception {
MongoClient mongoclient = MongoClients.create("mongodb://localhost");
MongoDatabase database = mongoclient.getDatabase("catalogue");
MongoCollection<Document>pincodeCollection = database.getCollection("pincodedistances");
List<Tuple2<Row,Row>> list =new LinkedList<>();
while (t.hasNext()) {
Tuple2<Row, Row>tuple2 = t.next();
Row pinRow = tuple2._1;
Integer srcPincode = pinRow.getAs("pincode");
Row pricesRow = tuple2._2;
Row pricesRow1 = (Row)pricesRow.getAs("leastPrice");
Integer buyingPrice = pricesRow1.getAs("buyingPrice");
Integer quantity = pricesRow1.getAs("quantity");
Integer destPincode = pricesRow1.getAs("pincodeNum");
if(buyingPrice!=null && quantity>0) {
BasicDBObject dbObject = new BasicDBObject();
dbObject.append("sourcePincode", srcPincode);
dbObject.append("destPincode", destPincode);
//System.out.println(srcPincode+","+destPincode);
Number distance;
if(srcPincode.intValue()==destPincode.intValue()) {
distance = 0;
}else {
Document document = pincodeCollection.find(dbObject).first();
distance = document.get("distance", Number.class);
}
double margin = 0.02;
Long finalPrice = Math.round(buyingPrice+(margin*buyingPrice)+distance.doubleValue());
//Row finalPriceRow = RowFactory.create(finalPrice,quantity);
StructType structType = new StructType();
structType = structType.add("finalPrice", DataTypes.LongType, false);
structType = structType.add("quantity", DataTypes.LongType, false);
Object values[] = {finalPrice,quantity};
Row finalPriceRow = new GenericRowWithSchema(values, structType);
list.add(new Tuple2<Row, Row>(pinRow, finalPriceRow));
}
}
mongoclient.close();
return list.iterator();
}
}).reduceByKey((priceRow1,priceRow2)->{
Long finalPrice1 = priceRow1.getAs("finalPrice");
Long finalPrice2 = priceRow2.getAs("finalPrice");
if(finalPrice1.longValue()<finalPrice2.longValue())return priceRow1;
return priceRow2;
}).collect().forEach(tuple2->{
// Business logic to push computed price to mongodb
}
I am able to get the answer correctly, however mapPartitionsToPair is taking a bit of time(~22 secs for just 12k records).
After browsing internet I found that mapPartitions performs better than mapPartitionsToPair, but I am not sure how to emit (key,value) from mapPartitions and then sort it.
Is there any alternative for above transformations or any better approach is highly appreciated.
Spark Cluster: Standalone(1 executor, 6 cores)

How to perform merge operation between two Eclipse resources programmatically

As of now I have two Eclipse resources objects(org.eclipse.emf.ecore.resource.Resource), now I want to implement merge operation between both resource objects using emf.
You can compare and merge EMF models using EMF-Compare.
Code snippet from the FAQ, to launch EMF-Compare programmatically :
public void compare(File model1, File model2) {
URI uri1 = URI.createFileURI("path/to/first/model.xmi");
URI uri2 = URI.createFileURI("path/to/second/model.xmi");
Resource.Factory.Registry.INSTANCE.getExtensionToFactoryMap().put("xmi", new XMIResourceFactoryImpl());
ResourceSet resourceSet1 = new ResourceSetImpl();
ResourceSet resourceSet2 = new ResourceSetImpl();
resourceSet1.getResource(uri1, true);
resourceSet2.getResource(uri2, true);
IComparisonScope scope = new DefaultComparisonScope(resourceSet1, resourceSet2);
Comparison comparison = EMFCompare.builder().build().compare(scope);
List<Diff> differences = comparison.getDifferences();
// Let's merge every single diff
IMerger.Registry mergerRegistry = new IMerger.RegistryImpl();
IBatchMerger merger = new BatchMerger(mergerRegistry);
merger.copyAllLeftToRight(differences, new BasicMonitor());
}

Transform training data into array in Neural network

I am trying a time prediction model in deeplearning4j for text processing which takes no of words,sentences,char as input features and produces time as output.But while modelling input data to output i am having difficulties to transform these values and how to tell the network for these values of input these are respective output values.
Also should i reduce dimensionality from just having x1 and y.instead of x1-x4?
training-data.csv has the below columns with 100 values.
x1,x2,x3,x4(inputs) y(output)
I tried using SequenceRecorder and Iterator which can capture variant inputs.
below is my code
public static void main(String[] args) throws Exception
{
// Initlizing parametres
final Logger log = LoggerFactory.getLogger(MainExpert.class);
final int seed =123;
final int numInput = 4;
final int numOutput = 1;
final int numHidden = 20;
final double learningRate = 0.015;
final int batchSize =30;
final int nEpochs =30;
//final int inputFeatures =4;
//Constructing Training data
final File baseFolder =new File("/home/aj/my/samples/corpus");
final File testFolder = new File("/home/aj/my/samples/corpus/train_data_0.csv");
SequenceRecordReader trainReader = new CSVSequenceRecordReader(0,",");
trainReader.initialize(new NumberedFileInputSplit(baseFolder.getAbsolutePath() + "/train_data_%d.csv",0,0));
DataSetIterator trainIterator = new SequenceRecordReaderDataSetIterator(trainReader,batchSize,-1,4,true);
SequenceRecordReader testReader = new CSVSequenceRecordReader(0,",");
testReader.initialize(new NumberedFileInputSplit(baseFolder.getAbsolutePath() + "/test_data_%d.csv",0,0));
DataSetIterator testIterator = new SequenceRecordReaderDataSetIterator(testReader,batchSize,-1,4,true);
DataSet trainData = trainIterator.next();
System.out.println(trainData);
DataSet testData = testIterator.next();
NormalizerMinMaxScaler normalizer = new NormalizerMinMaxScaler(0, 1);
normalizer.fitLabel(true);
normalizer.fit(trainData);
normalizer.transform(trainData);
normalizer.transform(testData);
//Configuring Network
log.info("Building Model");
MultiLayerConfiguration config = new NeuralNetConfiguration.Builder()
.seed(seed)
.iterations(1)
.optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
.learningRate(learningRate)
.updater(Updater.NESTEROVS).momentum(0.9)
.list()
.layer(0, new DenseLayer.Builder()
.nIn(numInput)
.nOut(numHidden)
.weightInit(WeightInit.XAVIER).
activation(Activation.RELU)
.build())
.layer(1, new DenseLayer.Builder()
.nIn(numHidden)
.nOut(numHidden)
.weightInit(WeightInit.XAVIER)
.activation(Activation.RELU)
.build())
.layer(2, new OutputLayer.Builder(LossFunction.MSE)
.nIn(numHidden)
.nOut(numOutput)
.weightInit(WeightInit.XAVIER)
.activation(Activation.IDENTITY)
.build())
.pretrain(false).backprop(true).build();
//Initializing network
log.info("initlizing model");
MultiLayerNetwork model = new MultiLayerNetwork(config);
model.init();
model.setListeners(new ScoreIterationListener(1));
log.info("Training Model");
for(int i=0;i<nEpochs;i++)
{
model.fit(trainData);
}
//Evaluation
RegressionEvaluation reval=new RegressionEvaluation(1);
while(testIterator.hasNext())
{
INDArray feat =testData.getFeatureMatrix();
INDArray labels =testData.getLabels();
INDArray prediction =model.output(feat);
reval.eval(labels, prediction);
}
System.out.println(reval.stats());
}
}
my data has four input values and one output values.
But i get an exception
org.deeplearning4j.exception.DL4JInvalidInputException: Input that is not a matrix; expected matrix (rank 2), got rank 3 array with shape [1, 4, 107]
We have an end to end csv classifier example here: https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-examples/src/main/java/org/deeplearning4j/examples/recurrent/seqClassification/UCISequenceClassificationExample.java
An rnn can handle multi variate input. In fact I encourage it. Having only 1 input feature doesn't do much for you.
I don't see the need to reduce it down to x1 and y.

Is it possible to merge two streams using CopyTo method?

Is it possible to merge two (or more) Streams (or MemoryStreams) using CopyTo method?
For instance, I have two source streams s1 and s2. I'm creating the destination MemoryStream:
MemoryStream omDest = new MemoryStream();
If I copy one Stream everything is fine:
s1.CopyTo(omDest);
But if I copy both, the second one overwrites the first one.
I'd appreciate your help.
Thank you.
You can set the position of the target stream to the length of the first stream after CopyTo like this:
memoryStream1.CopyTo(target);
target.Position = memoryStream1.Length;
memoryStream2.CopyTo(target);
So the copying for the second stream starts at the specified position
Full Code:
using (MemoryStream target = new MemoryStream(30))
{
using(MemoryStream mem1 = new MemoryStream(new byte[] { 1, 2, 3, 4, 5 }))
{
mem1.CopyTo(target);
target.Position = mem1.Length;
}
using (MemoryStream mem2 = new MemoryStream(new byte[] { 6, 7, 8, 9, 10 }))
{
mem2.CopyTo(target);
}
foreach(byte b in target.ToArray())
{
Console.Write(b+",");
}
}
Best Regards
Flo
EDIT:
To make it simpler you could also use the WriteTo-Method, so you wouldn't need the resetting of the position.

How to do time-series simple forecast?

I have a time-series uni-variate data. So just TimeStamp and Value. Now I want to extrapolate(forecast) this Value for next day/month/year. I know there are methods such as Box-jenkins (ARIMA) etc.
Spark has Linear Regression and I tried it, but I did not get satisfactory results. Did anybody tried time-series simple forecast in Spark. Can share their implementation approach?
PS: I check at User Mailing list for this issue, Almost all the questions regarding this issue are unanswered there.
Yes I have been already applied ARIMA in spark for uni variate time series.
public static void main(String args[])
{
System.setProperty("hadoop.home.dir", "C:/winutils");
SparkSession spark = SparkSession
.builder().master("local")
.appName("Spark-TS Example")
.config("spark.sql.warehouse.dir", "file:///C:/Users/abc/Downloads/Spark/sparkdemo/spark-warehouse/")
.getOrCreate();
Dataset<String> lines = spark.read().textFile("C:/Users/abc/Downloads/thunderbird/Time series/trainingvector_arima.csv");
Dataset<Double> doubleDataset = lines.map(line>Double.parseDouble(line.toString()),
Encoders.DOUBLE());
List<Double> doubleList = doubleDataset.collectAsList();
//scala.collection.immutable.List<Object> scalaList = new
Double[] doubleArray = new Double[doubleList.size()];
doubleArray = doubleList.toArray(doubleArray);
double[] values = new double[doubleArray.length];
for(int i = 0; i< doubleArray.length; i++)
{
values[i] = doubleArray[i];
}
Vector tsvector = Vectors.dense(values);
System.out.println("Ts vector:" + tsvector.toString());
//ARIMAModel arimamodel = ARIMA.fitModel(1, 0, 1, tsvector, true, "css-bobyqa", null);
ARIMAModel arimamodel = ARIMA.autoFit(tsvector, 1, 1, 1);
Vector forcst = arimamodel.forecast(tsvector, 10);
System.out.println("forecast of next 10 observations: " + forcst);
}
This code works for me. Here any values which you want to forecast pass as input data.