Load custom file type in Spark - scala

I've generated a model using OpenNLP and now i want to read the model in Spark (with Scala) as an RDD then use it to predict some values.
Is there a way to load other file types in Scala other than .txt, .csv, .parquet?
Thanks.

What you you want to load is a model, not data. If the model you have built is serializable, you can define a global singleton object with the model and a function which does the prediction and use the function in a RRD map. For example:
object OpenNLPModel {
val model = //load the OpenNLP model here
def predict(s: String): String = { model.predict(s) }
}
myRdd.map(OpenNLPModel.predict)
Read the Spark programming guide for more information.

I've just found out the answer.
public DoccatModel read(String path) throws IOException {
Configuration conf = new Configuration();
//Get the filesystem - HDFS
FileSystem fs = FileSystem.get(URI.create(path), conf);
FSDataInputStream in = null;
DoccatModel model = null;
try {
//Open the path mentioned in HDFS
in = fs.open(new Path(path));
model = new DoccatModel(in);
} finally {
IOUtils.closeStream(in);
}
return model;
}
You have to use the FileSystem class to read a file from HDFS.
Cheers!

Related

Writing object fields with fixed length to file with Spring batch

Spring batch provides FixedLengthTokenizer to read data but I do not see FixedLengthLineAggreator. How do I write an object into a flat file whereby the different fields should be written into the file with fixed length.
You can do this with FormatterLineAggregator. Just set your fields and set your formats using the String.format() syntax.
#Bean
public FormatterLineAggregator<MyObject> myLineAggregator() {
FormatterLineAggregator<MyObject> lineAggregator = new FormatterLineAggregator<>();
lineAggregator.setFieldExtractor(myBeanWrapperFieldExtractor());
lineAggregator.setFormat("%-5s%-09d%20s");
return lineAggregator;
}
#Bean
public BeanWrapperFieldExtractor<MyObject> myBeanWrapperFieldExtractor() {
BeanWrapperFieldExtractor<MyObject> fieldExtractor = new BeanWrapperFieldExtractor<MyObject>();
fieldExtractor.setNames(new String[]{"fieldOne", "fieldTwo", "fieldThree"});
return fieldExtractor;
}

Spring Batch: How to write to csv file from a Collection

I need to write to a CSV file from a Collection.
I created a class:
public class ItemWriterForCSVFile extends FlatFileItemWriter<Map<String, String>>{private LineAggregator<Map<String, String>> createLineAggregator() {
DelimitedLineAggregator<Map<String, String>> lineAggregator = new DelimitedLineAggregator<>();
lineAggregator.setDelimiter(",");
BeanWrapperFieldExtractor<Map<String, String>> fieldExtractor = createFieldExtractor();
lineAggregator.setFieldExtractor(fieldExtractor);
return lineAggregator;
}
private BeanWrapperFieldExtractor<Map<String, String>> createFieldExtractor() {
BeanWrapperFieldExtractor<Map<String, String>> extractor = new BeanWrapperFieldExtractor<>();
extractor.setNames(fields);
return extractor;
}}
Currently, I am have a code above, which extract fields from an object.
I need to change it to use a Map, as a dynamic structure.
I saw that PassThroughFieldExtractor can write a collection, but haven't find a suitable examples in java.
Any help appreciated.

Load a package and inside classes

I'm creating a package with some classes that I generated with wsimport on the fly and now I'm trying to load it to use, how can I do this? natively? or with some lib like byte-buddy, I tried the bellow code to load each class in a package:
File [] files = new File("<Path to package in filesystem with classes (*.class)>").listFiles();
List<URL> classUrls = new ArrayList<>();
for(File file : files) {
classUrls.add(new URL("file://" + file.getAbsolutePath()));
}
URL[] classz = new URL[classUrls.size()];
classz = classUrls.toArray(classz);
URLClassLoader child = new URLClassLoader(classz);
Class.forName("com.abc.external.resources.genwn239aqyhmfz.SomeClass", true, child);
But I still getting (Package: com.abc.external.resources.genwn239aqyhmfz.SomeClass)
java.lang.ClassNotFoundException: com.abc.external.resources.genwn239aqyhmfz.SomeClass
The rules for the class path are not different to the rules you have to obey when launching your application. The class path entries are not class files nor directories containing them, but the roots of your package structure.
So if the class you want to load is com.abc.external.resources.genwn239aqyhmfz.SomeClass, the class path entry has to be the directory containing the com directory, which contains the abc directory, and so on. If you know the expected full qualified name of one of the classes, it’s easy to find the right directory. Just traverse to the file hierarchy up as many times as the qualified name has package components. However, when you don’t know the name beforehand, finding it can be tricky. Here is a sketch:
// pass the expected name of one class contained in f or null if not known
static void loadClasses(File f, String predictedName)
throws IOException, ClassNotFoundException {
File[] classes = f.listFiles((d,n)->n.endsWith(".class"));
if(classes == null || classes.length == 0) {
System.err.println("no classes or not a directory");
return;
}
if(predictedName == null) predictedName = predictName(classes[0]);
for(int p = predictedName.indexOf('.'); p >= 0; p = predictedName.indexOf('.', p+1))
f = f.getParentFile();
URLClassLoader classLoader = new URLClassLoader(new URL[] { f.toURI().toURL() });
String packageName = predictedName.substring(0, predictedName.lastIndexOf('.')+1);
for(File cf: classes) {
String name = cf.getName();
name = name.substring(0, name.length()-6); // strip off ".class"
Class<?> cl = classLoader.loadClass(packageName+name);
// what do you wanna do with the classes?
System.out.println(cl);
}
}
private static String predictName(File classFile) throws IOException {
byte[] data = Files.readAllBytes(classFile.toPath());
return new ClassLoader() {
String getName() {
return defineClass(null, data, 0, data.length).getName();
}
}.getName();
}
The predictName implementation is a very simple one. If the class has dependencies to classes within the same file hierarchy which the JVM immediately tries to resolve, it will fail as we don’t have the necessary information yet. In that case, only a bytecode parsing library allowing to extract the name without loading the class would help. But that exceeds the scope of this question…

scala: Moking my scala Object that has external dependency

I have a Object like this:
// I want to test this Object
object MyObject {
protected val retryHandler: HttpRequestRetryHandler = new HttpRequestRetryHandler {
def retryRequest(exception: IOException, executionCount: Int, context: HttpContext): Boolean = {
true // implementation
}
}
private val connectionManager: PoolingHttpClientConnectionManager = new PoolingHttpClientConnectionManager
val httpClient: CloseableHttpClient = HttpClients.custom
.setConnectionManager(connectionManager)
.setRetryHandler(retryHandler)
.build
def methodPost = {
//create new context and new Post instance
val post = new HttpPost("url")
val res = httpClient.execute(post, HttpClientContext.create)
// check response code and then take action based on response code
}
def methodPut = {
// same as methodPost except use HttpPut instead HttpPost
}
}
I want to test this object by mocking dependent objects like httpClient. How to achieve this? can i do it using Mokito or any better way? If yes. How? Is there a better design for this class?
Your problem is: you created hard-to test code. You can turn here to watch some videos to understand why that is.
The short answer: directly calling new in your production code always makes testing harder. You could be using Mockito spies (see here on how that works).
But: the better answer would be to rework your production code; for example to use dependency injection. Meaning: instead of creating the objects your class needs itself (by using new) ... your class receives those objects from somewhere.
The typical (java) approach would be something like:
public MyClass() { this ( new SomethingINeed() ); }
MyClass(SomethingINeed incoming) { this.somethign = incoming; }
In other words: the normal usage path still calls new directly; but for unit testing you provide an alternative constructor that you can use to inject the thing(s) your class under test depends on.

InvalidFeatureException in jpmml while trying to load in java

my pmml file link. generated by R-Tool
pmml file on google drive
here is my java code..
PMML model = null;
File inputFilePath = new File("/home/equation/iris_rf.pmml");
try (InputStream is = new FileInputStream(inputFilePath)) {
model = org.jpmml.model.PMMLUtil.unmarshal(is);
} catch (Exception e) {
throw e;
}
// construct a tree predictor based on the PMML
ModelEvaluator<TreeModel> modelEvaluator = new TreeModelEvaluator(model);
System.out.println(modelEvaluator.getSummary());
exception ---
Exception in thread "main" org.jpmml.evaluator.InvalidFeatureException: PMML
at org.jpmml.evaluator.ModelEvaluator.selectModel(ModelEvaluator.java:528)
at org.jpmml.evaluator.tree.TreeModelEvaluator.<init>(TreeModelEvaluator.java:64)
at com.girnarsoft.Pmml.main(Pmml.java:24)
any idea? why getting this error ?
You must instantiate org.jpmml.evaluator.ModelEvaluator subclass that matches the top-level Model element of your PMML file.
Currently, you're instantiating org.jpmml.evaluator.tree.TreeModelEvaluator, which corresponds to the TreeModel element. However, you should be instantiating org.jpmml.evaluator.mining.MiningModelEvaluator instead, as the top-level Model element in your PMML file is the MiningModel element.
In general, you should construct ModelEvaluator instances using the ModelEvaluatorFactory#newModelEvaluator(PMML) factory method.