XGBoost4j-spark prediction on sparse vector from local model - scala

I am running on Databricks. I am trying to use an xgboost model that was trained locally in R for distributed predictions using xgboost4j-spark in scala. The data is in a Dataframe with a features column of sparse vectors from org.apache.spark.ml.linalg.Vectors.sparse. I have successfully trained an unrelated model on the data in this format.
The data looks like this:
train_sparse.filter("ID == 1").show(false)
+-----------+------------------------------------------+
|ID|feature_vector |
+-----------+------------------------------------------+
|1 |(4056,[0,1,1097,2250],[26.0,1.0,1.0,57.0])|
+-----------+------------------------------------------+
A bridge class had to be created first to load in the local model.
%scala
package ml.dmlc.xgboost4j.scala.spark2
import ml.dmlc.xgboost4j.scala.Booster
import ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel
class XGBoostRegBridge(
uid: String,
_booster: Booster) {
val xgbRegressionModel = new XGBoostRegressionModel(uid, _booster)
}
import ml.dmlc.xgboost4j.scala.spark2._
import ml.dmlc.xgboost4j.scala.XGBoost
val model = XGBoost.loadModel("/dbfs/FileStore/tmp/xgb53.model")
val bri = new XGBoostRegBridge("uid", model)
bri.xgbRegressionModel.setFeaturesCol("feature_vector")
var pred = bri.xgbRegressionModel.transform(train_sparse)
pred.show()
Job aborted due to stage failure.
Caused by: XGBoostError: [17:36:06] /workspace/jvm-packages/xgboost4j/src/native/xgboost4j.cpp:159: [17:36:06] /workspace/jvm-packages/xgboost4j/src/native/xgboost4j.cpp:78: Check failed: jenv->ExceptionOccurred():
Stack trace:
[bt] (0) /local_disk0/tmp/libxgboost4j3687488462117693459.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x53) [0x7f0ff8810843]
[bt] (1) /local_disk0/tmp/libxgboost4j3687488462117693459.so(XGBoost4jCallbackDataIterNext+0xd10) [0x7f0ff880d960]
[bt] (2) /local_disk0/tmp/libxgboost4j3687488462117693459.so(xgboost::data::SimpleDMatrix::SimpleDMatrix<xgboost::data::IteratorAdapter<void*, int (void*, int (*)(void*, XGBoostBatchCSR), void*), XGBoostBatchCSR> >(xgboost::data::IteratorAdapter<void*, int (void*, int (*)(void*, XGBoostBatchCSR), void*), XGBoostBatchCSR>*, float, int)+0x2f8) [0x7f0ff8902268]
[bt] (3) /local_disk0/tmp/libxgboost4j3687488462117693459.so(xgboost::DMatrix* xgboost::DMatrix::Create<xgboost::data::IteratorAdapter<void*, int (void*, int (*)(void*, XGBoostBatchCSR), void*), XGBoostBatchCSR> >(xgboost::data::IteratorAdapter<void*, int (void*, int (*)(void*, XGBoostBatchCSR), void*), XGBoostBatchCSR>*, float, int, std::string const&, unsigned long)+0x45) [0x7f0ff88f79b5]
[bt] (4) /local_disk0/tmp/libxgboost4j3687488462117693459.so(XGDMatrixCreateFromDataIter+0x152) [0x7f0ff881e682]
[bt] (5) /local_disk0/tmp/libxgboost4j3687488462117693459.so(Java_ml_dmlc_xgboost4j_java_XGBoostJNI_XGDMatrixCreateFromDataIter+0x96) [0x7f0ff880b7b6]
[bt] (6) [0x7f1020017ee7]
Stack trace:
[bt] (0) /local_disk0/tmp/libxgboost4j3687488462117693459.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x53) [0x7f0ff8810843]
[bt] (1) /local_disk0/tmp/libxgboost4j3687488462117693459.so(XGBoost4jCallbackDataIterNext+0xdc4) [0x7f0ff880da14]
[bt] (2) /local_disk0/tmp/libxgboost4j3687488462117693459.so(xgboost::data::SimpleDMatrix::SimpleDMatrix<xgboost::data::IteratorAdapter<void*, int (void*, int (*)(void*, XGBoostBatchCSR), void*), XGBoostBatchCSR> >(xgboost::data::IteratorAdapter<void*, int (void*, int (*)(void*, XGBoostBatchCSR), void*), XGBoostBatchCSR>*, float, int)+0x2f8) [0x7f0ff8902268]
[bt] (3) /local_disk0/tmp/libxgboost4j3687488462117693459.so(xgboost::DMatrix* xgboost::DMatrix::Create<xgboost::data::IteratorAdapter<void*, int (void*, int (*)(void*, XGBoostBatchCSR), void*), XGBoostBatchCSR> >(xgboost::data::IteratorAdapter<void*, int (void*, int (*)(void*, XGBoostBatchCSR), void*), XGBoostBatchCSR>*, float, int, std::string const&, unsigned long)+0x45) [0x7f0ff88f79b5]
[bt] (4) /local_disk0/tmp/libxgboost4j3687488462117693459.so(XGDMatrixCreateFromDataIter+0x152) [0x7f0ff881e682]
[bt] (5) /local_disk0/tmp/libxgboost4j3687488462117693459.so(Java_ml_dmlc_xgboost4j_java_XGBoostJNI_XGDMatrixCreateFromDataIter+0x96) [0x7f0ff880b7b6]
[bt] (6) [0x7f1020017ee7]
It is an iterator error of some type but I'm not using a custom iterator.

Just needed bri.xgbRegressionModel.setMissing(0.0F) and now it works

Related

Scala bit Short

In my console, when I try to check (0xFFFF.toShort).toBinaryString, it returns
(0xFFFF.toShort).toBinaryString
res1: String = 11111111111111111111111111111111
Shouldn't it return 1111111111111111, as in 16 1s? (16bits)
How do I fix this?
THanks
The bug is https://github.com/scala/bug/issues/10216 that the extension methods are only defined for int.
The workaround is to supply them yourself for byte and short, as shown here https://github.com/scala/scala/pull/8383/files
For example
scala> implicit class MyRichByte(val b: Byte) extends AnyVal {
| def toBinaryString: String = java.lang.Integer.toBinaryString(java.lang.Byte.toUnsignedInt(b))
| }
class MyRichByte
scala> (0xFFFF.toByte).toBinaryString
val res4: String = 11111111
In the REPL, use // print followed by tab (for tab completion) to see
scala> (0xFFFF.toShort).toBinaryString //print
scala.Predef.intWrapper(65535.toShort.toInt).toBinaryString // : String
use ammonite desugar,
# desugar((0xFFFF.toShort).toBinaryString)
res5: Desugared = scala.Predef.intWrapper(65535.toShort.toInt).toBinaryString
toBinaryString is a method of Int and there is an implicit conversion from Short to Int (intWrapper).
and the toShort is convert to -1, and -1 in turn converted to Int -1 and that is how we get the 32 of "1".

How to use a custom deserializer to deserialize classes with a prefined desserializer(java.util.UUID)

In a class of my program, there is an id field of type java.util.UUID. When the field is mapped and stored in the Mongodb, it becomes LUUID()(legacy UUID, sub-type 0x03).
I would like to register a custom serializer and deserializer to the ObjectMapper to convert it to UUID(sub-type 0x04) object when stored in DB, while staying java.util.UUID in the main program.
My Attempt:
I managed to use a custom serializer to write the UUID object to Mongodb, but I cannot have to map back to the java.util.UUID when deserialize
I have two helper functions that can transform from and to java.util.UUID to org.bson.types.Binary(which be convert to UUID object when stored in mongodb
fun fromStandardBinaryUUID(binary: Binary): UUID
fun toStandardBinaryUUID(uuid: UUID): Binary
Serializer.kt
class UUIDSerializer(t: Class<UUID>): StdSerializer<UUID>(t){
#Throws(IOException::class)
override fun serialize(uuid: UUID, jsonGenerator: JsonGenerator,
serializerProvider: SerializerProvider?) {
jsonGenerator.writeObject(toStandardBinaryUUID(uuid))
}
}
Deserializer.kt
class UUIDDeserializer(t: Class<*>): StdDeserializer<UUID>(t) {
constructor() : this(UUID::class.java)
#Throws(IOException::class, JsonProcessingException::class)
override fun deserialize(parser: JsonParser, deserializer: DeserializationContext?): UUID {
val binary = parser.readValueAs(Binary::class.java)
val uuid = fromStandardBinaryUUID(binary);
return uuid; // garbage value, not that actual id.
}
}
Code for registration
val uuidSerializer = UUIDSerializer(UUID::class.java)
val uuidDeserializer = UUIDDeserializer(UUID::class.java)
val module = SimpleModule("UUIDSerializer");
module.addDeserializer(UUID::class.java, uuidDeserializer).addSerializer(UUID::class.java, uuidSerializer);
KMongoConfiguration.registerBsonModule(module)
Implemntation of fromStandardBinaryUUID && toStandardBinaryUUID
fun fromStandardBinaryUUID(binary: Binary): UUID {
var msb: Long = 0
var lsb: Long = 0
val uuidBytes = binary.data
for (i in 8..15) {
lsb = lsb shl 8
lsb = lsb or uuidBytes[i].toLong() and 0xFFL
}
for (i in 0..7) {
msb = msb shl 8
msb = msb or uuidBytes[i].toLong() and 0xFFL
}
return UUID(msb, lsb)
}
fun toStandardBinaryUUID(uuid: UUID): Binary {
var msb = uuid.mostSignificantBits
var lsb = uuid.leastSignificantBits
val uuidBytes = ByteArray(16)
for (i in 15 downTo 8) {
uuidBytes[i] = (lsb and 0xFFL).toByte()
lsb = lsb shr 8
}
for (i in 7 downTo 0) {
uuidBytes[i] = (msb and 0xFFL).toByte()
msb = msb shr 8
}
return Binary(0x04.toByte(), uuidBytes)
}
A unit test with a serialization and then a deserialization is available here: https://github.com/Litote/kmongo/commit/97afc3ee309dd2b25e46ba95bb9678fb10006c63

Reading from InputStream as an Iterator[Byte] or Array[Byte]

I am representing a data object as an Iterator[Byte], which is created from an InputStream instance.
The problem lies in that Byte is a signed integer from -128 to 127, while the read method in InputStream returns an unsigned integer from 0 to 255. This is in particular problematic since by semantics -1 should denote the end of an input stream.
What is the best way to alleviate the incompatibility between these two types? Is there an elegant way of converting between one to another? Or should I just use Int instead of Bytes, even though it feels less elegant?
def toByteIterator(in: InputStream): Iterator[Byte] = {
Iterator.continually(in.read).takeWhile(-1 !=).map { elem =>
convert // need to convert unsigned int to Byte here
}
}
def toInputStream(_it: Iterator[Byte]): InputStream = {
new InputStream {
val (it, _) = _it.duplicate
override def read(): Int = {
if (it.hasNext) it.next() // need to convert Byte to unsigned int
else -1
}
}
}
Yes, you can convert byte to int and vice versa easily.
First, int to byte can be converted with just toByte:
scala> 128.toByte
res0: Byte = -128
scala> 129.toByte
res1: Byte = -127
scala> 255.toByte
res2: Byte = -1
so your elem => convert could be just _.toByte.
Second, a signed byte can be converted to an unsigned int with a handy function in java.lang.Byte, called toUnsignedInt:
scala> java.lang.Byte.toUnsignedInt(-1)
res1: Int = 255
scala> java.lang.Byte.toUnsignedInt(-127)
res2: Int = 129
scala> java.lang.Byte.toUnsignedInt(-128)
res3: Int = 128
so you can write java.lang.Byte.toUnsignedInt(it.next()) in your second piece of code.
However, the last method is only available since Java 8. I don't know about its alternatives in older versions of Java, but its actual implementation is astonishingly simple:
public static int toUnsignedInt(byte x) {
return ((int) x) & 0xff;
}
so all you need is just to write
it.next().toInt & 0xff
Unfortunately it is something related with a bad design of the class InputStream. If you use read() you will have that problem. You should use read(byte[]) instead.
But as you say, you could also use Int. That is up to you.

Scala native methods in inner classes

I have:
class XCClass
{
#native protected def createWin(displayPtr: Long, width: Int, height: Int, backGroundColour: Int = white.value,
borderColour: Int = darkblue.value, borderWidth: Int = 0, xPosn: Int = 0, yPosn: Int = 0): Long
#native protected def xOpen(): Long
System.load("/sdat/projects/prXCpp/Release/libprXCpp.so")
//Code edited out
class Window(width: Int, height: Int, backGroundColour: ColourInt = white, borderColour: ColourInt = darkblue,
borderWidth: Int = 0, xPosn: Int = 0, yPosn: Int = 0)
{
val xWinPtr = createWin(xServPtr, width, height, backGroundColour.value, borderColour.value, borderWidth, xPosn, yPosn)
#native def drawLine(x1: Int, y1: Int, x2: Int, y2: Int): Unit
}
}
The first two methods work fine, but the native method on the inner class gives
object XApp extends App
{
val xc:XCClass = XCClass()
val win: xc.Window = xc.Window(800, 600)
win.drawLine(20, 20, 40, 40)
readLine()
}
Exception in thread "main" java.lang.UnsatisfiedLinkError: pXClient.XCClass$Window.drawLine(IIII)V
Here's the C++ signature
extern "C" JNIEXPORT void JNICALL Java_pXClient_XCClass$Window_drawLine(JNIEnv * env, jobject c1, Display *dpy,
Window win, jint x1, jint y1, jint x2, jint y2)
I tried using an underscore instead of the $ sign, and having no inner name at all but that failed as well.
Edit2: I managed to get javah to work just before seeing Robin's answer and it gave
JNIEXPORT void JNICALL Java_pXClient_XCClass_00024Window_drawLine
(JNIEnv *, jobject, jint, jint, jint, jint);
Edit4: It worked fine once I'd corrected errors in my code. It seems that the JVM will import a native function with the wrong parameter signature as long as the name is correct.
I just did a quick test with a .java file and javah, and a $ is represented as _00024.

reading from binary file Scala

how to read a binary file in chunks in scala.
This was what I was trying to do
val fileInput = new FileInputStream("tokens")
val dis = new DataInputStream(fileInput)
var value = dis.readInt()
var i=0;
println(value)
the value which is printed is a huge number. Whereas it should return 1 as the first output
Because you're seeing 16777216 where you'd expect to have a 1, it sounds like the problem is the endianness of the file is different than the JVM is expecting. (That is, Java always expects big endian/network byte order and your file contains numbers in little endian.)
That's a problem with a well established gamut of solutions.
For example this page has a class that wraps the input stream and makes the problem go away.
Alternatively this page has functions that will read from a DataInputStream.
This StackOverflow answer has various snippets that will simply convert an int, if that's all you need to do.
Here's a Scala snippet that will add methods to read little endian numbers from the file.
The simplest answer to your question of how to fix it is to simply swap the bytes around as you read them. You could do that by replacing your line that looks like
var value = dis.readInt()
with
var value = java.lang.Integer.reverseBytes(dis.readInt())
If you wanted to make that a bit more concise, you could use either the approach of implicitly adding readXLE() methods to DataInput or you could override DataInputStream to have readXLE() methods. Unfortunately, the Java authors decided that the readX() methods should be final, so we can't override those to provide a transparent reader for little endian files.
object LittleEndianImplicits {
implicit def dataInputToLittleEndianWrapper(d: DataInput) = new DataInputLittleEndianWrapper(d)
class DataInputLittleEndianWrapper(d: DataInput) {
def readLongLE(): Long = java.lang.Long.reverseBytes(d.readLong())
def readIntLE(): Int = java.lang.Integer.reverseBytes(d.readInt())
def readCharLE(): Char = java.lang.Character.reverseBytes(d.readChar())
def readShortLE(): Short = java.lang.Short.reverseBytes(d.readShort())
}
}
class LittleEndianDataInputStream(i: InputStream) extends DataInputStream(i) {
def readLongLE(): Long = java.lang.Long.reverseBytes(super.readLong())
def readIntLE(): Int = java.lang.Integer.reverseBytes(super.readInt())
def readCharLE(): Char = java.lang.Character.reverseBytes(super.readChar())
def readShortLE(): Short = java.lang.Short.reverseBytes(super.readShort())
}
object M {
def main(a: Array[String]) {
println("// Regular DIS")
val d = new DataInputStream(new java.io.FileInputStream("endian.bin"))
println("Int 1: " + d.readInt())
println("Int 2: " + d.readInt())
println("// Little Endian DIS")
val e = new LittleEndianDataInputStream(new java.io.FileInputStream("endian.bin"))
println("Int 1: " + e.readIntLE())
println("Int 2: " + e.readIntLE())
import LittleEndianImplicits._
println("// Regular DIS with readIntLE implicit")
val f = new DataInputStream(new java.io.FileInputStream("endian.bin"))
println("Int 1: " + f.readIntLE())
println("Int 2: " + f.readIntLE())
}
}
The "endian.bin" file mentioned above contains a big endian 1 followed bay a little endian 1. Running the above M.main() prints:
// Regular DIS
Int 1: 1
Int 2: 16777216
// LE DIS
Int 1: 16777216
Int 2: 1
// Regular DIS with readIntLE implicit
Int 1: 16777216
Int 2: 1