can mlflow.spark's saved model loaded as Spark/Scala Pipeline? - scala

Our algorithm engineer is developing machine learning model using pyspark & mlflow. He's trying to save the model using mlflow.spark API & the model format is the native spark MLlib format. Could the model be loaded from Spark Scala code? It seems that mlflow is quite restricted for cross-language usage.

MLflow Java/Scala client does not have feature parity with MLflow Python as it is missing the concept of Projects and Models. However, you can read in a PySpark-generated Spark ML model with Scala using the downloadArtifact method.
https://mlflow.org/docs/latest/java_api/org/mlflow/tracking/MlflowClient.html#downloadArtifacts-java.lang.String-java.lang.String-
%python
mlflow.spark.log_model(model, "spark-model")
%scala
val modelPath = client.downloadArtifacts(runId, "spark-model/sparkml").getAbsolutePath
import org.apache.spark.ml.PipelineModel
val model = PipelineModel.load(modelPath)
val predictions = model.transform(data)

Related

Tobit model in pyspark

I am new to pyspark. I have a requirement to create a tobit model in pyspark.
I am looking for source code sample of tobit model in pyspark. It would be great if there is any build in package in pyspark to create tobit model. Kindly assist.

Does Apache spark 2.2 supports user-defined type (UDT)?

From this JIRA ticket Hide UserDefinedType in Spark 2.0
, seems that spark hide the UDT API from version 2.0.
Is there exists an alternative function or API we can use in version 2.2, so that we could define UserDefinedType? I wish to use a custom type in dataframe or structured streaming.
There is no alternative API and UDT remains private (https://issues.apache.org/jira/browse/SPARK-7768).
Generic Encoders (org.apache.spark.sql.Encoders.kryo and org.apache.spark.sql.Encoders.javaSerialization) serve similar purpose in Dataset, but there are not direct replacement:
How to store custom objects in Dataset?
Questions about the future of UDTs and Encoders

Import different db drivers in Slick

Slick 3 has "import api" to use specific database driver. e.g.
import slick.driver.H2Driver.api._
...DAO implementation...
or
import slick.driver.PostgresDriver.api._
...DAO implementation...
How do I use postgresql in production and h2 in unit test?
Use DatabaseConfig instead. As Slick documentation states:
On top of the configuration syntax for Database, there is another
layer in the form of DatabaseConfig which allows you to configure a
Slick driver plus a matching Database together. This makes it easy to
abstract over different kinds of database systems by simply changing a
configuration file.
Instead of importing database specific drivers, first obtain a DatabaseConfig:
val dbConfig = DatabaseConfig.forConfig[JdbcProfile]("<db_name>")
And then import api from it:
import dbConfig.driver.api._

compute nvl(max(c.EmployeeId),0) in slick using scala play?

I am using slick in play framework. I need to know how to write the following line in slick:
compute nvl(max(s.version),0)

Spark SQL data source: how to use a type erasure hack to pass RDD[InternalRow] back as RDD[Row]

I am creating a spark SQL data source to corporate internal data source, following the code of JDBC data source in spark SQL source code.
In spark 1.5, there is a comment in JDBCRelation.scala (JDBC data source code)
// Rely on a type erasure hack to pass RDD[InternalRow] back as RDD[Row]
where
InternalRow and Row do not share a common base class.
How does this work? thanks.