I am new to pyspark. I have a requirement to create a tobit model in pyspark.
I am looking for source code sample of tobit model in pyspark. It would be great if there is any build in package in pyspark to create tobit model. Kindly assist.
Related
Our algorithm engineer is developing machine learning model using pyspark & mlflow. He's trying to save the model using mlflow.spark API & the model format is the native spark MLlib format. Could the model be loaded from Spark Scala code? It seems that mlflow is quite restricted for cross-language usage.
MLflow Java/Scala client does not have feature parity with MLflow Python as it is missing the concept of Projects and Models. However, you can read in a PySpark-generated Spark ML model with Scala using the downloadArtifact method.
https://mlflow.org/docs/latest/java_api/org/mlflow/tracking/MlflowClient.html#downloadArtifacts-java.lang.String-java.lang.String-
%python
mlflow.spark.log_model(model, "spark-model")
%scala
val modelPath = client.downloadArtifacts(runId, "spark-model/sparkml").getAbsolutePath
import org.apache.spark.ml.PipelineModel
val model = PipelineModel.load(modelPath)
val predictions = model.transform(data)
Is this possible to generate random Avro data by the specified schema using org.apache.avro library?
I need to produce this data with Kafka.
I tried to find some kind of random data generator for test, however, I have stumbled upon tools for such data generator or GenericRecord usage. Tools are not very suitable for me as there is a specific file dependency (like reading the file and so on) and GenericRecord should be generated one-by-one as I've understood.
Are there any other solutions for Java/Scala?
UPDATE: I have found this class but it does not seem to beaccessible from org.apache.avro version version 1.8.2
The reason you need to read a file, is that it matches a Schema, which defines the fields that need to be created, and of which types.
That is not a hard requirement, and there would be nothing preventing creation of random Generic or Specific Records that are built in code via Avro's SchemaBuilder class
See this repo for example, that uses a POJO generated from an AVSC schema (which again, could be done with SchemaBuilder instead) into a Java class.
Even the class you linked to uses a schema file
So I personally would probably use Avro4s (https://github.com/sksamuel/avro4s) in conjunction with scalachecks (https://www.scalacheck.org) Gen to model such tests.
You could use scalacheck to generate random instances of case classes and avro4s to convert them to generic records, extract their schema etc etc.
There's also avro-mocker https://github.com/speedment/avro-mocker though I don't know how easy it is to hook into the code.
I'd just use Podam http://mtedone.github.io/podam/ to generate POJOs and then just output them to Avro using Java Avro library https://avro.apache.org/docs/1.8.1/gettingstartedjava.html#Serializing
I have CSV file but i have to assign data type without knowing its data type & schema
i mean everything is done via scala programing is it possible??
i don't know how to write code in scala any one know?
I am new to scala i just have basic idea about scala i know how to read csv file but i don't have its schema and data type is it possible?
could you give me answer like this
1)loading csv file.
2)how to analysis using sparkSQL.
3)how to set data type automatically in scala.
There is infer schema, that you can add that in code as option.Do you needa code example ? I can provide scala code.
I am planning to use JPA Entities to create my data model. As part of the design , we have been asked to document the data model and keep them updated. I was asked to used plantUml to create the Database model, but I would like to see how to get started on converting the JPA Entity--> plantuml macro code --> Data model. Seeking out help to see how I can tackle this request
You can build your model Structurizr and then export them to PlantUML.
https://github.com/structurizr/java-extensions/blob/master/docs/plantuml.md
I am creating a spark SQL data source to corporate internal data source, following the code of JDBC data source in spark SQL source code.
In spark 1.5, there is a comment in JDBCRelation.scala (JDBC data source code)
// Rely on a type erasure hack to pass RDD[InternalRow] back as RDD[Row]
where
InternalRow and Row do not share a common base class.
How does this work? thanks.