I have a model that will be used in java. I would like to reduce the amount of development time by exporting functions written in R to pmml.
as an experiment I tried function_to_pmml which creates an incomplete pmml lacking headers
a normal pmml file includes headers etc like
the java developers can not use the output from "function_to_pmml("1 + 3/5 - (4 * 2)")"
how do I get a complete pmml?
I was thinking I might be able to do something like add_attributes but not found an example of this working
You can export (in formula-) R expressions to PMML using the R2PMML package.
Related
I have a set of .xml documents that I want to parse.
I previously have tried to parse them using methods that take the file contents and dump them into a single cell, however I've noticed this doesn't work in practice since I'm seeing slower and slower run times, often with one task taking tens of hours to run:
The first transform of mine takes the .xml contents and puts it into a single cell, and a second transform takes this string and uses Python's xml library to parse the string into a document. This document I'm then able to extract properties from and return a DataFrame.
I'm using a UDF to conduct the process of mapping the string contents to the fields I want.
How can I make this faster / work better with large .xml files?
For this problem, we're going to combine a couple of different techniques to make this code both testable and highly scalable.
Theory
When parsing raw files, you have a couple of options you can consider:
❌ You can write your own parser to read bytes from files and convert them into data Spark can understand.
This is highly discouraged whenever possible due to the engineering time and unscalable architecture. It doesn't take advantage of distributed compute when you do this as you must bring the entire raw file to your parsing method before you can use it. This is not an effective use of your resources.
⚠ You can use your own parser library not made for Spark, such as the XML Python library mentioned in the question
While this is less difficult to accomplish than writing your own parser, it still does not take advantage of distributed computation in Spark. It is easier to get something running, but it will eventually hit a limit of performance because it does not take advantage of low-level Spark functionality only exposed when writing a Spark library.
✅ You can use a Spark-native raw file parser
This is the preferred option in all cases as it takes advantage of low-level Spark functionality and doesn't require you to write your own code. If a low-level Spark parser exists, you should use it.
In our case, we can use the Databricks parser to great effect.
In general, you should also avoid using the .udf method as it likely is being used instead of good functionality already available in the Spark API. UDFs are not as performant as native methods and should be used only when no other option is available.
A good example of UDFs covering up hidden problems would be string manipulations of column contents; while you technically can use a UDF to do things like splitting and trimming strings, these things already exist in the Spark API and will be orders of magnitude faster than your own code.
Design
Our design is going to use the following:
Low-level Spark-optimized file parsing done via the Databricks XML Parser
Test-driven raw file parsing as explained here
Wire the Parser
First, we need to add the .jar to our spark_session available inside Transforms. Thanks to recent improvements, this argument, when configured, will allow you to use the .jar in both Preview/Test and at full build time. Previously, this would have required a full build but not so now.
We need to go to our transforms-python/build.gradle file and add 2 blocks of config:
Enable the pytest plugin
Enable the condaJars argument and declare the .jar dependency
My /transforms-python/build.gradle now looks like the following:
buildscript {
repositories {
// some other things
}
dependencies {
classpath "com.palantir.transforms.python:lang-python-gradle-plugin:${transformsLangPythonPluginVersion}"
}
}
apply plugin: 'com.palantir.transforms.lang.python'
apply plugin: 'com.palantir.transforms.lang.python-defaults'
dependencies {
condaJars "com.databricks:spark-xml_2.13:0.14.0"
}
// Apply the testing plugin
apply plugin: 'com.palantir.transforms.lang.pytest-defaults'
// ... some other awesome features you should enable
After applying this config, you'll want to restart your Code Assist session by clicking on the bottom ribbon and hitting Refresh
After refreshing Code Assist, we now have low-level functionality available to parse our .xml files, now we need to test it!
Testing the Parser
If we adopt the same style of test-driven development as here, we end up with /transforms-python/src/myproject/datasets/xml_parse_transform.py with the following contents:
from transforms.api import transform, Output, Input
from transforms.verbs.dataframes import union_many
def read_files(spark_session, paths):
parsed_dfs = []
for file_name in paths:
parsed_df = spark_session.read.format('xml').options(rowTag="tag").load(file_name)
parsed_dfs += [parsed_df]
output_df = union_many(*parsed_dfs, how="wide")
return output_df
#transform(
the_output=Output("my.awesome.output"),
the_input=Input("my.awesome.input"),
)
def my_compute_function(the_input, the_output, ctx):
session = ctx.spark_session
input_filesystem = the_input.filesystem()
hadoop_path = input_filesystem.hadoop_path
files = [hadoop_path + "/" + file_name.path for file_name in input_filesystem.ls()]
output_df = read_files(session, files)
the_output.write_dataframe(output_df)
... an example file /transforms-python/test/myproject/datasets/sample.xml with contents:
<tag>
<field1>
my_value
</field1>
</tag>
And a test file /transforms-python/test/myproject/datasets/test_xml_parse_transform.py:
from myproject.datasets import xml_parse_transform
from pkg_resources import resource_filename
def test_parse_xml(spark_session):
file_path = resource_filename(__name__, "sample.xml")
parsed_df = xml_parse_transform.read_files(spark_session, [file_path])
assert parsed_df.count() == 1
assert set(parsed_df.columns) == {"field1"}
We now have:
A distributed-compute, low-level .xml parser that is highly scalable
A test-driven setup that we can quickly iterate on to get our exact functionality right
Cheers
I would like to paste the data-frame from the R environment to the latex part (question or solution part) when creating exercises in r-exams. Later the exercises will be imported into Moodle. Is that possible in r-exams? We saw it is possible when the object is matrix object via $\Sexpr{toLatex(matrix_obj)}$. But a similar way does not seem to work with the data-frames. Thank you!
A data.frame would usually be included as a {tabular} in LaTeX and there are various packages for automatic conversion like xtable or using the function kable() in knitr. For PDF output this also works nicely including all vertical and/or horizontal lines included in the table. However, for HTML-based output (as for Moodle) the table as such is converted correctly but without any lines.
An overview of a couple of solutions is available as:
Different copies of question with table for Moodle with R-Exams
Moreover, Kenji Sato has proposed to inject some dedicated CSS code to handle the table formatting in HTML. We are currently working on some automated way of including this in R/exams:
https://www.kenjisato.jp/en/post/2020/07/moodle-bordered-table/
I'm developing a Modelica library and need to produce a document with source code listings. I'd like to be able to include the source of the Modelica models without annotations.
I could manually edit them out, but I'm looking for a more automated strategy. I'm guessing the most convenient and straightforward approach is to use some tool to save .mo files with no annotations and include those in my document (I'm using \lstinputlisting in LaTeX).
Is it possible to do this? I have access to Dymola, OpenModelica and JModelica. Dymola is obviously capable of producing such a listing, as it's able to include it in the automatically generated documentation (File > Export > HTML...). I've been looking into scripting with Dymola and OpenModelica, but haven't found a way to do this either.
JModelica seems like it could be a good option, but I don't have experience working with Python. If this is possible and someone gives me some pointers, I'm willing to look into it myself. I found a mention to a prettyprint function that might do the job, but I'm not sure where to start. I can't even find reference to that function in the latest documentation.
It would also be more convenient for me to find a way of doing it with Dymola/OpenModelica (whether through the UI or by using a script). Have I missed something?
I think you could use saveTotalModel("total.mo", MyModelName) in OpenModelica. This will strip most annotations (not ones used for code generation if I remember correctly) and pretty-print the source code including all dependencies. Then you just copy-paste the models/packages that you want to include in the listing. Or if you prefer, you can do something like the following to only include code for a particular model:
loadModel(Modelica);
loadFile("MyModel.mo");
saveTotalModel("total.mo", MyModel.A.B);
clear();
loadFile(MyModel);
str := list(MyModel.A.B);
writeFile("MyModel.A.B.listing", str);
The graph options with Zeppelin are pretty basic. So I am looking for an example of how to do something simple, like a barchart, with ds3.js. From what I can tell that would be the best graphing library to use to create stunning graphs.
Anyway my question is how to pass data to the JavaScript code. With regular Zeppelin charts you write scala or other code and then save that in a dataframe. Then on the next line you use the %sql option and you can write a SQL command and then buttons appear to let you graph the data.
But what I have found looking on the internet is no indication that data created in the scala code section would be passed to the Angular section where you put the ds3.js code.
Some examples I found are like this one where all the html and Javascript is put in one giant print statement in the scala code https://rawkintrevo.org/2016/09/20/gelly-on-apache-flink/
And then there is an example like this one Using d3.js with Apache Zeppelin where the Zeppelin line is all JavaScript, but the data is just a locally created array.
So I need (1) an example and (2) some understanding of how RDDs ad Dataframes can be passed into the JavaScript code, which of course is on a different line that the scala code. How do you bring objects in the scala section of the notebook into scope for the Javascript section.
You can refer to zeppelin docs for a good getting-started guide to creating custom visualization. Also, you might want to check out the code of some of the built-ins viz.
Regarding how data from DataFrames are passed to js, I'm pretty sure z.show or %sql triggers dataFrame.take(${zeppelin.spark.maxResult}) which collects the RDD[T] as a Seq[T] object to the driver whose elements are then used to render graphs.
Alternatively if you have a javascript graph defined in another paragraph, you can also usez.angularBind("values", rdd.take(maxResult)) to send the data to the angular view. There's a really nice answer here on the subject which might help.
Hope you find this helpful.
I am working on a library of mathematical functions. As part of the Scaladoc, I would like to include the formula of each function. E.g.
/**
* Sum squared function:
* \(f(x) = \sum_i^n x_i^2\)
*/
def sumSquared[T](x: Seq[T]) = x.map(xi => xi * xi).sum
I am using MathJax to display the formula. It works if I manually edit the generated html to include the required MathJax javascript, but I want to automate this.
So far the only solutions I've found are:
Is there a way to include math formulae in Scaladoc?
How to run bash script after generating scaladoc using doc task?
If these are the only options then okay, however I'd like do this using only sbt (no external scripts). Is there a way to do this by maybe setting scalacOptions as in How to ScalaDoc?
I added an answer to Is there a way to include math formulae in Scaladoc? with an sbt task that does the job, but still using MathJax.
I don't know of any code that would actually replace the latex formula to images in the generated api files