Programmatic configuration for Great Expectations - pyspark

I'm looking into integrating a validation framework to an existing PySpark project. There are a lot of examples how to configure Great Expectations using JSON/YAML files in official documentation. However, in my case table schemas are defined as Python classes and I'm aiming to keep the validation definitions in these classes.
When playing around, I noticed this kind of pattern can be used to validate single expectations without any config files:
spark = SparkSession.builder.master("local[*]").getOrCreate()
df = spark.createDataFrame([
Row(x=1, y="foo"),
Row(x=2, y=None),
])
ds = SparkDFDataset(df)
expectation: ExpectationValidationResult = ds.expect_column_values_to_not_be_null("y")
print(expectation.success)
where expectation.success is either False or True. However, I'm aiming to build expectation suites and generating reports using programmatic configuration but can't find any references how to do it. This is what I tried to hack but it leads to a runtime exception:
ds.append_expectation(ExpectationConfiguration(
expectation_type="expect_column_values_to_not_be_null",
kwargs={'column': 'y', 'result_format': 'BASIC'},
))
engine = SparkDFExecutionEngine(
force_reuse_spark_context=True,
)
validator = Validator(
execution_engine=engine,
expectation_suite=ds.get_expectation_suite(),
)
res = validator.validate()
Any pointers on how to configure Great Expectations without config files (or minimal files) are highly appreciated!

Batches are required for validating expectation suites. Here is a working example:
spark = SparkSession.builder.master("local[*]").getOrCreate()
df = spark.createDataFrame([
Row(x=1, y="foo"),
Row(x=2, y=None),
])
engine = SparkDFExecutionEngine(
force_reuse_spark_context=True,
)
validator = Validator(
execution_engine=engine,
expectation_suite=ExpectationSuite(
expectation_suite_name="my_suite",
expectations=[
ExpectationConfiguration(
expectation_type="expect_column_values_to_not_be_null",
kwargs={"column": "y", "result_format": "BASIC"},
),
ExpectationConfiguration(
expectation_type="expect_column_values_to_not_be_null",
kwargs={"column": "x", "result_format": "BASIC"},
)
]
),
batches=[
Batch(
data=df,
batch_definition=BatchDefinition(
datasource_name="foo",
data_connector_name="foo",
data_asset_name="foo",
batch_identifiers=IDDict(ge_batch_id=str(uuid.uuid1())),
),
),
],
)
res = validator.validate()

Related

How do I insert data in a Postgres database? [duplicate]

Is there a standard way to determine what features are available for a given crate?
I'm trying to read Postgres timezones, and this says to use the crate postgres = "0.17.0-alpha.1" crate's with-time or with-chrono features.
When I try this in my Cargo.toml:
[dependencies]
postgres = { version = "0.17.0-alpha.1", features = ["with-time"] }
I get this error:
error: failed to select a version for `postgres`.
... required by package `mypackage v0.1.0 (/Users/me/repos/mypackage)`
versions that meet the requirements `^0.17.0-alpha.1` are: 0.17.0, 0.17.0-alpha.2, 0.17.0-alpha.1
the package `mypackage` depends on `postgres`, with features: `with-time` but `postgres` does not have these features.
Furthermore, the crate page for postgres 0.17.0 says nothing about these features, so I don't even know if they're supposed to be supported or not.
It seems like there would be something on docs.rs about it?
Crates uploaded to crates.io (and thus to docs.rs) will show what feature flags exist. crates.io issue #465 suggests placing the feature list on the crate's page as well.
Beyond that, the only guaranteed way to see what features are available is to look at the Cargo.toml for the crate. This generally means that you need to navigate to the project's repository, find the correct file for the version you are interested in, and read it.
You are primarily looking for the [features] section, but also for any dependencies that are marked as optional = true, as optional dependencies count as an implicit feature flag.
Unfortunately, there's no requirement that the crate author puts any documentation about what each feature flag does. Good crates will document their feature flags in one or more locations:
As comments in Cargo.toml
Their README
Their documentation
See also:
How do you enable a Rust "crate feature"?
For the postgres crate, we can start at crates.io, then click "repository" to go to the repository. We then find the right tag (postgres-v0.17.0), then read the Cargo.toml:
[features]
with-bit-vec-0_6 = ["tokio-postgres/with-bit-vec-0_6"]
with-chrono-0_4 = ["tokio-postgres/with-chrono-0_4"]
with-eui48-0_4 = ["tokio-postgres/with-eui48-0_4"]
with-geo-types-0_4 = ["tokio-postgres/with-geo-types-0_4"]
with-serde_json-1 = ["tokio-postgres/with-serde_json-1"]
with-uuid-0_8 = ["tokio-postgres/with-uuid-0_8"]
You can use the cargo-feature crate to view and manage your dependencies' features:
> cargo install cargo-feature --locked
> cargo feature postgres
Avaliable features for `postgres`
array-impls = ["tokio-postgres/array-impls"]
with-bit-vec-0_6 = ["tokio-postgres/with-bit-vec-0_6"]
with-chrono-0_4 = ["tokio-postgres/with-chrono-0_4"]
with-eui48-0_4 = ["tokio-postgres/with-eui48-0_4"]
with-eui48-1 = ["tokio-postgres/with-eui48-1"]
with-geo-types-0_6 = ["tokio-postgres/with-geo-types-0_6"]
with-geo-types-0_7 = ["tokio-postgres/with-geo-types-0_7"]
with-serde_json-1 = ["tokio-postgres/with-serde_json-1"]
with-smol_str-01 = ["tokio-postgres/with-smol_str-01"]
with-time-0_2 = ["tokio-postgres/with-time-0_2"]
with-time-0_3 = ["tokio-postgres/with-time-0_3"]
with-uuid-0_8 = ["tokio-postgres/with-uuid-0_8"]
with-uuid-1 = ["tokio-postgres/with-uuid-1"]
Note: this only works if the crate is already in Cargo.toml as a dependency.
The list of features are also displayed when using cargo add:
> cargo add postgres
Updating crates.io index
Adding postgres v0.19.4 to dependencies.
Features:
- array-impls
- with-bit-vec-0_6
- with-chrono-0_4
- with-eui48-0_4
- with-eui48-1
- with-geo-types-0_6
- with-geo-types-0_7
- with-serde_json-1
- with-smol_str-01
- with-time-0_2
- with-time-0_3
- with-uuid-0_8
- with-uuid-1
It seems like there would be something on docs.rs about it?
Other people thought so too! This was added to docs.rs in late 2020. You can view the list of features from the crate's documentation by navigating to "Feature flags" on the top bar:
You still won't see the list of features of postgres 0.17.0 though since old documentation isn't regenerated when new functionality is added to the site, but any recently published versions will have it available.
Note: this is only available on docs.rs and not generated when running cargo doc.
The feature-set of a dependency can be retrieved programmatically via the cargo-metadata crate:
use cargo_metadata::MetadataCommand;
fn main() {
let metadata = MetadataCommand::new()
.exec()
.expect("failed to get metadata");
let dependency = metadata.packages
.iter()
.find(|package| package.name == "postgres")
.expect("failed to find dependency");
println!("{:#?}", dependency.features);
}
{
"with-time-0_3": [
"tokio-postgres/with-time-0_3",
],
"with-smol_str-01": [
"tokio-postgres/with-smol_str-01",
],
"with-chrono-0_4": [
"tokio-postgres/with-chrono-0_4",
],
"with-uuid-0_8": [
"tokio-postgres/with-uuid-0_8",
],
"with-uuid-1": [
"tokio-postgres/with-uuid-1",
],
"array-impls": [
"tokio-postgres/array-impls",
],
"with-eui48-1": [
"tokio-postgres/with-eui48-1",
],
"with-bit-vec-0_6": [
"tokio-postgres/with-bit-vec-0_6",
],
"with-geo-types-0_6": [
"tokio-postgres/with-geo-types-0_6",
],
"with-geo-types-0_7": [
"tokio-postgres/with-geo-types-0_7",
],
"with-eui48-0_4": [
"tokio-postgres/with-eui48-0_4",
],
"with-serde_json-1": [
"tokio-postgres/with-serde_json-1",
],
"with-time-0_2": [
"tokio-postgres/with-time-0_2",
],
}
This is how other tools like cargo add and cargo feature provide dependency and feature information.

OpenAPI 3 MicroProfile, using #SecurityRequirementSet in #OpenApiDefinition

I'm working on switching OpenAPI v2 to Open API 3, we'll be using MicroProfile 2.0 with Projects as the plugin that will generate the OAS from the code.
Now, we have different headers that are needed as a kind of authorization. I know I can put them on each resource, that seems just a lot of repetition, so I thought it was a good idea to put in the JaxRSConfiguration file as a #SecurityScheme and add them as security to the #OpenApiDefinition.
#OpenAPIDefinition(
info = #Info(
...
)
),
security = {
#SecurityRequirement(name = Header1),
#SecurityRequirement(name = Header2)
},
components = #Components(
securitySchemes = {
#SecurityScheme( apiKeyName = Header1 ... ) ,
#SecurityScheme( apiKeyName = Header2 ....)
}),
...
)
This is working, however it generates an OR, while it should be an AND (without the - )
security:
- Header1: []
- Header2: []
I thought I could use #SecurityRequirementsSet inside security in the #OpenApiDefinition, but unfortunately this is not allowed. I know I can use it on each call or on top of the class of the different calls but as this is still a sort of repetition, I would prefer to have it as a general authentication security.
Does anybody know how I can achieve this ?

Adding a Configuration at runtime

I'd like to add a configuration within a plugin at runtime and probably also register it at projectConfigurations. Is there a possibility to manipulate the state in such a way? I assume this needs to be done in a command, and the solution is buried somewhere deep inside of the AttributeMap, but I'm not sure how to approach that beast.
def addConfiguration = Command.single( "add-configuration" )( ( state, configuration ) => {
val extracted = Project extract state
val c = configurations.apply( configuration )
// Add c to the state ...
???
} )
Background
I'm working on a sbt-slick-codegen plugin, which has multi-database support as an essential feature. Now to configure databases in sbt, I took a very simple approach: everything is stuffed into a Map.
databases in SlickCodegen ++= Map(
"default" -> Database(
Authentication(
url = "jdbc:postgresql://localhost:5432/db",
username = Some( "db_user_123" ),
password = Some( "p#assw0rd" )
),
Driver(
jdbc = "org.postgresql.Driver",
slick = slick.driver.PostgresDriver,
user = Some( "com.example.MySlickProfile" )
),
container = "Tables",
generator = new SourceCodeGenerator( _ ),
identifier = Some( "com.example" ),
excludes = Seq( "play_evolutions" ),
cache = true
),
"user_db" -> Database( ... )
)
Now, this works, but defeats the purpose of sbt, is hard to setup, difficult to maintain and update. I'd instead prefer configuration to work like this:
container in Database := "MyDefaultName",
url in Database( "backup" ) := "jdbc:postgresql://localhost:5432/database",
cache in Database( "default" ) := false
And also this approach is working on the current development branch, but only for Database( x )-configurations that are known at compile time, which leads to the next problem:
On top of that, I'm planning to add a module with PlayFramework support. The idea is to have a setting SettingKey[Seq[Config]]( "configurations" ) that accepts Typesafe Config instances, reads the slick configurations from them and generates appropriate Database( x )-configurations. But for this step I am out of ideas.

How do I create a Spark RDD from Accumulo 1.6 in spark-notebook?

I have a Vagrant image with Spark Notebook, Spark, Accumulo 1.6, and Hadoop all running. From notebook, I can manually create a Scanner and pull test data from a table I created using one of the Accumulo examples:
val instanceNameS = "accumulo"
val zooServersS = "localhost:2181"
val instance: Instance = new ZooKeeperInstance(instanceNameS, zooServersS)
val connector: Connector = instance.getConnector( "root", new PasswordToken("password"))
val auths = new Authorizations("exampleVis")
val scanner = connector.createScanner("batchtest1", auths)
scanner.setRange(new Range("row_0000000000", "row_0000000010"))
for(entry: Entry[Key, Value] <- scanner) {
println(entry.getKey + " is " + entry.getValue)
}
will give the first ten rows of table data.
When I try to create the RDD thusly:
val rdd2 =
sparkContext.newAPIHadoopRDD (
new Configuration(),
classOf[org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat],
classOf[org.apache.accumulo.core.data.Key],
classOf[org.apache.accumulo.core.data.Value]
)
I get an RDD returned to me that I can't do much with due to the following error:
java.io.IOException: Input info has not been set. at
org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:630)
at
org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:343)
at
org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:538)
at
org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
at scala.Option.getOrElse(Option.scala:120) at
org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:1367) at
org.apache.spark.rdd.RDD.count(RDD.scala:927)
This totally makes sense in light of the fact that I haven't specified any parameters as to which table to connect with, what the auths are, etc.
So my question is: What do I need to do from here to get those first ten rows of table data into my RDD?
update one
Still doesn't work, but I did discover a few things. Turns out there are two nearly identical packages,
org.apache.accumulo.core.client.mapreduce
&
org.apache.accumulo.core.client.mapred
both have nearly identical members, except for the fact that some of the method signatures are different. not sure why both exist as there's no deprecation notice that I could see. I attempted to implement Sietse's answer with no joy. Below is what I did, and the responses:
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.conf.Configuration
val jobConf = new JobConf(new Configuration)
import org.apache.hadoop.mapred.JobConf import
org.apache.hadoop.conf.Configuration jobConf:
org.apache.hadoop.mapred.JobConf = Configuration: core-default.xml,
core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml,
yarn-site.xml
Configuration: core-default.xml, core-site.xml, mapred-default.xml,
mapred-site.xml, yarn-default.xml, yarn-site.xml
AbstractInputFormat.setConnectorInfo(jobConf,
"root",
new PasswordToken("password")
AbstractInputFormat.setScanAuthorizations(jobConf, auths)
AbstractInputFormat.setZooKeeperInstance(jobConf, new ClientConfiguration)
val rdd2 =
sparkContext.hadoopRDD (
jobConf,
classOf[org.apache.accumulo.core.client.mapred.AccumuloInputFormat],
classOf[org.apache.accumulo.core.data.Key],
classOf[org.apache.accumulo.core.data.Value],
1
)
rdd2: org.apache.spark.rdd.RDD[(org.apache.accumulo.core.data.Key,
org.apache.accumulo.core.data.Value)] = HadoopRDD[1] at hadoopRDD at
:62
rdd2.first
java.io.IOException: Input info has not been set. at
org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:630)
at
org.apache.accumulo.core.client.mapred.AbstractInputFormat.validateOptions(AbstractInputFormat.java:308)
at
org.apache.accumulo.core.client.mapred.AbstractInputFormat.getSplits(AbstractInputFormat.java:505)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
at scala.Option.getOrElse(Option.scala:120) at
org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at
org.apache.spark.rdd.RDD.take(RDD.scala:1077) at
org.apache.spark.rdd.RDD.first(RDD.scala:1110) at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:64)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:69)
at...
* edit 2 *
re: Holden's answer - still no joy:
AbstractInputFormat.setConnectorInfo(jobConf,
"root",
new PasswordToken("password")
AbstractInputFormat.setScanAuthorizations(jobConf, auths)
AbstractInputFormat.setZooKeeperInstance(jobConf, new ClientConfiguration)
InputFormatBase.setInputTableName(jobConf, "batchtest1")
val rddX = sparkContext.newAPIHadoopRDD(
jobConf,
classOf[org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat],
classOf[org.apache.accumulo.core.data.Key],
classOf[org.apache.accumulo.core.data.Value]
)
rddX: org.apache.spark.rdd.RDD[(org.apache.accumulo.core.data.Key,
org.apache.accumulo.core.data.Value)] = NewHadoopRDD[0] at
newAPIHadoopRDD at :58
Out[15]: NewHadoopRDD[0] at newAPIHadoopRDD at :58
rddX.first
java.io.IOException: Input info has not been set. at
org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:630)
at
org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:343)
at
org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:538)
at
org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
at scala.Option.getOrElse(Option.scala:120) at
org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at
org.apache.spark.rdd.RDD.take(RDD.scala:1077) at
org.apache.spark.rdd.RDD.first(RDD.scala:1110) at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:61)
at
edit 3 -- progress!
i was able to figure out why the 'input INFO not set' error was occurring. the eagle-eyed among you will no doubt see the following code is missing a closing '('
AbstractInputFormat.setConnectorInfo(jobConf, "root", new PasswordToken("password")
as I'm doing this in spark-notebook, I'd been clicking the execute button and moving on because I wasn't seeing an error. what I forgot was that notebook is going to do what spark-shell will do when you leave off a closing ')' -- it will wait forever for you to add it. so the error was the result of the 'setConnectorInfo' method never getting executed.
unfortunately, I'm still unable to shove the accumulo table data into an RDD that's useable to me. when I execute
rddX.count
I get back
res15: Long = 10000
which is the correct response - there are 10,000 rows of data in the table I pointed to. however, when I try to grab the first element of data thusly:
rddX.first
I get the following error:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0.0 in stage 0.0 (TID 0) had a not serializable result:
org.apache.accumulo.core.data.Key
any thoughts on where to go from here?
edit 4 -- success!
the accepted answer + comments are 90% of the way there - except for the fact that the accumulo key/value need to be cast into something serializable. i got this working by invoking the .toString() method on both. i'll try to post something soon that's complete working code incase anyone else runs into the same issue.
Generally with custom Hadoop InputFormats, the information is specified using a JobConf. As #Sietse pointed out there are some static methods on the AccumuloInputFormat that you can use to configure the JobConf. In this case I think what you would want to do is:
val jobConf = new JobConf() // Create a job conf
// Configure the job conf with our accumulo properties
AccumuloInputFormat.setConnectorInfo(jobConf, principal, token)
AccumuloInputFormat.setScanAuthorizations(jobConf, authorizations)
val clientConfig = new ClientConfiguration().withInstance(instanceName).withZkHosts(zooKeepers)
AccumuloInputFormat.setZooKeeperInstance(jobConf, clientConfig)
AccumuloInputFormat.setInputTableName(jobConf, tableName)
// Create an RDD using the jobConf
val rdd2 = sc.newAPIHadoopRDD(jobConf,
classOf[org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat],
classOf[org.apache.accumulo.core.data.Key],
classOf[org.apache.accumulo.core.data.Value]
)
Note: After digging into the code, it seems the the is configured property is set based in part on the class which is called (makes sense to avoid conflicts with other packages potentially), so when we go and get it back in the concrete class later it fails to find the is configured flag. The solution to this is to not use the Abstract classes. see https://github.com/apache/accumulo/blob/bf102d0711103e903afa0589500f5796ad51c366/core/src/main/java/org/apache/accumulo/core/client/mapreduce/lib/impl/ConfiguratorBase.java#L127 for the implementation details). If you can't call this method on the concrete implementation with spark-notebook probably using spark-shell or a regularly built application is the easiest solution.
It looks like those parameters have to be set through static methods : http://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/client/mapred/AccumuloInputFormat.html. So try setting the non-optional parameters and run again. It should work.

ScalaJS to emit EmberJS code?

I want to write Scala code that can be then translated to EmberJS code.
Can it be done? If not out of the box any suggestion as to how that can be achieved by hacking ScalaJS?
Regards.
Scala.js can emit any kind of JavaScript code, so technically the answer is yes. However, since Ember requires you to define "components" using classes of sorts of its own design, it can be a bit ugly to write in Scala.js. For example, this example taken from the front page:
App.GravatarImageComponent = Ember.Component.extend({
size: 200,
email: '',
gravatarUrl: function() {
var email = this.get('email'),
size = this.get('size');
return 'http://www.gravatar.com/avatar/' + hex_md5(email) + '?s=' + size;
}.property('email', 'size')
});
would have to be written in Scala.js as:
import scala.scalajs.js
import js.Dynamic.{global => g, literal => lit}
g.App.GravatarImageComponent = g.Ember.Component.extend(lit(
size = 200,
email = "",
gravatarUrl = ({ (ths: js.Dynamic) =>
val email = ths.get("email")
val size = ths.get("size")
s"http://www.gravatar.com/avatar/${g.hex_md5(email)}?s=$size"
}: js.ThisFunction).asInstanceOf[js.Dynamic].property("email", "size")
))
which is, well ... JavaScript-ish.
Powerful UI libraries for JavaScript kind of rely so much on the dynamic and weird aspects of JavaScript that they don't fit super well in Scala.js. I am planning to write a React.js-like UI library specifically designed for Scala.js this semester to address this issue.