Great expectations v3 API in aws glue 3.0 - pyspark

I'm trying to a validation in the pipeline using Great expectations on AWS glue 3.0.
Here's my initial attempt to create the data context at runtime based on their docs
def create_context():
logger.info("Create DataContext Config.")
data_context_config = DataContextConfig(
config_version=2,
plugins_directory=None,
config_variables_file_path=None,
# concurrency={"enabled": "true"},
datasources={
"my_spark_datasource": DatasourceConfig(
class_name="Datasource",
execution_engine={
"class_name": "SparkDFExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
data_connectors={
"my_spark_dataconnector": {
"module_name": "great_expectations.datasource.data_connector",
"class_name": "RuntimeDataConnector",
"batch_identifiers": [""],
}
},
)
},
stores={
"expectations_S3_store": {
"class_name": "ExpectationsStore",
"store_backend": {
"class_name": "TupleS3StoreBackend",
"bucket": data_profile_s3_store_bucket,
"prefix": "expectations/",
"s3_put_options": {"ACL": "bucket-owner-full-control"},
},
},
"validations_S3_store": {
"class_name": "ValidationsStore",
"store_backend": {
"class_name": "TupleS3StoreBackend",
"bucket": data_profile_s3_store_bucket,
"prefix": "validations/",
"s3_put_options": {"ACL": "bucket-owner-full-control"},
},
},
"evaluation_parameter_store": {"class_name": "EvaluationParameterStore"},
"checkpoint_S3_store": {
"class_name": "CheckpointStore",
"store_backend": {
"class_name": "TupleS3StoreBackend",
"suppress_store_backend_id": "true",
"bucket": data_profile_s3_store_bucket,
"prefix": "checkpoints/",
"s3_put_options": {"ACL": "bucket-owner-full-control"},
},
},
},
expectations_store_name="expectations_S3_store",
validations_store_name="validations_S3_store",
evaluation_parameter_store_name="evaluation_parameter_store",
checkpoint_store_name="checkpoint_S3_store",
data_docs_sites={
"s3_site": {
"class_name": "SiteBuilder",
"store_backend": {
"class_name": "TupleS3StoreBackend",
"bucket": data_profile_s3_store_bucket,
"prefix": "data_docs/",
"s3_put_options": {"ACL": "bucket-owner-full-control"},
},
"site_index_builder": {
"class_name": "DefaultSiteIndexBuilder",
"show_cta_footer": True,
},
}
},
anonymous_usage_statistics={"enabled": True},
)
# Pass the DataContextConfig as a project_config to BaseDataContext
context = BaseDataContext(project_config=data_context_config)
logger.info("Create Checkpoint Config.")
checkpoint_config = {
"name": "my_checkpoint",
"config_version": 1,
"class_name": "Checkpoint",
"run_name_template": "ingest_date=%YYYY-%MM-%DD",
"expectation_suite_name": data_profile_expectation_suite_name,
"runtime_configuration": {
"result_format": {
"result_format": "COMPLETE",
"include_unexpected_rows": True,
}
},
"evaluation_parameters": {},
}
context.add_checkpoint(**checkpoint_config)
# logger.info(f'GE Data Context Config: "{data_context_config}"')
return context
Using this i get an error saying attempting to run operations on stopped spark context.
Is there a better way to use the spark source in glue3.0?
I want to be able to stay on glue3.0 as much as possible to prevent having to maintain two versions of glue jobs

You can fix this by setting the force_reuse_spark_context to True, here is a quick example (YML):
config_version: 3.0
datasources:
my_spark_datasource:
class_name: Datasource
module_name: great_expectations.datasource
data_connectors:
my_spark_dataconnector:
class_name: RuntimeDataConnector
module_name: great_expectations.datasource.data_connector
batch_identifiers: {}
execution_engine:
class_name: SparkDFExecutionEngine
force_reuse_spark_context: true
Another thing I would like to add is that you can define the context in a YML file and upload it to S3. Then, you can parse this file in the glue job with the function below:
def parse_data_context_from_S3(bucket: str, prefix: str = ""):
object_key = os.path.join(prefix, "great_expectations.yml")
print(f"Parsing s3://{bucket}/{object_key}")
s3 = boto3.session.Session().client("s3")
s3_object = s3.get_object(Bucket=bucket, Key=object_key)["Body"]
datacontext_config = yaml.safe_load(s3_object.read())
project_config = DataContextConfig(**datacontext_config)
context = BaseDataContext(project_config=project_config)
return context
Your CI/CD pipeline can easily replace the store backends in the YML file while deploying it to your environments (dev, hom, prod).
If you are using the RuntimeDataConnector, you should have no problem using Glue 3.0. The same does not apply if you are using the InferredAssetS3DataConnector and your datasets are encrypted using KMS. In this case, I was only able to use Glue 2.0.

Related

Implement I18n localization in Strapi local plugins

I generated a local plugin and created an article model using:
"pluginOptions": {
"i18n": {
"localized": true
}
},
inside his article.settings.json file, in order to make some specific fields translatables using the Internationalization(I18N) plugin
Problem is, while running the command:
strapi develop --watch-admin
I end up having the following errors:
error Something went wrong in the model "Article" with the attribute "localizations"
error TypeError: Cannot read property "uid" of undefined
Removing the "pluginOptions" instead, gives my local plugin running without any translatable field or articles__translations pivot that should be generated into my mysql database
"pluginOptions" is the very same parameter that gets generated into the model settings creating a collection type using the Content-Types Builder, but I can't have it to work while using it for a local plugin.
Here is my article.settings.json:
plugins/blog/models/article.settings.json
{
"kind": "collectionType",
"collectionName": "articles",
"info": {
"name": "article"
},
"options": {
"draftAndPublish": false,
"timestamps": true,
"populateCreatorFields": true,
"increments": true,
"comment": ""
},
"pluginOptions": {
"i18n": {
"localized": true
}
},
"attributes": {
"title": {
"pluginOptions": {
"i18n": {
"localized": true
}
},
"type": "string",
"required": true,
"maxLength": 255,
"minLength": 3
},
"slug": {
"pluginOptions": {
"i18n": {
"localized": true
}
},
"type": "uid",
"targetField": "title",
"required": true
},
"featured": {
"pluginOptions": {
"i18n": {
"localized": false
}
},
"type": "boolean",
"default": false
},
"published_date": {
"pluginOptions": {
"i18n": {
"localized": false
}
},
"type": "datetime"
},
}
}
You can use the content-type-builder plugin as a workaround. You would not create the content type under the content-types folder but create it programmatically.
As an example of a very simple tag content type:
{
"singularName": "tag",
"pluralName": "tags",
"displayName": "tag",
"description": "",
"draftAndPublish": false,
"pluginOptions": {
"i18n": {
"localized": true
}
},
"attributes": {
"label": {
"type": "string",
"pluginOptions": {
"i18n": {
"localized": true
}
},
"unique": true
}
}
}
Note, this schema of the json is a bit different from the ones in plugin/server/content-types.
Then you can create the content type programmatically like this:
import { Strapi } from "#strapi/strapi";
import tag from "../content-types/tag.json";
import page from "../content-types/page.json";
export default ({ strapi }: { strapi: Strapi }) => ({
async createContentComponent() {
if (!tag) return null;
try {
const components: any = [];
const contentType = await strapi
.plugin("content-type-builder")
.services["content-types"].createContentType({
contentType: tag,
components,
});
return contentType;
} catch (e) {
console.log("error", e);
return null;
}
},
});
This is exactly how the admin creates content types using the content builder UI.
And it works using the pluginOptions.i18n.localized: true.
One approach would be to do this, e.g., on the bootstrap phase of the plugin. Here you could also check whether or not the contents are created or not.
As a bonus, you can also create components that otherwise would not work.
Hope that helps.
Links:
Create components programmatically in a plugin: https://github.com/strapi/strapi-plugin-seo/blob/main/server/services/seo.js
Create content types:
https://github.com/strapi/strapi/blob/88caa92f878a068926255dd482180202f53fcdcc/packages/core/content-type-builder/server/controllers/content-types.js#L48
EDIT:
You could also keep the original schema and use this fn to transform it - at least for now as long as the other approach is not working:
https://github.com/strapi/strapi/blob/1eab2fb08c7a4d3d40a5a7ff3b2f137ce0afcf8a/packages/core/content-type-builder/server/services/content-types.js#L37

Cannot parametrize any value under placement.managedCluster.config

My goal is to create dataproc workflow template from python code. Meanwhile I want to have ability to parametrize placement.managedCluster.config.gceClusterConfig.subnetworkUri field during template instantiation.
I read template from json file like:
{
"id": "bigquery-extractor",
"placement": {
"managed_cluster": {
"config": {
"gce_cluster_config": {
"subnetwork_uri": "some-subnet-name"
},
"software_config" : {
"image_version": "1.5"
}
},
"cluster_name": "some-name"
}
},
"jobs": [
{
"pyspark_job": {
"args": [
"job_argument"
],
"main_python_file_uri": "gs:///path-to-file"
},
"step_id": "extract"
}
],
"parameters": [
{
"name": "CLUSTER_NAME",
"fields": [
"placement.managedCluster.clusterName"
]
},
{
"name": "SUBNETWORK_URI",
"fields": [
"placement.managedCluster.config.gceClusterConfig.subnetworkUri"
]
},
{
"name": "MAIN_PY_FILE",
"fields": [
"jobs['extract'].pysparkJob.mainPythonFileUri"
]
},
{
"name": "JOB_ARGUMENT",
"fields": [
"jobs['extract'].pysparkJob.args[0]"
]
}
]
}
code snippet I use:
options = ClientOptions(api_endpoint="{}-dataproc.googleapis.com:443".format(region))
client = dataproc.WorkflowTemplateServiceClient(client_options=options)
template_file = open(path_to_file, "r")
template_dict = eval(template_file.read())
print(template_dict)
template = dataproc.WorkflowTemplate(template_dict)
full_region_id = "projects/{project_id}/regions/{region}".format(project_id=project_id, region=region)
try:
client.create_workflow_template(
parent=full_region_id,
template=template
)
except AlreadyExists as err:
print(err)
pass
when I try to run this code I get the following error:
google.api_core.exceptions.InvalidArgument: 400 Invalid field path placement.managed_cluster.configuration.gce_cluster_config.subnetwork_uri: Field gce_cluster_config does not exist.
This behavior is the same also if I try to parametrize placement.managedCluster.config.softwareConfig.imageVersion, I will get
google.api_core.exceptions.InvalidArgument: 400 Invalid field path placement.managed_cluster.configuration.software_config.image_version: Field software_config does not exist.
But if I exclude any field under placement.managedCluster.config from parameters map, template is created successfully.
I didn't find any restriction on parametrizing these fields. Is there any? Or is it just me doing something wrong?
This doc listed the parameterizable fields. It seems that only managedCluster.name of managedCluster is parameterizable:
Managed cluster name. Dataproc will use the user-supplied name as the name prefix, and append random characters to create a unique cluster name. The cluster is deleted at the end of the workflow.
I don't see managedCluster.config parameterizable.

How to update an existing AWS API Gateway using CloudFormation Template

I have the following Swagger definition file that I was able to import to an existing AWS API Gateway through "Import API" option in the AWS Console. Now, I would like to do the same thing using a CloudFormation template. I would like to know if I can update an existing AWS API Gateway with the 'PATHS' through CloudFormation template. I have read the documentation in AWS, but I couldn't find any information. The AWS::ApiGateway::RestApi resource have no way of referring to an existing AWS API Gateway. The existing API Gateway was created manually from the AWS console (i.e, not created through CloudFormation template)
{
"openapi": "3.0.1",
"info": {
"title": "Common API",
"description": "defaultDescription",
"version": "0.3"
},
"servers": [
{
"url": "http://localhost:32780"
}
],
"paths": {
"/catalogs": {
"get": {
"description": "Auto generated using Swagger Inspector",
"parameters": [
{
"name": "language",
"in": "query",
"required": false,
"style": "form",
"explode": true,
"example": "en"
},
{
"name": "category",
"in": "query",
"required": false,
"style": "form",
"explode": true,
"example": "region"
},
{
"name": "subcategory",
"in": "query",
"required": false,
"style": "form",
"explode": true,
"example": "group"
}
],
"responses": {
"200": {
"description": "Auto generated using Swagger Inspector",
"content": {
"application/json;charset=UTF-8": {
"schema": {
"type": "string"
},
"examples": {}
}
}
}
},
"servers": [
{
"url": "http://localhost:32780"
}
]
},
"servers": [
{
"url": "http://localhost:32780"
}
]
}
}
}
as you have already created your API from the Console and trying to update it. Not sure whether the CFT can help but probably you can try once. As CloudFormation is capable of modulating/updating the API deployed under the same API Name or API Key.
So, you can probably note down the Name of the API you have created from the Console and try creating/deploying the API with the same name through CloudFormation.
RestAPI:
Type: AWS::Serverless::Api
Properties:
Name: !Sub "your ApiName from the console"
StageName: !Sub "dev"
DefinitionBody:
"Fn::Transform":
Name: "AWS::Include"
Parameters:
Location: !Sub "s3://${TemporaryBucket}/openapi.yaml"
Instead of calling the API from S3, the API Defition/Body can also be defined in the Cloudformation template itself for ease.

Get all vertices having a labelname

I am using ibm graph in bluemix and new to this.
I created a graph named 'test' using the GUI provided by bluemix and uploaded the sample data 'Music Festival' provided by ibm in that graph.
Now I am trying to query all the vertices having label 'attendee' using below query.
def gt = graph.traversal();
gt.V().hasLabel("attendee");
But I am getting error as
Error: Error encountered evaluating script def gt = graph.traversal();gt.V().hasLabel("attendee"); with reason com.thinkaurelius.titan.core.TitanException: Could not find a suitable index to answer graph query and graph scans are disabled: [(~label = attendee)]:VERTEX
Not sure what I am doing wrong.
Can somebody tell where am i going wrong?
How can i get rid of this error and get the expected output?
Thanks
#Radhika, Your Gremlin query is a valid Gremlin query. However, some vendors (such as IBM Graph and Titan) chose to only allow users to start their queries with a query that is indexed.This is to make sure you get the performance of your queries. Calling hasLabel() by itself will give you the Could not find a suitable index... error as you can't create indexes for labels. What you need to do is follow this step with a step that uses a indexed property as in this query :
graph.traversal();gt.V().hasLabel("band").has("genre","pop");
An index for genre has been created in the schema for the sample music festival data as you can see below
{
"propertyKeys": [
{ "name": "name", "dataType": "String", "cardinality": "SINGLE" },
{ "name": "gender", "dataType": "String", "cardinality": "SINGLE" },
{ "name": "age", "dataType": "Integer", "cardinality": "SINGLE" },
{ "name": "genre", "dataType": "String", "cardinality": "SINGLE" },
{ "name": "monthly_listeners", "dataType": "String", "cardinality": "SINGLE" },
{ "name":"date","dataType":"String","cardinality":"SINGLE" },
{ "name":"time","dataType":"String","cardinality":"SINGLE" }
],
"vertexLabels": [
{ "name": "attendee" },
{ "name": "band" },
{ "name": "venue" }
],
"edgeLabels": [
{ "name": "bought_ticket", "multiplicity": "MULTI" },
{ "name":"advertised_to","multiplicity":"MULTI" },
{ "name":"performing_at","multiplicity":"MULTI" }
],
"vertexIndexes": [
{ "name": "vByName", "propertyKeys": ["name"], "composite": true, "unique": false },
{ "name": "vByGender", "propertyKeys": ["gender"], "composite": true, "unique": false },
{ "name": "vByGenre", "propertyKeys": ["genre"], "composite": true, "unique": false}
],
"edgeIndexes" :[
{ "name": "eByBoughtTicket", "propertyKeys": ["time"], "composite": true, "unique": false }
]
That's why the above query works and you need to do the same.
If you don't have a schema, create one. You can model it after the
one above or follow the API
doc
Create an (Vertex/Label) index for the properties that you'll start
your traversals from. In this example, Name, Gender and Genre for
vertex properties and name for the edge properties.
Call the schema
endpoint
to add your schema to your graph
It's recommended to create your schema before adding any data to
your graph so that you don't have to reindex later. That'll save you
a lot of time.
Once you create your schema, you can't modify what you created
already, but you can add new properties/indexes later on.
Look at the following code samples for Java and Nodejs for the exact code to use.
I hope that helps

How can I have multiple API Gateway paths with GET requests in the awsm.json?

I'm trying to create an endpoint with many path parameters:
/api/v1/{option1}
/api/v1/{option1}/{option2}
/api/v1/{option1}/{option2}/{option3}
Using JAWS awsm.json, I want to create GET methods for all 3 routes. How(if possible) can I accomplish this using Serverless Framework?
CF:
{
"lambda": {
"envVars": [],
"deploy": true,
"package": {
"optimize": {
"builder": "browserify",
"minify": true,
"ignore": [],
"exclude": [
"aws-sdk"
],
"includePaths": []
},
"excludePatterns": []
},
"cloudFormation": {
"Description": "",
"Handler": "aws_modules/static/handler.handler",
"MemorySize": 1024,
"Runtime": "nodejs",
"Timeout": 6
}
},
"apiGateway": {
..path => /api/v1/{firstname}..
}
}
atm, there is no way to do this via Serverless Framework.
one thing i found out was that u can omit values in the url so itll be considered blank.
ex:
api/v1/option1//option3
so this considers option2 as blank. so this kinda solves the issue except user would need to add the additional /s