Is it possible to run spark-submit task on databricks using archives parameter? - pyspark

My problem is the following:
I'm trying to run a job with spark submit task, but I have an environment to build.
But for metro archives does not install my environment inside the cluster at runtime example job
{
"name":"my_test"
...
"new_cluster": {
...
"spark_conf": {
"spark.databricks.delta.preview.enabled": "true"
},
"spark_env_vars": {
"PYSPARK_PYTHON": "./environment/bin/python"
},
...
},
"spark_submit_task": {
"parameters": [
"--archives",
"dbfs:/teste_path/pyspark_conda_env.tar.gz#environment",
"dbfs:/teste_path/my_script.py"
]
}
}
}

Related

Great expectations v3 API in aws glue 3.0

I'm trying to a validation in the pipeline using Great expectations on AWS glue 3.0.
Here's my initial attempt to create the data context at runtime based on their docs
def create_context():
logger.info("Create DataContext Config.")
data_context_config = DataContextConfig(
config_version=2,
plugins_directory=None,
config_variables_file_path=None,
# concurrency={"enabled": "true"},
datasources={
"my_spark_datasource": DatasourceConfig(
class_name="Datasource",
execution_engine={
"class_name": "SparkDFExecutionEngine",
"module_name": "great_expectations.execution_engine",
},
data_connectors={
"my_spark_dataconnector": {
"module_name": "great_expectations.datasource.data_connector",
"class_name": "RuntimeDataConnector",
"batch_identifiers": [""],
}
},
)
},
stores={
"expectations_S3_store": {
"class_name": "ExpectationsStore",
"store_backend": {
"class_name": "TupleS3StoreBackend",
"bucket": data_profile_s3_store_bucket,
"prefix": "expectations/",
"s3_put_options": {"ACL": "bucket-owner-full-control"},
},
},
"validations_S3_store": {
"class_name": "ValidationsStore",
"store_backend": {
"class_name": "TupleS3StoreBackend",
"bucket": data_profile_s3_store_bucket,
"prefix": "validations/",
"s3_put_options": {"ACL": "bucket-owner-full-control"},
},
},
"evaluation_parameter_store": {"class_name": "EvaluationParameterStore"},
"checkpoint_S3_store": {
"class_name": "CheckpointStore",
"store_backend": {
"class_name": "TupleS3StoreBackend",
"suppress_store_backend_id": "true",
"bucket": data_profile_s3_store_bucket,
"prefix": "checkpoints/",
"s3_put_options": {"ACL": "bucket-owner-full-control"},
},
},
},
expectations_store_name="expectations_S3_store",
validations_store_name="validations_S3_store",
evaluation_parameter_store_name="evaluation_parameter_store",
checkpoint_store_name="checkpoint_S3_store",
data_docs_sites={
"s3_site": {
"class_name": "SiteBuilder",
"store_backend": {
"class_name": "TupleS3StoreBackend",
"bucket": data_profile_s3_store_bucket,
"prefix": "data_docs/",
"s3_put_options": {"ACL": "bucket-owner-full-control"},
},
"site_index_builder": {
"class_name": "DefaultSiteIndexBuilder",
"show_cta_footer": True,
},
}
},
anonymous_usage_statistics={"enabled": True},
)
# Pass the DataContextConfig as a project_config to BaseDataContext
context = BaseDataContext(project_config=data_context_config)
logger.info("Create Checkpoint Config.")
checkpoint_config = {
"name": "my_checkpoint",
"config_version": 1,
"class_name": "Checkpoint",
"run_name_template": "ingest_date=%YYYY-%MM-%DD",
"expectation_suite_name": data_profile_expectation_suite_name,
"runtime_configuration": {
"result_format": {
"result_format": "COMPLETE",
"include_unexpected_rows": True,
}
},
"evaluation_parameters": {},
}
context.add_checkpoint(**checkpoint_config)
# logger.info(f'GE Data Context Config: "{data_context_config}"')
return context
Using this i get an error saying attempting to run operations on stopped spark context.
Is there a better way to use the spark source in glue3.0?
I want to be able to stay on glue3.0 as much as possible to prevent having to maintain two versions of glue jobs
You can fix this by setting the force_reuse_spark_context to True, here is a quick example (YML):
config_version: 3.0
datasources:
my_spark_datasource:
class_name: Datasource
module_name: great_expectations.datasource
data_connectors:
my_spark_dataconnector:
class_name: RuntimeDataConnector
module_name: great_expectations.datasource.data_connector
batch_identifiers: {}
execution_engine:
class_name: SparkDFExecutionEngine
force_reuse_spark_context: true
Another thing I would like to add is that you can define the context in a YML file and upload it to S3. Then, you can parse this file in the glue job with the function below:
def parse_data_context_from_S3(bucket: str, prefix: str = ""):
object_key = os.path.join(prefix, "great_expectations.yml")
print(f"Parsing s3://{bucket}/{object_key}")
s3 = boto3.session.Session().client("s3")
s3_object = s3.get_object(Bucket=bucket, Key=object_key)["Body"]
datacontext_config = yaml.safe_load(s3_object.read())
project_config = DataContextConfig(**datacontext_config)
context = BaseDataContext(project_config=project_config)
return context
Your CI/CD pipeline can easily replace the store backends in the YML file while deploying it to your environments (dev, hom, prod).
If you are using the RuntimeDataConnector, you should have no problem using Glue 3.0. The same does not apply if you are using the InferredAssetS3DataConnector and your datasets are encrypted using KMS. In this case, I was only able to use Glue 2.0.

EMR - Airflow to run scala jar file airflow.exceptions.AirflowException

I am trying to run a scala jar file from AIRFLOW using emr and the jar file is designed to read mssql-jdbc and postgresql.
From airflow, I'm able to create cluster
My SPARK_STEPS looks like
SPARK_STEPS = [
{
'Name': 'Trigger_Source_Target',
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['spark-submit',
'--master', 'yarn',
'--jars', '/mnt/MyScalaImport.jar',
'--class', 'org.classname',
's3://path/SNAPSHOT.jar',
'SQL_Pwd', 'PostgreSQL_PWD', 'loadtype'],
}
}
]
After this I have JOB_FLOW_OVERRIDES defined-
JOB_FLOW_OVERRIDES = {
"Name": "pfdt-cluster-airflow",
"LogUri": "s3://path/elasticmapreduce/",
"ReleaseLabel": "emr-6.4.0",
"Applications": [
{"Name": "Spark"},
],
"Instances": {
"InstanceGroups": [
{
"Name": "Master nodes",
"Market": "ON_DEMAND",
"InstanceRole": "MASTER",
"InstanceType": "m5.xlarge",
"InstanceCount": 1,
}
],
"KeepJobFlowAliveWhenNoSteps": True,
"TerminationProtected": False,
'Ec2KeyName': 'pem_file_name',
"Ec2SubnetId": "subnet-123"
},
'BootstrapActions': [
{
'Name': 'import custom Jars',
'ScriptBootstrapAction': {
'Path': 's3://path/subpath/copytoolsjar.sh',
'Args': []
}
}
],
'Configurations': [
{
'Classification': 'spark-defaults',
'Properties': {
'spark.jars': 's3://jar_path/mssql-jdbc-8.4.1.jre8.jar'
}
}
],
"VisibleToAllUsers": True,
"JobFlowRole": "EMR_EC2_DefaultRole",
"ServiceRole": "EMR_DefaultRole",
"Tags": [
{"Key": "Environment", "Value": "Development"},
],
}
To copy the scala .jar file from S3 to local to airflow- I have a shell script which does the work: Path- s3://path/subpath/copytoolsjar.sh
aws s3 cp s3://path/SNAPSHOT.jar /mnt/MyScalaImport.jar
On triggering the airflow-
It fails at node watch_step
Errors what I'm getting are-
stdout.gz =>
stderr.gz =>
22/04/08 13:38:23 INFO CodeGenerator: Code generated in 25.5907 ms
Exception in thread "main" java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(DriverManager.java:315)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$2(JDBCOptions.scala:108)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:108)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:38)
How to resolve this issue-
I have my jars at-
s3://path/subpath/mssql-jdbc-8.4.1.jre8.jar
s3://path/subpath/postgresql-42.2.24.jar
To upload jar files(mssql-jdbc-8.4.1.jre8.jar, postgresql-42.2.24.jar) to airflow local-
In the bootstrap step-
'BootstrapActions': [ { 'Name': 'import custom Jars', 'ScriptBootstrapAction': { 'Path': 's3://path/subpath/copytoolsjar.sh', 'Args': [] } } ]
In copytoolsjar.sh file write the command as-
aws s3 cp cp s3://path/SNAPSHOT.jar /mnt/MyScalaImport.jar && bash -c "sudo aws s3 cp s3://path/subpath/mssql-jdbc-8.4.1.jre8.jar /usr/lib/spark/jars/" && bash -c "sudo aws s3 cp s3://path/subpath/postgresql-42.2.24.jar /usr/lib/spark/jars/"
Work will be done

(Cadence) getting "deployment contains nonexisting contract" error when trying to deploy to Flow testnet

i'm trying to deploy a hello world smart contract to testnet. This is the contract I'm trying to deploy:
./contracts/NonFungibleToken.cdc
pub contract NonFungibleToken {
// Declare a stored state field in HelloWorld
//
pub let greeting: String
// Declare a function that can be called by anyone
// who imports the contract
//
pub fun hello(): String {
return self.greeting
}
init() {
self.greeting = "Hello World!"
}
}
This is my config file (flow.json):
{
"emulators": {
"default": {
"port": 3569,
"serviceAccount": "emulator-account"
}
},
"contracts": {
"NonFungibleToken": "./contracts/NonFungibleToken.cdc"
},
"networks": {
"emulator": "127.0.0.1:3569",
"mainnet": "access.mainnet.nodes.onflow.org:9000",
"testnet": "access.devnet.nodes.onflow.org:9000"
},
"accounts": {
"emulator-account": {
"address": "f8d6e0586b0a20c7",
"key": "privatekey"
},
"testnet-account": {
"address": "0x2ca684c2732d60e6",
"key": "privatekey"
}
},
"deployments": {
"emulator": {
"emulator-account": [
"NonFungibleToken"
]
},
"testnet": {
"testnet-account": [
"NonFungibleToken"
]
}
}
}
When I try to deploy, this is the error I get:
MacBook-Air:nft-app alberthu$ flow project deploy
❌ Config Error: deployment contains nonexisting contract NonFungbileToken
Does anyone know how to fix this issue?
Ah the problem was that I needed to add the --network=testnet flag
flow project deploy --network=testnet

Error connecting to environment 1 Org Local Fabric: Error querying channels: 14 UNAVAILABLE: failed to connect to all addresses

I am unable to run my ibm evote blockchain application in hyperledger faric.I am using IBM Evote in VS Code (v1.39) in ubuntu 16. When I start my local fabric (1 org local fabric), I am facing above error.
following is my local_fabric_connection.json file code
{
"name": "local_fabric",
"version": "1.0.0",
"client": {
"organization": "Org1",
"connection": {
"timeout": {
"peer": {
"endorser": "300"
},
"orderer": "300"
}
}
},
"organizations": {
"Org1": {
"mspid": "Org1MSP",
"peers": [
"peer0.org1.example.com"
],
"certificateAuthorities": [
"ca.org1.example.com"
]
}
},
"peers": {
"peer0.org1.example.com": {
"url": "grpc://localhost:17051"
}
},
"certificateAuthorities": {
"ca.org1.example.com": {
"url": "http://localhost:17054",
"caName": "ca.org1.example.com"
}
}
}
and following is the snapshot
Based off your second image it doesn't look like your 1 Org Local Fabric started properly in the first place (you have no gateways and for some reason your wallets aren't grouped together!).
If you teardown your 1 Org Local Fabric then start it again hopefully it'll work.

Hyperledger IROHA get_acc_ast_tx in CLI mode dont work

I just finished the pluralsigt course and completed the tutorial of the official project documentation without problems, but nevertheless using the CLI I could not use the functions get_acc_ast_tx, get_acc_tx, I checked that the peer keys and the configuration files and correspond to genesis file, where admin#test is allowed to use these functions and I get:
[2019-12-08 04: 55: 57.883070400] [E] [CLI/ResponseHandler/Query]: Query is stateless invalid.
The genesis file I use is the initial one of the git repository:
{
"blockV1": {
"payload": {
"transactions": [{
"payload": {
"reducedPayload": {
"commands": [{
"addPeer": {
"peer": {
"address": "127.0.0.1:10001",
"peerKey": "bddd58404d1315e0eb27902c5d7c8eb0602c16238f005773df406bc191308929"
}
}
}, {
"createRole": {
"roleName": "admin",
"permissions": ["can_add_peer", "can_add_signatory", "can_create_account", "can_create_domain", "can_get_all_acc_ast", "can_get_all_acc_ast_txs", "can_get_all_acc_detail", "can_get_all_acc_txs", "can_get_all_accounts", "can_get_all_signatories", "can_get_all_txs", "can_get_blocks", "can_get_roles", "can_read_assets", "can_remove_signatory", "can_set_quorum"]
}
}, {
"createRole": {
"roleName": "user",
"permissions": ["can_add_signatory", "can_get_my_acc_ast", "can_get_my_acc_ast_txs", "can_get_my_acc_detail", "can_get_my_acc_txs", "can_get_my_account", "can_get_my_signatories", "can_get_my_txs", "can_grant_can_add_my_signatory", "can_grant_can_remove_my_signatory", "can_grant_can_set_my_account_detail", "can_grant_can_set_my_quorum", "can_grant_can_transfer_my_assets", "can_receive", "can_remove_signatory", "can_set_quorum", "can_transfer"]
}
}, {
"createRole": {
"roleName": "money_creator",
"permissions": ["can_add_asset_qty", "can_create_asset", "can_receive", "can_transfer"]
}
}, {
"createDomain": {
"domainId": "test",
"defaultRole": "user"
}
}, {
"createAsset": {
"assetName": "coin",
"domainId": "test",
"precision": 2
}
}, {
"createAccount": {
"accountName": "admin",
"domainId": "test",
"publicKey": "313a07e6384776ed95447710d15e59148473ccfc052a681317a72a69f2a49910"
}
}, {
"createAccount": {
"accountName": "test",
"domainId": "test",
"publicKey": "716fe505f69f18511a1b083915aa9ff73ef36e6688199f3959750db38b8f4bfc"
}
}, {
"appendRole": {
"accountId": "admin#test",
"roleName": "admin"
}
}, {
"appendRole": {
"accountId": "admin#test",
"roleName": "money_creator"
}
}],
"quorum": 1
}
}
}],
"txNumber": 1,
"height": "1",
"prevBlockHash": "0000000000000000000000000000000000000000000000000000000000000000"
}
}
}
I use the hyperledger image of docker, in MAC OS CATALINA.
I followed the tutorial according to this manual: https://iroha.readthedocs.io/en/latest/build/index.html
Thank you very much for the help.
Unfortunately, CLI is rather outdated – we are working on new solution for it, but meanwhile it is better to use one of the SDKs available – for Java, Python, JS or iOS (if you prefer mobile development).
All of them contain examples, so it should not be too tricky to use those. Although, if you encounter any issues, please contact us using one of the chats here.
This is due to outdated cli. A newer version that is developed will replace it, but is not yet ready.
The exact problem is that there was pagination metadata added for these queries in iroha, but the cli was not updated to set it properly. Protobuf transport allows cli to send a query without some fields that were added later, but iroha refuses to handle it.
You can use one of client libraries that are always kept up to date: https://iroha.readthedocs.io/en/latest/develop/libraries.html.