get s3 url path of metaflow artifact - netflix-metaflow

Is there a way to get the full s3 url path of a metaflow artifact, which was stored in a step?
I looked at Metaflow's DataArtifact class but didn't see an obvious s3 path property.

Yep, you can do
Flow('MyFlow')[42]['foo'].task.artifacts.bar._object['location']
where MyFlow is the name of your flow, 42 is the run ID, foo is the step under consideration and bar is the artifact from that step.

Based on #Savin's answer, I've written a helper function to get the S3 URL of an artifact given a Run ID and the artifact's name:
from metaflow import Flow, Metaflow, Run
from typing import List, Union
def get_artifact_s3url_from_run(
run: Union[str, Run], name: str, legacy_names: List[str] = [], missing_ok: bool = False
) -> str:
"""
Given a MetaFlow Run and a key, scans the run's tasks and returns the artifact's S3 URL with that key.
NOTE: use get_artifact_from_run() if you want the artifact itself, not the S3 URL to the artifact.
This allows us to find data artifacts even in flows that did not finish. If we change the name of an artifact,
we can support backwards compatibility by also passing in the legacy keys. Note: we can avoid this by resuming a
specific run and adding a node which re-maps the artifact to another key. This will assign the run a new ID.
Args:
missing_ok: whether to allow an artifact to be missing
name: name of the attribute to look for in task.data
run: a metaflow.Run() object, or a run ID
legacy_names: backup names to check
Returns:
the value of the attribute. If attribute is not found
Raises:
DataartifactNotFoundError if artifact is not found and missing_ok=False
ValueError if Flow not found
ValueError if Flow is found but run ID is not.
"""
namespace(None) # allows us to access all runs in all namespaces
names_to_check = [name] + legacy_names
if isinstance(run, str):
try:
run = Run(run)
except Exception as e:
# run ID not found. see if we can find other runs and list them
flow = run.split(sep="/")[0]
try:
flow = Flow(flow)
raise ValueError(f"Could not find run ID {run}. Possible values: {flow.runs()}") from e
except Exception as e2:
raise ValueError(f"Could not find flow {flow}. Available flows: {Metaflow().flows}") from e2
for name_ in names_to_check:
for step_ in run:
for task in step_:
print(f"task {task} artifacts: {task.artifacts} \n \n")
if task.artifacts is not None and name_ in task.artifacts:
# https://stackoverflow.com/a/66361249/4212158
return getattr(task.artifacts, name_)._object["location"]
if not missing_ok:
raise DataArtifactNotFoundError(
f"No data artifact with name {name} found in {run}. Also checked legacy names: {legacy_names}"
)

Related

Karate- Gatling: Not able to run scenarios based on tags

I am trying to run performance test on scenario tagged as perf from the below feature file-
#tag1 #tag2 #tag3
**background:**
user login
#tag4 #perf
**scenario1:**
#tag4
**scenario2:**
Below is my .scala file setup-
class PerfTest extends Simulation {
val protocol = karateProtocol()
val getTags = scenario("Name goes here").exec(karateFeature("classpath:filepath"))
setUp(
getTags.inject(
atOnceUsers(1)
).protocols(protocol)
)
I have tried passing the tags from command line and as well as passing the tag as argument in exec method in scala setup.
Terminal command-
mvn clean test-compile gatling:test "-Dkarate.env={env}" "-Dkarate.options= --tags #perf"
.scala update:- I have also tried passing the tag as an argument in the karate execute.
val getTags = scenario("Name goes here").exec(karateFeature("classpath:filepath", "#perf"))
Both scenarios are being executed with either approach. Any pointers how i can force only the test with tag perf to run?
I wanted to share the finding here. I realized it is working fine when i am passing the tag info in .scala file.
My scenario with perf tag was a combination of GET and POST call as i needed some data from GET call to pass in POST call. That's why i was seeing both calls when running performance test.
I did not find any reference in karate gatling documentation for passing tags in terminal execution command. So i am assuming that might not be a valid case.

Unzip artifact for REST API Gateway in CDK

I'm currently passing, thru parameterOverrides, both the S3 Bucket name and the object key.
However, the key is in fact a zipped file (that contains the YAML):
export class BusinessAssetApi extends SpecRestApi {
constructor(scope: Construct, id: string, bucketName: string, key: string) {
const bucket = Bucket.fromBucketName(scope, "openapi-bucket", bucketName)
super(scope, id, {
deploy: true,
deployOptions: {
stageName: STAGE_NAME,
},
apiDefinition: ApiDefinition.fromBucket(bucket, key),
})
}
}
Now, I want to know if there's a smart way to unzip the file and get the yaml file instead, or if there is a smarter way to save the artifact with a specific filename and/or file extension?
TIA
FAres
fromBucket is intended for use where you are storing your configuration files or other needed files directly in an s3 bucket - they aren't really intended for an artifact (i am kinda assuming you are getting this artifact from an earlier step in a codePipeline?) - and as such you have encountered the primary drawback of this design - zip files are not the configuration, and fromBucket does not unzip.
if you have a repo as your base point, and it's part of your pipeline or where you are running cdk deploy from, you can use fromAsset instead, but this is a bit more convoluted in terms of getting that file there.
The only solution I know of in this situation is to store the file as part of your pipeline process directly in an s3 bucket and then pass that as part of your parameters into your next stack.
I suppose alternatively, if you really have no other choice, you could write a bit of code to grab the zip out of the artifact and the keys for it out of the pipeline event, unzip it in code, and use fromInline instead... but that probably wont work as expected.

Unexpected error while passing variable group variables (Azure DevOps) to YAML pipeline

I'm a newbie to both Azure DevOps and Terraform but, I'm trying to deploy a pipeline using a YAML file.
I have tried to run a terraform plan using a YAML file and passing variables (from AZ DevOps) but, I got the following error:
2021-11-24T18:39:46.4604561Z Error: "name" may only contain alphanumeric characters, dash, underscores, parentheses and periods
2021-11-24T18:39:46.4604832Z
2021-11-24T18:39:46.4605940Z on modules/aks/main.tf line 2, in resource "azurerm_resource_group" "aks-resource-group":
2021-11-24T18:39:46.4606436Z 2: name = var.resource_group_name
2021-11-24T18:39:46.4606609Z
2021-11-24T18:39:46.4606722Z
2021-11-24T18:39:46.4606818Z
2021-11-24T18:39:46.4607525Z Error: Error: Subnet: (Name "#{vnet_subnet_name}#" / Virtual Network Name "#{vnet_name}#" / Resource Group "RG-XX-XXXX-XXXXX-001") was not found
2021-11-24T18:39:46.4608006Z
2021-11-24T18:39:46.4608580Z on modules/aks/main.tf line 16, in data "azurerm_subnet" "subnet-project":
2021-11-24T18:39:46.4609335Z 16: data "azurerm_subnet" "subnet-project" {
The 'name' has the following format at the Variable group in the Azure DevOps UI:
RG-XX-XXXX-XXXXX-001
This is the snippet of where I included the replace token at the YAML file:
displayName: 'Replace Secrets'
inputs:
targetFiles: |
variables.tfvars
encoding: 'utf-8'
actionOnMissing: fail
tokenPattern: #{MyVar}#
And this is a sample of the variables I have in a variable group:
variable-group-sample
Also, I replace the terraform.tfvars file with something like this:
resource_group_name = "#{resource_group_name}#"
I have checked the name inserted at the UI several times but I feel the error is pointing to something else I cannot see.
Have anyone experienced something related to this error?
Thank you in advance!
tokenPattern: #{MyVar}#
It is looking for the pattern #{MyVar}# to replace. Not "something contained between #{ and }#, but the actual value #{MyVar}#. I'm guessing it's expecting a regular expression, but I'm not familiar with that task.
So the end result is that your #{token values}# aren't getting replaced.
Assuming you're using https://marketplace.visualstudio.com/items?itemName=qetza.replacetokens, you probably want to specify tokenPrefix: #{ and tokenSuffix: }# instead of using tokenPattern.
Now, having said that...
There is no reason for you to be using token replacement on a tfvars file. You should create different tfvars files for each environment, then pass in a tfvars file via the -var-file argument to Terraform. Secrets can be passed in on the command line via -var 'foo=bar'
Storing variables that represent application or deployment configuration in Azure DevOps (or GitHub, or any other CI system) is a big, big anti-pattern, because it's tightly coupling your deployment process to a particular platform. If you're sourcing all of your variables from Azure DevOps, you can't easily test locally or migrate to a different CI/CD provider like GitHub Actions in the future.
For values that shouldn't be in source control, such a secrets, you should use a secret provider like Azure KeyVault and integrate it with your application (or, in this case, use a data resource in Terraform to pull the necessary secrets automatically at deployment time).

automate uploading of glue script

We are currently using cloud formation to create a glue job (via codebuild and codepipeline). The one thing we are stuck on is how to automate the code that goes into the glue job.
Our current relevant piece of the cloudformation template looks like this:
MyJob:
Type: AWS::Glue::Job
Properties:
Command:
Name: glueetl
ScriptLocation: "s3://aws-glue-scripts//your-script-file.py"
DefaultArguments:
"--job-bookmark-option": "job-bookmark-enable"
ExecutionProperty:
MaxConcurrentRuns: 2
MaxRetries: 0
Name: cf-job1
Role: !Ref MyJobRole
The problem is is the "ScriptLocation". Looks like it is required to be an S3 location. How can we automate the upload of this. The code is in a .py file in our Git repository and I assume is uploaded to the artifact repository as are of the codebuild process, but how to access it?
Would like to hear how others are doing this. Thanks!
EDIT: I was able to find a similar stack overflow post:AWS Glue automatic job creation but it the answers really don't give a solution or understand the question posed.
I've written a tool to handle the upload of stack dependencies, including CloudFormation nested templates and non-inline Lambda functions.
Currently AWS Glue was not handled since I haven't try it in any project yet. But it should be easy to expand to support Glue.
The dependencies were defined in separate config file, and a piece of code within the tool is responsible for the config. Here's the sample config:
Nested CloudFormation templates:
# DEPENDS=( <ParameterName>=<NestedTemplate> )
#
# Required: Yes if has nested template, otherwise No
# Default: None
# Syntax:
# <ParameterName>: The name of template parameter that is referred at the
# value of nested template property `TemplateURL`.
# <NestedTemplate>: A local path or a S3 URL starting with `s3://` or
# `https://` pointing to the nested template.
# The nested templates at local is going to be uploaded
# to S3 Bucket automatically during the deployment.
# Description:
# Double quote the pairs which contain whitespaces or special characters.
# Use `#` to comment out.
# ---
# Example:
# DEPENDS=(
# NestedTemplateFooURL=/path/to/nested/foo/stack.json
# NestedTemplateBarURL=/path/to/nested/bar/stack.json
# )
Lambda functions:
# LAMBDA=( <S3BucketParameterName>:<S3KeyParameterName>=<LambdaFunction> )
#
# Required: Yes if has None-inline Lambda Function, otherwise No
# Default: None
# Syntax:
# <S3BucketParameterName>: The name of template parameter that is referred
# at the value of Lambda property `Code.S3Bucket`.
# <S3KeyParameterName>: The name of template parameter that is referred
# at the value of Lambda property `Code.S3Key`.
# <LambdaFunction>: A local path or a S3 URL starting with `s3://` pointing
# to the Lambda Function.
# The Lambda Functions at local is going to be zipped and
# uploaded to S3 Bucket automatically during the deployment.
# Description:
# Double quote the pairs which contain whitespaces or special characters.
# Use `#` to comment out.
# ---
# Example:
# DEPENDS=(
# S3BucketForLambdaFoo:S3KeyForLambdaFoo=/path/to/LambdaFoo.py
# S3BucketForLambdaBar:S3KeyForLambdaBar=s3://mybucket/LambdaBar.py
# )
The tools were written in bash and come with 2 parts:
xsh: It works as a bash library framework.
xsh-lib/aws: It's a library of xsh.
The code you may need to expand is located in xsh-lib/aws/functions/cfn/deploy.sh.
The example deploy command looks like:
$ xsh aws/cfn/deploy -C /path/to/your/template-and-config-dir -t stack.json -c sample.conf
I'm considering to abstract the dependencies such as CloudFormation template, Lambda functions and Glue, into a single interface for both configs and handlers.
This will make it easier to add new dependency handlers to the deployer.

How to fix PipelineParam from discarding all information except for name in Kubeflow Pipeline

I'm trying to write an application using Kubeflow Pipelines. I'm running into trouble when passing in parameters to the pipeline (the main python function decorated with #kfp.dsl.pipeline). The parameters should be automatically converted into a PipelineParam with name, value, etc info. However, it seems that everything except for the name is being discarded. I'm on an Ubuntu server.
I've tried uninstalling/reinstalling and updating Kubeflow, tried installing several of the most recent versions of kfp (0.1.23, 0.1.22, 0.1.20, 0.1.18), as well as installing on my local machine.
def print_pipeline_param():
return(kfp.dsl.PipelineParam("Name", value="Value"))
#kfp.dsl.pipeline(
name='Test Pipeline',
description='Test pipeline'
)
def test_pipeline(output_file='/output.txt'):
print(print_pipeline_param())
print(output_file)
if __name__ == '__main__':
import kfp.compiler as compiler
compiler.Compiler().compile(test_pipeline, __file__ + '.tar.gz'
The result of running this is:
{{pipelineparam:op=;name=Name;value=Value;type=;}}
{{pipelineparam:op=;name=output-file;value=;type=;}
I should be getting '/output.txt' in the "value" field, but the only field populated is the name. This only happens when passing in parameters to the main pipeline function. This also happens when directly passing in a PipelineParam like so:
#kfp.dsl.pipeline(
name='Test Pipeline',
description='Test pipeline'
)
def test_pipeline(output_file=kfp.dsl.PipelineParam("Output File", value="/output.txt")):
print(output_file)
Prints out: {{pipelineparam:op=;name=output-file;value=;type=;}