How to interpret an XCOM within a task_group to keep a history of dynamically generated tasks within it? - airflow-2.x

I implemented a dag with a task_group that loops on the contents of a file.. This isn't ideal, I wish I could loop through the contents of an XCom, unless I'm mistaken, that isn't possible.
import json
import logging
from airflow.decorators import task_group
from airflow.operators.dummy import DummyOperator
from dtm.migration.helpers.tasks import (
check_compliance_of_objects_sizes,
)
from airflow.providers.google.cloud.transfers.gcs_to_gcs import GCSToGCSOperator
from airflow.providers.google.cloud.sensors.gcs import (
GCSObjectExistenceSensor,
)
#task_group(group_id="copy_task_group")
def copy_and_verify(
inp_parameters_path: str,
impersonated_service_account: str,
):
try:
with open(inp_parameters_path, "r") as f:
inp_parameters = json.load(f)
for entry in inp_parameters:
sensor_source_file = GCSObjectExistenceSensor(
task_id=f"sensor_source_file_{entry}",
bucket=inp_parameters[entry]["source"]["bucket"],
object=(
f"{inp_parameters[entry]['source']['prefix']}"
f"/{inp_parameters[entry]['source']['object']}"
),
impersonation_chain=impersonated_service_account,
)
copy_file = GCSToGCSOperator(
task_id=f"copy_file_{entry}",
source_bucket=inp_parameters[entry]["source"]["bucket"],
source_object=(
f"{inp_parameters[entry]['source']['prefix']}"
f"/{inp_parameters[entry]['source']['object']}"
),
destination_bucket=inp_parameters[entry]["destination"][
"bucket"
],
destination_object=(
f"{inp_parameters[entry]['destination']['prefix']}"
f"/{inp_parameters[entry]['destination']['object']}"
),
impersonation_chain=impersonated_service_account,
)
check_size = check_compliance_of_objects_sizes(
date_of_execution="{{ds_nodash}}",
data_to_check=inp_parameters[entry],
impersonation_chain=impersonated_service_account,
)
sensor_destination_file = GCSObjectExistenceSensor(
task_id=f"sensor_destination_file_{entry}",
bucket=inp_parameters[entry]["destination"]["bucket"],
object=(
f"{inp_parameters[entry]['destination']['prefix']}"
f"/{inp_parameters[entry]['destination']['object']}"
),
impersonation_chain=impersonated_service_account,
)
end_op = DummyOperator(task_id=f"end_{entry}")
(
sensor_source_file
>> copy_file
>> check_size
>> sensor_destination_file
>> end_op
)
except FileNotFoundError:
logging.info(
f"File {inp_parameters_path} is generated in a prior task."
)
As you can see, I used try...except FileNotFoundError: because the error (file not found) is raised on the dag creation since the inp_parameters_path hadn't been created yet by the upstream tasks.
The inp_parameters_path is a file that is created upstream, but unfortunately, it seems that in order to display the logged steps of the task_group post execution I need to keep this file.. Which I don't because it will change from one day to another.
Example of removal of task_group contents after execution:
The contents were well present before the remove_mapping_file has run.
How to pass an XCom to a task_group the same way I can do it with a task? If that's not possible, how do I archive those created inp_parameters_path files per execution so that I can come back and browse the execution of the dag?
If it can help, here's how my dag operates :
#dag(
catchup=False,
schedule_interval="#daily",
max_active_runs=1,
dag_id="migrate_dtm_gcs_data_history",
start_date=datetime(2022, 12, 1),
dagrun_timeout=timedelta(minutes=20),
tags=[
"migration",
"dtm",
],
default_args=default_args,
)
def template_dag():
map = convert_csv_migration_to_map(
source_csv=f"{COMPOSER_GCS_LOC_PATH}/dags/dtm/migration/helpers/gcs_migration_mapping_archetype.csv",
delimiter=",",
)
updated_map_1 = update_mapping_with_gcp_project_and_buckets_ids(
src_mig_map=map,
env="dev",
impersonation_chain=IMPERSONATED_SERVICE_ACCOUNT,
)
list_files_and_prefixes = GCSListObjectsOperator(
task_id="list_files_and_prefixes",
bucket=LEGACY_DTM_HISTORY_BUCKET,
prefix="raw_data/datamart/{{ds_nodash}}",
impersonation_chain=IMPERSONATED_SERVICE_ACCOUNT,
)
updated_map_2 = match_data_with_migration_map(
src_mig_map=updated_map_1,
files_and_prefixes=list_files_and_prefixes.output,
)
list_of_maps = flatten_to_input_files_granularity(
src_mig_map=updated_map_2,
)
gcs_migration_maps_path = (
f"{COMPOSER_GCS_LOC_PATH}/dags/dtm/migration"
f"/helpers/gcs_migration_list_of_maps_{{ds_nodash}}.json"
)
write_list_of_maps_into_dict = gen_dict_file_from_list_of_map(
file_name=gcs_migration_maps_path,
list_of_maps=list_of_maps,
)
copy_task_group = copy_and_verify(
inp_parameters_path=gcs_migration_maps_path,
impersonated_service_account=IMPERSONATED_SERVICE_ACCOUNT,
)
remove_mapping_file = BashOperator(
task_id="remove_mapping_file",
bash_command=f"rm -vf {gcs_migration_maps_path}",
)
(
map
>> updated_map_1
>> list_files_and_prefixes
>> updated_map_2
>> list_of_maps
>> write_list_of_maps_into_dict
>> copy_task_group
>> remove_mapping_file
)
dag = template_dag()

Related

unable to get repr for <class 'albumentations.core.composition.compose'>

I am trying to run a code repository downloaded from GitHub as it is mentioned in the instruction of that but getting following error.
TypeError: init() missing 1 required positional argument: 'image_paths'
I am having this error at the code line 63 (preprocessing = preprocessing).
When I srat the program in debug mode I shows following error
unable to get repr for <class 'albumentations.core.composition.compose'>
import matplotlib.pyplot as plt
import numpy as np
import os
import sys
import torch
from skimage import io
from utils import adjust_sar_contrast, compute_building_score, plot_images
sys.path.append('/home/salman/Downloads/SpaceNet_SAR_Buildings_Solutions-master/4-motokimura/tmp/work')
from spacenet6_model.configs.load_config import get_config_with_previous_experiment
from spacenet6_model.datasets import SpaceNet6TestDataset
from spacenet6_model.models import get_model
from spacenet6_model.transforms import get_augmentation, get_preprocess
# select previous experiment to load
exp_id = 14
exp_log_dir = "/home/salman/Downloads/SpaceNet_SAR_Buildings_Solutions-master/4-motokimura/tmp/logs" # None: use default
# select device to which the model is loaded
cuda = True
if cuda:
device = 'cuda'
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
else:
device = 'cpu'
os.environ['CUDA_VISIBLE_DEVICES'] = ''
# overwrite default config with previous experiment
config = get_config_with_previous_experiment(exp_id=exp_id, exp_log_dir=exp_log_dir)
# overwrite additional hyper parameters
config.MODEL.DEVICE = device
config.WEIGHT_ROOT = "/home/salman/Downloads/SpaceNet_SAR_Buildings_Solutions-master/4-motokimura/tmp/weights/"
config.MODEL.WEIGHT = f"/home/salman/Downloads/SpaceNet_SAR_Buildings_Solutions-master/4-motokimura/tmp/weights/exp_{exp_id:04d}/model_best.pth"
config.INPUT.MEAN_STD_DIR = "/home/salman/Downloads/SpaceNet_SAR_Buildings_Solutions-master/4-motokimura/tmp/work/models/image_mean_std/"
config.INPUT.TEST_IMAGE_DIR = "/home/salman/data/SN6_buildings_AOI_11_Rotterdam_test_public/test_public/AOI_11_Rotterdam/SAR-Intensity"
config.INPUT.SAR_ORIENTATION="/home/salman/Downloads/SpaceNet_SAR_Buildings_Solutions-master/4-motokimura/tmp/work/static/SAR_orientations.txt"
config.TRAIN_VAL_SPLIT_DIR="/home/salman/Downloads/data/spacenet6/split"
config.PREDICTION_ROOT="/home/salman/Downloads/data/spacenet6/predictions"
config.POLY_CSV_ROOT="/home/salman/Downloads/data/spacenet6/polygons"
config.CHECKPOINT_ROOT="/home/salman/Downloads/data/spacenet6/ceckpoints"
config.POLY_OUTPUT_PATH="/home/salman/Downloads/data/spacenet6/val_polygons"
config.freeze()
print(config)
model = get_model(config)
model.eval();
from glob import glob
image_paths = glob(os.path.join(config.INPUT.TEST_IMAGE_DIR, "*.tif"))
#print(image_paths)
preprocessing = get_preprocess(config, is_test=True)
augmentation = get_augmentation(config, is_train=False)
test_dataset = SpaceNet6TestDataset(
config,
augmentation=augmentation,
preprocessing=preprocessing
)
test_dataset_vis = SpaceNet6TestDataset(
config,
augmentation=augmentation,
preprocessing=None
)
channel_footprint = config.INPUT.CLASSES.index('building_footprint')
channel_boundary = config.INPUT.CLASSES.index('building_boundary')
score_thresh = 0.5
alpha = 1.0
start_index = 900
N = 20
for i in range(start_index, start_index + N):
image_vis = test_dataset_vis[i]['image']
image = test_dataset[i]['image']
x_tensor = image.unsqueeze(0).to(config.MODEL.DEVICE)
pr_score = model.module.predict(x_tensor)
pr_score = pr_score.squeeze().cpu().numpy()
pr_score_building = compute_building_score(
pr_score[channel_footprint],
pr_score[channel_boundary],
alpha=alpha
)
pr_mask = pr_score_building > score_thresh
rotated = test_dataset[i]['rotated']
if rotated:
image_vis = np.flipud(np.fliplr(image_vis))
pr_mask = np.flipud(np.fliplr(pr_mask))
plot_images(
SAR_intensity_0=(adjust_sar_contrast(image_vis[:, :, 0]), 'gray'),
building_mask_pr=(pr_mask, 'viridis')
)
The function which this code calls is given below:
def get_spacenet6_preprocess(config, is_test):
"""
"""
mean_path = os.path.join(
config.INPUT.MEAN_STD_DIR,
config.INPUT.IMAGE_TYPE,
'mean.npy'
)
mean = np.load(mean_path)
mean = mean[np.newaxis, np.newaxis, :]
std_path = os.path.join(
config.INPUT.MEAN_STD_DIR,
config.INPUT.IMAGE_TYPE,
'std.npy'
)
std = np.load(std_path)
std = std[np.newaxis, np.newaxis, :]
if is_test:
to_tensor = albu.Lambda(
image=functools.partial(_to_tensor)
)
else:
to_tensor = albu.Lambda(
image=functools.partial(_to_tensor),
mask=functools.partial(_to_tensor)
)
preprocess = [
albu.Lambda(
image=functools.partial(
_normalize_image,
mean=mean,
std=std
)
),
to_tensor,
]
return albu.Compose(preprocess)

Airflow - How to push xcom from ecs operator?

In my airflow dag, I have an ecs_operator task followed by python operator task. I want to push some messages from ECS task to python task using xcom feature of airflow. I tried the option do_xcom_push=True with no result. Find below sample dag.
dag = DAG(
dag_name, default_args=default_args, schedule_interval=None)
start = DummyOperator(task_id = 'start'
,dag =dag)
end = DummyOperator(task_id = 'end'
,dag =dag)
ecs_operator_args = {
'launch_type': 'FARGATE',
'task_definition': 'task-def:2',
'cluster': 'cluster-name',
'region_name': 'region',
'network_configuration': {
'awsvpcConfiguration':
{}
}
}
ecs_task = ECSOperator(
task_id='x_com_test'
,**ecs_operator_args
,do_xcom_push=True
,params={'my_param': 'Parameter-1'}
,dag=dag)
def pull_function(**kwargs):
ti = kwargs['ti']
msg = ti.xcom_pull(task_ids='x_com_test',key='the_message')
print("received message: '%s'" % msg)
pull_task = PythonOperator(
task_id='pull_task',
python_callable=pull_function,
provide_context=True,
dag=dag)
start >> ecs_task >> pull_task >> end
You need to setup a cloudwatch log group for the container.
ECSOperator needs to be extended to support pushing to xcom:
from collections import deque
from airflow.utils import apply_defaults
from airflow.contrib.operators.ecs_operator import ECSOperator
class MyECSOperator(ECSOperator):
#apply_defaults
def __init__(self, xcom_push=False, **kwargs):
super(CLECSOperator, self).__init__(**kwargs)
self.xcom_push_flag = xcom_push
def execute(self, context):
super().execute(context)
if self.xcom_push_flag:
return self._last_log_event()
def _last_log_event(self):
if self.awslogs_group and self.awslogs_stream_prefix:
task_id = self.arn.split("/")[-1]
stream_name = "{}/{}".format(self.awslogs_stream_prefix, task_id)
events = self.get_logs_hook().get_log_events(self.awslogs_group, stream_name)
last_event = deque(events, maxlen=1).pop()
return last_event["message"]
dag = DAG(
dag_name, default_args=default_args, schedule_interval=None)
start = DummyOperator(task_id = 'start'
,dag =dag)
end = DummyOperator(task_id = 'end'
,dag =dag)
ecs_operator_args = {
'launch_type': 'FARGATE',
'task_definition': 'task-def:2',
'cluster': 'cluster-name',
'region_name': 'region',
'awslogs_group': '/aws/ecs/myLogGroup',
'awslogs_stream_prefix': 'myStreamPrefix',
'network_configuration': {
'awsvpcConfiguration':
{}
}
}
ecs_task = MyECSOperator(
task_id='x_com_test'
,**ecs_operator_args
,xcom_push=True
,params={'my_param': 'Parameter-1'}
,dag=dag)
def pull_function(**kwargs):
ti = kwargs['ti']
msg = ti.xcom_pull(task_ids='x_com_test',key='return_value')
print("received message: '%s'" % msg)
pull_task = PythonOperator(
task_id='pull_task',
python_callable=pull_function,
provide_context=True,
dag=dag)
start >> ecs_task >> pull_task >> end
ecs_task will take the last event from the log group before finishing executing, and push it to xcom.
Apache-AWS has a new commit that pretty much implements what #Бојан-Аџиевски mentioned above, so you don't need to write your custom ECSOperator. Available as of version 1.1.0
All you gotta do is to provide the do_xcom_push=True when calling the ECSOperator and provide the correct awslogs_group and awslogs_stream_prefix.
Make sure your awslogs_stream_prefix follows the following format:
prefix-name/container-name
As this is what ECS directs logs to.

Apache Airflow - trigger/schedule DAG rerun on completion (File Sensor)

Good Morning.
I'm trying to setup a DAG too
Watch/sense for a file to hit a network folder
Process the file
Archive the file
Using the tutorials online and stackoverflow I have been able to come up with the following DAG and Operator that successfully achieves the objectives, however I would like the DAG to be rescheduled or rerun on completion so it starts watching/sensing for another file.
I attempted to set a variable max_active_runs:1 and then a schedule_interval: timedelta(seconds=5) this yes reschedules the DAG but starts queuing task and locks the file.
Any ideas welcome on how I could rerun the DAG after the archive_task?
Thanks
DAG CODE
from airflow import DAG
from airflow.operators import PythonOperator, OmegaFileSensor, ArchiveFileOperator
from datetime import datetime, timedelta
from airflow.models import Variable
default_args = {
'owner': 'glsam',
'depends_on_past': False,
'start_date': datetime.now(),
'provide_context': True,
'retries': 100,
'retry_delay': timedelta(seconds=30),
'max_active_runs': 1,
'schedule_interval': timedelta(seconds=5),
}
dag = DAG('test_sensing_for_a_file', default_args=default_args)
filepath = Variable.get("soucePath_Test")
filepattern = Variable.get("filePattern_Test")
archivepath = Variable.get("archivePath_Test")
sensor_task = OmegaFileSensor(
task_id='file_sensor_task',
filepath=filepath,
filepattern=filepattern,
poke_interval=3,
dag=dag)
def process_file(**context):
file_to_process = context['task_instance'].xcom_pull(
key='file_name', task_ids='file_sensor_task')
file = open(filepath + file_to_process, 'w')
file.write('This is a test\n')
file.write('of processing the file')
file.close()
proccess_task = PythonOperator(
task_id='process_the_file',
python_callable=process_file,
provide_context=True,
dag=dag
)
archive_task = ArchiveFileOperator(
task_id='archive_file',
filepath=filepath,
archivepath=archivepath,
dag=dag)
sensor_task >> proccess_task >> archive_task
FILE SENSOR OPERATOR
import os
import re
from datetime import datetime
from airflow.models import BaseOperator
from airflow.plugins_manager import AirflowPlugin
from airflow.utils.decorators import apply_defaults
from airflow.operators.sensors import BaseSensorOperator
class ArchiveFileOperator(BaseOperator):
#apply_defaults
def __init__(self, filepath, archivepath, *args, **kwargs):
super(ArchiveFileOperator, self).__init__(*args, **kwargs)
self.filepath = filepath
self.archivepath = archivepath
def execute(self, context):
file_name = context['task_instance'].xcom_pull(
'file_sensor_task', key='file_name')
os.rename(self.filepath + file_name, self.archivepath + file_name)
class OmegaFileSensor(BaseSensorOperator):
#apply_defaults
def __init__(self, filepath, filepattern, *args, **kwargs):
super(OmegaFileSensor, self).__init__(*args, **kwargs)
self.filepath = filepath
self.filepattern = filepattern
def poke(self, context):
full_path = self.filepath
file_pattern = re.compile(self.filepattern)
directory = os.listdir(full_path)
for files in directory:
if re.match(file_pattern, files):
context['task_instance'].xcom_push('file_name', files)
return True
return False
class OmegaPlugin(AirflowPlugin):
name = "omega_plugin"
operators = [OmegaFileSensor, ArchiveFileOperator]
Dmitris method worked perfectly.
I also found in my reading setting schedule_interval=None and then using the TriggerDagRunOperator worked equally as well
trigger = TriggerDagRunOperator(
task_id='trigger_dag_RBCPV99_rerun',
trigger_dag_id="RBCPV99_v2",
dag=dag)
sensor_task >> proccess_task >> archive_task >> trigger
Set schedule_interval=None and use airflow trigger_dag command from BashOperator to launch next execution at the completion of the previous one.
trigger_next = BashOperator(task_id="trigger_next",
bash_command="airflow trigger_dag 'your_dag_id'", dag=dag)
sensor_task >> proccess_task >> archive_task >> trigger_next
You can start your first run manually with the same airflow trigger_dag command and then trigger_next task will automatically trigger the next one. We use this in production for many months now and and it runs perfectly.

Twisted (Scrapy) and Postgres

Im using Scrapy (aka Twisted) and also Postgres as a database.
After I while my connections seem to fill up and then my script is been stuck. I checked this with this query SELECT * FROM pg_stat_activity; and read that its caused because Postgres has no connection pool.
I read about txpostgres and PGBouncer, Bouncer regrettably isn't an option, what else can I do to avoid this problem?
So far I use the following pipeline:
import psycopg2
from twisted.enterprise import adbapi
import logging
from datetime import datetime
import scrapy
from scrapy.exceptions import DropItem
class PostgreSQLPipeline(object):
""" PostgreSQL pipeline class """
def __init__(self, dbpool):
self.logger = logging.getLogger(__name__)
self.dbpool = dbpool
#classmethod
def from_settings(cls, settings):
dbargs = dict(
host=settings['POSTGRESQL_HOST'],
database=settings['POSTGRESQL_DATABASE'],
user=settings['POSTGRESQL_USER'],
password=settings['POSTGRESQL_PASSWORD'],
)
dbpool = adbapi.ConnectionPool('psycopg2', **dbargs)
return cls(dbpool)
def process_item(self, item, spider):
d = self.dbpool.runInteraction(self._insert_item, item, spider)
d.addErrback(self._handle_error, item, spider)
d.addBoth(lambda _: item)
return d
def _insert_item(self, txn, item, spider):
"""Perform an insert or update."""
now = datetime.utcnow().replace(microsecond=0).isoformat(' ')
txn.execute(
"""
SELECT EXISTS(
SELECT 1
FROM expose
WHERE expose_id = %s
)
""", (
item['expose_id'],
)
)
ret = txn.fetchone()[0]
if ret:
self.logger.info("Item already in db: %r" % (item))
txn.execute(
"""
UPDATE expose
SET last_seen=%s, offline=0
WHERE expose_id=%s
""", (
now,
item['expose_id']
)
)
else:
self.logger.info("Item stored in db: %r" % (item))
txn.execute("""
INSERT INTO expose (
expose_id,
title
) VALUES (%s, %s)
""", (
item['expose_id'],
item['title']
)
)
# Write image info (path, original url, ...) to db, CONSTRAIN to expose.expose_id
for image in item['images']:
txn.execute(
"""
INSERT INTO image (
expose_id,
name
) VALUES (%s, %s)
""", (
item['expose_id'],
image['path'].replace('full/', '')
)
)
def _handle_error(self, failure, item, spider):
"""Handle occurred on db interaction."""
# do nothing, just log
self.logger.error(failure, failure.printTraceback())

How can I export Jira issues to BitBucket

Ive just moved my projects code from java.net to BitBucket. But my jira issue tracking is still hosted on java.net, although BitBucket does have some options for linking to an external issue tracker I don't think I can use it for java.net, not least because I do not have the admin priviledges need to install the DVCS connector.
So I thought an alternative option would be to export and then import the issues into BitBucket issue tracker, is that possible ?
Progress so far
So I tried following the steps in both informative answers using OSX below but I hit a problem - I'm rather confused about what the script would actually be called because in the answers it talks about export.py but no such script exists with that name so I renamed the one I downloaded.
sudo easy_install pip (OSX)
pip install jira
pip install configparser
easy_install -U setuptools
Go to https://bitbucket.org/reece/rcore, select downloads tab, download zip and unzip, and rename to reece ( for some reason git clone https://bitbucket.org/reece/rcore fails with error)
cd reece/rcore
Save script as export.py in rcore subfolder
Replace iteritems with items in import.py
Replace iteritems with types/immutabledict.py
Create .config in rcore folder
Create .config/jira-issues-move-to-bitbucket.conf containing
jira-username=paultaylor
jira-hostname=https://java.net/jira/browse/JAUDIOTAGGER
jira-password=password
Run python export.py --jira-project jaudiotagger
gives
macbook:rcore paul$ python export.py --jira-project jaudiotagger
Traceback (most recent call last):
File "export.py", line 24, in <module>
import configparser
ImportError: No module named configparser
- Run python export.py --jira-project jaudiotagger
I need to run pip insdtall as root so did
sudo pip install configparser
and that worked
but now
python export.py --jira.project jaudiotagger
gives
File "export.py" line 35, in <module?
from jira.client import JIRA
ImportError: No module named jira.client
You can import issues into BitBucket, they just need to be in the appropriate format. Fortunately, Reece Hart has already written a Python script to connect to a Jira instance and export the issues.
To get the script to run I had to install the Jira Python package as well as the latest version of rcore (if you use pip you get an incompatible previous version, so you have to get the source). I also had to replace all instances of iteritems with items in the script and in rcore/types/immutabledict.py to make it work with Python 3. You will also need to fill in the dictionaries (priority_map, person_map, etc) with the values your project uses. Finally, you need a config file to exist with the connection info (see comments at the top of the script).
The basic command line usage is export.py --jira-project <project>
Once you've got the data exported, see the instructions for importing issues to BitBucket
#!/usr/bin/env python
"""extract issues from JIRA and export to a bitbucket archive
See:
https://confluence.atlassian.com/pages/viewpage.action?pageId=330796872
https://confluence.atlassian.com/display/BITBUCKET/Mark+up+comments
https://bitbucket.org/tutorials/markdowndemo/overview
2014-04-12 08:26 Reece Hart <reecehart#gmail.com>
Requires a file ~/.config/jira-issues-move-to-bitbucket.conf
with content like
[default]
jira-username=some.user
jira-hostname=somewhere.jira.com
jira-password=ur$pass
"""
import argparse
import collections
import configparser
import glob
import itertools
import json
import logging
import os
import pprint
import re
import sys
import zipfile
from jira.client import JIRA
from rcore.types.immutabledict import ImmutableDict
priority_map = {
'Critical (P1)': 'critical',
'Major (P2)': 'major',
'Minor (P3)': 'minor',
'Nice (P4)': 'trivial',
}
person_map = {
'reece.hart': 'reece',
# etc
}
issuetype_map = {
'Improvement': 'enhancement',
'New Feature': 'enhancement',
'Bug': 'bug',
'Technical task': 'task',
'Task': 'task',
}
status_map = {
'Closed': 'resolved',
'Duplicate': 'duplicate',
'In Progress': 'open',
'Open': 'new',
'Reopened': 'open',
'Resolved': 'resolved',
}
def parse_args(argv):
def sep_and_flatten(l):
# split comma-sep elements and flatten list
# e.g., ['a','b','c,d'] -> set('a','b','c','d')
return list( itertools.chain.from_iterable(e.split(',') for e in l) )
cf = configparser.ConfigParser()
cf.readfp(open(os.path.expanduser('~/.config/jira-issues-move-to-bitbucket.conf'),'r'))
ap = argparse.ArgumentParser(
description = __doc__
)
ap.add_argument(
'--jira-hostname', '-H',
default = cf.get('default','jira-hostname',fallback=None),
help = 'host name of Jira instances (used for url like https://hostname/, e.g., "instancename.jira.com")',
)
ap.add_argument(
'--jira-username', '-u',
default = cf.get('default','jira-username',fallback=None),
)
ap.add_argument(
'--jira-password', '-p',
default = cf.get('default','jira-password',fallback=None),
)
ap.add_argument(
'--jira-project', '-j',
required = True,
help = 'project key (e.g., JRA)',
)
ap.add_argument(
'--jira-issues', '-i',
action = 'append',
default = [],
help = 'issue id (e.g., JRA-9); multiple and comma-separated okay; default = all in project',
)
ap.add_argument(
'--jira-issues-file', '-I',
help = 'file containing issue ids (e.g., JRA-9)'
)
ap.add_argument(
'--jira-components', '-c',
action = 'append',
default = [],
help = 'components criterion; multiple and comma-separated okay; default = all in project',
)
ap.add_argument(
'--existing', '-e',
action = 'store_true',
default = False,
help = 'read existing archive (from export) and merge new issues'
)
opts = ap.parse_args(argv)
opts.jira_components = sep_and_flatten(opts.jira_components)
opts.jira_issues = sep_and_flatten(opts.jira_issues)
return opts
def link(url,text=None):
return "[{text}]({url})".format(url=url,text=url if text is None else text)
def reformat_to_markdown(desc):
def _indent4(mo):
i = " "
return i + mo.group(1).replace("\n",i)
def _repl_mention(mo):
return "#" + person_map[mo.group(1)]
#desc = desc.replace("\r","")
desc = re.sub("{noformat}(.+?){noformat}",_indent4,desc,flags=re.DOTALL+re.MULTILINE)
desc = re.sub(opts.jira_project+r"-(\d+)",r"issue #\1",desc)
desc = re.sub(r"\[~([^]]+)\]",_repl_mention,desc)
return desc
def fetch_issues(opts,jcl):
jql = [ 'project = ' + opts.jira_project ]
if opts.jira_components:
jql += [ ' OR '.join([ 'component = '+c for c in opts.jira_components ]) ]
if opts.jira_issues:
jql += [ ' OR '.join([ 'issue = '+i for i in opts.jira_issues ]) ]
jql_str = ' AND '.join(["("+q+")" for q in jql])
logging.info('executing query ' + jql_str)
return jcl.search_issues(jql_str,maxResults=500)
def jira_issue_to_bb_issue(opts,jcl,ji):
"""convert a jira issue to a dictionary with values appropriate for
POSTing as a bitbucket issue"""
logger = logging.getLogger(__name__)
content = reformat_to_markdown(ji.fields.description) if ji.fields.description else ''
if ji.fields.assignee is None:
resp = None
else:
resp = person_map[ji.fields.assignee.name]
reporter = person_map[ji.fields.reporter.name]
jiw = jcl.watchers(ji.key)
watchers = [ person_map[u.name] for u in jiw.watchers ] if jiw else []
milestone = None
if ji.fields.fixVersions:
vnames = [ v.name for v in ji.fields.fixVersions ]
milestone = vnames[0]
if len(vnames) > 1:
logger.warn("{ji.key}: bitbucket issues may have only 1 milestone (JIRA fixVersion); using only first ({f}) and ignoring rest ({r})".format(
ji=ji, f=milestone, r=",".join(vnames[1:])))
issue_id = extract_issue_number(ji.key)
bbi = {
'status': status_map[ji.fields.status.name],
'priority': priority_map[ji.fields.priority.name],
'kind': issuetype_map[ji.fields.issuetype.name],
'content_updated_on': ji.fields.created,
'voters': [],
'title': ji.fields.summary,
'reporter': reporter,
'component': None,
'watchers': watchers,
'content': content,
'assignee': resp,
'created_on': ji.fields.created,
'version': None, # ?
'edited_on': None,
'milestone': milestone,
'updated_on': ji.fields.updated,
'id': issue_id,
}
return bbi
def jira_comment_to_bb_comment(opts,jcl,jc):
bbc = {
'content': reformat_to_markdown(jc.body),
'created_on': jc.created,
'id': int(jc.id),
'updated_on': jc.updated,
'user': person_map[jc.author.name],
}
return bbc
def extract_issue_number(jira_issue_key):
return int(jira_issue_key.split('-')[-1])
def jira_key_to_bb_issue_tag(jira_issue_key):
return 'issue #' + str(extract_issue_number(jira_issue_key))
def jira_link_text(jk):
return link("https://invitae.jira.com/browse/"+jk,jk) + " (Invitae access required)"
if __name__ == '__main__':
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
opts = parse_args(sys.argv[1:])
dir_name = opts.jira_project
if opts.jira_components:
dir_name += '-' + ','.join(opts.jira_components)
if opts.jira_issues_file:
issues = [i.strip() for i in open(opts.jira_issues_file,'r')]
logger.info("added {n} issues from {opts.jira_issues_file} to issues list".format(n=len(issues),opts=opts))
opts.jira_issues += issues
opts.dir = os.path.join('/','tmp',dir_name)
opts.att_rel_dir = 'attachments'
opts.att_abs_dir = os.path.join(opts.dir,opts.att_rel_dir)
opts.json_fn = os.path.join(opts.dir,'db-1.0.json')
if not os.path.isdir(opts.att_abs_dir):
os.makedirs(opts.att_abs_dir)
opts.jira_issues = list(set(opts.jira_issues)) # distinctify
jcl = JIRA({'server': 'https://{opts.jira_hostname}/'.format(opts=opts)},
basic_auth=(opts.jira_username,opts.jira_password))
if opts.existing:
issues_db = json.load(open(opts.json_fn,'r'))
existing_ids = [ i['id'] for i in issues_db['issues'] ]
logger.info("read {n} issues from {fn}".format(n=len(existing_ids),fn=opts.json_fn))
else:
issues_db = dict()
issues_db['meta'] = {
'default_milestone': None,
'default_assignee': None,
'default_kind': "bug",
'default_component': None,
'default_version': None,
}
issues_db['attachments'] = []
issues_db['comments'] = []
issues_db['issues'] = []
issues_db['logs'] = []
issues_db['components'] = [ {'name':v.name} for v in jcl.project_components(opts.jira_project) ]
issues_db['milestones'] = [ {'name':v.name} for v in jcl.project_versions(opts.jira_project) ]
issues_db['versions'] = issues_db['milestones']
# bb_issue_map: bb issue # -> bitbucket issue
bb_issue_map = ImmutableDict( (i['id'],i) for i in issues_db['issues'] )
# jk_issue_map: jira key -> bitbucket issue
# contains only items migrated from JIRA (i.e., not preexisting issues with --existing)
jk_issue_map = ImmutableDict()
# issue_links is a dict of dicts of lists, using JIRA keys
# e.g., links['CORE-135']['depends on'] = ['CORE-137']
issue_links = collections.defaultdict(lambda: collections.defaultdict(lambda: []))
issues = fetch_issues(opts,jcl)
logger.info("fetch {n} issues from JIRA".format(n=len(issues)))
for ji in issues:
# Pfft. Need to fetch the issue again due to bug in JIRA.
# See https://bitbucket.org/bspeakmon/jira-python/issue/47/, comment on 2013-10-01 by ssonic
ji = jcl.issue(ji.key,expand="attachments,comments")
# create the issue
bbi = jira_issue_to_bb_issue(opts,jcl,ji)
issues_db['issues'] += [bbi]
bb_issue_map[bbi['id']] = bbi
jk_issue_map[ji.key] = bbi
issue_links[ji.key]['imported from'] = [jira_link_text(ji.key)]
# add comments
for jc in ji.fields.comment.comments:
bbc = jira_comment_to_bb_comment(opts,jcl,jc)
bbc['issue'] = bbi['id']
issues_db['comments'] += [bbc]
# add attachments
for ja in ji.fields.attachment:
att_rel_path = os.path.join(opts.att_rel_dir,ja.id)
att_abs_path = os.path.join(opts.att_abs_dir,ja.id)
if not os.path.exists(att_abs_path):
open(att_abs_path,'w').write(ja.get())
logger.info("Wrote {att_abs_path}".format(att_abs_path=att_abs_path))
bba = {
"path": att_rel_path,
"issue": bbi['id'],
"user": person_map[ja.author.name],
"filename": ja.filename,
}
issues_db['attachments'] += [bba]
# parent-child is task-subtask
if hasattr(ji.fields,'parent'):
issue_links[ji.fields.parent.key]['subtasks'].append(jira_key_to_bb_issue_tag(ji.key))
issue_links[ji.key]['parent task'].append(jira_key_to_bb_issue_tag(ji.fields.parent.key))
# add links
for il in ji.fields.issuelinks:
if hasattr(il,'outwardIssue'):
issue_links[ji.key][il.type.outward].append(jira_key_to_bb_issue_tag(il.outwardIssue.key))
elif hasattr(il,'inwardIssue'):
issue_links[ji.key][il.type.inward].append(jira_key_to_bb_issue_tag(il.inwardIssue.key))
logger.info("migrated issue {ji.key}: {ji.fields.summary} ({components})".format(
ji=ji,components=','.join(c.name for c in ji.fields.components)))
# append links section to content
# this section shows both task-subtask and "issue link" relationships
for src,dstlinks in issue_links.iteritems():
if src not in jk_issue_map:
logger.warn("issue {src}, with issue_links, not in jk_issue_map; skipping".format(src=src))
continue
links_block = "Links\n=====\n"
for desc,dsts in sorted(dstlinks.iteritems()):
links_block += "* **{desc}**: {links} \n".format(desc=desc,links=", ".join(dsts))
if jk_issue_map[src]['content']:
jk_issue_map[src]['content'] += "\n\n" + links_block
else:
jk_issue_map[src]['content'] = links_block
id_counts = collections.Counter(i['id'] for i in issues_db['issues'])
dupes = [ k for k,cnt in id_counts.iteritems() if cnt>1 ]
if dupes:
raise RuntimeError("{n} issue ids appear more than once from existing {opts.json_fn}".format(
n=len(dupes),opts=opts))
json.dump(issues_db,open(opts.json_fn,'w'))
logger.info("wrote {n} issues to {opts.json_fn}".format(n=len(id_counts),opts=opts))
# write zipfile
os.chdir(opts.dir)
with zipfile.ZipFile(opts.dir + '.zip','w') as zf:
for fn in ['db-1.0.json']+glob.glob('attachments/*'):
zf.write(fn)
logger.info("added {fn} to archive".format(fn=fn))
NOTE: I'm writing a new answer because writing this in a comment would be horrible, but most of the credit goes to #Turch's answer.
My steps (in OSX and Debian machines, both worked fine):
apt-get install python-pip (Debian) or sudo easy_install pip (OSX)
pip install jira
pip install configparser
easy_install -U setuptools (not sure if really needed)
Download or clone the source code from https://bitbucket.org/reece/rcore/ in your home folder, for example. Note: don't download using pip, it will get the 0.0.2 version and you need the 0.0.3.
Download the Python script created by Reece, mentioned by #Turch, and place it inside of the rcore folder.
Follow the instructions by #Turch: I also had to replace all instances of iteritems with items in the script and in rcore/types/immutabledict.py to make it work with Python 3. You will also need to fill in the dictionaries (priority_map, person_map, etc) with the values your project uses. Finally, you need a config file to exist with the connection info (see comments at the top of the script). Note: I used hostname like jira.domain.com (no http or https).
(This change did the trick for me) I had to change part of the line 250 from 'https://{opts.jira_hostname}/' to 'http://{opts.jira_hostname}/'
To finish, run the script like #Turch mentioned: The basic command line usage is export.py --jira-project <project>
The file was placed in /tmp/.zip for me.
The file was perfectly accepted in the BitBucket importer today.
Hooray for Reece and Turch! Thanks guys!