Using input function with remote files in snakemake - kubernetes

I want to use a function to read inputs file paths from a dataframe and send them to my snakemake rule. I also have a helper function to select the remote from which to pull the files.
from snakemake.remote.GS import RemoteProvider as GSRemoteProvider
from snakemake.remote.SFTP import RemoteProvider as SFTPRemoteProvider
from os.path import join
import pandas as pd
configfile: "config.yaml"
units = pd.read_csv(config["units"]).set_index(["library", "unit"], drop=False)
TMP= join('data', 'tmp')
def access_remote(local_path):
""" Connnects to remote as defined in config file"""
provider = config['provider']
if provider == 'GS':
GS = GSRemoteProvider()
remote_path = GS.remote(join("gs://" + config['bucket'], local_path))
elif provider == 'SFTP':
SFTP = SFTPRemoteProvider(
username=config['user'],
private_key=config['ssh_key']
)
remote_path = SFTP.remote(
config['host'] + ":22" + join(base_path, local_path)
)
else:
remote_path = local_path
return remote_path
def get_fastqs(wc):
"""
Get fastq files (units) of a particular library - sample
combination from the unit sheet.
"""
fqs = units.loc[
(units.library == wc.library) &
(units.libtype == wc.libtype),
"fq1"
]
return {
"r1": list(map(access_remote, fqs.fq1.values)),
}
# Combine all fastq files from the same sample / library type combination
rule combine_units:
input: unpack(get_fastqs)
output:
r1 = join(TMP, "reads", "{library}_{libtype}.end1.fq.gz")
threads: 12
run:
shell("cat {i1} > {o1}".format(i1=input['r1'], o1=output['r1']))
My config file contains the bucket name and provider, which are passed to the function. This works as expected when running simply snakemake.
However, I would like to use the kubernetes integration, which requires passing the provider and bucket name in the command line. But when I run:
snakemake -n --kubernetes --default-remote-provider GS --default-remote-prefix bucket-name
I get this error:
ERROR :: MissingInputException in line 19 of Snakefile:
Missing input files for rule combine_units:
bucket-name/['bucket-name/lib1-unit1.end1.fastq.gz', 'bucket-name/lib1-unit2.end1.fastq.gz', 'bucket-name/lib1-unit3.end1.fastq.gz']
The bucket is applied twice (once mapped correctly to each element, and once before the whole list (which gets converted to a string). Did I miss something ? Is there a good way to work around this ?

Related

Output file not created when running a R command in a Nextflow file?

I am trying to run a nextflow pipeline but the output file is not created.
The main.nf file looks like this:
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
process my_script {
"""
Rscript script.R
"""
}
workflow {
my_script
}
In my nextflow.config I have:
process {
executor = 'k8s'
container = 'rocker/r-ver:4.1.3'
}
The script.R looks like this:
FUN <- readRDS("function.rds");
input = readRDS("input.rds");
output = FUN(
singleCell_data_input = input[[1]], savePath = input[[2]], tmpDirGC = input[[3]]
);
saveRDS(output, "output.rds")
After running nextflow run main.nf the output.rds is not created
Nextflow processes are run independently and isolated from each other from inside the working directory. For your script to be able to find the required input files, these must be localized inside the process working directory. This should be done by defining an input block and declaring the files using the path qualifier, for example:
params.function_rds = './function.rds'
params.input_rds = './input.rds'
process my_script {
input:
path my_function_rds
path my_input_rds
output:
path "output.rds"
"""
#!/usr/bin/env Rscript
FUN <- readRDS("${my_function_rds}");
input = readRDS("${my_input_rds}");
output = FUN(
singleCell_data_input=input[[1]], savePath=input[[2]], tmpDirGC=input[[3]]
);
saveRDS(output, "output.rds")
"""
}
workflow {
function_rds = file( params.function_rds )
input_rds = file( params.input_rds )
my_script( function_rds, input_rds )
my_script.out.view()
}
In the same way, the script itself would need to be localized inside the process working directory. To avoid specifying an absolute path to your R script (which would not make your workflow portable at all), it's possible to simply embed your code, making sure to specify the Rscript shebang. This works because process scripts are not limited to Bash1.
Another way, would be to make your Rscript executable and move it into a directory called bin in the the root directory of your project repository (i.e. the same directory as your 'main.nf' Nextflow script). Nextflow automatically adds this folder to the $PATH environment variable and your script would become automatically accessible to each of your pipeline processes. For this to work, you'd need some way to pass in the input files as command line arguments. For example:
params.function_rds = './function.rds'
params.input_rds = './input.rds'
process my_script {
input:
path my_function_rds
path my_input_rds
output:
path "output.rds"
"""
script.R "${my_function_rds}" "${my_input_rds}" output.rds
"""
}
workflow {
function_rds = file( params.function_rds )
input_rds = file( params.input_rds )
my_script( function_rds, input_rds )
my_script.out.view()
}
And your R script might look like:
#!/usr/bin/env Rscript
args <- commandArgs(trailingOnly = TRUE)
FUN <- readRDS(args[1]);
input = readRDS(args[2]);
output = FUN(
singleCell_data_input=input[[1]], savePath=input[[2]], tmpDirGC=input[[3]]
);
saveRDS(output, args[3])

Container keeps on crashing while creating a deployment from a docker image in minikube

i have docker image containing python files which should download satellite imageries from scihub website. The docker image is working fine. Now when i want to create the deployment thorugh kubectl so that i can expose it as a service, its's container keeps on crashing. That's what the pod description says when seen through kubectl describe pod.
this is how i am trying to deploy sudo kubectl run back --image=back:latest --port=8080 --image-pull-policy Never. i also tried changing the port but it did not work. Here are the files within docker image.
Docker File
FROM python:3.7-stretch
COPY . /code
WORKDIR /code
RUN pip install -r requirements.txt
ENTRYPOINT ["python", "ingestion.py"]
** ingestion **
import os
import shutil
import logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(name)s - %(message)s')
logger = logging.getLogger("ingestion")
import requests
import datahub
scihub_username = os.environ["scihub_username"]
scihub_password = os.environ["scihub_password"]
result_url = "http://" + os.environ["CDINRW_BASE_URL"] + "/jobs/" + os.environ["CDINRW_JOB_ID"] + "/results"
logger.info("Searching the Copernicus Open Access Hub")
scenes = datahub.search(username=scihub_username,
password=scihub_password,
producttype=os.getenv("producttype"),
platformname=os.getenv("platformname"),
days_back=os.getenv("days_back", 2),
footprint=os.getenv("footprint"),
max_cloud_cover_percentage=os.getenv("max_cloud_cover_percentage"),
start_date = os.getenv("start_date"),
end_date = os.getenv("end_date"))
logger.info("Found {} relevant scenes".format(len(scenes)))
job_results = []
for scene in scenes:
# do not donwload a scene that has already been ingested
if os.path.exists(os.path.join("/out_data", scene["title"]+".SAFE")):
logger.info("The scene {} already exists in /out_data and will not be downloaded again.".format(scene["title"]))
filename = scene["title"]+".SAFE"
else:
logger.info("Starting the download of scene {}".format(scene["title"]))
filename = datahub.download(scene, "/tmp", scihub_username, scihub_password, unpack=True)
logger.info("The download was successful.")
shutil.move(filename, "/out_data")
result_message = {"description": "test",
"type": "Raster",
"format": "SAFE",
"filename": os.path.basename(filename)}
job_results.append(result_message)
res = requests.put(result_url, json=job_results, timeout=60)
res.raise_for_status()
*datahub
import logging
import os
import urllib.parse
import zipfile
import requests
# constructing URLs for querying the data hub
_BASE_URL = "https://scihub.copernicus.eu/dhus/"
SITE = {}
SITE["SEARCH"] = _BASE_URL + "search?format=xml&sortedby=beginposition&order=desc&rows=100&start={offset}&q="
_PRODUCT_URL = _BASE_URL + "odata/v1/Products('{uuid}')/"
SITE["CHECKSUM"] = _PRODUCT_URL + "Checksum/Value/$value"
SITE["SAFEZIP"] = _PRODUCT_URL + "$value"
logger = logging.getLogger(__name__)
def _build_search_url(producttype=None, platformname=None, days_back=2, footprint=None, max_cloud_cover_percentage=None, start_date=None, end_date=None):
search_terms = []
if producttype:
search_terms.append("producttype:{}".format(producttype))
if platformname:
search_terms.append("platformname:{}".format(platformname))
if start_date and end_date:
search_terms.append(
"beginPosition:[{}+TO+{}]".format(start_date, end_date))
elif days_back:
search_terms.append(
"beginPosition:[NOW-{}DAYS+TO+NOW]".format(days_back))
if footprint:
search_terms.append("footprint:%22Intersects({})%22".format(
footprint.replace(" ", "+")))
if max_cloud_cover_percentage:
search_terms.append("cloudcoverpercentage:[0+TO+{}]".format(max_cloud_cover_percentage))
url = SITE["SEARCH"] + "+AND+".join(search_terms)
return url
def _unpack(zip_file, directory, remove_after=False):
with zipfile.ZipFile(zip_file) as zf:
# This assumes that the zipfile only contains the .SAFE directory at root level
safe_path = zf.namelist()[0]
zf.extractall(path=directory)
if remove_after:
os.remove(zip_file)
return os.path.normpath(os.path.join(directory, safe_path))
def search(username, password, producttype=None, platformname=None ,days_back=2, footprint=None, max_cloud_cover_percentage=None, start_date=None, end_date=None):
""" Search the Copernicus SciHub
Parameters
----------
username : str
user name for the Copernicus SciHub
password : str
password for the Copernicus SciHub
producttype : str, optional
product type to filter for in the query (see https://scihub.copernicus.eu/userguide/FullTextSearch#Search_Keywords for allowed values)
platformname : str, optional
plattform name to filter for in the query (see https://scihub.copernicus.eu/userguide/FullTextSearch#Search_Keywords for allowed values)
days_back : int, optional
number of days before today that will be searched. Default are the last 2 days. If start and end date are set the days_back parameter is ignored
footprint : str, optional
well-known-text representation of the footprint
max_cloud_cover_percentage: str, optional
percentage of cloud cover per scene. Can only be used in combination with Sentinel-2 imagery.
(see https://scihub.copernicus.eu/userguide/FullTextSearch#Search_Keywords for allowed values)
start_date: str, optional
start point of the search extent has to be used in combination with end_date
end_date: str, optional
end_point of the search extent has to be used in combination with start_date
Returns
-------
list
a list of scenes that match the search parameters
"""
import xml.etree.cElementTree as ET
scenes = []
search_url = _build_search_url(producttype, platformname, days_back, footprint, max_cloud_cover_percentage, start_date, end_date)
logger.info("Search URL: {}".format(search_url))
offset = 0
rowsBreak = 5000
name_space = {"atom": "http://www.w3.org/2005/Atom",
"opensearch": "http://a9.com/-/spec/opensearch/1.1/"}
while offset < rowsBreak: # Next pagination page:
response = requests.get(search_url.format(offset=offset), auth=(username, password))
root = ET.fromstring(response.content)
if offset == 0:
rowsBreak = int(
root.find("opensearch:totalResults", name_space).text)
for e in root.iterfind("atom:entry", name_space):
uuid = e.find("atom:id", name_space).text
title = e.find("atom:title", name_space).text
begin_position = e.find(
"atom:date[#name='beginposition']", name_space).text
end_position = e.find(
"atom:date[#name='endposition']", name_space).text
footprint = e.find("atom:str[#name='footprint']", name_space).text
scenes.append({
"id": uuid,
"title": title,
"begin_position": begin_position,
"end_position": end_position,
"footprint": footprint})
# Ultimate DHuS pagination page size limit (rows per page).
offset += 100
return scenes
def download(scene, directory, username, password, unpack=True):
""" Download a Sentinel scene based on its uuid
Parameters
----------
scene : dict
the scene to be downloaded
path : str
the path where the file will be downloaded to
username : str
username for the Copernicus SciHub
password : str
password for the Copernicus SciHub
unpack: boolean, optional
flag that defines whether the downloaded product should be unpacked after download. defaults to true
Raises
------
ValueError
if the size of the downloaded file does not match the Content-Length header
ValueError
if the checksum of the downloaded file does not match the checksum provided by the Copernicus SciHub
Returns
-------
str
path to the downloaded file
"""
import hashlib
md5hash = hashlib.md5()
md5sum = requests.get(SITE["CHECKSUM"].format(
uuid=scene["id"]), auth=(username, password)).text
download_path = os.path.join(directory, scene["title"] + ".zip")
# overwrite if path already exists
if os.path.exists(download_path):
os.remove(download_path)
url = SITE["SAFEZIP"].format(uuid=scene["id"])
rsp = requests.get(url, auth=(username, password), stream=True)
cl = rsp.headers.get("Content-Length")
size = int(cl) if cl else -1
# Actually fetch now:
with open(download_path, "wb") as f: # Do not read as a whole into memory:
written = 0
for block in rsp.iter_content(8192):
f.write(block)
written += len(block)
md5hash.update(block)
written = os.path.getsize(download_path)
if size > -1 and written != size:
raise ValueError("{}: size mismatch, {} bytes written but expected {} bytes to write!".format(
download_path, written, size))
elif md5sum:
calculated = md5hash.hexdigest()
expected = md5sum.lower()
POD events
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 2m39s (x18636 over 2d19h) kubelet, minikube Back-off restarting failed container
the system which wants to use this service already has another main front end service running(which just runs the application ) on 8081 so maybe i need to expose this on the same port. How can i make the deployments running?

How to use multiple pytest conftest files in one test run with a duplicated parser.addoption?

I have a pytest testing project running selenium tests that has a structure like:
ProjRoot
|
|_Pytest.ini
|_____________TestFolderA
| |
| |_test_folderA_tests1.py
| |_test_folderA_tests2.py
|
|____________TestFolderB
| |
| |_test_folderB_test1.py
| |_test_folderA_tests2.py
|
|
|___________TestHelperModules
| |
| |_VariousTestHelperModules
|
|____________DriversAndTools
|___(contains chromedriver.exe, firefox profile folder etc)
I have a confTest.py file which I currently run in the ProjRoot, which I use as a setup and tear down for establishing the browser session for each test that is run. It runs each test twice. Once for Chrome and once for Firefox. In my tests I just utilise the resulting driver fixture. The conftest file is as below:
#conftest.py
import pytest
import os
import rootdir_ref
from selenium.webdriver.common.keys import Keys
import time
from webdriverwrapper.pytest import *
from webdriverwrapper import Chrome
from webdriverwrapper import DesiredCapabilities
from webdriverwrapper import Firefox
from webdriverwrapper import FirefoxProfile
#when running tests from command line we should be able to pass --url=www..... for a different website, check what order these definitions need to be in
def pytest_addoption(parser):
parser.addoption('--url', default='https://test1.testsite.com.au')
#pytest.fixture(scope='function')
def url(request):
return request.config.option.url
browsers = {
'firefox': Firefox,
'chrome': Chrome,
}
#pytest.fixture(scope='function',
params=browsers.keys())
def browser(request):
if request.param == 'firefox':
firefox_capabilities = DesiredCapabilities.FIREFOX
firefox_capabilities['marionette'] = True
firefox_capabilities['handleAlerts'] = True
theRootDir = os.path.dirname(rootdir_ref.__file__)
ffProfilePath = os.path.join(theRootDir, 'DriversAndTools', 'FirefoxSeleniumProfile')
geckoDriverPath = os.path.join(theRootDir, 'DriversAndTools', 'geckodriver.exe')
profile = FirefoxProfile(profile_directory=ffProfilePath)
print (ffProfilePath)
print (geckoDriverPath)
b = browsers[request.param](firefox_profile=profile, capabilities=firefox_capabilities, executable_path=geckoDriverPath)
elif request.param == 'chrome':
desired_cap = DesiredCapabilities.CHROME
desired_cap['chromeOptions'] = {}
desired_cap['chromeOptions']['args'] = ['--disable-plugins', '--disable-extensions']
theRootDir = os.path.dirname(rootdir_ref.__file__)
chromeDriverPath = os.path.join(theRootDir, 'DriversAndTools', 'chromedriver.exe')
b = browsers[request.param](chromeDriverPath)
else:
b = browsers[request.param]()
request.addfinalizer(lambda *args: b.quit())
return b
#pytest.fixture(scope='function')
def driver(browser, url):
driver = browser
driver.maximize_window()
driver.get(url)
return driver
What I’d like to do is have a conftest file in each Test Folder instead of the ProjRoot. But if I take this existing conftest file and put it in each test folder and then run pytest from the project root using
python –m pytest
letting pytest pickup the test directories from pytest.ini (expecting the test folders to run with their respectively contained conftest files) I have issues with the parser.addoption --url already having been added. The end of the error message is:
ClientScripts\conftest.py:19: in pytest_addoption
parser.addoption('--url', default='https://test1.coreplus.com.au/coreplus01')
..\..\..\VirtEnv\VirtEnv\lib\site-packages\_pytest\config.py:521: in addoption
self._anonymous.addoption(*opts, **attrs)
..\..\..\VirtEnv\VirtEnv\lib\site-packages\_pytest\config.py:746: in addoption
raise ValueError("option names %s already added" % conflict)
E ValueError: option names {'--url'} already added
The purpose of the --url addoption is so I can override the defaults in the conftest file at commandline if I want to point them all to a different url at the same time, but otherwise let them default to running to different url's as specified in their conftest files.
I had a similar issue.
Error was gone after removing all cached files and venv.

How can I export Jira issues to BitBucket

Ive just moved my projects code from java.net to BitBucket. But my jira issue tracking is still hosted on java.net, although BitBucket does have some options for linking to an external issue tracker I don't think I can use it for java.net, not least because I do not have the admin priviledges need to install the DVCS connector.
So I thought an alternative option would be to export and then import the issues into BitBucket issue tracker, is that possible ?
Progress so far
So I tried following the steps in both informative answers using OSX below but I hit a problem - I'm rather confused about what the script would actually be called because in the answers it talks about export.py but no such script exists with that name so I renamed the one I downloaded.
sudo easy_install pip (OSX)
pip install jira
pip install configparser
easy_install -U setuptools
Go to https://bitbucket.org/reece/rcore, select downloads tab, download zip and unzip, and rename to reece ( for some reason git clone https://bitbucket.org/reece/rcore fails with error)
cd reece/rcore
Save script as export.py in rcore subfolder
Replace iteritems with items in import.py
Replace iteritems with types/immutabledict.py
Create .config in rcore folder
Create .config/jira-issues-move-to-bitbucket.conf containing
jira-username=paultaylor
jira-hostname=https://java.net/jira/browse/JAUDIOTAGGER
jira-password=password
Run python export.py --jira-project jaudiotagger
gives
macbook:rcore paul$ python export.py --jira-project jaudiotagger
Traceback (most recent call last):
File "export.py", line 24, in <module>
import configparser
ImportError: No module named configparser
- Run python export.py --jira-project jaudiotagger
I need to run pip insdtall as root so did
sudo pip install configparser
and that worked
but now
python export.py --jira.project jaudiotagger
gives
File "export.py" line 35, in <module?
from jira.client import JIRA
ImportError: No module named jira.client
You can import issues into BitBucket, they just need to be in the appropriate format. Fortunately, Reece Hart has already written a Python script to connect to a Jira instance and export the issues.
To get the script to run I had to install the Jira Python package as well as the latest version of rcore (if you use pip you get an incompatible previous version, so you have to get the source). I also had to replace all instances of iteritems with items in the script and in rcore/types/immutabledict.py to make it work with Python 3. You will also need to fill in the dictionaries (priority_map, person_map, etc) with the values your project uses. Finally, you need a config file to exist with the connection info (see comments at the top of the script).
The basic command line usage is export.py --jira-project <project>
Once you've got the data exported, see the instructions for importing issues to BitBucket
#!/usr/bin/env python
"""extract issues from JIRA and export to a bitbucket archive
See:
https://confluence.atlassian.com/pages/viewpage.action?pageId=330796872
https://confluence.atlassian.com/display/BITBUCKET/Mark+up+comments
https://bitbucket.org/tutorials/markdowndemo/overview
2014-04-12 08:26 Reece Hart <reecehart#gmail.com>
Requires a file ~/.config/jira-issues-move-to-bitbucket.conf
with content like
[default]
jira-username=some.user
jira-hostname=somewhere.jira.com
jira-password=ur$pass
"""
import argparse
import collections
import configparser
import glob
import itertools
import json
import logging
import os
import pprint
import re
import sys
import zipfile
from jira.client import JIRA
from rcore.types.immutabledict import ImmutableDict
priority_map = {
'Critical (P1)': 'critical',
'Major (P2)': 'major',
'Minor (P3)': 'minor',
'Nice (P4)': 'trivial',
}
person_map = {
'reece.hart': 'reece',
# etc
}
issuetype_map = {
'Improvement': 'enhancement',
'New Feature': 'enhancement',
'Bug': 'bug',
'Technical task': 'task',
'Task': 'task',
}
status_map = {
'Closed': 'resolved',
'Duplicate': 'duplicate',
'In Progress': 'open',
'Open': 'new',
'Reopened': 'open',
'Resolved': 'resolved',
}
def parse_args(argv):
def sep_and_flatten(l):
# split comma-sep elements and flatten list
# e.g., ['a','b','c,d'] -> set('a','b','c','d')
return list( itertools.chain.from_iterable(e.split(',') for e in l) )
cf = configparser.ConfigParser()
cf.readfp(open(os.path.expanduser('~/.config/jira-issues-move-to-bitbucket.conf'),'r'))
ap = argparse.ArgumentParser(
description = __doc__
)
ap.add_argument(
'--jira-hostname', '-H',
default = cf.get('default','jira-hostname',fallback=None),
help = 'host name of Jira instances (used for url like https://hostname/, e.g., "instancename.jira.com")',
)
ap.add_argument(
'--jira-username', '-u',
default = cf.get('default','jira-username',fallback=None),
)
ap.add_argument(
'--jira-password', '-p',
default = cf.get('default','jira-password',fallback=None),
)
ap.add_argument(
'--jira-project', '-j',
required = True,
help = 'project key (e.g., JRA)',
)
ap.add_argument(
'--jira-issues', '-i',
action = 'append',
default = [],
help = 'issue id (e.g., JRA-9); multiple and comma-separated okay; default = all in project',
)
ap.add_argument(
'--jira-issues-file', '-I',
help = 'file containing issue ids (e.g., JRA-9)'
)
ap.add_argument(
'--jira-components', '-c',
action = 'append',
default = [],
help = 'components criterion; multiple and comma-separated okay; default = all in project',
)
ap.add_argument(
'--existing', '-e',
action = 'store_true',
default = False,
help = 'read existing archive (from export) and merge new issues'
)
opts = ap.parse_args(argv)
opts.jira_components = sep_and_flatten(opts.jira_components)
opts.jira_issues = sep_and_flatten(opts.jira_issues)
return opts
def link(url,text=None):
return "[{text}]({url})".format(url=url,text=url if text is None else text)
def reformat_to_markdown(desc):
def _indent4(mo):
i = " "
return i + mo.group(1).replace("\n",i)
def _repl_mention(mo):
return "#" + person_map[mo.group(1)]
#desc = desc.replace("\r","")
desc = re.sub("{noformat}(.+?){noformat}",_indent4,desc,flags=re.DOTALL+re.MULTILINE)
desc = re.sub(opts.jira_project+r"-(\d+)",r"issue #\1",desc)
desc = re.sub(r"\[~([^]]+)\]",_repl_mention,desc)
return desc
def fetch_issues(opts,jcl):
jql = [ 'project = ' + opts.jira_project ]
if opts.jira_components:
jql += [ ' OR '.join([ 'component = '+c for c in opts.jira_components ]) ]
if opts.jira_issues:
jql += [ ' OR '.join([ 'issue = '+i for i in opts.jira_issues ]) ]
jql_str = ' AND '.join(["("+q+")" for q in jql])
logging.info('executing query ' + jql_str)
return jcl.search_issues(jql_str,maxResults=500)
def jira_issue_to_bb_issue(opts,jcl,ji):
"""convert a jira issue to a dictionary with values appropriate for
POSTing as a bitbucket issue"""
logger = logging.getLogger(__name__)
content = reformat_to_markdown(ji.fields.description) if ji.fields.description else ''
if ji.fields.assignee is None:
resp = None
else:
resp = person_map[ji.fields.assignee.name]
reporter = person_map[ji.fields.reporter.name]
jiw = jcl.watchers(ji.key)
watchers = [ person_map[u.name] for u in jiw.watchers ] if jiw else []
milestone = None
if ji.fields.fixVersions:
vnames = [ v.name for v in ji.fields.fixVersions ]
milestone = vnames[0]
if len(vnames) > 1:
logger.warn("{ji.key}: bitbucket issues may have only 1 milestone (JIRA fixVersion); using only first ({f}) and ignoring rest ({r})".format(
ji=ji, f=milestone, r=",".join(vnames[1:])))
issue_id = extract_issue_number(ji.key)
bbi = {
'status': status_map[ji.fields.status.name],
'priority': priority_map[ji.fields.priority.name],
'kind': issuetype_map[ji.fields.issuetype.name],
'content_updated_on': ji.fields.created,
'voters': [],
'title': ji.fields.summary,
'reporter': reporter,
'component': None,
'watchers': watchers,
'content': content,
'assignee': resp,
'created_on': ji.fields.created,
'version': None, # ?
'edited_on': None,
'milestone': milestone,
'updated_on': ji.fields.updated,
'id': issue_id,
}
return bbi
def jira_comment_to_bb_comment(opts,jcl,jc):
bbc = {
'content': reformat_to_markdown(jc.body),
'created_on': jc.created,
'id': int(jc.id),
'updated_on': jc.updated,
'user': person_map[jc.author.name],
}
return bbc
def extract_issue_number(jira_issue_key):
return int(jira_issue_key.split('-')[-1])
def jira_key_to_bb_issue_tag(jira_issue_key):
return 'issue #' + str(extract_issue_number(jira_issue_key))
def jira_link_text(jk):
return link("https://invitae.jira.com/browse/"+jk,jk) + " (Invitae access required)"
if __name__ == '__main__':
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
opts = parse_args(sys.argv[1:])
dir_name = opts.jira_project
if opts.jira_components:
dir_name += '-' + ','.join(opts.jira_components)
if opts.jira_issues_file:
issues = [i.strip() for i in open(opts.jira_issues_file,'r')]
logger.info("added {n} issues from {opts.jira_issues_file} to issues list".format(n=len(issues),opts=opts))
opts.jira_issues += issues
opts.dir = os.path.join('/','tmp',dir_name)
opts.att_rel_dir = 'attachments'
opts.att_abs_dir = os.path.join(opts.dir,opts.att_rel_dir)
opts.json_fn = os.path.join(opts.dir,'db-1.0.json')
if not os.path.isdir(opts.att_abs_dir):
os.makedirs(opts.att_abs_dir)
opts.jira_issues = list(set(opts.jira_issues)) # distinctify
jcl = JIRA({'server': 'https://{opts.jira_hostname}/'.format(opts=opts)},
basic_auth=(opts.jira_username,opts.jira_password))
if opts.existing:
issues_db = json.load(open(opts.json_fn,'r'))
existing_ids = [ i['id'] for i in issues_db['issues'] ]
logger.info("read {n} issues from {fn}".format(n=len(existing_ids),fn=opts.json_fn))
else:
issues_db = dict()
issues_db['meta'] = {
'default_milestone': None,
'default_assignee': None,
'default_kind': "bug",
'default_component': None,
'default_version': None,
}
issues_db['attachments'] = []
issues_db['comments'] = []
issues_db['issues'] = []
issues_db['logs'] = []
issues_db['components'] = [ {'name':v.name} for v in jcl.project_components(opts.jira_project) ]
issues_db['milestones'] = [ {'name':v.name} for v in jcl.project_versions(opts.jira_project) ]
issues_db['versions'] = issues_db['milestones']
# bb_issue_map: bb issue # -> bitbucket issue
bb_issue_map = ImmutableDict( (i['id'],i) for i in issues_db['issues'] )
# jk_issue_map: jira key -> bitbucket issue
# contains only items migrated from JIRA (i.e., not preexisting issues with --existing)
jk_issue_map = ImmutableDict()
# issue_links is a dict of dicts of lists, using JIRA keys
# e.g., links['CORE-135']['depends on'] = ['CORE-137']
issue_links = collections.defaultdict(lambda: collections.defaultdict(lambda: []))
issues = fetch_issues(opts,jcl)
logger.info("fetch {n} issues from JIRA".format(n=len(issues)))
for ji in issues:
# Pfft. Need to fetch the issue again due to bug in JIRA.
# See https://bitbucket.org/bspeakmon/jira-python/issue/47/, comment on 2013-10-01 by ssonic
ji = jcl.issue(ji.key,expand="attachments,comments")
# create the issue
bbi = jira_issue_to_bb_issue(opts,jcl,ji)
issues_db['issues'] += [bbi]
bb_issue_map[bbi['id']] = bbi
jk_issue_map[ji.key] = bbi
issue_links[ji.key]['imported from'] = [jira_link_text(ji.key)]
# add comments
for jc in ji.fields.comment.comments:
bbc = jira_comment_to_bb_comment(opts,jcl,jc)
bbc['issue'] = bbi['id']
issues_db['comments'] += [bbc]
# add attachments
for ja in ji.fields.attachment:
att_rel_path = os.path.join(opts.att_rel_dir,ja.id)
att_abs_path = os.path.join(opts.att_abs_dir,ja.id)
if not os.path.exists(att_abs_path):
open(att_abs_path,'w').write(ja.get())
logger.info("Wrote {att_abs_path}".format(att_abs_path=att_abs_path))
bba = {
"path": att_rel_path,
"issue": bbi['id'],
"user": person_map[ja.author.name],
"filename": ja.filename,
}
issues_db['attachments'] += [bba]
# parent-child is task-subtask
if hasattr(ji.fields,'parent'):
issue_links[ji.fields.parent.key]['subtasks'].append(jira_key_to_bb_issue_tag(ji.key))
issue_links[ji.key]['parent task'].append(jira_key_to_bb_issue_tag(ji.fields.parent.key))
# add links
for il in ji.fields.issuelinks:
if hasattr(il,'outwardIssue'):
issue_links[ji.key][il.type.outward].append(jira_key_to_bb_issue_tag(il.outwardIssue.key))
elif hasattr(il,'inwardIssue'):
issue_links[ji.key][il.type.inward].append(jira_key_to_bb_issue_tag(il.inwardIssue.key))
logger.info("migrated issue {ji.key}: {ji.fields.summary} ({components})".format(
ji=ji,components=','.join(c.name for c in ji.fields.components)))
# append links section to content
# this section shows both task-subtask and "issue link" relationships
for src,dstlinks in issue_links.iteritems():
if src not in jk_issue_map:
logger.warn("issue {src}, with issue_links, not in jk_issue_map; skipping".format(src=src))
continue
links_block = "Links\n=====\n"
for desc,dsts in sorted(dstlinks.iteritems()):
links_block += "* **{desc}**: {links} \n".format(desc=desc,links=", ".join(dsts))
if jk_issue_map[src]['content']:
jk_issue_map[src]['content'] += "\n\n" + links_block
else:
jk_issue_map[src]['content'] = links_block
id_counts = collections.Counter(i['id'] for i in issues_db['issues'])
dupes = [ k for k,cnt in id_counts.iteritems() if cnt>1 ]
if dupes:
raise RuntimeError("{n} issue ids appear more than once from existing {opts.json_fn}".format(
n=len(dupes),opts=opts))
json.dump(issues_db,open(opts.json_fn,'w'))
logger.info("wrote {n} issues to {opts.json_fn}".format(n=len(id_counts),opts=opts))
# write zipfile
os.chdir(opts.dir)
with zipfile.ZipFile(opts.dir + '.zip','w') as zf:
for fn in ['db-1.0.json']+glob.glob('attachments/*'):
zf.write(fn)
logger.info("added {fn} to archive".format(fn=fn))
NOTE: I'm writing a new answer because writing this in a comment would be horrible, but most of the credit goes to #Turch's answer.
My steps (in OSX and Debian machines, both worked fine):
apt-get install python-pip (Debian) or sudo easy_install pip (OSX)
pip install jira
pip install configparser
easy_install -U setuptools (not sure if really needed)
Download or clone the source code from https://bitbucket.org/reece/rcore/ in your home folder, for example. Note: don't download using pip, it will get the 0.0.2 version and you need the 0.0.3.
Download the Python script created by Reece, mentioned by #Turch, and place it inside of the rcore folder.
Follow the instructions by #Turch: I also had to replace all instances of iteritems with items in the script and in rcore/types/immutabledict.py to make it work with Python 3. You will also need to fill in the dictionaries (priority_map, person_map, etc) with the values your project uses. Finally, you need a config file to exist with the connection info (see comments at the top of the script). Note: I used hostname like jira.domain.com (no http or https).
(This change did the trick for me) I had to change part of the line 250 from 'https://{opts.jira_hostname}/' to 'http://{opts.jira_hostname}/'
To finish, run the script like #Turch mentioned: The basic command line usage is export.py --jira-project <project>
The file was placed in /tmp/.zip for me.
The file was perfectly accepted in the BitBucket importer today.
Hooray for Reece and Turch! Thanks guys!

Match a running ipython notebook to a process

My server runs many long running notebooks, and I'd like to monitor the notebooks memory.
Is there a way to match between the pid or process name and a notebook?
Since the question is about monitoring notebooks' memory, I've written a complete example showing the memory consumption of the running notebooks. It is based on the excellent #jcb91 answer and a few other answers (1, 2, 3, 4).
import json
import os
import os.path
import posixpath
import subprocess
import urllib2
import pandas as pd
import psutil
def show_notebooks_table(host, port):
"""Show table with info about running jupyter notebooks.
Args:
host: host of the jupyter server.
port: port of the jupyter server.
Returns:
DataFrame with rows corresponding to running notebooks and following columns:
* index: notebook kernel id.
* path: path to notebook file.
* pid: pid of the notebook process.
* memory: notebook memory consumption in percentage.
"""
notebooks = get_running_notebooks(host, port)
prefix = long_substr([notebook['path'] for notebook in notebooks])
df = pd.DataFrame(notebooks)
df = df.set_index('kernel_id')
df.index.name = prefix
df.path = df.path.apply(lambda x: x[len(prefix):])
df['pid'] = df.apply(lambda row: get_process_id(row.name), axis=1)
# same notebook can be run in multiple processes
df = expand_column(df, 'pid')
df['memory'] = df.pid.apply(memory_usage_psutil)
return df.sort_values('memory', ascending=False)
def get_running_notebooks(host, port):
"""Get kernel ids and paths of the running notebooks.
Args:
host: host at which the notebook server is listening. E.g. 'localhost'.
port: port at which the notebook server is listening. E.g. 8888.
username: name of the user who runs the notebooks.
Returns:
list of dicts {kernel_id: notebook kernel id, path: path to notebook file}.
"""
# find which kernel corresponds to which notebook
# by querying the notebook server api for sessions
sessions_url = posixpath.join('http://%s:%d' % (host, port), 'api', 'sessions')
response = urllib2.urlopen(sessions_url).read()
res = json.loads(response)
notebooks = [{'kernel_id': notebook['kernel']['id'],
'path': notebook['notebook']['path']} for notebook in res]
return notebooks
def get_process_id(name):
"""Return process ids found by (partial) name or regex.
Source: https://stackoverflow.com/a/44712205/304209.
>>> get_process_id('kthreadd')
[2]
>>> get_process_id('watchdog')
[10, 11, 16, 21, 26, 31, 36, 41, 46, 51, 56, 61] # ymmv
>>> get_process_id('non-existent process')
[]
"""
child = subprocess.Popen(['pgrep', '-f', name], stdout=subprocess.PIPE, shell=False)
response = child.communicate()[0]
return [int(pid) for pid in response.split()]
def memory_usage_psutil(pid=None):
"""Get memory usage percentage by current process or by process specified by id, like in top.
Source: https://stackoverflow.com/a/30014612/304209.
Args:
pid: pid of the process to analyze. If None, analyze the current process.
Returns:
memory usage of the process, in percentage like in top, values in [0, 100].
"""
if pid is None:
pid = os.getpid()
process = psutil.Process(pid)
return process.memory_percent()
def long_substr(strings):
"""Find longest common substring in a list of strings.
Source: https://stackoverflow.com/a/2894073/304209.
Args:
strings: list of strings.
Returns:
longest substring which is found in all of the strings.
"""
substr = ''
if len(strings) > 1 and len(strings[0]) > 0:
for i in range(len(strings[0])):
for j in range(len(strings[0])-i+1):
if j > len(substr) and all(strings[0][i:i+j] in x for x in strings):
substr = strings[0][i:i+j]
return substr
def expand_column(dataframe, column):
"""Transform iterable column values into multiple rows.
Source: https://stackoverflow.com/a/27266225/304209.
Args:
dataframe: DataFrame to process.
column: name of the column to expand.
Returns:
copy of the DataFrame with the following updates:
* for rows where column contains only 1 value, keep them as is.
* for rows where column contains a list of values, transform them
into multiple rows, each of which contains one value from the list in column.
"""
tmp_df = dataframe.apply(
lambda row: pd.Series(row[column]), axis=1).stack().reset_index(level=1, drop=True)
tmp_df.name = column
return dataframe.drop(column, axis=1).join(tmp_df)
Here is an example output of show_notebooks_table('localhost', 8888):
I came here looking for the simple answer to this question, so I'll post it for anyone else looking.
import os
os.getpid()
This is possible, although I could only think of the rather hackish solution I outline below. In summary:
Get the ports each kernel (id) is listening on from the corresponding json connection files residing in the server's security directory
Parse the output of a call to netstat to determine which pid is listening to the ports found in step 1
Query the server's sessions url to find which kernel id maps to which session, and hence to which notebook. See the ipython wiki for the api. Although not all of it works for me, running IPython 2.1.0, the sessions url does.
I suspect there is a much simpler way, but I'm not sure as yet where to find it.
import glob
import os.path
import posixpath
import re
import json
import subprocess
import urllib2
# the url and port at which your notebook server listens
server_path = 'http://localhost'
server_port = 8888
# the security directory of the notebook server, containing its connections files
server_sec_dir = 'C:/Users/Josh/.ipython/profile_default/security/'
# part 1 : open all the connection json files to find their port numbers
kernels = {}
for json_path in glob.glob(os.path.join(server_sec_dir, 'kernel-*.json')):
control_port = json.load(open(json_path, 'r'))['control_port']
key = os.path.basename(json_path)[7:-5]
kernels[control_port] = {'control_port': control_port, 'key': key}
# part2 : get netstat info for which processes use which tcp ports
netstat_ouput = subprocess.check_output(['netstat', '-ano'])
# parse the netstat output to map ports to PIDs
netstat_regex = re.compile(
"^\s+\w+\s+" # protocol word
"\d+(\.\d+){3}:(\d+)\s+" # local ip:port
"\d+(\.\d+){3}:(\d+)\s+" # foreign ip:port
"LISTENING\s+" # connection state
"(\d+)$" # PID
)
for line in netstat_ouput.splitlines(False):
match = netstat_regex.match(line)
if match and match.lastindex == 5:
port = int(match.group(2))
if port in kernels:
pid = int(match.group(5))
kernels[port]['pid'] = pid
# reorganize kernels to use 'key' as keys
kernels = {kernel['key']: kernel for kernel in kernels.values()}
# part 3 : find which kernel corresponds to which notebook
# by querying the notebook server api for sessions
sessions_url = posixpath.join('%s:%d' % (server_path, server_port),
'api','sessions')
response = urllib2.urlopen(sessions_url).read()
for session in json.loads(response):
key = session['kernel']['id']
if key in kernels:
nb_path = os.path.join(session['notebook']['path'],
session['notebook']['name'])
kernels[key]['nb_path'] = nb_path
# now do what you will with the dict. I just print a pretty list version:
print json.dumps(kernels.values(), sort_keys=True, indent=4)
outputs (for me, at the moment):
[
{
"key": "9142896a-34ca-4c01-bc71-e5709652cac5",
"nb_path": "2015/2015-01-16\\serhsdh.ipynb",
"pid": 11436,
"port": 56173
},
{
"key": "1ddedd95-5673-45a6-b0fb-a3083debb681",
"nb_path": "Untitled0.ipynb",
"pid": 11248,
"port": 52191
},
{
"key": "330343dc-ae60-4f5c-b9b8-e5d05643df19",
"nb_path": "ipynb\\temp.ipynb",
"pid": 4680,
"port": 55446
},
{
"key": "888ad49b-5729-40c8-8d53-0e025b03ecc6",
"nb_path": "Untitled2.ipynb",
"pid": 7584,
"port": 55401
},
{
"key": "26d9ddd2-546a-40b4-975f-07403bb4e048",
"nb_path": "Untitled1.ipynb",
"pid": 10916,
"port": 55351
}
]
Adding to the Dennis Golomazov's answer to:
Make the code compatible with Python 3
Allow to login into a password-protected session
I replaced the get_running_notebooks function by this one (source):
import requests
import posixpath
import json
def get_running_notebooks(host, port, password=''):
"""
Get kernel ids and paths of the running notebooks.
Args:
host: host at which the notebook server is listening. E.g. 'localhost'.
port: port at which the notebook server is listening. E.g. 8888.
Returns:
list of dicts {kernel_id: notebook kernel id, path: path to notebook file}.
"""
BASE_URL = 'http://{0}:{1}/'.format(host, port)
# Get the cookie data
s = requests.Session()
url = BASE_URL + 'login?next=%2F'
resp = s.get(url)
xsrf_cookie = resp.cookies['_xsrf']
# Login with the password
params = {'_xsrf': xsrf_cookie, 'password': password}
res = s.post(url, data=params)
# Find which kernel corresponds to which notebook
# by querying the notebook server api for sessions
url = posixpath.join(BASE_URL, 'api', 'sessions')
ret = s.get(url)
#print('Status code:', ret.status_code)
# Get the notebook list
res = json.loads(ret.text)
notebooks = [{'kernel_id': notebook['kernel']['id'],
'path': notebook['notebook']['path']} for notebook in res]
return notebooks
Here is a solution that solves the access issue mentioned in other posts by first obtaining the access-token via jupyter lab list.
import requests
import psutil
import re
import os
import pandas as pd
# get all processes that have a ipython kernel and get kernel id
dfp = pd.DataFrame({'p': [p for p in psutil.process_iter() if 'ipykernel_launcher' in ' '.join(p.cmdline())]})
dfp['kernel_id'] = dfp.p.apply(lambda p: re.findall(r".+kernel-(.+)\.json", ' '.join(p.cmdline()))[0])
# get url to jupyter server with token and open once to get access
urlp = requests.utils.parse_url([i for i in os.popen("jupyter lab list").read().split() if 'http://' in i][0])
s = requests.Session()
res = s.get(urlp)
# read notebook list into dataframe and get kernel id
resapi = s.get(f'http://{urlp.netloc}/api/sessions')
dfn = pd.DataFrame(resapi.json())
dfn['kernel_id'] = dfn.kernel.apply(lambda item: item['id'])
# merge the process and notebook dataframes
df = dfn.merge(dfp, how = 'inner')
# add process info as desired
df['pid'] = df.p.apply(lambda p: p.pid)
df['mem [%]'] = df.p.apply(lambda p: p.memory_percent())
df['cpu [%]'] = df.p.apply(lambda p: p.cpu_percent())
df['status'] = df.p.apply(lambda p: p.status())
# reduce to columns of interest and sort
dfout = df.loc[:,['name','pid','mem [%]', 'cpu [%]','status']].sort_values('mem [%]', ascending=False)
I have asked similar question and in order to make it a duplicate I "reverse engineer" Dennis Golomazov's answer with focus on matching notebooks in a generic way (also manually).
Get json from the api/sessions path of your Jupyter server (i.e. https://localhost:8888/api/sessions in most cases).
Parse the json. It is a sequence of session objects (nested dicts if parsed with json module). Their .path attributes point to the notebook file, and .kernel.id is the kernel id (which is a part of path passed as an argument of python -m ipykernel_launcher, in my case `{PATH}/python -m ipykernel_launcher -f {HOME}/.local/share/jupyter/runtime/kernel-{ID}.json).
Find PID of process run with that path (e.g. by pgrep -f {ID}).