I'm trying to run a peak calling tool within a conda environment using snakemake.
The script looks as such (I only added the rows connect to the problem):
rule all:
input:
expand('{project}/{organism}/{mapper}/seacr/{pattern}.auc.threshold.bed', pattern = PATTERN, sample = IDS, organism = config['org'], project = config['project'], mapper = config['mapper']) # SEACR - run the peak calling
rule seacr_run:
input:
IP = '{project}/{organism}/{mapper}/seacr/IP_{PATTERN}.bedgraph',
IgG = '{project}/{organism}/{mapper}/seacr/IgG_{PATTERN}.bedgraph',
output:
bed1 = '{project}/{organism}/{mapper}/seacr/{PATTERN}.auc.threshold.bed',
shell:
'''
bash /fs/home/yeroslaviz/SEACR/SEACR_1.3.sh {input.IP} 0.01 non stringent {output.bed1}
'''
When running the -nps dryrun of the snamemake command I get the correct command printed to STDOUT
> snakemake -nps /fs/pool/pool-bcfngs/scripts/P193.ChipSeq.Snakemake -j 100
...
Building DAG of jobs...
Job counts:
count jobs
1 all
1 seacr_run
2
[Tue Mar 3 13:56:19 2020]
rule seacr_run:
input: P193/Mmu.GrCm38/bowtie2/seacr/IP_H3K4m3.bedgraph, P193/Mmu.GrCm38/bowtie2/seacr/IgG_H3K4m3.bedgraph
output: P193/Mmu.GrCm38/bowtie2/seacr/H3K4m3.auc.threshold.bed
jobid: 22
wildcards: project=P193, organism=Mmu.GrCm38, mapper=bowtie2, PATTERN=H3K4m3
bash /fs/home/yeroslaviz/SEACR/SEACR_1.3.sh P193/Mmu.GrCm38/bowtie2/seacr/IP_H3K4m3.bedgraph 0.01 non stringent P193/Mmu.GrCm38/bowtie2/seacr/H3K4m3.auc.threshold.bed
[Tue Mar 3 13:56:19 2020]
localrule all:
...
Job counts:
count jobs
1 all
1 seacr_run
2
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
When running the command above in the command line the tool works without problems. But hwhen I try to run it within the snakemake workflow I get the following error:
Waiting at most 5 seconds for missing files.
MissingOutputException in line 67 of /fs/pool/pool-bcfngs/scripts/P193.ChipSeq.Snakemake:
Missing files after 5 seconds:
P193/Mmu.GrCm38/bowtie2/seacr/H3K4m3.auc.threshold.bed
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Can anyone explain what is happening?
Thanks
I’m trying to solve a production problem. We receive an error file daily (Mon-Fri) from the bank that contains error records. These records are 94 bytes in length. On occasion there will be some error codes in the file that cause some significant problems when processed by a system at the State.
I was asked to “filter out” the error records from the file that gets sent to the State. To do this, I created a one line FINDSTR command (below) to locate records containing the error code “R02” (no quotes) in positions 4-6 of the records, and remove them.
FindStr /V "R02" INPUT_FILE > OUTPUT_FILTERED_FILE_%DATE%_%TIME%
This worked as I had hoped and the requesting users were happy, BUT one of the managers found that the record immediately preceding the record containing the error code ALSO MUST BE DELETED from the file, because it is associated with the record containing the error code. The problem is that this preceding record does NOT contain an error code in it. There is a 6 digit number in positions 89-94 that could be related to the error record, but I don't want to guess, or over-complicate the script.
As you can see in the example data (below), there are 5 error records (5, 7, 9, 11, and 27), containing error code “R02.” My FindStr command worked for removing the R02 records and creating an output file without any of the error records (containing “R02”).
WHAT I NEED NOW is to be able to remove the “associated records” that go with the R02 records in the example data below. Done properly, the following records would be removed from the example file in a single process: 4, 5, 6, 7, 8, 9, 10, 11, 26 and 27.
I need to delete both the “R02 error records AND the associated record above each of those error records simultaneously, and write the output to a NEW file, leaving the original file intact AS-IS – because it is retained locally by our user department.
Below, is what the INPUT record content looks like (error codes in BOLD), with a record number pre-pended for reference purposes. Sorry, but I can’t supply the full 94 byte record images due to security issues. Below that, is what my desired output file should look like.
I don’t know if this can be done with FindStr, but I’m sure PowerShell can do the job, BUT I know nothing about PowerShell. The script will be executed on Windows Server.
Can anyone help me with creating a script that will accomplish the processing to transform the input file into the desired output file?
Thanks very much in advance for your assistance.
****** Example Data ******
Input File
Nbr - - Record Content - -
01 HEADER RECORD
02 CONTROL RECORD
03 5200SAN
04 62112200
05 799**R02**12
06 62112200
07 799**R02**12
08 62112200
09 799**R02**12
10 62112200
11 799**R02**12
12 82000000
13 5200SAN
14 62112200
15 798C0312
16 62112200
17 798C0312
18 62112200
19 798C0312
20 62112200
21 798C0312
22 62112200
23 798C0312
24 82000000
25 5200SAN
26 62112200
27 799**R02**12
28 TRAILER RECORD
Desired New Output File
Nbr - - Record Content - -
01 HEADER RECORD
02 CONTROL RECORD
03 5200SAN
# DELETED #
# DELETED #
# DELETED #
# DELETED #
# DELETED #
# DELETED #
# DELETED #
# DELETED #
12 82000000
13 5200SAN
14 62112200
15 798C0312
16 62112200
17 798C0312
18 62112200
19 798C0312
20 62112200
21 798C0312
22 62112200
23 798C0312
24 82000000
25 5200SAN
# DELETED #
# DELETED #
28 TRAILER RECORD
The following Powershell is untested, but should do basically what you're asking for. There very well may be bugs in my logic, but this will give the basic framework of what needs to happen.
[cmdletbinding()]
Param
(
[string] $InputFilePath
)
# Read the text file
$InputFile = Get-Content $InputFilePath
# Get the time
$Time = Get-Date -Format "yyyyMMdd_hhmmss"
# Set up the output file name
$OutputFileFiltered = "Output_Filtered_File_$Time.txt"
# Initialize the variable used to hold the output
$OutputStrings = #()
# Loop through each line in the file
# Check the line ahead for "R02" and add it to the output
# or skip it appropriately
for ($i = 0; $i -lt $InputFile.Length - 1; $i++)
{
if ($InputFile[$i + 1] -notmatch "R02")
{
# The next record does not contain "R02", add it to the output
$OutputStrings += $InputFile[$i]
}
else
{
# The next record does contain "R02", skip it
$i++
}
}
# Add the trailer record to the output
$OutputString += $InputFile[$InputFile.Length - 1]
# Write the output to a file
$OutputStrings | Out-File $OutputFileFiltered
Save that as FilterScript.ps1 (or whatever you prefer) and execute it in Powershell with the following:
FilterScript.ps1 -InputFilePath "C:\Path\To\Your\InputFile.txt"
I've just moved servers and have overlap on crontabs running.
Both servers set to BST, but one sends me a log at
08:00 BST the other old one 09:00 BST
The crontab entry for both is
0 9 * * * /root/phpmaillog.sh > /dev/null 2>&1
Mystery?
This is my entry in /etc/crontab, CentOS 6.6:
0 0 */1 * * fredrik /home/fredrik/google-cloud-sdk/bin/gsutil -d -m rsync -r -C [src] [dst] &> [log]
And I'm getting this error: OSError: [Errno 13] Permission denied: '/.config'
The command runs fine if executed in the shell. I've noticed I cannot run 0 0 */1 * * fredrik gsutil ... without the full path to gsutil, so I'm assuming I'm missing something in the environment in which cron is running...?
Here's the full traceback:
Traceback (most recent call last):
File "/home/fredrik/google-cloud-sdk/bin/bootstrapping/gsutil.py", line 68, in <module>
bootstrapping.PrerunChecks(can_be_gce=True)
File "/home/fredrik/google-cloud-sdk/bin/bootstrapping/bootstrapping.py", line 279, in PrerunChecks
CheckCredOrExit(can_be_gce=can_be_gce)
File "/home/fredrik/google-cloud-sdk/bin/bootstrapping/bootstrapping.py", line 167, in CheckCredOrExit
cred = c_store.Load()
File "/home/fredrik/google-cloud-sdk/bin/bootstrapping/../../lib/googlecloudsdk/core/credentials/store.py", line 195, in Load
account = properties.VALUES.core.account.Get()
File "/home/fredrik/google-cloud-sdk/bin/bootstrapping/../../lib/googlecloudsdk/core/properties.py", line 393, in Get
return _GetProperty(self, _PropertiesFile.Load(), required)
File "/home/fredrik/google-cloud-sdk/bin/bootstrapping/../../lib/googlecloudsdk/core/properties.py", line 618, in _GetProperty
value = callback()
File "/home/fredrik/google-cloud-sdk/bin/bootstrapping/../../lib/googlecloudsdk/core/properties.py", line 286, in <lambda>
'account', callbacks=[lambda: c_gce.Metadata().DefaultAccount()])
File "/home/fredrik/google-cloud-sdk/bin/bootstrapping/../../lib/googlecloudsdk/core/credentials/gce.py", line 179, in Metadata
_metadata_lock.lock(function=_CreateMetadata, argument=None)
File "/usr/lib64/python2.6/mutex.py", line 44, in lock
function(argument)
File "/home/fredrik/google-cloud-sdk/bin/bootstrapping/../../lib/googlecloudsdk/core/credentials/gce.py", line 178, in _CreateMetadata
_metadata = _GCEMetadata()
File "/home/fredrik/google-cloud-sdk/bin/bootstrapping/../../lib/googlecloudsdk/core/credentials/gce.py", line 73, in __init__
_CacheIsOnGCE(self.connected)
File "/home/fredrik/google-cloud-sdk/bin/bootstrapping/../../lib/googlecloudsdk/core/credentials/gce.py", line 186, in _CacheIsOnGCE
config.Paths().GCECachePath()) as gcecache_file:
File "/home/fredrik/google-cloud-sdk/bin/bootstrapping/../../lib/googlecloudsdk/core/util/files.py", line 465, in OpenForWritingPrivate
MakeDir(full_parent_dir_path, mode=0700)
File "/home/fredrik/google-cloud-sdk/bin/bootstrapping/../../lib/googlecloudsdk/core/util/files.py", line 44, in MakeDir
os.makedirs(path, mode=mode)
File "/usr/lib64/python2.6/os.py", line 150, in makedirs
makedirs(head, mode)
File "/usr/lib64/python2.6/os.py", line 157, in makedirs
mkdir(name, mode)
OSError: [Errno 13] Permission denied: '/.config'
Thanks to Mike and jterrace for helping me getting this working. In the end, I had to revise these environment variables: PATH, HOME, BOTO_CONFIG (except for any other default ones).
PATH=/sbin:/bin:/usr/sbin:/usr/bin:/home/fredrik/google-cloud-sdk/bin
HOME=/home/fredrik
BOTO_CONFIG="/home/fredrik/.config/gcloud/legacy_credentials/[your-email-address]/.boto"
# Example of job definition:
# .---------------- minute (0 - 59)
# | .------------- hour (0 - 23)
# | | .---------- day of month (1 - 31)
# | | | .------- month (1 - 12) OR jan,feb,mar,apr ...
# | | | | .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# | | | | |
# * * * * * user-name command to be executed
0 0 */1 * * fredrik gsutil -d -m rsync -r -C /local-folder/ gs://my-bucket/my-folder/ > /logs/gsutil.log 2>&1
The > gsutil.log 2>&1 pipes both stdout and stderr to the same file. Also, it will overwrite the log file the next time gsutil runs. In order to make it append to the log file, use >> gsutil.log 2>&1. This should be safe on both Linux and OS X.
I'm noticing that the debug flag -d creates enormous log files on large data volumes, so I might opt out on that flag, personally.
You're probably getting a different boto config file when running from cron. Please try running the following both ways (as root, and then via cron), and see if you get different config file lists for the two cases:
gsutil -D ls 2>&1 | grep config_file_list
The reason this happens is that cron unsets most environment variables before running jobs, so you need to manually set the BOTO_CONFIG environment variable in your cron script before running gsutil, i.e.,:
BOTO_CONFIG="/root/.boto"
gsutil rsync ...
I believe you're getting this error because the HOME environment variable is not set when running under cron. Try setting HOME=/home/fredrik.
because cron is ran in a very limited environment, you need to source your .bash_profile to get your environment config.
* * * * * source ~/.bash_profile && your_cmd_here
For anyone trying to manage images with gsutil from PHP running Apache -
Made a new directory called apache-shared and chgrp/chown'd www-data (or whichever user your Apache runs on, run "top" to check). Copied the .boto file into the directory and ran the following without issue:
shell_exec('export BOTO_CONFIG=/apache-shared/.boto && export PATH=/sbin:/bin:/usr/sbin:/usr/bin:/home/user/google-cloud-sdk/bin && gsutil command image gs://bucket');
In my /etc/crontab file I write:
* * * * * PLACK_ENV=development -I /home/adrian/app/lib/ /home/adrian/app/script/db/log_to_db.pl
To make a cron job run every minute. The job is running the log_to_db.pl perl script, which inserts data to my database.
When I run in my terminal
PLACK_ENV=development -I /home/adrian/app/lib/ /home/adrian/app/script/db/log_to_db.pl
It's OK! The script runs.
But the cron job isn't working!
What can be wrong?
PD: My script starts like
#!/usr/bin perl
....
My cron log prints:
Jul 8 20:29:01 dev0001 crond[1829]: (*system*) RELOAD (/etc/crontab)
Jul 8 20:29:01 dev0001 crond[1829]: (CRON) bad username (/etc/crontab)
Jul 8 20:30:01 dev0001 crond[1829]: (*system*) RELOAD (/etc/crontab)
Jul 8 20:30:01 dev0001 crond[1829]: (CRON) bad username (/etc/crontab)
Jul 8 20:30:01 dev0001 CROND[13504]: (root) CMD (/usr/lib64/sa/sa1 -S DISK 1 1)
You need a username when putting it in the system crontab
* * * * * adrian PLACK_ENV=development -I /home/adrian/app/lib/ /home/adrian/app/script/db/log_to_db.pl
But as #jithin said, putting this in your user crontab (crontab -e) might make more sense.
Don't edit the crontab file directly. Instead use crontab -e and add the cron entry.
With reference to the link