How do I loop over several files, keeping the base name for further processing? - perl

I have multiple text files that need to be tokenised, POS and NER. I am using C&C taggers and have run their tutorial, but I am wondering if there is a way to tag multiple files rather than one by one.
At the moment I am tokenising the files:
bin/tokkie --input working/tutorial/example.txt--quotes delete --output working/tutorial/example.tok
as follows and then Part of Speech tagging:
bin/pos --input working/tutorial/example.tok --model models/pos --output working/tutorial/example.pos
and lastly Named Entity Recognition:
bin/ner --input working/tutorial/example.pos --model models/ner --output working/tutorial/example.ner
I am not sure how I would go about creating a loop to do this and keep the file name the same as the input but with the extension representing the tagging it has. I was thinking of a bash script or perhaps Perl to open the directory but I am not sure on how to enter the C&C commands in order for the script to understand.
At the moment I am doing it manually and it's pretty time consuming to say the least!

Untested, likely needs some directory mangling.
use autodie qw(:all);
use File::Basename qw(basename);
for my $text_file (glob 'working/tutorial/*.txt') {
my $base_name = basename($text_file, '.txt');
system 'bin/tokkie',
'--input' => "working/tutorial/$base_name.txt",
'--quotes' => 'delete',
'--output' => "working/tutorial/$base_name.tok";
system 'bin/pos',
'--input' => "working/tutorial/$base_name.tok",
'--model' => 'models/pos',
'--output' => "working/tutorial/$base_name.pos";
system 'bin/ner',
'--input' => "working/tutorial/$base_name.pos",
'--model' => 'models/ner',
'--output' => "working/tutorial/$base_name.ner";
}

In Bash:
#!/bin/bash
dir='working/tutorial'
for file in "$dir"/*.txt
do
noext=${file/%.txt}
bin/tokkie --input "$file" --quotes delete --output "$noext.tok"
bin/pos --input "$noext.tok" --model models/pos --output "$noext.pos"
bin/ner --input "$noext.pos" --model models/ner --output "$noext.ner"
done

Related

mapping values are not allowed in this context in "<unicode string>"

In my loop, I run a dbt command and save the output to a .yml file. The following command works and generates a schema in my .yml file accurately:
for file in models/l30_mart/*.sql; do
table=$(basename "$file" .sql)
dbt run-operation generate_model_yaml --args "{\"model_name\": \"$table\"}" > test.yml
done
However, in the example above, I am saving the test.yml file in the root directory. When I try to save the file in another path for example models/l30_mart/test.yml like this, it doesn't work:
for file in models/l30_mart/*.sql; do
table=$(basename "$file" .sql)
dbt run-operation generate_model_yaml --args "{\"model_name\": \"$table\"}" > models/l30_mart/test.yml
done
In this case, when I open the test.ymlfile, I see this:
12:06:42 Running with dbt=1.0.1
12:06:43 Encountered an error:
Compilation Error
The schema file at models/l30_mart/test.yml is invalid because no version is specified. Please consult the documentation for more information on schema.yml syntax:
https://docs.getdbt.com/docs/schemayml-files
What am I missing out on?
If I try something like this to save different files with the extracted tablename variable as the filename, it also doesn't work:
for file in models/l30_mart/*.sql; do
table=$(basename "$file" .sql)
dbt run-operation generate_model_yaml --args "{\"model_name\": \"$table\"}" > models/l30_mart/$table.yml
done
In this case, the files either have this output:
20:39:44 Running with dbt=1.0.1
20:39:45 Encountered an error:
Compilation Error
The schema file at models/l30_mart/**firsttable.yml** is invalid because no version is specified. Please consult the documentation for more information on schema.yml syntax:
https://docs.getdbt.com/docs/schemayml-files
or this (eg in the secondtablename.yml file):
20:39:48 Running with dbt=1.0.1
20:39:49 Encountered an error:
Parsing Error
Error reading dbt_4flow: l30_mart/firstablename.yml - Runtime Error
Syntax error near line 2
------------------------------
1 | 20:39:44 Running with dbt=1.0.1
2 | 20:39:45 Encountered an error:
3 | Compilation Error
4 | The schema file at models/l30_mart/firsttablename.yml is invalid because no version is specified. Please consult the documentation for more information on schema.yml syntax:
5 |
Raw Error:
------------------------------
mapping values are not allowed in this context
in "<unicode string>", line 2, column 31
Note that the secondtablename.yml mentions the firsttablename.yml.
I don't know dbt but the explanation that seems likely is that dbt for some reason parses all *.yml files in that target directory when you call it. Since the shell opens the pipe to the *.yml file before calling dbt, the file already exists (but initially empty) when dbt is called. Since dbt expects the file to contain a version, you get an error.
To check whether this assessment is correct, write into a temporary file:
for file in models/l30_mart/*.sql; do
target_file=$(mktemp)
table=$(basename "$file" .sql)
dbt run-operation generate_model_yaml --args "{\"model_name\": \"$table\"}" > $target_file
mv $target_file models/l30_mart/test.yml
done
(Be aware of mktemp shenanigans if you're using macOS)
Edit: Since dbt seems to be affected by the files existing, you can also try to generate all files and move them into the correct directory afterwards:
target_dir=$(mktemp -d)
for file in models/l30_mart/*.sql; do
table=$(basename "$file" .sql)
dbt run-operation generate_model_yaml --args "{\"model_name\": \"$table\"}" > $target_dir/$table.yml
done
mv $target_dir/*.yml models/l30_mart/
rmdir $target_dir

Search xml for a value using sed

I have a below xml file
<documents>
<document><title>some title1</title><abstract>Some abstract1</abstract></document>
<document><title>some title2</title><abstract>Some abstract2</abstract></document>
<document><title>some title3</title><abstract>Some abstract3</abstract></document>
<document><title>some title4</title><abstract>Some abstract4</abstract></document>
</documents>
I am trying to write a ksh script to fetch the abstract value based on title=title4
xmllint , xstartlet is not allowed in my machine (access issues)
I have tried with
sed -n '/abstract/{s/.*<abstract>//;s/<\/abstract.*//;p;}' connections.xml
How to modify this to search based on a title
Based on the example you have given:
sed -n '/title>.*title4<\/title>/{s#.*<abstract>##;s#</abstract>.*##;p}' file
Will give you:
Some abstract4
grep approach:
grep -Poz '<title>.*?title4</title><abstract>\K[^<>]+(?=</abstract>)' connections.xml && echo ""
The output:
Some abstract4

Puppet onlyif and unless conditional test from boolean data in Hiera and CLI script output

I am running Puppet v3.0 on RHEL 6 and am doing package management via the exec resource.
I would like to add a number of control gates into my manifest via onlyif and unless.
First I would like to use booleans as defined in Hiera [ auto lookup function ].
Secondly I would like to use booleans from a bash script running diff <() <().
Im using the following hiera data :
---
my-class::package::patch_now:
0
my-class::package::package_list:
acl-2.2.49-6.el6-x86_64
acpid-1.0.10-2.1.el6-x86_64
...etc
and my manifest are as follows :
# less package.pp
class my-classs::package(
$package_list,
$patch_now,
){
exec {'patch_packages':
provider => shell,
path => [ "/bin/", "/usr/bin/" ],
logoutput => true,
timeout => 100,
command => "yum update -e0 -d0 -y $package_list",
unless => "/path/to/my-diff.script 2>&1 > /dev/null",
onlyif => "test 0 -eq $patch_now",
}
}
How would I test the booleans (0|1) from Hiera and a CLI diff.script with unless and onlyif in the context above ?
I'm assuming that you mean to install all listed packages in one sweep if $patch_now is set.
You should not test for that using onlyif. That is meant to verify some state on the agent system. If the master is aware of your data, you should use conditionals in the manifest structure.
if $patch_now {
exec { ... }
}
But do use true and false instead of 1 and 0 as the value for the flag - both 1 and 0 are equal to true in boolean context!
Your YAML looks funny, anyway.
To define a single value:
my-class::package::patch_now: false
To define an array:
my-class::package::package_list:
- acl-2.2.49-6.el6-x86_64
- acpid-1.0.10-2.1.el6-x86_64
- ...
When you use the array in your class, you cannot just put it in a string such as "yum update -e0 -d0 -y $package_list", for that will expand to "yum update -e0 -d0 -y acl-2.2.49-6.el6-x86_64acpid-1.0.10-2.1.el6-x86_64...", without spaces between the elements.
To concatenate the elements with spaces, use the join function from the stdlib
module.
$packages = join($package_list, ' ')
...
"yum update -e0 -d0 -y $packages"
I honestly don't get how your diff <() <() is supposed to work. The whole approach looks a little convoluted. I suspect that with a little tweaking, your diff script could probably perform the updates on its own (so that the exec just runs this script with different parameters).
EDIT after receiving more info in your comment.
To make this work cleanly, I recommend the following.
have Puppet transfer your Hiera data to the agent
file { '/opt/wanted-packages': content => inline_template('<%= package_list * "\n" %>') }
The diff will then work like you suggested, only simpler.
diff /opt/wanted-packages <(facter ...)
Just make sure that the exec requires the file and you should be fine.

Rake does not recognize rules with multiple extensions

I generate PDFs from Markdown files using Rake. If a Markdown file is filename.md, I like the PDF to be filename.md.pdf not filename.pdf, so that autocompletion works the way I like and so that it's clear what the source of the PDF file is.
I have this Rake file, which works fine.
MDFILES = FileList["*.md"]
PDFS = MDFILES.ext("pdf")
desc "Build PDFs of all chapters"
task :pdfs => PDFS
# Build PDFs from Markdown source
rule ".pdf" => ".md" do |t|
sh "pandoc #{t.source} -o #{t.name}"
end
If I run rake pdfs or rake filename.pdf the PDFs are generated as expected, but the PDFs are named filename.pdf.
But I want the Rakefile to be this instead:
MDFILES = FileList["*.md"]
PDFS = MDFILES.ext("md.pdf")
desc "Build PDFs of all chapters"
task :pdfs => PDFS
# Build PDFs from Markdown source
rule "md.pdf" => ".md" do |t|
sh "pandoc #{t.source} -o #{t.name}"
end
Running rake pdfs or rake filename.md.pdf returns the error Don't know how to build task 'filename.md.pdf'.
How can I produce filenames the way I want?
By the way, this type of rule works fine with Make, to wit:
%.md.pdf : %.md
pandoc $< -o $#
I've had a similar problem myself recently when I attempted to specify an extension with multiple dots in a rule. I solved it by using a different rule syntax as described here.
Try something like this for your rule:
rule( /\.md\.pdf$/ => [
proc {|task_name| task_name.sub(/\.md\.pdf$/, '.md') }
]) do |t|
sh "pandoc #{t.source} -o #{t.name}"
end

How to change the order of rsync in symfony deployment task

I want to deploy a part of my symfony application, say, it's like a module.
I want to exclude all files first, and then include only the files of
my new module.
For deployment I use the following symfony task
php symfony project:deploy production -t
The parameter -t prints all files to the output that are included in this dry run of rsync.
Content of config/rsync_exclude.txt is only *, since I like to exclude everthing:
*
In config/rsync_include.txt I list all the files and folders for the inclusion:
config/
config/mysupermodule.yml
lib/model/doctrine/
lib/model/doctrine/MySuperclass.php
lib/model/doctrine/MySuperclassTable.php
lib/
lib/MySuperLibrary/
lib/MySuperLibrary/*
The symfony task builds the following rsync command:
rsync --dry-run -azC --force --delete --progress --exclude-from=config/rsync_exclude.txt --include-from=config/rsync_include.txt -e "ssh -p22" ./ user#www.server.com:/test_deployment/
Problem 1: The the task doesn't sync any files.
Solution to 1: Change order: Include first, then exclude.
I figured out, that if I change my need to this one:
I want to include all files of my new module and exclude then all
other.
This means using the following command:
rsync --dry-run -azC --force --delete --progress --include-from=config/rsync_include.txt --exclude-from=config/rsync_exclude.txt -e "ssh -p22" ./ user#www.server.com:/test_deployment/
The rsync works.
Problem 2: How can I change the order of the rsync when using the symfony task?
The symfony task first excludes than includes.
Solution 2: ?
It is NOT possible.
But you can edit the deployment task in lib/task/project/sfProjectDeployTask.class.php.
Replace this (line 145 to 154 in SF 1.4):
if (file_exists($options['rsync-dir'].'/rsync_exclude.txt'))
{
$parameters .= sprintf(' --exclude-from=%s/rsync_exclude.txt', $options['rsync-dir']);
}
if (file_exists($options['rsync-dir'].'/rsync_include.txt'))
{
$parameters .= sprintf(' --include-from=%s/rsync_include.txt', $options['rsync-dir']);
}
with this:
if (file_exists($options['rsync-dir'].'/rsync_include.txt'))
{
$parameters .= sprintf(' --include-from=%s/rsync_include.txt', $options['rsync-dir']);
}
if (file_exists($options['rsync-dir'].'/rsync_exclude.txt'))
{
$parameters .= sprintf(' --exclude-from=%s/rsync_exclude.txt', $options['rsync-dir']);
}
In short: switch this two IF statements.
Let's change the way you want to do.
You should use only the exclude file. Exclude only directories that changed but you don't want to sync.
Because anyway if you modules/, app/, ... directories haven't change, you don't have to put them in the exclude file because they will remain the same on both server.