How do I abstract a set of modules to a variable to be included in multiple pipelines? - wyam

I'm trying to abstract a set of modules to use them in multiple pipelines. I have this before my first pipeline.
IModule[] textReplacement = new IModule[] {
Replace(" -- ", "—"),
Replace("--", "—"),
Trace("Text replacement performed...")
};
Then, in a pipeline:
Pipelines.Add("Pages",
ReadFiles("*.md"),
Concat(textReplacement),
WriteFiles("*.html)
);
When I execute, Text replacement performed... is written to the console, so the execution flow is working through those two modules. However --
Text replacement does not actually occur. (Or, if it does, it's not persisted in the documents that continue down the pipeline.)
An empty document is added to document set.
The documentation for Concat clearly states:
The specified modules are executed with an empty initial document and then outputs the original input documents without modification concatenated with the results from the specified module sequence.
I just can't figure out why this is, why it would be needed or helpful, or how to get rid of it. I can't have am empty document floating around in my document set, or else is causes errors in subsequent pipelines.

The trick here is to use the LINQ .Concat() method to form a single array consisting of the modules specific to each pipeline combined with the common module array that's declared before the pipelines (textReplacement in the example). This works because IPipelineCollection.Add() accepts a params IModule[] array.
IModule[] textReplacement = new IModule[] {
Replace(" -- ", "—"),
Replace("--", "—"),
Trace("Text replacement performed...")
};
Pipelines.Add("Pages",
new[]
{
// Modules before the common set
ReadFiles("*.md")
}
.Concat(textReplacement) // The common set
.Concat(new[]
{
// Modules after the common set
WriteFiles("*.html)
})
.ToArray()
);
Granted, this is pretty awkward. Of course there are other ways to create a single array to feed to IPipelineCollection.Add() besides using .Concat(). For example, you could create a List<IModule> before each pipeline using List<T>.Add() and List<T>.AddRange() to create the aggregate sequence of modules and then just convert it to an array when creating the pipeline. You could also write an extension method that knows how to concatenate multiple sequences into a single array.
In the future this will become much easier with the introduction of a special module named Modules designed specifically for this purpose that acts as a container of child modules (https://github.com/Wyamio/Wyam/issues/197):
Modules textReplacement = Modules(
Replace(" -- ", "—"),
Replace("--", "—"),
Trace("Text replacement performed...")
);
Pipelines.Add("Pages",
ReadFiles("*.md"),
textReplacement,
WriteFiles("*.html)
);

Related

Handle POST data sent as array

I have an html form which sends a hidden field and a radio button with the same name.
This allows people to submit the form without picking from the list (but records a zero answer).
When the user does select a radio button, the form posts BOTH the hidden value and the selected value.
I'd like to write a perl function to convert the POST data to a hash. The following works for standard text boxes etc.
#!/usr/bin/perl
use CGI qw(:standard);
sub GetForm{
%form;
foreach my $p (param()) {
$form{$p} = param($p);
}
return %form;
}
However when faced with two form inputs with the same name it just returns the first one (ie the hidden one)
I can see that the inputs are included in the POST header as an array but I don't know how to process them.
I'm working with legacy code so I can't change the form unfortunately!
Is there a way to do this?
I have an html form which sends a hidden field and a radio button with
the same name.
This allows people to submit the form without picking from the list
(but records a zero answer).
That's an odd approach. It would be easier to leave the hidden input out and treat the absence of the data as a zero answer.
However, if you want to stick to your approach, read the documentation for the CGI module.
Specifically, the documentation for param:
When calling param() If the parameter is multivalued (e.g. from multiple selections in a scrolling list), you can ask to receive an array. Otherwise the method will return the first value.
Thus:
$form{$p} = [ param($p) ];
However, you do seem to be reinventing the wheel. There is a built-in method to get a hash of all paramaters:
$form = $CGI->new->Vars
That said, the documentation also says:
CGI.pm is no longer considered good practice for developing web applications, including quick prototyping and small web scripts. There are far better, cleaner, quicker, easier, safer, more scalable, more extensible, more modern alternatives available at this point in time. These will be documented with CGI::Alternatives.
So you should migrate away from this anyway.
Replace
$form{$p} = param($p); # Value of first field named $p
with
$form{$p} = ( multi_param($p) )[-1]; # Value of last field named $p
or
$form{$p} = ( grep length, multi_param($p) )[-1]; # Value of last field named $p
# that has a non-blank value

Get specific data form Xapian database with Perl

I'm writing a perl script to retrieve search results from a Xapian database.
I uses the Search::Xapian module and tried the basic Xapian Query Example. This basic program allow to make a query and get a array of results sorted by relevancy. My problem is that the get_data() method return the whole datas from the document (url, filname, abstract, author, ...) mixed together as a string.
I searched in the CPAN module for a method to get each data one by one but I didn't find it.
Is it possible to get the filename, url, author, ... one by one to put them in a specific variable ?
You've not posted the code to produce this, or details of your setup. See the simplesearch.pl example, rather than print it out, assign what you want to a variable:
# Display the results.
printf "%i results found.\n", $mset->get_matches_estimated();
printf "Results 1-%i:\n", $mset->size();
foreach my $m ($mset->items()) {
printf "%i: %i%% docid=%i [%s]\n", $m->get_rank() + 1, $m->get_percent(), $m->get_docid(), $m->get_document()->get_data();
}

Spark: How to structure a series of side effect actions inside mapping transformation to avoid repetition?

I have a spark streaming application that needs to take these steps:
Take a string, apply some map transformations to it
Map again: If this string (now an array) has a specific value in it, immediately send an email (or do something OUTSIDE the spark environment)
collect() and save in a specific directory
apply some other transformation/enrichment
collect() and save in another directory.
As you can see this implies to lazily activated calculations, which do the OUTSIDE action twice. I am trying to avoid caching, as at some hundreds lines per second this would kill my server.
Also trying to mantaining the order of operation, though this is not as much as important: Is there a solution I do not know of?
EDIT: my program as of now:
kafkaStream;
lines = take the value, discard the topic;
lines.foreachRDD{
splittedRDD = arg.map { split the string };
assRDD = splittedRDD.map { associate to a table };
flaggedRDD = assRDD.map { add a boolean parameter under a if condition + send mail};
externalClass.saveStaticMethod( flaggedRDD.collect() and save in file);
enrichRDD = flaggedRDD.map { enrich with external data };
externalClass.saveStaticMethod( enrichRDD.collect() and save in file);
}
I put the saving part after the email so that if something goes wrong with it at least the mail has been sent.
The final 2 methods I found were these:
In the DStream transformation before the side-effected one, make a copy of the Dstream: one will go on with the transformation, the other will have the .foreachRDD{ outside action }. There are no major downside in this, as it is just one RDD more on a worker node.
Extracting the {outside action} from the transformation and mapping the already sent mails: filter if mail has already been sent. This is a almost a superfluous operation as it will filter out all of the RDD elements.
Caching before going on (although I was trying to avoid it, there was not much to do)
If trying to not caching, solution 1 is the way to go

Can I start a Service Now workflow via an external SOAP call?

I would like to make a call into the ServiceNow SOAP webservice to start an instance of a specific web service.
I can find the WSDL for functions like incident.do but seem to be missing the step needed to find the proper table/endpoint for workflows to start.
If you want to start a Workflow via SOAP I think the only way to do this is to create a Scripted Web-Service or a Custom Processor.
In there you will have to define a script which starts your Workflow.
var w = new Workflow();
var context = w.startFlow(id, current, current.operation(), getVars());
In this wiki article you can find API Methods for Workflows.
The tricky bit is getting the variables into the Workflow.
While this sounds easy, in fact it isn't.
If your workflow runs on the table sc_req_item (which is likely if you are dealing with Request Fulfillment), you first need to set the Property (sys_properties) glide.workflow.enable_input_variables to true, because otherwise, you will not be able to add normal Input variables to your workflow.
Then, add the Input variables to the workflow. Note that you have some nifty datatypes available there. Note for example the "Data Structure" type.
All Input variables are treated like custome columns (in fact they are columns of a workflw-specific table). That is why the names start with u_.
Lets say, you define an input variable called u_dynamic_vars (Datatype "Data Structure").
Here is how to call the workflow:
var wf_name = "Name of your workflow";
// Instantiate JSON machinery
var parser = new JSON();
//Declare an instance of workflow.js
var wf = new Workflow ();
//Get the workflow id
var wfId = wf.getWorkflowFromName (wf_name) ;
//Start workflow, passing along object containing name/value pairs mapping to inputs expected by the workflow
var vars = { } ;
// Prepare the JSON Datastructure
var obj ={"name":"George",
"lastname":"Washington"};
// Encode the data
vars.u_dynamic_vars = parser.encode(obj);
vars.u_new_email = "inject#new.com";
// Get a specific RITM
var gr = GlideRecord("sc_req_item");
gr.get("18d8e9740f4013002f504c6be1050e48");
gs.print(gr.number);
// Start the Workflow with a "current" record
wf.startFlow(wfId , gr , "update" , vars ) ;
// You may also pass null, then current is null.
wf.startFlow(wfId , null , "update" , vars ) ;
In the workflow, you then unpack the data like so:
// Let's unpack it. For some reason, intantiating the parse won't work here...
payload = JSON.parse(workflow.variables.u_dynamic_vars);
gs.print("payload.first_name:" + payload.name);
Also note that a workflow does not necessarily need to run on a table.
To achieve this, choose "global" as table name when defining the workflow.

How do I dynamically build a search block in sunspot?

I am converting a Rails app from using acts_as_solr to sunspot.
The app uses the field search capability in solr that was exposed in acts_as_solr. You could give it a query string like this:
title:"The thing to search"
and it would search for that string in the title field.
In converting to sunspot I am parsing out field specific portions of the query string and I need to dynamically generate the search block. Something like this:
Sunspot.search(table_clazz) do
keywords(first_string, :fields => :title)
keywords(second_string, :fields => :description)
...
paginate(:page => page, :per_page => per_page)
end
This is complicated by also needing to do duration (seconds, integer) ranges and negation if the query requires it.
On the current system users can search for something in the title, excluding records with something else in another field and scoping by duration.
In a nutshell, how do I generate these blocks dynamically?
I recently did this kind of thing using instance_eval to evaluate procs (created elsewhere) in the context of the Sunspot search block.
The advantage is that these procs can be created anywhere in your application yet you can write them with the same syntax as if you were inside a sunspot search block.
Here's a quick example to get you started for your particular case:
def build_sunspot_query(conditions)
condition_procs = conditions.map{|c| build_condition c}
Sunspot.search(table_clazz) do
condition_procs.each{|c| instance_eval &c}
paginate(:page => page, :per_page => per_page)
end
end
def build_condition(condition)
Proc.new do
# write this code as if it was inside the sunspot search block
keywords condition['words'], :fields => condition[:field].to_sym
end
end
conditions = [{words: "tasty pizza", field: "title"},
{words: "cheap", field: "description"}]
build_sunspot_query conditions
By the way, if you need to, you can even instance_eval a proc inside of another proc (in my case I composed arbitrarily-nested 'and'/'or' conditions).
Sunspot provides a method called Sunspot.new_search which lets you build the search conditions incrementally and execute it on demand.
An example provided by the Sunspot's source code:
search = Sunspot.new_search do
with(:blog_id, 1)
end
search.build do
keywords('some keywords')
end
search.build do
order_by(:published_at, :desc)
end
search.execute
# This is equivalent to:
Sunspot.search do
with(:blog_id, 1)
keywords('some keywords')
order_by(:published_at, :desc)
end
With this flexibility, you should be able to build your query dynamically. Also, you can extract common conditions to a method, like so:
def blog_facets
lambda { |s|
s.facet(:published_year)
s.facet(:author)
}
end
search = Sunspot.new_search(Blog)
search.build(&blog_facets)
search.execute
I have solved this myself. The solution I used was to compiled the required scopes as strings, concatenate them, and then eval them inside the search block.
This required a separate query builder library that interrogates the solr indexes to ensure that a scope is not created for a non existent index field.
The code is very specific to my project, and too long to post in full, but this is what I do:
1. Split the search terms
this gives me an array of the terms or terms plus fields:
['field:term', 'non field terms']
2. This is passed to the query builder.
The builder converts the array to scopes, based on what indexes are available. This method is an example that takes the model class, field and value and returns the scope if the field is indexed.
def convert_text_query_to_search_scope(model_clazz, field, value)
if field_is_indexed?(model_clazz, field)
escaped_value = value.gsub(/'/, "\\\\'")
"keywords('#{escaped_value}', :fields => [:#{field}])"
else
""
end
end
3. Join all the scopes
The generated scopes are joined join("\n") and that is evaled.
This approach allows the user to selected the models they want to search, and optionally to do field specific searching. The system will then only search the models with any specified fields (or common fields), ignoring the rest.
The method to check if the field is indexed is:
# based on http://blog.locomotivellc.com/post/6321969631/sunspot-introspection
def field_is_indexed?(model_clazz, field)
# first part returns an array of all indexed fields - text and other types - plus ':class'
Sunspot::Setup.for(model_clazz).all_field_factories.map(&:name).include?(field.to_sym)
end
And if anyone needs it, a check for sortability:
def field_is_sortable?(classes_to_check, field)
if field.present?
classes_to_check.each do |table_clazz|
return false if ! Sunspot::Setup.for(table_clazz).field_factories.map(&:name).include?(field.to_sym)
end
return true
end
false
end