pig script fails when macro is used twice

pig script fails when macro is used twice - macros

I have a pig script that uses a macro twice, on the same relation but with different parameters; for each use I filter the same relation on a different field. Macro is shaped more or as as follows:
DEFINE doubleGroupJoin (mainField, mainRelation) returns out {
valid = FILTER $mainRelation BY $mainField != '';
r1 = FOREACH (GROUP valid BY $mainField) GENERATE
field1_1, field1_2, ...;
r2 = FOREACH (GROUP valid BY ($mainField, otherfield1, ...) GENERATE
field2_1, field2_2, ...;
$out = FOREACH (JOIN R1 BY field1_1, R2 BY field1_2) GENERATE
final1, final2, ...;
}
In the script I have the following:
-- Output1
finalR1 = doubleGroupJoin('field1', initialData);
STORE finalR1 INTO '$output/R1';
-- Output2
finalR2 = doubleGroupJoin('field2', initialData);
STORE finalR2 INTO '$output/R2';
If I comment out either Output1 or Output2 blocks, the job works fine, but if I try to use both I get the following error:
java.lang.ClassCastException: org.apache.pig.data.BinSedesTuple cannot be cast to java.lang.String
at org.apache.pig.backend.hadoop.HDataType.getWritableComparableTypes(HDataType.java:106)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Map.collect(PigGenericMapReduce.java:111)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:284)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:277)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Using Pig 0.12.0 here. Any suggestion on why this might be happening?

Related

Use matlab script to check symbol types in state chart

I am writing a script that will open a simulink model and check for all the logical comparison data types. I have four comparisons in my model, two of which are logical operators, the other two are condition statements in the state chart shown below.
Using a script I can check the data types of the logical operators shown below but I'm having trouble finding documentation on how to check the symbol data types from the state chart.
load_system('comparison.slx');
RelOps = find_system('comparison','BlockType','RelationalOperator');
for j = 1:numel(RelOps)
currRelOp = get_param(RelOps(j),'Operator');
if strcmp(currRelOp,'==')
currpc = get_param(RelOps(j),'PortConnectivity');
currpc = currpc{1,1};
srcblktype = get_param(currpc(2).SrcBlock,'BlockType');
if strcmp(srcblktype,'Constant')
srcblkdatatype = get_param(currpc(2).SrcBlock,'OutDataTypeStr');
end
end
I thought I would post this up on stack overflow while continuing my search. So far I have a script which can find all the variable names
compList = {'==' '!=' '<' '>' '<=' '>='};
charts = find(sfroot,'-isa','Stateflow.Chart');
st = find(charts(1),'-isa','Stateflow.Transition');
inputList = {};
for i = 1:numel(st)
condStr = st(i).LabelString;
if ~strcmp(condStr,'')
if contains(condStr, compList)
splitStr = split(condStr,compList);
for j = 1:numel(splitStr)
inputStr = char(splitStr(j));
inputStr = erase(inputStr,{' ','[',']'});
inputList = [inputList {inputStr}];
end
end
end
% For loop searching for data types
end

How to use pattern matching callbacks in Dash for Julia

I'm new to julia, but not so new on Dash; I'm trying to build my first app with Dash for julia, but I can't seem to make a pattern matching callback work properly. Here's the part of the code that's giving me troubles:
callback!(
app,
Output((type= "filter_", index= ALL), "options"),
Input("inputs", "data"),
State((type= "filter_", index= ALL), "value"),
) do inputs, filters
list_outs = []
list_vals = []
for i in 1:length(filters)
push!(list_outs, [(label= input, value= input) for input in inputs])
end
return list_outs
end
What I'm trying to do here is to use the available inputs of the data set, already stored in "inputs", to set the filters' options, creating as many sets of options as there are dropdowns.
The problem here is, I guess, in the format of the output I'm returning: it says "Invalid number of output values for {"index":["ALL"],"type":"filter_"}.options. Expected 3, got 1"
Sadly, I found nothing of use about how to use pattern matching callbacks with julia; I tried passing the output both as an array and as a tuple, but to no avail.
Any help is welcomed, thank you all!

This is the error related to the fact that if the result is a single Output, the callback output is automatically represented as an array of what is returned for uniform further processing. I.e., in your case, as [list_outs]. The fact that the Output with the match pattern is also treated as a single one is my bug, I added the issue and try to fix it in the near future.
Right now you can work around this problem by using Output as an array:
using Dash
using DashHtmlComponents
using DashCoreComponents
app = dash()
app.layout = html_div() do
dcc_input(id = "input", value = "A,B,C"),
dcc_dropdown(id = (type="filter_", index = 1)),
dcc_dropdown(id = (type="filter_", index = 2)),
dcc_dropdown(id = (type="filter_", index = 3)),
dcc_dropdown(id = (type="filter_", index = 4))
end
callback!(
app,
[Output((type= "filter_", index= ALL), "options")], #This is multiple output in explicitly form
Input("input", "value"),
State((type= "filter_", index= ALL), "value"),
) do input, filters
inputs = split(input, ",")
list_outs = []
list_vals = []
for i in 1:length(filters)
push!(list_outs, [(label= input, value= input) for input in inputs])
end
return [list_outs] # Accordingly, we return the result inside an additional array
end
run_server(app, debug = true)

Including time as an explicit variable in constraint in a Pyomo Model

I am using PyOMO to model a semi-batch reaction.
Consider an ODE system that describes a semi-batch reactor where one of the reactants is fed at a given volume flow for t1 units of time, the reaction goes on until t end, and obviously t1 < t end.
To specify the stop in the flow, I can either use a conditional rule (assume t1 = 3.5*60):
def _vol_flow_in_schedule(mod,t):
if t<=3.5*60:
return mod.vol_flow_in[t] == (12.3/1000)/(3.5*60)
else:
return mod.vol_flow_in[t] == 0
m1.vol_flow_in_schedule = Constraint(m1.time,rule=_vol_flow_in_schedule)
which will create a discontinuity (and then my model does not converge). What I want to do is use a sigmoidal function that will transition the flow to zero without a discontinuity.
To implement the sigmoidal though I need to refer to the time variable itself.
The below MATLAB code gives me the result I want:
t=[0:1:500];
acc=2; %Acceleration parameter, higher values yields sharper change.
time_of_step=3.5*60;
init_value = (12.3/1000)/(3.5*60);
end_value = 0;
sigmoidal=(init_value+(end_value-init_value)/2)...
+((end_value-init_value)/2)*atan((t-time_of_step)*acc)/atan(max(t));
This implementation however needs the time variable explicitly in the function. How can I access the time variable inside the PyOMO rule? I tried the below, but I get an " Cannot treat the scalar component 't_of_step' as an array" error:
m1.init_value = Param(initialize = (12.3/1000)/(3.5*60))
m1.end_value = Param(initialize = 0)
m1.t_of_step = Param(initialize = 210)
m1.acc = Param(initialize = 5)
.
.
def _vol_flow_sigmoidal (mod,t):
return mod.vol_flow_in[t] == (mod.init_value+(mod.end_value-mod.init_value)/2)+((mod.end_value-mod.init_value)/2)*atan((t-mod.t_of_step)*mod.acc)/atan(1500)
m1.vol_flow_sigmoidal = Constraint(m1.time,rule=_vol_flow_sigmoidal)
Hopefully I've described clearlyt what I am after. Any hints are most welcome,
Thanks!
Sal

How are you declaring the m1.time index?
My guess is that you are using a NumPy array to initialize the m1.time index. There is a known problem in Pyomo (see Issue #31) where the NumPy operator overloading and the Pyomo operator overloading end up fighting with each other (basically, NumPy gets fooled into thinking Pyomo scalars are actually indexed and attempts to treat them like arrays).
I was able to reproduce the error with the following complete example:
# pyomo 4.4.1
from pyomo.environ import *
import numpy as np
m1 = ConcreteModel()
m1.time = Set(initialize=np.array([0,100,200,300,400,500]))
m1.vol_flow_in = Var(m1.time)
m1.init_value = Param(initialize = (12.3/1000)/(3.5*60))
m1.end_value = Param(initialize = 0)
m1.t_of_step = Param(initialize = 210)
m1.acc = Param(initialize = 5)
def _vol_flow_sigmoidal (mod,t):
return mod.vol_flow_in[t] == (mod.init_value+(mod.end_value-mod.init_value)/2)\
+((mod.end_value-mod.init_value)/2)*atan((t-mod.t_of_step)*mod.acc)/atan(1500)
m1.vol_flow_sigmoidal = Constraint(m1.time,rule=_vol_flow_sigmoidal)
There are two alternatives that do work, both based on avoiding using NumPy arrays to initialize Pyomo Sets. You can either completely avoid Numpy:
m1.time = Set(initialize=[0,100,200,300,400,500])
or explicitly cast the NumPy array to a list:
timeArray = np.array([0,100,200,300,400,500])
m1.time = Set(initialize=timeArray.tolist())
Finally, for completeness, two other notes:
This also applies to initializing ContinuousSet objects in pyomo.dae
You will see the same behavior even if you avoid the explicit Pyomo Set declaration. That is, the following will also generate the error:
m1.time = np.array([0,100,200,300,400,500])
# ...
m1.vol_flow_sigmoidal = Constraint(m1.time,rule=_vol_flow_sigmoidal)
This is because Pyomo will quietly create the Set object for you behind the scenes as m1.vol_flow_sibmodial_index and then use that Set to index the Constraint.

How do I refer to an outside alias from inside a piglatin macro?

I have an alias which I want to use in a macro:
foo = ....;
define my_macro (z) returns y {
$y = join $z in id, foo on id;
};
a = my_macro(b);
Alas, I get the error:
Undefined alias: macro_my_macro_foo_0
I can, of course, pass foo as en argument:
define my_macro (foo, z) returns y {
$y = join $z in id, $foo on id;
};
a = my_macro(foo,b);
Is this the right way?
If foo is actually a relatively complicated object, will it be recalculated for each macroexpansion of my_macro?

Yes the second approach is right one, you need to pass the alias as an argument to macro otherwise it will not be visible inside macro.
on the other side, alias defined inside the macro will not be access outside, in-case if you want to access the alias then use this format macro_<my macro_name>_<alias name suffixed with an instance>
I have simulated both the options
1. accessing alias from outside to inside macro(using argument)
2. accessing alias from inside macro to outside (using macro expanded name format)
example
in.txt
a,10,1000
b,20,2000
c,30,3000
in1.txt
10,aaa
20,bbb
30,ccc
Pigscript:
define my_macro (foo,z) returns y {
$y = join $z by g1, $foo by f2;
test = FOREACH $y generate $0,$2;
};
foo = LOAD 'in.txt' USING PigStorage(',') AS (f1,f2,f3);
b = LOAD 'in1.txt' USING PigStorage(',') AS (g1,g2);
C = my_macro(foo,b);
DUMP C;
--DUMP macro_my_macro_test_0;
Output of option1:
DUMP C
(10,aaa,a,10,1000)
(20,bbb,b,20,2000)
(30,ccc,c,30,3000)
Output of option2:
DUMP macro_my_macro_test_0
(10,a)
(20,b)
(30,c)
There are some restrictions in using the macro, like
1. not allowed inside nested for each stmt
2. not allowed to use any grunt commands
3. not allowed to include a user-defined schema
I suggest you to refer the below document link, this will definitely give some better ideas about macros and also how to use inside pig script.
http://pig.apache.org/docs/r0.13.0/cont.html#macros

Loop through values of a SPSS variable inside of a Macro

How can I pass the values of a specific variable to a list-processing loop inside a macro?
Let's say, as an simplified example, I've got a variable foo which contains the values 1,4,12,33 and 51.
DATA LIST FREE / foo (F2) .
BEGIN DATA
1
4
12
33
51
END DATA.
And a macro that does some stuff with those values.
For testing reasons this Macro will just echo those values.
I'd like to find a way to run a routine that works like the following:
DEFINE !testmacro (list !CMDEND)
!DO !i !IN (!list)
ECHO !QUOTE(!i).
!DOEND.
!ENDDEFINE.
!testmacro list = 1 4 12 33 51. * <- = values from foo.

This is a situation where using the Python apis would be a good choice.

I made myself a little bit familiar with Python recently :-)
So this is what I worked out.
If the variable is a numeric:
BEGIN PROGRAM PYTHON.
import spss,spssdata
foolist = [element[0] for element in spssdata.Spssdata('foo').fetchall()]
foostring = " ".join(str(int(i)) for i in foolist)
spss.Submit("!testmacro list = %(foostring)s." %locals())
END PROGRAM.
If the variable is a string:
BEGIN PROGRAM PYTHON.
import spss,spssdata
foolist = [element[0].strip() for element in spssdata.Spssdata('bar').fetchall()]
foostring = " ".join(foolist)
spss.Submit("!testmacro list = %(foostring)s." %locals())
END PROGRAM.
Variants
Duplicates removed and list is orderd
BEGIN PROGRAM PYTHON.
import spss,spssdata
foolist = sorted(set([element[0] for element in spssdata.Spssdata('foo').fetchall()]))
foostring = " ".join(str(int(i)) for i in foolist)
spss.Submit("!testmacro list = %(foostring)s." %locals())
END PROGRAM.
Duplicates removed and items in order of first appearance in the dataset
Here, I use a function which I retrieved from Peter Bengtsson's Homepage (peterbe.com)
BEGIN PROGRAM PYTHON.
import spss,spssdata
def uniquify(seq, idfun=None):
# order preserving
if idfun is None:
def idfun(x): return x
seen = {}
result = []
for item in seq:
marker = idfun(item)
if marker in seen: continue
seen[marker] = 1
result.append(item)
return result
foolist = uniquify([element[0] for element in spssdata.Spssdata('foo').fetchall()])
foostring = " ".join(str(int(i)) for i in foolist)
spss.Submit("!testmacro list = %(foostring)s." %locals())
END PROGRAM.
Non-Python Solution
Not that I recommend it, but there is even a way to do this without Python.
I got the basic Idea from a SPSS programming book, which goes as follows:
Use the WRITE command to create a text file with the wanted command and variable values and include it with the insert command.
DATASET COPY foolistdata.
DATASET ACTIVATE foolistdata.
AGGREGATE OUTFILE=* MODE=ADDVARIABLES
/BREAK
/NumberOfCases=N.
* Variable which contains the command as string in the first case.
STRING macrocommand (A18).
IF ($casenum=1) macroCommand = "!testmacro list = ".
EXECUTE.
* variable which contains a period (.) in the last case,
* for the ending of the command string.
STRING commandEnd (A1).
IF ($casenum=NumberOfCases) commandEnd = ".".
* Write the 'table' with the command and variable values into a textfile.
WRITE OUTFILE="macrocommand.txt" /macrocommand bar commandEnd.
EXECUTE.
* Macrocall.
INSERT FILE ="macrocommand.txt".

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

pig script fails when macro is used twice - macros

Related

Use matlab script to check symbol types in state chart

How to use pattern matching callbacks in Dash for Julia

Including time as an explicit variable in constraint in a Pyomo Model

How do I refer to an outside alias from inside a piglatin macro?

Loop through values of a SPSS variable inside of a Macro

Categories

Resources