What is the " p " in beam please? - apache-beam

I see this "p | " at the beginning of beam pipelines and I do not understand what this p means.
example code of a beam pipeline from this introduction tutorial: https://beam.apache.org/documentation/sdks/python-streaming/
lines = p | 'read' >> ReadFromText(known_args.input)
​...
​counts = (lines
​| 'split' >> (beam.ParDo(WordExtractingDoFn())
​.with_output_types(six.text_type))
​| 'pair_with_one' >> beam.Map(lambda x: (x, 1))
​| 'group' >> beam.GroupByKey()
​| 'count' >> beam.Map(count_ones))
​...
​output = counts | 'format' >> beam.Map(format_result)
​# Write the output using a "Write" transform that has side effects.
​output | 'write' >> WriteToText(known_args.output)
What I understand:
I understand the concept of p-collection, p-transforms and the aim of beam which is to treat streaming data and batched data the same way.
What I don't understand:
What is this p ? What are the parentheses for ? The pipes ? the >> ?
It looks like bash style code but nowhere is it explained.
Please does someone have an explanation or a link to an actual tutorial that takes it from the start ?

p is the variable to start the pipeline (p = beam.Pipeline()), this is also referred as PBegin.
| separates each PTransform (operation).
>> is use between the | and the PTransform in case you want to name it.
The parentheses are there so Python doesn't complain about the multiline.
There's a set of tutorials in GCP that start from the very basics, with exercises in every "chapter". They are notebooks. To get them, you'd need to go to "Dataflow > Notebooks > Create Instance" and then, "Open Jupyterlab Notebook" and there should be a folder called Tutorials. Disclaimer, you'd need to pay for the instance hours and I was part of the team who added them.
There's also something called Katas, which is free, but I haven't gone through it thoroughly, so not sure if they start from the very beginning.

In this tutorial one can find a good introduction:
https://beam.apache.org/documentation/programming-guide/

Related

Need to explain the kdb/q script to save partitioned table

I'm trying to understand this snippet code from:
https://code.kx.com/q/kb/loading-from-large-files/
to customize it by myself (e.x partition by hours, minutes, number of ticks,...):
$ cat fs.q
\d .Q
/ extension of .Q.dpft to separate table name & data
/ and allow append or overwrite
/ pass table data in t, table name in n, : or , in g
k)dpfgnt:{[d;p;f;g;n;t]if[~&/qm'r:+en[d]t;'`unmappable];
{[d;g;t;i;x]#[d;x;g;t[x]i]}[d:par[d;p;n];g;r;<r f]'!r;
#[;f;`p#]#[d;`.d;:;f,r#&~f=r:!r];n}
/ generalization of .Q.dpfnt to auto-partition and save a multi-partition table
/ pass table data in t, table name in n, name of column to partition on in c
k)dcfgnt:{[d;c;f;g;n;t]*p dpfgnt[d;;f;g;n]'?[t;;0b;()]',:'(=;c;)'p:?[;();();c]?[t;();1b;(,c)!,c]}
\d .
r:flip`date`open`high`low`close`volume`sym!("DFFFFIS";",")0:
w:.Q.dcfgnt[`:db;`date;`sym;,;`stats]
.Q.fs[w r#]`:file.csv
But I couldn't find any resources to give me detail explain. For example:
if[~&/qm'r:+en[d]t;'`unmappable];
what does it do with the parameter d?
(Promoting this to an answer as I believe it helps answer the question).
Following on from the comment chain: in order to translate the k code into q code (or simply to understand the k code) you have a few options, none of which are particularly well documented as it defeats the purpose of the q language - to be the wrapper which obscures the k language.
Option 1 is to inspect the built-in functions in the .q namespace
q).q
| ::
neg | -:
not | ~:
null | ^:
string | $:
reciprocal| %:
floor | _:
...
Option 2 is to inspect the q.k script which creates the above namespace (be careful not to edit/change this):
vi $QHOME/q.k
Option 3 is to lookup some of the nuggets of documentation on the code.kx website, for example https://code.kx.com/q/wp/parse-trees/#k4-q-and-qk and https://code.kx.com/q/basics/exposed-infrastructure/#unary-forms
Options 4 is to google search for reference material for other/similar versions of k, for example k2/k3. They tend to be similar-ish.
Final point to note is that in most of these example you'll see a colon (:) after the primitives....this colon is required in q/kdb to use the monadic form of the primitive (most are heavily overloaded) while in k it is not required to explicitly force the monadic form. This is why where will show as &: in the q reference but will usually just be & in actual k code

use perl to extract specific output lines

I'm endeavoring to create a system to generalize rules from input text. I'm using reVerb to create my initial set of rules. Using the following command[*], for instance:
$ echo "Bananas are an excellent source of potassium." | ./reverb -q | tr '\t' '\n' | cat -n
To generate output of the form:
1 stdin
2 1
3 Bananas
4 are an excellent source of
5 potassium
6 0
7 1
8 1
9 6
10 6
11 7
12 0.9999999997341693
13 Bananas are an excellent source of potassium .
14 NNS VBP DT JJ NN IN NN .
15 B-NP B-VP B-NP I-NP I-NP I-NP I-NP O
16 bananas
17 be source of
18 potassium
I'm currently piping the output to a file, which includes the preceding white space and numbers as depicted above.
What I'm really after is just the simple rule at the end, i.e. lines 16, 17 & 18. I've been trying to create a script to extract just that component and put it to a new file in the form of a Prolog clause, i.e. be source of(banans, potassium).
Is that feasible? Can Prolog rules contain white space like that?
I think I'm locked into getting all that output from reVerb so, what would be the best way to extract the desirable component? With a Perl script? Or maybe sed?
*Later I plan to replace this with a larger input file as opposed to just single sentences.
This seems wasteful. Why not leave the tabs as they are, and use:
$ echo "Bananas are an excellent source of potassium." \
| ./reverb -q | cut --fields=16,17,18
And yes, you can have rules like this in Prolog. See the answer by #mat. You need to know a bit of Prolog before you move on, I guess.
It is easier, however, to just make the string a a valid name for a predicate:
be_source_of with underscores instead of spaces
or 'be source of' with spaces, and enclosed in single quotes.
You can use probably awk to do what you want with the three fields. See for example the printf command in awk. Or, you can parse it again from Prolog directly. Both are beyond the scope of your current question, I feel.
sed -n 'N;N
:cycle
$!{N
D
b cycle
}
s/\(.*\)\n\(.*\)\n\(.*\)/\2 (\1,\3)/p' YourFile
if number are in output and not jsut for the reference, change last sed action by
s/\^ *[0-9]\{1,\} \{1,\}\(.*\)\n *[0-9]\{1,\} \{1,\}\(.*\)\n *[0-9]\{1,\} \{1,\}\(.*\)/\2 (\1,\3)/p
assuming the last 3 lines are the source of your "rules"
Regarding the Prolog part of the question:
Yes, Prolog facts can contain whitespace like this, with suitable operator declarations present.
For example:
:- op(700, fx, be).
:- op(650, fx, source).
:- op(600, fx, of).
Example query and its result, to let you see the shape of terms that are created with this syntax:
?- write_canonical(be source of(a, b)).
be(source(of(a,b))).
Therefore, with these operator declarations, a fact like:
be source of(a, b).
is exactly the same as stating:
be(source(of(a,b)).
Depending on use cases and other definitions, it may even be an advantage to create this kind of facts (i.e., facts of the form be/1 instead of source_of/2). If this is the only kind of facts you need, you can simply write:
source_of(a, b).
This creates no redundant wrappers and is easier to use.
Or, as Boris suggested, you can use single quotes as in 'be source of'/2.

Find all differences between .mat files

I am looking for a way to list the differences between two .mat files, something that can be usefull for many people.
Though I searched everywhere I could think of, I have not found anything that meets my requirements:
Pick 2 mat files
Find the differences
Save them properly
The closest I have come is visdiff. As long as I stay within matlab, it will allow me to browse the differences, but when I save the result it only shows me the top level.
Here is a simplified example of what my files typically look like:
a = 6;
b.c.d = 7;
b.c.e = 'x';
save f1
f = a;
clear a
b.c.e = 'y';
save f2
visdiff('f1.mat','f2.mat')
If I click here on b, I can find the difference. However if I run this and use 'file>save', I am not able to click on b. Thus I still don't know what has been changed.
Note: I don't have Simulink
Hence my question is:
How can I show all differences between 2 mat files to someone without Matlab
Here are the answers that I personally consider to be most suitable for different situations:
Answer for users with Simulink
General answer
Answer displaying all value differences
Find all differences between mat files without MATLAB?
You can find the differences between HDF5 based .mat files with the HDF5 Tools.
Example
Let me shorten your MATLAB example and assume you create two mat files with
clear ; a = 6 ; b.c = 'hello' ; save -v7.3 f1
clear ; a = 7 ; b.e = 'world' ; save -v7.3 f2
Outside MATLAB use
h5ls -v -r f1.mat
to get a listing about the kind of data included f1.mat:
Opened "f1.mat" with sec2 driver.
/ Group
Location: 1:96
Links: 1
/a Dataset {1/1, 1/1}
Attribute: MATLAB_class scalar
Type: 6-byte null-terminated ASCII string
Data: "double"
Location: 1:2576
Links: 1
Storage: 8 logical bytes, 8 allocated bytes, 100.00% utilization
Type: native double
/b Group
Attribute: MATLAB_class scalar
Type: 6-byte null-terminated ASCII string
Data: "struct"
Location: 1:800
Links: 1
/b/c Dataset {5/5, 1/1}
Attribute: H5PATH scalar
Type: 2-byte null-terminated ASCII string
Data: "/b"
Attribute: MATLAB_class scalar
Type: 4-byte null-terminated ASCII string
Data: "char"
Attribute: MATLAB_int_decode scalar
Type: native int
Data: 2
Location: 1:1832
Links: 1
Storage: 10 logical bytes, 10 allocated bytes, 100.00% utilization
Type: native unsigned short
Use of
h5ls -d -r f1.mat
returns the values of the stored data:
/ Group
/a Dataset {1, 1}
Data:
(0,0) 6
/b Group
/b/c Dataset {5, 1}
Data:
(0,0) 104, 101, 108, 108, 111
The data 104, 101, 108, 108, 111 represents the word hello, which can be seen with
h5ls -d -r f1.mat | tail -1 | awk '{FS=",";printf("%c%c%c%c%c \n",$2,$3,$4,$5,$6)}'
You can get the same listing for f2.mat and compare the two outputs with the tool of your choice.
Comparison also works directly with HDF5 Tools. To compare the two numbers a from both files use
h5diff -r f1.mat f2.mat /a
which will show you the values and their difference
dataset: </a> and </a>
size: [1x1] [1x1]
position a a difference
------------------------------------------------------------
[ 0 0 ] 6 7 1
1 differences found
attribute: <MATLAB_class of </a>> and <MATLAB_class of </a>>
0 differences found
Remarks
There are a few more commands and options in the HDF5 Tools, which may help to get your real problem solved.
Binary distributions are available for Linux and Windows from The HDF Group. For OS X you can get them installed via MacPorts. If needed there is also a GUI: HDFView.
If you have simulink you can use Simulink.saveVars to generate an m-file that upon execution creates the same variables in work space:
a = 6;
b.c.d = 7;
b.c.e = 'x';
Simulink.saveVars('f1');
f = a;
clear a
b.c.e = 'y';
Simulink.saveVars('f2');
visdiff('f1.m','f2.m')
as illustrated in this sctreenshot
Note that by default it limits the number of elements in arrays to 1000 and you can increase it to 10000. Arrays larger than that limit will be saved in a separate mat-file.
UPDATE: From R2014a a new function similar to Simulink.saveVars has been added to MATLAB. see matlab.io.saveVariablesToScript
This is only part of the answer, but maybe it helps.
You could use gencode, a Matlab function that generates Matlab code from a variable such that running the code reproduces the variable. You do this for all of the variables in each mat-file (takes some programming, but should be doable) and put the results in different .m-files.
Then you use a standard text comparison tool (maybe even visdiff) to compare the .m-files.
There are several good tools to compare XML-Files, this I would proceed this way:
Download struct2xml.m
Load both matfiles
Export each with struct2xml
compare, using XMLSpy or similar
Simple general answer, without displaying value differences
Due to the insight I gained from the answers of #BHF, #Daniel R and #Dennis Jaheruddin, I have managed to find a simple scalable solution:
[fs1, fs2, er] = comp_struct(load('f1.mat'),load('f2.mat'))
Note that it works for .mat containing an arbritrary number of variables.
This uses the Compare Structures - File Exchange submission.
Answer for small files, displaying all value differences
Based on the suggestion by #A. Donda I have tried to use gencode to create a variable for everything.
Though it works for my toy example, it is quite slow and tells me that I exceed the allowed amount of variables for my real .mat files.
Anyway, for those who are looking for something that works with small files, I will post this option:
wList=who;
for iLoop = 1:numel(wList)
eval(['generated_' wList{iLoop} '= gencode(' wList{iLoop} ');'])
for jLoop = 1:numel(eval(['generated_' wList{iLoop}]))
eval(['generated_' wList{iLoop} '_' num2str(jLoop) '= generated_' wList{iLoop} '(' num2str(jLoop) ');' ])
end
end
Though it may work, I don't feel like this is the best way to go.
General answer, without displaying value differences
Due to the insight I gained from the answers of #BHF and #Daniel R I have managed to find a reasonably scalable solution.
Step 1: Save all variables from each files as a single struct
This uses the Save workspace to struct - File Exchange submission.
Here are the steps to take assuming you want to compare f1.mat and f2.mat:
clear
load f1
myStruct1 = ws2struct;
save myStruct1 myStruct1
clear
load f2
myStruct2 = ws2struct;
save myStruct2 myStruct2
clear
load myStruct1
load myStruct2
Step 2: Compare the structs
This uses the Compare Structures - File Exchange submission
Given that you want to compare myStruct1 and myStruct2 you can simply call:
[fs1, fs2, er] = comp_struct(myStruct1,myStruct2)
I was positively surprised at how readable the list of differences in er is, here is the output for the example that was used in the question:
er =
's2 is missing field a'
's1(1).b(1).c(1).e and s2(1).b(1).c(1).e do not match'
Note that it will not show values, from a technical point of view it is probably not too hard to change the m file if value difference displays are desirable. However, especially if there are some big matrices I suppose this could result in problematic output.

how to use expressions as function parameters in powershell

This is a very simple task in every language I have ever used, but I can't seem to figure it out in PowerShell. An example of what I'm talking about in C:
abs(x + y)
The expression x + y is evaluated, and the result passed to abs as the parameter... how do I do that in PowerShell? The only way I have figured out so far is to create a temporary variable to store the result of the expression, and pass that.
PowerShell seems to have very strange grammar and parsing rules that are constantly catching me by surprise, just like this situation. Does anyone know of documentation or a tutorial that explains the basic underlying theory of the language? I can't believe these are all special cases, there must be some rhyme or reason that no tutorial I have yet read explains. And yes, I've read this question, and all of those tutorials are awful. I've pretty much been relegated to learning from existing code.
In your case, simply surrounding the expression with parenthesis will allow you to pass it to your function.
You need to do this because PowerShell has more than one parsing mode depending on the beginning of the command.
Expression mode is similar to how most other languages parse - numbers are numbers and strings are quoted.
Command mode treats everything as a string except for variables and parenthesis. Strings here don't need to be quoted.
1+2 Expression mode - starts with number
"string" Expression mode - starts with quote
string Command mode - starts with letter
& "string" Command mode - starts with &
. "string" Command mode - starts with . and a space
.123 Expression mode - starts with . and number (without space)
.string Command mode - starts with a . that is part of a command name
You can mix modes in a single line by enclosing the commands with parenthesis.
You can see this effect if you define function abs in the following way:
function Abs($value)
{
Write-Host $args
if($value -lt 0) { -$value } else { $value }
}
Abs 1 + 2
#Prints: + 2
#Returns: 1
Abs 1+2
#Prints:
#Returns: 1+2
Abs (1 + 2)
#Prints:
#Returns: 3
Abs (1+2)
#Prints:
#Returns: 3
Not sure I follow exactly the problem you are encountering. The following shows that this works as expected in PowerShell:
PS> function abs($val) {Write-Host "`$val is $val"; if ($val -ge 0) {$val} `
else {-$val}}
PS> abs (3+4)
$val is 7
7
As for the best docs on the language, I highly recommend Windows PowerShell in Action by Bruce Payette (he's the guy behind the grammar). If you can wait a few more months (or are OK with an eletronic copy), there is a second edition due out this summer that has been updated for PowerShell 2.0.
You may also find my Effective PowerShell series helpful. Checkout item #10 on PowerShell Parsing Modes.

Limiting the amount of information printed by Perl debugger

One of my pet peeves with debugging Perl code (in command line debbugger, perl -d) is the fact that mistakenly printing (via x command) the contents of a huge datastructure is guaranteed to freeze up your terminal for forever and a half while 100s of pages of data are printed. Epecially if that happens across slowish network.
As such, I'd like to be able to limit the amount of data that x prints.
I see two approaches - I'd be willing to try either if someone knows how to do.
Limit the amount of data any single command in debugger prints.
Better yet, somehow replace the built-in x command with a custom Perl method (which would calculate the "size" of the data structure, and refuse to print its contents without confirmation).
I'm specifically asking "how to replace x with custom code" - building a Good Enough "is the data structure too big" Perl method is something I can likely do on my own without too much effort although I see enough pitfalls preventing the "perfect" one from being a fairly frustrating endeavour. Heck, merely doing Data::Dumper->Dump and taking the length of the string might do the trick :)
Please note that I'm perfectly well aware of how to manually avoid the issue by recursively examining layers of datastructure (e.g. print the ref, print the count of keys/array elements, etc...)... the whole point is I want to be able to avoid thoughtlessly typing x $huge_pile_of_data without thinking - or stumbling on a bug populating said huge pile of data into what should be a scalar.
The x command takes an optional argument for the maximum depth to display. That's not quite the same as limiting the amount of data to N pages, but it's definitely useful to prevent overload.
DB<1> %h = (a => { b => { c => 1 } } )
DB<2> x %h
0 'a'
1 HASH(0x1d5ff44)
'b' => HASH(0x1d61424)
'c' => 1
DB<3> x 2 %h
0 'a'
1 HASH(0x1d5ff44)
'b' => HASH(0x1d61424)
You can specify the default depth to print via the o command, e.g.
DB<1>o dumpDepth=1
Add that to your .perldb file to apply it to all debugger sessions.
Otherwise, it looks like the x command invokes DB::dumpit() which is just a wrapper for dumpval.pl (or, more specifically, the main::dumpValue() sub it defines). You could modify/replace that script as you see fit. I'm not sure how you'd make it interactive, though.
The | command in the debugger pipes another command's output to your pager, e.g.
DB<1> |x %huge_datastructure