Apache Beam no watermark_estimator_provider while running SDFBoundedSourceReader - apache-beam

def run_pipeline(pipeline_options):
with beam.Pipeline(options=pipeline_options) as p:
data = p | 'read' >> beam.io.ReadFromText(s3_input)
data = data | beam.Map(lambda x: ('dk', x))
data = data | 'Group into batches' >> beam.GroupIntoBatches(10)
data = data | beam.Map(lambda x: x[1])
data | beam.Map(print)
def run_direct():
# https://beam.apache.org/documentation/runners/direct/
pipeline_options = PipelineOptions([
"--runner=DirectRunner",
"--direct_num_workers=1",
"--direct_running_mode=multi_threading"
])
run_pipeline(pipeline_options)
run_direct()
AttributeError: 'apache_beam.runners.common.MethodWrapper' object has no attribute 'watermark_estimator_provider' [while running 'read/Read/SDFBoundedSourceReader/ParDo(SDFBoundedSourceDoFn)/pair']
It reads from S3 file. Add a common key. Then group into batches.
Essentially I want to batch the lines.

However if don't read from file, it works.
with beam.Pipeline() as pipeline:
batches_with_keys = (
pipeline
| 'Create produce' >> beam.Create([1,2,3,4,5,6,7,8])
| beam.Map(lambda x: ('dk', x))
| 'Group into batches' >> beam.GroupIntoBatches(3)
| beam.Map(lambda x: x[1])
| beam.Map(print))

Can you try using the s3io instead of beam.io.ReadFromText?

Related

Extracting Specific Field from String in Scala

My dataframe returns the below result as String.
QueryResult{status='success', finalSuccess=true, parseSuccess=true, allRows=[{"cbcnt":0}], signature={"cbcnt":"number"}, info=N1qlMetrics{resultCount=1, errorCount=0, warningCount=0, mutationCount=0, sortCount=0, resultSize=11, elapsedTime='5.080179ms', executionTime='4.931124ms'}, profileInfo={}, errors=[], requestId='754d19f6-7ec1-4609-bf2a-54214d06c57c', clientContextId='542bc4c8-1a56-4afb-8c2f-63d81e681cb4'} |
QueryResult{status='success', finalSuccess=true, parseSuccess=true, allRows=[{"cbcnt":"2021-07-30T00:00:00-04:00"}], signature={"cbcnt":"String"}, info=N1qlMetrics{resultCount=1, errorCount=0, warningCount=0, mutationCount=0, sortCount=0, resultSize=11, elapsedTime='5.080179ms', executionTime='4.931124ms'}, profileInfo={}, errors=[], requestId='754d19f6-7ec1-4609-bf2a-54214d06c57c', clientContextId='542bc4c8-1a56-4afb-8c2f-63d81e681cb4'}
I just want
"cbcnt":0 <-- Numeric part of this
Expected Output
col
----
0
2021-07-30
Tried:
.withColumn("CbRes",regexp_extract($"Col", """"cbcnt":(\S*\d+)""", 1))
Output
col
----
0
"2021-07-30 00:00:00 --<--additional " is coming
Using the Pyspark function regexp_extract:
from pyspark.sql import functions as F
df = <dataframe with a column "text" that contains the input data">
df.withColumn("col", F.regexp_extract("text", """"cbcnt":(\d+)""", 1)).show()
Extract via regex:
val value = "QueryResult{status='success', finalSuccess=true, parseSuccess=true, allRows=[{\"cbcnt\":0}], signature={\"cbcnt\":\"number\"}, info=N1qlMetrics{resultCount=1, errorCount=0, warningCount=0, mutationCount=0, sortCount=0, resultSize=11, elapsedTime='5.080179ms', executionTime='4.931124ms'}, profileInfo={}, errors=[], requestId='754d19f6-7ec1-4609-bf2a-54214d06c57c', clientContextId='542bc4c8-1a56-4afb-8c2f-63d81e681cb4'} |"
val regex = """"cbcnt":(\d+)""".r.unanchored
val s"${regex(result)}" = value
println(result)
Output:
0

How can I set the level of precision for Raku's sqrt?

With Perl, one could use bignum to set the level of precision for all operators. As in:
use bignum ( p => -50 );
print sqrt(20); # 4.47213595499957939281834733746255247088123671922305
With Raku I have no problems with rationals since I can use Rat / FatRat, but I don't know how to use a longer level of precision for sqrt
say 20.sqrt # 4.47213595499958
As stated on Elizabeth's answer, sqrt returns a Num type, thus it has limited precision. See Elizabeth's answer for more detail.
For that reason I created a raku class: BigRoot, which uses newton's method and FatRat types to calculate the roots. You may use it like this:
use BigRoot;
# Can change precision level (Default precision is 30)
BigRoot.precision = 50;
my $root2 = BigRoot.newton's-sqrt: 2;
# 1.41421356237309504880168872420969807856967187537695
say $root2.WHAT;
# (FatRat)
# Can use other root numbers
say BigRoot.newton's-root: root => 3, number => 30;
# 3.10723250595385886687766242752238636285490682906742
# Numbers can be Int, Rational and Num:
say BigRoot.newton's-sqrt: 2.123;
# 1.45705181788431944566113502812562734420538186940001
# Can use other rational roots
say BigRoot.newton's-root: root => FatRat.new(2, 3), number => 30;
# 164.31676725154983403709093484024064018582340849939498
# Results are rounded:
BigRoot.precision = 8;
say BigRoot.newton's-sqrt: 2;
# 1.41421356
BigRoot.precision = 7;
say BigRoot.newton's-sqrt: 2;
# 1.4142136
In general it seems to be pretty fast (at least compared to Perl's bigfloat)
Benchmarks:
|---------------------------------------|-------------|------------|
| sqrt with 10_000 precision digits | Raku | Perl |
|---------------------------------------|-------------|------------|
| 20000000000 | 0.714 | 3.713 |
|---------------------------------------|-------------|------------|
| 200000.1234 | 1.078 | 4.269 |
|---------------------------------------|-------------|------------|
| π | 0.879 | 3.677 |
|---------------------------------------|-------------|------------|
| 123.9/12.29 | 0.871 | 9.667 |
|---------------------------------------|-------------|------------|
| 999999999999999999999999999999999 | 1.208 | 3.937 |
|---------------------------------------|-------------|------------|
| 302187301.3727 / 123.30219380928137 | 1.528 | 7.587 |
|---------------------------------------|-------------|------------|
| 2 + 999999999999 ** 10 | 2.193 | 3.616 |
|---------------------------------------|-------------|------------|
| 91200937373737999999997301.3727 / π | 1.076 | 7.419 |
|---------------------------------------|-------------|------------|
If want to implement your own sqrt using newton's method, this the basic idea behind it:
sub newtons-sqrt(:$number, :$precision) returns FatRat {
my FatRat $error = FatRat.new: 1, 10 ** ($precision + 1);
my FatRat $guess = (sqrt $number).FatRat;
my FatRat $input = $number.FatRat;
my FatRat $diff = $input;
while $diff > $error {
my FatRat $new-guess = $guess - (($guess ** 2 - $input) / (2 * $guess));
$diff = abs($new-guess - $guess);
$guess = $new-guess;
}
return $guess.round: FatRat.new: 1, 10 ** $precision;
}
In Rakudo, sqrt is implemented using the sqrt_n NQP opcode. Which indicates it only supports native nums (because of the _n suffix). Which implies limited precision.
Internally, I'm pretty sure this just maps to the sqrt functionality of one of the underlying math libraries that MoarVM uses.
I guess what we need is an ecosystem module that would export a sqrt function based on Rational arithmetic. That would give you the option to use higher precision sqrt implementations at the expense of performance. Which then in turn, might turn out to be interesting enough to integrate in core.

Apple turicreate always return the same label

I'm test-driving turicreate, to resolve a classification issue, in which data consists of 10-uples (q,w,e,r,t,y,u,i,o,p,label), where 'q..p' is a sequence of characters (for now of 2 types), +,-, like this:
q,w,e,r,t,y,u,i,o,p,label
-,+,+,e,e,e,e,e,e,e,type2
+,+,e,e,e,e,e,e,e,e,type1
-,+,e,e,e,e,e,e,e,e,type2
'e' is just a padding character, so that vectors have a fixed lenght of 10.
note:data is significantly tilted toward one label (90% of it), and the dataset is small, < 100 points.
I use Apple's vanilla script to prepare and process the data (derived from here):
import turicreate as tc
# Load the data
data = tc.SFrame('data.csv')
# Note, for sake of investigating why predictions do not work on Swift, the model is deliberately over-fitted, with split 1.0
train_data, test_data = data.random_split(1.0)
print(train_data)
# Automatically picks the right model based on your data.
model = tc.classifier.create(train_data, target='label', features = ['q','w','e','r','t','y','u','i','o','p'])
# Generate predictions (class/probabilities etc.), contained in an SFrame.
predictions = model.classify(train_data)
# Evaluate the model, with the results stored in a dictionary
results = model.evaluate(train_data)
print("***********")
print(results['accuracy'])
print("***********")
model.export_coreml("MyModel.mlmodel")
Note:The model is over-fitted on the whole data (for now). Convergence seems ok,
PROGRESS: Model selection based on validation accuracy:
PROGRESS: ---------------------------------------------
PROGRESS: BoostedTreesClassifier : 1.0
PROGRESS: RandomForestClassifier : 0.9032258064516129
PROGRESS: DecisionTreeClassifier : 0.9032258064516129
PROGRESS: SVMClassifier : 1.0
PROGRESS: LogisticClassifier : 1.0
PROGRESS: ---------------------------------------------
PROGRESS: Selecting BoostedTreesClassifier based on validation set performance.
And the classification works as expected (although over-fitted)
However, when i use the mlmodel in my code, no matter what, it returns always the same label, here 'type2'. The rule is here type1 = only "+" and "e", type2 = all others combinations.
I tried using the text_classifier, the results are far less accurate...
I have no idea what I'm doing wrong....
Just in case someone wants to check, for a small data set, here's the raw data.
q,w,e,r,t,y,u,i,o,p,label
-,+,+,e,e,e,e,e,e,e,type2
-,+,e,e,e,e,e,e,e,e,type2
+,+,-,+,e,e,e,e,e,e,type2
-,-,+,-,e,e,e,e,e,e,type2
+,e,e,e,e,e,e,e,e,e,type1
-,-,+,+,e,e,e,e,e,e,type2
+,-,+,-,e,e,e,e,e,e,type2
-,+,-,-,e,e,e,e,e,e,type2
+,-,-,+,e,e,e,e,e,e,type2
+,+,e,e,e,e,e,e,e,e,type1
+,+,-,-,e,e,e,e,e,e,type2
-,+,-,e,e,e,e,e,e,e,type2
-,-,-,-,e,e,e,e,e,e,type2
-,-,e,e,e,e,e,e,e,e,type2
-,-,-,e,e,e,e,e,e,e,type2
+,+,+,+,e,e,e,e,e,e,type1
+,-,+,+,e,e,e,e,e,e,type2
+,+,+,e,e,e,e,e,e,e,type1
+,-,-,-,e,e,e,e,e,e,type2
+,-,-,e,e,e,e,e,e,e,type2
+,+,+,-,e,e,e,e,e,e,type2
+,-,e,e,e,e,e,e,e,e,type2
+,-,+,e,e,e,e,e,e,e,type2
-,-,+,e,e,e,e,e,e,e,type2
+,+,-,e,e,e,e,e,e,e,type2
e,e,e,e,e,e,e,e,e,e,type1
-,+,+,-,e,e,e,e,e,e,type2
-,-,-,+,e,e,e,e,e,e,type2
-,e,e,e,e,e,e,e,e,e,type2
-,+,+,+,e,e,e,e,e,e,type2
-,+,-,+,e,e,e,e,e,e,type2
And the swift code:
//Helper
extension MyModelInput {
public convenience init(v:[String]) {
self.init(q: v[0], w: v[1], e: v[2], r: v[3], t: v[4], y: v[5], u: v[6], i: v[7], o: v[8], p:v[9])
}
}
let classifier = MyModel()
let data = ["-,+,+,e,e,e,e,e,e,e,e", "-,+,e,e,e,e,e,e,e,e,e", "+,+,-,+,e,e,e,e,e,e,e", "-,-,+,-,e,e,e,e,e,e,e","+,e,e,e,e,e,e,e,e,e,e"]
data.forEach { (tt) in
let gg = MyModelInput(v: tt.components(separatedBy: ","))
if let prediction = try? classifier.prediction(input: gg) {
print(prediction.labelProbability)
}
}
The python code saves a MyModel.mlmodel file, which you can add to any Xcode project and use the code above.
note: the python part works fine, for example:
+---+---+---+---+---+---+---+---+---+---+-------+
| q | w | e | r | t | y | u | i | o | p | label |
+---+---+---+---+---+---+---+---+---+---+-------+
| + | + | + | + | e | e | e | e | e | e | type1 |
+---+---+---+---+---+---+---+---+---+---+-------+
is labelled as expected. But when using the swift code, the label comes out as type2. This thing is driving be berserk (and yes, I checked that the mlmodel replaces the old one whenever i create a new version, and also in Xcode).

Can't demonstrate why !A&B OR !A&C OR !C&B = !C&B OR !A&C

I'm beginning a course on boolean logic and I got this boolean expression I need to prove. After a few hours of research I tried Wolfram Alpha, but unlike other equations it doesn't explain step-by-step how it simplified the longer expression. It's also pretty easy to see the (!A&B) isn't necessary in the expression with truth tables, but I can't demonstrate it. How should I do it?
The expression:
!A&B OR !A&C OR !C&B = !C&B OR !A&C
And a link to the Wolfram Alpha input: Wolfram
Thanks in advance, have a nice day.
Here is a derivation
!A&B | !A&C | !C&B
= !A&B&(C | !C) | !A&C&(B | !B) | !C&B&(A | !A) // x & T = x
= !A&B&C | !A&B&!C | !A&B&C | !A&!B&C | A&B&!C | !A&B&!C // distributive
= !A&B&C | !A&B&!C | !A&!B&C | A&B&!C // x | x = x
= !A&B&!C | A&B&!C | !A&B&C | !A&!B&C // commutative
= B&!C&(!A | A) | !A&C&(B | !B) // distributive
= B&!C | !A&C // x | !x = T, x & T = x
There are two ways of proving these kinds of equalities. One is formal: find a chain of equalities that arrive to the target formula. The other is intuitive: understand why the equality holds. Let me try the latter.
In your case, after rewriting the left hand side of your equation we have to show that:
(!C&B OR !A&C) OR !A&B = !C&B OR !A&C
which has the form p OR q = p, right?
So the question becomes: when p OR q = p? In other words, when q adds nothing to p? Well, if p is a consequence of q then q adds nothing to p. This is if q -> p (i.e., p is a consequence of q) then p OR q = p (please prove this formally!)
So, we have to show that !C&B OR !A&C is a consequence of !A&B. But this is easy because !A&B=true implies A=false and B=true. So, if C=false we have !C&B=true and if C=true then !A&C = true. Hence in both cases we have !C&B OR !A&C = true.

Summary table using Coffeescript?

I've programmed a summary chart, and I'm trying to add a "Totals" line that will sum up three columns. The exercise is comprised of a number of questions, and the participant can increase, decrease or maintain a certain dollar amount. The column B total shows the initial values (where the slider starts off on). The column C shows the amount the participant either increased or decreased and the last column shows the resulting dollar amount. (B + C)
Here's an example of the summary chart
Column A --- B ------ C------ D
Question 1 | 100$ | +28$ | 128$ |
Question 2 | 150$ | (10$) | 140$ |
Totals ------| 250$ | +18$ | 268$ |
So far I've been able to program the totals for column B and D, but I can't figure out how to show a total for column C.
class window.CustomSimulator extends window.TaxSimulator
constructor: (#options = {}) ->
super
#updateTable()
update: ->
super
#updateTable()
updateTable: ->
self = this
$table = $('<table class="table table-condensed"><thead><tr><th>Category</th><th>Before</th><th>Your Choice</th><th>After</th></tr></table>')
$tbody = $('<tbody></tbody>')
before_total = 0
after_total = 0
#scope.find('.slider').each ->
$this = $(this)
$parents = $this.parents('tr')
category = $parents.find('.header').clone().children().remove().end().text().replace(/How would you adjust service levels and property tax funding for /, '').replace('?', '')
before = self.tipScaler($this, $this.data('initial'))
your_choice = $parents.find('.value').text()
after = $parents.find('.tip-content').text()
if $parents.find('.key').text() == 'Decrease:'
css_class = 'decrease'
your_choice = "(#{your_choice})"
else
css_class = 'increase'
$tr = $("""<tr><td>#{category}</td><td>#{before}</td><td class="table-#{css_class}">#{your_choice}</td><td>#{after}</td></tr>""")
$tbody.append($tr)
before_total += parseFloat(before.replace('$', ''))
after_total += parseFloat(after.replace('$', ''))
before_total = SimulatorHelper.number_to_currency(before_total, precision: 2, strip_insignificant_zeros: true)
after_total = SimulatorHelper.number_to_currency(after_total, precision: 2, strip_insignificant_zeros: true)
$("""<tfoot><tr><th>Totals</th><th>#{before_total}</th></th><th><th>#{after_total}</th></tr></tfoot>""").appendTo($table)
$table.append($tbody)
$('#summary-table').html($table)
I'm pretty new at this so I'm not sure if this is enough information.
Thanks!