How to print iteration value using pyspark for loop - pyspark

I am trying to print threshold for the dataframe values using pyspark.
Below is the R code which I wrote but I want this in Pyspark and I am unable to figure out how to do it in pyspark. Any help will be greatly appreciated!
Values dataframe looks something like
values dataframe is
vote
0.3
0.1
0.23
0.45
0.9
0.80
0.36
# loop through all link weight values, from the lowest to the highest
for (i in 1:nrow(values)){
# print status
print(paste0("Iterations left: ", nrow(values) - i, " Threshold: ", values[i, w_vote]))
}
What I am trying in pyspark is, but I am stuck here
for row in values.collect():
print('iterations left:',row - i, "Threshold:', ...)

Every language or tool has a different way to handle things. Below I am providing answer in the way you tried -
df = sqlContext.createDataFrame([
[0.3],
[0.1],
[0.23],
[0.45],
[0.9],
[0.80],
[0.36]
], ["vote"])
values = df.collect()
toal_values = len(values)
#By default values from collect are not sorted using sorted to sort values in ascending order for vote column
# If you don't want to sort these values at python level just sort it at spark level by using df = df.sort("vote", ascending=False).collect()
# Using enumerate to knowing about index of row
for index, row in enumerate(sorted(values, key=lambda x:x.vote, reverse = False)):
print ('iterations left:', toal_values - (index+1), "Threshold:", row.vote)
iterations left: 6 Threshold: 0.1
iterations left: 5 Threshold: 0.23
iterations left: 4 Threshold: 0.3
iterations left: 3 Threshold: 0.36
iterations left: 2 Threshold: 0.45
iterations left: 1 Threshold: 0.8
iterations left: 0 Threshold: 0.9
It is not encouraged to use collect If you are dealing with big data it will break your program.

Related

Using a grouped z-score over a rolling window

I would like to calculate a z-score over a bin based on the data of a rolling look-back period.
Example
Todays visitor amount during [9:30-9:35) should be z-score normalized based off the (mean, std) of the last 3 days of visitors that visited during [9:30-9:35).
My current attempts both raise InvalidOperationError. Is there a way in polars to calculate this?
import polars as pl
def z_score(col: str, over: str, alias: str):
# calculate z-score normalized `col` over `over`
return (
(pl.col(col)-pl.col(col).over(over).mean()) / pl.col(col).over(over).std()
).alias(alias)
df = pl.from_dict(
{
"timestamp": pd.date_range("2019-12-02 9:30", "2019-12-02 12:30", freq="30s").union(
pd.date_range("2019-12-03 9:30", "2019-12-03 12:30", freq="30s")
),
"visitors": [(e % 2) + 1 for e in range(722)]
}
# 5 minute bins for grouping [9:30-9:35) -> 930
).with_column(
pl.col("timestamp").dt.truncate(every="5m").dt.strftime("%H%M").cast(pl.Int32).alias("five_minute_bin")
).with_column(
pl.col("timestamp").dt.truncate(every="3d").alias("daytrunc")
)
# normalize visitor amount for each 5 min bin over the rolling 3 day window using z-score.
# not rolling but also wont work (InvalidOperationError: window expression not allowed in aggregation)
# df.with_column(
# z_score("visitors", "five_minute_bin", "normalized").over("daytrunc")
# )
# won't work either (InvalidOperationError: window expression not allowed in aggregation)
#df.groupby_rolling(index_column="daytrunc", period="3i").agg(z_score("visitors", "five_minute_bin", "normalized"))
For an example of 4 days of data with four data-points each lying in two time-bins ({0,0} - {0,1}), ({1,0} - {1,1})
Input:
Day 0: x_d0_{0,0}, x_d0_{0,1}, x_d0_{1,0}, x_d0_{1,1}
Day 1: x_d1_{0,0}, x_d1_{0,1}, x_d1_{1,0}, x_d1_{1,1}
Day 2: x_d2_{0,0}, x_d2_{0,1}, x_d2_{1,0}, x_d2_{1,1}
Day 3: x_d3_{0,0}, x_d3_{0,1}, x_d3_{1,0}, x_d3_{1,1}
Output:
Day 0: norm_x_d0_{0,0} = nan, norm_x_d0_{0,1} = nan, norm_x_d0_{1,0} = nan, norm_x_d0_{1,1} = nan
Day 1: norm_x_d1_{0,0} = nan, norm_x_d1_{0,1} = nan, norm_x_d1_{1,0} = nan, norm_x_d1_{1,1} = nan
Day 2: norm_x_d2_{0,0} = nan, norm_x_d2_{0,1} = nan, norm_x_d2_{1,0} = nan, norm_x_d2_{1,1} = nan
Day 3: norm_x_d3_{0,0} = (x_d3_{0,0} - np.mean([x_d0_{0,0}, x_d0_{0,1}, X_d1_{0,0}, ..., x_d3_{0,1}]) / np.std([x_d0_{0,0}, x_d0_{0,1}, X_d1_{0,0}, ..., x_d3_{0,1}])) , ... ,
They key here is to use over to restrict your calculations to the five minute bins and then use the rolling functions to get the rolling mean and standard deviation over days restricted by those five minute bin keys. five_minute_bin works as in your code and I believe that a truncated day_bin is necessary so that, for example, 9:33 on one day will include 9:31 both 9:34 on the same and 9:31 from 2 days ago.
days = 5
pl.DataFrame(
{
"timestamp": pl.concat(
[
pl.date_range(
datetime(2019, 12, d, 9, 30), datetime(2019, 12, d, 12, 30), "30s"
)
for d in range(2, days + 2)
]
),
"visitors": [(e % 2) + 1 for e in range(days * 361)],
}
).with_columns(
five_minute_bin=pl.col("timestamp").dt.truncate(every="5m").dt.strftime("%H%M"),
day_bin=pl.col("timestamp").dt.truncate(every="1d"),
).with_columns(
standardized_visitors=(
(
pl.col("visitors")
- pl.col("visitors").rolling_mean("3d", by="day_bin", closed="right")
)
/ pl.col("visitors").rolling_std("3d", by="day_bin", closed="right")
).over("five_minute_bin")
)
Now, that said, when trying out the code for this, I found polars doesn't handle non-unique values in the by-column in the rolling function correctly, so that the same values in the same 5-minute bin don't end up as the same standardized values. Opened bug report here: https://github.com/pola-rs/polars/issues/6691. For large amounts of real world data, this shouldn't actually matter that much, unless your data systematically differs in distribution within the 5 minute bins.

Decay chain simulation - with significantly different time scales

I would like to simulate a decay chain with Python. Normally, (in a loop over all nuclides) one calculates the number of decays per time step and updates the number of mother and daughter nuclei.
My problem is that the decay chain contains half-lives on very different time scales, i.e.
0.0001643 seconds for Po-214 and 307106512477175.9 seconds (= 1600 years) for Ra-226.
Using the same time step for all nuclides seems useless.
Is there a simulation method, preferably in Python, that can be used to handle this case?
Don't use time steps for this. Use event scheduling.
Half lives can be expressed as exponential decay, and the conversion between half life and rate of decay is straightforward. Start with the number of both types of nuclei, and schedule exponential inter-event times to figure out when the next decay of each type will occur. Whichever type has the lower time, decrement the corresponding number of nuclei and schedule the next decay for that type (and if need be, increment the count of whatever it decays into).
This can easily be generalized to multiple distinct event types by using a priority queue ordered by time of occurrence to determine which event will be the next one performed. This is the underlying principle behind discrete event simulation.
Update
This approach works with individual decay events, but we can leverage two important properties when we have exponential inter-event times.
The first is to note that exponentially distributed inter-event times means these are Poisson processes. The superposition property tells us that the union of two independent Poisson processes, each having rate λ, is a Poisson process with rate 2λ. Simple induction shows that if we have n independent Poisson properties with the same rate, their superposition is a Poisson process with rate nλ.
The second property is that the exponential distribution is memoryless. This means that when a Poisson event occurs, we can generate the time to the next event by generating a new exponentially distributed time at the current rate and adding it to the current time.
You haven't provided any information about what you want in the way of output, so I arbitrarily decided to print a report showing the time and the current numbers of nuclides whenever that number was halved. I also printed a report every 10 years, given the long half-life of Po-214.
I converted half-lifes to rates using the link provided at the top of the post, and then to means since that's what
Python numpy's exponential generator is parameterized to use. That's an easy conversion, since means and rates are inverses of each other.
Here's a Python implementation with comments:
from numpy.random import default_rng
from math import log
rng = default_rng()
# This creates a list of entries of quantities that will trigger a report.
# I've chosen to go with successive halvings of the original quantity.
def generate_report_qtys(n0):
report_qty = []
divisor = 2
while divisor < n0:
report_qty.append(n0 // divisor) # append next half-life qty to array
divisor *= 2
return report_qty
seconds_per_year = 365.25 * 24 * 60 * 60
po_214_half_life = 0.0001643 # seconds
ra_226_half_life = 1590 * seconds_per_year
log_2 = log(2)
po_mean = po_214_half_life / log_2 # per-nuclide decay rate for po_214
ra_mean = ra_226_half_life / log_2 # ditto for ra_226
po_n = po_n0 = 1_000_000_000
ra_n = ra_n0 = 1_000_000_000
time = 0.0
# Generate a report when the following sets of half-lifes are reached
po_report_qtys = generate_report_qtys(po_n0)
ra_report_qtys = generate_report_qtys(ra_n0)
# Initialize first event times for each type of event:
# - first entry is polonium next event time
# - second entry is radium next event time
# - third entry is next ten year report time
next_event_time = [
rng.exponential(po_mean / po_n),
rng.exponential(ra_mean / ra_n),
10 * seconds_per_year
]
# Print column labels and initial values
print("time,po_214,ra_226,time_in_years")
print(f"{time},{po_n},{ra_n},{time / seconds_per_year}")
while time < ra_226_half_life:
# Find the index of the next event time. Index tells us the event type.
min_index = next_event_time.index(min(next_event_time))
if min_index == 0:
po_n -= 1 # decrement polonium count
time = next_event_time[0] # update clock to the event time
if po_n > 0:
next_event_time[0] += rng.exponential(po_mean / po_n) # determine next event time for po
else:
next_event_time[0] = float('Inf')
# print report if this is a half-life occurrence
if len(po_report_qtys) > 0 and po_n == po_report_qtys[0]:
po_report_qtys.pop(0) # remove this occurrence from the list
print(f"{time},{po_n},{ra_n},{time / seconds_per_year}")
elif min_index == 1:
# same as above, but for radium
ra_n -= 1
time = next_event_time[1]
if ra_n > 0:
next_event_time[1] += rng.exponential(ra_mean / ra_n)
else:
next_event_time[1] = float('Inf')
if len(ra_report_qtys) > 0 and ra_n == ra_report_qtys[0]:
ra_report_qtys.pop(0)
print(f"{time},{po_n},{ra_n},{time / seconds_per_year}")
else:
# update clock, print ten year report
time = next_event_time[2]
next_event_time[2] += 10 * seconds_per_year
print(f"{time},{po_n},{ra_n},{time / seconds_per_year}")
Run times are proportional to the number of nuclides. Running with a billion of each took 831.28s on my M1 MacBook Pro, versus 2.19s for a million of each. I also ported this to Crystal, a compiled Ruby-like language, which produced comparable results in 32 seconds for a billion of each nuclide. I would recommend using a compiled language if you intend to run larger sized problems, but I will also point out that if you use half-life reporting as I did the results are virtually identical for smaller population sizes but are obtained much more rapidly.
I would also suggest that if you want to use this approach for a more complex model, you should use a priority queue of tuples containing time and type of event to store the set of pending future events rather than a simple list.
Last but not least, here's some sample output:
time,po_214,ra_226,time_in_years
0.0,1000000000,1000000000,0.0
0.0001642985647308265,500000000,1000000000,5.20630734690935e-12
0.0003286071415481526,250000000,1000000000,1.0412931957694901e-11
0.0004929007624958987,125000000,1000000000,1.5619082645571865e-11
0.0006571750701843468,62500000,1000000000,2.082462133319222e-11
0.0008214861652253772,31250000,1000000000,2.6031325741671646e-11
0.0009858208114474198,15625000,1000000000,3.1238776442043114e-11
0.0011502417677631668,7812500,1000000000,3.6448962144243124e-11
0.0013145712145548718,3906250,1000000000,4.165624808460947e-11
0.0014788866075394896,1953125,1000000000,4.686308868670272e-11
0.0016432124609700412,976562,1000000000,5.2070260760325286e-11
0.001807832817519779,488281,1000000000,5.728676507465013e-11
0.001972981254301889,244140,1000000000,6.252000324175124e-11
0.0021372947080755688,122070,1000000000,6.772678239395799e-11
0.002301139510796509,61035,1000000000,7.29187108904514e-11
0.0024642826956509244,30517,1000000000,7.808840645837847e-11
0.0026302282280720344,15258,1000000000,8.33469030620844e-11
0.0027944471221414947,7629,1000000000,8.855068579808016e-11
0.002954014120737834,3814,1000000000,9.3607058861822e-11
0.0031188370035748177,1907,1000000000,9.882998084692174e-11
0.003282466175503322,953,1000000000,1.0401507641592902e-10
0.003457552492113242,476,1000000000,1.0956322699169905e-10
0.003601851131916978,238,1000000000,1.1413577496124477e-10
0.0037747824699194033,119,1000000000,1.1961563838566314e-10
0.0039512825256332275,59,1000000000,1.252085876503038e-10
0.004124330529803301,29,1000000000,1.3069214800248755e-10
0.004337121375518753,14,1000000000,1.3743508300754027e-10
0.004535068261934763,7,1000000000,1.437076413268044e-10
0.004890820999020369,3,1000000000,1.5498076529965425e-10
0.004909065046898487,1,1000000000,1.555588842908994e-10
315576000.0,0,995654793,10.0
631152000.0,0,991322602,20.0
946728000.0,0,987010839,30.0
1262304000.0,0,982711723,40.0
1577880000.0,0,978442651,50.0
1893456000.0,0,974185269,60.0
2209032000.0,0,969948418,70.0
2524608000.0,0,965726762,80.0
2840184000.0,0,961524848,90.0
3155760000.0,0,957342148,100.0
3471336000.0,0,953178898,110.0
3786912000.0,0,949029294,120.0
4102488000.0,0,944898063,130.0
4418064000.0,0,940790494,140.0
4733640000.0,0,936699123,150.0
5049216000.0,0,932622334,160.0
5364792000.0,0,928565676,170.0
5680368000.0,0,924523267,180.0
5995944000.0,0,920499586,190.0
6311520000.0,0,916497996,200.0
6627096000.0,0,912511030,210.0
6942672000.0,0,908543175,220.0
7258248000.0,0,904590364,230.0
7573824000.0,0,900656301,240.0
7889400000.0,0,896738632,250.0
8204976000.0,0,892838664,260.0
8520552000.0,0,888956681,270.0
8836128000.0,0,885084855,280.0
9151704000.0,0,881232862,290.0
9467280000.0,0,877401861,300.0
9782856000.0,0,873581425,310.0
10098432000.0,0,869785364,320.0
10414008000.0,0,866002042,330.0
10729584000.0,0,862234212,340.0
11045160000.0,0,858485627,350.0
11360736000.0,0,854749939,360.0
11676312000.0,0,851032010,370.0
11991888000.0,0,847329028,380.0
12307464000.0,0,843640016,390.0
12623040000.0,0,839968529,400.0
12938616000.0,0,836314000,410.0
13254192000.0,0,832673999,420.0
13569768000.0,0,829054753,430.0
13885344000.0,0,825450233,440.0
14200920000.0,0,821859757,450.0
14516496000.0,0,818284787,460.0
14832072000.0,0,814727148,470.0
15147648000.0,0,811184419,480.0
15463224000.0,0,807655470,490.0
15778800000.0,0,804139970,500.0
16094376000.0,0,800643280,510.0
16409952000.0,0,797159389,520.0
16725528000.0,0,793692735,530.0
17041104000.0,0,790239221,540.0
17356680000.0,0,786802135,550.0
17672256000.0,0,783380326,560.0
17987832000.0,0,779970864,570.0
18303408000.0,0,776576174,580.0
18618984000.0,0,773197955,590.0
18934560000.0,0,769836170,600.0
19250136000.0,0,766488931,610.0
19565712000.0,0,763154778,620.0
19881288000.0,0,759831742,630.0
20196864000.0,0,756528400,640.0
20512440000.0,0,753237814,650.0
20828016000.0,0,749961747,660.0
21143592000.0,0,746699940,670.0
21459168000.0,0,743450395,680.0
21774744000.0,0,740219531,690.0
22090320000.0,0,736999181,700.0
22405896000.0,0,733793266,710.0
22721472000.0,0,730602000,720.0
23037048000.0,0,727427544,730.0
23352624000.0,0,724260327,740.0
23668200000.0,0,721110260,750.0
23983776000.0,0,717973915,760.0
24299352000.0,0,714851218,770.0
24614928000.0,0,711740161,780.0
24930504000.0,0,708645945,790.0
25246080000.0,0,705559170,800.0
25561656000.0,0,702490991,810.0
25877232000.0,0,699436919,820.0
26192808000.0,0,696394898,830.0
26508384000.0,0,693364883,840.0
26823960000.0,0,690348242,850.0
27139536000.0,0,687345934,860.0
27455112000.0,0,684354989,870.0
27770688000.0,0,681379178,880.0
28086264000.0,0,678414567,890.0
28401840000.0,0,675461363,900.0
28717416000.0,0,672522494,910.0
29032992000.0,0,669598412,920.0
29348568000.0,0,666687807,930.0
29664144000.0,0,663787671,940.0
29979720000.0,0,660901676,950.0
30295296000.0,0,658027332,960.0
30610872000.0,0,655164886,970.0
30926448000.0,0,652315268,980.0
31242024000.0,0,649481821,990.0
31557600000.0,0,646656096,1000.0
31873176000.0,0,643841377,1010.0
32188752000.0,0,641041609,1020.0
32504328000.0,0,638253759,1030.0
32819904000.0,0,635479981,1040.0
33135480000.0,0,632713706,1050.0
33451056000.0,0,629962868,1060.0
33766632000.0,0,627223350,1070.0
34082208000.0,0,624494821,1080.0
34397784000.0,0,621778045,1090.0
34713360000.0,0,619076414,1100.0
35028936000.0,0,616384399,1110.0
35344512000.0,0,613702920,1120.0
35660088000.0,0,611035112,1130.0
35975664000.0,0,608376650,1140.0
36291240000.0,0,605729994,1150.0
36606816000.0,0,603093946,1160.0
36922392000.0,0,600469403,1170.0
37237968000.0,0,597854872,1180.0
37553544000.0,0,595254881,1190.0
37869120000.0,0,592663681,1200.0
38184696000.0,0,590085028,1210.0
38500272000.0,0,587517782,1220.0
38815848000.0,0,584961743,1230.0
39131424000.0,0,582420312,1240.0
39447000000.0,0,579886455,1250.0
39762576000.0,0,577362514,1260.0
40078152000.0,0,574849251,1270.0
40393728000.0,0,572346625,1280.0
40709304000.0,0,569856166,1290.0
41024880000.0,0,567377753,1300.0
41340456000.0,0,564908008,1310.0
41656032000.0,0,562450828,1320.0
41971608000.0,0,560005832,1330.0
42287184000.0,0,557570018,1340.0
42602760000.0,0,555143734,1350.0
42918336000.0,0,552729893,1360.0
43233912000.0,0,550326162,1370.0
43549488000.0,0,547932312,1380.0
43865064000.0,0,545550017,1390.0
44180640000.0,0,543178924,1400.0
44496216000.0,0,540814950,1410.0
44811792000.0,0,538462704,1420.0
45127368000.0,0,536123339,1430.0
45442944000.0,0,533792776,1440.0
45758520000.0,0,531469163,1450.0
46074096000.0,0,529157093,1460.0
46389672000.0,0,526854383,1470.0
46705248000.0,0,524564196,1480.0
47020824000.0,0,522282564,1490.0
47336400000.0,0,520011985,1500.0
47651976000.0,0,517751635,1510.0
47967552000.0,0,515499791,1520.0
48283128000.0,0,513257373,1530.0
48598704000.0,0,511022885,1540.0
48914280000.0,0,508798440,1550.0
49229856000.0,0,506582663,1560.0
49545432000.0,0,504379227,1570.0
49861008000.0,0,502186693,1580.0
50176584000.0,0,500000869,1590.0
Expanded for More than 2 Nuclides
I mentioned that for more than a couple of nuclides you'd want to use a priority queue to track which decays occur next. I reorganized the code around functions, but that allowed greater flexibility in expanding the scope of the problem. Here you go:
#!/usr/bin/env python3
from numpy.random import default_rng
from math import log
import heapq
SECONDS_PER_YEAR = 365.25 * 24 * 60 * 60
LOG_2 = log(2)
rng = default_rng()
def generate_report_qtys(n0):
report_qty = []
divisor = 2
while divisor < n0:
report_qty.append(n0 // divisor) # append next half-life qty to array
divisor *= 2
return report_qty
po_n0 = 10_000_000
ra_n0 = 10_000_000
mu_n0 = 10_000_000
# mean is half-life / LOG_2
properties = dict(
po_214 = dict(
mean = 0.0001643 / LOG_2,
qty = po_n0,
report_qtys = generate_report_qtys(po_n0)
),
ra_226 = dict(
mean = 1590 * SECONDS_PER_YEAR / LOG_2,
qty = ra_n0,
report_qtys = generate_report_qtys(ra_n0)
),
made_up = dict(
mean = 75 * SECONDS_PER_YEAR / LOG_2,
qty = mu_n0,
report_qtys = generate_report_qtys(mu_n0)
)
)
nuclide_names = [name for name in properties.keys()]
def population_mean(nuclide):
return properties[nuclide]['mean'] / properties[nuclide]['qty']
def report(): # isolate as single point of maintenance even though it's a one-liner
nuc_qtys = [str(properties[nuclide]['qty']) for nuclide in nuclide_names]
print(f"{time},{time / SECONDS_PER_YEAR}," + ','.join(nuc_qtys))
def decay_event(nuclide):
properties[nuclide]['qty'] -= 1
current_qty = properties[nuclide]['qty']
if current_qty > 0:
heapq.heappush(event_q, (time + rng.exponential(population_mean(nuclide)), nuclide))
rep_qty = properties[nuclide]['report_qtys']
if len(rep_qty) > 0 and current_qty == rep_qty[0]:
rep_qty.pop(0) # remove this occurrence from the list
report()
def report_event():
heapq.heappush(event_q, (time + 10 * SECONDS_PER_YEAR, 'report_event'))
report()
event_q = [(rng.exponential(population_mean(nuclide)), nuclide) for nuclide in nuclide_names]
event_q.append((0.0, "report_event"))
heapq.heapify(event_q)
time = 0.0 # simulated time
print("time(seconds),time(years)," + ','.join(nuclide_names)) # column labels
while time < 1600 * SECONDS_PER_YEAR:
time, event_id = heapq.heappop(event_q)
if event_id == 'report_event':
report_event()
else:
decay_event(event_id)
To add more nuclides, add more entries to the properties dictionary, following the template of the current entries.

ORTOOLS - CPSAT - Objective to minimize a value by intervals

I my model in ORTools CPSAT, I am computing a variable called salary_var (among others). I need to minimize an objective. Let’s call it « taxes ».
to compute the taxes, the formula is not linear but organised this way:
if salary_var below 10084, taxes corresponds to 0%
between 10085 and 25710, taxes corresponds to 11%
between 25711 and 73516, taxes corresponds to 30%
and 41% for above
For example, if salary_var is 30000 then, taxes are:
(25710-10085) * 0.11 + (30000-25711) * 0.3 = 1718 + 1286 = 3005
My question: how can I efficiently code my « taxes » objective?
Thanks for your help
Seb
This task looks rather strange, there is not much context and some parts of the task might touch some not-so-nice areas of finite-domain based solvers (large domains or scaling / divisions during solving).
Therefore: consider this as an idea / template!
Code
from ortools.sat.python import cp_model
# Data
INPUT = 30000
INPUT_UB = 1000000
TAX_A = 11
TAX_B = 30
TAX_C = 41
# Helpers
# new variable which is constrained to be equal to: given input-var MINUS constant
# can get negative / wrap-around
def aux_var_offset(model, var, offset):
aux_var = model.NewIntVar(-INPUT_UB, INPUT_UB, "")
model.Add(aux_var == var - offset)
return aux_var
# new variable which is equal to the given input-var IFF >= 0; else 0
def aux_var_nonnegative(model, var):
aux_var = model.NewIntVar(0, INPUT_UB, "")
model.AddMaxEquality(aux_var, [var, model.NewConstant(0)])
return aux_var
# Model
model = cp_model.CpModel()
# vars
salary_var = model.NewIntVar(0, INPUT_UB, "salary")
tax_component_a = model.NewIntVar(0, INPUT_UB, "tax_11")
tax_component_b = model.NewIntVar(0, INPUT_UB, "tax_30")
tax_component_c = model.NewIntVar(0, INPUT_UB, "tax_41")
# constraints
model.AddMinEquality(tax_component_a, [
aux_var_nonnegative(model, aux_var_offset(model, salary_var, 10085)),
model.NewConstant(25710 - 10085)])
model.AddMinEquality(tax_component_b, [
aux_var_nonnegative(model, aux_var_offset(model, salary_var, 25711)),
model.NewConstant(73516 - 25711)])
model.Add(tax_component_c == aux_var_nonnegative(model,
aux_var_offset(model, salary_var, 73516)))
tax_full_scaled = tax_component_a * TAX_A + tax_component_b * TAX_B + tax_component_c * TAX_C
# Demo
model.Add(salary_var == INPUT)
solver = cp_model.CpSolver()
status = solver.Solve(model)
print(list(map(lambda x: solver.Value(x), [tax_component_a, tax_component_b, tax_component_c, tax_full_scaled])))
Output
[15625, 4289, 0, 300545]
Remarks
As implemented:
uses scaled solving
produces scaled solution (300545)
no fiddling with non-integral / ratio / rounding stuff BUT large domains
Alternative:
Maybe something around AddDivisionEquality
Edit in regards to Laurents comments
In some scenarios, solving the scaled problem but being able to reason about the real unscaled values easier might make sense.
If i interpret the comment correctly, the following would be a demo (which i was not aware of and it's cool!):
Updated Demo Code (partial)
# Demo -> Attempt of demonstrating the objective-scaling suggestion
model.Add(salary_var >= 30000)
model.Add(salary_var <= 40000)
model.Minimize(salary_var)
model.Proto().objective.scaling_factor = 0.001 # DEFINE INVERSE SCALING
solver = cp_model.CpSolver()
solver.parameters.log_search_progress = True # SCALED BACK OBJECTIVE PROGRESS
status = solver.Solve(model)
print(list(map(lambda x: solver.Value(x), [tax_component_a, tax_component_b, tax_component_c, tax_full_scaled])))
print(solver.ObjectiveValue()) # SCALED BACK OBJECTIVE
Output (excerpt)
...
...
#1 0.00s best:30 next:[30,29.999] fixed_bools:0/1
#Done 0.00s
CpSolverResponse summary:
status: OPTIMAL
objective: 30
best_bound: 30
booleans: 1
conflicts: 0
branches: 1
propagations: 0
integer_propagations: 2
restarts: 1
lp_iterations: 0
walltime: 0.0039022
usertime: 0.0039023
deterministic_time: 8e-08
primal_integral: 1.91832e-07
[15625, 4289, 0, 300545]
30.0

RankingMetrics in Spark (Scala)

I am trying to use spark RankingMetrics.meanAveragePrecision.
However it seems like its not working as expected.
val t2 = (Array(0,0,0,0,1), Array(1,1,1,1,1))
val r = sc.parallelize(Seq(t2))
val rm = new RankingMetrics[Int](r)
rm.meanAveragePrecision // Double = 0.2
rm.precisionAt(5) // Double = 0.2
t2 is a tuple where the left array indicates the real values and the right array the predicted values (1 - relevant document, 0- non relevant)
If we calculate the average precision for t2 we get :
(0/1 + 0/2 + 0/3 + 0/4 + 1/5 )/5 = 1/25
But the RankingMetric returns 0.2 for MeanAveragePrecision which should be 1/25.
Thanks.
I think that the problem is your input data. Since your predicted/actual data contains relevance scores, I think you should be looking at binary classification metrics rather than ranking metrics if you want to evaluate using the 0/1 scores.
RankingMetrics is expecting two lists/arrays of ranked items instead, so if you replace the scores with the document ids it should work as expected. Here is an example in PySpark, with two lists that only match on the 5th item:
from pyspark.mllib.evaluation import RankingMetrics
rdd = sc.parallelize([(['a','b','c','d','z'], ['e','f','g','h','z'])])
metrics = RankingMetrics(rdd)
for i in range(1, 6):
print i, metrics.precisionAt(i)
print 'meanAveragePrecision', metrics.meanAveragePrecision
print 'Mean precisionAt', sum([0, 0, 0, 0, 0.2]) / 5
Which produced:
1 0.0
2 0.0
3 0.0
4 0.0
5 0.2
meanAveragePrecision 0.04
Mean precisionAt 0.04
Basically how the RankingMetrics function works is with two lists on each row,
First list is the items being recommended order matters here
Second list is the relevant items
For example in PySpark (But should be equivalent for Scala or Java),
recs_rdd = sc.parallelize([
(
['item1', 'item2', 'item3'], # Recommendations in order
['item3', 'item2'] # Relevant items - Unordered
),
(
['item3', 'item1', 'item2'], # Recommendations in order
['item3', 'item2'] # Relevant items - Unordered
),
])
from pyspark.mllib.evaluation import RankingMetrics
rankingMetrics = RankingMetrics(recs_rdd)
print("MAP: ", rankingMetrics.meanAveragePrecision)
This prints the MAP value of 0.7083333333333333 and is calculated by
(
(1/2 + 2/3) / 2
+ (1/1 + 2/3) / 2
) / 2
Which equals 0.708333
With
row 1 as (1/2 + 2/3) / 2
1/2 : 1 item in positions 2 or less are relevant
2/3 : 2 items in positions 3 or less are relevant
2 : Row 1 has 2 relevant items
row 2 as (1/1 + 2/3) / 2
1/1 : 1 item in position 1 or less is relevant
2/3 : 2 items in positions 3 or less are relevant
2 : Row 2 has 2 relevant items
And / 2 as there are 2 rows

PySpark : how to split data without randomnize

there are function that can randomize spilt data
trainingRDD, validationRDD, testRDD = RDD.randomSplit([6, 2, 2], seed=0L)
I'm curious if there a way that we generate data the same partition ( train 60 / valid 20 / test 20 ) but without randommize ( let's just say use the current data to split first 60 = train, next 20 =valid and last 20 are for test data)
is there a possible way to split data similar way to split but not randomize?
The basic issue here is that unless you have an index column in your data, there is no concept of "first rows" and "next rows" in your RDD, it's just an unordered set. If you have an integer index column you could do something like this:
train = RDD.filter(lambda r: r['index'] % 5 <= 3)
validation = RDD.filter(lambda r: r['index'] % 5 == 4)
test = RDD.filter(lambda r: r['index'] % 5 == 5)