How to apply best fit distributions in pyspark? - pyspark

I currently working on a migration from python to pyspark,and I have one step where I find the best fit distribution using a modified function of Fitting empirical distribution to theoretical ones with Scipy (Python)? where I apply best_fit_distribution to each group od Id's, and save the output in a dictionary,there is some way to do that in pyspark? I was doing research about pyspark statistics and I don't find any library that could help me.
For the needs of the development I need to do this part in pyspark, so keep in original python can't be an option.
import scipy.stats as st
import numpy as np
import warnings
def best_fit_distribution(data, bins=200, ax=None)
y, x = np.histogram(data, bins=bins, density=True)
x = (x + np.roll(x, -1))[:-1] / 2.0
# Distributions to check
distribution_list = [st.alpha,st.chi2, st.pearson3] #This is an example
# Best holders
best_distribution = st.norm
best_params = (0.0, 1.0)
best_sse = np.inf
for distribution in distribution_list:
try:
with warnings.catch_warnings():
warnings.filterwarnings("ignore")
params = distribution.fit(data)
arg = params[:-2]
loc = params[-2]
scale = params[-1]
pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
sse = np.sum(np.power(y - pdf, 2.0))
# if axis pass in add to plot
try:
if ax:
pd.Series(pdf, x).plot(ax=ax)
#end
except Exception:
pass
# identify if this distribution is better
if best_sse > sse > 0:
best_distribution = distribution
best_params = params
best_sse = sse
except Exception:
pass
return (best_distribution.name, best_params)
This is an example and description of my df:
Id
Values
8
59.25
8
25.1
8
39.0333
8
138.3737
8
79.5002
8
52.9
8
0.1674
9
33.8667
9
0.75
9
78.05
9
76.9167
9
14.6667
9
80.3166
9
32.7333
9
0.8333
9
76.95
9
84.4
9
23.1667
9
23.1
9
76.6667
summary
Id
Values
count
34052
1983107
min
8
0.0
max
2558
59646.1712

Related

A bug is encountered with the "Maskable PPO" training with custom Env setup

I encountered an error while using SB3-contrib Maskable PPO action masking algorithm.
File ~\anaconda3\lib\site-packages\sb3_contrib\common\maskable\distributions.py:231, in MaskableMultiCategoricalDistribution.apply_masking(self, masks)
228 masks = th.as_tensor(masks)
230 # Restructure shape to align with logits
--> 231 masks = masks.view(-1, sum(self.action_dims))
233 # Then split columnwise for each discrete action
234 split_masks = th.split(masks, tuple(self.action_dims), dim=1)
RuntimeError: shape '[-1, 1600]' is invalid for input of size 800
I am running learning progamme with an action being a MultiBinary space with 800 selections of 0, 1.
The action space is defined as below:
self.action_space = spaces.MultiBinary(800)
Within the custom environment class, an "action_mask" function was created such that it returns a List of 800 boolean values.
Now, when I follow the document and start to train the model, the error message pops:
from sb3_contrib import MaskablePPO
from Equities_RL_Env import Equities_RL_Env
import time
from sb3_contrib.common.maskable.utils import get_action_masks
models_dir = f"models/V1 31-Jul/"
logdir = f"logs/{time.strftime('%d %b %Y %H-%M',time.localtime())}/"
if not os.path.exists(models_dir):
os.makedirs(models_dir)
if not os.path.exists(logdir):
os.makedirs(logdir)
env = Equities_RL_Env(Normalize_frame(historical_frame), pf)
env.reset()
model = MaskablePPO('MlpPolicy', env, verbose=1, tensorboard_log=logdir)
TIMESTEPS = 1000
iters = 0
while iters <= 1000000:
iters += 1
model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name=f"PPO")
model.save(f"{models_dir}/{TIMESTEPS*iters}")
May I know is there a way to define that shape within the custom environment?

SqueezeNet Deep Compression

Do you guys know where or how to obtain the 0.47MB version of SqueezeNet ?
In other words, how to make the weights bitwidth to be 6 instead of 8 ?
I cannot find the modification spot in this SqueezeNet generation code.
In this following method, I got 0.77 MB Model! Lets assume we have a SqueezeNet_model. We can convert SqueezeNet to Tensorflow Lite Model.
converter = tf.lite.TFLiteConverter.from_keras_model(SqueezeNet_model)
open("SqueezeNet_model.tflite", "wb").write(tflite_model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_quant_model = converter.convert()
Then, we can use POST quantization to decrease the size of model!
open("SqueezeNet_Quant_model.tflite", "wb").write(tflite_quant_model)
print("Quantized model in Mb:", os.path.getsize('SqueezeNet_Quant_model.tflite') / float(2**20)) // I got 0.77 MB model
Finally, we can test our model with:
# Load TFLite model and allocate tensors.
interpreter = tf.lite.Interpreter(model_path="SqueezeNet_Quant_model.tflite")
interpreter.allocate_tensors()
# Get input and output tensors.
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# Test model on some input data.
input_shape = input_details[0]['shape']
acc=0
for i in range(len(x_test)):
input_data = np.array(x_test[i].reshape(input_shape), dtype=np.float32)
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
output_data = interpreter.get_tensor(output_details[0]['index'])
if(np.argmax(output_data) == np.argmax(y_test[i])):
acc+=1
acc = acc/len(x_test)
print(acc*100)

An issue with argument "sortv" of function seqIplot()

I'm trying to plot individual sequences by means of function seqIplot() in TraMineR. These individual sequences represent work trajectories, completed by former school's graduates via a WEB questionnaire.
Using argument "sortv", I'd like to sort my sequences according to the order of the levels of one covariate, the year of graduation, named "PROMO".
"PROMO" is a factor variable contained in a data frame named "covariates.seq", gathering covariates together:
str(covariates.seq)
'data.frame': 733 obs. of 6 variables:
$ ID_SQ : Factor w/ 733 levels "1","2","3","5",..: 1 2 3 4 5 6
7 8 9 10 ...
$ SEXE : Factor w/ 2 levels "Féminin","Masculin": 1 1 1 1 2 1
1 2 2 1 ...
$ PROMO : Factor w/ 6 levels "1997","1998",..: 1 2 2 4 4 3 2 2
2 2 ...
$ DEPARTEMENT : Factor w/ 10 levels "BC","GCU","GE",..: 1 4 7 8 7 9
9 7 7 4 ...
$ NIVEAU_ADMISSION: Factor w/ 2 levels "En Premier Cycle",..: NA 1 1 1 1
1 NA 1 1 1 ...
$ FILIERE_SECTION : Factor w/ 4 levels "Cursus Classique",..: NA 4 2 NA
1 1 NA NA 4 3 ..
I'm also using "SEXE", the graduates' gender, as a grouping variable. To plot the individual sequences so, my command is as follows:
seqIplot(sequences, group = covariates.seq$SEXE,
sortv = covariates.seq$PROMO,
cex.axis = 0.7, cex.legend = 0.7)
I expected that, by using a process time axis (with the year of graduation as sequence-dependent origin), sorting the sequences according to the order of the levels of "PROMO" would give a plot with groups of sequences from the longest (for the older graduates) to the shortest (for the younger graduates).
But I've got an issue: in the output plot, the sequences don't appear to be correctly sorted according to the levels of "PROMO". Indeed, by using "sortv = covariates.seq$PROMO" as in the command above, the plot doesn't show groups of sequences from the longest to the shortest, as expected. It looks like the plot obtained without using the argument "sortv" (see Figures below).
Without using argument "sortv"
Using "sortv = covariates.seq$PROMO"
Note that I have 733 individual sequences in my object "sequences", created as follows:
labs <- c("En poste","Au chômage (d'au moins 6 mois)", "Autre situation
(d'au moins 6 mois)","En poursuite d'études (thèse ou hors
thèse)", "En reprise d'études / formation (d'au moins 6 mois)")
codes <- c("En poste", "Au chômage", "Autre situation", "En poursuite
d'études", "En reprise d'études / formation")
sequences <- seqdef(situations, alphabet = labs, states = codes, left =
NA, right = "DEL", missing = NA,
cnames = as.character(seq(0,7400/365,1/365)),
xtstep = 365)
The values of the covariates are sorted in the same order as the individual sequences. The covariate "PROMO" doesn't contain any missing value.
Something's going wrong, but what?
Thank you in advance for your help,
Best,
Arnaud.
Using a factor as sortv argument in seqIplot works fine as illustrated by the example below:
sdc <- c("aabbccdd","bbbccc","aaaddd","abcabcab")
sd <- seqdecomp(sdc, sep="")
seq <- seqdef(sd)
fac <- factor(c("2000","2001","2001","2000"))
par(mfrow=c(1,3))
seqIplot(seq, with.legend=FALSE)
seqIplot(seq, sortv=fac, with.legend=FALSE)
seqlegend(seq)

downsampling rate with movement data (first point equal from the original matrix)

I was wondering if the procedure applied trying to download the sample rate was the appropriate as follows the instruction: y = downsample(x,n)
downsamp_rate = 40;
downsampled_data = downsample(X,downsamp_rate);
.. because my doubt relays in why the first column from both matrices is exactly the same (the original matrix and the sample donwloaded)maintaining the same data....
then the other data have already transformed to a lower sample rate.
Thank you so much!
Best!
edited: Sample data. I pasted the data but I can upload de .mat files.
Original data.
column 1 column 2 column 3
-0,593600000000000 -0,592699999999996 -0,591899999999995
2,42180000000000 2,41010000000000 2,40360000000000
1,78550000000000 1,79020000000000 1,79530000000000
-1,30590000000000 -1,31520000000000 -1,31530000000000
-0,707800000000003 -0,712699999999999 -0,727700000000003
-0,986500000000001 -0,996000000000002 -1,00460000000000
-0,989699999999999 -0,989699999999999 -0,989699999999999
1,23500000000000 1,22970000000000 1,21880000000000
0,122899999999998 0,127899999999997 0,128899999999998
0,938300000000003 0,937500000000002 0,936200000000004
0,248600000000004 0,248500000000002 0,248700000000002
-0,381499999999996 -0,393199999999999 -0,393699999999997
0,294099999999997 0,279299999999999 0,271299999999997
-0,223200000000001 -0,223699999999999 -0,227299999999997
0,0879999999999992 0,117300000000004 0,122500000000003
-0,167899999999999 -0,170999999999999 -0,174800000000003
-0,687499999999996 -0,697199999999998 -0,701600000000002
-0,681700000000002 -0,682200000000000 -0,683000000000000
1,19659999999999 1,19670000000000 1,19490000000000
-0,565500000000008 -0,565199999999999 -0,557400000000008
Downsampled data
column 1 column 2 column 3
-0,593600000000000 0,821900000000003 0,936300000000001
2,42180000000000 1,14610000000000 -0,255400000000000
1,78550000000000 2,86550000000000 3,66890000000000
-1,30590000000000 7,01950000000000 12,9564000000000
-0,707800000000003 3,05920000000000 0,852999999999998
-0,986500000000001 -0,372200000000000 -0,951000000000002
-0,989699999999999 -0,988000000000000 -1,21730000000000
1,23500000000000 5,79700000000000 3,40880000000000
0,122899999999998 5,32230000000000 5,19260000000000
0,938300000000003 4,88130000000000 7,55900000000000
0,248600000000004 4,79290000000000 2,96620000000000
-0,381499999999996 -0,400000000000000 0,641500000000000
0,294099999999997 -0,131400000000004 -1,20040000000000
-0,223200000000001 1,49610000000000 1,59030000000000
0,0879999999999992 0,418700000000000 -0,0114999999999976
-0,167899999999999 0,0149999999999983 -0,857500000000000
-0,687499999999996 -0,593100000000002 0,119700000000000
-0,681700000000002 -0,170000000000003 0,126799999999999
1,19659999999999 1,17670000000000 1,15780000000000
-0,565500000000008 8,89019999999999 6,58569999999999
A possible for your output is a periodic input signal with a period length of downsamp_rate-1. To give a short demonstration:
>> X=repmat(1:39,1,10);
>> downsampled_data = downsample(X,downsamp_rate);
>> downsampled_data
downsampled_data =
Columns 1 through 9
1 2 3 4 5 6 7 8 9
Column 10
10
Thus, take a look at your rows 40,41,42. I assume the first value is identical to your row 1,2,3

POS tagging in Scala

I tried to POS tag a sentence in Scala using Stanford parser like below
val lp:LexicalizedParser = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz");
lp.setOptionFlags("-maxLength", "50", "-retainTmpSubcategories")
val s = "I love to play"
val parse :Tree = lp.apply(s)
val taggedWords = parse.taggedYield()
println(taggedWords)
I got an error type mismatch; found : java.lang.String required: java.util.List[_ <: edu.stanford.nlp.ling.HasWord] in the line val parse :Tree = lp.apply(s)
I don't know whether this is the right way of doing it or not. Are there any other easy ways of POS tagging a sentence in Scala?
You might like to consider the FACTORIE toolkit (http://github.com/factorie/factorie). It is a general library for machine learning and graphical models that happens to include an extensive suite of natural language processing components (tokenization, token normalization, morphological analysis, sentence segmentation, part-of-speech tagging, named entity recognition, dependency parsing, mention finding, coreference).
Furthermore it is written entirely in Scala, and it is released under the Apache License.
Documentation is currently sparse, but will be improving in the coming months.
For example, once Maven-based installation is finished you can type at the command line:
bin/fac nlp --pos1 --parser1 --ner1
to launch a socket-listening multi-threaded NLP server. Then query it by piping plain text to its socket number:
echo "Mr. Jones took a job at Google in New York. He and his Australian wife moved from New South Wales on 4/1/12." | nc localhost 3228
The output is then
1 1 Mr. NNP 2 nn O
2 2 Jones NNP 3 nsubj U-PER
3 3 took VBD 0 root O
4 4 a DT 5 det O
5 5 job NN 3 dobj O
6 6 at IN 3 prep O
7 7 Google NNP 6 pobj U-ORG
8 8 in IN 7 prep O
9 9 New NNP 10 nn B-LOC
10 10 York NNP 8 pobj L-LOC
11 11 . . 3 punct O
12 1 He PRP 6 nsubj O
13 2 and CC 1 cc O
14 3 his PRP$ 5 poss O
15 4 Australian JJ 5 amod U-MISC
16 5 wife NN 6 nsubj O
17 6 moved VBD 0 root O
18 7 from IN 6 prep O
19 8 New NNP 9 nn B-LOC
20 9 South NNP 10 nn I-LOC
21 10 Wales NNP 7 pobj L-LOC
22 11 on IN 6 prep O
23 12 4/1/12 NNP 11 pobj O
24 13 . . 6 punct O
Of course there is a programmatic API to all this functionality as well.
import cc.factorie._
import cc.factorie.app.nlp._
val doc = new Document("Education is the most powerful weapon which you can use to change the world.")
DocumentAnnotatorPipeline(pos.POS1).process(doc)
for (token <- doc.tokens)
println("%-10s %-5s".format(token.string, token.posLabel.categoryValue))
will output:
Education NN
is VBZ
the DT
most RBS
powerful JJ
weapon NN
which WDT
you PRP
can MD
use VB
to TO
change VB
the DT
world NN
. .
I found a very simple way to do POS tagging in Scala
Step 1
Download stanford tagger version 3.2.0 form the link below
http://nlp.stanford.edu/software/stanford-postagger-2013-06-20.zip
Step 2
Add stanford-postagger jar present in the folder to your project and also place the english-left3words-distsim.tagger file present in the models folder in your project
Then, with the code below you can pos tag a sentence in Scala
val tagger = new MaxentTagger(
"english-left3words-distsim.tagger")
val art_con = "My name is Rahul"
val tagged = tagger.tagString(art_con)
println(tagged)
Output: My_PRP$ name_NN is_VBZ Rahul_NNP
I believe the API of the Stanford Parser has changed somewhat, as it does sometimes. apply has the signature, public Tree apply(java.util.List<? extends HasWord> words), and this is what you see in the error message.
What you should use now is parse, which has the signature public Tree parse(java.lang.String sentence).