ggpmisc::stat_poly_eq() does not consider weighting by a variable - ggpmisc

How can I get ggpmisc::stat_poly_eq() to take into account weighting by a variable? I don't see in the documentation a parameter to include weight as it exists in geom_smooth and if I try to force it within aes() it results in 'Warning: Ignoring unknown parameters: weight.' I am aware of other workarounds to print the correct equation as shown here but it would be so much cleaner to have it in stat_poly_eq(). Has anybody figured this out or am I missing something? Thanks everyone.
stuff<-structure(list(var = 2:31, mean = c(17026.5, 11028.6842105263,
13113.1111111111, 11087.3679395386, 9863.8664212548, 10060.676012167,
9378.01091924399, 9790.67922990444, 8569.95788246269, 8839.68511390887,
7656.50625471556, 7370.78564257028, 7939.13425925926, 7541.83192090395,
8845.67474747475, 8023.03099415205, 6373.05976190476, 6337.93259803922,
6824.79901960784, 7450.80769230769, 6651.81884057971, 5548.59722222222,
7802.78205128205, 3627.07407407407, 2471, 2248.33333333333, 1368.7,
2104.25, 742, 2097.5), n = c(2L, 19L, 150L, 419L, 1562L, 3178L,
3880L, 2965L, 4288L, 2780L, 2209L, 664L, 54L, 59L, 66L, 57L,
50L, 34L, 34L, 26L, 23L, 18L, 13L, 9L, 7L, 7L, 5L, 2L, 4L, 2L
)), .Names = c("var", "mean", "n"), row.names = c(NA, -30L), class = "data.frame")
ggplot(data=stuff, aes(x=var, y=mean))+
geom_point()+
geom_smooth(aes(weight=n), method='lm', formula = y~x)+
stat_poly_eq(aes(label=paste(..eq.label.., ..rr.label.., sep="~~~")),
formula=y~x, label.x.npc=0.8, label.y.npc=0.8,
coef.digits=3, parse=TRUE)
figure from stuff data
#currently, the equation printed in the plot does not correspond to the coefficients resulting with weigth=n
truecoeff<-lm(data=stuff, mean~var, weight=n)
truecoeff

Thanks for reporting this! I have created an Issue from your message. This is now fixed in the development version to be submitted before 2018-07-31 to CRAN as 'ggpmisc' 0.3.0.

Related

Getting Import Error quite randomly when using plotly express and having multiple graphs on one page

Relatively new to Dash, and this is a problem that has been vexing me for months now. I am making a multi-page app that shows some basic data trends using cards, and graphs embedded within cardbody. 30% of the time, the app works well without any errors and the other 70% it throws either one of the following:
ImportError: cannot import name 'ValidatorCache' from partially initialized module 'plotly.validator_cache' (most likely due to a circular import)
OR
ImportError: cannot import name 'Layout' from partially initialized module 'plotly.graph_objects' (most likely due to a circular import)
Both these appear quite randomly and I usually refresh the app to make them go away. But obviously I am doing something wrong. I have a set of dropdowns that trigger callbacks on graphs. I have been wracking my head about this. Any help/leads would be appreciated. The only pattern I see in the errors is they seem to emerge when the plotly express graphs are being called in the callbacks.
What am I doing wrong? I have searched all over online for help but nothing yet.
Sharing with some relevant snippets of code (this may be too long and many parts not important to the question, but to give you a general idea of what I have been working towards)
import dash
import dash_bootstrap_components as dbc
from dash.dependencies import Input, Output, State
import dash_core_components as dcc
import dash_html_components as html
import plotly.graph_objs as go
from plotly.subplots import make_subplots
import plotly.express as px
card_content1_1 = [
dbc.CardHeader([dbc.Row([html.H5("SALES VOLUME TREND", className = "font-weight-bold text-success"),
dbc.Button(
html.I(className="fa fa-window-maximize"),
color="success",
id="sales_maximize",
className="ml-auto",
# href="www.cogitaas.com"
)
])]),
dbc.CardBody(
[dcc.Graph(
id='sales_graph',
figure={},
style={'height':'30vh'}
# className="mt-5"
)])]
card_stacked_discount = [
dbc.CardHeader([dbc.Row([html.H5("VOLUMES UNDER DIFFERENT DISCOUNT LEVELS", className="font-weight-bold text-info text-center"),
dbc.Button(
html.I(className="fa fa-window-maximize"),
color="info",
id="discount_maximize",
className="ml-auto",
# href="www.cogitaas.com"
)
])]),
dbc.CardBody(
[dcc.Dropdown(
id = 'stacked_discount_dropdown',
options =stacked_discount_options,
value=stacked_discount_options[0].get('value'),
style={'color':'black'},
# multi=True
),
dcc.Graph(
id='stacked_discount_graph',
figure={},
style={'height':'30vh'}
)])]
cards = html.Div(
[
dbc.Row(
[
dbc.Col(dbc.Card(card_content1_1, color="success", outline=True,
style={'height':'auto'}), width=8),
],
className="mb-4",
),
dbc.Row(
[
dbc.Col(dbc.Card(card_stacked_discount, color="info", outline=True), width=8),
dbc.Col(dbc.Card([
dbc.Row([
dbc.Col(dbc.Card(disc_sub_title, color="info", inverse=True)),
]),
html.Br(),
dbc.Row([
dbc.Col(dbc.Card(disc_sub_card1, color="info", outline=True)),
]),
]), width=4)
],
className="mb-4",
),
]
)
tab1_content = dbc.Card(
dbc.CardBody(
[cards,]
),
className="mt-3",
)
tabs = dbc.Tabs(dbc.Tab(tab1_content, label="Data", label_style={'color':'blue'}, tab_style={"margin-left":"auto"}),])
content = html.Div([
html.Div([tabs]),
],id="page-content")
app.layout = html.Div([dcc.Location(id="url"), content])
#app.callback(
dash.dependencies.Output('sales_graph', 'figure'),
[dash.dependencies.Input('platform-dropdown', 'value'),
dash.dependencies.Input('signature-dropdown', 'value'),
dash.dependencies.Input('franchise-dropdown', 'value'),
dash.dependencies.Input('sales_maximize', 'n_clicks'),
dash.dependencies.Input('time-dropdown', 'value'),
])
def update_sales_graph(plat, sign, fran, maximize, time_per):
print(str(time_per)+"Test")
time_ax=[]
if isinstance(time_per,str):
time_ax.append(time_per)
time_per=time_ax
if (time_per==None) or ('Full Period' in (time_per)):
dff = df[(df.Platform==plat) & (df.Signature==sign) & (df.Franchise==fran)]
elif ('YTD' in time_per):
dff = df[(df.Platform == plat) & (df.Signature == sign) & (df.Franchise == fran) & (df.year==2020)]
else:
dff = df[(df.Platform==plat) & (df.Signature==sign) & (df.Franchise==fran) & (df.Qtr_yr.isin(time_per))]
fig = px.area(dff, x='Date', y='Qty_Orig', color_discrete_sequence=px.colors.qualitative.Dark2)
fig.add_trace(go.Scatter(x=dff['Date'], y=dff['Outliers'], mode = 'markers', name='Outliers',
line=dict(color='darkblue')))
fig.add_trace(go.Scatter(x=dff['Date'], y=dff['bestfit'], name='Long Term Trend',
line=dict(color='darkblue')))
fig.update_layout(font_family="Rockwell",
title={'text': fran + " Volume Trend",
'y': 0.99,
# 'x': 0.15,
# 'xanchor': 'auto',
'yanchor': 'top'
},
legend=dict(
orientation="h",
# y=-.15, yanchor="bottom", x=0.5, xanchor="center"
),
yaxis_visible=False, yaxis_showticklabels=False,
xaxis_title=None,
margin=dict(l=0, r=0, t=0, b=0, pad=0),
plot_bgcolor='White',
paper_bgcolor='White',
)
fig.update_xaxes(showgrid=False, zeroline=True)
fig.update_yaxes(showgrid=False, zeroline=True)
changed_id = [p['prop_id'] for p in dash.callback_context.triggered][0]
if 'maximize' in changed_id:
fig.show()
return fig
#app.callback(
dash.dependencies.Output('stacked_discount_graph', 'figure'),
[dash.dependencies.Input('platform-dropdown', 'value'),
dash.dependencies.Input('signature-dropdown', 'value'),
dash.dependencies.Input('franchise-dropdown', 'value'),
dash.dependencies.Input('discount_maximize', 'n_clicks'),
dash.dependencies.Input('stacked_discount_dropdown', 'value'),
dash.dependencies.Input('time-dropdown', 'value'),
])
def stacked_discount(plat, sign, fran, maximize, sales_days, time_per):
time_ax=[]
if isinstance(time_per,str):
time_ax.append(time_per)
time_per=time_ax
# else:
# time_per=list(time_per)
if (time_per==None) or ('Full Period' in (time_per)):
df_promo = df_promo_vol[(df_promo_vol.Platform==plat) & (df_promo_vol.Signature==sign) & (df_promo_vol.Franchise==fran)]
elif ('YTD' in time_per):
df_promo = df_promo_vol[(df_promo_vol.Platform == plat) & (df_promo_vol.Signature == sign) & (df_promo_vol.Franchise == fran) & (df_promo_vol.Year==2020)]
else:
df_promo = df_promo_vol[(df_promo_vol.Platform==plat) & (df_promo_vol.Signature==sign) & (df_promo_vol.Franchise==fran) & (df_promo_vol.Qtr_yr.isin(time_per))]
color_discrete_map = {
"0 - 10": "orange",
"10 - 15": "green",
"15 - 20": "blue",
"20 - 25": "goldenrod",
"25 - 30": "magenta",
"30 - 35": "red",
"35 - 40": "aqua",
"40 - 45": "violet",
"45 - 50": "brown",
"50 + ": "black"
}
category_orders = {'disc_range': ['0 - 10', '10 - 15', '15 - 20', '20 - 25', '25 - 30', '30 - 35', '35 - 40',
'40 - 45', '45 - 50', '50 + ']}
if (sales_days == None) or (sales_days == 'sales_act'):
fig = px.bar(df_promo, x='start', y='units_shipped', color='disc_range',
color_discrete_map=color_discrete_map,
category_orders=category_orders,
)
else:
fig = px.bar(df_promo, x='start', y='Date', color="disc_range",
color_discrete_map=color_discrete_map,
category_orders=category_orders,
)
fig.update_layout(font_family="Rockwell",
title={'text': fran + " Sales Decomposition",
'y': 0.99,
'x': 0.1,
# 'xanchor': 'auto',
'yanchor': 'top'
},
legend=dict(
orientation="h",
# y=-.15, yanchor="bottom", x=0.5, xanchor="center"
),
# yaxis_visible=False, yaxis_showticklabels=False,
xaxis_title=None,
margin=dict(l=0, r=0, t=30, b=30, pad=0),
plot_bgcolor='White',
paper_bgcolor='White',
)
fig.update_xaxes(showgrid=False, zeroline=True)
fig.update_yaxes(showgrid=False, zeroline=True)
changed_id = [p['prop_id'] for p in dash.callback_context.triggered][0]
if 'maximize' in changed_id:
fig.show()
return fig
Well, it appears I may have stumbled on to an answer. I was using the pretty much the same inputs for multiple callbacks and that could have been causing some interference with the sequencing of inputs. Once I integrated the code into one callback with multiple outputs, the problem seems to have disappeared.
Was dealing with this same issue where everything in my app worked fine, then I made an entirely separate section & callback that started throwing those circular import errors.
Was reluctant to re-arrange my (rightfully) separated callbacks to be just a single one and found you can fix the issue by just simply importing what the script says it's failing to get. In my case, plotly was trying to import the ValidatorCache and Layout so adding these to the top cleared the issue and now my app works as expected. Hope this helps someone experiencing a similar issue.
from plotly.graph_objects import Layout
from plotly.validator_cache import ValidatorCache

Spark : subtractByKey issue (pyspark)

I have some kinds of issues with subtractByKey.
I have 2 files :
First one is like : (Client ID + Client Mail)
client_id emails
4A85FD8E-197D-2AE3-B939-A527AFF16A04 imperdiet.non.vestibulum#mon***tur.com
D48D530C-CF68-DAF1-18F0-E0A0A03F3E06 rutrum.urna#estm***ncus.net:facilisis#i****m.ca
40815230-25DC-9EA0-01D1-2706B4B56958 iaculis.nec.eleifend#gr****nc.net
...
and the second one : (Only Mail)
pharetra#P****s.com
ut.aliquam#o****m.org
erat#a****e.edu
....
Some lines in the first file can have 2 (or more) mails with this format :
mail:mail
What did i do :
*test1=sc.textFile("file1")
*test2=sc.textFile("file2")
*test3=test1.subtractByKey(test2)
and the result is ... :
[(u'A', u'B'), (u'A', u'D'), (u'A', u'1'), (u'A', u'D'), (u'A', u'D'), (u'A', u'B'), (u'A', u'F'), (u'A', u'E'), (u'A', u'9'), (u'A', u'5'), (u'A', u'9'), (u'A', u'6'), (u'c', u'l'), (u'E', u'8'), (u'E', u'4'), (u'E', u'6'), (u'E', u'6'), (u'E', u'7'), (u'E', u'5'), (u'E', u'5'), (u'E', u'5'), (u'E', u'2'), (u'E', u'8'), (u'C', u'2'), (u'C', u'5'), (u'C', u'6'), (u'C', u'C'), (u'C', u'E'), (u'C', u'3'), (u'C', u'F'), (u'C', u'4'), (u'C', u'B'), (u'C', u'F'), (u'C', u'F'), (u'C', u'8'), (u'C', u'0'), (u'1', u'D'), (u'1', u'2'), (u'1', u'3'), (u'1', u'8'), (u'1', u'0'), (u'1', u'F'), ... ]
I wanted to delete clients in the first file who had their mails in the second file but it did not work.
note: I am not very familiar with pyspark but the spark api should be the
same.
first you should make email as the key
rdd1=sc.textFile("file1").map(lambda line: (line.split(" ")[0], line.split(" ")[1]))
this will give you a rdd of
[(4A85FD8E-197D-2AE3-B939-A527AFF16A04,imperdiet.non.vestibulum#mon***tur.com)]
then as there may be multi email, you should do a flatMapValues()
rdd2 = rdd1.flatMapValues(lambda email: email.split(":"))
this will give you a pair rdd and each one just contains one email
now you can switch the key and value
rdd3=rdd2.map(lambda kv: (kv[1], kv[0]))
now you get a rdd that using user email as a key and the UUID as a value
such as
[(imperdiet.non.vestibulum#mon***tur.com, 4A85FD8E-197D-2AE3-B939-A527AFF16A04)]
now you should find which UUID's email is contained in file2, to do that you should load the second file as a rdd:
secondRdd = sc.textFile("file2").map(lambda line: (line, 1))
and do a join and tweak the join result rdd.
rdd4 = rdd3.join(secondRdd).map(lambda kv: (kv[1][0], kv[0]))
if everything is right now you should get a rdd which is as the format of (UUID, email) which represents all the users whose email occurs in file2,
then you can do a subtractByKey() with rdd1 which we originally got.

how make Distinct tuple scala

I have an RDD and I want create a new RDD with unique values, but I have an error.
The code:
val rdd = sc.textFile("/user/ergorenova/socialmedia/allus/archivosOrigen").map( _.split(",", -1) match {
case Array(caso, canal, lote, estado, estadoo, estadooo, fechacreacioncaso, fechacierrecaso, username, clientid, nombre, apellido, ani, email) =>(canal, username, ani, email)
}).distinct
val twtface = rdd.map {
case ( canal, username, ani, email ) =>
val campoAni = "ANI"
(campoAni , ani , canal , username)
}.distinct()
twtface.take(3).foreach(println)
This is the CSV file
caso2,canal2,lote,estado3,estado4,estado5,fechacreacioncaso2,fechacierrecaso2,username,clientid,nombre,apellido,ani,email
2694464,Twitter,Redes Sociales Movistar - Twitter,Cerrado por Abandono – Robot,,,16/04/2015 23:57:51,17/04/2015 6:00:19,kariniseta,158,,,22,mmmm#test.com
2694464,Twitter,Redes Sociales Movistar - Twitter,Cerrado por Abandono – Robot,,,16/04/2015 23:57:51,17/04/2015 6:00:19,kariniseta,158,,,22,mmmm#test.com
2635376,Facebook,Redes Sociales Movistar - Facebook,Cerrado por Abandono – Robot,,,03/04/2015 20:20:18,04/04/2015 2:30:06,martin.saggini,1126,,,,
2635376,Facebook,Redes Sociales Movistar - Facebook,Cerrado por Abandono – Robot,,,03/04/2015 20:20:18,04/04/2015 2:30:06,martin.saggini,1126,,,,
Error:
scala.MatchError: [Ljava.lang.String;#dea08cc (of class [Ljava.lang.String;)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:21)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:21)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I assume the error is due to a missing/additional newline in your csv file.
Your split and match assumes that every line of the csv has exactly 14 fields. Depending on the encoding or text editor you use, you may have additional new lines at the end of the document.
My suggestion would be to validate each line and add a catch-all case that gives you a more detailed error message, that way you will avoid the ambiguous MatchError.

Required tag missing error on MarketDataRequest query

I have a FIX44.MarketDataRequest query based on Trade Client example from QuickFIX-n example applications. But I can't manage to get something useful from the server with this query. I always get the error Required tag missing for this query.
Here is my code:
MDReqID mdReqID = new MDReqID("MARKETDATAID");
SubscriptionRequestType subType = new SubscriptionRequestType(SubscriptionRequestType.SNAPSHOT);
MarketDepth marketDepth = new MarketDepth(0);
QuickFix.FIX44.MarketDataRequest.NoMDEntryTypesGroup marketDataEntryGroup = new QuickFix.FIX44.MarketDataRequest.NoMDEntryTypesGroup();
marketDataEntryGroup.Set(new MDEntryType(MDEntryType.BID));
QuickFix.FIX44.MarketDataRequest.NoRelatedSymGroup symbolGroup = new QuickFix.FIX44.MarketDataRequest.NoRelatedSymGroup();
symbolGroup.Set(new Symbol("EUR/USD"));
QuickFix.FIX44.MarketDataRequest message = new QuickFix.FIX44.MarketDataRequest(mdReqID, subType, marketDepth);
message.AddGroup(marketDataEntryGroup);
message.AddGroup(symbolGroup);
Here is the generated outbound application level message (ToApp):
8=FIX.4.49=15835=V34=249=order.DEMOSUCD.12332150=DEMOSUSD52=20141223-07:02:33.22656=demo.fxgrid128=DEMOSUSD262=MARKETDATAID263=0264=0267=1269=0146=155=EUR/USD10=232
Here is the received ToAdmin message:
8=FIX.4.49=14935=334=249=demo.fxgrid52=20141223-07:02:36.51056=order.DEMOSUCD.12332157=DEMOSUSD115=DEMOSUSD45=258=Required tag missing371=265372=V373=110=136
If I understand correctly the pair 371=265 (RefTagID=MDUpdateType) after 258=Required tag missing indicates which tag is missing, .i.e. is missing MDUpdateType. But this is strange because this tag is optional for MarketDataRequest.
UPDATE
Here is my FIX config file:
[DEFAULT]
FileStorePath=store
FileLogPath=log
ConnectionType=initiator
ReconnectInterval=60
CheckLatency=N
[SESSION]
BeginString=FIX.4.4
TargetCompID=demo.fxgrid
SenderCompID=XXX
SenderSubID=YYY
StartTime=00:00:00
EndTime=00:00:00
HeartBtInt=30
SocketConnectPort=ZZZ
SocketConnectHost=X.X.X.X
DataDictionary=FIX44.xml
ResetOnLogon=Y
ResetOnLogout=Y
ResetOnDisconnect=Y
Username=AAA
Password=BBB

Sorting and obtaining the top 5 rows from a dataview

I have the below data table
DataTable dtEquityHoldings = new DataTable("EquityHoldings");
dtEquityHoldings.Columns.Add("EquityHoldings");
dtEquityHoldings.Columns.Add("PortfolioWeight(%)");
dtEquityHoldings.Columns.Add("MarketValue", typeof(double));
dtEquityHoldings.Columns.Add("Beta", typeof(double));
dtEquityHoldings.Columns.Add("Beta Adj. Delta", typeof(double));
dtEquityHoldings.Columns.Add("Vega", typeof(double));
dtEquityHoldings.Rows.Add("Santarus Inc", "0.81%", 882380.26, -1.56, 114709.43, 24937.23);
dtEquityHoldings.Rows.Add("Dell Inc", "1.21%", 1318123.60, 1.3, 324757.27, 47923.72);
dtEquityHoldings.Rows.Add("JPMorgan and Chase Co", "2.95%", 3213607.12, 1.12, 258414.50, 38472.28);
dtEquityHoldings.Rows.Add("Qualcomm Inc", "1.38%", 1503314.52, 1, 315608.54, 36938.75);
dtEquityHoldings.Rows.Add("Nokia", "2.45%", 2668927.95, 0.87, -346960.63, 39283.23);
dtEquityHoldings.Rows.Add("Rite Aid Corp", "1.84%", 2004419.36, 0.82, 139526.19, 92374.56);
dtEquityHoldings.Rows.Add("General Electric", "3.80%", 4139561.72, -0.78, 538143.02, 23947.83);
dtEquityHoldings.Rows.Add("Microsoft Corp", "2.06%", 2244078.20, 0.78, 454383.09, 42938.44);
dtEquityHoldings.Rows.Add("Johnson & Johnson", "4.47%", 4869431.81, 0.63, 633026.14, 82374.23);
dtEquityHoldings.Rows.Add("Power Inc.", "3.46%", 3769179.88, 0.13, 493374.82, 12930.02);
I want to sort the Beta column and then I have to take TOP 5 rows and then I have to bind that to the grid
I am using a dataview as under
DataView dvData = new DataView(dtEquityHoldings);
dvData.ToTable().AsEnumerable().OrderBy(r => r["Beta"]).Take(5);
dataGridView1.DataSource = dvData;
This is not working
Please help
One more
You're using LINQ. LINQ methods don't modify original collections. Try this:
dataGridView1.DataSource = dvData.ToTable().AsEnumerable().OrderBy(r => r["Beta"]).Take(5).ToList();