How to convert geom_point(aes()) + geom_vline(aes()) to Plotly? - r-plotly

I found this tutorial online that helps convert ggplot2's geom_abline() to a Plotly graph: https://plotly.com/ggplot2/geom_abline/
It looks like we can simply make such conversion using ggplotly():
library(ggplot2)
library(plotly)
p <- ggplot(data, aes(x=x_val, y=y_val, colour=color_val)) +
geom_point() +
geom_vline(aes(xintercept=xintercept_val), colour=color_val)
ggplotly(p)
However, I cannot convert my ggplot2 graph into a Plotly graph with the following code:
# notice that both my x_val and xintercept_val are dates.
# here's my ggplot2 code:
gg <- ggplot(data) +
geom_point(aes(
x_val,
y_val,
color=color_val,
shape=shape_val
)) +
geom_vline(aes(
xintercept=xintercept_val,
color=color_val
))
ggplotly(gg)
Here's a screenshot of my ggplot2 graph (I cropped out the legends):
Here's a screenshot of my Plotly graph using ggplotly(gg):
Not sure why the vertical lines aren't showing up in Plotly.

Looks like you stumbled over a bug in ggplotly (perhaps you should raise an issue on github). The issue is that ggplotly internally converts the dates to numerics (same with categorical variables). However, inspecting the JSON representation via plotly_json shows that the xintercepts in geom_vline are not converted. That's why they don't show up. However, as a workaround you can make the conversion manually using as.numeric().
As you provided no data I use a simple example dataset from the plotly website to which I added some dates. Try this:
dat <- read.table(header=TRUE, text='
cond xval yval
control 11.5 10.8
control 9.3 12.9
control 8.0 9.9
control 11.5 10.1
control 8.6 8.3
control 9.9 9.5
control 8.8 8.7
control 11.7 10.1
control 9.7 9.3
control 9.8 12.0
treatment 10.4 10.6
treatment 12.1 8.6
treatment 11.2 11.0
treatment 10.0 8.8
treatment 12.9 9.5
treatment 9.1 10.0
treatment 13.4 9.6
treatment 11.6 9.8
treatment 11.5 9.8
treatment 12.0 10.6
')
dat$xval <- rep(as.Date(paste0("2020-", 1:10, "-01")), 2)
max_date1 <- dat[dat$cond == "control", "xval"][which.max(dat[dat$cond == "control", "yval"])]
max_date2 <- dat[dat$cond == "treatment", "xval"][which.max(dat[dat$cond == "treatment", "yval"])]
# The basic scatterplot
p <- ggplot(dat, aes(x=xval, y=yval, colour=cond)) +
geom_point()
# Add colored lines for the date of the max yval of each group
p <- p +
geom_vline(aes(xintercept=as.numeric(max_date1)), colour="green") +
geom_vline(aes(xintercept=as.numeric(max_date2)), colour="lightblue")
p
fig <- ggplotly(p)
fig
Gives me this plot:

Related

How to apply best fit distributions in pyspark?

I currently working on a migration from python to pyspark,and I have one step where I find the best fit distribution using a modified function of Fitting empirical distribution to theoretical ones with Scipy (Python)? where I apply best_fit_distribution to each group od Id's, and save the output in a dictionary,there is some way to do that in pyspark? I was doing research about pyspark statistics and I don't find any library that could help me.
For the needs of the development I need to do this part in pyspark, so keep in original python can't be an option.
import scipy.stats as st
import numpy as np
import warnings
def best_fit_distribution(data, bins=200, ax=None)
y, x = np.histogram(data, bins=bins, density=True)
x = (x + np.roll(x, -1))[:-1] / 2.0
# Distributions to check
distribution_list = [st.alpha,st.chi2, st.pearson3] #This is an example
# Best holders
best_distribution = st.norm
best_params = (0.0, 1.0)
best_sse = np.inf
for distribution in distribution_list:
try:
with warnings.catch_warnings():
warnings.filterwarnings("ignore")
params = distribution.fit(data)
arg = params[:-2]
loc = params[-2]
scale = params[-1]
pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
sse = np.sum(np.power(y - pdf, 2.0))
# if axis pass in add to plot
try:
if ax:
pd.Series(pdf, x).plot(ax=ax)
#end
except Exception:
pass
# identify if this distribution is better
if best_sse > sse > 0:
best_distribution = distribution
best_params = params
best_sse = sse
except Exception:
pass
return (best_distribution.name, best_params)
This is an example and description of my df:
Id
Values
8
59.25
8
25.1
8
39.0333
8
138.3737
8
79.5002
8
52.9
8
0.1674
9
33.8667
9
0.75
9
78.05
9
76.9167
9
14.6667
9
80.3166
9
32.7333
9
0.8333
9
76.95
9
84.4
9
23.1667
9
23.1
9
76.6667
summary
Id
Values
count
34052
1983107
min
8
0.0
max
2558
59646.1712

With gtsummary, is it possible to have N on a separate row to the column name?

gtsummary by default puts the number of observations in a by group beside the label for that group. This increases the width of the table... with many groups or large N, the table would quickly become very wide.
Is it possible to get gtsummary to report N on a separate row beneath the label? E.g.
> data(mtcars)
> mtcars %>%
+ select(mpg, cyl, vs, am) %>%
+ tbl_summary(by = am) %>%
+ as_tibble()
# A tibble: 6 x 3
`**Characteristic**` `**0**, N = 19` `**1**, N = 13`
<chr> <chr> <chr>
1 mpg 17.3 (14.9, 19.2) 22.8 (21.0, 30.4)
2 cyl NA NA
3 4 3 (16%) 8 (62%)
4 6 4 (21%) 3 (23%)
5 8 12 (63%) 2 (15%)
6 vs 7 (37%) 7 (54%)
would become
# A tibble: 6 x 3
`**Characteristic**` `**0**` `**1**`
<chr> <chr> <chr>
1 N = 19 N = 13
2 mpg 17.3 (14.9, 19.2) 22.8 (21.0, 30.4)
3 cyl NA NA
4 4 3 (16%) 8 (62%)
5 6 4 (21%) 3 (23%)
6 8 12 (63%) 2 (15%)
7 vs 7 (37%) 7 (54%)
(I only used as_tibble so that it was easy to show what I mean by editing it manually...)
Any idea?
Thanks!
Here is one way you could do this:
library(tidyverse)
library(gtsummary)
mtcars %>%
select(mpg, cyl, vs, am) %>%
# create a new variable to display N in table
mutate(total = 1) %>%
# this is just to reorder variables for table
select(total, everything()) %>%
tbl_summary(
by = am,
# this is to specify you only want N (and no percentage) for new total variable
statistic = total ~ "N = {N}") %>%
# this is a gtsummary function that allows you to edit the header
modify_header(all_stat_cols() ~ "**{level}**")
First, I am making a new variable that is just total observations (called total)
Then, I am customizing the way I want that variable statistic to be displayed
Then I am using gtsummary::modify_header() to remove N from the header
Additionally, if you use the flextable print engine, you can add a line break in the header itself:
mtcars %>%
select(mpg, cyl, vs, am) %>%
# create a new variable to display N in table
tbl_summary(
by = am
# this is to specify you only want N (and no percentage) for new total variable
) %>%
# this is a gtsummary function that allows you to edit the header
modify_header(all_stat_cols() ~ "**{level}**\nN = {n}") %>%
as_flex_table()
Good luck!
#kittykatstat already posted two fantastic solutions! I'll just add a slight variation :)
If you want to use the {gt} package to print the table and you're outputting to HTML, you can use the HTML tag <br> to add a line break in the header row (very similar to the \n solution already posted).
library(gtsummary)
mtcars %>%
select(mpg, cyl, vs, am) %>%
dplyr::mutate(am = factor(am, labels = c("Manual", "Automatic"))) %>%
# create a new variable to display N in table
tbl_summary(by = am) %>%
# this is a gtsummary function that allows you to edit the header
modify_header(stat_by = "**{level}**<br>N = {N}")

leaflet map not rendering in html document using rmarkdown

Here is my code
---
date: "7 December 2018"
output: html_document
---
## 7 December 2018
```{r, echo=FALSE}
library(leaflet)
library(jsonlite)
citibike <- fromJSON("http://citibikenyc.com/stations/json")
stations <- citibike$stationBeanList
m = leaflet(stations) %>% addTiles(urlTemplate = 'http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png') %>% addCircles(lat = ~latitude, lng = ~longitude, weight = 5, radius = ~availableBikes, popup = paste("Station:", stations$stationName, "<br>", stations$availableBikes, "available bikes", "<br>", stations$availableDocks, "available docks")) %>% addControl("Available Bikes in NYC on 12/07/2018", position = "topright")
m
```
Using Knit, the html document created only shows the date but not the map. The map is created without any problem when using that code in the console of RStudio.
I have downloaded the latest version of leaflet from github. I use Windows 10.
platform x86_64-w64-mingw32 arch x86_64
os mingw32 system x86_64,
mingw32 status major
3 minor 5.1
year 2018 month 07
day 02 svn rev 74947
language R version.string R version
3.5.1 (2018-07-02) nickname Feather Spray

Python GGPlot Syntax to Annotate Chart with Statsmodels Variable?

Not able to located complete Yhat doc to answer this question, using the R version of
ggplot I've attempted to iteratively back into a solution.
What is the correct syntax for annotating a Python ggplot plot with text in generally, more specifically using a variable from Statsmodels (everything works except the last line of this code block below)?
from ggplot import *
ggplot(aes(x='rundiff', y='winpct'), data=mlb_df) +\
geom_point() + geom_text(aes(label='team'),hjust=0, vjust=0, size=10) +\
stat_smooth(method='lm', color='blue') +\
ggtitle('Contenders vs Pretenders') +\
ggannotate('text', x = 4, y = 7, label = 'R^2')
Thanks.
You can use geom_text as a provisional solution
from ggplot import *
import pandas as pd
dataText=pd.DataFrame.from_items([('x',[4]),('y',[7]),('text',['R^2'])])
ggplot(aes(x='rundiff', y='winpct'), data=mlb_df) +\
geom_point() + geom_text(aes(label='team'),hjust=0, vjust=0, size=10) +\
stat_smooth(method='lm', color='blue') +\
ggtitle('Contenders vs Pretenders') +\
geom_text(aes(x='x', y='y', label='text'), data=dataText)

How to match daily data from monthly using Matlab?

I have montly macroeconomic data series and I am planning to use them for a weekly (every Monday) regression analysis. How can I match a data point which release once a month to my date template( 4 times during that month) since the new point release and so on.
for u=2:size(daily,1)
l=find(dailytemplate(u)==monthly)
%# when the monthly date is not equal to my daily template
if isempty(l)
%# I need a clearver code for this part to find the previous release
dailyclose(u)=dailyclose(u-1)
else
dailyclose(u)=monthlyclose(l)
end
end
UPDATE from comment
I have the following monthly macro data. I want to use them to feed the weekly dates. For example, at March 31/03/2012 the M-input was 2.7. So any weekly date after that date should be
W_output=2.7
until the April 30/04/2012. Then the weekly W_output will be 2.3 which is the new monthly point, M_input. The following table provides examples for the weekly W_ouput and monthly M_Input:
08/06/2012 1.7
30/06/2012 1.7
01/06/2012 1.7
31/05/2012 1.7
25/05/2012 2.3
30/04/2012 2.3
18/05/2012 2.3
31/03/2012 2.7
11/05/2012 2.3
29/02/2012 2.9
04/05/2012 2.3
31/01/2012 2.9
27/04/2012 2.7
31/12/2011 3
20/04/2012 2.7
format long g
%Create a vector of dates (what I am assuming your date template looks like, this is march 31 and the next 9 mondays that follow it)
datetemplate = [datenum('2012/03/31')];
for i = 1:10
datetemplate(i + 1) = datetemplate(i) + 7;
end
datetemplate';
%Your macro ecos input and dates
macrochangedate = [datenum('2012/03/31');datenum('2012/04/30')]
macrochangedate = [macrochangedate [2.7; 2.3]]
for i = 1:size(macrochangedate,1)
result(datetemplate >= macrochangedate(i,1)) = macrochangedate(i,2);
end
Results:
result =
2.7
2.7
2.7
2.7
2.7
2.3
2.3
2.3
2.3
2.3
2.3
datestr(datetemplate)
ans =
31-Mar-2012
07-Apr-2012
14-Apr-2012
21-Apr-2012
28-Apr-2012
05-May-2012
12-May-2012
19-May-2012
26-May-2012
02-Jun-2012
09-Jun-2012