I'm having trouble deciphering the documentation for changing tick frequency and date formatting with pandas.
For example:
import numpy as np
import pandas as pd
import pandas.io.data as web
import matplotlib as mpl
%matplotlib inline
mpl.style.use('ggplot')
mpl.rcParams['figure.figsize'] = (8,6)
# grab some price data
px = web.DataReader('AAPL', "yahoo", '2010-12-01')['Adj Close']
px_m = px.asfreq('M', method='ffill')
rets_m = px_m.pct_change()
rets_m.plot(kind='bar')
generates this plot:
Yikes. How can I get the ticks to be every month or quarter or something sensible? And how can the date formatting be changed to get rid of times?
I've tried various things with ax.set_xticks() and ax.xaxis.set_major_formatter but haven't been able to figure it out.
If you use the plot method in pandas, the set_major_locator and set_major_formatter methods of matplotlib is likely to fail. It might just be easier to manually adjust the ticks, if you want to stay with pandas``plot methods.
#True if it is the first month of a quarter, False otherwise
xtick_idx = np.hstack((True,
np.diff(rets_m.index.quarter)!=0))
#Year-Quarter string for the tick labels.
xtick = ['{0:d} quarter {1:d}'.format(*item)
for item in zip(rets_m.index.year, rets_m.index.quarter)]
ax =rets_m.plot(kind='bar')
#Only put ticks on the 1st months of each quarter
ax.xaxis.set_ticks(np.arange(len(xtick))[xtick_idx])
#Adjust the ticklabels
ax.xaxis.set_ticklabels(np.array(xtick)[xtick_idx])
Related
I am trying to draw a bipartite graph for my data set, which is like below:
source target weight
reduce energy 25
reduce consumption 25
energy pennsylvania 4
energy natural 4
consumption balancing 4
the code That I am trying to plot the graph is as below:
C_2021 = nx.Graph()
C_2021.add_nodes_from(df_final_2014['source'], bipartite=0)
C_2021.add_nodes_from(df_final_2014['target'], bipartite=1)
edges = df_final_2014[['source', 'target','weight']].apply(tuple, axis=1)
C_2021.add_weighted_edges_from(edges)
But when I check with the below code whether it is bipartite or not, I get the "False" feedback.
nx.is_bipartite(C_2021)
Could you please advise what the issue is?
The previous issue is resolved, but when I want to plot the bipartite graph with the below steps, I do not get a proper result. If someone could help me, I will be appreciated it:
top_nodes_2021 = set(n for n,d in C_2021.nodes(data=True) if d['bipartite']==0)
top_nodes_2021
the output of the above is:
{'reduce'}
bottom_nodes_2021 = set(C_2021) - top_nodes_2021
bottom_nodes_2021
the output of the above is:
{'balancing', 'consumption', 'energy', 'natural', 'pennsylvania '}
then plot it by:
pos = nx.bipartite_layout(C_2021,top_nodes_2021)
plt.figure(figsize=[8,6])
# Pass that layout to nx.draw
nx.draw(C_2021,pos,node_color='#A0CBE2',edge_color='black',width=0.2,
edge_cmap=plt.cm.Blues,with_labels=True)
and the result is:
It works for me using your code. nx.is_bipartite(C_2021) returns true. Check the example below:
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
import pandas as pd
data = StringIO('''source;target;weight
reduce;energy;25
reduce;consumption;25
energy;pennsylvania ;4
energy;natural;4
consumption;balancing;4
''')
df_final_2014 = pd.read_csv(data, sep=";")
C_2021 = nx.Graph()
C_2021.add_nodes_from(df_final_2014['source'], bipartite=0)
C_2021.add_nodes_from(df_final_2014['target'], bipartite=1)
edges = df_final_2014[['source', 'target','weight']].apply(tuple, axis=1)
C_2021.add_weighted_edges_from(edges)
nx.is_bipartite(C_2021)
Finally to draw them get the bipartite sets. The data you passed during the creation is false (i.g. bipartite=0 and bipartite=1).
Use the following commands:
from networkx.algorithms import bipartite
top_nodes_2021, bottom_nodes_2021 = bipartite.sets(C_2021)
pos = nx.bipartite_layout(C_2021, top_nodes_2021)
plt.figure(figsize=[8,6])
# Pass that layout to nx.draw
nx.draw(C_2021,pos,node_color='#A0CBE2',edge_color='black',width=0.2,
edge_cmap=plt.cm.Blues,with_labels=True)
With the following result:
I have this code:
import pantab
import pandas as pd
import datetime
df = pd.DataFrame([
[datetime.date(2018,2,20), 4],
[datetime.date(2018,2,20), 4],
], columns=["date", "num_of_legs"])
pantab.frame_to_hyper(df, "example.hyper", table="animals")
which causes this error:
TypeError: Invalid value "datetime.date(2018, 2, 20)" found (row 0 column 0)
Is there a fix?
This has apparently been a problem since time 2020. You have to know which columns are datetime columns because panda dtypes treat dates and strings as objects. There's a world where data scientists don't care about dates, but not mine, apparently. Anyway, here's the solution awaiting the day when pandas makes the date dtype:
[https://github.com/innobi/pantab/issues/100][1]
And just to reduce it down to do what I did:
def createHyper(xlsx, filename, tablename):
for name in xlsx.columns:
if 'date' in name.lower():
xlsx[name] = pd.to_datetime(xlsx[name],errors='coerce',utc=True)
pantab.frame_to_hyper(xlsx,filename,table=tablename)
return open(filename, 'rb')
errors = 'coerce' makes it so you can have dates like 1/1/3000 which is handy in a world of scd2 tables
utc = True was necessary for me because my dates were timezone sensitive, yours may not be.
I screwed up the hyperlink, It's not working. Damn it. Thank you to the anonymous editor who will inevitable show up and fix it. I'm very sorry.
LS,
All the answers about the same topic did not help me solve my update problem. I think it has to do with the dfs = df.sort_values(by=['E']). I use all the latest versions of the libraries. The examples on the bokeh website work fine on my configuration. Via an update button I want to allow the user to select the prefered sort order. The two other sort buttons will be added when this part works.
Here is my code:
import pandas as pd
from bokeh.plotting import figure, curdoc
from bokeh.models import ColumnDataSource
from bokeh.layouts import gridplot
from bokeh.models import Button
df = pd.DataFrame(dict(A=["AMS", "LHR", "FRA", "PTY", "CGD"], S=[7,-5,-3,3,2], E=[8,3,-2,5,8], C=[5,2,7,-3,-4]))
source = ColumnDataSource(df)
options = dict(plot_width=300, plot_height=200,
tools="pan,wheel_zoom,box_zoom,box_select,lasso_select")
button = Button(label="Change Sort to E")
p1 = figure(y_range=source.data['A'].tolist(), title="S", **options)
p1.hbar(y='A', right="S", height=0.2, source=source)
p2 = figure(y_range=source.data['A'].tolist(), title="E", **options)
p2.hbar(y="A", right="E", height=0.2, source=source)
p3 = figure(y_range=source.data['A'].tolist(), title="C", **options)
p3.hbar(y="A", right="C", height=0.2, source=source)
def update():
dfs = df.sort_values(by=['E'])
source.data = ColumnDataSource.from_df(dfs)
button.on_click(update)
p = gridplot([[button], [p1, p2, p3]], toolbar_location="right")
curdoc().add_root(p)
I run the server via: bokeh serve --show app.py --port 5009
Thank you very much for making the update work.
If you want to change the order of things on a categorical axis, you have to update the range. The order of the factors on the axis is specified exactly by the order of factors you configure for the range, so you will need to re-order the factors to match the sort order you want. So, something like:
p1.y_range.factors = new_sorted_factors
See
https://docs.bokeh.org/en/latest/docs/user_guide/categorical.html#sorted
for a complete (standalone) example.
I noticed that the skewness returned from scipy stats is not correct. Pandas.skew() actually provide better results.
I am recently trying to duplicate a classic paper, Expected Stock Returns and Volatility by French&Schwert. I use S&P500 data from 1928 to 1984. I follow the formula in the paper for standard deviation of the return and I am able to get the same result for mean, std dev of std dev.
However, when I use scipy.stats.skew function, I can't not get any number of the std dev of the sp return. The function return "nan", where clearly it should return a value.
I switch to Pandas.skew(). it returned me the correct value as in the paper.
Clearly, something is wrong with the scipy.stats.skew() function.
scipy.stats.skew()
pandas.skew()
Results by Scipy.stats.skew()
['Adj Close_gspc', 'Adj Close_gspc_lag', 'SP_Return', 'SP_Return_square',
'SP_Return_lag', 'SP_varianceMon', 'SP_varianceMon_sqrRoot']
array([ 0.6922229 , 0.69186265, -0.11292165, 4.23571807, -1.9556035 ,
5.39873607, nan])
results by pandas:
Adj Close_gspc 0.693745
Adj Close_gspc_lag 0.693384
SP_Return -0.113170
SP_Return_square 4.245033
SP_Return_lag -1.959904
SP_varianceMon 5.410609
SP_varianceMon_sqrRoot 2.800919
dtype: float64
You haven't provided enough information or sample code to reproduce the nan that you get.
To make scipy.stats.skew compute the same value as the skew() method in Pandas, add the argument bias=False.
Here's an example.
First, the imports:
In [21]: import numpy as np
In [22]: import pandas as pd
In [23]: from scipy.stats import skew
Generate some data:
In [24]: np.random.seed(8675309)
In [25]: x = np.random.weibull(0.2, size=15)
Compute the skew with scipy and with Pandas:
In [26]: skew(x, bias=False)
Out[26]: 3.7582525674514544
In [27]: pd.Series(x).skew()
Out[27]: 3.7582525674514544
If you simply create a FixedObject and give it a set of coordinates and then ask for them back you get a different position:
>>> import ephem
>>> TestStar = ephem.FixedBody()
>>> TestStar._ra, TestStar._dec = '12:43:20', '-45:34:12'
>>> TestStar.compute()
>>> print TestStar.ra, TestStar.dec
12:44:15.34 -45:39:46.8
I now understand that this is because, as documented, the FixedBody is by default at the J2000 epoch, however the default observer's epoch is the moment that said observer is created, and it seems that that is the default for when you don't specify an observer.
However if I try to compensate for that:
>>> TestStar4 = ephem.FixedBody()
>>> TestStar4._ra, TestStar4._dec, TestStar4._epoch = '12:43:20', '-45:34:12', '2000/01/01 12:00:00'
>>> TestSite2 = ephem.Observer()
>>> TestSite2.lat, TestSite2.lon, TestSite2.date = 0,0,'2000/01/01 12:00:00'
>>> TestStar4.compute(TestSite2)
>>> print TestStar4.ra, TestStar4.dec
12:43:19.42 -45:33:51.9
You get an almost identical RA, but a DEC that is different by 20 arcseconds for this example.
I'm specifically trying to get the J2000 coordinates of some stars in a WEBDA catalog which provide relative coordinates for most stars.
For example see this random cluster:
http://www.univie.ac.at/webda/cgi-bin/frame_list.cgi?ic0166
The "Coordinates J2000" only has information on 9 stars and almost all stars have information in the "XY positions" link. The center and scale of these XY positions is a bit arbitrary but can be found in the site.
However if don't know why that 20 arcsecond difference in coordinates is there, I don't know when my system will fail.
Ok at this point I imagine the discrepancy is due to some correction factor.
I know now I want to use the Astrometric Geocentric Position, so:
>>> import ephem
>>> TestStar = ephem.FixedBody()
>>> TestStar._ra, TestStar._dec = '12:43:20', '-45:34:12'
>>> TestStar.compute()
>>> print TestStar.a_ra, TestStar.a_dec
12:43:20 -45:34:12
Simple enough (just hadn't understood that part of the manual, sorry).
I'm still curious as to which of all the corrections would affect this mostly, but I can carry on without knowing for now.