If you simply create a FixedObject and give it a set of coordinates and then ask for them back you get a different position:
>>> import ephem
>>> TestStar = ephem.FixedBody()
>>> TestStar._ra, TestStar._dec = '12:43:20', '-45:34:12'
>>> TestStar.compute()
>>> print TestStar.ra, TestStar.dec
12:44:15.34 -45:39:46.8
I now understand that this is because, as documented, the FixedBody is by default at the J2000 epoch, however the default observer's epoch is the moment that said observer is created, and it seems that that is the default for when you don't specify an observer.
However if I try to compensate for that:
>>> TestStar4 = ephem.FixedBody()
>>> TestStar4._ra, TestStar4._dec, TestStar4._epoch = '12:43:20', '-45:34:12', '2000/01/01 12:00:00'
>>> TestSite2 = ephem.Observer()
>>> TestSite2.lat, TestSite2.lon, TestSite2.date = 0,0,'2000/01/01 12:00:00'
>>> TestStar4.compute(TestSite2)
>>> print TestStar4.ra, TestStar4.dec
12:43:19.42 -45:33:51.9
You get an almost identical RA, but a DEC that is different by 20 arcseconds for this example.
I'm specifically trying to get the J2000 coordinates of some stars in a WEBDA catalog which provide relative coordinates for most stars.
For example see this random cluster:
http://www.univie.ac.at/webda/cgi-bin/frame_list.cgi?ic0166
The "Coordinates J2000" only has information on 9 stars and almost all stars have information in the "XY positions" link. The center and scale of these XY positions is a bit arbitrary but can be found in the site.
However if don't know why that 20 arcsecond difference in coordinates is there, I don't know when my system will fail.
Ok at this point I imagine the discrepancy is due to some correction factor.
I know now I want to use the Astrometric Geocentric Position, so:
>>> import ephem
>>> TestStar = ephem.FixedBody()
>>> TestStar._ra, TestStar._dec = '12:43:20', '-45:34:12'
>>> TestStar.compute()
>>> print TestStar.a_ra, TestStar.a_dec
12:43:20 -45:34:12
Simple enough (just hadn't understood that part of the manual, sorry).
I'm still curious as to which of all the corrections would affect this mostly, but I can carry on without knowing for now.
Related
I really like the heatmap, but what I need are the numbers behind the heatmap (AKA correlation matrix).
Is there an easy way to extract the numbers?
It was a bit hard to track down but starting from the documentation; specifically
from the report structure then digging into the following function get_correlation_items(summary) and then going into the source and looking at the usage of it we get to this call that essentially loops over each of the correlation types in the summary, to obtain the summary object we can find the following, if we lookup the caller we find that it is get_report_structure(summary) and if we try to find how to get the summary arg we find that it is simply the description_set property as shown here.
Given the above, we can now do the following using version 2.9.0:
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
df = pd.DataFrame(
np.random.rand(100, 5),
columns=["a", "b", "c", "d", "e"]
)
profile = ProfileReport(df, title="StackOverflow", explorative=True)
correlations = profile.description_set["correlations"]
print(correlations.keys())
dict_keys(['pearson', 'spearman', 'kendall', 'phi_k'])
To see a specific correlation do:
correlations["phi_k"]["e"]
a 0.000000
b 0.112446
c 0.289983
d 0.000000
e 1.000000
Name: e, dtype: float64
Sample Notebook
I'm following the tutorial here for implementing a change point kernel in gpflow.
However, I have 3 inputs and 1 output and I would like the changepoint kernel to be on the first input dimension only and other standard kernels to be on the other two input dimensions. I'm getting the following error :
InvalidArgumentError: Incompatible shapes: [2000,3,1] vs. [3,2000,1] [Op:Mul] name: mul/
Below is a minimum working example. Could anyone please let me know where I'm going wrong?
gpflow version 2.0.0.rc1
import pandas as pd
import gpflow
from gpflow.utilities import print_summary
df_all = pd.read_csv(
'https://raw.githubusercontent.com/ipan11/gp/master/dataset.csv')
# Training dataset in numpy format
X = df_all[['X1', 'X2', 'X3']].to_numpy()
Y1 = df_all['Y'].to_numpy().reshape(-1, 1)
# Changepoint kernel only on first dimension and standard kernels for the other two dimensions
base_k1 = gpflow.kernels.Matern32(lengthscale=0.2, active_dims=[0])
base_k2 = gpflow.kernels.Matern32(lengthscale=2., active_dims=[0])
k1 = gpflow.kernels.ChangePoints(
[base_k1, base_k2], [.4], steepness=5)
k2 = gpflow.kernels.Matern52(lengthscale=[1., 1.], active_dims=[1, 2])
k_all = k1+k2
print_summary(k_all)
m1 = gpflow.models.GPR(data=(X, Y1), kernel=k_all, mean_function=None)
print_summary(m1)
opt = gpflow.optimizers.Scipy()
def objective_closure():
return -m1.log_marginal_likelihood()
opt_logs = opt.minimize(objective_closure, m1.trainable_variables,
options=dict(maxiter=100))
The correct answer would be to move the active_dims=[0] from the base_k* kernels to the ChangePoints() kernel,
k1 = gpflow.kernels.ChangePoints([base_k1, base_k2], [0.4], steepness=5, active_dims=[0])
but this is currently not supported in GPflow 2, which is a bug. I've opened an issue on github, and will update this answer once it's fixed (if you feel up to having a go at fixing this bug, feel free to open a pull request, help always welcome!).
Result
Result2
Solved
Dealing with low amounts of data, and dealing with overfitting w/ Folding[GridSearchCV]
I am completely stumped as to how to get better estimations from my model. It seems that when I try to run my code, I get negative Accuracies. How can I improve cross_val_score or testing scores or whatever you want to call it such that I can predict values more reliably.
I tried adding more data (from 50 to 200+).
I tried random parameters (and realized this was a Naive approach)
I also tried cleaning my data w/ StandardScaler on the features
Anyone have any suggestions?
from sklearn.neural_network import MLPRegressor
from sklearn import preprocessing
import requests
import json
from calendar import monthrange
import numpy as np
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.preprocessing import scale
r =requests.get('https://www.alphavantage.co/query?function=TIME_SERIES_WEEKLY_ADJUSTED&symbol=W&apikey=QYQ2D6URDOKNUGF4')
#print(r.text)
y = json.loads(r.text)
#print(y["Monthly Adjusted Time Series"].keys())
keysInResultSet = y["Weekly Adjusted Time Series"].keys()
#print(keysInResultSet)
featuresListTemp = []
labelsListTemp = []
count = 0;
for i in keysInResultSet:
#print(i)
count = count + 1;
#print(y["Monthly Adjusted Time Series"][i])
tmpList = []
tmpList.append(count)
featuresListTemp.append(tmpList)
strValue = y["Weekly Adjusted Time Series"][i]["5. adjusted close"]
numValue = float(strValue)
labelsListTemp.append(numValue)
print("TOTAL SET")
print(featuresListTemp)
print(labelsListTemp)
print("---")
arrTestInput = []
arrTestOutput = []
print("SCALING SET")
X_train = np.array(featuresListTemp)
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
print(X_train_scaled)
product_model = MLPRegressor()
#10.0 ** -np.arange(1, 10)
#todo : once found general settings, iterate through some more seeds to find one that can be used on the training
parameters = {'learning_rate': ['constant','adaptive'],'solver': ['lbfgs','adam'], 'tol' : 10.0 ** -np.arange(1, 4), 'verbose' : [True], 'early_stopping': [True], 'activation' : ['tanh','logistic'], 'learning_rate_init': 10.0 ** -np.arange(1, 4), 'max_iter': [4000], 'alpha': 10.0 ** -np.arange(1, 4), 'hidden_layer_sizes':np.arange(1,11), 'random_state':np.arange(1, 3)}
clf = GridSearchCV(product_model, parameters, n_jobs=-1)
clf.fit(X_train_scaled, labelsListTemp)
print(clf.score(X_train_scaled, labelsListTemp))
print(clf.best_params_)
best_params = clf.best_params_
newPM = MLPRegressor(hidden_layer_sizes=((best_params['hidden_layer_sizes'])), #try reducing the layer size / increasing it and playing around with resultFit variable
batch_size='auto',
power_t=0.5,
activation=best_params['activation'],
solver=best_params['solver'], #non scaled input
learning_rate=best_params['learning_rate'],
max_iter=best_params['max_iter'],
learning_rate_init=best_params['learning_rate_init'],
alpha=best_params['alpha'],
random_state=best_params['random_state'],
early_stopping=best_params['early_stopping'],
tol=best_params['tol'])
scores = cross_val_score(newPM, X_train_scaled, labelsListTemp, cv=10, scoring='neg_mean_absolute_error')
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
print(scores)
Output from line 63 and down
0.9142644531564619 {'activation': 'logistic', 'alpha': 0.001, 'early_stopping': True, 'hidden_layer_sizes': 7, 'learning_rate':
'constant', 'learning_rate_init': 0.1, 'max_iter': 4000,
'random_state': 2, 'solver': 'lbfgs', 'tol': 0.01, 'verbose': True}
Accuracy: -21.91 (+/- 58.89) [ -32.87854574 -105.0632913
-22.89836453 -7.33154414 -22.38773819 -3.3786339 -1.7658796 -3.78002866 -4.78734308 -14.81212738]
{'activation': 'logistic', 'alpha': 0.01, 'early_stopping': True, 'hidden_layer_sizes': 30, 'learning_rate': 'constant', 'learning_rate_init': 0.1, 'max_iter': 4000, 'random_state': 2, 'solver': 'lbfgs', 'tol': 0.1, 'verbose': True}
{'activation': 'tanh', 'alpha': 0.01, 'early_stopping': True, 'hidden_layer_sizes': 99, 'learning_rate': 'constant', 'learning_rate_init': 0.1, 'max_iter': 4000, 'random_state': 1, 'solver': 'lbfgs', 'tol': 0.01, 'verbose': True}
Both configurations stated above will work for the sample set. Thanks all, please let me know if there are any questions. This can be solved by scaling down all your other parameters ie. instead of 10.0 ** -np.arange(1, 3) do 10.0 ** -np.arange(1, 2)
to a more limited set. Start removing parameters that you know are correct (very hard to do, but one could be learning_rate='constant' as I noticed that all my best fits resulted in a learning rate that was constant, regardless of any other parameters.
This is mostly for time optimization but will also help with overfitting as you increase the number of nodes in the network. The idea is that you want to increase the fit some N degrees without losing too much of the generalization properties of the true function) once you perform your first grid search.
You should start you grid search making sure that the # of hidden nodes is some where between the # of input nodes and the # of output nodes.
Once you find a decent fit, you can improve the fit by increasing the number of nodes. You must take care not to add too many nodes as to lose the generalization power of the true function. Before you even start thinking about scaling up, you must start reducing the complexity of the parameters such that on your second grid search you will be performing it on an increased number of nodes w/ more general parameters.
The generalization of parameters is described above with the second grid search taking into account more general parameters from the initial search, whilst increasing the network nodes.
I know this is confusing but it's what helped me fit this decently.
For anyone struggling I would try to
0) generalize after performing a search and getting a decent model
1) use generalization on second search with increased nodes
2) play with alpha parameter while scaling up (the rest of the parameters you can generalize)
3) add a few different seeds or remove them depending on the situation
4) While changing tol will alter fit it is also highly dependent on the number of iterations. For that reason, depending on the case, a reasonable number might be .01 or .001 (reasonable depending on how many iterations you want to wait for a given result/ opportunity to converge) If the tol is set too low, you will run out of iterations as each epoch will never get a chance to stop early.
I would like to use the astropy package to compute the time of equinoxes and solstices. I have worked before with the pyephem package, and it provides easy functions exactly for this: one can, for example, say
>>> print(ephem.next_equinox(ephem.now()))
2019/9/23 07:50:14
and get the time of the next equinox. However, there are no such functions in astropy, so I thought I might try to compute the times by the definition: the vernal equinox is the moment when the ecliptic longitude of the Sun is zero; the summer solstice is the moment when the ecliptic longitude of the Sun is 90° etc.
So it seems that getting the ecliptic longitude of the Sun would be the essential step, and then I could somehow solve that function for time:
def sunEclipticLongitude(t):
sun = astropy.coordinates.get_body('sun', t)
eclipticOfDate = astropy.coordinates.GeocentricTrueEcliptic(equinox=t)
sunEcliptic = sun.transform_to(eclipticOfDate)
return sunEcliptic.lon.deg
My first thought was to use something from scipy.optimize to solve this function for time, but at this point, I got stuck. The Sun's longitude is an angle, so there are obviously many solutions for lon=0 (this year's equinox, next year's equinox ...) How do I find the next time (from a particular origin, for example now) when the Sun's longitude is zero? How do I find the previous time when it was zero? Also, the vernal equinox seems to be a particularly nasty case for solving, since the function has a discontinuity at that point – it jumps from 360 to 0. How to handle that?
To answer your first question, the optimization strategy will still work for periodic functions, but the solution will depend on your optimization starting point. Just pick a point that's already pretty close. You know that equinoxes are Mar 21 and Sep 20 (or something like that), and solstices are June 21 and Dec 21. Pick an arbitrary time of day on those dates, and you'll find the right solution.
As for the discontinuity in the angle, it isn't really there... It's just that in this particular convention you elect to represent angles as numbers between 0 and 360 degrees. But mathematically, the angle of 361 degrees means exactly the same thing as 1 degrees and has just as much right to exist.
As a result, continuous periodic functions like sin(), cos() etc. will not show any discontinuity at that value of the angle (or any other value of the angle that you pick as minimum or maximum).
As you expected, the syzygys can be located using scipy.optimize, via a root-finding function, as in the code snippet below. But rather than looking for the next syzygy, it looks for the nearest one, by picking points 44 days before and after the given date and expecting that there will be one syzygy in between those dates. Since Earth doesn't have seasons shorter than 88 days within plus-or-minus 50,000 years, that's hopefully a pretty safe bet, and the astropy ephemerides aren't accurate for that long anyway. I employed sin(angle*2) in the (poorly named) linearize(angle) code below to convert ecliptic longitudes to a function which crosses zero at each quadrant.
The code is also in a gist, which might get refined: see find syzygy / equinox / solstice with astropy
# Find the nearest syzygy (solstice or equinox) to a given date.
# Solution for https://stackoverflow.com/questions/55838712/finding-equinox-and-solstice-times-with-astropy
# TODO: need to ensure we're using the right sun position functions for the syzygy definition....
import math
from astropy.time import Time, TimeDelta
import astropy.coordinates
from scipy.optimize import brentq
from astropy import units as u
# We'll usually find a zero crossing if we look this many days before and after
# given time, except when it is is within a few days of a cross-quarter day.
# But going over 50,000 years back, season lengths can vary from 85 to 98 days!
# https://individual.utoronto.ca/kalendis/seasons.htm#seasons
delta = 44.0
def mjd_to_time(mjd):
"Return a Time object corresponding to the given Modified Julian Date."
return Time(mjd, format='mjd', scale='utc')
def sunEclipticLongitude(mjd):
"Return ecliptic longitude of the sun in degrees at given time (MJD)"
t = mjd_to_time(mjd)
sun = astropy.coordinates.get_body('sun', t)
# TODO: Are these the right functions to call for the definition of syzygy? Geocentric? True?
eclipticOfDate = astropy.coordinates.GeocentricTrueEcliptic(equinox=t)
sunEcliptic = sun.transform_to(eclipticOfDate)
return sunEcliptic.lon.deg
def linearize(angle):
"""Map angle values in degrees near the quadrants of the circle
into smooth functions crossing zero, for root-finding algorithms.
Note that for angles near 90 or 270, increasing angles yield decreasing results
>>> linearize(5) > 0 > linearize(355)
True
>>> linearize(95) > 0 > linearize(85)
False
"""
return math.sin(math.radians(angle * 2))
def map_syzygy(t):
"Map times into linear functions crossing zero at each syzygy"
return linearize(sunEclipticLongitude(t))
def find_nearest_syzygy(t):
"""Return the precise Time of the nearest syzygy to the given Time,
which must be within 43 days of one syzygy"
"""
syzygy_mjd = brentq(map_syzygy, t.mjd - delta, t.mjd + delta)
syzygy = mjd_to_time(syzygy_mjd)
syzygy.format = 'isot'
return syzygy
if __name__ == '__main__':
import doctest
doctest.testmod()
t0 = Time('2019-09-23T07:50:10', format='isot', scale='utc')
td = TimeDelta(1.0 * u.day)
seq = t0 + td * range(0, 365, 15)
for t in seq:
try:
syzygy = find_nearest_syzygy(t)
except ValueError as e:
print(f'{e=}, {t.value=}, {t.mjd-delta=}, {map_syzygy(t.mjd-delta)=}')
continue
print(f'{t.value=}, {syzygy.value=}, {sunEclipticLongitude(syzygy)=}')
I'm having trouble deciphering the documentation for changing tick frequency and date formatting with pandas.
For example:
import numpy as np
import pandas as pd
import pandas.io.data as web
import matplotlib as mpl
%matplotlib inline
mpl.style.use('ggplot')
mpl.rcParams['figure.figsize'] = (8,6)
# grab some price data
px = web.DataReader('AAPL', "yahoo", '2010-12-01')['Adj Close']
px_m = px.asfreq('M', method='ffill')
rets_m = px_m.pct_change()
rets_m.plot(kind='bar')
generates this plot:
Yikes. How can I get the ticks to be every month or quarter or something sensible? And how can the date formatting be changed to get rid of times?
I've tried various things with ax.set_xticks() and ax.xaxis.set_major_formatter but haven't been able to figure it out.
If you use the plot method in pandas, the set_major_locator and set_major_formatter methods of matplotlib is likely to fail. It might just be easier to manually adjust the ticks, if you want to stay with pandas``plot methods.
#True if it is the first month of a quarter, False otherwise
xtick_idx = np.hstack((True,
np.diff(rets_m.index.quarter)!=0))
#Year-Quarter string for the tick labels.
xtick = ['{0:d} quarter {1:d}'.format(*item)
for item in zip(rets_m.index.year, rets_m.index.quarter)]
ax =rets_m.plot(kind='bar')
#Only put ticks on the 1st months of each quarter
ax.xaxis.set_ticks(np.arange(len(xtick))[xtick_idx])
#Adjust the ticklabels
ax.xaxis.set_ticklabels(np.array(xtick)[xtick_idx])