Convert string data in HDF5 File to float Format - type-conversion

I need to convert String data from a HDF5 File to Float format to use in a Skyplot (Astropy) with l b coordinates. The data is present here:
https://wwwmpa.mpa-garching.mpg.de/~ensslin/research/data/faraday2020.html
(Faraday Sky 2020)
The code I have programmed until now is:
from astropy import units as u
from astropy.coordinates import SkyCoord
import matplotlib.pyplot as plt
import numpy as np
import h5py
dat = []
ggl=[]
ggb=[]
f1= h5py.File('/home/nikita/faraday_2020/faraday2020.hdf5','r')
data = f1.get('faraday_sky_mean')
faraday_sky_mean = np.array(data)
data1 = f1.get('faraday_sky_std')
faraday_sky_std = np.array(data1)
n1 = 0
for line in f1:
s = line.split()
dat.append(s)
n1 = n1 +1
#
for i in range(0,n1):
ggl.append(float(dat[i][0])) # galactic coordinates input
ggb.append(float(dat[i][1]))
f1.close()
However I am getting the error:
ggl.append(float(dat[i][0])) # galactic coordinates input
ValueError: could not convert string to float: 'faraday_sky_mean'
Please help with this. Thanks.

What what you asked and what (I think) you need are 2 different things.
This line is NOT the way to read a HDF5 file: for line in f1:
You need to use a HDF5 API to read it (h5py is 1 of many).
I think you want to read datasets faraday_sky_mean and faraday_sky_std and load arrays into lists ggl and ggb. To do that, use this code. It will create 2 lists with 3145728 float64 values in each.
with h5py.File('faraday2020.hdf5','r') as hdf:
print(list(hdf.keys()))
faraday_sky_mean = hdf['faraday_sky_mean'][:]
faraday_sky_std = hdf['faraday_sky_std'][:]
print(faraday_sky_mean.shape, faraday_sky_mean.dtype)
print(f'Max Mean={max(faraday_sky_mean)}, Min Mean={min(faraday_sky_mean)}')
print(faraday_sky_std.shape, faraday_sky_std.dtype)
print(f'Max StdDev={max(faraday_sky_std)}, Min StdDev={min(faraday_sky_std)}')
ggl = faraday_sky_mean.tolist()
print(len(ggl),type(ggl[0]))
ggb = faraday_sky_std.tolist()
print(len(ggb),type(ggb[0]))
The procedure above saves the data as both NumPy arrays and Python lists. If you only need the lists (don't need the arrays), you can shorten the code as shown below:
with h5py.File('faraday2020.hdf5','r') as hdf:
ggl = hdf['faraday_sky_mean'][:].tolist()
print(len(ggl),type(ggl[0]))
ggb = hdf['faraday_sky_std'][:].tolist()
print(len(ggb),type(ggb[0]))

Related

Import csv as MATLAB struct

I have a large csv log file. Here is a simplified sample:
ts,a.b.c,a.b.d,a.b.e,b.c,b.d,c.d.e,c.d.f,c.g
2021-03-29 06:38:39,1.0000,2,3,28.20,1,2,3,4
2021-03-29 06:38:40,1.0000,2,3,28.20,1,2,3,0.000000
I am using MATLAB's Import Data tool to import it, but, unfortunately, it removes all dots from the header and imports all variables as, e.g.: abc, abd, abe etc.
What is an efficient way to import a csv like the one above as structs?
It am looking for a way to have data imported as structs: a, b and c for this particular log file, so that I can easily access variables as a.b.c or c.d.f.
Here is what I came up with, by simply using readtable.
function res = log_import(logfile)
log_table = readtable(logfile);
res = [];
for i = 1:width(log_table)
str_path = log_table.Properties.VariableDescriptions{i};
fields = strsplit(str_path,'.');
res = setfield(res, fields{1:end}, log_table{:, i});
end
end

How to write a flexible multiple exponential fit

I'd like to write a more or less universial fit function for general function
$f_i = \sum_i a_i exp(-t/tau_i)$
for some data I have.
Below is an example code for a biexponential function but I would like to be able to fit a monoexponential or a triexponential function with the smallest code adaptions possible.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
t = np.linspace(0, 10, 100)
a_1 = 1
a_2 = 1
tau_1 = 5
tau_2 = 1
data = 1*np.exp(-t/5) + 1*np.exp(-t/1)
data += 0.2 * np.random.normal(size=t.size)
def func(t, a_1, tau_1, a_2, tau_2): # plus more exponential functions
return a_1*np.exp(-t/tau_1)+a_2*np.exp(-t/tau_2)
popt, pcov = curve_fit(func, t, data)
print(popt)
plt.plot(t, data, label="data")
plt.plot(t, func(t, *popt), label="fit")
plt.legend()
plt.show()
In principle I thought of redefining the function to a general form
def func(t, a, tau): # with a and tau as a list
tmp = 0
tmp += a[i]*np.exp(-t/tau[i])
return tmp
and passing the arguments to curve_fit in the form of lists or tuples. However I get a TypeError as shown below.
TypeError: func() takes 4 positional arguments but 7 were given
Is there anyway to rewrite the code that you can only by the input parameters of curve_fit "determine" the degree of the multiexponential function? So that passing
a = (1)
results in a monoexponential function whereas passing
a = (1, 2, 3)
results in a triexponential function?
Regards
Yes, that can be done easily with np.broadcasting:
def func(t, a, taus): # plus more exponential functions
a=np.array(a)[:,None]
taus=np.array(taus)[:,None]
return (a*np.exp(-t/taus)).sum(axis=0)
func accepts 2 lists, converts them into 2-dim np.array, computes a matrix with all the exponentials and then sums it up. Example:
t=np.arange(100).astype(float)
out=func(t,[1,2],[0.3,4])
plt.plot(out)
Keep in mind a and taus must be the same length, so sanitize your inputs as you see fit. Or you could also directly pass np.arrays instead of lists.

PySpark: Error "Cannot pickle standard input" on function map

I'm trying to learn to use Pyspark.
I'm usin spark-2.2.0- with Python3
I'm in front of a problem now and I can't find where it came from.
My project is to adapt a algorithm wrote by data-scientist to be distributed. The code below it's what I have to use to extract the features from images and I have to adapt it to extract features whith pyspark.
import json
import sys
# Dependencies can be installed by running:
# pip install keras tensorflow h5py pillow
# Run script as:
# ./extract-features.py images/*.jpg
from keras.applications.vgg16 import VGG16
from keras.models import Model
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input
import numpy as np
def main():
# Load model VGG16 as described in https://arxiv.org/abs/1409.1556
# This is going to take some time...
base_model = VGG16(weights='imagenet')
# Model will produce the output of the 'fc2'layer which is the penultimate neural network layer
# (see the paper above for mode details)
model = Model(input=base_model.input, output=base_model.get_layer('fc2').output)
# For each image, extract the representation
for image_path in sys.argv[1:]:
features = extract_features(model, image_path)
with open(image_path + ".json", "w") as out:
json.dump(features, out)
def extract_features(model, image_path):
img = image.load_img(image_path, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
features = model.predict(x)
return features.tolist()[0]
if __name__ == "__main__":
main()
I have written the begining of the Code:
rdd = sc.binaryFiles(PathImages)
base_model = VGG16(weights='imagenet')
model = Model(input=base_model.input, output=base_model.get_layer('fc2').output)
rdd2 = rdd.map(lambda x : (x[0], extract_features(model, x[0][5:])))
rdd2.collect()[0]
when I try to extract the feature. There is an error.
~/Code/spark-2.2.0-bin-hadoop2.7/python/pyspark/cloudpickle.py in
save_file(self, obj)
623 return self.save_reduce(getattr, (sys,'stderr'), obj=obj)
624 if obj is sys.stdin:
--> 625 raise pickle.PicklingError("Cannot pickle standard input")
626 if hasattr(obj, 'isatty') and obj.isatty():
627 raise pickle.PicklingError("Cannot pickle files that map to tty objects")
PicklingError: Cannot pickle standard input
I try multiple thing and here is my first result. I know that the error come from the line below in the method extract_features:
features = model.predict(x)
and when I try to run this line out of a map function or pyspark, this work fine.
I think the problem come from the object "model" and his serialisation whith pyspark.
Maybe I don't use a good way to distribute this with pyspark and if you have any clew to help me, I will take them.
Thanks in advance.

How can I create a POSIXct vector in ffdf?

I've had a look around and can't quite seem to get a grasp of is going on with this. I'm using R in Eclipse. The file I'm trying to import is 700mb with around 15mil rows and 6 columns. As I was having problems loading in I have started using the ff package.
library(ff)
FDF = read.csv.ffdf(file='C:\\Users\\William\\Desktop\\R Data\\GBPUSD.1986.2014.txt', header = FALSE, colClasses=c('factor','factor','numeric','numeric','numeric','numeric'), sep=',')
names(FDF)= c('Date','Time','Open','High','Low','Close')
#names the columns in the ffdf file
dim(FDF)
# produces dimensions of the file
I then want to create a POSIXct sequence which will later be joined against the imported file. I had tried;
tm1 = seq(as.POSIXct("1986/12/1 00:00"), as.POSIXct("2014/09/04 23:59"),"mins"))
tm1 = data.frame (DateTime=strftime(tm1,format='%Y.%m.%d %H:%M'))
However R kept of crashing. I then tested this is RStudio and saw that their where constraints on the vector. It did, however, produce the correct
dim(tm1)
names(tm1)
So I went back into Eclipse thinking this was something to do with memory allocation. I've attempted the following;
library(ff)
tm1 = as.ffdf(seq(as.POSIXct("1986/12/1 00:00"), as.POSIXct("2014/09/04 23:59"),"mins"))
tm1 = as.ffdf(DateTime=strftime(tm1,format='%Y.%m.%d %H:%M'))
names(tm1) = c('DateTime')
dim(tm1)
names(tm1)
This gives an error of
no applicable method for 'as.ffdf' applied to an object of class "c('POSIXct', 'POSIXt')"
I can't seem to work around this. I then tried ...
library(ff)
tm1 = as.ff(seq(as.POSIXct("1986/12/1 00:00"), as.POSIXct("2014/09/04 23:59"),"mins"))
tm1 = as.ff(DateTime=strftime(tm1,format='%Y.%m.%d %H:%M'))
Which produce the output dates, however not in the correct format. In addition to this, when ...
dim(tm1)
names(tm1)
where executed they both returned null.
Question
How can I produce a POSIXct seq in the format I require above?
We'll we got there in the end.
I believe the problem was the available RAM during the creation of the full vector. As this was the case I broke the vector into 3, converted them into ffdf format to free up RAM and then used rbind to bind them together.
The problem with formatting the vector once created, I believe, was due to accessing RAM. Every time I tried this R crashed.
Even with the work around below my machine is slowing (4gb). I've ordered some more RAM and hope this will smooth future operations.
Below is the working code;
library(ff)
library(ffbase)
tm1 = seq(from = as.POSIXct('1986-12-01 00:00'), to = as.POSIXct('2000-12-01 23:59'), by = 'min')
tm1 = data.frame(DateTime=strftime(tm1, format='%Y.%m.%d %H:%M'))
# create data frame within memory contrainst
tm1 = as.ffdf(tm1)
# converts to ffdf format
memory.size()
tm2 = seq(from = as.POSIXct('2000-12-02 00:00'), to = as.POSIXct('2010-12-01 23:59'), by = 'min')
tm2 = data.frame(DateTime=strftime(tm2, format='%Y.%m.%d %H:%M'))
# create data frame within memory contrainst
tm2 = as.ffdf(tm2)
memory.size()
tm3 = seq(from = as.POSIXct('2010-12-2 00:00'), to = as.POSIXct('2014-09-04 23:59'), by = 'min')
tm3 = data.frame(DateTime=strftime(tm3, format='%Y.%m.%d %H:%M'))
memory.size()
tm3 = as.ffdf(tm3)
# converts to ffdf format
memory.size()
tm4 = rbind(tm1, tm2, tm3)
# binds ffdf objects into one
dim(tm4)
# checks the row numbers

Training an LSTM neural network to forecast time series in pybrain, python

I have a neural network created using PyBrain and designed to forecast time series.
I am using the sequential dataset function, and trying to use a sliding window of 5 previous values to predict the 6th. One of my problems is that I can't figure out how to create the required dataset by appending the 5 previous values to the inputs and the 6th as an output.
I am also unsure of how exactly to forecast values in the series once the network is trained.
Posting my code below:
from pybrain.datasets import SupervisedDataSet
from pybrain.datasets import SequentialDataSet
from pybrain.tools.shortcuts import buildNetwork
from pybrain.supervised.trainers import BackpropTrainer
from pybrain.supervised.trainers import RPropMinusTrainer
from pylab import ion, ioff, figure, draw, contourf, clf, show, hold, plot
from pybrain.structure import RecurrentNetwork
from pybrain.structure import FeedForwardNetwork
from pybrain.structure import LinearLayer, SigmoidLayer, TanhLayer
from pybrain.structure import FullConnection
from pybrain.structure import LSTMLayer
from pybrain.structure import BiasUnit
from pybrain.rl.learners.valuebased import Q
import pybrain
import matplotlib as plt
import translate
import time
import pickle
import scipy as sp
import numpy as np
import pylab as pl
import itertools
#Opening data from database
data = translate.translate(3600)
time, price, volume = zip(*data)
#Creating data lists instead of tuples
timeList = []
priceList = []
volumeList = []
for record in time:
timeList.append(record)
for record in price:
priceList.append(record)
for record in volume:
volumeList.append(record)
#Creating lookback window and target
datain = priceList[:5]
dataout = priceList[6]
print datain
print dataout
#Creating the dataset
ds = SequentialDataSet(5, 1)
for x, y in itertools.izip(datain, dataout):
ds.newSequence()
ds.appendLinked(tuple(x), tuple(y))
print (x, y)
print ds
#Building the network
n = RecurrentNetwork()
#Create the network modules
n.addInputModule(SigmoidLayer(5, name = 'in'))
n.addModule(LSTMLayer(100, name = 'LSTM'))
n.addModule(LSTMLayer(100, name = 'LSTM2'))
n.addOutputModule(SigmoidLayer(1, name = 'out'))
#Add the network connections
n.addConnection(FullConnection(n['in'], n['LSTM'], name = 'c_in_to_LSTM'))
n.addConnection(FullConnection(n['in'], n['LSTM2'], name = 'c_in_to_LSTM2'))
n.addConnection(FullConnection(n['LSTM'], n['out'], name = 'c_LSTM_to_out'))
n.addConnection(FullConnection(n['LSTM2'], n['out'], name = 'c_LSTM2_to_out'))
n.sortModules()
n.randomize()
#Creating the trainer
trainer = BackpropTrainer(n, ds)
#Training the network
#for i in range (1000):
# print trainer.train()
#Make predictions
#Plotting the results
pl.plot(time, price)
pl.show()
The above code gives:
TypeError: izip argument #2 must support iteration
I have seen the question linked below however I haven't been successful
Event Sequences, Recurrent Neural Networks, PyBrain
First question on this great site, any help is appreciated
#Creating lookback window and target
datain = priceList[:5]
dataout = priceList[6]
Not an expert. But it seems your datain is a list with length=6 while dataout is not.
I'd guess the TypeError says it all. Whereas priceList[:5] is a list and hence iterable, priceList[6] is a single element.
You'd probably want something like
datain = priceList[:5]
dataout = priceList[6:6]
which will make dataout a list with a single element.