Read and merge large tables on computer cluster

Read and merge large tables on computer cluster - merge

I need to merge different large tables (up to 10Gb each) into a single one. To do so I am using a computer cluster with 50+ cores and 10+Gb Ram that runs on Linux.
I always end up with an error message like: "Cannot allocate vector of size X Mb".
Given that commands like memory.limit(size=X) are Windows-specific and not accepted, I cannot find a way around to merge my large tables.
Any suggestion welcome!
This is the code I use:
library(parallel)
no_cores <- detectCores() - 1
cl <- makeCluster(no_cores)
temp = list.files(pattern="*.txt$")
gc()
Here the error occurs:
myfiles = parLapply(cl,temp, function(x) read.csv(x,
header=TRUE,
sep=";",
stringsAsFactors=F,
encoding = "UTF-8",
na.strings = c("NA","99","")))
myfiles.final = do.call(rbind, myfiles)

You could just use merge, for example:
`
mergedTable <- merge(table1, table2, by = "dbSNP_RSID")
If your samples have overlapping column names, then you'll find that the mergedTable has (for example) columns called Sample1.x and Sample1.y. This can be fixed by renaming the columns before or after the merge.
Reproducible example:
x <- data.frame(dbSNP_RSID = paste0("rs", sample(1e6, 1e5)),
matrix(paste0(sample(c("A", "C", "T", "G"), 1e7, replace = TRUE),
sample(c("A", "C", "T", "G"), 1e7, replace = TRUE)), ncol = 100))
y <- data.frame(dbSNP_RSID = paste0("rs", sample(1e6, 1e5)),
matrix(paste0(sample(c("A", "C", "T", "G"), 1e7, replace = TRUE),
sample(c("A", "C", "T", "G"), 1e7, replace = TRUE)), ncol = 100))
colnames(x)[2:101] <- paste0("Sample", 1:100)
colnames(y)[2:101] <- paste0("Sample", 101:200)
mergedDf <- merge(x, y, by = "dbSNP_RSID")
`

One way to approach this is with python and dask. The dask dataframe is stored mostly on disk rather than in ram- allowing you to work with larger than ram data- and can help you do computations with clever parallelization. A nice tutorial of ways to work on big data can be found in this kaggle post which might also be helpful for you. I also suggest checking out the docs on dask performance here. To be clear, if your data can fit in RAM using regular R dataframe or pandas dataframe will be faster.
Here's a dask solution which will assume you have named columns in the tables to align the concat operation. Please add to your question if you have any other special requirements about the data we need to consider.
import dask.dataframe as dd
import glob
tmp = glob.glob("*.txt")
dfs= []
for f in tmp:
# read the large tables
ddf = dd.read_table(f)
# make a list of all the dfs
dfs.append(ddf)
#row-wise concat of the data
dd_all = dd.concat(dfs)
#repartition the df to 1 partition for saving
dd_all = dd_all.repartition(npartitions=1)
# save the data
# provide list of one name if you don't want the partition number appended on
dd_all.to_csv(['all_big_files.tsv'], sep = '\t')
if you just wanted to cat all the tables together you can do something like this in straight python. (you could also use linux cat/paste).
with open('all_big_files.tsv', 'w') as O:
file_number = 0
for f in tmp:
with open(f, 'rU') as F:
if file_number == 0:
for line in F:
line = line.rstrip()
O.write(line + '\n')
else:
# skip the header line
l = F.readline()
for line in F:
line = line.rstrip()
O.write(line + '\n')
file_number +=1

Related

Marginal Means accounting for the random effect uncertainty

When we have repeated measurements on an experimental unit, typically these units cannot be considered 'independent' and need to be modeled in a way that we get valid estimates for our standard errors.
When I compare the intervals obtained by computing the marginal means for the treatment using a mixed model (treating the unit as a random effect) and in the other case, first averaging over the unit and THEN runnning a simple linear model on the averaged responses, I get the exact same uncertainty intervals.
How do we incorporate the uncertainty of the measurements of the unit, into the uncertainty of what we think our treatments look like?
In order to really propogate all the uncertainty, shouldn't we see what the treatment looks like, averaged over "all possible measurements" on a unit?
``` r
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(emmeans)
library(lme4)
#> Loading required package: Matrix
library(ggplot2)
tmp <- structure(list(treatment = c("A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B", "B",
"B", "B", "B", "B"), response = c(151.27333548, 162.3933313,
159.2199999, 159.16666725, 210.82, 204.18666667, 196.97333333,
194.54666667, 154.18666667, 194.99333333, 193.48, 191.71333333,
124.1, 109.32666667, 105.32, 102.22, 110.83333333, 114.66666667,
110.54, 107.82, 105.62000069, 79.79999821, 77.58666557, 75.78666928
), experimental_unit = c("A-1", "A-1", "A-1", "A-1", "A-2", "A-2",
"A-2", "A-2", "A-3", "A-3", "A-3", "A-3", "B-1", "B-1", "B-1",
"B-1", "B-2", "B-2", "B-2", "B-2", "B-3", "B-3", "B-3", "B-3"
)), row.names = c(NA, -24L), class = c("tbl_df", "tbl", "data.frame"
))
### Option 1 - Treat the experimental unit as a random effect since there are
### 4 repeat observations for the same unit
lme4::lmer(response ~ treatment + (1 | experimental_unit), data = tmp) %>%
emmeans::emmeans(., ~ treatment) %>%
as.data.frame()
#> treatment emmean SE df lower.CL upper.CL
#> 1 A 181.0794 10.83359 4 151.00058 211.1583
#> 2 B 101.9683 10.83359 4 71.88947 132.0472
#ggplot(.,aes(treatment, emmean)) +
#geom_pointrange(aes(ymin = lower.CL, ymax = upper.CL))
### Option 2 - instead of treating the unit as random effect, we average over the
### 4 repeat observations, and run a simple linear model
tmp %>%
group_by(experimental_unit) %>%
summarise(mean_response = mean(response)) %>%
mutate(treatment = c(rep("A", 3), rep("B", 3))) %>%
lm(mean_response ~ treatment, data = .) %>%
emmeans::emmeans(., ~ treatment) %>%
as.data.frame()
#> treatment emmean SE df lower.CL upper.CL
#> 1 A 181.0794 10.83359 4 151.00058 211.1583
#> 2 B 101.9683 10.83359 4 71.88947 132.0472
#ggplot(., aes(treatment, emmean)) +
#geom_pointrange(aes(ymin = lower.CL, ymax = upper.CL))
### Whether we include a random effect for the unit, or average over it and THEN model it, we find no difference in the
### marginal means for the treatments
### How do we incoporate the variation of the repeat measurments to the marginal means of the treatments?
### Do we then ignore the variation in the 'subsamples' and simply average over them PRIOR to modeling?
<sup>Created on 2021-07-31 by the [reprex package](https://reprex.tidyverse.org) (v2.0.0)</sup>

emmeans() does take into account the errors of random effects. This is what I get when I remove the complex sequences of pipes:
> mmod = lme4::lmer(response ~ treatment + (1 | experimental_unit), data = tmp)
> emmeans(mmod, "treatment")
treatment emmean SE df lower.CL upper.CL
A 181 10.8 4 151.0 211
B 102 10.8 4 71.9 132
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
This is as shown. If I fit a fixed-effects model that accounts for experimental units as a fixed effect, I get:
> fmod = lm(response ~ treatment + experimental_unit, data = tmp)
> emmeans(fmod, "treatment")
NOTE: A nesting structure was detected in the fitted model:
experimental_unit %in% treatment
treatment emmean SE df lower.CL upper.CL
A 181 3.25 18 174.2 188
B 102 3.25 18 95.1 109
Results are averaged over the levels of: experimental_unit
Confidence level used: 0.95
The SEs of the latter results are considerably lower, and that is because the random variations in experimental_unit are modeled as fixed variations.
Apparently the piping you did accounts for the variation of the random effects and includes those in the EMMs. I think that is because you did things separately for each experimental unit and somehow combined those results. I'm not very comfortable with a sequence of pipes that is 7 steps long, and I don't understand why that results in just one set of means.
I recommend against the as.data.frame() at the end. That zaps out annotations that can be helpful in understanding what you have. If you are doing that to get more digits precision, I'll claim that those are digits you don't need, it just exaggerates the precision you are entitled to claim.
Notes on some follow-up comments
Subsequently, I am convinced that what we see in the piped operations in the second part of the OP doe indeed comprise computing the mean of each EU, then analyzing those.
Let's look at that in the context of the formal model. We have (sorry MathJax doesn't work on stackoverflow, but I'll leave the markup there anyway)
$$ Y_{ijk} = \mu + \tau_i + U_{ij} + E_{ijk} $$
where $Y_{ijk}$ is the kth response measurement on the ith treatment and jth EU in the ith treatment, and the rhs terms represent respectively the overall mean, the (fixed) treatment effects, the (random) EU effects, and the (random) error effects. We assume the random effects are all mutually independent. With a balanced design, the EMMs are just the marginal means:
$$ \bar Y_{i..} = \mu + \tau_i + \bar U_{i.} + \bar E_{i..} $$
where a '.' subscript means we averaged over that subscript. If there are n EUs per treatment and m measurements on each EU, we get that
$$ Var(\bar Y_{i..} = \sigma^2_U / n + \sigma^2_E / mn $$
Now, if we aggregate the data on EUs ahead of time, we are starting with
$$ \bar Y_{ij.} = \mu + U_{ij} + \bar E_{ij.} $$
However, if we then compute marginal means by averaging over j, we get exactly the same thing as we did before with $\bar Y_{i..}$, and the variance is exactly as already shown. That is why it doesn't matter if we aggregated first or not.

Trying to install a corpus for countVectorizer in sklearn package

I am trying to load a corpus from my local drive into python at one time with a for loop and then read each text file and save it for analysis with countVectorizer. But, I am only getting the last file. How do I get the results from all of the files to be stored for analysis with countVectorizer?
This code brings out the text from last file in folder.
folder_path = "folder"
#import and read all files in animal_corpus
for filename in glob.glob(os.path.join(folder_path, '*.txt')):
with open(filename, 'r') as f:
txt = f.read()
print(txt)
MyList= [txt]
## Create a CountVectorizer object that you can use
MyCV1 = CountVectorizer()
## Call your MyCV1 on the data
DTM1 = MyCV1.fit_transform(MyList)
## get col names
ColNames=MyCV1.get_feature_names()
print(ColNames)
## convert DTM to DF
MyDF1 = pd.DataFrame(DTM1.toarray(), columns=ColNames)
print(MyDF1)
This code works, but would not work for a huge corpus that I am preparing it for.
#import and read text files
f1 = open("folder/animal_1.txt",'r')
f1r = f1.read()
f2 = open("/folder/animal_2.txt",'r')
f2r = f2.read()
f3 = open("/folder/animal_3.txt",'r')
f3r = f3.read()
#reassemble corpus in python
MyCorpus=[f1r, f2r, f3r]
## Create a CountVectorizer object that you can use
MyCV1 = CountVectorizer()
## Call your MyCV1 on the data
DTM1 = MyCV1.fit_transform(MyCorpus)
## get col names
ColNames=MyCV1.get_feature_names()
print(ColNames)
## convert DTM to DF
MyDF2 = pd.DataFrame(DTM1.toarray(), columns=ColNames)
print(MyDF2)

I figured it out. Just gotta keep grinding.
MyCorpus=[]
#import and read all files in animal_corpus
for filename in glob.glob(os.path.join(folder_path, '*.txt')):
with open(filename, 'r') as f:
txt = f.read()
MyCorpus.append(txt)

Callbackfunction modelcheckpoint causes error in keras

I seem to get this error when I am using the callback function modelcheckpoint..
I read from a github issue that the solution would be make use of model.get_weight, but I am implicitly only storing that since i am only storing the one with best weight.
Keras only seem to save weights using h5, which make me question is there any other way to do store them using the eras API, if so how? If not, how do i store it?
Made an example to recreate the problem:
#!/usr/bin/python
import glob, os
import sys
from os import listdir
from os.path import isfile, join
import numpy as np
import warnings
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from keras.utils import np_utils
from keras import metrics
import keras
from keras import backend as K
from keras.models import Sequential
from keras.optimizers import SGD, Adam
from keras.layers.core import Dense, Activation, Lambda, Reshape,Flatten
from keras.layers import Conv1D,Conv2D,MaxPooling2D, MaxPooling1D, Reshape
#from keras.utils.visualize_util import plot
from keras.models import Model
from keras.layers import Input, Dense
from keras.layers.merge import Concatenate, Add
import h5py
import random
import tensorflow as tf
import math
from keras.callbacks import CSVLogger
from keras.callbacks import ModelCheckpoint
if len(sys.argv) < 5:
print "Missing Arguments!"
print "python keras_convolutional_feature_extraction.py <workspace> <totale_frames> <fbank-dim> <window-height> <batch_size>"
print "Example:"
print "python keras_convolutional_feature_extraction.py deltas 15 40 5 100"
sys.exit()
total_frames = int(sys.argv[2])
total_frames_with_deltas = total_frames*3
dim = int(sys.argv[3])
window_height = int(sys.argv[4])
inserted_batch_size = int(sys.argv[5])
stride = 1
splits = ((dim - window_height)+1)/stride
#input_train_data = "/media/carl/E2302E68302E443F/"+str(sys.argv[1])+"/fbank/org_train_total_frames_"+str(total_frames)+"_dim_"+str(dim)+"_winheig_"+str(window_height)+"_batch_"+str(inserted_batch_size)+"_fws_input"
#output_train_data ="/media/carl/E2302E68302E443F/"+str(sys.argv[1])+"/fbank/org_train_total_frames_"+str(total_frames)+"_dim_"+str(dim)+"_winheig_"+str(window_height)+"_batch_"+str(inserted_batch_size)+"_fws_output"
#input_test_data = "/media/carl/E2302E68302E443F/"+str(sys.argv[1])+"/fbank/org_test_total_frames_"+str(total_frames)+"_dim_"+str(dim)+"_winheig_"+str(window_height)+"_batch_"+str(1)+"_fws_input"
#output_test_data = "/media/carl/E2302E68302E443F/"+str(sys.argv[1])+"/fbank/org_test_total_frames_"+str(total_frames)+"_dim_"+str(dim)+"_winheig_"+str(window_height)+"_batch_"+str(1)+"_fws_output"
#train_files =[f for f in listdir(input_train_data) if isfile(join(input_train_data, f))]
#test_files =[f for f in listdir(input_test_data) if isfile(join(input_test_data, f))]
#print len(train_files)
np.random.seed(100)
print "hallo"
def train_generator():
while True:
# input = random.choice(train_files)
# h5f = h5py.File(input_train_data+'/'+input, 'r')
# train_input = h5f['train_input'][:]
# train_output = h5f['train_output'][:]
# h5f.close()
train_input = np.random.randint(100,size=((inserted_batch_size,splits*total_frames_with_deltas,window_height,3)))
train_list_list = []
train_input = train_input.reshape((inserted_batch_size,splits*total_frames_with_deltas,window_height,3))
train_input_list = np.split(train_input,splits*total_frames_with_deltas,axis=1)
for i in range(len(train_input_list)):
train_input_list[i] = train_input_list[i].reshape(inserted_batch_size,window_height,3)
#for i in range(len(train_input_list)):
# train_input_list[i] = train_input_list[i].reshape(inserted_batch_size,33,window_height,1,3)
train_output = np.random.randint(5, size = (1,total_frames,5))
middle = int(math.ceil(total_frames/2))
train_output = train_output[:,middle:middle+1,:].reshape((inserted_batch_size,1,5))
#print train_output.shape
#print len(train_input_list)
#print train_input_list[0].shape
yield (train_input_list, train_output)
print "hallo"
def test_generator():
while True:
# input = random.choice(test_files)
# h5f = h5py.File(input_test_data+'/'+input, 'r')
# test_input = h5f['test_input'][:]
# test_output = h5f['test_output'][:]
# h5f.close()
test_input = np.random.randint(100,size=((inserted_batch_size,splits*total_frames_with_deltas,window_height,3)))
test_input = test_input.reshape((inserted_batch_size,splits*total_frames_with_deltas,window_height,3))
test_input_list = np.split(test_input,splits*total_frames_with_deltas,axis=1)
#test_input_list = np.split(test_input,45,axis=3)
for i in range(len(test_input_list)):
test_input_list[i] = test_input_list[i].reshape(inserted_batch_size,window_height,3)
#for i in range(len(test_input_list)):
# test_input_list[i] = test_input_list[i].reshape(inserted_batch_size,33,window_height,1,3)
test_output = np.random.randint(5, size = (1,total_frames,5))
middle = int(math.ceil(total_frames/2))
test_output = test_output[:,middle:middle+1,:].reshape((inserted_batch_size,1,5))
yield (test_input_list, test_output)
print "hallo"
def fws():
#print "Inside"
# Params:
# batch , lr, decay , momentum, epochs
#
#Input shape: (batch_size,40,45,3)
#output shape: (1,15,50)
# number of unit in conv_feature_map = splitd
next(train_generator())
model_output = []
list_of_input = [Input(shape=(8,3)) for i in range(splits*total_frames_with_deltas)]
output = []
#Conv
skip = total_frames_with_deltas
for steps in range(total_frames_with_deltas):
conv = Conv1D(filters = 100, kernel_size = 8)
column = 0
for _ in range(splits):
#print "column " + str(column) + "steps: " + str(steps)
output.append(conv(list_of_input[(column*skip)+steps]))
column = column + 1
#print len(output)
#print splits*total_frames_with_deltas
conv = []
for section in range(splits):
column = 0
skip = splits
temp = []
for _ in range(total_frames_with_deltas):
temp.append(output[((column*skip)+section)])
column = column + 1
conv.append(Add()(temp))
#print len(conv)
output_conc = Concatenate()(conv)
#print output_conc.get_shape
output_conv = Reshape((splits, -1))(output_conc)
#print output_conv.get_shape
#Pool
pooled = MaxPooling1D(pool_size = 6, strides = 2)(output_conv)
reshape = Reshape((1,-1))(pooled)
#Fc
dense1 = Dense(units = 1024, activation = 'relu', name = "dense_1")(reshape)
#dense2 = Dense(units = 1024, activation = 'relu', name = "dense_2")(dense1)
dense3 = Dense(units = 1024, activation = 'relu', name = "dense_3")(dense1)
final = Dense(units = 5, activation = 'relu', name = "final")(dense3)
model = Model(inputs = list_of_input , outputs = final)
sgd = SGD(lr=0.1, decay=1e-1, momentum=0.9, nesterov=True)
model.compile(loss="categorical_crossentropy", optimizer=sgd , metrics = ['accuracy'])
print "compiled"
model_yaml = model.to_yaml()
with open("model.yaml", "w") as yaml_file:
yaml_file.write(model_yaml)
print "Model saved!"
log= CSVLogger('/home/carl/kaldi-trunk/dnn/experimental/yesno_cnn_50_training_total_frames_'+str(total_frames)+"_dim_"+str(dim)+"_window_height_"+str(window_height)+".csv")
filepath='yesno_cnn_50_training_total_frames_'+str(total_frames)+"_dim_"+str(dim)+"_window_height_"+str(window_height)+"weights-improvement-{epoch:02d}-{val_acc:.2f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_weights_only=True, mode='max')
print "log"
#plot_model(model, to_file='model.png')
print "Fit"
hist_current = model.fit_generator(train_generator(),
steps_per_epoch=444,#len(train_files),
epochs = 10000,
verbose = 1,
validation_data = test_generator(),
validation_steps=44,#len(test_files),
pickle_safe = True,
workers = 4,
callbacks = [log,checkpoint])
fws()
Execute the script by: python name_of_script.py yens 50 40 8 1
which give me a full traceback:
full traceback
Error:
carl#ca-ThinkPad-T420s:~/Dropbox$ python mini.py yesno 50 40 8 1
Using TensorFlow backend.
Couldn't import dot_parser, loading of dot files will not be possible.
hallo
hallo
hallo
compiled
Model saved!
log
Fit
/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:2252: UserWarning: Expected no kwargs, you passed 1
kwargs passed to function are ignored with Tensorflow backend
warnings.warn('\n'.join(msg))
Epoch 1/10000
2017-05-26 13:01:45.851125: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-26 13:01:45.851345: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-26 13:01:45.851392: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
443/444 [============================>.] - ETA: 4s - loss: 100.1266 - acc: 0.3138Epoch 00000: saving model to yesno_cnn_50_training_total_frames_50_dim_40_window_height_8weights-improvement-00-0.48.hdf5
Traceback (most recent call last):
File "mini.py", line 205, in <module>
File "mini.py", line 203, in fws
File "/usr/local/lib/python2.7/dist-packages/keras/legacy/interfaces.py", line 88, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1933, in fit_generator
callbacks.on_epoch_end(epoch, epoch_logs)
File "/usr/local/lib/python2.7/dist-packages/keras/callbacks.py", line 77, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "/usr/local/lib/python2.7/dist-packages/keras/callbacks.py", line 411, in on_epoch_end
self.model.save_weights(filepath, overwrite=True)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 2503, in save_weights
save_weights_to_hdf5_group(f, self.layers)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 2746, in save_weights_to_hdf5_group
f.attrs['layer_names'] = [layer.name.encode('utf8') for layer in layers]
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2684)
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2642)
File "/usr/local/lib/python2.7/dist-packages/h5py/_hl/attrs.py", line 93, in __setitem__
self.create(name, data=value, dtype=base.guess_dtype(value))
File "/usr/local/lib/python2.7/dist-packages/h5py/_hl/attrs.py", line 183, in create
attr = h5a.create(self._id, self._e(tempname), htype, space)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2684)
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2642)
File "h5py/h5a.pyx", line 47, in h5py.h5a.create (/tmp/pip-4rPeHA-build/h5py/h5a.c:1904)
RuntimeError: Unable to create attribute (Object header message is too large)

If you look at the amount of data Keras is trying to save under layer_names attribute (inside the output HDF5 file being create), you will find that it takes more than 64kB.
np.asarray([layer.name.encode('utf8') for layer in model.layers]).nbytes
>> 77100
I quote from https://support.hdfgroup.org/HDF5/faq/limits.html:
Is there an object header limit and how does that affect HDF5 ?
There is a limit (in HDF5-1.8) of the object header, which is 64 KB.
The datatype for a dataset is stored in the object header, so there is
therefore a limit on the size of the datatype that you can have. (See
HDFFV-1089)
The code above was (almost entirely) copied from the traceback:
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 2746, in save_weights_to_hdf5_group
f.attrs['layer_names'] = [layer.name.encode('utf8') for layer in layers]
I am using numpy asarray method to get the figure fast but h5py gets similar figure (I guess), see https://github.com/h5py/h5py/blob/master/h5py/_hl/attrs.py#L102 if you want to find exact figure.
Anyway, either you will need to implement your own methods for saving/loading of the weights (or use existing workarounds), or you need to give a really short name to ALL the layers inside your model :), something like this:
list_of_input = [Input(shape=(8,3), name=('i%x' % i)) for i in range(splits*total_frames_with_deltas)]
conv = Conv1D(filters = 100, kernel_size = 8, name='cv%x' % steps)
conv.append(Add(name='add%x' % section)(temp))
output_conc = Concatenate(name='ct')(conv)
output_conv = Reshape((splits, -1), name='rs1')(output_conc)
pooled = MaxPooling1D(pool_size = 6, strides = 2, name='pl')(output_conv)
reshape = Reshape((1,-1), name='rs2')(pooled)
dense1 = Dense(units = 1024, activation = 'relu', name = "d1")(reshape)
dense2 = Dense(units
= 1024, activation = 'relu', name = "d2")(dense1)
dense3 = Dense(units = 1024, activation = 'relu', name = "d3")(dense1)
final = Dense(units = 5, activation = 'relu', name = "fl")(dense3)
You mustn't forget to name all the layers because the (numpy) string array into which the layer names are converted is using the size of the longest string for each individual string in it when it is saved!
After renaming the layers as proposed above (which takes almost 26kB) the model is saved successfully. Hope this elaborate answer helps someone.
Update: I have just made a PR to Keras which should fix the issue without implementing any custom loading/saving methods, see 7508

A simple solution, albeit possibly not the most elegant, could be to run a while loop with epochs = 1.
Get the weights at the end of every epoch together with the accuracy and the loss
Save the weights to file 1 with model.get_weight
if accuracy is greater than at the previous epoch (i.e. loop), store the weights to a different file (file 2)
Run the loop again loading the weights from file 1
Break the loops setting a manual early stopping so that it breaks if the loss does not improve for a certain number of loops

You can use get_weights() together with numpy.save.
It's not the best solution, because it will save several files, but it actually works.
The problem is that you won't have the "optimizer" saved with the current states. But you can perhaps work around that by using smaller learning rates after loading.
Custom callback using numpy.save:
def myCallback(epoch,logs):
global storedLoss
#do your comparisons here using the "logs" var.
print(logs)
if (logs['loss'] < storedLoss):
storedLoss = logs['loss']
for i in range(len(model.layers)):
WandB = model.layers[i].get_weights()
if len (WandB) > 0: #necessary because some layers have no weights
np.save("W" + "-" + str(i), WandB[0],False)
np.save("B" + "-" + str(i), WandB[1],False)
#remember that get and set weights use a list: [weights,biases]
#it may happen (not sure) that there is no bias, and thus you may have to check it (len(WandB)==1).
The logs var brings a dictionary with named metrics, such as "loss", and "accuracy", if you used it.
You can store the losses within the callback in a global var, and compare if each loss is better or worse than the last.
When fitting, use the lambda callback:
from keras.callbacks import LambdaCallback
model.fit(...,callbacks=[LambdaCallback(on_epoch_end=myCallback)])
In the example above, I used the LambdaCallback, which has more possibilities than just on_epoch_end.
For loading, do a similar loop:
#you have to create the model first and then set the layers
def loadModel(model):
for i in range(len(model.layers)):
WandBForCheck = model.layers[i].get_weights()
if len (WandBForCheck) > 0: #necessary because some layers have no weights
W = np.load(Wfile + str(i))
B = np.load(Bfile + str(i))
model.layers[i].set_weights([W,B])

See follow-up at https://github.com/fchollet/keras/issues/6766 and https://github.com/farizrahman4u/keras-contrib/pull/90.
I saw the YAML and the root cause is probably that you have so many Inputs. A few Inputs with many dimensions is preferred to many Inputs, especially if you can use scanning and batch operations to do everything efficiently.
Now, ignoring that entirely, here is how you can save and load your model if it has too much stuff to save as JSON efficiently:
You can pass save_weights_only=True. That won't save optimizer weights, so isn't a great solution.
Just put together a PR for saving model weights and optimizer weights but not configuration. When you want to load, first instantiate and compile the model as you did when you were going to train it, then use load_all_weights to load the model and optimizer weights into that model. I'll try to merge it soon so you can use it from the master branch.
You could use it something like this:
from keras.callbacks import LambdaCallback
from keras_contrib.utils.save_load_utils import save_all_weights, load_all_weights
# do some stuff to create and compile model
# use `save_all_weights` as a callback to checkpoint your model and optimizer weights
model.fit(..., callbacks=[LambdaCallback(on_epoch_end=lambda epoch, logs: save_all_weights(model, "checkpoint-{:05d}.h5".format(epoch))])
# use `load_all_weights` to load model and optimizer weights into an existing model
# if not compiled (no `model.optimizer`), this will just load model weights
load_all_weights(model, 'checkpoint-1337.h5')
So I don't endorse the model, but if you want to get it to save and load anyways this should probably work for you.
As a side note, if you want to save weights in a different format, something like this would work.
pickle.dump([K.get_value(w) for w in model.weights], open( "save.p", "wb" ) )
Cheers

Your model architecture must be too large to be saved.
USE get_weights AND set_weights TO SAVE AND LOAD MODEL, RESPECTIVELY.
Do not use callback model checkpoint. just once the training ends, save its weights with pickle.
Have a look at this link: Unable to save DataFrame to HDF5 ("object header message is too large")

Aligning and italicising table column headings using Rmarkdown and pander

I am writing a rmarkdown document knitting to pdf with tables taken from portions of lists from the ezANOVA package. The tables are made using the pander package. Toy Rmarkdown file with toy dataset below.
---
title: "Table Doc"
output: pdf_document
---
```{r global_options, include=FALSE}
#set global knit options parameters.
knitr::opts_chunk$set(fig.width=12, fig.height=8, fig.path='Figs/',
echo=FALSE, warning=FALSE, message=FALSE, dev = 'pdf')
```
```{r, echo=FALSE}
# toy data
id <- rep(c(1,2,3,4), 5)
group1 <- factor(rep(c("A", "B"), 10))
group2 <- factor(rep(c("A", "B"), each = 10))
dv <- runif(20, min = 0, max = 10)
df <- data.frame(id, group1, group2, dv)
```
``` {r anova, echo = FALSE}
library(ez)
library(plyr)
library(pander)
# create anova object
anOb <- ezANOVA(df,
dv = dv,
wid = id,
between = c(group1, group2),
type = 3,
detailed = TRUE)
# extract the output table from the anova object, reduce it down to only desired columns
anOb <- data.frame(anOb[[1]][, c("Effect", "F", "p", "p<.05")])
# format entries in columns
anOb[,2] <- format( round (anOb[,2], digits = 1), nsmall = 1)
anOb[,3] <- format( round (anOb[,3], digits = 4), nsmall = 1)
pander(anOb, justify = c("left", "center", "center", "right"))
```
Now I have a few problems
a) For the last three columns I would like to have the column heading in the table aligned in the center, but the actual column entries underneath those headings aligned to the right.
b) I would like to have the column headings 'F' and 'p' in italics and the 'p' in the 'p<.05' column in italics also but the rest in normal font. So they read F, p and p<.05
I tried renaming the column headings using plyr::rename like so
anOb <- rename(anOb, c("F" = "italic(F)", "p" = "italic(p)", "p<.05" = ""))
But it didn't work

In markdown, you have to use the markdown syntax for italics, which is wrapping text between a star or underscore:
> names(anOb) <- c('Effect', '*F*', '*p*', '*p<.05*')
> pander(anOb)
-----------------------------------------
Effect *F* *p* *p<.05*
--------------- ------ -------- ---------
(Intercept) 52.3 0.0019 *
group1 1.3 0.3180
group2 2.0 0.2261
group1:group2 3.7 0.1273
-----------------------------------------
If you want to do that in a programmatic way, you can also use the pandoc.emphasis helper function to add the starts to a string.
But your other problem is due to a bug in the package, for which I've just proposed a fix on GH. Please feel free to give a try to that branch and report back on GH -- I will try to get some time later this week to clean up the related unit tests and merge the branch if everything seem to be OK.

How to add meta_data to Pandas dataframe?

I use Pandas dataframe heavily. And need to attach some data to the dataframe, for example to record the birth time of the dataframe, the additional description of the dataframe etc.
I just can't find reserved fields of dataframe class to keep the data.
So I change the core\frame.py file to add a line _reserved_slot = {} to solve my issue. I post the question here is just want to know is it OK to do so ? Or is there better way to attach meta-data to dataframe/column/row etc?
#----------------------------------------------------------------------
# DataFrame class
class DataFrame(NDFrame):
_auto_consolidate = True
_verbose_info = True
_het_axis = 1
_col_klass = Series
_AXIS_NUMBERS = {
'index': 0,
'columns': 1
}
_reserved_slot = {} # Add by bigbug to keep extra data for dataframe
_AXIS_NAMES = dict((v, k) for k, v in _AXIS_NUMBERS.iteritems())
EDIT : (Add demo msg for witingkuo's way)
>>> df = pd.DataFrame(np.random.randn(10,5), columns=list('ABCDEFGHIJKLMN')[0:5])
>>> df
A B C D E
0 0.5890 -0.7683 -1.9752 0.7745 0.8019
1 1.1835 0.0873 0.3492 0.7749 1.1318
2 0.7476 0.4116 0.3427 -0.1355 1.8557
3 1.2738 0.7225 -0.8639 -0.7190 -0.2598
4 -0.3644 -0.4676 0.0837 0.1685 0.8199
5 0.4621 -0.2965 0.7061 -1.3920 0.6838
6 -0.4135 -0.4991 0.7277 -0.6099 1.8606
7 -1.0804 -0.3456 0.8979 0.3319 -1.1907
8 -0.3892 1.2319 -0.4735 0.8516 1.2431
9 -1.0527 0.9307 0.2740 -0.6909 0.4924
>>> df._test = 'hello'
>>> df2 = df.shift(1)
>>> print df2._test
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\Python\lib\site-packages\pandas\core\frame.py", line 2051, in __getattr__
(type(self).__name__, name))
AttributeError: 'DataFrame' object has no attribute '_test'
>>>

This is not supported right now. See https://github.com/pydata/pandas/issues/2485. The reason is the propogation of these attributes is non-trivial. You can certainly assign data, but almost all pandas operations return a new object, where the assigned data will be lost.

Your _reserved_slot will become a class variable. That might not work if you want to assign different value to different DataFrame. Probably you can assign what you want to the instance directly.
In [6]: import pandas as pd
In [7]: df = pd.DataFrame()
In [8]: df._test = 'hello'
In [9]: df._test
Out[9]: 'hello'

I think a decent workaround is putting your datafame into a dictionary with your metadata as other keys. So if you have a dataframe with cashflows, like:
df = pd.DataFrame({'Amount': [-20, 15, 25, 30, 100]},index=pd.date_range(start='1/1/2018', periods=5))
You can create your dictionary with additional metadata and put the dataframe there
out = {'metadata': {'Name': 'Whatever', 'Account': 'Something else'}, 'df': df}
and then use it as out[df]

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Read and merge large tables on computer cluster - merge

Related

Marginal Means accounting for the random effect uncertainty

Trying to install a corpus for countVectorizer in sklearn package

Callbackfunction modelcheckpoint causes error in keras

Aligning and italicising table column headings using Rmarkdown and pander

How to add meta_data to Pandas dataframe?

Categories

Resources