how to calculate a raster mean/average from individual maps using GDAL in python? - python-3.7

I have 6-monthly raster maps of ET-data as tif format for the months from Apr to Sep; and would like to get the average/mean of those 6-maps as a single mean ET-map.
ETmaps_average.tif (I need such a map!)
Any idea?
I prefer doing it using GDAL package in python 3.7. Thanks

I have made a few assumptions here, but given that those are true this should solve your problem.
All data can fit in memory
All images have the same size (and same geotransform)
All images have a single band
You should be able to modify the code in case some of the above assumptions are not true
from osgeo import gdal
import numpy as np
file_paths = ['''List of paths to your files''']
# We build one large np array of all images (this requires that all data fits in memory)
res = []
for f in file_paths:
ds = gdal.Open(f)
res.append(ds.GetRasterBand(1).ReadAsArray()) # We assume that all rasters has a single band
stacked = np.dstack(res) # We assume that all rasters have the same dimensions
mean = np.mean(stacked, axis=-1)
# Finally save a new raster with the result.
# This assumes that all inputs have the same geotransform since we just copy the first
driver = gdal.GetDriverByName('GTiff')
result = driver.CreateCopy('ETmaps_average.tif', gdal.Open(file_paths[0]))
result = None


Different results with torchvision transforms

Correct me if I am wrong.
The 'classic' way to pass images through torchvision transforms is to
use Compose as in its doc page. This, however, requires to pass Image input.
An alternative is to use ConvertImageDtype with torch.nn.Sequential. This 'bypasses'
the need for Image, and in my case it is much faster because I work with numpy arrays.
My problem is that results are not identical.
Below is an example with custom Normalize.
I would like to use torch.nn.Sequential (tr) because it is faster for my needs,
but the error compared to Compose (tr2) is very large (~810).
from PIL import Image
import torchvision.transforms as T
import numpy as np
import torch
o = np.random.rand(64, 64, 3) * 255
o = np.array(o, dtype=np.uint8)
i = Image.fromarray(o)
tr = torch.nn.Sequential(
T.Resize(224, interpolation=T.InterpolationMode.BICUBIC),
T.Normalize([0.48145466, 0.4578275, 0.40821073], [0.26862954, 0.26130258, 0.27577711]),
tr2 = T.Compose([
T.Resize(224, interpolation=T.InterpolationMode.BICUBIC),
T.Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)),
out = tr(torch.from_numpy(o).permute(2,0,1).contiguous())
out2 = tr2(i)
print(((out - out2) ** 2).sum())
The interpolation method seems to matter A LOT, and if I use the default BILINEAR the error is ~7, but I need to use BICUBIC.
The problem seems to lie in ConvertImageDtype vs ToTensor, because if I replace
ToTensor with ConvertImageDtype results are identical (cannot do the other way around
because ToTensor is not a subclass of Module and I cannot use it with nn.Sequential).
However, the following gives identical results
tr = torch.nn.Sequential(
tr2 = T.Compose([
out = tr(torch.from_numpy(o).permute(2,0,1).contiguous())
out2 = tr2(i)
print(((out - out2) ** 2).sum())
This means that the interpolation changes something in the results, which matters only
when I use ToTensor vs ConvertImageDtype.
Any input is appreciated.
This is documented here:
The output image might be different depending on its type: when downsampling, the interpolation of PIL images and tensors is slightly different, because PIL applies antialiasing. This may lead to significant differences in the performance of a network. Therefore, it is preferable to train and serve a model with the same input types. See also below the antialias parameter, which can help making the output of PIL images and tensors closer.
Passing antialias=True produces almost identical results.
This is interesting because the doc says that
it can be set to True for InterpolationMode.BILINEAR only mode.
Yet, I am using BICUBIC and still works.

How to handle flag/exception values

In Paraview, I am working with a dataset that uses the value -99999 as a flag value. I'd like to be able to manipulate the dataset without these values causing issues with things like glyphs and colorbars. Nominally, I'd like the data to be "ignored".
A little about the data: I've got both scalar and vector point data, sitting on a fixed 2D spatial mesh at set temporal intervals.
Although -99999 is very far beyond the values the data might otherwise show, using a threshold filter isn't an option because the flag can occur at different places at different times. The way Paraview's threshold filter works means that the point ID to a fixed point in space will change as the number of filtered points changes through time.
In case it matters, data are in a netCDF file that is read in via an XMF header file and the XDMF Reader since the CF reader doesn't work (possibly because of my unstructured triangular mesh). The netCDF data have the _FillValue global attribute, however this doesn't appear to be getting picked up on by Paraview.
You could use a Programmable Filter to replace values below -99999 by NaN. Providing the data is not a vtkMultiblockDataSet, you can use the following script in the programmable filter :
import numpy as np
from vtk.numpy_interface import dataset_adapter as dsa
# name of the array
name = 'name'
# limit
limit = -99999
array = inputs[0].PointData[name].copy()
array[array<=limit] = np.nan
out = dsa.WrapDataObject(self.GetOutput())
out.PointData.append(array, name)
Note: if data of interest is a Cell Data, replace PointData by CellData in the script.
Note 2: the script was tested on ParaView 5.6.

Does KernelDensity.estimate in pyspark.mllib.stat.KernelDensity work when input data is normally distributed?

Does pyspark's KernelDensity.estimate work correctly on a dataset that is normally distributed? I get an error when I try that. I have filed (KernelDensity.estimate in pyspark.mllib.stat.KernelDensity throws net.razorvine.pickle.PickleException when input data is normally distributed (no error when data is not normally distributed))
Example code:
vecRDD = sc.parallelize(colVec)
kd = KernelDensity()
# Find density estimates for the given values
densities = kd.estimate(samplePoints)
When data is NOT Gaussian, I get for e.g.
For reference, using Scala, for Gaussian data,
vecRDD = sc.parallelize(colVec)
kd = new KernelDensity().setSample(vecRDD).setBandwidth(3.0)
// Find density estimates for the given values
densities = kd.estimate(samplePoints)
I get:
I faced the same issue and was able to track down the issue to a very minimal test case. If you're using Numpy in Python to generate the data in the RDD, then that's the problem!
import numpy as np
kd = KernelDensity()
kd.setSample(sc.parallelize([0.0, 1.0, 2.0, 3.0])) # THIS WORKS
# kd.setSample(sc.parallelize([0.0, np.float32(1.0), 2.0, 3.0])) # THIS FAILS
kd.estimate([0.0, 1.0])
If this was your issue as well, simply convert the Numpy data to Python base type until the Spark issue is fixed. You can do that by using the np.asscalar function.

Do I have to preprocess test data using neural networks?

I am using Keras (version 2.0.0) and I'd like to make use of pretrained models like e.g. VGG16.
In order to get started, I ran the example of the [Keras documentation site ][] for extracting features with VGG16:
from keras.applications.vgg16 import VGG16
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input
import numpy as np
model = VGG16(weights='imagenet', include_top=False)
img_path = 'elephant.jpg'
img = image.load_img(img_path, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
features = model.predict(x)
The used preprocess_input() function bothers me
(the function does Zero-centering by mean pixel what can be seen by looking at the source code).
Do I really have to preprocess input data (validation/test data) before using a trained model?
If yes, one can conclude that you always have to be aware of what preprocessing steps have been performed during training phase?!
If no: Does preprocessing of validation/test data cause a bias?
I appreciate your help.
Yes you should use the preprocessing step. You can retrain the model without it but the first layers will learn to center your datas so this is a waste of parameters.
If you do not recenter your performances will suffer.
Great thread on reddit :

how to feed the image data to HDF5 on caffe or existing examples?

I had hard time working on caffe with HDF5 on the image classification and regression tasks, for some reason, the training on HDF5 will always fail at the first beginning that the test and train loss could very soon drop to close to zero. after trying all the tricks such reducing the learning rate, adding RELU, dropout, nothing started to work, so I started to doubt that the HDF5 data I am feeding to caffe is wrong.
so currently I am working on the universal dataset (Oxford 102 category flower dataset and also it has public code ), firstly I started out by trying ImageData and LMDB layer for the classification, they all worked very well. at last i used HDF5 data layer for the finetuning, the training_prototxt doesn't change unless on the data layer which uses HDF5 instead. and again, at the start of the learning, the loss drops from 5 to 0.14 at iteration 60, 0.00146 at iteration 100, that seems to prove that HDF5 data is incorrect.
i have two image&label to HDF5 snippet on the github, all of them seem to generate the HDF5 dataset, but for some reason these dataset doesn't seem to be not working with caffe
I wonder anything wrong with this data, or anything that makes this example run in HDF5 or if you have some HDF5 examples for classification or regression, which can be helpful to me a lot.
one snippet is shown as
def generateHDF5FromText2(label_num):
print '\nplease wait...'
HDF5_FILE = ['hdf5_train.h5', 'hdf5_test1.h5']
#store the training and testing data path and labels
LIST_FILE = ['train.txt','test.txt']
for kk, list_file in enumerate(LIST_FILE):
#reading the training.txt or testing.txt to extract the all the image path and labels, store into the array
path_list = []
label_list = []
with open(list_file, buffering=1) as hosts_file:
for line in hosts_file:
line = line.rstrip()
array = line.split(' ')
lab = int(array[1])
print len(path_list), len(label_list)
# init the temp data and labels storage for HDF5
datas = np.zeros((len(path_list),3,227,227),dtype='f4')
labels = np.zeros((len(path_list), 1),dtype="f4")
for ii, _file in enumerate(path_list):
# feed the image and label data to the TEMP data
img = _file )
img = img, (227, 227, 3) ) # resize to fixed size
img = np.transpose( img , (2,0,1))
datas[ii] = img
labels[ii] = int(label_list[ii])
# store the temp data and label into the HDF5
with h5py.File("/data2/"+HDF5_FILE[kk], 'w') as f:
f['data'] = datas
f['label'] = labels
One input transformation that seems to happen in the original net and is missing from your HDF5 creation in mean subtraction.
You should obtain mean file (looks like "imagenet_mean.binaryproto" in your example), read it into python and subtract it from each image.
BTW, the mean file can give you a clue as to the scale of the input image (if pixel values should be in [0..1] range or [0..255]).
You might find useful converting binaryproto to numpy array.