How to create a feature collection with mean values of certain features and the location/shape of a different feature collection? - merge

I have two feature collections one containing point data with measurements and the other polygons marking the different clusters where several measurements were taken close together. In Google earth engine I am trying to create a new (point) feature collection (or edit the polygon features) which contain the average measured values for every seperate cluster.
I used the following code to join the two different feature collections (so that the polygons also contain the point data):
var mean = ee.Filter.intersects({
leftField: '.geo',
rightField: '.geo'
})
var saveAllJoin = ee.Join.saveAll({
matchesKey: 'Measurements',
})
var intersect = saveAllJoin.apply(Clusters, Measurements, mean)
However, since multiple measurements are taken within one cluster this results in a featurecollection that contains a list with the measurement points located within a specific cluster. Instead I am looking for the average measured values as a property of the polygons. What is the way to do this in Google earth engine (or possibly in QGIS)?
I have tried using ee.FeatureCollection.map in order to calculate the mean value at every individual polygon:
var clust = ee.FeatureCollection(Clusters).map(function(feature) {
var meanClay = Measurements.reduceColumns({
reducer: ee.Reducer.mean(),
selectors: ['Clay']
})
return feature.set('mean', meanClay)
})
Now if I print the variable "clust" I get a featurecollection with mean values for the (measured) attribute clay. However, every feature gets the same value (the mean of all the measured points instead of just those within a specific cluster).
For clarity: I got a shapefile with 78 measurement locations (points) which are loaded into my script as a featurecollection and contain the measured values. Besides this I also have a shapefile with polygons indicating 16 areas where a cluster of measurements were performed (so around 4-5 measurements in each cluster). Now I am trying to get the average of all the measurements (points) within each polygon (cluster) for every individual polygon.

I solved it by using filterBounds() on the point data. Here the filterBounds is used to filter the point data to only consist of the points within a specific cluster.
var clust = ee.FeatureCollection(Clusters).map(function(feature) {
var meanClay = ee.FeatureCollection(Measurements.filterBounds(feature.geometry()))
.reduceColumns({
reducer: ee.Reducer.mean(),
selectors: ['Clay']
})
return feature.set(meanClay)
})

Related

Convert a numpy array of network values to a labelled node list with values

I have some graphs built with NetworkX with labelled nodes (names). I have computed trophic levels with the trophic tools script and obtained a numpy array of trophic values.
I want to create a node list of these values, with the according labels, similar for other topological indices (e.g. nx.degree_centrality is easily interpretable as every node names is followed by the relative value).
Can someone suggest how to merge or convert the numpy array to a labelled node list?
Thanks in advance!
I realised that the algorithm doesn't produce a real Laplacian and that every entry in the array was simply the node trophic value. I assumed the order of the array values as the same of the original node list, and the final data seems to concur with that (plant with lower values and predator with higher values).
This is the code to join node names with their trophic values if someone need to compute a similar trophic index:
trophic_levels = ta.trophic_levels(G)
trophic_levels_list = list(trophic_levels)
trophic_levels_series = pd.Series(trophic_levels_list)
G_nodes = pd.DataFrame.from_dict(dict(G.nodes(data=True)), orient='index')
G_nodes.reset_index(inplace=True)
G_nodes['trophic_level'] = trophic_levels_series

How to get the distance from a source to all points within a maximum distance with graph-tool (using Dijkstra algorithm)

I'm trying to use graph-tool to quickly calculate the distance from a source vertex to all vertices within a maximum distance, using a cost property that I have available for each edge.
I suppose I have to use the dijkstra_search function, but how do I specify the stop-criterium ? I have a working example, but I think it traverses the entire graph (taking several seconds as it's the entire road network of Holland).
Second, what's the fastest way of generating a list of: (vertex-id, distance) once the dijkstra_search function finishes ?
Turns out it was pretty easy:
import graph_tool.all as gt
dm = G.new_vp("double", np.inf)
gt.shortest_distance(G, source=sourcenode, weights=EdgePropertyMap, dist_map = dm, max_dist=0.1)

How can I find a max value of a selected region in a fit?

I am trying to find a max value of a curve fitted plot for a certain region in this plot. I have a 4th order fit, and when i use max(x), the ans for this is an extrapolated value, while I am actually looking fot the max value of the 'bump' in my data.
So question, how do I select the max for only a certain region in the data while using a cfit? Or how do I exclude a part of the fit?
LF = pol4Fit(L,F);
Coefs= coeffvalues(LF);
This code does only give the optimum (the max value) of the real points:
L_opt = feval(LF,L);
[F_opt,Num_Length]= max (L_opt);
Opt_Length= L(Num_Length);
So now I was trying something like: y=max(LF(F)), but this is not specific to select a region.
Try to only evaluate the region you are interested in.
For instance, let's say the specific region is a vector named S.
You can simply rewrite your code like below:
L_opt = feval(LF,S);
Use the specific domain region S instead of the whole domain L and it only evaluates the region you are concerned with. Then using max function should work properly for you.

Clusters merge threshold

I'm working with Mean shift, this procedure calculates where every point in the data set converges. I can also calculate the euclidean distance between the coordinates where 2 distinct points converged but I have to give a threshold, to say, if (distance < threshold) then this points belong to the same cluster and I can merge them.
How can I find the correct value to use as threshold??
(I can use every value and from it depends the result, but I need the optimal value)
I've implemented mean-shift clustering several times and have run into this same issue. Depending on how many iterations you're willing to shift each point for, or what your termination criteria is, there is usually some post-processing step where you have to group the shifted points into clusters. Points that theoretically shift to the same mode need not practically end up on directly top of each other.
I think the best and most general way to do this is to use a threshold based on the kernel bandwidth, as suggested in the comments. In the past my code to do this post processing has usually looked something like this:
threshold = 0.5 * kernel_bandwidth
clusters = []
for p in shifted_points:
cluster = findExistingClusterWithinThresholdOfPoint(p, clusters, threshold)
if cluster == null:
// create new cluster with p as its first point
newCluster = [p]
clusters.add(newCluster)
else:
// add p to cluster
cluster.add(p)
For the findExistingClusterWithinThresholdOfPoint function I usually use the minimum distance of p to each currently defined cluster.
This seems to work pretty well. Hope this helps.

iPhone hard computation and caching

I have problem. I have database with 500k records. Each record store latitude, longitude, specie of animal,date of observation. I must draw grid(15x10) above mapkit view, that show the concentration of specie in this grid cell. Each cell is 32x32 box.
If I calculate in run-time it is very slow.
Have somebody idea how to cache it?In memory or in database.
Data structure:
Observation:
Latitude
Longitude
Date
Specie
some other unimportant data
Screen sample:
alt text http://img6.imageshack.us/img6/7562/20091204201332.png
Each red box opocasity show count of species in this region.
Code that i use now:
data -> select from database, it is all observation in map region
for (int row = 0; row < rows; row++)
{
for (int column = 0; column < columns; column++)
{
speciesPerBox=0;
minG=boxes[row][column].longitude;
if (column!=columns-1) {
maxG=boxes[row][column+1].longitude;
} else {
maxG=buttomRight.longitude;
}
maxL=boxes[row][column].latitude;
if (row!=rows-1) {
minL=boxes[row+1][column].latitude;
} else {
minL=buttomRight.latitude;
}
for (int i=0; i<sightingCount; i++) {
l=data[i].latitude;
g=data[i].longitude;
if (l>=minL&&l<maxL&&g>=minG&&g<maxG) {
for (int j=0; j<speciesPerBox; j++) {
if (speciesCountArray[j]==data[i].specie) {
hasSpecie=YES;
}
}
if (hasSpecie==NO) {
speciesCountArray[speciesPerBox]=data[i].specie;
speciesPerBox++;
}
hasSpecie=NO;
}
}
}
mapData[row][column].count = speciesPerBox;
}
}
Since you data is static, you can pre-compute each species for each grid and store it in the database instead of all the location coordinates.
Since you have 15 x 10 = 150 cells, you'll end up with 150 * [num of species] records in the database, which should be a much smaller number.
Also, make sure you have indexes on the proper columns. Otherwise, your queries will have to scan every single record over and over again.
The loop for (int i=0; i<sightingCount; i++) is killing your performance. Especially the large number of if (l>=minL&&l<maxL&&g>=minG&&g<maxG) statements, where MOST OF the sightings will be skipped.
How large is sightingCount?
First you should use a kind of spatial optimization, e.g. a simple one: store species count lists per cell (lets call them "zones"). Define those zones rather large, so that you do not waste space. But smaller zones provide better performance, and too small zones will reverse the effect. So, make it configurable and test different zone sizes to find a good compromise!
When its time to sum up number of species in a cell for rendering, determine which zones the given cell overlaps (rather simple and fast "rectangle overlap" test). Then you only have to check the species counts of those zones. This largely reduces the iterations of your inner loop!
Thats the idea (of most "spatial optimizations"): divide and conquer; here you will divide your space, and then you can early reject the processing of a large number of irrelevant "sightings" with minimal effort (the added effort is the rectangle overlap test, but each test rejects multiple sightings, your current code tests each single sighting for relevance).
In a second step, also apply some obvious code optimizations: e.g. minL and maxL do not change per column. Computing minL and maxL can be moved to the outer loop (just before for( int column=0; ...).
As the latitudes of the grids are evenly distributed, you can even remove them from your grid cells, which saves some time in your iteration. Here an example (the spatial optimizations not included):
maxL=boxes[0][0].latitude;
minL=boxes[rows-1][0].latitude;
incL=maxL-minL;
for( int row = 0; row < rows; row++ )
{
for( int column = 0; column < columns; column++ )
{
speciesPerBox=0;
minG=boxes[row][column].longitude;
if (column!=columns-1) {
maxG=boxes[row][column+1].longitude;
} else {
maxG=buttomRight.longitude;
}
...
...
}
...
minL = maxL; // left edge = right edge of previous step
maxL += incL; // increment right edge
if( maxL >= 90 ) maxL -= 90; // check your scale, i assume 90°
}
Maybe this also works for the longitude loop, but longitude may not be evenly distributed (i.e. "incG" is different in each step).
Note that the spatial optimization will make a huge difference, the loop optimizations only a small (but still worth) difference.
With 500k records this sounds like a job for core data. Preferably core data on a desktop. If the data isn't being updated in realtime you should process the data on heavier hardware and just use the iPhone to display it. That would massively simplify the app because you would just to store the value for each map cell.
Even if you did want to process it on the iPhone, you should have the app process the data once and save the results. There appears to be no reason to have the app recalculate the species value of every map cell every time it wants t display a cell.
I would suggest creating a entity in core data to represent observations. Then another entity to represent geographical squares. Set a relationship between the squares and the observations that fall within the square. Then create a calculated value of species in the square entity. You would then only have to recalculate the species value if one of the observations changed.
This is the kind of problem that object graphs were created for. Even if the data is being continuously updated. Core data would only perform those calculations needed to accommodate the small number of observation objects that changed at any given time and it would do so in a highly optimized manner.
Edit01:
Approaching the problem from a completely different angle. Within core data.
(1) Create an object graph of observation records such that each each observation object has a reciprocal relation to the other observation objects that are closest to it geographically. This would create an object graph that would look like a flat irregular net.
(2) Create methods for the observationRecords class that (a) determine if the record lays within the bounds of an arbitrary geographic square (b) ask if each of its releated record if they are also in the square (c) return its own species count and the count of all the related records.
(3) Divide your map into the some reasonable small squares e.g. one second of arc square. Within that square select one linked record and add it to a list. Choose some percentage of all records like 1 in every 100 or 1,000 so that you cut the list down from 500k to to create a sublist that can be quickly searched by brute force predicate. Let's call those records in the list the gridflags.
(4) When the user zooms in, use brute force to find all the gridflag records with the geographical grid. Then ask each gridflag record to send messages to each of its linked records to see if (a) they're inside the grid, (b) what their species count is and (c) what the count is for their linked records that are also within the grid. (Use a flag to make sure each record is only queried once per search to prevent runaway recursion.)
This way, you only have to find one record inside each arbitrarily sized grid cell and that record will find all the other records for you. Instead of stepping through each record to see which record goes in what cell every time, you just have to process the records in each cell and those immediately adjacent. As you zoom in, the number of records you actually query shrinks instead of remaining constant. If a grid cell has only a handful of records, you only have to query a handful of records.
This would take some effort and time to set up but once you did it would be rather efficient especially when zoomed way in. For the top level, just have a preprocessed static map.
Hope I explained that well enough. It's hard to convey verbally.