More Efficient Way of Calculating Population from Data Grid and overlapping Polygon? - mongodb

folks! Apologies if this is a duplicate question and I've done some research on the topic but don't know if I'm heading in the right direction.
I have converted gridded data of population density to a MongoDB collection using a geometry object defining the population density cell as a five node polygon (the fifth node matching the first) and a float value consisting of the population in that geographic region. Even though the database is huge in size, I can quickly retrieve the "records" of the population regions as they are indexed as a 2D Sphere when it intersects a geo-polygon indicating some type of weather event or other geofence polygon.
The issue comes when I try to add all of the boxes up. It takes an exceedingly long amount of time, especially if the polygon is of a significant geographic area. The population data I have are 1km^2 cells. The adding of the data can take several seconds or, in worse case scenario, minutes!
I had a thought of creating a type of quadtree structure in the database by a lower resolution node set as a separate collection and so on and so on. Then when calculating population, I could start with the lowest res set and work my way down the node "tree" by making several database calls until there are no more matches. While I'd increase my database calls significantly, I'd reduce the sheer number of elements that I would need to add up at the end - which is taking the most computational time.
I could try to create these data using bottom-up neighbor finding whilst adding up the four population values that would make up the next lower-resolution node set. This, of course, will explode the database size and will increase the number of queries to the database for a single population request.
I haven't seen too much of this done with databases. I'd like to have it in a database (could also be PostgreSQL) since it gives me the ability to quickly geo-query by point or area. And, I'm returning the result as an API call so the efficiency of time is of the essence!
Any advice or places to research would be greatly appreciated!!!

Related

scipy.interpolate.griddata slow due to unnecessary data

I have a map with a 600*600 aequidistant x,y grid with associated scalar values.
I have around 1000 x,y coordinates at which I would like to get the bi-linear interpolated map values. Those are randomly placed in an inner center area of the map with arround 400*400 size.
I decided to go with the griddata function with method linear. My understanding is that with linear interpolation I would only need the three nearest grid positions around each coordinate do get the well defined interpolated values. So I would require around 3000 data points of the map to perform the interpolation. The 360k data points are highly unnecessary for this task.
Throwing stupidly the complete map in results in long excecution times of a half minute. Since it's easy to narrow the map already down to the area of interest I could reduce excecution time to nearly 20%.
I am now wondering if I oversaw something in my assumption that I need only the three nearest neighbours for my task. And if not, whether there is a fast solution to filter those 3000 out of the 360k. I assume looping 3000 times over the 360k lines will take longer than to just throw in the inner map.
Edit: I had also a look at the comparisson of the result with 600*600 and the reduced data points. I am actually surprised and concerned about the observation, that the interpolation results differ partly significantly.
So I found out that RegularGridinterpolator is the way to go for me. It's fast and I have a regular grid already.
I tried to sort out my findings with the differences in interpolation value and found griddata to show unexpected behavior for me.
Check out the issue I created for details.
https://github.com/scipy/scipy/issues/17378

compare all data in database at same time ( real time)

I have a problem with my android app, I have x value (whatever it is) and I have data in the database, I want to compare the value of x with all the data in the database at the same time in real time
the app is using sqlite.
I used a loop but when the database is large in this case my app lags in comparing all the data.my code is
public void Check_Distance(Location Current_Location,ArrayList<Location> LocationArrayList1)
{
double Distance;
for(int i=0;i<LocationArrayList1.size();i++)
{
Distance=distanceBetween(Current_Location,LocationArrayList1.get(i));
if(Distance<=0.1*1000){ // if distance is less then 100m give a sound
Notification_Sound();
}
}
}
You can't look at every record in the database at the exact same time. That's called quantum computing, and is currently an active research area where people far smarter than you or I are currently spending millions of dollars to try and create a machine that can do this kind of parallel processing.
That being said, you can make your algorithm more efficient, but that takes some effort to do. Both of the below are based on the idea of eliminating the majority of the locations that are obviously too far away very quickly, and performing more in-depth checks on those that could be in range.
One method is to sort the locations in ascending order in two arrays - one by North/South and the other by East/West. Find all entries within a given distance of the current position in each list, then combine the results to get a list of points within a box of X distance from the location. This box will have a much smaller number of points within it that you can then apply an iterative, circular, distance based approach to.
Another is to create a quadtree. This would subdivide the map area into a set of bounding volumes, where each volume would have a set of points, or additional bounding volumes. You can then place down your search area and find all the quadtree volumes that intersect with your circular search area, greatly minimizing the number of locations you need to do a true distance check on.

What's the most efficient way to use large data from Excel in my C# code?

I ran a computer simulation for my Pendulum, to measure time taken to reach the lowest point, for every velocity and every angle.
As you can imagine there is a lot of data, thousands of lines for all angles and velocity.
On every frame, I will be measuring the velocity and angle of the pendulum, and will look for the closest data in my Excel spreadsheet.
How can I go about this to make sure it's not too CPU-intensive?
Should I create a massive array where every element corresponds to a certain angle: for example, myArray[30] will be for all velocities and times for all my data between 30.0 degrees and 30.999. (That way it will be avoid lots of if statements)
Or should I keep everything in my Excel spreadsheet?
Any suggestion?
The best approach in my opinion would be dividing your data into intervals based on distribution since you have to access that data in every frame. Then when you measure the velocity and angle you can go look for the interval and access only that part of your data.
I would find maximum and minimum of your data points while importing to Unity and then divide that part based on (maximum - minimum) / NumOfIntervals. Lets say your interval size is 5 for each Angle. When you got an angle of 17 you can do (int)15/5 = 3(Assuming indexes start from zero) and go for third item in your structure. This can be a dictionary or Array of an Arbitrary class instances based on your data.
I can try to help further if you can share the structure of your data. But in my opinion evenly distribution of data to every interval is important.

Clarification on OVERPASS_MAX_QUERY_AREA_SIZE default (OSMnx, Overpass API)

I am using OSMnx to query the Overpass API. I've noticed that it has a fairly large default for minimum area size:
OVERPASS_MAX_QUERY_AREA_SIZE = 50*1000*50*1000
This value is used to subdivide "larger" polygons into chunks to submit to the Overpass API.
I'd like to understand why the area is so large. For example, the entirety of San Francisco (~50 sq miles) is "simplified" to a single query.
Key questions:
Is there any advantage to reducing query sizes submitted to the Overpass API?*
Is there any advantage to reducing the complexity of shapes/polygons being submitted to the Overpass API (that is, using rectangles with just 4 corner coordinates), versus more complex polygons?**
*Note: Example query that I would be running (looking for the ways that would constitute a walk network):
[out:json][timeout:180];(way["highway"]["area"!~"yes"]["highway"!~"cycleway|motor|proposed|construction|abandoned|platform|raceway"]["foot"!~"no"]["service"!~"private"]["access"!~"private"](37.778007,-122.445467,37.783454,-122.438958);>;);out;
**Note: This question is partially answered in this other post. That said, that question does not focus completely on the performance implications, and is not asked in the context of the variable area threshold used in OSMnx to subdivide "larger" geometries.
max_query_area_size appears to be some heuristic value someone came up after doing a number of test runs. From Overpass API side this figure has pretty much no meaning on its own.
It may be completely off for different kinds of queries or even in a different area than SF. As an example: for infrequent tags, it's usually better to go ahead with a rather large bounding box, rather than firing off a huge number of queries with tiny bounding boxes.
For some statement types, a large bounding box may cause significant longer processing time, though. In this case splitting up the area in smaller pieces may help. Some queries might even consume too much memory, which forces you to split your bounding box in smaller pieces.
As you didn't mention the kind of query you want to run, it's very difficult to provide some general advise. It's like asking for a best way to write SQL statements without providing any additional context.
Using bounding boxes instead of (poly:...) has performance advantages. If you can specify a bounding box, use the respective bounding box filter rather than providing 4 lat/lon pairs to the poly filter.

Determine in which polygons a point is

I have tremendous flows of point data (in 2D) (thousands every second). On this map I have several fixed polygons (dozens to a few hundreds of them).
I would like to determine in real time (the order of a few milliseconds on a rather powerful laptop) for each point in which polygons it lies (polygons can intersect).
I thought I'd use the ray casting algorithm.
Nevertheless, I need a way to preprocess the data, to avoid scanning every polygon.
I therefore consider using tree approaches (PM quadtree or Rtree ?). Is there any other relevant method ?
Is there a good PM Quadtree implementation you would recommend (in whatever language, preferably C(++), Java or Python) ?
I have developed a library of several multi-dimensional indexes in Java, it can be found here. It contains R*Tree, STR-Tree, 4 quadtrees (2 for points, 2 for rectangles) and a critbit tree (can be used for spatial data by interleaving the coordinates). I also developed the PH-Tree.
There are all rectange/point based trees, so you would have to convert your polygons into rectangles, for example by calculating the bounding box. For all returned bounding boxes you would have to calculate manually if the polygon really intersects with your point.
If your rectangles are not too elongated, this should still be efficient.
I usually find the PH-Tree the most efficient tree, it has fast building times and very fast query times if a point intersects with 100 rectangles or less (even better with 10 or less). STR/R*-trees are better with larger overlap sizes (1000+). The quadtrees are a bit unreliable, they have problems with numeric precision when inserting millions of elements.
Assuming a 3D tree with 1 million rectangles and on average one result per query, the PH-Tree requires about 3 microseconds per query on my desktop (i7 4xxx), i.e. 300 queries per millisecond.