Geopandas convert crs - pyspark

I have a created a geopandas dataframe with 50 million records which contain Latitude Longitude in CRS 3857 and I want to convert to 4326. Since the dataset is huge the geopandas unable to convert this.how i can execute this in distributed manner.
df = sdf.toPandas()
gdf = gpd.GeoDataFrame(
df.drop(['Longitude', 'Latitude'], axis=1),
crs={'init': 'epsg:4326'},
geometry=[Point(xy) for xy in zip(df.Longitude, df.Latitude)])
return gdf
result_gdf=convert_crs(grid_df)

See: https://github.com/geopandas/geopandas/issues/1400
This is very fast and memory efficient:
from pyproj import Transformer
trans = Transformer.from_crs(
"EPSG:4326",
"EPSG:3857",
always_xy=True,
)
xx, yy = trans.transform(df["Longitude"].values, df["Latitude"].values)
df["X"] = xx
df["Y"] = yy

See the geopandas docs on installation and make sure you have the latest version of geopandas and PyGeos installed. From the installation docs:
Work is ongoing to improve the performance of GeoPandas. Currently, the fast implementations of basic spatial operations live in the PyGEOS package (but work is under way to contribute those improvements to Shapely). Starting with GeoPandas 0.8, it is possible to optionally use those experimental speedups by installing PyGEOS.
Note the caveat that to_crs will ignore & drop any z coordinate information, so if this is important you unfortunately cannot use these speedups and something like dask_geopandas may be required.
However, with a recent version of geopandas and PyGeos installed, converting the CRS of 50 million points should be possible. The following generates 50m random points (<1s), creates a GeoDataFrame with geometries from the points in WGS84 (18s), converts them all to web mercator (1m21s) and then converts them back to WGS84 (54s):
In [1]: import geopandas as gpd, pandas as pd, numpy as np
In [2]: %%time
...: n = int(50e6)
...: lats = np.random.random(size=n) * 180 - 90
...: lons = np.random.random(size=n) * 360 - 180
...:
...:
CPU times: user 613 ms, sys: 161 ms, total: 774 ms
Wall time: 785 ms
In [3]: %%time
...: df = gpd.GeoDataFrame(geometry=gpd.points_from_xy(lons, lats, crs="epsg:4326"))
...:
...:
CPU times: user 11.7 s, sys: 4.66 s, total: 16.4 s
Wall time: 17.8 s
In [4]: %%time
...: df_mercator = df.to_crs("epsg:3857")
...:
...:
CPU times: user 1min 1s, sys: 13.7 s, total: 1min 15s
Wall time: 1min 21s
In [5]: %%time
...: df_wgs84 = df_mercator.to_crs("epsg:4326")
...:
...:
CPU times: user 39.4 s, sys: 9.59 s, total: 49 s
Wall time: 54 s
I ran this on a 2021 Apple M1 Max chip with 32 GB of memory using Geopandas v0.10.2 and PyGeos v0.12.0. The real memory usage peaked at around 9 GB - it's possible your computer is facing memory constraints or the runtime may be an issue. If so, additional debugging details and the full workflow would definitely be helpful! But this seems like a workflow that should be doable on most computers - you may need to partition the data and work through it in chunks if you're facing memory constraints but it's within a single order of magnitude of what most computers should be able to handle.

I hope this answer is fair enough,
because it will effectively solve your problem for any size of dataset.
And it's a well-trodden kind of answer to how to deal with data that's
too big for memory.
Answer: Store your data in PostGIS
You would then have two options for doing what you want.
Do data manipulations in PostGIS, using its geo-spatial SQL syntax.
The database will do the memory management for you.
Retrieve data a chunk at a time, do the manipulation in GeoPandas
and rewrite to the database.
In my experience it's solid, reliable and pretty
well integrated with GeoPandas via GeoAlchemy2.

Related

A memory efficient and fast alternative to usual `expm()` function in MATLAB or Python?

People have asked similar questions before but none has a satisfactory answer. I'm trying to solve Lindblad Master Equation and the matrix size I'm trying to simulate are of order 10000 x 10000. But the problem is with exponentiation of the matrix, which is consuming a lot of RAM.
The MATLAB and Python expm() function take around 20s and 80s for a matrix of size 1000 x 1000 respectively. The code I've shown below.
pd = makedist('Normal');
N = 1000;
r = random(pd ,[N, N]);
t0 = tic;
r = expm(r);
t_total = toc(t0);
The problem comes when I try to do the same for a matrix of size 10000 x 10000. Whenever I apply expm(), the RAM usage grows and it take all the RAM and SWAP memory on my PC (I've 128 GB RAM and 64 Core CPU) and it's same in case of both MATLAB and Scipy. I don't understand what is taking so much RAM and how can I efficiently rum expm() or if it is not possible at all? Even if I could do it on any other language efficiently it would be really helpful!

How to precalculate expensive Expressions in Polars (in groupby-s and in general)?

I'm having a hard time dealing with the fact that in a groupby I cant efficiently catch a group sub-dataframe with an Expr, perform an expensive operation with it once and then return several different aggregations. I can sort of do it (see example), but my solution is unreadable and looks like Im dealing with an unnecessary overhead because of all those lists. Is there a proper or a completely different way to do it?
Take a look at this example:
import polars as pl
import numpy as np
df = pl.DataFrame(np.random.randint(0,10,size=(1000000, 3)))
expensive = pl.col('column_1').cumprod().ewm_std(span=10).alias('expensive')
%%timeit
(
df
.groupby('column_0')
.agg([
expensive.sum().alias('sum'),
expensive.median().alias('median'),
*[expensive.max().alias(f'max{x}') for x in range(10)],
])
)
417 ms ± 38.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
(
df
.groupby('column_0')
.agg(expensive)
.select([
pl.col('expensive').arr.eval(pl.element().sum()).arr.first().alias('sum'),
pl.col('expensive').arr.eval(pl.element().median()).arr.first().alias('median'),
*[pl.col('expensive').arr.eval(pl.element().max()).arr.first().alias(f'max{x}') for x in range(10)]
])
)
95.5 ms ± 9.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
We can see that precomputing the expensive part is beneficial, but actually doing it involves this .arr.eval(pl.element().<aggfunc>()).arr.first() that really bothers me because of both readability and flexibility. Try as I might, I cant see a better solution.
I'm not sure whether the problem is just about groupbys, if your solution involves dealing with selects, please share that also.
Use explode instead of arr.eval like this:
%%timeit
df \
.groupby('column_0') \
.agg(expensive).explode('expensive').groupby('column_0').agg([
pl.col('expensive').sum().alias('sum'),
pl.col('expensive').median().alias('median'),
*[pl.col('expensive').max().alias(f'max{x}') for x in range(10)]
])
On my machine the run times were
Your first example: 320 ms ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Your second: 80.8 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Mine: 63 ms ± 507 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Another method which turns out to be slightly slower than the above is to do the expensive expression has a window function which then skips the explode
%%timeit
df.select(['column_0',expensive.over('column_0')]).groupby('column_0').agg([
pl.col('expensive').sum().alias('sum'),
pl.col('expensive').median().alias('median'),
*[pl.col('expensive').max().alias(f'max{x}') for x in range(10)]
])
This last one returned in 69.7 ms ± 911 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Read a file as SciPy sparse matrix directly

Is it possible to read a space separated file, with each line containing float numbers directly as SciPy sparse matrix?
Given: A space separated file containing ~56 million rows and 25 space separated floating point numbers in each row with a lot of sparsity.
Output: Convert the file into SciPy CSR sparse matrix as fast as possible
May be there are better solutions out there, but this solution worked for me after a lot of suggestions from #CJR (some of which I couldn't take into account).
Also, may be there is a better solution using hdf5, but, this is the solution using Pandas dataframe and finishes up in 6.7 minutes and takes around 50 GB of RAM on a 32 core machine for 56,651,070 rows and 25 space separated floating point numbers in each row with a lot of sparsity.
import numpy as np
import scipy.sparse as sps
import pandas as pd
import time
import swifter
start_time = time.time()
input_file_name = "df"
sep = " "
df = pd.read_csv(input_file_name)
df['array_column'] = df['array_column'].swifter.allow_dask_on_strings().apply(lambda x: np.fromstring(x, sep = sep), axis =1)
df_np_sp_matrix = sps.csr_matrix(np.stack(df['array_column'].to_numpy()))
print("--- %s seconds ---" % (time.time() - start_time))
Output:
--- 406.22810888290405 seconds ---
Matrix Size.
df_np_sp_matrix
Output:
<56651070x25 sparse matrix of type '<class 'numpy.float64'>'
with 508880850 stored elements in Compressed Sparse Row format>

Postgis ST_Distance_Sphere giving about 1.7 too high result

I am new into Postgis and spatial stuff and I am struggling with quite simple query.
I have two records in places for test, where column addressLocation is a POINT with following values:
(51.122711,17.031819)
(51.122805,17.035522)
I am trying to make a query:
SELECT *
FROM places
WHERE ST_Distance_Sphere("addressLocation"::geometry, ST_MakePoint(51.122711, 17.033686)) <= 255;
51.122711, 17.033686 Is about in the center between both of these points and distance measured on Google maps is about 125 and 128 meters.
The issue is that (51.122805,17.035522) got into results with 205 as limit and the other one with 210.
I was looking through the PostGIS docs and cannot find any explanation for such inaccuracy.
Coordinates in PostGIS must be expressed as longitude/latitude, while in Google Map they are expressed as latitude/longitude.
Your query is computing distances in Yemen:
Select ST_DistanceSphere(st_geomFromText('POINT(51.122711 17.031819)'),
st_geomFromText('POINT(51.122711 17.033686)'));
st_distancesphere
-------------------
207.60121386
While by swapping the coordinates, the points are in Poland and the distance is:
Select ST_DistanceSphere(st_geomFromText('POINT(17.031819 51.122711)'),
st_geomFromText('POINT(17.033686 51.122711)'));
st_distancesphere
-------------------
130.30184168

How to compute & plot Equal Error Rate (EER) from FAR/FRR values using matlab

I have the following values against FAR/FRR. i want to compute EER rates and then plot in matlab.
FAR FRR
19.64 20
21.29 18.61
24.92 17.08
19.14 20.28
17.99 21.39
16.83 23.47
15.35 26.39
13.20 29.17
7.92 42.92
3.96 60.56
1.82 84.31
1.65 98.33
26.07 16.39
29.04 13.13
34.49 9.31
40.76 6.81
50.33 5.42
66.83 1.67
82.51 0.28
Is there any matlab function available to do this. can somebody explain this to me. Thanks.
Let me try to answer your question
1) For your data EER can be the mean/max/min of [19.64,20]
1.1) The idea of EER is try to measure the system performance against another system (the lower the better) by finding the equal(if not equal then at least nearly equal or have the min distance) between False Alarm Rate (FAR) and False Reject Rate (FRR, or missing rate) .
Refer to your data, [19.64,20] gives min distance, thus it could used as EER, you can take mean/max/min value of these two value, however since it means to compare between systems, thus make sure other system use the same method(mean/max/min) to pick EER value.
The difference among mean/max/min can be ignored if the there are large amount of data. In some speaker verification task, there will be 100k data sample.
2) To understand EER ,better compute it by yourself, here is how:
two things you need to know:
A) The system score for each test case (trial)
B) The true/false for each trial
After you have A and B, then you can create [trial, score,true/false] pairs then sort it by the score value, after that loop through the score, eg from min-> max. At each loop assume threshold is that score and compute the FAR,FRR. After loop through the score find the FAR,FRR with "equal" value.
For the code you can refer to my pyeer.py , in function processDataTable2
https://github.com/StevenLOL/Research_speech_speaker_verification_nist_sre2010/blob/master/SRE2010/sid/pyeer.py
This function is written for the NIST SRE 2010 evaluation.
4) There are other measures similar to EER, such as minDCF which only play with the weights of FAR and FRR. You can refer to "Performance Measure" of http://www.nist.gov/itl/iad/mig/sre10results.cfm
5) You can also refer to this package https://sites.google.com/site/bosaristoolkit/ and DETware_v2.1.tar.gz at http://www.itl.nist.gov/iad/mig/tools/ for computing and plotting EER in Matlab
Plotting in DETWare_v2.1
Pmiss=1:50;Pfa=50:-1:1;
Plot_DET(Pmiss/100.0,Pfa/100.0,'r')
FAR(t) and FRR(t) are parameterized by threshold, t. They are cumulative distributions, so they should be monotonic in t. Your data is not shown to be monotonic, so if it is indeed FAR and FRR, then the measurements were not made in order. But for the sake of clarity, we can order:
FAR FRR
1 1.65 98.33
2 1.82 84.31
3 3.96 60.56
4 7.92 42.92
5 13.2 29.17
6 15.35 26.39
7 16.83 23.47
8 17.99 21.39
9 19.14 20.28
10 19.64 20
11 21.29 18.61
12 24.92 17.08
13 26.07 16.39
14 29.04 13.13
15 34.49 9.31
16 40.76 6.81
17 50.33 5.42
18 66.83 1.67
19 82.51 0.28
This is for increasing FAR, which assumes a distance score; if you have a similarity score, then FAR would be sorted in decreasing order.
Loop over FAR until it is larger than FRR, which occurs at row 11. Then interpolate the cross over value between rows 10 and 11. This is your equal error rate.