Spark saveAsTextFile in Google Compute taking quadratic time

Spark saveAsTextFile in Google Compute taking quadratic time - pyspark

I am working with PySpark on a Jupyter notebook in Google Compute.
I am saving files to cloud storage using saveAsTextFile. The trouble is, it is taking quadratic time in terms of the number of records in the file. This ... doesn't work very well for decently large files.
The command I'm using is
bigFile_save.saveAsTextFile("gs://myBucket/myFolder")
Is there a way of doing things more efficiently?
As requested, a fuller code sample is
# We start with a ColumnSimilarity matrix. This is upper-triangular,
# so we append the transpose
x1 = columnSim.entries.map(lambda x: x)
x2 = columnSim.transpose().entries.map(lambda x: x)
x3 = (x1 + x2)
distMat= x3.map(lambda p: (p.i,p.j,p.value))
# Save the similarity file. Convert column indicies to meaningful
# names
bigFile_save = distMap.map(lambda p: (names[p[0]],names[p[1]],p[2]))
t1 = time.time()
bigFile_save.saveAsTextFile("gs://myBucket/myFolder")
t2 = time.time()
timeDiff = t2 - t1
On consideration, what may be happening is lazy execution: names[p[0]] may not be being resolved until the actual printout.

Related

Solve system of differential equation in python

I'm trying to solve a system of differential equations in python.
I have a system composed by two equations where I have two variables, A and B.
The initial condition are that A0=1e17 and B0=0, they change simultaneously.
I wrote the following code using ODEINT:
import numpy as np
from scipy.integrate import odeint
def dmdt(m,t):
A, B = m
dAdt = A-B
dBdt = (A-B)*A
return [dAdt, dBdt]
# Create time domain
t = np.linspace(0, 100, 1)
# Initial condition
A0=1e17
B0=0
m0=[A0, B0]
solution = odeint(dmdt, m0, t)
Apparently I obtain an output different from the expected one but I don't understand the error.
Can someone help me?
Thanks

From A*A'-B'=0 one concludes
B = 0.5*(A^2 - A0^2)
Inserted into the first equation that gives
A' = A - 0.5*A^2 + 0.5*A0^2
= 0.5*(A0^2+1 - (A-1)^2)
This means that the A dynamic has two fixed points at about A0+1 and -A0+1, is growing inside that interval, the upper fixed point is stable. However, in standard floating point numbers there is no difference between 1e17 and 1e17+1. If you want to see the difference, you have to encode it separately.
Also note that the standard error tolerances atol and rtol in the range somewhere between 1e-6 and 1e-9 are totally incompatible with the scales of the problem as originally stated, also highlighting the need to rescale and shift the problem into a more appreciable range of values.
Setting A = A0+u with |u| in an expected scale of 1..10 then gives
B = 0.5*u*(2*A0+u)
u' = A0+u - 0.5*u*(2*A0+u) = (1-u)*A0 - 0.5*u^2
This now suggests that the time scale be reduced by A0, set t=s/A0. Also, B = A0*v. Insert the direct parametrizations into the original system to get
du/ds = dA/dt / A0 = (A0+u-A0*v)/A0 = 1 + u/A0 - v
dv/ds = dB/dt / A0^2 = (A0+u-A0*v)*(A0+u)/A0^2 = (1+u/A0-v)*(1+u/A0)
u(0)=v(0)=0
Now in floating point and the expected range for u, we get 1+u/A0 == 1, so effectively u'(s)=v'(s)=1-v which gives
u(s)=v(s)=1-exp(-s)`,
A(t) = A0 + 1-exp(-A0*t) + very small corrections
B(t) = A0*(1-exp(-A0*t)) + very small corrections
The system in s,u,v should be well-computable by any solver in the default tolerances.

Optimal transport code (Scipy linear programming optimization) takes much longer time

I have been trying to compute the Wasserstein distance between two one dimensional Gaussian distributions with mean 0.0 and 4.0, with variances 9.0 and 16.0 respectively. I used scipy.linprog.optimize module and used the "interior-point" method as said in the following link
https://yetanothermathprogrammingconsultant.blogspot.com/2019/10/scipy-linear-programming-large-but-easy.html.
However, it takes more than 17 hours, and still (my code is )running to run solve 300 x 300 LP matrix problems (i.e) 300 source nodes and 300 destination nodes. However, the document says it could be possible to solve the problem with 1000 source nodes and 1000 destination nodes.(i.e) one can solve the LP problem with 1,000,000 (one million) decisive variables. What is wrong with my code? Why it takes such a long time? Do we need large memory (or clusters) to solve such problems?
my code
from datetime import datetime
start_time = datetime.now()
from scipy.optimize import linprog
import scipy
#Initializing the LP matrix
Piprob=np.zeros(500*500).reshape(500,500)
def Piprobmin(Krv,rhoi,rhoj):
r1=np.shape(Krv)[0]
r2=np.shape(Krv)[1]
print("r1,r2",r1,r2)
#Computing the LP Matrix which has just two ones in each column
pmat=np.zeros((r1+r2)*(r1*r2)).reshape((r1+r2),(r1*r2))
for i in range(r1+r2):
for j in range(r1*r2):
if((i<r1) and (j<((i+1)*r2)) and (j>=(i*r2))):
pmat[i][j]=1
if(i>=r1):
for k in range(r1*r2):
if j==(i-r1)+(k*r2):
pmat[i][j]=1
#flattening the cost matrix into one dimensional array
krvf=Krv.flatten()
tempr=np.append(rhoi,rhoj)
Xv=[] #Creating list for joint probability matrix elements
res = scipy.optimize.linprog(c=krvf,method='interior-point',A_eq=pmat,b_eq=tempr,options=
{'sparse':True, 'disp':True})
print("res=\n",res)
wv=res.fun
for l1 in range(r1*r2):
Xv.append(res.x[l1])
Yv=np.array(Xv)
Yv=Yv.reshape(r1,r2)
#returning Yv-joint probability and ,Wv-minimized wasserstein distance
return Yv,wv
Piprob,W=Piprobmin(K,result1,result2) #K-cost function matrix,result1 is the first
#marginal,result2 is the second marginal
end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))
The size of the cost function is 300 X 300 and size, each marginal have 300 points (total 600 constraints). I verified my cost function is symmetric and non-negative. and each marginal is summed to one as they are just probabilities.

In the blog post the word sparse is used many times. Not without reason. It is extremely important to store the A matrix as a sparse matrix. Otherwise, you will not be able to handle large problems. The blog post discusses the difference in memory requirements of the transportation LP matrix in great detail, so this point should have been hard to miss.
Here is some example code on how to set up a transportation model with 1000 source nodes and 1000 destination nodes using scipy.optimize.linprog. Again, the LP matrix has 2,000 rows and 1,000,000 columns and is stored sparse.
import numpy as np
import scipy as sp
import scipy.sparse as sparse
import scipy.optimize as opt
from memory_profiler import profile
def GenerateData(M,N):
np.random.seed(123)
# form objective function
c = np.random.uniform(0,10,(M,N))
# demand, supply
s = np.random.uniform(0,15,M)
d = np.random.uniform(0,10,N)
assert np.sum(d) <= np.sum(s), "supply too small"
#print('c',c)
#print('s',s)
#print('d',d)
return {'c':c, 's':s, 'd':d, 'n':N, 'm':M}
def FormLPData(data):
rhs = np.append(data['s'],-data['d'])
# form A
# column (i,j)=n*i+j has two nonzeroes:
# 1 at row i with rhs supply(i)
# 1 at row N+j with rhs demand(j)
N = data['n']
M = data['m']
NZ = 2*N*M
irow = np.zeros(NZ, dtype=int)
jcol = np.zeros(NZ, dtype=int)
value = np.zeros(NZ)
for i in range(N):
for j in range(M):
k = M*i+j
k1 = 2*k
k2 = k1+1
irow[k1] = i
jcol[k1] = k
value[k1] = 1.0
irow[k2] = N+j
jcol[k2] = k
value[k2] = -1.0
A = sparse.coo_matrix((value, (irow, jcol)))
#print('A',A)
#print('rhs',rhs)
return {'A':A,'rhs':rhs}
#profile
def run():
# dimensions
M = 1000 # sources
N = 1000 # destinations
data = GenerateData(M,N)
lpdata = FormLPData(data)
res = opt.linprog(c=np.reshape(data['c'],M*N),A_ub=lpdata['A'],b_ub=lpdata['rhs'],options={'sparse':True, 'disp':True})
if __name__ == '__main__':
run()
So it looks like you totally missed the whole point about the blog post.

Hierarchical Agglomerative clustering for Spark

I am working on a project using Spark and Scala and I am looking for a hierarchical clustering algorithm, which is similar to scipy.cluster.hierarchy.fcluster or sklearn.cluster.AgglomerativeClustering, which will be useable for large amounts of data.
MLlib for Spark implements Bisecting k-means, which needs as input the number of clusters. Unfortunately in my case, I don't know the number of clusters and I would prefer to use some distance threshold as an input parameter, as it is possible to use in those two python implementations above.
If anyone would know the answer, I would be very grateful.

So I had the same problem and after looking high and low found no answers so I will post what I did here in the hopes that it helps anyone else and that maybe someone will build on it.
The basic idea of what I did was to use bisecting K-means recursively to continue to split clusters in half until all points in the cluster were a specified distance away from the centroid. I was using gps data so I have a little bit of extra machinery to deal with that.
The first step is to create a model that will cut the data in half. I used bisecting K means but I think this would work with any of the pyspark clustering methods so long as you can get the distance to the centroid.
import pyspark.sql.functions as f
from pyspark import SparkContext, SQLContext
from pyspark.ml.clustering import BisectingKMeans
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
bkm = BisectingKMeans().setK(2).setSeed(1)
assembler = VectorAssembler(inputCols=['lat','long'], outputCol="features")
adf = assembler.transform(locAggDf)#locAggDf contains my location info
model = bkm.fit(adf)
# predictions will have the original data plus the "features" col which assigns a cluster number
predictions = model.transform(adf)
predictions.persist()
The next step is our recursive function. The idea here is that we specify some distance from the centroid and if any point in a cluster is farther than that distance we cut the cluster in half. When a cluster is tight enough that it meets the condition I add it to a result array that I use to build the final clustering
def bisectToDist(model, predictions, bkm, precision, result = []):
centers = model.clusterCenters()
# row[0] is predictedClusterNum, row[1] is unit, row[2] point lat, row[3] point long
# centers[row[0]] is the lat long of center, centers[row[0]][0] = lat, centers[row[0]][1] = long
distUdf = f.udf(
lambda row: getDistWrapper((centers[row[0]][0], centers[row[0]][1], row[1]), (row[2], row[3], row[1])),
FloatType())##getDistWrapper(is how I calculate the distance of lat and long but you can define any distance metric)
predictions = predictions.withColumn('dist', distUdf(
f.struct(predictions.prediction, predictions.encodedPrecisionUnit, predictions.lat, predictions.long)))
#create a df of all rows that were in clusters that had a point outside of the threshold
toBig = predictions.join(
predictions.groupby('prediction').agg({"dist": "max"}).filter(f.col('max(dist)') > self.precision).select(
'prediction'), ['prediction'], 'leftsemi')
#this could probably be improved
#get all cluster numbers that were to big
listids = toBig.select("prediction").distinct().rdd.flatMap(lambda x: x).collect()
#if all data points are within the speficed distance of the centroid we can return the clustering
if len(listids) == 0:
return predictions
# assuming binary class now k must be = 2
# if one of the two clusters was small enough we will not have another recusion call for that cluster
# we must save it and return it at this depth the clustiering that was 2 big will be cut in half in the loop below
if len(listids) == 1:
ok = predictions.join(
predictions.groupby('prediction').agg({"dist": "max"}).filter(
f.col('max(dist)') <= precision).select(
'prediction'), ['prediction'], 'leftsemi')
for clusterId in listids:
# get all of the pieces that were to big
part = toBig.filter(toBig.prediction == clusterId)
# we now deed to refit the subset of the data
assembler = VectorAssembler(inputCols=['lat', 'long'], outputCol="features")
adf = assembler.transform(part.drop('prediction').drop('features').drop('dist'))
model = bkm.fit(adf)
#predictions now holds the new subclustering and we are ready for recursion
predictions = model.transform(adf)
result.append(bisectToDist(model, predictions, bkm, result=result))
#return anything that was given and already good
if len(listids) == 1:
return ok
Finally we can call the function and build the resulting dataframe
result = []
self.bisectToDist(model, predictions, bkm, result=result)
#drop any nones can happen in recursive not top level call
result =[r for r in result if r]
r = result[0]
r = r.withColumn('subIdx',f.lit(0))
result = result[1:]
idx = 1
for r1 in result:
r1 = r1.withColumn('subIdx',f.lit(idx))
r = r.unionByName(r1)
idx = idx + 1
# each of the subclusters will have a 0 or 1 classification in order to make it 0 - n I added the following
r = r.withColumn('delta', r.subIdx * 100 + r.prediction)
r = r.withColumn('delta', r.delta - f.lag(r.delta, 1).over(Window.orderBy("delta"))).fillna(0)
r = r.withColumn('ddelta', f.when(r.delta != 0,1).otherwise(0))
r = r.withColumn('spacialLocNum',f.sum('ddelta').over(Window.orderBy(['subIdx','prediction'])))
#spacialLocNum should be the final clustering
Admittadly this is quite convoluted and slow but it does get the job done, hope this helps!

Fortran - writing to disk in columns by default?

I am writing a simple code to output some large matrices to the disk that subsquently will be read in matlab.
I have written the following code, which exemplifies the writing for one such matrix. I am concerned with two things:
Efficiency in writing to disk (looking for something that is not too slow)
Easily being able to read it in matlab
a
PROGRAM WriteDisk
character(80) :: filename = ' '
INTEGER :: indt
INTEGER :: ind1, n1 = 161
INTEGER :: ind2, n2=20
INTEGER :: ind3, n3=2
INTEGER :: ind4, n4=2
INTEGER :: ind5, n5=21
INTEGER :: ind6, n6=20
INTEGER :: ind7, n7=2
INTEGER :: ind8, n8=2
INTEGER :: dummy
REAL, ALLOCATABLE :: m1(:,:,:,:,:,:,:,:,:)
ALLOCATE(m1(2,n1,n2,n3,n4,n5,n6,n7,n8))
dummy = 1
do ind8 = 1,n8
do ind7 = 1,n7
do ind6 = 1,n6
do ind5 = 1,n5
do ind4 = 1,n4
do ind3 = 1,n3
do ind2 = 1,n2
do ind1=1,n1
m1(2,ind1,ind2,ind3,ind4,ind5,ind6,ind7,ind8) = dummy
dummy = dummy + 1
end do
end do
end do
end do
end do
end do
end do
end do
indt = 1
write(filename,'(a,i0,a)')'PF_m1_',indt,'.txt'
OPEN(UNIT=25,FILE=filename,STATUS='replace',ACTION='write')
WRITE(25, *) m1(2,:,:,:,:,:,:,:,:)
CLOSE(UNIT=25)
END PROGRAM
The program above writes the matrix m1 as a 4327680 x 5 . This makes it cumbersome to reshape it in matlab (although totally possible), as in Matlab I need to do the following:
Maybe I was not clear enough in my question. When Fortran writes that matrix it writes is with 4327680 rows and 5 columns. I.e. when I open it in matlab I have to do something like to get the matrix in the original format:
n1 = 161;
n2 = 20;
n3 = 2;
n4 = 2;
n5 = 21;
n6 = 20;
n7 = 2;
n8 = 2;
m1 = load('PF_m1_1.txt'); %This is a two dimensional matrix that needs to be transposed and reshaped TWICE to get the original matrix
m1 = m1';
m1 = m1(:);
m1 = reshape(m1, n1,n2,n3,n4,n5,n6,n7,n8)
Is there anyway to write it as a single vector with element with element m1(2,1,1,1,1,1,1,1,1) as first element, m1(2,2,1,1,1,1,1,1,1) as second element, ... , m1(2,end,end,end,end,end,end,end,end) as last element, etc?
Or anyway that I am not aware of, to quickly save it directly as .mat file?

"Is there anyway to write it as a single vector with element with element m1(2,1,1,1,1,1,1,1,1) as first element, m1(2,2,1,1,1,1,1,1,1) as second element, ... , m1(2,end,end,end,end,end,end,end,end) as last element, etc?"
Yes, this the default Fortran column major order. This is the order your file is already written. There is nothing you have to do.
"This makes it cumbersome to reshape it in Matlab (although totally possible), as in Matlab I need to do the following:
m1 = reshape(m1, n1,n2,n3,n4,n5,n6,n7,n8)"
Reshape just updates the internal descriptor. It should be a very fast operation. Completely negligible. Even if it needed to shuffle the data, it would still be much quicker than reading from the hard-drive.
"I am concerned with two things: 1. Efficiency in writing to disk (looking for something that is not too slow)"
Use unformatted (also known as binary) I/O:
OPEN(UNIT=25,FILE=filename,ACCESS='stream',STATUS='replace',ACTION='write')
WRITE(25) m1(2,:,:,:,:,:,:,:,:)
CLOSE(UNIT=25)
"2. Easily being able to read it in Matlab"
To read it from Matlab, learn how to read binary data from Read and write from/to a binary file in Matlab from the Matlab documentation https://www.mathworks.com/help/matlab/ref/fread.html and from loads of other resources.
Don't forget to tell Matlab the right dimensions. Or store the dimensions in the first bytes of the data file (a header).

BFGS Fails to Converge

The model I'm working on is a multinomial logit choice model. It's a very specific dataset so other existing MNLogit libraries don't fit with my data.
So basically, it's a very complex function which takes 11 parameters and returns a loglikelihood value. Then I need to find the optimal parameter values that can minimize the loglikelihood using scipy.optimize.minimize.
Here are the problems that I encounter with different methods:
'Nelder-Mead’: it works well, and always give me the correct answer. However, it's EXTREMELY slow. For another function with a more complicated setup, it takes 15 hours to get to the optimal point. At the same time, the same function takes only 1 hour on Matlab using fminunc (which uses BFGS by default)
‘BFGS’: This is the method used by Matlab. It works well for any simply functions. However, for the function that I have, it always fails to converge and returns 'Desired error not necessarily achieved due to precision loss.’. I've spent lots of time playing around with the options but still failed to work.
'Powell': It quickly converges successfully but returns a wrong answer. The code is printed below (x0 is the correct answer, Nelder-Mead works for whatever initial value), and you can get the data here: https://www.dropbox.com/s/aap2dhor5jyxy94/data.csv
Thanks!
import pandas as pd
import numpy as np
from scipy.optimize import minimize
# https://www.dropbox.com/s/aap2dhor5jyxy94/data.csv
df = pd.read_csv('data.csv', index_col=0)
dfhh = df.hh
B = df.ix[:,'b0':'b4'].values # NT*5
P = df.ix[:,'p1':'p4'].values # NT*4
F = df.ix[:,'f1':'f4'].values # NT*4
SDV = df.ix[:,'lagb1':'lagb4'].values
def Li(x):
b1 = x[0] # coeff on prices
b2 = x[1] # coeff on features
a = x[2:7] # take first 4 values as alpha
E = np.exp(a + b1*P + b2*F) # (1*4) + (NT*4) + (NT*4) build matrix (NT*J) for each exp()
E = np.insert(E, 0, 1, axis=1) # (NT*5)
denom = E.sum(1)
return -np.log((B * E).sum(1) / denom).sum()
x0 = np.array([-32.31028223, 0.23965953, 0.84739154, 0.25418215,-3.38757007,-0.38036966])
np.random.seed(0)
x0 = x0 + np.random.rand(6)
minL = minimize(Li, x0, method='Nelder-Mead',options={'xtol': 1e-8, 'disp': True})
# minL = minimize(Li, x0, method='BFGS')
# minL = minimize(Li, x0, method='Powell', options={'xtol': 1e-12, 'ftol': 1e-12})
print minL
Update: 03/07/14 Simpler Version of the Code
Now Powell works well with very small tolerance, however the speed of Powell is slower than Nelder-Mead in this case. BFGS still fails to work.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark saveAsTextFile in Google Compute taking quadratic time - pyspark

Related

Solve system of differential equation in python

Optimal transport code (Scipy linear programming optimization) takes much longer time

Hierarchical Agglomerative clustering for Spark

Fortran - writing to disk in columns by default?

BFGS Fails to Converge

Categories

Resources