Calculate descriptors with RDkit - descriptor

I am trying to calculate all the descriptors (both 2D/3D) for a list of molecules with RDkit in python. When I run:
MolecularDescriptorCalculator.CalcDescriptors(mol, simplelist)
it returns:
AttributeError: 'Mol' object has no attribute 'simpleList'

To calculate all the rdkit descriptors, you can use the following code:
descriptor_names = list(rdMolDescriptors.Properties.GetAvailableProperties())
get_descriptors = rdMolDescriptors.Properties(descriptor_names)
Calculate descriptors using smile strings
def smi_to_descriptors(smile):
mol = Chem.MolFromSmiles(smile)
descriptors = []
if mol:
descriptors = np.array(get_descriptors.ComputeProperties(mol))
return descriptors
if the the smiles are in pandas dataframe
dataset['descriptors'] = dataset.SMILES.apply(smi_to_descriptors)

Looks like you are using the API slightly wrong, you need to initialize the MolecularDescriptorCalculator class first with the list of descriptors you require.
simplelist = ['TPSA'] # In the list add the names of the descriptors required
calculator = MolecularDescriptorCalculator(simplelist)
descriptors = calculator.CalcDescriptors(mol)
print(descriptors)
[Out]:
(21.259999999999998,)

Related

Convert string data in HDF5 File to float Format

I need to convert String data from a HDF5 File to Float format to use in a Skyplot (Astropy) with l b coordinates. The data is present here:
https://wwwmpa.mpa-garching.mpg.de/~ensslin/research/data/faraday2020.html
(Faraday Sky 2020)
The code I have programmed until now is:
from astropy import units as u
from astropy.coordinates import SkyCoord
import matplotlib.pyplot as plt
import numpy as np
import h5py
dat = []
ggl=[]
ggb=[]
f1= h5py.File('/home/nikita/faraday_2020/faraday2020.hdf5','r')
data = f1.get('faraday_sky_mean')
faraday_sky_mean = np.array(data)
data1 = f1.get('faraday_sky_std')
faraday_sky_std = np.array(data1)
n1 = 0
for line in f1:
s = line.split()
dat.append(s)
n1 = n1 +1
#
for i in range(0,n1):
ggl.append(float(dat[i][0])) # galactic coordinates input
ggb.append(float(dat[i][1]))
f1.close()
However I am getting the error:
ggl.append(float(dat[i][0])) # galactic coordinates input
ValueError: could not convert string to float: 'faraday_sky_mean'
Please help with this. Thanks.
What what you asked and what (I think) you need are 2 different things.
This line is NOT the way to read a HDF5 file: for line in f1:
You need to use a HDF5 API to read it (h5py is 1 of many).
I think you want to read datasets faraday_sky_mean and faraday_sky_std and load arrays into lists ggl and ggb. To do that, use this code. It will create 2 lists with 3145728 float64 values in each.
with h5py.File('faraday2020.hdf5','r') as hdf:
print(list(hdf.keys()))
faraday_sky_mean = hdf['faraday_sky_mean'][:]
faraday_sky_std = hdf['faraday_sky_std'][:]
print(faraday_sky_mean.shape, faraday_sky_mean.dtype)
print(f'Max Mean={max(faraday_sky_mean)}, Min Mean={min(faraday_sky_mean)}')
print(faraday_sky_std.shape, faraday_sky_std.dtype)
print(f'Max StdDev={max(faraday_sky_std)}, Min StdDev={min(faraday_sky_std)}')
ggl = faraday_sky_mean.tolist()
print(len(ggl),type(ggl[0]))
ggb = faraday_sky_std.tolist()
print(len(ggb),type(ggb[0]))
The procedure above saves the data as both NumPy arrays and Python lists. If you only need the lists (don't need the arrays), you can shorten the code as shown below:
with h5py.File('faraday2020.hdf5','r') as hdf:
ggl = hdf['faraday_sky_mean'][:].tolist()
print(len(ggl),type(ggl[0]))
ggb = hdf['faraday_sky_std'][:].tolist()
print(len(ggb),type(ggb[0]))

Import csv as MATLAB struct

I have a large csv log file. Here is a simplified sample:
ts,a.b.c,a.b.d,a.b.e,b.c,b.d,c.d.e,c.d.f,c.g
2021-03-29 06:38:39,1.0000,2,3,28.20,1,2,3,4
2021-03-29 06:38:40,1.0000,2,3,28.20,1,2,3,0.000000
I am using MATLAB's Import Data tool to import it, but, unfortunately, it removes all dots from the header and imports all variables as, e.g.: abc, abd, abe etc.
What is an efficient way to import a csv like the one above as structs?
It am looking for a way to have data imported as structs: a, b and c for this particular log file, so that I can easily access variables as a.b.c or c.d.f.
Here is what I came up with, by simply using readtable.
function res = log_import(logfile)
log_table = readtable(logfile);
res = [];
for i = 1:width(log_table)
str_path = log_table.Properties.VariableDescriptions{i};
fields = strsplit(str_path,'.');
res = setfield(res, fields{1:end}, log_table{:, i});
end
end

Batch processing with Nvidia's TensorRT

I converted the trained model to onnx format, and then created the TensorRT engine file from onnx model. I used the below snnipet code for doing this?
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
import tensorrt as trt
# logger to capture errors, warnings, and other information during the build and inference phases
TRT_LOGGER = trt.Logger()
def build_engine(onnx_file_path):
# initialize TensorRT engine and parse ONNX model
builder = trt.Builder(TRT_LOGGER)
network = builder.create_network()
parser = trt.OnnxParser(network, TRT_LOGGER)
# parse ONNX
with open(onnx_file_path, 'rb') as model:
print('Beginning ONNX file parsing')
parser.parse(model.read())
print('Completed parsing of ONNX file')
# allow TensorRT to use up to 1GB of GPU memory for tactic selection
builder.max_workspace_size = 1 << 30
# we have only one image in batch
builder.max_batch_size = 1
# use FP16 mode if possible
if builder.platform_has_fast_fp16:
builder.fp16_mode = True
# generate TensorRT engine optimized for the target platform
print('Building an engine...')
engine = builder.build_cuda_engine(network)
context = engine.create_execution_context()
print("Completed creating Engine")
return engine, context
# get sizes of input and output and allocate memory required for input data and for output data
for binding in engine:
if engine.binding_is_input(binding): # we expect only one input
input_shape = engine.get_binding_shape(binding)
input_size = trt.volume(input_shape) * engine.max_batch_size * np.dtype(np.float32).itemsize # in bytes
device_input = cuda.mem_alloc(input_size)
else: # and one output
output_shape = engine.get_binding_shape(binding)
# create page-locked memory buffers (i.e. won't be swapped to disk)
host_output = cuda.pagelocked_empty(trt.volume(output_shape) * engine.max_batch_size, dtype=np.float32)
device_output = cuda.mem_alloc(host_output.nbytes)
stream = cuda.Stream()
# preprocess input data
host_input = np.array(preprocess_image("turkish_coffee.jpg").numpy(), dtype=np.float32, order='C')
cuda.memcpy_htod_async(device_input, host_input, stream)
# run inference
context.execute_async(bindings=[int(device_input), int(device_output)], stream_handle=stream.handle)
cuda.memcpy_dtoh_async(host_output, device_output, stream)
stream.synchronize()
# postprocess results
output_data = torch.Tensor(host_output).reshape(engine.max_batch_size, output_shape[0])
postprocess(output_data)
The above codes is correctly work for one batch size of image, but I want to do for multi batch size, for this one thing that need to change :
builder.max_batch_size = 1
and What are other things I have to change to work correctly for batch size more than one? In my opinion, the one things that I have to change from sync to async, right?:
stream.synchronize()
How I can to solve the problem for batch size more than one?
My system:
torch:1.2.0
torchvision:0.4.0
albumentations:0.4.5
onnx:1.4.1
opencv-python:4.2.0.34
cuda:10.0
ubuntu:18.04
tensorrt: 5.x/6.x
Other solution is to use optimization profiler in TRT 7.x , But I want to know How I can to solve this problem with 5/6 versions, Is it possible?

PySpark: Error "Cannot pickle standard input" on function map

I'm trying to learn to use Pyspark.
I'm usin spark-2.2.0- with Python3
I'm in front of a problem now and I can't find where it came from.
My project is to adapt a algorithm wrote by data-scientist to be distributed. The code below it's what I have to use to extract the features from images and I have to adapt it to extract features whith pyspark.
import json
import sys
# Dependencies can be installed by running:
# pip install keras tensorflow h5py pillow
# Run script as:
# ./extract-features.py images/*.jpg
from keras.applications.vgg16 import VGG16
from keras.models import Model
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input
import numpy as np
def main():
# Load model VGG16 as described in https://arxiv.org/abs/1409.1556
# This is going to take some time...
base_model = VGG16(weights='imagenet')
# Model will produce the output of the 'fc2'layer which is the penultimate neural network layer
# (see the paper above for mode details)
model = Model(input=base_model.input, output=base_model.get_layer('fc2').output)
# For each image, extract the representation
for image_path in sys.argv[1:]:
features = extract_features(model, image_path)
with open(image_path + ".json", "w") as out:
json.dump(features, out)
def extract_features(model, image_path):
img = image.load_img(image_path, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
features = model.predict(x)
return features.tolist()[0]
if __name__ == "__main__":
main()
I have written the begining of the Code:
rdd = sc.binaryFiles(PathImages)
base_model = VGG16(weights='imagenet')
model = Model(input=base_model.input, output=base_model.get_layer('fc2').output)
rdd2 = rdd.map(lambda x : (x[0], extract_features(model, x[0][5:])))
rdd2.collect()[0]
when I try to extract the feature. There is an error.
~/Code/spark-2.2.0-bin-hadoop2.7/python/pyspark/cloudpickle.py in
save_file(self, obj)
623 return self.save_reduce(getattr, (sys,'stderr'), obj=obj)
624 if obj is sys.stdin:
--> 625 raise pickle.PicklingError("Cannot pickle standard input")
626 if hasattr(obj, 'isatty') and obj.isatty():
627 raise pickle.PicklingError("Cannot pickle files that map to tty objects")
PicklingError: Cannot pickle standard input
I try multiple thing and here is my first result. I know that the error come from the line below in the method extract_features:
features = model.predict(x)
and when I try to run this line out of a map function or pyspark, this work fine.
I think the problem come from the object "model" and his serialisation whith pyspark.
Maybe I don't use a good way to distribute this with pyspark and if you have any clew to help me, I will take them.
Thanks in advance.

GLM.fit() in Matlab vs. Python Statsmodels: why the different results?

In what ways is Matlab's glmfit implemented differently than Python statsmodels' GLM.fit()?
Here is a comparison of their results on my dataset:
This represents graph 209 weights, generated from running GLM fit on:
V: (100000, 209) predictor variable (design matrix)
y: (100000,1) response variable
Sum of squared errors: 18.140615678
A Specific Example
Why are these different? First, here's a specific example in Matlab:
yin = horzcat(y,ones(size(y)));
[weights_mat, d0, st0]=glmfit(V, yin,'binomial','probit','off',[],[],'off');
Let's try the equivalent in Python:
import statsmodels.api as sm
## set up GLM
y = np.concatenate((y, np.ones( [len(y),1] )), axis=1)
sm_probit_Link = sm.genmod.families.links.probit
glm_binom = sm.GLM(sm.add_constant(y), sm.add_constant(V_design_matrix), family=sm.families.Binomial(link=sm_probit_Link))
# statsmodels.GLM format: glm_binom = sm.GLM(data.endog, data.exog, family)
## Run GLM fit
glm_result = glm_binom.fit()
weights_py = glm_result.params
## Compare the difference
weights_mat_import = Matpy.get_output('w_output.mat', 'weights_mat') # imports matlab variables
print SSE(weights_mat_import, weights_python)
Let's Check The Docs
glmfit in Matlab:
[b,dev,stats] = glmfit(X,y,distr)
GLM.fit() setup in Python (documentation ) :
glm_model = sm.GLM(endog, exog, family=None, offset=None, exposure=None, missing='none', **kwargs)
glm_model.fit(start_params=None, maxiter=100, method='IRLS', tol=1e-08, scale=None, cov_type='nonrobust', cov_kwds=None, use_t=None, **kwargs)
How might we get Matlab glmfit results with Statsmodels?
Thank you!