On finding chirality using RDKit - chemistry

In the paper: "Graph Networks as a Universal Machine Learning
Framework for Molecules and Crystals", authors introduce chirality as an atom feature input to analyze QM9 dataset. I was trying to recreate this atom feature as following
Chirality: (categorical) R, S, or not a Chiral center (one-hot encoded).
The code I used is:
from chainer_chemistry import datasets
from chainer_chemistry.dataset.preprocessors.ggnn_preprocessor import GGNNPreprocessor
from rdkit import Chem
import numpy as np
dataset, dataset_smiles = datasets.get_qm9(GGNNPreprocessor(), return_smiles=True)
for i in range(len(dataset_smiles)):
mol = Chem.MolFromSmiles(dataset_smiles[i])
Chem.AssignAtomChiralTagsFromStructure(mol)
chiral_cc = Chem.FindMolChiralCenters(mol)
if not len(chiral_cc) == 0:
print(chiral_cc)
The output shows no Chiral centers for this dataset. When I use includeUnassigned=True, code gives a list of tuples, but instead of "R/S", I get "?". I was wondering if there is a mistake in my implementation. If this is expected, any thoughts on how chirality was assigned in the above paper?

The reason you see '?' instead of R/S is because
FindMolChiralCenters outputs '?' for unassigned stereocenters.
That is, if the SMILES does not assign the configuration like [C#] for S configuration or [C##] for R configuration, then it is considered unassigned.
I ran your code and all the SMILES in the dataset do not have any assigned stereocenters.
For example I list some of the SMILES and its configuration here:
CC1C(C=O)N2CC12C
[(1, '?'), (2, '?'), (5, '?'), (7, '?')]
CC1N(C=O)C2CC12C
[(1, '?'), (5, '?'), (7, '?')]
CC1N(C=O)C2CC12O
[(1, '?'), (5, '?'), (7, '?')]
CN1C(C=O)C2CC21C
[(2, '?'), (5, '?'), (7, '?')]
O=CC1C(O)C2(O)CC12
[(2, '?'), (3, '?'), (5, '?'), (8, '?')]
CC1C(=O)C=CC1C=O
[(1, '?'), (6, '?')]
O=CC1C=CC(=O)C1O
[(2, '?'), (7, '?')]
C#CC1C=CC(=O)C1C
[(2, '?'), (7, '?')]
So as you can see in the SMILES, none of them have either [C#] or [C##]. That is why you see '?' in configuration.
Also helpful rdkit documentation regarding FindMolChiralCenters can be found here

Related

OR-Tools: Is there any way to get raw combinations and permutations?

For example, you call something like permutations([1,2,3,4],2) or combinations([1,2,3,4],2) and get list of permutations/combinations like [(1, 2), (1, 3), (1, 4), (2, 3), (2, 4), ...]. Is it possible using exactly OR-Tools?

Compare 2 pyspark dataframe columns and change values of another column based on it

I have a problem where I have generated a dataframe from a graph algorithm that I have written. The thing is that I want to keep the value of the underlying component to be the same essentially after every run of the graph code.
This is a sample dataframe generated:
df = spark.createDataFrame(
[
(1, 'A1'),
(1, 'A2'),
(1, 'A3'),
(2, 'B1'),
(2, 'B2'),
(3, 'B3'),
(4, 'C1'),
(4, 'C2'),
(4, 'C3'),
(4, 'C4'),
(5, 'D1'),
],
['old_comp_id', 'db_id']
)
After another run the values change completely, so the new run has values like these,
df2 = spark.createDataFrame(
[
(2, 'A1'),
(2, 'A2'),
(2, 'A3'),
(3, 'B1'),
(3, 'B2'),
(3, 'B3'),
(1, 'C1'),
(1, 'C2'),
(1, 'C3'),
(1, 'C4'),
(4, 'D1'),
],
['new_comp_id', 'db_id']
)
So the thing I need to do is to compare the values between the above two dataframes and change the values of the component id based on the database id associated.
if the database_id are the same then update the component id to be from the 1st dataframe
if they are different then assign a completely new comp_id (new_comp_id = max(old_comp_id)+1)
This is what I have come up with so far:
old_ids = df.groupBy("old_comp_id").agg(F.collect_set(F.col("db_id")).alias("old_db_id"))
new_ids = df2.groupBy("new_comp_id").agg(F.collect_set(F.col("db_id")).alias("new_db_id"))
joined = new_ids.join(old_ids,old_ids.old_comp_id == new_ids.new_comp_id,"outer")
joined.withColumn("update_comp", F.when( F.col("new_db_id") == F.col("old_db_id"), F.col('old_comp_id')).otherwise(F.max(F.col("old_comp_id")+1))).show()
In order to use aggregated functions in non-aggregated columns, you should use Windowing Functions.
First, you outer-join the DFs with the db_id:
from pyspark.sql.functions import when, col, max
joinedDF = df.join(df2, df["db_id"] == df2["new_db_id"], "outer")
Then, start to building the Window (which where you group by db_id, and order by old_comp_id, in order to have in first rows the old_comp_id with highest value.
from pyspark.sql.window import Window
from pyspark.sql.functions import desc
windowSpec = Window\
.partitionBy("db_id")\
.orderBy(desc("old_comp_id"))\
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
Then, you build the max column using the windowSpec
from pyspark.sql.functions import max
maxCompId = max(col("old_comp_id")).over(windowSpec)
Then, you apply it on the select
joinedDF.select(col("db_id"), when(col("new_db_id").isNotNull(), col("old_comp_id")).otherwise(maxCompId+1).alias("updated_comp")).show()
For more information, please refer to the documentation (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Window)
Hope this helps

Pyspark | Transform RDD from key with list of values > values with list of keys

In pyspark, how to transform an input RDD where Every Key has a list of Values to an output RDD where Every Value has a list of Keys it belong to?
Input
[(1, ['a','b','c','e']), (2, ['b','d']), (3, ['a','d']), (4, ['b','c'])]
Output
[('a', [1, 3]), ('b', [1, 2, 4]), ('c', [1, 4]), ('d', [2,3]), ('e', [1])]
Flatten and swap the key value on the rdd first, and then groupByKey:
rdd.flatMap(lambda r: [(k, r[0]) for k in r[1]]).groupByKey().mapValues(list).collect()
# [('a', [1, 3]), ('e', [1]), ('b', [1, 2, 4]), ('c', [1, 4]), ('d', [2, 3])]

Pyspark:How to calculate avg and count in a single groupBy? [duplicate]

This question already has answers here:
Multiple Aggregate operations on the same column of a spark dataframe
(6 answers)
Closed 4 years ago.
I would like to calculate avg and count in a single group by statement in Pyspark. How can I do that?
df = spark.createDataFrame([(1, 'John', 1.79, 28,'M', 'Doctor'),
(2, 'Steve', 1.78, 45,'M', None),
(3, 'Emma', 1.75, None, None, None),
(4, 'Ashley',1.6, 33,'F', 'Analyst'),
(5, 'Olivia', 1.8, 54,'F', 'Teacher'),
(6, 'Hannah', 1.82, None, 'F', None),
(7, 'William', 1.7, 42,'M', 'Engineer'),
(None,None,None,None,None,None),
(8,'Ethan',1.55,38,'M','Doctor'),
(9,'Hannah',1.65,None,'F','Doctor')]
, ['Id', 'Name', 'Height', 'Age', 'Gender', 'Profession'])
#This only shows avg but also I need count right next to it. How can I do that?
df.groupBy("Profession").agg({"Age":"avg"}).show()
df.show()
Thank you.
For the same column:
from pyspark.sql import functions as F
df.groupBy("Profession").agg(F.mean('Age'), F.count('Age')).show()
If you're able to use different columns:
df.groupBy("Profession").agg({'Age':'avg', 'Gender':'count'}).show()

Spark Scala: Distance between elements of RDDs [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have 2 RRDs with time series. Like
rdd1.take(5)
[(1, 25.0)
(2, 50.23)
(3, 65.0)
(4, 7.23)
(5, 12.0)]
and
rdd2.take(5)
[(1, 85.0)
(2, 3.23)
(3, 9.0)
(4, 23.23)
(5, 65.0)]
I would like to find the disctance between each element of the first rdd and each element of the second and get next
result.take(5)
[((1,1): (25.0-85.0)**2),
((1,2): (25.0 - 3.23)**2),
.....
((1,5): (25.0 - 65.23)**2),
.....
((2,1): (50.23 - 85.0)**2),
.....
((5,5): (12.0 - 65.0)**2),
]
The number of elements can be from 10 000 to billions.
Please, help me.
#Mohit is right, you are looking for the cartesian product of your two RDDs, then you should map and compute your distance.
Here is an example :
val rdd1 = sc.parallelize(List((1, 25.0), (2, 50.23), (3, 65.0), (4, 7.23), (5, 12.0)))
val rdd2 = sc.parallelize(List((1, 85.0), (2, 3.23), (3, 9.0), (4, 23.23), (5, 65.0)))
val result = rdd1.cartesian(rdd2).map {
case ((a,b),(c,d)) => ((a,c),math.pow((b - d),2))
}
Now, let's see how it looks like :
result.take(10).foreach(println)
# ((1,1),3600.0)
# ((1,2),473.93289999999996)
# ((1,3),256.0)
# ((1,4),3.1328999999999985)
# ((1,5),1600.0)
# ((2,1),1208.9529000000002)
# ((2,2),2209.0)
# ((2,3),1699.9128999999998)
# ((2,4),728.9999999999998)
# ((2,5),218.1529000000001)
What you are looking for is Cartesian Product. This gives you the product (or pairing) between each element of RDD1 with RDD2.
Since you are dealing with billion-size dataset, make sure your infrastructure supports it.
A similar question may help you further.