Validation report: extend_v1¶
Model: extend_v1
Samples Generated: 500
Generated On: 2026-01-05 00:28
Generation Quality Metrics¶
Uniqueness |
Novelty |
SELFIES fidelity |
|---|---|---|
0.9960 |
0.6426 |
0.0152 |
Descriptor Statistics¶
Descriptor |
Average |
Minimum |
Maximum |
|---|---|---|---|
MW |
246.2045 |
97.1600 |
496.0100 |
LogP |
1.9321 |
-2.2100 |
7.6400 |
SA_Score |
3.5984 |
1.5400 |
6.9800 |
QED |
0.6220 |
0.1950 |
0.9430 |
Fsp3 |
0.5701 |
0.0000 |
1.0000 |
RotatableBonds |
3.0820 |
0.0000 |
12.0000 |
RingCount |
2.5800 |
0.0000 |
9.0000 |
TPSA |
47.8160 |
0.0000 |
138.8300 |
RadicalElectrons |
0.0120 |
0.0000 |
2.0000 |
Descriptor Distributions¶
Development Notes¶
Model was trained on SMILES patterns encoded into SELFIES. Training data was obtained from QM9 and ZINC datasets. Training was stopped when a plateau in loss was observed even with a reduced learning rate of 1e-5.
PyTorch dataset was defined as follows:
class ChempleterDataset(Dataset):
"""
PyTorch Dataset for SELFIES molecular representations.
:param selfies_file: Path to CSV file containing SELFIES strings in a "selfies" column.
:type selfies_file: str
:param stoi_file: Path to JSON file mapping SELFIES symbols to integer tokens.
:type stoi_file: str
:returns: Integer tensor representation of tokenized molecule with dtype=torch.long.
:rtype: torch.Tensor
"""
def __init__(self, selfies_file, stoi_file):
super().__init__()
selfies_dataframe = pd.read_csv(selfies_file)
self.data = selfies_dataframe["selfies"].to_list()
with open(stoi_file) as f:
self.selfies_to_integer = json.load(f)
def __len__(self):
return len(self.data)
def __getitem__(self, index):
molecule = self.data[index]
symbols_molecule = ["[START]"] + list(sf.split_selfies(molecule)) + ["[END]"]
integer_molecule = [
self.selfies_to_integer[symbol] for symbol in symbols_molecule
]
return torch.tensor(integer_molecule, dtype=torch.long)