Validation report: extend_v2

Model: extend_v2

Samples Generated: 500

Generated On: 2026-01-04 20:40

Generation Quality Metrics

Uniqueness

Novelty

SELFIES fidelity

1.0000

0.9920

0.0077

Descriptor Statistics

Descriptor

Average

Minimum

Maximum

MW

238.5955

83.0900

448.3600

LogP

1.9655

-1.9000

6.3100

SA_Score

3.4320

1.1600

6.8300

QED

0.6327

0.2250

0.9430

Fsp3

0.5788

0.0000

1.0000

RotatableBonds

2.9760

0.0000

11.0000

RingCount

2.4040

0.0000

6.0000

TPSA

44.7686

0.0000

127.8700

RadicalElectrons

0.0060

0.0000

1.0000

Descriptor Distributions

Descriptor Distributions

Development Notes

The model was trained using SELFIES representations generated from randomized SMILES strings. The training corpus was derived from the QM9 and ZINC datasets. Randomization was applied at every data access by converting each SMILES string into a non-canonical, randomly ordered form before encoding it into SELFIES.

Training was stopped once the loss reached a clear plateau, even after lowering the learning rate to 1e-5. Compared to training on canonical (non-randomized) SMILES, convergence was slower. This behavior is expected, as the same molecule can appear in different randomized forms across batches, increasing input variability and reducing batch-to-batch consistency.

A key challenge introduced by SMILES randomization is vocabulary stability. Because SELFIES are generated on the fly for each batch, new symbols may appear that were not present when the stoi mapping was originally constructed. This can result in index errors during tokenization if the mapping is incomplete. Multiple runs through the dataset to get a more complete stoi mapping may be needed or update the stoi on the fly during the pipeline.

PyTorch dataset was defined as follows:

class ChempleterRandomisedSmilesDataset(Dataset):
   """
   PyTorch Dataset for SELFIES molecular representations.

   :param smiles_file: Path to CSV file containing SMILES strings in a "smiles" column.
   :type smiles_file: str
   :param stoi_file: Path to JSON file mapping SELFIES symbols to integer tokens.
   :type stoi_file: str
   :returns: Integer tensor representation of tokenized molecule with dtype=torch.long.
   :rtype: torch.Tensor
   """

   def __init__(self, smiles_file, stoi_file):
      super().__init__()
      smiles_dataframe = pd.read_csv(smiles_file)
      self.data = smiles_dataframe["smiles"].to_list()
      with open(stoi_file) as f:
            self.selfies_to_integer = json.load(f)

   def __len__(self):
      return len(self.data)

   def __getitem__(self, index):
      molecule_in_smiles = self.data[index]

      # try randomisation
      molecule = Chem.MolFromSmiles(molecule_in_smiles)
      if molecule is not None:
            try:
               molecule_in_selfies = sf.encoder(
                  Chem.MolToSmiles(molecule, canonical=False, doRandom=True)
               )
            except Exception as e:
               logger.error(f"SELFIES encoding error for randomised SMILES: {e}")
      else:
            molecule_in_selfies = sf.encoder(molecule_in_smiles)

      symbols_molecule = (
            ["[START]"] + list(sf.split_selfies(molecule_in_selfies)) + ["[END]"]
      )
      integer_molecule = []

      # check if all symbols exist in stoi
      for symbol in symbols_molecule:
            if symbol not in self.selfies_to_integer:
               raise RuntimeError(
                  f"Molecule symbol not found in stoi. Add {symbol} in stoi with correct integer mapping."
               )
            else:
               integer_molecule.append(self.selfies_to_integer[symbol])

      return torch.tensor(integer_molecule, dtype=torch.long)