Validation report: extend_v2
============================

Model: extend_v2

Samples Generated: 500  

Generated On: 2026-01-04 20:40

Generation Quality Metrics
--------------------------

========== ======= ================
Uniqueness Novelty SELFIES fidelity
========== ======= ================
1.0000     0.9920  0.0077          
========== ======= ================


Descriptor Statistics
---------------------

================ ======== ======= ========
Descriptor       Average  Minimum Maximum 
================ ======== ======= ========
MW               238.5955 83.0900 448.3600
LogP             1.9655   -1.9000 6.3100  
SA_Score         3.4320   1.1600  6.8300  
QED              0.6327   0.2250  0.9430  
Fsp3             0.5788   0.0000  1.0000  
RotatableBonds   2.9760   0.0000  11.0000 
RingCount        2.4040   0.0000  6.0000  
TPSA             44.7686  0.0000  127.8700
RadicalElectrons 0.0060   0.0000  1.0000  
================ ======== ======= ========


Descriptor Distributions
------------------------

.. image:: extend_v2.png
   :alt: Descriptor Distributions
   :align: center


Development Notes
------------------------

The model was trained using SELFIES representations generated from randomized SMILES strings. 
The training corpus was derived from the QM9 and ZINC datasets. 
Randomization was applied at every data access by converting each SMILES string into a non-canonical, randomly ordered form before encoding it into SELFIES.

Training was stopped once the loss reached a clear plateau, even after lowering the learning rate to 1e-5. 
Compared to training on canonical (non-randomized) SMILES, convergence was slower. This behavior is expected, as the same molecule can appear in different randomized forms across batches, increasing input variability and reducing batch-to-batch consistency.

A key challenge introduced by SMILES randomization is vocabulary stability. Because SELFIES are generated on the fly for each batch, new symbols may appear that were not present when the stoi mapping was originally constructed. 
This can result in index errors during tokenization if the mapping is incomplete. Multiple runs through the dataset to get a more complete stoi mapping may be needed or update the stoi on the fly during the pipeline.

PyTorch dataset was defined as follows:

.. code::

   class ChempleterRandomisedSmilesDataset(Dataset):
      """
      PyTorch Dataset for SELFIES molecular representations.

      :param smiles_file: Path to CSV file containing SMILES strings in a "smiles" column.
      :type smiles_file: str
      :param stoi_file: Path to JSON file mapping SELFIES symbols to integer tokens.
      :type stoi_file: str
      :returns: Integer tensor representation of tokenized molecule with dtype=torch.long.
      :rtype: torch.Tensor
      """

      def __init__(self, smiles_file, stoi_file):
         super().__init__()
         smiles_dataframe = pd.read_csv(smiles_file)
         self.data = smiles_dataframe["smiles"].to_list()
         with open(stoi_file) as f:
               self.selfies_to_integer = json.load(f)

      def __len__(self):
         return len(self.data)

      def __getitem__(self, index):
         molecule_in_smiles = self.data[index]

         # try randomisation
         molecule = Chem.MolFromSmiles(molecule_in_smiles)
         if molecule is not None:
               try:
                  molecule_in_selfies = sf.encoder(
                     Chem.MolToSmiles(molecule, canonical=False, doRandom=True)
                  )
               except Exception as e:
                  logger.error(f"SELFIES encoding error for randomised SMILES: {e}")
         else:
               molecule_in_selfies = sf.encoder(molecule_in_smiles)

         symbols_molecule = (
               ["[START]"] + list(sf.split_selfies(molecule_in_selfies)) + ["[END]"]
         )
         integer_molecule = []

         # check if all symbols exist in stoi
         for symbol in symbols_molecule:
               if symbol not in self.selfies_to_integer:
                  raise RuntimeError(
                     f"Molecule symbol not found in stoi. Add {symbol} in stoi with correct integer mapping."
                  )
               else:
                  integer_molecule.append(self.selfies_to_integer[symbol])

         return torch.tensor(integer_molecule, dtype=torch.long)