Validation report: extend_v1
============================

Model: extend_v1

Samples Generated: 500  

Generated On: 2026-01-05 00:28

Generation Quality Metrics
--------------------------

========== ======= ================
Uniqueness Novelty SELFIES fidelity
========== ======= ================
0.9960     0.6426  0.0152          
========== ======= ================


Descriptor Statistics
---------------------

================ ======== ======= ========
Descriptor       Average  Minimum Maximum 
================ ======== ======= ========
MW               246.2045 97.1600 496.0100
LogP             1.9321   -2.2100 7.6400  
SA_Score         3.5984   1.5400  6.9800  
QED              0.6220   0.1950  0.9430  
Fsp3             0.5701   0.0000  1.0000  
RotatableBonds   3.0820   0.0000  12.0000 
RingCount        2.5800   0.0000  9.0000  
TPSA             47.8160  0.0000  138.8300
RadicalElectrons 0.0120   0.0000  2.0000  
================ ======== ======= ========


Descriptor Distributions
------------------------

.. image:: extend_v1.png
   :alt: Descriptor Distributions
   :align: center


Development Notes
------------------------

Model was trained on SMILES patterns encoded into SELFIES. Training data was obtained from QM9 and ZINC datasets.
Training was stopped when a plateau in loss was observed even with a reduced learning rate of 1e-5.

PyTorch dataset was defined as follows:

.. code::

   class ChempleterDataset(Dataset):
      """
      PyTorch Dataset for SELFIES molecular representations.

      :param selfies_file: Path to CSV file containing SELFIES strings in a "selfies" column.
      :type selfies_file: str
      :param stoi_file: Path to JSON file mapping SELFIES symbols to integer tokens.
      :type stoi_file: str
      :returns: Integer tensor representation of tokenized molecule with dtype=torch.long.
      :rtype: torch.Tensor
      """

      def __init__(self, selfies_file, stoi_file):
         super().__init__()
         selfies_dataframe = pd.read_csv(selfies_file)
         self.data = selfies_dataframe["selfies"].to_list()
         with open(stoi_file) as f:
               self.selfies_to_integer = json.load(f)

      def __len__(self):
         return len(self.data)

      def __getitem__(self, index):
         molecule = self.data[index]
         symbols_molecule = ["[START]"] + list(sf.split_selfies(molecule)) + ["[END]"]
         integer_molecule = [
               self.selfies_to_integer[symbol] for symbol in symbols_molecule
         ]
         return torch.tensor(integer_molecule, dtype=torch.long)