Validation report: extend_v1

Model: extend_v1

Samples Generated: 500

Generated On: 2026-01-05 00:28

Generation Quality Metrics

Uniqueness

Novelty

SELFIES fidelity

0.9960

0.6426

0.0152

Descriptor Statistics

Descriptor

Average

Minimum

Maximum

MW

246.2045

97.1600

496.0100

LogP

1.9321

-2.2100

7.6400

SA_Score

3.5984

1.5400

6.9800

QED

0.6220

0.1950

0.9430

Fsp3

0.5701

0.0000

1.0000

RotatableBonds

3.0820

0.0000

12.0000

RingCount

2.5800

0.0000

9.0000

TPSA

47.8160

0.0000

138.8300

RadicalElectrons

0.0120

0.0000

2.0000

Descriptor Distributions

Descriptor Distributions

Development Notes

Model was trained on SMILES patterns encoded into SELFIES. Training data was obtained from QM9 and ZINC datasets. Training was stopped when a plateau in loss was observed even with a reduced learning rate of 1e-5.

PyTorch dataset was defined as follows:

class ChempleterDataset(Dataset):
   """
   PyTorch Dataset for SELFIES molecular representations.

   :param selfies_file: Path to CSV file containing SELFIES strings in a "selfies" column.
   :type selfies_file: str
   :param stoi_file: Path to JSON file mapping SELFIES symbols to integer tokens.
   :type stoi_file: str
   :returns: Integer tensor representation of tokenized molecule with dtype=torch.long.
   :rtype: torch.Tensor
   """

   def __init__(self, selfies_file, stoi_file):
      super().__init__()
      selfies_dataframe = pd.read_csv(selfies_file)
      self.data = selfies_dataframe["selfies"].to_list()
      with open(stoi_file) as f:
            self.selfies_to_integer = json.load(f)

   def __len__(self):
      return len(self.data)

   def __getitem__(self, index):
      molecule = self.data[index]
      symbols_molecule = ["[START]"] + list(sf.split_selfies(molecule)) + ["[END]"]
      integer_molecule = [
            self.selfies_to_integer[symbol] for symbol in symbols_molecule
      ]
      return torch.tensor(integer_molecule, dtype=torch.long)