Validation report: bridge_v1 ============================ Model: bridge_v1 Samples Generated: 500 Generated On: 2026-01-04 20:40 Generation Quality Metrics -------------------------- ========== ======= ================ Uniqueness Novelty SELFIES fidelity ========== ======= ================ 0.8900 1.0000 0.0083 ========== ======= ================ Descriptor Statistics --------------------- ================ ======== ======== ======== Descriptor Average Minimum Maximum ================ ======== ======== ======== MW 235.1938 133.1500 303.4500 LogP 3.0860 0.4700 6.1200 SA_Score 2.2645 1.0600 5.8900 QED 0.7688 0.3260 0.9020 Fsp3 0.1675 0.0000 0.3800 RotatableBonds 3.9620 0.0000 9.0000 RingCount 2.3200 1.0000 4.0000 TPSA 24.6952 0.0000 63.6900 RadicalElectrons 0.0240 0.0000 1.0000 ================ ======== ======== ======== Descriptor Distributions ------------------------ .. image:: bridge_v1.png :alt: Descriptor Distributions :align: center Development Notes ------------------------ The model was trained on SELFIES sequences derived from SMILES strings, with data sourced from the QM9 and ZINC datasets. In contrast to the earlier setup in extend models, two additional control tokens—[MASK] and [BRIDGE] were introduced to explicitly model fragment bridging. As before in extend models, SMILES randomization was applied at data loading time using non-canonical RDKit SMILES to increase sequence diversity. Training was stopped once the loss reached a clear plateau, even after lowering the learning rate to 1e-5. Bridging ^^^^^^^^^^^^^^ Unlike the extend models, which learned to extend SMILES directly, this approach reformulates the task as a bridge completion problem. During dataset construction, sufficiently long SELFIES sequences are split into three parts: * an initial fragment (frag1), * a contiguous segment treated as the bridge, * and an end fragment (frag2). and the model is trained to predict the bridge tokens after [BRIDGE], followed by [END]. At inference time, two fragments are joined together as ``frag1 + [MASK] + frag2 + [BRIDGE]`` to form the prompt. Loss Masking ^^^^^^^^^^^^^^^^^ The training loss is computed only on the predicted bridge tokens. All target positions corresponding to [START], frag1, [MASK], frag2, and [BRIDGE] are ignored by the cross-entropy loss (set to zero as this alos the padding index). Fragmentation constraints ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Fragmentation is attempted only for molecules with sufficient length. For SELFIES sequences longer than 10 symbols: * frag1 length is sampled randomly, * bridge length is sampled between 1 and 10 tokens, * frag2 contains the remainder. Shorter sequences are left unfragmented and passed through unchanged, so in principle, this model can also extend by leaving frag2 empty. PyTorch dataset was defined as follows: .. code:: class ChempleterRandomisedBridgeDataset(Dataset): """ PyTorch Dataset for SELFIES molecular representations. :param smiles_file: Path to CSV file containing SMILES strings in a "smiles" column. :type smiles_file: str :param stoi_file: Path to JSON file mapping SELFIES symbols to integer tokens. :type stoi_file: str :returns: Integer tensor representation of tokenized molecule with dtype=torch.long. :rtype: torch.Tensor """ def __init__(self, smiles_file, stoi_file): super().__init__() smiles_dataframe = pd.read_csv(smiles_file) self.data = smiles_dataframe["smiles"].to_list() with open(stoi_file) as f: self.selfies_to_integer = json.load(f) def __len__(self): return len(self.data) def __getitem__(self, index): molecule_in_smiles = self.data[index] # try randomisation molecule = Chem.MolFromSmiles(molecule_in_smiles) if molecule is not None: try: molecule_in_selfies = sf.encoder( Chem.MolToSmiles(molecule, canonical=False, doRandom=True) ) except Exception as e: molecule_in_selfies = sf.encoder(molecule_in_smiles) logger.error(f"SELFIES encoding error for randomised SMILES: {e}") else: molecule_in_selfies = sf.encoder(molecule_in_smiles) symbols = list(sf.split_selfies(molecule_in_selfies)) # try fragmentation if len(symbols) > 10: len_frag1 = random.randint(1, len(symbols) - 8) len_bridge = random.randint( 1, 10 ) # bridge len ranges from 1 to 10.# this would be constraint later len_frag2 = min(len_frag1 + len_bridge, len(symbols) - 1) frag1 = symbols[:len_frag1] bridge = symbols[len_frag1:len_frag2] frag2 = symbols[len_frag2:] else: frag1 = symbols bridge = [] frag2 = [] symbols_molecule = ( ["[START]"] + frag1 + ["[MASK]"] + frag2 + ["[BRIDGE]"] + bridge + ["[END]"] ) integer_molecule = [] # check if all symbols exist in stoi for symbol in symbols_molecule: if symbol not in self.selfies_to_integer: raise RuntimeError( f"Molecule symbol not found in stoi. Add {symbol} in stoi with correct integer mapping." ) else: integer_molecule.append(self.selfies_to_integer[symbol]) return torch.tensor(integer_molecule, dtype=torch.long)