🧪 Introduction to SFcalculator_torch: the basics¶
This notebook demonstrates how to use SFcalculator_torch
— a PyTorch-based toolkit for computing structure factors from atomic models in macromolecular crystallography and cryo electron microscopy.
You'll learn how to:
- Load a PDB structure and MTZ reflection data
- Compute structure factors for the protein and solvent
- Combine them into a total model structure factor
- Evaluate model quality via R-factors
- Optionally optimize scaling factors to improve fit
The imports necessary for this notebook:
import torch
import numpy as np
import matplotlib.pyplot as plt
import SFC_Torch as sfc
import reciprocalspaceship as rs
from reciprocalspaceship.algorithms import scale_merged_intensities
📂 2. Loading Experimental Data and the Atomic Model¶
SFCalculator's core capabilities are the calculation of structure factors from a model, and the comparison to observations in the form of R factors. For likelihood-based calculations, please see ROCKET Codebase.
Let`s first fetch a dataset from the PDB, SFCalculator provides a utility function to fetch entries from the PDB database by entry ID. It also supports batch fetching.
df = sfc.fetch_pdb(idlist=["7lvc"], outpath="./")
100%|██████████| 1/1 [00:03<00:00, 3.61s/it]
It will attempt to download the sequence FASTA file, molecular model CIF file, structure factor CIF file, PDB file, and structure factor MTZ file when available. A CSV file indicates which file formats were fetched for each entry ID.
!ls
7lvc.cif 7lvc.pdb refined_model.pdb 7lvc.fasta 7lvc-sf.cif SFcalculator_torch_demo_with_patterson.html 7lvc_FW_scaled.mtz fetchpdb.csv SFcalculator_torch_demo_with_patterson.ipynb
The we use reciprocal spaceship to inspect our dataset.
ds=rs.read_cif("7lvc-sf.cif")
ds.head()
FreeR_flag | I(+) | SIGI(+) | I(-) | SIGI(-) | FWT | PHWT | DELFWT | PHDELWT | |||
---|---|---|---|---|---|---|---|---|---|---|---|
H | K | L | |||||||||
16 | 11 | 3 | 12 | 78.059998 | 7.19 | 79.589996 | 5.27 | 110.150002 | 124.489998 | 46.810001 | -55.5 |
2 | 9 | 170.080002 | 11.66 | 163.039993 | 8.03 | 175.240005 | 90.769997 | 41.369999 | -89.389999 | ||
1 | 12 | 98.040001 | 5.8 | 91.389999 | 5.36 | 142.119995 | -103.32 | 8.52 | 76.629997 | ||
10 | 7 | 8 | 111.849998 | 8.45 | 98.43 | 5.57 | 157.660004 | 156.199997 | 5.93 | 155.820007 | |
6 | 0 | 9.92 | 2.98 | 7.66 | 2.28 | 16.26 | -164.720001 | 38.34 | 13.24 |
We see that we have separate Friedel mates recorded for the intensities. There are also amplitude columns, but these are 2Fo-Fc amplitudes (FWT), Fo-Fc (FWT) and phases, not raw observations! (see Global Phasing's Kwiki for more details). As a consequence, we should first "stack" the Friedel mates, and then French-Wilson scale them. For example like this:
ds=ds.stack_anomalous()
ds_scaled = scale_merged_intensities(
ds,
intensity_key="I",
sigma_key="SIGI",
output_columns=["FW-I", "FW-SIGI", "F", "SIGF"],
inplace=False,
mean_intensity_method="anisotropic",
bw=2.0,
minimum_sigma=1e-6,
)
ds_scaled.write_mtz("7lvc_FW_scaled.mtz")
ds_scaled.head()
FreeR_flag | I | SIGI | FWT | PHWT | DELFWT | PHDELWT | dHKL | CENTRIC | FW-I | FW-SIGI | F | SIGF | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
H | K | L | |||||||||||||
16 | 11 | 3 | 12 | 78.059998 | 7.19 | 110.150002 | 124.489998 | 46.810001 | -55.5 | 1.900489 | False | 77.590569 | 7.19 | 8.799018 | 0.409693 |
2 | 9 | 170.080002 | 11.66 | 175.240005 | 90.769997 | 41.369999 | -89.389999 | 1.902241 | False | 168.864914 | 11.66 | 12.987023 | 0.449588 | ||
1 | 12 | 98.040001 | 5.8 | 142.119995 | -103.32 | 8.52 | 76.629997 | 1.903295 | False | 97.741295 | 5.8 | 9.882053 | 0.293788 | ||
10 | 7 | 8 | 111.849998 | 8.45 | 157.660004 | 156.199997 | 5.93 | 155.820007 | 1.921573 | False | 111.219727 | 8.45 | 10.538425 | 0.401649 | |
6 | 0 | 9.92 | 2.98 | 16.26 | -164.720001 | 38.34 | 13.24 | 1.926292 | False | 9.845529 | 2.971575 | 3.098016 | 0.497819 |
Now, let's read in these data, together with a model, into an SFCalculator
object: the core object that keeps track of our calculations.
pdb_path = "7lvc.cif"
mtz_path = "7lvc_FW_scaled.mtz"
sf = sfc.SFcalculator(pdb_path, mtz_path, device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
anomalous=True, mode="xray",
expcolumns=["F","SIGF"],
freeflag="FreeR_flag", testset_value=0, wavelength=1.892)
print("Model loaded with", sf.n_atoms, "atoms in space group", sf.space_group.hm)
print("Number of reflections in dataset: ", len(sf.HKL_array))
2025-07-16 11:56:57.851 | WARNING | SFC_Torch.Fmodel:init_pdb:147 - Can't find wavelength record in the PDB file, or it doesn't match your input wavelength! 2025-07-16 11:56:57.988 | WARNING | SFC_Torch.Fmodel:init_mtz:357 - Can't find wavelength record in the MTZ file, or it doesn't match with other sources
Model loaded with 2985 atoms in space group P 21 21 21 Number of reflections in dataset: 27967
🧮 3. Computing Structure Factors¶
For an accurate and efficient calcualtion, we first need to inspect it for an appropriate solvent percentage and a reasonable grid size for FFT:
sf.inspect_data(verbose=True)
Solvent Percentage: tensor(0.2636, device='cuda:0') Grid size: [64, 90, 180]
Next, we calculate the contributions from the protein and from solvent to the overall structure factor. If we want their actual complex values as a seperate variable, we can add Return=True
.
Internally, the are stored as sfcalculator.Fprotein_HKL
and sfcalculator.Fmask_HKL
.
F_protein = sf.calc_fprotein(Return=True)
F_solvent = sf.calc_fsolvent()
print(F_protein)
tensor([-77.0283+111.0421j, -9.1909+192.4632j, -31.6068-130.6637j, ..., -29.2952+10.9436j, 2.5986+11.8835j, -14.9516-75.6051j], device='cuda:0')
We also need to calculate some global scale paramters (stored internally as sfcalculator.kmasks
, sfcalculator.uanisos
, sfcalculator.kisos
).
To get an initial, analytical estimate, we can run sf.init_scales
. A more optimized value can be obtained using sf.get_scales_adam()
.
%%time
sf.init_scales(requires_grad=True)
CPU times: user 76.9 ms, sys: 16 ms, total: 92.9 ms Wall time: 91.6 ms
The total structure factor has been calculated already at this point. We can retrieve it using
Fmodel = sf.calc_ftotal()
And it is differentiably calculated with respect to the gradient_required parameters, which can be scale factors or the atomic model coordinates.
Fmodel
tensor([-6.3304+9.1257j, -0.7553+15.8170j, -2.5975-10.7382j, ..., -2.4075+0.8994j, 0.2136+0.9766j, -1.2287-6.2131j], device='cuda:0', grad_fn=<IndexPutBackward0>)
📏 4. Evaluating Model Quality with R-factors¶
We can get a sense of overall fit of the model to the data using R factors:
print(f"R-factors (work, free): {100 * sf.get_rfactors()[0]:.4}% and {100 * sf.get_rfactors()[1]:.4}%")
R-factors (work, free): 15.34% and 18.88%
We can use the Adam optimizer to estimate more accurate scales and see that this helps:
%%time
sf.get_scales_adam(n_steps=100, initialize=False)
CPU times: user 9.2 s, sys: 511 ms, total: 9.71 s Wall time: 10.8 s
print(f"R-factors (work, free): {100 * sf.get_rfactors()[0]:.4}% and {100 * sf.get_rfactors()[1]:.4}%")
R-factors (work, free): 14.71% and 18.19%
or use an LBFGS optimizer,
%%time
sf.get_scales_lbfgs(initialize=True)
print(f"R-factors (work, free): {100 * sf.get_rfactors()[0]:.4}% and {100 * sf.get_rfactors()[1]:.4}%")
R-factors (work, free): 14.79% and 18.29% CPU times: user 4.3 s, sys: 7.81 ms, total: 4.31 s Wall time: 4.34 s
and can obtain more granular information using SFCalculator.summarize()
sf.summarize()
Resolution N_work N_free <Fobs> <|Fmodel|> R_work R_free k_mask k_iso 49.52 - 11.37 93 7 21.8 19.4 0.344 0.221 0.068 0.052 11.37 - 9.12 107 2 24.0 21.9 0.237 0.764 0.065 0.061 9.12 - 7.39 173 11 23.2 21.8 0.245 0.226 0.058 0.075 7.39 - 5.99 334 17 19.7 18.1 0.226 0.305 0.050 0.075 5.99 - 4.86 634 22 24.2 23.0 0.199 0.196 0.022 0.073 4.86 - 3.94 1157 72 29.2 28.2 0.156 0.178 0.019 0.070 3.94 - 3.20 2169 142 24.4 23.9 0.147 0.170 0.012 0.083 3.20 - 2.59 4068 227 16.5 16.1 0.146 0.189 0.003 0.085 2.59 - 2.10 7596 388 11.9 11.7 0.135 0.179 0.000 0.083 2.10 - 1.70 10260 488 7.8 7.6 0.132 0.178 0.000 0.085 r_work: 0.148 r_free: 0.183 Number of outliers: 0
💾 6. Exporting a Refined PDB Model¶
sf.savePDB("refined_model.pdb")
🧬 7. Bonus: Computing and Visualizing a Patterson Map¶
A Patterson map is the Fourier transform of |F_obs|² (squared structure factor amplitudes). It reveals interatomic vector densities and is especially useful when phase information is unavailable, e.g. as a target during Molecular Replacement.
We'll compute a Patterson map and visualize a central slice through the origin.
We first generate an array of patterson vectors and calculate the patterson map intensity on those points
%%time
patterson_uvw_arr_frac = sfc.patterson.uvw_array_frac(sf.unit_cell, 1.0, 20.0, 0.3)
Pu_o = sfc.patterson.Patterson_torch(patterson_uvw_arr_frac,
sf.Fo, sf.HKL_array, sf.unit_cell.volume,
sharpen=True, remove_origin=True,
PARTITION=10000)
CPU times: user 233 ms, sys: 17.3 ms, total: 251 ms Wall time: 52.6 ms
Then we fill in the grid with the above two arrays, using simple rounding operations for now
patterson_grid = np.zeros(sf.gridsize)
grid_size_arr = np.array(sf.gridsize)
scaled_coords = sfc.utils.assert_numpy(patterson_uvw_arr_frac) * grid_size_arr
integer_indices = np.round(scaled_coords, decimals=0).astype(int)
integer_indices = np.clip(integer_indices, 0, grid_size_arr - 1)
patterson_grid[tuple(integer_indices.T)] = sfc.utils.assert_numpy(Pu_o)
patterson_grid.shape
(64, 90, 180)
actual_grid_size = [sf.unit_cell.a, sf.unit_cell.b, sf.unit_cell.c]
print("Grid size in Angstroms:", actual_grid_size)
Grid size in Angstroms: [34.297, 45.552, 99.035]
Then we do a visualization of a slice
show_range_x = int(20.0 / actual_grid_size[0] * sf.gridsize[0])
show_range_y = int(20.0 / actual_grid_size[1] * sf.gridsize[1])
slice_to_show = patterson_grid[:show_range_x, :show_range_y, 0]
fig, ax = plt.subplots(figsize=(5, 4))
im = ax.imshow(
slice_to_show,
extent=[0, 20.0, 0, 20.0],
origin='lower',
aspect='equal',
cmap='bwr',
vmin=-0.2, vmax=0.2
)
ax.set_xlabel(rf"X-axis ($\AA$)")
ax.set_ylabel(f"Y-axis ($\AA$)")
ax.set_title("Patterson Map Slice")
fig.colorbar(im, ax=ax, label="Intensity")
plt.show()
🧠 Summary and Next Steps¶
In this notebook, you've learned to:
✅ Load and parse a crystallographic model and reflection data
✅ Compute structure factors using atomic scattering
✅ Evaluate model quality with R and R-free
✅ Optimize scale parameters with PyTorch
✅ Save and analyze your results
🚀 What’s Next?¶
- Try
anomalous=True
for SAD/MAD data - Swap
mode="xray"
withmode="cryoem"
for cryo-EM scattering - Integrate with custom loss functions for machine learning