🧪 Introduction to SFcalculator_torch: the basics¶

This notebook demonstrates how to use SFcalculator_torch — a PyTorch-based toolkit for computing structure factors from atomic models in macromolecular crystallography and cryo electron microscopy.

You'll learn how to:

  • Load a PDB structure and MTZ reflection data
  • Compute structure factors for the protein and solvent
  • Combine them into a total model structure factor
  • Evaluate model quality via R-factors
  • Optionally optimize scaling factors to improve fit

🛠️ 1. Installation and Imports¶

If you haven't installed SFCalculator yet, you can do it following:

  1. Install PyTorch

  2. pip install sfcalculator-torch

The imports necessary for this notebook:

In [1]:
import torch
import numpy as np
import matplotlib.pyplot as plt
import SFC_Torch as sfc
import reciprocalspaceship as rs
from reciprocalspaceship.algorithms import scale_merged_intensities

📂 2. Loading Experimental Data and the Atomic Model¶

SFCalculator's core capabilities are the calculation of structure factors from a model, and the comparison to observations in the form of R factors. For likelihood-based calculations, please see ROCKET Codebase.

Let`s first fetch a dataset from the PDB, SFCalculator provides a utility function to fetch entries from the PDB database by entry ID. It also supports batch fetching.

In [2]:
df = sfc.fetch_pdb(idlist=["7lvc"], outpath="./")
100%|██████████| 1/1 [00:03<00:00,  3.61s/it]

It will attempt to download the sequence FASTA file, molecular model CIF file, structure factor CIF file, PDB file, and structure factor MTZ file when available. A CSV file indicates which file formats were fetched for each entry ID.

In [2]:
!ls
7lvc.cif	    7lvc.pdb	  refined_model.pdb
7lvc.fasta	    7lvc-sf.cif   SFcalculator_torch_demo_with_patterson.html
7lvc_FW_scaled.mtz  fetchpdb.csv  SFcalculator_torch_demo_with_patterson.ipynb

The we use reciprocal spaceship to inspect our dataset.

In [3]:
ds=rs.read_cif("7lvc-sf.cif")
ds.head()
Out[3]:
FreeR_flag I(+) SIGI(+) I(-) SIGI(-) FWT PHWT DELFWT PHDELWT
H K L
16 11 3 12 78.059998 7.19 79.589996 5.27 110.150002 124.489998 46.810001 -55.5
2 9 170.080002 11.66 163.039993 8.03 175.240005 90.769997 41.369999 -89.389999
1 12 98.040001 5.8 91.389999 5.36 142.119995 -103.32 8.52 76.629997
10 7 8 111.849998 8.45 98.43 5.57 157.660004 156.199997 5.93 155.820007
6 0 9.92 2.98 7.66 2.28 16.26 -164.720001 38.34 13.24

We see that we have separate Friedel mates recorded for the intensities. There are also amplitude columns, but these are 2Fo-Fc amplitudes (FWT), Fo-Fc (FWT) and phases, not raw observations! (see Global Phasing's Kwiki for more details). As a consequence, we should first "stack" the Friedel mates, and then French-Wilson scale them. For example like this:

In [4]:
ds=ds.stack_anomalous()
ds_scaled = scale_merged_intensities(
        ds,
        intensity_key="I",
        sigma_key="SIGI",
        output_columns=["FW-I", "FW-SIGI", "F", "SIGF"],
        inplace=False,
        mean_intensity_method="anisotropic",
        bw=2.0,
        minimum_sigma=1e-6,
    )
ds_scaled.write_mtz("7lvc_FW_scaled.mtz")

ds_scaled.head()
Out[4]:
FreeR_flag I SIGI FWT PHWT DELFWT PHDELWT dHKL CENTRIC FW-I FW-SIGI F SIGF
H K L
16 11 3 12 78.059998 7.19 110.150002 124.489998 46.810001 -55.5 1.900489 False 77.590569 7.19 8.799018 0.409693
2 9 170.080002 11.66 175.240005 90.769997 41.369999 -89.389999 1.902241 False 168.864914 11.66 12.987023 0.449588
1 12 98.040001 5.8 142.119995 -103.32 8.52 76.629997 1.903295 False 97.741295 5.8 9.882053 0.293788
10 7 8 111.849998 8.45 157.660004 156.199997 5.93 155.820007 1.921573 False 111.219727 8.45 10.538425 0.401649
6 0 9.92 2.98 16.26 -164.720001 38.34 13.24 1.926292 False 9.845529 2.971575 3.098016 0.497819

Now, let's read in these data, together with a model, into an SFCalculator object: the core object that keeps track of our calculations.

In [5]:
pdb_path = "7lvc.cif"
mtz_path = "7lvc_FW_scaled.mtz"

sf = sfc.SFcalculator(pdb_path, mtz_path, device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
                      anomalous=True, mode="xray",
                      expcolumns=["F","SIGF"],
                      freeflag="FreeR_flag", testset_value=0, wavelength=1.892)

print("Model loaded with", sf.n_atoms, "atoms in space group", sf.space_group.hm)
print("Number of reflections in dataset: ", len(sf.HKL_array))
2025-07-16 11:56:57.851 | WARNING  | SFC_Torch.Fmodel:init_pdb:147 - Can't find wavelength record in the PDB file, or it doesn't match your input wavelength!
2025-07-16 11:56:57.988 | WARNING  | SFC_Torch.Fmodel:init_mtz:357 - Can't find wavelength record in the MTZ file, or it doesn't match with other sources
Model loaded with 2985 atoms in space group P 21 21 21
Number of reflections in dataset:  27967

🧮 3. Computing Structure Factors¶

For an accurate and efficient calcualtion, we first need to inspect it for an appropriate solvent percentage and a reasonable grid size for FFT:

In [6]:
sf.inspect_data(verbose=True)
Solvent Percentage: tensor(0.2636, device='cuda:0')
Grid size: [64, 90, 180]

Next, we calculate the contributions from the protein and from solvent to the overall structure factor. If we want their actual complex values as a seperate variable, we can add Return=True. Internally, the are stored as sfcalculator.Fprotein_HKL and sfcalculator.Fmask_HKL.

In [7]:
F_protein = sf.calc_fprotein(Return=True)
F_solvent = sf.calc_fsolvent()
In [8]:
print(F_protein)
tensor([-77.0283+111.0421j,  -9.1909+192.4632j, -31.6068-130.6637j,
         ..., -29.2952+10.9436j,   2.5986+11.8835j,
        -14.9516-75.6051j], device='cuda:0')

We also need to calculate some global scale paramters (stored internally as sfcalculator.kmasks, sfcalculator.uanisos, sfcalculator.kisos). To get an initial, analytical estimate, we can run sf.init_scales. A more optimized value can be obtained using sf.get_scales_adam() .

In [9]:
%%time
sf.init_scales(requires_grad=True)
CPU times: user 76.9 ms, sys: 16 ms, total: 92.9 ms
Wall time: 91.6 ms

The total structure factor has been calculated already at this point. We can retrieve it using

In [10]:
Fmodel = sf.calc_ftotal()

And it is differentiably calculated with respect to the gradient_required parameters, which can be scale factors or the atomic model coordinates.

In [11]:
Fmodel
Out[11]:
tensor([-6.3304+9.1257j, -0.7553+15.8170j, -2.5975-10.7382j,  ...,
        -2.4075+0.8994j,  0.2136+0.9766j, -1.2287-6.2131j], device='cuda:0',
       grad_fn=<IndexPutBackward0>)

📏 4. Evaluating Model Quality with R-factors¶

We can get a sense of overall fit of the model to the data using R factors:

In [12]:
print(f"R-factors (work, free): {100 * sf.get_rfactors()[0]:.4}% and {100 * sf.get_rfactors()[1]:.4}%")
R-factors (work, free): 15.34% and 18.88%

We can use the Adam optimizer to estimate more accurate scales and see that this helps:

In [13]:
%%time
sf.get_scales_adam(n_steps=100, initialize=False)
CPU times: user 9.2 s, sys: 511 ms, total: 9.71 s
Wall time: 10.8 s
In [14]:
print(f"R-factors (work, free): {100 * sf.get_rfactors()[0]:.4}% and {100 * sf.get_rfactors()[1]:.4}%")
R-factors (work, free): 14.71% and 18.19%

or use an LBFGS optimizer,

In [15]:
%%time
sf.get_scales_lbfgs(initialize=True)
print(f"R-factors (work, free): {100 * sf.get_rfactors()[0]:.4}% and {100 * sf.get_rfactors()[1]:.4}%")
R-factors (work, free): 14.79% and 18.29%
CPU times: user 4.3 s, sys: 7.81 ms, total: 4.31 s
Wall time: 4.34 s

and can obtain more granular information using SFCalculator.summarize()

In [16]:
sf.summarize()
Resolution       N_work  N_free  <Fobs> <|Fmodel|>  R_work  R_free  k_mask   k_iso
49.52 - 11.37        93       7    21.8      19.4   0.344   0.221   0.068   0.052
11.37 - 9.12        107       2    24.0      21.9   0.237   0.764   0.065   0.061
9.12 - 7.39         173      11    23.2      21.8   0.245   0.226   0.058   0.075
7.39 - 5.99         334      17    19.7      18.1   0.226   0.305   0.050   0.075
5.99 - 4.86         634      22    24.2      23.0   0.199   0.196   0.022   0.073
4.86 - 3.94        1157      72    29.2      28.2   0.156   0.178   0.019   0.070
3.94 - 3.20        2169     142    24.4      23.9   0.147   0.170   0.012   0.083
3.20 - 2.59        4068     227    16.5      16.1   0.146   0.189   0.003   0.085
2.59 - 2.10        7596     388    11.9      11.7   0.135   0.179   0.000   0.083
2.10 - 1.70       10260     488     7.8       7.6   0.132   0.178   0.000   0.085
r_work: 0.148  
r_free: 0.183  
Number of outliers: 0      

💾 6. Exporting a Refined PDB Model¶

In [17]:
sf.savePDB("refined_model.pdb")

🧬 7. Bonus: Computing and Visualizing a Patterson Map¶

A Patterson map is the Fourier transform of |F_obs|² (squared structure factor amplitudes). It reveals interatomic vector densities and is especially useful when phase information is unavailable, e.g. as a target during Molecular Replacement.

We'll compute a Patterson map and visualize a central slice through the origin.

We first generate an array of patterson vectors and calculate the patterson map intensity on those points

In [18]:
%%time
patterson_uvw_arr_frac = sfc.patterson.uvw_array_frac(sf.unit_cell, 1.0, 20.0, 0.3)
Pu_o = sfc.patterson.Patterson_torch(patterson_uvw_arr_frac, 
                                     sf.Fo, sf.HKL_array, sf.unit_cell.volume,
                                     sharpen=True, remove_origin=True,
                                     PARTITION=10000)
CPU times: user 233 ms, sys: 17.3 ms, total: 251 ms
Wall time: 52.6 ms

Then we fill in the grid with the above two arrays, using simple rounding operations for now

In [19]:
patterson_grid = np.zeros(sf.gridsize)
grid_size_arr = np.array(sf.gridsize)
scaled_coords = sfc.utils.assert_numpy(patterson_uvw_arr_frac) * grid_size_arr
integer_indices = np.round(scaled_coords, decimals=0).astype(int)
integer_indices = np.clip(integer_indices, 0, grid_size_arr - 1)
patterson_grid[tuple(integer_indices.T)] = sfc.utils.assert_numpy(Pu_o)
In [20]:
patterson_grid.shape
Out[20]:
(64, 90, 180)
In [21]:
actual_grid_size = [sf.unit_cell.a, sf.unit_cell.b, sf.unit_cell.c]
print("Grid size in Angstroms:", actual_grid_size)
Grid size in Angstroms: [34.297, 45.552, 99.035]

Then we do a visualization of a slice

In [22]:
show_range_x = int(20.0 / actual_grid_size[0] * sf.gridsize[0])
show_range_y = int(20.0 / actual_grid_size[1] * sf.gridsize[1])
slice_to_show = patterson_grid[:show_range_x, :show_range_y, 0]
fig, ax = plt.subplots(figsize=(5, 4))
im = ax.imshow(
    slice_to_show,
    extent=[0, 20.0, 0, 20.0],
    origin='lower', 
    aspect='equal',  
    cmap='bwr',
    vmin=-0.2, vmax=0.2
)

ax.set_xlabel(rf"X-axis ($\AA$)")
ax.set_ylabel(f"Y-axis ($\AA$)")
ax.set_title("Patterson Map Slice")
fig.colorbar(im, ax=ax, label="Intensity")

plt.show()
No description has been provided for this image

🧠 Summary and Next Steps¶

In this notebook, you've learned to:

✅ Load and parse a crystallographic model and reflection data
✅ Compute structure factors using atomic scattering
✅ Evaluate model quality with R and R-free
✅ Optimize scale parameters with PyTorch
✅ Save and analyze your results

🚀 What’s Next?¶

  • Try anomalous=True for SAD/MAD data
  • Swap mode="xray" with mode="cryoem" for cryo-EM scattering
  • Integrate with custom loss functions for machine learning