Crystalite is a lightweight diffusion Transformer for crystal generation and crystal structure prediction. This post covers its chemistry-aware atom encoding, geometry-aware attention mechanism, and benchmark results.
Discovering new crystalline materials remains a difficult search problem and a central challenge in modern materials discovery
Generative models are therefore attractive because they can learn to propose candidate crystals directly from data. However, crystal generation is not simply a matter of predicting discrete atom labels. Atoms occupy positions inside a periodic lattice, distances wrap across unit-cell boundaries, and small geometric errors can change the physical plausibility of a structure. This is the main reason why much of the recent literature has relied on geometry-aware and often equivariant graph neural networks
Crystalite begins from a narrower question: how much of this geometric structure must be built into the backbone itself? Put differently, can a diffusion Transformer remain competitive if the inductive bias is placed more carefully? Recent diffusion-transformer results suggest that lighter backbones can indeed be competitive when the representation and geometry are handled well
A crystal is naturally described by three coupled objects: atom identities $ \mathbf{A} $, fractional coordinates $ \mathbf{F} $, and lattice geometry $ \mathbf{L} $:
\[\mathcal{C} = (\mathbf{A}, \mathbf{F}, \mathbf{L}), \qquad \mathbf{A} \in \{0,1\}^{N \times N_Z}, \quad \mathbf{F} \in [0,1)^{N \times 3}, \quad \mathbf{L} \in \mathbb{R}^{3 \times 3}.\]An important observation is that fractional coordinates should not be regarded as Cartesian coordinates in another form. A row of $ \mathbf{F} $ specifies the position of an atom within the unit cell relative to the lattice basis, whereas $ \mathbf{L} $ specifies the lattice basis itself in real space. As a result, the same fractional arrangement can correspond to substantially different Cartesian structures under different lattices, and real-space positions are only well defined when $ \mathbf{F} $ and $ \mathbf{L} $ are considered jointly.
This representation immediately explains why the problem is specialized. The unit cell is repeated periodically in all directions. Local environments matter, but only through the lattice that defines how the crystal repeats in real space. Geometric relations must be computed under the minimum-image convention rather than inside an isolated box.
Equivariant GNNs are well matched to this setting. They are designed to process neighborhoods, distances, directions, and symmetry transformations in a controlled way, which has enabled strong results in both crystal structure prediction and crystal generation. At the same time, incorporating equivariance in this way can make the resulting models architecturally complex and relatively slow at sampling time, when the denoiser must be evaluated repeatedly. Representative examples include CDVAE, DiffCSP, FlowMM, MatterGen, and OMatG
Crystalite does not argue that geometry can be ignored. Rather, it asks whether geometry can be introduced in a more economical way. The model retains a standard diffusion Transformer backbone and concentrates its crystal-specific structure in the representation and the attention mechanism.
The first modification concerns atom identity. In many crystal generators, atom types are represented by one-hot vectors. This is convenient, but chemically it is a poor geometry: every element is orthogonal to every other element. Sodium is no closer to lithium than it is to xenon.
This matters especially when diffusion is formulated over a continuous atom-type signal rather than with an explicitly discrete diffusion process. In that setting, one-hot atom identity provides no notion of chemical proximity: nearby vectors in representation space do not correspond to chemically similar elements, and chemically plausible substitutions are not encoded in the representation.
Crystalite replaces one-hot atom identity with Subatomic Tokenization, a compact descriptor built from periodic and electronic structure: period, group, block, and valence-shell occupancy. This gives the atom representation a more chemically meaningful structure.
In a simplified form, the token for element $k$ can be written as
\[\mathbf{h}_k = \big[ \mathrm{onehot}(r_k), \mathrm{onehot}(g_k), \mathrm{onehot}(b_k), s_k/2,\, p_k/6,\, d_k/10,\, f_k/14 \big].\]These descriptors are standardized, balanced across feature groups, optionally PCA-compressed, and then treated as continuous atom tokens. During sampling, a predicted token is decoded back to a valid element by nearest-token matching. This makes the denoising problem better aligned with chemical structure: nearby errors in token space correspond more naturally to chemically related species.
\[\mathcal{C} = (\mathbf{H}, \mathbf{F}, \mathbf{y}), \qquad \mathbf{H} \in \mathbb{R}^{N \times d_H}, \quad \mathbf{F} \in [0,1)^{N \times 3}, \quad \mathbf{y} \in \mathbb{R}^{6}.\]These are the three continuous channels denoised jointly by the model: atom tokens $ \mathbf{H} $, fractional coordinates $ \mathbf{F} $, and a compact latent description of the lattice $ \mathbf{y} $.
The second key design choice concerns geometry. Crystals are periodic objects, so geometric quantities should respect the torus structure of fractional coordinates. The subtle point is that Crystalite uses two closely related constructions here: a wrapped fractional residual for the coordinate loss, and a metric-aware periodic-image search for the geometry-aware attention module introduced below.
\[\boldsymbol{\delta}^{\mathrm{wrap}}_{ij} = \mathrm{wrap}(\mathbf{f}_i - \mathbf{f}_j), \qquad \mathrm{wrap}(\mathbf{u}) = \mathbf{u} - \mathrm{round}(\mathbf{u}), \qquad d^{\mathrm{wrap}}_{ij} = \left\| \boldsymbol{\delta}^{\mathrm{wrap}}_{ij} \mathbf{L} \right\|_2.\]This wrapped residual is the right object for the coordinate loss because the model predicts fractional coordinates directly. For the geometry-aware attention bias, the goal is slightly different: we want the shortest periodic displacement under the lattice metric, not just a componentwise wrap. In practice, the model searches over a small set of nearby periodic images and keeps the one with the smallest real-space norm:
\[\Delta \mathbf{f}_{ij}^{\star} = \arg\min_{\mathbf{r} \in \Omega_R} \left\| \left( \mathbf{f}_i - \mathbf{f}_j + \mathbf{r} \right)\mathbf{L} \right\|_2.\]For orthogonal cells these two notions coincide, but for skewed lattices they do not have to. That is why it is useful to distinguish between wrapped fractional residuals in the loss and metric-aware minimum-image geometry in the attention module. The backbone itself is still intentionally simple: one token per atom, one additional global token for the lattice, and a standard diffusion-conditioned Transformer trunk.
\[\mathbf{t}_i^{\mathrm{atom}} = E_H(\mathbf{h}_i) + E_F(\mathbf{f}_i), \qquad \mathbf{t}^{\mathrm{lat}} = E_{\mathrm{lat}}(\mathbf{y}).\]
This gives the model a clean decomposition between local atomic information and global cell information. The lattice is parameterized through a lower-triangular latent, which keeps the representation unconstrained while ensuring positive diagonal entries:
\[\mathbf{L}(\mathbf{y}) = \begin{bmatrix} e^{y_1} & 0 & 0 \\ y_2 & e^{y_3} & 0 \\ y_4 & y_5 & e^{y_6} \end{bmatrix}.\]The geometry-specific piece of this backbone is the Geometric Enhancement Module (GEM). GEM computes periodic pairwise features from atom positions and the lattice, then feeds them into attention as a bias rather than replacing the Transformer with a heavier message-passing architecture.
Concretely, GEM adds a geometry-dependent bias directly to the attention logits:
\[A_{\mathrm{geom}} = \frac{QK^\top}{\sqrt{d}} + B_{\mathrm{geom}}.\]
The bias term $B_{\mathrm{geom}}$ is built from periodic pairwise geometry: metric-aware minimum-image displacements, normalized distances, Fourier or radial features, and a compact lattice descriptor. The resulting attention is still standard self-attention, but it is informed by crystal geometry at the point where token interactions are decided.
This is the central modeling idea. Crystalite does not attempt to remove geometric inductive bias. Instead, it introduces a softer and more modular form of that bias inside the attention mechanism.
We evaluate Crystalite in two settings. In crystal structure prediction (CSP), the composition is known and the model predicts the structure. In de novo generation, the model generates atom types, coordinates, and lattice jointly from noise.
The CSP results are particularly direct. Across the reported benchmarks, Crystalite achieves the strongest results in both match rate and RMSE. The RMSE improvements are especially notable, because they suggest better geometric refinement rather than merely better recovery of the correct structural mode.
| Model | MP-20 | MPTS-52 | Alex-MP-20 | |||
|---|---|---|---|---|---|---|
| MR ↑ | RMSE ↓ | MR ↑ | RMSE ↓ | MR ↑ | RMSE ↓ | |
| CDVAE | 33.90 | 0.1045 | 5.34 | 0.2106 | -- | -- |
| DiffCSP | 51.49 | 0.0631 | 12.19 | 0.1786 | -- | -- |
| FlowMM | 61.39 | 0.0566 | 17.54 | 0.1726 | -- | -- |
| CrystalFlow | 62.02 | 0.0710 | 22.71 | 0.1548 | -- | -- |
| KLDM | 65.83 | 0.0517 | 23.93 | 0.1276 | -- | -- |
| OMatG | 63.75 | 0.0720 | 25.15 | 0.1931 | 64.71 | 0.1251 |
| Crystalite | 66.05 | 0.0329 | 31.49 | 0.0701 | 67.52 | 0.0335 |
This interpretation matches the reported ablations: GEM has only a modest effect on match rate, but it consistently improves structural accuracy.
The de novo generation setting is more nuanced, because validity, novelty, uniqueness, and stability pull in different directions. For that reason, the most informative summary target is often S.U.N., which rewards structures that are simultaneously stable, unique, and novel. On this benchmark, Crystalite achieves the strongest S.U.N. result while remaining substantially faster than competing baselines.
| Model | Quality and Diversity | Stability, Distribution, and Speed | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Struct. Val. ↑ | Comp. Val. ↑ | Unique ↑ | Novel ↑ | U.N. ↑ | Stable ↑ | S.U.N. ↑ | wdist-ρ ↓ | wdist N-ary ↓ | Time / 1k ↓ | |
| FlowMM | 93.03 | 83.15 | 97.44 | 85.00 | 83.99 | 46.05 | 31.64 | 1.389 | 0.075 | 1560 |
| CrystalDiT | 77.82 | 67.28 | 90.88 | 59.33 | 56.86 | 83.41 | 41.70 | 0.202 | 0.171 | 73.72 |
| DiffCSP | 99.93 | 82.10 | 96.90 | 89.53 | 87.89 | 50.28 | 38.60 | 0.192 | 0.344 | 237 |
| MatterGen | 99.78 | 83.72 | 98.10 | 91.14 | 90.26 | 51.70 | 42.29 | 0.088 | 0.184 | 2639 |
| ADiT | 99.52 | 90.15 | 90.25 | 59.80 | 56.91 | 76.90 | 36.76 | 0.231 | 0.089 | 84.81 |
| Crystalite | 99.61 | 81.94 | 95.33 | 79.15 | 77.12 | 70.97 | 48.55 | 0.046 | 0.125 | 22.36 / 5.14† |
We also evaluate Crystalite on LeMat GenBench
| Model | Valid | Unique | Novel | Stable | Metastable | SUN | MSUN | E Above Hull | Relax. RMSD |
|---|---|---|---|---|---|---|---|---|---|
| (%) ↑ | (%) ↑ | (%) ↑ | (%) ↑ | (%) ↑ | (%) ↑ | (%) ↑ | (eV) ↓ | (Å) ↓ | |
| Pre-Relaxed Models | |||||||||
| Crystalite | 97.20 | 95.80 | 53.20 | 12.70 | 51.60 | 1.50 | 22.60 | 0.0905 | 0.1320 |
| OMatG | 96.40 | 95.20 | 51.20 | 11.60 | 49.80 | 1.00 | 18.00 | 0.0956 | 0.0759 |
| MatterGen | 95.70 | 95.10 | 70.50 | 2.00 | 33.40 | 0.20 | 15.00 | 0.1834 | 0.3878 |
| PLaID++ | 96.00 | 77.80 | 24.20 | 12.40 | 60.70 | 1.00 | 7.60 | 0.0854 | 0.1286 |
| WyFormer-DFT | 95.20 | 95.00 | 66.40 | 3.70 | 24.80 | 0.40 | 7.80 | 0.2708 | 0.4173 |
| WyFormer | 93.40 | 93.00 | 66.40 | 0.50 | 15.70 | 0.10 | 1.90 | 0.4988 | 0.8121 |
| Non-Pre-Relaxed Models | |||||||||
| DiffCSP | 95.70 | 94.80 | 66.20 | 2.30 | 29.80 | 0.10 | 8.50 | 0.2747 | 0.3794 |
| DiffCSP++ | 95.30 | 95.10 | 62.00 | 1.00 | 26.40 | 0.20 | 5.00 | 0.4093 | 0.6933 |
| SymmCD | 73.40 | 73.00 | 47.00 | 1.40 | 18.60 | 0.10 | 2.40 | 0.8761 | 0.8720 |
| CrystalFormer | 69.90 | 69.40 | 31.80 | 1.40 | 28.80 | 0.00 | 3.10 | 0.7039 | 0.6585 |
| ADiT | 90.60 | 87.80 | 26.00 | 0.40 | 36.50 | 0.00 | 1.00 | 0.3333 | 0.3794 |
| Crystal-GFN | 51.70 | 51.70 | 51.70 | 0.00 | 0.00 | 0.00 | 0.00 | 2.0858 | 1.8665 |
A good generative model should not only produce plausible samples, but continue to generate diverse candidates as the sampling budget grows. Evaluating that behavior at scale is only practical when generation is fast enough to make such studies feasible in the first place. This is exactly where Crystalite is useful: it is fast enough for large-scale generation, while preserving uniqueness and unique-and-novel rate more effectively as more crystals are sampled.
Crystalite shows that strong crystal generation does not require pushing all of the geometric inductive bias into a heavy backbone. Instead, chemical structure can be encoded in the atom representation, while periodic geometry can be placed where it matters most: in the loss and in attention through GEM.
In practice, that gives a useful balance: Crystalite remains easier to train, sample from, and extend than many geometry-heavy alternatives while still reaching state-of-the-art crystal structure prediction results and strong de novo generation performance. In the main generation comparison, it also achieves the highest S.U.N. score while sampling one to two orders of magnitude faster than leading GNN-based baselines, depending on the reference model and inference setting.
That speed matters for more than runtime alone. Fast sampling makes conditional generation, guidance, and steering much easier to use in practice, because adding search loops or external constraints no longer multiplies an already expensive denoising cost. In that sense, Crystalite is not only an efficient generator, but also a more practical foundation for controllable and scalable materials discovery workflows.