To solve a structure for the SFX, the necessary number of indexed snapshot patterns of crystals depends on the SNR of the individual patterns, the symmetry of the crystal, and the variability of parameters on which the diffraction depends from shot to shot (such as the chaotic spectrum of FEL pulses)6. These factors influence the final accuracy of the merged data. To determine de novo the structure of a protein in which no homologous structures exist, the experimental phasing of SFX data, the data must have high resolution and a very high multiplicity of data sets for phase determination7,10,35‑38. The large shot-by-shot variations in X-ray intensity and photon energy may make experimental phasing of XFEL data very challenging.
We conducted test of self-seeded XFEL for SFX, because the self-seeded XFEL that we achieved performs extremely well. A previous did not show any difference in the data quality metrics of the SFX compared to SASE11, but the peak spectral brightness of our XFEL is about ten times higher compared to the XFEL used previously and 40 times higher than SASE, with excellent stability. We expect that reduction in the relative bandwidth from ΔE/E=1.3×10-3 (SASE) to ΔE/E=1.9×10-5 (SS) will sharpen diffraction patterns, especially those collected at large scattering angles, which are responsible for increasing the resolution. Also, we expect an increase in filtration rate of raw data owing to the higher spectral intensity of the self-seeded XFEL compared to SASE.
We performed a demonstration experiment by mapping out the three-dimensional structure of the lysozyme from chicken eggwhite and performing a comparative analysis of the results obtained using the narrowband HXRSS FEL and the broadband SASE FEL (see Methods for the crystal preparation and experimental conditions).
We collected and processed three data sets that had different numbers of images for both self-seeded and SASE modes: SS1/SASE1 (111,467/101,443), SS2/SASE2 (38,510/38,686), and SS3/SASE3 (20,209/20,530). The indexing rates were substantial in all cases. For example, SS1; 70,656 crystal diffraction patterns (63.4%) were identified as crystal hits, and 33,663 of them were indexed (47.6%). The index rates of the self-seeding data sets were higher than those of the SASE data sets (Table 1). SFX data quality metrics such as SNR [or I/σ], multiplicity, Rsplit (i.e., the consistency of merged intensity distributions between two half-datasets separated from the full dataset), and correlation coefficient [CC*] strongly depend on the number of images, as is known (Fig. 4, Supplementary Table 2). However, the self-seeding data shows superior metrics than the SASE data at high resolutions, unlike a previous report11. Remarkably, the self-seeding data sets had twice the multiplicity of the SASE data set at all resolutions (Fig. 4b), so the final accuracy of the merged data is improved, even with the same number of hit images (see Methods for SFX data processing).
Table 1: Statistics of data collection, phasing, and model refinement for three sets for self-seeded (SS1, SS2, and SS3) and SASE (SASE1, SASE2, and SASE3) modes. The models of self-seeded mode (SS1 and SS2) and SASE mode (SASE1 and SASE2) have been refined from 38.8 to 1.75 Å except for SS3 and SASE3 (38.8 to 1.85 Å). The higher-resolution shells (1.93-1.85 Å) for SS3 and SASE3 must be < 1.85 Å for validity. All models have one monomer in the asymmetric unit and adopt nearly identical structures, with r.m.s. deviations ~0.05 Å for 129 Cα atom pairs.
Data sets (lysozyme)
|
SS1
|
SASE1
|
SS2
|
SASE2
|
SS3
|
SASE3
|
A. Data collection
|
|
|
|
|
|
|
Space group
|
P43212
|
P43212
|
P43212
|
P43212
|
P43212
|
P43212
|
Unit cell length (Å)
|
a = 77.56, b = 77.56, c = 37.32
|
a = 77.56, b = 77.56, c = 37.32
|
a = 77.88, b = 77.88, c = 37.32
|
a = 77.88, b = 77.88, c = 37.32
|
a = 77.88, b = 77.88, c = 37.32
|
a = 77.88, b = 77.88, c = 37.32
|
Unit cell angle (°)
|
α, β, γ = 90.0
|
α, β, γ = 90.0
|
α, β, γ = 90.0
|
α, β, γ = 90.0
|
α, β, γ = 90.0
|
α, β, γ = 90.0
|
X-ray wavelength (Å)
|
1.2782
|
1.2782
|
1.2782
|
1.2782
|
1.2782
|
1.2782
|
Number of collected images
|
111,467
|
101,443
|
38,510
|
38,686
|
20.209
|
20.530
|
Number of hits
|
70,656
|
70,656
|
27,926
|
27,926
|
12,377
|
12,377
|
Number of indexed images
|
33,663
|
28,301
|
14,256
|
11,809
|
7,091
|
5,686
|
Indexing rate from hits (%)
|
47.64
|
40.05
|
51.05
|
42.29
|
57.29
|
45.94
|
Number of merged images
|
33,663
|
28,301
|
27,926
|
27,926
|
12,377
|
12,377
|
Resolution range (Å)a
|
38.8–1.75
(1.78-1.75)
|
38.8–1.75
(1.78-1.75)
|
38.8–1.75
(1.78-1.75)
|
38.8–1.75
(1.78-1.75)
|
38.8–1.85
(1.93-1.85)
|
38.8–1.85
(1.93-1.85)
|
Total / unique reflections
|
6,274,437
/ 23,307
|
3,786,572
/ 23,307
|
2,555,216
/ 23,307
|
1,593,005
/ 23,307
|
1.213.951
/ 23,307
|
753,928
/ 23,307
|
Multiplicity
|
269.2 (189.6)
|
162.4 (108.9)
|
108.0 (73.6)
|
55.9 (37.5)
|
49.9 (35.4)
|
28.2 (19.7)
|
Completeness (%)a
|
100.0 (100.0)
|
100.0 (100.0)
|
100.0 (100.0)
|
100.0 (100.0)
|
100.0 (100.0)
|
100.0 (100.0)
|
CC*a,b
|
0.992 (0.944)
|
0.993 (0.855)
|
0.979 (0.824)
|
0.981 (0.353)
|
0.971 (0.799)
|
0.963 (0.523)
|
<I /sI>a
|
6.2 (1.77)
|
5.6 (1.29)
|
4.0 (1.17)
|
3.1 (0.22)
|
2.9 (1.17)
|
2.3 (0.33)
|
Rsplit (%)a,c
|
12.8 (52.5)
|
12.7 (74.0)
|
20.9 (79.2)
|
21.3 (100)
|
25.0 (68.9)
|
30.5 (100)
|
B. Model refinement
|
|
|
|
|
|
|
Rwork / Rfree (%)d
|
20.7 / 23.2
|
20.9 / 24.5
|
22.4 / 24.4
|
21.4 / 25.9
|
22.1 / 24.7
|
21.6 / 26.4
|
No. of non-hydrogen atoms
/ average B-factor (Å2)
Protein
Water
|
1,001 / 19.8
94 / 30.1
|
1,001 / 29.6
62 / 36.4
|
1,001 / 21.3
89 / 31.4
|
1,001 / 30.2
65 / 38.6
|
1,001 / 20.0
95 / 31.4
|
1,001 / 27.2
88 / 37.5
|
RMS deviations from ideal geometry
|
|
|
|
|
|
|
Bond lengths (Å)
/ bond angles (°)
|
0.009 / 0.927
|
0.008 / 0.917
|
0.010 / 1.137
|
0.007 / 0.875
|
0.011 / 1.293
|
0.007 / 0.887
|
PDB code
|
7BYO
|
7BYP
|
7D01
|
7D02
|
7D04
|
7D05
|
Ramachandran plot (%)
|
|
|
|
|
|
|
Favoured / Outliers
|
99.2 / 0.0
|
99.2 / 0.0
|
96.85 / 0.0
|
98.4 / 0.0
|
98.4 / 0.0
|
99.21 / 0.0
|
Rotamer outliers
|
0.0
|
0.0
|
0.0
|
0.0
|
0.0
|
0.0
|
For quality assessment, we performed molecular replacement (MR) with the Phaser-MR in PHENIX40, using a model of lysozyme (Protein Data Bank code 1VDS) as a search model, then conducted atomic model refinement using phenix.refine, then inspected of (mFo-DFc) omit maps.40 (see Methods for structure determination, refinement, and analysis). To compare and analyze the structures and their electron density maps without bias or error, we performed structural determination using the same numbers of hit images for self-seeding and SASE data sets (SS1/SASE1, SS2/SASE2, and SS3/SASE3).
After refinement, when we compared the models with their structure maps (SS1 and SASE1), we found apparent improvements in 2mFo-DFc maps of the self-seeded mode (Fig. 5a), even though lysozyme is a globular protein and has some buried residues that strongly interact with other residues. To get a much better view, we obtained bias-free mFo-DFc omit maps by sorting out the residues (Fig. 5b). Comparison of the mFo-DFc omit maps at 1.75-Å resolution (Fig. 5b) clearly shows that the maps of the ten residues (Phe21/Ala28/Tyr41/Trp46/Phe52/Asn62/Tyr71/Trp81/Trp126/Trp141) are not blurred in self-seeded mode; the maps, including the side chains and the main chains (carboxyl groups, nitrogens on the peptide backbones, and α-carbons), are sharper than those obtained in SASE mode. For instance, in the Phe21 and Asn62 maps, β-carbons and side chains are revealed clearly only in self-seeded mode. Refined models without a specific residue were deleted from the original structure (Supplementary Table 3).
Comparative analysis of the mFo-DFc electron density maps of the ten residues reveals the superiority of the self-seeded data set over the SASE mode data sets (Table 2, Supplementary Fig. 5). For example, even though the data-quality metrics of the SS3 data are inferior to those of the SASE1 (SS3 dataset has one-fourth as many indexed images as the SASE1), the omit maps of the ten residues from the SS3 data are better than those from the SASE data. B-factors are crystallographic parameters to explain this big difference. The average B-factors41 of both protein and solvent waters models are relatively lower in the models from the self-seeded than in those from SASE mode, and the average B-factors are independent of the number of indexed images (Table1: Model refinement). These traits indicate that the atomic displacement fluctuations are relatively weaker when a narrowband self-seeded FEL is used, than when a broadband SASE FEL is used. The reduced fluctuations might help increase the refinement of the model with sharpened electron density maps. The overall sharpening of the omit maps obtained from the self-seeding data resulted from phasing-quality data with fewer patterns. The high quality of data obtained in self-seeded mode is a result of the use of recurrent shots from a highly-stable self-seeded XFEL.