6.2 Related sequences

We will attempt to reproduce the full-lentgh alignment presented in Walls et al. (2020) supplementary material “Data S1.”14 We will limit the exercise to creating the best possible automatic alignment without using manual editing.

The sequences aligned in the paper are listed in the following table showing accession number and short name used within the supplemental full length alignment.

Spike Glycoprotein S accession codes
Accession code Short name
YP_009724390.1 SARS-CoV-2
QHR63300.2 SARSr-CoV_RaTG13
AAP13441.1 SARS-CoV_Urbani
AAP13567.1 SARS-CoV_CUHK-W1
AAS00003.1 SARS-CoV_GZ02
AAV97988.1 SARS-CoV_A031
AAV91631.1 SARS-CoV_A022
ALK02457.1 WIV16
AGZ48828.1 WIV1
AVP78042.1 SARSr-CoV_ZXC21
AVP78031.1 SARSr-CoV_ZC45
Q3I5J5.1 SARSr-CoV_Rp3
ACU31032.1 SARSr-CoV_Rs672

NOTE: a link for direct download will be provided in the exercise section 6.3 below.

We’ll use TCoffee (Thompson (2009)) to create the alignment. TCoffee is a complex, multi-algorithm system that can also take advantage on online database searching. The online version can be accessed at http://www.tcoffee.org/ and contains multiple options to align sequences. Here are a few options listed on the web site:

  • T-Coffee Aligns DNA, RNA or Proteins using the default T-Coffee
  • M-Coffee Aligns DNA, RNA or Proteins by combining the output of popular aligners
  • Expresso Aligns protein sequences using structural information
  • PSI-Coffee Aligns distantly related proteins using homology extension (slow and accurate)

PSI-Coffee (Definition) uses the EBI web-services and runs remotely (at the EBI) the BLASTs required for the homology extension procedure.

Expresso (Definition) is the most accurate mode of T-Coffee and creates structure-based alignments. Expresso fetches PDB structures whose similarity to the original sequence is higher than 30% (by default) that can be used as a template

In the exercise below we’ll use a line-command in effect is combining Expresso and PSI-Coffee.

Since TCoffee is complex and complicated to install the exercise below will be presented in docker. Readers that do not readily have access to docker should test the possibilities on the TCoffee web site.

The T-Coffee Server is hosted by the Centre for Genomic Regulation (CRG) of Barcelona.

References

Thompson, Steven. 2009. “An Introduction to Multiple Sequence Alignment — and the T-Coffee Shop. Beyond Just Aligning Sequences: How Good Can You Make Your Alignment, and so What?” In Bioinformatics for Systems Biology, 283–313. https://doi.org/10.1007/978-1-59745-440-7_15.

Walls, A. C., Y. J. Park, M. A. Tortorici, A. Wall, A. T. McGuire, and D. Veesler. 2020. “Structure, Function, and Antigenicity of the SARS-CoV-2 Spike Glycoprotein.” Cell 181 (2): 281–92. https://doi.org/https://dx.doi.org/10.1016/j.cell.2020.02.058.


  1. The link to the full-length sequence aligment is not trivial to find. It can be found at https://bit.ly/3dthDBS and was archived at https://bit.ly/2zXNP2R.↩︎