2.1 Get Sequences
We’ll work with the protein sequence to start with.
These preparation steps will be performed within a web browser:
TASK
Open a web browser and follow instructions below:
- Go to: https://www.ncbi.nlm.nih.gov/protein/ This will take you to the NCBI “protein” database.
- For a precise research of the Spike protein for only CoV-2 enter the follwing search code within the text field:
surface glycoprotein[All Fields] AND "Severe acute respiratory syndrome coronavirus 2"[Organism]
On the date of the first writing on April 16, 2020 the result was a list of 795
proteins.
During revisions on May 26, 2020 the list was 4239
and grew to 5465
on June 18, 2020 and 7847
on June 25, 2020. This change will also impact the numbers for the alternate option to select only unique sequences (see below) from 85
to 942
(May 26) and 1198
(both June 18 and June 25) items.
Many example will keep the shorter list as an example. But commands would transpose to larger sequence lists as well.
A box at the top of the page provides an alternate option:
See the results of this search (85 items) in our new Identical Protein Groups database.
The new Identical Protein Groups database5. The Identical Protein Groups (IPG) resource makes it easier to find protein information by searching against groups of protein records where each group represents a unique protein sequence.
We’ll choose this option to avoid carrying many proteins that are 100% identical.
- Click on the link: results of this search (85 items.) (That number will increase with time, see remark above.)
At the top of the page the it should say Summary 20 per page by default. On the same line locate the menu Send to which we’ll use the save the files:
- Click on Send to and select the File option
- Under Format select FASTA
- Click button “Create File”. It will automatically download to your default location, most likely “Downloads” within your user area. The default file name is
sequence.fasta.txt
.
Note: The original file with only 85 sequences is available for download as spike_raw_85.fa
Note: You can repeat the saving if you wish to download a comma-delimited (.csv
) file with all proteins and their associated nucleic acid codes by selecting “Identical Protein Groups” as the format.