
2.3 Checking sequences
There are 2 things we want to see:
1.Are there any “partial” sequences. The complete sequence is 1273
in length.
2. Are there any **X
** wihtin the sequence, meaning that the amino acid sequence is not 100% complete as an X
represents an unknown or undefined amino acid, due to uncertainty within the nucleotide sequencing.
We can quickly accomplish these tasks with the command grep
that recognizes a pattern. In our case the patterns will be either the word partial
or a capital X
.
Checking for “partial” with fgrep
(fast grep
as it only works with simple patterns.) We just need to provide the pattern followed by the file name. The -i
option makes the command case insensitive. We only print the first three sequence names.
>QJR94977.1 surface glycoprotein, partial [Severe acute respiratory syndrome coronavirus 2]
>QJR93825.1 surface glycoprotein, partial [Severe acute respiratory syndrome coronavirus 2]
>QJR92925.1 surface glycoprotein, partial [Severe acute respiratory syndrome coronavirus 2]
Note: You can also use more
or less
command to inspect the the complete spike_raw.fa
file one screen and scroll one screenfull at a time with space bar, and q to quit. We’ll remove these sequences later for the final set after we check for X
and retain only the last 3 lines on the screen with tail
:
LLALHRSYLTPGDSXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSF
NGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPXKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFL
Since the name of the sequence does not get retained we do not know which sequence contain the X
s, but there are some for sure.
In the next section we’ll use “piped” command-lines to remove sequences with X
and sequences that are “partial”.