3.1 Removing partial and sequences with X | A sequence alignment and analysis of SARS-CoV-2 spike glycoprotein

3.1 Removing partial and sequences with `X`

In order to remove the “unwanted” sequences, those that are “partial” or contain one or more X, we’ll take advange of the format of the file containing all the “unique” sequences spike_raw.fa. The file is in a multi-sequence “FastA” format⁷.

We will want to remove files that:

contain the word “partial” within the description line.
contain any number of X within the sequence data.

In FastA format each sequence starts with a “greater than” symbol > followed immediately by the sequence name. Everything after the first blank space is considered “annotation”. Subsequent lines collectively make up the (protein) sequence. A new line with the > symbol marks the beginning of a new, separate sequence.

Below we can see the first 2 lines of the first 2 sequences within the file. The -A1 qualifier shows one line after the line containing the searched pattern >. (The -- marks separate output results.)

grep -A1 ">" spike_raw.fa  | head -6

>QJU70245.1 surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]
MFVFLVLXPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFD
--
>QJU70329.1 surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFD
--

“Algorithm”:

Here is a proposed method to find and edit out “unwanted” sequences from the multi-sequence file. It is not necessary the most efficient or the most elegant, but has the advantage of using existing tools and does not require any programming.

We’ll use “word processing” methods to achieve the goal with “Unix utilities” used for replacing and substituting characters, including “new line” (return character.)

Each protein sequence will be converted from multiple lines containging name, annotation and sequence into a single line containing all information and sent within the “data stream” pipeline.

At that point the grep command can simply “weed-out” the lines (hence the sequences) that contain the word “partial” or the letter X.

Finally, the “return characters” are added again to recreate to the original, multi-line format.

Here is a “final” command that will accomplish the task. Results are saved into a file: spike_filtered.fa.

The command below is “split” after the pipe character | to allow writing one command per line for more clarity:

The reader is encouraged to add each command after the | pipe sign one at a time press return and observe the effect it has…

Hint: optionally add head to avoid long outputs.

cat spike_raw.fa |
  sed -e 's/>/*>/g' -e 's/$/#/g' |
  tr -d '\n' |
  tr '*' '\n' |
  fgrep -v partial |
  fgrep -v X  |
  tr '#' '\n' |
  sed 1d > spike_filtered.fa

Here is an explanation of each line:

cat spike_raw.fa: send the content of file into the data stream (standard input/output.)
sed -e 's/>/*>/g' -e 's/$/#/g': substitute > into *> and execute (-e) another coommand to exchange “end of line” (represented by $) with #. This will be used later to re-establish the multiline format.
tr -d '\n' - translate utility: delete all “new-lines” (return characters) in essence transforming the whole into a single line.
tr '*' '\n' - re-establish a return before the > character: now each sequence and its name and annotation is represented into a single line.
fgrep -v partial and fgrep -v X: fast grep recongizes the patterns and inverts the selection (-v) to retain only the sequences that don’t match these patterns. They are the sequences we want to keep.
tr '#' '\n': convert each # character into a “new line” (return character.) This re-establishes the multi-line format for the sequence.
sed 1d removes the first line which would be just a return character introduced in the previous step when replacing # character with “new line.”
> spike_filtered.fa: redirect standard output (with symbol >) into a file named spike_filtered.fa.

Command Design Note:

The cat command on the first line could be omitted if the file name was provided after the first sed command as the input argument. However, writing the command this way may make it easier for some readers to better understand how the “data stream” is initiated. In addition the “flow” of commands is only from left to right if started with cat.

https://en.wikipedia.org/wiki/FASTA_format ↩︎