Chapter 3 Complete sequences dataset

We now have a set of sequences that are “unique” sine they were obtained from the Identical Protein Groups but some of them are “partial” and others contain one or more X of undetermined amino acids.

In the next section we’ll prepare a “dataset” that only contains sequences that are “complete” (full length) and do not contain any X.

The method will be to use existing Unix utilities to achieve this task, without a mouse or a graphical interface. Most importantly, there is no manual editing to remove the sequence. Therefore if or when there is an update at NCBI increasing the number of downloadable “unique” sequences, the steps below can be repeated as a “script” and without the possibility to introduce errors as might be the case if it were done manually.

The reader is encouraged to review or learn about the commands and concepts below, see suggested masterials in Introduction, Chapter 1.

The following Unix utilities will be used:

  • cat - print file(s) onto the screen (standard output)
  • sed - stream editor - modify data “on the fly” e.g. substiute one or mores character string for another.
  • tr - translate characters (i.e. replace or delete)
  • fgrep: fast grep - file pattern searcher

The following Unix key concepts will be used:

  • Standards: standard input, standard output
  • Data “streams: “redirection” and “piping” of “standard input/output”

The following symbols will be used:

  • | - “pipe”symbol, to receive or pass the “standard” stream of data from the previous or to the next command.
  • > - “redirect” final standard output into the named file.