psm_utils.io
Parsers for proteomics search results from various search engines.
- psm_utils.io.read_file(filename: str | Path, *args, filetype: str = 'infer', **kwargs)
Read PSM file into
PSMList
.- Parameters:
filename (str) – Path to file.
filetype (str, optional) – File type. Any PSM file type with read support. See psm_utils tag in Supported file formats.
*args (tuple) – Additional arguments are passed to the
psm_utils.io
reader.**kwargs (dict, optional) – Additional keyword arguments are passed to the
psm_utils.io
reader.
- psm_utils.io.write_file(psm_list: PSMList, filename: str | Path, *args, filetype: str = 'infer', show_progressbar: bool = False, **kwargs)
Write
PSMList
to PSM file.- Parameters:
psm_list (PSMList) – PSM list to be written.
filename (str) – Path to file.
filetype (str, optional) – File type. Any PSM file type with read support. See psm_utils tag in Supported file formats.
show_progressbar (bool, optional) – Show progress bar for conversion process. (default: False)
*args (tuple) – Additional arguments are passed to the
psm_utils.io
writer.**kwargs (dict, optional) – Additional keyword arguments are passed to the
psm_utils.io
writer.
- psm_utils.io.convert(input_filename: str | Path, output_filename: str | Path, input_filetype: str = 'infer', output_filetype: str = 'infer', show_progressbar: bool = False)
Convert a PSM file from one format into another.
- Parameters:
input_filename (str) – Path to input file.
output_filename (str) – Path to output file.
input_filetype (str, optional) – File type. Any PSM file type with read support. See psm_utils tag in Supported file formats.
output_filetype (str, optional) – File type. Any PSM file type with write support. See psm_utils tag in Supported file formats.
show_progressbar (bool, optional) – Show progress bar for conversion process. (default: False)
Examples
Convert a MaxQuant msms.txt file to a MS²PIP peprec file, while inferring the applicable file types from the file extensions:
>>> from psm_utils.io import convert >>> convert("msms.txt", "filename_out.peprec")
Convert a MaxQuant msms.txt file to a MS²PIP peprec file, while explicitly specifying both file types:
>>> convert( ... "filename_in.msms", ... "filename_out.peprec", ... input_filetype="msms", ... output_filetype="peprec" ... )
Note that filetypes can only be inferred for select specific file names and/or extensions, such as
msms.txt
or*.peprec
.
psm_utils.io.idxml
Interface with OpenMS idXML PSM files.
Notes
idXML supports multiple peptide hits (identifications) per spectrum. Each peptide hit is parsed as an individual
PSM
object.
psm_utils.io.maxquant
Interface to MaxQuant msms.txt PSM files.
- class psm_utils.io.maxquant.MSMSReader(filename: str | Path, *args, **kwargs)
Reader for MaxQuant msms.txt PSM files.
- Parameters:
filename (str, pathlib.Path) – Path to PSM file.
decoy_prefix (str, optional) – Protein name prefix used to denote decoy protein entries. Default:
"DECOY_"
.
Examples
MSMSReader
supports iteration:>>> from psm_utils.io.maxquant import MSMSReader >>> for psm in MSMSReader("msms.txt"): ... print(psm.peptidoform.proforma) WFEELSK NDVPLVGGK GANLGEMTNAGIPVPPGFC[+57.022]VTAEAYK ...
Or a full file can be read at once into a
PSMList
object:>>> reader = MSMSReader("msms.txt") >>> psm_list = reader.read_file()
psm_utils.io.mzid
Reader and writers for the HUPO-PSI mzIdentML format.
See psidev.info/mzidentml for more info on the format.
- class psm_utils.io.mzid.MzidReader(filename: str | Path, *args, **kwargs)
Reader for mzIdentML PSM files.
- Parameters:
filename (str, pathlib.Path) – Path to PSM file.
Examples
MzidReader supports iteration:
>>> from psm_utils.io.mzid import MzidReader >>> for psm in MzidReader("peptides_1_1_0.mzid"): ... print(psm.peptidoform.proforma) ACDEK AC[Carbamidomethyl]DEFGR [Acetyl]-AC[Carbamidomethyl]DEFGHIK
Or a full file can be read at once into a
psm_utils.psm_list.PSMList
object:>>> mzid_reader = MzidReader("peptides_1_1_0.mzid") >>> psm_list = mzid_reader.read_file()
- class psm_utils.io.mzid.MzidWriter(filename: str | Path, *args, show_progressbar: bool = False, **kwargs)
Writer for mzIdentML PSM files.
- Parameters:
filename (str, Pathlib.Path) – Path to PSM file.
show_progressbar (bool, optional) – Show progress bar for conversion process. (default: False)
Notes
Unlike other psm_utils.io writer classes,
MzidWriter
does not support writing a single PSM to a file with thewrite_psm()
method. Only writing a full PSMList to a file at once with thewrite_file()
method is currently supported.
psm_utils.io.peptide_record
Interface with Peptide Record PSM files.
Peptide Record (or PEPREC) is a legacy PSM file type developed at CompOmics as input format for MS²PIP. It is a simple and flexible delimited text file where each row represents a single PSM. Required columns are:
spec_id
: Spectrum identifier; usually the identifier used in the spectrum file.peptide
: Simple, stripped peptide sequence (e.g.,ACDE
).modifications
: Amino acid modifications in a custom format (see below).
Depending on the use case, more columns can be required or optional:
charge
: Peptide precursor charge.observed_retention_time
: Observed retention time.predicted_retention_time
: Predicted retention time.label
: Target/decoy:1
for target PSMs,-1
for decoy PSMs.score
: Primary search engine score (e.g., the score used for q-value calculation).
Peptide modifications are denoted as a pipe-separated list of pipe-separated
location → label pairs for each modification. The location is an integer counted
starting at 1 for the first amino acid. 0
is reserved for N-terminal modifications
and -1
for C-terminal modifications. Unmodified peptides can be marked with a hyphen
(-
). For example:
PEPREC modification(s) |
Explanation |
---|---|
|
Unmodified |
|
|
|
|
|
|
|
|
Full PEPREC example:
spec_id,modifications,peptide,charge
peptide1,-,ACDEK,2
peptide2,2|Carbamidomethyl,ACDEFGR,3
peptide3,0|Acetyl|2|Carbamidomethyl,ACDEFGHIK,2
Attention
Labile, unlocalized, and fixed modifications are not encoded in the Peptide Record
notation. To encode fixed modifications, use
apply_fixed_modifications()
before writing to
Peptide Record.
- class psm_utils.io.peptide_record.PeptideRecordReader(filename: str | Path, *args, **kwargs)
Reader for Peptide Record PSM files.
- Parameters:
filename (str, pathlib.Path) – Path to PSM file.
Examples
PeptideRecordReader supports iteration:
>>> from psm_utils.io.peptide_record import PeptideRecordReader >>> for psm in PeptideRecordReader("peprec.txt"): ... print(psm.peptidoform.proforma) ACDEK AC[Carbamidomethyl]DEFGR [Acetyl]-AC[Carbamidomethyl]DEFGHIK
Or a full file can be read at once into a
PSMList
object:>>> peprec_reader = PeptideRecordReader("peprec.txt") >>> psm_list = peprec_reader.read_file()
- class psm_utils.io.peptide_record.PeptideRecordWriter(filename, *args, **kwargs)
Writer for Peptide Record PSM files.
- Parameters:
filename (str, Path) – Path to PSM file
- write_psm(psm: PSM)
Write a single PSM to new or existing Peptide Record PSM file.
- Parameters:
psm (PSM) – PSM object to write.
Examples
To write single PSMs to a file,
PeptideRecordWriter
must be opened as a context manager. Then, within the context,write_psm()
can be called:>>> with PeptideRecordWriter("peprec.txt") as writer: >>> writer.write_psm(psm)
- psm_utils.io.peptide_record.peprec_to_proforma(peptide: str, modifications: str, charge: int | None = None) Peptidoform
Convert Peptide Record notation to
Peptidoform
.- Parameters:
peptide (str) – Stripped peptide sequence.
modifications (str) – Modifications in Peptide Record notation (e.g.,
4|Oxidation
)charge (int, optional) – Precursor charge state
- Returns:
peptidoform – Peptidoform
- Return type:
- Raises:
InvalidPeprecModificationError – If a PEPREC modification cannot be parsed.
- psm_utils.io.peptide_record.proforma_to_peprec(peptidoform: Peptidoform)
Convert
Peptidoform
to Peptide Record notation.- Parameters:
peptidoform (psm_utils.peptidoform.Peptidoform) –
- Returns:
peptide (str) – Stripped peptide sequence
modifications (str) – Modifications in Peptide Record notation
charge (int, optional) – Precursor charge state, if available, else
None
Notes
Labile, unlocalized, and fixed modifications are not encoded in the Peptide Record notation. To encode fixed modifications, use
apply_fixed_modifications()
before writing to Peptide Record.
- psm_utils.io.peptide_record.from_dataframe(peprec_df: DataFrame) PSMList
Convert Peptide Record Pandas DataFrame into PSMList.
- Parameters:
peprec_df (pandas.DataFrame) – Peptide Record DataFrame
- Returns:
psm_list – PSMList object
- Return type:
- psm_utils.io.peptide_record.to_dataframe(psm_list: PSMList) DataFrame
Convert PSMList object into Peptide Record Pandas DataFrame.
- Parameters:
psm_list (PSMList) –
- Return type:
pd.DataFrame
Examples
>>> psm_list = PeptideRecordReader("peprec.csv").read_file() >>> psm_utils.io.peptide_record.to_dataframe(psm_list) spec_id peptide modifications charge label ... 0 peptide1 ACDEK - 2 1 ... 1 peptide2 ACDEFGR 2|Carbamidomethyl 3 1 ... 2 peptide3 ACDEFGHIK 0|Acetyl|2|Carbamidomethyl 2 1 ...
psm_utils.io.percolator
Reader and writers for Percolator Tab PIN
/POUT
PSM files.
The tab-delimited input and output format for Percolator are defined on the Percolator GitHub Wiki pages.
Notes
While
PercolatorTabReader
supports reading the peptide notation with preceding and following amino acids (e.g.R.ACDEK.F
),PercolatorTabWriter
simply writes peptides in Proforma format, without preceding and following amino acids.
- class psm_utils.io.percolator.PercolatorTabReader(filename: str | Path, score_column=None, retention_time_column=None, mz_column=None, *args, **kwargs)
Reader for Percolator Tab PIN/POUT PSM file.
As the score, retention time, and precursor m/z are often embedded as feature columns, but not with a fixed column name, their respective column names need to be provided as parameters to the class. If not provided, these properties will not be added to the resulting
PSM
. Nevertheless, they will still be added to itsrescoring_features
property dictionary, along with the other features.- Parameters:
filename (str, pathlib.Path) – Path to PSM file.
score_column (str, optional) – Name of the column that holds the primary PSM score.
retention_time_column (str, optional) – Name of the column that holds the retention time.
mz_column (str, optional) – Name of the column that holds the precursor m/z.
- class psm_utils.io.percolator.PercolatorTabWriter(filename: str | Path, style: str = 'pin', feature_names: list[str] | None = None, *args, **kwargs)
Writer for Percolator TSV “PIN” and “POUT” PSM files.
- Parameters:
filename (str, pathlib.Path) – Path to PSM file.
style (str) – Percolator Tab style. One of {
pin
,pout
}. Ifpin
, the columnsPSMId
,Label
,ScanNr
,Peptide
andProteins
are written alongside the requested feature names (seefeature_names
). Ifpout
, the columnsPSMId
,Label
,score
,q-value
,posterior_error_prob
,peptide
, andproteinIds
are written.feature_names (list[str], optional) – List of feature names to extract from PSMs and write to file. List values should correspond to keys in the
rescoring_features
property. IfNone
, no rescoring features will be written to the file. If appending to an existing file, the existing header will be used to determine the feature names. Only has effect withpin
style.
- psm_utils.io.percolator.join_pout_files(target_filename: str | Path, decoy_filename: str | Path, output_filename: str | Path)
Join target and decoy Percolator Out (POUT) files into single PercolatorTab file.
- Parameters:
target_filename (str, Path) –
decoy_filename (str, Path) –
output_filename (str, Path) –
psm_utils.io.tsv
Reader and writer for a simple, lossless psm_utils TSV format.
Most PSM file formats will introduce a loss of some information when reading,
writing, or converting with psm_utils.io
due to differences between file
formats. In contrast, PSMList
objects can be written
to — or read from — this simple TSV format without any information loss (with exception
of the free-form spectrum
attribute).
The format follows basic TSV rules, using tab as delimiter, and supports quoting when a field contains the delimiter. Peptidoforms are written in the HUPO-PSI ProForma 2.0 notation.
Required and optional columns equate to the required and optional attributes of
PSM
. Dictionary items in
provenance_data
, metadata
, and rescoring_features
are flattened to separate columns, each with their column names prefixed with
provenance:
, meta:
, and rescoring:
, respectively.
Examples
psm_utils
TSV file, compatible with HUPO-PSI Universal Spectrum Identifierpeptidoform spectrum_id run collection
VLHPLEGAVVIIFK/2 17555 Adult_Frontalcortex_bRP_Elite_85_f09 PXD000561
...
peptidoform spectrum_id run collection spectrum is_decoy score precursor_mz retention_time protein_list source provenance:filename rescoring:ExpMass rescoring:CalcMass rescoring:hyperscore rescoring:deltaScore rescoring:frac_ion_b rescoring:frac_ion_y rescoring:Mass rescoring:dM rescoring:absdM rescoring:PepLen rescoring:Charge2 rescoring:Charge3 rescoring:Charge4 rescoring:enzN rescoring:enzC rescoring:enzInt
RNVIDKVAK/2 _3_2_1 False 20.3 1042.64 ['DECOY_sp|Q8U0H4_REVERSED|RTCB_PYRFU-tRNA-splicing-ligase-RtcB-OS=Pyrococcus-furiosus...'] percolator pyro.t.xml.pin 1042.64 1042.64 20.3 6.6 0.444444 0.333333 1042.64 0.0003 0.0003 9 1 0 0 1 0 1
KHLEQHPK/2 _4_2_1 False 26.5 1016.56 ['sp|Q8TZD9|RS15_PYRFU-30S-ribosomal-protein-S15-OS=Pyrococcus-furiosus-(strain-ATCC...'] percolator pyro.t.xml.pin 1016.56 1016.56 26.5 18.5 0.375 0.75 1016.56 0.001 0.001 8 1 0 0 1 0 0
...
- class psm_utils.io.tsv.TSVReader(filename: str | Path, *args, **kwargs)
Reader for PSM file.
- Parameters:
filename (str, pathlib.Path) – Path to PSM file.
- class psm_utils.io.tsv.TSVWriter(filename: str | Path, example_psm: PSM | None = None, *args, **kwargs)
Reader for psm_utils TSV format.
- Parameters:
filename (str, Pathlib.Path) – Path to PSM file.
example_psm (psm_utils.psm.PSM, optional) – Example PSM, required to extract the column names when writing to a new file. Should contain all fields that are to be written to the PSM file, i.e., all items in the
provenance_data
,metadata
, andrescoring_features
attributes. In other words, items that are not present in the example PSM will not be written to the file, even though they are present in other PSMs passed towrite_psm()
orwrite_file()
.
psm_utils.io.xtandem
Interface with X!Tandem XML PSM files.
Notes
In X!Tandem XML, N/C-terminal modifications are encoded as normal modifications and are therefore parsed accordingly. Any information on which modifications are N/C-terminal is therefore lost.
N-terminal modification in X!Tandem XML:
<aa type="M" at="1" modified="42.01057" />
Consecutive modifications, i.e., a modified residue that is modified further, is encoded in X!Tandem XML as two distinctive modifications on the same site. However, in
psm_utils
, multiple modifications on the same site are not supported. While parsing X!Tandem XML PSMs, the mass shift labels of these two modifications will therefore be summed into a single modification.For example, carbamidomethylation of cystein (57.02200) plus ammonia-loss (-17.02655) will be parsed as one modification with mass shift 39.994915, which matches the combined modification Pyro-carbamidomethyl:
<aa type="C" at="189" modified="57.02200" /> <aa type="C" at="189" modified="-17.02655" />
[+39,99545]
Although X!Tandem XML allows multiple peptide/protein identifications per entry, only the first peptide/protein per entry is parsed.
- class psm_utils.io.xtandem.XTandemReader(filename: str | Path, *args, decoy_prefix='DECOY_', **kwargs)
Reader for X!Tandem XML PSM files.
- Parameters:
filename (str, pathlib.Path) – Path to PSM file.
decoy_prefix (str, optional) – Protein name prefix used to denote decoy protein entries. Default:
"DECOY_"
.
Examples
XTandemReader
supports iteration:>>> from psm_utils.io.xtandem import XTandemReader >>> for psm in XTandemReader("pyro.t.xml"): ... print(psm.peptidoform.proforma) WFEELSK NDVPLVGGK GANLGEMTNAGIPVPPGFC[+57.022]VTAEAYK ...
Or a full file can be read at once into a
PSMList
object:>>> reader = XTandemReader("pyro.t.xml") >>> psm_list = reader.read_file()