PHIDRA
¶
Protein Homology Identification via Domain-Related Architecture
A simple way to search and validate identified Pfam domains of interest against a curated InterProScan Domain Architecture (IDA) file to check whether or not your proteins match a domain composition found in the InterPro Database.
Description¶
A Python-based package of scripts that performs an initial homology search using MMseqs2
to identify targeted proteins of interest in small or large datasets. Top hits are searched against the Pfam database using pfam_scan
, and verified domains are checked and compared against a custom input InterProScan Domain Architecture (IDA) file relative to your target protein of interest.
A recursive search is then performed using the full-length proteins with validated IDAs as the subject database and the original input as the query, with initial matches filtered out. This process captures potentially more distant proteins that may have been missed in the initial homology search but are functionally relevant.
➡️ Official documentation is hosted on PHIDRA.
Getting started¶
Dependencies¶
phidra
can be run either on a machine with a properly set up Python environment or by creating a custom Conda environment if you do not wish to change your current setup (recommended).- MMseqs2 required for initial and recursive homology search.
- HMMER required for Pfam database creation and protein domain identification through
pfam_scan
. - Python >= 3.8
Installation¶
- Conda setup (recommended) Help
#Create environment with >= Python 3.8
conda create --name phidra python=3.8
#Activate environment
conda activate phidra
#Install hmmer/mmseqs2 (Required)
conda install -c conda-forge -c bioconda mmseqs2
conda install bioconda::hmmer
#Clone tool into working directory
git clone https://github.com/zschreib/phidra
cd phidra
#Grabs required python packages
pip install -r requirements.txt
- Python setup >= Python 3.8, requires MMseqs2 and HMMER to be already set up.
git clone https://github.com/zschreib/phidra
cd phidra
pip install -r requirements.txt
- Tool and version check. If you receive any errors here, do not proceed until fixed.
mmseqs -h
hmmscan -h
python phidra_run.py -v
Setting up Pfam Database (Optional)¶
- If you already have a custom or existing Pfam-HMM formatted database you can skip this step.
- You can optionally include the Pfam-B database to perform a less restrictive search, which can help identify more remote or novel domain relationships that may be missed by the curated Pfam-A profiles.
mkdir pfam_database
cd pfam_database
wget http://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.dat.gz
wget http://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
gunzip -c Pfam-A.hmm.dat.gz > Pfam-A.hmm.dat
gunzip -c Pfam-A.hmm.gz > Pfam-A.hmm
rm Pfam-A.hmm.gz Pfam-A.hmm.dat.gz
hmmpress Pfam-A.hmm
Creating an InterProScan Domain Architecture (IDA) file¶
1. Identify essential domains¶
- For DNA polymerase A, I have identified PF00476 (DNA_pol_A) as the core domain.
- Search domain on InterPro:
https://www.ebi.ac.uk/interpro/entry/pfam/PF00476/domain_architecture/
and grab the full IDA profile.
2. Download domain architecture data in TSV format for all or selected domain combinations¶
File should include (by default): * IDA ID (unique hash) * Domain combinations * Protein counts within IDA * Representative sequence * Representative length * Domain positions/coordinates
3. Sample IDA file in TSV format:¶
IDA ID IDA Text Unique Proteins Representative Accession Representative Length Representative Domains
82911c7e8cf0ed5121595d5944b2a2f9a2c4f49e PF02739:IPR020046-PF01367:IPR020045-PF01612:IPR002562-PF00476:IPR001098 22766 P00582 928 PF02739{5_3_exonuc_N}:IPR020046{5-3_exonucl_a-hlix_arch_N}[9-170],PF01367{5_3_exonuc}:IPR020045{DNA_polI_H3TH}[171-272],PF01612{DNA_pol_A_exo1}:IPR002562{3'-5'_exonuclease_dom}[330-516],PF00476{DNA_pol_A}:IPR001098{DNA-dir_DNA_pol_A_palm_dom}[551-925]
d4033548d92ed83469d80faa39cea59a849060a8 PF00476:IPR001098 18435 P00581 704 PF00476{DNA_pol_A}:IPR001098{DNA-dir_DNA_pol_A_palm_dom}[333-701]
16983c3a921f1409678537d9977eca4ed176e7c2 PF02739:IPR020046-PF01367:IPR020045-PF22619:IPR054690-PF00476:IPR001098 10607 Q04957 877 PF02739{5_3_exonuc_N}:IPR020046{5-3_exonucl_a-hlix_arch_N}[4-170],PF01367{5_3_exonuc}:IPR020045{DNA_polI_H3TH}[172-266],PF22619{DNA_polI_exo1}:IPR054690{DNA_polI_exonuclease}[318-457],PF00476{DNA_pol_A}:IPR001098{DNA-dir_DNA_pol_A_palm_dom}[498-875]
ce54c2b3dc9b223aa1f4992353c0972accb51c74 PF02739:IPR020046-PF01367:IPR020045-PF00476:IPR001098 6038 O84500 866 PF02739{5_3_exonuc_N}:IPR020046{5-3_exonucl_a-hlix_arch_N}[3-166],PF01367{5_3_exonuc}:IPR020045{DNA_polI_H3TH}[167-259],PF00476{DNA_pol_A}:IPR001098{DNA-dir_DNA_pol_A_palm_dom}[495-865]
f54dc0def957e931720f1460942192debd6092ca PF01612:IPR002562-PF00476:IPR001098 4576 Q05254 595 PF01612{DNA_pol_A_exo1}:IPR002562{3'-5'_exonuclease_dom}[19-210],PF00476{DNA_pol_A}:IPR001098{DNA-dir_DNA_pol_A_palm_dom}[243-587]
4. Analyze architectures¶
Core domain (PF00476) commonly associates with: * PF02739 (5' exonuclease N-terminal) * PF01367 (5' exonuclease) * PF01612 (DNA polymerase A exonuclease)
5. Use for validation¶
- Compare unknown or known sequence hits to the subject database against known IDA profiles.
- Explore domain organization and composition.
Domain architectures help validate protein predictions by ensuring essential functional domains are present, and they also provide structured, functionally meaningful features that can be leveraged as high-quality inputs for machine-learning models to improve training and downstream classification.
Running the tool¶
usage: phidra_run.py [-h] [-v] -i INPUT_FASTA -db SUBJECT_DB -pfam PFAM_HMM_DB -ida IDA_FILE -f FUNCTION -o OUTPUT_DIR
[-t THREADS] [-e EVALUE]
Identifies homologous proteins and associated Pfam domains from input protein sequences, while comparing against
InterPro Domain Architectures to analyze domain-level similarities and functional relationships.
Help:
-h, --help Show this help message and exit
-v, --version Show program version and exit
Required arguments:
-i INPUT_FASTA, --input_fasta INPUT_FASTA
Query FASTA for mmseqs search (default: None)
-db SUBJECT_DB, --subject_db SUBJECT_DB
Subject FASTA for mmseqs createdb (default: None)
-pfam PFAM_HMM_DB, --pfam_hmm_db PFAM_HMM_DB
Pfam HMM format database path (default: None)
-ida IDA_FILE, --ida_file IDA_FILE
IDA TSV file (default: None)
-f FUNCTION, --function FUNCTION
User label for this run (default: None)
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
Base output directory (default: None)
Optional arguments:
-t THREADS, --threads THREADS
Threads for tools supporting -cpu/--threads (default: 1)
-e EVALUE, --evalue EVALUE
E-value threshold for mmseqs easy-search (e.g., 1E-3, 1e-5) (default: 1E-3)
Example run using the provided examples directory:
python phidra_run.py -i examples/query/polA_test.fa -db examples/subject/pola_16_k12_ref.fa -pfam [pfam_DB_location] -ida examples/IDA/polA_IDA_list.tsv -f examples/output -t 5 -e 1E-3
Results output summary¶
output/
├── final_results/
│ ├── pfam_coverage_report.tsv # Summary of Pfam domain coverage across all search hits
│ ├── summary.tsv # Summary table combining counts for each iteration
│ ├── unvalidated_ida_pfams/ # Pfam domains found but not validated by iterative domain architecture (IDA)
│ │ ├── domains.fa # FASTA of individual Pfam domain hits
│ │ ├── full_proteins.fa # Full-length proteins containing those domains
│ │ └── pfam_unvalidated_merged_report.tsv # Merged table of unvalidated domain results
│ └── validated_ida_pfams/ # Pfam domains validated by IDA recursion
│ ├── domains.fa # FASTA of validated Pfam domain hits
│ ├── full_proteins.fa # Full-length proteins containing validated domains
│ └── pfam_validated_merged_report.tsv # Merged table of validated domain results
│
├── mmseqs/
│ ├── initial/ # First-pass MMseqs2 search output
│ │ ├── bits.tsv # Top hit table by best bitscore
│ │ ├── hits.fa # FASTA of sequences with best e-value
│ │ ├── hits.tsv # Top hit table by best e-value
│ │ └── res.m8 # MMseqs2 table output (m8 format) for all significant initial hits
│ └── recursive/ # Secondary MMseqs2 search using hits as queries
│ └── res.m8 # MMseqs2 table output (m8 format) for all significant recursive hits
│ # No recursive hits so tophit/FASTA not created
└── pfam/
├── initial/ # Initial Pfam HMMER domain search results
│ ├── pfam_coverage_report.tsv # Coverage summary of initial Pfam search
│ ├── unvalidated_ida_report/ # Domains not validated by user IDA but hit Pfam domain
│ │ ├── domains.fa # FASTA of unvalidated Pfam domain hits
│ │ ├── full_proteins.fa # Full-length proteins for unvalidated hits
│ │ └── pfam_unvalidated_report.tsv # Tabular report of unvalidated domain results
│ └── validated_ida_report/ # Domains validated by IDA
│ ├── domains.fa # FASTA of validated Pfam domain hits
│ ├── full_proteins.fa # Full-length proteins for validated hits
│ └── pfam_validated_report.tsv # Tabular report of validated domain results
└── recursive/ # Results from recursive Pfam search
(empty) # No recursive hits so pfam files not created
- The named output directory contains the complete output of the
phidra
pipeline. - Includes homology search results, IDA-validated and unvalidated Pfam domain calls with both sequence-level and domain-level FASTA files, and comprehensive summary tables.
- A detailed description of each file is provided above to help navigate and interpret the results.
Looking for Version 1?¶
- Browse the stable v1.x branch
- Or check the v1.0.0 release
Citation¶
If you found this tool useful, please cite:
Schreiber, Z. D. PHIDRA: Protein Homology Identification via Domain-Related Architecture. GitHub. Retrieved Sept 27, 2025, from https://github.com/zschreib/phidra/.
Authors¶
Contributor’s name and contact info
zschreib.dev@gmail.com
License¶
This project is licensed under the GNU General Public License v3.0 see the LICENSE file for more details.
Acknowledgments¶
Inspiration, code snippets, etc. * MMseqs2 * InterPro * pfam_scan