1. Introduction¶
The MoCha Oncogenic MOI Annotator (MOMA) utility is a sequencing platform agnostic tool used to annotate variants from a NGS sequencing assay and classify the variants as Mutations of Interest (MOIs) or Variants of Unknown Significance (VuSes). These classifications are based on annotation data from Annovar, which will indicate a variant’s location, functional effect, population frequency information, etc., and mapping them to OncoKB, where the variant’s Oncogenicity and Effect can be determined.
In addition, simple filtering is performed on the output data to remove variants that are not clinically relevant. We remove variants that are above a set population frequency as determined from GnomAD, ExAC, and 1000G, as well as non-coding variants (i.e. intronic) and synonymous variants. The remaining calls, then, are mapped to OncoKB and the variant’s Oncogenicity and Oncogenic Effect annotation is added to the variant call if there is a match.
MOMA is designed to be compatible with VCF files from all NGS platforms, either natively, or with the addition of a simple helper script to modify the input to be compatible with the annotation pipeline. MOMA can also run starting from a MAF file. In fact, the first steps of the MOMA pipeline when starting from a VCF file, is to annotate the data and generate a MAF file that the tool will use for downstream processing. As shown below, the spirit of the tool is to be able to accomodate any kind of data, using helper scripts to stage the data in a way that can be easily processed through the rest of the tool
1.1. MOMA Workflow¶
Each MOMA run should start with a VCF file from the assay platform of choice. From that VCF, a Single Base Substitution 6 (SBS-6) matrix is generated and the number of base changes corresponding to those 6 categories is tallied, along with computing the number of transitions, transversions, transition to transversion ratio (Ts/Tv) and deamination score. See more about this data in the SBS-6 Output for details.
If your pipeline outputs copy number variants (CNVs) and / or translocations (gene fusions), MOMA can possibly handle that as well. As of this version, MOMA will handle these data natively for Oncomine Comprehensive Assay (OCA) and TruSight Oncology 500 ctDNA. As these kinds of data are quite varied in their reporting format, helper scripts will most certainly need to be written to handle those outputs. However, once the data is formatted appropriately with a helper script, the output can be annotated with OncoKB and merged into the final dataset just like the above mentioned platforms.
Note
In order to process these extra variant types, additional input files may be required. See MOMA Output for details.
Typical output from a MOMA run will be a final MOMA Report as a CSV file, as well as a directory of some other informative output files. The table below indicates these files and their purpose.
| Output File | Description |
|---|---|
| .log | Log file from the pipline, containing useful information about the run, version numbers for resource files, information about some variants that have been filtered out, and whatnot. |
| .annovar.txt | Text file output from Annovar that will be used to generate a MAF file. |
| .annovar.vcf | The VCF version of the Annovar output. Not currently used for anything. |
| .maf | MAF file of annotated variants. This file is somewhat truncated from the full set of MAF fields, but can still be parsed by many conventional MAF parsing tools. |
| .moma_report_<date>.csv | The final MOMA report. |
| .sbs_metrics.csv | Data for the Single Base Substitution 6 (SBS-6) matrix as determined from the VCF. Useful in determining if there is deamination artifacts for example. |
More details on the expected output from MOMA can be found in the MOMA Output section.
2. Installation¶
As this package is a collection of Perl and Python scripts, installation is simple, and requirements are few. The package can be installed on any *nix system with at least 8GB RAM for smaller analyses (e.g. Oncomine Comprehensive Assay) and maybe up to 32GB of RAM for larger, WES analyses.
2.1. Requirements¶
The following packages and tools are required to run this plugin:
- Python3
- Perl v5.26+
- Annovar
- vcftools
- samtools
- bedtools
Be sure that these elements have been properly installed and are availble in
your $PATH.
2.1.1. Scripting Languages¶
All of the code base for this utility is written in either BASH, Perl or Python3. You can probably use whichever system Perl is available to you (tested 5.26.x to 5.31.x) as there are no major, specialized language requirements (see below for module requirements). Please be sure to have Python3 installed as it is required for MOMA, and Python 2.7 will not work. As of this writing Python 2.7 is slated to be deprecated by the end of 2020, and its continued use is discouraged.
Additionally, there are some Python and Perl libraries that should be installed as they are not typically part of the standard installation:
| Language | Package / Library |
|---|---|
| Perl |
Text::CSV Sort::Versions Log::Log4perl |
| Python3 | pysam natsort |
Note
The python natsort library is required as well, but to help control versioning, a version of this library has been included in the package.
These libraries can be installed using the typical tools / methods normally used
to install these components (e.g. cpan or its more streamlined cousin cpanm
for Perl and pip3 for Python3). Be sure that the modules can be found in
your $PERLLIB / $PYTHONPATH, and can be loaded.
2.1.2. Annovar Variant Annotation Package¶
In order to determine the coding sequence change, protein change, variant location, population frequency value, etc. for each variant in the VCF file, the data need to be annotated. Annovar was chosen due to its flexibility (the tool uses databases for this task, which can easily be modified and created), speed, and lighter footprint / resources. There certainly are other tools that can be used for the same task (e.g. VEP), and we are not endorsing one tool as being better than any other. For the purposes of MOMA, however, we decided that Annovar would be a good resource for the annotation component during the development of the tool.
In order to ensure that the data are always the same coming from the pipeline, and since Annovar can not be distributed with this package, it must be installed and moved into this package as instructed below. We do not recommend using any other system Annovar at this time, and there has been a check set up to ensure that a local copy of Annovar has been copied into the package.
2.1.2.1. Annovar Package Installation¶
As indicated above, Annovar is employed by this plugin to do variant annotation, and as such needs to be downloaded and installed along with some databases. You can find the Annovar documentation and installation instructions here
Once downloaded, the Annovar installation should be put into the lib dir in
the MOMA package without the database files (essentially just the Annovar Perl
scripts). We’ll store the database files in a different location in the next
step.
2.1.2.2. Annovar Database Installation¶
We use some of the publicly distributed Annovar libraries for this package, along with some custom ones. Since they are far too large to keep within this Github repo, they need to be obtained from a different resource.
Todo
I need to figure out a way to distribute these database files! Maybe I can get a tarball or something of these that can be used?
The following are the databases used by this plugin:
Required Annovar Databases
- Custom Databases
- hg19_trunc_refGene, hg19_trunc_refGeneMrna (custom refgene database).
- hg19_cosmic89_noEnst (custom COSMICv89 database)
- Default Databases
- hg19_knownGene
- hg19_avsnp142
- hg19_dbnsfp35a
- hg19_clinvar_20190305
- hg19_popfreq_all_20150413
- hg19_gnomad_exome
Since the required Annovar databases total more than 40 GB in size they can not
be included in this repository. Instead, you must download the databases
following the instructions in the Annovar Documentation Once
the files are downloaded, they should be decompressed and stored in a directory
called annovar_db in the resources dir of the package (full path:
mocha_oncogenic_moi_annotator/resource/annovar_db/).
The custom databases, as indicated above, can be obtained from here:
Todo
Maybe I can put these on S3 for download?
2.1.2.3. Human Reference hg19¶
For some steps of the pipeline a human reference hg19 (GRCh37) FASTA file is
required. This file can be obtained from the
UCSC Golden Path Repository
Once downloaded, place this file in the resources directory. To conserve
space, this file should be gzipped, and it will be indexed during the first run.
2.2. Setting up MOMA¶
Todo
Write a setup script that can help automate all of this.
Setting up MOMA is as simple as getting the package, getting Annovar, and getting the necessary resource files (i.e. Annovar database and human reference files), and placing all in the correct location.
Clone the MOMA repository from the MOMA github repository.
Download the latest version of Annovar from the Annovar source
Create a directory called
annovarin thelibdir in the package root. The full path should be:mocha_oncogenic_moi_annotator/lib/annovar/Move the Annovar perl scripts from Step 3 into this new directory. You’ll typically see 6 scripts:
annotate_variation.plcoding_change.plconvert2annovar.plretrieve_seq_from_fasta.pltable_annovar.plvariants_reduction.plWe don’t need all of the annovar scripts for this package, but there’s no reason to exclude any for now.
Get the human reference hg19 fasta.gz file and place into the
resourcesdir within the package root.Create a new directory in
resourcescalledannovar, get the Annovar database files, and place them into the new directory.
In the end, you should have a directory tree that looks like this:
lib
├── annovar
│ ├── annotate_variation.pl
│ ├── coding_change.pl
│ ├── convert2annovar.pl
│ ├── retrieve_seq_from_fasta.pl
│ ├── table_annovar.pl
│ └── variants_reduction.pl
├── bin
│ └── natsort
├── logger.py
├── natsort
│ ├── compat
│ ├── __init__.py
│ ├── __main__.py
│ ├── natsort.py
│ ├── ns_enum.py
│ ├── __pycache__
│ ├── unicode_numbers.py
│ ├── unicode_numeric_hex.py
│ └── utils.py
├── NonHotspotRules.pm
├── __pycache__
│ ├── logger.cpython-36.pyc
│ ├── logger.cpython-37.pyc
│ ├── utils.cpython-36.pyc
│ └── utils.cpython-37.pyc
└── utils.py
LICENSE.txt
MoCha_Oncogenic_Mutation_Annotator.py
moma_plugin.py
resource
├── annovar_db
│ ├── hg19_avsnp142.txt
│ ├── hg19_avsnp142.txt.idx
│ ├── hg19_clinvar_20190305.txt
│ ├── hg19_clinvar_20190305.txt.idx
│ ├── hg19_cosmic89_noEnst.txt
│ ├── hg19_cytoBand.txt
│ ├── hg19_dbnsfp35a.txt
│ ├── hg19_dbnsfp35a.txt.idx
│ ├── hg19_gnomad_exome.txt
│ ├── hg19_gnomad_exome.txt.idx
│ ├── hg19_knownGene.txt
│ ├── hg19_popfreq_all_20150413.txt
│ ├── hg19_popfreq_all_20150413.txt.idx
│ ├── hg19_trunc_refGeneMrna.fa
│ └── hg19_trunc_refGene.txt
├── blacklisted_vars.txt
├── count.txt
├── gene_reference.csv
├── hg19.fasta.gz
├── hg19.fasta.gz.fai
├── hg19.fasta.gz.gzi
├── mocha_tso500_ctdna_hotspots_v1.072018.bed
├── moma_cnv_lookup.tsv
├── moma_fusion_genes.tsv
├── moma_hotspot_lookup.txt
└── non-hotspot_rules.json
run_moma_pipeline.py
scripts
├── annovar2maf.pl
├── annovar_wrapper.sh
├── calc_tmb.pl
├── calc_tstv_deam.py
├── collate_moma_reports.pl
├── get_cnvs.pl
├── get_fusions.pl
├── get_var_counts_from_moma_results.pl
├── moma2rave.py
├── moma.pl
├── simplify_vcf.pl
├── tso500_cnvs.pl
├── tso500_fusions.pl
└── usage.sh
templates
├── barcode_block.html
├── barcode_summary.html
└── progress_block.html
test
├── gen_tests
│ ├── 19-31014-002-Q_cfNA_rep1.clean.maf
│ ├── 19-32096-004-SCRN_cfTNA_rep1.clean.annotated.filtered.maf
│ ├── H7T7_cfTNA_rep3.clean.annotated.filtered.maf
│ ├── H7T7_cfTNA_rep3.clean.maf
│ ├── nhs_tests.json
│ ├── nhs_test.truncmaf
│ ├── sample.truncmaf
│ └── test_nonhs_rules_module.pl
├── ocav3
│ ├── 0CFDXX_IonXpress_079.vcf
│ └── oca.vcf
├── tso500
│ ├── Horizon_2-000_rep1.cnv.fc.txt
│ ├── Horizon_2-000_rep1.fusion.txt
│ ├── Horizon_2-000_rep1.vcf
│ ├── Lovo_nuc.vcf
│ ├── npDNA121_MS_rep1.vcf
│ ├── poolcf88_RDH_rep1.vcf
│ ├── W313718105511_cfNA.vcf
│ ├── W313718105515_cfNA.vcf
│ └── W313718105516_cfNA.vcf
└── wes
├── 114434.consensus.vcf
└── 128128~338-R~L42~WES.merged.vcf
_version.py
Note
There are some files and scripts in this package that are not currently used, but are intended for use in Ion Torrent Plugins downstream. They can be ignored for now.
2.3. Running Tests¶
Included in the package is a set of test VCF files that can be run through MOMA.
You can find these located in the test directory within the package.
Todo
Set up a simple test harness script to run all tests at once and compare the data.
You can simply attempt to process each platform specific VCF file through MOMA following the usage instructions in the Running MOMA section. If these tests can complete successfully, you have a fully working instance, ready to process samples.