Identifying Gene Targets for Malaria with Excel
Malaria is one of the great global health problems today, taking a large toll on people in the tropics and subtropics. Malaria is caused by Plasmodium, eukaryotic parasites that are transmitted to humans in the saliva of mosquitoes. This disease sickens over 200 million people annually and kills nearly half a million, the majority of whom are children under age five. The best malaria vaccine currently available has less than 50% efficacy, drug resistance to antimalarials is on the rise, and current antimalarials do not effectively block transmission from infected individuals back to the mosquitoes that spread the disease. New antimalarial strategies are therefore needed. In this exercise, you will use Excel as a statistical tool to determine which proteins would be good targets for a malaria vaccine.
Activity 1: Understanding the Plasmodium Life Cycle in Malaria and the Data Available
Instructions for Activity 1: Take detailed notes on the content below within a digital document. Your goal is to fully understand the life cycle of Plasmodium and the important parts (cells, genes, RNA, proteins, etc.) and interactions within this system so that you can understand this issue. Use the internet to look up words that you need defined or explained, using only reputable sources. For instance, if you find the image below leaves you with questions, you could search the internet using terms such as "Plasmodium Life Cycle". This could bring you to this accurate resource from the CDC which you can use as part of your journey to understanding the topic more fully. Also, as you are reading below, if you need a refresher on how genes and proteins are related, you can try search terms that include words such as "central dogma, molecular biology, gene expression, transcription, translation" in addition to the words genes and proteins. ISB's Central Dogma Game may also help you remember the connection between these important cellular parts and processes.
In addition to taking detailed notes, you should also use your document to keep track of your questions, wonderings, references, useful figures, etc.
Figure 1. Life cycle of the Plasmodium parasite, the causative agent of malaria.
Plasmodium parasites have a complex life cycle that involves transmitting back-and-forth between mammalian hosts (including humans) and female Anopheles mosquitoes (Figure 1). When a mosquito takes blood from an infected host, Plasmodium gametocyte cells rapidly mature into oocyst sporozoite cells (OO-spz) in the mosquito gut, and salivary gland sporozoite cells (SG-spz) in the mosquito salivary gland. SG-spz will then be transmitted to a mammalian host when the mosquito takes a blood meal. Though OO-spz and SG-spz have identical genetic makeup, they have highly different capabilities. OO-spz can effectively invade mosquito salivary glands, but are barely able to infect a host. Conversely, SG-spz are highly infectious to mammals, but are unable to re-invade the salivary glands.
How can organisms with identical genetics function so differently? We must hypothesize that these identical organisms are expressing different genes that allow them to thrive in different environments. This is not a unique feature of Plasmodium. Think about the cells in your body. Each cell carries 100% of your genetic information. However an eye cell, for example, only expresses the genes needed to perform its role in the eye, which would be different from genes expressed in a hand skin cell. Therefore, the answer to this puzzling phenomenon in Plasmodium cell types is in the proteins.
How can we identify the genes that are required for SG-spz to be infectious in a host?
We can hypothesize that genes and proteins that are upregulated, or increased, in Infectious Sporozoites (UIS) are essential for the parasite to transition between the mosquito and mammal, and are therefore ideal targets for new therapies to prevent malaria.
There are two types of genes that we were interested in identifying:
1) Genes for which the protein abundance increases in SG-spz, indicating that the protein will likely be required for the parasite to successfully invade the mammalian host.
2) Genes for which transcript abundance increases, but no protein is made. This second scenario indicates a gene that is being subjected to “translational repression”, a process in which the parasite transcribes the gene but sequesters the mRNA, in preparation for when a signal indicates that the protein is required. It is known that Plasmodium uses translational repression in the gametocyte stage as it waits to be transmitted to mosquitoes. We hypothesized that the same process happens in sporozoites as they wait to be transmitted to a host.
What data is available and how was it collected and prepared?
In the lab, OO-spz and SG-spz cells were harvested from mosquitoes.
Next Generation Sequencing Technology was used to quantify the amount of transcript abundance for each gene for three biological replicates each of OO-spz and SG-spz. The values in this data have been modified (through a process called normalization) in order to better facilitate direct comparison of the values among all six data sets.
Proteomics: Proteins were extracted and treated with an enzyme that cuts the long proteins into shorter peptides. The peptides were sequenced and quantified by liquid chromatography-mass spectrometry (LC-MS). The proteomics data are given as peptide spectrum matches (PSMs), which is the number of times the instrument identified a peptide belonging to a given protein. Most proteins are identified from multiple peptides, so the total PSMs is the sum of PSMs from all peptides that came from the protein. PSMs are integers, but in some cases a peptide has a sequence that could belong to more than one protein. In these cases, the PSM count for the peptide is split among all of the proteins it could belong to. As a result, the PSMs value for some proteins have decimal values. In general, the number of PSMs increases linearly with increasing protein concentration, but with several caveats (discussed below). Because proteins cannot be amplified the way that DNA can, a great deal more sample is required to perform proteomics compared to RNA-seq, and the sensitivity is lower. Consequently, we were only able to perform a single biological replicate each for OO-spz and SG-spz, and fewer total proteins were identified than transcripts from the same samples.
Activity 2: Using Data to Identify and Analyze Transcripts and Proteins with Excel
Instructions for Activity 2 and the Reflection: Using the same digital document you started above, take notes on the content below. As you begin to use Excel or Google Sheets, take screenshots of all of your steps and add those to your document in order to track your ongoing analysis and findings. You can also link to your Excel sheet and to other work within your digital document. As you move through the activities below in your Excel sheet, feel free to add tabs that preserve each of your data analyses and/or outputs. You can rename each tab to keep track of what each one shows (e.g. "Abundant OO-spz Transcripts", or the shortened "Sorted OO RNA". Use the internet to look up words that you need defined or explained, using only reputable sources. You can also use the internet to remind you HOW to do things in Excel (such as how to use functions to quickly find averages for hundreds of rows of data). Keep track of all of your thinking and questions within your document.
Identify transcripts and proteins that are Upregulated in Infectious Sporozoites (UIS), i.e., that are significantly more abundant in SG-spz compared to OO-spz.
Identify transcripts that are UIS and/or highly abundant in SG-spz, but for which little or no protein is produced, suggesting that the transcript may be translationally repressed.
The data are given as an array in an Excel spreadsheet. Each row is a unique gene. Each column is a unique experiment. More information about any gene (including its sequence, literature references, and lots of data from other experiments) can be found at PlasmoDB.org.
The published data analysis involved several complex algorithms designed to account for experimental variability inherent to RNA-seq. However, you can still identify important changes in transcript abundance through a fairly straight-forward analysis of the raw data we have provided:
Link to download Malaria Parasite Data
Save a workable copy of the dataset to your computer or google drive. Let's start analyzing the data!
1) What are the most abundant transcripts in OO-spz? In SG-spz? Is there any overlap?
For each gene, compute the average read counts for SG-spz and OO-spz.
Sort the average read counts in descending order. What are some of the most abundant genes for SG-spz and OO-spz?
Answer: Three transcripts are in the top 5 for both SG-spz and OO-spz: CSP, TRAP, and an uncharacterized protein with the gene ID PY17X_1452200. CSP and TRAP are required by the parasite for invading host tissues. No one knows what PY17X_1452200 does – it is a good target for future studies. The #1 transcript in SG-spz is called UIS4. This transcript was identified as UIS in previous experiments. It is required by the parasite for invading the liver.
2) What transcripts are most up-regulated in SG-spz? Are any of these already designated as UIS?
Use a t-test to compute a p-value for each transcript.
In MS Excel or in Google Sheets, use the formula =TTEST(Range 1, Range 2, tails, type) for each sample row to compute the p-value for the RNA-seq values.
Range 1 is the three SG-spz values, Range 2 is the three OO-spz values.
Tails = 2, indicating a two-tailed t-test. We use this type (as opposed to a 1-tailed test) because the OO-spz value can be either greater than or less than the SG-spz value we are comparing it against.
Type = 2. The three options for this test are
a. Paired test, e.g., the same sample is tested under two different conditions
b. Two-sample equal variance (homoscedastic) test
c. Two-sample unequal variance (heteroscedastic) test
This analysis compares two distinct samples that nonetheless have similar variance, so option 2 is used.
2. Correct for multiple hypothesis testing by calculating a false discovery rate (FDR), e.g. using the Benjamini-Hochberg method:
Sort the p-values in ascending order.
Copy the values and paste them into the on-line calculator at https://www.sdmproject.com/utilities/?show=FDR
Copy and paste the output values back into the spreadsheet. A cut-off of <0.05 (5% false discovery rate) is usually good. (Excel reminder: You may need to "transpose" your values into your excel sheet to display them in a column rather than a row. Using new tabs as working sheets helps keep track of your steps and will allow you to go back to double-check issues should they arise. And of course, your final data strings can always be pasted back into your master sheet. Another good reminder: If you find something hard or tedious in Excel, you can probably find a quick and easy workaround with a quick internet search.)
3. Calculate a fold-change ratio for each gene (the amount by which the transcript abundance increases in SG-spz vs OO-spz) by dividing the average value for SG-spz by the average value for OO-spz.
4. Since calculating a ratio is only possible for genes that have values in both OO-spz and SG-spz, do one of the following for transcripts that are only detected in one transcript or the other:
Analyze transcripts without ratios separately from those with (e.g., for transcripts only detected in SG-spz, sort by abundance and find the most abundant transcripts only found in SG-spz). For those with ratios, simply sort the list to find the transcripts with the highest fold-change.
In order to generate ratios for every transcript, substitute read count values of zero with a non-zero number. A simple and conservative method is to replace all values of zero with the lowest non-zero value in the data set, e.g. 0.36 in SG-spz replicate 1 or 1.0 in SG-spz replicate 2.
Answer: UIS4 (PY17X_0502200) is the most highly up-regulated transcript in SG-spz relative to OO-spz. This protein is required for invasion of the liver. Several other UIS proteins are also significantly up-regulated, including UIS2 and UIS3. Interestingly, many of the most up-regulated transcripts are annotated simply as “conserved Plasmodium protein, unknown function”. These are high-priority targets for follow-up studies.
Analyzing the proteomics data is not as straight-forward as the RNA-seq data. One challenge is that there is only one biological replicate of each, so it is not possible to perform a t-test to identify significantly up-regulated proteins. In the manuscript, we employed a variety of statistical methods to conservatively identify proteins that show evidence for being truly different between the two sample types. However, it is also possible to glean important information from the data by observing general trends.
Question: What are the most abundant proteins in OO-spz? In SG-spz? Is there any overlap? How do the relative protein abundances for CSP, TRAP compare with their transcript abundances?
For each protein, first calculate a spectral abundance factor (SAF) by dividing the PSMs by protein length. This simple normalization accounts for the fact that a longer protein would be expected to produce more detectable peptides as a shorter protein at the same concentration.
Sort the proteins by SAF in descending order.
Answer: Most of the most abundant proteins are so-called “house-keeping” proteins: tubulin, actin, histone, GAPDH. CSP and TRAP are both in the top 10% of the most abundant proteins, but not #1 and #2 like their transcripts. This discrepancy arises from both technical reasons and biological reasons. On the technical side, differences in how transcript abundance and protein abundance are measured lead to variance in the instrument readouts for the two different methods. On the biological side, RNA and protein abundance both fluctuate over time, and proteins may persist for a while or be rapidly degraded or even secreted out of the parasite. Since we have taken a snapshot of protein and RNA abundance at a single time point, it is therefore expected that we will observe some defect from a pure 1:1 ratio of RNA:protein. That said, we can still observe general trends, e.g., that high-abundance transcripts tend to produce proteins at relatively high abundance.
Question: What proteins are most up-regulated in SG-spz? Are any of these already designated as UIS? Compare the changes in transcript abundance versus protein abundance for UIS2, UIS3 and UIS4. Are the transcripts and proteins similarly regulated between OO-spz and SG-spz?
Calculate a fold-change ratio for each protein by dividing the PSMs for SG-spz by the value for OO-spz.
As with the RNA-seq data, it is only possible to calculate a ratio for proteins detected in both samples. The same two strategies may also be used: either consider the proteins with and without ratios separately, or replace all zeros with a value of 1 PSM.
We cannot use a t-test. Furthermore, ratios obtained from small PSM values near the limit of detection are less reliable than those obtained from large values. For example, a ratio of 4 PSMs to 1 PSM is technically a four-fold difference, but that small of a difference could just be instrumental noise, whereas a ratio of 100 PSM to 25 PSMs likely represents a significant change in protein abundance. For our analysis, we only called a protein UIS if it was in the top 50% of the most abundant proteins and it was at least six-fold upregulated in SG-spz.
Answer: The most highly abundant proteins in SG-spz that were not detected at all in OO-spz include many that are known to be important for the sporozoite to successfully infect the mammalian host, e.g. GEST, SPELD, UIS3, P36 and P36p. Importantly, parasites lacking P36 and P36p invade the liver but are unable to mature, and so they die without infecting the blood. Genetically modified parasites lacking these proteins are being developed as a next-generation malaria vaccine. Similarly, the proteins with the largest fold-change ratios are PLP1, CelTOS, and UIS2, all of which the sporozoite uses to invade and infect the mammalian host.
While UIS2 and UIS3 are abundant both as transcript and protein in SG-spz, UIS4 shows evidence of being translationally repressed: the transcript is highly up-regulated in SG-spz, but protein is almost undetectable. It is known that UIS4 is highly expressed soon after the sporozoite invades the liver in the mammalian host. Our data suggest that the sporozoite makes lots of UIS4 transcript but does not translate it into protein, instead waiting for some signal that it is in the liver. Having all of the transcript pre-made will allow the parasite to produce the UIS4 very rapidly when it needs the protein.
How did this activity help find malaria vaccine targets, and how could we use this process for other parasites?
As you answer this question, consider how gene expression plays a role in an organism's dynamic response to their environment. Consider this specifically in the context of a parasite that must spend time both inside and outside of a host. Also consider what this means in the context of vaccine development.
(If you get stuck, here are a few hints and thoughts to get you going: If a gene is upregulated, we can anticipate that certain proteins will ultimately be made. And if we can see what genes a parasite upregulates in preparation for or soon after infecting a host, we can identify proteins that we should target with drugs or vaccines. This process of identifying so-called "essential" genes - genes that, if deleted, prevent the parasite from completing its life cycle or being successful in a host - is a huge part of parasitology research.)
Students, please take this 1-minute survey, now that you've completed this activity. We are interested in learning about your experience so we can improve these resources. All responses to this survey are anonymous, all questions are optional, and your feedback is much appreciated.
Lindner, Scott E., Kristian E. Swearingen, Melanie J. Shears, Michael P. Walker, Erin N. Vrana, Kevin J. Hart, Allen M. Minns, Photini Sinnis, Robert L. Moritz, and Stefan H. I. Kappe. 2019. “Transcriptomics and Proteomics Reveal Two Waves of Translational Repression during the Maturation of Malaria Parasite Sporozoites.” Nature Communications 10 (1): 4964. https://doi.org/10.1038/s41467-019-12936-6.
Identifying Gene Targets for Malaria with Excel
Contributors and Supporters
Funding to support the development of this lesson was provided by the Institute for Systems Biology Innovator Award (ISB Project #10520010000) and National Science Foundation Award DBI-1565166 . The content of these pages was created by students for students with the help of teachers and scientists. The views expressed herein are those of the authors and do not necessarily reflect the views of NSF.