Proteins and COVID-19

by Betemariyam Gessessee

What are Proteins?

According to the oxford dictionary proteins can be classified as "any class of nitrogenous organic compound that consist of large molecules composed of one or more long chains of amino acids and are an essential part of all living organisms, especially as structural components of body tissues such as muscle, hair, collagen, etc., and as enzymes and antibodies. "

What do they do?

Proteins carry a massive role both within the cell and outside of the cell, and remain vital "...for the structure, function, and regulation of the body's tissues and organs." Although we might not be able to see it with our own eyes protein have a profound impact on our health. Relevant to the current situation on COVID-19 pandemic in 2020 they play an incredible role in our well being.

Take antibodies for example; "antibodies are specialized, Y-shaped proteins that bind like a lock-and-key to the body's foreign invaders — whether they are viruses, bacteria, fungi or parasites." In other words, for our bodies to combat the many viruses and bacteria that linger in our environment, proteins are employed by our immune system; these proteins are called lymphocytes.

An important idea to consider when talking about proteins, is that much like the factors that surround and impact our everyday life, proteins have their own unique and complicated system. They are able to interact and engage with one another---much like how people can work and talk to one another. In fact proteins, during there process of function, do not operate alone but rather in complexes; I will be looking into exactly how they operate as teams ---quite literally.

Yeast in Protein

To dive even deeper into the realm of proteins there is incredible research that show tens of thousands of protein interactions in just yeast! That is one of my motivations for choosing protein networks as my system -- because their complex mechanisms are also present within our biological system.

My Scope

Given that there are vast amounts of protein networks its important that I pick an exact setting to observe. As the above information might entail, I looked into the interconnections of proteins in the context/presence of the COVID-19 virus.

Unfortunately, my research is limited to the internet and the resources uploaded to it. As a computational modelers however, this isn't all too bad. I have the luxury of opening and reusing data that has be published by other researchers so that I can incorporate the measurements of their already published research into my own. "The Biological General Repository for Interaction Datasets (BioGrid), is a curated biological database of protein-protein interactions, genetic interactions, chemical interactions, and post-translational modifications. "

I will be employing the open source data that this repository has as it relates to protein - protein interaction. And since there is a viral component to our protein network's environment , I can expand my scope of observation to even genetic interaction as well. All of which is accessible for free.


CoronavirusDataSet-DESKTOP-FKIOSO8.xlsx

Understanding the data

The Data above is extremely complex and very extensive. For my project however, having the ability to understand just one of the interactions is sufficient enough. Why? Well, when looking a visualization of the data above, which is my final project, if I can identify a specific protein that has an emphasis on the COVID-19/Larger protein process present in the visualization. I can work backwards to tie that science behind that specific protein to the visualization.

For example, imagine if my network/system was air traffic. If I had obtained data that had the population of people at airports, number of airports, number of flights, flight models and ID's, number of destination ect, I could potentially visualize a network of flights across entire regions. However, because this data might seem so large and extensive, like the Protein - Protein - Gene interaction data above, I can choose a certain area that is very dense in my visualization. In other words, if I set my source nodes to be departures and my terminal nodes to be arrivals and all other aspects of the data be the attributes of the edges/relationships of either two of them I can see areas where there is a significant phenomenon occurring within my model. Furthermore, I can see exactly how significant and vital that areas is by evaluating the density within in my model.

That is what I did for the extensive data set above; visualized the model and chose areas where there seemed a "cluster" or area of dense interaction. Once I found this area of significance/cluster I can research more about that specific point. As of now, however, an important element of the data that I do bear in mind is column that is outlined in red. That is how I identify specific genes/potential significant areas.

How to begin visualization

"Cytoscape is an open source bioinformatics software platform for visualizing molecular interaction networks and integrating with gene expression profiles and other state data." In simpler terms this program is the perfect computational tool for understanding and interpreting our large and extensive data. Cytoscape actually does most of the heavy lifting, so to speak, by generating the visualization with its own programming; all I need to do is upload out network/data as a file -- because excel is file.

Once I imported my network as a file, I need to set my parameters. What I mean by this is that Cytoscape needs an initial command to prompt it to start and stop somewhere in the data. Also it asks for characteristics that need to be considered when creating those edges/relationships. For this reason I set the Entrez Gene Interactor A as my source node and my Entrez Gene interactor B as my end node. For all remaining parameters I set the data indicating an aspect of interacted A as a source node attribute and all ones containing an aspect of interacted B is set to a target node attribute. When I do this, I am essentially tweaking the perspective of Cytoscape of the data so that it can understand what elements are characteristics of other elements.

Definition of Entrez

The term is best understood in by the definition of the Nation Center for Biotechnology information: "Entrez Gene generates unique integers (GeneID) as stable identifiers for genes ... for a subset of model organisms". Furthermore, "It tracks those identifiers and uses them to integrate multiple types of information including...summary descriptions...reports of pathways and protein interactions, associated markers and phenotypes. Because the GeneID is used to represent gene-specific information in other databases at NCBI, the full Entrez Gene report includes a wealth of links to gene-specific literature citations, sequences, variations, homologs and databases outside of NCBI. "

Entrez Gene interactor A, for example, will allow different repositories to understand that specific gene/protein. So those numbers under that term in the excel sheet represent identifiers for genes and are useful in recognizing the protein interactions within the visualizations I will make later on. Also, notice how the gene-specific information is present in other databases outside of NCBI, luckily BioGrid has its own encyclopedia for all of these Entrez Gene interactors; they are called BioGrid ID's. If I look at column D and E, those represent what I will reference too once I have successfully visualized the data. In simplest terms, those number are identifications for different gene and proteins and by the BioGrid repository I can find the exact literature that allows me to better understand what they do and why they do the things they do.

Visualize Using Cytoscape

Data Visual 1

Data Visual 2

Data Visual 3

Data Visual 4


Data Visual 5

Data Visual 6

Data Visual 7

Data Visual 8

What do you see?

All those dots that you see are those BioGrid ID's. Each one has its own characteristic. Those lines that you see are the interactions of the protein - protein - gene relationship. I will be using one of these layouts to highlight those very dense areas. Keep in mind, although the forms of the visualizations are different, I used the same data with the same source node, target node and attributes. In other words, if I was to zoom in and magnify deeply into each of these layouts I would see the BioGrid ID's labeled in each of the nodes.

Take a quick look at the animation below to get a glimpse of a how Cytoscape displays the data.

Zooming in you can see each line is actually an arrow, although it is hard to see in the exported image, the other writing is reads, "interacts with". So each of those arrows represents an interaction with a BioGrid ID. Now that I have a visual for the Data, I can locate a specific ID and loot it up inside BioGrid's Repository. In the process of doing this not only will I be able to find a reasoning for why I see some patterns inside the visual image but also recognize the relationship that these genes and proteins have with one another.

Picking an emphasis point.

Central Nodes

Pinpointing the nodes that are centered for a lot of interactions. I will later identify these nodes based on their BioGrid Interactor A/B Identification. Under the Marquee setting in Cytoscape I can see the spherical radius -- in this context these are the more like the spherical influences. The circles are generated by the overlapping/gathered edges of the nodes.

Density

Zooming out of the image, simply selecting all of the edges that are directed in and directed out of these nodes illustrates just how crucial these nodes are to the entire system. To remove those nodes would mean the loss of a lot of data and alot protien - protein . That is what I mean by density.

Identifying by Bio Grid ID

By selecting these nodes I can pull of the their characteristics and attributes from our table that cytoscape holds. From this feature, I can identify the nodes by their BioGrid ID. If you look under the BioGridID ID Interactor A you can see all of the numbers designed to represent the specific gene/protein. The node that is at the top left corner of the above image, "Central Node", is named 4383871. The node that is that is to the top right of that same image is named 4383868. The node at the bottom left corner of that image is named 4383858, and lastly the node the at the bottom right of the image is named 4383947. All of these characteristics and more can be found under the lower section of our cytoscape network. These seem to be the nodes present in the most dense parts of the image I generated.

These ID's above represent different proteins all under the Severe acute respiratory syndrome coronavirus 2 organism .Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is an enveloped, positive-sense, single-stranded RNA virus that causes coronavirus disease 2019 (COVID-19).

The following descriptions of each ID is with full and direct reference to the NCBI repository as well as the BioGrid repository..

4383871

  1. Gene Symbol/Description is ORF7b.

  2. Gene Type: Protein Coding

  3. 415 Physical Interactions

  4. Subcellular location : Host membrane

  5. Lineage Viruses: Riboviria; Orthornavirae; Pisuviricota; Pisoniviricetes; Nidovirales; Cornidovirineae; Coronaviridae; Orthocoronavirinae; Betacoronavirus; Sarbecovirus

  6. Summary: Virus particles include the RNA genetic material and structural proteins needed for invasion of host cells. Once inside the cell the infecting RNA is used to encode structural proteins that make up virus particles, nonstructural proteins that direct virus assembly, transcription, replication and host control and accessory proteins whose function has not been determined.~ ORF7b encodes a viral accessory protein. Based on its similarity to other coronavirus proteins, ORF7b protein is thought to localize to the Golgi compartment.

4383868

  1. Gene Symbol ORF3a

  2. Gene Description ORF3a Protein

  3. Gene type Protein coding

  4. 370 Physical Interactions

  5. Lineage Viruses; Riboviria; Orthornavirae; Pisuviricota; Pisoniviricetes; Nidovirales; Cornidovirineae; Coronaviridae; Orthocoronavirinae; Betacoronavirus; Sarbecovirus

  6. Summary: Virus particles include the RNA genetic material and structural proteins needed for invasion of host cells. Once inside the cell the infecting RNA is used to encode structural proteins that make up virus particles, nonstructural proteins that direct virus assembly, transcription, replication and host control and accessory proteins whose function has not been determined.~ ORF3a encodes a viral accessory protein. Based on its similarity to other coronavirus proteins, ORF3a protein is thought to be a protein with ion channel activity (viroporin) that activates the NLRP3 inflammasome. ORF3a may also play a role in virus replication and pathogenesis.

  7. Additional information/function Forms homotetrameric potassium sensitive ion channels (viroporin) and may modulate virus release. Up-regulates expression of fibrinogen subunits FGA, FGB and FGG in host lung epithelial cells. Induces apoptosis in cell culture. Downregulates the type 1 interferon receptor by inducing serine phosphorylation within the IFN alpha-receptor subunit 1 (IFNAR1) degradation motif and increasing IFNAR1 ubiquitination.

4383858

  1. Gene Symbol: ORF1ab

  2. Gene description : ORF1a polyprotein;ORF1ab polyprotein

  3. Gene type Protein Coding

  4. 30 Physical Interactions

  5. Lineage: Viruses; Riboviria; Orthornavirae; Pisuviricota; Pisoniviricetes; Nidovirales; Cornidovirineae; Coronaviridae; Orthocoronavirinae; Betacoronavirus; Sarbecovirus

  6. Summary Once inside the cell the infecting RNA is used to encode structural proteins that make up virus particles, nonstructural proteins that direct virus assembly, transcription, replication and host control and accessory proteins whose function has not been determined.~ ORF1ab, the largest gene, contains overlapping open reading frames that encode polyproteins PP1ab and PP1a. The polyproteins are cleaved to yield 16 nonstructural proteins, NSP1-16. Production of the longer (PP1ab) or shorter protein (PP1a) depends on a -1 ribosomal frameshifting event. The proteins, based on similarity to other coronaviruses, include the papain-like proteinase protein (NSP3), 3C-like proteinase (NSP5), RNA-dependent RNA polymerase (NSP12, RdRp), helicase (NSP13, HEL), endoRNAse (NSP15), 2'-O-Ribose-Methyltransferase (NSP16) and other nonstructural proteins. SARS-CoV-2 nonstructural proteins are responsible for viral transcription, replication, proteolytic processing, suppression of host immune responses and suppression of host gene expression. The RNA-dependent RNA polymerase is a target of antiviral therapies.

4383947

  1. Gene Symbol: sarsp1

  2. Gene description replicase polyprotein 1AB

  3. Gene type: protein coding

  4. 36 Physical Interactions

  5. Lineage: Viruses; Riboviria; Orthornavirae; Pisuviricota; Pisoniviricetes; Nidovirales; Cornidovirineae; Coronaviridae; Orthocoronavirinae; Betacoronavirus; Sarbecovirus

Analysis

When initially looking at our visualization there wasn't too much that I could pull from it - this also goes for our initial excel data sheet. But now that I have some context into exactly what those nodes are and what they mean I have a better understanding to why they are so densely represented in our data. I can pull a couple of explanations and conclusions from my new found data. First, I have discovered that 2 of my 4 nodes are actually not single proteins but poly proteins, and as defined by webster dictionary these proteins can be cleaved into separate smaller proteins with different biological functions. In other words these nodes don't just do one thing, but rather a multitude of things. The following examples illustrate exactly what those sub proteins are and what they do. Also note that all the references are pulled directly from the Uniport repository.

"Replicase polyprotein 1ab: Multifunctional protein involved in the transcription and replication of viral RNAs. Contains the proteinases responsible for the cleavages of the polyprotein.

Host translation inhibitor nsp1: Inhibits host translation by interacting with the 40S ribosomal subunit. The nsp1-40S ribosome complex further induces an endonucleolytic cleavage near the 5'UTR of host mRNAs, targeting them for degradation. Viral mRNAs are not susceptible to nsp1-mediated endonucleolytic RNA cleavage thanks to the presence of a 5'-end leader sequence and are therefore protected from degradation. By suppressing host gene expression, nsp1 facilitates efficient viral gene expression in infected cells and evasion from host immune response.

Non-structural protein 2: May play a role in the modulation of host cell survival signaling pathway by interacting with host PHB and PHB2. Indeed, these two proteins play a role in maintaining the functional integrity of the mitochondria and protecting cells from various stresses.

Non-structural protein 3: Responsible for the cleavages located at the N-terminus of the replicase polyprotein. In addition, PL-PRO possesses a deubiquitinating/deISGylating activity and processes both 'Lys-48'- and 'Lys-63'-linked polyubiquitin chains from cellular substrates. Participates together with nsp4 in the assembly of virally-induced cytoplasmic double-membrane vesicles necessary for viral replication. Antagonizes innate immune induction of type I interferon by blocking the phosphorylation, dimerization and subsequent nuclear translocation of host IRF3. Prevents also host NF-kappa-B signaling.

Non-structural protein 4: Participates in the assembly of virally-induced cytoplasmic double-membrane vesicles necessary for viral replication.By similarity

3C-like proteinase: Cleaves the C-terminus of replicase polyprotein at 11 sites (PubMed:32321856). Recognizes substrates containing the core sequence [ILMVF]-Q-|-[SGACN] (PubMed:32198291, PubMed:32272481). Also able to bind an ADP-ribose-1''-phosphate (ADRP) (By similarity) (PubMed:32198291, PubMed:32272481).

Non-structural protein 6:Plays a role in the initial induction of autophagosomes from host reticulum endoplasmic. Later, limits the expansion of these phagosomes that are no longer able to deliver viral components to lysosomes.By similarity

Non-structural protein 7: Plays a role in viral RNA synthesis (PubMed:32358203, PubMed:32277040). Forms a hexadecamer with nsp8 (8 subunits of each) that may participate in viral replication by acting as a primase. Alternatively, may synthesize substantially longer products than oligonucleotide primers (By similarity).

Non-structural protein 8: Plays a role in viral RNA synthesis (PubMed:32358203, PubMed:32277040). Forms a hexadecamer with nsp8 (8 subunits of each) that may participate in viral replication by acting as a primase. Alternatively, may synthesize substantially longer products than oligonucleotide primers (By similarity).

Non-structural protein 9:May participate in viral replication by acting as a ssRNA-binding protein.

Non-structural protein 10: Plays a pivotal role in viral transcription by stimulating both nsp14 3'-5' exoribonuclease and nsp16 2'-O-methyltransferase activities. Therefore plays an essential role in viral mRNAs cap methylation"

All of the above information is with full reference to Uniprot: https://www.uniprot.org/uniprot/P0DTD1

What this means

As I outlined above, the nature of the proteins do suggest that they are multifaceted and are responsible for many diverse biological related phenomena. And as I have just written from the above descriptions, many of these proteins play a significant role in the life of the Coronavirus. Furthermore, if I take a deeper dive into the nature of the poly proteins , according to the Handbook of Proteolytic Enzymes I learn that they may play a "regulatory role in virus life cycle". And so, one conclusion that I can pull from our explanation is that the denser the area of around the node the more capable that node is or the more multifaceted that node is. That is a general interpretation us understand the visualizations.

However, that conclusion may be wrong. Two of my chosen nodes seem to not have the quality of being polyproteins and so to understand their nature and why they are so dense I looked into their function. One common description is present within those two proteins, "ORF7b encodes a viral accessory protein" and ORF3a also "encodes a viral accessory protein". Each of those nodes do in fact perform a number of things but one common denominator that I can pull from this is that they have the same nature in encoding accessory proteins. For more description on accessory proteins take a quick glance at the context for both of those proteins under their summer. So what do I learn? In addition to the discovery about the polyproteins, I now can see that in our data a recurring pattern under the first two proteins is due to their nature in encoding viral accessory proteins. Through this process they gain alot of interactions 4383871 and 4383868 have 415 and 370 physical interactions according to the BioGridID repository, respectively

To bring this whole project full circle I have discovered 2 main things 1) Polyproteins' cleaving ability may be a reason for how they manage to have such vital roles in our network and 2) Proteins involved in encoding viral accessories seem to be exposed to tens of hundreds of interactions in there process. Both of these observations combined, I can now see that even though my layout might change as long as I have the same data, I can interpret any visualization based on these 2 found concepts.


Cool looking Visualization I also Generated

References

Handbook of Proteolytic Enzymes. ScienceDirect. [accessed 2020 Oct 20]. https://www.sciencedirect.com/book/9780123822192/handbook-of-proteolytic-enzymes

Project BGRID. Result Summary. BioGRID Search for Protein Interactions, Chemical Interactions, and Genetic Interactions. [accessed 2020 Oct 20]. https://thebiogrid.org/4383947/summary/severe-acute-respiratory-syndrome-related-coronavirus/nsp9ab.html

Structure - NCBI. National Center for Biotechnology Information. [accessed 2020 Oct 20]. https://www.ncbi.nlm.nih.gov/structure?LinkName=gene_structure

Project BGRID. Result Summary. BioGRID Search for Protein Interactions, Chemical Interactions, and Genetic Interactions. [accessed 2020 Oct 20]. https://thebiogrid.org/4383947/summary/severe-acute-respiratory-syndrome-related-coronavirus/nsp9ab.html

Bai L, Liu Y, Mu Y, Anwar A, He C, Yan Y, Li Y, Yu X. Heterotrimeric G-Protein γ Subunit CsGG3.2 Positively Regulates the Expression of CBF Genes and Chilling Tolerance in Cucumber. Frontiers. 2018 Mar 29 [accessed 2020 Oct 20]. https://www.frontiersin.org/articles/10.3389/fpls.2018.00488/full



The content of these pages was created by students for students with the help of educators and scientists. The views expressed herein are those of the authors and do not necessarily reflect the views of NSF or ISB.