Role of Conformational Changes in DNA Recognition by Transcription Factors Having Helix-Turn-Helix Motif

DNA-binding proteins play a crucial role in various cellular processes, some of these have a unique structural arrangement, the helix-turn-helix motif, which enables them to bind to DNA molecules in a specific manner. Upon binding to DNA, conformational changes occur in both the protein structure and the DNA molecule, making it essential to quantify and characterise these changes. This work aims to quantify and characterise the conformational changes in the protein structures. This is achieved by calculating the root-mean-square deviation of the protein structure in the free state and DNA bound state. A database is curated which has a significant number of DNA-bound and unbound pairs available. For the unbound structure of the proteins, pre-processing is required to remove chains other than those found in its corresponding DNA-bound structure followed by superimposing it to the DNA-bound structures obtained from the Protein Data Bank (PDB) to calculate the RMSD. After quantifying the conformational changes using RMSD values, the proteins are categorised into six types based on their observed conformational changes. In this study, the primary objective is to conduct a comprehensive analysis of conformational changes in DNA-binding proteins and investigate their correlation with various protein properties. By exploring these correlations, the study aims to unravel potential patterns or dependencies between the extent of conformational changes and the diverse properties of DNA-binding proteins. Such insights can shed light on the functional implications of structural variations in DNA-protein interactions, providing valuable information about how protein dynamics contribute to DNA recognition and binding. In addition, this work endeavours to broaden its scope by attempting to replicate DNA-bound complexes through the utilisation of freely available structures. By doing so, it aims to assess the reliability and accuracy of docking methods in predicting and reconstructing these complexes. The primary objective is to evaluate the feasibility and effectiveness of such techniques in the context of complex formation between DNA and protein molecules.


Introduction
Transcription factors have specific structural DNA-binding domains critical to their specificity and affinity.A protein domain that is independently folded and has a minimum of one structural motif that may detect DNA which either has a single strand or a double strand is known as a DNA-binding domain • Email: editor@ijfmr.com

IJFMR23057331
Volume 5, Issue 5, September-October 2023 1 (DBD) [1].A DBD might have a generic affinity for DNA or can detect a particular DNA sequence which is a recognition sequence.Nucleic acids may also be included in the folded structure of some domains which can bind DNA [2].Nucleases, which cleave molecules of DNA, various polymerases, transcription factors, which regulate transcription, and histones, that are essential in chromosomal packing and transcription in the nucleus of a cell, are the instances of proteins which bind to the DNA molecule.The two roles of DNA binding can occasionally overlap which are either structural or related to transcription activity.DNA replication, repair, storage, and processes which include modification, for instance methylation, are regulated by DNA-binding domains with activities requiring DNA structure.The four distinct structural motifs that are proposed for the "DNA-binding domains" of proteins that aid in the control of transcription in eukaryotes are helix-turn-helix (HTH), two varieties of zinc fingers, and the leucine zipper [3].DNA may be bound by the basic structural motif termed as the helix-turn-helix motif.Short amino acid sequences connect the two helices on each monomer, whose binding occurs to the major groove of DNA.The four known forms of HTH are winged HTH, tri-helical, tetra-helical, and di-helical (the simplest) [4].Meanwhile, recent study is focused on the "DNA-binding domain of MarR", which helps in the regulation of the resistance to different antibiotics, which forms a winged helix-turn-helix with an extra alpha helix located at the C terminus.This is one of the variations of the HTH motif [5].Proteins having HTH binding motifs play a major role not only in transcription but also in repairing and regulation of DNA, interactions between proteins in various signalling pathways, metabolism of RNA.Moreover, catalytic domains of some enzymes have also been known to contain HTH domain [6].
A fundamental question arises: How do DNA-binding proteins achieve precise recognition of their binding sites?Multiple studies propose that both DNA and proteins undergo conformational changes to facilitate the formation of functional complexes.These conformational alterations significantly influence the stability, specificity, and cooperativity of interactions within the complex [7].It is hypothesised that these dynamic changes play a crucial role in enabling accurate and selective binding between different components of the protein-DNA complex, thereby contributing to the overall functionality of the system.The primary objective of our investigation is to examine the conformational changes that take place when a helix-turn-helix (HTH) motif binds to the major groove of DNA.Several ongoing studies have indicated that even though the protein structures in their unbound states may possess relevant information facilitating their interaction with DNA, the most significant conformational alterations occur when the protein engages with specific DNA targets at distinct locations.These changes play a crucial role in dictating the protein's ability to recognize and bind to its DNA targets effectively.By analysing these conformational changes, we aim to gain insights into the structural dynamics and molecular mechanisms underlying protein-DNA interactions.Furthermore, our investigation seeks to elucidate the specific regions and residues within the protein that are involved in the recognition and binding process.Such knowledge can contribute to a deeper understanding of the intricate interplay between proteins and DNA, and potentially aid in the development of novel strategies for targeting DNA-binding proteins in various biological processes.This work also aims to enhance our comprehension of the interplay between HTH proteins and their targets, focusing on the resultant specific conformational alterations in these proteins.This study will provide valuable insights into the stability and specificity of the protein-DNA complex, facilitating further investigations into the mechanisms underlying protein-DNA interactions and enabling the exploration of novel approaches for studying and manipulating these interactions.Furthermore, this study expands its scope by attempting to replicate DNA-bound complexes using existing unbound structures, thereby evaluating the effectiveness of docking methods [8].By leveraging available free structures, we aim to assess the capability of docking methodologies in accurately predicting and reconstructing DNA-bound complexes.This approach provides valuable insights into the feasibility and reliability of utilising docking techniques for studying and analysing complex formations between DNA and proteins.Ultimately, this research contributes to advancing our understanding of protein-DNA interactions and supports the development of improved computational tools for investigating such interactions.

Methodology 2.1 Tools used
(a) BlastClust -BlastClust is a programme which is used to group biological sequences according to how similar they are.It is frequently used in bioinformatics and is particularly intended for nucleotide or protein sequences.The input sequences are given in fasta format.These fasta sequences are read by BlastClust, which then clusters the data depending on how similar they are.Each sequence in the dataset is compared to every other sequence in the dataset using the BLAST algorithm as part of the clustering phase.The alignment scores produced by the BLAST algorithm show where two sequences are comparable.These alignment scores are used to assess the degree of similarity between sequences.It distributes sequences to clusters based on user-defined criteria, which are frequently reported as a percentage identity or similarity score.Sequences that are above the defined threshold are gathered in one cluster.In the fasta dataset, this enables researchers to find collections of linked sequences.(b) PyMol -PyMol is a widely used and robust molecular visualisation software.It is mainly created to visualise and analyse three-dimensional molecular structures, such as those found in proteins, nucleic acids, ligands and complexes.It performs various functions such as molecular visualisation, molecular editing, structural analysis, molecular dynamics, image rendering and publication, scripting and automation, etc.
It is an indispensable software for researchers working in numerous branches of structural biology and drug development because of its user-friendly interface, various features, and customizable choices.It is also employed in later stages of the work to align the protein-DNA docked structure with the native complex, allowing for a comparison of structural correspondence.This assessment aids in evaluating the accuracy and reliability of the predicted docked structures in relation to the experimentally determined native complexes.(c) PDBsum -PDBsum is a widely used online database that provides detailed information and analysis of protein structures available in the PDB.It offers comprehensive summaries of protein structures, including annotations, secondary structure assignments, ligand binding sites, protein-protein interactions, and other relevant information.PDBsum also generates interactive visualisations and schematic diagrams to aid in the understanding and interpretation of protein structures.It is often used to explore and analyse protein structures, investigate protein-ligand interactions, and gather insights into the structural characteristics and functional aspects of proteins.Information of binding sites in protein which is bound to DNA is retrieved from this database.(d) DP-Bind -It is a web server and computational tool designed to predict DNA-binding residues in proteins.It utilises a machine learning approach and integrates various sequence and structural features to identify and classify DNA-binding residues within protein sequences.DP-Bind is particularly useful for studying protein-DNA interactions, as it can provide insights into the binding mechanisms and aid in the characterization of DNA-binding proteins.It takes the Fasta sequence as the input protein sequences or structures into the DP-Bind server to obtain predictions and analysis regarding potential DNA-binding residues.This tool serves as a valuable resource for investigating protein-DNA interactions and can contribute to the understanding of DNA recognition processes by proteins.(e) Haddock 2.4 -HADDOCK (High Ambiguity Driven protein-protein DOCKing) is a widely used bioinformatics software tool for predicting and modelling the three-dimensional structures of proteinprotein complexes.It is also used widely for protein and nucleic acid docking.It employs a flexible docking algorithm that combines information from various sources, such as biochemical and biophysical data, to generate models of protein-protein interactions.It utilises a "data-driven" approach, where ambiguous interaction restraints are derived from experimental data or bioinformatics predictions, enabling the exploration of different binding modes and the flexibility of the interacting molecules.It is frequently used in structural biology and drug discovery research to study proteinprotein interactions and aid in the design of therapeutics targeting protein complexes [9].(f) AutoDock Vina -a popular molecular docking software used in drug discovery.It predicts how small molecules bind to proteins by calculating their binding energy, aiding in drug design.

Database preparation
• The search for the proteins containing the DNA-binding helix turn helix motif was conducted across three databases, namely InterPro, CATH and Structural Classification of Proteins (SCOP).Among these, CATH yielded entries for the ''Helix turn Helix'' query out of which DNA-binding proteins were selected.• Interpro returned entries for the query ''DNA-binding helix turn helix''.The entries from Interpro were downloaded in TSV format and advance search of all the IDs yielded PDB entries.Nuclear Magnetic Resonance (NMR) structures and structures with resolution greater than 3Å were filtered out and further process was conducted on the remaining structures.Out of which, some proteins were found to be in free state and others were bound to nucleic acids.Since, the work revolves around the proteins which are bound to DNA and help in transcription, entries bound to DNA were manually filtered out.
To find the bound-unbound pairs available for these proteins, PDB IDs were mapped to Uniprot.Several Uniprot entries were returned and only the unique ones were selected for both, protein-DNA complex and proteins in free state.Fasta sequences were fetched for these uniprot IDs and clustering was performed using BlastClust to cluster similar types of uniprot IDs in one group.This step was required to remove any redundancy in our data which can be caused if the same type of uniprot IDs were taken.Two separate lists containing clusters of uniprot IDs were generated.Both the lists were compared to find an intersection of the Uniprot IDs which will give the proteins which have complex and free structures available and hence Uniprot IDs common to both the lists were produced.To get the PDBs associated with each of the Uniprot ID, a python code was written which created an excel sheet which returned all the PDBs corresponding to Uniprot IDs for both, protein-DNA complex and proteins in free state.To get the best match of the PDBs for superimposition, pblast was performed for sequence alignment.The best matched PDBs i.e., IDs having sequence identity and query cover more than 95% were considered amongst the list of PDBs for the superimposition, thus calculating the RMSD values.• However, SCOP did not produce any result for the aforementioned queries.The reason which can be attributed with SCOP failure is that different databases have different ways of classifying and annotating proteins so it's not unusual to see variation in the number and types of result obtained from different databases.

Data Pre-processing
In this step, data cleaning is performed by eliminating various heteroatoms present in the structure.Additionally, chains from the free protein structure that differ from those present in the DNA-bound protein are removed.This process ensures that the subsequent analysis focuses solely on the relevant components and conformations involved in the protein-DNA interaction.By removing extraneous elements and non-matching chains, the dataset is refined, allowing for a more accurate and targeted examination of the DNA-bound protein structure and its associated properties.

RMSD Calculation
PyMol was used to calculate root mean square deviation of the DNA-bound protein structure to that of the protein structure available in free state.Load structures: PDB files of both the structures were loaded.Align structures: Both the structures were superimposed using align feature which provides sequencebased alignment and RMSD value is displayed.

Docking
The docking process was conducted on a set of rigid proteins, selected based on their RMSD values being less than 2 Å.This criterion was employed as an indicator of greater stability, as proteins with smaller RMSD values are typically associated with higher structural certainty.Conversely, proteins exhibiting larger RMSD values are considered more uncertain in terms of their conformational integrity.
To assess the quality of the protein-DNA docked structures, the RMSD of the protein-DNA complex was calculated relative to the native protein-DNA complex available in the database.This analysis quantifies the structural deviations between the docked structure and the experimentally determined reference complex, providing a measure of the accuracy and reliability of the docking method.Lower RMSD values indicate a closer resemblance between the docked and native complexes, indicating a higher degree of agreement between the predicted and experimental protein-DNA interactions.Conversely, larger RMSD values suggest significant disparities and potential inconsistencies between the predicted and known protein-DNA complex structures.Multiple methods exist for docking DNA onto proteins.Firstly, blind docking involves the exploration of potential binding sites across the protein surface without prior knowledge of specific binding regions.Secondly, known DNA binding sites can be utilised, which are obtained using PDBsum from protein-DNA complexes originally available in the PDB.These known binding sites serve as reference points for the docking process.Thirdly, predicted binding sites can be generated by employing tools such as DP-Bind on proteins initially not bound to DNA.This approach enables the identification of potential DNA binding regions based on computational predictions, expanding the scope of protein-DNA docking beyond experimentally determined complexes.Upon aligning the protein-DNA docked complexes with their respective native complexes obtained from the PDB, a notable issue arose.PyMOL solely provided RMSD values for the protein components, disregarding the DNA molecules.Consequently, to accurately assess the conformational disparities, a dedicated Python script was developed.This custom script facilitated the computation of RMSD specifically for the DNA segments, enabling a comprehensive analysis of structural deviations between the docked and native DNA.By addressing this discrepancy, a more precise evaluation of the overall complex conformation was achieved.For the total of 84 UniProt IDs, a list of corresponding PDB IDs obtained for both (free proteins and complex) and sequence alignment (pBlast) was done to filter best matched PDB pairs.Total of 39 Uniprot IDs were finalised.Further studies were performed on these 39 sets of protein.Shown below is the table along with their PDB IDs for free proteins and proteins bound to DNA respectively.It also shows different types of conformational changes in protein upon their binding with DNA.

RMSD and characterisation of conformational changes
This section presents the RMSD values of the superimposed proteins, accompanied by a classification of the type of conformational changes observed, as visualised using a PyMol.The RMSD values serve as quantitative measures to assess the structural dissimilarities and variations between the protein conformations under investigation.The identification of the specific type of conformational change aids in elucidating the manner in which the protein structure undergoes alterations upon binding to a DNA molecule.This information provides valuable insights into the structural dynamics and functional implications of the protein-DNA interaction, enhancing our understanding of the molecular mechanisms involved.
1.The structure shown here is of Csp231I C protein obtained from PDB ID 4JCX which is bound to DNA (green) and its free structure is obtained from PDB ID 3LFP (red).It has negligible conformational change(SS) and low RMSD value of 0.324 Å thus indicating that this protein is rigid in nature.RMSD values appear to be higher when the number of bend and non-bend structures are "somewhat similar" and lower otherwise.Additionally, the percentage of strands (and other structures) shows a negative correlation with the RMSD values.This might be because the strands bind to the minor grooves and fit snugly therein.Consequently there is no significant conformational change which results in a relatively low RMSD value.By the docking experiments, it is found that if the true DNA-binding site residues of protein are known then a correct native like complex could be generated.Sequence based prediction using DP-Bind did not help as the predicted binding sites were too far away.Blind docking did not yield any significant results.

3. Results and Discussion 3 . 1
Database outcome Number of bound and unbound pairs retrieved after carefully searching the databases InterPro and CATH following the defined set of procedures are shown below in figure 1 and figure 2 respectively.

Figure 1 -Figure 2 -
Figure 1 -Summary of Results from InterPro

3. 2 Figure 4 -Figure 5 -
Figure 4 -Pie Chart showing What Percentage and Number of Proteins have a Particular Type of Conformational Change

Figure 6 -
Figure 6 -Superimposed Structure of Csp231I C Protein

Figure 9 -
Figure 9 -Superimposed Structure of MerR Family Regulator Protein

Figure 12 -
RMSD vs Length of Protein 2.RMSD values do not show any significant trend with the DNA length.

Figure 13 -
RMSD vs Length of DNA 3. RMSD values do not display any significant trend with the no. of arginine residues.Figure 14 -RMSD vs Number of Arginine Residues 4.RMSD values display an increasing trend with the number of positively charged amino acids.

Figure 15 -
RMSD vs Number of Positively Charged Amino Acids 5.RMSD values display an increasing trend with the number of negatively charged amino acids.

Figure 16 -
RMSD vs Number of Negatively Charged Amino Acids • Email: editor@ijfmr.comIJFMR23057331 Volume 5, Issue 5, September-October 2023 13 6.RMSD values do not display any significant trend with the number of polar amino acid residues Figure 17 -RMSD vs Number of Polar Amino Acid Residues 7.RMSD values do not display any significant trend with the number of hydrophobic amino acid residues.

Figure 18 -
RMSD vs Number of Hydrophobic Amino Acid Residues 8.RMSD values do not display any significant trend with the number of residues in helical conformation.Figure 19 -RMSD vs Number of Residues in Helical Conformation • Email: editor@ijfmr.comIJFMR23057331 Volume 5, Issue 5, September-October 2023 14 9. RMSD values do not display any significant trend with the number of residues in coil secondary structure.Figure 20 -RMSD vs Number of Residues in Coiled Secondary Structure 10.

Figure 21 -
RMSD vs Number of Residues in Bend Secondary Structure 11.RMSD values display an increasing trend with the percentage of residues in helical conformation (in the unbound structure of protein).

Figure 22 -
RMSD vs Percentage of Residues in Helical Conformation 12.RMSD values display an increasing trend with the percentage of residues in coiled secondary structure (in the unbound structure of protein).

Figure 23 -
RMSD vs Percentage of Coiled Secondary Structure 13.RMSD values display an increasing trend with the percentage of residues in the bend secondary structure (in the unbound structure of protein)Figure24-RMSD vs Percentage of Residues in the Bend Secondary Structure 14. RMSD values display a decreasing trend with the percentage of residues in strands and other structures (in the unbound structure of protein).

Figure 25 -
RMSD vs Percentage of Residues in the Strands and Other Structures4.5 Docking resultsProtein initially not bound to DNA obtained from PDB ID 6LTZ is subjected to docking -Case 1 -DNA-binding sites are not known and blind docking is performed using AutoDock Vina.RMSD value is 48.93 Å.

Figure 26 -
Figure 26 -Aligned Structure of Putative Antitoxin HigA3 when Binding Sites are Unknown

Figure 27 -
Figure 27 -Aligned Structure of Putative Antitoxin HigA3 when Binding Sites are Predicted

Figure 28 -
Figure 28 -Aligned Structure of Putative Antitoxin HigA3 when Binding Sites are Known

Table 1 -
List of 39 Proteins along with their RMSD Values and Conformational Changes Type