This page describes different features of the web server and provides some theoretical background. For a quick hands-on manual please see Tutorial. Technical documentation of the algorithm is available here.
To start modelling you can go directly to the Request form and send a new request. In the simplest case, when the default dataset and modelling parameters are used all you need to do is to specify a genomic region you are interested in. In a more advanced scenario you may want to play with modelling parameters or to provide your own data.
Shape of the modelled structures is dependent not only on the interactions data itself, but also on the segments boundaries. A most prominent feature of our model are interaction complexes which form characteristic "clumps" of chromatin loops. Simplifying things a bit, each segment will form a separate clump. Thus, if a region consists of only a few segments then it will be represented as a small number of large clumps, as opposed to multi-segment regions, which will consist of a large number of smaller clumps. Additionally, the higher number of segments the higher the chromatin chain flexibility will be. Because of these reasons users are encouraged to inspect segment boundaries before running the simulations. A data browser can be used for visual inspection and possible adjustments.
Our modelling is based on the topological domains and chromatin loops which constitute the basic units on the genome organization. We use a hierarchical approach, in which we first model the low-resolution structure (where every bead corresponds to approximately megabase-sized regions) reflecting the relative placement of the domains, and then we use it to guide the high-resolution modelling in which chromatin loops are shaped (here every bead corresponds to anything from a few kb to tens or hundred of kb long chromatin fiber, depending on the region).
Topological domains are regions with abundant interactions throughout the region and substantially fewer interactions with other regions (ie. loci within a domain tend to interact with each other rather then with loci located in other domains). While domains appear as clearly discernable blocks along the diagonal of a Hi-C or ChIA-PET heatmaps there is no agreement as to their precise definition - different studies report different number of identified domains depending, among the others, on the sequencing coverage and domain calling algorithm.
The term "topologically associating domains" (TADs) is usually used in the context of Hi-C technique, while for ChIA-PET a "chromosome contact domains" (CCDs) is used. CCDs differ from TADs in the way they are defined - TADs are defined based on the Hi-C heatmaps (which aggregates all the Hi-C interactions), and CCDs are based on PET clusters solely. Our algorithm is independent from a specific method and can work for any type of regions, and to make it clear (and to highlight the distinction between the biological domains and the corresponding computational object) we will use a term "segments" to denote domains.
We base our approach on ChIA-PET data, where one can distinguish PET clusters, strong interactions that appear multiple times in the dataset, and which are anchored at the binding peaks, and singletons, which correspond to weak interactions. In principle, one can use any experimental technique that captures the genome-wide interactions between different loci, as long as a distinction between weak and strong interactions can be made. Particularly, Hi-C data can be used - in that case loop calling should be performed and the identified loops would be treated as a strong interactions, whereas the remaining contacts would be treated as weak ones.
The main idea of the algorithm is to identify the segments (we use strong interactions for that, but one can use also weak ones or any other data source) and model their spatial locations using heatmaps obtained from aggregating weak interactions, and then model all segments sequentially based on the strong interactions
You can browse either the default dataset (for that please select cell line and chromosome from the list, and click on the "Load" button), or your own data - simply select a file after clicking on the "Browse" button. The file should be in bedpe-like format described here.
After the dataset is loaded an interaction plot will appear (similar to the one presented on Fig 1. Please note that due to presentation reasons we only display interactions shorter than 5Mb). You can control the plot interactively using a mouse (details below). An important feature of the tool is the possibility to create a segment split, which can be later exported to a text file and used when submitting a request. Segments boundaries are marked with blue rectangles. To add a new boundary left-click anywhere in the boundary markers area. You can click-and-drag boundary markers to move them to a desired location. To remove an existing boundary right-click on it.
Rather than create a split from scratch you can use the automatic splitting tool which makes the split based on the size of gaps between interaction networks and relative segment sizes (this feature only works for uploaded data, for the default dataset our current, manually adjusted split is presented). You are strongly advised to visually inspect the split to make sure it is reasonable.
In the simplest mode to run the simulations one only needs to specify the genomic region. In this case a default dataset (see this FAQ question for details) and modelling parameters are used. For a description of sample results please see Tutorial.
You can run the simulation using your own data. You need to provide 2 files, for strong and weak interactions.
While our modelling was designed specifically for the ChIA-PET in principle you can use any source of interaction frequency data, as long as you can identify weak and strong interactions.
For the interaction files we use a custom data format similar to the bedpe format used by BEDTools (see specification), which was in turn based on the bed format of the UCSC Genome Browser (see specification). The format is as follows:
chr1 start1 end1 chr2 start2 end2 score [factor]where
For example, following is an extract from the default dataset:
chr1 838908 841157 chr1 911279 912011 11 CTCF chr1 838908 841157 chr1 918286 922335 5 CTCF chr1 838908 841157 chr1 997732 1000167 7 CTCF chr1 874878 875243 chr1 884840 885469 4 RNAPII chr1 893388 893956 chr1 901890 902912 4 RNAPII chr1 894575 894962 chr1 993755 994444 4 RNAPII chr1 898120 898390 chr1 916799 917446 4 RNAPII chr1 918286 922335 chr1 1055642 1059032 4 CTCF chr1 918286 922335 chr1 967152 969271 5 CTCF chr1 918286 922335 chr1 997732 1000167 75 CTCF chr1 919232 919815 chr1 999113 999871 4 RNAPII chr1 919433 919993 chr1 927913 928829 7 RNAPII
Segment definition should be stored in a tab-delimited BED file (detailed description available here) containing locations of segment splits. The format is as follows:
chr start endwhere
A sample excerpt from the default dataset (note that here the third column is identical to the second, but this is an arbitrary choice):
chr1 1634152 1634152 chr1 3735966 3735966 chr1 6072764 6072764 chr1 8966211 8966211 chr1 11059337 11059337 chr1 13354769 13354769 chr1 16818853 16818853 chr1 20011431 20011431 chr1 20994644 20994644 chr1 23774789 23774789
For clarity we group the options in sections corresponding to the ones present in the request form.
One of the crucial factors impacting the shape of the resulting structures is the split of genomic region into segments. Ideally, segments should reflect the underlying biological structures - topological domains. While domains are usually identified based on the aggregated contact maps (which would mostly correspond to singletons in case of ChIA-PET data) for our heuristic splits-finding algorithm we use PET clusters alone. This stems from the observation that strong interactions should appear mostly within domains, which was confirmed by our analysis (ie. segments based on PET clusters are very similar to the ones that can be deducted from singleton heatmaps).
In the current version we don't allow for displaying heatmaps in the data browser due to technical reasons. Still, users (specifically those submitting their own data) are encouraged to visually inspect the resulting segments and to compare them with heatmaps generated by other means, if possible.
While there is no single best split for a given data, there are some general guidelines that can be given. First, we prefer approximately megabase-sized segments - this is for both practical and theoretical reasons. Practically speaking, topological domains length is usually of this order, and we want segments to correspond to domains. On the other hand, for the computational reasons it is convenient to have segments of similar length, and that are not very short (one of the reasons is that this makes the heatmap-related biases smaller). Of course, some of domains may be very short or very long. Below we provide a short description how such cases can be handled.
Sometimes PET clusters form clearly separated networks of interactions. More often the gaps between those networks are not so large. Below 2 regions presenting both those cases are presented (note that for clarity only CTCF interactions are showed). Please note that in both the examples segments are approximately 1-2Mb long (with the exception of the last segment in the second figure). Actually, these examples were cherry-picked to show some ideal situation, and most of the genomic regions are much more difficult to divide into segments. In our dataset we often encounter single, separated chromatin loops, or very dense, but short regions. This is presented on the following figure: And to give a more dramatic example: As we mentioned above, we prefer relatively long and equally-sized segments. Thus, rather than creating a separate segment for every visible interaction network (including trivial, single-loop networks) we group multiple networks together. This allows us to obtain more meaningful singleton heatmaps and thus to model more reasonable structures on the low-resolution segment level. Please note while we may group several networks together for the low-resolution modelling stage, on the high-resolution levels (ie. anchors and subanchors levels) we treat them separately (generally, they would form separate "clumps" of loops, please see Modelling overview section for an illustration).
Apart from the segment split 2 parameters have the largest impact on the resulting structures: frequency and genomic distance scales. Simply put, the former controls the size of structure on the segment level, while the latter - the size of genomic loops. To produce reasonable structures these 2 values need to be properly adjusted in relation to each other.
In more details, frequency scaling controls how the interaction frequency between segments (i.e. the number of interactions between them) translates to the physical distance between the corresponding beads. While the related Power parameter controls the rate the distance changes when interaction frequency increases, Scale is simply a value that makes all the distances proportionally smaller or bigger. If the value is too large the segment-level beads will be positioned very far from each other resulting in a structure resembling a number of isolated small clumps (see figure below, left). On the other extremum, if the value is too small the beads will be very close to each other, and the chromatin loops from different segments will strongly overlap, to the point where the whole region looks like one large and dense clump (below, right). Finally, when the parameters are adjusted properly we can obtain a structure that is more consistent with the current models of genome organization (below, center).
We note that in principle small datasets may be problematic for our algorithm, because the low-resolution structure is constructed based on the number of singleton interactions between segments. If the number of singletons is small then some of the segments may have no corresponding singletons, in which case we are not able to properly position the corresponding bead relative to others due to lack of information. Similarly, an input file with only a single PET cluster interaction could be problematic as well, as the resulting plots and 3D structure would not be very informative. For these reasons we require that there is at least 1 PET cluster positioned in every segment and at least 10 singletons anchored at every segment. Please note that it is difficult to define a lower limit guaranteeing that the simulation will be successful, so, if working with sparse datasets or very small regions, please be aware that specific combinations of clusters, singletons and segment boundaries may yield unexpected results. This, however, should not be the case if the real-life experimental data are used.
Please contact us if you have any troubles or questions concerning the method.
Our method makes some assumptions about the chromatin organization and its relevance to the chromosome conformation capture (3C) based techniques, such as ChIA-PET and Hi-C. While there seem to be a general consensus as to the chromatin organization, specifically as to the existence and role of topological domains and chromatin loops, it it yet not clear how well these structures can be described using frequency interaction data.
Of particular problem is the relation between frequency of contacts between domains and spatial distances between them. While there were some functional relations suggested there are no clear experimental evidence for one or the other, on the contrary, it is suggested that different chromosomal regions may exhibit different scalings (the simplest example being eu- and heterochromatin). Currently our method support the inverse power law relationship. It does not currently allow for defining different scalings for different regions, but we are working in this direction. The default parameters were chosen based on our analysis (this will be reported elsewhere) and should work well in a typical case, but they may not be suitable for all the data. Below we present some simple examples of how one can adjust the parameters to produce a reasonable structures, but users are encouraged.
One should also be aware that both ChIA-PET and Hi-C data are population based, ie. the data is generated from millions of cells. Thus, a single model created based on that data does not represent a true conformation exhibited by a particular cell, but rather an average over the whole cell population. There are 2 main ways of dealing with this problem. The first is to treat the single resulting structure as a consensus model that represents the general shape. Thus the structural features need to be interpreted as the general tendencies of the population rather than a strict description of a chromatin. In the second approach one tries to create an ensemble of structures which may better reflect the possible cell subpopulations. Both methods have their advantages and limitations, and the best approach may depend on the ultimate goal of the analysis. In our case we are concerned more with the general landscape of the genome organization, and thus in our method we the consensus approach.
A more detailed discussion of these (and some other issues related to the interaction frequency data) can be found for example in [Lajoie et al, The Hitchhiker’s guide to Hi-C analysis: Practical guidelines, Methods, 2015].
What are singletons and clusters?
Our modelling is based on ChIA-PET data, and the terms singletons and PET clusters denote the interactions that appeared only once (or only a very small number of times) and that appeared very frequently in the dataset obtained from the experiment, respectively. The basic interpretation is that PET clusters represent very confident, strong interactions, and thus can be used to identify chromatin loops, while singletons, which are not so confident (but are much more abundant) can be used to infer the spatial relationship between different contact domains, chromatin loops, or loci on a particular chromatin loops.
Why some of the singletons have high frequency?
In our dataset for an interaction to be classified as a cluster it must fulfill a number of requirements:
Do I need to use ChIA-PET data?
No. While our modelling was designed specifically for the ChIA-PET in principle you can use any source of interaction frequency data, as long as you can identify weak and strong interactions.
Can I use Hi-C data for the simulation?
Yes, you can. You should supply your Hi-C interactions as singletons. To identify clusters you should perform loops calling.
How long should the computations take?
It well depends on the size of the region. Modeling of the sample region, chr1:8966211-11059337, which contains less than 200 clusters and around 16000 singletons takes around 4 minutes, while modeling of chromosome 1 usually takes up to 2 hours. These times were measured when the job queue was empty, so it may take longer if the request needs to wait in a queue. Note that the default settings were used for the calculations, and some of the settings (particularly "Steps" options and enabling the subanchor heatmaps) may impact the computation time significantly.
Note the maximal running time (ie. since the status changed to "Started") is set to 48 hours, after which jobs are killed. Please let us know if your computations take much longer than expected or if you need to increase the time limit.
What is the data format?
Please see Submitting custom data section
What is the default dataset?
By default we use the original dataset that was used in our recent publication (Tang et al,
CTCF-Mediated Human 3D Genome Architecture Reveals Chromatin Topology for Transcription, Cell, 2015).
This dataset contains 30 714 highly confident PET clusters with convergent or tandem CTCF motifs
orientation, 100261 RNAPII clusters and 10 479 226 singletons (including 10 430 481 true
singletons [frequency < 4] and 48 745 interaction with frequency >3, but without a CTCF motif,
an incorrect CTCF motif orientation or without the corresponding cohesin peak).
This dataset is available here.
Is the modeling software available?
Yes, the modeling software we use is publicly available.
Is there any other similar software available?
We are currently aware of only one similar web server, mainly AutoChrom3D, which allows for reconstructing genomic structures from Hi-C data for a number of selected human and mouse cell lines. It also allows to submit user's own data. There are however several similar software packages available for download, for example TADBit, ChromSDE, ShRec3D, BACH, MCMC5C or MOGEN. These packages usually use Hi-C data for the modeling.