Septic Shock

Home   |    Data Portal   |    Reagents   |    Research  |   Technology   |   Bioinformatics   |   Team  |   Help

Bioinformatics in the Septic Shock Consortium

 

A. Overview: Project Integration

To achieve the goals of this project, vast amounts of heterogeneous data must be collected, analyzed, warehoused and federated. Our goal is to make the research environment more conducive to using multiple techniques on single systems and to integrate the data from the different Core projects for a system-level understanding. One barrier is the lack of a common user interface to multiple data streams and tools. This limits familiarity with the new and varied experimental methods and makes it difficult to integrate data.

The data integration problem ultimately needs to be solved “at the data source”. Over the five years of the project, we will adopt a common data acquisition and analysis framework that will achieve the goal of “integration at the source” as well as speed the inclusion of data from new sources.

To start quickly, we will negotiate standards for data transmission with our data creators. In the long term, we will deliver data handling tools to the labs that have common interfaces allowing multiple kinds of data to be handled in a more intuitive way, as well as provide better annotation and tracking of the data. Each lab must have a comprehensive Laboratory Information Management System (LIMS) to collect and analyze data. Data will move through the Cores and Bridging Projects within an integrated consortium database (SBEAMS) where it can have different statuses such as “creator private” or “project private”. Private data may be put into the archive to compare with other data to test its validity. Any data without such an annotation is considered “validated” and is transferred to the public Science Database. A “What’s New” page will show recent and expected updates to the database. The same web interfaces, tools and annotation capabilities will be available to Project members and the general scientific community. This goes far beyond the normal web sites for communication and simple data browsing. In addition to simple downloads of flat files and source code, we will provide “portals” for collaboration and “hooks” for data access equivalent to the access internal to our collaboration (§ D.2). This high level of public data access will actively promote external experimentation and data federation efforts.

The data model has a significant impact on future modeling efforts. We will extend the current gene-based systems to connected graphs of sequence data, genes, gene products, etc. The graphs can then represent the global networks of physical interactions or signaling. Within a cell, we explore the activation of the subnetworks that are established by development, stimulated by perturbation or malfunctioning owing to disease.

Finally, we must build the tools that allow data to be pulled together, understood and eventually modeled. There are specific tools that will be refined for work with sequence data, transcriptomics and proteomics. These will be integrated to examine gene networks and pathways. These tools should use intuitive graphical interfaces to lower the barrier of accessing and using the heterogeneous data. They will be built toward guiding future experimentation. We are committed to the open source model for our tools. VERA, SAM and Dapple (§B.1) are all currently distributed. The first release of Cytoscape (§C.6) will occur in February 2002 and uses the Twiki environment for collaborative development (§D.2). SBEAMS (§B.3) will also be distributed.

As specifically requested, the locations of answers to the 8 Bioinformatics Queries is given in §D.4.

B. Preliminary Results

We have made significant progress in: 1) data acquisition, analysis and modeling of microarray data, 2) databases for microarrays and proteomics and 3) integration and graphical display of gene networks/pathways.

B.1 Data Acquisition, Analysis and Modeling of Microarray Data

Dapple ( Buhler et al. 2000 ) isolates and quantifies the fluorescent spots on a microarray. It uses advanced pattern-matching techniques and an example-based machine learning to differentiate between accurately- and inaccurately-isolated spots. It produces quantitative measurements with high confidence and minimal human intervention. Source code is freely available (www.systemsbiology.org/Default.aspx?pagename=proteomicssoftware). QuantArray is another alternative used in this step (see §II.B of the Genomics Core) with different characteristics for exploration and automation.

Next, we have routines for filtering, normalization, and merging results of multiple experiments, followed by SAM and VERA (V. Thorsson, see Ideker et al. 2000c ) to quantify experimental variability using multiple measurements and to test for differential expression. Our statistical model describes the multiplicative, additive, and correlative errors influencing a microarray experiment, and model parameters are estimated from observed intensities for all genes using the method of maximum likelihood (VERA). A generalized likelihood ratio test is performed for each gene to determine whether, under the model, the gene shows significantly different expression between two biological conditions (SAM). This test is more sensitive and more specific than the naïve criteria of 1.8x or 2x change in expression. VERA and SAM were written by V. Thorsson and are freely available at: The Vera And Sam Website.

For proteomics, we have developed tools and web-interfaced software for determining relative expression levels in ICAT experiments (described in the Proteomics Core).

B.2 Visualization of large-scale data and biological networks

Genome-scale datasets often reflect the global state of the cell. For example, we observe global expression changes using DNA microarrays or ICAT-based proteomics, or physical interactions among components. A powerful and intuitive method for visualizing data is a cellular or informational network. Our preliminary work on yeast illustrates the way that we will work with mouse data. Figure 1 illustrates the utility and intuitiveness of graphical modeling approaches. Colored classes of proteins show the clustering of functional groups and illustrate the principle of “guilt by association” in protein networks. The production of Figure 2 involved some basic needs assessment and graphical tool construction together with laborious human arrangement and interpretation of the network. This clarifies what needs to be done with the data to achieve the scientific goals. We are now in a phase of automating this to make it straightforward for any GLUE investigator to do. This preliminary work also shows the utility of yeast data to guide priorities for mouse experiments. We will continue to develop our network exploration and modeling efforts (§C.6).

Text Box: Fig 1: The network of protein-protein interactions in yeast (Schwikowski et al. 2000,  Uetz et al. 2000). The layout used a dynamical algorithm that treats the connections as springs with highly local “degeneracy repulsion” (Fruchterman and Reingold 1991). These are complicated graphs and we are working to display global and local information in more intuitive ways. Exploration of informational networks will be accelerated by providing a direct and interactive interface to a wide variety of analysis tools.

Fig. 2 Integration of gene- and protein- expression with a physical-interaction network ( Ideker et al. 2000b ). Yeast gene expression ratios (growing on galactose vs. not growing on galactose) are superimposed on a network of protein-DNA and protein-protein interactions. Each node represents a gene and the connections represent protein®DNA (directed edge) or protein-protein (undirected edge) interactions. Nodes are annotated with a corresponding gene name and a cluster number (genes with identical cluster number had similar expression pattern over a multitude of perturbation experiments). The gray-scale represents the gene expression ratio in yeast cells between the two conditions, with black being a large ratio. Outer circles show the mRNA expression ratio. Inner circles show protein expression ratios given by ICAT. Thus, we have begun the integration of four types of data: mRNA expression, protein expression protein-protein and DNA/protein interactions.

B.3 Current Databases:

SBEAMS (Systems Biology Experiment Analysis Management System, http://db.systemsbiology.net/projects/local/sbeams.php) is a framework for collecting, storing, and accessing data from microarrays, proteomics and other ISB facilities, but is modular and extensible to other experimental data (lead developer is E. Deutsch). SBEAMS users request experiments from an ISB core facility through a web interface and receive notice when the processed data are available in the database for subsequent analysis and annotation. Remote investigators submit requests that include information about samples requiring transfer. They are then given tracking numbers to use when shipping to remote core facilities.

SBEAMS tracks data acquisition, from sample preparation, spotting, hybridization, quantitation, and derivation of expression measures for microarray experiments as well as sample preparation, MALDI plate spotting, mass spectrometry, sequence database searches, and annotation for Proteomics experiments. This integrated system combines a relational database management system (RDBMS) back end a collection of tools to store, manage, and query experiment information and results in the RDBMS, a web front end for querying the database and providing integrated access to remote data sources, and an interface to existing programs for clustering and other analysis. Since all data from each step of the experiment are warehoused in a modular schema in the RDBMS, quality control and data analysis tasks are greatly simplified.

Each investigator first stores and manages the data unique to his or her experiment including the annotated history of executing the experiment. An automated pipeline processes the raw data into gene expression measures with data quality estimates or protein matches and quality scores. The investigator may use built-in tools or custom scripts to correlate, explore and annotate the experimental results. There are enough tools and datasources that we refer the reader to the URL above for details.

Our group is contributing to new standards as a member of the MGED Group (Microarray Gene Expression Database, sourceforge.net/projects/mged/) and are committed to making SBEAMS compliant with open standards like MIAME (Minimum Information About a Microarray Experiment) and MAGE-ML for data exchange and storage formats. We have built SBEAMS in advance of standards using clear schema that can be easily translated. We take the same approach to proteomics where standards will not arrive for sometime. Our approach here is to follow the standard descriptions of gene products as much as possible. SBEAMS will allow Internet access to the data via a public web front end when they are fully processed and released by the investigators.

Fig 3: A screenshot of the SBEAMS web interface. The upper left window is the main screen inviting the authenticated user to choose modules to begin work (Microarray, Proteomics and some smaller ISB projects). The SBEAMS core module handles user authentication, work group management, permissions management, simplified engine-independent SQL database access API, web form abstraction, tabular data rendering, and much more. One or more additional (experiment/project specific) SBEAMS modules are then invoked after the core module. These modules provide specific functionality to manage and browse microarray, proteomics, etc. experiments.

C. Proposed Research

Information technology has made great changes, but these are trivial compared to what’s ahead. Science information systems are similar to those in libraries where card catalogs have been replaced, but we still walk to the library and remove the only copy of most of the information. In biological information systems, even the card catalog of existing information is incomplete. A system that provides this information and database access to reduced and raw data for reanalysis will be a fundamental tool for research and education in the 21st century.

Information technology has a rapid pace, disk storage and computational power double every 12-18 months integrating to 1,000 fold improvements per decade. However, the number of records in Genbank is increasing at the same exponential rate because it’s the same technology. Scientific data acquisition uses sensors that are chips with millions of electronic elements with high speed interconnects. Hence, data rates increase as rapidly as computing power. Scientific analysis systems are based on finding relations so that the computational workload often increases as the data rate squared— outstripping the hardware advance of computers at a frightening rate, requiring us to think carefully about how our systems scale. Biology is still in a transition stage where technology is being rapidly incorporated, creating an even more rapid advance.

A key feature of exponentials is that they integrate to exponentials. Hence, half of all the scientific data ever collected are just a few years old. In the next 5 years, we will collect ten times more data than we have to date. It’s imperative to look a few years out to put together systems to manage and integrate the data at the source. This will be the focus of the informatics core. From the very start, we will be “moving upstream” to create an integrated system at the source of data production. Over the 5 years of this project, this will be a significant and broad contribution to biomedical research.

C.1 Data Acquisition, Analysis and Management

To achieve our goals, vast amounts of heterogeneous data must be collected, analyzed, integrated, visualized and modeled. Biologists are just beginning to cope with large amounts of data and the use of multiple techniques on single systems that require integrating the data for a larger understanding. In other fields, the concern is that their paradigms will not scale to Petabyte data sets. In biology, the data are extremely diverse but the raw data in our projects are unlikely to exceed 10’s of Terabytes (TB) and the science archive system should remain below a few TB. As a result, we can use proven techniques to handle our own data. The Cores generating the largest quantity of data are Genomics and Proteomics, both collated with the Informatics team. To be more specific, all of the operational proteomic data will be ~10 TB, while the data that will normally be queried to examine expression will be tens of Gigabytes (GB). Similarly, all of the raw microarray data (scanned images) from Genomics will be ~200 GB, while the quantitation of spots reduces this to just a few GB. This is roughly 3-5 times the data being currently managed in these areas. The additional data from Genomics (SNPs and sequencing) are well-established techniques at ISB and modules are being developed for SBEAMS in collaboration with other ongoing projects (§I.B of the Genomics Core). The lead for SNPs processing is K. Deutsch; for sequencing, it’s M. Whiting for databases and G. Glusman for analysis. The primary need of the Biological Reagents Core is to track physical resources and characterize the resources through links to gene based data including function, homology and structure. The results of prior experiments with reagents must be accessible as well as having information accessed by the data records in Genomics and Proteomics. The Forward Genetics and Animal Model Cores will take a relatively small amount of data on mouse mutants and then link to the larger volumes of data collected by the Proteomics and Genomics Cores. Clearly, computer equipment purchases will largely be driven by the proteomic data.

We will balance the need to build a long-term solution without impeding short-term progress. From the start, “data delivery” to the central database will use XML, employing emerging standards where available. While this simple step gets warm nods, we expect backpedaling on implementation. A typical example is the researcher who says: “We use Jack Straw’s Excel file format, this is the standard in the field. We don’t want to switch to that XML stuff.” Here, we stress that the importance of encapsulation to our approach. That is, practices within a lab are not dictated, but when it comes time to pull data together, the output needs to be standardized and self-describing, e.g. XML based. As the project evolves, SBEAMS will become a full data acquisition Framework with modules for each type of experiment that simplifies the process of data collection (a “carrot” approach to standards).

C.2 Integration via a Data Acquisition/Analysis Framework

Wherever possible, we will adopt and extend proven standards and systems from other fields. In general, biological data has greater commonality with other scientific data than the commercial data collected by Walmart. The main difference is that one normally executes scientific queries over a dataset that is much smaller than the raw data. An example is the set of genes that show significant changes on a microarray versus the original microarray image. A user might want to return to the original array for a reanalysis, but it would live in a distinct archive to maximize performance of routine queries that didn’t require reanalyis. We will borrow heavily from data intensive projects in Earth science, astronomy and particle physics. Lake has been involved in collaborative projects with scientists in each of these areas and continues to serve on NASA advisory committees for the management of earth and space science data, providing good ties to these efforts. In some cases, tools and modules can be transferred. In most cases, we simply speed development by understanding the lessons learned from past successes and failures in other fields.

To integrate data at the source, we need a flexible and modular data acquisition and analysis framework. There are good examples in other scientific fields, such as the ROOT Framework in particle physics ( Rademakers and Brun 1998 ) that is used for applications in physics, financial analysis, neural networks and pharmaceuticals. ROOT shows the value of a Framework, but it is used in physics labs where the general level of computing knowledge is advanced and the support system is strong. It shows the value of building a culture of shared open source tools. However, we will use a core system that has commercial strength and support.

The current “enterprise solutions” are problematic. Our GLUE consortia members haven’t requested use or support of any of these systems. In a project this size, we might accommodate the license costs across our consortium, but doing so could create a barrier for community use of our data. In addition to being expensive, the systems are all proprietary, so source code is not available, scripting languages are limited, but APIs are well-defined so that our tools could be adapted to these systems. The available packages are limited in scope yet complex, taking a few person-months to install and require expensive additional software (in both licensing and support) and some must be run on particularly expensive hardware. Finally, most of the current “enterprise solutions” address some problems in analysis and persistent storage, but do little for data acquisition.

The design goals that we have articulated are also those being promoted by IBM with their Life Sciences Framework product (components adopted from their proven eCommerce Framework). Very little of this product currently exists, but we are negotiating a strategic alliance to participate in the development with them. Should this be slow to materialize, we will move forward with SBEAMS maintaining flexible design goals to enable a later migration of our tools to a dominant platform in biological information management.

C.3 The Database

Our current approach to databases has been to build tools that work with multiple data management systems by using standard SQL, Perl and Java. We have databases that run under both MySQL and Microsoft SQL Server. As part of our IBM collaboration, we will shift to DB2 which is free to university researchers. This is critical for educational use where license costs can be prohibitive. DB2 is well suited to scientific data where the raw or processed data (e.g., an image or a spectra) is much larger than the metadata tags.

In data analysis, there are standard operations for data reduction and preparation as well as exploratory operations. The former are straightforward to optimize, but the latter pose interesting challenges. Generally, scientists explore the multi-dimensional properties of the data with no two queries being exactly alike. They will start with small queries of limited scope, then gradually make their queries more complex but on a hit-or-miss basis. Several styles of queries need to be supported including manual browsing (look at objects one-by-one, and manually/interactively explore their properties) as well as sweeping searches with complex constraints and cross-identifications with other datasets.

However, we propose to augment these with access modes that enable much more efficient exploration. Examples include: (1) Find all objects similar to one set of objects and dissimilar from another set of objects; report statistics on such data, or (2) Construct a model to distinguish one training set of objects from another, and then go through the entire catalog and report, or (3) cluster the data and provide summaries of the clusters, allowing the user to drill down further into each cluster and so forth. These queries are not straightforward to express in SQL. In fact, it is often very difficult for users to define them. Such data-driven queries are fundamental to data mining systems. These are functions of particular interest to the Proteomics Core where they are exploring new methods of assessing the reliability of protein identification.

Biologists often analyze their data in collaboration with statisticians and scientists from other fields. Wherever possible, we will use tools that are familiar to their likely collaborators. As an example we will bind the GNU version of the popular statistics package S-PLUS (Chambers 1998), known as “R”(www.r-project.org) to the analysis component of our Framework.

C.4 Data Integration

Our collaboration will focus on integrating data by providing consistent schema and building tools for data acquisition and analysis. Data generated by our project will be transmitted to a central source and maintained in a relational database. However, there will always be foreign data that must be integrated to achieve our scientific goals. A minimal list includes YPD, COGS, BIND, DIP, MIPS, GO and GENBANK.

As with most database issues, there are two different approaches to integrating key datasets NOT generated by our consortia: warehousing versus federation. We take a mixed approach where we warehouse all our project generated data and some key datasets that can be wholly transferred, but we also recognize the importance of some federation for three reasons: 1) greater flexibility, 2) some commercial data sources such as YPD and Celera can’t be sucked into a warehouse (though the similar databases SGD and the human genome don’t have that constraint) and 3) data federation is evolving to a Peer to Peer (P2P) model with “automated registration” (this jargon basically means that biological data could eventually become like “Napster”). Once we have accepted some federation in our approach, we again confront a choice between a relational model versus object or semi-structured (again, we are using a relational system for our Project generated data). Neither approach is proven. To date, successful systems have all been built on the warehouse model (guiding us to warehouse wherever possible), but the assertion of ownership for key data is forcing the issue of federation. We have strong ties with the leading groups in each approach to federation. In the relational area, IBM has the product Discovery Link based on the Garlic research project at Almaden. Our University of Washington collaborator Alon Halevy has built Tukwila ( Ives et al. 2000b , data.cs.washington.edu/integration/tukwila) that is the basis for a data integration company Nimble (www.nimble.com). IBM’s Life Sciences Framework blurs or outright confuses some of these differences. It is XML based; yet built on Discovery Link which is not. It is important, but ground breaking , to make one of these federation approaches work.

C.5 Analysis and Modeling

In building a data system, we choose abstractions. Current data systems are mostly repositories for one or more model systems that are “gene based” in their organization. There are many who stress the importance of building cell-based systems. We want a system that captures the information with an appropriate structure and builds connections that are valuable for modeling the system, its responses to perturbations and its faults that result in disease. Currently, we view the interaction network as the most appropriate. The global network as shown in Figure 1 is the representation of the model organism. A cell state is characterized by its active subnetworks (as determined by expression data and genetic studies). Perturbations and disease are examined as changes in the activity of subnetworks. The bridging project on the Modeling of the Innate Immune Response will test and extend this approach. We will clearly document this Bridging Project to demonstrate the approach and provide training materials for the web site.

Several of the Core staff will focus on needs assessment, building tools for the other cores and aiding in the creation and execution of data pipelines. N. Goodman, E. Deutsch, K. Deutsch and M. Whiting have proven that they are extremely capable in these areas. In the case of proteomics, the Informatics core staff has worked closely with J. Eng on database issues, pipelining and automating, but he writes the core modules of the analysis software—that are the standard bearers for the field. For microarrays, the staff in the Informatics core is responsible for building the data pipeline, analysis modules and the database. In all cases, laboratory personnel are responsible for executing the pipelines to clean and analyze the data. Any other procedure would be inconsistent with quality control.

The Informatics core will work with Genomics and Proteomics to examine global variation in “reference states”. The Rosetta compendium (Hughes et al. 2000) found that the reference wild type state showed significant variation of 55 genes (mostly respiration and metabolism) in their 63 control experiments, compared to a total of 288 genes that varied over their 300 deletion experiments (where most of the 55 are included in the 288). We will run 30 control arrays during the first year and request a control run of 3 arrays each month thereafter to monitor any possible drifts. Since ICAT has greater precision as well as greater costs, we will run 5 controls during the first 6 month and then 1 each quarter thereafter. Variance in these control sets will be interesting to the broader mouse model community and provide early datasets to exercise our tool development.

C.6 Visualization

We need a visualization system that serves as an advanced query system and actively connects to tools that are executing. This has many applications, “visual debugging” uses the human eye to discover subtle patterns that betray errors and guide solutions, and “visual development” provides insights for complex problems. Finally, “visual querying” integrates visualization to the query engine for scalable visualization and data exploration. Users can roam through large datasets without loading all of the data into memory. We imagine that most users will interact with the system through the visual interface. Presenting scientific data in an intuitive and appealing fashion goes beyond mathematical and computer graphics aspects. We will work with a graphic artist to improve our visual interface designs, and explore new ways of presenting data.

Visualization of large scientific data poses a significant challenge, both technical and conceptual. Correlations in the data may have a dimensionality > 3 and the graphical networks shown in Figures 1 and 2 are clearly extremely complex. There is currently no clear and intuitive way to present and navigate such data. The best current method is Kohonen nets a.k.a. self-organizing maps ( Kohonen 1990 , Tamayo et al. 1999 ).


C.7 Network Exploration and Modeling

Cytoscape is software developed jointly by P. Shannon, A. Markiel, B. Schwikowski (ISB) and T. Ideker (Whitehead Institute) for the display, exploration, and the analysis of gene networks/pathways. The scaffold of the program is the graphical representation of the network (shown to the right): a collection of nodes and edges with attached attributes and measurements to represent a biological system. One can selectively display all the genes involved in phosphorylation (or any term in a linked database), or color all the genes as a function of their expression level in micro-array experiments. This is done in a variety of visual styles.

The program uses the physical interactions (protein-protein and protein-DNA) in the yeast genome and mRNA expression data to calculate "active transcriptional paths”. The search is guided by a sophisticated scoring system that uses the probability of differential expression (currently calculated with SAM) to score the significance of a suggested subnetwork. The adjacent figure shows the results from a run of this module, giving the size, score, and set of conditions for each of the top scoring subnetworks. While this method does not directly extract a signaling pathway, it can be used to identify candidate genes that are involved in a pathway but do not show change at the transcriptional level. For example, a regulatory protein that itself does not change in expression would be implicated through its connections to the genes it regulates.

The window below shows the genes in the highest scoring component, with links to their GO classifications (Ashburner et al. 2000) and their raw expression data. A more extensive protein-mRNA software tool will be developed that includes simultaneous clustering of protein and mRNA data, statistical assessment of protein-mRNA relations on subsets of data (e.g., genes of a given cellular role, or localization).

The main program is written in Java, while the active subnetworks code is C++ integrated with Cytoscape via a JNI wrapper. The program uses the graph libraries YFiles and Leda that are inexpensive for academics and free distribution to other academics within a compiled binary is allowed. The computational techniques will be presented in February 2002, accompanied by a software release (via a TWiki site) that includes source, binary (with YFiles and Leda), documentation, a help system, and a suite of tests in accordance with accepted software engineering practices. We are actively working with both internal and external collaborators to bring in new kinds of data, and we are also developing new analysis modules. Fore example, later releases of Cytoscape will include network modeling algorithms currently in development in the Modeling Bridging Project.

D. Project Organization, Facilities, Community Delivery

D.1 Coordination with other Core and Bridging Projects:

Each core and bridging project will be required to designate a “Data Liaison” to the Informatics core and a member of the Bioinformatics Team will be designated as a Liaison to each of the Cores. They will serve as single points of contact for coordination. They will be responsible for answering or forwarding questions from investigators, will participate in schema design, data delivery plans data and needs assessments for tools that enable the scientists to use the data most productively. The Liaisons will also participate in the development and dissemination of new training materials (§D.2). That is, they will play a critical role in providing content for the web site. The travel budget of this Core reflects the need to bring the Liaisons together for regular meetings.

D.2 Advanced Technology for Education, Collaboration and Data Delivery

There are a variety of new technologies that we will use to enhance collaboration and deliver educational materials, results and data to the community. The use of simple web sites continues to be highly effective. Ours will be www.septicshock.org. Beyond universal web site access, we will maintain TWiki (twiki.sourceforge.net) collaboration sites that require registration and permit changes that are tracked under revision control. This provides a simple, but effective, way to build study groups and facilitate remote collaboration. We have built several sites, the most recent one being for the Cytoscape effort. We will link digital streaming video training materials and PowerPoint presentations to illustrate the use of the data and software products (both training materials prepared as well as seminars that illustrate the scientific context of their use). We devote 0.5 FTE to managing the web site. This is sufficient to do this, all other informatics staff and the Data Liaisons will be involved in the labor intensive task of producing the content.

Our Project is committed to a policy of complete and rapid public release of all data and technologies ( Rowen et al. 2000 ). With the release of software products comes the difficulty of supporting a variety of platforms with minor difficulties. The software produced by Lake’s astrophysics group is used at over 100 sites in 19 countries. They set up scripts to automatically configure the software using the GNU autoconf utility. There are similar tools for bug tracking and a “FAQ-o-matic” within TWiki.

We will provide a simple web interface to the data so that users can browse the available data and perform simple queries. However, web based systems are too often “gates” than “portals”. They limit what can be done with data and often restrict data transfer. To avoid this, we will support SQL based queries, as well as XML transport. Remote users will then be able to write and store scripts as well as to build their own locally annotated data subsets. They can also create wrappers to federate our data with others of interest. We will also permit downloading our data for warehousing by others.

D.3 Infrastructure Built on Scaleable Servers and National Resources

ISB is committed to a state-of-the art computing system, a program of experimentation with new technology and use of high-end national resources. ISB is a member of the Pacific NW GigaPOP Consortia and connects to Internet 2 via dedicated fiber and gigabit ethernet.

We have a strategic alliance with the Arctic Regional Supercomputer Center (ARSC). Lake has been the largest user at this center as well as one of the top users at the national NPACI, NCSA and PSC. ARSC operates differently than many of the other national centers, focusing on a smaller number of extremely demanding projects. ARSC also connects to the NW GigaPOP, providing excellent transfer speeds between the two sites. The main use of their supercomputing facilities by the GLUE Consortia will be for Proteomic Data processing and analysis (see § with this title in the Proteomics Core). The port of SEQUEST to the ARSC architectures has begun in collaboration with Tom Baring on their staff. Baring is also our liaison for porting sequence analysis tools there (work being carried out by J. Roach, G. Glusman and K. Deutsch). Our alliance with ARSC also provides us with access to deep technical expertise in large archives and high-end visualization as well as parallel computing. At ISB, our team understands how to use commodity hardware to solve complex tasks. Lake is Project Scientist of the NASA project that spawned the Beowulf approach to commodity supercomputing and his group built the first Beowulf using Compaq Alpha cpus. ISB currently operates a large Beowulf with mixed cpus. The commodity clusters have been and will remain important to processing Proteomic Data. Generally, we will use commodity clusters and cheap EIDE disk for data acquisition and analysis. The operational system will use machines designed for higher throughput with large numbers of users.

Large data set access is primarily I/O limited. There will be queries when all the data need to be scanned in a linear fashion even in the best indexing scheme. Acceptable I/O performance can be achieved with expensive, ultra-fast storage systems, or lots of commodity servers in parallel. This latter approach can be scalable for a well-designed system. In other collaborations, we have achieved 1 GB/sec I/O rates using commodity hardware.

D.4 Location of Answers to Bioinformatics Queries:

1.      What are the data release policies and what are the associated intellectual property issues? (Executive Summary, §A, §D.2)

2.      How will the data be available to the scientific community? (§A, D.2)

3.      What is the nature and structure of the data? (§B.3, §C.3, SBEAMS website referenced in §B.3)

4.      What is the structure of the database? (§B.3, §C.3, SBEAMS website referenced in §B.3)

5.      What is the mechanism of communication between the sites and database managers? (§D.1, §D.2)

6.      What are the key interacting databases? How will the data be linked? (§C.4 ,note the response is different for key datasets generated by our project versus those that we don’t control.)

7.      How will progress be available to the public? (§D.2)

8.      What experience in bioinformatics is available to your group (Personnel description in Budget justification, Biosketches, §C.4, §D.3) and what resources can you draw on? (§D.3)


References

Ashburner, M. et al. 2000. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25: 25-9.

Buhler, J., Ideker, T., and Haynor, D. 2000. Dapple: an improved method for processing DNA microarrays. Technical Report. University of Washington, Seattle, WA 98195-2350.

Chambers, J.M. 1998. Programming with Data. Springer, New York. (http://cm.bell-labs.com/cm/ms/departments/sia/Sbook).

Fruchterman, T. and Reingold, E. 1991. Graph drawing by force-directed placement. Softw. Pract. Exp. 21: 1129-1164.

Gray, J. 1999. Turing Award Lecture: What Next? A dozen remaining IT problems. http://research.microsoft.com/~Gray/. presented at the ACM Federated Research Computer Conference in Atlanta.

Hughes, T. R. et al. 2000. Functional Discovery via a Compendium of Expression Profiles, Cell, 102: 109-26.

Ideker, T.E., Thorsson, V., Ranish, J.A., Christmas, R., Buhler, J., Bumgarner, R., Aebersold, R., and Hood, L. 2000b. Integrated genomic and proteomic analysis of a systematically perturbed metabolic network. Science, 292, 4.

Ideker, T.E., Thorsson, V., Siegel, A., and Hood, L.E. 2000c. Testing for differentially expressed genes by maximum likelihood analysis of microarray data. Journal of Computational Biology, 7, 805.

Ives, Z.G., Halevy, A., and Weld, D.S. 2000a. Efficient Evaluation of Regular Path Expressions on Streaming XML Data. Technical Report. University of Washington, Seattle, WA 98195.

Ives, Z.G., Halevy, A., Weld, D.S., Florescu, D., and M., F. 2000b. Adaptive Query Processing for Internet Applications. IEEE Data Engineering Bulletin 23.

Kohonen, T. 1990. The Self-Organizing Map. Proceedings of the IEEE 78: 1464-1479.

Pottinger, R. and Halevy, A. 2000. A Scalable Algorithm for Answering Queries Using Views. In Int. Conf. on Very Large Data Bases (VLDB).

Rademakers, R. and Brun, R. 1998. ROOT - An Object Oriented Data Analysis Framework. Linux Journal 51: 27.

Rowen, L., Wong, G.K., Lane, R.P., and Hood, L. 2000. Intellectual property. Publication rights in the era of open data release policies. Science 289: 1881.

Schwikowski, B., Uetz, P., and Fields, S. 2000. A network of protein-protein interactions in yeast. Nature Biotechnology 18.

Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E.S., and Golub, T.R. 1999. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci U S A 96: 2907-12.

Uetz, P. et al. 2000. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae [see comments]. Nature 403: 623-7.