Hyperdatabases: An Infrastructure for the Information Space

Hans-Jörg Schek
Hans-Jörg Schek

What limits do database systems reach these days?

Schek: Perhaps “limits” is not really the correct term. The question actually should be whether database systems can still serve as the central infrastructure for the development of information systems. About 20 years ago, relational databases replaced the first generation of database systems. At the time, the goal was to build a better platform for the development of data-intensive applications. SQL was even considered as a language for end users, a language they could use for direct, ad hoc queries of the database. Relational databases and SQL were intended to create an interface that supported decision making for important strategies.

Today, we find ourselves in a network of an unmanageable amount of information. We talk about an information space that offers limitless sources of information, consumers of information, and connections between them. Databases continue to play an important, although less prominent, role in an information space. They appear “only” in the role of reliable, high-performance storage servers that can exist at any point in the information space and perform their duties there. But databases have not been the linchpin and infrastructure for the development of today’s distributed applications for some 10 years. They no longer play this role.

What potential for development is still possible for functionality, scalability, performance, availability, or usability?

Schek: For this question, let’s assume that the database system plays the role of a reliable storage server, as in the case in the architecture of SAP R/3. Despite the high standard reached by commercial systems, there are still exciting questions. I’d like to mention two. First, we want the ability to scale out, and second, we need database systems that optimize themselves. The scale-out characteristic was requested a few years ago by Jim Gray and differs from the usual scale-up characteristic. It means that if we want to achieve a high level of performance by operating a database server on a computer cluster with standard hardware and software components, we achieve better performance by simply increasing the size of the cluster. The second topic, self-optimizing databases, has been a dream for a long time, but is becoming a necessity these days given the growing complexity of systems. Particularly in the context of database clusters, the question arises of optimal data distribution for replication and partitioning, followed by the question of query optimization. The demand for scale-out scalability increasingly requires automatic self-configuration, such as that required in IBM’s Autonomic Computing Program, for example.

You’re researching hyperdatabases right now. What are they exactly, and what opportunities and advantages do they offer?

Schek: I’d like to offer two definitions. The first defines a hyperdatabase as a database over databases. The second states that a hyperdatabase is a basic software layer available on every computer in an information space, much like the TCP/IP network layer. With the first definition – a database over databases – we’re less interested in the administration of the actual data. We’re much more interested in how we can manage, use, and combine distributed services and service calls. We ask ourselves what infrastructure should be located at the next higher level above databases and how it will simplify dealing with numerous information services (Web services). A hyperdatabase is such an infrastructure. Much like a database works with data, a hyperdatabase works with services. Accordingly, it is located where we see middleware today.

With the second definition, we start with distribution. The network components handle the transmission and routing of bytes between points in the information space. In this context, the layers of the hyperdatabase handle the processing and routing of application processes, also called transactional processes or just flows. Flows combine several service calls, specify alternatives, and provide for error handling. The hyperdatabase ensures that execution and termination characteristics are guaranteed during parallel processing. To avoid bottlenecks and reduce the number of vulnerable central components, we use peer-to-peer and grid-computing considerations that have been developed from distributed computers and network technology. You can find more details on our Web site: http://www-dbs.inf.ethz.ch.

The amount and type of information change constantly. How can you restrain such an information space so that current and consistent data is always available?

Schek: This is an extremely important issue and an additional reason for naming this a hyperdatabase. Just like a modern database automatically regulates controlled (that is, correct and consistent) changes to data, the hyperdatabase should regulate changes in the information space. Consider the appearance and disappearance of information providers and maintenance of the related dependencies in the information space. To do so, the appropriate transactional processes are started at specific events. For example, when registering a new service provider, every point of the information space that must know about the registration is automatically supplied with the new information.

What new application areas, such as mobile devices, do you see? Which ones do you regard as particularly important?

Schek: Here I’d have to follow up what I said earlier. The information space does not consist of more or less stationary components, but increasingly consists of mobile information providers and users. Today, we already have personal digital assistants (PDAs) and mobile telephones. In the future, new techniques of human-machine communications and invisible computers will enable or simplify interaction in the information space. In certain situations, mobile components are consciously or necessarily separated from the network; they expect to receive information intended for them and any changed information as soon as they are switched on. Another form of mobile information appears with health monitoring. If we want to, we will be able to carry around with us various sensors that collaborate with other information sources, such as electronic patient files. Along with the interaction of a physician, this data can help warn us of impending danger so that we can take countermeasures. The technology will enable us to remain mobile – even when we’re at an advanced age or need care – and still receive professional attention in case of an emergency.

I see the main use of information technology in preventative healthcare and in posthospitalization, particularly in developments in pervasive computing and dealing with the resulting ubiquitous information. For that reason, I have very gladly accepted another professorial chair at the Tyrolean University for Medical Informatics and Technology (UMIT) in Innsbruck. At UMIT, I collaborate with ETH in Zurich to link research into hyperdatabases with medical information systems and, above all, work there to include mobility and sensor data more strongly in previous work.

How can information and data be personalized, and what problems arise from doing so?

Schek: That’s a very hot topic right now, even though research into the issue has been going on for decades. But we’re more aware of it today. Today, we’re all confronted with and experience getting results from an Internet search engine in an amazingly short time. But we often don’t get what we really want. The question of relevant information depends upon the person, the situation, and the spatial and temporal context. Of course, we can imagine subscribing to specific information, defining changed information, and having a hyperdatabase infrastructure that reliably supplies us with both. However, we’re still far away from evaluating the relevance of the information correctly and using appropriate relevance feedback.

A simple example can clarify the situation. When I’m on my way to the airport and end up in a traffic jam, I’m primarily interested in knowing if I’ll make my flight. After all, it might be delayed. In this situation, it’s less important for me to know that the value of my stock portfolio has changed significantly, even though I have a subscription that supplies that information. In the context of our research into hyperdatabases, we’re in the process of extracting and maintaining many features of various types from multimedia objects. The more features we can extract, the better we can effectively improve relevance and relevance feedback. But we’re still a long way from intelligent retrieval and personalized information.

Is there cooperation with industry, such as SAP, and what’s your view of cooperation?

Schek: In the past, we had various cooperative agreements with industry, all in the area of middleware and infrastructure for information systems. Our partners included Telekurs, ABB, and Schindler. The most important cooperation right now is occurring with Microsoft Research, which sponsors research into database clusters, and with SAP, which sponsors SAP R/3 training for students. We have good contact with IBM for Web services and flow engines. In addition, as I noted earlier, we’re in the process of establishing cooperation in medical informatics with firms related to UMIT.

What’s your personal motto?

Schek: (Almost) never give up….