Position Paper for VRML '95

Getting Together in Cyberspace

Robert Rockwell, Ph.D.
Chief Scientist

blaxxun interactive
Mittererstr. 9, D-80336 Munich, Germany

Abstract. A central challenge to VR technology is the need to support social interaction in multi-user virtual environments. To meet this challenge, we will need to define semantic frameworks in terms of which people can convey to each other what they are up to. This paper considers some of the language/ control issues involved in "socializing" cyberspace, and the implications for an evolving online architecture of VR products, services, and construction tools.

The Problem. Human reality is profoundly social. If virtual reality is to become a recognizably human place to visit and work in, then it must support people who want to meet there and work together. Facilities for effective multi-user interaction will therefore be crucial to the future evolution of online virtual environments.

Existing online services provide only the most narrow of channels for human interaction: a thin pipeline of text, the occasional bit-mapped image. The growing availability of such higher-bandwidth channels as ISDN has encouraged a variety of alternative developments, including both audio and video links. But these are mostly seen as enhancements to the telephone, rather than as tools for building and operating persistent virtual worlds - their users remain conceptually at their real-world desks.

In contrast, while VR developers are rapidly expanding their ability to model and present complex virtual environments, most of current VR technology is insistently single-user: you can't fit two heads under most helmets! The few examples of (non-immersive) virtual meeting-places capture only a small fraction of the multiplex flow of human social interplay.

This is not surprising, since the tools currently available for building such services address only the simplest level of scene description and object behavior - the Virtual Reality Modelling Language (VRML) being a good example. Having agreed very quickly on a syntax for representing static scenes, its designers now confront the very much more difficult task of extending that syntax to include object behaviors, direct user manipulation, and interaction among multiple user-manipulated "avatars".

The lively discussion over what (not) to include in the next revision(s) of VRML is one indicator of how much information must be captured in any simulation of human social interplay. The sheer hours of talk-show time and pages of magazine space devoted to body talk and fashion statements, spin doctors and image makeovers, should serve as a warning: people have a lot going on when they interact. Human communication is not just multi-threaded, but multi-channel: lots of different (and not necessarily consistent) messages conveyed simultaneously in several media.

That is not to say that cyberspace encounters ought to be photo-realistic proxies for face-to-face meetings. (You don't have to read Dilbert to learn that "one of the best things about the Internet is that nobody knows you're a dog.") But it does suggest that the 'interactivity' issue now hovering on the horizon with VRML 2.0 involves rather more than finding the best syntax for telling an object to alter its relationship to other objects in a virtual world. What we need is a way to model how people relate: their gestures and explicit references, knowledge sharing and negotiation, context recognition and, occasionally, deliberate deception.

I submit that these sociocultural issues are central to the challenge of vr. Enabling people to communicate effectively when they encounter each other in virtual environments - which also means enabling them to capture and re-use the common understandings built up during such encounters - surely this is at the heart of the whole idea of "implementing cyberspace."

Online VR will be a success to the extent that it enables the (co-)evolution of cybercommunities: shared worlds that emerge out of the interplay among the people who meet in them. With that goal firmly in mind, this paper asks 2 questions:

Agreeing on the answers to these questions - or at least, agreeing what these questions mean, and that they are questions we need to answer - is a first step toward defining the requirements for a "VR toolkit" adequate to the design and construction of genuinely social virtual worlds.

What is an interaction? This is not the first time anyone has tried to answer this question. There is an enormous literature on social action (sociological, anthropological, philosophical), and there is an important line of computer science research which focuses on the role language plays in supporting (and constraining) people's social interaction [cf. W&F86 ]. Nodding gratefully in the direction of these esteemed ancestors (but leaving the detailed bibliography to the graduate students), let us consider the bare bones of what happens when to people engage in any social interaction:

Each of these interaction phases is accompanied by its own "protocol," in the sense that term is defined in the layered OSI metamodel for machine communication promulgated by the International Standards Organization [ISO]. We will want to consider that model more closely in a moment; but first, consider how moving from real to virtual space complicates these simple procedures.

Recognition/Identification: Since people may choose to represent themselves online by any arbitrary avatar, various issues related to recognition and identification crop up. For one thing, limited bandwidth and CPU-cycles prohibit most representations from looking much like the person represented. For another, many participants will consciously decide to use varying (and perhaps deliberately misleading) representations of themselves. In consequence, the rich set of experience that everybody has with classifying others on the basis of appearance (a.k.a. prejudice) doesn't work anymore. Will we need something like "trusted personality servers" to establish the match between person and representation? More likely, new rules and new behaviors will emerge. As a consequence of the loss of visual clues about a personality, other characteristics will become more important (e.g., voice, conversation style, or interest profiles). It might be sensible to define filters that allow one to specify one's interests and to "highlight" others who match this profile.

Contact protocols. In the real world, the rules for making (or breaking) contact among people are relatively well defined. There are formal protocols for introductions (via shared acquaintances, or "Do you come here often?"), and less formal initiators for specific "talk sessions" among people already in frequent contact ("Have you got a minute?"). The rules of Mail/Forum behavior have been consciously evolved over the past decade ("netiquette"). But translate such forms into VR contexts is not always straightforward. For example, "lurking" is no longer invisible, but as conspicuous - and perhaps as inappropriate - as "holding back" in a therapy group. It would be useful to have a way of gently, nonverbally, signalling our (lack of?) interest...

Interaction protocols. Face-to-face communication is both multi-threaded and multi-channel, with different (and not always consistent) messages being conveyed by body talk and fashion statements, by shifts in the volume, speed and precision of speech, etc. Most of this is lost in current VR settings: your avatar registers very little more than the fact of your presence. We need to provide avatars with modes to match our moods - virtual equivalents for smiling, blushing, etc. Perhaps more importantly, we need ways of managing many-sided conversations: recent work in the CSCW domain could be valuable here (e.g. on multi-user "whiteboards" and support for distributed real-time meetings).

Cultural differences. Meeting people with a very different cultural background was a rare exception until a century ago. High speed mass transport (and especially mass tourism) have made the once fundamental distinction between "my people" and the rest of the world increasingly porous. As world-wide networks begin to bring people from all over the world to meet online, no quick course in "netiquette for newbies" will be sufficient to bridge the cultural gaps across which web surfers will meet and great each other.

Ironically, the much greater bandwidth of VR communication is part of the problem. The thin pipeline of ASCII carries only the bare bones of a message, with little or no cultural meta-information flowing alongside. Under these circumstances, Dogbert may be a thoroughly satisfactory conversation partner. VR will be more nearly like a videophone: we will want to think hard about what background we want to include, and what others can be expected to read into the symbols we use to represent ourselves.

This last point is somewhat counter-intuitive. We tend to assume that increasing bandwidth and providing people with more options ought to make communication easier, not harder. But in fact, adding a new layer of possibility may be like moving from Basic to C++ - with so many additional options, simply saying hello, world! may require a considerable effort.

The OSI metamodel. The main VRML mailing list recently had a long thread devoted to participants' difficulty with this issue of cultural differences. The thread began with the assertion that VR would surely be an contribution to cross-cultural understanding, since it would support traffic in pictures that everyone could understand (after all, everyone can appreciate a sunset...) Gradually, it became clear - and here the contribution of those with multicultural experience was critical - that, while everyone could indeed appreciate a sunset, how each one interpreted the sunset might well be deeply different. What the group was confronting was the difference between syntax and semantics, and thus also between protocols and interfaces: the key distinctions which underlie and unify the Open Systems Interconnection (OSI) model of machine communication (see Fig. 1).

Peer >protocol Peer

The OSI model is built up by distinguishing communication layers, in which peers exchange messages based on a shared protocol. The lexicon or vocabulary of these messages is provided by the layer below; the function of each layer is to provide a similar vocabulary of semantically more abstract services to clients operating in the layer(s) above.

The OSI model provides a consistent framework for organizing discussions of computer system communications. It makes it easy to separate, for example, discussions on Ethernet vs. Token Ring from discussions on TCP/IP vs. SP/IPX, while at the same time opening up discussions as to how either of the latter semantically more abstract protocols might be implemented using the interface (i.e. lexicon of connected services) provided by either of the former.

Quite apart from its relevance to system networking, the metamodel, can help us locate some similarities among the various linguistic difficulties described above. What is useful is the basic idea of a "layer" as a class of messages couched in a common vocabulary, whose function it is to produce a set of higher-order abstractions which can be used to support communication in the next higher layer. In each of our examples above, one would to describe the problem would be to say that a layer (or two!) is missing. Introducing VR to the Web provides a radically extended vocabulary of possibilities for which we have yet to invent an appropriate semantics.

It is the same sort of problem a Pascal programmer has when first confronting Smalltalk or LISP. Even if the expressive possibilities inherent in runtime polymorphism or self-modifying strings seem conceptually clear, at first you have trouble finding anything sensible to say in such a language. You still live in a semantic world grown from different lexical roots; the things you want to express don't need the new vocabulary. At some point, often through interacting with others who have made the new terrain their own, you begin to find your own practical uses for the new constructs. You exper-ience what American psychologists call a "gestalt shift" (and Germans call an "Aha!-Effekt"). Suddenly, the new possibilities become part of your language. It is not that your knowledge of the syntax is any better; it is that now you yourself have things to say that can't be expressed any other way.

In OSI terms, what you have done for yourself is create a new communications layer. From the raw potential of a lexicon, you have built a realm of discourse. Enhancing the current generation of VR technology to support effectively social environments is a challenge of the same kind: we must learn not just to build new things, but to think new things.

Online and Off at Work and Play. Fortunately, we don't have to start from scratch. The frontier is already well populated with a variety of proto-cyberians. Our task will be to see how much we can learn from those who have already socialized their own corners of what, in only a few more years, will surely seem to be a single converging cyberspace community.

The existing groups range from the virtually anarchic playtime "chat" groups through a broad spectrum of somewhat more stable and well-ordered Newsgroups, BBSs, and Forums, to tightly structured mission-critical business teams following model-driven program plans.

This spectrum is spread out in two dimensions: structure anarchy and work play. That these vary independently is shown by the fruitful tension between roll-your-own "groupware" and formally mandated "workflow" advocates in the business world, and the distance in both style and substance between the random chat at any of the online cafés and the extraordinarily elaborated complexity of something like WaxWeb.

At present, there is probably more serious experimentation with genuinely social VR in the world of play than in the world of work. (Although if we stretch a point and include videoconferencing as an initial form of telepresence, the business world is clearly holding its own). In any case, the important thing to note is the extra semantic layers added by the addition of a historical dimension to any online community.

The crucial historical issue is that of group memory. Again, we have a spectrum: from the almost wholly transitory chat groups through more or less well-tended Email-archives to the kind of organically growing information-tree supported by Lotus NOTES to the precisely prescribed structure of the document store and process control procedures required by a Pentagon software project managed under MilSpec 2167a. The spectrum is only partly about increasing formality and control. It is at least as much about increasing size and complexity. How many people are actively involved with each other in the community? For how long?

At the one end of the spectrum our Social Interaction Support system is handling a shifting flow of short bursts of unstructured information exchanged among people grouped in twos and threes for at most a few minutes at a time, of which nothing is stored, not even the fact that the conversations have happened. At the other end, we have long-lived groups with hundreds of members (often arranged into several layers of subgroups with overlapping memberships), all exchanging complex information structures that are intricately linked to one another, where the whole point is to build up a consultable history of the group's work.

Who can see or do what when? Considering these more complex models of potential sociality in cyberspace brings another familiar software problem to the fore in an unexpected guise: the issue of scopes of action and visibility. If you change something in a space I am visiting, do I see the change? Does the change persist so that I see it tomorrow? What about changes that produce "ripple effects" ?

What makes this question difficult is that it has two obvious answers, both clearly right, and both mutually exclusive.

One answer is to say that virtual spaces should be like real spaces, in which you can move naturally and produce expected effects. If I toss something into the disintegrator or send it through the beamer to another planet, I don't want to find it back on the floor/shelf when I return.

But of course the example makes it clear that we want virtual spaces to provide all sorts of effects that we can't get in "analogue" reality: disintegrators and beamers and alternative bodies and all sorts of other cool stuff we can count on the world-designers to dream up for us.

The problem for VRML interactivity is what constraints to provide "for free." The world-designer will have a tough enough time without having to re-invent physics along the way. That is why we want at least some form of (perhaps adjustable) gravity, and would like things not to fall through the floor unless we tell them to ...

The semantic analogues to such physical constraints are not instantly obvious, but will be equally critical to making worlds which can be not only constructed but co-habited. Perhaps the best-known such problem is that of version management. On the one hand we don't want to force everyone to live in the same cutout world: interaction means being able to change your physical and social surroundings. On the other hand, if everyone can change anything at will, there is no chance to achieve any kind of meaningful growth.

Gregory Bateson identified this as the problem of "flexibility budgeting," and it is clearly one of the rules of the god-game. If you want to make worlds that can evolve, you have to allow them to achieve some measure of stability, however tenuous. More exactly: there must be ways of protecting core systems from random tinkering, without stifling the flow of adaptive invention.

Three timescopes. A minimal solution would seem to require three clearly separated classes of function. Alongside the features needed to enable interaction itself, it will be equally critical to provide features for managing or administering the environment. This is the familiar domain of visitor registration, archive maintenance, etc. Beyond this is a third domain, which is where you can change those things not under parametric control (so to speak): the class of functions needed to evolve or enhance the realm. Note that this distinction cuts across the intra-layer triad of connectivity, protocol and semantics. The result is a requirements matrix of 9 cells:




Semantics What? Why?


< Power of
Protocol How? When?
Comm. Rules


< Structure

Who? Where?
Making Links


< Network,

Functionality of a Communications Layer

Thus, for example, VRML-based worlds must provide features for adding both vocabulary and whole new ways of using words (e.g. metaphor, linked quote, self-explaining neologisms). Administrators must be able both to admit members and to create roles, etc.

This is not to say that all these capabilities must be incorporated into any particular language, platform or toolset. What is best implemented at the platform level and what in the tools that build on and exploit platform capabilities is a subtle design matter, and outside the scope of these brief notes. My point here is rather to broaden and complicate our view of the problem space: to insist that enabling social action in cyberspace is fundamentally about enabling the construction of layered communication patterns, and that co-adaptation must be designed into the foundation.

Online communities. VR technology opens up the possibility of capturing shared experience in the fabric of shared settings: cybercommunities built by the people who meet in them, using as their building blocks the same symbols that constitute the substance of their interaction.

Meeting this challenge, redeeming this promise, means learning to model more than just shapes in space. It means inventing ways to make such spaces genuinely co-habitable, and to distill our experiences there into meanings that can be built into the spaces themselves. The goal of our next generation of design work should be to insure that people can truly share the worlds they meet each other in. That means - crucially - devising means of layering meanings onto its models.

If that can be accomplished, cyberspace will happily populate itself.


W&F86 Winograd, Terry & Fernando Flores, Understanding Computers and Cognition: a New Foundation for Design, Ablex (1989).

About the Author.

Dr. Rockwell's commitment to computer graphics for interactive teamwork dates from his pioneering 1974 proposal for a geographic data visualization system to evaluate the social and environmental impact of government policies. After co-authoring one of the very first CASE workbenches for MIS applications in 1980, he became deeply involved in the design, development, and practical deployment of Softlab's Maestro, the world's first collaborative environment for enterprise-scale software development. As Technical Director 1990-92 of the 7-nation Eureka Software Factory project, he created the ESF CoRe, a conceptual reference model for software factories. In 1995 he became co-founder and Chief Scientist of blaxxun interactive, an international cyberspace technology company.

© 2001 blaxxun interactive. All rights reserved.