ABSTRACT An environment for user interface softbots
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
An environment for user interface softbots
Robert St.Amant and Ajay Dudani
Department of Computer Science
North Carolina State University
Raleigh,NC27695-7534
stamant@,abdudani@
ABSTRACT
A user interface softbot is a software agent that controls an in-teractive system through the graphical user interface,relying on visual information from the system rather than an applica-tion programming interface or access to source code.Interface softbots have acted as autonomous agents in applications such as drawing and data recording,and the core vision process-ing algorithms have been incorporated into cognitive models for simple problem-solving tasks.Building interface softbots is still a time-consuming task,unfortunately,requiring expe-rience with complex program components as well as the de-tails of the visual interface.We have developed a prototype development environment that facilitates the development of interface softbots,streamlining the programming process and making it more accessible to new developers. Categories and Subject Descriptors
D.2.6[Software]:Software Engineering—Programming En-vironments;H.5.2[Information Interfaces and Presentation]: User Interfaces;I.2.5[Artificial Intelligence]:Programming Languages and Software—Expert system tools and techniques General Terms
Human Factors
Keywords
Interface softbots,Agents,Programming Environments
1.INTRODUCTION
Over the past few years our research has focused on the concept of user interface softbots,or ibots:agents that treat the user interface of graphical applications as their environ-ment[9,12].Rather than controlling an application through a programmatic interface(i.e.,an API)or by modifications to its source code,interface softbots control an application using .the same mechanisms as a human user.An interface softbot runs image processing algorithms over the visual display of an application,identifying its controls and data input widgets, and executes mouse and keyboard actions to control its behav-ior.Conceptually,interface softbots occupy a space between conventional software agents and physical robots,in that they require vision processing and motor capabilities but operate in an environment far simpler than the real world.Interface sof-bots provide rich opportunities for research in cognitive mod-eling[9],as well as autonomous agents and intelligent user interfaces[10].
There are clear disadvantages in such an approach to build-ing agents(for example,consider efficiency).Ritter et al.[5] discuss a number of alternative approaches to this problem in the context of cognitive modeling agents,including building or modifying applications so that they can provide agents with the information they request,modifying the windowing sys-tem so that agents can retrieve relevant information,or instru-menting the operating system to include the relevant hooks.
A full discussion of these possibilities is beyond the scope of this paper.In some situations,however,the interface softbot approach can be a reasonable choice:sometimes we may be interested explicitly in the visual environment of an applica-tion,rather than only in the functionality that it supports. Also,we gain several benefits from building agents that can interact directly with interfaces the way that people do[10]. We can learn about the interaction properties of a user interface as a problem-solving environment.This kind of information is of intrinsic value to researchers and developers in human-computer interaction.We also can learn about the strengths and limitations of agents in a problem-solving environment. This kind of information can be useful to cognitive modelers. We can use the user interface of an application as a general, instrumentable testbed for modeling.The interface provides a flexible testbed in which we can pose problems to an agent that are simultaneously tractable and realistic.Finally,an interface sofbot must build and maintain a model of the visual environ-ment in which it exists;this information can give developers insight into the capabilities and limitations of the interactive systems they build.
We have pursued these ideas along two lines.First,we have built interface softbots as autonomous agents in a number of different applications.In general,they are appropriate for situ-ations in which the functions of an API are inappropriate for a task[8],or in which it is important for problem-solving activ-ity to take place at the level of abstraction of the visual user in-terface[4].Our interface softbots can perform simple editing
tasks in a word processor(Notepad),draw pictures in Adobe Illustrator[11],even play Microsoft Solitaire[12].We have also designed and implemented models,based on the ACT-R cognitive architecture[1],to perform editing tasks,to navigate through a user interface as a form of exploration[4],and to play a simple off-the-shelf driving game.This last application is not yet complete;it relies on a general framework still un-der development,which has the goal of integrating cognitive modeling into an environment for evaluating conventional user interfaces[6].
Unfortunately,developing basic interface softbot function-ality is not always easy,and this can offset the benefits of building agents and cognitive models that follow these con-straints.The problems are related mainly to the image pro-cessing requirements:
•Pattern issues.In some cases a specific visual pattern
(e.g.,a novel interaction widget)cannot yet be recog-
nized by an ibot.In such cases the developer must pro-
duce an appropriate specification for the image process-
ing algorithms.To do this,however,one needs a close
understanding of the capabilities and limitations of these
algorithms,and this knowledge is gained only at signif-
icant cost to a developer.
•State issues.It is not always easy for the developer to
know exactly what the ibot can“see”at any point in
time.Certainly this information(at the bit/pixel level)is
accessible,but because the ibot’s representation is nec-
essarily incomplete,the developer may not have com-
plete knowledge of its visual input.
•Timing issues.The ibot does not have a time-continuous
representation of the visual interface.Instead,it must
poll the display.If this occurs during periods during
which the interface has not yet reached quiescence(e.g.,
the ibot has selected a menu option to launch an appli-
cation,but that application’s main window has not yet
appeared)then expectations about the state of the inter-
face may not be met.
We have begun to address these difficulties by building a programming environment for developing interface softbots. Our focus in this project is not explicitly on cognitive model-ing or on the intelligence of agent actions;we have established elsewhere that interface softbot technology is compatible with ACT-R and other modeling architectures as well as with com-mon AI planning approaches.Our development environment concentrates on the visual interface between an application and an ibot.In this environment a developer can see a tex-tual representation of the objects that an ibot can recognize on the screen,canfilter this visual information by object type,can designate specific visual patterns to be recognized by objects, and can carry out further related tasks.The result is a simpler, more streamlined programming process for the developers of cognitive models or interface agents.
In the remainder of this paper,we discuss a working exam-ple of an interface softbot that interacts with a chat application, showing how the environment works in practice.We then ex-plain how the environment supports development of interface softbots.The environment is not an end in itself;rather,it pro-vides a testbed for our experimentation with tools for models and agents that can interact visually with applications.2.A WORKING EXAMPLE
Our example deals with an interface sofbot that interacts with a chat system,MSN Messenger.We would expect the following as a minimal set of capabilities for the agent:it should be able to see the list of users who are added as po-tential chat partners;it should be able to send messages to a particular partner;it should know when a new message has been received from the other person;it should be able to read what has been written by other person.From the perspective of an agent developer,these capabilities abstract away many of the less relevant characteristics of agent behavior,to concen-trate on the intelligent aspects of making conversation.The advantage of working through an interface softbot in thefirst place is that the richness of real-world interaction(with differ-ent systems and different users)is retained.
The sample interaction that follows,partially illustrated in Figure1,describes what is required on the part of the sys-tem,focusing on the interface softbot functions,rather than the intelligent behavior aspects.Recall that the ibot must inter-act with the chat application through its visual user interface, rather than more directly through a programmatic interface.
1.Read screen.The ibot captures the contents of the dis-
play,performs appropriate screen processing,and cre-
ates data structures that store information about the co-
ordinates,colors,and other properties of screen objects
(e.g.,windows,buttons,characters)on the screen.
2.Start chat.The ibot searches for the name of the par-
ticular chat partner.Because the ibot has identified all
the characters on the screen,it can now combine these
characters tofind the visual representation of this name.
Once the location of the name is found on the screen,
the ibot double-clicks that coordinate to start the chat
session.
3.Read screen.A new window for the chat session opens;
the ibot must now re-process the screen to identify the
window and all of the new information that it contains.
4.Send message.From the most recent screen processing
operation,the ibot now has information on the location
of the text area where messages are written.It now types
in the text for a message(supplied by an external con-
troller or generated internally)by simulating the press-
ing of keys on the keyboard.
5.Active wait.At this point the ibot iteratively polls the
display until new text is detected.(This comparison can
either occur at the bit level or object level,depending on
the needs of the application.)An altered display indi-
cates that the other end has responded to the last mes-
sage sent.When a difference is found,the ibot collects
the new characters,processes them,and generates a new
response.
6.End.The process continues till the ibot detects an end
condition,either by processing the messages it receives
or when an internally generated state condition is reached. This example has been implemented just as described above. The ibots in this test application interact with each other,with messages generated either by the Alice system or by a user. In the latter case,the user provides input by interacting with
Figure1:The Chat environment
the ibot rather than directly with the chat application.We em-phasize that this example is not meant to represent a useful application for user interface agents in general(although such an arrangement might provide a plausible means of input for a disabled user)but rather to show some of the possibilities and challenges of building interface softbots that perform visual analysis of their environments.
3.DEVELOPMENT ENVIRONMENT We have described the basic visual processing functional-ity for ibots elsewhere;in this section we concentrate on the chat application[9].In the code to implement the chat ibot, following our existing development procedures,user code for the chat functionality would be interleaved with code for the perception and action functionality,in a serial process.This is problematic,from a design viewpoint.The developer must be aware of state and timing issues,as described above,and must carefully integrate into the agent code the low-level image pro-cessing functionality with higher-level reasoning functionality.
A better solution,which we have implemented,has the im-age processing code and higher-level user code running at the same time,in parallel.For example,in the example of MSN Messenger above,the code called Read Screen to analyze the display to extract information and generate data structures for further processing.Under our new arrangement,Read Screen runs in parallel,creating objects that other code can simply access to get the current values.Timing and state issues are taken care of on the image processing side,rather than in interleaved,explicit code.
This arrangement leads to what we can think of as an ibot “engine.”The job of the engine is to continously scan the user interface on behalf of external programs.External programs receive information they need about the current state of the user interface from the engine.The advantage of this approach is that agent functionality outside the scope of perception and action does not need to include code to scan the user interface and do the relevant processing;instead an agent program can focus on how it is going to use the user interface.Another ad-vantage of this approach is that,if there are several programs that need information about the current state of the user inter-face at the same time,all of them will not be required to do the screen capture and low-level processing and identification of screen objects.Instead they will get that information from the engine.
From the developer’s point of view,whenever he wants to use the functions provided by an ibot,he simply starts the ibot engine.The engine provides specific hooks to the internal functions of ibots which developer can call directly.The de-veloper can now program applications on top of an interface softbot using the hooks provided by the engine.The decision on what hooks are provided is an important one.From the point of view of the ibot,when the engine is started,the system will keep on reading the screen contents,identify the widgets, characters,and so forth on the screen and keep putting them into appropriate data structures for the information to be used by the developer.
In the graphical interface to the engine,the user can se-lect the text representations of objects,at which point they are highlighted on the screen,by the presence of a colored border. Filtering by object type is supported as well.The user can also point at a specific,unrecognized object and direct the engine to generate a description for it.(This capability is currently lim-ited,in that it produces descriptions based on existing prim-itives;more general learning is a topic under development.) Other capabilities supported by the engine are not shown in the interface.These include access to time-stamped object spec-ifications,access to specific properties of recognized objects (position,size,text label,etc.),and manipulation of objects. The ibot engine addresses an important issue for us aside from developer support.One of the basic limitations of our work on interface softbots until now has been platform de-pendence,in two ways.First,the pattern matching compo-
nents of the system are tailored to the look and feel of the Windows user interface(Windows95/98/2000/NT).The im-age processing algorithms are simple and general,and could easily be ported to other systems,but the knowledge base of visual patterns on which they rely would need to be extended significantly for interface softbots to work on a Macintosh or a Linux graphical user interface.Second,for both its visual and motor processing,the current substrate implementation relies on function calls to the Windows API.This is a small part of the system(estimated at3%of the total code[9])but still an important part.
The engine addresses the problem of dependence on imple-mentation,though not the dependence on the display.We have reimplemented the substrate in Java,from its earlier imple-mentation in a combination of Lisp and C++.Java is operating system independent,so the code written for Windows will also run for a different user interface for another operating system, as long as the interface follows the same visual conventions that Windows does.
4.THE IBOT ENGINE IN USE
Extending our previous research,we plan to concentrate our work with the new implementation in the areas of cognitive modeling and planning.
From a cognitive modeling viewpoint,the design of human-computer interfaces can be seen as a form of engineering de-sign.Engineering guidelines can arise from a number of sources, including HCI,software engineering[2],artificial intelligence [3],and cognitive psychology[7].Cognitive models in par-ticular provide a means of applying what is known about psy-chology to predict time,errors,and other measures of user interaction.
Cognitive models have been used to analyze a wide range of human problem-solving activities in HCI.Tasks include searching for and selecting menu items,learning the layout of widgets in a graphical user interface,and the exploratory nav-igation of an unfamiliar interface,among many others.These tasks have the common characteristic that the model must in-teract with a visual representation of a graphical user interface. Until recently cognitive models have interacted with applica-tions either in simulation,through an application’s program-ming interface(API),or with the help of a specialized user interface management system.In these cases,the perceptual functions of the cognitive model operate at one remove from the actual environment.Cognitive models built on top of inter-face softbot functionality do not have this limitation.We ex-pect that our new environment will facilitate the development of task environments in which cognitive models can operate.
In our planning work,we are developing domain specifica-tions to allow unmodified planners to use the user interface as a testbed environment.We are currently in the process of integrating different planners from other research laboratories into the ibot engine,in order to evaluate their varying capabil-ities in solving realistic but tractable problems—those that are commonly solved by everyday users.
5.ACKNOWLEDGMENTS
This effort was supported by the National Science Founda-tion under award0083281and by the Space and Naval War-fare Systems Center,San Diego.The information in this paper does not necessarily reflect the position or policies of the U.S. government,and no official endorsement should be inferred.
6.REFERENCES
[1]John Anderson and Christian Lebiere.The Atomic
Components of wrence Erlbaum,Mahwah,
NJ,1998.
[2]Alan J.Dix,Janet E.Finlay,Gregory D.Abowd,and
Russell Beale.Human-Computer Interaction.Prentice
Hall,2nd edition,1998.
[3]James Foley,Won Chul Kim,Srdjan Kovacevic,and
Kevin Murray.UIDE–an intelligence user interface
design environment.In Joseph W.Sullivan and
Sherman W.Tyler,editors,Intelligent User Interfaces,
pages339–385.ACM Press,New York,1991.
[4]Mark O.Riedl and Robert St.Amant.Toward automated
exploration of interactive systems.In Proceedings of
Intelligent User Interfaces,pages135–142,2002. [5]Frank E.Ritter,Gordon D.Baxter,Gary Jones,and
Richard M.Young.Supporting cognitive models as
users.ACM Transactions on Computer-Human
Interaction,7(2):141–173,2000.
[6]Frank E.Ritter,Dirk Van Rooy,and Robert St.Amant.
A user modeling design tool for comparing interfaces.
In Proceedings of the International Symposium on
Computer-Aided Design of User Interfaces,2002.To
appear.
[7]Frank E.Ritter and Richard M.Young.Embodied
models as simulated users:Introduction to this special
issue on using cognitive models to improve interface
design.International Journal of Human-Computer
Studies,55(1):1–14,2001.
[8]Robert St.Amant,Henry Lieberman,Richard Potter,
and Luke S.Zettlemoyer.Visual generalization in
programming by munications of the
ACM,43(3):107–114,March2000.
[9]Robert St.Amant and Mark O.Riedl.A
perception/action substrate for cognitive modeling in
HCI.International Journal of Human-Computer
Studies,55(1),2001.
[10]Robert St.Amant and R.Michael Young.Interface
agents in a model world environment.AI Magazine,
22(4):95–107,2001.
[11]Robert St.Amant and Luke S.Zettlemoyer.The user
interface as an agent environment.In Proceedings of the Fourth International Conference on Autonomous
Agents,pages483–490,2000.
[12]Luke Zettlemoyer and Robert St.Amant.A visual
medium for programmatic control of interactive
applications.In CHI’99(ACM Conference on Human
Factors in Computing),pages199–206,1999.。