CHI 97 Electronic Publications: Late-Breaking/Short Talks

CHI 97 Electronic Publications: Late-Breaking/Short Talks

Internet Scrapbook: Creating Personalized World Wide Web Pages

Atsushi Sugiura, Yoshiyuki Koseki
C&C Research Laboratories, NEC Corporation
4-1-1 Miyazaki, Miyamae-ku, Kawasaki 216, JAPAN
E-mail: {sugiura, koseki}@mmp.cl.nec.co.jp

ABSTRACT

This paper describes an information personalization system, called Internet Scrapbook, which enables users to create a personal page by clipping and merging their necessary data gathered from multiple Web pages. Even when the source Web pages are modified, the system updates the personal page, replacing with the latest data extracted from the source pages. Therefore, once a user creates their personal pages, she can browse her necessary information only.

KEYWORDS: World Wide Web, Web browser, end-user programming, programming by example, programming by demonstration.

INTRODUCTION

One of main problems of WWW (World Wide Web) browsers, such as NCSA Mosaic [1], is that users are required to perform repetitive operations to access Web information. To obtain necessary information from a Web page, the user specifies its URL (Uniform Resource Locator) and searches the downloaded page for the desired information either by eyes or by using string search function provided by the a browser. Since there are usually several pages where the user needs to access, she has to repeat those operations many times. Furthermore, if the Web pages are frequently modified, repetitive access to those pages is a heavy burden for the user. Our goal is to enable users with little programming skill to automate the daily Web browsing tasks.

This paper describes an information personalization system, called Internet Scrapbook, which allows users to create a personal page by clipping only their necessary data from Web pages. Even if the source pages are modified on the Web sites, the system automatically updates the personal page by re-constructing it with the latest information. Thereby, the user can avoid repetitive operations of specifying the URLs and searching the information.

ABSTRACT
INTRODUCTION
INTERNET SCRAPBOOK

Overview
Updating a Personal Page

INFORMAL EXPERIMENTS
CONCLUSION
ACKNOWLEDGMENTS
REFERENCES

INTERNET SCRAPBOOK

Overview

Our approach to achieve the goal is based on an example-based or demonstration-based programming [1]. Namely, a user demonstrates the objective task on example data, and a system generates a program corresponding to the user demonstration and executes the task on behalf of the user. Thereby, users are not required to learn special programming skills to automate repetitive tasks.

In Scrapbook, operations for creating a personal page is the user demonstration. As shown in Figure 1a and 1b, the user first selects the desired portion of data on one browser for regular Web browsing, and next copies the selected data to the other browser for the personal page. Currently, Scrapbook is implemented to copy data directly from Netscape Navigator, using NCAPI (Netscape Client Application Program Interface).

Every time the user performs the selection and the copy operations, the system generates a matching pattern, used to update the personal page. The system extracts a portion, corresponding to the one selected by the user, from the new source page and automatically re-composes the personal page with the latest information (Figure 1c).

An important design principle in Scrapbook is to make the data extraction procedure simple enough for users to understand its mechanism and easily examine whether the data extraction will be done as desired. Scrapbook is an inference system that infers a user desired portion in a source page from a single example, and there is no guarantee that the inference is always correct. Therefore, it is essential that the users can anticipate the system behavior in advance.

(a) Browsers for regular browsing (b) Browser for personal page (c) Updated personal page
Figure 1: Overview of Internet Scrapbook

Updating a Personal Page

To update a personal page, the system extracts the target data from source pages and re-constructs the personal page. In Scrapbook, extracted data is determined by line patterns, which are the previous/first/next lines of the portion that the user had selected on an original source page. For example, if the user selects the data as shown in Figure1a, "Last update: 97.1.14", "Top News" and "Economy", which are the previous line, the first line and the next line respectively, are used as line patterns.

To determine a starting point of the data to be extracted, the system first tries to find a point in the source page which completely matches the previous/first line pattern. If such a point can not be found, it performs partial matching to find a point with the largest matching degree.Likewise, the end point is determined by using the next line pattern.

This simple matching procedure allows users to anticipate the system's behavior in extracting data from the source page, simply checking the previous/first/next line of data the user had selected. In the case of Figure 1a, it is expected that most of the text contained in the line patterns will remain in the future source page. The first line "Top News" and the next line "Economy" will be unchanged, because they represent categories of articles, not daily articles. In the previous line, although the date "97.1.14" will be changed, the text "Last update:" will remain. Therefore, it is anticipated that the system will be able to extract the proper portion.

Although the data extraction procedure is simple, it can operate well on many Web pages. Most of the Web pages are composed so that users can easily browse the pages. In order to help users find their target information, the frequently modified parts of a Web page are usually proceeded by the permanent titles. Since users usually need the frequently modified portions surrounded by those permanent parts, the line patterns are useful for the data extraction in many cases.

In addition to the line patterns, Scrapbook also uses HTML tag patterns, which are the order of the tags in a source page, such as extracting from the first <H2> tag to the second one. However, the system uses the line patterns prior to the tag patterns. The tag patterns are only used when no candidate portion can be found with the line patterns or when multiple candidates.

INFORMAL EXPERIMENTS

We did informal experiments with 90 pages on 25 Web sites. The data extraction procedure using both line patterns and tag patterns can extract the user desired portion from 79 pages. Also, a procedure using line patterns only can operate well on 62 pages.

A typical case where the extraction fails is that the latest information is added to the head of the source page, as shown in Figure 2. If a user clips data in the dash line, the texts "97.1.14" and "97.1.13" are used as line patterns. In this case, the system always extracts the same data even if the source page is modified. We need to improve the matching procedure so that it can extract the new information by using tag patterns prior to line patters in such situations

Figure 2: A typical Web page that the system fails to extract proper portion.

CONCLUSION

This paper describes a technique to obtain user desired information from the Web pages with minimum efforts. The proposed demonstration-based technique frees users from the repetitive tasks on the Web without requiring to write any script or program.

ACKNOWLEDGMENTS

The authors express their appreciation to Satoshi Goto and Shiro Sakata of NEC Corporation for giving them the opportunity to pursue this research.

REFERENCES

Cypher, A. ed. Watch What I Do: Programming by Demonstration. MIT Press, 1993.
NCSA Mosaic: http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/

CHI 97 Electronic Publications: Late-Breaking/Short Talks