Katie Han Thesis Final.pdf


Preview of PDF document katiehanthesisfinal.pdf

Page 1 2 3 4 5 6 7 8 9 10 11 12 13

Text preview


Enhancing Content Selection and Extraction
Mechanisms in Web Browsers and PDF
Senior Honors Thesis by Katie Han
Advisor: Andries van Dam
Second Reader: James Tompkin
Abstract
A common task during research or document organization is selecting fragments of content
from the web or PDF documents and migrating the information to different environments, such
as note-taking applications and word processors. However, the native selection and extraction
capabilities of standard web browsers and PDF viewers offer little help in scraping resources in
this manner. For web pages, the underlying HTML structure that represents images, lists,
tables, links, and other formatting is often lost in the process; for PDF documents, selecting and
copying any content other than basic text is virtually impossible. In this project, I explore and
extend the work done by cTed, a web browser plug-in that allows intuitive gestures to select
content on websites with arbitrary layouts. Then, I move on to applying those selection
mechanisms to PDF documents and implementing a software that enables extracting excerpts
from documents with little loss of the embedded information.

Introduction
In recent years, the main channel for people to consume and share information has shifted
towards the web. Whether it is morning news, research articles, or interesting blog posts,
information is most often accessed through web pages displayed on browsers. When absorbing
these materials, people often intend the scrap excerpts for the purpose of sharing with others or
reorganizing the information in a different environment, such as Microsoft Word or OneNote.
However, to select specific fragments of web pages and export the clipped content is generally
unintuitive and difficult. For instance, web browser’s basic click-and-drag functionality provides
limited selection and is insufficient in reflecting what users want to capture from the page. On
the other hand, a device’s screenshot tool simply grabs a bitmap representation of the selected
text, images, tables, or videos instead of conserving the important underlying HTML elements.
While this method allows the user to share a visual representation of the page, other resources
behind the content, such as text and links, are lost.
Apart from HTML web pages, information is frequently shared and displayed via PDF
documents, especially in the context of research and scholarship. Similar problems exist with
PDF file formats when selecting and extracting of excerpts from a document. Current PDF
viewer applications provide limited interactions between the user and the underlying content of
the PDF. While many applications focus on displaying the pages of a PDF document and editing