多的研究工作利用搜索引擎来获取Web信息供研究所用,本文定义了“基于搜索 引擎的研究方法”,并选取WWW、SIGIR等七大学术会议在2001年至2007年的 所有学术论文作为研究对象,提出“所使用搜索引擎”、“搜索引擎访问方式”等 八种不同的维度,对146篇相关研究进行分类比较,并提出指导性的意见。 关键词:Web实体实例,Web实体属性类型集,搜索引擎,Web实体踪迹
Research on the extraction of Web entities and discovery of entity activities
Conglei Yao(Computer Science) Directed by Professor Xiaoming Li
For a given entity type, it is challenging and signi??cant to get all its instances in a web scale with the corresponding entity attributes extracted. It is also very important to retrieve the entity activities and organize them in a time sequential order to form the so-called entity tracks. This dissertation is aimed at exploring the related models and algorithms for these two issues to build ef??cient and effective entity-based web information systems. This thesis carries a comprehensive study of the extraction of the web entities, their relations, and their tracks. The research is based on the WebDigest project, which is started by the Lab of Computer Networks of Peking University, and aims to supply advanced entity-related service by extracting desired information from billions of pages. This dissertation focuses on two main problems: one is the effective and ef??cient extraction of web entity instances for a given entity type, under the assumption that the desired instances have already contained the determining attributes each; the other is the extraction of activities for a given web entity, and the organization of the activities into some proper form. Moreover, concern that all the models and methods in this thesis are based on search engines, this dissertation also focuses on the survey and analysis of the related studies based on search engines. In summary, the main contributions of this dissertation lie in the ??ve aspects as follows: (1) A novel framework for the extraction of web entity instances from large-scale web pages A main shortcoming in current studies is that the entity attribute types are mostly manually speci??ed and their real importance is actually unknown. This dissertation proposes a novel framework where entity attribute types can be extracted and evaluated automatically. The input of this framework consists of a speci??ed entity type and the initial knowledge of the entity type provided by users. Based on these input information, this framework ??rst creates a global web entity attribute schema, and makes sure
