Abstract
A multiparadigm approach is developed and demonstrated for exploiting knowledge about structure for the purpose of extracting information from noisy textual data. A motivating example of a potential application would be an address encoding system for a delivery service such as UPS, Federal Express or the United States Post Office. This approach combines aspects of database organization and clustering of records, fuzzy parsing, fuzzy retrieval, an aggregation algebra, and measures of both performance and accuracy. Fuzzy retrieval, in the form of set and fuzzy operators, is accomplished by considering each symbol of the input text to be imperfect and retrieving non-exact matching records from the database that hold for a particular threshold value. The set of low-level database operators constrains the cardinality and accuracy of retrievals. A hierarchical method of clustering the database is defined, whereby the records are partitioned in a manner such that similar records are in the same cluster. This clustering strategy is guaranteed to be mutually exclusive and a complete cover of the data records. Associated with these clusters is an algebra that combines clusters of data into one window of ranked data. A set of fuzzy measures is defined that are used to aggregate and rank sets of records.
Original language | English |
---|---|
Pages (from-to) | 195-205 |
Number of pages | 11 |
Journal | Soft Computing |
Volume | 4 |
Issue number | 4 |
DOIs | |
Publication status | Published - 2000 |
Externally published | Yes |
Keywords
- Database clustering
- Fuzzy retrieval
- Noisy data
- Semi-structured data