LIRICS Software: Automatic Quality Control
Automatic Quality Control
Neil Newbold, Lee Gillam
Document Content Management System - Readability Tools
The Document Content Management System is a flexible package of integrated readability tools designed to enable automatic quality control. The system uses supporting resources and components for the standards development process, including a Plain English thesaurus, lookup of ISO TC 37 terminology provided from a terminology management system (TMS) via ISO 16642, automatic terminology discovery using statistical and linguistic techniques, and readability metrics. The work was undertaken as part of the EU funded project LIRICS, which involved collaborative efforts with project partners from international research institutions including INRIA (France) and DFKI (Germany).Installation
- Download GATE from here and follow GATE installation instructions.
- Download Readability_Tools.zip from here.
- Extract the Readability_Tools folder from the Zip file and copy into the folder into the plugins directory within your GATE installation directory.
-
Run GATE and click on the manage CREOLE plugins button
at the top.
- Tick the load now box next to readability tools to make all the new processing resources available.
Benefits of the Document Content Management System
- annotates known terminology in a document with easy access to their definitions
- supports automatic terminology extraction
- identifies verbose words and phrases and offers suitables replacements
- calculates readability measures based on terminology used
- automatically replaces verbose words with replacement or terminology with ISO definition
- incorporates GATE functionality with tools run in a user selected pipeline
Process
The readability components have been integrated with the University of Sheffield's GATE system which offers a set of reusable processing resources for common NLP tasks. These resources are packaged together to form ANNIE, A Nearly-New Information Extraction system. Existing GATE plug-ins from ANNIE were utilised for the preliminary tasks, leading into the newly devised readability processing resources. The Readability Analyser can be run at two seperate points in the pipleline to either incorporate or ignore terminology. Once the process has completed, it can be reinterated to refine the quality of the text. The pipeline of the processing resources are featured in the diagram below:
Readability Tools
-
1. Terminology Lookup
This plug-in examines a document and identifies term entries found in the text. It uses an XML file containing the full terminology and a language identifier to identify in which language terms will be annotated. The file comprises all the terms and definitions for the subject field and is a small-scale export of a terminology database in the meta-model format specified in the ISO 16642 standard. This terminology was collected from existing ISO standards during the project and inserted into a terminology management system (Surrey’s System Quirk Browser/Refiner application).
-
2. Linguistic Term Finder
This plug-in determines all the compound nouns in the document using the part of speech annotations created by the ANNIE POS tagger.
-
3. Keyword Extractor
This plug-in calculates the frequency and weirdness of individual words, as utilised by System Quirk. The user supplies a file of words and their frequencies in a reference corpus: we use frequency information from the 100 million word tokens of the British National Corpus (BNC). Frequently used words in the document which have a low frequency in the BNC (i.e. can be called “weird”) are extracted.
-
4. Statistical Term Finder
This plug-in is run after the Keyword Extractor and examines the neighbouring words around a keyword and identifies any recurring patterns. Input parameters include neighbourhood size (distance from this keyword) and weirdness threshold for inclusion. If a word consistently appears in the user-defined neighbourhood size with a predetermined level of weirdness then it is considered a potential new term.
-
5. SimpleText Analyser
This plug-in uses a dictionary of words and phrases identified as verbose by either the Plain English Campaign or ASD Simplified Technical English. The dictionary contains a list of 1302 phrases and offers a selection of one or more alternatives for each. The SimpleText Analyser identifies these phrases within the text and produces potential replacements for the expression.
-
6. Annotation Controller
This plug-in is used to reduce the number of overlapping annotations by prioritising some annotations over others.
-
7. Readability Analyser
This plug-in stores the number of words, syllables, sentences, characters and polysyllabic words contained within a document. These values are used for the calculation of readability formulas such as the Flesch Reading Ease, Kincaid Formula, SMOG, ARI and Fog Index.
-
8. Replacer
This plug-in substitutes the text in a document with user-selected SimpleText replacements. If there is no best replacement selected then the text is left unchanged.
For further information on Readability Tools for GATE, please contact:
Readability Tools Support: Neil Newbold
Fax: +44 1483 876051




