Welcome to the TextMarker Wiki#
The TextMarker system is an open source tool (https://sourceforge.net/projects/textmarker/
) for the development of rule-based information extraction applications. The development environment is based on the DLTK framework (http://www.eclipse.org/dltk/
). It supports the knowledge engineer with a full-featured rule editor, components for the explanation of the rule inference and a build process for generic UIMA Analysis Engines and Type Systems (http://incubator.apache.org/uima/
). Therefore TextMarker components can be easily created and combined with other UIMA components in different information extraction pipelines rather flexibly.
TextMarker applies a specialized rule representation language for the effective knowledge formalization: The rules of the TextMarker language are composed of a list of rule elements that themselves consists of four parts: The mandatory matching condition establishs a connection to the input document by referring to an already existing concept, respectively annotation. The optional quantifier defines the usage of the matching condition similar to regular expressions. Then, additional conditions add constraints to the matched text fragment and additional actions determine the consequences of the rule. Therefore, TextMarker rules match on a pattern of given annotations and, if the additional conditions evaluate true, then they execute their actions, e.g. create a new annotation. If no initial annotations exist, for example, created by another component, a scanner is used to seed simple token annotations contained in a taxonomy.
The TextMarker system provides unique functionality that is usually not found in similar systems. The actions are able to modify the document either by replacing or deleting text fragments or by filtering the view on the document. In this case, the rules ignore some annotations, e.g. HTML markup, or are executed only on the remaining text passages. The knowledge engineer is able to add heuristic knowledge by using scoring rules. Additionally, several language elements common to scripting languages like conditioned statements, loops, procedures, recursion, variables and expressions increase the expressiveness of the language. Rules are able to directly invoke external rule sets or arbitrary UIMA Analysis Engines and foreign libraries can be integrated with the extension mechanism for new language elements.
If you use the TextMarker system in academic research, then please cite the following paper as appropriate:
@inproceedings{2009:GSCL:KAP:TextMarker,
title = {TextMarker: A Tool for Rule-Based Information Extraction },
author = {Peter Kluegl and Martin Atzmueller and Frank Puppe},
booktitle = {Proceedings of the Biennial GSCL Conference 2009, 2nd UIMA@GSCL Workshop},
editor = {Christian Chiarcos and Richard Eckart de Castilho and Manfred Stede},
pages = {233-240},
publisher = {Gunter Narr Verlag},
year = {2009}
}
Add new attachment
List of attachments
| Kind | Attachment Name | Size | Version | Date Modified | Author | Change note |
|---|---|---|---|---|---|---|
PNG |
TextMarkerIDE.PNG | 126.1 kB | 1 | 17-Aug-2009 17:58 | Peter Klügl |