The tokeniser is a tool which, given a text as input, returns a list of tokens. The tokens can be ortographical words, numerals and punctuation marks.
The tokeniser is designed to work on Maltese texts.
The tokeniser can be used in two ways: online, as well as integrated in other applications as a web-service.
Online graphical user interface
A graphical user interface is available here, containing different levels of tagging which can be applied to a given text.
The tokeniser is also available as a web-service. The WSDL link is http://metanet4u.research.um.edu.mt/services/MtTokeniser?wsdl.
The service has one method which can be invoked:
- String tokenise(String text, Boolean tokenTags, String separator)
The method takes has three parameters:
This is the text that will be tokenised
This is a boolean variable. If tokenTags is true then the output tokens will be wrapped in tags
(ex: <token> tagged_text </token>). If false, the token will have no tags.
This is a string which will be used to separate one token and another in the output string.