ABSTRACT/Implementation

of the Formalism

for a Grammar Checker(1)


JAN HRIC

 

GOALS OF THE DEVELOPMENT

This paper describes the main goals and ideas of an implementation of a prototype system for supporting the design of grammars for natural languages and the development of the appropriate parsers. It is a part of a wider project for an automatic checking of grammatical correctness of texts in natural languages with a free word order. The applied formalism is based on list automata driven by rules with feature structures. A more detailed description can be found in [Hri].

The main goal of the development was to implement as fast as possible the formal means that enable an application and testing of the suggested linguistic ideas. Since further development of the relevant formalism, the syntax used and conventions introduced can be expected, flexibility of the system is desirable. A certain comfort for linguistic research, e.g. debugging the grammar and tracing computation processes, is necessary. The system should also provide good checking abilities and sensitivity to possible errors. Prolog was chosen for our implementation of the prototype. It facilitates a quick development of a flexible system, which should be able to work in an interactive as well as in a batch mode. The whole system is supposed to be reimplemented after the ideas crystallize.

Formalism for Error Checking. The device chosen for the shallow grammatical analysis (grammar checking) of natural languages with a free word order is a nondeterministic list automaton. The formal base for a classification of the list automata can be found in [Jančar et al. 1993].

Automaton Description. The main goal of the error checking module (ECM) is to distinguish grammatically well-formed and ill-formed sentences in the input text. The input of the module is the output of the lexical analyzer and the elements of the input list take the form of feature structures. After the processing is complete, the input list is reduced to its sentinels. The output list will contain no error marking symbol if the input list is recognized as free of errors.

The ECM module consists of a list automaton P, a list automaton N and a control module C.

The function of the positive automaton P is to decide (nondeterministically) which parts of the text at the input level are correct (wrt. the rules of the control grammar) and to delete these parts from the input level. If the input list is free of errors, the P-automaton will sequentially delete (in at least one of its computations) all inputs elements except sentinels. The task of the negative automaton N is to find an error in the input text, to record it and remove it. It is supposed that after deleting the correct parts

next page

  contents
 

JAN HRIC, IMPLEMENTATION OF THE FORMALISM . . .

54

 

of the text, the words which cause the error are quite near at the input list and so the error may be detected by the N-automaton. The N-automaton recognizes typical errors.

The C module is a control submodule of the ECM. It switches the control between the P and N automata until the input is empty. Different computations due to nondeterminism can finish with different results and records about errors. A composition of the results is a forthcoming task, which will be dealt with during the later stages of the project.

Grammar description. A formalism for controlling the deletion automata is a Deletion Automata Based Grammar (DABG).

DABG rules are a special kind of rewriting rules. The left part describes the requirements imposed on the input sequence and the right part is the result of the transformation which the rule represents. It is allowed only to delete or (in certain cases) to modify the input. The rule can contain a condition (constraint), which encodes additional requirements imposed on the rule.

Comparing the input FS with the left-hand side of the rule is based on a unification of the FSs.

Implementation

This section describes the most interesting parts of the system. It justifies the decisions used during the implementation.

Input format. We decided to separate the input form of the feature structures from an internal form of the implementation to achieve flexibility. The special symbols, logical conjunctions and operators used in grammar rules are defined as Prolog operators to enable direct reading of the input form. The advantages of the explicit transformation between the input and internal form are that internal structures are independent on the particular input syntax and therefore the kernel is more stable. Also the implementation oriented choice of internal structures simplified writing the prototype.

On the other hand, we had to write transformation procedures and to perform them on all inputs and outputs. Thus the users are separated from the implementation and a user interface had to be developed as a shell encapsulating the Prolog system.

Output formats. There are various types of outputs and they differ in volume and format of the information provided to the user. In general the listings of feature structures are very large. Thus we had to provide the users with the means to suppress the output of information irrelevant to their particular purposes. The user have to get an opportunity to adjust the amount of computation tracing information. It is possible to adjust output to the following levels:

previous page

next page

  contents
 

JAN HRIC, IMPLEMENTATION OF THE FORMALISM . . .

55

 

(i) only the final results, needed for the later phases of an error detection

(ii) use of certain rules during computation

Reduction of nondeterminism. The automaton is nondeterministic. The simulation of the nondeterminism is in depth-first fashion. The application of two rules P and R is independent, if there is no input FS which the left hand sides of both rules are applied to. If the application of two rules is independent, their use is commutative and we can choose the order of the application without damaging the completeness of the search.

Current state, future and conclusions

The system is now implemented in the LPA 386 Prolog in a DOS environment. We have implemented the kernel of the system, the input and the output. It is possible to develop the grammar and use it for the parsing. However, the system is slow, so this prototype is acceptable in debugging single rules, not in parsing whole sentences.(2)

The implementation of the prototype has been successful. It fulfils the main need of the linguists - to develop quickly a working prototype for the formalism they use.

 

REFERENCES

 

[AC] A.V. Aho, M.J. Corasick: Efficient String Matching: An Aid to Bibliographic Search, Communications of the ACM, 1975, Vol. 18, No. 6, pp. 333-340

[Col] A. Colmerauer: Les systémes-Q: un formalisme pour analyser et synthétiser des phrasese sur ordinateur, Groupe TAUM, Université de Montréal, 1971

[Hri] J. Hric: Implementation of the Formalism for a Grammar Checker, Proceeding of the Conference on the Practical Applications of Prolog, London, April 1994

[Jan] P.Jančar, F.Mráz, M.Plátek: A Taxonomy of Forgetting Automata, Proceedings of MFCS'93, Gdansk, Poland, August 1993

[O'K] R.A. O'Keefe: The Craft of Prolog, MIT Press 1990

[Smo] G. Smolka: Feature-Constraint Logics For Unification Grammars, Journal of Logic Programming, 1992, Vol. 12, Nos. 1 & 2, pp. 51-87

 

NOTES

 

1. This work is partially supported by Joint Research Project PECO 2824

2. The system works on SUN under compiled SICStus Prolog 15 times more quickly.

previous page

  contents