Vojtech Svatek - abstracts of selected papers

Vojtech Svatek - abstracts of selected papers (linked from his bibliography page, in reverse chronological order)

Belak V., Svatek V.: Supporting Self-Organization in Politics by the Semantic Web Technologies. In: E-Part 2010 - Electronic Government and Electronic Participation, Lausanne, 2010. Full paper. We present a use of knowledge technologies in support of self-organization of people with joint political goals. We argue for the use of the semantic web technologies to enhance interoperability between eParticipation systems and to provide better user experience. We claim that ontology-supported eParticipation may increase the impact of eParticipation projects to public policy, because it enables better linkage of users and sharing of knowledge across different systems. In order to enable these scenarios, we built a core eParticipation ontology in RDF/OWL. The suitability of this approach is preliminarily demonstrated in a design and implementation of a proof-of-concept social-semantic web application Ontopolis.net. It is wholly backed by the ontology and thus demonstrates the possible openness of such an approach. This system leverages various knowledge technologies and resources like WordNet thesaurus in order to provide an intelligent recommendation of content or users. Hence it is designed to help people establish groups centred around joint goals and interests, which may subsequently lead to an emergence of public initiatives and joint actions.

Nekvasil M., Svatek V., Novotny O.: Critical Success Factors of Semantic Applications. In: Znalosti 2010, 9th Czecho-Slovak Knowledge Engineering Conference, Jindrichuv Hradec 2010. Full paper. Broader and broader areas of application deployment are covered by semantic technologies recently and in the meantime their scope is increasing constantly. The possibilities of semantic applications are now so vast that they cannot be judged as one market segment any longer. The skepticism that arises due to the uncertainty of investments in such technologies is only augmented by these differences. This paper provides a possible approach to the categorization of semantic applications and subsequently sets several critical success factors for the deployment of these technologies in a business environment. Last but not least a possibility of how maturity models of enterprises for preliminary assessment of the investments into semantic applications can be formed is outlined.

Kliegr T., Svatek V., Ralbovsky M., Simunek M.: SEWEBAR-CMS: semantic analytical report authoring for data mining results. Journal of Intelligent Information Systems, 2010. Pre-final draft Final version (Springerlink - Online First) SEWEBAR-CMS is a set of extensions for the Joomla! Content Management System (CMS) that extends it with functionality required to serve as a communication platform between the data analyst, domain expert and the report user. SEWEBAR-CMS integrates with existing data mining software through PMML. Background knowledge is entered via a web-based elicitation interface and is preserved in documents conforming to the proposed Background Knowledge Exchange Format (BKEF) specification. SEWEBAR-CMS offers web service integration with semantic knowledge bases, into which PMML and BKEF data are stored. Combining domain knowledge and mining model visualizations with results of queries against the knowledge base, the data analyst conveys the results of the mining through a semi-automatically generated textual analytical report to the end user. The paper demonstrates the use of SEWEBAR-CMS on a real-world task from the cardiological domain and presents a user study showing that the proposed report authoring support leads to a statistically significant decrease in the time needed to author the analytical report.

Kliegr T., Svatek V., Simunek M., Stastny D., Hazucha A.: XML Schema and Topic Map Ontology for Formalization of Background Knowledge in Data Mining. In: IRMLeS 2010 - Workshop on Inductive Reasoning and Machine Learning on the Semantic Web, at ESWC 2010. Paper in CEUR proceedings Background (or sometimes referred to as domain) knowledge is extensively used in data mining for data pre-processing and for nugget-oriented data mining tasks: it is essential for constraining the search space and pruning the results. Despite the costs of eliciting background knowledge from domain experts, there has been so far little effort to devise a common exchange standard for its representation. This paper proposes the Background Knowledge Exchange Format (BKEF), a lightweight XML Schema for storing information on features and patterns, and the Background Knowledge Ontology (BKOn), as its semantic abstraction. The purpose of BKOn is to allow reasoning over and integration of analysed data with existing domain ontologies. We show an elicitation interface producing BKEF and discuss the possibilities for integration of such background knowledge with domain ontologies.

Svatek V., Svab-Zamazal O., Vacura M.: Adapting Ontologies to Content Patterns using Transformation Patterns. In: WOP 2010 - Workshop on Ontology Patterns, at ISWC 2010. Paper in CEUR proceedings. Ontology content patterns are meant to be used not only for new ontologies but also for reengineering of existing ontologies. However, the modelling style of such ontologies often differs from the best-practice pattern that is to be imported to their root portions, which makes the integration of the two models time-consuming and error-prone. We explore how the recently developed PatOMat transformation framework could be applied to ease the adaptation of ‘legacy’ ontologies to a widely used content pattern from the OntologyDesignPatterns.org library. We also investigate the link between transformation choices and logical patterns as those earlier proposed by the W3C SWBPD Group.

Svatek V., Svab-Zamazal O., Iannone L.: Pattern-Based Ontology Transformation Service Exploiting OPPL and OWL-API. In: EKAW 2010 - Knowledge Engineering and Management by the Masses, Lisbon, 2010. Springer-Verlag. Full paper. Exploitation of OWL ontologies is often difficult due to their modelling style even if the underlying conceptualisation is adequate. We developed a generic framework and collection of services that allow to dene and execute ontology transformation (in particular) with respect to modelling style. The definition of transformation is guided by transformation patterns spanning between mutually corresponding patterns in the source and target ontology, the detection of an instance of one leading to construction of an instance of the other. The execution of axiom-level transformations relies on the functionality of the OPPL processor, while entity-level transformations, including sophisticated handling of naming and treatment of annotations, are carried out directly through the OWL API. A scenario of applying the transformation in the specific context of ontology matching is also presented.

Vacura M., Svatek V.: Ontological Analysis of Human Relations for Semantically Consistent Transformations of FOAF Data. In: KIELD 2010 - Workshop on Knowledge Injection into and Extraction from Linked Data, at EKAW 2010. Paper in CEUR proceedings. The FOAF project has prominent importance for capturing human relations in Linked Data. We analyze the FOAF data structures and their extensions from the point of view of formal ontology and discuss problems inherent in its design. We also point out necessary considerations for transforming the FOAF data structures by supplying additional knowledge into them, while achieving/maintaining semantic consistency.

Vacura M., Svatek V.: Ontology Based Tracking and Propagation of Provenance Metadata. In: Networked Digital Technologies (NDT 2010), Prague, Springer-Verlag, 2010. Full paper. Tracking the provenance of application data is of key importance in the network environment due to the abundance of heterogeneous and controllable resources. We focus on ontologies as a mean of knowledge representation and present a novel approach to representation of provenance metadata in knowledge bases, relying on an OWL 2 design pattern. We also outline an abstract method of propagation of provenance metadata during the reasoning process.

Svatek V., Svab-Zamazal O.: Entity Naming in Semantic Web Ontologies: Design Patterns and Empirical Observations. In: Znalosti 2010, 9th Czecho-Slovak Knowledge Engineering Conference, Jindrichuv Hradec 2010. Full paper. We systematically analyse the entity naming options over the structure of the OWL ontology language, both at the level of entity types (classes, properties and individuals) and simple structures such as inverse property or domain/range axioms. We attempt to distinguish what is good and bad practice in entity naming. Finally, we partially compare our assumptions with the reality of OWL ontology design.

Labsky M., Svatek V., Nekvasil M.: Multi-Paradigm and Multi-Lingual Information Extraction as Support for Medical Web Labelling Authorities. Journal of Systems Integration, 2010, Vol. 1, No. 4, pp. 3–12. ISSN 1804-2724. Full paper. Until recently, quality labelling of medical web content has been a pre-dominantly manual activity. However, the advances in automated text processing opened the way to computerised support of this activity. The core enabling technology is information extraction (IE). However, the heterogeneity of websites offering medical content imposes particular requirements on the IE techniques to be applied. In the paper we discuss these requirements and describe a multi-paradigm approach to IE addressing them. Experiments on multi-lingual data are reported. The research has been carried out within the EU MedIEQ project.

Svatek V., Kliegr T., Nemrava J., Ralbovsky M., Splichal J., Vejlupek T., Rocek V., Rauch J.: Building and Integrating Competitive Intelligence Reports Using the Topic Map Technology. In: Fifth International Conference on Topic Maps Research and Applications (TMRA 2009), Leipzig, Germany, November 2009. Full paper. Competitive intelligence supports the decision makers in understanding the competitive environment by means of textual reports prepared based on public resources. CI is particularly demanding in the context of larger business clusters. We report on a long-term project featuring large-scale manual semantic annotation of CI reports wrt. business clusters in several industries. The underlying ontologies are the result of collaborative editing by multiple student teams. The results of annotation are finally merged into CI maps that allow easy access to both the original documents and the knowledge structures.

Svab-Zamazal O., Scharffe F., Svatek V.: Preliminary Results of Logical Ontology Pattern Detection using SPARQL and Lexical Heuristics . In: WOP 2009 - Workshop on Ontology Patterns, at ISWC 2009. Paper in CEUR proceedings. Ontology design patterns were proposed in order to assist the ontology engineering task, providing models of specific construction representing a particular form of knowledge. Various kinds of patterns have since been introduced and classes of patterns identified. Detecting these patterns in existing ontologies is needed in various scenarios, for example the detection of the the two parts of an alignment pattern in an ontology matching scenario, or the detection of an anti-pattern in an optimization scenario. In this paper we present a novel method for the detection of logical patterns in ontologies. This method is based on both SPARQL, as the underlying language for retrieving patterns, and a lexical heuristic constraining the query. It extends our previous works on ontology patterns modeling and detection. We describe an algorithm computing a tokenbased similarity measure used as the lexical heuristic. We conduct an experiment on a large number of Web ontologies, obtaining interesting measures on the usage frequency of three selected patterns.

Svatek V., Svab-Zamazal O., Presutti V.: Ontology Naming Pattern Sauce for (Human and Computer) Gourmets. In: WOP 2009 - Workshop on Ontology Patterns, at ISWC 2009. Paper in CEUR proceedings. Various explicit and implicit naming conventions for entities have emerged in ontological engineering realms during the decade/s of its existence. In the paper we argue that the naming principles are neither trivial nor completely haphazard in practice, present a preliminary categorisation of ontology naming patterns, and discuss the impact of entity naming on both human and computer perception of ontologies.

Kliegr T., Ralbovsky M., Svatek V., Simunek M., Jirkovsky V., Nemrava J., Zemanek J.: Semantic Analytical Reports: A Framework for Post-processing Data Mining Results. In: 18th International Symposium on Methodologies for Intelligent Systems (ISMIS 2009), Prague. Springer, LNCS 5722. Pre-final paper (final version available via SpringerLink). Intelligent post-processing of data mining results can provide valuable knowledge. In this paper we present the first systematic solution to post-processing that is based on semantic web technologies. The framework input is constituted by PMML and description of background knowledge. Using the Topic Maps formalism, a generic Data Mining ontology and Association Rule Mining ontology were designed. Through combination of a content management system and a semantic knowledge base, the analyst can enter new pieces of information or interlink existing ones. The information is accessible either via semi-automatically authored textual analytical reports or via semantic querying. A prototype implementation of the framework for generalized association rules is demonstrated on the PKDD’99 Financial Data Set.

Svab-Zamazal O., Svatek V., Scharffe F.: Pattern-Based Ontology Transformation Service. In: International Conference on Knowledge Engineering and Ontology Development (KEOD 2009), Funchal, Madeira, Portugal, October 2009. Full paper. Many use cases for semantic technologies (eg. reasoning, modularisation, matching) could benefit from an ontology transformation service. This service is supported with ontology transformation patterns consisting of corresponding ontology patterns capturing alternative modelling choices, and an alignment between them. In this paper we present the transformation process together with its two constituents: a pattern detection and an ontology transformation process. The pattern detection process is based on SPARQL and the transformation process is based on an ontology alignment representation with specific extensions regarding detailed information about the transformation.

Nekvasil M., Svatek V.: Towards Models for Judging the Maturity of Enterprises for Semantics. In: Workshop Econom 2009; part of BIS 2009 International Workshops, Poznan, Poland, April 27-29, 2009, Revised Papers. Lecture Notes in Business Information Processing, Vol. 37. Pre-final paper; final version is available via SpringerLink In recent years, semantic technologies have been included in broader and broader areas of application deployment, and their scope has been constantly expanding. The differences amongst them, however, are often vast and the successes of such investments are uncertain. This work provides a possible approach to the categorization of semantic applications and uses it to formulate a set of critical success factors of the deployment of these technologies in a business environment. Finally, it outlines how it is possible to formulate the maturity models of enterprises for preliminary assessment of the investments into semantic applications.

Svab-Zamazal O., Svatek V.: Empirical Knowledge Discovery over Ontology Matching Results. In: IRMLeS 2009 - Workshop on Inductive Reasoning and Machine Learning on the Semantic Web, at ESWC 2009. Paper in CEUR proceedings Analysis of ontology alignments, as sets of correspondences between entities, can reveal knowledge to be later fed back to the alignment process. We report on data mining experiments over 3-year results of the ‘conference’ track of the Ontology Alignment Evaluation Initiative. The discovered hypotheses express relationships among the matching tools used, the nature of source ontologies, the confidence measure of the returned correspondences, their actual correctness, and, notably, the participation of the correspondences in mapping patterns.

Labsky M., Svatek V., Nekvasil M., Rak D.: The Ex Project: Web Information Extraction using Extraction Ontologies. In: Berendt, B.; Mladenic, D.; de Gemmis, M.; Semeraro, G.; Spiliopoulou, M.; Stumme, G.; Svatek, V.; Zelezny, F. (Eds.): Knowledge Discovery Enhanced with Semantic and Social Information. Springer, Studies in Computational Intelligence, Vol. 220, 2009. Full paper. Extraction ontologies represent a novel paradigm in web information extraction (as one of ‘deductive’ species of web mining) allowing to swiftly proceed from initial domain modelling to running a functional prototype, without the necessity of collecting and labelling large amounts of training examples. Bottlenecks in this approach are however the tedium of developing an extraction ontology adequately covering the semantic scope of web data to be processed and the difficulty of combining the ontology-based approach with inductive or wrapper-based approaches. We report on an ongoing project aiming at developing a web information extraction tool based on richly-structured extraction ontologies and with additional possibility of (1) semi-automatically constructing these from third-party domain ontologies, (2) absorbing the results of inductive learning for subtasks where pre-labelled data abound, and (3) actively exploiting formatting regularities in the wrapper style.

Karkaletsis V., Stamatakis, K., Karampiperis, P., Labsky, M., Ruzicka, M., Svatek, V., Cabrera, E. A., Polla, M., Mayer, M. A., Villaroel Gonzales, D.: Management of Medical Website Quality Labels via Web Mining. In: Berka P., Rauch J., Zighed D.A. (eds.): Data Mining and Medical Knowledge Management: Cases and Applications. IGI Global Inc., 2009. Full paper. The WWW is an important channel of information exchange in many domains, including the medical one. The ever increasing amount of freely available healthcare-related information generates, on the one hand, excellent conditions for self-education of patients as well as physicians, but on the other hand entails substantial risks if such information is trusted irrespective of low competence or even bad intentions of its authors. This is why medical website certification (also called ‘quality labeling’) by renowned authorities is of high importance. In this respect, it recently became obvious that the labeling process could benefit from employment of web mining and information extraction techniques, in combination with flexible methods of web-based information management developed within the semantic web initiative. Achieving such synergy is the central issue in the MedIEQ project. The AQUA (Assisting QUality Assessment) system, developed within the MedIEQ project, aims to provide the infrastructure and the means to organize and support various aspects of the daily work of labeling experts.

Svab-Zamazal O., Svatek V.: Towards Ontology Matching via Pattern-Based Detection of Semantic Structures in OWL Ontologies. In: Znalosti 2009, 8th Czecho-Slovak Knowledge Engineering Conference, Brno 2009. Full paper. Ontology Matching is nowadays a vivid area of Computer Science. There are several OM tools looking for correspondences between entities of ontologies. These correspondences are usual simple equivalence mapping pairs class-to-class or property-to-property. In our work we concentrate on diverse kinds of semantic structures in ontologies in terms of their detection and mutual matching. For this kind of matching not only equivalence relations as well as not only homogeneous correspondences are envisaged. The paper is a report from the first phase of the work aiming at ontology matching via pattern-based detection of semantic structures in OWL ontologies. In this initial phase we mainly pay attention to n-ary relations and their discovery in OWL ontologies. Subsequent phases lie in description of conditions of semantic matching between semantic structures.

Zeman M., Ralbovsky M., Svatek V., Rauch J.: Ontology-Driven Data Preparation for Association Mining. In: Znalosti 2009, 8th Czecho-Slovak Knowledge Engineering Conference, Brno 2009. Full paper. Ontologies can convey domain semantics to various phases of a KDD application through a mapping established between ontology entities and columns of the data matrix. The approach implemented in the Ferda tool focuses on providing support for the data preparation phase. Information about important data values and column groupings, once injected into a domain ontology, can be repeatedly used for creating meaninfgul categories for attributes and for defining mining tasks producing association hypotheses well-interpretable in the domain context. Tests on real data have been carried out in the domain of cardiology.

Svatek V., Vacura M., Ralbovsky M., Svab-Zamazal O., Parsia B.: OWL Support for (Some) Non-Deductive Scenarios of Ontology Usage In: OWL: Experiences and Directions (OWLED 2008), Fifth International Workshop, Karlsruhe, Germany, October 26-27, 2008, co-located with ISWC 2008. Full paper. Applications of ontologies exist that go beyond standard deductive reasoning and rather have the character of empirical discovery in knowledge/data. We analyse the inventory of OWL with respect to two such applications, namely to pattern-based ontology matching and to ontology-aware knowledge discovery from databases.

Kliegr T., Nemrava J., Svatek V., Rauch J., Nekvasil M., Ralbovsky M., Vejlupek T., Splichal J.: Semantic Annotation and Linking of Competitive Intelligence Reports for Business Clusters. In: First International Workshop on Ontology-supported Business Intelligence (OBI2008), co-located with ISWC2008. Full paper. Competitive intelligence (CI) is a sub-discipline of business intelligence that supports the decision makers in understanding the competitive environment by means of textual reports prepared based on public resources. CI is particularly demanding in the context of larger business clusters. We report on a long-term project featuring large-scale manual semantic annotation of CI reports wrt. business clusters in several industries. The underlying ontologies are the result of collaborative editing by multiple student teams. The results of annotation are finally merged into CI maps that allow easy access to both the original documents and the knowledge structures.

Svab-Zamazal O., Svatek V., Meilicke C., Stuckenschmidt, H.: Testing the impact of pattern-based ontology refactoring on ontology matching results. In: Third International Workshop on Ontology Matching (OM-2008) collocated with ISWC-2008. Full paper. We observe the impact of ontology refactoring, based on detection of name patterns in the ontology structure, on the results of ontology matching. Results of our experiment are evaluated using novel logic-based measures accompanied by an analysis of typical effects. Although the pattern detection method only covers a fraction of ontological errors, there seems to be a measurable effect on the quality of the resulting matching.

Kliegr T., Svatek V., Chandramouli K., Nemrava J., Izquierdo E.: Wikipedia As the Premiere Source for Targeted Hypernym Discovery. In: Wikis, Blogs, Bookmarking Tools - Mining the Web 2.0 (WBBTMine 08), Workshop at ECML PKDD 2008, 15 September 2008, Antwerp, Belgium. Full paper. Targeted Hypernym Discovery (THD) applies lexico-syntactic (Hearst) patterns on a suitable corpus with the intent to extract one hypernym at a time. Using Wikipedia as the corpus in THD has recently yielded promising results in a number of tasks. We investigate the reasons that make Wikipedia articles such an easy target for lexicosyntactic patterns, and suggest that it is primarily the adherence of its contributors to Wikipedia's Manual of Style. We propose the hypothesis that extractable patterns are more likely to appear in articles covering popular topics, since these receive more attention including the adherence to the rules from the manual. However, two preliminary experiments carried out with 131 and 100 Wikipedia articles do not support this hypothesis.

Labsky M., Svatek V., Nekvasil M.: Information Extraction Based on Extraction Ontologies: Design, Deployment and Evaluation. In: Workshop on Ontology-Based Information Extraction Systems (OBIES-08) held within KI-08, Kaiserslautern, 23 September 2008. Paper in CEUR proceedings Most IE methods do not provide easy means for integrating complex prior knowledge that can be provided by human experts. Such knowledge is especially valuable when there are no or little training data. In the paper we elaborate on the extraction ontology paradigm; the distinctive features of our system called Ex are 1) probabilistic reasoning over extractable attribute and instance candidates and 2) combination of the extraction ontology approach with the inductive and (to some degree) wrapper approach. We also discuss the issues related to the deployment and evaluation of applications based on extraction ontologies.

Svab-Zamazal O., Svatek V.: Analysing Ontological Structures through Name Pattern Tracking. In: EKAW 2008 - 16th International Conference on Knowledge Engineering and Knowledge Management. Springer LNCS. Full paper. Concept naming over the taxonomic structure is a useful indicator of the quality of design as well as source of information exploitable for various tasks such as ontology refactoring and mapping. We analysed collections of OWL ontologies with the aim of determining the frequency of several combined name&graph patterns potentially indicating underlying semantic structures. Such structures range from simple set-theoretic subsumption to more complex constructions such as parallel taxonomies of different entity types. The final goal is to help refactor legacy ontologies as well as to ease automatic alignment among different models. The results show that in most ontologies there is a significant number of occurrences of such patterns. Moreover, their detection even using very simple methods has precision sufficient for a semi-automated analysis scenario.

Vacura M., Svatek V., Smrz P.: A Pattern-based Framework for Representation of Uncertainty in Ontologies. In: 11th International Conference on Text, Speech and Dialogue. Brno, Czech Republic, September 8-12, 2008. Springer LNCS, to appear. Full paper. We present a novel approach to representing uncertain information in ontologies based on design patterns. We provide a brief description of our approach, present its use in case of fuzzy information and probabilistic information, and describe the possibility to model multiple types of uncertainty in a single ontology. We also shortly present an appropriate fuzzy reasoning tool and define a complex ontology architecture for well-founded handling of uncertain information.

Vacura M., Svatek V., Saathoff C., Franz T., Troncy R.: Describing Low-Level Image Features Using The COMM Ontology. In: First ICIP Workshop on Multimedia Information Retrieval 2008, October 12, 2008, San Diego, California, U.S.A. Full paper. We present an innovative approach for storing and processing extracted low-level image features based on current Semantic Web technologies. We propose to use the COMM multimedia ontology as a “semantic” alternative to the MPEG-7 standard, which is at the same time largely compliant with it. We describe how COMM can be used directly or through its associated Java API.

Karkaletsis V., Karampiperis P., Stamatakis K., Labsky M., Ruzicka M., Svatek V., Mayer M.A., Leis A., Villarroel D.: Automating Accreditation of Medical Web Content. In: 5th Prestigious Applications of Intelligent Systems Conference (PAIS 2008), Greece. Incl. in Proc. ECAI'08, IOS Press, 2008. Full paper. The increasing amount of freely available health-related web content generates, on one hand, excellent conditions for self-education of patients as well as physicians, but on the other hand entails substantial risks if such information is trusted irrespective of low competence or even bad intentions of its authors. This is why medical web resources accreditation by renowned authorities is of high importance. However, various health web content surveys show that the proportion of accredited web resources is insufficient due to the difficulty of the labeling authorities to cope with the amount and dynamics of the medical web. In this paper, we address the problem of automating the accreditation of medical web content. To this end, we present a system which provides the infrastructure and the means to organize and support various aspects of the daily work of labeling experts, exploiting web content collection and information extraction techniques.

Praks P., Svatek V., Cernohorsky J.: Linear algebra for vision-based surveillance in heavy industry - convergence behavior case study. In: IEEE CBMI 2008 - Sixth International Workshop on Content-Based Multimedia Indexing. 18-20th June, 2008, Queen Mary, University of London, London, UK. pp. 346-352. Full paper. The surveillance application aims at improving the quality of technology via modelling human expert behaviour in the coking plant ArcelorMittal Ostrava, the Czech Republic. Video data on several industrial processes are captured by means of a CCD camera and classified by using Latent Semantic Indexing (LSI) with the respect to etalons classified by an expert. We also study the convergence behavior of proposed partial eigenproblem-based dimension reduction technique and its ability for knowledge acquisition. Having increased the computational effort of the dimension reduction technique did not imply the increasing quality of retrieved results in our cases.

Nemrava J., Buitelaar P., Svatek V., Declerck T.: Text Mining Support for Semantic Indexing and Analysis of A/V Streams In: Proc. of OntoImage, Workshop at LREC 2008, Marrakech, Morocco, May 2008. Full paper. The work described here concerns the use of complementary resources in sports video analysis; soccer in our case. Structured web data such as match tables with teams, player names, score goals, substitutions, etc. and multiple, unstructured, textual web data sources (minute-by-minute match reports) are processed with an ontology-based information extraction tool to extract and annotate events and entities according to the SmartWeb soccer ontology. Through the temporal alignment of the primary A/V data (soccer videos) with the textual and structured complementary resources, these extracted and semantically organized events can be used as indicators for video segment extraction and semantic classification, i.e. occurrences of particular events in the complementary resources can be used to classify the corresponding video segment, enabling semantic indexing and retrieval of soccer videos.

Kliegr T., Chandramouli K., Nemrava J., Svatek V., Izquierdo E.: Combining Image Captions and Visual Analysis for Image Concept Classification. In: The 9th Intl. Workshop on Multimedia Data Mining, held with ACM SIGKDD'08, Las Vegas, 2008. Full paper. We present a framework for efficiently exploiting free-text annotations as a complementary resource to image classification. A novel approach called Semantic Concept Mapping (SCM) is used to classify entities occurring in the text to a custom-defined set of concepts. SCM performs unsupervised classification by exploiting the relations between common entities codified in the Wordnet thesaurus. SCM exploits Targeted Hypernym Discovery (THD) to map unknown entities extracted from the text to concepts in Wordnet. We show how the result of SCM/THD can be fused with the outcome of Knowledge Assisted Image Analysis (KAA), a classification algorithm that extracts and labels multiple segments from an image. In the experimental evaluation, THD achieved an accuracy of 75%, and SCM an accuracy of 52%. In one of the first experiments with fusing the results of a free-text and image-content classifier, SCM/THD + KAA achieved a relative improvement of 49% and 31% over the text-only and image-content-only baselines.

Chandramouli K., Kliegr T., Nemrava J., Svatek V., Izquierdo E.: Query Refinement and User Relevance Feedback for Contextualized Image Retrieval. In: The 5th IET Visual Information Engineering 2008 Conference (VIE'08), China, 2008. Full paper. The motivation of this paper is to increase the user perceived precision of results of Content Based Information Retrieval (CBIR) systems with Query Refinement (QR), Visual Analysis (VA) and Relevance Feedback (RF) algorithms. The proposed algorithms were implemented as modules into K-Space CBIR system. The QR module discovers hypernyms for the given query from a free text corpus (Wikipedia) and uses these hypernyms as refinements for the original query. Extracting hypernyms from Wikipedia makes it possible to apply query refinement to more queries than in related approaches that use static predefined thesaurus such as Wordnet. The VA Module uses the K-Means algorithm for clustering the images based on low-level features. The RF Module uses the preference information expressed by the user to build user profiles by applying SOM-based supervised classification, which is further optimized by a hybrid Particle Swarm Optimization (PSO) algorithm. The experiments evaluating the performance of QR and VA modules show promising results.

Nekvasil M., Svatek V., Labsky M.: Transforming Existing Knowledge Models to Information Extraction Ontologies. In: 11th International Conference (BIS'08), Innsbruck, May 5-7, 2008. Springer LNBIP 7. Draft paper (final version available via SpringerLink). Diverse types of structured domain models are nowadays in use in various contexts. On the one hand there are generic models, especially domain ontologies, which are typically used in applications with artificial intelligence (reasoning) flavor; on the other hand there are more specific models that only come to use in areas like software engineering or business analysis. Furthermore, the discipline of information extraction has invented very specific knowledge models called extraction ontologies, whose purpose is to help extract and semantically annotate textual data. In this paper we present a method of authoring extraction ontologies (more specifically, their abstract constituents called presentation ontologies) via reusing different types of other knowledge models, especially domain ontologies and UML models. Our priority is to maintain consistency between extracted data and those prior models.

Labsky M., Svatek V.: Combining Multiple Sources of Evidence in Web Information Extraction. In: 17th International Symposium on Methodologies for Intelligent Systems (ISMIS'08), Toronto, May 20-23, 2008. Springer LNCS 4994. Draft paper (final version available via SpringerLink). Extraction of meaningful content from collections of web pages with unknown structure is a challenging task, which can only be successfully accomplished by exploiting multiple heterogeneous resources. In the Ex information extraction tool, so-called extraction ontologies are used by human designers to specify the domain semantics, to manually provide extraction evidence, as well as to define extraction subtasks to be carried out via trainable classifiers. Elements of an extraction ontology can be endowed with probability estimates, which are used for selection and ranking of attribute and instance candidates to be extracted. At the same time, HTML formatting regularities are locally exploited.

Rak D., Svatek V., Fidalgo M., Alm O.: Detecting MeSH Keywords and Topics in the Context of Website Quality Assessment. In: The 1st International Workshop on Describing Medical Web Resources (DRMed 2008), held in conjunction with the 21st International Congress of the European Federation for Medical Informatics (MIE 2008), May 27, 2008, Goteborg, Sweden. Full paper. Automatic detection of keywords and general topics is a special-purpose auxiliary task in the website quality assessment process. We describe the approach to obtaining such information used in the MedIEQ project, discuss problems related to the type of human language used in medical websites, and illustrate them on examples.

Svatek V., Svab O.: Towards Retrieving Scholarly Literature via Ontological Relationships. In: Znalosti 2008, Bratislava, Slovakia, February 2008. Full paper. We analyse the problem of retrieving scientific literature related to a problem with complex description, and outline the skeleton of a solution. The proposed mixture of methods and approaches covers manual as well as automatic methods, with emphasis on community tagging, automated ontology learning from text and ontology mapping. Symbiosis of RDF/OWL and Topic Maps as underlying formalisms is foreseen. As a very simple proof of concept, relational annotation of five research papers has been carried out independently by two annotators, and the results were analysed.

Nemrava J., Buitelaar P., Simou N., Sadlier D., Svatek V., Declerck T., Cobet A., Sikora T., O’Connor N., Tzouvaras V., Zeiner H., Petrak J.: An Architecture for Mining Resources Complementary to Audio-Visual Streams. In: Workshop on Knowledge Acquisition from Multimedia Content (KAMC) at SAMT 2007, Genova. Full paper. In this paper we attempt to characterize resources of information complementary to audio-visual (A/V) streams and propose their usage for enriching A/V data with semantic concepts in order to bridge the gap between low-level video detectors and high-level analysis. Our aim is to extract cross-media feature descriptors from semantically enriched and aligned resources so as to detect finer-grained events in video.We introduce an architecture for complementary resource analysis and discuss domain dependency aspects of this approach related to our domain of soccer broadcasts.

Svatek V., Svab O.: Tracking Name Patterns in OWL Ontologies. In: International Workshop on Evaluation of Ontologies (EON-07) collocated with the 6th International Semantic Web Conference (ISWC-2007), Busan, Korea. Full paper. Analysis of concept naming in OWL ontologies with set-theoretic semantics could serve as partial means for understanding their conceptual structure, detecting modelling errors and assessing their quality. We carried out experiments on three existing ontologies from public repositories, concerning the consistency of very simple name patterns---subclass name being a certain kind of parent class name extension, while considering thesaurus relationships. Several probable taxonomic errors were identified in this way.

Svab O., Svatek V.: In Vitro Study of Mapping Method Interactions in a Name Pattern Landscape. In: International Workshop on Ontology Matching (OM-07) collocated with the 6th International Semantic Web Conference (ISWC-2007), Busan, Korea. Full paper. Ontology mapping tools typically employ combinations of methods, the mutual effects of which deserve study. We propose an approach to analysis of such combinations using synthetic ontologies. Initial experiments have been carried out for two string-based and one graph-based method. Most important target of the study was the impact of name patterns over taxonomy paths on the mapping results.

Stamatakis K., Metsis V., Karkaletsis V., Ruzicka M., Svatek V., Amigo Cabrera E., Polla E., Spyropoulos C.: Content Collection for the Labelling of Health-related Web Content. In: 11th Conference on Artificial Intelligence in Medicine (AIME 07), 7-11 July 2007, Amsterdam, The Netherlands. Draft paper (final version available via SpringerLink). As the number of health-related web sites in various languages increases, it is more than necessary to implement control mechanisms that give the users adequate guarantee that the web resources they are visiting, meet a minimum level of quality standards. Based upon state-of-the-art technology in the areas of semantic web, content analysis and quality labeling, the AQUA system, designed for the EC-funded project MedIEQ, aims to support the automation of the labeling process in health-related web content. AQUA provides tools that crawl the web to locate unlabelled health web resources in different European languages, as well as tools that traverse websites, identify and extract information and, upon this information, propose labels or monitor already labeled resources. Two major steps in this automated labeling process are web content collection and information extraction. This paper focuses on content collection. We describe existing approaches, present the architecture of the content collection toolkit and how this is integrated within the AQUA system, and discuss our initial experimental results in the English language (six more languages will be covered by the end of the project).

Labsky M., Nekvasil M., Svatek V.: Towards Web Information Extraction using Extraction Ontologies and (Indirectly) Domain Ontologies. In: (Poster Paper) Int'l Conf. on Knowledge Capture (K-CAP'07), Whistler, BC, Canada, October 2007, ACM. Full paper. Extraction ontologies allow to swiftly proceed from initial domain modelling to running a functional prototype of a web information extraction application. We investigate the possibility of semi-automatically deriving extraction ontologies from third-party domain ontologies.

Labsky M., Svatek V., Nekvasil M., Rak D.: The Ex Project: Web Information Extraction using Extraction Ontologies. In: Proc. PriCKL'07, ECML/PKDD Workshop on Prior Conceptual Knowledge in Machine Learning and Knowledge Discovery. Warsaw, Poland, October 2007. Also as post-proceedings to appear at Univ. Bari. Full paper; and its extended version for post-proceedings. Extraction ontologies represent a novel paradigm in web information extraction (as one of ‘deductive’ species of web mining) allowing to swiftly proceed from initial domain modelling to running a functional prototype, without the necessity of collecting and labelling large amounts of training examples. Bottlenecks in this approach are however the tedium of developing an extraction ontology adequately covering the semantic scope of web data to be processed and the difficulty of combining the ontology-based approach with inductive or wrapper-based approaches.We report on an ongoing project aiming at developing a web information extraction tool based on richly-structured extraction ontologies and with additional possibility of (1) semi-automatically constructing these from third-party domain ontologies, (2) absorbing the results of inductive learning for subtasks where pre-labelled data abound, and (3) actively exploiting formatting regularities in the wrapper style.

Nemrava J., Buitelaar P., Svatek V., Declerck T.: Event Alignment for Cross-Media Feature Extraction in the Football Domain. In: Int'l Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS'07), Santorini, Greece, June 6-8, 2007. Full paper. This paper describes an experiment in creating cross-media descriptors from football-related text and videos. We used video analysis results and combined them with several textual resources – both semi-structured (tabular match reports) and unstructured (textual minute-by-minute match reports). Our aim was to discover the relations among six video data detectors and their behavior during a time window that corresponds to an event described in the textual data. Based on this experiment we show how football events extracted from text may be mapped to and help in analysing corresponding scenes in video.

Svatek V., Vacura M., Labsky M., ten Teije A.: Modelling Web Service Composition for Deductive Web Mining. Computing and Informatics, Vol. 26, 2007, 255-279. Full paper. Composition of simpler web services into custom applications is understood as promising technique for information requests in a heterogeneous and changing environment. This is also relevant for applications characterised as deductive web mining (DWM). We suggest to use problem-solving methods (PSMs) as templates for composed services. We developed a multi-dimensional, ontologybased framework, and a collection of PSMs, which enable to characterise DWM applications at an abstract level; we describe several existing applications in this framework. We show that the heterogeneity and unboundedness of the web demands for some modifications of the PSM paradigm used in the context of traditional arti- ficial intelligence. Finally, as simple proof of concept, we simulate automated DWM service composition on a small collection of services, PSM-based templates, data objects and ontological knowledge, all implemented in Prolog.

Svab O., Svatek V., Stuckenschmidt H.: A Study in Empirical and ‘Casuistic’ Analysis of Ontology Mapping Results. In: 4th European Semantic Web Conference (ESWC-2007), Innsbruck 2007. Springer LNCS 4519. Draft paper (final version available via SpringerLink). Many ontology mapping systems nowadays exist. In order to evaluate their strengths and weaknesses, benchmark datasets (ontology collections) have been created, several of which have been used in the most recent edition of the Ontology Alignment Evaluation Initiative (OAEI). While most OAEI tracks rely on straightforward comparison of the results achieved by the mapping systems with some kind of reference mapping created a priori, the 'conference' track (based on the OntoFarm collection of heterogeneous 'conference organisation' ontologies) instead encompassed multiway manual as well as automated analysis of mapping results themselves, with `correct' and `incorrect' cases determined a posteriori. The manual analysis consisted in simple labelling of discovered mappings plus discussion of selected cases (`casuistics') within a face-to-face consensus building workshop. The automated analysis relied on two different tools: the Drago system for testing the consistency of aligned ontologies and the LISp-Miner system for discovering frequent associations in mapping meta-data including the phenomenon of graph-based mapping patterns. The results potentially provide specific feedback to the developers and users of mining tools, and generally indicate that automated mapping can rarely be successful without considering the larger context and possibly deeper semantics of the entities involved.

Svab O., Svatek V.: Ontology Mapping Enhanced using Bayesian Networks. In: Znalosti 2007, Ostrava, TU Ostrava 2007. Full paper. Bayesian networks (BNs) can capture interdependencies among ontology mapping methods and thus possibly improve the way they are combined. We outline the basic idea behind our approach and show some experiments on ontologies from the OAEI ‘conference organisation’ collection. The possibility of modelling explicit mapping patterns in combination with methods is also discussed

(This is a long version of the OM06 paper...)

Svab O., Svatek V.: Combining Ontology Mapping Methods Using Bayesian Networks. In: International Workshop on Ontology Matching collocated with the 5th International Semantic Web Conference (ISWC-2006), November 5, 2006: GA Center, Athens, Georgia, USA. Full paper. Bayesian networks (BNs) can capture interdependencies among ontology mapping methods and thus possibly improve the way they are combined. Experiments on ontologies from the OAEI collection are shown, and the possibility of modelling explicit mapping patterns in combination with methods is discussed.

(This is a short pre-version of the Znal07 paper...)

Labsky M., Svatek V.: On the Design and Exploitation of Presentation Ontologies for Information Extraction. In: ESWC'06 Workhshop on Mastering the Gap: From Information Extraction to Semantic Representation, Budva, Montenegro, June, 2006. Full paper. The structure of ontologies that are considered as input to information extraction is mostly rather simple. In this paper we report on our ongoing effort of using rich ontologies with numerous constraints over the information to be extracted. Important aspects of the approach are the coupling of user-defined ontologies with other sources of knowledge such as training data and document formatting structures, and the distinction between proper domain ontologies and so-called presentation ontologies, where the latter (as `pragmatic bridges' over the `semantic gap') can partially be derived from the former. The extraction tool under construction builds on experience from an ongoing application in the domain of product catalogue analysis, and attempts to offer high flexibility with respect to availability of various input information sources.

Svatek V., Rauch J., Ralbovsky M.: Ontology-Enhanced Association Mining. In: Ackermann M. et al., eds., Semantics, Web, and Mining, Springer Verlag, LNCS 4289, 2006. Draft paper (final version available via SpringerLink). The roles of ontologies in KDD are potentially manifold. We track them through different phases of the KDD process, from data understanding through task setting to mining result interpretation and sharing over the semantic web. The underlying KDD paradigm is association mining tailored to our 4ft-Miner tool. Experience from two different application domains---medicine and sociology---is presented throughout the paper. Envisaged software support for prior knowledge exploitation via customisation of an existing user-oriented KDD tool is also discussed.

Labsky M., Svatek V., Svab O., Praks P., Kratky M., Snasel V.: Information Extraction from HTML Product Catalogues: from Source Code and Images to RDF. In: 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05), Compiegne 2005. Full paper. We describe an application of information extraction from company websites focusing on product offers. A statistical approach to text analysis is used in conjunction with different ways of image classification. Ontological knowledge is used to group the extracted items into structured objects. The results are stored in an RDF repository and made available for structured search.

Svatek V., Vacura M.: Automatic Composition of Web Analysis Tools: Simulation on Classification Templates. In: First International Workshop on Representation and Analysis of Web Space (RAWS-05). Full paper. Template-based composition of web services is considered as useful middle-way between their manual 'programming in the large' and fully automatic 'AI-planning-style' composition. This is also relevant for applications analysing the content and structure of the web space. As simple proof of concept, we simulate this approach on a collection of services, templates, data objects and ontological knowledge, all implemented in Prolog. The underlying task is multi-way recognition of sites containing pornography, understood as instance of classification task.

Kratky M., Andrt M., Svatek V.: XML Query Support for Web Information Extraction: A Study on HTML Element Depth Distribution. In: First International Workshop on Representation and Analysis of Web Space (RAWS-05). Full paper. Knowledge-based web information extraction methods can achieve very high precision in restricted domains; they are however slow and suffer from performance degradation beyond their specific domain. We thus plan to adapt an existing XML storage and query engine to act as efficient pre-processor for such methods. The critical point of the approach is the amount of information provided as XML environment of the start-up terms/elements. For this purpose, we carried out a statistical analysis of depth distribution in the WebTREC collection.

Kavalec M., Svatek V.: A Study on Automated Relation Labelling in Ontology Learning. In: P.Buitelaar, P. Cimiano, B. Magnini (eds.), Ontology Learning and Population, IOS Press, 2005. Full paper. Ontology learning from texts has been proposed as a technology helping ontology designers in the modelling process. Within ontology learning, the discovery of non-taxonomic relations is understood as the problem least addressed. We propose a technique for extraction of lexical items that may give cue in assigning semantic labels to otherwise `anonymous' non-taxonomic relations. The technique has been implemented as extension to the existing Text-to-Onto tool. Experiments have been carried out on a collection of texts describing tour destinations as well as on a semantically annotated general corpus. The paper also discusses evaluation aspects of relation labelling, among which the distinction of prior and posterior precision looks as most important.

Svatek V., ten Teije A., Vacura M.: Web Service Composition for Deductive Web Mining: A Knowledge Modelling Approach. In: Znalosti 2005, High Tatras 2005. Full paper. Composition of simpler web services into custom applications is understood as promising technique for information requests in a heterogeneous and changing environment. This is also relevant for applications analysing the content and structure of the web. We discuss the ways the problem-solving-method approach studied in artificial intelligence can be adopted for template-based service composition for this problem domain; main focus is on the classification task.

Labsky M., Praks P., Svatek V., Svab O.: Multimedia information extraction from HTML product catalogues. In: Workshop on Databases, Texts, Specifications and Objects (DATESO'05), Ostrava 2005. Full paper. We describe a demo application of information extraction from company websites, focusing on bicycle product offers. A statistical approach (Hidden Markov Models) is used in combination with different ways of image classification, including latent semantic analysis of image collections. Ontological knowledge is used to group the extracted items into structured objects. The results are stored in an RDF repository and made available for structured search.

Nemrava J., Svatek V.: Text mining tool for ontology engineering based on use of product taxonomy and web directory. In: Workshop on Databases, Texts, Specifications and Objects (DATESO'05), Ostrava 2005. Full paper. This paper presents our attempt to build a text mining tool for collecting specific words – verbs in our case – that usually occur together with particular product category as support for ontology designers. As the ontologies are headstone for the success of the semantic web, our effort is focused on building small and specialized ontologies concerning one product category and describing its frequent relations in common text. We describe the way we use web directories to obtain suitable information about the products from UNSPSC taxonomy and we propose the method how the extracted information could be further processed.

Svatek V., Labsky M., Vacura M.: Knowledge Modelling for Deductive Web Mining. In: 14th International Conference on Knowledge Engineering and Knowledge Management (EKAW 2004), Whittlebury Hall, Northamptonshire, UK. Draft paper (final version will be available via SpringerLink). Knowledge-intensive methods that can altogether be characterised as deductive web mining (DWM) already act as supporting technology for building the semantic web. Reusable knowledge-level descriptions may further ease the deployment of DWM tools. We developed a multi-dimensional, ontology-based framework, and a collection of problem-solving methods, which enable to characterise DWM applications at an abstract level. We show that the heterogeneity and unboundedness of the web demands for some modifications of the problem-solving method paradigm used in the context of traditional artificial intelligence.

Kavalec M., Svatek V.: Relation Labelling in Ontology Learning: Experiments with Semantically Tagged Corpus. In: EKAW 2004 Workshop on the Application of Language and Semantic Technologies to support Knowledge Management Processes, Whittlebury Hall, Northamptonshire, UK. Full paper. Ontology learning from text can be viewed as auxilliary technology for knowledge management application design. We proposed a technique for extraction of lexical entries that may give cue in assigning semantic labels to otherwise `anonymous' non-taxonomic relations. In this paper we present experiments on semantically annotated corpus SemCor, and compare them with previous experiments on plain texts.

Svatek V., Snasel V.: Formal Model of Meta-Information Acquisition from Information Resources. In: Workshop on Information Technology - Applications and Theory (ITAT2004), High Tatras 2004. Full paper. An outline of formal model describing the acquisition of ‘meta–information’ on information resources is being proposed, which should enable to compare the quality of different analysis procedures. It is illustrated an example from website analysis.

Svab O., Labsky M., Svatek V.: RDF-Based Retrieval of Information Extracted from Web Product Catalogues. In: SIGIR'04 Semantic Web Workshop, Sheffield. Full paper. Extraction of relevant data from the raw source of HTML pages poses specific requirements on their subsequent RDF storage and retrieval. We describe an application of statistical information extraction technique (Hidden Markov Models) on product catalogues, followed with conversion of extracted data to RDF format and their structured retrieval. The domain-specific query interface, built on the top of Sesame repository, offers a simple form of navigational retrieval. Integration of further web-analysis methods, within the Rainbow architecture, is forthcoming.

Cespivova H., Rauch J., Svatek V., Kejkula M., Tomeckova M.: Roles of Medical Ontology in Association Mining CRISP-DM Cycle. In: ECML/PKDD04 Workshop on Knowledge Discovery and Ontologies, Pisa. Full paper. We experimented with introduction of medical ontology and other background knowledge into the process of association mining. The inventory used consisted of the LISp-Miner tool, the UMLS ontology, the STULONG dataset on cardiovascular risk, and a set of simple qualitative rules. The experiment suggested that an ontology may bring benefits to all phases of the KDD cycle as described in CRISP-DM.

Labsky M., Svatek V., Svab O.: Types and Roles of Ontologies in Web Information Extraction. In: ECML/PKDD04 Workshop on Knowledge Discovery and Ontologies, Pisa. Full paper. We discuss the diverse types and roles of ontologies in web information extraction and illustrate them on a small study from the product offer domain. Attention is mainly paid to the impact of domain ontologies, presentation ontologies and terminological taxonomies.

Svab O., Svatek V., Kavalec M., Labsky M.: Querying the RDF: Small Case Study in the Bicycle Sale Domain. In: Workshop on Databases, Texts, Specifications and Objects (DATESO'04), Ostrava 2004. Full paper. We examine the suitability of RDF, RDF Schema (as simple ontology language), and RDF repository Sesame, for providing the back-end to a prospective domain-specific web search tool, targeted at the offer of bicycles and their components. Actual data for the RDF repository are to be extracted by analysis modules of a distributed knowledge-based system named Rainbow. Attention is paid to the comparison of different query languages and to the design of application-specific templates.

Svatek V.: Design Patterns for Semantic Web Ontologies: Motivation and Discussion. In: 7th Conference on Business Information Systems, Poznaň 2004. Full paper. The relatively high level of standardisation of semantic web ontology languages is in contrast to mostly ad hoc designed content of ontologies themselves. An overview of existing methods supporting ontology content creation is presented. Methods based on design patterns are then discussed in more detail as they seem most promising particularly for business environment. Examples of elementary problems typical for semantic web ontologies are shown, and their pattern–based solution is outlined.

Ruzicka M., Svatek V.: Mark-up based analysis of narrative guidelines with the Stepper tool. In: Symposium on Computerized Guidelines and Protocols (CGP-04), Praha 2004. IOS Press. Full paper. The Stepper tool was developed to assist a knowledge engineer in developing a computable version of narrative guidelines. The system is document-centric: it formalises the initial text in multiple user-definable steps corresponding to interactive XML transformations. In this paper, we report on experience obtained by applying the tool on a narrative guideline document addressing unstable angina pectoris. Possible role of the tool and associated methodology in developing a guideline-based application is also discussed.

Svatek V., Riha A., Peleska J., Rauch J.: Analysis of guideline compliance – a data mining approach. In: Symposium on Computerized Guidelines and Protocols (CGP-04), Praha 2004. IOS Press. Full paper. While guideline-based decision support is safety-critical and typically requires human interaction, offline analysis of guideline compliance can be performed to large extent automatically. We examine the possibility of automatic detection of potential non-compliance followed up with (statistical) association mining. Only frequent associations of non-compliance patterns with various patient data are submitted to medical expert for interpretation. The initial experiment was carried out in the domain of hypertension management.

Kavalec M., Maedche A., Svatek V.: Discovery of Lexical Entries for Non-Taxonomic Relations in Ontology Learning. In: SOFSEM – Theory and Practice of Computer Science, Springer LNCS 2932, 2004. Full paper. Ontology learning from texts has recently been proposed as a new technology helping ontology designers in the modelling process. Discovery of non-taxonomic relations is understood as the least tackled problem therein. We propose a technique for extraction of lexical entries that may give cue in assigning semantic labels to otherwise `anonymous' relations. The technique has been implemented as extension to the existing Text-to-Onto tool, and tested on a collection of texts describing worldwide geographic locations from a tour-planning viewpoint.

Svatek V., Ruzicka M.: Step-by-step formalisation of medical guideline content. International Journal of Medical Informatics, 2003, 70, 2–3, 329–335. Full paper. Approaches to formalisation of medical guidelines can be divided into model-centric and document-centric. While model-centric approaches dominate in the development of clinical decision support applications, document-centric, mark-up-based formalisation is suitable for application tasks requiring the 'literal' content of the document to be transferred into the formal model. Examples of such tasks are logical verification of the document or compliance analysis of health records. The quality and efficiency of document-centric formalisation can be improved using a decomposition of the whole process into several explicit steps. We present a methodology and software tool supporting the step-by-step formalisation process. The knowledge elements can be marked up in the source text, refined to a tree structure with increasing level of detail, rearranged into an XML knowledge base, and, finally, exported into the operational representation. User-definable transformation rules enable to automate a large part of the process. The approach is being tested in the domain of cardiology. For parts of the WHO/ISH Guidelines for Hypertension, the process has been carried out through all the stages, to the form of executable application, generated automatically from the XML knowledge base.

Svatek V., Berka P., Kavalec M., Kosek J., Vavra V.: Discovering company descriptions on the web by multiway analysis. In: New Trends in Intelligent Information Processing and Web Mining (IIPWM'03), Zakopane 2003. Springer-Verlag, 'Advances in Soft Computing' series, 2003. Full paper. We investigate the possibility of web information discovery and extraction by means of a modular architecture analysing separately the multiple forms of information presentation, such as free text, structured text, URLs and hyperlinks, by independent knowledge-based modules. First experiments in discovering a relatively easy target, general company descriptions, suggests that web information can be efficiently retrieved in this way. Thanks to the separation of data types, individual knowledge bases can be much simpler than those used in information extraction over unified representations.

Labsky M., Svatek V.: Ontology Merging in Context of Web Analysis. In: Workshop on Databases, Texts, Specifications and Objects (DATESO'03), Ostrava 2003. Full paper (zipped). The Rainbow system aims at the analysis of websites by means of distributed modules specialized in particular types of data, such as free text, HTML structures or link topology. In order to ease the integration of services offered by the individual modules, which may come from third parties, a collection of ontologies has been developed. Parts of the ontologies contain information specific to the different ways of analyses, resulting in a need for integration. This paper describes how ontology--merging, namely the FCA--Merge method, may be used to integrate the results of multiple analyses for a certain application domain.

Svatek V., Kosek J., Labsky M., Braza J., Kavalec M., Vacura M., Vavra V., Snasel V.: Rainbow - Multiway Semantic Analysis of Websites. In: 2nd International DEXA Workshop on Web Semantics (WebS03), Prague 2003, IEEE Computer Society Press. Full paper. The Rainbow project aims at the development of a reusable, modular architecture for web (particularly, website) analysis. Individual knowledge-based modules separately analyse different types of web data and communicate the results via web-service interface. The output of analysis has the form of classes (of web resources) predefined in an ontology, extracted text, and/or addresses of retrieved web resources. Within the project, several original methods of analysis as well as (analytic) knowledge acquisition have been developed. The current domains of investigation are sites of small organisations offering products or services, and pornography sites. The paper is the first systematic overview of diverse methods developed or envisaged in Rainbow.

Ruzicka M., Svatek V.: An interactive approach to rule-based transformation of XML documents. In: Datakon 2003, the annual database conference, Brno 2003, 277-288. Full paper. Transformation of XML documents is typically understood as non-interactive. In contrast, we formulate the specific task of XML-based transformation of knowledge contained in semi-formal documents, which heavily depends on human understanding of element content and thus requires frequent user intervention. Yet, many aspects of this process are pre-determined, and their automation is highly desirable. We implemented a software tool (called Stepper) supporting interactive step-by-step transformation of ‚knowledge blocks'. The transformation is governed by rules expressed in a new 'interactive transformation' language (called XKBT), while its non-interactive aspects are handled by embedded XSLT rules.

Svatek V., Ruzicka M.: Step-by-step Mark-up of Medical Guideline Documents. In: (Surjan G. et al., eds.) Health Data in the Information Society. Proceedings of MIE2002, Budapest 2002. IOS Press, 591-595. Full paper (zipped Postcript). The quality of document-centric formalisation of medical guidelines can be improved using a decomposition of the whole process into several explicit steps. We present a methodology and a software tool supporting the step-by-step formalisation process. The knowledge elements can be marked up in the text with increasing level of detail, rearranged into an XML knowledge base and exported into the operational representation. Semi-automated transitions can be specified by means of rules. The approach has been tested in a hypertension application.

Svatek V., Kosek J., Braza J., Kavalec M., Klemperer J., Berka P.: Framework and Tools for Multiway Extraction of Web Metadata. In: Information Systems Modelling, Roznov 2002. Full paper. We outline a generic conceptual framework for automated extraction of semantic metadata on the web, and present the results of experiments aiming at the development of an integrated multiway architecture for the metadata extraction task. The architecture will consist of separate modules specialised at different types of data, and cooperating via a SOAP-based message passing protocol.

Kavalec M., Svatek V.: Information Extraction and Ontology Learning Guided by Web Directory. In: ECAI Workshop on NLP and ML for Ontology engineering. Lyon, 2002. Full paper. The paper presents our ongoing effort to create an information extraction tool for collecting general information on products and services from the free text of commercial web pages. A promising approach is that of combining information extraction with ontologies. Ontologies can improve the quality of information extraction and, on the other hand, the extracted information can be used to improve and extend the ontology. We describe the way we use Open Directory as training data, analyse this resource from the ontological point of view, present some preliminary results related to information extraction, and outline our plans for building and deploying the ontology.

Lin V., Rauch J., Svatek V.: Content-based Retrieval of Analytic Reports. In: International Workshop on Rule Markup Languages for Business Rules on the Semantic Web. Sardinia, 2002. Full paper, Slides. Analytic reports are special textual documents containing condensed results from a data mining process. Embedded knowledge enables the interpretation of the reports by automated procedures, which opens the way to content-based retrieval. We elaborate the technique for statistical association rules as specific form of discovered knowledge, demonstrate its formal apparatus on examples from the medical domain, and outline the perspectives of sharing and reusing the content of analytic reports over the Semantic Web.

Riha A., Svatek V., Nemec P., Zvarova J.: Medical guideline as prior knowledge in electronic healthcare record mining. In: 3rd International Conference on Data Mining Methods and Databases for Engineering, Finance and Other Fields, 25-27 September 2002, Bologna, Italy. WIT Press, to appear. Full paper (zipped MS Word). We investigate the possibility of two-step approach to electronic healthcare record mining, in the context of analysing the compliance of healthcare practice with standards formulated in medical guidelines. Non-compliance patterns detected in the process of guideline-based data pre-processing provide additional attributes for subsequent association rule mining. The approach has been preliminarily tested on databases of hypertensive patients from different Czech hospitals. It should help reveal causes of frequent non-compliance; its sensitivity however depends on the quality of guideline formalisation, on the eligibility of patients for the given guideline, and on the coverage of datasets.

Svatek V., Kroupa T., Ruzicka M.: Guide-X - a Step-by-step, Markup-Based Approach to Guideline Formalisation. In: First European Workshop on Computer-based Support for Clinical Guidelines and Protocols, Leipzig 2000. IOS Press, 2001, 97-114. Full paper. The main difficulties of converting the original textual form of medical guidelines to a computer-tractable form are connected both with the ambiguity of the natural language text and with the complexity of the resulting formal (and operational) representation. Proceeding directly from one to the other is thus an extremely demanding task. The proposed Guide-X methodology addresses this problem by breaking the whole process of guideline operationalisation into several steps, each of which requires a different mixture of types (medical, knowledge representation, typographical) and degrees of expertise. The principal technology used is that of XML tagging (using both pre-existing and newly developed languages). The result of each step is connected, element-by-element, to the results of previous steps, thus making the verification and revision of the operationalisation process easier. The methodology is currently being tested in the field of hypertension treatment, within the framework of the Medical Guideline Technology project of the EU Fourth Framework Programme.

Kavalec M., Svatek V., Strossa P.: Web Directories as Training Data for Automated Metadata Extraction. In: Semantic Web Mining, Workshop at ECML/PKDD-2001, Freiburg 2001. Full paper. In this paper, we analyse the possibility of reusing the knowledge embedded in the structure of web directories in order to obtain labelled training data for Web Information Extraction with limited human effort.

Svatek V., Riha A., Zika T., Zvarova J., Jirousek R., Zdrahal Z.: Informal, Formal and Operational Modelling of Medical Guidelines. In: Hruska T., Hashimoto M. (eds.): Knowledge-Based Software Engineering. IOS Press 2000, 9-16. Full paper. Formal representation and automatic processing of medical guidelines is one of the foremost challenges in applied knowledge engineering. We describe a new, flexible model of medical guideline, which seems to be particularly suitable for analysis of guidelines with respect to patient records. To provide for clear transition between the analysis and design phases of development of guideline-processing software, we formulate the model at various levels: informal, formal and operational. For the first one, we use structured text plus diagrams, while for the latter two, the OCML (Operational Concept Modelling Language) seems to be suitable, since it enables both formal checking of concept definitions and execution of operational specifications. The model is currently being tested in the hypertension domain.

Svatek V., Zvarova J., Jirousek R.: A Two-Tiered Model of Medical Guideline. In: Hasman A., Blobel B., Dudeck J., Engelbrecht R., Gell G., Prokosch H.-U. (eds.): Telematics in Health Care - Medical Infobahn for Europe, MIE2000/GMDS2000 [CD-ROM]. Quintessenz Verlag, Berlin 2000. ISSN: 1616-2463. Full paper. One of the most important issues in designing a formal model for representing medical guidelines is the trade-off between modularity and compactness. We briefly analyse this problem, review some existing approaches, and suggest a two-tiered model that relies on a secondary structure superposed on the set of fine-grained knowledge modules. We hypothesise that such model is particularly suitable for tasks related to long-term patient management with respect to high-level, loosely structured guidelines. The model is currently being operationalised and tested in the hypertension treatment domain.

Svatek V., Kavalec M.: Supporting Case Acquisition and Labelling in the Context of Web Mining. In: (Zighed D., Komorowski J., Zytkow J.:) Principles of Data Mining and Knowledge Discovery - PKDD2000. LNAI 1910, Springer Verlag 2000, 626-631. Full paper. Case acquisition and labelling are important bottlenecks for predictive data mining. In the web context, a cascade of supporting techniques can be used, from general ones such as user interfaces, through filtering based on keyword frequency, to web-specific techniques exploiting public search engines. We show how a synergistic application of multiple techniques can be helpful in obtaining and pre-processing textual data, in particular for ILP-based web mining. The (two-fold) learning task itself consist in construction and disambiguation of categorisation rules, which are to process the results returned by web search engines.

Svatek V., Berka P.: URL as starting point for WWW document categorisation. In: (Mariani J., Harman D.:) RIAO'2000 - Content-Based Multimedia Information Access, CID, Paris, 2000, 1693-1702. Full paper. Information about the category (type) of a WWW page can be helpful for the user within search, filtering, as well as navigation tasks. We propose a multidimensional categorisation scheme, with bibliographic dimension as the primary one. We examine the possibilities and limits of performing such categorisation based on information extracted from URL, which is particularly useful for certain on-line applications such as meta-search or navigation support. In addition, we describe the problem of ambiguity of URL terms, and suggest a method for its partial overcoming by means of machine learning. As a side-effect, we show that general purpose WWW search engines can be used for providing input data for both human and computational analysis of the web.

Sramek D., Berka P., Kosek J., Svatek V.: Improving WWW Access - from Single-Purpose Systems to Agent Architectures? In: (Cerri S., Dochev D., eds.) Artificial Intelligence: Methodology, Systems, and Application. Berlin : Springer Verlag, 2000, 167-178. Full paper. Sophisticated techniques from various areas of Artificial Intelligence can be used to improve the access to the WWW; the most promising ones stem from Data Mining and Knowledge Modeling. We describe the process of building two experimental systems: the VSEved system for intelligent meta-search, and the VSEtecka system for navigation support. We discuss our experience from this process, which seems to justify the hypothesis that the Multi-Agent paradigm can improve the efficiency of web access tools, in the future. With this respect, we outline a web-oriented multi-agent architecture.

Vojtech Svatek - CV, topics, projects, bibliography Vojtech Svatek - homepage Knowledge Engineering Group

Vojtech Svatek , last update February 27, 2010