Faceted Search for Discovering Software

paper, specified "long paper"
Authorship
  1. 1. Jan Odijk

    Utrecht University

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


Introduction

Enabling the easy discovery of resources is an important service for Digital Humanities scholars. In CLARIN, the Virtual Language Observatory (VLO) serves this purpose, but it is currently mostly suited for the discovery of data. Discovering software is not so easy in the current VLO. In order to address this issue, we present a proposal for faceted search in metadata for software, which is based on a CLARIN Component Metadata Infrastructure (CMDI) profile for the description of software that enables discovery of the software and formal documentation of aspects of the software. We have tested the profile and the faceted search based on this profile by making metadata for over 80 pieces of software, and by creating an implementation of the faceted search. It is available on the web.

http://portal.clarin.nl/clariah-tools-fs

We propose to add this faceted search to the VLO, and show how metadata curation software, combined with provided metadata curation files, can curate existing metadata descriptions for software using other profiles to make them suited for such faceted search (implemented for 280 other pieces of software, also available on the web

http://portal.clarin.nl/clariah-tools-fs-global

).

Metadata Profile ClarinSoftwareDescription

The ClarinSoftwareDescription (CSD) profile

http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/profiles/clarin.eu:cr1:p_1342181139640/xsd

enables one to describe information about software in accordance with the CMDI metadata framework

(Broeder et al., 2012).

The profile has been set up in such a way that it enables (1) the description of properties that support discovery of the resource, and (2) the description of properties for documenting the resource, in as formal a manner as possible.

Since the focus of this paper is on the faceted search for software, we only briefly describe some aspects of the profile.

The profile consists of the CMDI components Generalinfo, SoftwareFunction, SoftwareImplementation, Access, ResourceDocumentation, SoftwareDevelopment, TechnicalInfo, Service and LRS.

The
SoftwareFunction
component enables one to describe the function of the software in terms of the closed vocabulary elements
tool category
,
tool tasks
,
research phase(s)
(for which it is most relevant),
research domains
and, for the linguistics domain, relevant
linguistic subdisciplines
for which it was originally developed.

The
SoftwareImplementation
component enables one to describe information for users on the implementation and installation of the software. The most important components are for the description of the
interface
, the
input
and the
output
of the software.

The
Service
component (CLARIN-NL Web Service description) is intended for describing properties of web services. It is compatible with the CLARIN CMDI core model for Web Service description version 1.0.2.

The
LRS
component is intended for the description of the properties of a particular task for the CLARIN Language Resource SwitchBoard (CLRS,

(Zinn,

2016)). It is our viewpoint that specifications for an application for inclusion in the CLRS registry
should be derivable from the metadata for this application. We devised a script to turn a CSD-compatible metadata record that contains an LRS component into the format required for the CLRS and tested it successfully with the Frog web service and application (Van den

Bosch et al., 2007).

Semantics

Many of the profile’s components, elements and their possible values have a semantic definition by a link to an entry in the CLARIN Concept Registry (CCR,

(Schuurman et al.,

2016)). For the ones that were lacking we created definitions and provided other relevant information required for inclusion into the CCR. These are currently being evaluated by the CCR coordinators for inclusion in the CCR. After our submission to the CCR, we made some new modifications to the profile, so there are new elements and values for which the semantics does not yet exist.

Metadata Descriptions using the CSD profile

We have described more than 80 software resources with the CSD profile. Describing these software resources resulted in various improvements of earlier versions of the profile. Most descriptions started from the information contained in an unformalized software overview. The information from this overview was semi-automatically converted to CMDI metadata in accordance with the CSD profile. The resulting descriptions were further extended and then submitted to the original developers and CLARIN Centres that host the resources for corrections and/or additions.

Faceted Search

A major purpose of metadata is to facilitate the discovery of resources. An important instrument for this purpose in CLARIN is the Virtual Language Observatory (VLO, (

Van Uytvanck,

2014)). The VLO offers faceted search for resources through their metadata, but its faceted search is fully tuned to the discovery of
data
. For this reason, we defined a new faceted search, specifically tuned to discovery of
software
. This faceted search offers
search
facets and
display
facets:

Search Facets
LifeCycleStatus, ResearchPhase, toolTask, researchDomain, linguisticsSubject, inputLanguage, applicationType, NationalProject, CLARINCentre, input modality, licence

Display Facets
name, title, version, inputMimetype, outputMimetype, outputLanguage, Country, Description, ResourceProxy, AccessContact, ProjectContact, CreatorContact, Documentation, Publication, sourcecodeURI, Project, MDSelfLink, OriginalLocation

Furthermore, all metadata profiles for the description of software must be able to provide the values for the facets. That is the case to a large extent, though some metadata curation is needed in some cases and existing values must be mapped to a restricted vocabulary for use in the faceted search.

Curation of existing metadata for software

The basic idea is as follows: we create a new standardised metadata record for all software descriptions, in principle each time a record is harvested. This metadata record contains the components and elements that are required for the faceted search as defined above. The record is constructed from the original CMDI record for the resource, combined with the data for this resource contained in a curation file, by a script. The curation file can be used to add information that was lacking or only present in an unformalised way, and it can be used to map existing values to other values, e.g. to restrict them to a specific closed vocabulary. We report on our experiments with such a curation file for the
WebLichtWebService
profile, since it was most needed and most complex for this profile. Over 280 WebLicht Web Services can now be found with the faceted search.

Concluding Remarks

The faceted search is publicly available and in use by digital humanities researchers. We already received feedback from users for further improvements, which we hope to make in the course of 2019. We will also describe some problems we encountered, which we only briefly mention here: (1) definition of closed vocabularies (2) several technical problems related to the CLARIN Component Registry (3) lack of good CMDI metadata editors. Finally, we will identify some future work, in particular on deriving CLRS registry entries for CLAM-based applications and web services,

https://proycon.github.io/clam/

for extracting metadata information from independent initiatives such as codemeta.

https://codemeta.github.io/

References

Daan Broeder, Menzo Windhouwer, Dieter Van Uytvanck, Twan Goosen, and Thorsten Trippel
. (2012). CMDI: A component metadata infrastructure. In
Proceedings of the LREC workshop ‘Describing LRs with Metadata: Towards Flexibility and Interoperability in the Documentation of LR’.
, pages 1–4, Istanbul, Turkey. European Language Resources Association (ELRA).

Davor Ostojic, Go Sugimoto, and Matej Ďurčo
. (2017). The curation module and statistical analysis on VLO metadata quality. In Selected papers from the CLARIN Annual Conference 2016, Aix-en-Provence, 2628 October 2016, number 136 in Linköping Electronic Conference Proceedings, pages 90–101. Linköping University Electronic Press, Linköpings Universitet.

Ineke Schuurman, Menzo Windhouwer, Oddrun Ohren, and Daniel Zeman
. (2016). CLARIN Concept Registry: The New Semantic Registry. In Koenraad De Smedt, editor, Selected Papers from the CLARIN Annual Conference 2015, October 1416, 2015, Wroclaw, Poland, number 123 in Linköping Electronic Conference Proceedings, pages 62–70, Linköping, Sweden. CLARIN, Linköping University Electronic Press.

http://www.ep.liu.se/ecp/article.asp?issue=123&article=004

.

A. van den Bosch, G.J. Busser, W. Daelemans, and S. Canisius
. (2007). An efficient memory-based morphosyntactic tagger and parser for Dutch. In F. Van Eynde, P. Dirix, I. Schuurman, and V. Vandeghinste, editors,
Selected Papers of the 17th Computational Linguistics in the Netherlands Meeting
, pages 99–114. Leuven, Belgium.

Dieter Van Uytvanck.
(2014). How can I find resources using CLARIN? Presentation held at the
Using CLARIN for Digital Research
tutorial workshop at the
2014 Digital Humanities Conference
, Lausanne, Switzerland.

https://www.clarin.eu/sites/default/files/CLARIN-dvu-dh2014_

VLO.pdf

, July.

Claus Zinn
. (2016). The CLARIN language resource switchboard.

https://www.clarin.eu/

sites/default/files/08%20-%20ZINN-Lg-Sw-Board.pdf

. Presentation at the CLARIN 2016 Annual Conference.

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.

Conference Info

In review

ADHO - 2019
"Complexities"

Hosted at Utrecht University

Utrecht, Netherlands

July 9, 2019 - July 12, 2019

436 works by 1162 authors indexed

Series: ADHO (14)

Organizers: ADHO