The CodeSharing Protocol for TEI Markup

Version 1.0

Martin Holmes, University of Victoria, 2013.

This document describes a protocol through which any repository containing documents encoded in TEI XML (or any other schema) may make examples of XML encoding available to any harvesting tool, or, through a form-based interface, to anyone interested in examining encoding practises. I recently presented the project at the TEI 2014 Conference in Evanston (PDF of paper), and the project itself is housed on SourceForge at https://sourceforge.net/projects/codesharing/.

This is not a formal specification, so it eschews the customary definitions of modal verbs and key adjectives.

Definitions

Repository
Any collection of XML-encoded documents, stored in an XML database, a conventional database, or on a file system.
Service
A server responding on a particular URL to requests formatted according to the CodeSharing protocol, returning results in accordance with the description below.
Harvester
An automated application which queries a system implementing the CodeSharing API to retrieve lists of XML elements or other information provided through the API.
User
A non-automated client who uses a human-friendly interface based on the CodeSharing API to retrieve examples of encoding.

Use cases

There are currently three primary use-cases for an implementation of this protocol.

Easy access to example encodings for project workers

Large-scale encoding projects are increasingly common in the digital humanities. It is not unusual for five or six encoders to be working simultaneously on a document collection, with different levels of skills and experience. Good documentation and training is essential, but many people learn more effectively by looking at examples. A human-friendly front-end for a CodeSharing service, such as the HTML-form-based implementation included in the current codebase, can be a very useful tool for working encoders, enabling them to find examples of the current usage of tags and attributes on which to base their encoding.

Statistical and survey work across projects

Organizations with a strong investment in TEI encoding, including the TEI itself, will be interested in the possibility of querying multiple collections of TEI documents for information about the use of specific tags and attributes. For instance, the TEI Council, in its role as maintainer and updater of the Guidelines, often needs to know how a particular element is being used "in the wild" when considering how or whether to make a change to its definition or values.

A source of examples for the TEI Guidelines

TEI Council members working on improving and expanding the TEI Guidelines are often in search of realistic uses of elements and attributes to insert as examples into the Guidelines text and specifications. Repositories implementing a CodeSharing service would provide easily accessible sources for such examples.

HTTP requests

CodeSharing requests are HTTP requests, submitted using the HTTP GET or POST methods. Since requests cannot by the nature of the CodeSharing protocol be very long, it is unlikely that normal length-limits for GET will be exceeded, so it is more practical to use GET than POST because GET requests can easily be bookmarked.

There is a single base URL for all requests. This base URL does not need to be included in any of the responses to requests, since it is impossible for the requester to retrieve a response without already knowing the URL; the base URL must be advertised to potential users and harvesters using other means, such as links on the project website.

In what follows, we will use the example base URL http://mapoflondon.uvic.ca/codesharing(.xml|.htm), which is in fact a working prototype.

Requests made to the base URL must also include a list of keyword arguments in the form of key = value pairs. Keys and values are described below. Responses to requests on the base URL take the form of a TEI XML document, whose parameters are described below.

In addition to the base URL for XML responses, a repository may also advertise a second URL on which requests can be made, and which will provide responses in the form of an HTML page with output formatted for easy reading. Such an HTML page may also include an HTML form, enabling a user to query the server more easily. An example of such an interface is available at http://mapoflondon.uvic.ca/codesharing.htm. The key = value parameters are identical when querying the XML or HTML URLs of a repository.

Request parameters

The following is a list of the keys and values that may be included in the request, and a brief description of the response the server should provide to each. The exact format of the response is treated in a later section.

Key: verb

The verb key is the primary component of the request. It can take the following values:

Note that identify is the default value for this key, and if the key is absent, or has no value, identify is assumed. This means that a request to the base URL with no parameters at all is the same as a request with verb = identify .

Key: elementName

The value of this parameter is the local name of an XML element. Where verb = getExamples, the service responds with a set of examples of the element requested in the namespace which is specified. For example, this request:

?verb=getExamples&elementName=div&namespace=http%3A%2F%2Fwww.tei-c.org%2Fns%2F1.0

will retrieve a set of examples of the TEI <div> element, if it exists in documents in the repository.

Key: attributeName

The value of this parameter is the local name of an XML attribute. Where verb = getExamples, the service responds with a set of examples of the attribute requested in the namespace which is specified. Attributes are always returned in the context of their parent elements. For example, this request:

verb=getExamples&attributeName=ref&namespace=http%3A%2F%2Fwww.tei-c.org%2Fns%2F1.0

will retrieve a set of examples of elements which bear the TEI @ref attribute, if it exists in documents in the repository. Note: the same conditions regarding namespaces apply to this key as to the verb = listAttributes parameter discussed above.

When this parameter is combined with the elementName parameter, only attributes found in the context of the named element will be returned. For instance, this request:

?verb=getExamples&elementName=hi&attributeName=rend&namespace=http%3A%2F%2Fwww.tei-c.org%2Fns%2F1.0

will retrieve only TEI <hi> elements which bear the attribute @rend.

Key: attributeValue

The value of this parameter is the value of an XML attribute. Where verb = getExamples, and an attributeName parameter is provided, the service responds with a set of examples of the attribute with the value requested in the namespace which is specified. Attributes are always returned in the context of their parent elements. For example, this request:

verb=getExamples&attributeName=type&attributeValue=simple&namespace=http%3A%2F%2Fwww.tei-c.org%2Fns%2F1.0

will retrieve a set of examples of elements which bear the TEI @type attribute set to the value simple, if it exists in documents in the repository.

When the two attribute parameters are combined with the elementName parameter, only attributes found in the context of the named element will be returned. For instance, this request:

?verb=getExamples&elementName=list&attributeName=type&attributeValue=simple&namespace=http%3A%2F%2Fwww.tei-c.org%2Fns%2F1.0

will retrieve only TEI <list> elements whose @type attribute has the value simple.

Key: documentType

The value of this parameter is a string token. If this parameter is supplied, then the results of other requests will be filtered such that only elements and attributes from documents which belong to the specified document type are returned. A harvester or human user may discover what documentType values are available in the collection by submitting a request with verb = listDocumentTypes. The documentType parameter is an optional feature of the API, and services working with collections which do not have document type categories may choose not to provide it.

Key: namespace

The value of this parameter is a namespace URL. In the case of a GET request, note that certain characters in the URL (such as colons and slashes) will be escaped using hexadecimal representations in order to comply with RFC 2396. The namespace parameter provides the namespace context within which other requests operate, as described above.

Key: wrapped

The key wrapped may have two values:

If wrapped = true, the elements found by the service in response to the rest of the request will be returned in the context of their parent element. For example, this request:

?verb=getExamples&elementName=hi&wrapped=true&namespace=http%3A%2F%2Fwww.tei-c.org%2Fns%2F1.0

would result in a list of examples of elements which contain the TEI <hi> element (such as <p>). This is particularly useful for learning about the typical usage of an element and the contexts in which it is found.

Key: maxItemsPerPage

The value of this key is a positive integer. This enables a harvester to specify the maximum number of items which it is prepared to process in one operation. However, it is only a request. The service is free to impose its own maximum number of items per page on its responses, and this may override the number requested in this parameter. This is important, because it would be relatively easy to overwhelm a server providing CodeSharing services for a large repository by requesting a response containing thousands of elements.

Note: this parameter works in conjunction with the paging functionality described below.

HTTP responses

Responses to requests on the base URL of the service (XML responses) must conform with the following guidelines:

There are no rules or guidelines regarding the format of a response on to the HTML URL of a CodeSharing service; any human-readable page which includes the response information in some format is acceptable.

Data points

The following data points are returned in the <front> element of the TEI response file. Each data point is identified by an @xml:id attribute beginning with "cs_" (for "CodeSharing"). The data itself comprises the text content of the element. There is no constraint on what type of element should be used, as long as it has the correct @xml:id. Some data is required, and some is optional. Many of these data points simply echo back input parameters.

Required data items

Optional data items

Format of results

Results are always returned in the <body> of the TEI XML result document. They are formatted as follows:

Paging of results

As mentioned above, where there are more results than can be fit on one page, the result document will always contain an element with the @xml:id cs_nextUrl, so that the user or harvester can retrieve a series of result pages until the entire result set has been gathered, if required. Assuming that all the parameters for the subsequent request can be encoded in the cs_nextUrl value, the server need not maintain an session record and the entire API can function in a stateless manner, as does the sample XQuery implementation running on the Map of Early Modern London website. However, if it suits the implementer to maintain state on the server, to avoid (for instance) having to do expensive queries multiple times, then there is nothing to prevent this.

Nevertheless, since the protocol is intended to function in a stateless manner, it is possible that the underlying data may change between requests in a series. If, for instance, a document is deleted between the first and second requests for examples of an element, and the number of such examples is thereby reduced, the second page of results may not be what was expected; in fact in may be empty. This is not regarded as problematic; there is no requirement that a series of result pages, when amalgamated, need precisely reflect a particular coherent state of the repository.

Examples

This section provides several example requests along with constructed URLs for them, linked to the example implementation on the Map of Early Modern London website. Links are provided both to the base URL http://mapoflondon.uvic.ca/codesharing (resulting in an XML response), and the HTML front-end of the service http://mapoflondon.uvic.ca/codesharing, which provides a more human-readable rendering created by transforming the XML response to XHTML5 using XSLT.

  1. List all TEI elements in the repository:
        verb=listElements
        namespace=http://www.tei-c.org/ns/1.0
  2. Get examples of TEI <head> elements in the repository:
        verb=getExamples
        elementName=head
        namespace=http://www.tei-c.org/ns/1.0
  3. Get examples of TEI @style attributes, on any element:
        verb=getExamples
        attributeName=style
        namespace=http://www.tei-c.org/ns/1.0
  4. Get examples of TEI @style attributes appearing on the <hi> element, and return them in the context of their parent element:
        verb=getExamples
        elementName=hi
        attributeName=style
        wrapped=true
        namespace=http://www.tei-c.org/ns/1.0

Future development

The following features are under consideration for a future version of this protocol:

Response compression

The OAI-PMH protocol, which is a strong influence on this one, includes the option for the server to supply a response to the querying harvester in a compressed format. Where large quantities of XML data are being harvested, this might be a useful feature.

XPath queries

This protocol is designed to be as simple as possible, so that it can easily be implemented and understood both by implementers and users. However, sophisticated users who are familiar with XPath and XQuery are likely to find its limitations frustrating. One obvious way to expand the protocol is to allow querying of the data using XPath and/or XQuery. This presents potential issues of security and input-sanitization, so it would need some careful thought; an XPath query could easily construct a single response element that includes the entire collection that is being queried, bringing the responding server to its knees, and an XQuery could modify or delete data if permissions are not correctly configured.

Source document information for examples

It would be useful if information about the source document from which an example is taken could be provided along with the example. This might be supplied as a URI in the @source attribute on the <egXML> element. This would have to be optional, since many repositories do not provide access to their XML source code as a matter of course, and may choose to implement the CodeSharing API in a manner that excludes some components of their documents.

Identifiers for examples

Other than providing the option to "wrap" a target element in its parent, the current version of the protocol provides no contextual information about any element in the response. It might be useful if there were a method of specifying a stable identifier for any result element, which could be expressed as an attribute on the containing <egXML> element. However, it is by no means clear that any such identifier could be stable in the long-term, given that collections change and data is edited, and there is no current use-case for any such identifier.