Program with MoEML

Introduction project structure

This documentation provides basic information for programmers needing to work on the MoEML infrastructure and build process. It covers programming languages used, code organization, software requirements, the static build process, and tips and tricks for working more efficiently on the code.

The MoEML project consists of source data—a large collection of TEI-encoded XML files along with other resources such as images, stylesheets and scripts—along with a substantial codebase whose job is to check, validate and diagnose problems with the source data, and eventually to build it into a complete static website for deployment. The codebase is currently organized in a rather haphazard way, largely for historical reasons, and still includes many components from previous incarnations of the project which are no longer relevant. Figuring out which bits of the repository are responsible for which types of functionality can be difficult. This document should help to clarify some of these issues, pending a proper purge and reorganization of the codebase.

Organization of the SVN Repository repository organization

As mentioned above, the organization of the software repository is somewhat confusing for two reasons: first, the repo size is large, and we don’t want to force regular encoders to download the whole thing. This means the data component of the repo needs to be self-sufficient in some ways, so programming code and resources are included there which would be better placed elsewhere in a perfect world. Secondly, the project has gone through many phases in which different components were used (Cocoon, eXist, even PHP) so there are remnants of stuff which really should be reorganized.

These are the important areas for programmers:

backup_schemas contains copies of schemas which the build process would normally download from the web, to ensure we have the latest versions. When the download fails, these are used in order to allow the build to proceed. db is confusingly named, because it was once part of an eXist XML db folder structure. It contains the following components: agas contains the image files used in the Agas Map page. data contains all the TEI XML and related files (images, binary documents) which form the intellectual content of the project. Because this is where encoders work, this folder also includes the project schemas (in rng, although the schema constraints are created in ODD and Schematron, and then build into RNG). This folder also includes a utilities folder with some important build components, in particular the diagnostics code and the schema build code. redirects currently contains only one file, an XML file called redirects.xml where we specify how to handle ids (which are tied to URLs) which need to be retired. We specify where each retired id should be redirected to, so that pages do not simply disappear from the site when new versions are released. site contains site components and resources that are edited by designers and programmers, used in building the website. ise contains some old versions of Internet Shakespeare Edition plays which were part of an experiment to link between the two projects. The long-term status of this experiment is undecided; ignore this folder for the moment. jenkins contains two components related to the build process for the project which runs on our Jenkins CI server. config.xml is the configuration file for that build; this should be updated whenever a change is made to the build configuration on the server. The other file, moeml_log_parse_rules.txt, is a set of rules which is used by Jenkins to determine whether a build has failed or succeeded. In the course of a normal build process, words such as error or warning may appear in the output from the process; normally these would cause the build to fail, but in some cases they are accidental (for instance, a filename may contain the word error, so this ruleset is used to refine the process to make sure it only fails when something is actually wrong. obsolete is what you would expect: a place where we stash data and code files which are no longer needed. presentations contains the materials for presentations made by project members that relate directly to MoEML. static is the folder which contains all of the code used to build the current version of the site. css has all the various CSS files used in the current XHTML5 version of the site. exist contains code related to the version of the site which was hosted in the eXist XML database; this was used for version 6.3, but from version 6.4 onwards we have moved to a completely static version which does not require a backend database, so this code will eventually be moved to the obsolete folder. externals is a folder which is configured to bring in some XQuery code from other repositories, used in the eXist version of the site. This will eventually be removed. fonts contains all the web fonts used in the current version of the site, and in the PDF versions of the Mayoral Shows. fopConfig contains a configuration file for the FOP PDF processor which is used for generating the PDF versions of Mayoral Shows. js contains a variety of JavaScript libraries used in the static website. ssExtras contains two files which are used as part of the staticSearch component of the site build (described at length below). xsl contains all the XSLT code used in our current build processes, to create the website and the Mayoral Show PDFs. This will be described in detail below. This folder also contains a number of Ant build files which control the various build processes. You may also notice other files and folders inside static which are not part of the svn repository; examples are site and staticSearch. These are created during the build process. utilities contains a range of libraries and code modules some of which are essential for everything (e.g. the Saxon XSLT processor) and some of which are one-off transformations used to fix problems. Many of these files are obsolete and a cleanup of this folder is long overdue. workshops contains materials used for teaching workshops for RAs on specific topics such as regular expressions and XPath.

Software Requirements software requirements build process

This is a list of software that is required for running the various build processes. Some of it is actually stored in the repository, and some must be installed on the machine doing the build.

Software Included in the Repository

The following software is stored in the SVN repository, so does not need to be installed locally:

Saxon XSLT processor (saxon-he-10.jar) Schematron library for Ant (ant-schematron-2010-04-14.jar) The W3C HTML validator (vnu.jar) The Jing RELAXNG validator (jing.jar)

Software to be Installed Locally

To run the various MoEML build processes, you will need the following software to be installed on your machine. At present most of the build processes have to be run on *NIX systems because they depend on command-line utilities. If you are forced to use Windows, you’ll probably have to install the Windows Subsystem for Linux. For running specific components of the build, you may not need all of these applications or libs.

Java Ant ant-contrib linkchecker jsonlint xmllint svn git zip pdftoppm sensible-browser

MoEML’s Build Processes

The project has two distinct build processes: The extended validation build (run by build.xml in the project root folder) The static site build (run by build.xml in the static folder)

The Extended Validation Build

The extended validation build is designed to provide a range of extra checks to be carried out before bothering to build the website. It is controlled by the Ant build.xml file in the project root folder. It checks that: all XML documents are valid (with Schematron and RELAXNG) TEI code in all egXMLs in praxis is valid all inline CSS is valid all internal links point to something real there are no duplicate ids It also runs the project diagnostics, to find problems which are not build-breaking but will require attention.

Project Diagnostics

RELAXNG and Schematron validation are vital components of MoEML’s quality control process, but they aren’t sufficient to find all of the issues we need to avoid. The project diagnostics provide a second level of checking and testing. See Holmes and Takeda’s 2019 article <ref target="https://doi.org/10.1093/llc/fqz011">Beyond validation: Using programmed diagnostics to learn about, monitor, and successfully complete your DH project</ref> for more details on the principles underlying our diagnostic processes.

Running the diagnostics is simple. In the root directory of the MoEML project, type: ant diagnostics The process will take about five or six minutes, and will output products/diagnostics, which you can open in your browser. This contains the results of all the tests and checks performed. Our Jenkins CI server runs the diagnostics as part of every build, and serves the results for everyone to use. If you are working through problems raised in the diagnostics, and you want to check whether your fixes have been successful, you can run a local build of the diagnostics as specified above to get quicker feedback than waiting for the whole build process to complete on Jenkins.

The Static Site Build

If all checks in the Extended Validation Build have completed successfully, then Jenkins will run the static site build. This is controlled by the Ant build.xml file in the static subfolder.

This is a long and complex process, and it takes a long time to complete. Programmers working on the project need to understand it well so that they can run subcomponents of the build process in order to reproduce build errors rapidly and fix them efficiently.

This is the list of tasks that run, in sequence, as part of the static build (warning: may change; check build.xml to get the precise details). clean: Delete products and by-products of previous builds getSvnInfo: Get the latest svn version to use in footers etc. getStaticSearchCode: download the latest version of the staticSearch codebase from its GitHub repository createXslCaptions: Process the boilerplate.xml file to create an XSLT resource containing the captions, to be used when building the site. createBinaryDocList: create a text file listing all binary documents (PDFs etc.) from the repository which are actually linked on the site, so that we copy only those documents to the output. createImageLists: Create a text file listing all images from the repository which are actually used on the site. copySiteAncillaryFiles: Copy CSS, JavaScript and other static files from the static/ folder to the output site/folder. extractSchematron: Extract the Schematron ruleset from the tei_all RelaxNG schema, so that it can be used for validation. copyBinaryDocs: Copy required binary documents to the output folder. copyImages: Copy required images to the output folder. createImportXsl: Do some preprocessing to handle cases where MoEML uses its own custom mol-import processing instruction to create composite documents. applyImportXsl: Finish processing the mol-import cases started in the preceding step. createOriginalXml: Transform the source XML in db/data to create the more normalized and standardized version we publish as Original XML. createGeneratedContent: Create a set of additional TEI XML files mechanically constructed from existing data (document category lists, etc.), and some JSON. validateOriginalXml: Validate the Original XML collection against our schemas. createStandaloneXml: Transform the Original XML to pull in all referenced entities (people, places etc.) to create a standalone version. resolveStyleSelectors: Process any Standalone XML documents using rendition/selector so that targeted elements have explicit pointers up to the rendition element. rationalizeStyleAttributes: Process all inline style attributes to make them into rendition elements in the header. validateStandaloneXml: Validate the Standalone XML against our schemas. createAjaxFragments: Create versions of core entities (people, places, etc.) in the form of XHTML5 div elements which can be retrieved by AJAX when clicking on a site link. These fragments are also used later in the build process to generate the XHTML pages for entities from BIBL1, PERS1, and ORGS1. createStandardXml: Process the Standalone XML to create versions which do not include more unusual encoding strategies that regular TEI encoders might find puzzling. createSimpleXml: Process the Standalone XML to create versions which are valid against the TEI simplePrint schema. createLiteXml: Process the Standalone XML to create versions which are valid against the TEI Lite schema. createXhtmlDocs: Process all the Standalone XML files and AJAX fragments to create the XHTML5 output documents constituting the site. copyAgasMapTiles: Copy the collection of tiles (small fragments of the Agas Map at different zoom levels) to the output site. createAgasMapXhtml: Create the actual page for the Agas Map in the site folder. validateXhtmlDocs: Validate the entire constructed site using the W3C VNU validator (which also checks CSS). buildStaticSearch: Run the staticSearch build/indexing process to create the search page and the JSON and other resource files which support it. createTxtList: Create a list of primary source published documents that will be converted into text files to enable users to do text-analysis. createTxtFiles: Process the list of files from the previous step to create plain text versions. You can see the full set of tasks and subtasks that are available by typing ant and pressing the tab key twice will in the static folder.

Running Partial Builds

The complete static build process takes hours. If you’re working on fixing a build problem and you need to test your changes, it is obviously not practical to run the entire build process and wait to see the results. However, in most cases, you don’t need to. Here are a number of examples of how you can run only a small component of the build process to test specific changes.

IMPORTANT NOTE: In most cases, you must have an existing completed build in place before you can successfully run partial builds. That means that once in a while, you will need to run a complete local build for yourself. You can of course do that over lunch or overnight. Another alternative is to run this: cd static ant -f getBuiltSiteFromJenkins.xml This will go to the Jenkins server and download the latest complete successful build, then unzip it into the correct locations on your machine. Note that the MoEML build is several GB in size, so if you have a slow connection, it might be faster to build it yourself.

Once you have a full completed build available locally, you can start running only the part of the build that you are interested in. For example, if you are trying to work on a problem that relates to the generation of the Original XML, you might do this: ant createOriginalXml validateOriginalXml This will perform only those two steps, and you can then examine the results in the site/xml/original folder.

If you’re working on something more substantial that requires several steps, you can just chain them together as appropriate. Make sure you run them in the order they’re shown in the long list above, because each process may depend on the output from a preceding process.

Processing a subset of documents

Another useful approach to rapid building is to process only a specific subset of documents. For example, imagine that you are dealing with an HTML problem that affects lots of documents, but you know that one particular document (ABCH1) exemplifies the issue, and can be used as a test. You can run this: ant createXhtmlDocs -DdocsToBuild=ABCH1 This will run the part of the build that transforms the Standalone XML into HTML files, but it will only process a single document, making it very fast indeed; you can then inspect the changes to that specific document. To process more than one document, separate them with commas: ant createStandaloneXml -DdocsToBuild=ABCH1,STMA12

Finally, there is a specific target named quick, which is designed to do the minimum processing to get from the source XML to the XHTML output (in other words, the most important stages in the build process, ignoring such things as RSS feeds). If you run: ant quick -DdocsToBuild=ABCH1,STMA12 you’ll pass those two documents through the entire process from source to HTML output, but the process should be relatively fast. Again, it’s important to remember that you must have a complete set of build products in place in your static/site folder before this will work properly.

You can even do the same with entities such as people, places and bibliography items, but you will need to build both the item itself and the containing XML file. So to rebuild the Clothworkers’ Company (CLOT2) page, you could run: ant quick -DdocsToBuild=ORGS1,CLOT2 This will process all the content of the orgs, and then build the AJAX fragment for CLOT2, and generate a page from it.

Strategies for Building and Testing

The various strategies described above provide the basis for a programmer to work efficiently on solving a specific problem or adding a specific feature without having to wait for long periods to see the results of changes. If you triage the issue you’re working on carefully, you’ll be able to break it down into small steps, and identify a specific subset of documents which can be used for testing, then develop and test your changes carefully, so that when you do commit changes to the repository, it’s much less likely that the full build will fail because of something you did.