Apply, Part 2: Overview, and the Document Model

This installment will provide an overview of the Apply system, primarily of the treatment of documents. (See Part 1 for the basic explanation of why this software exists in the first place.) The rest of the system is based on how documents are handled, so this is probably the most basic introduction possible.

This guide will deliberately ignore many important parts of the system—the administrator interface, reviews, the attached content management system, (mostly) the search engine—but once the document system is clear, those things will be much easier to explain.

Documents

An application in progress. Shows page navigation to the left, mini CMS to host surrounding pages, etc.

The Apply system is based on a model of documents—collections of answers to questions, given in some paginated order, which a user can (a) begin, (b) save and return to, and (c) eventually finalize by clicking a Submit button.

A typical workflow is as follows: A project (a grant opportunity, say) begins with a proposal document. (This is what one might be tempted to call a “application;” unfortunately “application” is so loaded w/r/t its meaning in software that it had to be banished from this particular vocabulary.) By submitting a proposal, the user initiates a candidacy for that project.

When the proposal (nee “application”) period ends, completed proposals are referred to some number of reviewers. The review process may be as simple as a numerical score and a comment, or as complicated as a multi-page form of its own. (To Apply, reviews are just another kind of document.) Review scores are averaged, considered by project administrators, and winners are selected.

After awards are distributed, winners may be asked to complete supplementary documents, e.g., a schedule of public events or proof of insurance for traveling exhibits. These additional documents are collated under their candidacy along with the original proposal. At the end of the project term, most winners will also submit a final report evaluating the grant. At all times, project administrators need visibility (search, export, list) into all documents/candidacies.

Regardless of what other choices I would make, I wasn’t going to uproot this document-based workflow. Users and administrators both understand it and find it intuitive. So I started there.

The model

Logically at least, (if not literally at the level of software abstractions) documents can be thought of as composed of answers of various kinds, collected and/or presented in some order. The key thing here is that whatever abstraction we use for documents, we must be able to associate it with (a) arbitrarily many answers, of (b) arbitrarily many types. Because it must characterize all documents, the model cannot make any assumptions about the answers it contains. (How many, what order, what type, etc.)

Wherefore schemas?

Given the above, it’s reasonable to wonder whether a “schemaless” (document-based, NoSQL, etc.) database is the right tool for the job. Spoiler alert: ultimately, I did not use such a database—Apply uses ActiveRecord. The reasons why bear mentioning though.

First, my own knowledge was a factor. Especially at the time, (summer 2011) I had a lot more experience with SQL/relational databases. This isn’t an ideal constraint, but it is a real world one: I was going to have to build and maintain this system, and to some extent schemaless (etc.) databases represented a source of uncertainty.

But second, and more important: in a cost/benefit analysis, I’m not sure schemaless DBs are the right tool anyway. If your data is truly schemaless, in that you need arbitrary values under a single type, it may be necessary.

I don’t think our data is truly schemaless. The fact that the document system must remain agnostic about answer types does not mean it is schemaless: this would be to mistake encapsulation or information-hiding for schema properties. By remaining agnostic about answer types, Document is saying “I don’t know what (sub)schemas I will support,” not “I support no schemas at all.” I take “schemaless” here to mean “no restriction on the composition of individual type instances”—as in, for a given type A, instances of A may differ in their composition, data members, their types, etc.

Contrast this with the case where, for any given type, that type does have a schema, but the code that coordinates it must remain agnostic about what type it is dealing with at any given time. (This looks more like our situation.) There is nothing schemaless about this arrangement, because in every case where a record is in play—say, a Contact— that record has a schema, imposed on it by its type.

When you look at it that way, it’s not clear what part is supposed to be schemaless. Elsewhere in the application we expect (for example) Contact objects to support at least a minimal kind of API: fname, lname, email, etc. These values may be nil, but we expect the methods to be present. Any object that satisfies the API is basically already “schema-fied”, at least insofar as the API is a schema.

Similarly, for Narratives, we expect #narrative_content to return a kind of text entity; AttachmentGroups should only contain instances of Attachment, and so on. This looks (to me) like a plurality of fine-grained schemas, not schemalessness. Even if we permitted arbitrariness, some sort of interface still has to be spec’d out on top of it. I don’t really see what a schemaless database gets us here.

The other main benefit of many schemaless/document-based/NoSQL systems is horizontal scalability. But it’s extremely unlikely the Public Programs Office will actually require scalability beyond that possible with MySQL or Postgres. (File under ‘problems we’d love to have.’) In the unlikely case we did, projects could probably be partitioned to different servers at the level of the HTTP proxy. At this point, the liabilities (no ACID guarantees, eventual consistency, split-brain problems, less experience with code) don’t seem to be yielding much obvious benefit.

I vs. O

The downside of SQL/relational systems (to me, or for this project) is that they are clunky to search. For this, we use Elasticsearch. When documents are created or updated, they are (asynchronously, via Resque) indexed in ES, at the granularity of individual answers, entire documents and more. (See below for more information.)

The idea is that data enters the system via ActiveRecord, and the relational database is the canonical copy. Data by and large leaves the system (exports, reviewer menus, etc.) by way of Elasticsearch, which can be verified/regenerated from the SQL. SQL/relational systems provide a good interface for writing: ACID guaranteees, enforced schemas, etc. Elasticsearch provides an interface ideal for reading: search, faceting, plenty of capacity for dormant projects, etc. So that is how we use them.

The exception to this rule is the show action, as called on a document via the web. This always pulls the document directly from the database, rather than ES, due to the slight lag time of the background worker. This is to prevent users from seeing confirmation screens missing their most recent update. For large documents it can be a little expensive, but this happens relatively infrequently and is worth not getting those phone calls.

Implementation

Fudging a few details, the document system is based on two corresponding hierarchies of ActiveRecord models. The first comprise the specification of the project—the types of documents it requires, their questions, their pagination order. The second hierarchy consists of records specific to a user, each corresponding to a type in the first hierarchy:

Description Model State Model
Project* Candidacy
Template Document
Page Paging
Question Answering

<span class=image-caption>* Technically, this is “ProjectDescription,” but for clarity’s sake we’ll worry about that later.</span>

Definitions:

As a user traverses a Template, Page by Page, they build up a Document, Paging by Paging, composed of Answerings (and their Answerables).

From the perspective of the user, the models on the right “store state” in a way the ones on the left don’t. I use the shudder quotes because in fact they do store state—they wouldn’t be in the database otherwise. But that state holds items like question prompts, validation requirements, etc. We may update them from time to time, but the user they (ideally) appear eternal. The state the user is concerned with—their essay responses, their contact information, whether they finalized their form or not—is contained entirely in the hierarchy on the right.

Descriptions

This difference is reflected in the fact that while candidacies, documents, pagings and answerings are all backed by their own database tables, all of the models in the left-hand hierarchy (Project, Template, Page, Question in case that’s above the fold) share a single database table, and inherit from the same class: Description. This arrangement is called “single table inheritance.” An extra column is added, “type”, which stores the name of the inherited class to use. ActiveRecord supports this out of the box.

STI gets a bad rap sometimes, and at least some of it is unfair. The complaint is generally (some variant of) that it’s a confusing idiom that doesn’t always do what you’d expect.

STI is appropriate here because Description classes really do all have the same backing format, and differ only in their behavior. (This being the condition under which STI is appropriate.) This is intentional and by definition. That data format is: a “nickname,” (a database-independent identifier for use within the project) a parent_id, (the id of another Description; the relationship is a self join on the Description table) and a serialized hash object, whose contents are given (normally) in a YAML file.

In fact, the database-backing of Description objects is ultimately a convenience. Projects are specified entirely in these YAML files, using directories for hierarchical relationships:

~/src/apply/local/projects$ tree sample
sample
├── phases
│   ├── pre.yaml
│   ├── proposal.yaml
│   ├── reporting.yaml
│   └── review.yaml
├── project.yaml
└── templates
    ├── proposal
    │   ├── pages
    │   │   ├── contact.yaml
    │   │   └── narratives_and_uploads.yaml
    │   ├── questions
    │   │   ├── attachments.yaml
    │   │   ├── contact.yaml
    │   │   ├── narrative1.yaml
    │   │   └── narrative2.yaml
    │   └── template.yaml
    └── review
        ├── pages
        │   └── review.yaml
        ├── questions
        │   └── score_and_comments.yaml
        └── template.yaml

8 directories, 15 files

The loader traverses the directory structure, resolves each individual description object by nicknames, and updates/creates the serialized hash object using the contents of the file. (In theory the database could be bypassed or replaced entirely by any format capable of supplying a hash.)

Some directories have a “node” file: project.yaml for project wide information; template.yaml for any given template. Some directories contain collections: questions, pages, phases. Examples:

# sample/project.yaml
nickname: sample
full_name: Sample Grant Project

# sample/templates/proposal/template.yaml
nickname: proposal
document_class: Proposal
starting_page: contact
flat_page_order:
  - contact
  - narratives_and_uploads

# sample/templates/proposal/pages/narratives_and_uploads.yaml
nickname: narratives_and_uploads
question_order:
  - narrative1
  - narrative2
  - attachments
text:
  title: Narrative Answers and File Uploads
  info: "On this page you can write some words and upload some attachments."
terminal: true   # show a submit button

# sample/templates/proposal/questions/narrative1.yaml
questions/narrative1.yaml
answerable_type: Narrative
nickname: narrative1
max_words: 350
public_max: 300 # tell them it's 300 but leave margin of error
text:
  title: "Narrative 1"

Description objects really are schemaless. This is basically why STI works so well with them: STI usually breaks down (it seems?) when people discover that two models joined at the table really do need different backing schemas after all. This is not really an issue if the only “schema” is that you require a hash. The benefit is that we can provide a single abstraction (Description) that can be used to build any type of project specification we can dream up. E.g., note above we use the same abstraction for a project phase as we do for a question. I’ve recently been toying with a Description type to model custom exports.

Documents proper

Apply represents the hierarchical relations above as ActiveRecord associations: A Candidacy has_many Documents; a Document has_many Pagings and Answerings, etc.

While the actual Document instance contains only metadata (ownership, current state, create/update timestamps, etc.), it provides the main interface for updating all of its child records. By using the accepts_nested_attributes_for and validates_associated directives, we can treat the entire document as if it were a single record.

#red_tape_update

The stock update methods like update_attributes won’t be sufficient here, because other work needs to be managed around an update. We abstract all of that work into the method #red_tape_update:

@document.red_tape_update(attributes, options_hash) 

Attributes is a normal ActiveRecord hash of attributes. The options control other behavior: a submit flag, to (attempt to) finalize the document; an extended checks flag to trigger validations that should only happen when a user is leaving a page; etc. This method returns Boolean indicating the success/failure of the update.

The main side effect of #red_tape_update is, if the update goes through, to queue an appropriate Event (see below) to asynchronously handle any secondary work, like reindexing answers in Elasticsearch.

This method also temporarily sets the @red_tape instance variable to true. To (lightly) enforce the interface, instances of Document include a validation on this ivar. (It’s easily defeated if you need to—just enough to remind you not to if it’s an accident.)

Building a document

In normal operation, for a template to be available to a user, three conditions must obtain:

If all of these things obtain, the template is made available to the user with a nice big Begin link. When the user clicks that link, (a) the template produces the first page (as in, an instance of Page); (b) the first page generates a list of questions to be presented to the user; (c) assuming they click Save and the document passes validation, answers area created/updated, and a paging is created referencing that Page, and is made the current paging; (d) if they clicked Proceed, the template determines the next page and repeats the process there.

Pages and Pagings

The idea is that Pages, given a user document, dynamically generate a list of Questions for that document. As the user traces a path through the document by clicking Proceed, they leave a trail of Pagings, which as noted are organized in a tree structure. (Courtesy of the Ancestry gem.) The path ending in the document’s current paging represents the active path through the document.

Again, this is required to properly support templates that determine page order dynamically based on previous answers to questions. This feature was required in the very first project for Apply, and is used regularly for all kinds of purposes. (It also may be the most complicated part of the entire codebase, and will hopefully get its own entry in this series.)

My stalking horse when designing it was the Choose Your Own Adventure novel: I required (somewhat arbitrarily) that the system should be capable of hosting a CYOA novel using the page system. A darker example was that ghoulish feature of the GRE wherein early correct answers lead to more valuable questions later; both can be modeled in Apply.

The document itself stores its current_paging_id as part of its own state; this paging will be considered the most advanced node in the branch. For example, when a user calls the URL to edit their document, with no parameters passed, they will be presented with the page denoted by document.current_paging.

(Note that this is implemented by specifying:

class Document < ActiveRecord::Base
  belongs_to current_paging
  ...
end

In no real sense does a document “belong to” its current paging. The meaning of/difference between has_* and belongs_to in AR is simply where the foreign key is stored. We want documents to keep track of their own most recent paging, so the foreign key is stored in the documents table—hence “belongs to.”)

If you traverse the current paging’s ancestors until you find a paging with no ancestors, you have the active path through the document. This is used to generate the full view of the document: beginning with the root paging, the show action gets its page, which generates a list of questions; if the document has any answers for those questions, they will be shown. This is repeated on the next paging, and so on, until the current paging is met. (Documents can therefore respond to the show action as soon as they are created, regardless of completeness.)

Here is the definition of a valid document in a Apply: if a user begins on a valid start page, proceeds legally through zero or more following pages, the document is valid. If the current page is marked as a terminal page, the user will be offered the opportunity to submit their document. Any valid document that is currently on a terminal page is submittable.

The motivation: Suppose (as has been the case) on the first page of a document, a user is asked what sort of institution they work at. (A public library, museum, state organization, etc.) Depending on their answer, they will receive different paths through the application. As they are preparing to submit, they realize they made a mistake in that field. Correcting it will now send them through a different path through the application—to ensure their path through the document is valid, their original access to the submit button must be revoked, and they have to at least click through each page to the final one. They may mostly be asked the same questions in a different order though; their original answers should survive. This use case (hereafter referred to as “page-jumping”) alone was sufficient to prohibit any strong “ownership” relationship between Page/Question and Paging/Answering. Documents have a pool of Answerings, and a list of Paging tickets. That’s it. Everything else is determined on the fly.

Answers

Earlier we said Answerings “wrap” their Answerables, and this composite entity is (sort of) an “Answer.” What does that mean exactly?

The relationship between a document and the answers that comprise it has two interesting qualities:

As it turns out, this particular combination is not naturally expressed by relational databases. ActiveRecord provides the first condition with has_many. If, say a Blog has_many Posts, calling @blog.posts means, “look in a table called ‘posts’ for those with my (i.e., @blog’s) id in the blog_id column.”

ActiveRecord provides the second condition, polymorphism, via belongs_to with polymorphic: true. This allows a model to belong to another model whose type is only known at runtime. Here, ActiveRecord says, “consult the table named by my own owner_type column for the id given in my own owner_id column.”

In both of these cases, the calling object can resolve both the id and the relevant table in order to find its relations. What you can’t do (directly) is a polymorphic has_many, i.e., where object X has many associated objects, and their type is unknown. For a model to resolve its relations, it must know both what to look for—the key, usually an id—and where to look for ir, i.e., what table to query. The key is known, as this will be the id of the calling object. But as it doesn’t know the type of its related objects, it doesn’t know where to look for the key.

The solution is to create a new model, Answering, that provides a generic representation of an Answer. A document has_many answerings, and answerings polymorphically belong_to an Answerable, as #payload. The sample question spec above contains the following key:

answerable_type: 'Narrative'

When an answer(ing|able) is initialized, the answering consults this information to determine what kind of payload to build. This involves overriding the #build_payload method generated by ActiveRecord, and replacing it with one that hooks into the Description system:

# answering.rb
  def build_payload(*params)
    self.payload = Answerable.const_get(question.describe(:answerable_type)).new
    # more work goes on here...

With this, answerings build their payloads polymorphically, and respond to all of the normal nested model behavior in ActiveRecord.

The “join model” Answering is really the only concession that was necessary for the strictures of working in a relational database. It’s a concession only in that it complicates things with another model; and in certain use cases it may serve a useful role of its own, e.g., tracking the last user to edit an answer in a collaborative document. (No such functionality exists yet, but it’s been discussed.) In that case, you wouldn’t want to duplicate those columns in every Answerable table; it’s natural to store them in the Answering table.

And, when we’re not working in relational strictures, an Answering and its Answerable payload can be combined into a single record, an Answer: this is how they go into Elasticsearch. Because ES is effectively schemaless, we can flatten these into a single type. (These types are then owned by a Candidacy type in ES, and accessed via a long has_child query, but that’s a topic for another post.)

Schemalessness redux

Everything above notwithstanding, I have certainly looked at a more fundamentally schemaless approach for a future/major rewrite. This is still drawing board stuff, but because of Elasticsearch, we already have a leg up: There is already a pure JSON representation of all (most) of the data in the system, for indexing. This representation could be used as the baseline for a document-based system in something like MongoDB, or even Elasticsearch itself.

The damage from such a rewrite would, it seems, largely be limited to the Answerable module and the Document model. The paging system would essentially be the same. The DocumentController code is already ignorant of the implementation underneath it. Validations could be retained on non-AR objects via ActiveModel.

Events

The Event type is an ActiveRecord model with no parallel Description type. (Though if more complicated event behavior was required in the future, an EventDescription could be developed without too much headache.) When a document is created/updated, an event is filed in an open state, like a ticket, and its id is queued for a Resque worker process.

The worker finds the event in an open state, and examines the event’s #work_order, which is a hash. (Again—good schemalessness, in that we can add information to this arbitrarily if we need to do a new kind of work around such an event.) From the work order, the worker learns what kind of event it was, and triggers the appropriate callback. This is how confirmation emails are sent on document submissions, for example. The worker also reindexes documents, answers, candidacies as necessary in ES. When the worker is finished, it marks the event as complete. This way, failed updates can reveal themselves as incomplete events in the database.

Example 1: Submitting a proposal document shifts the user’s candidacy into a Reviewable state, and can automatically create review assignments for the proposal, notify project administrators, etc.

Example 2: Updating or submitting a review recalculates the aggregate numerical score of a candidacy; this is derived from information in the relational database, but the score itself is only stored in ES. This is yet another way in which ES functions as a sort of optimization cache: we predigest the data for searching/viewing on updates, and in doing so transform it into a structurally different representation.

Example 3: Suppose the grant project is a large, costly traveling exhibition that arrives in trunks. Before the user can begin reporting on the exhibit, we require that they fill out a report on the condition of the trunks. Submitting the condition report reveals the exhibit report, by altering the state of the user’s candidacy.

The event system is the mechanism by which users actions cause changes beyond individual document updates. It is also the place for any work that should not be completed during the main request/response cycle. Callbacks are defined at the level of the Document subtype (this is another example of STI): #post_submit, e.g., is defined differently on Proposal, Review, Supplement, FinalReport, and so on. (STI here has the advantage that new Document types may be declared on the fly without the worry of managing more complex kinds of polymorphism.)