Apply, Part 2: Overview, and the Document Model
This installment will provide an overview of the Apply system, primarily of the treatment of documents. (See Part 1 for the basic explanation of why this software exists in the first place.) The rest of the system is based on how documents are handled, so this is probably the most basic introduction possible.
This guide will deliberately ignore many important parts of the system—the administrator interface, reviews, the attached content management system, (mostly) the search engine—but once the document system is clear, those things will be much easier to explain.
The Apply system is based on a model of documents—collections of answers to questions, given in some paginated order, which a user can (a) begin, (b) save and return to, and (c) eventually finalize by clicking a Submit button.
A typical workflow is as follows: A project (a grant opportunity, say) begins with a proposal document. (This is what one might be tempted to call a “application;” unfortunately “application” is so loaded w/r/t its meaning in software that it had to be banished from this particular vocabulary.) By submitting a proposal, the user initiates a candidacy for that project.
When the proposal (nee “application”) period ends, completed proposals are referred to some number of reviewers. The review process may be as simple as a numerical score and a comment, or as complicated as a multi-page form of its own. (To Apply, reviews are just another kind of document.) Review scores are averaged, considered by project administrators, and winners are selected.
After awards are distributed, winners may be asked to complete supplementary documents, e.g., a schedule of public events or proof of insurance for traveling exhibits. These additional documents are collated under their candidacy along with the original proposal. At the end of the project term, most winners will also submit a final report evaluating the grant. At all times, project administrators need visibility (search, export, list) into all documents/candidacies.
Regardless of what other choices I would make, I wasn’t going to uproot this document-based workflow. Users and administrators both understand it and find it intuitive. So I started there.
Logically at least, (if not literally at the level of software abstractions) documents can be thought of as composed of answers of various kinds, collected and/or presented in some order. The key thing here is that whatever abstraction we use for documents, we must be able to associate it with (a) arbitrarily many answers, of (b) arbitrarily many types. Because it must characterize all documents, the model cannot make any assumptions about the answers it contains. (How many, what order, what type, etc.)
Given the above, it’s reasonable to wonder whether a “schemaless” (document-based, NoSQL, etc.) database is the right tool for the job. Spoiler alert: ultimately, I did not use such a database—Apply uses ActiveRecord. The reasons why bear mentioning though.
First, my own knowledge was a factor. Especially at the time, (summer 2011) I had a lot more experience with SQL/relational databases. This isn’t an ideal constraint, but it is a real world one: I was going to have to build and maintain this system, and to some extent schemaless (etc.) databases represented a source of uncertainty.
But second, and more important: in a cost/benefit analysis, I’m not sure schemaless DBs are the right tool anyway. If your data is truly schemaless, in that you need arbitrary values under a single type, it may be necessary.
I don’t think our data is truly schemaless. The fact that the document system must remain agnostic about answer types does not mean it is schemaless: this would be to mistake encapsulation or information-hiding for schema properties. By remaining agnostic about answer types, Document is saying “I don’t know what (sub)schemas I will support,” not “I support no schemas at all.” I take “schemaless” here to mean “no restriction on the composition of individual type instances”—as in, for a given type A, instances of A may differ in their composition, data members, their types, etc.
Contrast this with the case where, for any given type, that type does have a schema, but the code that coordinates it must remain agnostic about what type it is dealing with at any given time. (This looks more like our situation.) There is nothing schemaless about this arrangement, because in every case where a record is in play—say, a Contact— that record has a schema, imposed on it by its type.
When you look at it that way, it’s not clear what part is supposed to be schemaless. Elsewhere in the application we expect (for example) Contact objects to support at least a minimal kind of API: fname, lname, email, etc. These values may be nil, but we expect the methods to be present. Any object that satisfies the API is basically already “schema-fied”, at least insofar as the API is a schema.
Similarly, for Narratives, we expect
#narrative_content to return
a kind of text entity; AttachmentGroups should only contain instances
of Attachment, and so on. This looks (to me) like a plurality of
fine-grained schemas, not schemalessness. Even if we permitted
arbitrariness, some sort of interface still has to be spec’d out
on top of it. I don’t really see what a schemaless database gets us here.
The other main benefit of many schemaless/document-based/NoSQL systems is horizontal scalability. But it’s extremely unlikely the Public Programs Office will actually require scalability beyond that possible with MySQL or Postgres. (File under ‘problems we’d love to have.’) In the unlikely case we did, projects could probably be partitioned to different servers at the level of the HTTP proxy. At this point, the liabilities (no ACID guarantees, eventual consistency, split-brain problems, less experience with code) don’t seem to be yielding much obvious benefit.
I vs. O
The downside of SQL/relational systems (to me, or for this project) is that they are clunky to search. For this, we use Elasticsearch. When documents are created or updated, they are (asynchronously, via Resque) indexed in ES, at the granularity of individual answers, entire documents and more. (See below for more information.)
The idea is that data enters the system via ActiveRecord, and the relational database is the canonical copy. Data by and large leaves the system (exports, reviewer menus, etc.) by way of Elasticsearch, which can be verified/regenerated from the SQL. SQL/relational systems provide a good interface for writing: ACID guaranteees, enforced schemas, etc. Elasticsearch provides an interface ideal for reading: search, faceting, plenty of capacity for dormant projects, etc. So that is how we use them.
The exception to this rule is the
show action, as called on a
document via the web. This always pulls the document directly from
the database, rather than ES, due to the slight lag time of the
background worker. This is to prevent users from seeing confirmation
screens missing their most recent update. For large documents it can
be a little expensive, but this happens relatively infrequently and
is worth not getting those phone calls.
Fudging a few details, the document system is based on two corresponding hierarchies of ActiveRecord models. The first comprise the specification of the project—the types of documents it requires, their questions, their pagination order. The second hierarchy consists of records specific to a user, each corresponding to a type in the first hierarchy:
|Description Model||State Model|
<span class=image-caption>* Technically, this is “ProjectDescription,” but for clarity’s sake we’ll worry about that later.</span>
Candidacy is the relationship between an applicant and a grant opportunity. This is the master record under which all of their stuff is collected, and (if anything) this is the record that is flagged as “selected” when someone wins a grant.
Document is a collection of Pagings and Answerings, such as a grant proposal or a report. Documents are owned by Candidacies, and can be created, saved, submitted, unsubmitted, etc.
Paging is like a ticket for a certain Page, denoting that the user has begun and possibly completed it. Pagings are arranged in a tree structure, to represent the various paths a user might have taken through a document. This is necessary because many documents require page jumping behavior, (i.e., selecting a path through the document based on prior answers, like a Choose Your Own Adventure) and all documents require that users be able to go back and edit their answers.
Answering is a metadata record for a given answer. (The actual user data are stored as instances of model classes in the Answerable module; every Answering has one. For now, we can just say that Answering wraps its Answerable.) For any Question, a document may have an Answering. Questions don’t belong to Pages, and Answerings don’t belong to Pagings, for the page jumping reasons above: A single Question may appear on different Pages for the user, depending on their path through an application.
As a user traverses a Template, Page by Page, they build up a Document, Paging by Paging, composed of Answerings (and their Answerables).
From the perspective of the user, the models on the right “store state” in a way the ones on the left don’t. I use the shudder quotes because in fact they do store state—they wouldn’t be in the database otherwise. But that state holds items like question prompts, validation requirements, etc. We may update them from time to time, but the user they (ideally) appear eternal. The state the user is concerned with—their essay responses, their contact information, whether they finalized their form or not—is contained entirely in the hierarchy on the right.
This difference is reflected in the fact that while candidacies, documents, pagings and answerings are all backed by their own database tables, all of the models in the left-hand hierarchy (Project, Template, Page, Question in case that’s above the fold) share a single database table, and inherit from the same class: Description. This arrangement is called “single table inheritance.” An extra column is added, “type”, which stores the name of the inherited class to use. ActiveRecord supports this out of the box.
STI gets a bad rap sometimes, and at least some of it is unfair. The complaint is generally (some variant of) that it’s a confusing idiom that doesn’t always do what you’d expect.
STI is appropriate here because Description classes really do all have the same backing format, and differ only in their behavior. (This being the condition under which STI is appropriate.) This is intentional and by definition. That data format is: a “nickname,” (a database-independent identifier for use within the project) a parent_id, (the id of another Description; the relationship is a self join on the Description table) and a serialized hash object, whose contents are given (normally) in a YAML file.
In fact, the database-backing of Description objects is ultimately a convenience. Projects are specified entirely in these YAML files, using directories for hierarchical relationships:
~/src/apply/local/projects$ tree sample sample ├── phases │ ├── pre.yaml │ ├── proposal.yaml │ ├── reporting.yaml │ └── review.yaml ├── project.yaml └── templates ├── proposal │ ├── pages │ │ ├── contact.yaml │ │ └── narratives_and_uploads.yaml │ ├── questions │ │ ├── attachments.yaml │ │ ├── contact.yaml │ │ ├── narrative1.yaml │ │ └── narrative2.yaml │ └── template.yaml └── review ├── pages │ └── review.yaml ├── questions │ └── score_and_comments.yaml └── template.yaml 8 directories, 15 files
The loader traverses the directory structure, resolves each individual description object by nicknames, and updates/creates the serialized hash object using the contents of the file. (In theory the database could be bypassed or replaced entirely by any format capable of supplying a hash.)
Some directories have a “node” file: project.yaml for project wide information; template.yaml for any given template. Some directories contain collections: questions, pages, phases. Examples:
# sample/project.yaml nickname: sample full_name: Sample Grant Project # sample/templates/proposal/template.yaml nickname: proposal document_class: Proposal starting_page: contact flat_page_order: - contact - narratives_and_uploads # sample/templates/proposal/pages/narratives_and_uploads.yaml nickname: narratives_and_uploads question_order: - narrative1 - narrative2 - attachments text: title: Narrative Answers and File Uploads info: "On this page you can write some words and upload some attachments." terminal: true # show a submit button # sample/templates/proposal/questions/narrative1.yaml questions/narrative1.yaml answerable_type: Narrative nickname: narrative1 max_words: 350 public_max: 300 # tell them it's 300 but leave margin of error text: title: "Narrative 1"
Description objects really are schemaless. This is basically why STI works so well with them: STI usually breaks down (it seems?) when people discover that two models joined at the table really do need different backing schemas after all. This is not really an issue if the only “schema” is that you require a hash. The benefit is that we can provide a single abstraction (Description) that can be used to build any type of project specification we can dream up. E.g., note above we use the same abstraction for a project phase as we do for a question. I’ve recently been toying with a Description type to model custom exports.
Apply represents the hierarchical relations above as ActiveRecord
associations: A Candidacy
has_many Documents; a Document
Pagings and Answerings, etc.
While the actual Document instance contains only metadata (ownership,
current state, create/update timestamps, etc.), it provides the
main interface for updating all of its child records. By using the
we can treat the entire document as if it were a single record.
The stock update methods like
update_attributes won’t be sufficient
here, because other work needs to be managed around an update. We abstract
all of that work into the method
Attributes is a normal ActiveRecord hash of attributes. The options control other behavior: a submit flag, to (attempt to) finalize the document; an extended checks flag to trigger validations that should only happen when a user is leaving a page; etc. This method returns Boolean indicating the success/failure of the update.
The main side effect of
#red_tape_update is, if the update goes
through, to queue an appropriate Event (see below) to asynchronously
handle any secondary work, like reindexing answers in Elasticsearch.
This method also temporarily sets the
@red_tape instance variable
true. To (lightly) enforce the interface, instances of Document include
a validation on this ivar. (It’s easily defeated if you need to—just
enough to remind you not to if it’s an accident.)
Building a document
In normal operation, for a template to be available to a user, three conditions must obtain:
- The current project phase must permit that template’s documents to be created/updated. You can’t keep updating your proposal after the application period closes, say.
- The user must be eligible for that template, on the basis of the state of their candidacy. Applicants who were not given an award shouldn’t be able to start a final report, say.
- If the document already exists, its own state must permit the update: if you’ve already submitted, you can’t make updates. (There’s a special document state, “unsubmitted,” that both meets this criteria, and confers a free pass on the first one. Useful for admin-sanctioned edits, accidenal submission, etc.)
If all of these things obtain, the template is made available to the user with a nice big Begin link. When the user clicks that link, (a) the template produces the first page (as in, an instance of Page); (b) the first page generates a list of questions to be presented to the user; (c) assuming they click Save and the document passes validation, answers area created/updated, and a paging is created referencing that Page, and is made the current paging; (d) if they clicked Proceed, the template determines the next page and repeats the process there.
Pages and Pagings
The idea is that Pages, given a user document, dynamically generate a list of Questions for that document. As the user traces a path through the document by clicking Proceed, they leave a trail of Pagings, which as noted are organized in a tree structure. (Courtesy of the Ancestry gem.) The path ending in the document’s current paging represents the active path through the document.
Again, this is required to properly support templates that determine page order dynamically based on previous answers to questions. This feature was required in the very first project for Apply, and is used regularly for all kinds of purposes. (It also may be the most complicated part of the entire codebase, and will hopefully get its own entry in this series.)
My stalking horse when designing it was the Choose Your Own Adventure novel: I required (somewhat arbitrarily) that the system should be capable of hosting a CYOA novel using the page system. A darker example was that ghoulish feature of the GRE wherein early correct answers lead to more valuable questions later; both can be modeled in Apply.
The document itself stores its
current_paging_id as part of its
own state; this paging will be considered the most advanced node
in the branch. For example, when a user calls the URL to edit their
document, with no parameters passed, they will be presented with
the page denoted by
(Note that this is implemented by specifying:
In no real sense does a document “belong to” its current paging.
The meaning of/difference between
belongs_to in AR
is simply where the foreign key is stored. We want documents to
keep track of their own most recent paging, so the foreign key is
stored in the documents table—hence “belongs to.”)
If you traverse the current paging’s ancestors until you find a paging
with no ancestors, you have the active path through the document.
This is used to generate the full view of the document: beginning with the
root paging, the
show action gets its page, which generates a list of
questions; if the document has any answers for those questions, they will be shown.
This is repeated on the next paging, and so on, until the current paging is met.
(Documents can therefore respond to the
show action as soon as they are created,
regardless of completeness.)
Here is the definition of a valid document in a Apply: if a user begins on a valid start page, proceeds legally through zero or more following pages, the document is valid. If the current page is marked as a terminal page, the user will be offered the opportunity to submit their document. Any valid document that is currently on a terminal page is submittable.
The motivation: Suppose (as has been the case) on the first page of a document, a user is asked what sort of institution they work at. (A public library, museum, state organization, etc.) Depending on their answer, they will receive different paths through the application. As they are preparing to submit, they realize they made a mistake in that field. Correcting it will now send them through a different path through the application—to ensure their path through the document is valid, their original access to the submit button must be revoked, and they have to at least click through each page to the final one. They may mostly be asked the same questions in a different order though; their original answers should survive. This use case (hereafter referred to as “page-jumping”) alone was sufficient to prohibit any strong “ownership” relationship between Page/Question and Paging/Answering. Documents have a pool of Answerings, and a list of Paging tickets. That’s it. Everything else is determined on the fly.
Earlier we said Answerings “wrap” their Answerables, and this composite entity is (sort of) an “Answer.” What does that mean exactly?
The relationship between a document and the answers that comprise it has two interesting qualities:
- It’s one-to-many: a single document is composed of many answers; and
- It’s polymorphic: the answers must be of many kinds (contacts, narratives, uploads, etc.)
As it turns out, this particular combination is not naturally expressed
by relational databases. ActiveRecord provides the first condition with
has_many. If, say a Blog
has_many Posts, calling
“look in a table called ‘posts’ for those with my (i.e.,
in the blog_id column.”
ActiveRecord provides the second condition, polymorphism, via
polymorphic: true. This allows a model to belong to another model whose
type is only known at runtime. Here, ActiveRecord says, “consult the table
named by my own owner_type column for the id given in my own owner_id column.”
In both of these cases, the calling object can resolve both the id and the
relevant table in order to find its relations. What you can’t do (directly)
is a polymorphic
has_many, i.e., where object X has many associated objects, and their type is unknown.
For a model to resolve its relations, it must know both what to look
for—the key, usually an id—and where to look for ir, i.e., what table to
query. The key is known, as this will be the id of the calling object. But as it
doesn’t know the type of its related objects, it doesn’t know where to look for the key.
The solution is to create a new model, Answering, that provides a generic
representation of an Answer. A document
has_many answerings, and answerings
belong_to an Answerable, as
#payload. The sample
question spec above contains the following key:
When an answer(ing|able) is initialized, the answering consults this information
to determine what kind of payload to build. This involves overriding the
#build_payload method generated by ActiveRecord, and replacing it
with one that hooks into the Description system:
# answering.rb def build_payload(*params) self.payload = Answerable.const_get(question.describe(:answerable_type)).new # more work goes on here...
With this, answerings build their payloads polymorphically, and respond to all of the normal nested model behavior in ActiveRecord.
The “join model” Answering is really the only concession that was necessary for the strictures of working in a relational database. It’s a concession only in that it complicates things with another model; and in certain use cases it may serve a useful role of its own, e.g., tracking the last user to edit an answer in a collaborative document. (No such functionality exists yet, but it’s been discussed.) In that case, you wouldn’t want to duplicate those columns in every Answerable table; it’s natural to store them in the Answering table.
And, when we’re not working in relational strictures, an Answering
and its Answerable payload can be combined into a single record,
an Answer: this is how they go into Elasticsearch. Because ES is
effectively schemaless, we can flatten these into a single type.
(These types are then owned by a Candidacy type in ES, and accessed
via a long
has_child query, but that’s a topic for another post.)
Everything above notwithstanding, I have certainly looked at a more fundamentally schemaless approach for a future/major rewrite. This is still drawing board stuff, but because of Elasticsearch, we already have a leg up: There is already a pure JSON representation of all (most) of the data in the system, for indexing. This representation could be used as the baseline for a document-based system in something like MongoDB, or even Elasticsearch itself.
The damage from such a rewrite would, it seems, largely be limited to the Answerable module and the Document model. The paging system would essentially be the same. The DocumentController code is already ignorant of the implementation underneath it. Validations could be retained on non-AR objects via ActiveModel.
The Event type is an ActiveRecord model with no parallel Description type. (Though if more complicated event behavior was required in the future, an EventDescription could be developed without too much headache.) When a document is created/updated, an event is filed in an open state, like a ticket, and its id is queued for a Resque worker process.
The worker finds the event in an open state, and examines the event’s
which is a hash. (Again—good schemalessness, in that we can
add information to this arbitrarily if we need to do a new kind of work
around such an event.) From the work order, the worker learns what kind
of event it was, and triggers the appropriate callback. This is how
confirmation emails are sent on document submissions, for example.
The worker also reindexes documents, answers, candidacies as necessary
in ES. When the worker is finished, it marks the event as complete.
This way, failed updates can reveal themselves as incomplete events in
Example 1: Submitting a proposal document shifts the user’s candidacy into a Reviewable state, and can automatically create review assignments for the proposal, notify project administrators, etc.
Example 2: Updating or submitting a review recalculates the aggregate numerical score of a candidacy; this is derived from information in the relational database, but the score itself is only stored in ES. This is yet another way in which ES functions as a sort of optimization cache: we predigest the data for searching/viewing on updates, and in doing so transform it into a structurally different representation.
Example 3: Suppose the grant project is a large, costly traveling exhibition that arrives in trunks. Before the user can begin reporting on the exhibit, we require that they fill out a report on the condition of the trunks. Submitting the condition report reveals the exhibit report, by altering the state of the user’s candidacy.
The event system is the mechanism by which users actions cause
changes beyond individual document updates. It is also the place
for any work that should not be completed during the main
request/response cycle. Callbacks are defined at the level of the
Document subtype (this is another example of STI):
is defined differently on Proposal, Review, Supplement, FinalReport, and
so on. (STI here has the advantage that new Document types may be
declared on the fly without the worry of managing more complex
kinds of polymorphism.)