SPDCA: Page Associations

In the weird, elitist, math-y world of computer science, it's rare that a thing is called elegant. Elegant,used in such a context, has a somewhat different meaning than the common understanding (in the same way that "theory" means something different within the scientific realm than it does outside). Elegance is a quality assigned to something that has simple rules, but vast and interesting consequences. An elegant solution for a problem solves it in an interesting and subtle way, usually in a way that is otherwise tedious or impossible to do otherwise.

My sense of elegance is probably a little warped, as my appreciation for algorithmic design and mathematical proofs is pretty much zero in the face of practicality and work deadlines. Elegance to a true computer scientist is being able to perform all iteration through functional recursion. Elegance to me is something that works surprisingly well, and fits the size of the problem at hand without overstepping boundaries or introducing unneeded dependency or complexity.

Finding new elegant things is rare. Often, they are subtle, and hard to recognize unless I have to implement them myself. When I do find them, I feel giddy at their discovery, and afterwards the buzz is diminished by the fact that what makes a solution elegantis the problem it is solving, not how well I can apply it elsewhere. Like finding the perfect millimeter wrench, and then realizing you will probably never find a bolt that size again (Murrica!).

Working within Giterary, I found something interesting, and in my limited definition, elegant.

The problem was this: Mediawiki, MoinMoin, and Liquid Story Binder all have the concept of document "association." That is, for a given document, based on its contents, the applications can tell you what implied and explicit links between two documents that an author has drawn. Mediawiki and MoinMoin support associations between documents based on what wikilinks exist between the documents. Liquid Story Binder simply associates commonly named files together into arbitrary containers.

Mediawiki and MoinMoin both support a "What Links Here" feature, where you can ask their databases "What pages link here?"Rather than searching the entirety of the page contents, both can access a specialized database (MySQL for Mediawiki, and a custom database for MoinMoin) that is maintained in the background which can tell them very quickly which pages have the current one as its target.

Mediawiki and MoinMoin also both support the concept of "Orphaned" pages and "Wanted" pages. Orphan pages are pages that are targeted by no other documents (accessible only by searching). Wanted pages are those that, while targeted from another document, do not actually exist anywhere in the wiki. Having the ability to see orphans is valuable if you're trying to maintain an inclusive structure ("I want to be able to navigate to all pages somehow...") while wanted pages are useful to track work by allowing you to ask "What holes can I fill in?"

Of course, I wanted these features. But with Giterary, there is no "special database" to use in this way, and adding something like it would violate a number of the design principles I started with ("At the end of the day, everything in Giterary is just files," "Everything within reason should be editable with a text editor," "Don't use proprietary or closed-source storage formats"). The Git repository database doesn't store much more than directory state, and while that is sufficient to solve the problem, you would have to search the entire directory every time you ask for Orphaned or Wanted pages. It would also introduce complexity in what it would take to install the software, and introduce a dependency on something that was completely optional.

The more critical reason for not using an external database, though, is that performing a Git synchronization would lose all associative information. If you spend a ton of time architecting a story and file structure, you shouldn't be expected to abandon it. Your data should be portable, and persistent.

One solution would be to have a single, binary database (like SQLite, or BerkleyDB) that would live alongside the repository. SQLite and BDB are fairly easy to deal with, and are well supported by PHP. However, for limited or locked-down installations of PHP, the libraries for SQLite and BDB are not installed, and additionally, require the installation of SQLite and BDB external libraries. If you're using an shared hosting solution (like idkfa is, and many frugal web developers are), you don't always have control over your system. While simpler, you would still be introducing dependency and difficulty for someone trying to get Giterary installed. It would also mean that you would have files within your database that you would need the SQLite or BDB tools to query, if you ever wanted to do so.

Another solution would be to create a human-readable database file that could be "queried" upon user request. While this would work, it would not scale well. As you added more and more files, with more links between them, you would have to read the file in its entirety (into memory), perform your query/operation on the file, and if necessary, write the file back. Only one person could do this at a time, which for a web application is an unacceptable design choice ("Your web application can only have 1 person saving a file at a time? What?"). Additionally, the larger the file got, the more memory it would take to perform functions on it, and the more likely a shared web host would kill your process for trying to use too much.

So:

  • An external database is out (complexity, dependency)
  • An internal, binary database is out (complexity, dependency, non-human-readable)
  • A custom, internal, human-readable database is out (performance, scaling)
  • Git won't store association data, just the files you're interested in
  • You don't want to lose your association data, and you want it to carry between Git repositories, or Giterary instances

Also, another requirement is that you're a programmer, and it isn't enough to borrow features from your competitors: you have to make them better. Why should your associations be limited to just the links between? What if you're using templates, or other special "associations" that aren't just hyperlinks? What if you want to make up your own associations, given them their own names, etc.?

The solution? ...Your files and directories are a type of database. A very specialized one, but given the right organization, can store interesting information.

Here's what you do:

  1. Create a special directory in your repository. Call it "Associations."
  2. For each file in your database, create a file under Associations (Associations/A, Associations/B,etc.)
  3. For each file in your database, also create a directory counterpart under Associations based on the naming of the file in step #2 (Associations/A.dir, Associations/B.dir,etc.).
  4. For each associated file with the above files, maintain a file under the directory counterpart based on the target (Associations/A.dir/B, Associations/B.dir/C, etc.).
  5. Whenever you save a file, detect all "targets" from the saved "source," and update the directory counterparts with the targets (also cleaning up non-existence targets along the way).
  6. Whenever you delete a file, delete the Association source file (Associations/A) and its directory counterpart (Associations/A.dir).
  7. Whenever you move a file, delete the Association source file (Associations/A) and its directory counterpart (Associations/A.dir), and then rebuild the source at its new destination (Associations/D,Associations/D.dir,Associations/D.dir/B, etc.).

This abuses the semantics of files/folders such that you can ask questions like:

  • For all targets (Associations/*/*), show those that do not have sources (Associations/*) (Wanted pages query)
  • For all sources (Associations/*), show those that are not targeted (Associations/*/*) (Orphaned pages query)
  • For this source (Associations/A), who do I link to (Associations/A.dir/*)? (All page targets query)
  • For this source (Associations/A), who links to me (Associations/*/A)? (What links here query)

This solves the basic problems (Wanted Pages, Orphaned pages, source/target associations), but still leaves a few questions:

  • What if a file's name or path has a "/" in it? That will cause problems, as "/" represents a path component boundary.
  • What if you want more than 1 type of association, and you wanted to distinguish between them?

The file naming problem can be solved by finding a way to translate a full path (regardless of the characters within it) to a "normal" file name. In this case, you can use a hash like MD5 on a file name/path or Base64 encoding, and use that instead of A or B. In the case of MD5s, you can store the "original" path in the source and target directories within the files themselves.

The association type problem can be solved by adding another directory layer in between the sources and their targets:

Associations/A.dir/link/B
Associations/A.dir/template/C
Associations/A.dir/list/D
Associations/A.dir/table/D

This means you can ask questions like:

  • What associations do I have to tables for a given document?
  • What are all of the associations of type template?
  • What are all of the association types the database knows about?

There are also a few nice things that are as a result of this storage mechanism:

  • As the associations are stored as files, and the files are versioned by Git, when you use Git to synchronize, or to "go back in time," the associations for that particular directory structure are maintained. You don't have to "rebuild" the associations, Git can track them just like any other file.
  • The associations, while not super easy to navigate just by browsing, could be used externally, or even used to build directed graphs just by listing the files. A savvy author who maintained references between scenes, chapters, etc., could later use the data to show the references in graphical form.
  • No external dependencies are introduced, and the storage remains consistent with all other files in the database.
  • The storage mechanism is syntax independent, meaning that even if I decide to change my mind on how to express associations, or perhaps make associations be assignable outside of a document's contents, the structure doesn't care.
  • It's fast. Quite fast. In fact, I'd initially introduced a bug that would attempt to scan all associations, rather than just the ones I was interested in for orphans/wanted pages. The entirety of the process took ~100ms, which after I found the bug, was reduced to ~30ms. Improvements could still be made.

I'll admit that there are a few cons:

  • The repository contains files that the author doesn't maintain directly. Thankfully, every association can be tucked away in the special "Associations" folder, but it is ugly that you have a few hundred tiny files in your repository for which you have no direct involvement with.

  • If using the MD5 hashes to determine unique, non-colliding filenames, it's not super apparent what is being edited, or whose associations are being modified.

  • For each edit, you have to rebuild your associations. This would have to happen either way, but you also have to commit your association changes. This leaves you with two options:

    • Include your association changes with the edit commits, such that you see at 1 file change for the file in question, and 0 or more changes with the original change depending on whether associations changed.

    • Perform your association changes in a separate commit.

    For my particular solution, I opted to use a separate commit so that I could easily try to separate the "association" commits from the "real" commits. The interface I have set up currently by default hides the "associative" commits.

So, there you have it. Taking a file structure, and building another file structure alongside it to model the relationships between the original file structure.

This is something I consider elegant, as it solves the problem at hand, is practical, efficient, and consistent within the technical limitations/specifications of the project.

#5173, posted at 2013-06-13 17:08:17 in Indiscernible from Magic