Papermill: Introduction

Max F. Albrecht

Bachelor Thesis,
Bauhaus-Universität Weimar,
2013

0.1 About this document

This is the theoretical part of my bachelor thesis, ‘Papermill’. For the sake of usefulness, it doubles as the first version of an end user’s manual for the practical part of the thesis: a framework for writing and publishing long-form text using open source software, also named papermill.

The instructional character has several implications regarding its structure. For one, it is split into two documents.

After enumerating the goals of the project and how they are planned to be achieved, the Introduction starts with a Glossary. It explains several important technical topics, followed by an overview to most important theoretical aspects of my research and practical work. This should serve as a foundation, familiarizing the reader with the general concepts and problems in the scope of this project.

The Manual is a step-by-step description of how to use the papermill framework to produce a publication. It will evolve alongside the framework.

The thesis is accompanied by following two attachments:

0.1.1 (Typographic) conventions

monospaced
acronyms, technical terms, standards, trademarks and names of software. A block of monospaced text denotes a source code listing.

bold-monospaced
Glossary keywords

italic
emphasis, reference to chapter title

bold
strong emphasis

‘single quotes’
idiomatic terms and expressions

‘single quotes italic’
foreign terms

“double quotes”
quotation

Quoted sources are referenced numerically and listed in the References chapter at the end of each document.

External Links are used to refer to software or further information, like Wikipedia articles about background topics the reader might want to explore, but are not crucial to the understanding of the text. In the print edition, they are represented by footnotes.

0.2 Goals

There are 3 primary goals:

0.2.1 Publication development

Make the mode of production and tools used in (open source) software development more accessible for authors and writers.

  • Version control
  • Automation, Continuous integration
  • IDEs: Integrated Development Environments
  • Libraries: reusable modules

0.2.2 Cross-media publishing

  • Problem: one input, multiple outputs
  • Solution: semantic and structured content

Produce a document source which is as media-agnostic as possible, in the hopes that if it works for print and web today, it should be possible to adapt to the to-be-invented media of the future.

Requires careful weighing of options and features: The syntax has to be powerful enough to serve power-users, but needs to be friendly enough to not intimidate newcomers.

0.2.3 Long-term reproducibility

Make sure that once written, the source of a publication can be compiled into the desired output ‘forever’, or at least as long as computers exist.


From these, we can derive our secondary goals:

0.2.4 Plain text

If a document is to be written in a binary container (like a Microsoft Word File), the usefulness of a software-like development process are very limited.

0.2.5 Online & offline

The framework should be usable on a general purpose computer without requiring an internet connection.

At the same time, it should be possible to use it with just a web browser, because less and less personal computing devices are in fact ‘general purpose’.

0.2.6 Simplicity

A complicated system is harder to use, explain and extend than a simple one.

A long-term goal is to abstract away as much as possible from the user, while at the same time still providing all the necessary information for those who want to know what is happening in the background.

Most of the contents of the current ‘Manual’ will stay relevant and will simply be updated parallel to the framework. This ‘Introduction’ however will hopefully not be required reading in the future.

0.2.7 Free/Libre/Open Source

Using ‘open source’ software can be a goal in itself. Usually, this is a matter of personal or political opinion.

However, considering the already listed goals, the problems that arise from them and the existing approaches to solve them, we can safely conclude that there is no alternative to use ‘open source’ software, at least for the framework itself.1

0.3 Implementation

Implementation of the goals happens in the form of real-world prototypes. This means a minimal working solution is developed while actual publication is produced along with it.

0.3.1 Version 0: phd.nts.is

The first prototype was written alongside Naomi T. Salmons PhD Dissertation “Als ich Künstler war”.

Using the tools that were already available (Markdown, pandoc, git), everything from automation to templates was written (‘scripted’) for this specific publication. I also guided the usage of version control for collaborating with (proof-reading) editors. Furthermore, my role as a technical administrator provided valuable insight and inspiration for the next steps of the development process.

The dissertation can be downloaded in web and print format at the project website, while the source code (including the aforementioned prototype) is published on GitHub.

0.3.2 Version 1: This document

This document, my Bachelor thesis, was written alongside the development of several more prototypes.

The goal was to find generally useful solutions to the technical problems that are common to most publication projects.

  • Project Configuration specification
    • define meta-data, Inputs, Outputs
  • Stationery: publication templates

  • mill Command Line Utility
    • compiles papermill projects
    • inside: node.js module, usable locally and server-side
  • bookstrap: template and style for web output to complement the LaTeX templates
    • optimized for modern web browsers
    • designed for long-form text: readability, non-distracting, table/sidebar of contents
    • uses novel grid system based on typographic em-units
  • Bonus: Papermill.app, a graphical ‘drag-and-drop’ interface to compile papermill projects on Mac OS X

1 Glossary

1.1 Unix, Linux, *nix

Unix is:

  1. an operating system family with a history back to the year 1969 .

    Linux is a well-known member of this diverse family, so *nix is sometimes used as a more general term.

    Most of the internet runs on some kind of *nix, in fact today most computers that don’t bear a Windows sticker probably run a variation of it, including Apple’s and Google’s computers, smartphones and tablets.

  1. a philosophy

    • everything is a (text) file!
    • Simplicity and modularity

1.2 FLOSS

“Free/Libre/Open Source” – the most unambiguous name for the concept of non-restrictive licensing.

1.3 HTML

The formatting language of the web. Invented by Tim Berners-Lee at the CERN in 1989, it allows authors to write a plain text and ‘mark it up’ using <tags>. By enclosing content to them, distinct elements of a document can be created, like headings, paragraphs, images, links, and so on.

Made specifically to be used with HTTP (the Hyper Text Transfer Protocoll), which explains the meaning of the acronym: Hyper Text Markup Language.

Today, these two standards serve as the foundation of the web, along with CSS (for styling) and JavaScript (for interactive and programmatic elements).

Example: A document with a top-level heading with the text “Hello”, followed by a paragraph with the text “World!”

<html>
<body>
  <h1>Hello</h1>
  <p>World!</p>
</body>
</html>

1.4 TeX, LaTeX, *TeX

“I can’t go to a restaurant and order food because I keep looking at the fonts on the menu.”

— Donald Knuth [1, p. 321]

A typesetting engine, with a formatting syntax that doubles as a programming language.

Invented by Donald Knuth, while trying to digitally typeset the second edition of his book “The Art of computer programming” (the hot metal machine used for the first edition was no longer available). Unhappy with the then state of typesetting software, he spend seven years to program the TeX system from scratch.

Today, there exist several derivatives, LaTeX, XeTeX are among the most used.

Example: A document with a top-level heading with the text “Hello”, followed by a paragraph with the text “World!”

\documentclass{article}
\begin{document}
\section{Hello}
World!
\end{document}

1.5 WYSIWYG

“What you see is what you get” — promise made by word processors and other visual layout and design editors.

1.6 git

Git is a distributed version/revision control system dubbed as “the stupid content tracker” [2, L. 3] and developed by Linus Thorvalds, more commonly known for his inception of this Linux kernel. Just like Donald Knuth and his TeX project, Linus became so frustrated with the lack of (by his standards) good software to solve his problem, that he put of work on Linux until git was usable.

For a step-by-step introduction to Version Control and git, see the chapter Versioning.

1.7 diff, (patch)

  • diff, the (noun)
  • diff, to (verb)

“The verb”diff" is computer jargon, but it’s the only word with exactly the sense I want. []"

diff: An unselective and microscopically thorough comparison between two versions of something. From the Unix diff utility, which compares files.” — [3, p. 224, 244]

A diff, in general, is a file which stores the differences between two files in a text format. This format can be read by humans, but more importantly it can be evaluated by a computer.

If there is an original file A and a different version of the same file B, a piece of software can produce B only by applying the diff between A and B to the file A.2 This process can also be called “patching”, which is why a diff is sometimes be referred to as a patch.

As seen in the example, a “diff” only compares a file line by line.

Example: A small text file, another version of it, and the diff

  1. text1.txt:

    THIS IS A TEXT.  
    I MADE IT.
  2. text2.txt:

    THIS IS A TEXT.  
    I WROTE IT.
  3. output of “git diff text1.txt text2.txt”:

    --- a/text1.txt
    +++ b/text2.txt
    @@ -1,2 +1,2 @@
     THIS IS A TEXT.  
    -I MADE IT.
    +I WROTE IT.

1.8 hash

A hash is a concept from the area of cryptography.

A simplistic explanation would be to think of it as a ‘cross sum’ of some content, which (mathematically) can only be calculated with access to the exact content (and not by guessing, for example).3

A ‘hash function’ used in git (and many other software programs) is called SHA1, which is why git users sometimes call the hash a ‘SHA’.

A textual representation of a hash looks like this:  eb9095849a85a02e29c3fd7b4224dc4bd55c35e0. This can be automatically abbreviated by git to the shortest string that still is unique, in this case it would be: eb9095849a.

1.9 Hidden file, -folder

Certain files or folders on a computer, for example configuration files belonging to software, can be hidden from the user by the operating system and/or file browser. This is usually done to keep the user from inadvertently modifying or deleting them, because ‘seeing’ and editing them usually requires some kind of setting, command line flag or other kind of ‘trick’. there is a historically grown convention in operating systems in the UNIX family, that files and folder starting with a . (dot) character are hidden.

2 Semantic Writing

“EFFECTIVE IMMEDIATELY!! NO MORE TYPEWRITERS []
If word processing is so neat, then let’s all use it!”

Michael Scott, President of Apple Computer, 1980 (internal memo) [4, p. 1]

Historically, the semantic structure of any text has always been “embedded” into the final document by the author or writer using visual formatting. Any emphasis, division of the text into chapters, paragraphs and line breaks was (hopefully) copied in conjunction with the text.

With the introduction of the movable type printing press, this changed dramatically: Any text now had to be split up into its composing letters and spaces and then re-arranged, using only the available (lead) characters.

So, long before the widespread use of computers in the writing process and before the process now widely described as “Digitalization”4 even started, the printing press marks the transition from thinking about and working with text as a discrete (countable, ‘digital’) signal, as opposed to the monolithic, continuous signal it was seen as before.

Not surprisingly, the typical problems that have to be dealt with when converting any signal from continuous to discrete, from analog to digital, can be observed from this period on and are partly still not solved until now.

Suddenly, the letters of the used alphabet were not enough to properly “encode” a text so that it could be reproduced properly while loosing neither content nor intent of the author. Jan Tschichold, one of the most influential typographers of the 20th century, is still complaining about the ambivalence of paragraph positions in the 1960, when authors already used (also discrete) typewriters to write manuscripts:

“Thousands of working hours are sacrificed by typographers, getting the right order of letters written without indentation, with countless pencil marks and deep thought. This idling could be avoided if the manuscripts would be handed in formatted as described here.” [5] (own translation)5

Since then, the situation has become equally better and worse: Nowadays, most authors produce their manuscripts digitally, which should leave less room for interpretations and errors. Yet, the vast majority of non-technical writers are using a WYSIWYG-based system such as Microsoft Word or Apple Pages, meaning the intent of the author is once again ‘entangled’ with the visual output (or even more so, as shown in the next section, Formatting).

Note that there are alternatives in the market, but these are mostly aimed at very complex projects. An example would be Adobe FrameMaker, which according to Wikipedia does allow input of “structured text”, but is aimed at “industries such as aerospace, where several models of the same complex product exist, or pharmaceuticals”. [6]

2.1 Formatting

The relationship between formatting, typography and design is a common source of confusion. On the one hand, it can be summed up quite simply:

  • Formatting conveys intent, thus is part of the document’s source
  • Design translates this source into an output, using typography

On the other hand, there are specific connotations embedded in our visual and cultural knowledge. These are shaped by a) how this translation was handled historically; and b) the user interface of word processors since the 1980s.

Bold/Italic buttons. From left to right: Microsoft Word 2.0 (1989); Apple Pages (2009); WordPress (2012); Apple iOS 6 (2012)
Bold/Italic buttons. From left to right: Microsoft Word 2.0 (1989); Apple Pages (2009); WordPress (2012); Apple iOS 6 (2012)

As the examples show, the interface of most word processors visualize ‘emphasis’ as ‘italic’ and ‘strong emphasis’ as ‘bold’. While this is consistent with how these semantic intentions are usually expressed typographically, it shows the general problem of the WYSIWYG paradigm: the semantic structure of the document is once again ‘embedded’, thus uncertain.

2.2 Markup and Markdown

The alternative approach, sometimes called “What you mean is what you get” in response to WYSIWYG, is to use a Markup Language, like HTML.

Continuing with our example, in HTML a phrase is emphasized by enclosing it in a “<em>”-Tag.

If there is no associated (CSS-) instruction for the browser how to style this (semantic) tag, it uses the built-in default, which defines “emphasis” as “italic”.

As an example, this is how this definition looks in the source code of the Safari web browser:

em {
  font-style: italic;
}

Webkit Default CSS, Lines 993-995 (truncated)

Another well-known language for writing structured text, at least among scientific and technical authors, is the syntax used in TeX, LaTeX and other TeX-like systems.

However, both HTML and TeX share the same problems. They are:

  1. to complicated to use for the average user
  2. very verbose and require excessive of typing
  3. media-specific, HTML is for the web and TeX for print

A popular approach for the first two problems is called Markdown.

Markdown’s creator, John Gruber, presents it as follows:

“Markdown is intended to be as easy-to-read and easy-to-write as is feasible. [] The single biggest source of inspiration for Markdown’s syntax is the format of plain text email.

To this end, Markdown’s syntax is comprised entirely of punctuation characters [] carefully chosen so as to look like what they mean. E.g., asterisks around a word actually look like *emphasis*. Markdown lists look like, well, lists. Even blockquotes look like quoted passages of text, assuming you’ve ever used email."

He also clarifies the relation to HTML:

“HTML is a publishing format; Markdown is a writing format.”

However, this means that Markdown also has a media-specific heritage, but the basic syntax is focused enough on semantic elements to be generally useful.

There are several extensions to Markdown trying to solve this and other shortcomings of the basic syntax. The most popular include MultiMarkdown, PHP Markdown Extra, and pandoc’s Markdown, which is the most complete of them.

pandoc is a program that converts Markdown to HTML, like the original implementation Gruber released together with his specification, but also extends the concept in several important ways. It was written by John MacFarlane, himself also a scientist and author, so his program, the supported syntax extensions and novel output options (namely TeX) are a natural fit for the papermill framework and it’s most important basis.

3 Versioning

In general, Version Control is the act of collecting, labeling, ordering and indexing all the different revisions of a document. By extension, this also tracks the changes made between those revisions, making it possible to retrace the development of the document and possibly even the thought process of the author.

These drafts and revisions and their comparison have spawned diverse studies of these topics in the literary sciences. Yet, there is a danger of loosing this raw material as more and more authors move to produce their textual work using a computer.

This danger was the starting point for this whole project as well, sparked by Cory Doctorow’s essay “Extreme Geek” [7], where he writes about this problem and how he solved it for himself. As Doctorow summarizes in a blog post:

“I was prompted to do this after discussions with several digital archivists who complained that, prior to the computerized era, writers produced a series [of] complete drafts on the way to publications, complete with erasures, annotations, and so on. These are archival gold, since they illuminate the creative process []. By contrast, many writers produce only a single (or a few) digital files that are modified right up to publication time, without any real systematic records of the interim states between the first bit of composition and the final draft.” 8

Being a blogger, a digital activist and an all-round-nerd, he falls into the target group of ‘technically involved’ authors, so it comes as no surprise that he found a highly technical solution. He commissioned a piece of software (flashbake), which automatically keeps track of his work in 15-minute-intervals.

Underneath, it uses the git version control system, which in the last years has slowly become the de-facto standard for Version Control and collaborating in open source software projects. Its usage grew hand in hand with the popularity of GitHub, a git hosting service providing a complete ecosystem, including a web view for all files and content of the repository and project management features like issue tracking. An open source, self-hosted alternative with similar features is called Gitlab; another service with a different focus but similar hosting options is bitbucket.

3.1 Git

Sources for this guide and further reading: “The Git Parable” [9] “Pro Git” [10]

This chapter breaks down what one needs to know about git to an absolute minimum. For example, it won’t explain how to use the git command line, or any other git interface.

Some newer graphical interfaces (especially GitHub’s GUI apps) make working with git so easy that the first half of this chapter is condensed to the click of a single button; the second half means 3-4 clicks in their web interface.

But: since your document’s history should be as important to you as it is to Mr. Doctorow, I really want you to understand the concepts and nomenclature behind it.

The hope is to give you peace of mind that your work is saved and safe. Though there is a video of Linus Torvalds saying exactly that6, only with some background knowledge you can start to really trust the system. Moreover, this same knowledge should enable you to learn how to use any git interface in a relatively short time.

3.2 Repository

The most basic term one needs to know when using git is a “repository”. It is a purposefully general term, but it helps to think of it as a ‘folder’:

This is not even wrong: If you directly edit files in a git repository on your computer, it will be there just like any other folder (in git terms, this is your ‘working folder’). In our case, this folder contains all the files related to a publication:

  • Most importantly: the text files (‘sources’)
  • Any non-text assets that are part of the publication, like images and figures.
  • Any additional files that need to be tracked, or just shared.

So, what makes this folder special? Inside it, there is one more thing: a hidden folder called .git!

As a user, you never directly use this folder, but it is good to know where it is and what is its purpose.7 It contains a lot of meta-data and also a small database-like storage. The git software reads and writes to this storage facilitating all the nice things described further in this chapter. It may sound like “magic” at times, so keep in mind: It is just a very simple (but clever) program which reads and writes to this small database inside your repository.

3.3 Clone and Fork

If a repository is not started (‘initialized’) locally, it first has to be cloned.

A clone is a copy of a repository. If the clone has changes, it is considered a fork.8

Let’s have a look at how changes are made in the first place.

3.4 Commit

“committing” is the activity of saving your changes into the git database, the result of this is also called a commit. Every commit contains the state of all the files in your repository at a certain point in time. Since we are talking about “Versioning”, it is best to think of every commit as a “version” of your project. You can later use git to go “back in time”, revisiting or even restoring an older version; or to get a list of changes between two specific commits. As we will see later in the chapter, commits can also be sent around – this is how git is used for collaboration.

Let’s take a look at how to do a commit:

First, you should review your changes. If you are happy with them, you need to tell git which files you want to commit. This is called staging. It allows you to make several changes at the same time, but only commit a fraction of them each time. In software development, this is mostly done to break up the changes into smaller pieces, making them easier to review on the receiving side.

After all changes that should be included are staged, you can proceed with the actual commit. It includes:

  • Your name and email
  • A commit message (if you supplied one)
  • A diff of your changes
  • A hash of all those items

The commit message is meant to explain the changes you’ve made. Depending on the context, it might be addressed to yourself, collaborators like editors or co-authors, or anybody looking a repository’s history.

In the software world, this message is very important, because the text being worked on is source code, meant to be interpreted by a computer. Any changes can have side effects which might be non-obvious by just looking at the changes.

When working with a human-language document source however, most of the time the changes in the text don’t need to be explained since the intent is already apparent from the changes. In this case, the message can be omitted or even automatically generated, containing information about the circumstances of the commit (location, name of the computer, etc).

The hash of each commit is calculated. It can be used as a unique version number because it refers to a specific commit in a repository.

Furthermore, the hash of every commit is used by git in the background to make sure that your content has not been changed, be it by error, accident or malicious intend: If every commit just stores the differences between the last and the current version, those changes are (cryptographically speaking) secured as much as the content of your online banking website.

3.5 Branch

We already established that clones are copies of a repository, and forks are clones with any changes not found elsewhere.

But git is even more flexible: There is also the possibility of having a complete copy of the repository inside your local copy. These “built-in” copies are called branches.

They make it possible to work on an isolated copy of the complete project, for example while working something that is not ready to be included in the ‘main’ project, but still should be committed in small steps.

There is always at least one branch in every git repository, by default this branch is called ‘master’, which is nothing more than a default name.9

New branches are always based on a commit of an existing branch. This commit is the starting point of the branch, or where your changes branch off, just like in a tree.

3.6 Collaboration

If we break the process down into individual steps, it should sound familiar to anyone who has ever collaborated on text documents with others. Even when using paper, they are the same:

  • Obtain a copy of the document(s) (clone)
  • Make some changes, review and save them (commit)
  • Instead of sending the complete changed document back, formulate just the changes10 (diff)
  • Informing the source of the document of your changes, asking it to integrate them

Example letter:

Dear Sir/Madam,

attached you find my changes to your files. 
I kindly ask that you apply them to your source.

Regards,
 Jon Doe

---
Changes:
- In the file "doc.md", 3rd line, 1st character, 
  I have changed the word "hello" to "world".

But instead of doing all these steps manually, we have already learned that git takes care of the cloning, branching and committing; and that every commit is nothing more than the difference between the new version and the old version.

So, how do we send our changes?

Technically, there are many ways to do this. git can create an email for you not unlike the example letter, this is how collaboration on the Linux kernel and many more projects is mostly dealt with.

The email model is fitting for a project like Linux, where the mailing list archives form a public record of which changes where proposed by whom, the discussion around them, and if, when and how they were integrated.

However, there are other built-in ways to share commits which are much simpler to use (and automate).

3.7 Push

Because git is a distributed version control system, there is no inherit need to have a central server, or any internet connection: everything can be done locally/offline.

Yet, it is possible to use any number of remotes, which are again copies of the repository, outside of it. They can be an actual server, but it is also possible to use any storage, like an external USB drive. Web Interfaces like GitHub and Gitlab are based on the ability to use them as remotes and offer additional features based on the data you send there.

Once a remote is set up, the commits can be pushed there.11 This ability can be used for backups and syncing, but most importantly for sharing the code with collaborators.

Remote repository also have branches, and pushing always happens from a local branch (e.g. ‘master’) to a specific branch in the remote. This remote branch can be the master branch as well (if it is your own project), or be a different ones. Common strategies are to have a branch per collaborator, or one per topic.

Furthermore, git only allows a linear history in each branch. That means you will only be able to push if your changes based on all the changes in the remote branch (or some of them could be lost). To get these changes before we can send our own, we have pull first.

3.8 Pull & Merge

After new commits are pushed to a remote repository, everyone (with access) can pull them into one of their local branches.

The formal way of asking someone to pull your changes is called a “Pull Request”, or a“Merge Request” when using Gitlab.

Both names have some truth to them, because pulling, like pushing happens from branch to branch and only allows linear history.

Because of this, pulling actually happens in two stages: first, the remote changes are fetched from the remote, then they are merged into the desired local branch.

If there are any conflicts, they have to be solved before the merge can be competed (it can also be aborted).

A conflict happens if the same part(s) of a file where changed in both branches.

Example:

  • Base text file

    I <3 free software!
  • Changes made in branch A:

    I love free software!
  • Changes made in branch B:

    I <3 open source!
  • Trying to merge B into A yields a warning: “Merge conflict in file.txt Automatic merge failed; fix conflicts and then commit the result.”

  • The file now looks like this:

    <<<<<<< HEAD    
    I love free software!
    =======
    I <3 open source!
    >>>>>>> B
  • Manually solving the conflict. The result can then be committed:

    I love open source!

The much better solution is obviously trying to not create conflicts at all. This requires a rigorous workflow if more than two people are involved in the project, but can be summed up as:

  • Commit early!
  • Push often
  • Pull regularly

There are many ways to deal with this in bigger projects; they are outside the scope of this introduction. However, when using a ‘1-branch-per-collaborator’ model this problem is partly remedied. If everyone always only pushes to their own branch, there should not be any surprising changes. Similarly, a ‘1-branch-per-topic’ model ensures that this is not a practical problem by just first pushing to a new remote branch (with the name of the topic), and subsequent changes into the same one.

Finally, this example shows one the advantages of web interfaces like GitHub: When a ‘Merge Request’ is created, it is automatically checked for conflicts. If there are none, the merge can happen directly on the server, so the result can simply be pulled without handling a merge.

4 Appendix

4.1 License

This publication and all related software is ‘free and open source’, licensed under the MIT License:

Copyright (c) 2013 Max F. Albrecht

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

4.2 Colophon

Made with:

Published using fonts from Adobe, released under an open source license: [12]

  • Source Sans Pro
  • Source Code Pro

4.3 Acknowledgments

NTS, for having the confidence to write a 193-page dissertation using a very early prototype. The Kunst-Technik-Einheit staff. Everybody at MR, for providing a space where ideas like this can grow; jd, marv, eick et. al. for listening to my ramblings. CC for sponsoring this. Richard Stallman, for basing a religion-like ideology on the idea that software should be free (as in freedom, not as in free beer). The Open Source Initiative (OSI), for establishing the more marketable term ‘open source’ (which can be understood without bringing up beer). Donald Knuth for creating TeX etc.. Linus Thorvalds, for creating and maintaining Linux and git. Chris Wanstrath, Tom Preston-Werner and PJ Hyett, for founding GitHub and exposing the world to git through a world-class service. John Gruber, for the original Markdown specification. John MacFarlane, for pandoc, its Markdown extensions and related software. Brendan Eich, for creating JavaScript. Ryan Dahl for creating node.js and Isaac Z. Schlueter at Joyent for keeping it running. Lakshan ‘laktek’ Perera, for his software punch, which convinced me to learn js and node for this project. Jeremy Ashkenas, for underscore, docco, etc. Everyone from the node.js community, especially nodejitsu, for flatiron and all the other modules. <3

4.4 References

[1]“Notices of the AMS,” vol. 49, no. 3.

[2]L. Thorvalds and others, “Git Readme.” [Online]. Available: https://github.com/git/git. [Accessed: 16-May-2013]

[3]P. Graham, Hackers & Painters. Big Ideas from the Computer Age. O’Reilly Media, Sebastopol (CA), 2010.

[4]S. Ditlea, “An Apple On Every Desk,” Inc., New York City, 1981 [Online]. Available: http://www.inc.com/magazine/19811001/2033.html

[5]J. Tschichold, Erfreuliche Drucksachen durch gute Typographie. Eine Fibel für jedermann. Maro Verlag, Augsburg, 1988.

[6]“Adobe FrameMaker.” [Online]. Available: https://en.wikipedia.org/wiki/Adobe_Framemaker. [Accessed: 01-Aug-2013]

[7]C. Doctorow, “Extreme Geek.” [Online]. Available: http://www.locusmag.com/Perspectives/2009/05/cory-doctorow-extreme-geek.html. [Accessed: 06-Aug-2012]

[8]C. Doctorow, “Flashbake: Free version-control for writers using git.” [Online]. Available: http://craphound.com/?p=2171. [Accessed: 13-Feb-2009]

[9]T. Preston-Werner, “The Git Parable.” [Online]. Available: http://tom.preston-werner.com/2009/05/19/the-git-parable.html. [Accessed: 07-Aug-2013]

[10]S. Chacon, Pro Git. Apress, 2009 [Online]. Available: http://git-scm.com/book

[11]R. E. Silverman, Git Pocket Guide. O’Reilly Media, Sebastopol (CA), 2013.

[12]P. D. Hunt, “Source Sans Pro: Adobe’s first open source type family.” [Online]. Available: https://blogs.adobe.com/typblography/2012/08/source-sans-pro.html. [Accessed: 02-Aug-2012]


  1. The users operating system, text editor, etc. can of course be proprietary or ‘Open Source’.

  2. On a *nix operating system, this program is itself called diff (from “difference”) and gave name to the concept.

  3. A hash can thus be used to prove that one was in possession of a specific content (like your document) at a certain point in time, just by publicly releasing the hash (but not the document). If the document is published at a later point, anyone can calculate that the previously released ‘proof-of-existence hash’ was correct.

  4. In German: ‘Digitalisierung’.

  5. Original: “Tausende von Arbeitsstunden werden von Typographen geopfert, um einzugslos geschriebene Briefe durch unzählbare Bleistiftangaben und Nachdenken richtig zu ordnen. Dieser Leerlauf ließe sich vermeiden, wenn die Manuskripte gleich in der soeben beschriebenen Art abgeliefert würden.”

  6. Quote: “I guarantee you, if you put your data in git: You can trust, that 5 years later, after it was converted from your hard disk, to DVD, to whatever new technology, and you copied it along, [] you can verify that the data you get out is the exact same data you’ve put in.”
    (He goes on to explain that in the past, it was tried to smuggle bad changes into the Linux kernel by physically breaking into a data center, which adds some gravitas to his testimony.).

  7. Namely, if you move or copy repository-folder from one place to another (disk, computer, ). If the folder is copied as a whole, the .git folder is still inside, meaning the complete versioned history is included, for good or bad. If just single files from the folder are copied somewhere else, the .git is not copied with them, thus not including the history/database.

  8. Not the be confused with how ‘fork’ is used in software development, where a project might split into 2 new ones with different goals, using the old code base as a starting point.

  9. “There is nothing special about the name ‘master’ apart from convention” [11, p. 4]

  10. Keep in mind that even if you don’t do it yourself, then whomever you sent the document to has to do it.

  11. If the repository was cloned, the source is already configured as a remote with the name ‘origin’.