This document describes yet another instantiation of my attempts to create a sane reference manager that could actually improve rather than degrade my bibliography/PDF management workflows.
- Introduction: doing two things and doing them well
- Basic usage
- Implementation
- Client/server protocol
- Specific difficulties to consider
- Footnotes
Why yet another reference manager? The first goal of Tkacz is to be usable for very large bibliographies in the field of historical sciences, with a strong focus on contemporary history of science, although it shouldn’t be restricted to a specific field. Current document/management software fall in one of two categories, each with its own shortcomings:
\paragraph{Document managers} usually have very powerful classification, ordering and annotation features, but a very poor data model, which makes it hard to actually meaningfully and formally describe documents by providing formal bibliographic references.
Most document management software also assume documents are somehow unique and sui generis, where historical sources tend to have significant histories: they may be reprinted, republished, collected and updated, and all of this matters, both for the history of the document and its availability: knowing where to find an obscure article is important. In the “hard” sciences, the most complicated (significant) thing that can happen to an article is to be withdrawn, and all you’re expected to do in this case is not to cite it anymore; but for historical research, withdrawals or reprints are actually significant.
It is also assumed that documents are files, usually PDFs. But it’s often the case that documentation is scattered between different libraries, printed and electronic copies, that not everything has a PDF and not every PDF is complete.
\paragraph{Reference managers} make it easy to reference documents you don’t actually own (and often make it the default), and usually have a powerful (or extensible) enough data model to accommodate a lot of information, but they usually fall very short when it comes to classification. Beyond that, the (potential) power of the data model is not necessarily reflected in all UIs, and some changes that are theoretically possible in the data model will make the data actually unusable. For instance, handing authors not as strings, but as objects is theoretically feasible in the BibTeX format; but such files would be unusable by the Bib(La)TeX family of programs.
\paragraph{}
The goal of Tkacz is to “do two things, and do them well”: to serve as a document manager with the full power of a bibliographical data manager. That means:
- Nicely deal with complex or unusual entities: manuscripts, unpublished works, posthumous publications, etc. This means a more complex data model that most reference managers provide, and if possible an extensible data model.
- Prefer objects over strings for qualified properties. Authors, for example: attribution is important, and attribution can be tricky; it’s thus better that the
author
field takes a series of person objects, optionally parameterized, instead of a list of strings. This helps solve three problems: a) it allows to keep notes on persons as well, which can be quickly retrieved and are connected to this person’s works; b) it nicely handles cases where a given person published under multiple names, or published anonymous works, making it easy to present the complete list of someone’s works, regardless of the name they were published under; c) it also, obviously, makes typos much easier to detect and correct. - Provide advanced classification mechanisms: hierarchical or flat collections, tags, etc. Because we use objects rather than strings, a good deal of classification can be automated: works by this author, works in this journal, etc. Classification also allows to consider objects as topics or more generally classes: it is not uncommon, eg. in philosophy or literature, that the topic of a book is the author of other books. Friedrich Nietzsche is the author of the Geneology of Morals, and the topic of Heidegger’s Nietzsche.
- Provide just as powerful annotation features.
- Express relationships between entities: an article and a reprint, or a withdrawal, a book and its editions, a multipart work and its parts, etc.
- Help represent the history of references: an article may have been published in a journal, then reprinted multiple times in different books. Where do I look for Canguilhem’s 1966 article “Mort de l’homme ou épuisement du cogito?”, if I can’t get the original journal? Has it been translated into English, if I need to cite it in an English article, and where to find the translation?
- Make it hard to introduce inconsistencies or typos in the database. Journals, like persons, are objects: if the object describes the publishing rate, then no reference could contain a volume/issue number inconsistent with publication date without at least generating a warning.
- Create usable bibliographic databases for use with (La)TeX (and other reference managers). The data model described above is way too expressive for what BiBTeX or BibLaTeX can handle, so Tkacz needs to be able to build
.bib
files on-demand with a given subset of its db. - Rather than supporting “file attachments”, provide a general mechanism to describe copies of a reference, some of which can be files, but would also include things like: it’s in this library at this call number; it’s in my personal library, I have a printout in this box, etc.
- Make the database history easy to browse, review, and understand. Nothing frightens me more than undocumented and non-versioned binary databases with quirky UIs that make it hard to get a grasp of what’s going on (yes, Papers, I’m looking at you). The database will be in a Lisp-like syntax (think XML with parentheses), with Git integration out of the box. A change = a commit, with a meaningful message. This leaves the user free to rebase, reorder or squash commits before pushing, and should make it trivial to keep a perfectly clean history.
when started with M-x tkacz RET
, Tkacz shows a list of all references it has in store. It can also show a list of any other type of entities: to do so, press e, then select the entity type you want. There are four by default: references (r
), persons (p
), publishers (b
), and journals (j
).
By default, entities are displayed in so-called natural format, they can also be shown in tabulated format by pressing ===.
There are multiple ways to create new references:
- Press
n n
in the references view to display an input form where you can manually fill fields. This is the most tedious way, and should generally be avoided. n u
will prompt for a URL, then do its best to build a reference out of it. If possible, it will assimilate the associated PDF as a copy of the reference.n U
does the same in a loop, which is useful if you’re browsing the web in search for documentation (terminate withC-g
). To create references from a web browser, simply configure it to call(tkacz/create-reference-from-url)
or (tkacz/create-reference-from-html)
on the Emacs daemon.- Similarly,
n f
will prompt for a file,n F
will do so in a loop. n d
will show a drop area on which you can drag and drop virtually everything, although with a strong preference for URLs and PDFs.
From the list view, press <RET>
to open or focus editor view.
Tkacz classification system is made of two distinct mechanisms: taxonomies and contexts.
Taxonomies are hierarchical trees whose branches and leaves may contain entities of various types.
:context nil
- if true, this category is treated as a context.
:incontexts nil
- the contexts this category is meaningful in.
:superclass t
- if true, this branch contains the entities its children contains.
:accept-entities t
- if
:accept-branches t
By default, children inherit all of their parent properties. This can be overriden by passing each property a second argument to be used as the default value for its children. Eg, the root context taxonomy has the property :context nil t
Branches can be of two types:
- Standard branches
- Selection branches are queries (think of OSX’s Smart Folders)
Contexts are branches and leaves of a taxonomy. Contexts are how Tkacz help manage huge collections of possibly unrelated entities. If you’re working on, eg, your PhD in history of psychiatry, you don’t want all your computer science articles collection popping up in the list. Contexts are taxonomies, but the contract with the UI is different:
- Contexts are used as first-order filters. In the default UI,
C
is used to toggle between contexts. - When toggling back to a previous context, secondary filters are to be restored as they were.
What’s good is a personal library if you can’t find anything inside? Tkacz comes with two powerful query systems. The coolest one is a formal search syntax, the fastest one is full-text search.
Formal queries are especially useful for building collections and taxinomies. They take the following form:
((type book)
(by MichelFoucault)
(date (between 1960 1980)))
Multiple values can be searched on a single selector. Into French Theory?
((type book article)
(by GillesDeleuze JacquesDerrida JacquesLacan MichelFoucault)))
Need the complete works of someone, including books they edited?
(((author editor) PierreBourdieu))
Notice the car
of each s-expression is the field, the whole cdr
is values.
Standard boolean operators are available, of course:
(not (and (author RobertStoller) (author RobertGreen)))
(or (date (between 1910 1930)) (date (between 1950 1965)))
Some basic capture and logic is available. You can search for a book by at least two of a group of authors by searching like this:
;; Set the original author list
(let ([authors (AlonzoChurch KurtGödel AlanTuring)])
;; Do twice
(repeat 2
;; Capture the matched author as capt
(capture capt (by authors))
;; Remove the matched author from list before searching again
(set authors (remove capt authors))))
- How do we search for, eg, people who wrote books?
- How do we restrict search to given taxonomic branch?
- Need specification searching text fields. We need “like”, “contains”, “starts with” and “regexp match”, etc.
Just type ?
in the UI, and type some search terms. This is actually just another formal search: Eg, searching for “popper logic” actually generates:
((fulltext "popper" "logic"))
Very similarly to Emacs itself, Tkacz is written and configured in Lisp[fn:1].
Tkacz has strong implicit typing, but its type system comes with its own set of quirks. The set of types contains the usual suspects (strings, integer, floats, booleans, lists, sets…) along with algebraic sums and products, called either
and struct
. The primitive types are as boring as can be expected, but either
and struct
are the tricky ones.
Structs are the root of all Tkacz’s types. they’re a combination of named fields of a specific type. A typical struct would look like this:
(tkacz/deftype natural-person-name
(prefix :type string)
(first :type string)
(middle :type string)
(von :type string)
(suffix :type string)
:read '((string) . 'tkacz/string-to-natural-person-name)
:show '(tkacz/natural-person-name-to-string))
(tkacz/deftype person
(natural :type bool :default t)
(name :type ,(if tzo/natural-of-person-name self
natural-person-name
string))
(abbrev :exists ,(tzo/person-natural self)
:type string))
The tkacz/deftype
macro creates a series of functions in the tzo/
pseudo-namespace:
- a function called
tzo/make-[structname]
(here,tzo/make-person
), which takes an unspecified list of arguments and consumes them in the order the fields are defined. - a function for each reader, in the form
tzo/read-structname-from-type1-type2-...
(here,tzo/read-person-from-string
) - a function
tzo/show-person
. - accessor functions for each field, of the form
tzo/field-name-of-struct-name
. - type-predicate functions for each field, of the form
tzo/field-name-of-struct-name-is-type-p
.
The constructor returns a Tkacz struct object, which is a list of the form:
(tzo/struct person
:natural (:type boolean :value t)
:name (:type natural-person-name
:value (tzo/struct-natural-person-name
:first "Louis"
:von "de"
:suffix "Broglie"))
(tzo/struct person
:natural (:type boolean :value nil)
:name (:type string :value "Federal Bureau of Investigation")
:abbrev (:type string :value "FBI"))
To do so, it populates each field in order. Populating in order is useful because the type or even existence of fields may depend on other fields. In this example:
- The type of
name
depends on whether this is a natural person or not. Natural persons have names in five parts[fn:2] which can’t be abbreviated, others have a simple string.
Either is a rough equivalent of Haskell’s |
. It defines a sum type which can be of any of a finite set of type. A simple example of either
is:
(either string number)
A field of this type can be, guess what, either a string or a number. Unlike structs, either isn’t enough to define a type, and can only be assigned as the type of a struct’s field. Ether’s are resolved at struct constructor level, and don’t appear in the object itself but are replaced by a value of the chosen type. For example, if the above definition was the type of a field called a
, the struct object would only contain:
(tzo/struct struct-name :a (:type integer :value 1))
Entities are the essential Tkacz type. They’re defined from structs, but unlike structs, entities are named root objects, not values. Entity names start by an uppercase letter, and they’re defined with the (tkacz/defentity ENTITY-NAME STRUCT-NAME)
macro:
(tkacz/defentity Person person)
Everything Tkacz is meant to keep information about is an entity. The most important type of Entity is of course Reference, which stores a bibliographic reference.
Taxonomies are trees. Taxonomy objects are structs with the following attributes:
Name | Default | Meaning |
---|---|---|
name | required | The name of this branch |
parent | nil | parent branch |
gender | true | whether this branch is a gender |
showempty | false | Whether to show this branch if empty |
parent
is nil at the root branch of a tree.- An
electric
branch is created programmatically, and won’t be serialized. This implies thataccept-subtrees
andaccept-entities
are false and all children inherit them as false. - A
gender
is a branch which contains the leaves of its children (the way, in biology, a gender is made of its species) showempty
hides a branch and all its subtrees if they contain no entities, and only in this case.
There are two kind of branches: standard and queries. Query branches can do two things: they can treat their result as a list of entities, or as a list of branches which each receive a result and use it on a second, standard query.
The behavior of query branches is defined by their gender
field. If gender
is true, these branches contain their results as leaves, and subbranches may contain other queries which refine the original query (ie, they apply on the first result set, so subbranches are necessarily strict subsets of their parents)
Query branches have an extra query
attribute, which holds the query.
Also, query branches:
- cannot have results be manually added/removed.
- non-gender query branches cannot have subtrees added/removed.
The Tkacz program is a simple server able to talk to various clients. The CLI program itself isn’t made to be used by humans, but only for programs to interact with it. The initial implementation uses s-expressions for requests and responses, because a they’re) really easy to parse; and b) both the server and original client are written in Lisp. Client writes to server’s stdin to send requests of the form:
(req ID BODY)
The client is responsible for giving each request a session-unique ID, as the server provides no guarantee on the order of responses. Replies are written to stdout, wrapped in:
(res ID BODY)
To begin a session, the client sends:
(tkacz?)
to which the server replies with the version message:
(tkacz! :protocol (0 1 0) :async t)
Eg. virtually every ancient work: (Plato, 2004) sounds weird, but we really don’t know the exact date The Sophist was written, and publication date is meaningless in the context.
- Some works don’t actually exist: Hume’s Treatise of Human Nature is made of three different books, but some editions merges some, or all, of these books.
- This is actually the general issue of multiple volume works.
- We could create a
collection
ormultipart
type, which would represent an abstract work as a collection of multiple parts.
Some works have not been originally published on papers:
- Conferences and lectures (Austin’s How to do things with words, Goodman’s Facts, Fictions, Predictions, Bourdieu’s lectures at the Collège de France…)
- Ancient works (Plato, Aristotle…)
This is actually an easy case, which could build on the basic relationship between editions. Something like:
(conference Goodman:FactFictionForecast)
Eg Critique of Pure Reason
To handle these cases, we may create a “virtual” entry, something like:
(virtual KRV
:title '(("Kritik der reinen Vernunft" :lang de :orig t)
("Critique de la raison pure" :lang fr))
("Critique of Pure Reason" :lang en))
Louis Marin’s Portrait of the King contains a long commentary of Jean de la Fontaine’s The Crow and the Fox. If one wanted to take a quick note on that (ie, express that Marin1981
, I, 2 is about the fable, should one look up the original publication year of the whole collection of fables, create records, etc, or could one just create a quick draft entry on “Fables” with a single, unnumbered chapter “The Fox and the Crown” to which Marin1981, I, 2 could point to as a topic?
The definition of an entity should include rules for sorting its instances, regardless of the way they’re rendered: Jacques Lacan should appear after Sigmund Freud.
Since entities describe how they should be displayed, we need a rudimentary text formatting syntax, something that should be trivial to convert to any other syntax, and which could look like:
(tkacz/format-text
(sc "Bourdieu") 'comma "Pierre"
'space "et" 'space
"Jean-Claude" 'space (sc "Passeron")
'comma
(italic "La reproduction"
'dot
"Éléments pour une théorie du système d'enseignement"
'dot)
"Paris" 'colon "Les éditions de Minuit" 'colon "1970" 'fullstop)
which
[fn:1] The original implementation is an Emacs Lisp program. Further developments may convert it to a standalone program, but it would be reasonable that such a program is either a compiled Lisp program, or written in another language an embedding a Lisp machine.
[fn:2] This is an oversimplification, but it’s the BibTeX model.
[fn:3] It is hard to avoid that the format be in a specific Lisp dialect. But this dialect should not command the implementation language of the backend. This has some obvious consequences on extensibility.