Giter VIP home page Giter VIP logo

tkacz's Introduction

Tkacz

This document describes yet another instantiation of my attempts to create a sane reference manager that could actually improve rather than degrade my bibliography/PDF management workflows.

Contents

Introduction: doing two things and doing them well

Why yet another reference manager? The first goal of Tkacz is to be usable for very large bibliographies in the field of historical sciences, with a strong focus on contemporary history of science, although it shouldn’t be restricted to a specific field. Current document/management software fall in one of two categories, each with its own shortcomings:

\paragraph{Document managers} usually have very powerful classification, ordering and annotation features, but a very poor data model, which makes it hard to actually meaningfully and formally describe documents by providing formal bibliographic references.

Most document management software also assume documents are somehow unique and sui generis, where historical sources tend to have significant histories: they may be reprinted, republished, collected and updated, and all of this matters, both for the history of the document and its availability: knowing where to find an obscure article is important. In the “hard” sciences, the most complicated (significant) thing that can happen to an article is to be withdrawn, and all you’re expected to do in this case is not to cite it anymore; but for historical research, withdrawals or reprints are actually significant.

It is also assumed that documents are files, usually PDFs. But it’s often the case that documentation is scattered between different libraries, printed and electronic copies, that not everything has a PDF and not every PDF is complete.

\paragraph{Reference managers} make it easy to reference documents you don’t actually own (and often make it the default), and usually have a powerful (or extensible) enough data model to accommodate a lot of information, but they usually fall very short when it comes to classification. Beyond that, the (potential) power of the data model is not necessarily reflected in all UIs, and some changes that are theoretically possible in the data model will make the data actually unusable. For instance, handing authors not as strings, but as objects is theoretically feasible in the BibTeX format; but such files would be unusable by the Bib(La)TeX family of programs.

\paragraph{}

The goal of Tkacz is to “do two things, and do them well”: to serve as a document manager with the full power of a bibliographical data manager. That means:

  • Nicely deal with complex or unusual entities: manuscripts, unpublished works, posthumous publications, etc. This means a more complex data model that most reference managers provide, and if possible an extensible data model.
  • Prefer objects over strings for qualified properties. Authors, for example: attribution is important, and attribution can be tricky; it’s thus better that the author field takes a series of person objects, optionally parameterized, instead of a list of strings. This helps solve three problems: a) it allows to keep notes on persons as well, which can be quickly retrieved and are connected to this person’s works; b) it nicely handles cases where a given person published under multiple names, or published anonymous works, making it easy to present the complete list of someone’s works, regardless of the name they were published under; c) it also, obviously, makes typos much easier to detect and correct.
  • Provide advanced classification mechanisms: hierarchical or flat collections, tags, etc. Because we use objects rather than strings, a good deal of classification can be automated: works by this author, works in this journal, etc. Classification also allows to consider objects as topics or more generally classes: it is not uncommon, eg. in philosophy or literature, that the topic of a book is the author of other books. Friedrich Nietzsche is the author of the Geneology of Morals, and the topic of Heidegger’s Nietzsche.
  • Provide just as powerful annotation features.
  • Express relationships between entities: an article and a reprint, or a withdrawal, a book and its editions, a multipart work and its parts, etc.
  • Help represent the history of references: an article may have been published in a journal, then reprinted multiple times in different books. Where do I look for Canguilhem’s 1966 article “Mort de l’homme ou épuisement du cogito?”, if I can’t get the original journal? Has it been translated into English, if I need to cite it in an English article, and where to find the translation?
  • Make it hard to introduce inconsistencies or typos in the database. Journals, like persons, are objects: if the object describes the publishing rate, then no reference could contain a volume/issue number inconsistent with publication date without at least generating a warning.
  • Create usable bibliographic databases for use with (La)TeX (and other reference managers). The data model described above is way too expressive for what BiBTeX or BibLaTeX can handle, so Tkacz needs to be able to build .bib files on-demand with a given subset of its db.
  • Rather than supporting “file attachments”, provide a general mechanism to describe copies of a reference, some of which can be files, but would also include things like: it’s in this library at this call number; it’s in my personal library, I have a printout in this box, etc.
  • Make the database history easy to browse, review, and understand. Nothing frightens me more than undocumented and non-versioned binary databases with quirky UIs that make it hard to get a grasp of what’s going on (yes, Papers, I’m looking at you). The database will be in a Lisp-like syntax (think XML with parentheses), with Git integration out of the box. A change = a commit, with a meaningful message. This leaves the user free to rebase, reorder or squash commits before pushing, and should make it trivial to keep a perfectly clean history.

Basic usage

when started with M-x tkacz RET, Tkacz shows a list of all references it has in store. It can also show a list of any other type of entities: to do so, press e, then select the entity type you want. There are four by default: references (r), persons (p), publishers (b), and journals (j).

By default, entities are displayed in so-called natural format, they can also be shown in tabulated format by pressing ===.

Working with references

Creating references

There are multiple ways to create new references:

  • Press n n in the references view to display an input form where you can manually fill fields. This is the most tedious way, and should generally be avoided.
  • n u will prompt for a URL, then do its best to build a reference out of it. If possible, it will assimilate the associated PDF as a copy of the reference. n U does the same in a loop, which is useful if you’re browsing the web in search for documentation (terminate with C-g). To create references from a web browser, simply configure it to call (tkacz/create-reference-from-url) or (tkacz/create-reference-from-html) on the Emacs daemon.
  • Similarly, n f will prompt for a file, n F will do so in a loop.
  • n d will show a drop area on which you can drag and drop virtually everything, although with a strong preference for URLs and PDFs.

Viewing and editing references

From the list view, press <RET> to open or focus editor view.

Organizing references

Tkacz classification system is made of two distinct mechanisms: taxonomies and contexts.

Taxonomies

Taxonomies are hierarchical trees whose branches and leaves may contain entities of various types.

:context nil
if true, this category is treated as a context.
:incontexts nil
the contexts this category is meaningful in.
:superclass t
if true, this branch contains the entities its children contains.
:accept-entities t
if
:accept-branches t

By default, children inherit all of their parent properties. This can be overriden by passing each property a second argument to be used as the default value for its children. Eg, the root context taxonomy has the property :context nil t

Branches can be of two types:

  • Standard branches
  • Selection branches are queries (think of OSX’s Smart Folders)

Contexts

Contexts are branches and leaves of a taxonomy. Contexts are how Tkacz help manage huge collections of possibly unrelated entities. If you’re working on, eg, your PhD in history of psychiatry, you don’t want all your computer science articles collection popping up in the list. Contexts are taxonomies, but the contract with the UI is different:

  • Contexts are used as first-order filters. In the default UI, C is used to toggle between contexts.
  • When toggling back to a previous context, secondary filters are to be restored as they were.

Querying the database

What’s good is a personal library if you can’t find anything inside? Tkacz comes with two powerful query systems. The coolest one is a formal search syntax, the fastest one is full-text search.

Formal queries

Formal queries are especially useful for building collections and taxinomies. They take the following form:

((type book)
 (by MichelFoucault)
 (date (between 1960 1980)))

Multiple values can be searched on a single selector. Into French Theory?

((type book article)
 (by GillesDeleuze JacquesDerrida JacquesLacan MichelFoucault)))

Need the complete works of someone, including books they edited?

(((author editor) PierreBourdieu))

Notice the car of each s-expression is the field, the whole cdr is values.

Standard boolean operators are available, of course:

(not (and (author RobertStoller) (author RobertGreen)))
(or (date (between 1910 1930)) (date (between 1950 1965)))

Some basic capture and logic is available. You can search for a book by at least two of a group of authors by searching like this:

;; Set the original author list
(let ([authors (AlonzoChurch KurtGödel AlanTuring)])
    ;; Do twice
  (repeat 2
          ;; Capture the matched author as capt
          (capture capt (by authors))
          ;; Remove the matched author from list before searching again
          (set authors (remove capt authors))))

Question on formal searches

  • How do we search for, eg, people who wrote books?
  • How do we restrict search to given taxonomic branch?
  • Need specification searching text fields. We need “like”, “contains”, “starts with” and “regexp match”, etc.

Full-text search

Just type ? in the UI, and type some search terms. This is actually just another formal search: Eg, searching for “popper logic” actually generates:

((fulltext "popper" "logic"))

Implementation

Very similarly to Emacs itself, Tkacz is written and configured in Lisp[fn:1].

Types

Tkacz has strong implicit typing, but its type system comes with its own set of quirks. The set of types contains the usual suspects (strings, integer, floats, booleans, lists, sets…) along with algebraic sums and products, called either and struct. The primitive types are as boring as can be expected, but either and struct are the tricky ones.

Structs

Structs are the root of all Tkacz’s types. they’re a combination of named fields of a specific type. A typical struct would look like this:

(tkacz/deftype natural-person-name
               (prefix :type string)
               (first :type string)
               (middle :type string)
               (von :type string)
               (suffix :type string)
               :read '((string) . 'tkacz/string-to-natural-person-name)
               :show '(tkacz/natural-person-name-to-string))

(tkacz/deftype person
               (natural :type bool :default t)
               (name :type ,(if tzo/natural-of-person-name self
                              natural-person-name
                              string))
               (abbrev :exists ,(tzo/person-natural self)
                       :type string))

The tkacz/deftype macro creates a series of functions in the tzo/ pseudo-namespace:

  • a function called tzo/make-[structname] (here, tzo/make-person), which takes an unspecified list of arguments and consumes them in the order the fields are defined.
  • a function for each reader, in the form tzo/read-structname-from-type1-type2-... (here, tzo/read-person-from-string)
  • a function tzo/show-person.
  • accessor functions for each field, of the form tzo/field-name-of-struct-name.
  • type-predicate functions for each field, of the form tzo/field-name-of-struct-name-is-type-p.

The constructor returns a Tkacz struct object, which is a list of the form:

(tzo/struct person
            :natural (:type boolean :value t)
            :name (:type natural-person-name
                         :value (tzo/struct-natural-person-name
                                 :first "Louis"
                                 :von "de"
                                 :suffix "Broglie"))

(tzo/struct person
            :natural (:type boolean :value nil)
            :name (:type string :value "Federal Bureau of Investigation")
            :abbrev (:type string :value "FBI"))

To do so, it populates each field in order. Populating in order is useful because the type or even existence of fields may depend on other fields. In this example:

  • The type of name depends on whether this is a natural person or not. Natural persons have names in five parts[fn:2] which can’t be abbreviated, others have a simple string.

Alternatives (either)

Either is a rough equivalent of Haskell’s |. It defines a sum type which can be of any of a finite set of type. A simple example of either is:

(either string number)

A field of this type can be, guess what, either a string or a number. Unlike structs, either isn’t enough to define a type, and can only be assigned as the type of a struct’s field. Ether’s are resolved at struct constructor level, and don’t appear in the object itself but are replaced by a value of the chosen type. For example, if the above definition was the type of a field called a, the struct object would only contain:

(tzo/struct struct-name :a (:type integer :value 1))

Use-case for either is missing.

Entities

Entities are the essential Tkacz type. They’re defined from structs, but unlike structs, entities are named root objects, not values. Entity names start by an uppercase letter, and they’re defined with the (tkacz/defentity ENTITY-NAME STRUCT-NAME) macro:

(tkacz/defentity Person person)

Everything Tkacz is meant to keep information about is an entity. The most important type of Entity is of course Reference, which stores a bibliographic reference.

Taxonomies

Taxonomies are trees. Taxonomy objects are structs with the following attributes:

NameDefaultMeaning
namerequiredThe name of this branch
parentnilparent branch
gendertruewhether this branch is a gender
showemptyfalseWhether to show this branch if empty
  • parent is nil at the root branch of a tree.
  • An electric branch is created programmatically, and won’t be serialized. This implies that accept-subtrees and accept-entities are false and all children inherit them as false.
  • A gender is a branch which contains the leaves of its children (the way, in biology, a gender is made of its species)
  • showempty hides a branch and all its subtrees if they contain no entities, and only in this case.

There are two kind of branches: standard and queries. Query branches can do two things: they can treat their result as a list of entities, or as a list of branches which each receive a result and use it on a second, standard query.

Standard branches

Query branches

The behavior of query branches is defined by their gender field. If gender is true, these branches contain their results as leaves, and subbranches may contain other queries which refine the original query (ie, they apply on the first result set, so subbranches are necessarily strict subsets of their parents)

Query branches have an extra query attribute, which holds the query.

Also, query branches:

  • cannot have results be manually added/removed.
  • non-gender query branches cannot have subtrees added/removed.

Git support

Client/server protocol

The Tkacz program is a simple server able to talk to various clients. The CLI program itself isn’t made to be used by humans, but only for programs to interact with it. The initial implementation uses s-expressions for requests and responses, because a they’re) really easy to parse; and b) both the server and original client are written in Lisp. Client writes to server’s stdin to send requests of the form:

(req ID BODY)

The client is responsible for giving each request a session-unique ID, as the server provides no guarantee on the order of responses. Replies are written to stdout, wrapped in:

(res ID BODY)

Starting a session

To begin a session, the client sends:

(tkacz?)

to which the server replies with the version message:

(tkacz! :protocol (0 1 0) :async t)

Requests

Specific difficulties to consider

References without a known original publication date

Eg. virtually every ancient work: (Plato, 2004) sounds weird, but we really don’t know the exact date The Sophist was written, and publication date is meaningless in the context.

«Abstract» references and «virtual» works

Multipart works

  • Some works don’t actually exist: Hume’s Treatise of Human Nature is made of three different books, but some editions merges some, or all, of these books.
  • This is actually the general issue of multiple volume works.
  • We could create a collection or multipart type, which would represent an abstract work as a collection of multiple parts.

Non-published works

Some works have not been originally published on papers:

  • Conferences and lectures (Austin’s How to do things with words, Goodman’s Facts, Fictions, Predictions, Bourdieu’s lectures at the Collège de France…)
  • Ancient works (Plato, Aristotle…)
Conferences and lectures

This is actually an easy case, which could build on the basic relationship between editions. Something like:

(conference Goodman:FactFictionForecast)

Works with multiple, different, editions

Eg Critique of Pure Reason

To handle these cases, we may create a “virtual” entry, something like:

(virtual KRV
         :title '(("Kritik der reinen Vernunft" :lang de :orig t)
                  ("Critique de la raison pure" :lang fr))
                  ("Critique of Pure Reason" :lang en))

La Fontaine’s fables

Louis Marin’s Portrait of the King contains a long commentary of Jean de la Fontaine’s The Crow and the Fox. If one wanted to take a quick note on that (ie, express that Marin1981, I, 2 is about the fable, should one look up the original publication year of the whole collection of fables, create records, etc, or could one just create a quick draft entry on “Fables” with a single, unnumbered chapter “The Fox and the Crown” to which Marin1981, I, 2 could point to as a topic?

Sorting

The definition of an entity should include rules for sorting its instances, regardless of the way they’re rendered: Jacques Lacan should appear after Sigmund Freud.

Text formatting syntax

Since entities describe how they should be displayed, we need a rudimentary text formatting syntax, something that should be trivial to convert to any other syntax, and which could look like:

(tkacz/format-text
 (sc "Bourdieu") 'comma "Pierre"
 'space "et" 'space
 "Jean-Claude" 'space (sc "Passeron")
 'comma
 (italic "La reproduction"
         'dot
         "Éléments pour une théorie du système d'enseignement"
         'dot)
 "Paris" 'colon "Les éditions de Minuit" 'colon "1970" 'fullstop)

which

Footnotes

[fn:1] The original implementation is an Emacs Lisp program. Further developments may convert it to a standalone program, but it would be reasonable that such a program is either a compiled Lisp program, or written in another language an embedding a Lisp machine.

[fn:2] This is an oversimplification, but it’s the BibTeX model.

[fn:3] It is hard to avoid that the format be in a specific Lisp dialect. But this dialect should not command the implementation language of the backend. This has some obvious consequences on extensibility.

tkacz's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.