SiSU parses a lightweight-markup source into an abstract document object model. Every substantive element (paragraph, heading, table, verse, image) becomes a typed object carrying its position in the document's sequence and hierarchy, and a stable citation number. From that single abstraction it emits multiple output formats - HTML (segmented and scroll), EPUB3, LaTeX (then PDF via xelatex), ODT, plain text, and an SQLite full-text search database. Each object's number stays stable across every output format and across translations of the same document.
The processing pipeline is markup → abstraction → output.
Object-Centric Document Abstraction. The abstraction stage builds an
in-memory object model: every paragraph, heading, table, footnote and so on is a
numbered object that carries its own parent / sibling / type metadata, known as
OCN (Object Citation Numbering). Every output format is generated from that
single abstraction, so all formats share the same object identifiers. The
abstraction can also be written out as a human-readable, PEG-parsable text
format (.ssp) that other tools can consume directly.
.ssp). Other
tools - in any language - can consume the abstraction without re-implementing
the parser. This is also what lets the abstraction stage be reasoned about,
diffed, fed to embedding pipelines, or used as the input to custom
renderers.A single document - The Wealth of Networks, Yochai Benkler - shown in every output format SiSU Spine produces. The same OCN identifies the same object in each:
⌘ Authors
-
⌘ Topics
-
፨ Search
(Authors and Topics are software-curated from each document's
header metadata. Search is object-granular.)
The collection contains 25+ documents released under various Creative Commons licences, in the public domain, or as the author's own work (with one GPL-licensed exception and the abandoned Debian live-manual). A specialised collection would benefit from a consistently applied bespoke ontology or thesaurus.
All project repositories are at https://git.sisudoc.org:
git clone git://git.sisudoc.org/software/sisudoc-spinegit clone git://git.sisudoc.org/software/sisudoc-spine-search-cgigit clone git://git.sisudoc.org/markup/sisudoc-spine-samplesgit clone git://git.sisudoc.org/software/sisugit clone git://git.sisudoc.org/markup/sisu-markup-samplesgit clone git://git.sisudoc.org/tools/tree-sitter-sisuSpine (D) and the original sisu (Ruby) share the same lightweight body markup; spine moves the document header to YAML where the original uses a bespoke header dialect. Spine is roughly 60x faster on equivalent inputs (a one-minute Ruby run is about a one-second D run). Spine emits HTML, EPUB, LaTeX, ODT, plain text and the SQLite search database; PDF is delegated to an external xelatex pass (slower but produces excellent output). For output formats both produce, spine's representations are generally more up to date. Spine was released publicly under AGPLv3 on 2024-05-01.
Summary. An object is a unit of text within a document, the most common being a paragraph. Objects include individual headings, paragraphs, tables, and grouped text of various types such as code blocks and (within poems) verse. Objects have properties and attributes; of particular significance are headings and their levels, which provide document structure. A heading is an object with a hierarchical value that conceptually contains other objects (such as paragraphs and possibly sub-headings). Objects are tracked sequentially as they relate to each other within a document, and substantive objects are numbered sequentially for citation purposes. Notably, footnotes are not objects in themselves - they belong to the object from which they are referenced, and follow their own numbering sequence. From heading objects, linked tables of content may be generated; and if additional metadata is provided, book-style indexes can be generated that link back to the objects to which they relate.
Object-centricity. In SiSU, objects are the fundamental unit from which larger constructs and the document itself are built. Breaking the document into objects provides interesting possibilities.
Objects are fundamental building blocks. Objects are usually blocks of text - paragraphs, headings, tables, grouped text of various types including code blocks and verse - and may also be, for example, images. Objects can be formatted and placed as needed, enabling multiple types of representation across disparate formats and text receptacles: HTML, EPUB, LaTeX, (in the past, mind-maps) and SQL (populated at object level, so that search has that degree of granularity).
Sequence. Objects have sequence - this follows authorship and is part of how a document conveys meaning.
Object numbers and citation. Substantive objects are numbered sequentially and can be referenced for citation purposes. Object numbers locate text precisely across different document formats and different languages (assuming the document has been translated). For search, they identify precisely where within each document the search criteria are met - in the form of an index, or by surfacing the matching text objects so a reader can decide which documents are of interest before opening them. Object numbering also frees the representation of each format to be whatever is most suitable to that format, while structural and citation integrity are retained.
Characteristics. Objects have properties (the fundamental type: heading, paragraph, table, verse, etc.) and may carry attributes (e.g. indentation, language, programming language for a code block).
Document structure. Headings hold the document's structure through their heading-level property. Headings are individual objects like any other, with the additional properties that (i) they may be regarded as containing the other objects following them sequentially (until the next heading of similar or higher level), and (ii) they have a hierarchy, the root being the document title. To give greater flexibility across output formats, headings have two sets of levels: the level under which substantive text occurs (chapter or segment), and above that, optional document section separators (book, section, part).
Non-objects. Footnotes are not objects in themselves; they belong to the referencing object and follow their own numbering sequence. Tables of content may be generated from heading objects; book-style indexes may be generated when the required metadata is provided.
The document header. A SiSU document has a header carrying document metadata - at a minimum, title and author. The header may also carry markup instructions (e.g. how to identify headings within the document, so that those headings do not need to be inferred).
With minimal preparation of a plain-text (UTF-8) file using SiSU markup syntax in your text editor of choice, SiSU can generate various document formats, most of which share a common object numbering system for locating content - plain text, HTML, XHTML, XML, EPUB, OpenDocument text (ODT), LaTeX, PDF - and populate an SQL database with objects (roughly paragraph-sized chunks) so searches may be performed and matches returned with that degree of granularity. Think of being able to finely match text across different output formats (same object identifier for PDF, EPUB or HTML) and across languages where translations exist (same object identifier across languages). For search, your criteria are met by these documents at these locations within each document (equally relevant across different output formats and languages). Page numbers provide none of this functionality. Object numbering is particularly suitable for "published" works (finalised texts as opposed to works that are frequently changed or updated), for which it provides a fixed means of reference of content. Document outputs can also share provided semantic metadata.
SiSU is less about document layout than about finding a way, using little markup, to construct an abstract representation of a document that makes it possible to produce multiple representations - which may be rather different from each other and used for different purposes - whether layout and publishing, scrollworthy online viewing, or content search. The aim is to take advantage, from a minimal-preparation starting point, of some of the strengths of rather different established ways of representing documents for different purposes: search (relational database, or indexed flat files of complete documents or files made up of objects), online or electronic viewing (HTML, XML, EPUB), or paper publication (PDF via LaTeX).
The solution arrived at is to extract structural information about the document (sections and headings, available through pattern matching or markup) and to track objects (defined units of text such as paragraphs, headings, tables, verse, etc., but also images), which can then be reconstituted as the same document with relevant object identification numbers - so text (objects) can be referenced across different output formats and presentations.
SiSU generates tables of content and, through its markup, the means for metadata to be provided for the generation of book-style indexes for a document (that, again, due to document object numbers, are the same and equally relevant across all output formats). Per-document classifying/organizing metadata can also be provided for automated document curation.
There have also been working experiments with SiSU-markup source: two-way conversion/representation in mind-mapping software (kdissert / semantik, for its strong focus on producing documents); and po4a (for translators) has been used successfully in its regular text mode for SiSU markup in translation - which is more an attribute of po4a than of SiSU, but of interest due to SiSU/spine's object citation numbering being available across translations. ODT has been an output, but much more interesting (and requested by potential users) would be the ability of a word processor to save text in SiSU markup, making alternative document processing and presentations with SiSU possible.
Also worth mention: in the relatively long history of this project there has been work on extracting hash representations of each object that could hypothetically be shared to prove the content of a document without sharing its content, or to identify which objects change. These hashes can also be used as unique identifiers in a database, or as filenames if individual objects are saved.
SiSU has evolved; the current implementation focuses on one primary use-case, books and literary writings. The concept, however, has wider application. The following is a souvenir from an encounter with an IBM software evaluator in London in June 2004, set up after a chance meeting with an IBM manager at a Linux Expo who was curious about my interest in GNU/Linux given my legal background - on hearing that I also wrote software, he suggested IBM should have a look. The evaluator's response after the meeting:
"Ralph
Good to meet with you today, I was very impressed with your software.
[colleague's name (also posted to an IBM colleague)] - in summary -
Ralph has built an application that runs on linux and takes ASCII documents
and pulls them apart in to the smallest constituent parts, storing them as
XML, PDF and HTML; the HTML are hyperlinked up so the document can be browsed
in its full form. The format and text data created is stored in a
database.
This has potential in any place that needs the power of full
text search whilst holding the structural concepts of the document i.e. legal,
pharma, education, research.. which ones we need to figure out, ..."
Special interest was expressed in the search implications of SiSU. To paraphrase: the company has document management systems dealing with hundreds of thousands of texts; these tell you which documents match your search criteria, but cannot inform you where within a text these matches were found without opening the documents. SiSU addresses this by defining document objects and making them the building block of the document - trackable objects that can be placed back in the context of the document or corpus of documents if part of a collection. SiSU's early design was to abstract documents to their structure and identified objects, numbered in a citable way (as the evaluator pointed out, document-object hashes can be of use for the purpose).
SiSU is more suited to finalised / stratified / published writings (articles, books) that are to remain and be referenced as published - works set at a given time. (As opposed to the increasingly prevalent and important forms of fluid text.)
Trained AI could likely assist in the preparation of documents with SiSU markup, with resulting deterministic and reproducible outputs (for substantive document objects). Caveats: where text objects may be in blocks (or not), there is some room for discretion and ambiguity in the markup, with resulting possibility of differences in presentation. Book indexes are another markup-intensive area; unless following an already published index, they can be prepared differently and possibly improved over time, and for specialised subject collections could potentially be prepared against a thesaurus.
ralph.amissah - www since 1993 ;-)