≅ SiSU - lightweight markup, object-centric documents, multiple outputs & search

SiSU parses a lightweight-markup source into an abstract document object model. Every substantive element (paragraph, heading, table, verse, image) becomes a typed object carrying its position in the document's sequence and hierarchy, and a stable citation number. From that single abstraction it emits multiple output formats - HTML (segmented and scroll), EPUB3, LaTeX (then PDF via xelatex), ODT, plain text, and an SQLite full-text search database. Each object's number stays stable across every output format and across translations of the same document.

The processing pipeline is markup → abstraction → output.

Object-Centric Document Abstraction. The abstraction stage builds an in-memory object model: every paragraph, heading, table, footnote and so on is a numbered object that carries its own parent / sibling / type metadata, known as OCN (Object Citation Numbering). Every output format is generated from that single abstraction, so all formats share the same object identifiers. The abstraction can also be written out as a human-readable, PEG-parsable text format (.ssp) that other tools can consume directly.

ℹ - How this differs from a typical "markup → HTML" pipeline

Citation that survives format conversion. Quote object 412 and the reference is meaningful in the HTML, the EPUB, the PDF, the plain text and the SQLite search results - and in any translation, because OCN is a property of the abstraction, not of pagination or layout.
Object-granular search. The SQLite database is populated at object granularity. A query reports not just "this document matches" but "object 412 in this document matches" - and links straight back to that object in the published HTML.
Inspectable intermediate form. The document abstraction has a human-readable, PEG-parsable text serialisation (.ssp). Other tools - in any language - can consume the abstraction without re-implementing the parser. This is also what lets the abstraction stage be reasoned about, diffed, fed to embedding pipelines, or used as the input to custom renderers.
Deterministic and reproducible. The same markup source produces the same OCN sequence and the same outputs every time. Per-object content hashes can be exposed for content identification or verification without disclosing the content itself.
Designed for finished, "published" works. SiSU is aimed at writings that are published as a stable artefact (books, essays, articles, legal and regulatory texts), where a fixed citable reference of object-level granularity is more valuable than the flexibility of fluid text.
Static output, optional search. Generated content is static HTML / EPUB / PDF / text - trivial to host and to archive. The SQLite + CGI search is an opt-in component that adds object-granular full-text query without changing the publishing model.

⌘ - See it in action

A single document - The Wealth of Networks, Yochai Benkler - shown in every output format SiSU Spine produces. The same OCN identifies the same object in each:

⌘ - Browse and search the sample collection

⌘ Authors - ⌘ Topics - ፨ Search
(Authors and Topics are software-curated from each document's header metadata. Search is object-granular.)

The collection contains 25+ documents released under various Creative Commons licences, in the public domain, or as the author's own work (with one GPL-licensed exception and the abandoned Debian live-manual). A specialised collection would benefit from a consistently applied bespoke ontology or thesaurus.

Δ - Source repositories

All project repositories are at https://git.sisudoc.org:

sisudoc-spine (D) - the current generator
git clone git://git.sisudoc.org/software/sisudoc-spine
sisudoc-spine-search-cgi (D) - object-granular CGI search
git clone git://git.sisudoc.org/software/sisudoc-spine-search-cgi
sisudoc-spine-samples - 25+ marked-up sample documents
git clone git://git.sisudoc.org/markup/sisudoc-spine-samples
sisu (Ruby, original/antecedent) - the original generator
git clone git://git.sisudoc.org/software/sisu
sisu-markup-samples - samples for the original sisu
git clone git://git.sisudoc.org/markup/sisu-markup-samples
tree-sitter-sisu - tree sitter for sisu markup
git clone git://git.sisudoc.org/tools/tree-sitter-sisu

ℹ - Spine vs. the original sisu

Spine (D) and the original sisu (Ruby) share the same lightweight body markup; spine moves the document header to YAML where the original uses a bespoke header dialect. Spine is roughly 60x faster on equivalent inputs (a one-minute Ruby run is about a one-second D run). Spine emits HTML, EPUB, LaTeX, ODT, plain text and the SQLite search database; PDF is delegated to an external xelatex pass (slower but produces excellent output). For output formats both produce, spine's representations are generally more up to date. Spine was released publicly under AGPLv3 on 2024-05-01.

ℹ - A longer description (design and intent)

Summary. An object is a unit of text within a document, the most common being a paragraph. Objects include individual headings, paragraphs, tables, and grouped text of various types such as code blocks and (within poems) verse. Objects have properties and attributes; of particular significance are headings and their levels, which provide document structure. A heading is an object with a hierarchical value that conceptually contains other objects (such as paragraphs and possibly sub-headings). Objects are tracked sequentially as they relate to each other within a document, and substantive objects are numbered sequentially for citation purposes. Notably, footnotes are not objects in themselves - they belong to the object from which they are referenced, and follow their own numbering sequence. From heading objects, linked tables of content may be generated; and if additional metadata is provided, book-style indexes can be generated that link back to the objects to which they relate.

Object-centricity. In SiSU, objects are the fundamental unit from which larger constructs and the document itself are built. Breaking the document into objects provides interesting possibilities.

Objects are fundamental building blocks. Objects are usually blocks of text - paragraphs, headings, tables, grouped text of various types including code blocks and verse - and may also be, for example, images. Objects can be formatted and placed as needed, enabling multiple types of representation across disparate formats and text receptacles: HTML, EPUB, LaTeX, (in the past, mind-maps) and SQL (populated at object level, so that search has that degree of granularity).

Sequence. Objects have sequence - this follows authorship and is part of how a document conveys meaning.

Object numbers and citation. Substantive objects are numbered sequentially and can be referenced for citation purposes. Object numbers locate text precisely across different document formats and different languages (assuming the document has been translated). For search, they identify precisely where within each document the search criteria are met - in the form of an index, or by surfacing the matching text objects so a reader can decide which documents are of interest before opening them. Object numbering also frees the representation of each format to be whatever is most suitable to that format, while structural and citation integrity are retained.

Characteristics. Objects have properties (the fundamental type: heading, paragraph, table, verse, etc.) and may carry attributes (e.g. indentation, language, programming language for a code block).

Document structure. Headings hold the document's structure through their heading-level property. Headings are individual objects like any other, with the additional properties that (i) they may be regarded as containing the other objects following them sequentially (until the next heading of similar or higher level), and (ii) they have a hierarchy, the root being the document title. To give greater flexibility across output formats, headings have two sets of levels: the level under which substantive text occurs (chapter or segment), and above that, optional document section separators (book, section, part).

Non-objects. Footnotes are not objects in themselves; they belong to the referencing object and follow their own numbering sequence. Tables of content may be generated from heading objects; book-style indexes may be generated when the required metadata is provided.

The document header. A SiSU document has a header carrying document metadata - at a minimum, title and author. The header may also carry markup instructions (e.g. how to identify headings within the document, so that those headings do not need to be inferred).

ℹ - Historical description (original sisu)

With minimal preparation of a plain-text (UTF-8) file using SiSU markup syntax in your text editor of choice, SiSU can generate various document formats, most of which share a common object numbering system for locating content - plain text, HTML, XHTML, XML, EPUB, OpenDocument text (ODT), LaTeX, PDF - and populate an SQL database with objects (roughly paragraph-sized chunks) so searches may be performed and matches returned with that degree of granularity. Think of being able to finely match text across different output formats (same object identifier for PDF, EPUB or HTML) and across languages where translations exist (same object identifier across languages). For search, your criteria are met by these documents at these locations within each document (equally relevant across different output formats and languages). Page numbers provide none of this functionality. Object numbering is particularly suitable for "published" works (finalised texts as opposed to works that are frequently changed or updated), for which it provides a fixed means of reference of content. Document outputs can also share provided semantic metadata.

SiSU is less about document layout than about finding a way, using little markup, to construct an abstract representation of a document that makes it possible to produce multiple representations - which may be rather different from each other and used for different purposes - whether layout and publishing, scrollworthy online viewing, or content search. The aim is to take advantage, from a minimal-preparation starting point, of some of the strengths of rather different established ways of representing documents for different purposes: search (relational database, or indexed flat files of complete documents or files made up of objects), online or electronic viewing (HTML, XML, EPUB), or paper publication (PDF via LaTeX).

The solution arrived at is to extract structural information about the document (sections and headings, available through pattern matching or markup) and to track objects (defined units of text such as paragraphs, headings, tables, verse, etc., but also images), which can then be reconstituted as the same document with relevant object identification numbers - so text (objects) can be referenced across different output formats and presentations.

SiSU generates tables of content and, through its markup, the means for metadata to be provided for the generation of book-style indexes for a document (that, again, due to document object numbers, are the same and equally relevant across all output formats). Per-document classifying/organizing metadata can also be provided for automated document curation.

There have also been working experiments with SiSU-markup source: two-way conversion/representation in mind-mapping software (kdissert / semantik, for its strong focus on producing documents); and po4a (for translators) has been used successfully in its regular text mode for SiSU markup in translation - which is more an attribute of po4a than of SiSU, but of interest due to SiSU/spine's object citation numbering being available across translations. ODT has been an output, but much more interesting (and requested by potential users) would be the ability of a word processor to save text in SiSU markup, making alternative document processing and presentations with SiSU possible.

Also worth mention: in the relatively long history of this project there has been work on extracting hash representations of each object that could hypothetically be shared to prove the content of a document without sharing its content, or to identify which objects change. These hashes can also be used as unique identifiers in a database, or as filenames if individual objects are saved.

ℹ - From a 2004 evaluation (IBM Software Innovations)

SiSU has evolved; the current implementation focuses on one primary use-case, books and literary writings. The concept, however, has wider application. The following is a souvenir from an encounter with an IBM software evaluator in London in June 2004, set up after a chance meeting with an IBM manager at a Linux Expo who was curious about my interest in GNU/Linux given my legal background - on hearing that I also wrote software, he suggested IBM should have a look. The evaluator's response after the meeting:

"Ralph
Good to meet with you today, I was very impressed with your software.
[colleague's name (also posted to an IBM colleague)] - in summary - Ralph has built an application that runs on linux and takes ASCII documents and pulls them apart in to the smallest constituent parts, storing them as XML, PDF and HTML; the HTML are hyperlinked up so the document can be browsed in its full form. The format and text data created is stored in a database.
This has potential in any place that needs the power of full text search whilst holding the structural concepts of the document i.e. legal, pharma, education, research.. which ones we need to figure out, ..."

Special interest was expressed in the search implications of SiSU. To paraphrase: the company has document management systems dealing with hundreds of thousands of texts; these tell you which documents match your search criteria, but cannot inform you where within a text these matches were found without opening the documents. SiSU addresses this by defining document objects and making them the building block of the document - trackable objects that can be placed back in the context of the document or corpus of documents if part of a collection. SiSU's early design was to abstract documents to their structure and identified objects, numbered in a citable way (as the evaluator pointed out, document-object hashes can be of use for the purpose).

ℹ - Some observations

SiSU is more suited to finalised / stratified / published writings (articles, books) that are to remain and be referenced as published - works set at a given time. (As opposed to the increasingly prevalent and important forms of fluid text.)

Trained AI could likely assist in the preparation of documents with SiSU markup, with resulting deterministic and reproducible outputs (for substantive document objects). Caveats: where text objects may be in blocks (or not), there is some room for discretion and ambiguity in the markup, with resulting possibility of differences in presentation. Book indexes are another markup-intensive area; unless following an already published index, they can be prepared differently and possibly improved over time, and for specialised subject collections could potentially be prepared against a thesaurus.

ralph.amissah - www since 1993 ;-)