Index: doc-rnd/devlog/20160823_lhtpers.html =================================================================== --- doc-rnd/devlog/20160823_lhtpers.html (nonexistent) +++ doc-rnd/devlog/20160823_lhtpers.html (revision 2694) @@ -0,0 +1,209 @@ + + +

pcb-rnd devlog

+ +

File format vs. VCS vs. script and manual editing

+ +

The problem

+ +The file a CAD program loads and saves is likely to end up in a Version +Control System (VCS) sooner or later. To make VCS more meaningful, the +changes to the file need to be minimal and relevant. For example in a +load-save round trip (i.e. load the file in the CAD and save it in another +file without any modification) the diff should be empty. If the user adds a new +object, the diff should contain only lines added for the new object. (And maybe +a few lines of changes in an index, if there's an index, but I believe it's +better if there's no index.) Likewise, if an object is deleted, it is expected +to result a few lines removed, preferrably at a single location. + +

+Terminology used in this document: +

+

+The file is a tree; even an ini file is a tree with 1 or 2 levels (bottom +level is a hash of text nodes, top level, groups, represent another hash). An +xml is a list of lists with text nodes and non-text-node attributes +representing a hash. + +

Problem #0: order

+To achieve the above goals, both gschem and pcb and pcb-rnd are prudent: +objects are ordered in memory. This is necessary because in an unordered +list or hash if the number of objects changes (objects added or removed) +the saving order may shuffle, causing a lot of unnecessary changes. +

+This problem is already solved. + +

Problem #1: canonical form

+Our file formats are plain text. While there are certain constructs that have +strict format, e.g. a string in quotes can be described only one way without +changing the content of the string, other parts have alternatives: + +

+This problem is seemingly solved by using a canonical form: even if the +loader may accept alternatives, the save code always generates the same formatted +output, down to the last whitespace indentation. + +

Problem #2: hand editing, external scripts

+However, the canonical format is not always the most convenient for the user. +Sometimes different indentation, maybe even locally, may improve readability. +Unfortunately a CAD round-trip brings back the file to canonical format, removing +any user formatting. Or in other words, the user needs to learn the syntax +of the file format and how the canonical format looks like, else user +changes to a file may cause VCS noise later. +

+Even worse, 3rd party scripts need to learn the canonical format too, for the +same reason. + + +

The solution

+One solution is to merge the canonical format and the syntax, so the resulting +specification is so strict that there are no alternatives but everything can +be written using only a single, well described format. This solves problem +#1 and #2, but creates problem #3: a text file format as rigid as a binary +format would be. While it could be still parsed by text tools easily, it +would be much harder to emit properly using the same tools or even harder to +edit manually, with a text editor. In this sense it may start losing its +most important text-file-property: free form, to some extent. +

+An alternative is that the CAD program doesn't save canonical format over +an existing file but tries to accomodate to the format the file already featurs. +This solves problem #0, #1, #2 and #3, allowing the user (or 3rd party scripts) +to use whatever formatting they prefer, with clean round trips. + +

How to implement it (in theory)

+A few unusual features are required: + + +

The obvious approach

+An obvious approach would be to smarten the parser and store delimiters +(e.g. whitespace) and the original text version for each data node, along with +the order of nodes in the original file. + +

The less obvious approach

+A second approach is to ignore these details on load and keep on storing data +in native binary format in memory. On save, start parsing the original file +(or a snapshot of it taken load-time), and: + +

+This may sound more expensive than the first approach, because the parser needs +to run again on save. However, the costs of the previous approach are +high too: +

+

+If the syntax-sugar overhead of the file format is low, majority of all +bytes will be actual node data (need to be stored) and delimiters/indentation +(need to be stored). Then it might be cheaper to store it all in one bug chunk +and reparse it (without building a DOM) than storing the same data in small +chunks. + + +

An even less obvious approach

+Take the previous method, but assume the application does not change the +order of objects on ordered lists and each object has some sort of unique +identifer (so two identical looking objects can't be on a list). +This makes list merging much simpler and doable in a single pass, keeping +a list item pointer with the in-memory list: + +

+Empty lists may need special (but trivial) treatment. +

+While copying an existing node, it is possible to "pick up" style +e.g. indentation. When a new item needs to be added, such picked up local +style info can be used to generat otuput that looks similar to other items +in its local environment. For example the first time a list or hash item +needs to be added is right after at least one item of the given hash or +list is already parsed, which means the style of an item is already know. + +

How it is implemented in practice

+Using the last approach from the above list, with an utility library for +lihata. The full list of requirements (on the application and the tree +structure): +