Index: doc-rnd/devlog/20160823_lhtpers.html =================================================================== --- doc-rnd/devlog/20160823_lhtpers.html (nonexistent) +++ doc-rnd/devlog/20160823_lhtpers.html (revision 2694) @@ -0,0 +1,209 @@ + + +

pcb-rnd devlog

+ +

File format vs. VCS vs. script and manual editing

+ +

The problem

+ +The file a CAD program loads and saves is likely to end up in a Version +Control System (VCS) sooner or later. To make VCS more meaningful, the +changes to the file need to be minimal and relevant. For example in a +load-save round trip (i.e. load the file in the CAD and save it in another +file without any modification) the diff should be empty. If the user adds a new +object, the diff should contain only lines added for the new object. (And maybe +a few lines of changes in an index, if there's an index, but I believe it's +better if there's no index.) Likewise, if an object is deleted, it is expected +to result a few lines removed, preferrably at a single location. + +

+Terminology used in this document: +

file is the data file, represented in a structured text format; + for the whole document except for the last chapter, any format may work: + xml, json, ini, lihata, etc. +
node a data node; may be: +
- a leaf node with payload: an actual data item (like a number) +
- a hash that contains an unordered set of other named nodes +
- a list is an ordered list of named or unnamed nodes +
+

+The file is a tree; even an ini file is a tree with 1 or 2 levels (bottom +level is a hash of text nodes, top level, groups, represent another hash). An +xml is a list of lists with text nodes and non-text-node attributes +representing a hash. + +

Problem #0: order

+To achieve the above goals, both gschem and pcb and pcb-rnd are prudent: +objects are ordered in memory. This is necessary because in an unordered +list or hash if the number of objects changes (objects added or removed) +the saving order may shuffle, causing a lot of unnecessary changes. +

+This problem is already solved. + +

Problem #1: canonical form

+Our file formats are plain text. While there are certain constructs that have +strict format, e.g. a string in quotes can be described only one way without +changing the content of the string, other parts have alternatives: +

white spaces and other field separators +
numerical formats: is 8 the same as 08 or 8.0? +
pcb flag formats, old/new style: the same object can be described + using different syntax constructions, sometimes resulting the same + object +

+This problem is seemingly solved by using a canonical form: even if the +loader may accept alternatives, the save code always generates the same formatted +output, down to the last whitespace indentation. + +

Problem #2: hand editing, external scripts

+However, the canonical format is not always the most convenient for the user. +Sometimes different indentation, maybe even locally, may improve readability. +Unfortunately a CAD round-trip brings back the file to canonical format, removing +any user formatting. Or in other words, the user needs to learn the syntax +of the file format and how the canonical format looks like, else user +changes to a file may cause VCS noise later. +

+Even worse, 3rd party scripts need to learn the canonical format too, for the +same reason. + + +

The solution

+One solution is to merge the canonical format and the syntax, so the resulting +specification is so strict that there are no alternatives but everything can +be written using only a single, well described format. This solves problem +#1 and #2, but creates problem #3: a text file format as rigid as a binary +format would be. While it could be still parsed by text tools easily, it +would be much harder to emit properly using the same tools or even harder to +edit manually, with a text editor. In this sense it may start losing its +most important text-file-property: free form, to some extent. +

+An alternative is that the CAD program doesn't save canonical format over +an existing file but tries to accomodate to the format the file already featurs. +This solves problem #0, #1, #2 and #3, allowing the user (or 3rd party scripts) +to use whatever formatting they prefer, with clean round trips. + +

How to implement it (in theory)

+A few unusual features are required: +

F1: the parser needs to be aware of otherwise unimportant + characters (e.g. whitespaces in indentation) +
F2: if (payload) data is to be converted to a native format by + the program, the original text representaton needs to be saved; + e.g. the input file contins 4.99999999, the program loads it to + a float and later saved back, it may end up as 4.999997836; + instead if the data did not change, the original text shall be + saved back; this also solves the 005 vs 5 vs 5.0 problem +
F3: new data should be inserted, obsolete data should be removed + without reordering stationary data +

+ +

The obvious approach

+An obvious approach would be to smarten the parser and store delimiters +(e.g. whitespace) and the original text version for each data node, along with +the order of nodes in the original file. + +

The less obvious approach

+A second approach is to ignore these details on load and keep on storing data +in native binary format in memory. On save, start parsing the original file +(or a snapshot of it taken load-time), and: +

any non-data-payload character should be copied the the output; + this includes whitespace, delimiters, comments, syntax sugar; + this solves F1 +
if a node did not change, the parser copies it byte-to-byte to + the output (solves F2 partially - no format change if data is + unchanged in memory); +
value change check: while parsing, convert the original value to + native format again and compare, which solves F2: the same + string->native conversion should yield the same native result; + if our in-memory representation differs, it has changed and the + new value should be copied to the output instead of the old value; + the format of the new value doesn't matter, can be canonical, since + the diff is already caused by the fact that the value has changed. +
if the language supports ordered lists: before emitting list items, + the input list must be read through (pre-read) and the in-memory and + just-parse lists need to be compared to detect changes in order; + existing list items are copied as described above; list items + deleted in-memory are ignored; list items added in-memory are + written out (solves F3) +
if the language supports unordered hashes: check each key parsed; + if it exists in-memory, copy/write it normally; if it doesn't, + delete it; add new hash items (the ones present in-memory but + not seen when the parser reaches the end of the hash; solves F3) +

+This may sound more expensive than the first approach, because the parser needs +to run again on save. However, the costs of the previous approach are +high too: +

essentially the whole file needs to be stored in-memory + (for whitespaces), but not in one bug chunk but in many + little chunks among with the data nodes (which leads to + allocation overhead); the original values need to be stored in + text too, to solve the numeric problems +
either the same string->native conversion needs to be done using + stored text values; or the code needs to keep track of which node + changed using some data-changed bit +
the same list/hash merging needs to be done +

+If the syntax-sugar overhead of the file format is low, majority of all +bytes will be actual node data (need to be stored) and delimiters/indentation +(need to be stored). Then it might be cheaper to store it all in one bug chunk +and reparse it (without building a DOM) than storing the same data in small +chunks. + + +

An even less obvious approach

+Take the previous method, but assume the application does not change the +order of objects on ordered lists and each object has some sort of unique +identifer (so two identical looking objects can't be on a list). +This makes list merging much simpler and doable in a single pass, keeping +a list item pointer with the in-memory list: +

initialization: set the pointer to the first element of the list +
iteration: parse the first (next) item of the list +
check if it exists in-memory and is the first +
- if yes, copy from the original file, increase the pointer by + one element +
- if not, check if it got deleted (not present on the list + in-memory) and skip it if so; pointer stays in place +
- if not, check if it is on the in-memory list as a later entry; + if so, insert all preceding entries from the in-memory list, + increasing the pointer by one for each element written +
+
repeat the iteration until there are no more items on the input list +
check if there are items remaining from the pointer, if so output + them (they will end up at the end of the list in the output file) +
finished: the output file's list now matches the in-memory list. +

+Empty lists may need special (but trivial) treatment. +

+While copying an existing node, it is possible to "pick up" style +e.g. indentation. When a new item needs to be added, such picked up local +style info can be used to generat otuput that looks similar to other items +in its local environment. For example the first time a list or hash item +needs to be added is right after at least one item of the given hash or +list is already parsed, which means the style of an item is already know. + +

How it is implemented in practice

+Using the last approach from the above list, with an utility library for +lihata. The full list of requirements (on the application and the tree +structure): +

the root element of the input file and the in-memory representation + must fully match +
list items must have names that are unique to the list, even accross + item removal from the list within a session (so a newly added item + never has the same name as a past item parsed back from the file) +
the application must not change the order of elements on a list - + new elements may be inserted anywhere and old elements may be + removed from anywhere but the order of any two remaining old + elements must be preserved +