Design and Architecture
This document lists the design and architectural decisions taken during the development of Correct Markdown. It follows the Architecture Decisions format.
StringView
Date: (2024-10-01)
Context
The project goal is to correct pieces of text in a markdown document. This task cannot be done with a straightforward find and replace because the markdown document contains information other than plain-text, namely markdown tags and, eventually, html tags.
We need a data-structure to isolate markup information from plain-text information.
Decision
Create StringView data-structure. This data-structure maintains an index and a view object where the subsets of a string partition can be edited independently from each other.
To initialize a StringView we have to pass a partition of string. A partition is a collection of disjoint sets which the union gives the original string. A partition is described by a dictionary where the key represents the set name (view name) and the value represents the list of strings the set is made of.
During initialization we build the index and view objects. The index is a list of ViewSegmentItem
@dataclass
class ViewSegmentItem:
view_name: Any
master_index: int
segment_index: int
After initialization, one can update a view without affecting the others.
>>> segments = {
... "text": ["","October journal","\n\n","October first","\n\n", "Today it rained."],
... "markup": ["<h1>","</h1>","<h2>","</h2>","",""]
... }
>>> from danoan.correct_markdown.core.string_view import StringView
>>> SV = StringView(segments)
>>> SV["text"]
'October journal\n\nOctober first\n\nToday it rained.'
>>> SV["markup"]
'<h1></h1><h2></h2>'
>>> SV.get_content()
'<h1>October journal</h1>\n\n<h2>October first</h2>\n\nToday it rained.'
>>> s = SV["text"].find("October first")
>>> SV["text"] = s, "Monday, October first\n\nToday was sunny!"
>>> SV["text"]
'October journal\n\nMonday, October first\n\nToday was sunny!'
>>> SV.get_content()
'<h1>October journal</h1>\n\n<h2>Monday, October first</h2>\n\nToday was sunny!'
Notice that to replace a view segment we can bracket reference the view and then assign it to an integer and a string. The integer represents the position in the view from where to start the reassignment and the string is the new content.
Status
Implemented.
Consequences
Refactor strikethrough-items to start using StringView. It is likely the case that a helper data-structure will be needed to address special requirements for the markdown correction feature.
Diff Items: Diff by character and diff by word
Date: (2024-10-07)
Context
Our goal is to correct a given markdown file (original) and generate another (corrected) that is as similar as possible to the original one. That is, we only apply corrections to words, not formatting. The formatting should be kept the same as the original and that includes white spaces and new lines.
The diff items can be computed using either the word-basis
or the charater-basis
mode. In the
first, trailing spaces and new lines are ignores while in the second they are not.
From an application point of view, we are interested in a word-basis diff to identify the parts of the text that differ from the original and corrected version. The diff items generated by the word-basis mode are used to generate the explanations of the corrections. It is not helpful to pass formatting differences to the explanation prompt.
On the other hand, to create the StringView segments, I need to use the character-basis
mode
otherwise we cannot reconstruct the original string correctly.
Decision
Accepted.
Status
Implemented.
Consequences
The joiner string in word-basis
is a space; while in character-basis
is the empty string.
MarkdownView: Ignore white spaces and new lines during find operation
Date: (2024-10-09)
Context
The diff items are computed using word-basis
mode and this mode ignore white spaces and new lines.
Therefore, we cannot reconstruct the exact string present in the text from the diff items. New lines
and white spaces might be missing. That’s why, during the find operation, we use regex to ignore
trailing white spaces and new lines.
Decision
Accepted.
Status
Implemented.
Consequences
Use [\s\n]*
between the words of the search_value in the find method of MarkdownView.