VisualEditor is an in-browser rich text editor for HTML documents. It is most widely used as the Wikipedia editor. However, the core implementation is a standalone JavaScript library that does not depend on the MediaWiki platform.
This tutorial focuses mainly on the fundamentals of the VisualEditor core implementation. It is aimed at newcomers joining the Editing team and volunteers who wish to hack on internals, but it should be interesting to anyone who wants to understand how an in-browser rich text editor works.
Some of the fundamentals depend heavily on algorithms, so a computer science background would be very helpful, but pointers will be provided so the reader can fill in necessary background learning.
VisualEditor works on HTML documents — it doesn’t know wikitext. However, MediaWiki’s native storage format is wikitext. This is possible because MediaWiki’s parser (Parsoid) can automatically translate to/from HTML+RDFa format, and a lot of work has gone into ensuring diffs round-trip cleanly, so that source editors who use raw wikitext can work side-by-side with rich-text editors who use VisualEditor.
VisualEditor never needs to parse wikitext directly; from its point of view, it just sees MediaWiki loading/saving HTML+RDFa documents.
We’re going to try things out on a live Wikipedia page, loaded into VisualEditor. So of course don’t click Publish! (Though if you did ever publish a test change by accident, you could just revert it or someone else would).
Working on a live Wikipedia page means you don’t need a local development environment identical to the instance, which can actually be hard to achieve. There are over 300 versions of Wikipedia (in different languages) and each may have different templates and extensions installed, not to mention a data set that can be very large.
We’re using Simple English Wikipedia, which is written in easy-to-understand English and has a (much) smaller data set than full English Wikipedia.
In the URL, oldid=9187227 means we’re opening a specific revision of the page, not necessarily the most current. This is helpful for a tutorial because it means the data will be exactly as the tutorial expects. In general, if you edit an old revision then click Publish, you’ll potentially undo the changes made in subsequent revisions — as an editor you’re responsible for handling this manually. This doesn’t matter for the tutorial, because you won’t ever publish.
In the URL, veaction=edit means we’re jumping straight into VisualEditor. Normally, you’d open an article in read mode, and then click to edit the page.
Now open the developer tools. In Firefox and Chromium, you can press Ctrl+Shift+I to do this. Click on the Console tab, and type:
s=ve.init.target.surface.modeld=s.documentModel
Then d is the document model (i.e. the abstract data representation version of the document we're editing). s is the surface model, which additionally represents the current selection.]
We will use s and d so defined throughout this tutorial.
Now try:
d.getData()
The array returned is a dump of the linear model
Q1a. How long is the array?
A1a. The array has length 779.
Q1b. What section within the array represents the word “traditional” in the first paragraph?
Q1d. What difference do you notice between the representation of “traditional” vs “Oolong”?
A1d. Each letter of “Oolong” is represented as an array, the first element being the letter and the second element being ["hed5e5a34cf0f5c5b"]. But it is not immediately obvious what hed5e5a34cf0f5c5b means.
LEARNING GOALS: We learned how to access the live VisualEditor instance running within an editing session on a Wikipedia page, and how to query the document model to see the abstract representation of the content.
“The three primary components of VisualEditor are ve.__, ve.__ and ve.__ .”
“The linear model is optimized for ___ editing. It is similar to an ___ token stream, but with ___ ___ composed onto each character. This allows arbitrary ___ of content to be simple and efficient.”
LEARNING GOALS: We learned about the VisualEditor architecture, and in particular what the linear model is. Now we’ll return to the live editing session, and learn about the transactions system.
In the developer tools console tab (see above), type:
d.completeHistory
and find the .transactions array. If you haven’t edited anything, it should contain exactly one transaction. (If you have edited something, you may need to open this afresh in a private browser tab, to defeat VisualEditor’s autosave feature which may preserve your edits even if you close and reopen the page). That transaction should consist of a single .operation.
Q3a. What is the .type of the operation? What does the operation do?
A3a. The .type of the operation is 'retain'. It is essentially a no-op, keeping content unchanged.
Q3b. The .length property is 779. Where have you seen that number before? Why do you think it appears here?
A3b. 779 was the length of the linear model data. So retaining 779 items means keeping the entire document unchanged.
Now click on the document text, press Ctrl+A (to select all), then press Backspace (to delete the entire selection). Look again at .getData().
Q4a. How long is the array now?
A4a. The array now has length 348.
Q4b. What are the items at offsets 2 and 3?
A4b. The items at offsets 2 and 3 are open and close tags for an mwCategory item.
Q4c. Can you find content in the editing interface that corresponds to the items? Can you do something to delete that content and make those items disappear?
A4c. They correspond to the Category: Tea tag at the bottom of the page. Clicking on the tag, removing the category Tea, then clicking “Apply changes” removes the tag from the page and the items from the linear model data (so calling d.getData() will show the length becomes 346).
Then try it again after pressing Ctrl+Z repeatedly to undo all changes.
Q5a. What is the response? Comparing to the linear model, what do you think “document range” might mean?
A5a. After doing Select All + Delete, the document range is 0-2. For the original document state, the document range is 0-435. So “document range” in some sense represents the entire document. But it is not immediately obvious why some content lies beyond the document range.
Q5b. Look closely at the linear model data again. What tag contains all the content outside the document range? What is this content?
A5b. All the content outside the document range lies inside a single internalList tag pair. It appears to contain content that appears in references.
Q5c. Why do you think such content is stored separately?
A5c. Each reference can be cited more than once, so there may be a need to have a handle on them in an object store that’s separate from the main document content.
Q5d. It feels like such content should have disappeared when you deleted everything. But it didn’t. Can you explain why that is not actually a bug?
A5d. Uncited references just take up unnecessary memory in the internalList store. This is not really a problem because a VisualEditor edit session has a relatively short lifetime. Once the editor publishes, the edit session will end and the memory will be freed.
In the linear model, find the words “Oolong” and “traditional” from the first paragraph again. Recall there’s an interesting difference, and expand the items to see it completely.
Q6. What do you think hed5e5a34cf0f5c5b might mean?
Walkthrough of Tutorial 1 steps
A6. The word “Oolong” is bolded and each letter is represented with the hed5e5a34cf0f5c5b code, whereas the word “traditional” is not bolded and the letters do not have the hed5e5a34cf0f5c5b code. So it looks like something to do with the bold.
Q1. Look inside tx1.operations and guess at the meaning of everything
A1.tx1.operations is a diff representing how the document shall be changed. Conceptually, it is applied by putting a pointer at the start of the linear model then working through the operations
the “retain” operation of length 3 means the next three characters shall remain unchanged;
the “replace” operation removes [ 'c', 'd' ] and inserts [] (i.e. it is a pure removal);
finally the “retain” operation of length 6 means the rest of the document shall remain unchanged.
Notice that the “replace” operation is symmetrical: it specifies the content to remove as well as the content to insert. This is useful for creating the reverse (“undo”) transaction:
s.change(tx1)s.change(tx1.reversed())s.change(tx1)// error, because tx1.applied === truetx2=ve.dm.TransactionBuilder.static.newFromInsertion(d,6,['h','e','l','l','o'])
Q4. Reload the page, setup s= and d= and tx1= again. Set a breakpoint inside s.change. Apply tx1 and step into everything interesting.
An interesting place to explore:
Step into changeInternal and see that it commits each transaction via ve.dm.Document.commit. Step inside again. Notice that each commit creates a new ve.dm.TransactionProcessor and calls its process method, which then calls ve.dm.TreeModifier’s process method. Step into ve.dm.TreeModifier.static.applyTreeOperations. From here we arrive at the ve.dm.TreeModifier.static.applyTreeOperation method that we’ll learn about in Tutorial 3.
Step in again, into ve.dm.Node#setLength. This method changes the length of this DM node, and makes the corresponding change to all ancestor nodes (recursively), and uses “emit( ‘update’ )” to notify listeners there has been a change.
Q2 Read this method very carefully, and try to state in what order the following things happen: updating this node’s length, updating ancestor nodes’ lengths, notifying listeners of these changes.
A2.
Nitty gritty details:
The node we’re updating is type “text”, its parent is type “heading”, and grandparent is type “document”.
Updates this node’s length (7 to 6)
Updates parent’s length (7 to 6)
Updates grandparent’s length (11 to 10)
Emits ‘lengthChange’ and ‘update’ from grandparent’s setLength
Emits ‘lengthChange’ and ‘update’ from parent’s setLength
Emits ‘lengthChange’ and ‘update’ from node’s setLength
Takeaway:
Recursively updates all lengths, starting at the current node. When it hits the end of the recursion, it emits ‘lengthChange’ and ‘update’ from each node, all the way back to the starting node. All lengths must be adjusted before emitting update; the LM and DM tree must be in sync.
Step into the emit( ‘update’ ) line. This will pass into OO.EventEmitter#emit; you want to step into the method.apply line, which will pass into ve.ce.BranchNode#onModelUpdate. Notice we’re now in a completely different part of the codebase: the listener lives in the CE.
Q4 Is this listener running synchronously or asynchronously, with respect to the emit call? How do you know?
A4. Synchronously. The emit call blocks until all the listener functions for it have completed their execution.
Notice we’re processing operations one by one. Each operation modifies the linear model, then we update the DM tree correspondingly, then each node change in the DM tree emits an ‘update’ event which the CE node uses to update itself correspondingly.
Q5 How is this even possible? A single linear operation, in isolation, does not necessarily preserve tree validity. It can leave the linear data in a state that does not even represent a tree. For instance <heading>...</paragraph>. So how does VE update the tree incrementally?
A5. There are two different types of operations here: linear operations and tree operations. ve.dm.TreeModifier calculates tree operations from the linear ones, and each tree operation is guaranteed to leave the tree in a valid state.
Step into ve.dm.TreeModifier.calculateTreeOperations to see how tree operations are made.
Next time: synchronous updates originating in the model.
Tutorial 4: Updates initiated in the model vs the view
Q1. Look at the call stack. How did renderContents (which is CE code) get called from DM code (which isn’t supposed to know or care whether there’s a CE listening)? Is this call synchronous (=happens while the DM is applying a transaction) or asynchronous (=happens after the DM has finished applying a transaction)?
A1.renderContents is called from the event emitter that we went over in the previous section. The call is synchronous; it happens while the DM is applying a transaction (to be precise, after the current tree operation has been processed but before the next tree operation is processed)
Q2. Step carefully through renderContents. When does the update reach the DOM?
A2. Child nodes are detached from $this.element and then changes are made. The changes reach the DOM when the nodes are reattached to $this.element with appendRenderedContents.
Now apply the same change but do it by editing the contentEditable DOM directly: select the letters ‘cde’ and press ‘e’ (so the net effect will be to remove the ‘c’ and ‘d’). The breakpoint in renderContents should trigger again.
Q3. Look at the call stack this time. Can you see where the following things happened?
ve.ce.SurfaceObserver detected that the content changed
ve.ce.Surface built a ve.dm.Transaction
ve.ce.Surface added a render lock then applied the transaction
ve.ce.ContentBranchNode saw the render lock and so did not try to update its contents
A3
ve.ce.SurfaceObserver detects that the content has changed in pollOnceInternal
ve.ce.Surface builds a ve.dm.Transaction in handleObservedChanges; specifically, it calls ve.ce.TextState.getChangeTransaction to build the transaction from the observed change (this call is not seen in the call stack because it has already returned)
ve.ce.Surface adds a render lock in handleObservedChanges and applies the transaction in changeModel
ve.ce.ContentBranchNode checks for the render lock in renderContents (first if statement returns false)
Q4. Describe briefly the difference in control flow between the first example (where the update was initiated in the model) and the second example (where the update was initiated in the view).
A4
(A) Update initiated in model, (B) Update initiated in view
Differences:
In (A), the DM initiates the transaction; in (B), the ve.ce.Surface initiates it
More specifically, in (A), we build a transaction manually and then call ve.dm.Surface.change on it (though in other model-initiated changes it could come from a keydown handler and be built programmatically)
Whereas in (B), ve.ce.Surface observes a change that already happened to ContentEditable, builds a transaction from the observed changes, and then calls ve.dm.Surface.change
In both (A) and (B), DM is then updated through the TransactionProcessor
In (A), the view is updated in renderContents; in (B), renderContents does nothing
Q1. Guess, and then test, what formatting will appear if you type text after placing the cursor:
between the space and the ‘d’
between the ‘f’ and the space
Notice that the cursor positions above are visually ambiguous: it’s not clear whether they lie inside the italic tags or outside. Chromium normalizes ambiguous cursor positions towards the left, or more precisely, towards the document start (since it applies in right-to-left scripts too).
Q3. Try the same experiments from Q1-2 in Firefox. Does the result depend whether you click on the cursor position or cursor there?
Notice that Firefox does NOT normalize ambiguous cursor positions. When moving the cursor with left/right arrow keys, it moves lazily (choosing the nearest of the ambiguous cursor positions to the prior position).
Notice that VE adds an extra cursor step to step into/out of a link, whereby you can type text that extends the link or not, depending on your wishes. Can you think of how this might have been implemented? Bear in mind you’ve just seen Chromium’s native behaviour won’t let you extend a link by typing text at its end.
Fixup as you type, to add link annotation? No, we used to do that but it breaks IMEs. In general we can’t fixup text if it might be part of uncommitted IME candidate text, and there’s no easy way to detect whether text is part of uncommitted IME candidate text. This massively constrains what fixups we can apply.
IME = Input Method Engine, a software component for typing languages with complex scripts, such as Chinese or Japanese. An IME treats a combination of keystrokes as a composite character. This might look like a dropdown of candidate text that the user can choose from as they type. On mobile browsers, the mobile keyboard is an IME and so imposes these same constraints.
Change the link styling to inline-block or block? No, the latter can actually solve this problem, but it has major side effects (e.g. breaks word wrapping)
Inspect the link to see how we achieve this behaviour. We call the extra <img> elements “annotation nails”.
Q6. Can you explain we need two at each end of the link, and not just one?
A6. (Depending on the browser) Without the second nail on either side, the browser might always treat the text next to the link as “not a link”, because of how the browser sees img tags.
An example of the difference is how Chromium would actually work with just one tag on either end, since Chromium doesn’t normalize across an img tag. However, this wouldn’t work in Firefox.
It’s less messy to have two images, as it’s more predictable across browsers and responds better to potential future browser implementation changes.
Go to he.wikipedia.org and copy a single word of Hebrew (language written right to left). Paste it over the ‘def’. Now try cursoring right across the h1.
Q9. Is your browser doing visual bidi cursoring or logical bidi cursoring? (Search for these terms if you’re not sure what they mean)
A9. Bidi = bidirectional; combines LTR and RTL scripts
With bidi text, cursor movement/selection is handled in two ways:
Visual = Cursor moves to the next visually adjacent character, regardless of text’s directionality
If you press the left arrow, the cursor moves left, regardless of the direction of the text at the cursor position
Logical = Cursor decides what “before” means based on what’s in memory, the data model, regardless of how it’s rendered
Q10. Bearing in mind that some browsers do visual cursoring and others do logical cursoring, and there’s no easy way for our code to know which will happen, do you need to improve on your answer to Q8?
A10. We have to wait and see whether we did jump over a nail, and then fix the jump to cross the other nail too.
Q11. What do you think “prepare-observe-fixup” means with respect to how we handle cursoring across link boundaries? Can you think of other cases where this would be a useful pattern?
A11. Another case is different scripts and how they may affect cursoring.
In bidirectional text, we cannot necessarily tell whether pressing Left will move the cursor towards the logical start of the document or towards the logical end.
In text with complex grapheme clusters, we cannot necessarily tell how many logical offsets the cursor will skip.
Therefore we need to let native cursor movement happen — and then potentially need to fix up where the cursor landed, for example if it lands inside content that should be read-only (such as template output).
Put “throw new Error( 'foo' )” at the top of ve.ce.ClipboardHandler#afterPaste. Then open http://localhost:8000/demos/ve/desktop.html#!h1 and try to paste something with the console open. Note how the browser gives an “uncaught error” warning in the console.
Next, move the “throw new Error( 'foo' )” into the .then callback at the bottom of that method. And try the paste again. Note how there is NO warning now. This is because the .then callback is called through the jQuery promise system, and there’s no good way for it to know whether the promise error is uncaught. (See https://phabricator.wikimedia.org/T233480 ).
Now instead try “Promise.resolve().then( function () { throw new Error( 'foo' ); } )” Note this time you do get an “uncaught error” warning — because the native promise system does know whether the native promise error is uncaught.
Q1. Suppose you suspect an uncaught error is happening in a .then callback within some new code, but you can’t see exactly where. How might you temporarily use native promises to help you debug this?
A1. You could wrap the suspect code inside a native promise callback to quickly see where the uncaught error might be.
Microtasks were added to JS as a way to escape the limitations of a single-threaded language (which JS is)
Agents
Runtime engine maintains set of agents in which to execute JS code
Agents are made up of
Set of execution contexts
Execution context stack
Main thread
Set for any additional threads that may be created to handle workers
Task queue
Microtask queue
Each component of an agent is unique to that agent (except the main thread)
Event loops
Each agent is driven by an event loop
Each iteration of an event loop:
Runs at most one pending JS task
Runs any pending microtasks
Performs any needed rendering and painting before looping again
Task = anything scheduled to be run by the standard mechanisms such as initially starting to execute a script or async dispatching an event
A task can be enqueued by using events, setTimeout(), setInterval(), etc
Microtasks vs tasks
Only one task executes per event-loop iteration; microtasks run after each task finishes, before the next task begins (including any microtasks scheduled by those microtasks)
Q2. VisualEditor’s jQuery promises are created using the helper method ve.createDeferred . It wouldn’t be a huge code change to reimplement this to use native promises instead. Why might this create subtle timing issues? (Hint: native promises use microtasks but jQuery promises don’t)
A2. Because native promises use microtasks, they execute immediately after a task finishes; once the current stack finishes, all pending promise callbacks run. jQuery’s promise callbacks run later in the event loop, after the browser has processed other events. However, we might not want callbacks to run any earlier, as we need the information that the browser is processing. More generally, the slight change in the processing moment might introduce some very subtle timing bugs that are infeasible to debug, e.g. in interactions with IMEs.
Reload http://localhost:8000/demos/ve/desktop.html#!h1 and place the cursor between ‘c’ and ‘d’. Click Filibuster, press Enter, then click Filibuster again. You should see a call tree appended below the document, with each function call numbered sequentially.
Q3. Can you see where the javascript started handling the ‘Enter’ keypress? And how that triggered one of the two processes you learned in tutorial 4? Is it a model-initiated change or a view-initiated change?
A3. Model-initiated change.
It started handling the ‘Enter’ keypress in 62 (902.00ms-902.00ms) VeCeKeyDownHandlerFactory.lookupHandlersForKey(13, "linear")--->["(function VeCeLinearEnterKeyDownHandler)"]
The KeydownHandler triggers a transaction being built programmatically - from keyboard input and not from any observed changes in the CE.
Tricky. We pressed a key, so why did this not result in the ContentEditable being updated (...and then the surface observing a change in the CE, and then all the steps of a view-initiated change)?
Because in ve.ce.LinearEnterKeyDownHandler, we override the Enter key behavior with preventDefault, so ContentEditable is not actually updated when you would expect it to be.
Now reload set a breakpoint at the start of ve.ce.Surface#onDocumentInput. Again place the cursor between ‘c’ and ‘d’, and press ‘x’. You should be able to resume as normal
Install a Japanese romaji input method on your operating system, activate it, and learn how to enter the Kanji ‘日本’ (“Japan”, probably by typing ‘nipon’ and selecting from a list).
Try to use the Chromium debugger to put a breakpoint at the start of ve.ce.Surface#onDocumentBeforeInput. Now type ‘日本’ in Japanese.
Q5. Does the debugger close the input method? Why do you think this might happen?
“Closing the input method” means prematurely committing the candidate text.
Exactly how an input method interacts with Javascript is highly platform-specific. It depends on platform combinations: OS, browser, language, input method software, and even software version. Different input methods can send radically different sequences of events, even if they look like they’re doing exactly the same thing. For instance, as of 2023, pressing Enter in Android Gboard does completely different things depending on whether the language is English or Cantonese.
Now reload, and place the cursor between ‘c’ and ‘d’ again. Click Filibuster, type ‘日本’ in Japanese, then click Filibuster again, to get a call tree.
Q6. Look carefully through the call tree. Can you list the Javascript events which VisualEditor observes from the input method? For many platform combinations, you’ll see interesting changes of selection and content as the input method software builds up candidate text and then commits it. Or at the other extreme, you may only see a single ‘input’ event where ‘日本’ is inserted.
Filibuster works by wrapping every method in ve.ce.*, ve.ui.* and ve.dm.* with a proxy that logs the call and its return value. It slows down execution greatly and the logs are too vast to be useful for complex edit sessions. Its main use is for debugging input method behaviour — where you can’t set a breakpoint because it will disturb the IME (see Q5) — and for this purpose a few keystrokes on a document of a few words usually suffices.
Now use the cursor and shift keys in VE to select the entire H2, starting from the end and moving left (so that your cursor ends at the beginning of the node). Try
r2 = s.getSelection().getRange()
Q2. How do r1 and r2 differ?
A2. ‘from’ and ‘to’ are swapped for r1 and r2. These two variables track selection.
and look at the property values returned by sel.getRange() and sf.getSelection().getRange().
Q4. If you insert text at the start of the document, does calling sf.getSelection().getRange() again now give a result with different property values? What are the consequences of this answer?