Tuesday, July 30, 2019

Markdown and support of embedded mathematics

Hello everyone!

In the previous post I mentioned that Cantor now handles embedded mathematical expressions inside of Markdown, like $...$ and $$...$$ in accordance with the Markdown syntax.

In the past Cantor for a long time didn’t have any support for Markdown and only have simple text entry type for comment purposes. Markdown entry type was added only in 2018 by kqwyf. Internally, this was realized by using the Discount library, which converts markdown syntax to the to html code which is then passed to Qt for final the rendering (Qt supports limited set of the html syntax).

Discount library actually supports integration with LaTeX: text inside LaTeX expressions like $$...$$, \(...\), \[...]\ is passed to the output html string without modifications (except html escaping).

As you see Discount doesn't support embedded mathematics with single delimiter $...$ that is used in Jupyter very frequently. Of course, for my Jupiter integration projects ignoring this type of statements was not an option. I decided to report this issue in Discount bug tracker because all the other options solve this problem purely in Cantor had other problems.

Fortunately, the author of Discount reacted very soon (thanks to him for that) and suggested code changes for supporting the single-delimited math. Unfortunately, the changes didn't get into master branch yet. To proceed further in Cantor I decided to copy required Discount’s code having all the relevant changed into Cantor’s repository as a third party library.

Independent of the support for the single-delimiter mathematics, there is a big problem with the embedded mathematical expressions - you need to somehow find these mathematical statements in output html string. In the initial implementation I simply searched for $$ in the result string but this could lead to "search collisions".

The dollar sign could be inside of a Markdown code block or inside of a quote block. Here, the dollar signs shouldn't treated as part of the embedded mathematics. After some further testing of couple of other implementations on Cantor sidethe conclusion was obvios - the identification and labeling of positions of embedded mathematics in the html string, produced by Discount, should be done directly inside Discount itself.

At this moment, the version of Discount added to Cantor’s repository had two additional functional fixes on top of the officially released version of this library. First, Discount copies all LaTeX expressions during the processing of markdown syntax to a special string list, which is then used by Cantor to search for LaTeX code. Second, a useful change was to add an ASCII non-text symbol to every math expression. This symbol is used as a search key which greatly reduces the likelihood for a string collision, still theoretically possible, though.

For example, if Discount will find (according Markdown syntax) math expression $\Gamma$, then it will write the additional symbol and the expression iin the output html string will be $<symbol>\Gamma$ and Cantor will search exactly this text.

I think, that's all.  Maybe this doesn’t look like a complex problem but solving this problem was a task that took the most time and it took me two months to fix it. So, I think the problem and its solution deserved a separate blog post.

At this moment, what I called "maximum plan" (I have mentioned this concep in this post) of the Jupyter support in Cantor is mostly finished. So, in the next post I plan to show how Cantor now handles test notebooks and what I’ll plan to do next.

Tuesday, July 23, 2019

Improved rendering of mathematical expressions in Cantor

Hello everyone!

In the previous post I mentioned that the render of mathematical expressions in Cantor has bad performance. This heavily and negatively influences user experience. With my recent code changes I addressed this problem and it should be solved now. In this blog post I wand to provide some details on what was done.

First, I want to show some numbers proving the performance improvements. For example, loading of the notebook "Rigid-body transformations in a plane (2D)" - one of the notebooks I’m using for testing - took 15.9 seconds (this number and all other numbers mentioned in the following are average values of 5 consequent measurements). With the new implementation it takes only 4.06 seconds. And this acceleration comes without loosing render quality.

This is a example, how modern render looks like compared with Jupyter renderer (as you can see, Cantor doesn't show images from web in Markdown entries, but I will fix it soon).




I did further measurements by executing all the tests I wrote for the Jupiter import which cover several Jupyter notebooks. Here the results:
  • Without math rendering - 7.75 seconds.
  • New implementation - 14.014 seconds.
  • Old implementation - 41.296 seconds.
To quickly summarize, we get an average of 535% performance improvement. This result depends on the number of available cores and I’ ll explain below why.

To get these results I solved two main problems of math rendering in Cantor.
First, I changed the code for the LaTeX renderer. In the old implementation the rendering process consisted of the following steps:
  1. create TeX document using a page template and the code provided by the user.
  2. run latex executable on the produced TEX file to generate the DVI file.
  3. run dvips executable to convert the DVI file to an EPS file.
  4. convert the produced EPS file to a QImage using Libspectre library.
After these four steps the produced QImage is shown Cantor’s worksheet (QGraphicsScene/View). As you see, the overall chain of steps to get the image out of a mathematical expression is quite long - there are several steps where the time is spent. In total, for a usual mathematical expression these operations take ~500 ms where the first three steps take 300 ms and the last step takes 200 ms. The complexity and the size of the mathematical expressions have a negligible impact on the numbers shown above. Also, the time spent in Cantor for expressions of other types is relatively small. So, for example if you have a notebook with 20 different mathematical expressions and some other entries of other types, Cantor will load the project in ca 20*500ms=10s.

I reduced this chain to three elements by merging the steps two and three. This was achieved by using pdflatex as the LaTeX engine which produces a PDF file directly out of the TEX file. Furthermore, I replaced libspectre library with Poppler pdf rendering library. This brought the overall time down to 330 ms with pdflatex process taking 300 ms and with the rendering in Poppler (converting PDF to QImage) taking only 30 ms. With this, for our example notebook with 20 mathematical expressions mentioned above the rendering take only 6.6 seconds. In this new chain, the LaTeX process is the bottle neck and I’m looking for potential acceleration here but until now I didn’t find any "magic" parameters which would help to reduce this time spent in latex rendering.

Despite this progress, loading of somewhat bigger documents will hurt in Cantor. For example, for a project having 100 formulas openning of the file will take ca. 33 seconds.

The problem here is that the rendering process is a blocking and non-parallelized operation - only one mathematical expression is processed simultanuosly and the UI is blocked for the whole processing time. Obviously, this behaviour is unsatisfactory and under-utilizes the modern multi-core CPUs. So I  decided to run the rendering part in many parallel tasks asynchronously without blocking the main GUI thread. Fortunately, Qt helps here a log with it's classes QThreadPool managing the pool of threads and QRunnable providing an interface to define "tasks" that will be executed in parallel.

Now when openning a project, for every Markdown entry containing mathematical expression, Cantor creates a render task for every expression, sends this task to the thread pool and continues with the processing of entries in the document. Once such a task has finished, the result is shown in Cantor's worksheet. With this a further good performance improvement can be achieved. Clearly, the more cores you have the faster the processing will be. Of course, if you have only a small number of physical threads possible on your computer, you won't notice a huge difference. But still, you should see an improvement compared to the old single-threaded implementation in Cantor.

For a notebook comparable in size to  "Rigid-body transformations in a plane (2D)" project which has 109 mathematical expressions, the loading of the notebook takes a reasonable and acceptable time on my hardware (I have 8 physical cores in the CPU, so that is why the render acceleration is so big). And, thanks to the asynchron processing, the user can interact with the notebook even though the rendering of the mathematical expressions is still in process.

Since my previous post, not only math renderer have changed, there is also a huge change in Markdown support - Cantor finally handles embeded math expressions, like $...$ and $$...$$ in accordance with the Markdown syntax. In the next blog post I'll describe how it works now.

Wednesday, July 10, 2019

New unit tests for the new code

Hello everyone,

today I want to present the test system for Cantor's worksheet.
The worksheet is the most central, prominent and important part of the application where the most work is done.

So, it is important to cover this part with enough tests to ensure the quality and stability of this component in future.

At the moment, this system contains only ten tests and all of them cover the functionality for the import of Jupyter notebooks only that was added recently to Cantor (I have mentioned them in my first post).
However, this test infrastructure is of generic nature and can easily be used for testing Cantor's own Cantor files, too.

The test system checks that a worksheet/notebook file is loaded successfully, tests the backend type and validates the overall worksheet structure and the content of its entries.

Actually, some content is not validated, for example the image content. This would increase the complexity of the tests and slow down their execution without additional big value with respect to the quality assurance.

This new infrastructure has proven to be helpful already. When writing the first tests for the worksheet I have found couple of bugs in the implementation of the import of Jupyter notebooks. After having fixed them and now, having this additional barriers, I'm more confident about the implementation and can say more surely that the import of Jupyter notebooks works fine.

In previous post I have mentioned some issues with the perfromance of the renderer used for mathematical expressions in Cantor. It turned out this problem is not so easy to solve as I assumed first. But now, after having finished a substantial part of the work that was planned to be done as part of this GSoC project, I can give more attention to to remaining problems, including this one with the performance of the renderer.
In the next post I plan to show a better realization of the math renderer in Cantor.