Thursday, August 22, 2019

Cantor and the support for Jupyter notebooks at the finish line

Hello everyone! It's been almost three weeks since my last post and this is going to be my my final post in this blog. So, I want to summarize all my work done in this GSoC project. Just to remember again, the goal of the project was to add the support for Jupiter notebooks to Cantor. This format is widely used in the scientific and education areas, mostly by the application Jupyter, and there is a lot of content available on the internet in this format (for example, here). By adding the support of this format in Cantor we’ll allow Cantor users access this content. This is short description, if you more intersted, you can found more details in my proporsal.

In the previous post, I described the "maximum plan" of the Jupyter support in Cantor being mostly finished. What this means in practice for Cantor is:
  • you can open Jupyter notebooks
  • you can modify Jupyter notebooks
  • you can save modified Jupyter notebooks without loosing any information
  • you can save native Cantor worksheets in Jupyter notebook format
To test the implemented code I used couple of notebooks mentioned in „link to the earlier post“. But the Jupyter world doesn’t consist out of this small number of notebooks only, of course. So, it was interesting to confront the code with more notebooks available in the wild out there.

I recently discovered a nice repository of Jupyter notebooks about Biomechanics and Motor Control with 70 notebooks. I didn’t use these notebooks before for testing and validation and didn’t know anything about there content. 70 notebooks is quite a number and my assumption was these notebooks, without knowing them in detail, will cover many different parts and details of the specification of the Jupyter notebook format and will challenge my implementation to an extent that was not possible during my previous testing activities. So, this new set of notebooks was supposed to be new and good test content for further and stricter validation of Cantor.

I was not disappointed. After the first round of manual testing based on this content, I found issues in 7 notebooks (63 projects functioning correctly!), which I addressed. Now, Cantor handles all 70 notebooks from this repository correctly.

Looking back at what was achieved this summer, the following list summarizes the project:
  • the scope for mandatory features described in the project proposal was fully realized
  • the biggest part of optional features was finalized
  • some other new features were added to Cantor which were needed for the realization of the project like new result types, the supported for embedded mathematical expressions and attachments in Markdown Cells, etc.
  • the new implementation was tested and considered stable enough to be merged into master and we plan to release this with Cantor 19.12
  • new dedicated tests were written to cover the new code and to avoid regressions in future, the testing framework was extended to handle project load and save steps
I prepared some screenshoots of Jupyter notebooks that show the final result in Cantor:

Even though the initial goal of the project was achieved, there are still some problems and limitations in the current implementation:
  • for Markdown entries containing text with images where certain alignment properties were set or after image size manipulations, the visualization of the content is not always correct which is potentially a bug in Qt
  • because of small difference in syntax between MathJax used in Jupyter notebooks and Latex used for the actual rendering in Cantor, the rendering of embedded mathematical expressions is not always successful. At the moment Cantor shows an error message in such cases, but this message is often not very clear and helpful for the user
  • Qt classes, without involving the full web-engine, as used by Cantor provide only a limited and basic support for HTML. More complex cases like embedded Youtube video and JavaScript don’t work at all.
This is all for the limitations, I think. Let's talk about future plans and perspectives. In my opinion, this project has reached its initial goals, is finished now and will only need maintenance and support in terms of bug fixing and adjustment to potential format changes in future.

When talking more generally, this project is part of the current overall development activities in Cantor to improve the usability and the stability of the application and to extend the feature set in order to enable more workflows and to reach to a bigger audience with this. See 19.08 and 18.12 release announcements to read more about the developments in the recent releases of Cantor. Support of the Jupyter notebook format is a big step into this direction but this not all. We have already many other items in our backlog like for the UX improvements, plots integration improvements going into this direction. Some of this items will be addressed soon. Some of them are something for the next GSoC project next year maybe?

I think, that's all for now. Thank you for reading this blog and thank you for your interest in my project. Working on this project was a very interesting and pleasant period of my life. I am happy that I had this opportunity and was able to contribute to KDE and especially to Cantor with the support of my mentor Alexander Semke.
So, Bye.

Tuesday, July 30, 2019

Markdown and support of embedded mathematics

Hello everyone!

In the previous post I mentioned that Cantor now handles embedded mathematical expressions inside of Markdown, like $...$ and $$...$$ in accordance with the Markdown syntax.

In the past Cantor for a long time didn’t have any support for Markdown and only have simple text entry type for comment purposes. Markdown entry type was added only in 2018 by kqwyf. Internally, this was realized by using the Discount library, which converts markdown syntax to the to html code which is then passed to Qt for final the rendering (Qt supports limited set of the html syntax).

Discount library actually supports integration with LaTeX: text inside LaTeX expressions like $$...$$, \(...\), \[...]\ is passed to the output html string without modifications (except html escaping).

As you see Discount doesn't support embedded mathematics with single delimiter $...$ that is used in Jupyter very frequently. Of course, for my Jupiter integration projects ignoring this type of statements was not an option. I decided to report this issue in Discount bug tracker because all the other options solve this problem purely in Cantor had other problems.

Fortunately, the author of Discount reacted very soon (thanks to him for that) and suggested code changes for supporting the single-delimited math. Unfortunately, the changes didn't get into master branch yet. To proceed further in Cantor I decided to copy required Discount’s code having all the relevant changed into Cantor’s repository as a third party library.

Independent of the support for the single-delimiter mathematics, there is a big problem with the embedded mathematical expressions - you need to somehow find these mathematical statements in output html string. In the initial implementation I simply searched for $$ in the result string but this could lead to "search collisions".

The dollar sign could be inside of a Markdown code block or inside of a quote block. Here, the dollar signs shouldn't treated as part of the embedded mathematics. After some further testing of couple of other implementations on Cantor sidethe conclusion was obvios - the identification and labeling of positions of embedded mathematics in the html string, produced by Discount, should be done directly inside Discount itself.

At this moment, the version of Discount added to Cantor’s repository had two additional functional fixes on top of the officially released version of this library. First, Discount copies all LaTeX expressions during the processing of markdown syntax to a special string list, which is then used by Cantor to search for LaTeX code. Second, a useful change was to add an ASCII non-text symbol to every math expression. This symbol is used as a search key which greatly reduces the likelihood for a string collision, still theoretically possible, though.

For example, if Discount will find (according Markdown syntax) math expression $\Gamma$, then it will write the additional symbol and the expression iin the output html string will be $<symbol>\Gamma$ and Cantor will search exactly this text.

I think, that's all.  Maybe this doesn’t look like a complex problem but solving this problem was a task that took the most time and it took me two months to fix it. So, I think the problem and its solution deserved a separate blog post.

At this moment, what I called "maximum plan" (I have mentioned this concep in this post) of the Jupyter support in Cantor is mostly finished. So, in the next post I plan to show how Cantor now handles test notebooks and what I’ll plan to do next.

Tuesday, July 23, 2019

Improved rendering of mathematical expressions in Cantor

Hello everyone!

In the previous post I mentioned that the render of mathematical expressions in Cantor has bad performance. This heavily and negatively influences user experience. With my recent code changes I addressed this problem and it should be solved now. In this blog post I wand to provide some details on what was done.

First, I want to show some numbers proving the performance improvements. For example, loading of the notebook "Rigid-body transformations in a plane (2D)" - one of the notebooks I’m using for testing - took 15.9 seconds (this number and all other numbers mentioned in the following are average values of 5 consequent measurements). With the new implementation it takes only 4.06 seconds. And this acceleration comes without loosing render quality.

This is a example, how modern render looks like compared with Jupyter renderer (as you can see, Cantor doesn't show images from web in Markdown entries, but I will fix it soon).

I did further measurements by executing all the tests I wrote for the Jupiter import which cover several Jupyter notebooks. Here the results:
  • Without math rendering - 7.75 seconds.
  • New implementation - 14.014 seconds.
  • Old implementation - 41.296 seconds.
To quickly summarize, we get an average of 535% performance improvement. This result depends on the number of available cores and I’ ll explain below why.

To get these results I solved two main problems of math rendering in Cantor.
First, I changed the code for the LaTeX renderer. In the old implementation the rendering process consisted of the following steps:
  1. create TeX document using a page template and the code provided by the user.
  2. run latex executable on the produced TEX file to generate the DVI file.
  3. run dvips executable to convert the DVI file to an EPS file.
  4. convert the produced EPS file to a QImage using Libspectre library.
After these four steps the produced QImage is shown Cantor’s worksheet (QGraphicsScene/View). As you see, the overall chain of steps to get the image out of a mathematical expression is quite long - there are several steps where the time is spent. In total, for a usual mathematical expression these operations take ~500 ms where the first three steps take 300 ms and the last step takes 200 ms. The complexity and the size of the mathematical expressions have a negligible impact on the numbers shown above. Also, the time spent in Cantor for expressions of other types is relatively small. So, for example if you have a notebook with 20 different mathematical expressions and some other entries of other types, Cantor will load the project in ca 20*500ms=10s.

I reduced this chain to three elements by merging the steps two and three. This was achieved by using pdflatex as the LaTeX engine which produces a PDF file directly out of the TEX file. Furthermore, I replaced libspectre library with Poppler pdf rendering library. This brought the overall time down to 330 ms with pdflatex process taking 300 ms and with the rendering in Poppler (converting PDF to QImage) taking only 30 ms. With this, for our example notebook with 20 mathematical expressions mentioned above the rendering take only 6.6 seconds. In this new chain, the LaTeX process is the bottle neck and I’m looking for potential acceleration here but until now I didn’t find any "magic" parameters which would help to reduce this time spent in latex rendering.

Despite this progress, loading of somewhat bigger documents will hurt in Cantor. For example, for a project having 100 formulas openning of the file will take ca. 33 seconds.

The problem here is that the rendering process is a blocking and non-parallelized operation - only one mathematical expression is processed simultanuosly and the UI is blocked for the whole processing time. Obviously, this behaviour is unsatisfactory and under-utilizes the modern multi-core CPUs. So I  decided to run the rendering part in many parallel tasks asynchronously without blocking the main GUI thread. Fortunately, Qt helps here a log with it's classes QThreadPool managing the pool of threads and QRunnable providing an interface to define "tasks" that will be executed in parallel.

Now when openning a project, for every Markdown entry containing mathematical expression, Cantor creates a render task for every expression, sends this task to the thread pool and continues with the processing of entries in the document. Once such a task has finished, the result is shown in Cantor's worksheet. With this a further good performance improvement can be achieved. Clearly, the more cores you have the faster the processing will be. Of course, if you have only a small number of physical threads possible on your computer, you won't notice a huge difference. But still, you should see an improvement compared to the old single-threaded implementation in Cantor.

For a notebook comparable in size to  "Rigid-body transformations in a plane (2D)" project which has 109 mathematical expressions, the loading of the notebook takes a reasonable and acceptable time on my hardware (I have 8 physical cores in the CPU, so that is why the render acceleration is so big). And, thanks to the asynchron processing, the user can interact with the notebook even though the rendering of the mathematical expressions is still in process.

Since my previous post, not only math renderer have changed, there is also a huge change in Markdown support - Cantor finally handles embeded math expressions, like $...$ and $$...$$ in accordance with the Markdown syntax. In the next blog post I'll describe how it works now.

Wednesday, July 10, 2019

New unit tests for the new code

Hello everyone,

today I want to present the test system for Cantor's worksheet.
The worksheet is the most central, prominent and important part of the application where the most work is done.

So, it is important to cover this part with enough tests to ensure the quality and stability of this component in future.

At the moment, this system contains only ten tests and all of them cover the functionality for the import of Jupyter notebooks only that was added recently to Cantor (I have mentioned them in my first post).
However, this test infrastructure is of generic nature and can easily be used for testing Cantor's own Cantor files, too.

The test system checks that a worksheet/notebook file is loaded successfully, tests the backend type and validates the overall worksheet structure and the content of its entries.

Actually, some content is not validated, for example the image content. This would increase the complexity of the tests and slow down their execution without additional big value with respect to the quality assurance.

This new infrastructure has proven to be helpful already. When writing the first tests for the worksheet I have found couple of bugs in the implementation of the import of Jupyter notebooks. After having fixed them and now, having this additional barriers, I'm more confident about the implementation and can say more surely that the import of Jupyter notebooks works fine.

In previous post I have mentioned some issues with the perfromance of the renderer used for mathematical expressions in Cantor. It turned out this problem is not so easy to solve as I assumed first. But now, after having finished a substantial part of the work that was planned to be done as part of this GSoC project, I can give more attention to to remaining problems, including this one with the performance of the renderer.
In the next post I plan to show a better realization of the math renderer in Cantor.

Saturday, June 22, 2019

Support for Jupyter notebooks has evolved in Cantor

Hello everyone, it's been almost a month since my last post and there are a lot of changes that have been done since then.

First, what I called the "minimal plan" is arleady done! Cantor can now load Jupyter notebooks and save the currently opened document in Jupyter format.

Below you can see how one of the Jypiter notebooks I'm using for test purposes (I have mentioned them in previous post) looks in Jupyter and in Cantor.

As you can see, there aren't many differences in the representation of the content except of some minor differences in the rendering of the markdown code.

For the comparison, I also prepared some previews of the same fragments of the notebooks, opened in Jupyter and in Cantor.
This is a fragment from Understanding evolutionary strategies and covariance matrix adaptation notebook.

As the next example, we show a screenshot of A Reaction-Diffusion Equation Solver in Python with Numpy notebook.

As the final example, we show a screenshot of Rigid-body transformations in a plane (2D) notebook.

To be more detailed and concrete on what is currently supported in Cantor, below is the list of objects that can be imported:
  • Markdown cells
    • With mathematical expressions
    • With attachments
  • Code cells
    • With text (including error messages) and image results)
  •  Raw NBConvert cells
Cantor is able to handle almost all content specified by Jupyter notebook format, except of some metadata information about the notebook in general and about its cells, information about the used "kernel" (support for this will be added soon) and results of another types (for example latex or html outputs), which are more difficult to implement because of the lack of good and complete documentation of them.

When saving the project in Jupyter's format, Cantor handles almost all of its native entry types like markdown entries, text entries, code entries and image entries. For the remaining "page break entry" in Cantor it is still to be worked out how to map this element to Jupyter's structures.

Despite quite a good progress made, there is still a lot place and potential for improvements. Besides some technical issues arising when dealing with the import of another format and mapping its sturcture to the native structures of your application, which is very natural actually for all applications I guess, there is currently also currently problem with perfromance of the renderer used for mathematical expressions in Cantor. Openning of large documents (either in Cantor's native format or Jupyter notebooks) having a lot of formulas takes considerable amount of time because of the bad renderer implementation in Cantor. This heavily influence the user experience and I plan to start working on this soon.

So, there are some work for done before Cantor will support what I call the "maximum plan". With this I understand the ability to garantee the conversion between two formats when openning or saving projects to happen without any substantial loss of information relevant and critical for the consumption of the project file.

To achieve this, I want now to invest more into testing with more notebooks and closing the remaining gaps but also into writing automatic tests for Cantor covering this new functionality in Cantor. The latter are important to also prevent any kind of regressions introduce during bug fixing activities in the next weeks. This is something for the next week.

In the next post I plan to show a working test system and how Cantor are passing its tests.

Tuesday, June 4, 2019

Hello everyone! I'm participating in Google Summer of Code 2019, I am working on KDE Cantor project. The GSoC project is mentored by Alexander Semke - one of the core developers of LabPlot, Knights and Cantor. At first, let me introduce you into Cantor and into my GSoC-project:
Cantor is a KDE application providing a graphical interface to different open-source computer algebra systems and programming languages, like Octave, Maxima, Julia, Python etc. The main idea of this application is to provide one single, common and user-friendly interface for different systems instead of providing different GUIs for different systems. The details specific to the different languages are transparent to the end-user and are handled internally in the language specific parts of Cantor's code.
There is another project following this idea - the project Jupyter. As a result of its very big popularity, user base and the community around this project, there is a lot of content available for this project created and contributed by users from different scientific and educational areas, as documented in the gallery of interesting Jupyter Notebooks.
At the moment, Cantor has its own format for projects. Though this format is good enough to manage Cantor projects, there is not a lot of content created and published by Cantor users and the user base is still not at the level which this application would deserve. Furthermore, sharing of the content stored in Cantor's native format requires the availability of Cantor on the target system, which is available for linux only at the moment. This all complicates the attempts to make Cantor more popular and known to a broader user base. Adding the possibility to import/export Jupyter Notebook worksheets in Cantor will address the problems described above.
If you are interested in a more the technical and detailed description of the project, you can check out my proposal.

Actually, it's not my first contribution to Cantor. I am contributing to this project for roughly one year already. As a developer interested in C++, Qt and applications relevant for scientific purposes, I started to contribute to Cantor last year by working on smaller bug fixes first. With time and with more understanding about the overall architecture of Cantor I could work on bigger topics like new features, more complicated bug fixes and refactorings in the code and this year I'm happy to contribute yet another big and very important functionality to Cantor as part of GSoC.

To start I selected couple of well structured Jupyter notebooks from a gallery of interesting Jupyter Notebooks. Those notebooks were selected based on three criteria:

  • they should be self-sufficient
  • they should contain commands and results of different types
  • they should have a reasonable size sufficient for testing the new code and for demoing the results
Below you can see the screenshots of the notebooks I decided to use:

The notebooks will be used for testing functionality and also for showing a progress of this project and in the final post I will summarize and report on Cantor being able to successfully process such files.

In the next post I plan to already show a working first version of the Jupyter importer.