SanketDG

Home Projects About RSS

Summer of Code Week 1

So as I said in my previous blog post, I am working with coala for language independent documentation extraction for this year’s Google Summer of Code.

It has been one week since the coding period has started, and there has been some work done! I would like to explain some stuff before we get started on the real work.

So, my project deals with language independent documentation extraction. Turns out, documentation isn’t that independent of the language. Most programming languages don’t have an official documentation specification. But It could be said that documentation is independent of the documentation standard (hereby referred to as docstyle) it uses.

I have to extract parts/metadata from the documentation like descriptions, parameters and their descriptions, return descriptions and perform various analyzing routines on this parsed metadata.

Most of my work is with the DocumentationComment class, where I have to implement routines for each language/docstyle. I started out with python first because of two reasons:

  • Its my favourite programming language (Duh!)
  • coala is written in python! (Duh again!)

So python has its own docstyle, that is known as “docstrings”, and they are clearly defined in PEP 257. Note that PEP 257 is just a general styleguide on how to write docstrings.

The PEP contains conventions, not laws or syntax

It is not a specifictaion.

Several documentation tools support compiling these docstrings into API documentation like Sphinx and Doxygen. I aim to support both of them.

So, I have come up with the following signature for DocumentationComment:

DocumentationComment(documentation, language, docstyle, indent, marker, range)

Now let’s say doc is an instance of DocumentationComment. doc would have a function named parse_documentation() that would do the parsing and get the metadata. So if I have a function with a docstring:

>>> def best_docstring(param1, param2, param3):
        """
        This is the best docstring ever!

        :param param1:
            Very Very Long Parameter description.
        :param param2:
            Short Param description.

        :return: Long Return Description That Makes No Sense And Will
                 Cut to the Next Line.
        """
        return None

And I load this into the DocumentationComment class and then apply the parsing:

>>> from coalib.bearlib.languages.documentation.DocumentationComment import (
    DocumentationComment)
>>> doc = DocumentationComment(best_docstring.__doc__,
                           language="python",
                           docstyle="default")
>>> docdata = doc.parse_documentation()

Note: Not all parameters are required for instantation.

Now printing repr(docdata) would print:

[Desc(desc='\nThis is the best docstring ever!\n'),
Param(name='param1:', desc='    Very Very Long Parameter description.\n'),
Param(name='param2:', desc='    Short Param description.\n'),
RetVal(desc='Long Return Description That Makes No Sense'
            ' And Will\n         Cut to the Next Line.\n')]

You may ask about the strange formatting. That is because it retains the exact formatting, as displayed in the docstring. This is important, because whatever analyzing routines I run, I should always be able to “assemble” back to the original docstring.

That’s it! This was my milestone for week 1, to parse and extract metadata out of python docstrings! I have already started developing a simple Bear, that I will talk about later this week.

PS: I would really like to thank my mentor Mischa Krüger for his thoughts on the API design and for doing reviews on my ugly code. :P