view test/html_norm.py @ 7853:03c1b7ae3a68

issue2551328/issue2551264 unneeded next link and total_count incorrect Fix: issue2551328 - REST results show next link if number of results is a multiple of page size. (Found by members of team 3 in the UMass-Boston CS682 Spring 2024 class.) issue2551264 - REST X-Total-Count header and @total_size count incorrect when paginated These issues arose because we retrieved the exact number of rows from the database as requested by the user using the @page_size parameter. With this changeset, we retrieve up to 10 million + 1 rows from the database. If the total number of rows exceeds 10 million, we set the total_count indicators to -1 as an invalid size. (The max number of requested rows (default 10 million +1) can be modified by the admin through interfaces.py.) By retrieving more data than necessary, we can calculate the total count by adding @page_index*@page_size to the number of rows returned by the query. Furthermore, since we return more than @page_size rows, we can determine the existence of a row at @page_size+1 and use that information to determine if a next link should be provided. Previously, a next link was returned if @page_size rows were retrieved. This change does not guarantee that the user will get @page_size rows returned. Access policy filtering occurs after the rows are returned, and discards rows inaccessible by the user. Using the current @page_index/@page_size it would be difficult to have the roundup code refetch data and make sure that a full @page_size set of rows is returned. E.G. @page_size=100 and 5 of them are dropped due to access restrictions. We then fetch 10 items and add items 1-4 and 6 (5 is inaccessible). There is no way to calculate the new database offset at: @page_index*@page_size + 6 from the URL. We would need to add an @page_offset=6 or something. This could work since the client isn't adding 1 to @page_index to get the next page. Thanks to HATEOAS, the client just uses the 'next' url. But I am not going to cross that bridge without a concrete use case. This can also be handled client side by merging a short response with the next response and re-paginating client side. Also added extra index markers to the docs to highlight use of interfaces.py.
author John Rouillard <rouilj@ieee.org>
date Mon, 01 Apr 2024 09:57:16 -0400
parents 5cadcaa13bed
children
line wrap: on
line source

"""Minimal html parser/normalizer for use in test_templating.

When testing markdown -> html conversion libraries, there are
gratuitous whitespace changes in generated output that break the
tests. Use this to try to normalize the generated HTML into something
that tries to preserve the semantic meaning allowing tests to stop
breaking.

This is not a complete parsing engine. It supports the Roundup issue
tracker unit tests so that no third party libraries are needed to run
the tests. If you find it useful enjoy.

Ideally this would be done by hijacking in some way
lxml.html.usedoctest to get a liberal parser that will ignore
whitespace. But that means the user has to install lxml to run the
tests. Similarly BeautifulSoup could be used to pretty print the html
but again, BeautifulSoup would need to be installed to run the
tests.

"""
try:
    from html.parser import HTMLParser
except ImportError:
    from HTMLParser import HTMLParser  # python2

try:
    from htmlentitydefs import name2codepoint
except ImportError:
    pass  # assume running under python3, name2codepoint predefined


class NormalizingHtmlParser(HTMLParser):
    """Handle start/end tags and normalize whitespace in data.
    Strip doctype, comments when passed in.

    Implements normalize method that takes input html and returns a
    normalized string leaving the instance ready for another call to
    normalize for another string.


    Note that using this rewrites all attributes parsed by HTMLParser
    into attr="value" form even though HTMLParser accepts other
    attribute specification forms.
    """

    debug = False  # set to true to enable more verbose output

    current_normalized_string = ""  # accumulate result string
    preserve_data = False           # if inside pre preserve whitespace

    def handle_starttag(self, tag, attrs):
        """put tag on new line with attributes.
           Note valid attributes according to HTMLParser:
              attrs='single_quote'
              attrs=noquote
              attrs="double_quote"
        """
        if self.debug: print("Start tag:", tag)

        self.current_normalized_string += "\n<%s" % tag

        for attr in attrs:
            if self.debug: print("     attr:", attr)
            self.current_normalized_string += ' %s="%s"' % attr

        self.current_normalized_string += ">\n"

        if tag == 'pre':
            self.preserve_data = True

    def handle_endtag(self, tag):
        if self.debug: print("End tag  :", tag)

        self.current_normalized_string += "\n</%s>" % tag

        if tag == 'pre':
            self.preserve_data = False

    def handle_data(self, data):
        if self.debug: print("Data     :", data)
        if not self.preserve_data:
            # normalize whitespace remove leading/trailing
            data = " ".join(data.strip().split())

        if data:
            self.current_normalized_string += "%s" % data

    def handle_comment(self, data):
        print("Comment  :", data)

    def handle_decl(self, data):
        print("Decl     :", data)

    def reset(self):
        """wrapper around reset with clearing of csef.current_normalized_string
           and reset of self.preserve_data
        """
        HTMLParser.reset(self)
        self.current_normalized_string = ""
        self.preserve_data = False

    def normalize(self, html):
        self.feed(html)
        result = self.current_normalized_string
        self.reset()
        return result


if __name__ == "__main__":
    parser = NormalizingHtmlParser()

    parser.feed('<div class="markup"><p> paragraph   text with whitespace\n  and more space  <pre><span class="f" data-attr="f">text more text</span></pre></div>')
    print("\n\ntest1", parser.current_normalized_string)

    parser.reset()

    parser.feed('''<div class="markup">
       <p> paragraph   text with whitespace\n  and more space  
       <pre><span class="f" data-attr="f">text \n more text</span></pre>
    </div>''')
    print("\n\ntest2", parser.current_normalized_string)
    parser.reset()
    print("\n\nnormalize", parser.normalize('''<div class="markup">
       <p> paragraph   text with whitespace\n  and more space  
       <pre><span class="f" data-attr="f">text \n more text &lt;</span></pre>
    </div>'''))

Roundup Issue Tracker: http://roundup-tracker.org/