view roundup/dehtml.py @ 5710:0b79bfcb3312

Add support for making an idempotent POST. This allows retrying a POST that was interrupted. It involves creating a post once only (poe) url /rest/data/<class>/@poe/<random_token>. This url acts the same as a post to /rest/data/<class>. However once the @poe url is used, it can't be used for a second POST. To make these changes: 1) Take the body of post_collection into a new post_collection_inner function. Have post_collection call post_collection_inner. 2) Add a handler for POST to rest/data/class/@poe. This will return a unique POE url. By default the url expires after 30 minutes. The POE random token is only good for a specific user and is stored in the session db. 3) Add a handler for POST to rest/data/<class>/@poe/<random token>. The random token generated in 2 is validated for proper class (if token is not generic) and proper user and must not have expired. If everything is valid, call post_collection_inner to process the input and generate the new entry. To make recognition of 2 stable (so it's not confused with rest/data/<:class_name>/<:item_id>), removed @ from Routing::url_to_regex. The current Routing.execute method stops on the first regular expression to match the URL. Since item_id doesn't accept a POST, I was getting 405 bad method sometimes. My guess is the order of the regular expressions is not stable, so sometime I would get the right regexp for /data/<class>/@poe and sometime I would get the one for /data/<class>/<item_id>. By removing the @ from the url_to_regexp, there was no way for the item_id case to match @poe. There are alternate fixes we may need to look at. If a regexp matches but the method does not, return to the regexp matching loop in execute() looking for another match. Only once every possible match has failed should the code return a 405 method failure. Another fix is to implement a more sophisticated mechanism so that @Routing.route("/data/<:class_name>/<:item_id>/<:attr_name>", 'PATCH') has different regexps for matching <:class_name> <:item_id> and <:attr_name>. Currently the regexp specified by url_to_regex is used for every component. Other fixes: Made failure to find any props in props_from_args return an empty dict rather than throwing an unhandled error. Make __init__ for SimulateFieldStorageFromJson handle an empty json doc. Useful for POSTing to rest/data/class/@poe with an empty document. Testing: added testPostPOE to test/rest_common.py that I think covers all the code that was added. Documentation: Add doc to rest.txt in the "Client API" section titled: Safely Re-sending POST". Move existing section "Adding new rest endpoints" in "Client API" to a new second level section called "Programming the REST API". Also a minor change to the simple rest client moving the header setting to continuation lines rather than showing one long line.
author John Rouillard <rouilj@ieee.org>
date Sun, 14 Apr 2019 21:07:11 -0400
parents c749d6795bc2
children b74f0b50bef1
line wrap: on
line source


from __future__ import print_function
from roundup.anypy.strings import u2s, uchr
class dehtml:
    def __init__(self, converter):
        if converter == "none":
            self.html2text = None
            return

        try:
            if converter == "beautifulsoup":
                # Not as well tested as dehtml.
                from bs4 import BeautifulSoup
                def html2text(html):
                    soup = BeautifulSoup(html)

                    # kill all script and style elements
                    for script in soup(["script", "style"]):
                        script.extract()

                    return u2s(soup.get_text('\n', strip=True))

                self.html2text = html2text
            else:
                raise ImportError # use
        except ImportError:
            # use the fallback below if beautiful soup is not installed.
            try:
                # Python 3+.
                from html.parser import HTMLParser
                from html.entities import name2codepoint
            except ImportError:
                # Python 2.
                from HTMLParser import HTMLParser
                from htmlentitydefs import name2codepoint

            class DumbHTMLParser(HTMLParser):
                # class attribute
                text=""

                # internal state variable
                _skip_data = False
                _last_empty = False

                def handle_data(self, data):
                    if self._skip_data: # skip data if in script or style block
                        return

                    if ( data.strip() == ""):
                        # reduce multiple blank lines to 1
                        if ( self._last_empty ):
                            return
                        else:
                            self._last_empty = True
                    else:
                        self._last_empty = False

                    self.text=self.text + data

                def handle_starttag(self, tag, attrs):
                    if (tag == "p" ):
                        self.text= self.text + "\n"
                    if (tag  in ("style", "script")):
                        self._skip_data = True

                def handle_endtag(self, tag):
                    if (tag  in ("style", "script")):
                        self._skip_data = False

                def handle_entityref(self, name):
                    if self._skip_data:
                        return
                    c = uchr(name2codepoint[name])
                    try:
                        self.text= self.text + c
                    except UnicodeEncodeError:
                        # print a space as a placeholder
                        pass

            def html2text(html):
                parser = DumbHTMLParser()
                parser.feed(html)
                parser.close()
                return parser.text

            self.html2text = html2text

if "__main__" == __name__:
    html='''
<body>
<script>
this must not be in output
</script>
<style>
p {display:block}
</style>
    <div class="header"><h1>Roundup</h1>
        <div id="searchbox" style="display: none">
          <form class="search" action="../search.html" method="get">
            <input type="text" name="q" size="18" />
            <input type="submit" value="Search" />
            <input type="hidden" name="check_keywords" value="yes" />
            <input type="hidden" name="area" value="default" />
          </form>
        </div>
        <script type="text/javascript">$('#searchbox').show(0);</script>
    </div>
       <ul class="current">
<li class="toctree-l1"><a class="reference internal" href="../index.html">Home</a></li>
<li class="toctree-l1"><a class="reference external" href="http://pypi.python.org/pypi/roundup">Download</a></li>
<li class="toctree-l1 current"><a class="reference internal" href="../docs.html">Docs</a><ul class="current">
<li class="toctree-l2"><a class="reference internal" href="features.html">Roundup Features</a></li>
<li class="toctree-l2 current"><a class="current reference internal" href="">Installing Roundup</a></li>
<li class="toctree-l2"><a class="reference internal" href="upgrading.html">Upgrading to newer versions of Roundup</a></li>
<li class="toctree-l2"><a class="reference internal" href="FAQ.html">Roundup FAQ</a></li>
<li class="toctree-l2"><a class="reference internal" href="user_guide.html">User Guide</a></li>
<li class="toctree-l2"><a class="reference internal" href="customizing.html">Customising Roundup</a></li>
<li class="toctree-l2"><a class="reference internal" href="admin_guide.html">Administration Guide</a></li>
</ul>
<div class="section" id="prerequisites">
<h2><a class="toc-backref" href="#id5">Prerequisites</a></h2>
<p>Roundup requires Python 2.5 or newer (but not Python 3) with a functioning
anydbm module. Download the latest version from <a class="reference external" href="http://www.python.org/">http://www.python.org/</a>.
It is highly recommended that users install the latest patch version
of python as these contain many fixes to serious bugs.</p>
<p>Some variants of Linux will need an additional &#8220;python dev&#8221; package
installed for Roundup installation to work. Debian and derivatives, are
known to require this.</p>
<p>If you&#8217;re on windows, you will either need to be using the ActiveState python
distribution (at <a class="reference external" href="http://www.activestate.com/Products/ActivePython/">http://www.activestate.com/Products/ActivePython/</a>), or you&#8217;ll
have to install the win32all package separately (get it from
<a class="reference external" href="http://starship.python.net/crew/mhammond/win32/">http://starship.python.net/crew/mhammond/win32/</a>).</p>
</div>
</body>
'''

    html2text = dehtml("dehtml").html2text
    if html2text:
        print(html2text(html))

    try:
        # trap error seen if N_TOKENS not defined when run.
        html2text = dehtml("beautifulsoup").html2text
        if html2text:
            print(html2text(html))
    except NameError as e:
        print("captured error %s"%e)

    html2text = dehtml("none").html2text
    if html2text:
        print("FAIL: Error, dehtml(none) is returning a function")
    else:
        print("PASS: dehtml(none) is returning None")



Roundup Issue Tracker: http://roundup-tracker.org/