tag:blogger.com,1999:blog-72464019587673955192024-10-25T09:22:01.471+02:00Sven R. KunzeAnonymoushttp://www.blogger.com/profile/13449109761788005994noreply@blogger.comBlogger34125tag:blogger.com,1999:blog-7246401958767395519.post-79498034548908005932017-11-02T00:36:00.001+01:002017-11-02T00:42:06.717+01:00UI Changes of Ubuntu 17.10<h2>
A Timeline Of Discovery</h2>
<ol>
<li>Dist-Upgrade went through without an issue. <b><span style="color: #38761d;">Check!<span style="color: #6aa84f;"> β</span></span></b></li>
<li>Up-to-date packages installed. They seem to work just fine. <b><span style="color: #38761d;">Check!<span style="color: #6aa84f;"> β</span></span></b></li>
<li>Old stuff removed. <b><span style="color: #38761d;">Check!<span style="color: #6aa84f;"> β</span></span></b></li>
<li>Internals basically work the same except some minor systemd changes. <b><span style="color: #38761d;">Check!<span style="color: #6aa84f;"> β</span></span></b></li>
<li>A lot of UI changes. <span style="color: red;"><b>π</b></span></li>
</ol>
<h2>
UI Changes</h2>
<ol>
<li>
First things
first: login now requires an additional keystroke (Enter-Key) to see/activate login form of user. <span style="color: red;"><b>Meeeh! π<a name='more'></a></b></span></li>
<li>Starting applications from dock results in strange icon animation. <b style="color: red;">Meeeh! π</b><b style="color: red;"><b style="color: red;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiupSwlWUSsRKAAoK9KV9eIOTuj7VcOzOKkrLoa_fKL4RYkOaE2ey9utka5O0jbFlov3dlo0mjQzpLVCEs0qHK6UDuh0E6xpCGeDwFxrkOuNH2-4YhdB94p-D7G_bKIQUpIQPTy5msJ-6w/s1600/appstart1.png" imageanchor="1"><img border="0" height="133" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiupSwlWUSsRKAAoK9KV9eIOTuj7VcOzOKkrLoa_fKL4RYkOaE2ey9utka5O0jbFlov3dlo0mjQzpLVCEs0qHK6UDuh0E6xpCGeDwFxrkOuNH2-4YhdB94p-D7G_bKIQUpIQPTy5msJ-6w/s200/appstart1.png" width="200" /></a></b><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgbz4OZ-q50RNANwr4WntDOyZBILBzgp0kLcxuGPGF-98bo5oU8fYuLBifJP-SAT7hXTRE6a1qslJAyYmOhI8wjsNQ_blAVeyy2D3FToIlILuRjaZkS7lHtza_zBX9KYzeEhsTSMh45ar0/s1600/appstart2.png" imageanchor="1"><img border="0" height="185" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgbz4OZ-q50RNANwr4WntDOyZBILBzgp0kLcxuGPGF-98bo5oU8fYuLBifJP-SAT7hXTRE6a1qslJAyYmOhI8wjsNQ_blAVeyy2D3FToIlILuRjaZkS7lHtza_zBX9KYzeEhsTSMh45ar0/s200/appstart2.png" width="200" /></a></b></li>
<li>Pinning applications is not called βpinningβ anymore but βAdd to Favoritesβ. Meeeh for beginners β pinning is a well-known concept!</li>
<li>Missing Desktop Link to Desktop in Files. <b><span style="color: red;">Meeeh!</span></b><b style="color: red;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjWD13n6MNbBiloZXnmHyShjw-y-TE6bg__gTZLJrB9GbIwVqJ65RX47BYnFCUY6cq4QHzPezeOqfyxiNrHjs2psXVr0xLZDS9Vllx8K0vK5_E3xuQ4uOViIurUUyW9uW19oSkZPm8CBPg/s1600/missing-desktop.png" imageanchor="1"><img border="0" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjWD13n6MNbBiloZXnmHyShjw-y-TE6bg__gTZLJrB9GbIwVqJ65RX47BYnFCUY6cq4QHzPezeOqfyxiNrHjs2psXVr0xLZDS9Vllx8K0vK5_E3xuQ4uOViIurUUyW9uW19oSkZPm8CBPg/s200/missing-desktop.png" width="126" /></a></b></li>
<li>No hot corners for closing maximized application by mouse. <b style="color: red;">Meeeh! π</b></li>
<li>
<div style="line-height: 100%; margin-bottom: 0in;">
Trash not in
dock, polluting desktop β disabled trash icon altogether. <b style="color: red;">Meeeh! π</b></div>
</li>
<li>
<div style="line-height: 100%; margin-bottom: 0in;">
Top bar does
not merge with window bar when maximized β wasting a lot of space. <b style="color: red;">Meeeh! π<br /><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhy_Xtn9ZLaercUqy-PU9un33N9H062O4mCvDomonSX03SzMKn04DIr1eQc1MGM1D6d2U5hcMGbsInheVXsLf9yURGPMAbZW3sVEKfxeM5OaCohPVyQYG5Drh_xLSkxi93gOcvflmiW3K0/s1600/top-window-not-merging.png" imageanchor="1" style="font-weight: 400;"><img border="0" height="25" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhy_Xtn9ZLaercUqy-PU9un33N9H062O4mCvDomonSX03SzMKn04DIr1eQc1MGM1D6d2U5hcMGbsInheVXsLf9yURGPMAbZW3sVEKfxeM5OaCohPVyQYG5Drh_xLSkxi93gOcvflmiW3K0/s400/top-window-not-merging.png" width="400" /></a></b></div>
<div style="line-height: 100%; margin-bottom: 0in;">
</div>
</li>
<li>
<div style="line-height: 100%; margin-bottom: 0in;">
Ancient-looking
icons with random arrows on the top right. <b style="color: red;">Meeeh! π<br /><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiT3gRKKJ6aBkrxdFcHGsbpIKUYV1oUs852HI6pswgYqHmXgvKKF4VUI4E3M1LGu7PGc8Hde-6Xz9bKA-7RXxIwRB7jIkG0fNJ4pnvx_Hr3WyD2n3TMJdf7GXrWPSR_wYZuhpB4h2u_Cf4/s1600/top-right.png" imageanchor="1"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiT3gRKKJ6aBkrxdFcHGsbpIKUYV1oUs852HI6pswgYqHmXgvKKF4VUI4E3M1LGu7PGc8Hde-6Xz9bKA-7RXxIwRB7jIkG0fNJ4pnvx_Hr3WyD2n3TMJdf7GXrWPSR_wYZuhpB4h2u_Cf4/s400/top-right.png" /></a></b></div>
<div style="line-height: 100%; margin-bottom: 0in;">
</div>
</li>
<li>
<div style="line-height: 100%; margin-bottom: 0in;">
Aggregation
of unrelated concepts, makes it harder to find right quick settings
and requires a second click to open menus with arrows. <b style="color: red;">Meeeh! π<br /><br /><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjpZyH0XLeeXdQaqlujgNFu-qdngZC8ULkcq3YdzhTDHq3B9zNNFRDlkblgGWkYsQq2HX167ttWJ93zq8Z-CeeHf4P35ArcldzfKCQ9DBr7S0xwk-TeHYpTl0oAGC7RYJzCdPsWQudB_Rs/s1600/quick-settings.png" imageanchor="1"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjpZyH0XLeeXdQaqlujgNFu-qdngZC8ULkcq3YdzhTDHq3B9zNNFRDlkblgGWkYsQq2HX167ttWJ93zq8Z-CeeHf4P35ArcldzfKCQ9DBr7S0xwk-TeHYpTl0oAGC7RYJzCdPsWQudB_Rs/s320/quick-settings.png" width="275" /></a></b></div>
<div style="line-height: 100%; margin-bottom: 0in;">
</div>
</li>
<li>
<div style="line-height: 100%; margin-bottom: 0in;">
Clock in the
middle of the top bar. Actually no issue, but is reason for issues 5
and 7, so meeeh!</div>
<div style="line-height: 100%; margin-bottom: 0in;">
</div>
</li>
<li>
<div style="line-height: 100%; margin-bottom: 0in;">
Space-wasting
buttons on the top-left corner, another reason for issues 5 and 7. <b style="color: red;">Meeeh! π<br /><br /><b style="color: red;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhCnzN7f4OLadR3bRrAocgw7qE57OdTw1_Qne099GQt5dLiu3mZejXpRtg88WhApfd5F9IXNGZLUuHuUgGy-LKZKVkrnE8qgOXN8RrMrpKr7P0RF3qBPetrl5gs0JbG-aJfHawFR5gUaPs/s1600/top-left.png" imageanchor="1"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhCnzN7f4OLadR3bRrAocgw7qE57OdTw1_Qne099GQt5dLiu3mZejXpRtg88WhApfd5F9IXNGZLUuHuUgGy-LKZKVkrnE8qgOXN8RrMrpKr7P0RF3qBPetrl5gs0JbG-aJfHawFR5gUaPs/s400/top-left.png" /></a></b></b></div>
<div style="line-height: 100%; margin-bottom: 0in;">
</div>
</li>
<li>
<div style="line-height: 100%; margin-bottom: 0in;">
Apps-floating-around
animation when clicking on the Sudoku icon on the bottom left. Like
kindergarden, meeeh! <b style="color: red;">π<br /><br /><b style="color: red;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg29ILo7344Sh6__PbkRpi7WZdwz9XPSjsoFT8xDZ61H0MXxcH6iRgaFvWgMupdKfbHAyux8cWl_ymaK8RGkIdTww1ZE7M2SGXgobDwLVGPm3QefuXwymkyfUu1YwdObDgHUaXSWBbRbKk/s1600/sudoku1.png" imageanchor="1"><img border="0" height="120" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg29ILo7344Sh6__PbkRpi7WZdwz9XPSjsoFT8xDZ61H0MXxcH6iRgaFvWgMupdKfbHAyux8cWl_ymaK8RGkIdTww1ZE7M2SGXgobDwLVGPm3QefuXwymkyfUu1YwdObDgHUaXSWBbRbKk/s200/sudoku1.png" width="200" /></a></b><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTdjvFqO_ruyl8gWtTVMWFnzrWAMyvBNnFAnWouxnG_efXr1Yy7ziJOsDEpj9mSdcOYFe5Yr3AymrlbbEJWDBGnI_GfLIZ4YQbT-jcU8lomt_RXVV3wvKPO1XNX6RUejegjta6_UEPEZk/s1600/sudoku2.png" imageanchor="1"><img border="0" height="112" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTdjvFqO_ruyl8gWtTVMWFnzrWAMyvBNnFAnWouxnG_efXr1Yy7ziJOsDEpj9mSdcOYFe5Yr3AymrlbbEJWDBGnI_GfLIZ4YQbT-jcU8lomt_RXVV3wvKPO1XNX6RUejegjta6_UEPEZk/s200/sudoku2.png" width="200" /></a><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiQhyYeeES4mt5wTd3_azbfcwfD_nQ8p9nYrMlAyxo4VIPnLlm2dq0DNcK2lBtOPD2xs5lPZjzkOfSAQTj_HGarZWMewv9Wr9XjlFf-7RkeXbsDJiFTZMapQ-UROoos4IRZZaNATvIzJGU/s1600/sudoku3.png" imageanchor="1"><img border="0" height="185" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiQhyYeeES4mt5wTd3_azbfcwfD_nQ8p9nYrMlAyxo4VIPnLlm2dq0DNcK2lBtOPD2xs5lPZjzkOfSAQTj_HGarZWMewv9Wr9XjlFf-7RkeXbsDJiFTZMapQ-UROoos4IRZZaNATvIzJGU/s320/sudoku3.png" width="320" /></a></b></div>
<div style="line-height: 100%; margin-bottom: 0in;">
</div>
<div style="line-height: 100%; margin-bottom: 0in;">
</div>
</li>
<li>
<div style="line-height: 100%; margin-bottom: 0in;">
No known
shortcut for Sudoku icon. <b style="color: red;">Meeeh! π</b></div>
</li>
<li>
<div style="line-height: 100%; margin-bottom: 0in;">
Cannot see
all settings in the Settings app at once, needs scrolling now β
space wasting again. <b style="color: red;">Meeeh! π</b></div>
</li>
<li>
<div style="line-height: 100%; margin-bottom: 0in;">
Single
settings dialog looks clean. <b><span style="color: #38761d;">Nice!<span style="color: #38761d;"><span style="color: #6aa84f;"> β</span></span></span></b></div>
</li>
<li>
<div style="line-height: 100%; margin-bottom: 0in;">
Super-key search doesn't work reliably β depending on the mouse position (windows steal focus). Super annoying. <b style="color: red;">Meeeh! π</b></div>
</li>
<li>
<div style="line-height: 100%; margin-bottom: 0in;">
Super-key search not forgiving of typos like in former Ubuntu versions (e.g. geidt -> gedit) <b style="color: red;">Meeeh! π</b></div>
</li>
<li>
<div style="line-height: 100%; margin-bottom: 0in;">
Lock screen image disappears when entering password - displaying gray-noise background. Looks 10 years older! <b><span style="color: red;">Meeeh! π</span></b></div>
<div style="line-height: 100%; margin-bottom: 0in;">
</div>
</li>
<li>
<div style="line-height: 100%; margin-bottom: 0in;">
Long-pressing
super key does not reveal dock numbers of applications. <b style="color: red;">Meeeh! π</b></div>
</li>
<li>
<div style="line-height: 100%; margin-bottom: 0in;">
Long-pressing
super key does not reveal shortcuts of systems. <b style="color: red;">Meeeh! π</b></div>
<div style="line-height: 100%; margin-bottom: 0in;">
</div>
</li>
<li>
<div style="line-height: 100%; margin-bottom: 0in;">
In preparation of this post: taking screenshots - they arenβt in the <b>clipboard</b> by default but stored in Pictures as an image file. Couldn't find a setting so far. <b style="color: red;">Meeeh! π</b></div>
</li>
<li>
<div style="line-height: 100%; margin-bottom: 0in;">
I don't get
the conceptual difference (thus intended usage) of the Sudoku Icon
search and the Super-Key search. Both allow to search for/start
applications but they look different for unknown reasons. <b style="color: red;">π</b></div>
</li>
<li>Plugged in my phone. Why isn't it in the dock? I discovered it after closing all apps (and finished the task). Hmm. <b style="color: red;">π<br /><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_abn58TCyzgkrwF10E5P8wXyHNoZtqb23uLI65e-0ZYYuhcyRF6M5oaZLe3Bk2w2BDI2jMYD8GcNFkpimiTvO_5a00fO7cm59zM1ekgq1BAZhYufrreKHoU24KJwSIG70onR9sIKHMGE/s1600/Screenshot+from+2017-10-30+11-36-22.png" imageanchor="1"><img border="0" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_abn58TCyzgkrwF10E5P8wXyHNoZtqb23uLI65e-0ZYYuhcyRF6M5oaZLe3Bk2w2BDI2jMYD8GcNFkpimiTvO_5a00fO7cm59zM1ekgq1BAZhYufrreKHoU24KJwSIG70onR9sIKHMGE/s200/Screenshot+from+2017-10-30+11-36-22.png" width="120" /></a></b></li>
</ol>
<div style="line-height: 100%; margin-bottom: 0in;">
<h2>
Conclusion</h2>
</div>
<ul>
<li><div style="line-height: 100%; margin-bottom: 0in;">
Upgrade from
17.04 works fine!
</div>
</li>
<li>
<div style="line-height: 100%; margin-bottom: 0in;">
Internals and
drivers work fine as well!</div>
</li>
<li>
<div style="line-height: 100%; margin-bottom: 0in;">
UI has
changed a lot; as a desktop OS graphical design is important:</div>
<ul>
<li>
<div style="line-height: 100%; margin-bottom: 0in;">
wasting a
lot of space β especially bad for smaller screens such as the
laptop I use to write this post</div>
</li>
<li>
<div style="line-height: 100%; margin-bottom: 0in;">
aged look
:-(</div>
</li>
<li>
<div style="line-height: 100%; margin-bottom: 0in;">
a lot of
useless bells and whistles</div>
</li>
</ul>
</li>
</ul>
<div style="line-height: 100%; margin-bottom: 0in;">
Despite all the meeehs, thanks a lot, Ubuntu team, for the hard work you put into this
release. I am very eager to see the next iteration of this product!<br />
<br />
Cheers,<br />
Sven<br />
<br />
<b>PS</b>: you might wonder why I nitpick about so many tiny details of the new Gnome-based UI. I care because I build my <b>workflows</b> around these tiny details of which I have now no viable alternative (e.g. the clipboard one - issue 21) or no time-efficient one (hot-corners for closing apps by mouse).</div>
Anonymoushttp://www.blogger.com/profile/13449109761788005994noreply@blogger.com0tag:blogger.com,1999:blog-7246401958767395519.post-11113920186058488222017-10-28T10:50:00.000+02:002017-10-28T10:50:24.096+02:00Drinking Games with PostgreSQL: GIN and RUM<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhesov52Y5a5DuKO5mkNxQlIdRLcQWXNt-xbBzol1YLQU7QKIqSpvKC7vhPR0iRpLcZq0I7mRwEZ2v-nbI3pCWYq7dd9yyUny-pmaCNq_num7elR2gGvNwbEJAEApMAalk1txEoOZmJPEg/s1600/whiskey.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="332" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhesov52Y5a5DuKO5mkNxQlIdRLcQWXNt-xbBzol1YLQU7QKIqSpvKC7vhPR0iRpLcZq0I7mRwEZ2v-nbI3pCWYq7dd9yyUny-pmaCNq_num7elR2gGvNwbEJAEApMAalk1txEoOZmJPEg/s640/whiskey.png" width="640" /></a></div>
<br />
Fulltext support is a very important feature of most modern relational databases. For one, they enable fast <a href="https://en.wikipedia.org/wiki/Information_retrieval">information retrieval</a> and for another, they simply allow application designers and operators to remain inside the known relational world. No need for a second database-like system, no need for additional maintenance, no need for yet another library, no need for a fundamentally different query language. You get the point.<br />
<br />
Starting with version 8.3, PostgreSQL supports the <a href="https://www.postgresql.org/docs/9.6/static/datatype-textsearch.html">tsvector and tsquery</a> constructs and with it indexing support for the matching operator. This post will cover more of those indexing technology which we can utilize for accelerating regular expressions, LIKE-queries, fulltext and even JSON-related queries.<br />
<a name='more'></a><b>Juniper-Flavored Spirit</b><br />
<br />
<b>GIN</b> stands for Generic Inverted Index and is the de facto standard index since PostgreSQL 8.3 for indexing <a href="https://www.postgresql.org/docs/9.6/static/datatype-textsearch.html">tsvectors</a> thus accelerating subsequent <a href="https://www.postgresql.org/docs/9.6/static/datatype-textsearch.html">tsqueries</a>.<br />
<br />
Here is how it works:<br />
<ol>
<li>a document (basically text) is split into normalized words</li>
<li>GIN is, simply speaking, a BTree of words mapping to document IDs</li>
</ol>
With this strategy, it is possible to index a lot of documents and retrieve them efficiently via the matching operator <b>@@</b>. If you want to read about it in detail, have a look at <a href="https://www.postgresql.org/docs/9.6/static/gin-intro.html">here</a> and <a href="http://www.cybertec.at/gin-just-an-index-type/">here</a>.<br />
<br />
In recent years, fulltext has become more and more important. So GIN has seen e.g. performance improvements in PostgreSQL 9.4. Reducing the index size and speeding-up multi-key lookups (rare + frequent) did the trick to accelerate the <b>@@</b> operator even more for larger amounts of documents.<br />
<br />
I can recommend <a href="https://www.pgcon.org/2014/schedule/events/698.en.html">this resource</a> and <a href="https://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf">this one</a>, if you need more good reads.<br />
<br />
<blockquote class="tr_bq">
<i><b>BUT</b> there's still room for improvement, and that's where more beverages come into play.</i></blockquote>
<br />
<b>Distilled Alcoholic Beverage</b><br />
<br />
It seems like the folks over at Postgres Professional have a preference for alcoholic beverages (please mind the vodka at the end of the presentation referred above). Good beverages appear to help pushing boundaries even further and build a brand-new index infrastructure. They called it:<br />
<br />
<div style="text-align: center;">
<b>RUM</b></div>
<br />
Why do we need yet another index implementation? The following list shows the items missing efficient/general indexing support:<br />
<ul>
<li>phrase operator (<b><-></b>, <b><n></b>)</li>
<li>ranking (<b>ts_rank</b>, <b>ts_rank_cd</b>)</li>
<li>inverse fulltext search</li>
<li>inverse regular expression</li>
<li>position in JSON arrays</li>
</ul>
I didn't find out whether or not RUM is a acronym, but the basic idea is to store additional data among the usual GIN data [1][2]. This allows associating ordering information for efficient ranking, distance determination or positioning in JSON arrays. This approach increases the size of the index marginally but speeds up real-world queries several orders of magnitude.<br />
<br />
In order to leverage RUM best, they also gave birth to an new ranking function: <b>ts_score</b> which is a middleground between the well-known ranking functions <a href="https://www.postgresql.org/docs/9.6/static/textsearch-controls.html#TEXTSEARCH-RANKING" style="font-weight: bold;">ts_rank</a> and <a href="https://www.postgresql.org/docs/9.6/static/textsearch-controls.html#TEXTSEARCH-RANKING"><b>ts_rank_cd</b></a>.<br />
The slides are pretty clear about the shortcomings of those functions [1][2]:<br />
<ul>
<li><b>ts_rank</b> doesn't supports logical operators</li>
<li><b>ts_rank_cd</b> works poorly with OR queries</li>
</ul>
<br />
I don't know completely how they plan to integrate this into PostgreSQL trunk if at all, but personally, I would love to see these features in PostgreSQL 11. They solve many real-world use-cases. So, keep up the good work!<br />
<br />
Btw. if you want to give them a hand or review, stop by at their <a href="https://github.com/postgrespro/rum">GitHub repositoy</a>.<br />
<br />
Enjoy the drinks <span style="font-size: large;">π</span>,<br />
Sven<br />
<br />
[1] <a href="https://pgconf.ru/media/2017/04/03/20170316H3_Korotkov-rum.pdf">https://pgconf.ru/media/2017/04/03/20170316H3_Korotkov-rum.pdf</a><br />
[2] <a href="http://www.sai.msu.su/~megera/postgres/talks/pgopen-2016-rum.pdf">http://www.sai.msu.su/~megera/postgres/talks/pgopen-2016-rum.pdf</a>Anonymoushttp://www.blogger.com/profile/13449109761788005994noreply@blogger.com0tag:blogger.com,1999:blog-7246401958767395519.post-19056862772941888692017-06-27T23:44:00.000+02:002017-06-27T23:44:35.990+02:00PostgreSQL: production-ready Hash Indexes<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgV9V2CG4NsRQ79mi3XnbM_H-HBly2X2VWUEF41c73Xm_v8od3Lh2U__3lW70eAOH4m7NqpWooEVkCyXLtA-pYXJinqR5pK7P7vvfEdsQfH5Zw1sPZBepVct93olFuv7y4DSsKnCufLUq0/s1600/hash-index.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="212" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgV9V2CG4NsRQ79mi3XnbM_H-HBly2X2VWUEF41c73Xm_v8od3Lh2U__3lW70eAOH4m7NqpWooEVkCyXLtA-pYXJinqR5pK7P7vvfEdsQfH5Zw1sPZBepVct93olFuv7y4DSsKnCufLUq0/s640/hash-index.png" width="640" /></a></div>
<br />
<br />
Up to PostgreSQL 9.6, hash indexes were second class. Using them in production was not recommended; they weren't crash-safe. That's going to change with PostgreSQL 10. And here's what's gonna change.<br />
<br />
<a name='more'></a><br />
<b>Write Ahead Logging for Hash Indexes</b><br />
<a href="https://commitfest.postgresql.org/13/740/">Commit Fest Entry</a> and the relevant threads <a href="https://www.postgresql.org/message-id/flat/CAA4eK1JOBX=YU33631Qh-XivYXtPSALh514+jR8XeD7v+K3r_Q@mail.gmail.com">(1)</a> + <a href="https://www.postgresql.org/message-id/flat/CA%2BTgmoZWH0L%3DmEq9-7%2Bo-yogbXqDhF35nERcK4HgjCoFKVbCkA%40mail.gmail.com">(2)</a><br />
<br />
The actual problem with hash indexes were their inability of using the WAL. <a href="https://www.postgresql.org/docs/9.6/static/indexes-types.html">Quoting the docs:</a><br />
<blockquote class="tr_bq">
<i>Hash index operations are not presently WAL-logged, so hash indexes might need to be rebuilt with REINDEX after a database crash if there were unwritten changes.</i></blockquote>
With PostgreSQL 10, this drawback has gone. The community, with Amit Kapilla leading the way, worked hard to remove this restriction and made hash index a first class citizen of the PostgreSQL.<br />
<br />
What's in for you, you might wonder. According to Amit Kapilla, <a href="https://www.postgresql.org/message-id/CAA4eK1%2Bf%3DKpUW3TOW0P9LUCA8xhFujcjKhEnYRtoaB83LSsSMg%40mail.gmail.com">hash indexes tend to perform better than B-trees if the column values are unique</a>. So, it can be beneficial to use hash indexes instead of btrees. Performance FTW!<br />
<div>
<br /></div>
<div>
Given the production-readiness of hash indexes, it now makes sense to improve the long-neglected index type in general. It's hard to say if "WAL for Hash Indexes" was the trigger here but it makes subsequent patches to hash indexes usable in the first place. So, I picked the following two interesting ones.</div>
<div>
<br /></div>
<b><br /></b>
<b>Microvacuum Hash Indexes</b><br />
<a href="https://commitfest.postgresql.org/13/835/">Commit Fest Entry</a> and the relevant <a href="https://www.postgresql.org/message-id/flat/CAE9k0PkRSyzx8dOnokEpUi2A-RFZK72WN0h9DEMv_ut9q6bPRw@mail.gmail.com">thread</a><br />
<br />
This patch reduces the index size by factors of 2 to 4. It does that by increasing the re-usability of hash index pages and avoiding page splits.<br />
A small but nice side effect: while reviewing the patch, occasional deadlocks caught the community's eye and eventually led to improved testing code for other index types.<br />
<br />
<br />
<b>A Better Way to Expand Hash Indexes.</b><br />
<a href="https://commitfest.postgresql.org/13/1012/">Commit Fest Entry</a> and the relevant <a href="https://www.postgresql.org/message-id/flat/CAD__OuhG6F1gQLCgMQNnMNgoCvOLQZz9zKYJQNYvYmmJoM42gA@mail.gmail.com">thread</a><br />
<br />
Hash indexes need to grow when more and more items are inserted. Usually, there's a so-called bucket increase which, by this patch, is chopped into smaller pieces and allows for a more gradual way of increasing the size of hash indexes.<br />
<br />
<br />
Given those improvements, I can imagine a lot of people now considering hash indexes as a viable alternative to the venerable B-tree. What about you?<br />
<br />
Regards,<br />
Sven<br />
<br />Anonymoushttp://www.blogger.com/profile/13449109761788005994noreply@blogger.com0tag:blogger.com,1999:blog-7246401958767395519.post-8619612665897094272017-06-10T15:00:00.002+02:002017-06-10T15:00:54.788+02:00CI Done Right<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td class="copyright-fotos2" style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_86vLVsEvDnN0ja22bXvoODYNCHN869z3TCdDOuivh7Un5OwvAckXUoARpk0mUxX92u59sSGAVu_CRg2BcoZjNeZ5FR6JicP7RKSDfNt00hxtMcjw-M7AlJidzhYH7klen4X8AQPFDHc/s1600/broken.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="288" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_86vLVsEvDnN0ja22bXvoODYNCHN869z3TCdDOuivh7Un5OwvAckXUoARpk0mUxX92u59sSGAVu_CRg2BcoZjNeZ5FR6JicP7RKSDfNt00hxtMcjw-M7AlJidzhYH7klen4X8AQPFDHc/s640/broken.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Broken.</td></tr>
</tbody></table>
<br />
Recently, some people have had some serious issues with a broken <a href="https://pypi.python.org/pypi/setuptools">setuptools release</a> [1, 2, 3]. One special complain was about broken CI systems just because of this central package. Replying to those reactions, I had a conversation via twitter about certain design decisions of continuous integration [4, 5].<br />
<br />
In this post, I want to assemble part of my personal experience in building those systems for the last couple of years. The result should be a little cozy guideline for those setting up their <a href="https://en.wikipedia.org/wiki/Continuous_integration">CI system</a> and <a href="https://en.wikipedia.org/wiki/Continuous_delivery">CD system</a> within a corporate environment.<br />
<a name='more'></a><br />
So, let's start with those rough steps:<br />
<ol>
<li><b>Define what is to be tested.</b></li>
<li><b>Define what will happened in case of success and failure of those tests.</b></li>
</ol>
<div>
Each of those steps aren't easy and setting it up over night won't help. CI systems differ from environment to environment. As usual, it depends. This said, let's dig into the details.<br />
<br /></div>
<div>
<h2>
<b><span style="font-size: x-large;">What is to be tested?</span></b></h2>
</div>
<div>
We can break down this question into those subpoints:</div>
<div>
<ol>
<li>Define the build environment, including</li>
<ol>
<li>system binaries</li>
<li>system configs</li>
<li>directory structures</li>
<li>running services</li>
</ol>
<li>Define the test environment, including</li>
<ol>
<li>your source-code</li>
<li>your packages</li>
<li>3rd-party packages</li>
<li>non-source-code data</li>
</ol>
</ol>
</div>
Define this stuff. Don't even start without having put any thought into it. Usually, we want to test our code and its integration in 3rd-party code but <b>not</b> the 3rd-party code itself.<br />
<br />
And there is more to it. Software usually comes in form of releases. In practice, release A does almost the same things as release B. That's why we have a constant package name with changing versions attached to it.<br />
<br />
However on an abstract level, each release of a package is actually a different package in itself. There is a diff, isn't it? So, we need to bother with package releases before implementing our CI system, which basically boils down to:
<br />
<br />
<div style="text-align: center;">
<b><span style="font-size: large;">Dependency Update Cadence
</span></b></div>
<br />
You guessed it - we need to define it as well for everything already defined above (usually dependencies to your code): system binaries, 3rd-party packages, your packages, etc.<br />
<ul>
<li>How often do you update the OS environment?</li>
<li>How often do you update 3rd-party dependencies? Which ones?</li>
<li>How often do you update your packages?</li>
</ul>
<ul>
<li>Manually?</li>
<li>On a regular basis like monthly?</li>
<li>Do you really need to be bleeding edge? Or will a more stable version do?</li>
</ul>
Ever updated one of your important frameworks (e.g. like <a href="https://www.djangoproject.com/">Django</a>)? Don't tell me you could fix the incompatibilities in the few millions of lines of legacy source code within a day - <i>all of them</i>.<br />
<br />
Some package require bleeding-edge versions of different dependencies (like in case of virtualenv installing latest setuptoos). Using a <b><a href="https://pypi.python.org/pypi/pypiserver">private package servers</a></b> and some <b>cronjobs</b> is one way to answer those questions in a sensible and enterprise-friendly way.<br />
<a href="https://pip.pypa.io/en/stable/reference/pip_install/#requirement-specifiers"><br /></a>
<b><a href="https://pip.pypa.io/en/stable/reference/pip_install/#requirement-specifiers">Pinning dependencies</a></b> is another way. Use it, to define versions and define a (preferably automated) process of updating those according to your needs.<br />
<br />
This will give you a proper test environment, where you can rely on what complements your code while being tested. Test results aren't meaningful otherwise.<br />
<blockquote class="tr_bq">
<b><span style="font-size: large;">Accounting for dependency versions enables you to</span></b><b><span style="font-size: large;"> change their update cadence according to quick needs.</span></b></blockquote>
As it seemed some of the people who were using the bleeding-edge version of setuptools haven't had a way to change their setup quick enough. So, they relied on how fast the team around setuptools could fix it. That is unacceptable in corporate software engineering.<br />
<h2>
<span style="font-size: x-large;">
What happens in case of test success and test failure?</span></h2>
As mentioned at the beginning, every CI/CD setup is different and it should serve their creators' needs. So, here's a collection of items, people might want to do after a successful test run (YMMV but usually that means zero failures):<br />
<ul>
<li>merge code</li>
<li>create a release aka version tagging</li>
<li>build a package of this release</li>
<li>trigger jobs for dependent packages</li>
<li>update/deploy QA (re-running test there again?) and production systems with theses releases</li>
</ul>
<div>
Here we go with another list in case of test failures (usually that means at least a single test failure):</div>
<div>
<ul>
<li>notify developers</li>
<li>re-run tests under certain circumstances</li>
<li>do nothing ;-)</li>
</ul>
<div>
For the sake of completeness: all of these items should be triggered and executed <b>automatically</b> with no supervision required except in case of errors of the CI/CD machinery. That's a design aspect, you need to care about deeply.<br />
<br />
Considering all those points should help building CI/CD systems while allowing you to minimize wasted enterprise resources (aka your time) and increase acceptance within your team.</div>
</div>
<div>
<b><br /></b></div>
<div>
Happy coding,</div>
<div>
Sven</div>
<br />
[1] <a href="https://www.reddit.com/r/Python/comments/6elcaa/psa_setuptools_broken_release_36_dont_use_it/">https://www.reddit.com/r/Python/comments/6elcaa/psa_setuptools_broken_release_36_dont_use_it/</a><br />
[2] <a href="https://github.com/pypa/setuptools/issues/1042">https://github.com/pypa/setuptools/issues/1042</a><br />
[3] <a href="https://github.com/pypa/setuptools/pull/1043">https://github.com/pypa/setuptools/pull/1043</a><br />
<br />
[4] <a href="https://twitter.com/kunsv/status/870399225883836417">https://twitter.com/kunsv/status/870399225883836417</a><br />
[5] <a href="https://twitter.com/lucaswiman/status/870682012675211265">https://twitter.com/lucaswiman/status/870682012675211265</a>Anonymoushttp://www.blogger.com/profile/13449109761788005994noreply@blogger.com0tag:blogger.com,1999:blog-7246401958767395519.post-24437687188847933062017-05-03T22:45:00.001+02:002017-05-03T22:49:17.928+02:00PostgreSQL 10 is on its way<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td class="copyright-fotos2" style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgBdsUAGwj2le4n1w9XI4TnHr4nFSLJKfNIlBWGnJ4qcXHMv2rxJteb3lttJsJ-tKF1hOFAovjaceVI3ArXyy_5KJq33hAcRZOgaMnmFM0kAxgeS4gbsv4sZ1OvVjQfSfRApaxXbXeO43o/s1600/arch.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="188" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgBdsUAGwj2le4n1w9XI4TnHr4nFSLJKfNIlBWGnJ4qcXHMv2rxJteb3lttJsJ-tKF1hOFAovjaceVI3ArXyy_5KJq33hAcRZOgaMnmFM0kAxgeS4gbsv4sZ1OvVjQfSfRApaxXbXeO43o/s640/arch.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">A multitude of features and fascinating details.</td></tr>
</tbody></table>
<br />
The next major release of <a href="https://www.postgresql.org/">PostgreSQL</a> is on its way. It's a huge release in terms of features, improvements and bugfixes.<br />
<br />
<h2>
First things first</h2>
Versioning has been changed to a simpler scheme using only two numbers: major and minor version. So, this will be PostgreSQL 10 followed by 10.1 and 10.2 and so on. Okay, now let's get to the real deal <b><span style="font-size: large;">π</span></b><br />
<a name='more'></a><br />
<h2>
Logical Replication</h2>
The inclusion of <a href="https://www.2ndquadrant.com/en/resources/pglogical/">pglogical</a> as <a href="https://www.postgresql.org/docs/devel/static/logical-replication.html">logical replication</a> has been completed. So, PostgreSQL 10 will feature a publish-subscribe replication system which finally supports use-cases like <b>zero-downtime upgrades</b>, data consolidation Γ la <b>data warehousing</b> and <b>scaling out</b> read load; using one database technology only. I personally find this an important feat for the opensource project!<br />
<br />
<h2>
Parallel Support</h2>
Parallel support has been extended far beyond the initial parallel-enabling architecture (e.g. flagging parallel-safe functions) and its first implementation for sequential scans with hash join, nested loops and aggregate functions. Now PostgreSQL includes parallel <b>btree index scans</b>, parallel <b>bitmap heap scans</b>, parallel <b>merge joins</b> and parallel uncorrelated <b>subqueries</b> as well.<br />
<br />
<h2>
Hash Indexes</h2>
Finally, <b>hash indexes</b> are going to be fully featured citizens of PostgreSQL, thus the <a href="https://www.postgresql.org/docs/9.6/static/indexes-types.html">age-old warning</a> against their usage in production is going to be removed.<br />
<br />
<h2>
Fulltext meets JSON</h2>
Other improvements will be <b>fulltext</b> searches performed on <b>JSON</b> datasets, plus more utility functions for JSON data. Basically the cornerstone of what almost all modern non-trivial websites need to do to provide a streamlined user experience combined with short development cycles.<br />
<br />
<br />
In a series of upcoming posts, I am going to cover the most interesting aspects in detail by looking at specific commit sets.<br />
<br />
SvenAnonymoushttp://www.blogger.com/profile/13449109761788005994noreply@blogger.com0tag:blogger.com,1999:blog-7246401958767395519.post-1729778425010233342016-09-29T23:38:00.000+02:002016-09-29T23:40:20.959+02:00PostgreSQL 9.6 ReleasedWe are finally there. There is a new major version of PostgreSQL - 9.6.<br />
<br />
Here are the <a href="https://www.postgresql.org/about/news/1703/">news</a>, so go ahead and <a href="http://srkunze.blogspot.com/2016/06/postgresql-parallel-aggregate.html">read</a> <a href="http://srkunze.blogspot.com/2016/07/postgresql-index-only-scans-with.html">what</a> <a href="http://srkunze.blogspot.com/2016/09/postgresql-optimizing-aggregates.html">they've</a> done for the community. It's just great!<br />
<br />
Cheers!Anonymoushttp://www.blogger.com/profile/13449109761788005994noreply@blogger.com0tag:blogger.com,1999:blog-7246401958767395519.post-40397026434601789672016-09-07T23:00:00.002+02:002017-05-03T22:45:33.141+02:00PostgreSQL: Optimizing Aggregates<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td class="copyright-fotos2" style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgHQbeEV2qRnse2e7dnaGGcwPjOAvCnKYF_YQqpqeeiSSbEPAS4pdzT8BtALmoDzqK5BkR-K-1OW2yYPYHtCukSAOYt_HXtBB4SMbjU7ALo9STyQRql7oasydIBRhhjvBUCcMIL-H3qquE/s1600/aggregate.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="256" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgHQbeEV2qRnse2e7dnaGGcwPjOAvCnKYF_YQqpqeeiSSbEPAS4pdzT8BtALmoDzqK5BkR-K-1OW2yYPYHtCukSAOYt_HXtBB4SMbjU7ALo9STyQRql7oasydIBRhhjvBUCcMIL-H3qquE/s640/aggregate.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Aggregation of vast energy: our sun.</td></tr>
</tbody></table>
It's time to write about PostgreSQL 9.6 <a href="http://srkunze.blogspot.com/2016/06/postgresql-parallel-aggregate.html">again</a>. With the <a href="https://www.postgresql.org/about/news/1693/">Release Candidate 1</a> out in the wild, we slowly approach the end of a very interesting development cycle. Here I would like to talk about the development in the field of aggregates. So, two commits, <a href="https://commitfest.postgresql.org/9/552/">9/552</a> and <a href="https://commitfest.postgresql.org/9/435/">9/435</a>, will improve the performance of queries using aggregate functions and GROUP BY clauses:<br />
<a name='more'></a><br />
<blockquote class="tr_bq">
<b>Title </b>Combine Aggs: Serialize/Deserialize Internal aggregate states<br />
<b>Topic </b>Server Features<br />
<b>Created </b>2016-02-29 04:08:46<br />
<b>Last modified </b>2016-04-08 17:49:52 (5 months ago)<br />
<b>Latest email </b>2016-04-09 10:13:21 (5 months ago)<br />
<b>Status</b>2016-03: Committed<br />
<b>Authors </b>David Rowley (davidrowley)<br />
<b>Reviewers </b>Robert Haas (rhaas)Become reviewer<br />
<b>Committer </b>Robert Haas (rhaas)<br />
(<a href="https://www.postgresql.org/message-id/flat/CA+U5nMJ92azm0Yt8TT=hNxFP=VjFhDqFpaWfmj+66-4zvCGv3w@mail.gmail.com#CA+U5nMJ92azm0Yt8TT=hNxFP=VjFhDqFpaWfmj+66-4zvCGv3w@mail.gmail.com">full discussion</a>)</blockquote>
<blockquote class="tr_bq">
<b>Title </b>Remove Functionally Dependent GROUP BY Columns<br />
<b>Topic </b>Performance<br />
<b>Created </b>2015-12-02 10:23:32<br />
<b>Last modified </b>2016-02-11 22:52:25 (6 months, 4 weeks ago)<br />
<b>Latest email </b>2016-02-14 21:16:35 (6 months, 3 weeks ago)<br />
<b>Status</b>2016-03: Committed<br />
2016-01: Moved to next CF<br />
<b>Authors </b>David Rowley (davidrowley)<br />
<b>Reviewers </b>Julien Rouhaud (rjuju)Become reviewer<br />
<b>Committer </b>Tom Lane (tgl)<br />
(<a href="https://www.postgresql.org/message-id/flat/CAKJS1f_UZ_MXtpot6EPXsgHSujoUCrKuXYHLH06h072rDXsCzw@mail.gmail.com#CAKJS1f_UZ_MXtpot6EPXsgHSujoUCrKuXYHLH06h072rDXsCzw@mail.gmail.com">full discussion</a>)</blockquote>
<br />
The former commit basically improves the performance of queries that look like that:<br />
<br />
<pre>SELECT AVG(c), VARIANCE(c), SUM(c) FROM my_table;</pre>
<br />
Here they basically reduce amount of work to a single calculation for each row which previously would do redundant calculations for each result column. This speedup is due to <a href="https://www.postgresql.org/message-id/54E7948D.9010803%402ndquadrant.com">sharing the internal state a common transition function</a> of those aggregate functions.<br />
<br />
Another extra is that sharing independent state variables across different aggregates allows parallelizinig the computation of these variables. In case of the functions above which share [count, sum, sum of squares], all three are independent.<br />
<br />
The discussion around this topic started in the end of 2014 and ended in spring 2016. So, Robert Haas committed the patch and <a href="https://www.postgresql.org/message-id/CA%2BTgmoarQVxcK0QVnzA2ghsXD4JjyKg1muAnqwVzFEsuyC_pEw%40mail.gmail.com">noted</a> "Man, that was a lot of work" indicating the difficulties in implementing this infrastructure correctly.<br />
<br />
<br />
The latter commit basically removed useless GROUP BY arguments, i.e. if they functionally depend on each other. So, the GROUP BY statement can be simplified. The PostgreSQL planner does this in 9.6 now. There are some quite impressive performance gains (<a href="https://www.postgresql.org/message-id/CAKJS1f_UZ_MXtpot6EPXsgHSujoUCrKuXYHLH06h072rDXsCzw%40mail.gmail.com">up to 50%</a>) with some rather small performance losses due to increased planning overhead.<br />
<br />
Its main use-case are those users migrating from database systems which do not allow to omit redundant GROUP BY arguments when these appear in the SELECT clause.<br />
<br />
<br />
That's it for now.<br />
<br />
SvenAnonymoushttp://www.blogger.com/profile/13449109761788005994noreply@blogger.com0tag:blogger.com,1999:blog-7246401958767395519.post-78122972455412173252016-09-06T19:45:00.002+02:002016-09-06T19:45:50.348+02:00Followup: systemd user instancesIn this <a href="http://srkunze.blogspot.com/2016/08/what-do-you-do-when-you-need-more.html">last post</a>, I wrote about how to fix systemd user instances for older/broken systemd versions. Here, I'd like to explain how we managed to get the solution for, say, more than a single host where you can't do those changes by hand (at least not while keeping you sane and your customers happy).<br />
<br />
In order to keep track here, we need to do the following:<br />
<a name='more'></a><ol>
<li>generate an SSH key for root if missing</li>
<li>deploy that SSH key for the users in question if missing</li>
<li>deploy <code style="background-color: #cccccc;">our-user@.service</code> on the host</li>
<li>enable <code style="background-color: #cccccc;">our-user@.service</code> for the users in question</li>
</ol>
<div>
For that purpose, we built an RPM package (I guess the same will be possible with your favorite package management system as well), which looks like this:</div>
<pre style="background-color: #cccccc;">[Unit]
Description=Alternative User Manager for %I
After=sshd.service
[Service]
ExecStartPre=/bin/bash -c "test -e ~/.ssh/id_rsa || ssh-keygen
-t rsa -N '' -f ~/.ssh/id_rsa"
ExecStartPre=/bin/bash -c "/usr/bin/ssh -oBatchMode=yes %I@localhost /usr/bin/echo
|| cat ~/.ssh/id_rsa.pub | /usr/bin/su -l %I -c 'tee -a ~/.ssh/authorized_keys'"
ExecStart=/usr/bin/ssh %I@localhost /usr/lib/systemd/systemd --user
Restart=on-failure
[Install]
WantedBy=default.target
</pre>
NOTE: don't break the lines.<br />
<br />
The noticeable difference to our <a href="http://srkunze.blogspot.com/2016/08/what-do-you-do-when-you-need-more.html">first version</a> is the addition of two additional <span style="background-color: #cccccc;">ExecStartPre</span> options which perform (1) and (2) of our TODO list. Especially (2) turned out to be very tricky using all sorts of shell magic.<br />
<br />
The remaining points (3) and (4) requires us to perform remote execution powered by <a href="https://docs.saltstack.com/">SaltStack</a>'s (<a href="https://github.com/saltstack/salt">also on github</a>). This way, we can deploy our additional package hassle-freely on all affected hosts and with it the service unit. Enabling and starting the service (via salt) also performs steps (1) and (2) and we are all set!<br />
<br />
Best,<br />
SvenAnonymoushttp://www.blogger.com/profile/13449109761788005994noreply@blogger.com0tag:blogger.com,1999:blog-7246401958767395519.post-56624840362828352452016-08-23T21:24:00.000+02:002016-09-06T10:49:43.291+02:00What do you do when you need more systemd instances?<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td class="copyright-fotos2" style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh_xVnlecHQjDvYtaOF9OZrhi7bVFL5pdsZklTh-bIpKtqxxKV-dySOjyeiJotSD8FdRfuOqIau06PvcYVKCTdEvHjJrO0M1VSJlvnSs4aPYwlg7O_msY7QvUZtvnC1Aa4DMyStFaDDupw/s1600/challenge.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="332" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh_xVnlecHQjDvYtaOF9OZrhi7bVFL5pdsZklTh-bIpKtqxxKV-dySOjyeiJotSD8FdRfuOqIau06PvcYVKCTdEvHjJrO0M1VSJlvnSs4aPYwlg7O_msY7QvUZtvnC1Aa4DMyStFaDDupw/s640/challenge.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">You wanna make it there? You need to go there!</td></tr>
</tbody></table>
Today, we had a very interesting problem. We needed to have <a href="https://www.freedesktop.org/wiki/Software/systemd/">systemd</a> run <b>additional instances of itself</b> to manage custom daemons. This works like the following.<br />
<a name='more'></a><br />
You would need to enable "lingering" for the corresponding users:<br />
<pre style="background-color: #cccccc;">loginctl enable-linger <user> # only once via root
# alternative: touch /var/lib/systemd/linger/<user></pre>
After this, you can happily put your service file here:<br />
<pre style="background-color: #cccccc;">mkdir -p ~/.config/systemd/user/
vim ~/.config/systemd/user/test.service</pre>
<pre style="background-color: #cccccc;">[Unit]
Description=test
[Service]
Type=simple
ExecStart=/bin/sleep 10000
Restart=on-abort
[Install]
WantedBy=default.target</pre>
Then, enabling/starting/stopping your brand-new service (and everything else you would expect from a proper task management) works like a charm, also after reboots.<br />
<pre style="background-color: #cccccc;">systemctl --user enable test.service
systemctl --user start test.service
systemctl --user status test.service</pre>
(Please note the <code style="background-color: #cccccc;">--user</code> argument; run this under a non-root user.)<br />
<br />
Well, life could be so easy if everything runs on Ubuntu 16.04 (or on a recent distribution for that purpose). In fact, not all our production servers do. There's a big portion of openSUSE 12.3 servers, which need special handling.<br />
<br />
Once we brought our test setup described above to said servers, we noticed that<br />
<pre style="background-color: #cccccc;">systemctl --user</pre>
fails with<br />
<pre style="background-color: #cccccc;">Failed to issue method call: Process /bin/false exited with status 1</pre>
Not very helpful indeed, so we dug deeper. The core issue here is that there is simply no user instance of systemd running. This is what <code style="background-color: #cccccc;">/lib/systemd/system/user@.service</code> is good for. On my Ubuntu 16.04, <code style="background-color: #cccccc;">user@1000.service</code> is enabled and running, thus maintaining a second systemd instance just for my login user in addition to root's system systemd - commonly known as pid 1. If you stop <code style="background-color: #cccccc;">user@1000.service</code>, you'll notice that <code style="background-color: #cccccc;">systemctl --user</code> also fails.<br />
<br />
In short, on openSUSE 12.3, the mechanism to start user instances of systemd is simply broken. Starting the <code style="background-color: #cccccc;">user@.service</code> results in a failure.<br />
<br />
<b>How to fix it for openSUSE 12.3?</b><br />
<br />
The outlook of updating all production servers, made the following solutions unacceptable (due to possibility of errors or failures while updating and necessary reboots; basically headaches on steroids):<br />
<ul>
<li>update systemd (will it reboot at all?)</li>
<li>change the PAM config (will we authenticate again?)</li>
<li>even deeper changes in Linux (no please)</li>
</ul>
So, we decided to stick with what we have and what as known to work properly.<br />
<br />
<b>How does it work anyway?</b><br />
<br />
If one pays closer attention to the user@.service unit, we see what it actually does. Let's do the same in a shell:<br />
<pre style="background-color: #cccccc;">user:~$ systemd --user</pre>
It works and systemctl works as well! Horray. :)<br />
So, it should be a no-brainer for root, right?<br />
<pre style="background-color: #cccccc;">root:~# su - user -c '/usr/lib/systemd/systemd --user'</pre>
<pre style="background-color: #cccccc;"><samp>Failed to create root cgroup hierarchy: Permission denied
Failed to allocate manager object: Permission denied</samp></pre>
Uh, that's odd. Maybe, that's the reason why systemd cannot create another instance of itself. But what's the difference here? When does it work and when not?<br />
<br />
<b>Use SSH!</b><br />
<br />
In the first try, we connected to the server via ssh. There, the authentication and session creation process is successfully finished and thoroughly tested. So, we decided to use ssh when writing the following <code style="background-color: #cccccc;">our-user@.service</code> unit file:
<br />
<pre style="background-color: #cccccc;">[Unit]
Description=Alternative User Manager for %I
After=sshd.service
[Service]
ExecStart=/usr/bin/ssh %I@localhost /usr/lib/systemd/systemd --user
Restart=on-failure
[Install]
WantedBy=default.target
</pre>
Enabling this, made the whole systemd user instance magic work again for openSUSE 12.3 again.<br />
<br />
Cheers,<br />
Sven<br />
<br />
Further readings about system vs user instances of systemd:<br />
<a href="https://www.freedesktop.org/software/systemd/man/systemd.html#--system">https://www.freedesktop.org/software/systemd/man/systemd.html#--system</a><br />
<a href="https://www.freedesktop.org/software/systemd/man/systemctl.html#--user">https://www.freedesktop.org/software/systemd/man/systemctl.html#--user</a>Anonymoushttp://www.blogger.com/profile/13449109761788005994noreply@blogger.com0tag:blogger.com,1999:blog-7246401958767395519.post-25127179763087538542016-07-15T17:29:00.000+02:002016-07-15T17:35:47.277+02:00PostgreSQL: Index-Only Scans with Partial Indexes<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td class="copyright-fotos2" style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgNPb0NE9RX_zakdUXAyqwUZdNGfGWHWaYt88KtSUw4RozCOvbOhT0p10Kohw2QBHH5N9V1ELDdAsaBXuVpBaPiCa8Rd9ZF8mz2O3PG_aJ7bfbBmZRl-6uaP1mZ31s4NrbB41RQ5uhOmho/s1600/partial.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="248" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgNPb0NE9RX_zakdUXAyqwUZdNGfGWHWaYt88KtSUw4RozCOvbOhT0p10Kohw2QBHH5N9V1ELDdAsaBXuVpBaPiCa8Rd9ZF8mz2O3PG_aJ7bfbBmZRl-6uaP1mZ31s4NrbB41RQ5uhOmho/s640/partial.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Partial sun lurking out of the water.</td></tr>
</tbody></table>
Another posts of my <a href="http://srkunze.blogspot.com/2016/06/postgresql-parallel-aggregate.html">PostgreSQL 9.6 series</a>. This time, I am talking about commit <a href="https://commitfest.postgresql.org/9/299/">9/299</a>. The complete discussion can be found <a href="http://www.postgresql.org/message-id/flat/55A00F17.1020608@2ndquadrant.com#55A00F17.1020608@2ndquadrant.com">here</a>.<br />
<a name='more'></a><blockquote class="tr_bq">
<b>Title</b> index-only scans with partial indexes<br />
<b>Topic </b>Performance<br />
<b>Created </b>2015-07-10 18:32:48<br />
<b>Last modified </b>2016-03-31 22:00:55 (2 months, 4 weeks ago)<br />
<b>Latest email </b>2016-04-01 02:39:49 (2 months, 4 weeks ago)<br />
<b>Status</b><br />
2016-03: Committed<br />
2016-01: Moved to next CF<br />
2015-11: Moved to next CF<br />
2015-09: Moved to next CF<br />
<b>Authors</b><span class="Apple-tab-span" style="white-space: pre;"> </span>Tomas Vondra, Kyotaro Horiguchi<br />
<b>Reviewers </b>Kyotaro Horiguchi, Kevin Grittner, Konstantin Knizhnik<br />
<b>Committer </b>Tom Lane (tgl)</blockquote>
This commit basically allows to use partial indexes to participate in index-only scans. Index-only scans, as the name suggests, use the corresponding index only to sift through the data thus being much faster than going back and forth to the related table.<br />
<br />
However, some indexes, namely partial indexes, that cannot be used that easily for this kind of optimization. Partial indexes are indexes which include a <b>WHERE</b> clause in order to index a slice of the data only.<br />
<br />
Quoting Tomas Vondra:<br />
<blockquote class="tr_bq">
<blockquote class="tr_bq">
<i>In other words, unless you include columns from the index predicate to the index, the planner will decide index only scans are not possible. Which is a bit unfortunate, because those columns are not needed at runtime, and will only increase the index size (and the main benefit of partial indexes is size reduction).</i></blockquote>
</blockquote>
Initial discussion on this topic concerns the increase of complexity and runtime of the planning phase which is usually the case when the planner needs to act more and more intelligent. Properly discussed and being delayed three times, the patch was merged into the development branch of PostgreSQL by Tom Lane on 31th of March 2016.<br />
<br />
Thanks again for another improvement on the performance of the PostgreSQL-Server.<br />
<br />
<br />
SvenAnonymoushttp://www.blogger.com/profile/13449109761788005994noreply@blogger.com0tag:blogger.com,1999:blog-7246401958767395519.post-50272553346878284462016-06-28T18:56:00.000+02:002016-06-28T19:12:32.446+02:00PostgreSQL: Parallel AggregateWith <a href="http://www.postgresql.org/developer/roadmap/">PostgreSQL 9.6</a> looming on the horizon, I went out to sift through some of <a href="https://commitfest.postgresql.org/">PostgreSQL's commitfests</a> to find some interesting bits and pieces. This post is the start of a series covering commits of the next generation of the venerable database management system.<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td class="copyright-fotos2" style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg-rls5HiQY8C97QSbTvVmT3Ttz7Wnc1czFJGIUYgFlsWCcBNUYQyp5NMFwc28LpR2zHOHpjTxNZvYxJdAT57SMFEfcxw9z0vD7AV8G7sO7jPnL8D47gDGA_DkUBbmLKpIM_tmFcwGpy7E/s1600/parallel.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="354" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg-rls5HiQY8C97QSbTvVmT3Ttz7Wnc1czFJGIUYgFlsWCcBNUYQyp5NMFwc28LpR2zHOHpjTxNZvYxJdAT57SMFEfcxw9z0vD7AV8G7sO7jPnL8D47gDGA_DkUBbmLKpIM_tmFcwGpy7E/s640/parallel.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Outflow made parallel.</td></tr>
</tbody></table>
<a name='more'></a>We all fancy performance improvements and concurrency especially in those days where servers tend to grow upper double-digit numbers of execution cores. As <a href="http://srkunze.blogspot.com/2016/01/postgresql-95-released.html">suspected</a> earlier, we will now see a lot of movement in that direction as the foundation for parallel execution has been introduced. So, let's start with the following: <a href="https://commitfest.postgresql.org/9/551/">Parallel Aggregate</a>. See its key data below, from which you see that it's already been integrated successfully into the main branch of development:<br />
<blockquote class="tr_bq">
<b>Title</b><span class="Apple-tab-span" style="white-space: pre;"> </span>parallel aggregate<br />
<b>Topic</b><span class="Apple-tab-span" style="white-space: pre;"> </span>Server Features<br />
<b>Created</b><span class="Apple-tab-span" style="white-space: pre;"> </span>2016-02-29 00:15:35<br />
<b>Last modified</b><span class="Apple-tab-span" style="white-space: pre;"> </span>2016-03-21 13:36:54<br />
<b>Latest email</b><span class="Apple-tab-span" style="white-space: pre;"> </span>2016-03-22 05:47:25<br />
<b>Status</b><span class="Apple-tab-span" style="white-space: pre;"> </span>2016-03: Committed<br />
<b>Authors</b><span class="Apple-tab-span" style="white-space: pre;"> </span>David Rowley, Haribabu Kommi<br />
<b>Reviewers</b><span class="Apple-tab-span" style="white-space: pre;"> </span>Robert Haas (rhaas)<br />
<b>Committer</b><span class="Apple-tab-span" style="white-space: pre;"> </span>Robert Haas (rhaas)</blockquote>
Robert Haas initially created an infrastructure for parallel execution in PostgreSQL (described in detail <a href="http://rhaas.blogspot.de/2015/11/parallel-sequential-scan-is-committed.html">there</a>) by adding a <i>Gather</i> node which spawns a number of workers to solve a parallelizable workload of an SQL execution plan. David Rowley and Haribabu Kommi extended this idea to aggregation which also allows parallel execution in certain situations.<br />
<br />
First drafts can be found <a href="http://www.postgresql.org/message-id/CAJrrPGd3AjmgYp3CGxjKeWf1942GbGgHUOXDs2KTS=xt1hkEMw@mail.gmail.com/">here</a>. As noted there, the aggregate needs to indicate parallel support and to keep the implementation simple, they only implemented the most basic bits first. As such, most of the potential of parallelism still lies ahead of us. In the course of the <a href="https://www.postgresql.org/message-id/flat/CAJrrPGd3AjmgYp3CGxjKeWf1942GbGgHUOXDs2KTS=xt1hkEMw@mail.gmail.com#CAJrrPGd3AjmgYp3CGxjKeWf1942GbGgHUOXDs2KTS=xt1hkEMw@mail.gmail.com">review</a> and the future development, some difficulties arose and some related issues needed to be handled. But in the end the <a href="https://www.postgresql.org/message-id/CA%2BTgmoY7_f-tfuKa%3DV0XrYcOqZ3fdd%2BRjc4QcLH%2BUn-ZTc9iyg%40mail.gmail.com">commit went through</a> and you can make use of it now.<br />
<br />
This given, I just can say a huge "Thank You!" to the PostgreSQL team.<br />
<br />
<br />
Best,<br />
SvenAnonymoushttp://www.blogger.com/profile/13449109761788005994noreply@blogger.com0tag:blogger.com,1999:blog-7246401958767395519.post-64850051479669733152016-03-31T16:33:00.000+02:002016-03-31T22:30:03.828+02:00What is a path?<div style="text-align: right;">
<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: left; margin-right: 1em; text-align: left;"><tbody>
<tr><td class="copyright-fotos2" style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEitVGd0IEC9SOFIuKgYK1i7MLiB6nHLvZ3l51mnrZQ8uL6If_beSY1ZBFviP9LdelWWZ7j6GOyQQBRNB5QHCu9Ga_WEhnZLGwc5YSNKzHIg8YaXWaYLsxjtJep9Ck-oe6tTiP6ooDi_KcA/s1600/path.png" imageanchor="1" style="clear: right; display: inline; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEitVGd0IEC9SOFIuKgYK1i7MLiB6nHLvZ3l51mnrZQ8uL6If_beSY1ZBFviP9LdelWWZ7j6GOyQQBRNB5QHCu9Ga_WEhnZLGwc5YSNKzHIg8YaXWaYLsxjtJep9Ck-oe6tTiP6ooDi_KcA/s400/path.png" width="300" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Wisecracker.</td></tr>
</tbody></table>
</div>
<br />
<a href="https://docs.python.org/3/library/pathlib.html">pathlib</a> is a provisional stdlib module. However, as the current threads (<a href="https://mail.python.org/pipermail/python-ideas/2016-March/039353.html">here</a>, <a href="https://mail.python.org/pipermail/python-ideas/2016-March/039268.html">here</a>, <a href="https://mail.python.org/pipermail/python-ideas/2016-March/039132.html">here</a> and <a href="https://mail.python.org/pipermail/python-ideas/2016-March/038998.html">here</a>) on python-ideas show, it is not as easy to work with as originally intended. Once you have a Path object, it's quite easy to use what Path offers which is a lot.<br />
<br />
One big problem, though, is the interaction of Path objects with existing stdlib functions. Most of the later are string-consuming functions whereas the former are no strings at all. As far as I can see, this is one reason why pathlib lacks broader adoption and many agree. This situation leads to the following possible resolutions:<br />
<div>
<ol>
<li>make Path objects compatible with strings (basically make them inherit from strings)</li>
<li>make existing stdlib functions accept Path objects (basically make them accept both and convert if needed)</li>
<li>do both but that seems superfluous</li>
</ol>
<div>
Solution 2 would also affect third-party libraries as <a href="https://mail.python.org/pipermail/python-ideas/2016-March/039351.html">noted here</a>.<br />
<br />
In order to decide appropriately, it becomes necessary to answer the following question:</div>
</div>
<a name='more'></a><div>
<br /></div>
<div>
<b>What is a path in the first place?</b></div>
<div>
<b><br /></b></div>
<div>
<a href="https://mail.python.org/pipermail/python-ideas/2016-March/039284.html">Brett Cannon made a good point</a> of why PEP 428 (that's the one which introduced pathlib as a stdlib module) deliberately chose Path not to inherit from string. I pondered over it for a while and saw that from his perspective a path is actually not a string but rather a complex data-structure consisting of parts with distinct meanings: literally the steps (of which a path consists) to a resource. The string which represents a path as most people know it is just that: a representation of a more complex object, just like a dict or a list. Let me make this a bit clearer. </div>
<div>
<br /></div>
<div>
I think we agree on the following: writing down 21 characters in a row is a string, right? So, what about these 21 characters?</div>
<div>
<br />
{1: 11, 2: 12, 3: 13}</div>
<div>
<br />
If you see that in a Python program (and presumably in many other modern programming languages), you associate that with a dictionary, mapping, hash, etc. So, these 21 characters are a mere representation of a complex object with a very rich functionality.<br />
<br />
The following paragraphs summarizes what makes the discussion about paths and strings to hard. Depending on whom you ask there are different interpretations of what a path actually is.</div>
<div>
<br />
<b>Paths as complex objects and strings as their representation</b><br />
<br /></div>
<div>
Let's put this analogy to work with paths. If you come from a language that treats strings as file paths (like Python), you can imagine and categorize the facilities of pathlib like so:<br />
<ul>
<li><b>pure path</b> - operating on the path string</li>
<li><b>concrete path</b> - operating on files corresponding to the given path</li>
</ul>
The classic "extract the file extension" issue is done easily with the pure path methods. Writing to a file is also easily done with concrete path operations. So, it seems paths are pretty complex objects with some internal structure and a lot of functionality.<br />
<br />
<b>Paths as monolithic object for addressing resources</b><br />
<b><br /></b>
The previous interpretation is not the only one. Despite all the fine functionality of extracting file extensions, concatenating parts to a larger path, etc., building a path is not an end in itself. When you got a path, it addresses a resource on a machine. When doing so for reading or writing that resource, you actually don't care about whether the path consists of parts or not. To you, it's a monolithic structure, <b>an address</b>.<br />
<br />
But, you might say, each part of a path represents a directory in hierarchical file systems. Sure that is true for many file systems but <a href="https://en.wikipedia.org/wiki/File_system#Database_file_systems">not for all</a>. Moreover, how often do you really care about the underlying directory structure? It needs to be there to make things work, of course. When it's there, you mostly don't care. How often do you need to create a subtree in an directory in order to create a single config file? I encounter this once in a while and to be honest: <b>it sucks</b>.<br />
<br />
<span style="background-color: #eeeeee; font-family: "courier new" , "courier" , monospace;">touch /home/me/on/your/ssd.conf</span> will fail if the directory "on/your/" has not been created by somebody before me.<br />
<br />
Especially for me, as a <b>Web</b> developer, it's quite hard to understand what purpose this restriction serves. Within a Web application the hierarchy of URLs is an emergent property not a prerequisite.<br />
<br />
Users of <b>git</b> are accustomed to not-committing directories in. Why? Because it's unnecessary and the directory structure is again emerging from the files names themselves (aka from the content).<br />
<br />
This said, it's rather cumbersome to attribute semantics to the parts of a string that happens to be separated by "/" or "\". At least to me, a path made of one piece.<br />
<br />
<b>What about security then?</b><br />
<br />
One can further argue that Web development and git repositories are different here. There is a clear boundary where a path can lead. A URL path cannot address a foreign resource on another domain. git file paths are contained within the repository root.<br />
<br />
See the common theme? There is a container from where the path of a resource <b>cannot escape</b>.<br />
<br />
If you have a complete file system available at your fingertips, a lot harm can be done when malicious user input is concatenated unattendedly as a subpath; actually to address a resource within a container but misused to gain access to the complete file system.<br />
<br />
I cannot say if the container pattern would work for everybody but it's definitely worth exploring as there are some prominent working examples out there.<br />
<br />
<b>Conclusion</b><br />
<br />
I really like pathlib since it solves many frequently asked questions in the right way once and for all.<br />
<br />
But I don't like using it as an argument again inheriting paths from strings saying paths have internal structure in contrast to strings. At least to me, they do not. That, on the other hand, does not necessarily mean inheriting path from string is a good idea but it makes it no worse one than it was before.<br />
<br />
Best,<br />
Sven</div>
Anonymoushttp://www.blogger.com/profile/13449109761788005994noreply@blogger.com0tag:blogger.com,1999:blog-7246401958767395519.post-42723501161406433842016-03-31T13:24:00.003+02:002016-03-31T18:39:50.211+02:00p-stringsCurrently, there is an <a href="https://mail.python.org/pipermail/python-ideas/2016-March/039132.html">interesting debate on python-ideas</a> on the topic of "Would we like to add so-called <b>p-strings</b> to Python?". The p-string idea basically extends the <a href="https://www.python.org/dev/peps/pep-0498/">f-string syntax</a> which will be released in the upcoming Python 3.6.<br />
<br />
The "p" in p-string stands for path and <b>one of the alternative proposals</b> is to add the following syntactic sugar to Python like this:<br />
<a name='more'></a><br />
<span style="font-family: "courier new" , "courier" , monospace;">p'/someroot/{myvariable}/file.ext'</span><br />
<br />
This basically is supposed to create a <a href="https://docs.python.org/3/library/pathlib.html#basic-use">pathlib.Path</a> object which allows all sort of convenience methods like <a href="https://docs.python.org/3/library/pathlib.html#pathlib.PurePath.suffix">extracting the extension</a> or <a href="https://docs.python.org/3/library/pathlib.html#pathlib.PurePath.parts">iterating over the parts of the path</a> and more.<br />
<br />
Despite the usability improvement, there are of course <b>reservations</b> about the idea. Mainly these are:<br />
<br />
<ul>
<li>path vs. str (I will cover this in <a href="http://srkunze.blogspot.com/2016/03/what-is-path.html">another post</a>)</li>
<li>tying a syntax to a stdlib module (in this case pathlib)</li>
<li>security concerns about including user input (same discussion arose with f-strings back then)</li>
</ul>
<br />
If you find the proposal useful or have something else to contribute to the discussion, we look forward to seeing you on the mailing list. :)<br />
<br />
Best,<br />
SvenAnonymoushttp://www.blogger.com/profile/13449109761788005994noreply@blogger.com0tag:blogger.com,1999:blog-7246401958767395519.post-49567283064618578072016-03-29T11:50:00.000+02:002016-03-31T22:34:26.989+02:00Python makes you a worse programmerThanks Luke for this interesting read: <a href="http://lukeplant.me.uk/blog/posts/why-learning-haskell-python-makes-you-a-worse-programmer/">http://lukeplant.me.uk/blog/posts/why-learning-haskell-python-makes-you-a-worse-programmer/</a><br />
<br />
It reminds me of English as it is substantially simpler than most other languages. What I've heard (from themselves) is that most native English speakers are not easily motivated to learn a second language. And that is although they know all the corresponding advantages like healthier brains, better first language, more interesting traveling, etc.<br />
<br />
A good article about why to learn a second language: <a href="http://www.omniglot.com/language/articles/benefitsoflearningalanguage.htm">http://www.omniglot.com/language/articles/benefitsoflearningalanguage.htm</a>Anonymoushttp://www.blogger.com/profile/13449109761788005994noreply@blogger.com0tag:blogger.com,1999:blog-7246401958767395519.post-52929982503316636752016-03-22T22:22:00.001+01:002016-03-22T22:41:11.520+01:00Safe Cache Invalidation<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td class="copyright-fotos2" style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiXkbt6Vcs-y_kPYm4rigbvJwcc6uZI8yt3ji9SoiHewlPu4ZNzIePMCRm7tXz-maE7URKgJ_bSc-7uPtoCR8xAOUaVo-Ez0QnQSa-tyevxY4UrMhu7McO-03Et8hz6__9BIFZhsTBwLvE/s1600/bubble.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="432" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiXkbt6Vcs-y_kPYm4rigbvJwcc6uZI8yt3ji9SoiHewlPu4ZNzIePMCRm7tXz-maE7URKgJ_bSc-7uPtoCR8xAOUaVo-Ez0QnQSa-tyevxY4UrMhu7McO-03Et8hz6__9BIFZhsTBwLvE/s640/bubble.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Caches - as fragile as bubbles.</td></tr>
</tbody></table>
<blockquote>
<a href="http://martinfowler.com/bliki/TwoHardThings.html">There are only two hard things in Computer Science: cache invalidation and naming things.</a>
<br />
<br />
β Phil Karlton
</blockquote>
And right he is. Both is true for the package that I would like to present in this post. Based on <span style="font-family: "courier new" , "courier" , monospace;">functools.lru_cache</span>, it allows you to specify when the caches should be invalided. In the absence of a proper name for this kind of functionality, I called it xcache, analogous to xheap and xfork.<br />
<a name='more'></a><br />
You can find the <a href="https://github.com/srkunze/xcache">source at github</a> and the pre-built <a href="https://pypi.python.org/pypi/xcache">package on PyPI</a>.<br />
<br />
<b>What's in the package?</b><br />
<br />
The purpose of xcache can be explained best using an example. Imagine you have a function like the following:<br />
<br />
<pre style="background-color: #eeeeee; margin-left: 40px; margin-right: 40px;">@lru_cache
def math_func(a, b, c):
return ....
</pre>
<br />
Let's assume this is one of those proper mathematical functions you know from school or something built upon those. That means, beside all parameters being one letters, the same input yields the same output, now and forever. It further means that an LRU cache can be used to speed up this function <a href="http://srkunze.blogspot.com/2016/03/lru-caches.html">enormously without compromising readability</a>.<br />
<br />
But, since there is an eternal battle going on between mathematicians and computer scientists, you don't write all your functions in this manner. Even certainly, most of the functions you've written and you are about to write will have side-effects which will inevitably lead to wrong results the longer you cache those.<br />
<br />
In short, the associated RLU caches should be invalidated once in a while to maintain the proper output of not-so-mathematical functions. This is where xcache comes in. It allows to invalidate caches in two ways:<br />
<ol>
<li>using <b>automatic memory management</b> (aka garbage collection)</li>
<li>using <b>context managers</b> (aka <span style="font-family: "courier new" , "courier" , monospace;">with</span> or <span style="font-family: "courier new" , "courier" , monospace;">@</span>)</li>
</ol>
The following examples illustrate the use-cases by attaching RLU caches to the lifespan of a Web request. Normally each request is handled within its own transaction, so most of the data used can be considered constant while handling the request. As soon as the request is finished, the transaction ends (committed or rollback); thus the caches should be invalidated since another concurrently executed request might have changed the underlying data.<br />
<br />
<b>Invalidation via Memory Management</b><br />
<br />
<i>some preparation</i><br />
<br />
<pre style="background-color: #eeeeee; margin-left: 40px; margin-right: 40px;">from xcache import cached_gen
request_cache = cached_gen(lambda: request) # create new cache wrapper
@request_cache()
def check_permission(user, obj):
return ...
</pre>
<br />
<i>invalidation happens magically</i><br />
<br />
<pre style="background-color: #eeeeee; margin-left: 40px; margin-right: 40px;">request = ..... # where ever you get your request from
objs = .... # list of some objects
if any(check_permission(request.user, obj) for obj in objs):
print(result_success)
else:
print(result_deny)
request = .... # another request; all request caches are invalidated
</pre>
<br />
NOTE: we generally attach the request object to some thread-local object, so <span style="font-family: "courier new" , "courier" , monospace;">ref_cache_gen</span> can access it regardless of context.<br />
<br />
<b>Invalidation via Context Manager</b><br />
<br />
<i>preparation again</i><br />
<br />
<pre style="background-color: #eeeeee; margin-left: 40px; margin-right: 40px;">from xcache import cached
@cached()
def check_permissions(user, obj):
return ...
</pre>
<br />
<i>explicit invalidation</i><br />
<br />
<pre style="background-color: #eeeeee; margin-left: 40px; margin-right: 40px;">from xcache import clean_caches
for request in request_list:
with clean_caches(): # start with empty caches
objs = .... # list of some objects
if any(check_permission(request.user, obj) for obj in objs):
print(result_success)
else:
print(result_deny) # after this line caches are empty as well
</pre>
<br />
NOTE: using <span style="font-family: "courier new" , "courier" , monospace;">clean_caches</span> you can even specify to which object caches should be attached to.<br />
<br />
<b>Conclusion</b><br />
<br />
RLU caches are very useful as is cache invalidation. Thus, you might find xcache to be a low-overhead addition to your caching libs. <a href="https://pypi.python.org/pypi/xcache">Check out the docs</a> for more options and use-cases. You can plug in all <span style="font-family: "courier new" , "courier" , monospace;">rlu_cache-</span>compatible cache implementations into xcache, cf. <a href="https://pypi.python.org/pypi/cachetools/1.0.0">cachetools</a>.<br />
<br />
Best,<br />
SvenAnonymoushttp://www.blogger.com/profile/13449109761788005994noreply@blogger.com2tag:blogger.com,1999:blog-7246401958767395519.post-69662477979805244012016-03-08T23:19:00.001+01:002016-09-29T23:50:01.109+02:00Even Faster Heaps<div class="separator" style="clear: both; text-align: center;">
</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td class="copyright-fotos2" style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTWW37Htua9l7ff04XfGVynr8hW-wH7LbGsF9YgeTdo_TUEg1wiolqPASlDFGAU4IBUCq9DggiKos6qE30XT5op52KZpWemP7MBu69IUFFrqfZTMYX8KagOcfNNOQ8berB2b7AR-W1Mac/s1600/accident.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="312" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTWW37Htua9l7ff04XfGVynr8hW-wH7LbGsF9YgeTdo_TUEg1wiolqPASlDFGAU4IBUCq9DggiKos6qE30XT5op52KZpWemP7MBu69IUFFrqfZTMYX8KagOcfNNOQ8berB2b7AR-W1Mac/s640/accident.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">An ambulance rushing by.</td></tr>
</tbody></table>
Heaps are about performance. So, it is time to make xheap faster again. After <a href="https://mail.python.org/pipermail/python-list/2016-February/703832.html">realizing</a> that the actual slowdown of RemovalHeap and XHeap does not simply stem from the general overhead but from NOT using the C implementation at all, I decided to change that.<br />
<a name='more'></a><br />
Here's an update of the benchmark. <a href="http://srkunze.blogspot.de/2016/02/the-xheap-benchmark.html">Compared to its predecessor</a>, the change was quite a success. The removal capability accounts for an 4x slowdown now compared to its prior 50x. Furthermore, I could improve the testsuite considerably and as expected I needed to fix some bugs then.<br />
<br />
The faster and better version of xheap can be <a href="https://pypi.python.org/pypi/xheap">obtained from PyPI</a> and the source can be <a href="https://github.com/srkunze/xheap">found at github</a>.<br />
<br />
<hr />
<span style="font-family: "courier new" , "courier" , monospace;"><b>operation</b> 1,000 items 10,000 </span><span style="font-family: "courier new" , "courier" , monospace;">items</span><span style="font-family: "courier new" , "courier" , monospace;"> 100,000 </span><span style="font-family: "courier new" , "courier" , monospace;">items</span><span style="font-family: "courier new" , "courier" , monospace;"> 1,000,000 </span><span style="font-family: "courier new" , "courier" , monospace;">items</span><br />
<hr />
<span style="font-family: "courier new" , "courier" , monospace;">init heapq 0.03 ( 1.00x) 0.41 ( 1.00x) 4.45 ( 1.00x) 64.37 ( 1.00x)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> Heap 0.03 ( 1.02x) 0.42 ( 1.03x) 4.46 ( 1.00x) 64.36 ( 1.00x)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> RemovalHeap 0.05 ( 1.57x) 0.62 ( 1.53x) 7.94 ( 1.79x) 101.16 ( 1.57x)</span><br />
<hr />
<span style="font-family: "courier new" , "courier" , monospace;">pop heapq 0.01 ( 1.00x) 0.07 ( 1.00x) 0.85 ( 1.00x) 10.32 ( 1.00x)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> Heap 0.01 ( 1.46x) 0.10 ( 1.38x) 1.13 ( 1.33x) 13.20 ( 1.28x)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> RemovalHeap 0.02 ( 4.18x) 0.24 ( 3.51x) 2.61 ( 3.07x) 28.37 ( 2.75x)</span><br />
<hr />
<span style="font-family: "courier new" , "courier" , monospace;">push heapq 0.00 ( 1.00x) 0.03 ( 1.00x) 0.34 ( 1.00x) 3.58 ( 1.00x)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> Heap 0.01 ( 1.82x) 0.06 ( 1.85x) 0.61 ( 1.81x) 6.38 ( 1.78x)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> RemovalHeap 0.01 ( 2.83x) 0.09 ( 2.91x) 0.97 ( 2.86x) 9.81 ( 2.74x)</span><br />
<hr />
<hr />
<span style="font-family: "courier new" , "courier" , monospace;">init heapq 0.14 ( 1.00x) 1.60 ( 1.00x) 24.00 ( 1.00x) 271.42 ( 1.00x)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> OrderHeap 0.17 ( 1.26x) 1.88 ( 1.18x) 26.80 ( 1.12x) 299.55 ( 1.10x)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> XHeap 0.19 ( 1.42x) 2.10 ( 1.31x) 30.14 ( 1.26x) 332.56 ( 1.23x)</span><br />
<hr />
<span style="font-family: "courier new" , "courier" , monospace;">pop heapq 0.01 ( 1.00x) 0.15 ( 1.00x) 1.83 ( 1.00x) 22.37 ( 1.00x)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> OrderHeap 0.02 ( 1.80x) 0.24 ( 1.60x) 2.73 ( 1.50x) 31.36 ( 1.40x)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> XHeap 0.03 ( 2.66x) 0.33 ( 2.25x) 3.69 ( 2.02x) 41.03 ( 1.83x)</span><br />
<hr />
<span style="font-family: "courier new" , "courier" , monospace;">push heapq 0.00 ( 1.00x) 0.04 ( 1.00x) 0.54 ( 1.00x) 5.67 ( 1.00x)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> OrderHeap 0.01 ( 3.97x) 0.15 ( 3.70x) 1.62 ( 3.03x) 16.46 ( 2.90x)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> XHeap 0.01 ( 3.28x) 0.12 ( 3.06x) 1.38 ( 2.58x) 13.94 ( 2.46x)</span><br />
<hr />
<hr />
<span style="font-family: "courier new" , "courier" , monospace;">remove RemovalHeap 0.02 ( 1.00x) 0.19 ( 1.00x) 2.18 ( 1.00x) 22.62 ( 1.00x)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> XHeap 0.02 ( 0.90x) 0.17 ( 0.89x) 1.72 ( 0.79x) 17.60 ( 0.78x)</span><br />
<hr />
<br />
Kudos to <a href="https://mail.python.org/pipermail/python-list/2016-January/701653.html">Srinivas who proposed the mark&sweep approach</a> in the first place, especially the sweeping condition which allows an amortized runtime of O(log n) for pop, push and remove.<br />
<br />
Best,<br />
SvenAnonymoushttp://www.blogger.com/profile/13449109761788005994noreply@blogger.com0tag:blogger.com,1999:blog-7246401958767395519.post-33369747906449887652016-03-06T22:40:00.003+01:002016-03-06T22:40:59.735+01:00Raymond Tomlinson, the inventor of email, died<a href="https://de.wikipedia.org/wiki/Ray_Tomlinson">Raymond Tomlinson</a> invented one of the most famous technologies of today: <a href="https://en.wikipedia.org/wiki/Email">email</a>.<br />
<br />
He died on Friday.<br />
<br />
Read more about him on Ars: <a href="http://arstechnica.com/business/2016/03/e-mail-inventor-ray-tomlinson-who-popularized-symbol-dies-at-74/">http://arstechnica.com/business/2016/03/e-mail-inventor-ray-tomlinson-who-popularized-symbol-dies-at-74/</a>Anonymoushttp://www.blogger.com/profile/13449109761788005994noreply@blogger.com0tag:blogger.com,1999:blog-7246401958767395519.post-54932828674980766512016-03-02T22:06:00.000+01:002016-03-07T22:53:47.979+01:00LRU Caches<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td class="copyright-fotos2" style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi4a7mJsQ8bUyTL0OWWA523scczxKaZls5bThd9u9R597M_LbZPmtjllrnGzMp6yfkGhcAUfoXshUnwXGCh-MkE2g7QwusTo3t0C0wYiSyIjq86cqeM8RYUoC4DplF7icAxykuhDW9IwDs/s1600/precious.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="316" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi4a7mJsQ8bUyTL0OWWA523scczxKaZls5bThd9u9R597M_LbZPmtjllrnGzMp6yfkGhcAUfoXshUnwXGCh-MkE2g7QwusTo3t0C0wYiSyIjq86cqeM8RYUoC4DplF7icAxykuhDW9IwDs/s640/precious.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Precious little pieces preserving the balance of nature.</td></tr>
</tbody></table>
Python features <a href="https://en.wikipedia.org/wiki/Cache_algorithms#Examples">LRU caches</a>. For this purpose, the decorator <span style="font-family: "courier new" , "courier" , monospace;"><a href="https://docs.python.org/3/library/functools.html#functools.lru_cache">@functools.lru_cache</a></span> is provided. You can configure the size of the cache as well as whether equal arguments of different types should be distinguished.<br />
<br />
<b>RLU</b> stands for "<b>l</b>east <b>r</b>ecently <b>u</b>sed", i.e. if the maximum size of the cache has been reached and a new item is to be inserted, the item with the oldest access timestamp will be discarded to make room for the new resident. The cache size can be unlimited which especially useful for short running scripts.<br />
<br />
Let's get our hands dirty:<a name='more'></a><br />
<pre style="background-color: #eeeeee; margin-left: 40px; margin-right: 40px;">from time import time
def fib(n):
return fib(n-1) + fib(n-2) if n > 1 else 1
for j in range(0, 40, 5):
a = time()
f = fib(j)
b = time()
print('{t:10.8f} fib({j})={f}'.format(t=b-a, j=j, f=f))
</pre>
Which results in:<br />
<pre style="background-color: #eeeeee; margin-left: 40px; margin-right: 40px;">0.00000048 fib(0)=1
0.00000048 fib(0)=1
0.00000238 fib(5)=8
0.00001502 fib(10)=89
0.00017858 fib(15)=987
0.00217247 fib(20)=10946
0.01955438 fib(25)=121393
0.21021986 fib(30)=1346269
2.30364680 fib(35)=14930352
</pre>
Runtime increases dramatically.<br />
<br />
What if you really need <span style="font-family: "courier new" , "courier" , monospace;">fib(100)</span>? It seems you are screwed then, right? Let's see how <span style="font-family: "courier new" , "courier" , monospace;">lru_cache</span> remedies the situation here:<br />
<pre style="background-color: #eeeeee; margin-left: 40px; margin-right: 40px;">from functools import lru_cache
@lru_cache(maxsize=None) # unlimited cache size
def fib(n):
return fib(n-1) + fib(n-2) if n > 1 else 1
for j in range(0, 1000, 100):
a = time()
f = fib(j)
b = time()
print('{t:10.8f} fib({j})={f:e}'.format(t=b-a, j=j, f=f))
print(fib.cache_info())
</pre>
Et voilΓ :<br />
<pre style="background-color: #eeeeee; margin-left: 40px; margin-right: 40px;">0.00000381 fib(0)=1.000000e+00
0.00016809 fib(100)=5.731478e+20
0.00013471 fib(200)=4.539737e+41
0.00013041 fib(300)=3.595793e+62
0.00014305 fib(400)=2.848123e+83
0.00013804 fib(500)=2.255915e+104
0.00013733 fib(600)=1.786845e+125
0.00017452 fib(700)=1.415308e+146
0.00013733 fib(800)=1.121024e+167
0.00014663 fib(900)=8.879303e+187
CacheInfo(hits=907, misses=901, maxsize=None, currsize=901)
</pre>
Handsome execution times even for very large Fibonacci numbers.
<br />
<br />
<b>Conclusion</b><br />
<ul>
<li>performance gained - <i>yay</i></li>
<li>intention of the implementation preserved - <i>yay</i></li>
<li><b>recursion depth issue</b> not solved - try <span style="font-family: "courier new" , "courier" , monospace;">range(0, 10000, 1000)</span></li>
</ul>
That's for now. There'll be another post covering automatic cache invalidation. Stay tuned.<br />
<br />
Best,<br />
SvenAnonymoushttp://www.blogger.com/profile/13449109761788005994noreply@blogger.com0tag:blogger.com,1999:blog-7246401958767395519.post-53623171703448900232016-03-01T21:31:00.003+01:002016-03-01T21:54:46.444+01:00DROWNβYet Another Vulnerability of TLSYet another vulnerability of <a href="https://en.wikipedia.org/wiki/Transport_Layer_Security">TLS</a> has been discovered even affecting the latest version 1.2 as Ars wrote:<br />
<br />
<a href="http://arstechnica.com/security/2016/03/more-than-13-million-https-websites-imperiled-by-new-decryption-attack/">http://arstechnica.com/security/2016/03/more-than-13-million-https-websites-imperiled-by-new-decryption-attack/</a><br />
<br />
Best,<br />
SvenAnonymoushttp://www.blogger.com/profile/13449109761788005994noreply@blogger.com0tag:blogger.com,1999:blog-7246401958767395519.post-29456039342691746032016-03-01T20:51:00.002+01:002016-03-31T18:41:54.568+02:00Designing xforkRecently, I came to know a small team working on a problem which they try to solve by using threads. As expected, problems popped up soon and development slowed down considerably. So based on the <a href="http://srkunze.blogspot.com/2016/02/concurrency-in-python.html">previous post</a>, I would like to lay out my intentions and design decisions regarding <a href="https://pypi.python.org/pypi/xfork">xfork</a>, a module I've written and actively maintain in analysis of the <a href="https://www.python.org/dev/peps/pep-0492/">newly introduced async/await syntax</a>.<br />
<br />
<b>Concurrency is a hard engineering problem.</b>
<br />
<br />
Take it seriously and even consider <u>not being concurrent</u> a valid option.
<br />
<br />
<b>Design Assumptions</b><br />
<b><br /></b>
I created xfork from the following observations based on my own experience. Developers usually:<br />
<a name='more'></a><br />
<ol>
<li>don't understand 100% of the problem's domain.</li>
<li>need a simple approach to get things right.</li>
<li>understand code written in a sequential style.</li>
<li>don't know what environment their code is running on such as:</li>
<ol>
<li>How many cores has the target system?</li>
<li>How many processes are allowed on/would wrestle the target system down?</li>
<li>How much memory has the target system?</li>
<li>How often will the code be re-used and re-executed?</li>
</ol>
</ol>
<div>
Each observation will be addressed by a following section.</div>
<div>
<br /></div>
<div>
<b>Background Tasks</b></div>
<div>
<br /></div>
<div>
The observation 1 stems from some pretty basic human property. So, let me put it bluntly:</div>
<div>
<ul>
<li>we don't want processes</li>
<li>we don't want threads</li>
<li>we don't want coroutines</li>
</ul>
<div>
What we <b>really want</b> is faster execution. Parallel (or at least concurrent) execution is just a means to an end here. In turn, processes, threads and coroutines are just a means to parallel execution. So, we better build some abstraction which is actually closer to the developers problem: faster execution.</div>
<div>
<br /></div>
<div>
Let's start by calling units of execution which can run independently "background tasks" or simply "tasks".<br />
<br /></div>
</div>
<div>
<b>Task Hierarchy</b></div>
<div>
<br /></div>
<div>
In order to address assumption 2, something to structure a collection of tasks is needed.</div>
<div>
<br /></div>
<div>
A software developer is just a normal guy who needs simple solutions for his job. Something that has emerged several times throughout of human history are <b>hierarchies</b>. As humans are concerned, they understand hierarchies pretty well. Most companies are structured this way, your folder and files system is probably a hierarchical one, as is your governmental system or the process tree of your computer, tablet or smartphone.<br />
<br />
To put it simply, a hierarchy is a <b>layered system</b>βso you only care about the layer above and below youβand one layer is represented by a <b>single representative</b>βso you greatly simplify the communication to the layers above and below. These two properties made hierarchies quite successful so far.</div>
<div>
<br />
This said, we go with a hierarchical system for now when it comes to concurrency. That means, there is one task managing a bunch of independent and similar tasks. Managing basically subsumes task creation, result collection and result processing.</div>
<div>
<br /></div>
<div>
<b>Functions as Tasks</b></div>
<div>
<br /></div>
<div>
xfork has been designed to address observation 3 and to take the warning from the beginning seriously. A main goal was to make hopping back and forth from sequential to concurrent style of programming as easy as possible.<br />
<br />
The most basic concept, developers usually understand are functions. Thus, they act as kind of a <b>bridge between the two worlds</b>. A function can be executed either by waiting for its result (sequential style) or by submitting it to a background worker and requesting its result at a later point (concurrent style).<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjr5k1osetgOlb_Zf9mHEsgOacEkRfoxIZ9FUx2gmaU5RtqPrzxvZ4YVZGl5nyCyMw4wJ6xIBs4ZoovVXDjqkHIxIXjzawG41sp1SATSgUWpiAUJAaC5QffAP_2iTWZ9I1Jzl4nQOOMiUg/s1600/onedoesnot.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="188" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjr5k1osetgOlb_Zf9mHEsgOacEkRfoxIZ9FUx2gmaU5RtqPrzxvZ4YVZGl5nyCyMw4wJ6xIBs4ZoovVXDjqkHIxIXjzawG41sp1SATSgUWpiAUJAaC5QffAP_2iTWZ9I1Jzl4nQOOMiUg/s320/onedoesnot.jpg" width="320" /></a></div>
This will especially be clear when working with a large legacy code-base. You might finally consider using concurrent approaches to speed things up but a complete rewrite is out of question. One does not simply throw away large collection of already working functions.<br />
<br /></div>
<b>Task Management</b><br />
<br />
This job should be done for you by xfork. It should take care of the question whether to create a thread or a process for a background task. Moreover, the number of processes and threads needs to be managed without developer interaction by creating and closing them down for you on the fly and according to the machines capabilities.<br />
<br />
<div>
When should a task be a process, when should it be a thread and when should it be implemented as a coroutine running in an event loop? <a href="http://srkunze.blogspot.com/2016/02/concurrency-in-python.html">The last post</a> gives some pretty simple explanation for this. Processes utilize the multicore architectures of today's computers, so are suitable for CPU-bound tasks. Coroutines are designed to wait for I/O efficiently, so I/O-bound tasks are their use-cases. Threads are located somewhere in the middle especially when it comes to the <a href="https://docs.python.org/3/glossary.html#term-global-interpreter-lock">GIL of CPython</a>. So right now, they apply for the I/O-bound side of tasks.<br />
<br /></div>
This said, the main exercise for a developer using xfork is actually thinking of whether their function is <b>I/O-bound</b> or <b>CPU-bound</b> and whether it is <b>thread-safe or not</b>.<br />
<b><br /></b>
<b>Conclusion</b><br />
<br />
All observations being addressed, I think it's time to make a break. A next post will investigate the current implementation of xfork.<br />
<br />
Best,<br />
<div>
Sven</div>
Anonymoushttp://www.blogger.com/profile/13449109761788005994noreply@blogger.com0tag:blogger.com,1999:blog-7246401958767395519.post-54372762442085840372016-02-23T19:57:00.000+01:002016-03-31T18:26:20.514+02:00Concurrency in Python<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td class="copyright-fotos2" style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjq61cWP2yOjUbmYN57aXbcPuyquLnMFJaGBL42OYembeKS38X__qnis-or7IpEVD3eW7eqbLAPDs6PI9AJMAc9mldUlPblMCMWyE1ZzhFrcmn1dEkX8o869bVibfR0IdEFq0cUWSLDInk/s1600/double.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="268" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjq61cWP2yOjUbmYN57aXbcPuyquLnMFJaGBL42OYembeKS38X__qnis-or7IpEVD3eW7eqbLAPDs6PI9AJMAc9mldUlPblMCMWyE1ZzhFrcmn1dEkX8o869bVibfR0IdEFq0cUWSLDInk/s640/double.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><i>More speed by having two rails. Not always true.</i></td></tr>
</tbody></table>
Last year, <a href="https://www.python.org/dev/peps/pep-0492/">PEP 0492</a> got accepted, which introduced <b>coroutines</b> and <b>async/await</b> to Python. During that time, I started subscribing to some Python mailing lists and participated in discussions since then. <a href="https://mail.python.org/pipermail/python-ideas/2015-July/034537.html">I wondered</a> how ordinary Python developers can write code that can be executed <a href="https://en.wikipedia.org/wiki/Parallel_computing">in parallel</a> or at least <a href="https://en.wikipedia.org/wiki/Concurrent_computing">concurrently</a>. Specifically regarding asyncio (coroutines) and concurrency in general, we got a survey compiled which I want to record here.<br />
<a name='more'></a><br />
<b>Improving Performance by Running Independent Tasks Concurrently - A Survey
</b><br />
<br />
<table class="data-table">
<tbody>
<tr>
<th></th>
<th>Processes</th>
<th>Threads</th>
<th>Coroutines</th>
</tr>
<tr>
<th>purpose</th>
<td>cpu-bound tasks</td>
<td>cpu- & i/o-bound tasks</td>
<td>i/o-bound tasks</td>
</tr>
<tr>
<th>customizable</th>
<td>no</td>
<td>no</td>
<td><b>yes</b></td>
</tr>
<tr>
<th>controllable</th>
<td>no</td>
<td>no</td>
<td><b>yes</b></td>
</tr>
<tr><td colspan="4"></td></tr>
<tr>
<th>managed by</th>
<td>os scheduler</td>
<td>os scheduler + interpreter</td>
<td>event loop</td>
</tr>
<tr>
<th>parallelism</th>
<td><b>yes</b></td>
<td>no</td>
<td>no</td>
</tr>
<tr>
<th>switching</th>
<td>at any time</td>
<td>after any bytecode</td>
<td>at user-defined points</td>
</tr>
<tr>
<th>shared state</th>
<td><b>no</b></td>
<td>yes</td>
<td>yes</td>
</tr>
<tr><td colspan="4"></td></tr>
<tr>
<th>startup time</th>
<td>biggest/medium*</td>
<td>medium</td>
<td><b>smallest</b></td>
</tr>
<tr>
<th>CPU overhead**</th>
<td>biggest</td>
<td>medium</td>
<td><b>smallest</b></td>
</tr>
<tr>
<th>memory overhead</th>
<td>biggest</td>
<td>medium</td>
<td><b>smallest</b></td>
</tr>
<tr><td colspan="4"></td></tr>
<tr>
<th>pool class</th>
<td><a href="https://docs.python.org/3.5/library/multiprocessing.html#multiprocessing.pool.Pool">multiprocessing.Pool</a></td>
<td><a href="https://docs.python.org/3.5/library/multiprocessing.html#module-multiprocessing.dummy">multiprocessing.dummy.Pool</a></td>
<td><a href="https://docs.python.org/3/library/asyncio-eventloop.html">asyncio.BaseEventLoop</a></td>
</tr>
<tr>
<th>solo class</th>
<td><a href="https://docs.python.org/3.5/library/multiprocessing.html#multiprocessing.Process">multiprocessing.Process</a></td>
<td><a href="https://docs.python.org/3.5/library/threading.html#threading.Thread">threading.Thread</a></td>
<td><a href="https://docs.python.org/3/library/asyncio-task.html#coroutines">asyncio.coroutine</a></td>
</tr>
</tbody></table>
<br />
<span style="font-size: x-small;">*
biggest - </span><span style="font-size: x-small;">on Windows and </span><span style="font-size: x-small;">if using 'spawn' ('fork'+'exec')</span><span style="font-size: x-small;">; </span><span style="font-size: x-small;">medium - if using 'fork' alone</span><br />
<span style="font-size: x-small;">**
due to context switching
</span><br />
<br />
I started this little survey out of curiosity and professional needs to speed up our Python production systems. What I can tell from the experience gained in that field is that you basically need to ask yourself two questions:<br />
<ol>
<li>Does this code run faster with concurrency?</li>
<li>Is the code cpu-bound or i/o-bound?</li>
</ol>
Writing and maintaining concurrent code always makes many brains hurt. So, if you don't have any significant improvement, leave your code alone. If you still want to, you then either need to settle for processes (having cpu-bound tasks) or threads/coroutines (having i/o-bound tasks). As usual, your mileage may vary (also considering the table).<br />
<br />
In the course of the thread, <b>Steve Dower</b> responded very ingeniously (<a href="https://mail.python.org/pipermail/python-ideas/2015-July/034565.html">here</a> and <a href="https://mail.python.org/pipermail/python-ideas/2015-July/034793.html">here</a>) which was my main driver to extend the survey. It explains certain values in the table quite easily, so I am going to quote him here for your convenience.<br />
<b><br /></b>
<br />
<div style="text-align: center;">
<b>Steve Dower's "Let's Bake a Cake"</b></div>
<blockquote class="tr_bq">
<i>Let's say you are making a cake. There are two high-level steps involved:</i><br />
<ol>
<li><i>Gather all the ingredients</i></li>
<li><i>Mix all the ingredients</i></li>
<li><i>Bake it in the oven</i></li>
</ol>
<i>You are personally required to do steps 1 and 2 ("hands-on"). They takes all of your time and attention and you can't do anything else simultaneously.</i><i><br /></i><i><br /></i><i>For step 3, you hand off the work to the oven. While the oven is baking, you are basically free to do other things.</i><i><br /></i><i><br /></i><i>In this analogy, "you" are the main thread and the oven is another thread. (Thread and process are interchangeable here in the general sense - the GIL in Python is practicality that makes processes preferable, but that doesn't affect the concepts.) Steps 1 and 2 are CPU bound (as far as "you" the main thread are concerned), and step 3 is IO bound from "your" (the main thread's) point-of-view.</i><i><br /></i><i><br /></i><i>Step 3 requires you to wait until it is complete:</i><br />
<ul>
<li><i>You can do a synchronous wait, by sitting and staring at the oven until it's done.</i></li>
<li><i>You can poll, by occasionally interrupting yourself to walk over to the oven and see if it's done yet.</i></li>
<li><i>You can use a signal/interrupt, when the oven is ready, regardless of whether you are ready to handle the interruption (but note: you know that the oven is done without having to walk over and check it).</i></li>
<li><i>Or you can use asyncio, where you occasionally interrupt yourself and, when you do, the oven will make some noise if it has finished. (and if you never interrupt yourself, the oven never makes a sound)</i></li>
</ul>
<i>This last option is most efficient for you, because you aren't interrupted at awkward times (i.e. greatly reduced need for locking on shared state) but you also don't have to walk all the way over to the oven to check whether it is done. You pause, listen, and get straight back to work if the oven is still going. That's the core feature of asyncio - not the networking or subprocess support - the ability to be notified efficiently that a task is complete without being interrupted by that notification.</i><i><br /></i><i><br /></i><i>Now let's expand this to making 3 cakes in parallel to see how "parallelism" works. Since there's so much going on, we'll create a TODO list:</i><br />
<ol>
<li><i>Make cake #1</i></li>
<li><i>Make cake #2</i></li>
<li><i>Make cake #3</i></li>
</ol>
<i>(This means we've started three tasks to the current event loop. It's likely these are three external requests from clients, such as HTTP requests. It is possible, though not common in my experience, for production software to explicitly start with multiple tasks like this. More common is to have one task and a UI event loop that injects UI events as necessary.)</i><i><br /></i><i><br /></i><i>Task 1 is the obvious place to start, so we take that off the TODO list and start working on it. The steps to make cake #1 are:</i><br />
<ul>
<li><i>Gather ingredients for cake #1</i></li>
<li><i>Mix ingredients for cake #1</i></li>
<li><i>Bake cake #1</i></li>
</ul>
<i>Gathering ingredients is a synchronous operation (`def gather_ingredients()`) so we do that until we've gathered everything.</i><i><br /></i><i><br /></i><i>Mixing ingredients is a long, interruptible operation (`async def mix_ingredients()`, with occasional explicit `await yield()` or whatever syntax was chosen for this), so we start mixing and then pause. When we pause, we put our current task on the TODO list:</i><br />
<ol>
<li><i>Make cake #2</i></li>
<li><i>Make cake #3</i></li>
<li><i>Continue mixing cake #1</i></li>
</ol>
<i>We see that our next task is to make cake #2, so we repeat the steps above and eventually pause while we're mixing. Now the TODO list looks like:</i><br />
<ol>
<li><i>Make cake #3</i></li>
<li><i>Continue mixing cake #1</i></li>
<li><i>Continue mixing cake #2</i></li>
</ol>
<i>And this continues. (Note that selecting which task to continue with is a detail of the event loop you're using. Check the spec to see whether some tasks have a higher priority or what order tasks are continued in. And bear in mind that so far, we've only used explicit yields - "I'm ready to do something else now if something needs doing".)</i><i><br /></i><i><br /></i><i>Eventually we will finish mixing one of the cakes, let's say it's cake #1. We will put it in the oven (`await put_in_oven()`) and then check the TODO list for what we should do next. There's nothing for us to do with cake #1, so our TODO list looks like:</i><br />
<ol>
<li><i>Continue mixing cake #2</i></li>
<li><i>Continue mixing cake #3</i></li>
</ol>
<i>Eventually, the oven will finish baking cake #1 and will add its own item to the TODO list:</i><br />
<ol>
<li><i>Continue mixing cake #2</i></li>
<li><i>Continue mixing cake #3</i></li>
<li><i>Cake #1 is ready</i></li>
</ol>
<i>When we take a break from mixing cake #2, we will continue mixing cake #3 (again, depending on your event loop's policy with regards to prioritisation). When we take a break from mixing cake #3, "Cake #1 is ready" will be the top of our TODO list and so we will continue with the statement following where we awaited it (it probably looked like `await put_in_oven(); remove_from_oven()` or maybe `baked_cake = await put_in_oven(mixed_ingredients)`).</i><i><br /></i><i><br /></i><i>Eventually our TODO list will be empty, and so we will sit there waiting for something to appear on it (such as another incoming request, or an oven adding a "remove cake" item).</i><i><br /></i><i><br /></i><i>Processes and threads only really enter into asyncio as a "thing that can post messages back to my TODO list/event loop", while asyncio provides an efficient mechanism for interleaving (not parallelising) multiple tasks throughout an entire application (or a very significant self-contained piece of it). The parallelism only comes when all the main thread has to do for a particular task is wait, because another thread/process/service/device/etc. is doing the actual work.</i><i><br /></i><i><br /></i><i>-----</i><i>"But I still have a question: why can't we use threads for the cakes? (1 cake = 1 thread)."</i><i><br /></i><i><br /></i><i>Because that is the wrong equality - it's really 1 baker = 1 thread.</i><i><br /></i><i><br /></i><i>Bakers aren't free, you have to pay for each one (memory, stack space), it will take time for each one to learn how your bakery works (startup time), and you will waste some of your own time coordinating them (interthread communication).</i><i><br /></i><i><br /></i><i>You also only have one set of baking equipment (the GIL), buying another bakery is expensive (another process) and fitting more equipment into the current one is very complicated (subinterpreters).</i><i><br /></i><i><br /></i><i>So you either pay a high price for 2 bakers = 2 cakes, or you accept 2
bakers = 1.5 cakes (in the same amount of time). It turns out that often 1 baker can do 1.5 cakes in the same time as well, and it's much easier to reason about and implement correctly.</i></blockquote>
I further want to thank everybody who participated in discussing the matter and thus improved the survey enormously with a lot of patient explanations and insightful details.<br />
<br />
There will be <a href="http://srkunze.blogspot.com/2016/03/designing-xfork.html">another post</a> covering a <a href="https://pypi.python.org/pypi/xfork">module, I've written back then</a> and which I maintain actively. It was and still is a way of digesting the concurrency matter in an attempt to improve usability and reduce boilerplate in Python.<br />
<br />
Best,<br />
SvenAnonymoushttp://www.blogger.com/profile/13449109761788005994noreply@blogger.com0tag:blogger.com,1999:blog-7246401958767395519.post-13287849038725838462016-02-19T17:02:00.000+01:002016-03-31T22:35:16.650+02:00My Python IDE Journey<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td class="copyright-fotos2" style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiHFnx4O3gNprUoUeSK4Mri-LA8tBhc_b5G9xUVd5bhuq622NWt7lzTjpvgMDxepsHmaKkcL6s8J52hEirowWd4wAEa_odO_GZVVbUyHGcqz40qDQZRnMhllJuiJL8oFORaJoUHYBT4Myk/s1600/journey.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="424" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiHFnx4O3gNprUoUeSK4Mri-LA8tBhc_b5G9xUVd5bhuq622NWt7lzTjpvgMDxepsHmaKkcL6s8J52hEirowWd4wAEa_odO_GZVVbUyHGcqz40qDQZRnMhllJuiJL8oFORaJoUHYBT4Myk/s640/journey.png" width="640" /></a>
</td></tr>
<tr><td class="tr-caption" style="text-align: center;"><i>Pick one.</i></td></tr>
</tbody></table>
This post is not intended as advertising but to illustrate my journey to my currently used Python IDE. I tried several ones in recent years due to educational and professional needs as well as to satisfy my curiosity.<br />
<br />
<b>First Stop</b><br />
<br />
Everything starts with <a href="https://wiki.gnome.org/Apps/Gedit">gedit</a>, <a href="http://www.nano-editor.org/">nano</a> and <a href="http://www.vim.org/"><b>vim</b></a>, right? Not quite full IDEs but it's a start. You can at least write code and have some syntax highlighting available. Until today, a colleague of mine uses vim with tons of plugins featuring "go to definition", "find usages", "code completion", "project nav tree", etc. So, it's quite possible to work with simple editors and enhance them indefinitely.<br />
<br />
As you can imagine, I was looking for something else which goes beyond the venerable terminal. So, I started looking for an alternative with the following properties (in its order of priority):<br />
<a name='more'></a><ul>
<li>out-of-the-box experience</li>
<li>mouse usage where appropriate</li>
<li>fewer keystrokes and mouse clicks</li>
<li>faster search without manual indexing</li>
<li>configurable executions (run unittests, run scripts, etc.)</li>
<li>debugging with most important things on a glance</li>
<li>introspection</li>
</ul>
<div>
<br /></div>
<b>Second Stop</b><br />
<br />
I made my second stop at <a href="http://pythonhosted.org/spyder/"><b>Spyder</b></a>. It is an open-source project providing <b>almost all the basic needs</b> described above. In general it feels like Eclipse but is much simpler, cleaner and more thought-out to my taste.<br />
<br />
Spyder stands for <b>S</b>cientific <b>PY</b>thon <b>D</b>evelopment <b>E</b>nvi<b>R</b>onment and as expected is best suited for scientific tasks such as researching. So, it handles small and informal scripts quite well for performing data transformations, evaluation and plotting diagrams.<br />
<br />
Spyder works reliably given you've installed all necessary third-party dependencies such as <a href="https://pypi.python.org/pypi/pyflakes">pyflakes</a> and <a href="https://pypi.python.org/pypi/rope">rope</a>. An up-to-date list can be obtained from <a href="https://pythonhosted.org/spyder/installation.html#installing-or-running-directly-from-source">here</a>. Furthermore, if you are inclined to use <a href="http://www.numpy.org/">numpy</a> and <a href="http://www.scipy.org/">scipy</a> to live up to Spyder's name, you also need compiling tools available (at least my machine does it with Ubuntu 14.04 installed). So, the out-of-the-box experience, somewhat impaired, is still way above most multi-purpose editors.<br />
<br />
Further not required but nice-to-have features are available: an integrated profiler, static code analysis custom color/font schemes and an object inspector (aka "show me the doc string").<br />
<br />
After a year or so working with Spyder, the journey resumed to satisfy also the following emerging requirements:<br />
<!--more--><br />
<ul>
<li>integration of version control (such as local history and svn+git)</li>
<li>better usability and more convenience</li>
<li>integrated bash</li>
</ul>
<div>
<br /></div>
<b>Third Stop</b><br />
<br />
Another colleague of mine showed that JetBrains (the maker of <a href="https://www.jetbrains.com/resharper/">ReSharper</a>) open-sourced their <a href="http://blog.jetbrains.com/pycharm/2013/10/pycharm-3-0-community-edition-source-code-now-available/">community edition of <b>PyCharm</b>.</a> So, since this was another requirement for me, I gave it a try and I fall in love with it instantly. The people at JetBrains just know how to do their craftsmanship. PyCharm is a beautiful IDE with literally tons of features. It can handle professional workloads with massive amounts of files, yet is usable and convenient.<br />
<br />
It definitely brings you a solid out-of-the-box experience. Thus, if you don't care or don't want to bother with installing any third-party library just to get a decent IDE, <b>PyCharm is your choice</b>.<br />
<br />
In fact, the first-mentioned colleague got so inspired by PyCharm that he went out to give every possible vim plugin a try to replicate PyCharm's productivity features. He has not given up yet, but his efforts brought him massive increase of productivity even though working on a regular terminal session.<br />
<br />
This one also solved the missing SCM and local code history integration. Furthermore, its usability is beyond good and evil and anything I've seen so far. Using it makes me feel free and I have to admit not using it make me feel a lot slower. But don't forget about what I wrote about <a href="http://srkunze.blogspot.com/2016/01/boon-and-bane-of-ides.html">IDEs in general</a> last time. Last but not least, if you ever need a terminal, it's right there at your fingertips running in the correct directory.<br />
<br />
<b>Fourth Stop</b><br />
<br />
Quite recently, <a href="http://srkunze.blogspot.com/2016/01/next-programming-language.html">in an attempt of life-long learning</a> I made a short trip to some interesting piece of technology: <a href="http://ipython.org/notebook.html" style="font-weight: bold;">IPython notepad</a>. It's not a traditional Python IDE but it can improve your productivity given the right workload.<br />
<br />
Just think of it as an interactive Python session (where you can also execute blobs of Python code) which you be able to resume later. The notepad stores the Python source and its corresponding output (prints, tables, diagrams, ...) once executed. Thus you can examine the results later or re-execute the code if there's been some changes to the data.<br />
<br />
This way, it makes it a perfect tool for scientific usage and because of its simplicity it's even more suitable than Spyder in my opinion. If you want to give it a try and see if it would complement your current workflows, have a look at this <a href="http://pandas.pydata.org/pandas-docs/stable/tutorials.html">pandas cookbook</a> for some wild number crunching experience.<br />
<br />
See you next time around,<br />
Sven<br />
<br />
PS: After writing this post, I felt inclined to install a Spyder again to see where the project has come to. It turns out it is actively maintained and the folks of Spyder make great progress. Still, the installation procedure is still feels brittle and quite manual aka error-prone. So, I wish them good luck and hope we will see some serious competition in the Python IDE area.Anonymoushttp://www.blogger.com/profile/13449109761788005994noreply@blogger.com0tag:blogger.com,1999:blog-7246401958767395519.post-62428683106351622982016-02-18T18:23:00.000+01:002016-03-22T20:38:09.655+01:00Let's go down the rabbit hole!<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td class="copyright-fotos2" style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh5PdR9F4NqAEvDqgn18kPa4JgW-v69CeZnkhjovsOUvdyNLjY6GO27SUVwkeCDCXgs9Za-5DcYJE635bRO4jX0vzwxx-U__4i4aguERen9lrXd5usvOinb2yNFcM9eGTl6auVfeX1nTKc/s1600/upside-down.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="390" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh5PdR9F4NqAEvDqgn18kPa4JgW-v69CeZnkhjovsOUvdyNLjY6GO27SUVwkeCDCXgs9Za-5DcYJE635bRO4jX0vzwxx-U__4i4aguERen9lrXd5usvOinb2yNFcM9eGTl6auVfeX1nTKc/s640/upside-down.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><i>Things can be topsy-turvy when considered upside down.</i></td></tr>
</tbody></table>
As mentioned in the <a href="http://srkunze.blogspot.com/2016/02/the-xheap-benchmark.html">previous post</a>, there is an interesting and at the same time weird piece of code duplication in <span style="font-family: "courier new" , "courier" , monospace;"><a href="https://pypi.python.org/pypi/xheap">RemovalHeap</a></span> and <span style="font-family: "courier new" , "courier" , monospace;"><a href="https://pypi.python.org/pypi/xheap">XHeap</a></span> that is necessary to make them work properly. This post will cover this oddity in depth.<br />
<br />
Imagine you want to count the number of items being set in a list. So, instead of providing a native list object, you write your own class like this:<br />
<a name='more'></a><br />
<pre style="background-color: #eeeeee; margin-left: 40px; margin-right: 40px;">class MyList(list):
count = 0
def __setitem__(self, key, value):
self.count += 1
super(MyList, self).__setitem__(key, value)
ml = MyList([0])
for i in range(10):
ml[0] = -i
print(ml.count) # print 10
</pre>
<br />
Sounds good, right? Now, <span style="font-family: "courier new" , "courier" , monospace;">RemovalHeap</span> and <span style="font-family: "courier new" , "courier" , monospace;">XHeap</span> do exactly this when they keep track of the index of an item. Instead of counting, though, they store the new index in a dictionary. This is necessary for fast item removal, i.e. runtime of O(log n).<br />
<br />
So far so good and this could be the end of the story if that were the only necessity for fast item removal. However, it turned out it isn't.<br />
<br />
When we now apply <span style="font-family: "courier new" , "courier" , monospace;">heappop</span> on our counting class, it turns out it doesn't work anymore:<br />
<br />
<pre style="background-color: #eeeeee; margin-left: 40px; margin-right: 40px;">from heapq import heappop
ml = MyList(range(10))
heappop(ml)
print(ml.count) # print 0
</pre>
<br />
No changes at all? That cannot be right.<br />
<br />
It seems we don't jump back into Python code. You might say, this is due to the underlying C implementation of <span style="font-family: "courier new" , "courier" , monospace;">heappop</span> but have a look. Let's copy the Python <a href="https://hg.python.org/cpython/file/3.5/Lib/heapq.py#l135">source from <span style="font-family: "courier new" , "courier" , monospace;">heappop</span></a> and call it <span style="font-family: "courier new" , "courier" , monospace;">my_heappop</span>:<br />
<br />
<pre style="background-color: #eeeeee; margin-left: 40px; margin-right: 40px;">def my_heappop(heap):
from heapq import _siftup
lastelt = heap.pop()
if heap:
returnitem = heap[0]
heap[0] = lastelt
_siftup(heap, 0)
return returnitem
return lastelt
</pre>
<br />
As you can see, it just delegates work to <span style="font-family: "courier new" , "courier" , monospace;">_siftup</span> and that will do the trick:
<br />
<br />
<pre style="background-color: #eeeeee; margin-left: 40px; margin-right: 40px;">ml = MyList(range(10))
my_heappop(ml)
print(ml.count) # print 6
</pre>
<br />
Now, there have been 6 changes to our heap when popping off the first item.<br />
<br />
What's wrong here? As one can see, Python does in fact hop back and forth intertwined Python and C code. Why the public API of heapq doesn't work as expected whereas the private one does, is unclear to me. This peculiar behavior of <span style="font-family: "courier new" , "courier" , monospace;">heappop</span> also holds for <span style="font-family: "courier new" , "courier" , monospace;">heappush</span>, <span style="font-family: "courier new" , "courier" , monospace;">heapreplace</span> and <span style="font-family: "courier new" , "courier" , monospace;">heappushpop</span>. As usual I am open for suggestions and explanations.<br />
<br />
In any case, that is the reason why <span style="font-family: "courier new" , "courier" , monospace;">RemovalHeap</span> and <span style="font-family: "courier new" , "courier" , monospace;">XHeap</span> duplicate some parts of heapq and now you know.<br />
<br />
Best,<br />
Sven<br />
<br />
NOTE: I could reproduce this behavior for Python 2.7.6 and Python 3.4.3.Anonymoushttp://www.blogger.com/profile/13449109761788005994noreply@blogger.com0tag:blogger.com,1999:blog-7246401958767395519.post-11078591098199273482016-02-16T21:16:00.000+01:002016-03-08T17:59:09.155+01:00The xheap Benchmark<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td class="copyright-fotos2" style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEisZTsAHIOFrnUjXl3xtfK6XFVE2vdIRE918CHJkzCPC40L6l4m65Oj_mkZbNoAPv1dwhS4MFNEvkSmwwbxumDQ873z9YVZdlR9joiwc45jYeT8bKdZf80QX5nvY34ry83JQdmGMiBqtn4/s1600/measure.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="425" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEisZTsAHIOFrnUjXl3xtfK6XFVE2vdIRE918CHJkzCPC40L6l4m65Oj_mkZbNoAPv1dwhS4MFNEvkSmwwbxumDQ873z9YVZdlR9joiwc45jYeT8bKdZf80QX5nvY34ry83JQdmGMiBqtn4/s640/measure.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><i>These are the inlets of a steam engine. That means, it's time to perform some serious measurements!</i></td></tr>
</tbody></table>
<br />
We are going to compare <a href="https://pypi.python.org/pypi/xheap">xheap</a> and <a href="https://docs.python.org/3.5/library/heapq.html">heapq</a>. The benchmark suite can be found right by the <a href="https://github.com/srkunze/xheap">source</a>.<br />
<b><br /></b>
<b>The Competitors</b><br />
<ul>
<li>heapq - collections of heap functions of Python stdlib written in C</li>
<li>xheap - object-oriented wrappers for heapq</li>
</ul>
<br />
<a name='more'></a><b>The Benchmark of </b><b>Runtime</b><br />
<br />
In order to make things comparable, I split the benchmark up into two benchmark cases. Both cases differ by the fact that customizing the ordering of a heap always requires more work. The second benchmark takes that into account.<br />
Both cases run <span style="font-family: "courier new" , "courier" , monospace;">heapify</span>, <span style="font-family: "courier new" , "courier" , monospace;">pop</span> and <span style="font-family: "courier new" , "courier" , monospace;">push</span> 10,000 times and calculate the minimum of these runtimes (in <b><span style="color: #cc0000;">milliseconds</span></b>) which you see presented below. Depending on which item is popped or pushed, runtimes can vary. Because of that, we start with a heap of size X, then popping or pushing X/32 items.<br />
Each case further has a baseline which is heapq obviously. I used <a href="https://docs.python.org/3.5/library/timeit.html">timeit</a> at suggested <a href="https://mail.python.org/pipermail/python-list/2016-January/702571.html">here</a>; the benchmark ran on Python 3.4.<br />
<br />
<b>1) heapq vs Heap vs RemovalHeap</b><br />
<br />
<hr />
<span style="font-family: "courier new" , "courier" , monospace;"><b>operation</b> 1,000 items 10,000 </span><span style="font-family: "courier new" , "courier" , monospace;">items</span><span style="font-family: "courier new" , "courier" , monospace;"> 100,000 </span><span style="font-family: "courier new" , "courier" , monospace;">items</span><span style="font-family: "courier new" , "courier" , monospace;"> 1,000,000 </span><span style="font-family: "courier new" , "courier" , monospace;">items</span><br />
<hr />
<span style="font-family: "courier new" , "courier" , monospace;"><b>init</b> heapq 0.03 ( 1.00x) 0.41 ( 1.00x) 4.50 ( 1.00x) 64.83 ( 1.00x)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> Heap 0.03 ( 1.02x) 0.42 ( 1.02x) 4.47 ( 0.99x) 64.70 ( 1.00x)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> RemovalHeap 0.11 ( 3.35x) 1.32 ( 3.23x) 17.88 ( 3.98x) 222.94 ( 3.44x)</span><br />
<hr />
<span style="font-family: "courier new" , "courier" , monospace;"><b>pop</b> heapq 0.01 ( 1.00x) 0.07 ( 1.00x) 0.86 ( 1.00x) 10.05 ( 1.00x)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> Heap 0.01 ( 1.45x) 0.09 ( 1.33x) 1.08 ( 1.26x) 12.73 ( 1.27x)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> RemovalHeap 0.27 (51.73x) 3.68 (52.28x) 44.35 (51.80x) 517.98 (51.54x)</span><br />
<hr />
<span style="font-family: "courier new" , "courier" , monospace;"><b>push</b> heapq 0.00 ( 1.00x) 0.03 ( 1.00x) 0.34 ( 1.00x) 3.65 ( 1.00x)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> Heap 0.01 ( 1.81x) 0.06 ( 1.83x) 0.61 ( 1.82x) 6.46 ( 1.77x)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> RemovalHeap 0.08 (23.66x) 0.66 (19.65x) 6.48 (19.20x) 67.46 (18.46x)</span><br />
<hr />
As expected the bare C implementation outperforms any wrapper lib written in Python. However, the performance penalty is quite small compared to what we gain through cleaner code and better maintainability when using <span style="font-family: "courier new" , "courier" , monospace;">Heap</span>.<br />
<br />
The current implementation of <span style="font-family: "courier new" , "courier" , monospace;">RemovalHeap</span> definitely runs slower as it needs to keep track of changing indexes. Perhaps, that is a good reason to rethink its implementation and switch to a mark-and-sweep approach. So, this benchmark is a good starting point for the future. Additionally, this shows that if you don't really need removal, you better stick to <span style="font-family: "courier new" , "courier" , monospace;">Heap</span>.<br />
<br />
<b>2) heapq + tuples vs OrderHeap vs XHeap</b><br />
<br />
<hr />
<span style="font-family: "courier new" , "courier" , monospace;"><b>operation</b> 1,000 items 10,000 </span><span style="font-family: "courier new" , "courier" , monospace;">items</span><span style="font-family: "courier new" , "courier" , monospace;"> 100,000 </span><span style="font-family: "courier new" , "courier" , monospace;">items</span><span style="font-family: "courier new" , "courier" , monospace;"> 1,000,000 </span><span style="font-family: "courier new" , "courier" , monospace;">items</span><br />
<hr />
<span style="font-family: "courier new" , "courier" , monospace;"><b>init</b> heapq 0.14 ( 1.00x) 1.60 ( 1.00x) 24.16 ( 1.00x) 272.64 ( 1.00x)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> OrderHeap 0.17 ( 1.26x) 1.90 ( 1.19x) 27.20 ( 1.13x) 302.80 ( 1.11x)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> XHeap 0.29 ( 2.16x) 3.18 ( 1.99x) 50.32 ( 2.08x) 553.61 ( 2.03x)</span><br />
<hr />
<span style="font-family: "courier new" , "courier" , monospace;"><b>pop</b> heapq 0.01 ( 1.00x) 0.15 ( 1.00x) 1.82 ( 1.00x) 21.85 ( 1.00x)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> OrderHeap 0.02 ( 1.75x) 0.23 ( 1.58x) 2.72 ( 1.49x) 30.59 ( 1.40x)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> XHeap 0.28 (25.25x) 3.84 (25.98x) 46.32 (25.45x) 540.32 (24.73x)</span><br />
<hr />
<span style="font-family: "courier new" , "courier" , monospace;"><b>push</b> heapq 0.00 ( 1.00x) 0.04 ( 1.00x) 0.54 ( 1.00x) 5.69 ( 1.00x)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> OrderHeap 0.01 ( 4.00x) 0.15 ( 3.68x) 1.65 ( 3.08x) 16.50 ( 2.90x)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> XHeap 0.04 ( 9.55x) 0.35 ( 8.79x) 3.85 ( 7.16x) 39.49 ( 6.94x)</span><br />
<hr />
Generally speaking, the overhead of <span style="font-family: "courier new" , "courier" , monospace;">OrderHeap</span> and <span style="font-family: "courier new" , "courier" , monospace;">XHeap</span> to provide custom orders diminishes when we hit bigger and bigger sizes. This makes sense as the overhead basically consists of creating and unpacking tuple values on the fly and super calls only once per operation; thus is constant.<br />
<div>
<br />
<b>The Benchmark of Removal</b></div>
<br />
This particular feature cannot be compared using heapq, <span style="font-family: "courier new" , "courier" , monospace;">Heap</span> and <span style="font-family: "courier new" , "courier" , monospace;">OrderHeap</span> since they simply don't feature removal. For future reference and for providing guidance, I still present the benchmark results comparing <span style="font-family: "courier new" , "courier" , monospace;">RemovalHeap</span> and <span style="font-family: "courier new" , "courier" , monospace;">XHeap</span> to each other. As an attempt to equalize item comparison, <span style="font-family: "courier new" , "courier" , monospace;">RemovalHeap</span> will be fed by tuples this time. <span style="font-family: "courier new" , "courier" , monospace;">RemovalHeap</span> is the baseline.<br />
<br />
<hr style="font-family: 'Times New Roman';" />
<span style="font-family: "courier new" , "courier" , monospace;">
<b>operation</b> 1,000 items 10,000 </span><span style="font-family: "courier new" , "courier" , monospace;">items</span><span style="font-family: "courier new" , "courier" , monospace;"> 100,000 </span><span style="font-family: "courier new" , "courier" , monospace;">items</span><span style="font-family: "courier new" , "courier" , monospace;"> 1,000,000 </span><span style="font-family: "courier new" , "courier" , monospace;">items</span><br />
<hr />
<span style="font-family: "courier new" , "courier" , monospace;"><b>remove</b> RemoHeap 0.12 ( 1.00x) 1.26 ( 1.00x) 13.28 ( 1.00x) 149.50 ( 1.00x)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> XHeap 0.12 ( 0.94x) 1.19 ( 0.95x) 12.16 ( 0.92x) 124.02 ( 0.83x)</span><br />
<hr />
<div>
<b><br /></b>
<b>The Benchmark of Comparisons</b></div>
<br />
Regarding <span style="font-family: "courier new" , "courier" , monospace;">heapify</span>, <span style="font-family: "courier new" , "courier" , monospace;">pop</span> and <span style="font-family: "courier new" , "courier" , monospace;">push</span>, xheap does not perform even a single item comparison more than heapq. As you can infer from the <a href="https://github.com/srkunze/xheap">source</a>, Heap and OrderHeap leave it completely to heapq for efficiency reasons.<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">RemovalHeap</span> and <span style="font-family: "courier new" , "courier" , monospace;">XHeap</span>, on the other side, have a peculiarity that require them to perform at least some of the item comparisons on their own. Though, the amount of comparisons stay the same as the source is a simple copy from heapq. <a href="http://srkunze.blogspot.com/2016/02/lets-go-down-rabbit-hole.html">Another post</a> will cover why this is necessary.<br />
<b><br /></b>
<b>Conclusion</b><br />
<br />
As suspected in the <a href="http://srkunze.blogspot.com/2016/01/fast-object-oriented-heap-implementation.html">previous post</a>, both convenience and features come at price. So, in order to preserve your initial goals when using heaps (= speed), your best choice is the feature-poorest variant. Most of the overhead is constant per operation since it's consists of wrapper, <span style="font-family: "courier new" , "courier" , monospace;">super</span> and descriptor calls. Given the current <a href="https://mail.python.org/pipermail/python-dev/2016-January/142945.html">optimization efforts of CPython</a>, this kind of overhead can be reduced even further without changing xheap itself.<br />
<br />
I would like to express my gratitude to the Python community for providing heapq. Without this amazing library the development of xheap hadn't been possible.<br />
<br />
As usual, I am open for suggestions of how to improve the benchmark. To see how xheap performs on your machine, you can simply execute <span style="font-family: "courier new" , "courier" , monospace;">test_xheap_time.py</span> from the <a href="https://github.com/srkunze/xheap">xheap repository</a>.<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td class="copyright-fotos2" style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgxIxwH4RcG8LCNRB8X4reHrXVdMGA-As5BylodaPdiJ_buVlkVxfR9a4rle80wdua0dk9rm2HeusTOr3Vo2AmQZepAcqGkfNPPwiJFPDkayUX-_tPdxw6UY_nATUeKxe2VoD2EAltF7N8/s1600/machine.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="428" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgxIxwH4RcG8LCNRB8X4reHrXVdMGA-As5BylodaPdiJ_buVlkVxfR9a4rle80wdua0dk9rm2HeusTOr3Vo2AmQZepAcqGkfNPPwiJFPDkayUX-_tPdxw6UY_nATUeKxe2VoD2EAltF7N8/s640/machine.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><i>A beautiful steam engine running at full speed. <a href="http://fotosquadrat.blogspot.com/2016/02/mit-volldampf-voraus.html">You can read more about it here</a>.</i></td></tr>
</tbody></table>
Best,<br />
Sven<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
Anonymoushttp://www.blogger.com/profile/13449109761788005994noreply@blogger.com0tag:blogger.com,1999:blog-7246401958767395519.post-63830203819496477882016-01-30T21:17:00.004+01:002016-03-07T22:44:38.838+01:00Fast Object-Oriented Heap Implementation<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td class="copyright-fotos2" style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhx7YCjAMgO0FWRZ7VOk818c6VAHf497WLjyqU1M_5_edRPhw2vpbcF47cPKJEKuNjV0ez_8iyUfV_2ZBP-cx71O4Ny3NnJP5H4cC7HOtN17lGyDNWB1DrE9OOWSPc3BEyABmbYcqcgOvE/s1600/light.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="426" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhx7YCjAMgO0FWRZ7VOk818c6VAHf497WLjyqU1M_5_edRPhw2vpbcF47cPKJEKuNjV0ez_8iyUfV_2ZBP-cx71O4Ny3NnJP5H4cC7HOtN17lGyDNWB1DrE9OOWSPc3BEyABmbYcqcgOvE/s640/light.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><i>There comes light to the darkness of your heaps.</i></td></tr>
</tbody></table>
<br />
This is the third post of a series of heap-related ones. See <a href="http://srkunze.blogspot.de/2016/01/heaps-in-python.html">here</a> and <a href="http://srkunze.blogspot.de/2016/01/heapq-and-missing-features.html">here</a> for the back story.<br />
<br />
In the last post, we found the <a href="https://docs.python.org/3.5/library/heapq.html">heapq module</a> lacking important features. Average Joe Dev doesn't want to clutter up his source code and and re-implement the same features all over the place to rectify the shortcomings of heapq. Understandably, Python core devs don't want to compromise on the performance of heapq eitherβbeing fast is the mission of a heap.<br />
<br /><a name='more'></a>
One issue with all proposals and implementations I've seen so far, is either their reduced feature set or their lack of performance. So, I went out to fix that once again and (I hope) for all.<br />
<b>What's different this time?</b> Not much, I suppose, except that I made the following observation. Past implementations provided a single feature-enhanced and object-oriented implementation. So, you get object orientation but with an unnecessary slowdown. One solution would be to utilize heapq alongside with a such a feature-rich heap class. Then again using different interfaces for almost the same thing usually produces headaches in production. So, this approach isn't quite optimal.<br />
<br />
If you need a feature, you will implement the corresponding logic (and slowness) anyway. From what can tell, cluttering your source does not make you and your program faster in the long run. Thus, hiding complexity in a feature-specific class make sense. Taking into account the performance penalty produced by more features, it makes further sense to split things up into different implementations.<br />
<br />
So, I came to the conclusion that this problem cannot be solved by a single heap implementation but by a feature-complete heap suite: <b>different classes with the same base interface for different use-cases</b>.<br />
<br />
<a href="https://pypi.python.org/pypi/xheap">xheap is such a heap suite</a>. I further provided you with a <a href="https://github.com/srkunze/xheap">repo at github</a> to be used as an issue tracker. The remainder of this post will examine each class provided by xheap; a next post will investigate their performance.<br />
<h2>
Heap</h2>
The heap interface is pretty standard these days:<br />
<ul>
<li><b>peek</b> - get first item</li>
<li><b>push</b> - insert an item</li>
<li><b>pop</b> - remove first item</li>
<li><b>pushpop</b> - push then pop; but faster</li>
<li><b>poppush/replace</b> - pop then push; but faster</li>
</ul>
<div>
You somehow also expect it to work like an array/list (cf. <a href="http://srkunze.blogspot.com/2016/01/heaps-in-python.html">heap invariant</a>). All variants of <span style="background-color: #eeeeee; font-family: monospace;">Heap</span> will support that interface. Some demo code:<br />
<pre style="background-color: #eeeeee; margin-left: 40px; margin-right: 40px;">from xheap import Heap
heap = Heap([5, 4, 3, 1]) # make heap from list
heap.push(2)
heap.pop() # returns 1
heap.peek() # returns 2
heap[0] # returns 2
heap.pushpop(6) # returns 2
heap.replace(7) # returns 3
</pre>
<br /></div>
<div>
<b>Benefit?</b> It's object oriented! And as fast as heapq since it's a thin wrapper.<br />
<br />
You can replace it with a more feature-rich (and potentially slower) version if needed. And by the time that happens, the only thing you need to change is using another heap class that provides the wanted features.</div>
<h2>
OrderHeap</h2>
You guessed it: you can specify how items are compared. The <span style="background-color: #eeeeee; font-family: monospace;">key</span>-parameter (analogous to builtin functions <span style="background-color: #eeeeee; font-family: monospace;">max</span> or <span style="background-color: #eeeeee; font-family: monospace;">sorted</span>) is specified during heap initialization:<br />
<pre style="background-color: #eeeeee; margin-left: 40px; margin-right: 40px;">from xheap import OrderHeap
heap = OrderHeap(key=lambda x: -ord(x)) # define the order
heap.push('a')
heap.push('z')
heap.push('t')
heap.pop() # returns z</pre>
<br />
<b>Benefit?</b> Auto-wrapping new items into tuples by which the heap is sorted internally. You would have to re-implement it anyway in all places (without making a mistake btw.), so the heap class is the best location where such logic naturally belongs to.<br />
<br />
Why not setting the key, each time you push an item? Right now, I couldn't imagine such a use-case. From what I can tell, the key is always derived from the item itself. So, a tuple might suffice for now; if you feel that's not quite true for your project, let me know so I can tweak OrderHeap to your needs.<br />
<br />
Btw. <a href="https://github.com/srkunze/xheap/pulls">pull requests</a> are welcome as well. ;-)<br />
<h2>
RemovalHeap</h2>
Pretty obvious as well: you can remove an item from anywhere in the heap without manually keeping track of indexes.<br />
<br />
<b>Benefit?</b> Keeping track of indexes, is not implemented easilyβespecially outside of a class. It further requires special tweaking to <span style="background-color: #eeeeee; font-family: monospace;">Heap.pop</span> to enable removing an item from the middle of the heap.<br />
<br />
There is an alternative approach using periodic sweeps but hey, the concrete implementation is encapsulated within the heap class. So, if I ever feel like changing it for an better approach, nobody will notice.<br />
<br />
It's demo time:
<br />
<pre style="background-color: #eeeeee; margin-left: 40px; margin-right: 40px;">from xheap import RemovalHeap
heap = RemovalHeap(['z', 'u', 'd', 'a'])
heap.remove('d')</pre>
<h2>
XHeap</h2>
For the problem described <a href="http://srkunze.blogspot.de/2016/01/heapq-and-missing-features.html">here</a>, I need both properties. Thus, XHeap is basically a conjunction of OrderHeap and RemovalHeap: you can remove items AND you can define arbitrary orderog.<br />
<br />
What's left is investigating how fast or slow each class is compared to Python's original heapq module. <a href="http://srkunze.blogspot.com/2016/02/the-xheap-benchmark.html">To be continued β¦</a><br />
<br />
Best,<br />
SvenAnonymoushttp://www.blogger.com/profile/13449109761788005994noreply@blogger.com1