WEBVTT 00:00:00.001 --> 00:00:04.020 Is that Python code of yours running a little slow? Are you thinking about rewriting the 00:00:04.020 --> 00:00:08.560 algorithm or maybe even in another language? Well, before you do, you'll want to listen to 00:00:08.560 --> 00:00:12.440 what Davis Silverman has to say about speeding up Python code using profiling. 00:00:12.440 --> 00:00:17.000 This is show number 28, recorded Wednesday, September 16th, 2015. 00:00:17.000 --> 00:00:46.860 Welcome to Talk Python To Me, a weekly podcast on Python, the language, the library, 00:00:47.000 --> 00:00:51.700 the ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter 00:00:51.700 --> 00:00:56.620 where I'm @mkennedy. Keep up with the show and listen to past episodes at talkpython.fm 00:00:56.620 --> 00:01:02.960 and follow the show on Twitter via at Talk Python. This episode is brought to you by Hired and 00:01:02.960 --> 00:01:08.720 CodeShip. Thank them for supporting the show on Twitter via at Hired underscore HQ and at CodeShip. 00:01:08.720 --> 00:01:13.260 There's nothing special to report this week, so let's get right to the show with Davis. 00:01:13.760 --> 00:01:18.520 Let me introduce Davis. Davis Silverman is currently a student at the University of Maryland working 00:01:18.600 --> 00:01:23.680 part-time at the Human Geo Group. He writes mostly Python with an emphasis on performant 00:01:23.680 --> 00:01:26.520 Pythonic code. Davis, welcome to the show. 00:01:26.520 --> 00:01:27.240 Hello. 00:01:27.240 --> 00:01:33.380 Thanks for being here. I'm really excited to talk to you about how you made some super slow Python 00:01:33.380 --> 00:01:39.860 code much, much faster using profiling. You work at a place called the Human Geo Group. You guys do a 00:01:39.860 --> 00:01:44.120 lot of Python there, and we're going to spend a lot of time talking about how you took some of your 00:01:44.120 --> 00:01:51.580 social media sort of data, real-time analytics type stuff, and built that in Python and improved 00:01:51.580 --> 00:01:56.540 it using profiling. But let's start at the beginning. What's your story? How did you get into programming 00:01:56.540 --> 00:01:56.960 in Python? 00:01:56.960 --> 00:02:07.520 I originally, when I was a kid, obviously, I grew up and I had internet. I was lucky, and I was very into 00:02:07.520 --> 00:02:15.540 computers, and my parents were very happy with me building and fixing computers for them, obviously. 00:02:15.540 --> 00:02:22.500 So by the time high school came around, I took a programming course, and it was Python, and I fell in 00:02:22.500 --> 00:02:27.300 love with it immediately. And I've since always been programming in Python. I've been programming in Python 00:02:27.300 --> 00:02:31.740 since sophomore year of high school. So it's been quite a few years now. 00:02:31.860 --> 00:02:38.020 I think all of us programmers unwittingly become tech support for our families and whatnot, right? 00:02:38.020 --> 00:02:41.520 Oh, yeah. My entire family, I'm that guy. 00:02:41.520 --> 00:02:45.380 Yeah. I try to not be that guy, but I end up being that guy a lot. 00:02:45.380 --> 00:02:52.160 So you took Python in high school. That's pretty cool that they offered Python there. Do they have 00:02:52.160 --> 00:02:53.580 other programming classes as well? 00:02:54.420 --> 00:03:01.340 Yeah. So my high school, I live in the DC metro area. I live in Montgomery County. It's a very nice 00:03:01.340 --> 00:03:07.000 county, and the schools are very good. And luckily, the intro programming course was taught by a very 00:03:07.000 --> 00:03:13.240 intelligent teacher. So she taught us all Python. And then the courses after that were Java courses, 00:03:13.240 --> 00:03:18.120 the college-level advanced placement Java, and then a data structures class after that. 00:03:18.120 --> 00:03:23.400 So we got to learn a lot about the fundamentals of computer science with those classes. 00:03:24.080 --> 00:03:27.540 Yeah, that's really cool. I think I got to take basic when I was in high school, and that was about it. 00:03:27.540 --> 00:03:28.900 It was a while ago. 00:03:28.900 --> 00:03:32.380 I wrote a basic interpreter, but it wasn't very good. 00:03:32.380 --> 00:03:37.180 Cool. So before we get into the programming stuff, maybe you could just tell me, 00:03:37.180 --> 00:03:39.480 what is the Human Geo Group? What do you guys do? 00:03:39.480 --> 00:03:48.100 Yeah. So the Human Geo Group, we're a small government contractor, and we deal mostly in government contracts, 00:03:48.100 --> 00:03:53.740 but we have a few commercial projects and ventures that I was working on over the summer that we'll be 00:03:53.740 --> 00:04:04.600 talking about. We're a great small little company in Arlington, Virginia, and we actually just won an award for one of the best places to work in the DC metro area for millennials and for younger people. 00:04:05.600 --> 00:04:27.980 If you go to thehumangeo.com, you guys have a really awesome webpage. I really like how that is. It's like, bam, here we are. We're about data, and it just has kind of this live page. So many companies want to get their CEO statement and all the marketing stuff and the news. You guys are just like, look, it's about data. That's cool. 00:04:27.980 --> 00:04:40.000 Yeah. There's a recent, I don't remember how recent, but there is the website rewrite and the website, this one guy, he decided he took the time and he's like, I really want to do something, you know, show some geographic stuff. 00:04:40.080 --> 00:04:58.920 So he used leaflet.js, which we do a lot of open source with and on our GitHub page with leaflet.js. And he made it very beautiful. And he, and that even there's some icons of all the people at human geo. And I think it's much better than any of those. Like you said, a generic, you know, contractor site. It's, it's much better and much more energetic. 00:04:59.080 --> 00:05:12.480 It looks to me like you do a lot with social media and like sentiment analysis and tying that to, to location, the geo part. Right. And all that kind of stuff. What's the story there with what kind of stuff do you guys do? 00:05:12.900 --> 00:05:27.120 Yeah. So one of, what I was working on was we took one of our, one of our customers is a billion dollar entertainment company. And, I mean, you've probably heard of them. I think we talk about them on our site. 00:05:27.620 --> 00:05:44.520 And what we do is we analyze various social media sites like, you know, Reddit and Twitter and YouTube, and we gather geographic data if available. And we gather sentiment data using specific algorithms from things like the natural language toolkit, which is an amazing Python package. 00:05:44.520 --> 00:05:50.620 Then we show it to the user in a, in a very nice, website that we have created. 00:05:50.620 --> 00:05:56.260 So you say you're, you work for this entertainment company as well as like a government contractor. 00:05:56.400 --> 00:06:01.560 What is the government interested in with all this data? The U S government that is right for the international listeners. 00:06:01.560 --> 00:06:11.700 Yeah, it's, it's definitely a United States government. we, we do, not, we do less social media analysis for the government. We do, we do some, but it's nothing. 00:06:11.700 --> 00:06:26.380 It's not what people think the NSA does. definitely. I think, you know, just like anything, like a company would want, like your, like you'd search on something and then it would have like, Oh, there are Twitter users talking about this. 00:06:26.380 --> 00:06:36.120 This in the, you know, in these areas. Yeah. I guess the government wouldn't know, would want to know that, especially in like emergencies or things like that possibly. Right. 00:06:36.120 --> 00:06:45.780 Yeah. We also do some, platform stuff. Like we create, certain platforms for, the government. That's not necessarily social media stuff. 00:06:46.180 --> 00:06:50.460 Right. Sure. So how much of that is Python and how much of that is other languages? 00:06:51.460 --> 00:07:16.460 So at the human geo, we do, we do tons of Python at the backend. for some of the government stuff, we do Java, which is big in the government, obviously. so I think we, I mean, we definitely do have a lot of Python. We use it a lot in the pipeline, for various tools and things that we use internally and externally at human geo. I know that the project that I was working on is exclusively Python. 00:07:16.460 --> 00:07:26.000 Python and all parts of the pipeline for, for gathering data and for representing data for, you know, the backend server and the front end server. So that was all Python. 00:07:26.000 --> 00:07:39.460 Right. So it's, it's pretty much Python end to end other than it looks, I don't know specifically the project you were working on, but it looks like it's very heavy, like D3, fancy JavaScript stuff on the front end for the visualization. 00:07:39.460 --> 00:07:41.640 But other than that, it was more or less Python, right? 00:07:41.640 --> 00:07:48.060 We do use a lot. I mean, yeah, we do have some amazing JavaScript people. They, they do a lot of really fun looking stuff. 00:07:48.060 --> 00:07:53.200 Yeah. You can tell it's, it's a fun place for like data display, front end development. That's cool. 00:07:53.200 --> 00:07:55.740 So Python two or Python three. 00:07:55.740 --> 00:08:06.320 So we use Python two, but I was making sure as, I mean, when I was working on the code base, I was definitely writing Python three compatible code using the proper future imports. 00:08:06.320 --> 00:08:14.920 And, and I, and I was testing it on the Python three and we're, we're probably closer to Python three than a lot of companies are. we just haven't expanded the time to do it. 00:08:14.920 --> 00:08:19.560 We're probably will in 2018 when Python two is nearing end of line. 00:08:19.560 --> 00:08:27.720 That's seems like a really smart, smart way to go. Did you happen to profile it under Python, CPython two and CPython three? 00:08:27.720 --> 00:08:32.520 I didn't. It doesn't fully run in CPython three right now. I wish I could. 00:08:33.220 --> 00:08:38.960 It would just be really interesting since you spent so much time looking at the performance if you could compare those, but yeah, if it doesn't run. 00:08:38.960 --> 00:08:41.260 That would be interesting. You're right. I wish I could know that. 00:08:41.260 --> 00:08:49.200 I suspect that most people know what profiling is, but there's a whole diverse set of listeners. So maybe we could just say really quickly, what is profiling? 00:08:50.200 --> 00:09:12.240 Yeah. So profiling in any language, anything is, knowing heuristics about what's running in your program. And, you know, for example, how many times is a function called or how long does it take for this section of code to run? And it's simply like a statistical thing. Like you, like you get a profile of your code. You see all the parts of your code that are fast or slow. For example. 00:09:12.860 --> 00:09:35.120 You wrote a really interesting blog post and we'll talk about that in just a second. And I think like all good profiling articles or topics, you know, you, you sort of point out what I consider to be the first rule of profiling or more like the first rule of optimization, which profiling is kind of the tool to get you there, which is to not prematurely optimize your stuff. Right. 00:09:35.800 --> 00:09:36.520 Yeah, definitely. 00:09:36.520 --> 00:09:52.220 Yeah. I, you know, I've spent a lot of time thinking about how programs run and why it's slow or why it's fast or worrying about this little part of that little part. And, you know, most of the time it just doesn't matter. Or if it is slow, it's slow for reasons that were unexpected. Right. 00:09:52.220 --> 00:10:02.520 Yeah, definitely. Yeah. I always make sure that there is a legitimate problem to be solved before spending time doing something like profiling or optimizing a code base. 00:10:03.200 --> 00:10:14.040 Definitely. So let's talk about your blog post because that kind of went around on Twitter in a pretty big way and on the Python newsletters and so on. And I read that. Oh, this is really cool. I should have Davis on and we should talk about this. 00:10:14.500 --> 00:10:24.560 So, so, so on the human geo blog, you wrote an article or post called profiling in Python, right? What motivated you to write that? 00:10:25.520 --> 00:10:40.700 Yeah. So when I was right, when I was working on this code, basically my, my coworkers, my boss, they said, you know, this piece of code, we have this pipeline for, for gathering data from the customers specifically that they give us. 00:10:40.700 --> 00:10:53.460 And we run it, we ran it about like 2am every night. The problem was it took 10 hours to run this, this piece of code. It was doing a very heavy text processing, which I'll talk about more later, I guess. 00:10:53.540 --> 00:11:02.560 It was doing a lot of text processing, which, which ended up taking 10 hours and they look at it and they said, you know, it's updating at noon every day and the workday starts at like nine. 00:11:03.060 --> 00:11:10.880 So we should probably try to fix this and get it working faster. Davis, you should totally look at this as a great first project. 00:11:10.880 --> 00:11:12.060 Here's your new project, right? 00:11:12.060 --> 00:11:17.220 Yeah. It was like, here's day one. Okay. Make this thing 10 times faster. Go. 00:11:17.220 --> 00:11:19.640 Yeah. Oh, is that all you want me to do? 00:11:19.640 --> 00:11:32.420 Yeah. So I start, so I wrote this and I did what any guy would do. I, first thing I did was, you know, profile it, which is the first thing you should do to make sure that there's actually a hotspot. 00:11:32.580 --> 00:11:51.800 And I ran through the process that I, that I talked about in the post post. I talk about what I ran through the tools I use. And I realized that it was, it wasn't a simple thing for me to do all this. I don't do this often. And I figure that a lot of people, like you said, maybe don't know about profiling or haven't done this. 00:11:51.800 --> 00:12:03.800 So I said, you know, if I write a blog post about this and hopefully somebody else won't have to Google like 20 different posts and read all of them to come up with one coherent idea of how to do this in one fell swoop. 00:12:03.800 --> 00:12:16.560 Right. Maybe you can just lay it out. Like, these are the five steps you should probably start with. And, you know, profiling is absolutely an art more than it is, you know, engineering, but it, at least having the steps to follow is super helpful. 00:12:17.340 --> 00:12:30.720 You started out by talking about the CPython distribution. And I think it'd be interesting to talk a little bit about potentially alternative implementations as well. Cause you, you did talk about that a bit. 00:12:30.840 --> 00:12:38.140 Yep. You said there's basically two profilers that come with CPython profile and CPython. 00:12:38.140 --> 00:12:50.320 Yeah. So there, the two profilers that come with Python, as you said, profile and CPython, all have the same exact interface and include the same heuristics. 00:12:50.320 --> 00:13:00.140 But the idea behind profile is that it's written in Python and is portable across Python implementations, hopefully, and CPython, which is written in C. 00:13:00.260 --> 00:13:07.940 And it's pretty specific to using a C interpreter, such as a CPython or even PyPy because PyPy is, is very, interoperable. 00:13:07.940 --> 00:13:19.940 Right. It does have that, that interoperability layer. So maybe if we were doing like Jython or we were doing IronPython or PyStone or something, we couldn't use CPython. 00:13:19.940 --> 00:13:22.580 Yeah. I wouldn't, I wouldn't say that you can. 00:13:22.580 --> 00:13:24.700 I'm just guessing. I don't really, I haven't tested it. 00:13:25.300 --> 00:13:34.020 I would say that you would use the, for Jython and IronPython, you could use the standard Java or .NET profilers and use those instead. 00:13:34.020 --> 00:13:40.480 I'm pretty sure those will work just fine because I mean, they've, they're known to work great with their respective ecosystems. 00:13:41.060 --> 00:13:54.180 Do you know whether there's a significant performance difference? Let me take a step back. You know, it seems to me like when I've done profiling in the past, that it's a little bit like the Heisenberg uncertainty principle. 00:13:54.580 --> 00:14:16.900 And that if you observe a thing by the fact you've observed it, you've altered it. Right. You know, when you run your code under the profiler, it might not behave in the same exact timings and so on as if it were running natively, but you can still get a pretty good idea usually. Right. So is there a difference across the C profile versus profile in that regard? 00:14:17.460 --> 00:14:29.460 Oh yeah, definitely. Profile is much slower and it has a much higher latency and as overhead, I mean, than C profile because it has to do a lot of different worker. 00:14:29.460 --> 00:14:52.600 I mean, because Python exposes internals to C to CPython in, in some Python modules, but they're a lot slower than just using straight C and getting straight to it. So if you're using regular profile, I'd recommend, well, if you're using it in CPython or pipe, I'd recommend using C profile instead because it has much lower overhead and it gives you much better numbers that you can work with that make more sense. 00:14:53.020 --> 00:14:59.800 Okay. Awesome. So how do I get started? Like, suppose I have some code that is slow, but how do I run it in this in C profile? 00:14:59.800 --> 00:15:06.600 Yeah. So the first thing that I would do is, I mean, I, in the blog posts, which I'm pulling up right now just to see. 00:15:06.600 --> 00:15:14.060 And I'll be sure to link to that in the show notes as well so that if people can just jump to the show notes and pull it up. 00:15:14.760 --> 00:15:27.880 Yeah. Well, so one of my favorite things about C profile is that you can call it using, you know, the Python dash M syntax, the column module syntax, and it will print out to standard out your profile for you when it's done. 00:15:27.880 --> 00:15:41.820 It's super easy. I mean, all you need to do is just give it your, you know, main Python file and it'll run it. And then at the end of the running, it'll give you the profile. It's super simple. And one of the reasons why the blog post is so easy to write. 00:15:42.460 --> 00:15:59.960 Yeah, that's cool. So by default, it looks like it gives you basically the total time spent in methods, all of the methods, you know, looking at the call stack and stuff, a number of times it was called, stuff like that, right? The cumulative time, the individual time per call and so on. 00:16:00.140 --> 00:16:17.600 It gives you the default, like you said, you're correct. And it's also really easy. You can give it a sorting argument. So that way you can, if you want to call it on C, you know, how many times this is called, like if it's called 60,000 times, it's probably a problem in a, you know, a 10 minute run. 00:16:17.960 --> 00:16:32.660 And if it's, you know, it could be only called twice, but it may take an hour to run. That would be very scary. In which case you definitely want to try, you want to, you want to sort it both ways. You want to sort it every way. So you can see what, you know, just in case you're missing something important. 00:16:33.280 --> 00:16:42.300 Right. You want to slice it in a lot of different ways. How many times was it called? What was the, you know, maximum individual time, cumulative time, all those types of things. 00:16:42.300 --> 00:16:49.680 All right. Maybe you're calling a database function. Right. And maybe that's just got tons of data. 00:16:49.680 --> 00:16:59.160 Yeah. It's slow. Yeah. Yeah. Maybe, maybe the database is slow. And so that says, well, don't even worry about your Python code. Just go make your database fast or use Redis for caching or something. Right. 00:16:59.640 --> 00:17:08.540 Or yeah, work on your query. Maybe you're, maybe you're, maybe you can make it distinct, a distinct query and get a much smaller data set that ends up coming back to you. 00:17:08.540 --> 00:17:25.420 Yeah, absolutely. Yeah. It could just be a smarter query you're doing. So this all is pretty good. I mean, this, this output that you get is pretty nice, but in a real program with many, many functions and so on, that can be pretty hard to understand, right? 00:17:26.280 --> 00:17:39.100 Yes. I definitely, I found, so when I was working on this, I had the same issue. There was, there was so many lines of code. It wasn't, it was filling up my terminal, you know, and what I had to do is I had to save it to an output file and that was too much work. 00:17:39.440 --> 00:17:50.300 So I was searching around more and I found PyCallGraph, which is amazing at showing you the flow of a program and it gives you great graphic representation of what C profile is also showing you. 00:17:50.300 --> 00:17:56.200 That's awesome. So it's kind of like a visualizer for the C profile output, right? 00:17:56.840 --> 00:18:04.180 Yeah. It even colors in the more red it is, the more hot of a call it is, or the more times it runs and the longer it runs. 00:18:04.180 --> 00:18:08.380 Yeah. That's awesome. So just pip install PyCallGraph to get started, right? 00:18:08.380 --> 00:18:13.800 Super simple. It's one of the best things about, pip as a package manager. 00:18:13.800 --> 00:18:16.540 Yeah. I mean, that's part of why Python is awesome, right? pip install. 00:18:16.540 --> 00:18:17.000 Definitely. 00:18:17.000 --> 00:18:26.680 Whatever. Anti-gravity. So then you say, you, invoke it, you say basically PyCallGraph and you say GraphFizz. 00:18:26.680 --> 00:18:29.000 And then how does that work? 00:18:29.000 --> 00:18:45.560 So PyCallGraph is, it supports outputting to multiple different file formats. GraphFizz is simply a file format for, I mean, it's a program that can show called .file. 00:18:45.560 --> 00:18:46.960 I mean dot dot files 00:18:46.960 --> 00:18:49.340 I don't really understand how to say it out loud 00:18:49.340 --> 00:18:51.380 so I mean the first argument for 00:18:51.380 --> 00:18:52.560 graph is the 00:18:52.560 --> 00:18:55.360 style of how it's going to be read the program that's going to 00:18:55.360 --> 00:18:57.520 read it and then you give it 00:18:57.520 --> 00:18:59.420 the double dash which means it's not 00:18:59.420 --> 00:19:01.140 part of the PyCallGraphCall options 00:19:01.140 --> 00:19:02.600 it's now the Python call 00:19:02.600 --> 00:19:04.940 that you're calling it with 00:19:04.940 --> 00:19:06.580 and those arguments 00:19:06.580 --> 00:19:09.440 so it's almost basically the same 00:19:09.440 --> 00:19:11.320 as cProfile but it's kind of inverted 00:19:11.320 --> 00:19:13.520 right and you can get really simple 00:19:13.520 --> 00:19:15.280 call graphs that are just this function called 00:19:15.280 --> 00:19:17.220 this function which called this function and it took this 00:19:17.220 --> 00:19:19.300 amount of time or you can get really 00:19:19.300 --> 00:19:20.980 quite complex 00:19:20.980 --> 00:19:22.060 call graphs 00:19:22.060 --> 00:19:25.520 like you can say you know this module 00:19:25.520 --> 00:19:27.300 calls these functions which then reach out 00:19:27.300 --> 00:19:29.000 to this other module and then they're all 00:19:29.000 --> 00:19:30.820 interacting in these ways 00:19:30.820 --> 00:19:32.700 that's pretty amazing 00:19:32.700 --> 00:19:34.480 yeah it shows 00:19:34.480 --> 00:19:36.700 exactly what module you're using 00:19:36.700 --> 00:19:39.140 I mean like if you're using regex it'll just 00:19:39.140 --> 00:19:40.820 it'll show you each part of the regex 00:19:40.820 --> 00:19:42.920 module like the regex compile 00:19:42.920 --> 00:19:44.900 or you know the different 00:19:45.000 --> 00:19:46.680 modules that are using the regex module 00:19:46.680 --> 00:19:48.900 and then it'll show you how many times each is called 00:19:48.900 --> 00:19:50.180 and they're boxed all nicely 00:19:50.180 --> 00:19:51.100 and 00:19:51.100 --> 00:19:53.380 and it gives you I mean the image is so 00:19:53.380 --> 00:19:54.500 easy to look at 00:19:54.500 --> 00:19:56.820 and you could just zoom in at the exact part you want 00:19:56.820 --> 00:19:58.920 and then look at what calls it 00:19:58.920 --> 00:20:00.880 and where and what it calls to see 00:20:00.880 --> 00:20:03.160 you know how the program flows much simpler 00:20:03.160 --> 00:20:04.660 yeah that's really cool 00:20:04.660 --> 00:20:05.920 and you know it's something that just 00:20:05.920 --> 00:20:07.180 came to me as an idea 00:20:07.180 --> 00:20:08.180 I'm looking at this 00:20:08.180 --> 00:20:20.720 this episode is brought to you by 00:20:20.720 --> 00:20:21.360 hired 00:20:21.360 --> 00:20:23.000 hired is a two-sided 00:20:23.000 --> 00:20:25.560 curated marketplace that connects the world's 00:20:25.560 --> 00:20:27.900 knowledge workers to the best opportunities 00:20:27.900 --> 00:20:29.540 each offer you receive 00:20:29.540 --> 00:20:32.160 has salary and equity presented right up front 00:20:32.160 --> 00:20:35.000 and you can view the offers to accept or reject them 00:20:35.000 --> 00:20:37.000 before you even talk to the company 00:20:37.000 --> 00:20:41.380 typically candidates receive five or more offers in just the first week 00:20:41.380 --> 00:20:42.600 and there are no obligations 00:20:42.600 --> 00:20:43.380 ever 00:20:43.380 --> 00:20:45.540 sounds pretty awesome doesn't it 00:20:45.540 --> 00:20:47.520 well did I mention there's a signing bonus 00:20:47.520 --> 00:20:51.620 everyone who accepts a job from hired gets a two thousand dollar signing bonus 00:20:51.620 --> 00:20:53.960 and as talk python listeners 00:20:53.960 --> 00:20:55.940 it gets way sweeter 00:20:55.940 --> 00:21:00.240 use the link hired.com slash talk python to me 00:21:00.240 --> 00:21:03.840 and hired will double the signing bonus to four thousand dollars 00:21:03.840 --> 00:21:05.240 opportunities knocking 00:21:05.240 --> 00:21:07.820 visit hired.com slash talk python to me 00:21:07.820 --> 00:21:08.860 and answer the call 00:21:08.860 --> 00:21:22.180 because it colors the hot spots and all it's really good for profiling but 00:21:22.180 --> 00:21:25.900 even if you weren't doing profiling it seems like that would be pretty interesting for just 00:21:25.900 --> 00:21:29.000 understanding new code that you're trying to get your head around 00:21:29.000 --> 00:21:31.980 oh yeah that's definitely true that's uh 00:21:31.980 --> 00:21:36.720 i've employed that since as a method to look at the control flow of a program 00:21:36.720 --> 00:21:39.420 right how does these methods how did these 00:21:39.420 --> 00:21:44.580 modules and all this stuff just how do they relate like just run the main method and see what happens right 00:21:44.580 --> 00:21:46.220 exactly it's 00:21:46.220 --> 00:21:49.680 it's uh it's become a useful tool of mine 00:21:49.680 --> 00:21:52.200 that i'll definitely be using in the future 00:21:52.200 --> 00:21:54.120 i always have it my virtual end of nowadays 00:21:54.120 --> 00:21:55.340 so 00:21:55.340 --> 00:21:56.280 we've 00:21:56.280 --> 00:21:58.020 taken c profile we've 00:21:58.020 --> 00:21:59.960 applied it to our program 00:21:59.960 --> 00:22:09.320 we've sort of gotten some textual version of the output that tells us where we're spending our time in various ways and then we can use py call graph to visualize that understand it more quickly 00:22:09.320 --> 00:22:10.720 so 00:22:10.720 --> 00:22:14.920 then what like how do you fix these problems what are some of the common things you can do 00:22:14.920 --> 00:22:16.260 yeah so 00:22:16.260 --> 00:22:22.280 as i outlined in the blog post there's there's a plethora of methods that you can do depending on what 00:22:22.280 --> 00:22:25.660 your profile is showing for example if 00:22:25.660 --> 00:22:28.120 if you're spending a lot of time in 00:22:28.120 --> 00:22:30.360 in python code 00:22:30.360 --> 00:22:33.020 then you can definitely look at 00:22:33.020 --> 00:22:37.220 things like using a different interpreter for example an optimizing compiler like 00:22:37.220 --> 00:22:38.060 pypy 00:22:38.060 --> 00:22:43.960 uh will definitely make your code run a lot faster as it'll translate to machine code at runtime 00:22:43.960 --> 00:22:49.660 or you can also look at the algorithm that you're using and see if it's for example like an on cubed 00:22:49.660 --> 00:22:54.260 uh time complexity algorithm that would be terrible and you might want to fix that 00:22:54.260 --> 00:22:58.540 yeah those are the kinds of problems that you run into when you have small test data 00:22:58.540 --> 00:23:02.620 and you write your code and it seems fine and then you give it real data and it just dies right 00:23:02.620 --> 00:23:04.020 exactly that's 00:23:04.020 --> 00:23:05.280 that's why the thing 00:23:05.280 --> 00:23:11.380 they gave me the code that took i mean they gave me like five gigabytes of data and they said this is the data that we get like on a 00:23:11.380 --> 00:23:12.620 nightly basis and i said 00:23:12.620 --> 00:23:14.980 oh my god this will take all day to run 00:23:14.980 --> 00:23:15.940 so 00:23:15.940 --> 00:23:19.240 i i use smaller test test pieces 00:23:19.240 --> 00:23:22.740 and then luckily i i used big enough that it showed 00:23:22.740 --> 00:23:25.380 uh some statistically significant numbers for me 00:23:25.380 --> 00:23:29.200 right it was something you could actually execute in a reasonable amount of time as you 00:23:29.200 --> 00:23:31.920 through your exploration but not so big that 00:23:31.920 --> 00:23:33.660 i'd rather not do like c plus plus 00:23:33.660 --> 00:23:39.360 times of run you know compile times but at runtime because it just kind of sits there while it's processing 00:23:39.360 --> 00:23:41.000 right so 00:23:41.000 --> 00:23:42.460 rather not only do it once a day 00:23:42.460 --> 00:23:45.300 you mentioned some interesting things like 00:23:45.300 --> 00:23:47.940 pi pi is super interesting to me 00:23:47.940 --> 00:23:49.340 we're going to talk more about how 00:23:49.340 --> 00:23:51.700 how you well you chose not to use that but 00:23:51.700 --> 00:23:53.820 you know on show 21 we had 00:23:53.820 --> 00:23:55.000 macha from the 00:23:55.000 --> 00:23:55.920 the pi pi 00:23:55.920 --> 00:23:59.700 oh yeah i have it up right now i'm gonna watch it later 00:23:59.700 --> 00:24:03.420 yeah i'm so excited yeah that was super interesting and we talked a little bit about 00:24:03.420 --> 00:24:06.620 optimization there and like why you might choose an alternate 00:24:06.620 --> 00:24:08.820 interpreter that was cool 00:24:08.820 --> 00:24:12.300 then there's some other interesting things you can do as well like 00:24:12.300 --> 00:24:18.520 you could use like name tuples instead of actual classes or you could use built-in data 00:24:18.520 --> 00:24:23.480 structures instead of creating your own because a lot of times the built-in structures like list and array 00:24:23.480 --> 00:24:27.940 and dictionary and so on are implemented deep down in c and they're much faster 00:24:27.940 --> 00:24:29.460 yeah definitely 00:24:29.460 --> 00:24:35.160 uh one of the best examples of this is that i saw some guy who wrote his own dictionary class 00:24:35.160 --> 00:24:38.340 and uh it was a lot slower than 00:24:38.340 --> 00:24:43.600 and this isn't in the human geo code base just so you know we have we have good code at human geo 00:24:43.600 --> 00:24:45.660 you guys don't show up on the daily wtf 00:24:45.660 --> 00:24:51.940 oh please no no we're we're much better than that no this is a another another place that i'm that i saw some code 00:24:51.940 --> 00:24:56.680 and uh yeah i mean they have a lot of optimizations like in the latest python release 00:24:56.680 --> 00:25:02.620 they actually made them the dict class i mean it's actually now an ordered dict in the latest python releases 00:25:02.620 --> 00:25:06.640 because they basically copied what pypy did the same they did the same thing 00:25:06.640 --> 00:25:14.420 yeah so you should always trust uh the internal stuff yeah and that's really true and if you're going to stick 00:25:14.420 --> 00:25:20.280 with c python as you know the majority of people do understanding how the interpreter actually works 00:25:20.280 --> 00:25:28.540 is also super important and i i talked about it several times in the show and um had philip guau on the show 00:25:28.540 --> 00:25:34.820 he did a 10 hour he recorded basically a graduate's uh graduate course at the university of rochester he did 00:25:34.820 --> 00:25:40.640 and put it online so uh definitely i can try to link to that and check it out that's show 22 00:25:40.640 --> 00:25:45.380 i'd love to see that yeah i mean it's you really understand what's happening inside the c runtime 00:25:45.380 --> 00:25:50.760 and so then you're like oh i see why this is expensive and all those things can help you know another thing 00:25:50.760 --> 00:25:54.740 that you talked about that was interesting was io like i gave you my example of databases 00:25:54.740 --> 00:26:00.600 or maybe you're talking about a file system or you know a web service call something like this 00:26:00.600 --> 00:26:07.780 right and you had some good advice on that i thought yeah i definitely uh basically the c python 00:26:07.780 --> 00:26:16.020 c python's jill the global interpreter lock is very uh precluding you can't you can't do multiple 00:26:16.020 --> 00:26:22.660 multiple computation and computations at the same time because c python only allows to be used at one core 00:26:22.660 --> 00:26:28.520 at a time right basically computational parallelism is not such a thing in python if you're you got 00:26:28.520 --> 00:26:33.880 to drop down to c or fork the processes or something like that right exactly and those are all fairly 00:26:33.880 --> 00:26:39.860 expensive for a task that we're running on an aws server on an aws server that we're trying to spend as 00:26:39.860 --> 00:26:45.660 little money as possible because it runs at like the break of dawn so we don't want to be running multiple 00:26:45.660 --> 00:26:52.080 python instances but when you when you're doing io which doesn't involve any python you can use 00:26:52.080 --> 00:26:59.420 threading to do things like you know file system you know expensive io tasks like getting a getting 00:26:59.420 --> 00:27:05.200 the data off a url or things like that that would that's great for python's threading but otherwise 00:27:05.200 --> 00:27:12.740 you don't really want to be using it right basically the built-in functions that wait on io they 00:27:12.740 --> 00:27:18.620 release the global interpreter lock and so that frees up your other code to keep on running more or less 00:27:18.620 --> 00:27:23.180 right you definitely want to make sure that if you're doing io that it's not the bottleneck 00:27:23.180 --> 00:27:28.520 i mean as long as everything else is not the bottleneck right and you know we've had um 00:27:28.520 --> 00:27:37.600 a lot of cool advances in python 3 around this type of parallelism right and they just added async and 00:27:37.600 --> 00:27:43.960 await the new keywords to uh is that three five i think it was right yeah just three fives just 00:27:43.960 --> 00:27:49.100 came out like two days yeah two days ago so yeah i mean that's super new but these are the types 00:27:49.100 --> 00:27:55.900 of places where async and await would be ideal for helping you increase your performance yeah it's uh 00:27:55.900 --> 00:28:02.520 it's the new syntax is like a special case it's syntactic sugar for yielding but it makes things 00:28:02.520 --> 00:28:08.680 much simpler and easier to read because if you're using yields to do complex async stuff that isn't just 00:28:08.680 --> 00:28:13.300 like a generator then it's very ugly so they added this new syntax to make it much simpler yeah it's great i'm 00:28:13.300 --> 00:28:18.000 looking forward to doing stuff with that you also had some guidance on regular expressions and the 00:28:18.000 --> 00:28:24.440 first one the first bit i also really like kind of like your premature optimization bit is once you 00:28:24.440 --> 00:28:29.700 decide to use a regular expression to solve a problem you now have two problems yeah i always have that 00:28:29.700 --> 00:28:36.760 issue um when i whenever whenever i do anything and i talk to students and they're like and i'm like oh 00:28:36.760 --> 00:28:40.580 look at this text and you could do this and like oh i'll just use regex to solve it i'm saying please know 00:28:40.580 --> 00:28:46.160 you know you'll end up with something that you won't be able to read in the next two days and then 00:28:46.160 --> 00:28:51.820 you know just find a better way to do it for goodness sakes yeah i'm i'm totally with you a friend of mine 00:28:51.820 --> 00:28:58.580 he has this really great way of looking at complex code both around regular expressions and like 00:28:58.580 --> 00:29:03.720 parallelism and it says when you're writing that kind of code you're often writing code right at the limit 00:29:03.720 --> 00:29:09.380 of your understanding of code or your ability to write complex code and debugging is harder than 00:29:09.380 --> 00:29:15.380 writing code and so you're writing code you literally can't debug so you may maybe maybe not 00:29:15.380 --> 00:29:22.500 go quite that far right yeah i think uh you should always try to make code as simple as possible because 00:29:22.500 --> 00:29:28.480 debugging and profiling and looking through your code will be uh much less fun if you try to be as 00:29:28.480 --> 00:29:32.960 clever as possible yeah absolutely clever code it's not so clever when it's your bug to solve later 00:29:32.960 --> 00:29:40.480 yeah and i i also had try to give special mention to python's regex engine as much as i dislike regex 00:29:40.480 --> 00:29:48.580 i think python's read that verbose flag is amazing and if you haven't uh looked into it python has great 00:29:48.580 --> 00:29:55.960 support for i'm greatly annotating the regex so if you have to use it you can be very verbose about it 00:29:55.960 --> 00:30:02.260 and that way it's much better documented in the code yeah that's great advice so let's maybe talk 00:30:02.260 --> 00:30:08.220 about how you solve your problem after you like what was the problem what did you find out the issue 00:30:08.220 --> 00:30:15.940 to be and then how do you solve it yeah so what we were doing in this code is we were taking gigabytes 00:30:15.940 --> 00:30:24.840 upon gigabytes of plain text data like words um from users of you know various forums and we 00:30:24.840 --> 00:30:32.140 got we processed all this data for sentiment analysis and to do that you need to stem each word to its base word 00:30:32.140 --> 00:30:37.800 so that way you can have you can make sure you you're analyzing the correct word because there's like five different 00:30:37.800 --> 00:30:45.400 forms of every single word right like running ran all those are kind of basically the same meaning right exactly 00:30:45.400 --> 00:30:52.700 yeah okay yeah so so we get the base word of all those words and the thing is gigabytes of like let's say 00:30:52.700 --> 00:30:59.500 five gigabytes of words if it's if a word is like four bytes you know that's like billions of that's so many 00:30:59.500 --> 00:31:06.380 words i don't even want to think about it and then for every single word we stemmed it and we analyzed it and it was 00:31:06.380 --> 00:31:13.840 slow arduous process and as i was profiling and i realized we run the same uh stemming stemming function 00:31:13.840 --> 00:31:18.800 which is an nltk it's a porter stemmer which is amazing and it's what everyone uses for stemming 00:31:18.800 --> 00:31:27.380 we ran it about in my you know 50 megabyte test case which is still so many words thousands upon thousands 00:31:27.380 --> 00:31:34.360 of words it ran about like 600 000 times and i was like my my goodness this is this is running way too much 00:31:34.360 --> 00:31:39.540 and there's only like 400 000 words in the english language there's no way you know each of these 00:31:39.540 --> 00:31:45.460 words is needs to be stemmed because you know gamers on the internet aren't very you know linguistically 00:31:45.460 --> 00:31:55.360 amazing yeah exactly so so what we so i figured you know i can create a cache or as it's called you know 00:31:55.360 --> 00:32:01.500 in more functional you know terms i can create a memoization algorithm that saves these answers i mean this 00:32:01.500 --> 00:32:07.700 saves the computation so i don't need to recompute the function because stemming is a pure function 00:32:07.700 --> 00:32:12.620 if you i mean if you're into functional programming you can you don't need to recompute it every single 00:32:12.620 --> 00:32:17.860 time you get the same input right if you're guaranteed with the same inputs you get the same output 00:32:17.860 --> 00:32:26.880 then you can cash the heck out of this thing right exactly so i i went from about like 60 600 000 calls to 00:32:26.880 --> 00:32:34.520 like you know like 30 000 and i was like it was an immediate you know these these words like the whole 00:32:34.520 --> 00:32:43.000 pro program ran orders of magnitude faster that's awesome and you know what i think i really love about 00:32:43.000 --> 00:32:49.600 this two things i love about it one is you know you talked a little bit about the solution on the 00:32:49.600 --> 00:32:57.620 blog and it's like i don't know nine lines of code no yes that's it's python yeah it's so awesome and 00:32:57.620 --> 00:33:03.200 the other thing is you didn't even have to change your code necessarily you're able to like create 00:33:03.200 --> 00:33:07.580 things that you can just apply to your code you have to re you have to rewrite the algorithms or things 00:33:07.580 --> 00:33:14.520 like this right yeah i i really find that you know the algorithm worked i mean it got it got things 00:33:14.520 --> 00:33:20.420 done it did it correctly there had to be you know i i mean i wasn't opposed to changing the algorithm 00:33:20.420 --> 00:33:25.900 obviously if that was the hot part of the code but once i found out that the hot part of the code wasn't 00:33:25.900 --> 00:33:31.420 even code that we wrote you know it was just code that we were calling right from another library and 00:33:31.420 --> 00:33:36.620 it's probably really optimized but if you're calling it 600 000 times well nothing is optimized when you're 00:33:36.620 --> 00:33:43.100 calling it hundreds of thousands of times you know at that point you got to not call it that many times 00:33:43.100 --> 00:33:50.300 within that time span you basically created a decorator that will just cache the output and 00:33:50.300 --> 00:33:56.140 only call the function if it's never seen the inputs before right exactly so i mean what it does is it 00:33:56.140 --> 00:34:01.660 you know internally a decorator all it does is it wraps the function in another function so it 00:34:01.660 --> 00:34:10.640 adds an internal cache which is just a python dictionary which which keeps the function arguments as the key 00:34:10.640 --> 00:34:16.100 and the output of the function as the value and if it's been computed then it's in the dictionary so 00:34:16.100 --> 00:34:21.760 it all it needs to do is a simple dictionary call you know it's like just like one or two python bytecode 00:34:21.760 --> 00:34:27.600 instructions which are i mean as opposed to calling an entire function um itself which would which would 00:34:27.600 --> 00:34:34.540 be hundreds of python bytecode instructions right yeah that's fantastic yeah and i'll and when it when it 00:34:34.540 --> 00:34:40.260 doesn't find the answer when i mean when it doesn't find the arguments in the function it i mean it's just 00:34:40.260 --> 00:34:46.700 one computation and for the amount of words in the english language um that are primarily used it's it'll 00:34:46.700 --> 00:34:55.580 be called much less right yeah so your typical data set maybe i don't know 30 000 20 000 times something like that 00:34:55.580 --> 00:34:56.700 yeah yeah yeah 00:34:56.700 --> 00:35:17.580 this episode is brought to you by code ship code ship has launched organizations create teams set 00:35:17.580 --> 00:35:22.440 permissions for specific team members and improve collaboration in your continuous delivery workflow 00:35:22.440 --> 00:35:27.900 maintain centralized control over your organization's projects and teams with code ship's new organizations 00:35:27.900 --> 00:35:33.820 plan and as talk python listeners you can save 20 off any premium plan for the next three months 00:35:33.820 --> 00:35:40.480 just use the code talk python all caps no spaces check them out at code ship.com and tell them thanks 00:35:40.480 --> 00:35:43.100 for supporting the show on twitter where they're at code ship 00:35:43.100 --> 00:35:56.080 the other yeah the other thing i i like is that the solution was so simple but but you probably needed 00:35:56.080 --> 00:36:03.620 the profiling to come up with it right you know so i have an example that i've given a few times about 00:36:03.620 --> 00:36:09.280 these types of things and just you know choosing the right data structure can be really important i worked 00:36:09.280 --> 00:36:15.320 on this project that did real time processing of data coming in at 250 times a second so that leaves four 00:36:15.320 --> 00:36:21.840 milliseconds to process each segment right which is really not much and but it had to be real time and if 00:36:21.840 --> 00:36:26.280 it couldn't keep up well then you can't write a real time system or you need some insane hardware to run it or 00:36:26.280 --> 00:36:33.740 something right and it was doing crazy math like wavelet decomposition and all sorts of stuff okay this is 00:36:33.740 --> 00:36:39.160 like i was saying earlier it's the verge of understanding what we're doing right yeah and it 00:36:39.160 --> 00:36:43.620 was too slow the first time we ran like oh no please don't make me try to optimize a wavelet 00:36:43.620 --> 00:36:49.640 decomposition you know it's kind of like four year analysis but worse yeah and i'm like there's got to 00:36:49.640 --> 00:36:54.880 be a better way right so break out the profiler and it turned out that we had to sort of do lookups 00:36:54.880 --> 00:37:01.380 back in the past on our data structures yeah and we happened to use a list to go back and look up 00:37:01.380 --> 00:37:07.160 and we were spending 80 of our time just looking for stuff in the list we just switched it to an on 00:37:07.160 --> 00:37:11.100 yeah exactly we just switched it to a dictionary and it went five times faster and i mean it was like 00:37:11.100 --> 00:37:16.080 almost one line of code change it was like ridiculously simple but if you don't use the 00:37:16.080 --> 00:37:20.380 profiler to find the real problem you're going to go muck with your algorithm and not even really 00:37:20.380 --> 00:37:23.920 make a difference because it had nothing to do with the algorithm right it was somewhere else 00:37:23.920 --> 00:37:30.360 yeah i i definitely find that there's also a lot of a lot more push like in the java world 00:37:30.360 --> 00:37:35.220 to make things like final by default to try to make them immutable unless they don't need to be 00:37:35.220 --> 00:37:40.600 and a lot of languages are also embracing immutable by default and trying to keep as strict as possible 00:37:40.600 --> 00:37:46.680 so that way you can you know be linear more lenient when you need to and i find the same thing in 00:37:46.680 --> 00:37:51.660 languages like python whereas i try to use a set before like unless i absolutely need a list if i'm just 00:37:51.660 --> 00:37:56.360 containing elements a set is much better for finding things right if you just want to know have i seen 00:37:56.360 --> 00:38:00.320 this thing before a set exactly it's maybe the right data structure or if you're going to store 00:38:00.320 --> 00:38:05.960 integers in a list you'd be much better off using an array of integers or an array of floats because 00:38:05.960 --> 00:38:12.280 it's much much more efficient you said one of the things you considered was py py just for those who 00:38:12.280 --> 00:38:17.460 don't know maybe what's the super quick elevator pitch on py py and what would you find out about using it 00:38:17.460 --> 00:38:26.280 yeah so py py is a very compliant python interpreter that at runtime turns the python code into machine 00:38:26.280 --> 00:38:31.920 code it finds the hot parts of your code or what's being run a lot and it finds a better way to run 00:38:31.920 --> 00:38:40.040 that for you so that way it'll um it'll run a lot faster than c python because c python doesn't do many 00:38:40.040 --> 00:38:46.380 optimizations by default or just in general right it it runs it through the interpreter which is a fairly 00:38:46.380 --> 00:38:50.960 complex thing rather than turn it straight to machine instructions yeah yeah there's a lot more 00:38:50.960 --> 00:38:58.560 overhead you tried out py py did it make it faster or did it matter at all oh yeah oh yeah py py is i i 00:38:58.560 --> 00:39:03.800 actually use py py before even profiling just to see you know what i could like just because i was like 00:39:03.800 --> 00:39:10.060 oh let's just see how faster py py is in this case and it ran about like five times faster um because it 00:39:10.060 --> 00:39:16.640 figured out what to do uh but the thing is that under our constraints uh we didn't we wanted to stick 00:39:16.640 --> 00:39:24.100 with a little aws instance that we were just running every night and the thing is py py uses more memory 00:39:24.100 --> 00:39:30.160 than c python to support its garbage collector and its just-in-time compiler that both need to run at the 00:39:30.160 --> 00:39:36.160 runtime so it uses a little bit more memory and we didn't really want to spend that money whereas you know 00:39:36.160 --> 00:39:41.360 because if i can get it down to an hour in c python and it runs at 2 a.m no one's going to be looking at 00:39:41.360 --> 00:39:47.280 that stuff at 3 a.m right absolutely and if you can optimize it in c python and then feed it into py py 00:39:47.280 --> 00:39:52.400 maybe it could go exactly faster still right that that would have run it at about like as opposed to 00:39:52.400 --> 00:39:56.520 from 10 hours to one hour it'd be from like if i was running in py py with the cache optimization 00:39:56.520 --> 00:40:01.680 it would probably run in like 30 minutes 20 minutes like i said it was unnecessary like it would it would 00:40:01.680 --> 00:40:07.240 have been nice and but we didn't need it so we didn't really feel like spending them spending the 00:40:07.240 --> 00:40:12.280 time to add that to the pipeline right sure it seems like i mean if you need it to go faster it seems like 00:40:12.280 --> 00:40:18.020 you could just keep iterating on this right so for every execution your your decorator will wrap it 00:40:18.020 --> 00:40:25.220 for every run of your python app but you know you could actually keep those stemmed words yeah it could 00:40:25.220 --> 00:40:30.960 save it like to a redis cache or to a file or to a database or right you could just keep going on this 00:40:30.960 --> 00:40:36.540 idea right but yeah that was definitely something i wanted to i thought about doing that yeah and 00:40:36.540 --> 00:40:41.460 definitely it was fast enough already and that's that is where we'll go next after that i always find 00:40:41.460 --> 00:40:47.960 that you use whenever when you profile and you have a flat profile with nothing sticking out and with 00:40:47.960 --> 00:40:52.940 nothing um you know that looks like it needs to be optimized that's when you need to change the runtime 00:40:52.940 --> 00:41:00.240 that's when you need to look into ffi or py py or jython with a proper jit interesting what's ffi 00:41:00.240 --> 00:41:05.360 so ffi is the foreign function interface it's just the general term used by all languages 00:41:05.360 --> 00:41:11.340 in which case you can drop down into c that's right or any c compatible language yeah that's the c so you 00:41:11.340 --> 00:41:16.660 would basically go right in c and then hook into it that's right so or use like cython for example 00:41:16.660 --> 00:41:25.260 which compiles python down to c with a weirdly python plus c syntax if you have to right yeah i've never 00:41:25.260 --> 00:41:31.320 tried it but i've seen it and i'm like you're saying like you annotate um python variables with 00:41:31.320 --> 00:41:38.120 c like you say double i equals zero in python syntax that's right it's really strange yeah that is that is 00:41:38.120 --> 00:41:44.300 strange uh so you know i talk about pycharm on the show a lot i'm a big fan of pycharm 00:41:44.300 --> 00:41:51.200 and uh they just added built-in profiling in their latest release 4.5 sounds nice yeah i don't know 00:41:51.200 --> 00:41:55.800 if it's out or not it has a visualizer too kind of like py call graph but they said they're using 00:41:55.800 --> 00:42:02.680 something called the yappy y-a-p-p-i profiler and it'll fall back to c profile if it has to do you know 00:42:02.680 --> 00:42:09.720 yappy i haven't i've not seen this yeah i i was looking at all these profiles uh also i mean py py comes 00:42:09.720 --> 00:42:15.600 with some profiler that works on Linux called vm prof um and those are all different profiles and i 00:42:15.600 --> 00:42:22.540 looked at them and and they're nice sure but i really loved how simple it was i mean and i got 00:42:22.540 --> 00:42:27.400 the results that i needed from c profile and it comes with python it was that you didn't need to 00:42:27.400 --> 00:42:33.060 install anything you just ran the module and that's why i was so happy with using it and didn't need to 00:42:33.060 --> 00:42:38.860 try a different profiler right that's cool just python space dash m space you know c profile and 00:42:38.860 --> 00:42:44.840 then your your app boom exactly and if you're using something like pycharm i'm sure i mean if it comes 00:42:44.840 --> 00:42:50.900 with uh yappy or any other profile it like definitely use that you know i'm sure it's i'm sure it's great 00:42:50.900 --> 00:42:57.980 because that pycharm people are awesome uh but for the purposes i mean of the simplicity that i wanted 00:42:57.980 --> 00:43:03.600 keeping the blog post i only used the word pip once i mean like i only used it once to install pycharm 00:43:03.600 --> 00:43:07.520 and that's just how simple it needs to be right you maybe could have gotten away without it if you 00:43:07.520 --> 00:43:13.000 really wanted to look at just the text output the other thing is you know this is something you 00:43:13.000 --> 00:43:18.500 can do really well on your server right you can go over and profile something or maybe it has real 00:43:18.500 --> 00:43:23.720 data it can get to like maybe it works fine on my machine right but not in production or not in qa or 00:43:23.720 --> 00:43:28.160 something like and so you don't have to go install pycharm on the server which exactly probably you 00:43:28.160 --> 00:43:36.640 don't want to do on you know sshed into aws yeah and that's why python is amazing for being 00:43:36.640 --> 00:43:41.260 batteries included is that it includes all of these nice things that you that you could need that any 00:43:41.260 --> 00:43:45.900 pro that any developer was going to need to use eventually do you want to talk a little bit about 00:43:45.900 --> 00:43:52.840 open source and what you guys are doing with open source at human geo yeah so human geo we do it we 00:43:52.840 --> 00:43:59.320 have a github profile github slash human geo and we mostly the most of our open source stuff is 00:43:59.320 --> 00:44:04.620 the javascript leaflet uh leaflet stuff that we've incorporated we add themes and stuff so if you do any 00:44:04.620 --> 00:44:11.020 if you want any map visualizations i like it much better than the google maps api um using leaflet so 00:44:11.020 --> 00:44:17.540 i'd recommend looking at the art the stuff we've done there and we also have uh one or a couple of python 00:44:17.540 --> 00:44:23.100 libraries uh you know we have a an elastic search binding that we that we've used which has since 00:44:23.100 --> 00:44:29.700 been superseded by the mozilla elastic binding so we definitely love open source of human geo 00:44:29.700 --> 00:44:36.300 and we make and use libraries in open source and that's one of my favorite parts about human geo 00:44:36.300 --> 00:44:42.700 that's really awesome and you guys are hiring right oh yeah we're we're always hiring uh we're looking for 00:44:42.700 --> 00:44:50.900 you know great developers at any age python java like i said we won one of the best places to work 00:44:50.900 --> 00:44:56.220 for millennials in the dc metro area right and for the international listeners that's washington dc 00:44:56.220 --> 00:45:00.840 in the united states yes washington dc yeah no worries yeah whether uh whether you have 00:45:00.840 --> 00:45:07.180 a government clearance or not we'd love you know just send send an email or call or get in touch if 00:45:07.180 --> 00:45:11.940 you want to work at an amazing place that does awesome python i love their code bases yeah that's 00:45:11.940 --> 00:45:17.120 really cool it should definitely reach a bunch of uh enthusiastic python people so if if you're 00:45:17.120 --> 00:45:21.700 looking for a job give those guys a ring that's awesome definitely so you guys are big fans of python 00:45:21.700 --> 00:45:29.580 right oh yeah we we've i think they i mean they've been using python since the company started and i was 00:45:29.580 --> 00:45:33.820 looking at the original commits like when the company started and it was for these projects and they're 00:45:33.820 --> 00:45:38.340 all python it's and it's really exciting we've been using since the beginning it's amazing to rapidly 00:45:38.340 --> 00:45:44.300 iterate uh it's fast enough obviously and you can uh you can look at it and it's super easy to profile 00:45:44.300 --> 00:45:48.740 when you need to that's another reason why it's amazing it's not just like you can say it's slow 00:45:48.740 --> 00:45:53.480 but then it's easy to optimize in that case yeah that's really cool so are there some 00:45:53.480 --> 00:46:02.020 practices or techniques or whatever that you guys have sort of landed on or focused in on if you've 00:46:02.020 --> 00:46:08.420 been doing it for so long that you could recommend at human geo we uh you know we make sure to 00:46:08.420 --> 00:46:14.680 i mean we don't go like full agile or anything i mean we definitely consider ourselves fast moving 00:46:14.680 --> 00:46:22.800 and we work uh at a very great pace uh we so i guess you could call it agile and then we have 00:46:22.800 --> 00:46:30.220 great git practices we use git flow and we make sure to have good code review and any code base including 00:46:30.220 --> 00:46:35.520 in python you know you got to have good code review and when you write python code uh i do 00:46:35.520 --> 00:46:42.620 write a lot of unit tests for my code things like that nice so is that a unit test module or pipe or um 00:46:42.620 --> 00:46:49.240 py test or oh yeah the standard library just i love standard library stuff yeah it's the batteries 00:46:49.240 --> 00:46:55.700 included right yes definitely so anything else you want to touch on before we uh kind of wrap things up 00:46:55.700 --> 00:47:01.860 davis uh no i was just uh i'm i'm just really excited to be given this opportunity and i wanted 00:47:01.860 --> 00:47:07.060 to give a shout out to all my amazing colleagues at the imageo yeah that's awesome you guys are doing 00:47:07.060 --> 00:47:13.040 cool stuff so i'm glad to shine a light on that so final two questions before i let you out of here 00:47:13.040 --> 00:47:18.280 favorite editor what do you what do you use to write python these days i'll say that for all my open 00:47:18.280 --> 00:47:24.280 source like i work on you know a website for the dc python community and i use sublime text for all 00:47:24.280 --> 00:47:30.220 that open source stuff all my tools that i use and when i'm given when i work for a company i i ask them 00:47:30.220 --> 00:47:37.400 for a pycharm license yeah nice because it's uh pycharm is great for big projects that you can really 00:47:37.400 --> 00:47:43.220 focus on yeah i you know like i said that's my favorite editor as well and it definitely there's a 00:47:43.220 --> 00:47:48.240 definitely a group of people that love the lightweight editors like vim and emacs and sublime 00:47:48.240 --> 00:47:53.160 and then people that like ides it's a bit of a divide but i you know i feel like when you're on a huge 00:47:53.160 --> 00:47:59.820 project you can just understand it sort of more definitely more in its entirety using something 00:47:59.820 --> 00:48:04.540 like pycharm so yeah i like it my favorite thing about pycharm is that you can like control click 00:48:04.540 --> 00:48:09.220 or command click and you can go like on a module and it'll take you to the source for that module 00:48:09.220 --> 00:48:15.000 so you can really fast like look at where the code is flowing um in the in the source yeah absolutely 00:48:15.000 --> 00:48:19.820 or hey you're importing a module but it's not specified in the requirements do you want me to add 00:48:19.820 --> 00:48:23.460 it for you for this package you're writing right stuff like that it's just it's sweet they have great 00:48:23.460 --> 00:48:28.940 support for the tooling and the tool chain of python yeah awesome davis this has been a really 00:48:28.940 --> 00:48:33.140 really interesting conversation and hopefully some people can go make their python code faster 00:48:33.140 --> 00:48:38.560 yeah i definitely hope that they will and i hope they learned a lot from this yeah thanks for 00:48:38.560 --> 00:48:43.160 being on the show man no problem thank you so much yeah talk to you later this has been another 00:48:43.160 --> 00:48:48.600 episode of talk python to me today's guest was davis silverman and this episode has been brought to you 00:48:48.600 --> 00:48:54.080 by hired and coachip thank you guys for supporting the show hired wants to help you find your next big 00:48:54.080 --> 00:48:58.940 thing visit hired.com slash talk python to me to get five or more offers with salary and equity 00:48:58.940 --> 00:49:03.160 presented right up front and a special listener signing bonus of four thousand dollars 00:49:03.160 --> 00:49:09.700 coachip wants you to always keep shipping check them out at coachip.com and thank them on twitter 00:49:09.700 --> 00:49:15.180 via at coachip don't forget the discount code for listeners it's easy talk python all caps no spaces 00:49:15.180 --> 00:49:22.160 you can find the links from today's show at talkpython.fm/episodes slash show slash 28 00:49:22.160 --> 00:49:27.980 be sure to subscribe to the show open your favorite podcatcher and search for python we should be right at the top 00:49:27.980 --> 00:49:34.800 you can also find itunes and direct rss feeds in the footer of the website our theme music is developers 00:49:34.800 --> 00:49:40.920 developers developers developers by cory smith who goes by smix you can hear the entire song at talkpython.fm 00:49:40.920 --> 00:49:47.460 this is your host michael kennedy thank you very much for listening smix take us out of here 00:49:47.460 --> 00:49:53.360 stating with my voice there's no norm that i can feel within haven't been sleeping i've been using 00:49:53.360 --> 00:49:57.120 lots of rest i'll pass the mic back to who rocked it best 00:49:57.120 --> 00:50:03.480 developers developers developers developers developers developers developers developers developers developers 00:50:03.480 --> 00:50:09.300 developers developers developers developers developers developers developers developers