WEBVTT

00:00:00.001 --> 00:00:04.020
Is that Python code of yours running a little slow? Are you thinking about rewriting the

00:00:04.020 --> 00:00:08.560
algorithm or maybe even in another language? Well, before you do, you'll want to listen to

00:00:08.560 --> 00:00:12.440
what Davis Silverman has to say about speeding up Python code using profiling.

00:00:12.440 --> 00:00:17.000
This is show number 28, recorded Wednesday, September 16th, 2015.

00:00:17.000 --> 00:00:46.860
Welcome to Talk Python To Me, a weekly podcast on Python, the language, the library,

00:00:47.000 --> 00:00:51.700
the ecosystem, and the personalities. This is your host, Michael Kennedy. Follow me on Twitter

00:00:51.700 --> 00:00:56.620
where I'm @mkennedy. Keep up with the show and listen to past episodes at talkpython.fm

00:00:56.620 --> 00:01:02.960
and follow the show on Twitter via at Talk Python. This episode is brought to you by Hired and

00:01:02.960 --> 00:01:08.720
CodeShip. Thank them for supporting the show on Twitter via at Hired underscore HQ and at CodeShip.

00:01:08.720 --> 00:01:13.260
There's nothing special to report this week, so let's get right to the show with Davis.

00:01:13.760 --> 00:01:18.520
Let me introduce Davis. Davis Silverman is currently a student at the University of Maryland working

00:01:18.600 --> 00:01:23.680
part-time at the Human Geo Group. He writes mostly Python with an emphasis on performant

00:01:23.680 --> 00:01:26.520
Pythonic code. Davis, welcome to the show.

00:01:26.520 --> 00:01:27.240
Hello.

00:01:27.240 --> 00:01:33.380
Thanks for being here. I'm really excited to talk to you about how you made some super slow Python

00:01:33.380 --> 00:01:39.860
code much, much faster using profiling. You work at a place called the Human Geo Group. You guys do a

00:01:39.860 --> 00:01:44.120
lot of Python there, and we're going to spend a lot of time talking about how you took some of your

00:01:44.120 --> 00:01:51.580
social media sort of data, real-time analytics type stuff, and built that in Python and improved

00:01:51.580 --> 00:01:56.540
it using profiling. But let's start at the beginning. What's your story? How did you get into programming

00:01:56.540 --> 00:01:56.960
in Python?

00:01:56.960 --> 00:02:07.520
I originally, when I was a kid, obviously, I grew up and I had internet. I was lucky, and I was very into

00:02:07.520 --> 00:02:15.540
computers, and my parents were very happy with me building and fixing computers for them, obviously.

00:02:15.540 --> 00:02:22.500
So by the time high school came around, I took a programming course, and it was Python, and I fell in

00:02:22.500 --> 00:02:27.300
love with it immediately. And I've since always been programming in Python. I've been programming in Python

00:02:27.300 --> 00:02:31.740
since sophomore year of high school. So it's been quite a few years now.

00:02:31.860 --> 00:02:38.020
I think all of us programmers unwittingly become tech support for our families and whatnot, right?

00:02:38.020 --> 00:02:41.520
Oh, yeah. My entire family, I'm that guy.

00:02:41.520 --> 00:02:45.380
Yeah. I try to not be that guy, but I end up being that guy a lot.

00:02:45.380 --> 00:02:52.160
So you took Python in high school. That's pretty cool that they offered Python there. Do they have

00:02:52.160 --> 00:02:53.580
other programming classes as well?

00:02:54.420 --> 00:03:01.340
Yeah. So my high school, I live in the DC metro area. I live in Montgomery County. It's a very nice

00:03:01.340 --> 00:03:07.000
county, and the schools are very good. And luckily, the intro programming course was taught by a very

00:03:07.000 --> 00:03:13.240
intelligent teacher. So she taught us all Python. And then the courses after that were Java courses,

00:03:13.240 --> 00:03:18.120
the college-level advanced placement Java, and then a data structures class after that.

00:03:18.120 --> 00:03:23.400
So we got to learn a lot about the fundamentals of computer science with those classes.

00:03:24.080 --> 00:03:27.540
Yeah, that's really cool. I think I got to take basic when I was in high school, and that was about it.

00:03:27.540 --> 00:03:28.900
It was a while ago.

00:03:28.900 --> 00:03:32.380
I wrote a basic interpreter, but it wasn't very good.

00:03:32.380 --> 00:03:37.180
Cool. So before we get into the programming stuff, maybe you could just tell me,

00:03:37.180 --> 00:03:39.480
what is the Human Geo Group? What do you guys do?

00:03:39.480 --> 00:03:48.100
Yeah. So the Human Geo Group, we're a small government contractor, and we deal mostly in government contracts,

00:03:48.100 --> 00:03:53.740
but we have a few commercial projects and ventures that I was working on over the summer that we'll be

00:03:53.740 --> 00:04:04.600
talking about. We're a great small little company in Arlington, Virginia, and we actually just won an award for one of the best places to work in the DC metro area for millennials and for younger people.

00:04:05.600 --> 00:04:27.980
If you go to thehumangeo.com, you guys have a really awesome webpage. I really like how that is. It's like, bam, here we are. We're about data, and it just has kind of this live page. So many companies want to get their CEO statement and all the marketing stuff and the news. You guys are just like, look, it's about data. That's cool.

00:04:27.980 --> 00:04:40.000
Yeah. There's a recent, I don't remember how recent, but there is the website rewrite and the website, this one guy, he decided he took the time and he's like, I really want to do something, you know, show some geographic stuff.

00:04:40.080 --> 00:04:58.920
So he used leaflet.js, which we do a lot of open source with and on our GitHub page with leaflet.js. And he made it very beautiful. And he, and that even there's some icons of all the people at human geo. And I think it's much better than any of those. Like you said, a generic, you know, contractor site. It's, it's much better and much more energetic.

00:04:59.080 --> 00:05:12.480
It looks to me like you do a lot with social media and like sentiment analysis and tying that to, to location, the geo part. Right. And all that kind of stuff. What's the story there with what kind of stuff do you guys do?

00:05:12.900 --> 00:05:27.120
Yeah. So one of, what I was working on was we took one of our, one of our customers is a billion dollar entertainment company. And, I mean, you've probably heard of them. I think we talk about them on our site.

00:05:27.620 --> 00:05:44.520
And what we do is we analyze various social media sites like, you know, Reddit and Twitter and YouTube, and we gather geographic data if available. And we gather sentiment data using specific algorithms from things like the natural language toolkit, which is an amazing Python package.

00:05:44.520 --> 00:05:50.620
Then we show it to the user in a, in a very nice, website that we have created.

00:05:50.620 --> 00:05:56.260
So you say you're, you work for this entertainment company as well as like a government contractor.

00:05:56.400 --> 00:06:01.560
What is the government interested in with all this data? The U S government that is right for the international listeners.

00:06:01.560 --> 00:06:11.700
Yeah, it's, it's definitely a United States government. we, we do, not, we do less social media analysis for the government. We do, we do some, but it's nothing.

00:06:11.700 --> 00:06:26.380
It's not what people think the NSA does. definitely. I think, you know, just like anything, like a company would want, like your, like you'd search on something and then it would have like, Oh, there are Twitter users talking about this.

00:06:26.380 --> 00:06:36.120
This in the, you know, in these areas. Yeah. I guess the government wouldn't know, would want to know that, especially in like emergencies or things like that possibly. Right.

00:06:36.120 --> 00:06:45.780
Yeah. We also do some, platform stuff. Like we create, certain platforms for, the government. That's not necessarily social media stuff.

00:06:46.180 --> 00:06:50.460
Right. Sure. So how much of that is Python and how much of that is other languages?

00:06:51.460 --> 00:07:16.460
So at the human geo, we do, we do tons of Python at the backend. for some of the government stuff, we do Java, which is big in the government, obviously. so I think we, I mean, we definitely do have a lot of Python. We use it a lot in the pipeline, for various tools and things that we use internally and externally at human geo. I know that the project that I was working on is exclusively Python.

00:07:16.460 --> 00:07:26.000
Python and all parts of the pipeline for, for gathering data and for representing data for, you know, the backend server and the front end server. So that was all Python.

00:07:26.000 --> 00:07:39.460
Right. So it's, it's pretty much Python end to end other than it looks, I don't know specifically the project you were working on, but it looks like it's very heavy, like D3, fancy JavaScript stuff on the front end for the visualization.

00:07:39.460 --> 00:07:41.640
But other than that, it was more or less Python, right?

00:07:41.640 --> 00:07:48.060
We do use a lot. I mean, yeah, we do have some amazing JavaScript people. They, they do a lot of really fun looking stuff.

00:07:48.060 --> 00:07:53.200
Yeah. You can tell it's, it's a fun place for like data display, front end development. That's cool.

00:07:53.200 --> 00:07:55.740
So Python two or Python three.

00:07:55.740 --> 00:08:06.320
So we use Python two, but I was making sure as, I mean, when I was working on the code base, I was definitely writing Python three compatible code using the proper future imports.

00:08:06.320 --> 00:08:14.920
And, and I, and I was testing it on the Python three and we're, we're probably closer to Python three than a lot of companies are. we just haven't expanded the time to do it.

00:08:14.920 --> 00:08:19.560
We're probably will in 2018 when Python two is nearing end of line.

00:08:19.560 --> 00:08:27.720
That's seems like a really smart, smart way to go. Did you happen to profile it under Python, CPython two and CPython three?

00:08:27.720 --> 00:08:32.520
I didn't. It doesn't fully run in CPython three right now. I wish I could.

00:08:33.220 --> 00:08:38.960
It would just be really interesting since you spent so much time looking at the performance if you could compare those, but yeah, if it doesn't run.

00:08:38.960 --> 00:08:41.260
That would be interesting. You're right. I wish I could know that.

00:08:41.260 --> 00:08:49.200
I suspect that most people know what profiling is, but there's a whole diverse set of listeners. So maybe we could just say really quickly, what is profiling?

00:08:50.200 --> 00:09:12.240
Yeah. So profiling in any language, anything is, knowing heuristics about what's running in your program. And, you know, for example, how many times is a function called or how long does it take for this section of code to run? And it's simply like a statistical thing. Like you, like you get a profile of your code. You see all the parts of your code that are fast or slow. For example.

00:09:12.860 --> 00:09:35.120
You wrote a really interesting blog post and we'll talk about that in just a second. And I think like all good profiling articles or topics, you know, you, you sort of point out what I consider to be the first rule of profiling or more like the first rule of optimization, which profiling is kind of the tool to get you there, which is to not prematurely optimize your stuff. Right.

00:09:35.800 --> 00:09:36.520
Yeah, definitely.

00:09:36.520 --> 00:09:52.220
Yeah. I, you know, I've spent a lot of time thinking about how programs run and why it's slow or why it's fast or worrying about this little part of that little part. And, you know, most of the time it just doesn't matter. Or if it is slow, it's slow for reasons that were unexpected. Right.

00:09:52.220 --> 00:10:02.520
Yeah, definitely. Yeah. I always make sure that there is a legitimate problem to be solved before spending time doing something like profiling or optimizing a code base.

00:10:03.200 --> 00:10:14.040
Definitely. So let's talk about your blog post because that kind of went around on Twitter in a pretty big way and on the Python newsletters and so on. And I read that. Oh, this is really cool. I should have Davis on and we should talk about this.

00:10:14.500 --> 00:10:24.560
So, so, so on the human geo blog, you wrote an article or post called profiling in Python, right? What motivated you to write that?

00:10:25.520 --> 00:10:40.700
Yeah. So when I was right, when I was working on this code, basically my, my coworkers, my boss, they said, you know, this piece of code, we have this pipeline for, for gathering data from the customers specifically that they give us.

00:10:40.700 --> 00:10:53.460
And we run it, we ran it about like 2am every night. The problem was it took 10 hours to run this, this piece of code. It was doing a very heavy text processing, which I'll talk about more later, I guess.

00:10:53.540 --> 00:11:02.560
It was doing a lot of text processing, which, which ended up taking 10 hours and they look at it and they said, you know, it's updating at noon every day and the workday starts at like nine.

00:11:03.060 --> 00:11:10.880
So we should probably try to fix this and get it working faster. Davis, you should totally look at this as a great first project.

00:11:10.880 --> 00:11:12.060
Here's your new project, right?

00:11:12.060 --> 00:11:17.220
Yeah. It was like, here's day one. Okay. Make this thing 10 times faster. Go.

00:11:17.220 --> 00:11:19.640
Yeah. Oh, is that all you want me to do?

00:11:19.640 --> 00:11:32.420
Yeah. So I start, so I wrote this and I did what any guy would do. I, first thing I did was, you know, profile it, which is the first thing you should do to make sure that there's actually a hotspot.

00:11:32.580 --> 00:11:51.800
And I ran through the process that I, that I talked about in the post post. I talk about what I ran through the tools I use. And I realized that it was, it wasn't a simple thing for me to do all this. I don't do this often. And I figure that a lot of people, like you said, maybe don't know about profiling or haven't done this.

00:11:51.800 --> 00:12:03.800
So I said, you know, if I write a blog post about this and hopefully somebody else won't have to Google like 20 different posts and read all of them to come up with one coherent idea of how to do this in one fell swoop.

00:12:03.800 --> 00:12:16.560
Right. Maybe you can just lay it out. Like, these are the five steps you should probably start with. And, you know, profiling is absolutely an art more than it is, you know, engineering, but it, at least having the steps to follow is super helpful.

00:12:17.340 --> 00:12:30.720
You started out by talking about the CPython distribution. And I think it'd be interesting to talk a little bit about potentially alternative implementations as well. Cause you, you did talk about that a bit.

00:12:30.840 --> 00:12:38.140
Yep. You said there's basically two profilers that come with CPython profile and CPython.

00:12:38.140 --> 00:12:50.320
Yeah. So there, the two profilers that come with Python, as you said, profile and CPython, all have the same exact interface and include the same heuristics.

00:12:50.320 --> 00:13:00.140
But the idea behind profile is that it's written in Python and is portable across Python implementations, hopefully, and CPython, which is written in C.

00:13:00.260 --> 00:13:07.940
And it's pretty specific to using a C interpreter, such as a CPython or even PyPy because PyPy is, is very, interoperable.

00:13:07.940 --> 00:13:19.940
Right. It does have that, that interoperability layer. So maybe if we were doing like Jython or we were doing IronPython or PyStone or something, we couldn't use CPython.

00:13:19.940 --> 00:13:22.580
Yeah. I wouldn't, I wouldn't say that you can.

00:13:22.580 --> 00:13:24.700
I'm just guessing. I don't really, I haven't tested it.

00:13:25.300 --> 00:13:34.020
I would say that you would use the, for Jython and IronPython, you could use the standard Java or .NET profilers and use those instead.

00:13:34.020 --> 00:13:40.480
I'm pretty sure those will work just fine because I mean, they've, they're known to work great with their respective ecosystems.

00:13:41.060 --> 00:13:54.180
Do you know whether there's a significant performance difference? Let me take a step back. You know, it seems to me like when I've done profiling in the past, that it's a little bit like the Heisenberg uncertainty principle.

00:13:54.580 --> 00:14:16.900
And that if you observe a thing by the fact you've observed it, you've altered it. Right. You know, when you run your code under the profiler, it might not behave in the same exact timings and so on as if it were running natively, but you can still get a pretty good idea usually. Right. So is there a difference across the C profile versus profile in that regard?

00:14:17.460 --> 00:14:29.460
Oh yeah, definitely. Profile is much slower and it has a much higher latency and as overhead, I mean, than C profile because it has to do a lot of different worker.

00:14:29.460 --> 00:14:52.600
I mean, because Python exposes internals to C to CPython in, in some Python modules, but they're a lot slower than just using straight C and getting straight to it. So if you're using regular profile, I'd recommend, well, if you're using it in CPython or pipe, I'd recommend using C profile instead because it has much lower overhead and it gives you much better numbers that you can work with that make more sense.

00:14:53.020 --> 00:14:59.800
Okay. Awesome. So how do I get started? Like, suppose I have some code that is slow, but how do I run it in this in C profile?

00:14:59.800 --> 00:15:06.600
Yeah. So the first thing that I would do is, I mean, I, in the blog posts, which I'm pulling up right now just to see.

00:15:06.600 --> 00:15:14.060
And I'll be sure to link to that in the show notes as well so that if people can just jump to the show notes and pull it up.

00:15:14.760 --> 00:15:27.880
Yeah. Well, so one of my favorite things about C profile is that you can call it using, you know, the Python dash M syntax, the column module syntax, and it will print out to standard out your profile for you when it's done.

00:15:27.880 --> 00:15:41.820
It's super easy. I mean, all you need to do is just give it your, you know, main Python file and it'll run it. And then at the end of the running, it'll give you the profile. It's super simple. And one of the reasons why the blog post is so easy to write.

00:15:42.460 --> 00:15:59.960
Yeah, that's cool. So by default, it looks like it gives you basically the total time spent in methods, all of the methods, you know, looking at the call stack and stuff, a number of times it was called, stuff like that, right? The cumulative time, the individual time per call and so on.

00:16:00.140 --> 00:16:17.600
It gives you the default, like you said, you're correct. And it's also really easy. You can give it a sorting argument. So that way you can, if you want to call it on C, you know, how many times this is called, like if it's called 60,000 times, it's probably a problem in a, you know, a 10 minute run.

00:16:17.960 --> 00:16:32.660
And if it's, you know, it could be only called twice, but it may take an hour to run. That would be very scary. In which case you definitely want to try, you want to, you want to sort it both ways. You want to sort it every way. So you can see what, you know, just in case you're missing something important.

00:16:33.280 --> 00:16:42.300
Right. You want to slice it in a lot of different ways. How many times was it called? What was the, you know, maximum individual time, cumulative time, all those types of things.

00:16:42.300 --> 00:16:49.680
All right. Maybe you're calling a database function. Right. And maybe that's just got tons of data.

00:16:49.680 --> 00:16:59.160
Yeah. It's slow. Yeah. Yeah. Maybe, maybe the database is slow. And so that says, well, don't even worry about your Python code. Just go make your database fast or use Redis for caching or something. Right.

00:16:59.640 --> 00:17:08.540
Or yeah, work on your query. Maybe you're, maybe you're, maybe you can make it distinct, a distinct query and get a much smaller data set that ends up coming back to you.

00:17:08.540 --> 00:17:25.420
Yeah, absolutely. Yeah. It could just be a smarter query you're doing. So this all is pretty good. I mean, this, this output that you get is pretty nice, but in a real program with many, many functions and so on, that can be pretty hard to understand, right?

00:17:26.280 --> 00:17:39.100
Yes. I definitely, I found, so when I was working on this, I had the same issue. There was, there was so many lines of code. It wasn't, it was filling up my terminal, you know, and what I had to do is I had to save it to an output file and that was too much work.

00:17:39.440 --> 00:17:50.300
So I was searching around more and I found PyCallGraph, which is amazing at showing you the flow of a program and it gives you great graphic representation of what C profile is also showing you.

00:17:50.300 --> 00:17:56.200
That's awesome. So it's kind of like a visualizer for the C profile output, right?

00:17:56.840 --> 00:18:04.180
Yeah. It even colors in the more red it is, the more hot of a call it is, or the more times it runs and the longer it runs.

00:18:04.180 --> 00:18:08.380
Yeah. That's awesome. So just pip install PyCallGraph to get started, right?

00:18:08.380 --> 00:18:13.800
Super simple. It's one of the best things about, pip as a package manager.

00:18:13.800 --> 00:18:16.540
Yeah. I mean, that's part of why Python is awesome, right? pip install.

00:18:16.540 --> 00:18:17.000
Definitely.

00:18:17.000 --> 00:18:26.680
Whatever. Anti-gravity. So then you say, you, invoke it, you say basically PyCallGraph and you say GraphFizz.

00:18:26.680 --> 00:18:29.000
And then how does that work?

00:18:29.000 --> 00:18:45.560
So PyCallGraph is, it supports outputting to multiple different file formats. GraphFizz is simply a file format for, I mean, it's a program that can show called .file.

00:18:45.560 --> 00:18:46.960
I mean dot dot files

00:18:46.960 --> 00:18:49.340
I don't really understand how to say it out loud

00:18:49.340 --> 00:18:51.380
so I mean the first argument for

00:18:51.380 --> 00:18:52.560
graph is the

00:18:52.560 --> 00:18:55.360
style of how it's going to be read the program that's going to

00:18:55.360 --> 00:18:57.520
read it and then you give it

00:18:57.520 --> 00:18:59.420
the double dash which means it's not

00:18:59.420 --> 00:19:01.140
part of the PyCallGraphCall options

00:19:01.140 --> 00:19:02.600
it's now the Python call

00:19:02.600 --> 00:19:04.940
that you're calling it with

00:19:04.940 --> 00:19:06.580
and those arguments

00:19:06.580 --> 00:19:09.440
so it's almost basically the same

00:19:09.440 --> 00:19:11.320
as cProfile but it's kind of inverted

00:19:11.320 --> 00:19:13.520
right and you can get really simple

00:19:13.520 --> 00:19:15.280
call graphs that are just this function called

00:19:15.280 --> 00:19:17.220
this function which called this function and it took this

00:19:17.220 --> 00:19:19.300
amount of time or you can get really

00:19:19.300 --> 00:19:20.980
quite complex

00:19:20.980 --> 00:19:22.060
call graphs

00:19:22.060 --> 00:19:25.520
like you can say you know this module

00:19:25.520 --> 00:19:27.300
calls these functions which then reach out

00:19:27.300 --> 00:19:29.000
to this other module and then they're all

00:19:29.000 --> 00:19:30.820
interacting in these ways

00:19:30.820 --> 00:19:32.700
that's pretty amazing

00:19:32.700 --> 00:19:34.480
yeah it shows

00:19:34.480 --> 00:19:36.700
exactly what module you're using

00:19:36.700 --> 00:19:39.140
I mean like if you're using regex it'll just

00:19:39.140 --> 00:19:40.820
it'll show you each part of the regex

00:19:40.820 --> 00:19:42.920
module like the regex compile

00:19:42.920 --> 00:19:44.900
or you know the different

00:19:45.000 --> 00:19:46.680
modules that are using the regex module

00:19:46.680 --> 00:19:48.900
and then it'll show you how many times each is called

00:19:48.900 --> 00:19:50.180
and they're boxed all nicely

00:19:50.180 --> 00:19:51.100
and

00:19:51.100 --> 00:19:53.380
and it gives you I mean the image is so

00:19:53.380 --> 00:19:54.500
easy to look at

00:19:54.500 --> 00:19:56.820
and you could just zoom in at the exact part you want

00:19:56.820 --> 00:19:58.920
and then look at what calls it

00:19:58.920 --> 00:20:00.880
and where and what it calls to see

00:20:00.880 --> 00:20:03.160
you know how the program flows much simpler

00:20:03.160 --> 00:20:04.660
yeah that's really cool

00:20:04.660 --> 00:20:05.920
and you know it's something that just

00:20:05.920 --> 00:20:07.180
came to me as an idea

00:20:07.180 --> 00:20:08.180
I'm looking at this

00:20:08.180 --> 00:20:20.720
this episode is brought to you by

00:20:20.720 --> 00:20:21.360
hired

00:20:21.360 --> 00:20:23.000
hired is a two-sided

00:20:23.000 --> 00:20:25.560
curated marketplace that connects the world's

00:20:25.560 --> 00:20:27.900
knowledge workers to the best opportunities

00:20:27.900 --> 00:20:29.540
each offer you receive

00:20:29.540 --> 00:20:32.160
has salary and equity presented right up front

00:20:32.160 --> 00:20:35.000
and you can view the offers to accept or reject them

00:20:35.000 --> 00:20:37.000
before you even talk to the company

00:20:37.000 --> 00:20:41.380
typically candidates receive five or more offers in just the first week

00:20:41.380 --> 00:20:42.600
and there are no obligations

00:20:42.600 --> 00:20:43.380
ever

00:20:43.380 --> 00:20:45.540
sounds pretty awesome doesn't it

00:20:45.540 --> 00:20:47.520
well did I mention there's a signing bonus

00:20:47.520 --> 00:20:51.620
everyone who accepts a job from hired gets a two thousand dollar signing bonus

00:20:51.620 --> 00:20:53.960
and as talk python listeners

00:20:53.960 --> 00:20:55.940
it gets way sweeter

00:20:55.940 --> 00:21:00.240
use the link hired.com slash talk python to me

00:21:00.240 --> 00:21:03.840
and hired will double the signing bonus to four thousand dollars

00:21:03.840 --> 00:21:05.240
opportunities knocking

00:21:05.240 --> 00:21:07.820
visit hired.com slash talk python to me

00:21:07.820 --> 00:21:08.860
and answer the call

00:21:08.860 --> 00:21:22.180
because it colors the hot spots and all it's really good for profiling but

00:21:22.180 --> 00:21:25.900
even if you weren't doing profiling it seems like that would be pretty interesting for just

00:21:25.900 --> 00:21:29.000
understanding new code that you're trying to get your head around

00:21:29.000 --> 00:21:31.980
oh yeah that's definitely true that's uh

00:21:31.980 --> 00:21:36.720
i've employed that since as a method to look at the control flow of a program

00:21:36.720 --> 00:21:39.420
right how does these methods how did these

00:21:39.420 --> 00:21:44.580
modules and all this stuff just how do they relate like just run the main method and see what happens right

00:21:44.580 --> 00:21:46.220
exactly it's

00:21:46.220 --> 00:21:49.680
it's uh it's become a useful tool of mine

00:21:49.680 --> 00:21:52.200
that i'll definitely be using in the future

00:21:52.200 --> 00:21:54.120
i always have it my virtual end of nowadays

00:21:54.120 --> 00:21:55.340
so

00:21:55.340 --> 00:21:56.280
we've

00:21:56.280 --> 00:21:58.020
taken c profile we've

00:21:58.020 --> 00:21:59.960
applied it to our program

00:21:59.960 --> 00:22:09.320
we've sort of gotten some textual version of the output that tells us where we're spending our time in various ways and then we can use py call graph to visualize that understand it more quickly

00:22:09.320 --> 00:22:10.720
so

00:22:10.720 --> 00:22:14.920
then what like how do you fix these problems what are some of the common things you can do

00:22:14.920 --> 00:22:16.260
yeah so

00:22:16.260 --> 00:22:22.280
as i outlined in the blog post there's there's a plethora of methods that you can do depending on what

00:22:22.280 --> 00:22:25.660
your profile is showing for example if

00:22:25.660 --> 00:22:28.120
if you're spending a lot of time in

00:22:28.120 --> 00:22:30.360
in python code

00:22:30.360 --> 00:22:33.020
then you can definitely look at

00:22:33.020 --> 00:22:37.220
things like using a different interpreter for example an optimizing compiler like

00:22:37.220 --> 00:22:38.060
pypy

00:22:38.060 --> 00:22:43.960
uh will definitely make your code run a lot faster as it'll translate to machine code at runtime

00:22:43.960 --> 00:22:49.660
or you can also look at the algorithm that you're using and see if it's for example like an on cubed

00:22:49.660 --> 00:22:54.260
uh time complexity algorithm that would be terrible and you might want to fix that

00:22:54.260 --> 00:22:58.540
yeah those are the kinds of problems that you run into when you have small test data

00:22:58.540 --> 00:23:02.620
and you write your code and it seems fine and then you give it real data and it just dies right

00:23:02.620 --> 00:23:04.020
exactly that's

00:23:04.020 --> 00:23:05.280
that's why the thing

00:23:05.280 --> 00:23:11.380
they gave me the code that took i mean they gave me like five gigabytes of data and they said this is the data that we get like on a

00:23:11.380 --> 00:23:12.620
nightly basis and i said

00:23:12.620 --> 00:23:14.980
oh my god this will take all day to run

00:23:14.980 --> 00:23:15.940
so

00:23:15.940 --> 00:23:19.240
i i use smaller test test pieces

00:23:19.240 --> 00:23:22.740
and then luckily i i used big enough that it showed

00:23:22.740 --> 00:23:25.380
uh some statistically significant numbers for me

00:23:25.380 --> 00:23:29.200
right it was something you could actually execute in a reasonable amount of time as you

00:23:29.200 --> 00:23:31.920
through your exploration but not so big that

00:23:31.920 --> 00:23:33.660
i'd rather not do like c plus plus

00:23:33.660 --> 00:23:39.360
times of run you know compile times but at runtime because it just kind of sits there while it's processing

00:23:39.360 --> 00:23:41.000
right so

00:23:41.000 --> 00:23:42.460
rather not only do it once a day

00:23:42.460 --> 00:23:45.300
you mentioned some interesting things like

00:23:45.300 --> 00:23:47.940
pi pi is super interesting to me

00:23:47.940 --> 00:23:49.340
we're going to talk more about how

00:23:49.340 --> 00:23:51.700
how you well you chose not to use that but

00:23:51.700 --> 00:23:53.820
you know on show 21 we had

00:23:53.820 --> 00:23:55.000
macha from the

00:23:55.000 --> 00:23:55.920
the pi pi

00:23:55.920 --> 00:23:59.700
oh yeah i have it up right now i'm gonna watch it later

00:23:59.700 --> 00:24:03.420
yeah i'm so excited yeah that was super interesting and we talked a little bit about

00:24:03.420 --> 00:24:06.620
optimization there and like why you might choose an alternate

00:24:06.620 --> 00:24:08.820
interpreter that was cool

00:24:08.820 --> 00:24:12.300
then there's some other interesting things you can do as well like

00:24:12.300 --> 00:24:18.520
you could use like name tuples instead of actual classes or you could use built-in data

00:24:18.520 --> 00:24:23.480
structures instead of creating your own because a lot of times the built-in structures like list and array

00:24:23.480 --> 00:24:27.940
and dictionary and so on are implemented deep down in c and they're much faster

00:24:27.940 --> 00:24:29.460
yeah definitely

00:24:29.460 --> 00:24:35.160
uh one of the best examples of this is that i saw some guy who wrote his own dictionary class

00:24:35.160 --> 00:24:38.340
and uh it was a lot slower than

00:24:38.340 --> 00:24:43.600
and this isn't in the human geo code base just so you know we have we have good code at human geo

00:24:43.600 --> 00:24:45.660
you guys don't show up on the daily wtf

00:24:45.660 --> 00:24:51.940
oh please no no we're we're much better than that no this is a another another place that i'm that i saw some code

00:24:51.940 --> 00:24:56.680
and uh yeah i mean they have a lot of optimizations like in the latest python release

00:24:56.680 --> 00:25:02.620
they actually made them the dict class i mean it's actually now an ordered dict in the latest python releases

00:25:02.620 --> 00:25:06.640
because they basically copied what pypy did the same they did the same thing

00:25:06.640 --> 00:25:14.420
yeah so you should always trust uh the internal stuff yeah and that's really true and if you're going to stick

00:25:14.420 --> 00:25:20.280
with c python as you know the majority of people do understanding how the interpreter actually works

00:25:20.280 --> 00:25:28.540
is also super important and i i talked about it several times in the show and um had philip guau on the show

00:25:28.540 --> 00:25:34.820
he did a 10 hour he recorded basically a graduate's uh graduate course at the university of rochester he did

00:25:34.820 --> 00:25:40.640
and put it online so uh definitely i can try to link to that and check it out that's show 22

00:25:40.640 --> 00:25:45.380
i'd love to see that yeah i mean it's you really understand what's happening inside the c runtime

00:25:45.380 --> 00:25:50.760
and so then you're like oh i see why this is expensive and all those things can help you know another thing

00:25:50.760 --> 00:25:54.740
that you talked about that was interesting was io like i gave you my example of databases

00:25:54.740 --> 00:26:00.600
or maybe you're talking about a file system or you know a web service call something like this

00:26:00.600 --> 00:26:07.780
right and you had some good advice on that i thought yeah i definitely uh basically the c python

00:26:07.780 --> 00:26:16.020
c python's jill the global interpreter lock is very uh precluding you can't you can't do multiple

00:26:16.020 --> 00:26:22.660
multiple computation and computations at the same time because c python only allows to be used at one core

00:26:22.660 --> 00:26:28.520
at a time right basically computational parallelism is not such a thing in python if you're you got

00:26:28.520 --> 00:26:33.880
to drop down to c or fork the processes or something like that right exactly and those are all fairly

00:26:33.880 --> 00:26:39.860
expensive for a task that we're running on an aws server on an aws server that we're trying to spend as

00:26:39.860 --> 00:26:45.660
little money as possible because it runs at like the break of dawn so we don't want to be running multiple

00:26:45.660 --> 00:26:52.080
python instances but when you when you're doing io which doesn't involve any python you can use

00:26:52.080 --> 00:26:59.420
threading to do things like you know file system you know expensive io tasks like getting a getting

00:26:59.420 --> 00:27:05.200
the data off a url or things like that that would that's great for python's threading but otherwise

00:27:05.200 --> 00:27:12.740
you don't really want to be using it right basically the built-in functions that wait on io they

00:27:12.740 --> 00:27:18.620
release the global interpreter lock and so that frees up your other code to keep on running more or less

00:27:18.620 --> 00:27:23.180
right you definitely want to make sure that if you're doing io that it's not the bottleneck

00:27:23.180 --> 00:27:28.520
i mean as long as everything else is not the bottleneck right and you know we've had um

00:27:28.520 --> 00:27:37.600
a lot of cool advances in python 3 around this type of parallelism right and they just added async and

00:27:37.600 --> 00:27:43.960
await the new keywords to uh is that three five i think it was right yeah just three fives just

00:27:43.960 --> 00:27:49.100
came out like two days yeah two days ago so yeah i mean that's super new but these are the types

00:27:49.100 --> 00:27:55.900
of places where async and await would be ideal for helping you increase your performance yeah it's uh

00:27:55.900 --> 00:28:02.520
it's the new syntax is like a special case it's syntactic sugar for yielding but it makes things

00:28:02.520 --> 00:28:08.680
much simpler and easier to read because if you're using yields to do complex async stuff that isn't just

00:28:08.680 --> 00:28:13.300
like a generator then it's very ugly so they added this new syntax to make it much simpler yeah it's great i'm

00:28:13.300 --> 00:28:18.000
looking forward to doing stuff with that you also had some guidance on regular expressions and the

00:28:18.000 --> 00:28:24.440
first one the first bit i also really like kind of like your premature optimization bit is once you

00:28:24.440 --> 00:28:29.700
decide to use a regular expression to solve a problem you now have two problems yeah i always have that

00:28:29.700 --> 00:28:36.760
issue um when i whenever whenever i do anything and i talk to students and they're like and i'm like oh

00:28:36.760 --> 00:28:40.580
look at this text and you could do this and like oh i'll just use regex to solve it i'm saying please know

00:28:40.580 --> 00:28:46.160
you know you'll end up with something that you won't be able to read in the next two days and then

00:28:46.160 --> 00:28:51.820
you know just find a better way to do it for goodness sakes yeah i'm i'm totally with you a friend of mine

00:28:51.820 --> 00:28:58.580
he has this really great way of looking at complex code both around regular expressions and like

00:28:58.580 --> 00:29:03.720
parallelism and it says when you're writing that kind of code you're often writing code right at the limit

00:29:03.720 --> 00:29:09.380
of your understanding of code or your ability to write complex code and debugging is harder than

00:29:09.380 --> 00:29:15.380
writing code and so you're writing code you literally can't debug so you may maybe maybe not

00:29:15.380 --> 00:29:22.500
go quite that far right yeah i think uh you should always try to make code as simple as possible because

00:29:22.500 --> 00:29:28.480
debugging and profiling and looking through your code will be uh much less fun if you try to be as

00:29:28.480 --> 00:29:32.960
clever as possible yeah absolutely clever code it's not so clever when it's your bug to solve later

00:29:32.960 --> 00:29:40.480
yeah and i i also had try to give special mention to python's regex engine as much as i dislike regex

00:29:40.480 --> 00:29:48.580
i think python's read that verbose flag is amazing and if you haven't uh looked into it python has great

00:29:48.580 --> 00:29:55.960
support for i'm greatly annotating the regex so if you have to use it you can be very verbose about it

00:29:55.960 --> 00:30:02.260
and that way it's much better documented in the code yeah that's great advice so let's maybe talk

00:30:02.260 --> 00:30:08.220
about how you solve your problem after you like what was the problem what did you find out the issue

00:30:08.220 --> 00:30:15.940
to be and then how do you solve it yeah so what we were doing in this code is we were taking gigabytes

00:30:15.940 --> 00:30:24.840
upon gigabytes of plain text data like words um from users of you know various forums and we

00:30:24.840 --> 00:30:32.140
got we processed all this data for sentiment analysis and to do that you need to stem each word to its base word

00:30:32.140 --> 00:30:37.800
so that way you can have you can make sure you you're analyzing the correct word because there's like five different

00:30:37.800 --> 00:30:45.400
forms of every single word right like running ran all those are kind of basically the same meaning right exactly

00:30:45.400 --> 00:30:52.700
yeah okay yeah so so we get the base word of all those words and the thing is gigabytes of like let's say

00:30:52.700 --> 00:30:59.500
five gigabytes of words if it's if a word is like four bytes you know that's like billions of that's so many

00:30:59.500 --> 00:31:06.380
words i don't even want to think about it and then for every single word we stemmed it and we analyzed it and it was

00:31:06.380 --> 00:31:13.840
slow arduous process and as i was profiling and i realized we run the same uh stemming stemming function

00:31:13.840 --> 00:31:18.800
which is an nltk it's a porter stemmer which is amazing and it's what everyone uses for stemming

00:31:18.800 --> 00:31:27.380
we ran it about in my you know 50 megabyte test case which is still so many words thousands upon thousands

00:31:27.380 --> 00:31:34.360
of words it ran about like 600 000 times and i was like my my goodness this is this is running way too much

00:31:34.360 --> 00:31:39.540
and there's only like 400 000 words in the english language there's no way you know each of these

00:31:39.540 --> 00:31:45.460
words is needs to be stemmed because you know gamers on the internet aren't very you know linguistically

00:31:45.460 --> 00:31:55.360
amazing yeah exactly so so what we so i figured you know i can create a cache or as it's called you know

00:31:55.360 --> 00:32:01.500
in more functional you know terms i can create a memoization algorithm that saves these answers i mean this

00:32:01.500 --> 00:32:07.700
saves the computation so i don't need to recompute the function because stemming is a pure function

00:32:07.700 --> 00:32:12.620
if you i mean if you're into functional programming you can you don't need to recompute it every single

00:32:12.620 --> 00:32:17.860
time you get the same input right if you're guaranteed with the same inputs you get the same output

00:32:17.860 --> 00:32:26.880
then you can cash the heck out of this thing right exactly so i i went from about like 60 600 000 calls to

00:32:26.880 --> 00:32:34.520
like you know like 30 000 and i was like it was an immediate you know these these words like the whole

00:32:34.520 --> 00:32:43.000
pro program ran orders of magnitude faster that's awesome and you know what i think i really love about

00:32:43.000 --> 00:32:49.600
this two things i love about it one is you know you talked a little bit about the solution on the

00:32:49.600 --> 00:32:57.620
blog and it's like i don't know nine lines of code no yes that's it's python yeah it's so awesome and

00:32:57.620 --> 00:33:03.200
the other thing is you didn't even have to change your code necessarily you're able to like create

00:33:03.200 --> 00:33:07.580
things that you can just apply to your code you have to re you have to rewrite the algorithms or things

00:33:07.580 --> 00:33:14.520
like this right yeah i i really find that you know the algorithm worked i mean it got it got things

00:33:14.520 --> 00:33:20.420
done it did it correctly there had to be you know i i mean i wasn't opposed to changing the algorithm

00:33:20.420 --> 00:33:25.900
obviously if that was the hot part of the code but once i found out that the hot part of the code wasn't

00:33:25.900 --> 00:33:31.420
even code that we wrote you know it was just code that we were calling right from another library and

00:33:31.420 --> 00:33:36.620
it's probably really optimized but if you're calling it 600 000 times well nothing is optimized when you're

00:33:36.620 --> 00:33:43.100
calling it hundreds of thousands of times you know at that point you got to not call it that many times

00:33:43.100 --> 00:33:50.300
within that time span you basically created a decorator that will just cache the output and

00:33:50.300 --> 00:33:56.140
only call the function if it's never seen the inputs before right exactly so i mean what it does is it

00:33:56.140 --> 00:34:01.660
you know internally a decorator all it does is it wraps the function in another function so it

00:34:01.660 --> 00:34:10.640
adds an internal cache which is just a python dictionary which which keeps the function arguments as the key

00:34:10.640 --> 00:34:16.100
and the output of the function as the value and if it's been computed then it's in the dictionary so

00:34:16.100 --> 00:34:21.760
it all it needs to do is a simple dictionary call you know it's like just like one or two python bytecode

00:34:21.760 --> 00:34:27.600
instructions which are i mean as opposed to calling an entire function um itself which would which would

00:34:27.600 --> 00:34:34.540
be hundreds of python bytecode instructions right yeah that's fantastic yeah and i'll and when it when it

00:34:34.540 --> 00:34:40.260
doesn't find the answer when i mean when it doesn't find the arguments in the function it i mean it's just

00:34:40.260 --> 00:34:46.700
one computation and for the amount of words in the english language um that are primarily used it's it'll

00:34:46.700 --> 00:34:55.580
be called much less right yeah so your typical data set maybe i don't know 30 000 20 000 times something like that

00:34:55.580 --> 00:34:56.700
yeah yeah yeah

00:34:56.700 --> 00:35:17.580
this episode is brought to you by code ship code ship has launched organizations create teams set

00:35:17.580 --> 00:35:22.440
permissions for specific team members and improve collaboration in your continuous delivery workflow

00:35:22.440 --> 00:35:27.900
maintain centralized control over your organization's projects and teams with code ship's new organizations

00:35:27.900 --> 00:35:33.820
plan and as talk python listeners you can save 20 off any premium plan for the next three months

00:35:33.820 --> 00:35:40.480
just use the code talk python all caps no spaces check them out at code ship.com and tell them thanks

00:35:40.480 --> 00:35:43.100
for supporting the show on twitter where they're at code ship

00:35:43.100 --> 00:35:56.080
the other yeah the other thing i i like is that the solution was so simple but but you probably needed

00:35:56.080 --> 00:36:03.620
the profiling to come up with it right you know so i have an example that i've given a few times about

00:36:03.620 --> 00:36:09.280
these types of things and just you know choosing the right data structure can be really important i worked

00:36:09.280 --> 00:36:15.320
on this project that did real time processing of data coming in at 250 times a second so that leaves four

00:36:15.320 --> 00:36:21.840
milliseconds to process each segment right which is really not much and but it had to be real time and if

00:36:21.840 --> 00:36:26.280
it couldn't keep up well then you can't write a real time system or you need some insane hardware to run it or

00:36:26.280 --> 00:36:33.740
something right and it was doing crazy math like wavelet decomposition and all sorts of stuff okay this is

00:36:33.740 --> 00:36:39.160
like i was saying earlier it's the verge of understanding what we're doing right yeah and it

00:36:39.160 --> 00:36:43.620
was too slow the first time we ran like oh no please don't make me try to optimize a wavelet

00:36:43.620 --> 00:36:49.640
decomposition you know it's kind of like four year analysis but worse yeah and i'm like there's got to

00:36:49.640 --> 00:36:54.880
be a better way right so break out the profiler and it turned out that we had to sort of do lookups

00:36:54.880 --> 00:37:01.380
back in the past on our data structures yeah and we happened to use a list to go back and look up

00:37:01.380 --> 00:37:07.160
and we were spending 80 of our time just looking for stuff in the list we just switched it to an on

00:37:07.160 --> 00:37:11.100
yeah exactly we just switched it to a dictionary and it went five times faster and i mean it was like

00:37:11.100 --> 00:37:16.080
almost one line of code change it was like ridiculously simple but if you don't use the

00:37:16.080 --> 00:37:20.380
profiler to find the real problem you're going to go muck with your algorithm and not even really

00:37:20.380 --> 00:37:23.920
make a difference because it had nothing to do with the algorithm right it was somewhere else

00:37:23.920 --> 00:37:30.360
yeah i i definitely find that there's also a lot of a lot more push like in the java world

00:37:30.360 --> 00:37:35.220
to make things like final by default to try to make them immutable unless they don't need to be

00:37:35.220 --> 00:37:40.600
and a lot of languages are also embracing immutable by default and trying to keep as strict as possible

00:37:40.600 --> 00:37:46.680
so that way you can you know be linear more lenient when you need to and i find the same thing in

00:37:46.680 --> 00:37:51.660
languages like python whereas i try to use a set before like unless i absolutely need a list if i'm just

00:37:51.660 --> 00:37:56.360
containing elements a set is much better for finding things right if you just want to know have i seen

00:37:56.360 --> 00:38:00.320
this thing before a set exactly it's maybe the right data structure or if you're going to store

00:38:00.320 --> 00:38:05.960
integers in a list you'd be much better off using an array of integers or an array of floats because

00:38:05.960 --> 00:38:12.280
it's much much more efficient you said one of the things you considered was py py just for those who

00:38:12.280 --> 00:38:17.460
don't know maybe what's the super quick elevator pitch on py py and what would you find out about using it

00:38:17.460 --> 00:38:26.280
yeah so py py is a very compliant python interpreter that at runtime turns the python code into machine

00:38:26.280 --> 00:38:31.920
code it finds the hot parts of your code or what's being run a lot and it finds a better way to run

00:38:31.920 --> 00:38:40.040
that for you so that way it'll um it'll run a lot faster than c python because c python doesn't do many

00:38:40.040 --> 00:38:46.380
optimizations by default or just in general right it it runs it through the interpreter which is a fairly

00:38:46.380 --> 00:38:50.960
complex thing rather than turn it straight to machine instructions yeah yeah there's a lot more

00:38:50.960 --> 00:38:58.560
overhead you tried out py py did it make it faster or did it matter at all oh yeah oh yeah py py is i i

00:38:58.560 --> 00:39:03.800
actually use py py before even profiling just to see you know what i could like just because i was like

00:39:03.800 --> 00:39:10.060
oh let's just see how faster py py is in this case and it ran about like five times faster um because it

00:39:10.060 --> 00:39:16.640
figured out what to do uh but the thing is that under our constraints uh we didn't we wanted to stick

00:39:16.640 --> 00:39:24.100
with a little aws instance that we were just running every night and the thing is py py uses more memory

00:39:24.100 --> 00:39:30.160
than c python to support its garbage collector and its just-in-time compiler that both need to run at the

00:39:30.160 --> 00:39:36.160
runtime so it uses a little bit more memory and we didn't really want to spend that money whereas you know

00:39:36.160 --> 00:39:41.360
because if i can get it down to an hour in c python and it runs at 2 a.m no one's going to be looking at

00:39:41.360 --> 00:39:47.280
that stuff at 3 a.m right absolutely and if you can optimize it in c python and then feed it into py py

00:39:47.280 --> 00:39:52.400
maybe it could go exactly faster still right that that would have run it at about like as opposed to

00:39:52.400 --> 00:39:56.520
from 10 hours to one hour it'd be from like if i was running in py py with the cache optimization

00:39:56.520 --> 00:40:01.680
it would probably run in like 30 minutes 20 minutes like i said it was unnecessary like it would it would

00:40:01.680 --> 00:40:07.240
have been nice and but we didn't need it so we didn't really feel like spending them spending the

00:40:07.240 --> 00:40:12.280
time to add that to the pipeline right sure it seems like i mean if you need it to go faster it seems like

00:40:12.280 --> 00:40:18.020
you could just keep iterating on this right so for every execution your your decorator will wrap it

00:40:18.020 --> 00:40:25.220
for every run of your python app but you know you could actually keep those stemmed words yeah it could

00:40:25.220 --> 00:40:30.960
save it like to a redis cache or to a file or to a database or right you could just keep going on this

00:40:30.960 --> 00:40:36.540
idea right but yeah that was definitely something i wanted to i thought about doing that yeah and

00:40:36.540 --> 00:40:41.460
definitely it was fast enough already and that's that is where we'll go next after that i always find

00:40:41.460 --> 00:40:47.960
that you use whenever when you profile and you have a flat profile with nothing sticking out and with

00:40:47.960 --> 00:40:52.940
nothing um you know that looks like it needs to be optimized that's when you need to change the runtime

00:40:52.940 --> 00:41:00.240
that's when you need to look into ffi or py py or jython with a proper jit interesting what's ffi

00:41:00.240 --> 00:41:05.360
so ffi is the foreign function interface it's just the general term used by all languages

00:41:05.360 --> 00:41:11.340
in which case you can drop down into c that's right or any c compatible language yeah that's the c so you

00:41:11.340 --> 00:41:16.660
would basically go right in c and then hook into it that's right so or use like cython for example

00:41:16.660 --> 00:41:25.260
which compiles python down to c with a weirdly python plus c syntax if you have to right yeah i've never

00:41:25.260 --> 00:41:31.320
tried it but i've seen it and i'm like you're saying like you annotate um python variables with

00:41:31.320 --> 00:41:38.120
c like you say double i equals zero in python syntax that's right it's really strange yeah that is that is

00:41:38.120 --> 00:41:44.300
strange uh so you know i talk about pycharm on the show a lot i'm a big fan of pycharm

00:41:44.300 --> 00:41:51.200
and uh they just added built-in profiling in their latest release 4.5 sounds nice yeah i don't know

00:41:51.200 --> 00:41:55.800
if it's out or not it has a visualizer too kind of like py call graph but they said they're using

00:41:55.800 --> 00:42:02.680
something called the yappy y-a-p-p-i profiler and it'll fall back to c profile if it has to do you know

00:42:02.680 --> 00:42:09.720
yappy i haven't i've not seen this yeah i i was looking at all these profiles uh also i mean py py comes

00:42:09.720 --> 00:42:15.600
with some profiler that works on Linux called vm prof um and those are all different profiles and i

00:42:15.600 --> 00:42:22.540
looked at them and and they're nice sure but i really loved how simple it was i mean and i got

00:42:22.540 --> 00:42:27.400
the results that i needed from c profile and it comes with python it was that you didn't need to

00:42:27.400 --> 00:42:33.060
install anything you just ran the module and that's why i was so happy with using it and didn't need to

00:42:33.060 --> 00:42:38.860
try a different profiler right that's cool just python space dash m space you know c profile and

00:42:38.860 --> 00:42:44.840
then your your app boom exactly and if you're using something like pycharm i'm sure i mean if it comes

00:42:44.840 --> 00:42:50.900
with uh yappy or any other profile it like definitely use that you know i'm sure it's i'm sure it's great

00:42:50.900 --> 00:42:57.980
because that pycharm people are awesome uh but for the purposes i mean of the simplicity that i wanted

00:42:57.980 --> 00:43:03.600
keeping the blog post i only used the word pip once i mean like i only used it once to install pycharm

00:43:03.600 --> 00:43:07.520
and that's just how simple it needs to be right you maybe could have gotten away without it if you

00:43:07.520 --> 00:43:13.000
really wanted to look at just the text output the other thing is you know this is something you

00:43:13.000 --> 00:43:18.500
can do really well on your server right you can go over and profile something or maybe it has real

00:43:18.500 --> 00:43:23.720
data it can get to like maybe it works fine on my machine right but not in production or not in qa or

00:43:23.720 --> 00:43:28.160
something like and so you don't have to go install pycharm on the server which exactly probably you

00:43:28.160 --> 00:43:36.640
don't want to do on you know sshed into aws yeah and that's why python is amazing for being

00:43:36.640 --> 00:43:41.260
batteries included is that it includes all of these nice things that you that you could need that any

00:43:41.260 --> 00:43:45.900
pro that any developer was going to need to use eventually do you want to talk a little bit about

00:43:45.900 --> 00:43:52.840
open source and what you guys are doing with open source at human geo yeah so human geo we do it we

00:43:52.840 --> 00:43:59.320
have a github profile github slash human geo and we mostly the most of our open source stuff is

00:43:59.320 --> 00:44:04.620
the javascript leaflet uh leaflet stuff that we've incorporated we add themes and stuff so if you do any

00:44:04.620 --> 00:44:11.020
if you want any map visualizations i like it much better than the google maps api um using leaflet so

00:44:11.020 --> 00:44:17.540
i'd recommend looking at the art the stuff we've done there and we also have uh one or a couple of python

00:44:17.540 --> 00:44:23.100
libraries uh you know we have a an elastic search binding that we that we've used which has since

00:44:23.100 --> 00:44:29.700
been superseded by the mozilla elastic binding so we definitely love open source of human geo

00:44:29.700 --> 00:44:36.300
and we make and use libraries in open source and that's one of my favorite parts about human geo

00:44:36.300 --> 00:44:42.700
that's really awesome and you guys are hiring right oh yeah we're we're always hiring uh we're looking for

00:44:42.700 --> 00:44:50.900
you know great developers at any age python java like i said we won one of the best places to work

00:44:50.900 --> 00:44:56.220
for millennials in the dc metro area right and for the international listeners that's washington dc

00:44:56.220 --> 00:45:00.840
in the united states yes washington dc yeah no worries yeah whether uh whether you have

00:45:00.840 --> 00:45:07.180
a government clearance or not we'd love you know just send send an email or call or get in touch if

00:45:07.180 --> 00:45:11.940
you want to work at an amazing place that does awesome python i love their code bases yeah that's

00:45:11.940 --> 00:45:17.120
really cool it should definitely reach a bunch of uh enthusiastic python people so if if you're

00:45:17.120 --> 00:45:21.700
looking for a job give those guys a ring that's awesome definitely so you guys are big fans of python

00:45:21.700 --> 00:45:29.580
right oh yeah we we've i think they i mean they've been using python since the company started and i was

00:45:29.580 --> 00:45:33.820
looking at the original commits like when the company started and it was for these projects and they're

00:45:33.820 --> 00:45:38.340
all python it's and it's really exciting we've been using since the beginning it's amazing to rapidly

00:45:38.340 --> 00:45:44.300
iterate uh it's fast enough obviously and you can uh you can look at it and it's super easy to profile

00:45:44.300 --> 00:45:48.740
when you need to that's another reason why it's amazing it's not just like you can say it's slow

00:45:48.740 --> 00:45:53.480
but then it's easy to optimize in that case yeah that's really cool so are there some

00:45:53.480 --> 00:46:02.020
practices or techniques or whatever that you guys have sort of landed on or focused in on if you've

00:46:02.020 --> 00:46:08.420
been doing it for so long that you could recommend at human geo we uh you know we make sure to

00:46:08.420 --> 00:46:14.680
i mean we don't go like full agile or anything i mean we definitely consider ourselves fast moving

00:46:14.680 --> 00:46:22.800
and we work uh at a very great pace uh we so i guess you could call it agile and then we have

00:46:22.800 --> 00:46:30.220
great git practices we use git flow and we make sure to have good code review and any code base including

00:46:30.220 --> 00:46:35.520
in python you know you got to have good code review and when you write python code uh i do

00:46:35.520 --> 00:46:42.620
write a lot of unit tests for my code things like that nice so is that a unit test module or pipe or um

00:46:42.620 --> 00:46:49.240
py test or oh yeah the standard library just i love standard library stuff yeah it's the batteries

00:46:49.240 --> 00:46:55.700
included right yes definitely so anything else you want to touch on before we uh kind of wrap things up

00:46:55.700 --> 00:47:01.860
davis uh no i was just uh i'm i'm just really excited to be given this opportunity and i wanted

00:47:01.860 --> 00:47:07.060
to give a shout out to all my amazing colleagues at the imageo yeah that's awesome you guys are doing

00:47:07.060 --> 00:47:13.040
cool stuff so i'm glad to shine a light on that so final two questions before i let you out of here

00:47:13.040 --> 00:47:18.280
favorite editor what do you what do you use to write python these days i'll say that for all my open

00:47:18.280 --> 00:47:24.280
source like i work on you know a website for the dc python community and i use sublime text for all

00:47:24.280 --> 00:47:30.220
that open source stuff all my tools that i use and when i'm given when i work for a company i i ask them

00:47:30.220 --> 00:47:37.400
for a pycharm license yeah nice because it's uh pycharm is great for big projects that you can really

00:47:37.400 --> 00:47:43.220
focus on yeah i you know like i said that's my favorite editor as well and it definitely there's a

00:47:43.220 --> 00:47:48.240
definitely a group of people that love the lightweight editors like vim and emacs and sublime

00:47:48.240 --> 00:47:53.160
and then people that like ides it's a bit of a divide but i you know i feel like when you're on a huge

00:47:53.160 --> 00:47:59.820
project you can just understand it sort of more definitely more in its entirety using something

00:47:59.820 --> 00:48:04.540
like pycharm so yeah i like it my favorite thing about pycharm is that you can like control click

00:48:04.540 --> 00:48:09.220
or command click and you can go like on a module and it'll take you to the source for that module

00:48:09.220 --> 00:48:15.000
so you can really fast like look at where the code is flowing um in the in the source yeah absolutely

00:48:15.000 --> 00:48:19.820
or hey you're importing a module but it's not specified in the requirements do you want me to add

00:48:19.820 --> 00:48:23.460
it for you for this package you're writing right stuff like that it's just it's sweet they have great

00:48:23.460 --> 00:48:28.940
support for the tooling and the tool chain of python yeah awesome davis this has been a really

00:48:28.940 --> 00:48:33.140
really interesting conversation and hopefully some people can go make their python code faster

00:48:33.140 --> 00:48:38.560
yeah i definitely hope that they will and i hope they learned a lot from this yeah thanks for

00:48:38.560 --> 00:48:43.160
being on the show man no problem thank you so much yeah talk to you later this has been another

00:48:43.160 --> 00:48:48.600
episode of talk python to me today's guest was davis silverman and this episode has been brought to you

00:48:48.600 --> 00:48:54.080
by hired and coachip thank you guys for supporting the show hired wants to help you find your next big

00:48:54.080 --> 00:48:58.940
thing visit hired.com slash talk python to me to get five or more offers with salary and equity

00:48:58.940 --> 00:49:03.160
presented right up front and a special listener signing bonus of four thousand dollars

00:49:03.160 --> 00:49:09.700
coachip wants you to always keep shipping check them out at coachip.com and thank them on twitter

00:49:09.700 --> 00:49:15.180
via at coachip don't forget the discount code for listeners it's easy talk python all caps no spaces

00:49:15.180 --> 00:49:22.160
you can find the links from today's show at talkpython.fm/episodes slash show slash 28

00:49:22.160 --> 00:49:27.980
be sure to subscribe to the show open your favorite podcatcher and search for python we should be right at the top

00:49:27.980 --> 00:49:34.800
you can also find itunes and direct rss feeds in the footer of the website our theme music is developers

00:49:34.800 --> 00:49:40.920
developers developers developers by cory smith who goes by smix you can hear the entire song at talkpython.fm

00:49:40.920 --> 00:49:47.460
this is your host michael kennedy thank you very much for listening smix take us out of here

00:49:47.460 --> 00:49:53.360
stating with my voice there's no norm that i can feel within haven't been sleeping i've been using

00:49:53.360 --> 00:49:57.120
lots of rest i'll pass the mic back to who rocked it best

00:49:57.120 --> 00:50:03.480
developers developers developers developers developers developers developers developers developers developers

00:50:03.480 --> 00:50:09.300
developers developers developers developers developers developers developers developers