Tommy Blanchard

Cognitive Wonderland

2024-08-20T00:00:00-04:00

I haven't been active on here for a while, but I've been active on my Substack newsletter/blog. Follow me on Cognitive Wonderland

Data Bullshit and Data Illiteracy

2020-11-07T00:00:00-05:00

Something I see too often in both immature data science practitioners and data science teams is overcomplication. The desire to use complex neural networks when there isn't enough data to support them, or wanting to include all kinds of variables that don't actually add predictive value.

Ironically, using a simple solution is often a sign of sophistication. The experienced data scientists I know will know when complexity is warranted, but at the end of the day, simple often produces the optimal point for the effort:reward ratio.

For inexperienced data practitioners, the reason for unnecessary complexity is obvious: it's a combination of wanting to demonstrate you know complex methods, and the lack of experience to know when enough is enough. That can be fine when building up a portfolio and gaining experience.

In immature data science teams, the issue is usually a more serious problem. It leads to what I'm calling Data Bullshit: data science projects that are overly complicated for the sake of sounding complicated.

Data Naivete Leads to Perverse Incentives

One of my previous bosses would always brag about how the models our team was making "included thousands of variables". This was true. What he didn't say was that we could get the same model performance with a tiny fraction of those variables. The additional variables were just bloat, which caused complications in extracting all of the data elements, unnecessary complexities in the code making it harder to maintain, not to mention added literally months to development time.

There was a ton of low-hanging fruit, places we could be adding tremendous value with small, properly targeted projects, but none of them sounded spectacular enough. It was easier to get a spotlight if he claimed we were doing things that sounded super sophisticated, that involved skills that no other team had, even if the simple stuff would have worked just as well.

This is a pattern I see repeated: a leader who only knows enough to be dangerous, a data naive organization where people can't tell what's bullshit, and there are incentives to justify a data science team because of their expensive data science salaries.

Widespread data naivete in an organization, where people are wowed by the talk of statistical models and machine learning sounds like magic, results in these perverse incentives - especially when there is data naivete at the executive level. If the people at the top don't know how to smell out data bullshit, they're going to be wowed by the complexity of the solution, and incorrectly think the complexity was required, justifying the expensive team. In reality, a simpler solution would be better and more quickly implemented - but those solutions are too easy to explain, and therefore don't show how the data science team is "special" for having come up with it (even though it often takes more experience to spot where a simple solution can be used than it takes to throw the kitchen sink at a problem).

This trend of unnecessarily overly complicated methods reached an extreme at this organization when one executive suggested we capture sound data at some of our sites to see if there were patterns between compliance issues and sound features.

This was going to be a huge logistical nightmare - literally planting recording hardware in thousands of sites, building data pipelines to collect and process the data, dealing with potential legal issues of placing recording devices, collecting data for months just to get a handful of positive data points, etc. There was going to be so much variance in the data due to dumb factors like where the device was placed in a site, and it was clear to every data literate person that there was no way the data was going to be helpful for this classification.

More importantly, there were plenty of other ways of getting more relevant data that wasn't currently being used - one of the big things they wanted to look for in the noise data was the sounding of alarms, but we could just pull alarm triggers more directly without needing any hardware! But none of that was good enough, because this executive thought this idea would sound good. And it did - he gave talks about this idea and drew a lot of attention for his innovative thinking.

I was one of two data scientists on the team assigned to this project. We both knew it was doomed to failure but couldn't convince the data illiterate leadership. It lowered morale and took a huge amount of our time and resources. I ended up leaving the company, and was happy to wash my hands of the project.

Conclusion

Data illiteracy leads to data bullshit. Leadership in a company using data science needs either to be data literate, or be willing to listen to those that are. Otherwise, you're going to end up with time-wasting projects, lose your best data scientists, and the projects that are completed are going to be overly complicated messes.

Data Science Bootcamps - What They Can and Can't Do

2020-10-30T00:00:00-04:00

It's pretty common that I get asked if a data science bootcamp (or a part-time Master's program, which for the most part I put in the same category) is worth it. Just to get my credentials out of the way as I weigh in: I have taught for a couple of Master's programs and an online bootcamp, as well as having gone through a bootcamp myself when I was making the transition into data science.

I've written some thoughts on bootcamps in the past. The trouble is, it's a hard question to give a single general answer for. Everyone has different skills and different goals. However, I think one thing is pretty constant: bootcamps tend to overpromise to those without a highly technical background. If you're not already most of the way there, a bootcamp won't make you a data scientist.

Different starting points

Let me give a few thoughts on bootcamps for people at different stages in their data science learning:

Starting from scratch: Let me be blunt. If you have little to no experience with programming, data, or statistics, a 6 month part-time program isn't going to make you a data scientist. You might learn some cool skills, and if you put in the work you might land an entry-level data analyst position. Data science positions are generally considered pretty senior, or at least highly skilled. A short program simply isn't enough time to gain the levels of competency expected in the foundational skills.

Academic backgrounds: I think those coming from a quantitative academic background are some of the best candidates for bootcamps (full disclosure, this was my background). Most quantitative academics have passable coding skills, know statistics, and have experience working with and telling stories about data. Bootcamps give a taste of the breadth of tools out there, some practice using some of the popular tools in data science, and some coding practice. Generally, people with this kind of background are the ones I see most reliably go on to actually get data science positions following a bootcamp. As an added bonus, there are programs like Data Incubator and Insight that are free for promising candidates coming from academic backgrounds. That said, many make the transition without needing a bootcamp, so while I think they can be useful, it might be worth getting a sense for your prospects on the job market without one first.

Software engineering background: Honestly, I don't see many software engineers taking bootcamps. I think that makes sense - for the most part, if you're already working in tech, you know the space and the transition isn't as big as those coming from academia. You probably have the skills to learn from all of the great free online resources. There is a huge need for data scientists who have serious coding skills, and often people going this route end up with the title "machine learning engineer". That said, for a software engineer looking to get some background in data science and want more of a structured curriculum, I think a bootcamp could be a good way of doing it, since they will provide the statistics and machine learning background that they probably lack, but it's definitely not a necessity.

Data analysts: Bootcamps can certainly "level up" your data skills. Depending on where your skills are now, that could be a great way of trying to either work your way into a more senior data analyst position, or try to make the jump to data scientist. But it depends a lot on what your current role entails - if you haven't used Python at all, and mostly just use Tableau, a bootcamp is going to introduce you to things and are unlikely to be enough to get you to make the leap to data scientist. It also might not be useful for your current role if there's no opportunities for using machine learning. If you see places for using data science skills in your current role and the only thing holding you back is lack of your own skills, a part-time bootcamp can be super useful.

Boot camps and other part time programs are great for introducing you to the wide range of data science tools out there. They are not enough on their own to make you an expert, and they are not transformative. If you're just curious about data science, there are a ton of online resources for learning. If you want to do something more directed and don't have serious financial constraints, a bootcamp can be a reasonable option. Just don't go in expecting that a bootcamp will take you from 0 to 6-figure data scientist in a few months.

Decision Theory and Captain Picard's Leadership

2020-02-02T00:00:00-05:00

With the start of Star Trek: Picard recently, I've been thinking about one of the big impacts Captain Picard has had on my life (as a lifelong geek). There is an episode of Star Trek: The Next Generation called "Attached", where Doctor Crusher and Captain Picard become mind-linked while trapped on a strange planet. The pair need to start moving on foot through the wilderness, but Crusher has no idea which direction they should start hiking in. Picard simply says "This way," and starts walking. Crusher, able to read Picard's thoughts (because, mind-link) is shocked to realize that Picard has no better idea than she does and is just arbitrarily choosing a direction, but acting confidently about it.

Crusher takes this to say something deep about one of Picard's traits that make him an effective leader: his ability to act confident despite uncertainty. But there's a more general principle here that comes from normative decision theory.

The decision Picard is making is between walking in different directions, without having any reason to prefer one over the other. Time is of the essense, and the morale of his crew (in this case, only Crusher) is important. Given the facts, Picard's real choice isn't what direction to walk in, but whether it's worth it to stop and consider before making a decision. Given that all directions seem the same, and there's no information to be gained, the decision should be as quick and confident as Picard made it. The potential payoffs and risks of each option is the same. There is a cost to stopping to consider: it takes time, and it can lower morale among crew because he is giving an indication of uncertainty. Picard therefore should do what he did: make the decision of which direction to go based on a mental coin flip. He's able to act confidently because he actually has made the best choice he could have: the choice to not waste resources deliberating.

When is the best course of action to just arbitrarily choose an option and get on with it? The value to be gained from stopping to think about a decision rather than just using a mental coin-flip is proportional to the difference in the value of the options. If choosing between $20 and $15, the value of being deliberate and not just tossing a coin is $2.50 ($5, cut in half since if you flipped a coin half the time you would choose the $20). When put in terms of money, decisions are easy because we just look at what number is larger. In real life, we have options that we might not explicitly put a number on, but we have some subjective valuation of that's useful to think of as a number (e.g. the amount you would pay to take that option). Similarly, you can imagine putting a price on how much you would pay to get rid of the anxiety of the decision and have your time back. Deliberate decisions have a cost.

If choosing between an obviously good option (one you would pay $100 for) and an obviously bad (one you would pay $20 to avoid), there's quite a bit of value to making a deliberate choice. However, in most cases that choice will be obvious and therefore require little deliberation. High payoff and low cost since the deliberation will be quick, so you should definitely actually consciously choose. It isn't hard and doesn't take much deliberation to decide you would prefer to get a free fancy dinner over having smelly garbage dumped all over your kitchen. If the values are closer together, like choosing between two high-end restaurants, the decision might be harder. Is one worth $105 to you and the other worth only $104? It might be hard to distinguish your true valuations of the two options and which is higher because they are so close. But because they are close, the value of a deliberate choice is diminished -- you'll end up with a good dinner regardless, and the difference in value to you might only be a couple of dollars. Choosing between two very different options, like a new pair of jeans and a high-end meal might also be difficult, but the more difficult it seems, the closer the values of those options must be to you, so the less it matters what you choose. Again, you'll be happy with either.

If there's low or no value to a deliberate choice, the costs of making a deliberate choice can easily outweigh the value of having made the deliberate choice. Ironically, these are also the situations in which we're most likely to be struck with indecisiveness, simply because it isn't so easy to distinguish the values of the options. We can go back and forth, imagining taking one option and then the other, and being unsure which we value more. If the options are complicated, it might be worth it to give the decision a bit of thought and make sure you're not missing something obvious, but beyond that you're facing the same situation as Picard: two options very close in value, and the real decision is how long you're going to sit and think about it. The longer you think about it, the bigger the cost -- in terms of time, the anxiety of making the decision, the mental effort, etc. Since the options are close in value, it's very easy for these costs to outweigh the benefit of choosing the better option (if there even is one). Even with large life decisions, if a large difference in value doesn't appear after thinking about it for a day, is it worth being anxious and uncertain about it for a week if the value difference between them is so indistinguishable? Likely not.

When I face a decision where there isn't an obvious right answer, especially when the stakes are low, I often think of Picard and make the decision quick. I'm put at ease knowing that by just arbitrarily picking an option, I'm actually making the right choice -- the choice to not pay the price of being indecisive.

Reboot

2020-01-23T00:00:00-05:00

A lot has changed since I started this website and blog.

When I started, I was just starting my career as a data scientist. I was looking for an outlet for posting projects so I could build up a bit of a portfolio and online presence. This was a career-building move. (I also had a bunch of pent-up thoughts about leaving academia that I felt I needed a venue for).

It wasn't long before I got tired of doing data science during the day at my job, and in the evening for this site, even if the projects I was doing for this site were pretty fun, light projects. Instead, I transitioned to making this site mostly a place for me to post thoughts about data science and connect with the data science community. This was easier on me, allowed me to have my online presence, and it was rewarding to have my stuff discussed on Reddit, have people comment here, or have people reach out to me directly.

My main source of readership was https://www.reddit.com/r/datascience/. The subreddit was at the time plagued with posts about people asking how to get a data science job, which seemed like 95% of the content on the subreddit. But one of the moderators seems to have decided that percentage was too low and started to clamp down on people posting their blogs, and removed one of my posts (which had already generated a bunch of discussion) and told me not to post another. So down my post went, and with it my primary audience.

It was hard to motivate continuing to write posts after that. Why write posts to enter a discussion with the data science community when there was no venue for connecting the community to what I was writing? So this site went dormant.

Nowadays, I'm far enough along in my career that I don't think my web presence is going to matter too much one way or another. So I'm doing a bit of a reboot, and making this less of a (poor attempt at) a professional blog and more of whatever I feel like writing about. I'll probably still write about data science stuff, but I'm not limiting myself to that. We'll see what happens!

The Data Incubator Unofficial Frequently Asked Questions

2018-05-30T00:00:00-04:00

About a year ago I wrote a review of The Data Incubator (updated review is here). I always know when the Data Incubator application season is here because I always get a few people who have found my blog reaching out with questions about the process. I decided to put together a short list of some of the most common questions I get asked.

Should I do the program?

The program has pros and cons. For some people I think the program is great. It provides an overview of a lot of data science topics and practice with concepts. However, it’s probably not as useful for people weak in programming or those that hope it will help them network or get them a job in a specific area/industry. In general, if you’re not accepted as a Fellow (tuition-free), I think it generally isn’t worth it. You can read more about my thoughts on the program in my review of it.

How good is the network of hiring partners?

Much worse than you would expect from the slick marketing. Around the areas the program is located, especially NYC and the Bay Area, there are a fair number of hiring partners. However, the types of industries are pretty limited. Don't expect to be able to work in gaming or non-profit, those jobs are extremely rare despite being listed alongside industries with way more hiring partners (like finance). If you have strong geographic constraints, don’t expect much from the program’s network.

What is the admission process like?

It’s long and involved, especially compared to other programs like Insight. There are three stages to it. First there is an initial web application that is just a typical application asking for relevant background, education, why you want to do the program, where you could work, etc. If you make it through that stage, you are considered a semi-finalist, where there are some more involved challenges thrown at you (see below for more info). Finally there is the finalist stage, where you go through a quick interview.

What are the semi-finalist challenges like?

The hard part of the semi-finalist challenges are a couple of tough programming questions. You have a limited amount of time to do them (~4 days), with no flexibility about the schedule for doing them. Being very comfortable with Python will help a huge amount here. For a sense of the kinds of questions they like to ask, check out their blog post on efficient numerical computation, they love asking you to make use of the concepts there.

In addition to the programming challenges, you’ll be asked to come up with an idea for your capstone project and submit a short video describing it. They also want a preliminary data visualization, and like for it to be interactive (best is if it’s hosted on Heroku, see their post on getting set up on Heroku).

I'm a finalist, what is the interview like?

The final interview is pretty superficial. It will happen with a group of other candidates. You’ll have 2 minutes (seriously, that’s all) to pitch your project and show a couple of preliminary analyses. The other candidates will have a chance to ask one or two questions. Then it will move on to the next candidate. The interviewers won’t ask any technical questions, but will ask things about where you’re willing to work. For the whole group of 4-6 candidates, the interview will only be 30 minutes.

What should I do for my capstone project?

It’s nice but absolutely not necessary to have a project dealing with data in the industry you would like to work in. Don’t be overly ambitious, make sure you can get the data pretty easily, and just plan to build a predictive model of some kind. For your video and initial data analysis, keep it simple, all you’re doing is showing that your idea is feasible and interesting. When you start working on your project, having a nice presentation/app is usually more important than building a well performing model.

That's all for now. Feel free to reach out with any additional questions not asked here, I would be happy to add to the list!

An Updated Review of The Data Incubator Data Science Bootcamp

2018-05-29T00:00:00-04:00

A while ago, I wrote a review of The Data Incubator based on my experience in the program. Since then, it’s been the most common reason people reach out to me. I’ve had people reach out to tell me how the program went for them, to ask me questions about the program, or to ask advice. Since this happens so frequently, and my review is a bit out of date now, I figured I would write an updated version, taking into account what has changed (and what hasn’t) according to those who have been through the program after me. I've also written a collection of the most common questions I get asked about the program, with my answers.

Curriculum

The curriculum is the right level of breadth. A lot of topics are covered, but with enough depth that you get a reasonable introduction to them. The topics covered are pretty standard and there are no major omissions, though arguably it is a bit weak on statistics (it focuses much more on coding skills). It’s a good overview of the broad strokes of data science, which I think is the benefit of bootcamps like this.

Lectures

While the topics covered are good, the lectures themselves are lacking. Many of the lectures are taught remotely with very poor A/V equipment, making it difficult and annoying to interact with the instructor. The lecture quality varies a lot, but tends to not be very useful. I and many others tended to use lecture time for eating lunch so it wasn’t a total waste.

Assignments

The assignments are possibly the greatest strength of the program. While there are annoyances with the automated marking, the assignments have the great benefit of forcing you to get your hands dirty with each topic and get some experience. This, in my opinion, is the benefit of The Data Incubator over it’s main competitor, Insight, which essentially just gives a list of topics and resources and leaves it to the fellows to learn and get experience.

Coding exercises

Every morning in the program starts with coding exercises, programming problems that you might get as a technical exercise at a job interview. I found these extremely useful as they made me much more confident in Python. Opinions differ though, some people found them frustrating and discouraging. As these challenges are not unlike a lot of the ones you’ll see as the technical part of a job interview, I tend to think they are valuable.

Job hunt support

In general, I thought there was good written content and advice on how to interview and build a data science resume. However, the lectures on these topics were significantly less useful. The alumni network is very weak, giving few ‘ins’ and networking opportunities from the program. In comparison, Insight ends up being very good at this, and fellows in that program end up with a strong network and community after completing the program. The hiring partners, the companies that hire from the bootcamp, are fewer than you would expect from the slick advertising, and are much less diverse than you would think (fewer industries and locations). This is a major issue, as some people from my cohort dropped out after seeing how few opportunities there were in the area they wanted to work in.

Organizational culture

I’ve heard mixed opinions on this, but the majority of people I’ve talked to agree that the culture of the organization feels a bit off. The demeanor of some career advisement staff (to name names, Alyssa Thomas, director of program experience and career advising) can be very condescending. In addition to this, fellows are punished for not completing their assignments by being cut off from the portal to interact with employers, and this punishment is wielded pretty arbitrarily to punish fellow behavior they don’t like (such as turning down a bad job offer). A number of other factors make the program feel much less welcoming an environment, which is too bad. I would love to love The Data Incubator, but instead was left with mixed feelings about it, largely because the culture led to a mixed experience.

Final thoughts

Overall the program is good for what it is. The curriculum and assignments are solid, and I learned a lot from the program as well as the other fellows. The program has problems, but all programs do. I wouldn’t full-heartedly endorse the program, but I think it is a good way to go for a lot of people looking to break into data science, especially those coming from academia.

Implementing data science in healthcare

2018-05-15T00:00:00-04:00

I recently wrote a blogpost for datascience.com, which you can find here. It’s about some of the lessons I’ve learned on actually implementing data science projects in healthcare for clinicians. It isn’t about the technical side of things, but the challenges of actually making sure your project is actually useful and fits into the workflows of clinicians.

Automated machine learning is coming... and it won't matter

2018-04-04T00:00:00-04:00

Recently, I’ve been seeing a lot of services and products advertising automation of machine learning. Data Robot and H2O.ai offer platforms that allow the creation of machine learning algorithms in point-and-click interfaces. They’ll even do the feature engineering for you! This functionality, or something like it, is slowly being built into various tools and programs. They promise to automate the creation of the whole machine learning pipeline - from feature transformations, hyperparameter tuning, to model selection. There are open-source tools that do much the same things (like TPOT, a cool module I love the idea of but can never get to actually work on a data set that isn’t trivially small).

Right now, these tools mostly aren’t great and/or are absurdly expensive (for the cost of a subscription to Data Robot, you can employ a full-time data scientist). But I have no doubt that soon tools will exist that will completely take care of the model/hyperparameter/feature-transformation process.

I’ve had people ask me if I’m worried about my job security as a data scientist. No, I am not. I can’t wait until these tools are there and open source so I can just type “import machinelearn” and just have it do the stupid hyperparameter optimization and I can get on with the hard part of the job.

When I get data to the point where it could conceivably be ingested by one of these tools, the problem is basically done. At that point I need to run a bit of code to do the grid search and find a reasonably decent model and tune the hyperparameters. Hell, if I just ran XGBoost with the default parameters at this point it would usually be almost as good as I am ever going to get it anyways. Doing the extra work of tuning things a bit more is only worth it because it’s relatively easy, and you very quickly get to the point of diminishing returns (unless you’re in a Kaggle competition, where even diminished returns might take you from 10th place to 1st so you milk every tiny incremental increase in accuracy you can).

Once you have your data in the format where you could make a Kaggle competition out of it, you’ve done the hard part. I would love it if at that point I just ran a single function that did a well optimized search that was way more thorough than my typical grid searches, and also explored some different feature transformations. Maybe my models would do marginally better, and I would save myself a few minutes writing the code. It would be nice. But if it would put you out of a job, maybe you should be seriously thinking about what skills you bring to the table.

In most data science positions I’ve heard of, the hard part isn’t building a model once the problem has been framed, data collected, samples chosen, and data is in a neat one-row-per-sample format. The hard part is getting to that point. While I don’t doubt some of these steps will be made simpler in the future as tools evolve, I can’t see anytime in the near future where the whole process could be easily automated. Translating a business problem into a prediction problem is hard and requires a lot of business knowledge coupled with abstract, quantitative thinking. Figuring out what data to use and how to get it is hard - businesses evolve and the data infrastructure isn’t always so clean, so there aren’t ready solutions here. Choosing an unbiased sample set for training can be extremely difficult and there isn’t a cookie-cutter solution to this. Most often, some structure needs to be imposed on the data from knowledge about the particulars of the problem.

I have no doubt that in the next few years, we’ll have some nice tools for automating the building of a machine learning pipeline. Hopefully once that problem is rendered trivial, fewer aspiring data scientists will try to prove their skills by showing off how accurate their model is on the Iris data set. I don’t see much impact on the field beyond that.

It’s okay to not be a data scientist

2018-02-20T00:00:00-05:00

Everyone wants to be a data scientist

Being a data scientist has been hyped a lot in the past few years. Glassdoor has listed it as the #1 Best Job in America 3 years in a row, and it isn’t hard to find blog posts talking about how great being a data scientist is. It’s no surprise, then, that the main data science subreddit seems to be mostly people asking for advice on how to become a data scientist, rather than about data science.

I’ve been feeling pretty ambivalent about a lot of the advice I’ve seen (and previously given), and I worry it’s been doing people a disservice. A lot of the advice out there (and advertising from certain online courses/bootcamps) make it sound like the only things you need to be a data scientist are a few technical skills - so it’s no surprise that there are questions like this asking why the pay gap between data analysts and data scientists is so large if the lists of technical skills each need to have are the same.

Why isn’t everyone a data scientist?

The truth is, there is more to being a data scientist than learning how to import sklearn. I have to tread carefully here. I don’t want to be seen as gatekeeping. But I think there needs to be some reality check on the idea that anyone can do one machine learning project and put it on github to land a job paying a 6-figure income.

According to the 2017 Burtch Works ‘Salaries of Data Scientists’ report, about 90% of data scientists have advanced degrees (~40% PhDs, ~50% Masters). Of those 10% without a graduate degree, my guess (I don’t have stats on this) the majority have a lot of experience either with pretty high-level data analytics or software engineering.

There are reasons for the large proportion of graduate degrees. Graduate degrees (especially PhDs) in quantitative disciplines indicate that a person has had a certain amount of experience with exploring abstract concepts, developing intuitions for statistical relationships, devising ways of testing hypotheses, translating data into stories, communicating complex results, etc. It may only take a couple of months to learn Python and how to wrangle data in Pandas and build a model in sklearn, but that’s just the surface-level stuff. The real world has complications that require you to have an intuition behind the math and understand what will and won’t work - and how to show that it will or won’t work.

I don’t mean to imply a PhD is required to be a data scientist - that is empirically false (60% don’t, after all). There are many different paths to becoming a data scientist. But there’s no easy path.

The thing to realize about all of those data science bootcamps and online courses is this: for most of the people who get a data scientist job after them, the skills they learned in those courses were just the icing on the cake. If you have a quantitative PhD, or tonnes of software engineering experience, or years of working closely with data scientists as a data analyst, maybe all you need is a short course to pick up how to use a couple of Python packages and you’ll be competitive. If you don’t, most likely you have a longer road to being a data scientist.

You don’t have to be a data scientist

Gaining new skills is fun and probably good for your career regardless of what title you end up with. It’s worth picking up skills that allow you to do things you enjoy. If you currently don’t work as an analytics professional, learning online or doing a data science course could certainly help land that first data analyst position. Maybe if you’re a current data analyst working at a high level, learning some of those skills and incorporating them into your work could eventually lead to a data science position down the road.

But there are more ways ‘up’ than one. Being a data scientist doesn’t need to be your goal. You can end up on a management track and end up leading a team of data analysts. That may lead to managing a large data analytics group, or even to being Chief Analytics Officer one day. My point is, a lot of people are a bit too obsessed with the title ‘Data Scientist’. It’s a good job, but I think people are overvaluing it and underestimating the expected level of experience for it. There are lots of great analytics jobs (and lots of great non-analytics jobs, for that matter), and I think it would benefit some people to take their eyes off the shiny, hyped up ‘data scientist’ title and judge their options more objectively.

Performance metrics aren't everything

2018-02-09T00:00:00-05:00

Lately I’ve been getting pretty annoyed with an obsession with performance metrics. It’s like someone let the word out about what area under the ROC curve is and suddenly everyone thinks that’s the only measure of whether a data science project is ‘good’ or not.

The problem is, whether a model is good or not typically relies on much more than [insert your favorite performance metric here]. Yes, if your model is predicting at chance, it’s almost certainly useless. But even slightly above chance it might be immensely useful (conversely, even with perfect predictions it might be useless). The usefulness of a predictive model is a function of what it enables you to do that you wouldn’t be able to do without it. What actions or interventions will the project as a whole allow that would not otherwise happen, and how valuable are those?

Unfortunately, a little knowledge can be a dangerous thing. I’ve had projects where the predictive accuracy of a model is completely irrelevant - the model was built just to eliminate the effects of a few variables on some outcome (i.e. to get the residual). After showing how well the model captured the trends in those few variables, the non-technical people who had seen a predictive model before immediately started asking about the area under the ROC curve, and judging the project on this. Yet the value of the project was just removing those trends from the outcome, which the model did beautifully.

Return On Investment (brought up often and referred to as ROI by those who want you to know they’ve taken a business class) is a better concept to keep in mind for thinking about the value of a project, even if in practice it’s often not possible to work it out exactly. What is the (value - cost) of having this model in production? The cost of a false predictions and the payoff of true predictions are just as important to figuring this out as the predictive accuracy of a model. There is a lot that a data scientist can do to alter these costs and payoffs, and sometimes that’s a better place to focus effort than getting an extra 0.00001 precision in your model.

For example, often a model is expected to not only make a prediction, but give some idea of why that prediction was made. This is what makes tools like LIME that explain where a prediction has come from so useful. Unfortunately, they don’t completely solve the issue because explanations are complicated. Some explanations might be less useful than others. Explaining a patient is likely to be sick because they are old might be true, but that doesn’t make the explanation or the prediction useful. Additional work of picking out those features that yield explanations of interest (or grouping features together in a way that makes sense) is also necessary. Doing this work well might create a tool that drastically improves the efficiency of some process. Doing it poorly might mean your model directs effort to the wrong places, creating cost and no value. These are both in the realm of possibilities regardless of how accurate your model is.

There are plenty of other examples of ways you can use a predictive model beyond just its prediction, or add value on top of that prediction. Knowing a performance metric doesn’t mean you can judge a project’s value. Real projects aren’t Kaggle competitions where the only thing that matters is predictive accuracy.

Lessons learned in my first year as a data scientist

2018-01-25T00:00:00-05:00

It’s been a year since I started my first job as a data scientist. In that time, I’ve learned a lot, but most of that learning hasn’t been the type I expected. I’ve certainly learned some things about new technologies and techniques, but much of what I’ve learned has been about how to actually make my skills useful to others in the company.

For context, I work in a healthcare company where the data science group acts kind of like a consulting firm - we take on projects to create predictive models and provide other analytical support. For that reason, a lot of the lessons I’ve learned have more to do with the issues with operationalizing data science projects in the business.

Feature engineering is important

One of the big lessons from the year is that feature engineering is the most important part of building a predictive model. This might not be a surprised to anyone with a lot of machine learning experience, and I certainly had heard this when I started. But if I'm honest, it was hard for me to come up with examples of when or why it was important before I started. Seeing repeated examples of this in practice have really made me understand it on a deeper level. The kind of model you use doesn’t matter if your features suck, and being able to understand what kinds of features will be useful is a skill that requires abstract thinking and intuitive math. This is what makes machine learning hard.

Getting the best performance is not important

When I was initially learning data science and machine learning, there was a lot of emphasis on performance metrics. The truth is, for almost every project, you can get 90% of the actual value of the project quickly and the last 10% will take you ten times longer. For most projects, it just isn’t worth it. While with Kaggle you want to perfect the model as much as possible because a .01 increase in your performance metric of choice can make or break you, in the real world, tiny increases rarely matter (though obviously this depends on the industry and project).

The hard part of projects isn't the machine learning, but everything else

More important than knowing how to build the best predictive model is knowing how to translate a business problem into a problem a predictive model can help with, and translating a predictive model into something actionable. When people from other departments hear we can build predictive models, they often come up with some idea of something they think would be interesting to predict. Often, their ideas don’t make sense, but explaining why that’s the case, the technical issues involved, and coming up with an alternative that will work - that’s important. It’s also important that once a model is built, the output is actually usable - it’s easy to predict things, but often the prediction is only useful if it comes with some additional insight. Using tools like lime has been useful and is part of the solution, but sometimes your features are not actionable or easy to understand for the end user, so lime is useful but not a silver bullet - more thought needs to be put in than just to stick lime on the end.

Ask the right questions BEFORE you start a project

Finally, probably the most important lesson I’ve learned the hard way this year: Make sure you know up front where the data is going to come from and how the project is going to be used before putting too much effort into it or committing to any kind of deadline. I had one project described to me that I was excited to tackle and promised a quick turn-around on, only to learn that writing the algorithm was the easy part. I was given a sample of data and had assumed that meant there was a process in place for me to acquire the data. In reality, there was no such process, and it took months to figure out how to get access to the necessary data. I ended up looking pretty bad for going way over my initial time estimate, even though most of that time was spent waiting for people to respond to emails so I could track down and get access to the necessary data. Figuring out data pipelines is really hard in a big company!

I also had to pick up a few projects that were started before me when another data scientist left. One was a dashboard put together to help navigate and visualize some text data. I was amazed when I talked to the end-user to check what else they wanted done with the project, and discovered all of the changes they wanted involved removing features. The dashboard was massively over-engineered, with some of its major functionality either removed or left in but never used. Worse, the dashboard was never meant to be in long-term use and would be retired in a year, so there was no chance these features would find eventual use. This meant time was wasted twice - once creating those features, and again going back and removing them to simplify the dashboard.

This and other experiences really taught me the importance of having a close feedback loop with the end user of whatever project I’m working on. A biweekly, half-hour meeting can end up saving a lot of time.

Motivation in Academia vs Industry

2018-01-21T00:00:00-05:00

I’ve been in industry working as a data scientist for a year now after leaving academia. By all measures I can think of, it’s been a good year. I got lucky with having a great manager, I’ve gotten a bunch of experience and learned a lot, I work at a place where doing good work means actually helping sick people, and I even lucked out with a promotion, allowing me to try out a slightly more managerial position so I get to see how I like those responsibilities while still making my own development contributions.

However, I wanted to revisit my appraisal of my life in academia vs industry. Rereading my previous post about this, I agree with everything I said. On the whole, the issues I had in academia aren’t an issue in industry, and overall I still can’t imagine going back to academia.

But I’ve been thinking lately about what’s been the biggest tradeoff for me: the motivational structure. With academia, there are a bunch of somewhat quantifiable metrics that are often looked at to approximate how much you’ve accomplished as a scientist - your number of first-author publications, h-index, etc. Of course there are issues with these metrics, and a lot of controversy around how much weight is put on them to judge a scholar. However, one thing I’ve realized is that (at least for me) they are incredibly effective motivators.

It feels incredibly good to get a new first-author publication when that is the main metric by which you’re judged. It goes on your CV, and that accomplishment will follow you wherever you go. Every citation it accrues goes down on a sort of permanent record of achievement for you. It feels a lot like building something, where every incremental piece adds up.

There’s no real equivalent in industry. When a project is completed, it might come up between you and your manager in your performance review. If it’s a big project, you might put it on your resume to help the next time you’re looking for a job.

This motivational structure for academia is a double-edged sword. It gave me a sense of purpose and progression, quantifiable goals and accomplishments. However, it was a bit too effective. I didn’t take much time off because I wanted to keep moving towards that next publication. I felt guilty reading if it wasn’t reading something I hoped would give me a new project idea for a quick publication. And of course, when there was a big set back in my project, it would take a big psychological toll.

I’ve been trying, both at work and in my private life, to find things that will help give me some of this sense of progression and purpose. I’ve had mixed success - certainly I’ve found some things motivating and rewarding in a similar way, but not in as sustained a way as publications were when I was an academic. It just seems to take more effort to find these things.

A probably healthier way of looking at things is with a growth-mindset - that it’s the skills I’ve gained and lessons learned that are important, not the specific accomplishments I’ve made. But growth is hard to measure, and it’s often hard to see how far you’ve come. It’s hard to get super excited about growth.

On balance I certainly still think the move to industry was doubtless the right one for me. The same lack of a super motivating goal also makes it much easier to leave work at work and have a better work-life balance. But this is probably the biggest issue I’ve struggled with since the shift. It was something I worried about before leaving, and that worry has turned out to be justified.

"Should I do a Data Science bootcamp?"

2018-01-03T00:00:00-05:00

Since I wrote a review of my experience with The Data Incubator bootcamp, a lot of people have reached out in one way or another wondering if they should bother with a data science bootcamp. Many people interpreted my review as very negative. To be clear, I do think there is value in the program, but also some pretty serious problems. Despite my ambivalence about the program itself, I have to say it was worth it for me and is probably worth it for many others.

The real value of Data Incubator (which I imagine goes for most data science bootcamps) was being given challenges that required me to learn, and being surrounded by people with the same goal. It’s much easier to learn that way than trying to learn on your own. It’s amazing how much you learn from just being around smart people trying to learn things. Having some direction from a course specifically designed to take you into the world of data science also gives you a good amount of breadth over the field of data science, which can be hard to get when trying to learn on your own.

However, I think there are some serious costs to these programs. They are stressful. They require a more-than-full-time commitment for a couple of months. Some may require you to move, though maybe temporarily (I moved to NYC for 2 months for Data Incubator). Some of them require you to pay up to $15,000 - Data Incubator offers a less competitive option of being a ‘scholar’, where you pay about that amount to attend the program.

I think the stress is probably worth it - it sucks during the program, but any quick way of learning a lot is going to involve a lot of stress. It’s hard to put a dollar amount that the programs are worth, so it’s hard to say if it’s worth taking the time off or even paying tuition. If you have financial support from somewhere, and you won’t be sunk financially if you don’t get a job immediately after finishing, it might be worth it.

The issue is, not everyone gets a job. These programs don’t work for everyone. It’s hard to know the rate at which attendees of a program get a job - the issue with most of the numbers these programs put out is they decide who to include in their calculation. A lot of students just lose contact with the program after losing hope of getting a job through them, and the program doesn’t count them when they tell you what percentage of students found a job through the program. This basically guarantees the program will be able to list a high success rate. So don’t trust the numbers they give (or at least ask more about if it includes everyone who enters the program, not just the people who ‘successfully complete’ it, since the definition of successfully complete may mean they either got a job or they stayed in contact with the program for 3 months after it was over). Regardless, if doing the program will put you on a financial knife-edge, it probably isn’t worth the risk.

There are alternatives to doing a bootcamp. If you’re motivated and can get a data analyst position that works closely with data scientists, you can transition from there. You can also take courses, do projects, and learn on your own in your spare time. I think these options are slower and probably harder than doing a bootcamp, but they can work, and are certainly safer. But if you feel like you're close - you have a PhD or relevant work experience and just need that extra boost to get over the edge - then bootcamps can be really powerful.

Data professional definitions: Data analyst vs data scientist vs data engineer

2017-12-14T00:00:00-05:00

Lately I’ve read a lot of attempts at defining data scientist and differentiating it from other data-centric roles. The terms ‘data scientist’, ‘data analyst’, and ‘data engineer’ are obviously interrelated. But recently I’ve seen some weird definitions of them.

Let me make clear that this isn’t just a silly semantic quibble with no practical significance (though it certainly is partially, maybe largely, that). This issue often comes up when people are giving career advice. A recent blog post defined a data analyst as someone who interrogates data using SQL and Excel to produce reports, while a data scientist is someone who delivers software. This is a lead in to advice that a data analyst position is not good preparation for a data scientist position because data scientists are basically software engineers and data analyst positions don’t give you that experience. The premise, however, is based on an extremely narrow definition of both of these roles that might only be true in certain companies.

Another example of silly definitions for these roles came from a reddit thread I read, where someone was claiming that anyone who regularly uses pandas/sk-learn must be a data engineer. This bizarre claim seemed to stem from an idea that legitimate data scientists are at the forefront of machine learning research and therefore must be programming completely new machine learning algorithms from scratch.

Here’s the thing: if anyone claims a very clear, straightforward definition of any of these roles that sharply delineates them, they are probably extrapolating based on experience of how these roles are defined in one company (or industry). This is because these jobs, by their very nature, have a lot of overlap.

I would give the high-level, fuzzy (and hopefully not controversial) definitions of these roles as:

Data engineer: a data professional who focuses on building data pipelines, manages how to get data from point A to point B.

Data analyst: a data professional who focuses on producing reports describing trends/insights in data.

Data scientist: a data professional who focuses on producing insights and predictions from data.

Note my use of the weasel phrase ‘focuses on’ to avoid making any hard statements. Because at bottom, each of these roles overlaps significantly with the others - it’s typical for data professionals to need to extract data or move it around. It’s not uncommon for data engineers to produce dashboards or reports, or for data analysts to set up data pipelines. Data scientists may sometimes produce reports, and some data analysts are deliver code to go into production.

Any hard line you can think of (data scientists use machine learning, data analysts use excel, etc, etc) there are going to be lots of counter-examples.

At the end of the day, there is a difference in these roles. It’s a fuzzy yet important one. Typically data scientists are more specialized and have more experience/skills than data analysts. Data engineers might have more knowledge of databases but less about statistics. But the field is not at a point where you can easily determine the total extent of the role just by the title.

The bottom line is, if you’re looking to become a data scientist and want to know what path to take, getting experience as a data analyst (or data engineer) might not be a bad way to go about it. However, it’s dependent on the specifics of the particular position you get. If you’re a data analyst who sits with business people and only use excel to produce simple reports, you’re not going to get the kind of experience you need to move on to a data science position. But if you work closely with other data scientists, or are expected to learn stats/machine learning to perform your duties, it might be good experience. I think the blog post I mentioned above does give good advice on this - be concerned about who you’ll be sitting with, since that probably says a lot about what skills you’ll learn.

“Should I get a PhD to be a data scientist/analytics professional?”

2017-11-19T00:00:00-05:00

There’s some advice I’ve read that “when you’ve given the same in-person advice 3 times, write a blog post”. I’ve decided to take that advice and write some of my thoughts on getting a PhD to go into industry.

Opportunity Costs of Getting a PhD

Obviously, having a PhD on your resume, all else being equal, helps you. It shows you’re smart and have a variety of skills. It also gives you a number of important skills for industry.

But asking whether a PhD is a good thing to have on your resume isn’t the right question. Is having a PhD better than having 5+ years of industry experience, or a Master’s degree and a couple of years of experience? My guess is in most cases, no. PhDs take a long time, and aren’t very efficient for learning the specific skills for an industry job. PhDs train you to be a researcher in a very specific field.

Simply put, if you’re just looking for a way of advancing your career, in most cases you’re better off doing something other than getting a PhD. There might be some exceptions to this - like if you’re planning on getting a PhD in machine learning. But for the most part, a PhD is just a slow way of learning some skills. When you factor in the financial opportunity costs as well, it just doesn’t make sense to get a PhD if you’re doing it just for your career.

Risks of Enrolling in a PhD Program

One issue I don’t see brought up enough in discussions about getting a PhD is that getting a PhD carries significant risk. A large proportion of students (estimated ~50%) don’t complete their PhD. Everyone thinks that these statistics don’t apply to them, but quite simply, there is a lot of luck involved in getting a PhD. Many advisers just aren’t good (or have styles that don’t jive well with your own), some departments suck, some research plans don’t pan out. On top of that, sometimes life circumstances get in the way - you have a kid, a family member gets sick, or you find you can’t continue doing it for financial reasons. These are things it’s hard to know ahead of time. There is always a chance things just won’t work out with your PhD program, or that it will require much more time than you originally planned.

Doing it for the Wrong Reasons

Finally - if you’re getting a PhD just thinking about what it’s going to give you on the other end, you’re doing it for the wrong reasons. If you’re not seriously excited about the prospect of doing academic research for the next few years, you’re just not going to gain much from the experience. A PhD is a long commitment - 5+ years might as well be a lifetime. It’s also a lot of work - I worked much harder during my PhD than I have at any other point in my life. Academia can also be extremely lonely and isolating, as I've written about before. If you aren’t happy and motivated by the work you’re doing during that time, you’re (1) not going to accomplish much, (2) not learn much, and (3) miss out on enjoying a big chunk of your life.

Bottom Line

I’m not against the idea of getting a PhD, and going into it planning to take an industry job after isn’t a bad idea (it’s probably better than going into it expecting a tenure-track academic job, given the prospects of that). However, I think if you’re going to do a PhD, it should be because you love the idea of doing a PhD. If it’s just something you’re trying to get through for the position and prestige you feel you’ll have on the other end, it really isn’t worth it. Given the risks involved, the financial hardship it entails, the cost to your mental health, I really believe you should only do a PhD if you would want to do it even if it wouldn’t actually help your career at all.

Data Science in Healthcare

2017-11-14T00:00:00-05:00

I've recently written a couple of guest blog posts about data science in the healthcare space.

The first is a pretty general article about some of the challenges and opportunities for data science and predictive analytics in healthcare. You can find it here.

The second one is about the importance of domain knowledge in data science, with a focus again on healthcare. It talks about why domain knowledge is important, and ideas of how to get it if you (like me) are a data science without a medical background working in healthcare. Find that one here.

Review of The Data Incubator data science bootcamp

2017-05-29T00:00:00-04:00

May 2018: Click here for an updated review, and here for a recent collection answers to common questions I get asked about Data Incubator

This is a review of my experience in the Data Incubator data science program.

When I decided to make the switch from academia to data science, I went the route of doing a bootcamp. There is a lot that is appealing about this route. You get to learn with others in the same position as you. Having assignments and lessons forced on you can be a powerful motivator to learn and get experience. Having a fully worked out curriculum can mean getting a larger, better rounded introduction to the field than if you tried to learn on your own. Finally, many bootcamps explicitly market themselves to academics, which I found attractive given my background.

For these reasons, I decided to attend a bootcamp to launch my data science career. I ended up attending Data Incubator, in the September-October 2016 cohort. While I did end up getting a job through the program, I still left it with some mixed feelings about the program and thought writing some details may be helpful for someone trying to make a decision about attending.

Getting in

Many, if not all, data science bootcamps have some kind of admission process. For some, this means just sending a resume and going through an interview. Others involve short technical challenges to test your statistics/programming to see if you're at the level they expect students of the program to be at.

I applied for a bunch of different programs and went through a few different admissions tests. By far the Data Incubator's was the most involved. After applying with typical application paperwork and getting reference letters, if you're chosen you're given challenge questions - programming questions that are pretty challenging that you only have a few days to complete. There is no way to reschedule so if you happen to be travelling during the scheduled time (like I was), you just have to figure out a way to do it. Along with the challenge questions, you have to come up with a project for the program, and do some initial exploratory analysis. If you make it through to the semi-finalist stage, you need to put together more analyses for the project proposal and prepare to do a short pitch in an interview. All of this took many stressful hours of preparation time.

Compare this to Data Incubator's main competitor, Insight, where I did an initial application and an interview and that was it. That application took perhaps an hour or two (and was also successful).

In addition to all the work you put into the Data Incubator application, you're not told until the end if you're accepted as a 'fellow' (no tuition), or a 'scholar' ($15k for the program). My hunch (based on a small sample size) is that they are strongly biased towards giving 'fellowships' to PhDs, and most Masters applicants end up with 'scholar' status if they are accepted.

Strengths of the program

Perhaps the best part of the program is that the assignments are diverse and do a good job of introducing you to a range of topics and tools. In contrast to Insight (where, my understanding is, you're given resources and a list of things you should do but no real lectures or assignments), there is a pretty structured curriculum that makes it clear what you're supposed to be learning and when. The fact that they are assignments forces you to actually get your hands dirty working with powerful different Python tools.

Another strength of the program is the focus on programming skills. As well as the assignments which are mostly expected to be done in Python, four days a week the program starts with an hour long coding challenge. I came in having basically taught myself Python to complete the application. By the time I finished, Python was my most comfortable language and I felt very confident in it.

The course includes in its curriculum some basic job skills that many people transitioning to data science lack: how to sell yourself to employers as a data scientist. There is advice and feedback on making a resume and interview practice. While some of this was a bit too obvious to be useful, I found some tips and advice helped a lot.

Finally, the course provides what any bootcamp should: a social learning experience where you're immersed and forced to learn a lot in a short period of time.

Weaknesses of the program

The program includes daily lectures. On paper, this sounds great, but in practice it's terrible. The hour-long lectures are frequently unhelpful because they often come after you've already learned the content while trying to complete the assignment for that week. When they are timely, they frequently are just low quality that make it hard to gain anything from. Most lectures are also remote - being teleconferenced from one of the other Data Incubator locations. The remote lectures raise barriers to interacting with the lecturer, and the sound quality is often poor enough that it takes just too much effort to pay attention. I found myself frequently using lecture time as lunch hour so the time wasn't a complete waste.

Parts of the curriculum structure also just didn't make sense. We had an assignment nominally supposed to teach us SQL, even though we had no resources for being taught how to even set up SQL on our machines (everyone ended up completing the assignment using Python's Pandas). One of the last weeks of the program we suddenly switched to using Scala for the morning programming challenges, despite using Python for everything else in the program. We came in Monday morning for our weekly assignment, and were literally given an hour to complete a challenge in a language none of us had ever used.

The advertising of the program is slick but the reality often falls short. The variety of companies hiring from the program is greatly exaggerated. Based on the marketing, I was expecting them to have hiring partners all over the country with a large variety of industries. There was exactly one non-profit, zero video-game companies that were hiring. There were only a couple of jobs in all of Boston. One student at my location just left the program immediately after seeing how limited his job options were in his geographical area.

Other problems with the course were deeper than just poor course planning. There seemed to be a deep distrust and infantilization of the students woven into the philosophy of the course. If we failed to complete an assignment on time, we were punished by being locked out of the system for contacting hiring partners until we caught up. One fellow in my cohort was locked out of the system as punishment for turning down a bad job offer. The 'soft skills' lectures (which were really 'job search' lectures) were at best not useful, but frequently condescending and verged on outright insulting people with academic backgrounds. We were routinely encouraged to apply for jobs we weren't interested in.

Conclusion

Overall, the program was useful for me. I probably could have gotten a data science job without it, but I am glad I did it. There are ugly parts of the program, but it was free training for me and did end up landing me a job, so I can't complain too much. However, I don't view it particularly favorably compared to Insight (a program I've only heard good things about from a few of my friends who have gone through it). I also am not convinced a bootcamp is necessary for the job - I actually got more data science interviews that weren't through Data Incubator, and it's mostly fluke that I got my job through Data Incubator instead of on my own.

How to make the transition from academia to data science

2017-04-23T00:00:00-04:00

Ever since I've started writing about my transition from academia to industry (both my reasons for leaving and what I think about the transition in retrospect), I've started receiving a lot of requests for advice on making that transition. Sometimes these requests come from former peers, former professors looking to advise students, or just someone who read one of my blog posts online.

I'm always happy to reply to these requests as best as I can given my experiences, but I figured since so many people seem interested I it might be useful to put my thoughts in a blog post.

What do I need to know to be a data scientist?

My main message to most of the people that have been asking me for advice is not to worry too much. People seem to think you need to know everything to be a data scientist. If you have a PhD in the sciences or any quantitative discipline, you're likely most of the way to having the skills you need, so don't be scared to take that next step (by just looking at and applying to jobs). Put yourself out there and let potential employers decide whether you have the skills to work for them or not, don't do their job for them.

If you really want to know what I think are the essentials to get a first data science position, the main classes of skills you need to have should come as no surprise:

Be a decent programmer: Python and R are the languages that are most common. Being comfortable in both opens more doors, but you can make due with one or the other. SQL is also used frequently, so at least do a short tutorial on it (basic queries are simple enough that it's not super difficult to pick up the necessities).
Know some machine-learning: You should understand basic machine-learning concepts and have at some experience with some of the most common algorithms (like tree-based methods, logistic regression, dimensionality reduction, etc.)
Know some traditional stats: You should know about p-values, regressions, t-tests, statistical significance.

These are the bare essentials. You'll need to be more than just 'decent' in at least one of these categories (which one depends on the job, read on), but the point is, these should be skills you already have or could gain in a short period of time through some additional reading/practice. The skills involved in data science are so broad, no one is an expert in all of them. As long as you know the basics, and have some areas where you're particularly strong, you should be able to find a somewhere that your skills could be useful.

Other skills that are useful

Some people seem convinced that they need an understanding of every big data technology under the sun to be a data scientist. To the uninitiated, the bizarre names you might hear (Hadoop, Spark, MongoDB, etc.) just sound intimidating. The truth is, they're not that complicated, and you don't need to be an expert in them to get a data science job.

Different data science teams use different tools. Some may use some big data tools for everything and really need someone who can jump into that workflow right away. But in my experience, the majority of data science jobs have these skills listed as 'Preferred but not required', if they're listed at all.

Knowing specific packages and algorithms might be required by specific jobs. The best way to figure out what you might need to know for the kinds of jobs your interested in is to just start looking. Go to Indeed or any other job posting site and just start looking at jobs. Find some that interest you, and figure out what skills they want, and get to work on those skills.

Where do I learn these skills?

Once you get a sense of what skills you might need to gain or strengthen, where do you go to get them? For almost any package under the sun, it's easy to just google a tutorial or free online course to learn the basics.

Some people opt to go to boot camps. My overall opinion is that they are good for getting a general overview if you can afford to take a couple of months off full-time, but I don't think they are necessary.

Start applying/interviewing as early as possible

I find I learn best by being forced to solve problems. I think what taught me the most about being a data scientist was applying and interviewing. Interviewers will often give you a coding challenge or small project to work on at home before they bring you in for an in-person interview. These can be anything from a quick one hour coding exercise to a multi-day project where you have to implement a recommendation engine. These are great for learning, because they're nice small projects that are, by definition, what interviewers want you to know. Treating them as a learning exercise helps in two ways. First, they're efficient for learning skills. Second, it will take the edge off when, inevitably, you get rejected repeatedly after interviewing. Interviewing is hard and it can really suck to get rejected after you've put in a lot of time and effort and fallen in love with the idea of working at a particular company. A better mentality is to approach interviews as being a success as long as you've learned something.

There are plenty of other good resources for learning some of the essential skills to being a data scientist. Here's a woefully incomplete guide:

To brush up on programming, try the Cracking the Coding Interview resources at HackerRank
For machine-learning, An Introduction to Statistical Learning is excellent. The Elements of Statistical Learning is more comprehensive but denser.
To just learn some basic packages and data skills, try working your way through some of the Titanic tutorials at Kaggle

Conclusion

I think my main piece of advice is this: You're probably closer than you know to getting a data science interview, and you should take the next steps (looking at job listings, or applying to some of them) as early as possible to better guide your search. Good luck!

Retrospective on leaving academia for industry data science

2017-04-09T00:00:00-04:00

It's been two and a half months since I left academia to take a job as a data scientist in industry. Although I haven't been in my new job for very long yet, I wanted to check in with my current thoughts about whether the transition was worth it and the differences between working in academia and industry.

In a lot of ways, being an industry data scientist is a lot like being a post-doc. I'm paid to tackle intellectual problems, I have a lot of freedom in how I approach problems, and there is emphasis placed on professional development (attending conferences, etc.). Of course, there is a change of emphasis on what types of tasks I do - as an academic I spent a lot of time collecting data and writing, in industry the emphasis is more on doing analyses and communicating results verbally or through simple reports.

The problems I had in academia I no longer have in industry

I made the decision to do leave academia for a number of reasons, outlined in a previous post. The main reasons were that I was lonely, stressed out, and no longer found the research held any interest for me.

Being in industry, there's a lot less social isolation. It's rare that a day goes by where I'm not in some kind of meeting. While sometimes this can get frustrating when meetings seem unnecessary, I think a big part of the point of meetings is social. It's nice to get away from the desk and talk to people. I'm not the kind of person that often seeks out social interactions, and I think it's really good for me to be involved in various meetings.

In academia, it often felt like I alone had to figure out how to solve a problem (or come up with an idea) in isolation to save a project that might span years. The projects I've worked on in industry are just much less stressful. They tend to be smaller or have a much longer timeline. Perhaps most importantly, they tend to have more people involved so you can always get more input. The end product isn't a research paper, so often it makes sense to be a bit less rigorous if it means a quicker turn-around (depending on the needs of the project).

Perhaps most importantly, it's expected that you leave work at work. I have done small amounts of work on evenings and weekends since starting my new job, but it's seen as weird and definitely not expected. People will talk about how much they have to finish, but doing it on the weekend is still unthinkable for them. It keeps work from encroaching on my personal life, and keeps me from feeling guilty about working on other things for fun.

Industry isn't without its problems, but I find them easier to overcome

While most of the time I'm interacting with more people throughout the course of a project, one downside of industry is that I have to explain things to many more non-technical people. In academic science, everyone knows some stats and some programming, so it's usually pretty easy to explain the idea behind analyses to someone regardless of their specific field. Now, I find myself talking to HR people or nurses who lack any math or programming understanding. This can lead to some frustration and talking at cross-purposes.

However, the challenge of communicating with less technical people is just that - a challenge. I've always enjoyed explaining technical things in non-technical ways. It helps me to understand things when I'm forced to put it in simple terms. This is a bit more extreme - explaining sometimes complex algorithms and statistics to people who don't know the first thing about math or code. But it allows me to develop a skill to speak to a wider audience, and I think that's a good thing.

While projects are less stressful, the flip side is a feeling of less personal investment that makes it hard to feel passionate about projects. The projects can still be fun and interesting, but there isn't always a whole lot of deeper meaning in them. Relatedly, with academia, your work is in some ways open to the world. You get publications, maybe some media coverage, and are considered a recognized expert. Without going beyond your job, the same thing doesn't really happen in industry. I might get some publications due to my position, but media coverage (or any other sort of recognition outside of the company) is unlikely.

But once again, I find this issue much easier to deal with than the issues I had in academia. Though I have less passion for work projects, I have a lot more time and energy to put into my own personal projects. And this can be quite rewarding - I get to try different things, like coding silly twitter bots or writing data journalism pieces. I find I have a wider range of projects to work on for satisfaction than I did while an academic, where I felt like any side project should ultimately net me a publication.

A lot of clear benefits of industry

I've had moments of wondering if I really did the right thing by leaving academia (usually when some reward of academia is made particularly salient). However, the thought doesn't usually last long.

Simply put, I value the types of freedom an industry job gives me far more than the types academia does. I have less freedom now in terms of what specific projects I take on. However, I have much more freedom to define where I live, what industry I work in, and all of the freedoms involved in having a much more secure financial situation. There are also a number of different ways I could take my career - I could transition into a management position or stay closer to development/implementation. I could try freelancing or consulting. While I'm happy where I am right now, it's good to know that I have a lot of options for the future.

Does the Muslim ban make us safer?

2017-03-10T00:00:00-05:00

On March 6, Donald Trump signed a second executive order banning people from certain majority-muslim countries from entering the USA. The previous order banned citizens of seven countries: Syria, Iran, Sudan, Libya, Somalia, Yemen, and Iraq, but was stopped after legal action. The new ban applies to six countries - Iraq is not included - and uses different wording in the hopes of being on firmer legal ground.

Trump and his administration have claimed that this ban is to protect Americans. How effective is this ban for that purpose? I looked at data originally collected by New America. This data is from records of criminal convictions and deaths of terrorists. The data contains terrorist activity from after 9/11 up to 2017.

The ban is supposed to protect Americans from terrorism. An obvious metric we can look at for how much harm terrorists from the banned countries do to America is looking at the number of people killed on American soil by terrorists from these countries.

Immigration control isn't going to be of much help curbing harm from terrorists, since the vast majority of terrorism-related deaths are caused by people born in America. A few were caused by people born in other countries, and absolutely none by people from the banned countries.

However, the fact that terrorists from these countries didn't kill anyone does not mean that there are no terrorists from these countries, contrary to false claims by a federal judge. I was surprised to find that quite a few terrorists come from the banned countries - 58, if we include naturalized citizens and permanent residents. However the breakdown of where they come from is interesting. The vast majority of terrorists from the banned countries come from Somalia. None came from Libya. However, of those that come from the banned countries, most are naturalized citizens or permanent residents, and thus would not have been affected by the ban (except the first ban, which originally applied to permanent residents).

It's important to put the threat foreign terrorists pose to America in context. Currently, even among Democrats, a majority list terrorism as a top policy priority. But would we actually be significantly safer even if we somehow managed to eliminate all terrorism caused by people born outside of America?

25 deaths over 16 years is not a very large number. On average, only slightly more people die each year by foreign terrorists as from alligator attacks or during enema administrations. More than twice as many people die riding roller-coasters, and the number of people killed each year by lightning is is one and a half times greater than the total number of people killed by foreign terrorists since 9/11.

There are plenty of costs to the Muslim ban - it exacerbates the doctor shortage and it hurts U.S. science. Many worry that the move hurts intelligence gathering on terrorist organizations by hurting relations with the banned countries, making us less safe from terrorists.

In terms of protecting Americans, legislation on roller-coasters would likely be more effective at keeping Americans safe than immigration policy reform possibly could.

Sources for non-terrorism death rates:

Alligators: Wikipedia. Data for 2001-2017

Enema administration: Centers for Disease Control and Prevention. Data for 1999-2016.

Roller-coasters: Injury Prevention. Data for 1994-2004

Lightning: National Weather Service. Data for 2006-2013

Unique Tag Cloud for Each Category in Pelican

2017-02-15T17:51:29-05:00

Pelican allows you to put articles into different 'categories'. On this site, I have the Blog and Projects categories. I wanted 'Projects' to be able to function as a portfolio. Ideally, it would have a tag cloud to allow someone to see all the different tools I've used for different projects, and easily find projects that involved the tools they're interested in.

By default, Pelican does not have a tag cloud. In recent versions, they've taken the tag cloud functionality out of the main program and put it into a plugin. However, the plugin counts all tags across all articles, meaning my 'Blog' tags would be mixed in with the 'Projects' tags. I ended up needing to modify the plugin to get the functionality I wanted. Instructions in case you want to do the same thing are below.

Set up

Since I modified the base tag_cloud plugin, the set up is very similar to that described on the official tag cloud plugin page. First, pelicanconf.py needs to be modified to look for the tag cloud plugin. If this is your first plugin, you should simply add these lines:

PLUGIN_PATHS = ["plugins"]
PLUGINS = ["tag_cloud"]

Otherwise, just add "tag_cloud" to your current PLUGINS list.

Next, you'll need the tag_cloud.py file, and need to place it in your plugins folder. You can download my version of tag_cloud.py from the github repo.

The basic change from the original script is that instead of having a tag_cloud structure with just a list of the tags, tag_cloud is a dictionary with each key being a category and each value being a tag list. The hard part is that Pelican doesn't allow you to use dictionaries in its templates - it converts everything to a list. For clarity, I change the dictionary to a list before passing it to the generator. In the templates I use Jinga2 to loop through all of the entries and find the matching category, like an inefficient dictionary, as explained below.

Displaying the tag cloud

To display the tag clouds is a little more complicated than with the basic plugin. You probably don't want to tag clouds showing up on every page - I only wanted them on the category pages themselves, article pages, and tags pages.

The different themes all have similar structures, but some might be a bit different. I use the blueidea theme.

To have the tag list for a category appear on article pages, the following code should be added article.html:

<div id="tagcloud">
    {% for cat in tag_cloud %}
        {% if article.category == cat.0 %}
                <b><center>{{ cat.0 }} Tags</center></b>
                <ul class="tagcloud">
                    {% for tag in cat.1 %}
                        <li class="tag-{{ tag.1 }}">
                            <a href="{{ SITEURL }}/{{ tag.0.url }}">
                                {{ tag.0 }}
                                {% if TAG_CLOUD_BADGE %}
                                    <span class="badge">({{ tag.2 }})</span>
                                {% endif %}
                            </a>
                        </li>
                    {% endfor %}                        
            </ul> 
         {% endif %}
     {% endfor %}
</div>

The placement of the code depends a bit on how you plan to style it and where you want it to show up. I have mine near the top, right below {% block content %}

To add the tag cloud to the tags and categories pages, the same code needs to be added to index.html. Place it within the first item conditional, right after this:

{# First item #}
{% if loop.first and not articles_page.has_previous() %}

This will take the category of the first article appearing on the page and generate the appropriate category tag cloud.

Settings and CSS Styling

The tag cloud should now display, just not necessarily how you want or where you want.

The plugin allows you to have different sizes for tags based on how common they are. This can be altered by changing the TAG_CLOUD_STEPS value (default is 4, you can set TAG_CLOUD_STEPS=num in pelicanconf.py). So use the following, adding as many li.tag-# as you have TAG_CLOUD_STEPS. You can also style the tag cloud list any way you want:

ul.tagcloud {
  list-style: none;
    padding: 0;
}

ul.tagcloud li {
    display: inline-block;
}

li.tag-1 {
    font-size: 150%;
}

li.tag-2 {
    font-size: 120%;
}

You can also use CSS to style the div the tag appears in. Here is mine:

div#tagcloud{   
    right: -150px;
    top: 250px;
    position: absolute;
    width: 140px;
    background: white;
    border-radius: 10px 10px 10px 10px;
    -moz-border-radius: 10px 10px 10px 10px;
    -webkit-border-radius: 10px 10px 10px 10px;
}

(I have a fixed-width wrapper around everything including my tagcloud, so this positioning puts my sidebar just outside of the main content area)

Full list of settings and defaults (you can add these lines to pelicanconf.py and change whatever values you want. If they don't exist in pelicanconf.py the defaults will be used):

TAG_CLOUD_STEPS=4 #number of different sizes of fonts in the tag cloud
TAG_CLOUD_MAX_ITEMS=100 #number of different tags that can appear in tag cloud
TAG_CLOUD_SORTING='size' #how tags will be ordered in the tag cloud. Valid values: random, alphabetically, alphabetically-rev, size and size-rev
TAG_CLOUD_BADGE=True #If True, displays the number of articles in each tag

Known issues

If you have overlap in the tags used between the two categories, the page for that tag will give the tag cloud for whatever the first article on the page is for that tag.

Since I used the same names as for the basic tag cloud plugin, this is not compatible with the base tag cloud.

If this somehow becomes popular, I might try to make this plugin a bit more official - giving it a unique name, adding some tests, and creating a tagcloud.html file that just gets included instead of copying and pasting the same code to two separate locations. So if you're using the plugin (or interested in it), let me know! Right now I only put in enough work to get it working for myself.

Reasons I left academia

2017-02-12T00:00:00-05:00

I recently made the transition from academia to industry. Some people were surprised I decided to leave - I had had a pretty successful academic career. So why did I want to leave? I've read a lot of articles and opinions on reasons people leave academia. The common reasons I often saw people cite for why they decided to leave didn't resonate with me. It's usually things like that the academic job market is too tough, or that industry pays better. The better pay is certainly nice, but that wasn't an impetus for my switch. I had never really worried about the job market. So why did I decide to leave?

Becoming unhappy in academia

I was pretty happy during my PhD. I learned a lot throughout it, got a lot of positive feedback, felt respected, and interacted with my peers a lot. Even so, by my last year, and especially after starting my postdoc, I was slowly becoming less satisfied. Eventually I was just not happy. I became stressed out, and research wasn't fun anymore.

One day I was describing my work environment to my roommate who works in industry. I told him I had maybe one to three meetings with people throughout the week. I explained that during my day, I mostly was by myself in an office. He said that sounded like it really sucked. That off-hand comment was actually really eye-opening to me. It had never occurred to me that working in such isolation probably wasn't good for me. It seemed just like the natural process of becoming more independent would of course mean more isolation.

As a postdoc or a grad student, it would often just be me and my advisor on a project. If my advisor was busy (frequently the case), it meant I was alone on the project. If others were involved in the project, in my experience their involvement was pretty superficial and communications usually occurred via email, or meetings once every month or two. It really didn't feel like I had anyone 'with me' on solving issues on a project, which was stressful. More importantly, I felt lonely.

Talking to other friends in industry reinforced the impression that it was much less isolating to work in industry than in research. They talked about frequent code-reviews and stand-up meetings. Since project turn-around was much quicker, project meetings were much more frequent. There was always a lot of interaction going on.

Few reasons to stay

After I had pinpointed why I was feeling so unhappy, I did a lot of soul-searching on why I was so set on an academic career in the first place.

Supposedly the big advantages of academia are the intellectual satisfaction of research and the freedom to research what you want. However, I didn't feel intellectually satisfied. Science works slowly, with incremental advances. Every paper I wrote was just one interpretation of one small data set about what one small brain area did in one contrived task. I was familiar with the big theories in my field, and there wasn't any new big ideas that were exciting me any more. Intellectually, focusing on such a narrow space of science was just becoming boring.

The intellectual freedom didn't sound that exciting to me anymore. Sure, I could set up a wide range of different experiments, except 1) they had to be in line with a fundable research plan, and 2) they had to lead to publications. It seemed more like pressure to come up with good ideas than freedom to do what I want.

I'm a bit ashamed to admit that for me, the factor that made the decision difficult was the level of fame and prestige that comes with being an academic researcher. It's nice to have your name on a bunch of publications, to get some media attention, and have international colleagues that respect you.

Costs of staying

My growing unhappiness forced me to really face the sacrifices I was making by staying in academia. The biggest issue for me was the cost to my personal relationships. Academia meant moving at the whims of the academic job market - due to how few jobs there are, there's little freedom in where you live geographically. Already the moves I had done for academia had cost me important relationships. There was also putting off thinking about starting a family because I didn't feel like my life was stable enough to start one.

Deciding to leave

I think there's a big disconnect in mentality between academia and industry because they offer very different types of rewards. In academia, all of the metrics you have for how well you're doing don't exist in industry: publications, citations, H-index, etc. Until I really started to feel the cost of academia in terms of my happiness, I never felt the allure of industry because it didn't offer me these rewards I had been trained to seek through my academic training. Similarly, it was easy to ignore the income one could make in industry because money isn't one of the big motivators in academia since (at least at the postdoc level) everyone is making about the same low amount.

Once I was forced to really open my eyes to the possibility of leaving, suddenly I had to consider what an industry job could offer me. It was painful going through the process of changing my mentality and accepting that I might never get another publication, but that was okay because publications only mean anything in academia.

Eventually I started learning more about data science. I learned how it would allow me to work with different data, do the parts of science I love (analysis), and give me freedom to find jobs I wanted. I thought it sounded like a much better place for me. Hopefully that judgment was right!

Update

I wrote a retrospective on my first few months in industry and how I feel now about the move from academia. See it here.

Simple Stock Ticker App

2017-02-04T15:14:21-05:00

This was just a very simple learning project I did as part of The Data Incubator program.

The project itself was just a simple stock ticker. It requires as input the ticker, and produces a graph of the stock prices over time. Here is the finished product. You can also see the code for it here.

Flask Framework

Flask is a lightweight Python framework. It's pretty simple to use. Set up some template html files, define some GET and POST methods, and you're good to go.

The Data Incubator provded a pretty simple template for getting started. It provided some sample files, but the app itself just returned whatever the index template was.

The modified the template to have two pages: the index page, which just has a GET command, and a graph page, which has POST.

On the index page, the user inputs the ticker they want. This is used in the POST command, which requests and then plots the data.

Requesting Data

The data I used for my stock ticker came from Quandl. I used Python's Requests library to make API requests.

Plotting Data

After requesting the data and getting it into a usable format, I plotted it using Bokeh.

Overall I really like Bokeh - some things are easier to do in Bokeh than in either Seaborn or matplotlib. But it definitely isn't perfect, and some things that feel like they should be trivial end up being more work than you would expect.

But the real power of Bokeh is that the figures it produces can be output into javascript format so they can easily be placed into a website. So the code for the Bokeh figure is easily just placed into the graph.html template as javascript code, and it appears, all while staying in Python world.

This Website

2017-01-18T10:04:08-05:00

I felt it was fitting to write the first "Project" article on this website, since it's the most recent little project I've been working on.

Choosing a framework

I was looking for a few things in a website: Something easily customizable, easily publishable, capable of blog feeds, and that could easily incorporate Jupyter Notebooks directly into it.

Looking around, it seemed the best fit for the job were static site generators. Static site generators are just programs that take a bunch of content, your settings and templates/stylesheets, and create static HTML files. Static site generators are simple to use, easy to customize, don't require some awkward web UI, and publishing is extremely fast and doesn't even require me to visit a website. They also have the advantage of being fast compared to any dynamic websites since the server just has to serve a plain HTML page. There are a few other advantages to them - I found this a helpful read.

With a static site generator, I can easily write an article in Markdown. Then with a couple simple terminal commands I can have it publish to the site.

Choosing a static site generator

It turns out Jekyll is the most popular static site generator. That means it has the advantage of the biggest community of support. However, there were two major disadvantages for me.

First, Jekyll is written in Ruby. I'm not very familiar with Ruby, and don't currently have any reason to learn it (besides for Jekyll). This means if I wanted to dig into the code to customize things, it would be a big pain.

Second, Jekyll doesn't have easy support for Jupyter Notebooks. I want to be able to do some data exploration in a Jupyter Notebook, and then just upload the notebook as an article easily. While it is possible to post Jupyter Notebooks with Jekyll, it isn't supported natively and takes extra steps.

After doing some research, I ended up deciding on Pelican. Pelican is written in Python, my preferred language. With a simple plugin, publishing Jupyter Notebooks becomes almost as easy as publishing Markdown - a separate file needs to be created for the metadata. Making changes to notebooks and then republishing becomes trivial, and life is easy.

Website host

After I had figured out how I was going to create my website, I had to choose a place to put it. This was a no-brainer. Because I'm using a static website, I don't need any fancy server support. Github Pages is free and allows me to change my site just by pushing to a Github repo. It also allows custom URLs (though not with HTTPS currently). So the whole process for publishing an article is: write Markdown or Jupyter Notebook, run pelican to generate the site pages, commit and push to Github. Super easy and quick!

If you're looking for a tutorial on how to set up your own Github hosted Pelican website, this tutorial is super helpful.

Hello, world!

2017-01-16T00:00:00-05:00

As of this writing, I'm in the middle of the transition from academia to industry data science. This website is meant partially to replace my academic website. It's also a place for me to showcase some of my personal data science projects. I'll probably be starting with some of the small projects I've done while building up some of my data science skills, and hopefully eventually be releasing some more polished projects.

I also intend to write blog entries (like this one). While my projects will tend to be something that involved coding, learning a new skill, and/or exploring a data set, blog entries will just be free-form writing. I mostly intend to write about data science, statistics, science, and related things. Because I've so recently come from academia, I expect many will focus on the difference between academia and industry, and the transition. And sometimes maybe I'll write about things completely unrelated.

This site is still somewhat under construction, and the organization of it might change quite a bit as it grows/matures.

If you want to read about the technical aspect of the blog (what tools I'm using and why), see this post.