To me, data scientists are software developers with a knack for numerical analysis. When I differentiate my role as a data scientist from my role as a software developer, I am highlighting my skills with…
- Machine learning & numerical optimization
- Matrix algebra & data structures
- Statistical analysis & data visualization
- Scientific computing stacks (SciPy, BioConda, etc.)
- Distributed computing tools (Spark, Hadoop, MRO, etc.)
Of course, I may use many different programming languages in any given data science task. My career has been defined by addressing numerical analysis needs with the right software to get the job done. In the simplest terms, programming is the best way to actuate a talent in mathematics.
When numerical analysis is the goal, some programming language is part of the solution. While I like to differentiate my roles as a data scientist and a software developer, I am weary of data scientists who do not also consider themselves software developers.
Machine Learning from Scratch
Machine learning is hot. As of late, the new buzzword is artificial intelligence. For me, its always been called numerical optimization. I call it this because of how I approach machine learning problems.
For a few years now, machine learning has been taught at universities and coding bootcamps using algorithms. In these classes, I often hear concerning statements like, “It doesn’t matter what data you input, the same algorithm can learn the solution.” For example, in a Lynda course called “Machine Learning Essential Training”, the instructor claims that the same algorithms used for email spam classification can be used to identify handwritten numbers. This may be technically true, but if it were that simple, data scientists would be out of the job.
The truth is that most new applications of machine learning require brand new algorithms. When we build brand new algorithms to tackle machine learning problems, we are working on numerical optimization problems.
For example, I was once tasks with determining the 3D shape of a wall of a house given the corner points of the windows drawn on 2D images. I can guarantee you, there are no off-the-shelf algorithms prepared to do this. The success of the algorithm required me to take advantage of very case-specific constraints of problem, and program every addition, subtraction, multiplication, and division operation myself. To do this, I used a number of matrix algebra, 3D modeling, and multi-core computing libraries, but not one machine learning library.
Building new numerical optimization procedures is an delicate and iterative process that requires tons of research, documentation, and preparation, but it is how we build new machine learning algorithms. The advent of off-the-shelf machine learning algorithms has the hard work of many numerical optimization experts to thank. Ultimately, the result is very powerful and satisfying.