Data scientists are often faced with the task of understanding and modelling multivariate data sets. As any experienced data scientist will tell you, it is usually hard to wrap your head around data with more than 4 columns. While we can plot relationships between each variable individually, it is difficult to intuitively understand multivariate patterns. At this point, we typically throw our hands in the air and begin applying various machine learning algorithms to see which one achieves the highest prediction accuracy. All the while:
- We do not have a subconscious or instinctual understanding of our data.
- We cannot set proper expectations or goals for predictive accuracy.
- Traditional plotting methods are introducing random biases into our perception of the data.
We seek an instinctual and unbiased understanding of complex multivariate data.
The State of Machine Learning
The advancement of machine learning on vectors on univariate data is now primarily dependent on enabling a better understanding of complex data.
- We have hit a developmental and educational plateau in machine learning on vectors of univariate input.
- Data scientists are complacent with the performance of turnkey algorithms, leaving them prone to under-featurizing rich data.
- Promise lies in 3D, virtual reality, and augmented reality technologies. These technologies will enable a better understanding of multivariate data, leading to better performance via:
- Data cleaning
- Outlier detection
Back to Fisher’s Iris
Years ago, as a student in Professor Jeff Holt’s UVA Data Science class, I explored the Fisher’s Iris data set. We used the R Language for this class, which comes with the data pre-loaded. The first thing we do is…
which generates this…
which leaves us overwhelmed and confused.
We think, there are clearly some clusters and relationships, but we are not exactly sure how. Regardless, we move on to fitting a k-nearest neighbors model and get 99%+ in-sample accuracy. This is essentially perfect, so we accept that I don’t understand the data, but the computer does.
This line of thinking is convenient in education but toxic in production, where we face bigger and badder problems. In practice, we handle thousands to millions of rows of unnatural data. What differentiates the average data scientist from the extraordinary one here is the ability to draw on intuition to expertly clean and featurize.
Queue Next-Gen Visualization
The key to understanding complex multivariate data is 3D visualization. It is commonly thought that 3D visualization is the key to understanding only 3D data, but it is not quite that simple. In the same way we can add colors to 2D plots to denote different labels, we can add colors, shapes, sizes, and 2D labels onto 3D plots. In the end, there is enough space in the 3D rendering to intuitively understand the data. Instead of squinting your eyes at n-by-n scatterplots, the data hits your frontal cortex like a freight train, leaving you eager to get your hands dirty with the machine learning.
Below is a 3D rendering of the iris data set. Spheres are plotted as data points. The x, y, and z, axes correspond to sepal length, sepal width, and petal length.
Visualizing Greater-Than-Three Dimensions
As mentioned previously, we can control more variables within the 3D rendering to visualize more columns of data. The next rendering uses the fourth column of the iris data, the petal width, to scale the size of the spheres.
Using Shapes and Colors for Categorical Data
The target variable in the iris data set is generally the flower type. There are three flower types, setosa, versicolor, and virginica. We will map each flower type to the shape, sphere, cube, and cone, respectively. The results are surprisingly intuitive. The figure below sheds some light on how nearest neighbors algorithms can achieve 99%+ in-sample accuracy with this data.
I did not understand Fisher’s Iris data until I saw this plot.
See It to Believe It
I used Blender’s Python API to generate these plots, and the screenshots do not do the final product justice. To professionals that understand Blender’s 3D Viewport, the result is stunningly simple. Rotating around and focusing on certain sections of the rendering show each cluster of data to be perfectly distinct. Tools like this will be implemented in browsers, virtual reality devices, and augmented reality devices in the near future, where they will become essential tools for the modern data scientist. The technical hurdles encountered in data visualization will help further optimize these technologies.
Look Out for Our Presentation
I am currently working with Professor Jeff Holt, Professor Gretchen Martinet, and programmer James Wang of the University of Virginia to create user-friendly 3D visualization software in Google VR. We will be presenting our prototype with the UVA Statistics Department in April 2017 as part of a student-teacher collaboration to promote machine learning education.