Chris Conlan

Financial Data Scientist

  • About
  • Blog
    • Business Management
    • Programming with Python
    • Programming with R
    • Automated Trading
    • 3D Technology and Virtual Reality
  • Books
    • The Financial Data Playbook
    • Fast Python
    • Algorithmic Trading with Python
    • The Blender Python API
    • Automated Trading with R
  • Snippets

Machine Learning: Visualizing N-Dimensional Data

January 30, 2017 By Chris Conlan 2 Comments

Data scientists are often faced with the task of understanding and modelling multivariate data sets. As any experienced data scientist will tell you, it is usually hard to wrap your head around data with more than 4 columns. While we can plot relationships between each variable individually, it is difficult to intuitively understand multivariate patterns. At this point, we typically throw our hands in the air and begin applying various machine learning algorithms to see which one achieves the highest prediction accuracy. All the while:

  1. We do not have a subconscious or instinctual understanding of our data.
  2. We cannot set proper expectations or goals for predictive accuracy.
  3. Traditional plotting methods are introducing random biases into our perception of the data.

We seek an instinctual and unbiased understanding of complex multivariate data.

The State of Machine Learning

The advancement of machine learning on vectors on univariate data is now primarily dependent on enabling a better understanding of complex data.

  1. We have hit a developmental and educational plateau in machine learning on vectors of univariate input.
  2. Data scientists are complacent with the performance of turnkey algorithms, leaving them prone to under-featurizing rich data.
  3. Promise lies in 3D, virtual reality, and augmented reality technologies. These technologies will enable a better understanding of multivariate data, leading to better performance via:
    1. Data cleaning
    2. Outlier detection
    3. Featurization

Back to Fisher’s Iris

Years ago, as a student in Professor Jeff Holt’s UVA Data Science class, I explored the Fisher’s Iris data set. We used the R Language for this class, which comes with the data pre-loaded. The first thing we do is…

plot(iris)

which generates this…

Chris Conlan Data in R

which leaves us overwhelmed and confused.

We think, there are clearly some clusters and relationships, but we are not exactly sure how. Regardless, we move on to fitting a k-nearest neighbors model and get 99%+ in-sample accuracy. This is essentially perfect, so we accept that I don’t understand the data, but the computer does.

This line of thinking is convenient in education but toxic in production, where we face bigger and badder problems. In practice, we handle thousands to millions of rows of unnatural data. What differentiates the average data scientist from the extraordinary one here is the ability to draw on intuition to expertly clean and featurize.

Queue Next-Gen Visualization

The key to understanding complex multivariate data is 3D visualization. It is commonly thought that 3D visualization is the key to understanding only 3D data, but it is not quite that simple. In the same way we can add colors to 2D plots to denote different labels, we can add colors, shapes, sizes, and 2D labels onto 3D plots. In the end, there is enough space in the 3D rendering to intuitively understand the data. Instead of squinting your eyes at n-by-n scatterplots, the data hits your frontal cortex like a freight train, leaving you eager to get your hands dirty with the machine learning.

Below is a 3D rendering of the iris data set. Spheres are plotted as data points. The x, y, and z, axes correspond to sepal length, sepal width, and petal length.

blender_3d_rendering_of_iris_data

Visualizing Greater-Than-Three Dimensions

As mentioned previously, we can control more variables within the 3D rendering to visualize more columns of data. The next rendering uses the fourth column of the iris data, the petal width, to scale the size of the spheres.

4d_data_visualization_of_iris_data

Using Shapes and Colors for Categorical Data

The target variable in the iris data set is generally the flower type. There are three flower types, setosa, versicolor, and virginica. We will map each flower type to the shape, sphere, cube, and cone, respectively. The results are surprisingly intuitive. The figure below sheds some light on how nearest neighbors algorithms can achieve 99%+ in-sample accuracy with this data.

I did not understand Fisher’s Iris data until I saw this plot.

iris_data_5d_visualization_blender

See It to Believe It

I used Blender’s Python API to generate these plots, and the screenshots do not do the final product justice. To professionals that understand Blender’s 3D Viewport, the result is stunningly simple. Rotating around and focusing on certain sections of the rendering show each cluster of data to be perfectly distinct. Tools like this will be implemented in browsers, virtual reality devices, and augmented reality devices in the near future, where they will become essential tools for the modern data scientist. The technical hurdles encountered in data visualization will help further optimize these technologies.

Look Out for Our Presentation

I am currently working with Professor Jeff Holt, Professor Gretchen Martinet, and programmer James Wang of the University of Virginia to create user-friendly 3D visualization software in Google VR. We will be presenting our prototype with the UVA Statistics Department in April 2017 as part of a student-teacher collaboration to promote machine learning education.

Filed Under: 3D Technology and Virtual Reality Tagged With: augmented reality science, big data, machine learning, University of Virginia 3D visualization software, virtual reality

Comments

  1. S Mraz says

    March 15, 2018 at 12:52 pm

    Hi chris!
    Very nice examples for someone who is a beginner. Any pointers on how to color the objects?

    Reply
    • Chris Conlan says

      March 15, 2018 at 1:55 pm

      Yes, please see this post for the code: https://chrisconlan.com/visualizing-data-blender-python/

      Plotting colors is actually a little complicated because it requires us to access Blender textures. See the Blender Python API book for more info on this: https://chrisconlan.com/books/

      Reply

Leave a Reply Cancel reply

Latest Release: The Financial Data Playbook

The Financial Data Playbook

Available for purchase at Amazon.com.

Algorithmic Trading

Pulling All Sorts of Financial Data in Python [Updated for 2021]

Calculating Triple Barrier Labels from Advances in Financial Machine Learning

Calculating Financial Performance Metrics in Pandas

Topics

  • 3D Technology and Virtual Reality (8)
  • Automated Trading (9)
  • Business Management (9)
  • Chris Conlan Blog (5)
  • Computer Vision (2)
  • Programming with Python (16)
  • Programming with R (6)
  • Snippets (8)
  • Email
  • LinkedIn
  • RSS
  • YouTube

Copyright © 2022 · Enterprise Pro Theme On Log in