Visualizing Big Data

100,000 rows - heatmap of Java source code A couple of weeks ago I was asked about how visualization related to Big Data. I am far from a Big Data expert, but I know the basics and have been studying and selling visualization solutions for the past seven years. This article describes my thoughts on visualizing Big Data, not as a definitive statement, but as an exploration of ideas.

Defining Big Data

Different people have different definitions for what constitutes Big Data. For the purposes of this article, I’m going to leave the definition vague. For me, Big Data includes all the movie recommendations in Netflix’s database, the shopping cart data for every customer of a grocery store or the minute-by-minute location of everyone with a GPS mobile phone, as well as smaller data sets such as the list of a million tasks in an enterprise project management system or the work history of everyone in LinkedIn.

Uses of Big Data

Proper use of data visualization requires first understanding the problem being solved. Big Data can be used for different types of problems. Roughly, the problems I see include:

  • Automating Decisions
    Data used to trigger a decision, such as low inventory levels triggering automatic re-orders or automated credit approvals. Big Data determines the decision here, either explicitly through rules or implicitly through inference-based systems.
  • Creating Operational Data
    Data used as input into algorithms, such as calculating average highway speeds when determining the optimal trucking route. Big Data influences the result here, but other factors may drive the final result.
  • Supporting Decisions
    Data used for a user-driven decision, such as determining how to arrange products in a store. Big Data informs the result here, but the final result requires human judgement.
  • Creating Personalized Views
    Data used to create personalized views for an individual, such as which movies to watch based on previous rentals. Big Data creates the result here, but the final result is a small data subset.

Each of these has different patterns of how people interact with Big Data. While automated decisions cut out people entirely, people are a central part of using Big Data to support decisions. And since visualization is entirely about people, these distinctions become useful.

Visualizing Big Data

Depending on the use of Big Data, the goals and techniques used for visualization will differ.

For instance, Big Data doesn’t need to be visualized when used in automated decisions, except to monitor and improve the algorithms. The actual process of analyzing the Big Data and implementing a decision doesn’t involve a human, so visualization is pointless.

On the other side, it’s difficult to use Big Data to support decision-making without the use of visualization. While you could, in theory, reduce Big Data to a single number, the value of Big Data is in the details, where data visualization shines.

If we look at the use cases illustrated above, the uses of visualization then become:

  • Exploring Data
    To help people explore data, either to support decisions or to improve the development of the algorithms used in creating operational data or automating decisions. Exploring Big Data often requires new types of visualization platforms that can support navigating and visualizing huge data sets.
  • Monitoring Results
    To help people debug and monitor the results of using Big Data, such as the improved purchase rate of a new recommendation algorithm. While the data driving the algorithm may be huge, the result data can often be visualized using existing tools.

    For instance, you may be looking at 10 billion transactions, but only be concerned about the effectiveness of recommendations for 50,000 products. While 10 billion transactions require new types of visualization, 50,000 products can easily be shown on a heat map. It’s the result data you’re analyzing, not the source data.

  • Finding Insights
    To help people make better decisions using insights gained from Big Data. In this case, Big Data gets reduced to a manageable size through data mining and aggregation algorithms. While new visualization techniques may help in analyzing Big Data at the transaction level, most visualization today of Big Data is of summarized views…

    …which is often good enough. In supporting decisions, it’s important to not attempt to visualize too much data. Analyze data at the level you’re optimizing it at, or one level deeper. Visualization can be used to see greater levels of detail, but greater detail can introduce noise that interferes with identifying the broad trends required for higher-level decisions.

  • Exploring Results
    To help people explore their own personalized view of Big Data. Here Big Data transforms to Small Data, and existing visualization tools can be used.

    For instance, Netflix might analyze a billion movie ratings to recommend a movie to you, but they’ll only present 10-20 movies to you at once. Visualization can help explore more of these results at once, but you likely won’t be viewing the entire Netflix catalogue.

Comments?

Do you agree with these categories? What did I miss? Do you know of specific visualization techniques well suited for Big Data? Give me your thoughts by leaving a comment below or on Twitter at FastFedora.

2 comments

  1. Matt says:

    I think you’ve got it about right. Our software helps users visualize large amounts of radio frequency data. They have new capabilities to capture vast amounts of data, but it is useless without tools to visualize it and search it (like the Internet without a browser and search engine).

    A recurring theme for us is letting the machine do what machines do well and letting the human do what humans do well. The machine can display a visualization and search using specific parameters, but the human can recognize patterns much better than a machine. The machine is better at dealing with the expected, the human is better at dealing with the unexpected.

  2. Eric Jackson says:

    It seems to me there might be another dimension here, having to do with relationships between people and the visualization, or perhaps between the visualization and the decision process. Examples that come to mind: a) one or more people actively using visualization to understand something critical to a decision, b) a presenter using visualization to convince an audience of the points he’s trying to make, c) a group of people using a visualization to establish a context for their discussion. What’s changing here is not so much the use of the visualization as the relationship of the visualization to the process.

    A secondary comment here is that none of the uses are particularly related to big data (as opposed to a few thousand entries in a spreadsheet). The key point seems to be that visualization is relevant to big data when support for a decision requires human rather than algorithmic insight into a massive dataset.

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

«

»