Unraveling NCAA March Madness: Seeding, Upsets, and Performance Trends

Our visualizations explore the relationship between seeding and performance in NCAA March Madness, revealing that while upsets are significant, higher seeds tend to prevail in later rounds, challenging the notion of seeding's importance.

Data Description

Our visualization is based on a March Madness dataset containing every NCAA tournament game result since 1985, when the tournament expanded to a 64-team bracket. It includes information such as the year of the tournament, the round of each game (ranging from 1 to 6), the seeds of the competing teams (ranging from 1 to 16), the region of the bracket (ranging from 1 to 4), and the scores of the games.

Data Cleaning

The original dataset is a csv with 12 variables: 'Year', 'Round', 'Region Number', 'Region Name', 'Seed', 'Score', 'Team', 'Team', 'Score', 'Seed'. We found it strange that distinct columns in the dataset had the same names, hence we changed the variables to the following: 'Year', 'Round', 'Region Number', 'Region Name', 'Seed1', 'Score1', 'Team1', 'Team2', 'Score2', 'Seed2'. We wrote a python script to create a new csv file with one additional variable: Upset. We defined Upset as a boolean variable determined by if a lower seed scored less points than a higher rated seed (a higher ranked opponent lost to a lower ranked one). We based our visualizations off this new csv dataset.

Variables in the Dataset

Variable Name

Description

'Year'

The year of the tournament.

'Round'

The round of the tournament game.

'Region Number'

The number of the region in the tournament bracket.

'Region Name'

The name of the region in the tournament bracket.

'Seed1'

The seed of Team 1.

'Score1'

The score of Team 1.

'Team1'

The name of Team1.

'Seed2'

The seed of Team 2.

'Score2'

The score of Team2.

'Team2'

The name of Team2.

'Upset'

A boolean value indicating whether an upset occurred (where a lower seed defeated a higher seed).

Design Rationale

The interactive elements in our visualization aim to enhance user engagement and exploration of the NCAA March Madness data. We incorporated several design decisions to ensure these interactions are discoverable, usable, and interesting.

Scatterplot

The scatterplot effectively shows the relationship between seed values and scores. Instead of traditional dots, a thematic choice is made to represent each data point with a basketball emoji. This adds a layer of contact and visual appeal and emphasizes our NCAA basketball topic. The inclusion of a linear regression line helps to identify trends and draw conclusions.

Mouseover events on the basketball emoji in the scatterplot trigger a tooltip, providing detailed information about each data point (seed and score). This allows users to explore specific data points and understand the correlation between seed and performance.

Discoverability and Usability

The tooltip appears dynamically on mouseover, offering contextual information without cluttering the main visualization; however, as a tradeoff, the visibility and affordance of this feature is lower as users might not notice until they hover over specific elements. Regardless, this feature makes the scatterplot more informative and engaging for users.

Heatmap

The choice of a sequential colour scale creates a visual hierarchy, allowing viewers to quickly find out the intensity of upsets in different seed combinations.

Like our previous design with the scatterplot, mouseover events on cells in the heatmap trigger a tooltip, displaying information about the corresponding seed combinations and the number of upsets. This interactive element allows users to quickly identify areas with high upset occurrences.

Discoverability and Usability

Similar to the scatterplot, the heatmap tooltip appears on mouseover, providing additional details without overwhelming the user. The same drawback persists as the user might not discover the functionality until they actively hover.

“CLICK ME”

A falling basketball animation is triggered once the user interacts with the "CLICK ME" button. The bouncing animation adds a playful element and serves as a visual cue: when the basketballs finish bouncing a few times, a summary box appears to provide a concise overview of the project, enhancing user understanding.

Discoverability and Usability

The "CLICK ME" button triggers the animation, which is a one-time event, ensuring it doesn't disrupt the overall usability of the visualization. The button also has very clear interaction affordance as it directly invites the users to interact with it which prompts a summary box that is centrally positioned on the page. It is also designed as a floating overlay, making it easily noticeable (without being disruptive to the overall layout) for a quick read about our data visualization before proceeding to other project-related information.

Checkbox Filters

Checkbox filters for years and rounds enable users to selectively explore specific subsets of the data, providing users with the opportunity to access more detailed data and deeper insights. For a quick overview of all data, we decided to add also a “Select All” and “Clear All” button for both the years and rounds variables.

Discoverability and Usability

Checkboxes themselves clearly afford the interaction of selecting and deselecting. The "Select All" and "Clear All" buttons offer a convenient way to manage multiple selections.

Layout

The single-column layout with flexbox ensures a clean and focused presentation of content for data visualizations. Centering the main content also makes the visualization accessible and aesthetically pleasing.

Visualization Containers

The white background, border radius and box shadow of the visualization containers separate the graph from the background, providing a polished appearance while catching the viewer’s attention. The rounded corners also add a touch of softness to the containers.

Colour Scheme

The overall background color is a linear gradient from light blue to dark, providing a visually appealing and dynamic backdrop. The choice of orange text color enhances color contrast, improving text readability against the background. This consistent color scheme contributes to a cohesive and aesthetically pleasing design. It also reinforces the thematic connection to NCAA basketball, where blue and orange are commonly associated colors.

The Objective

Our visualizations intend to show trends of historical seeding and performance in the NCAA end-of-season tournament, known colloquially as “March Madness.” We wanted to investigate if seeding is statistically as big of a deal as sports media make it seem. “Upsets,” a higher ranked team (worse) beating a lower ranked team, are a huge deal in sports. For a very long time, there was a fascination with upsets in college basketball specifically because a 16 seed had never in the history of the sport beaten a number 1 seed until 2018. Furthermore, the NCAA tournament involves fans predicting all games in a “bracket.” To this day, no one has ever predicted a perfect bracket. This further adds to the folklore of March Madness and its related seeding. Theoretically, if seeding was perfect, simply picking the highest seed at each stage would lead to a perfect bracket, but evidently, this has never been a successful strategy. This prompted our interest in the particular dataset.

The linear regression visualization shows interesting data filterable on several different factors, including different combinations of years and rounds. March Madness has a total of 6 rounds: Round of 64, Round of 32, Round of 16, Elite Eight, Final Four, and the National Championship. One interesting observation is that seeding seems to get increasingly less important until about rounds 5 and 6. For example, when filtering on all years, but only on round 1, the slope is steeper than rounds 2 and 3, indicating that lower seeds tend to win the most. This makes conceptual sense as the teams should theoretically get more competitive as they win more games. However, in the later rounds (5 and 6), the slope is the steepest out of all the rounds, with many 1 seeds winning it all. The highest cases of seeds winning were 8 seeds. This indicates that upsets tend to happen in rounds 2 and 3, but the “best” teams end up winning much more frequently later on in the tournament either way.

The heat map also provides some interesting insights: in earlier rounds, it is almost purely linear, indicating that there are cases where each seed has beaten another, but with lower seeds dominating. Interestingly enough, our dataset includes one of the only two cases where a 16 seed beats a one seed, which occurred for the first time in 2018. When all years and only round 1 are selected, this shows up as a black square on the heat map (indicating very “cold” as it’s only happened once in the dataset). In later rounds, the distribution is more skewed toward higher seeds indicating similar trends to those described in the previous paragraph.

Overall, our investigation of the dataset provides some insight into that seeding is far from perfect, there are clear outliers, but generally team performance regresses to the mean. This is why it is so difficult to predict a perfect bracket, but at the same time fairly safe to bet that the aggregate winners will have very low seeds.