Visual Baseball: The Use of Word Clouds

If you saw my recent post on the payroll disparity between the Yankees and Twins, you’ll recall that I used a visual approach called “word clouds.” I’ve used this in other situations, such as comparing base stealing between the Angels and Red Sox (back in the ALDS). A few of you noted that in the Twins Yankees Payroll visual the size of the name distorted what was being communicated. This is a point well taken and applies here, since Ellsbury’s name is longer than names like Hunter of Abreu. Still, I’m hoping that there’s insight revealed from this visual that might not be so easy to spot in looking at a list of numbers. Or, at least, more fun.

image


Kevin Dame is a writer and visual designer who brings sports information to life in new and meaningful ways. Visit his website and follow him on Twitter @kevintdame.
12 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Detroit Michael
14 years ago

Isn’t there another problem?  For example, Ellsbury stole 70 bases and Nick Green stole 1 base.  It looks to me like Ellsbury’s name is in a 70 point font and Green’s name is in a 1 point font, so small that I can’t read it.  However, a printed name is a two dimensional image so that even if their names were of the same length, Ellsbury’s name is 70 x 70 larger than Green’s name.

Julian
14 years ago

Christopher: I think the jumbled presentation is kind of the point. Kevin is working for a balance between visual interest and data presentation.

Detroit Michael: Similarly, the precise numbers aren’t really the point. The point (to me anyway) is to visually represent the wide gap between Ellsbury’s steals total and pretty much everyone else on the Sox (plus the more even distribution of the Angel’s steals and the various other differences, etc), not to nail the exact relationships between all players shown.

Jacob Rothberg
14 years ago

These are great features, and i think they really hold true to the purpose of this site, which I feel, is to present interesting ideas and information about baseball in easily comprehendible ways. I think everybody complaining about the length of names or other details is missing the forest for the trees here. Keep up the good work, I look forward to seeing what other tricks you have up your sleeve.

Christopher Taylor
14 years ago

Instead of size (which has the confound of name length as you point out) it might be better to use something like contrast/saturation. For example, the greater the number of bases stolen the darker grey (or more saturated) red the name would appear in a list (the jumbled presentation in the graphic, obfuscates rather than clarifies the information presented).

Detroit Michael
14 years ago

Maybe I didn’t make more point clearly so let me try a different comparison.  Let’s compare Ellsbury (70 SB) to Youkilis (7 SB).  Their names are the same length.  Ellsbury’s name should be 10 times larger than Youkilis’ name.  Instead it looks to me like “Ellsbury” is 10 times as tall and 10 times as wide, making it 100 times as big as “Youkilis.”

The effect is that the visual representation far overstates the relative contributions of Ellsbury to the Red Sox’ stolen base output (or A-Rod’s salary to the Yankees’ payroll to use an earlier example).  This seems like a much bigger effect than worrying about the number of letters in the player’s name.

Kevin Dame
14 years ago

Hi there.  I think the point about distortion is a good one.  I’ve used a Google tool called Wordle to create these word clouds.  I think Wordle does not take into account the multiplying effect of the data (the 70×70 factor that someone brought up).  They probably don’t care as much about data integrity and are biasing towards powerful visuals.  But I think in our space (the world of baseball) data integrity is more important and should be more seriously considered.  In the future, I may choose to create these word clouds myself to avoid these distortions.  Of course the visuals will become less powerful.  What would you rather have – a perfectly accurate image or one that makes a clear point quickly and more powerfully?  I’m curious to hear what people think…

Detroit Michael
14 years ago

Excellent!  Sorry for the false accusation then.

Kevin Dame
14 years ago

Detroit Michael, I took a closer look and it turns out the word sizes are accurate.  It may not look like it, but Ellsbury is 10 times the size of Youkilis. In other words, if you enlarge the word Youkilis by 1,000%, they’re equal.  But it’s good to ask the question “Are these accurate?”  The answer is YES!

In any case, to echo what Julian wrote earlier, the goal here is to tell a story about how these two teams steal bases.  Clearly the Sox were much more dependent on Ellsbury for their speed, while the Angels had a more distributed running game.

Alex
14 years ago

I don’t see why having the relationships between the name being exponential rather than linear would destroy the “integrity” of the data.

Kevin Dame
14 years ago

I think the issue of a relationship being exponential vs. linear is that it results in exaggeration.  If an infographic implies that A-rod’s salary is 10 times that of Jorge Posada’s salary (which is inaccurate), I think I’ve failed in my goal of bringing greater insight to baseball.  I think it’s important to maintain the integrity of what the data tells us, and if things become exaggerated it should be duly noted.  Kind of like how many TV ads write “dramatization” at the bottom of your TV screen as they show little scrubby bubble characters cleaning your bath tub.

kardo
14 years ago

I personally do not think our perception is linear, not area based. Ask most people on the street to write something twice as big, and they will make it twice as long and twice as high. Correcting it for area will confuse people more then it helps.

Also Kevin, I don’t think wordle is a google tool. It’s an java applet that runs on a google infrastructure. Whatever that means. If those two are the same then ignore my comment.

/whinemode.

Kevin Dame
14 years ago

This is a great discussion.  You’re making me ask myself “How do I best visualize relative sizes?”  Do I do it via measurement (ie. visual of x is exactly double the visual of y)?  Or do I do whatever causes the viewer to think “X is double Y?”  I tend to lean towards the latter (your point I think) but I wonder if that then becomes a subjective thing.  Would be a cool thing to study.