Can HP and Kaggle’s Data Scientists Decipher the Perfect March Madness Bracket?


20150130_MarchMad_627x349

In the quest to fill out the perfect March Madness bracket, it quickly becomes apparent that such a scenario is more of a pipedream than anything else. The joy, though, to enter a pool with the hopes of beating those in the group is something anyone can relate to, including data scientists.

Although the rough odds of completing an unblemished bracket are like one in 18 trillion, that doesn’t dissuade big data analysts to dabble into their own algorithms. Writing a code model to compete with others’ code model constitutes an advanced level of play for the Big Dance–arguably even more fun, too, in the process.

The increasing amount of data that can and has been collected over the course of the NCAA season–as well as all of the previous tournament results–provides a glimmer of an opportunity to boost their respective odds. Anything from team statistics, seeding, geographical placement, and to social media are in play and fair game. The wealth of factors to determine the qualitative and quantitative degrees for each as a collective sum presents problems to navigate through. The field stacks up unlike any other.

Accordingly, for a second consecutive year, Kaggle, the world’s largest community of data scientists, is hosting its March Machine Learning Mania competition, which is sponsored by HP; they’re a partner that was looking to grow their predictive tool set and place it in front of a bevy of data scientists.

More than 104 teams have entered this contest, with a $15,000 prize on the line. HP provides their HP Haven Big Data platform, where these data scientists can elect to use it coupled with their own machine learning and statistics skills. Up to three decade’s worth of historical games can be funneled through along with any external data of their choosing. This competition spurs boundless creativity from the participants, in terms of the kind of datasets to form their respective models. These algorithms are pitted head-to-head to try and predict the winner of this year’s tournament.

Fundamentally, though, predicting a March Madness bracket can determined to be a data problem.

“Depending on utility, you can make anything into a data problem, as long as the data you are bringing in can help you. Basketball is interesting because the data doesn’t capture many of the nuances of the sport,” Will Cukierski, Kaggle’s Head of Competitions and Data Scientist, tells SportTechie.

“For example, it is hard for a computer to gauge how a small forward fares against a big guy in the paint. The goal of the contest is to have teams bring in as much data as they can and see how well it stacks up against expert predictions,” he continued.

The factors or segments of data, thus, that prove to be key for execution and results is entirely independent of the people competing, bearing in mind their personal preferences. Each team involved can examine the gamut of options, including but not limited to: just going back a year of statistics for primarily freshmen-laden teams, to qualifying the coach as a major differentiator.

An example here would be looking at Duke as a perennially great basketball program. Analytically, their success could be attributed to Coach Mike Krzyzewski and his ability to recruit talented players. The range of factors that can extract further insight into a particular matchup; past battles between rivalries, a team that plays zone defense versus one that plays man-to-man, or a collective team’s experience during recent tournament years all generate different intel within big data constructs.

In essence, a March Madness bracket is akin to how people test financial models, delving into how teams have handled past situations.

Courtesy of HP, these data scientists have 50-plus REST APIs from HP IDOL OnDemand platform at their disposal. This medium serves as starting point to augment their respective datasets. They can pull trending topics and identifying entities derived from its news dataset–on top of analyzing public sentiment about teams and players using social media channels data. There’s encouragement to quantify as much data points as possible, which spreads among the community members for increased competitiveness.

HP’s Head of Developer Programs, Sean Hughes, informs SportTechie that the HP Haven Big Data platform available to these data scientists comprises of two components, the aforementioned HP IDOL and the HP Vertica, where, when used together, places eclectic capabilities that enables harnessing 100 percent of their data. Those that perform with this vehicle, gain the ability to analyze structured and unstructured data faster than traditional methods. And can instantly factor in real-time news influences into their forecast.

When building out a model, conversely, the process can be as complex or as simple as conceivable. The former would necessitate leveraging the latest neural nets while the latter can be drawn from a lone Excel spreadsheet. A simple model, in this case, would dovetail something along the lines of predicting the difference in seeds between two schools. Ascertaining that a top seed has a 100 percent probability of beating a 16th seed, or an 8th seed versus 7th seed matchup equates to a 50 percent proposition, reflects simple models that can be done. Regardless, these kind of models likely won’t trump the best algorithms, but a user shouldn’t hedge against them performing quite well either.

There are challenges, though, for the data scientists that veer too creatively with their respective models.

Cukierski believes that “a certain idea could lead you down a path that is hard to execute on.”

“It is possible to record stats down to such a specific target, but then the act of trying to digest all the data–not just quantify–could be too difficult. Tracking the ball and all player actions could form the basis of an entire PhD thesis,” continued Cukierski.

In fact, there’s just a host of potential predictive factors that would compound matters, including weather and distance traveled by schools. They can “waste” a lot of time attempting to incorporate certain data points that do not always translate to superior predictions–something that these data scientists should be cognizant about if they really want to win the contest.

Public sentiment, too, can be casted as virtually inconsequential, or strident data. While these points can be considered–like, frequency of hashtags–the elementary framework of the games–especially team-wide statistics–offer a greater likelihood to contribute as part of a viable model.

With that in mind, it remains unclear to what extent does machine learning really enhance March Madness bracket predictions.

Cukierski explains: “It is hard to say how well machine learning has improved forecasts prior to Kaggle; allow people to predict before the beginning of the tournament–make a prediction for every single game that could ever occur in the tournament. However, last year we had around ten teams who beat Vegas odds, which are considered to be state-of-the-art.”

“So there is something there.”

Still, they have plenty of people producing predictions, which, statistically, that means some of these teams bound to get lucky. The volume exceeds the propensity for the result to be actualized. Over a short interval of time, though, the execution doesn’t necessarily earmark for these data scientists to be deemed experts in any fashion.

In the end, the odds of forecasting a perfect bracket are slim to none as it gets–predicated on as much luck as it does data science.

“For us, success goes back to helping people learn,” says Hughes with regards to HP’s purpose of this endeavor, where there’s increased awareness of their platform from the developer and data scientist communities–open and accessible as possible to showcase its design ease of usage and speed.

Ad interim, Cukierski elaborates on the fascinating nexus of sports and analytics that drives a larger point–STEM education worthy–to this entire March Madness bracket competition: “What I find really interesting is the fact that you can take data and turn it into a real prediction. And while this can be applied to a number of industries, the fact that it’s sports means more people are taking notice. We can say with 80 percent confidence that a team will win or lose–that is a real prediction. Many people don’t deal with stats if it doesn’t interest them, but sports data is a unique exception to this rule.”