Netflix and high school algebra: a short story about methodological domination

# Netflix and high school algebra: a short story about methodological domination

1816

Featured image by Daniel Friedman

This post is part of a series on artificial intelligence and computational methods in education. The first post can be found here.

In his book on ‘Machine Learners’ the sociologist Adrian MacKenzie attends to what he calls the ‘broadest claim’ associated with ML, summarised by the following equation:

N = ꓯX

In the equation, N refers to the number of observations carried out for any given purpose, that is, the size of the dataset. The key, distinctive element of the equation is the logical operator ꓯ: ‘all’. Compared with other ‘traditional’ statistical methods, machine learning does not deal with samples taken from populations, but with entire populations. It does not want some of the data, but all of it. The final operator, X, is also interesting. It refers to the process of vectorisation, that is, the mathematical process of abstractly representing (and then computing) relations between data points as sets of entities in a geometric space defined by two or more dimensions or coordinates. Taken together, the equation represents an enticing promise: give me all the data, says machine learning, and I will vectorize it (X) to find new patterns. The glaring absence of Y from the equation is almost like a show of overconfidence. In traditional regression models, Y is the outcome variable, which in Machine Learning may or may not be known. This is possible because ML can operate in either a ‘supervised’ manner, when parameters and labels frame the possible outputs of the computations (i.e. we know in advance how to classify what is likely to surface); or in an unsupervised manner, when pre-defined labels are not available and the process unfolds in an emerging fashion, identifying unexpected and yet-to-be classified patterns.

This promise is not about data reduction, but epistemic transformation. New knowledge is produced through a probabilistic process of discovery that reassembles and queries the data.

Indeed, it is the same promise that underpins the entire ‘big data’ project: transformative insights become possible when entire populations (not samples) are subjected to computational analysis. The relationship between this promise and the goals of artificial intelligence is complex, to the point that Mackenzie questions the connection between ML and automation, arguing that this technique is, in essence, about expanding the horizon of calculability: transforming hitherto intractable problems by making them amenable to computation and prediction. Automation comes into play only when ML becomes part of a socio-technical system in which said calculations and predictions shape, or even bypass, human judgement and decision-making. A ML-based diagnostic system in general medical practice is an example.  An adaptive software-based teaching agent is, of course, another. The key point is that Machine learning is not necessarily about robots, even though ML algorithms often find their way inside AI systems.  The concern of machine learning for big data is also ad-hoc and highly instrumental: any data will do (increasingly even smaller datasets), as long as it is a population. Indeed, ML thrives on population data hardly representable through traditional row/column tables, but ‘whose open margins afford many formulations of similarity and difference’ (MacKenzie: 58): biometric data, geolocation data, textual corpus data – any kind of data from any imaginable context is fair game and can be used to refine the transformative calculations of statistical learning.

This rather cold, utilitarian interest in whole datasets has consequences, as important criteria of data quality are at constant risk of being ignored or glossed over. To explain this claim, the relationship between populations and samples must be examined more closely. This relationship (which lies at the heart of traditional statistics) is not without its problems. Mackenzie correctly highlights its power-laden bio political nature, as the notion of normally distributed statistical population has been used, since its inception, to govern people based on their bodily and existential characteristics: age, health, mortality, wealth, education and so forth. This process is ultimately about ‘subjectification’, as individuals are classified according to where they are in a bell-shaped probability distribution. Their place in various ‘normal’ populations is therefore linked to social and personal positioning and, equally, to strategies of governance: being at the left tail of a high-stakes educational testing distribution has consequences in terms of personal identity, as well as having a possible impact on educational trajectories and futures.

Despite these criticalities, moving from population to samples (and vice versa) is seldom only a procedural step in empirical research: something we must do to obviate the fact that, for practical or ethical reasons, we cannot study entire populations.  As it is often the case, limitations and barriers may lead to valuable forms of knowledge and, ultimately, to theorisation. Indeed, moving from populations to samples entails several moments of epistemic creation, as the researcher engages with issues of randomisation and representativeness. By extracting samples from populations, we learn something about the complexity of the latter, as we tentatively distinguish between their empirically tractable dimensions and their elements of (temporary) unknowability. Dealing with entire datasets short-circuits this process, as we are led to believe that we no longer need to engage carefully with the multidimensionality and relative unknowability of populations. It is all about the method, because we already have ‘all the data’ after all. To be fair, similar critiques are being articulated in some of the more thoughtful and theoretically oriented circles of Learning Analytics, but the encroachment of corporate interests and computational hegemony on education risks silencing these voices.

The application of computational methods to educational data illustrates this point.

The Association for Computing Machinery (ACM) is the world’s largest learned society for computing. Several Special Interest Groups operate within it, including the Knowledge Discovery and Data Mining SIG (SIGKDD). Every year, SIGKDD organizes the KDD Cup, where teams of data scientists from the private sector and/or top ranking university departments vie with each other to develop, recombine and test computational methods applied to a dizzyingly broad range of data: measures of research impact, advertising click-through, breast cancer data, urban pollution data, etc. The variety is mindboggling and reflects an established trend in the Data Science field.  Year after tear, the goals are largely the same ones, informed by dominant, cross-sector big data concerns: recommendation, prediction, personalisation, content filtering and so forth.  The 2010 iteration is of particular interest, as it is the only one that focused on educational data. A very similar competition was held more recently – in 2017 –  but was organised by a different group: The International Educational Data Mining Society.

The 2010 KDD challenge asked participants to predict student performance on mathematical problems from logs of student interaction with an Intelligent Tutoring System. Participants developed ‘learning models’ (algorithms) which were then used on a training data set. Once trained, the models were deployed to predict student performance in a test dataset.

We can interpret the 2010 KDD competition in two ways. The first, most obvious reading is as a showcase of computational methods applied to educational data, which illustrates the predictive effectiveness and the transferability of a collection of techniques across domains.  Even in this rather superficial interpretation, one cannot help but be baffled by the level of computational diversity observable. The various entries to the competition provide an insight into a vibrant process, simultaneously generative and re-combinatorial, where individual methods are selected, tweaked, reassembled and then unleashed on educational data. Each method, on its own, represents a different approach to analysis and probabilistic prediction:  K Nearest Neighbor (KNN), linear regression, logistic regression, neural networks, random forests, gradient boosting. For those looking for further detail, several publications resulted from the competition.

A second, more sociologically interesting, reading of the competition is as a moment of ‘translation’ that contributed to ‘create convergences and homologies by relating things that were previously different’ (Callon 1981, p.211).   As such, the event acquires historical relevance as it occurred at a moment when the Educational Data Mining and the fledgling Learning Analytics fields were beginning to gain significant traction. The competition provided a stage where the data science community confronted a typical, if rather narrow, educational problem: performance in mathematics. This was made possible by the fact that competing teams were dealing with a population dataset and no familiarity with educational theory or practice was needed to engage with it. The only function of these data was as tabulated values ripe for some hard-core computational treatment. As such, the 2010 KDD cup and its narrow ITS datasets allowed several actors (computational methods and data scientists) to transition into the education domain, creating new networked trajectories that warrant further scrutiny and analysis.

Here is an example of these networked trajectories: the third prize of the 2010 KDD price was awarded to a team of Austrian data scientists called BigChaos@KDD, whose members were from a private analytics company: Commendo Consulting and Research. These individuals were (still are) highly rated experts within the data science community, having won a great deal of data mining challenges and, most notably, having reached leading positions in the \$1 million Netflix Collaborative Filtering Algorithm prize. The Netflix competition run from 2006 to 2010, when it was officially cancelled in response to a lawsuit that followed some serious privacy concerns. Commendo consulting was later (in 2012) bought by another analytics company called Opera solutions, where the other winning team who worked on the Netflix data were employed, thus creating a ‘predictive analytics powerhouse’ .

In the space of a paragraph, a picture begins to emerge were various dynamics, intersecting contexts and actors are observable:

1. The data science industry, with its emphasis on methodological proficiency and analytic power, where reputations and careers are built on the capacity to subject radically different forms of (population) data to comparable and scalable forms of computational analysis.
2. An algorithmic thin line connecting corporate interests in predictive analytics and computational methods in education. Netflix and secondary level algebra have something in common after all.
3. A discourse and a praxis of data as challenge: a tendency to pursue methodological domination over data points whereby data (any kind of data, as long as it’s ‘all of it’) can be tamed, harnessed and vectorized.

This highlights a big problem: the current epistemic (and of course economic) vibrancy in the data science field is predicated on the establishment of several degrees of separation between methods and data. Perhaps understandably, given the backgrounds and inclinations of data analysts, computer scientists, mathematicians and statisticians, there does not seem to be much interest in the actual data, but only in the process of methodological abstraction enabled by computational analysis. In his examination of machine learning, Adrian Mackenzie tells the fascinating story of a dataset containing clinical measurements of men’s prostates organised (transformed) in a scatterplot matrix. This created an unconventional ‘tabular space’ in which ‘relational contrasts between pairs of variables started to appear’ (p61). The most interesting part of the story is not in the (slightly enthralled) account of the vectorization process, as novel patterns in the data become visible, but relegated to the footnote, where the reader is reminded that the entire dataset was possibly based on flawed assumptions about the relevance of PSA (Prostate Specific Antigen) as a biomarker for prostate cancer.