Does the debate between machine learning and statistics make sense?

Whether there is a clear distinction between statistics and machine learning has always been the focus of academic debate.

Some scholars believe that machine learning is just a glamorous coat of statistics. Other discussions argue that what involves using logistic regression or generalized linear models (GLM) can be called machine learning; otherwise it is not.

There are also some views that whether to perform meta-analysis may be a criterion for distinguishing the two fields.

But does it really make sense to argue about the boundary between the two? If we think about this issue seriously, perhaps we will find that the answer is no.

Dr. Sam Finlayson of Massachusetts Institute of Technology pointed out that “the past discussions between machine learning and statistics were largely untouched, because these discussions either ignored the historical background or the “regression method” was ambiguous.” Therefore, this argument is true. It makes no sense.

1. Ignore the historical background: the term "machine learning" was born not to distinguish statistics

Group photo during the Dartmouth Conference For thousands of years, researchers have dreamed of building "smart" devices, but the term "artificial intelligence" did not appear until 1956. John McCarthy proposed this term at the Dartmouth Conference at the time and defined artificial intelligence as: the science and engineering of making intelligent machines.

Since then, the term artificial intelligence has been used and popular to this day.

And McCarthy was able to persuade participants to use this term at the conference largely because the definition itself is very vague.

In that era, the research perspectives of scientists dedicated to "intelligence" had not yet turned to "data-driven", but focused on automata theory, formal logic, and cybernetics.

That is to say, McCarthy wanted to create a term to accommodate all these paradigms, rather than leaning towards any specific method.

It was in this situation that Arthur Samuel (one of the attendees at the Dartmouth Conference) proposed the term "machine learning" in 1959 and defined it as a field of research, that is, without explicit programming A field of research that allows computers to learn.

The reason for this definition is that Samuels and his colleagues hope to make computers more "intelligent" by allowing computers to have recognition capabilities and continuously improving this capability over time.

From today's point of view, this research method seems familiar, but it took decades for pioneers to make it the dominant paradigm for AI research.

From the intention of researchers at the time, machine learning was created to describe the design process of computers, which used statistical methods to improve performance. That is to say, the term is intended to contrast with non-data-driven methods of building intelligent machines, not to contrast with statistics.

After all, statistics focuses on using data-driven methods to provide humans with effective information.

Another widely recognized definition of machine learning comes from a textbook published by Tom M.Mitchell in 1997. He mentioned in the book: "The field of machine learning involves how to make computer programs automatically improve through experience."

In addition, there is a semi-formal definition in the book: For a certain type of task T and performance measurement P, a computer program learns from experience E, and then its performance P in task T increases with the improvement of experience E.

2. The debate about who "owns" the return misses the point

Many people are currently trying to use a dichotomy to draw a strong line between statistical methods and machine methods, but this is obviously a dictatorship.

Some people are particularly persistent in the belief that regression-driven research methods are statistically exclusive, and cannot be called machine learning in any case.

This type of view is actually even more stupid than the current view that "logistic regression equals econometrics", and both have also provoked fierce debate.

The machine learning community has been working on "better computers" for sixty years, and doesn't care whether it is a fantastic method or statistical data which is better.

This is why most professors spend a lot of time teaching generalized linear models and their variants when teaching machine learning courses.

Therefore, statistics is very meaningful in the context of machine learning and artificial intelligence. The terminology of machine learning involves different methods and is dedicated to making "programs" intelligent. Frankly speaking, no statistician of any rank can assert that "statistical methods that deviate from the actual research background are useful."

The dispute over the ownership of regression methods actually underestimates both machine learning and statistics to a large extent. The reasons can be roughly summarized into the following four:

1. Restrict the core role that classic statistical methods can play in the construction of computer programs;

2. Ignore the influence of machine learning on statistics. In fact, artificial intelligence and computer science have greatly promoted the revival of statistics. For example, the causality of Judea Pearl opened up a new statistical paradigm;

3. The "tough" dichotomy between statistics and machine learning weakens the important information in modeling decisions to a certain extent, and this classification is sometimes meaningless.

4. Most current top researchers in machine learning and statistics belong to these two fields at the same time.

In fact, many current studies have highlighted the rich interaction between statisticians and machine learning researchers. For example, well-known scholars Rob Tibshirani and Trevor Hastie did not struggle with the boundary of methodology, but instead used tools developed by machine learning researchers to help Improve research in the field of statistics. It is not that Hastie and Tibs invented new methods, but that these methods have affected the daily work of statisticians and machine learning researchers.

3. Many "arguments" are doomed to failure before they begin

The different goals lead to differences in methods and cultures, which is why the meaning of the term "machine learning" has changed so much since its inception.

The disconnect in the language makes many "arguments" doomed before they begin.

As mentioned above, the research field of machine learning was created because computer scientists tried to create and understand intelligent computer systems, which is still the case today.

The main machine learning applications include speech recognition, computer vision, robotics/automatic systems, computational advertising, surveillance, chatbots, and more. In the process of trying to solve these problems, machine learning researchers basically start by trying classic statistical methods, such as the relatively simple generalized linear model (GLM).

Of course, over the years, computer scientists have also constantly proposed new methods to make machine learning tools increasingly powerful.

Like evolution in any other context, the evolutionary history of statistical methods used in machine learning is also formed under the pressure of "natural selection". Compared with statisticians, machine learning researchers often pay less attention to understanding all the specific actions performed behind the algorithm. This is actually very important, and it is becoming more and more important.

They are usually most concerned with model errors. As a result, the methods developed by machine learning researchers tend to be more flexible, even at the expense of interpretability to achieve higher flexibility. This discrete evolution can easily blur the line between machine learning and statistical research based entirely on methods.

In addition, many statisticians do not understand the history of machine learning. So it is not surprising that they will be keen to use any other terminology to define the field of machine learning, even if this approach is unnecessary. For the same reason, the strict division based on "use" is now very complicated. In fact, many machine learning practitioners nowadays, even when they are only applying machine learning methods for pure data analysis, rather than driving computer programs. At that time, they would still claim that they were doing machine learning.

Although this statement is not true in a strict historical sense, I don’t think it is necessary to blame this approach, because it may be a combination of habit, cultural background, or "thinking that this statement sounds cool" influences.

So in reality, when people use the term "machine learning", they often refer to other things that are very different from machine learning itself. People may use it to express: "I am using statistical methods to learn from the programs I design" or "I am designing data analysis that can be deployed in automated systems."

Or it means: "I am using a method originally developed by the machine learning community, such as random forest, for statistical data analysis." More generally, they use this term to say: "I am a machine learning researcher myself, and I am using data for machine learning research. I can say whatever I am happy."

In fact, the different usages of this term are not surprising or problematic, so this is only the result of the evolution of language. However, it is very funny when another group of people-data scientists come together to argue "whether a particular project can be called purely machine learning or statistics, choose one of the two".

In my opinion, the term "data scientist" was originally formed by the intersection of machine learning and statistics. When this argument occurs, everyone often participates in the argument with different, vaguely defined, and unclear assumptions. The meaning of these words is debated at the beginning. And then they hardly spend time to understand the origin of these words or listen to what the other party really wants to express, but just shout at a distance between each other, the voice is loud but not clear.

4. This whole "argument" is almost a waste of time

Now, let us put these real problems on the table to talk about: Nowadays, many machine learning researchers (or at least machine learning enthusiasts) still have insufficient understanding of statistics. Some people are indeed machine learning researchers, but there are also many professional statisticians who sometimes think of themselves as machine learning researchers.

The more serious reality is that the development of machine learning research has gone so fast, and it is often culturally disconnected from the field of statistics so far, that I think that for even very outstanding machine learning researchers It is very common to "rediscover" or "reinvent" certain parts of statistics.

This is a problem and a waste! Finally, because a large number of third-party application researchers like to use the term "machine learning": in order to make the paper more fashionable, the term is used in the paper, even if in reality what they call "machine learning" is neither building an automated system Nor did it use the methods proposed in the field of machine learning.

(Lei Feng.com) I think the solution to all these problems is to make people more aware that most data methods for machine learning actually exist in statistics. Regardless of whether these methods are used in data analysis or in designing intelligent systems, our first task is to cultivate a deep understanding of the principles of statistics, rather than obsessing over whether the division of machine learning and statistics is right or wrong.

The endless debate about whether a lot of work is machine learning or statistics will eventually only distract people and prevent them from spending more energy on "how to properly match the problem and specific tools to do the job well." The necessary dialogue and exchanges-relatively speaking, this is more important. At the same time, people’s opinionated wrong dichotomy between statistics and machine learning methods will allow many researchers to further develop the habit of not using complex methods, just to make themselves feel like they are doing "real Machine learning".

This will also lead directly to people who call their work machine learning unscrupulously in order to make their work sound more fashionable in methodology.

The golden age of statistical computing is pushing the field of machine learning and statistics to become closer than ever. Of course, machine learning research was born in the computer science system, and contemporary statisticians increasingly rely on algorithms and software stacks created in the computer science community for decades. They are also increasingly discovering the usefulness of methods proposed by machine learning researchers, such as high-dimensional regression, which is especially reflected in the field of computational biology.

On the other hand, the machine learning community is also paying more and more attention to topics such as interpretability, fairness, and verifiable robustness. This also makes many researchers give priority to making machine learning output values more direct and traditional The statistical values are consistent. At least, even when deploying systems with the most complex architecture possible, people are generally aware that it is necessary to use classic statistics to measure and evaluate the performance of machine learning models.

5. Summary

All in all, the academic debate about machine learning and statistics is wrong, people are overloaded with the use of related terms, and the methodological dichotomy is not correct. Machine learning researchers are paying more and more attention to statistics, while statistics Homes are also increasingly dependent on computer science and machine learning communities.

There are no return and merger conspiracy theories.

There is a lot of hype now, but the fact that cannot be changed is that when other people use different terms than you, it is because they come from a different background and have different goals, not because they are dishonest or stupid.

Label

Does the debate between machine learning and statistics make sense?

Related Posts

Comments