Talking about Deep Learning: Core Concepts
Posted by
Md Ashikquer Rahman
Your Ads Here
Machine Learning (Machine Learning)
In machine learning, we (1) read the data, (2) train the model, and (3) use the model to make predictions on new data. Training can be seen as a process of learning one by one when the model gets new data. At each step, the model makes predictions and gets feedback on accuracy. The form of feedback is the error under a certain measure (such as the distance from the correct solution), which is then used to correct the prediction error.
Learning is a iterative process in the parameter space: when you adjust the parameters to correct a prediction, the model may get the original right and wrong. It takes many iterations for the model to have good predictive ability. This "prediction-correction" process continues until the model has no room for improvement.
Feature Engineering
Feature engineering extracts useful patterns from data to make it easier to classify by machine learning models. For example, use a bunch of green or blue pixel areas as a standard to determine whether the photo is a land animal or aquatic animal. This feature is very effective for machine learning models because it limits the number of categories that need to be considered.
In most forecasting tasks, feature engineering is a necessary skill to achieve good results. However, because different data sets have different feature engineering methods, it is difficult to draw general rules and only some general experience, which makes feature engineering more of an art than a science. An extremely critical feature in one data set may not be useful in another data set (for example, the next data set is full of plants). It is precisely because feature engineering is so difficult that scientists will develop algorithms that automatically extract features.
Many tasks can already be automated (such as object recognition, speech recognition), and feature engineering is still the most effective technology in complex tasks (such as most tasks in Kaggle machine learning competitions).
Feature Learning
Feature learning algorithms look for common patterns among the same class and automatically extract them for classification or regression. Feature learning is feature engineering that is automatically completed by algorithms. In deep learning, the convolutional layer is very good at finding the features in the picture and mapping to the next layer, forming a hierarchical structure of non-linear features, and the complexity gradually increases (for example: circle, edge -> nose, eyes, cheek) . The last layer uses all the generated features for classification or regression (the last layer of the convolutional network is essentially polynomial logistic regression).
Figure 1 shows the features generated by the deep learning algorithm. It is difficult to find that these features are very clear, because most features are often incomprehensible, especially recurrent neural networks, LSTMs or very deep deep convolutional networks.
Deep Learning
In hierarchical feature learning, we extract several layers of non-linear features and pass them to the classifier, which integrates all the features to make predictions. We deliberately stack these deep non-linear features, because the number of layers is small, and complex features cannot be learned. It can be proved mathematically that the best features that a single-layer neural network can learn are circles and edges, because they contain the most information that a single nonlinear transformation can carry. In order to generate more informative features, we cannot directly manipulate these inputs, but continue to transform the first batch of features (edges and circles) to get more complex features.
Studies have shown that the human brain has the same working mechanism: the cones of the first layer of nerves that receive information are more sensitive to edges and circles, while the deeper cerebral cortex is sensitive to more complex structures, such as the human face.
Hierarchical feature learning was born before deep learning, and its structure faces many serious problems, such as the disappearance of gradients-the gradient becomes too small at very deep levels to provide any learning information. This makes the hierarchical structure inferior to some traditional machine learning algorithms (such as support vector machines).
To solve the problem of vanishing gradients, so that we can train dozens of non-linear layers and features, many new methods and strategies have emerged, and the term "deep learning" comes from this. In the early 2010s, research found that with the help of GPU, the excitation function has a gradient flow sufficient to train a deep structure. Since then, deep learning has begun to develop steadily.
Deep learning is not always tied to deep nonlinear hierarchical features, and sometimes it is also related to long-term nonlinear time dependence in sequence data. For sequence data, most other algorithms only have the memory of the last 10 time steps, while the LSTM recurrent neural network (invented by Sepp Hochreiter and Jürgen Schmidhuber in 1997) enables the network to trace back the activities hundreds of time steps ago to make correct predictions. . Although LSTM has been hidden in the snow for nearly 10 years, its application has grown rapidly since it was combined with convolutional networks in 2013.
basic concepts
Logistic Regression
Regression analysis predicts the relationship between statistical input variables to predict output variables. Logistic regression uses input variables to produce output in a limited number of input variables, such as "cancer" / "no cancer", or the type of picture such as "bird" / "car" / "dog" / "cat" / "horse".
Logistic regression uses the logistic sigmoid function (see Figure 2) to assign weights to the input values to produce two-category predictions (in polynomial logistic regression, it is multi-category).
Logistic regression is very similar to non-linear perceptrons or neural networks without hidden layers. The main difference is that as long as the input variables meet some statistical properties, logistic regression is easy to interpret and reliable. If these statistical properties are true, a very stable model can be generated with very little input data. Therefore, logistic regression has extremely high value in many sparse data applications. For example, in medicine or social sciences, logistic regression is used to analyze and interpret experimental results. Because logistic regression is simple and fast, it is also very friendly to large data sets.
In deep learning, in the neural network used for classification, the last few layers are generally logistic regression. In this series, the deep learning algorithm is regarded as a number of feature learning stages, and then the features are passed to the logistic regression to input the classification.
Artificial Neural Network (Aritificial Neural Network)
The artificial neural network (1) reads the input data, (2) performs the transformation to calculate the weighted sum, (3) applies a nonlinear function to the transformation result, and calculates an intermediate state. These three steps together are called a "layer", and the transformation function is called a "unit". The intermediate state—the feature—is also the input of another layer.
Through the repetition of these steps, the artificial neural network learns many layers of nonlinear characteristics, and finally combines them to get a prediction. The learning process of the neural network is to generate error signals-the difference between the network prediction and the expected value-and then use these errors to adjust the weights (or other parameters) to make the prediction results more accurate.
Kaiser: The following is an analysis of several terms, including the changes in habits in recent years. I don't think we need to go into it.
Unit
The unit sometimes refers to the excitation function in a layer, and the input is transformed by these nonlinear excitation functions, such as the logistic sigmoid function. Usually a unit will connect several inputs and multiple outputs, but there are also more complicated ones. For example, a long short-term memory (LSTM) unit contains multiple excitation functions and is connected to a nonlinear function or maximum output unit in a special layout. LSTM calculates the final output after a series of nonlinear transformations on the input. Pooling, convolution and other input transformation functions are generally not called units.
Artificial Neuron (Artificial Neuron)
Artificial neurons-or neurons-are synonymous with "units", just to show a close connection with neurobiology and the human brain. In fact, deep learning has nothing to do with the brain. For example, it is now believed that a biological neuron is more like a whole multilayer perceptron, rather than a single artificial neuron. Neurons were proposed during the last AI winter, with the purpose of distinguishing the most successful neural network from the failed perceptrons. However, since the great success of deep learning since 2012, the media has often used the term "neuron" as an example and compared deep learning to the mimicry of the human brain. This is actually misleading and dangerous for the field of deep learning. Nowadays, the industry does not recommend the use of the term "neuron", but instead uses a more accurate "unit".
Acitivation Function
The excitation function reads the weighted data (the input data and the weight are subjected to matrix multiplication), and the nonlinear transformation of the output data. For example, output = max(0, weight_{data}) is to modify the linear excitation function (essentially, the negative value becomes 0). The difference between "unit" and "activation function" is that the unit can be more complex, for example, a unit can contain several activation functions (like LSTM units) or more complex structures (such as Maxout units).
Kaiser: This example in the original text obviously complicates the simple question, and I suggest skipping it.
The difference between linear excitation function and nonlinear excitation function can be reflected by the relationship between a set of weighted values: imagine 4 points A1, A2, B1, B2. The point (A1/A2) is very close to (B1/B2), but A1 is far away from B1 and B2, and so is A2.
After linear transformation, the relationship between the points will change, for example, A1 and A2 are far away, but at the same time B1 and B2 are also far away. However, the relationship between (A1/A2) and (B1/B2) remains unchanged.
The nonlinear transformation is different. We can increase the distance between A1 and A2, while reducing the distance between B1 and B2, thus establishing a new relationship with higher complexity. In deep learning, the nonlinear activation function of each layer creates more complex features.
The pure linear transformation, even if there are 1000 layers, can be represented by a single layer (because a series of matrix multiplications can be represented by a matrix multiplication). This is why nonlinear activation functions are so important in deep learning.
Layer
The layer is the more advanced basic unit of deep learning. The first layer is a container that accepts weighted input, undergoes nonlinear excitation function transformation, and passes it to the next layer as an output. A layer usually contains only one excitation function, pooling, convolution, etc., so it can be simply compared with other parts of the network. The first and last layers are called "input layer" and "output layer" respectively, and the middle ones are called "hidden layers".
Convolutional deep learning
Convolution
Convolution is a mathematical operation that expresses the rules of mixing two functions or two pieces of information: (1) feature mapping or input data, mixed with (2) convolution kernel to form (3) transformed feature mapping. Convolution is also often used as a filter. The kernel filters feature maps to find certain information. For example, a convolution kernel may only find edges and discard other information.
Convolution is very important in physics and mathematics, because it establishes a bridge between time domain and space (position (0,30), pixel value 147) and frequency domain (amplitude 0.3, frequency 30 Hz, phase 60 degrees). The establishment of this bridge is through Fourier transform: when you do Fourier transform on both the convolution kernel and the feature map, the convolution operation is greatly simplified (integration becomes multiplication).
Convolution can describe the spread of information. For example, when you pour milk into the coffee without stirring, this is diffusion, and it can be accurately described by convolution (pixels spread toward the edge of the image). In quantum mechanics, this describes the probability that when you measure a quantum's position, it will appear in a specific position (the probability that the pixel is in the highest position at the edge). In probability theory, the author describes cross-correlation, which is the similarity of two sequences (if a feature (such as a nose) pixels overlap with an image (such as a face), the similarity is high). In statistics, convolution describes the weighted moving average of the positive input blood hunting (high edge weight, low weight elsewhere). There are many other explanations from different perspectives.
Kaiser: The "edge" here is different from the previous edge. The original word is contour, which refers to the decision boundary. For example, for face recognition, the contour of the face is the contour, the focus of recognition.
But for deep learning, we don’t know which interpretation of convolution is correct. Cross-correlation interpretation is currently the most effective: Convolution filter is a feature detector, and the input (feature map) is Filtered by a certain feature (kernel), if the feature is detected, the output is high. This is also the interpretation of cross-correlation in image processing.
Pooling/Sub-Sampling
The pooling process reads input from a specific area and compresses it into a value (downsampling). In a convolutional neural network, the concentration of information is a very useful property, so that the output connections usually receive similar information (the information flows through the funnel and flows to the correct position of the next convolutional layer). This provides basic rotation/translation invariance. For example, if the face is not in the center of the picture, but slightly offset, this should not affect the recognition, because the pooling operation imports the information to the correct position, so the convolution filter can still detect the face.
The larger the pooling area, the more information is compressed, resulting in a more "slim" network and easier to match with video memory. But if the pooling area is too large and too much information is lost, it will also reduce the predictive ability.
Kaiser: The following part puts together the concepts listed before and is the essence of the full text.
Convolutional Neural Network (CNN)
Convolutional neural networks, or convolutional networks, use convolutional layers to filter the input to obtain valid information. Some parameters of the convolutional layer are automatically learned, and the filter is automatically adjusted to extract the most useful information (feature learning). For example, for a general object recognition task, the shape of the object may be the most useful; for birds to recognize people, the color may be the most useful (because the shapes of the birds are similar, but the colors are varied).
Generally, we use multiple convolutional layers to filter images, and the information obtained after each layer is more and more abstract (level features).
Convolutional networks use a pooling layer to obtain limited translation/rotation invariance (even if the target is not in the usual position, it can be detected). Pooling also reduces memory consumption, allowing more convolutional layers to be used.
Recent convolutional networks use the inception module, that is, a 1x1 convolution kernel to further reduce memory consumption and speed up the calculation and training process.
Inception
In convolutional networks, the original intention of the inception module is to implement deeper and larger convolutional layers with higher computational efficiency. Inception uses a 1x1 convolution kernel to generate a small feature map. For example, 192 28x28 feature maps can be compressed into 64 28x28 feature maps after 64 1x1 convolutions. Because the size is reduced, these 1x1 convolutions can be followed by larger convolutions, such as 3x3, 5x5. In addition to 1x1 convolution, max pooling is also used for dimensionality reduction.
In the output of the Inception module, all large convolutions are spliced into a large feature map, and then passed to the next layer (or inception module).
In machine learning, we (1) read the data, (2) train the model, and (3) use the model to make predictions on new data. Training can be seen as a process of learning one by one when the model gets new data. At each step, the model makes predictions and gets feedback on accuracy. The form of feedback is the error under a certain measure (such as the distance from the correct solution), which is then used to correct the prediction error.
Learning is a iterative process in the parameter space: when you adjust the parameters to correct a prediction, the model may get the original right and wrong. It takes many iterations for the model to have good predictive ability. This "prediction-correction" process continues until the model has no room for improvement.
Feature Engineering
Feature engineering extracts useful patterns from data to make it easier to classify by machine learning models. For example, use a bunch of green or blue pixel areas as a standard to determine whether the photo is a land animal or aquatic animal. This feature is very effective for machine learning models because it limits the number of categories that need to be considered.
In most forecasting tasks, feature engineering is a necessary skill to achieve good results. However, because different data sets have different feature engineering methods, it is difficult to draw general rules and only some general experience, which makes feature engineering more of an art than a science. An extremely critical feature in one data set may not be useful in another data set (for example, the next data set is full of plants). It is precisely because feature engineering is so difficult that scientists will develop algorithms that automatically extract features.
Many tasks can already be automated (such as object recognition, speech recognition), and feature engineering is still the most effective technology in complex tasks (such as most tasks in Kaggle machine learning competitions).
Feature Learning
Feature learning algorithms look for common patterns among the same class and automatically extract them for classification or regression. Feature learning is feature engineering that is automatically completed by algorithms. In deep learning, the convolutional layer is very good at finding the features in the picture and mapping to the next layer, forming a hierarchical structure of non-linear features, and the complexity gradually increases (for example: circle, edge -> nose, eyes, cheek) . The last layer uses all the generated features for classification or regression (the last layer of the convolutional network is essentially polynomial logistic regression).
Figure 1 shows the features generated by the deep learning algorithm. It is difficult to find that these features are very clear, because most features are often incomprehensible, especially recurrent neural networks, LSTMs or very deep deep convolutional networks.
Deep Learning
In hierarchical feature learning, we extract several layers of non-linear features and pass them to the classifier, which integrates all the features to make predictions. We deliberately stack these deep non-linear features, because the number of layers is small, and complex features cannot be learned. It can be proved mathematically that the best features that a single-layer neural network can learn are circles and edges, because they contain the most information that a single nonlinear transformation can carry. In order to generate more informative features, we cannot directly manipulate these inputs, but continue to transform the first batch of features (edges and circles) to get more complex features.
Studies have shown that the human brain has the same working mechanism: the cones of the first layer of nerves that receive information are more sensitive to edges and circles, while the deeper cerebral cortex is sensitive to more complex structures, such as the human face.
Hierarchical feature learning was born before deep learning, and its structure faces many serious problems, such as the disappearance of gradients-the gradient becomes too small at very deep levels to provide any learning information. This makes the hierarchical structure inferior to some traditional machine learning algorithms (such as support vector machines).
To solve the problem of vanishing gradients, so that we can train dozens of non-linear layers and features, many new methods and strategies have emerged, and the term "deep learning" comes from this. In the early 2010s, research found that with the help of GPU, the excitation function has a gradient flow sufficient to train a deep structure. Since then, deep learning has begun to develop steadily.
Deep learning is not always tied to deep nonlinear hierarchical features, and sometimes it is also related to long-term nonlinear time dependence in sequence data. For sequence data, most other algorithms only have the memory of the last 10 time steps, while the LSTM recurrent neural network (invented by Sepp Hochreiter and Jürgen Schmidhuber in 1997) enables the network to trace back the activities hundreds of time steps ago to make correct predictions. . Although LSTM has been hidden in the snow for nearly 10 years, its application has grown rapidly since it was combined with convolutional networks in 2013.
basic concepts
Logistic Regression
Regression analysis predicts the relationship between statistical input variables to predict output variables. Logistic regression uses input variables to produce output in a limited number of input variables, such as "cancer" / "no cancer", or the type of picture such as "bird" / "car" / "dog" / "cat" / "horse".
Logistic regression uses the logistic sigmoid function (see Figure 2) to assign weights to the input values to produce two-category predictions (in polynomial logistic regression, it is multi-category).
Logistic regression is very similar to non-linear perceptrons or neural networks without hidden layers. The main difference is that as long as the input variables meet some statistical properties, logistic regression is easy to interpret and reliable. If these statistical properties are true, a very stable model can be generated with very little input data. Therefore, logistic regression has extremely high value in many sparse data applications. For example, in medicine or social sciences, logistic regression is used to analyze and interpret experimental results. Because logistic regression is simple and fast, it is also very friendly to large data sets.
In deep learning, in the neural network used for classification, the last few layers are generally logistic regression. In this series, the deep learning algorithm is regarded as a number of feature learning stages, and then the features are passed to the logistic regression to input the classification.
Artificial Neural Network (Aritificial Neural Network)
The artificial neural network (1) reads the input data, (2) performs the transformation to calculate the weighted sum, (3) applies a nonlinear function to the transformation result, and calculates an intermediate state. These three steps together are called a "layer", and the transformation function is called a "unit". The intermediate state—the feature—is also the input of another layer.
Through the repetition of these steps, the artificial neural network learns many layers of nonlinear characteristics, and finally combines them to get a prediction. The learning process of the neural network is to generate error signals-the difference between the network prediction and the expected value-and then use these errors to adjust the weights (or other parameters) to make the prediction results more accurate.
Kaiser: The following is an analysis of several terms, including the changes in habits in recent years. I don't think we need to go into it.
Unit
The unit sometimes refers to the excitation function in a layer, and the input is transformed by these nonlinear excitation functions, such as the logistic sigmoid function. Usually a unit will connect several inputs and multiple outputs, but there are also more complicated ones. For example, a long short-term memory (LSTM) unit contains multiple excitation functions and is connected to a nonlinear function or maximum output unit in a special layout. LSTM calculates the final output after a series of nonlinear transformations on the input. Pooling, convolution and other input transformation functions are generally not called units.
Artificial Neuron (Artificial Neuron)
Artificial neurons-or neurons-are synonymous with "units", just to show a close connection with neurobiology and the human brain. In fact, deep learning has nothing to do with the brain. For example, it is now believed that a biological neuron is more like a whole multilayer perceptron, rather than a single artificial neuron. Neurons were proposed during the last AI winter, with the purpose of distinguishing the most successful neural network from the failed perceptrons. However, since the great success of deep learning since 2012, the media has often used the term "neuron" as an example and compared deep learning to the mimicry of the human brain. This is actually misleading and dangerous for the field of deep learning. Nowadays, the industry does not recommend the use of the term "neuron", but instead uses a more accurate "unit".
Acitivation Function
The excitation function reads the weighted data (the input data and the weight are subjected to matrix multiplication), and the nonlinear transformation of the output data. For example, output = max(0, weight_{data}) is to modify the linear excitation function (essentially, the negative value becomes 0). The difference between "unit" and "activation function" is that the unit can be more complex, for example, a unit can contain several activation functions (like LSTM units) or more complex structures (such as Maxout units).
Kaiser: This example in the original text obviously complicates the simple question, and I suggest skipping it.
The difference between linear excitation function and nonlinear excitation function can be reflected by the relationship between a set of weighted values: imagine 4 points A1, A2, B1, B2. The point (A1/A2) is very close to (B1/B2), but A1 is far away from B1 and B2, and so is A2.
After linear transformation, the relationship between the points will change, for example, A1 and A2 are far away, but at the same time B1 and B2 are also far away. However, the relationship between (A1/A2) and (B1/B2) remains unchanged.
The nonlinear transformation is different. We can increase the distance between A1 and A2, while reducing the distance between B1 and B2, thus establishing a new relationship with higher complexity. In deep learning, the nonlinear activation function of each layer creates more complex features.
The pure linear transformation, even if there are 1000 layers, can be represented by a single layer (because a series of matrix multiplications can be represented by a matrix multiplication). This is why nonlinear activation functions are so important in deep learning.
Layer
The layer is the more advanced basic unit of deep learning. The first layer is a container that accepts weighted input, undergoes nonlinear excitation function transformation, and passes it to the next layer as an output. A layer usually contains only one excitation function, pooling, convolution, etc., so it can be simply compared with other parts of the network. The first and last layers are called "input layer" and "output layer" respectively, and the middle ones are called "hidden layers".
Convolutional deep learning
Convolution
Convolution is a mathematical operation that expresses the rules of mixing two functions or two pieces of information: (1) feature mapping or input data, mixed with (2) convolution kernel to form (3) transformed feature mapping. Convolution is also often used as a filter. The kernel filters feature maps to find certain information. For example, a convolution kernel may only find edges and discard other information.
Convolution is very important in physics and mathematics, because it establishes a bridge between time domain and space (position (0,30), pixel value 147) and frequency domain (amplitude 0.3, frequency 30 Hz, phase 60 degrees). The establishment of this bridge is through Fourier transform: when you do Fourier transform on both the convolution kernel and the feature map, the convolution operation is greatly simplified (integration becomes multiplication).
Convolution can describe the spread of information. For example, when you pour milk into the coffee without stirring, this is diffusion, and it can be accurately described by convolution (pixels spread toward the edge of the image). In quantum mechanics, this describes the probability that when you measure a quantum's position, it will appear in a specific position (the probability that the pixel is in the highest position at the edge). In probability theory, the author describes cross-correlation, which is the similarity of two sequences (if a feature (such as a nose) pixels overlap with an image (such as a face), the similarity is high). In statistics, convolution describes the weighted moving average of the positive input blood hunting (high edge weight, low weight elsewhere). There are many other explanations from different perspectives.
Kaiser: The "edge" here is different from the previous edge. The original word is contour, which refers to the decision boundary. For example, for face recognition, the contour of the face is the contour, the focus of recognition.
But for deep learning, we don’t know which interpretation of convolution is correct. Cross-correlation interpretation is currently the most effective: Convolution filter is a feature detector, and the input (feature map) is Filtered by a certain feature (kernel), if the feature is detected, the output is high. This is also the interpretation of cross-correlation in image processing.
Pooling/Sub-Sampling
The pooling process reads input from a specific area and compresses it into a value (downsampling). In a convolutional neural network, the concentration of information is a very useful property, so that the output connections usually receive similar information (the information flows through the funnel and flows to the correct position of the next convolutional layer). This provides basic rotation/translation invariance. For example, if the face is not in the center of the picture, but slightly offset, this should not affect the recognition, because the pooling operation imports the information to the correct position, so the convolution filter can still detect the face.
The larger the pooling area, the more information is compressed, resulting in a more "slim" network and easier to match with video memory. But if the pooling area is too large and too much information is lost, it will also reduce the predictive ability.
Kaiser: The following part puts together the concepts listed before and is the essence of the full text.
Convolutional Neural Network (CNN)
Convolutional neural networks, or convolutional networks, use convolutional layers to filter the input to obtain valid information. Some parameters of the convolutional layer are automatically learned, and the filter is automatically adjusted to extract the most useful information (feature learning). For example, for a general object recognition task, the shape of the object may be the most useful; for birds to recognize people, the color may be the most useful (because the shapes of the birds are similar, but the colors are varied).
Generally, we use multiple convolutional layers to filter images, and the information obtained after each layer is more and more abstract (level features).
Convolutional networks use a pooling layer to obtain limited translation/rotation invariance (even if the target is not in the usual position, it can be detected). Pooling also reduces memory consumption, allowing more convolutional layers to be used.
Recent convolutional networks use the inception module, that is, a 1x1 convolution kernel to further reduce memory consumption and speed up the calculation and training process.
Inception
In convolutional networks, the original intention of the inception module is to implement deeper and larger convolutional layers with higher computational efficiency. Inception uses a 1x1 convolution kernel to generate a small feature map. For example, 192 28x28 feature maps can be compressed into 64 28x28 feature maps after 64 1x1 convolutions. Because the size is reduced, these 1x1 convolutions can be followed by larger convolutions, such as 3x3, 5x5. In addition to 1x1 convolution, max pooling is also used for dimensionality reduction.
In the output of the Inception module, all large convolutions are spliced into a large feature map, and then passed to the next layer (or inception module).
Your Ads Here
Your Ads Here
Your Ads Here
Your Ads Here
Newer Posts
Newer Posts
Older Posts
Older Posts
Comments