Overfitting concepts and several regularization methods for solving overfitting problems
One of the most common problems encountered by data scientists in daily work and study is overfitting. Have you ever had such a model, it is excellent on the training set, but it is a mess on the test set. Have you ever had such an experience, when you participate in the modeling contest, from the running point of view your model should obviously top the list, but in the rankings announced by the side of the game, it is a great name, far away in hundreds of after that. If you have a similar experience, then this article is written specifically - it will tell you how to avoid overfitting to improve the performance of the model.
In this article, we will elaborate on the concept of overfitting and use several regularization methods for solving overfitting problems, supplemented by Python examples to further consolidate this knowledge. Note that this article assumes that the reader has some experience with neural networks and Keras.
What is regularization
Before delving into this topic, take a look at this picture:
Each time he talks about fitting, this picture will be pulled out from time to time as a "bite." As shown in the above figure, at the beginning, the model can not fit all the data points well, that is, it cannot reflect the data distribution. At this time, it is under-fitting. With the increase in the number of training, it slowly finds out the pattern of the data. It can reflect the data trend while fitting the data points as much as possible. At this time, it is a model with better performance. On this basis, if we continue to train, then the model will further tap into the details and noise in the training data. In order to fit all the data points "unscrupulously", it is overfitted.
In other words, from left to right, the complexity of the model gradually increases, and the prediction error on the training set gradually decreases, but its error rate on the test set presents a downward convex curve.
If you have built a neural network before, surely you have already learned this lesson: how complicated the network is, how easy it is to overfit. In order to make the model more generalized while fitting the data, we can use regularization to make some minor modifications to the learning algorithm to improve the overall performance of the model.
Regularization and overfitting
Over-fitting is closely related to the design of neural networks, so let's first look at an over-fit neural network:
If you have read our zero-learning lessons before: understanding and coding neural networks (full version) from Python and R, or having a basic understanding of neural network regularization concepts, you should know that the lines with arrows in the figure above actually all carry The power is heavy, and the neuron is where input and output are stored. For the sake of fairness, that is, to prevent the network from flying too far in the direction of optimization, we also need to add a prior-regularization penalty term to punish the neuron's weighted matrix.
If we set the regularization coefficient to be large, resulting in the value of some weighting matrix is ​​almost zero - what we finally get is a simpler linear network, which is probably under-fitting.
Therefore, this coefficient is not the bigger the better. We need to optimize the value of this regularization factor in order to get a well-fitting model, as shown in the figure below.
Regularization in deep learning
L2 and L1 regularization
L1 and L2 are the most common regularization methods. Their approach is to add a regularization term to the cost function.
Cost function = loss (eg binary cross entropy) + regularization term
Due to the addition of this regularization term, each weight is reduced, in other words, the complexity of the neural network is reduced, combining the idea of ​​how easy it is to have "how complicated the network is," theoretically That said, doing so is equal to preventing overfitting (Ocamham's razor rule).
Of course, this regularization term is not the same in L1 and L2.
For L2, its cost function can be expressed as:
Here λ is the regularization coefficient, which is a hyperparameter that can be optimized for better results. After derivation of the above equation, the coefficient before the weight w is 1−ηλ/m, because η, λ, and m are positive numbers, and 1−ηλ/m is less than 1, and the tendency of w is decreased, so the regularization of L2 is also Called weight attenuation.
For L1, its cost function can be expressed as:
Unlike L2, here we punish the absolute value of the weight w. After derivation of the above expression, we get an equation containing -sgn(w), which means that when w is positive, w decreases toward 0; when w is negative, w increases toward 0 . So the idea of ​​L1 is to put weight on 0, thereby reducing the complexity of the network.
Therefore, when we want to compress the model, the effect of L1 will be very good, but if it is simply to prevent over-fitting, L2 will still be used under normal circumstances. In Keras, we can directly call regularizers to regularize at any level.
Example: Using L2 regularized code in the full connection layer:
From keras Import regularizers
Model.add(Dense(64, input_dim=64,
Kernel_regularizer=regularizers.l2(0.01)
Note: Here 0.01 is the value of the regularization coefficient λ, we can further optimize it through grid search.
Dropout
Dropout is arguably the most interesting type of regularization method, and its effect is also very good, so it is one of the commonly used methods in the field of deep learning. To better explain it, let's first assume our neural network grows like this:
What did Dropout drop in the end? Let's take a look at the following picture: In each iteration, it randomly selects some neurons and "writes them" - "deleting" the neurons together with the corresponding input and output.
Compared to the modification of the cost function by L1 and L2, Dropout is more like a skill to train the network. As training progresses, the neural network will ignore some (hyperparameters, conventionally half) hidden/input neurons in each iteration, which leads to different outputs, some of which are correct and some are wrong.
This approach is somewhat similar to ensemble learning, it can capture more randomness. The ensemble learning classifier is usually better than a single classifier. Similarly, because the network fits the data distribution, most of the output of the model after Dropout is certainly correct, while the impact of noise data is only a small part, not The final result has a big impact.
Due to these factors, we generally use Dropout when our neural network is larger and more random.
In Keras, we can use the keras core layer to achieve dropout. Here is its Python code:
From keras.layers.core importDropout
Model = Sequential([
Dense (output_dim=hidden1_num_units, input_dim=input_num_units, activation='relu'),
Dropout(0.25),
Dense (output_dim=output_num_units, input_dim=hidden5_num_units, activation='softmax'),
])
Note: Here we set 0.25 to the Dropout hyper-parameter (each time "deletion" 1/4), we can further optimize it through grid search.
Data enhancement
Since overfitting is the excessive capture of noise and detail in the dataset by the model, the easiest way to prevent overfitting is to increase the amount of training data. However, in machine learning tasks, increasing the amount of data is not easy to achieve, because the cost of collecting and marking data is too high.
Assume that we are dealing with some hand-written digital images. In order to expand the training set, we can use methods such as rotation, rollover, reduction/amplification, displacement, interception, addition of random noise, and addition of distortion. Here are some processed figures:
These are data enhancements. In a sense, the performance of the machine learning model is based on the amount of data, so the data enhancement can provide a huge improvement in the accuracy of model prediction. Sometimes it is a necessary skill to improve the model.
In Keras, we can use the ImageDataGenerator to perform all of these transformations. It provides a large list of parameters that can be used to preprocess training data. The following is the sample code that implements it:
From keras.preprocessing.image importImageDataGenerator
Datagen = ImageDataGenerator(horizontal flip=True)
Datagen.fit(train)
Early stop method
This is a cross validation strategy. Before training, we took a part of the training set as a verification set. As the training progressed, when the performance of the model on the verification set became worse, we immediately stopped the training manually. This method of early stop was the early stop method.
In the above figure, we should stop training at the dotted line, because after that, the model begins to overfit.
In Keras, we can stop the training ahead of time by calling the callbacks function. Here is its sample code:
From keras.callbacks importEarlyStopping
EarlyStopping(monitor='val_err', patience=5)
Here, monitor refers to the number of epochs that need to be monitored; val_err denotes a validation error.
Patience said that after 5 consecutive epoch models, the predictions did not improve further. According to the above figure, after the dashed line, the model will have a higher verification error for each epoch (lower verification accuracy). Therefore, after training five epochs in a row, it will stop training early.
Note: There is a situation when the model is trained on 5 epochs, its verification accuracy may increase, so we must be careful when selecting hyperparameters.
Investigating MNIST data with keras examples
Dataset: datahack.analyticsvidhya.com/contest/practice-problem-identify-the-digits/
After learning so many regularization methods, we must start practicing now. In this case, we used the digital identification data set of Analytics Vidhya.
We lead several basic libraries:
%pylab inline
Import numpy as np
Import pandas as pd
From scipy.misc import imread
From sklearn.metrics import accuracy_score
From matplotlib import pyplot
Import tensorflow as tf
Import keras
# Prevent potential randomness
Seed = 128
Rng = np.random.RandomState(seed)
Then load the data set:
Root_dir = os.path.abspath('/Users/shubhamjain/Downloads/AV/identify the digits/')
Data_dir = os.path.join(root_dir, 'data')
Sub_dir = os.path.join(root_dir, 'sub')
## only read training files
Train = pd.read_csv(os.path.join(data_dir, 'Train', 'train.csv'))
Train.head()
Check the image:
Img_name = rng.choice(train.filename)
Filepath = os.path.join(data_dir, 'Train', 'Images', 'train', img_name)
Img = imread(filepath, flatten=True)
Pylab.imshow(img, cmap='gray')
Pylab.axis('off')
Pylab.show()
# store images in a numpy array
Temp = []
For img_name in train.filename:
Image_path = os.path.join(data_dir, 'Train', 'Images', 'train', img_name)
Img = imread(image_path, flatten=True)
Img = img.astype('float32')
Temp.append(img)
X_train = np.stack(temp)
X_train /= 255.0
X_train = x_train.reshape(-1, 784).astype('float32')
Y_train = keras.utils.np_utils.to_categorical(train.label.values)
Create verification data set (7:3):
Split_size = int(x_train.shape[0]*0.7)
X_train, x_test = x_train[:split_size], x_train[split_size:]
Y_train, y_test = y_train[:split_size], y_train[split_size:]
Construct a simple neural network with 5 hidden layers, each containing 500 neurons:
# Import keras module
From keras.models importSequential
From keras.layers importDense
# define vars
Input_num_units = 784
Hidden1_num_units = 500
Hidden2_num_units = 500
Hidden3_num_units = 500
Hidden4_num_units = 500
Hidden5_num_units = 500
Output_num_units = 10
Epochs = 10
Batch_size = 128
Model = Sequential([
Dense (output_dim=hidden1_num_units, input_dim=input_num_units, activation='relu'),
Dense (output_dim=hidden2_num_units, input_dim=hidden1_num_units, activation='relu'),
Dense (output_dim=hidden3_num_units, input_dim=hidden2_num_units, activation='relu'),
Dense (output_dim=hidden4_num_units, input_dim=hidden3_num_units, activation='relu'),
Dense (output_dim=hidden5_num_units, input_dim=hidden4_num_units, activation='relu'),
Dense (output_dim=output_num_units, input_dim=hidden5_num_units, activation='softmax'),
])
Run 10 epoch first, and quickly check the model performance:
Model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
Trained_model_5d = model.fit(x_train, y_train, nb_epoch=epochs, batch_size=batch_size, validation_data=(x_test, y_test))
L2 regularization
From keras Import regularizers
Model = Sequential([
Dense (output_dim=hidden1_num_units, input_dim=input_num_units, activation='relu',
Kernel_regularizer=regularizers.l2(0.0001)),
Dense (output_dim=hidden2_num_units, input_dim=hidden1_num_units, activation='relu',
Kernel_regularizer=regularizers.l2(0.0001)),
Dense (output_dim=hidden3_num_units, input_dim=hidden2_num_units, activation='relu',
Kernel_regularizer=regularizers.l2(0.0001)),
Dense (output_dim=hidden4_num_units, input_dim=hidden3_num_units, activation='relu',
Kernel_regularizer=regularizers.l2(0.0001)),
Dense (output_dim=hidden5_num_units, input_dim=hidden4_num_units, activation='relu',
Kernel_regularizer=regularizers.l2(0.0001)),
Dense (output_dim=output_num_units, input_dim=hidden5_num_units, activation='softmax'),
])
Model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
Trained_model_5d = model.fit(x_train, y_train, nb_epoch=epochs, batch_size=batch_size, validation_data=(x_test, y_test))
λ is equal to 0.0001, and the model prediction accuracy is higher!
L1 regularization
## l1
Model = Sequential([
Dense (output_dim=hidden1_num_units, input_dim=input_num_units, activation='relu',
Kernel_regularizer=regularizers.l1(0.0001)),
Dense (output_dim=hidden2_num_units, input_dim=hidden1_num_units, activation='relu',
Kernel_regularizer=regularizers.l1(0.0001)),
Dense (output_dim=hidden3_num_units, input_dim=hidden2_num_units, activation='relu',
Kernel_regularizer=regularizers.l1(0.0001)),
Dense (output_dim=hidden4_num_units, input_dim=hidden3_num_units, activation='relu',
Kernel_regularizer=regularizers.l1(0.0001)),
Dense (output_dim=hidden5_num_units, input_dim=hidden4_num_units, activation='relu',
Kernel_regularizer=regularizers.l1(0.0001)),
Dense (output_dim=output_num_units, input_dim=hidden5_num_units, activation='softmax'),
])
Model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
Trained_model_5d = model.fit(x_train, y_train, nb_epoch=epochs, batch_size=batch_size, validation_data=(x_test, y_test))
No improvement in the accuracy of the model, PASS!
Dropout
## dropout
From keras.layers.core importDropout
Model = Sequential([
Dense (output_dim=hidden1_num_units, input_dim=input_num_units, activation='relu'),
Dropout(0.25),
Dense (output_dim=hidden2_num_units, input_dim=hidden1_num_units, activation='relu'),
Dropout(0.25),
Dense (output_dim=hidden3_num_units, input_dim=hidden2_num_units, activation='relu'),
Dropout(0.25),
Dense (output_dim=hidden4_num_units, input_dim=hidden3_num_units, activation='relu'),
Dropout(0.25),
Dense (output_dim=hidden5_num_units, input_dim=hidden4_num_units, activation='relu'),
Dropout(0.25),
Dense (output_dim=output_num_units, input_dim=hidden5_num_units, activation='softmax'),
])
Model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
Trained_model_5d = model.fit(x_train, y_train, nb_epoch=epochs, batch_size=batch_size, validation_data=(x_test, y_test))
Also, the accuracy rate is higher than the beginning.
Data enhancement
From keras.preprocessing.image importImageDataGenerator
Datagen = ImageDataGenerator(zca_whitening=True)
# Download Data
Train = pd.read_csv(os.path.join(data_dir, 'Train', 'train.csv'))
Temp = []
For img_name in train.filename:
Image_path = os.path.join(data_dir, 'Train', 'Images', 'train', img_name)
Img = imread(image_path, flatten=True)
Img = img.astype('float32')
Temp.append(img)
X_train = np.stack(temp)
X_train = x_train.reshape(x_train.shape[0], 1, 28, 28)
X_train = X_train.astype('float32')
# Fit parameters from data - increase training data
Datagen.fit(X_train)
Here, we use zca_whitening, which highlights the outline of each number, as shown in the following figure:
## splitting
Y_train = keras.utils.np_utils.to_categorical(train.label.values)
Split_size = int(x_train.shape[0]*0.7)
X_train, x_test = X_train[:split_size], X_train[split_size:]
Y_train, y_test = y_train[:split_size], y_train[split_size:]
## reshaping
X_train=np.reshape(x_train,(x_train.shape[0],-1))/255
X_test=np.reshape(x_test,(x_test.shape[0],-1))/255
## structure using dropout
From keras.layers.core importDropout
Model = Sequential([
Dense (output_dim=hidden1_num_units, input_dim=input_num_units, activation='relu'),
Dropout(0.25),
Dense (output_dim=hidden2_num_units, input_dim=hidden1_num_units, activation='relu'),
Dropout(0.25),
Dense (output_dim=hidden3_num_units, input_dim=hidden2_num_units, activation='relu'),
Dropout(0.25),
Dense (output_dim=hidden4_num_units, input_dim=hidden3_num_units, activation='relu'),
Dropout(0.25),
Dense (output_dim=hidden5_num_units, input_dim=hidden4_num_units, activation='relu'),
Dropout(0.25),
Dense (output_dim=output_num_units, input_dim=hidden5_num_units, activation='softmax'),
])
Model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
Trained_model_5d = model.fit(x_train, y_train, nb_epoch=epochs, batch_size=batch_size, validation_data=(x_test, y_test))
The promotion is very obvious!
Early stop method
From keras.callbacks importEarlyStopping
Model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
Trained_model_5d = model.fit(x_train, y_train, nb_epoch=epochs, batch_size=batch_size, validation_data=(x_test, y_test)
, callbacks = [EarlyStopping(monitor='val_acc', patience=2)])
Compared with the above methods, the early stop method only stopped running five epochs, because the prediction accuracy rate did not increase. But if we increase the number of iterations, it should give better results.
Marine valve remote control device
Marine valve remote control device
Marine valve remote control device,valve remote control device,Marine valve remote control device price
Taizhou Jiabo Instrument Technology Co., Ltd. , https://www.taizhoujbcbyq.com