This paper studies high-quality image compression technology. The compressed images are used for fast transmission of image information and to reduce memory usage. Therefore, high requirements are placed on compression speed and image details. In the past, the image compression algorithm based on autoencoder was designed [30]. Through reasonable allocation of positions, the image compression effect was significantly optimized. However, for some high-resolution images, the compression did not achieve good results and still had blurring and distortion problems. The generative adversarial network has excellent image compression performance for high-resolution images. Deep learning promotes the development of image compression technology, continuously improves the performance of image compression, and achieves better compression rate and better compression index. Therefore, it is very important to learn to establish a better image compression framework. Therefore, this paper re-designs content-weighted autoencoder as the basis of image compression, and deeply integrates it with the generative adversarial network to form a framework for high-quality image compression, so as to achieve the maximum preservation of image information at a faster compression speed and better compression rate. The following first introduces the overall network structure of the high-quality image compression algorithm designed in this paper, and then expands the key modules and the loss function used in detail. Finally, the training and use of this algorithm are explained.
3.1 Overall structure design of algorithm network
The network structure of the high-quality image compression algorithm designed in this paper includes the following main modules: content-weighted autoencoder, in which the decoding part is used as the generator G in the generative adversarial network to realize the function of the generator, and the compressed data output by the autoencoder can be used as the generation condition; importance map, in which Q(x) represents the importance map quantization process and M(x) represents the importance map mask calculation process; binary quantizer, which can set the activation function in the encoder and convert it with the activation function to generate the decoding result during decoding; multi-scale discriminator DM and composite loss function Lcos. The above modules together realize high-quality image compression, and the overall structure design of the algorithm network is shown in Fig. 3.
3.2 Content-weighted autoencoder
The content-weighted autoencoder uses convolution operation to replace the traditional fully connected mode of encoding, which can achieve image compression based on a smaller bit rate and optimize the problem with entropy rate. Its structure includes two parts: encoding and decoding. Among them, the encoding part is a structure composed of a cascade combination of convolutional layers and residual modules, including 3 convolutional layers and 3 residual blocks. Each residual block has two convolutional layers and a ReLU function. The residual module is used to improve the anti-noise performance of the encoder; and the encoder designed in this paper does not add a normalization layer to avoid visual artifacts in smooth areas.
In the encoding process, the image is first input into the network, and after convolution through 64 convolution kernels Conv1 with a size of 8×8 and a stride of 4, it passes through a residual module Res1. Then it passes through a convolution layer of 128 convolution kernels Conv2 with a size of 4×4 and a stride of 2. After that, it passes through two residual modules Res2 and Res3, and the feature map is convolved with a 1×1 convolution kernel Conv3. Except for the last layer of the encoder which uses Sigmoid activation function, all convolutional layers use ReLU. The encoding process is shown in Fig. 4.
The encoding process is to add an activation function to the input signal \(x=[{x_1},{x_2}, \cdot \cdot \cdot {x_n}]\) and map the data of the input signal to y, where y is the new data matrix. The mathematical principle is as shown in formula (1).
.
Wherein, f is the activation function, w is the mapping matrix, and b is the bias term of the encoding part.
The decoding part consists of an up-sampling layer and a deconvolution layer. The convolution layer is for feature extraction, and the deconvolution layer is for image reconstruction. Finally, through continuous iteration, the error between the output and the input is minimized to obtain the optimal autoencoder parameters. The feature extraction in the autoencoder is more efficient, and the parameter weights in the convolutional network are shared by neurons, so that the network complexity is easy to train the algorithm model, and the reconstruction quality of the compressed image is improved.
The decoding process is to restore the extracted effective features so that the result is close to the input signal x. The mathematical principle is as shown in formula (2).
$$x^{\prime}=f^{\prime}(w^{\prime}y+b^{\prime})$$
2
.
Wherein, \(f^{\prime}\) is the mapping function, \(w^{\prime}\) is the mapping matrix, and \(b^{\prime}\) is the bias term of the decoding part.
3.3 Importance map
In the process of image compression, different regions have different compression difficulties, and smoother regions are easier to compress. However, regions with relatively rich textures are the important parts to obtain information [31], so bits should be allocated to parts with complex texture structures. In the process of extracting feature maps, different feature maps contain different information. The content-weighted importance maps can achieve better bit allocation and can control and optimize the compression rate.
The importance map is obtained from the input image through learning. The intermediate feature map can be obtained from the residual block of the encoder by processing the image, and then the importance map \(F(x)\) is obtained by processing the convolution layer. The importance map extraction process is shown in Fig. 5.
In the network, if the input image is x, the output of the encoder is \(E(x) \in {R^{h \times w \times n}}\), \(F(x)\) is used to represent the importance map of size \(h \times w\). When \(\frac{{l - 1}}{L} \leqslant {F_{ij}} \leqslant \frac{l}{L}\), encode and store the output information from the first to the \(\frac{{nl}}{L}\) bit. Among them, L represents the value of the importance, and \(\frac{n}{L}\) represents the bit corresponding to each importance. The importance map is used to realize the allocation of bits. First, the size of the importance map \(F(x)\) is recorded as \(h \times w\), and the number of feature maps output by the encoder network is recorded as n. The importance map is quantized to become an integer less than n. Generate an importance feature mask m corresponding to \(B(E(x))\), whose size is \(h \times w \times n\). Here, \({f_{ij}}\) is recorded as an element in \(F(x)\), and the process of obtaining the importance map by quantization is defined as formula (3).
$$Q({f_{ij}})=l - 1,{\text{ }}if{\text{ }}\frac{{l - 1}}{L} \leqslant {f_{ij}} \leqslant \frac{l}{L},{\text{ }}l=1, \cdot \cdot \cdot ,L$$
3
.
After quantizing the importance map, the importance feature mask m is calculated by formula (4).
$${m_{kij}}=\left\{ \begin{gathered} 1 \hfill \\ n \hfill \\ \end{gathered} \right.{\text{ }}if{\text{ }}k \leqslant \frac{n}{l}Q({f_{ij}})$$
4
.
The final encoding result of the input image x can be represented by \(c=M \otimes B\), where the symbol \(\otimes\) represents element multiplication. In this way, the content-weighted importance map is obtained, which guides the generation of images with clearer textures.
In the process of back propagation, the gradient still needs to be calculated. The feature map is convolved to generate the importance map. The importance feature mask is generated by the valuer, which causes the gradient of most areas to be zero. To calculate the gradient of the mask of the element \({p_{ij}}\) in the importance map, see formula (5).
$${m_{kij}}=\left\{ \begin{gathered} 1 \hfill \\ 0 \hfill \\ \end{gathered} \right.{\text{ }}if{\text{ }}\left| {\frac{{kl}}{n}} \right| \leqslant L{p_{ij}}$$
5
.
3.4 Binary quantizer
After encoding the image, a binary quantizer is used to complete the quantization process. The activation function is the Sigmoid function, which takes values between [0,1]. After nonlinear transformation, the value output by the encoder should also be between [0,1]. In forward propagation, the activation value greater than 0.5 can be defined as 1, and the activation value less than 0.5 can be defined as 0, as shown in formula (6).
$$B({e_{ij}})=l,{\text{ }}if{\text{ }}\frac{1}{2}<{e_{ij}}<\frac{{l+1}}{2},{\text{ }}l=0,1$$
6
.
In back propagation, the gradient is calculated by the chain rule, which will cause the gradient to be almost equal to 0 everywhere during back propagation. In order to solve this problem of gradient descent in back propagation, this paper designs a gradient back propagation function, which is shown in formula (7).
$$\widetilde {B}(x)=x,{\text{ 0}}<x<1$$
7
.
3.5 Multi-scale discriminator
The discriminator is the core of the generative adversarial network. Through adversarial training with the generator, the ability to distinguish the authenticity of the generated image is improved [32]. In order to obtain better compression and visual effects, a relatively large receptive field is required. To achieve this goal, a large convolution kernel or a more complex network is required, which will lead to overfitting, so a better convolutional network system needs to be introduced. The multi-scale discriminator can collect feature data at each scale, obtain a better global view and more accurate detail information, and fuse the data at each level through the multi-scale discriminator, so that the compressed image generated is as close to the original image as possible.
When the image data generated by the content-weighted autoencoder is input into the multi-scale discriminator, the pooling layer will down-sample the input data at different scales to obtain images of three different resolutions, and then use three discriminator networks to process the images of the three different scales. Among them, the low-resolution discriminator can obtain a larger field of view when training, and the high-resolution discriminator can minimize image distortion when training, and the generated compressed image texture is clearer. The network structure of the multi-scale discriminator is shown in Fig. 6 below.
The multi-scale discriminator obtains better identification ability by training the generated compressed image and the original image. The specific working principle of the multi-scale discriminator network is to obtain three different scale images by double and quadruple down-sampling the image obtained by the generator and the original image respectively. After that, the images are convolved through the discriminator module respectively. The discriminator modules at three different scales in the network structure have the same network structure, which consists of two convolutional layers, three convolutional block layers, and a Sigmoid function. Each convolutional block structure consists of a conv, a BN layer, and a Leaky-Relu. The number of convolutional layer operations increases successively. The first layer n = 128, the second layer n = 256, and the third layer n = 512. In the discriminator structure, the convolution kernels in all convolutional networks are 4×4. The step size in the first convolutional layer and the convolutional block is 2; the step size of the last convolutional layer is 1. Finally, the multi-scale discriminator analyzes and judges the image, fuses the image of each scale, and produces an output result, which is True if valid and Fake if invalid.
3.6 Composite loss function
The image compression design based on unsupervised learning introduces adversarial network into the end-to-end framework for generative image compression, so the loss function consists of content-weighted autoencoder loss function, decoder loss function, feature matching loss function and multi-scale discriminator loss function.
In the process of image data compression and reconstruction by content-weighted autoencoder, errors will be generated between input data and reconstructed data, that is, data volume will be lost. To achieve better learning and extraction of image data features, it is necessary to constantly adjust the distortion and bit rate of image reconstruction. The rate-distortion function is optimized and used as its loss function, as shown in formula (8).
$${L_C}={L_D}+\alpha {L_R}$$
8
.
Where, \({L_D}\) represents the rate-distortion loss, α is the weight used to adjust the bit rate, \({L_R}\) and is the rate loss.
The rate-distortion loss is expressed using the L2 norm square, as shown in formula (9).
$${L_D}=\left\| {{{x^{\prime}}_n} - {x_n}} \right\|_{2}^{2}$$
9
.
Where, \({x^{\prime}_n}\) represents the reconstructed image, \({x_n}\) represents the input image.
In the network structure of the autoencoder, the rate loss is defined as the entropy of the intermediate feature map, and the encoder hidden space data stored is related to the degree of concentration of the quantized data, so the entropy of the intermediate data is selected to define \({L_R}\), see formula (10).
$${L_R}= - E\left[ {{{\log }_2}{P_q}} \right]$$
20
.
In the formula \({P_q}=\int_{{x - \frac{1}{2}}}^{{x+\frac{1}{2}}} {{P_d}}\), \({P_d}\) represents the probability density function of the original data.
Therefore, the content-weighted autoencoder loss function can be expressed as formula (11).
$${L_C}=\left\| {{{x^{\prime}}_n} - {x_n}} \right\|_{2}^{2}+\alpha \left\{ {\left. { - E\left[ {{{\log }_2}{P_q}} \right]} \right\}} \right.$$
31
.
The decoder is a generator in the generative antagonism network, which needs to allocate bits in the process of image compression. The rate-distortion function can optimize the balance between the reconstruction function and the bit rate. The rate-distortion function formula is (12).
$${L_d}+\beta R={L_d}+\beta H(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{w} )$$
42
.
The loss function of the optimized generator is formula (13).
$${L_G}={E_{x\sim {p_x}}}\left[ {\lambda R+d(x,\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x} ) - \beta lb(D(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x} ,y))} \right]$$
53
.
Wherein, \(d(x,\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x} )\) is the loss, \(\lambda\) and \(\beta\) are weight parameters.
The feature matching loss is represented by MAE (mean absolute error) here, which is less susceptible to outliers than MSE (mean square error), as shown in formula (14).
$${L_{FM}}=E\sum\limits_{{i=1}}^{{L1}} {\frac{1}{{{N_i}}}} \left[ {\left\| {F_{D}^{i}(x) - F_{D}^{i}(G(z))} \right\|} \right]$$
64
.
The loss function of the multi-scale discriminator is defined in formula (15).
$${L_M}={E_{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x} \sim {p_g}}}\left[ {D(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x} )} \right] - {E_{x\sim {p_r}}}\left[ {D(x)} \right]+\lambda {E_{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x} \sim {p_{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x} }}}}\left[ {{{({{\left\| {{\nabla _{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x} }}D(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{x} )} \right\|}_2} - 1)}^2}} \right]$$
75
.
Therefore, the above loss functions together constitute a composite loss function, which can effectively improve the quality and effect of image compression generation from many aspects. The composite loss function is defined as formula (16).
$${L_{com}}=\rho {L_C}+\upvarphi {L_G}+\phi {L_{FM}}+\psi {L_M}$$
86
.
Wherein, \(\rho\),\(\upvarphi\), \(\phi\) and \(\psi\), are weight parameters. According to the experimental platform environment, the weight parameters are continuously adjusted through experiments to achieve the best image compression effect. The weight parameters selected in this paper are 0.5, 0.5, 5, and 3, respectively.
3.7 Algorithm training process
The specific process of algorithm training is as follows:
Step 1: Use the paired original image and input image as training data, and send the input image to the generator after passing through the encoder, binary quantizer, and importance map calculation to generate a compressed image;
Step 2: Send the generated compressed image and the original image to the multi-scale discriminator DM, and the multi-scale discriminator DM will discriminate the two, and judge whether the result has achieved the compression effect. If it reaches the standard, it will output the compressed image. If it does not reach the standard, it will continue to return the reconstructed image until a usable compressed image is generated. According to the results, the multi-scale discriminator loss, decoder loss, and content-weighted loss are calculated;
Step 3: Compare the generated compressed image with the original image, and calculate the feature matching loss;
Step 4: Back propagate, according to the losses calculated in steps 2 and 3 above, update the multi-scale discriminator DM and generator G parameters respectively;
Step 5: Execute steps 1 to 4, where the input image is given as x in the encoder, and the encoder output is obtained by analyzing and transforming the input signal, which is recorded as\(E(x) \in {R^{h \times w \times n}}\). Here \(h \times w\) is the size, and n represents the number of feature map. The obtained is quantized by the valuer. Based on the characteristics of the binary valuer, the part of the output data greater than 0.5 is marked as 1, and the rest is marked as 0. The feature map in the encoder is extracted and the network convolution operation is performed separately to obtain the importance map, which is recorded as \(F(x)\). Similarly, the quantization operation is carried out on \(F(x)\), and the importance mask of the same size is generated after the quantization process as that after \(E(x)\) quantization. The obtained importance mask is combined with the binary code generated by the valuer output of the encoder, so that the image can better preserve important information, and finally an image compression code is obtained. The decoder is symmetrical with the encoder structure, and the same analysis and transformation are performed to obtain the output result of the decoder and generate a compressed image. When the parameters of DM are updated, the discrimination of effective generation compression can be close to 1, and the discrimination of invalid generation compression can be close to 0, that is, the multi-scale discriminator DM is maximized and optimized. During the training of the generative network, the generator G is connected in series with the multi-scale discriminator DM, and the error generated is passed to the generative network. At this time, the parameters of the generative network need to be minimized, that is, the generator G is minimized and optimized. In this process, the generator G and the multi-scale discriminator DM form a dynamic game, and the loop is jumped out after reaching the Nash equilibrium. The generated compressed image is judged to be infinitely close to the real original image.
Step 6: Finally, adjust the parameters of each step according to specific needs, and the algorithm network outputs a usable high-quality compressed image.