The dense feature fusion end-to-end stereo matching network is a deep learning model designed for three-dimensional (3D) object recognition and matching. This model leverages CNNs to extract features from corresponding targets, and these features are subsequently fused through a dense feature fusion layer to enhance matching accuracy. The model adopts end-to-end learning methods, wherein the input and output serve as the model's input and output, enabling automatic learning and optimization [26,27]. The primary focus of this research paper is to establish and apply this model to address challenges pertaining to management efficiency and cultural conflicts within cross-cultural management projects.
The overall structure of the dense feature fusion end-to-end stereo matching network model comprises four modules: multi-scale feature extraction, dense fusion, mixed attention, and cost aggregation. Figure 1 illustrates the specific structure:
In Fig. 1, the CNNs serves as a deep learning model composed of various modules dedicated to 3D object recognition and matching. The multi-scale feature extraction module plays a crucial role in processing the model's input to acquire contextual information. By fusing the multi-scale features, the dense fusion module generates a dense feature map, thereby enhancing the matching capability in regions with limited constraints. The attention module contributes to the network's ability to process user information, thereby improving the accuracy of disparity prediction. Subsequently, the left and right feature tensors are concatenated to form a 4D matching cost volume, which is further aggregated through 3D convolution using an encoding-decoding structure. Finally, the soft argmin method is employed to obtain the ultimate disparity map.
Based on the aforementioned structure, a multi-scale feature extraction module is specifically designed to extract contextual information at different scales. A feature pyramid network is constructed, incorporating a few residual modules to deepen the network structure while reducing the resolution. This enables the network to capture both rich shallow and deep information. The primary objective of the dense fusion module is to enhance the interaction between channels and reduce the number of network parameters, all while maintaining the resolution of the input features. In order to enhance the network's processing of user information, the Hybrid Attention Module (HAM) is introduced. This module effectively improves feature representation without introducing an excessive number of parameters. It consists of temporal and spatial attention branches, which collectively contribute to the network's enhanced user information processing. The cost aggregation module utilizes three stacked hourglass encoding-decoding structures to regularize the cost volume. This approach fully exploits information from deeper-level cost volumes and global context information. By incorporating residual connections, the module aggregates context information and reduces information loss. The feature extraction module and dense fusion module serve as the central components, and their specific structures are depicted in Figs. 2 and 3, respectively.
In Fig. 2, the multi-scale feature extraction module is designed to capture features of various scales and levels from the input image. This module typically comprises multiple convolutional layers with different sizes, as well as down-sampling and up-sampling operations. The convolutional layers of different sizes allow the module to extract features of different scales present in the input image. Subsequently, down-sampling operations are applied to reduce the spatial resolution of the feature map. This not only helps to decrease computational costs but also enhances the model's robustness to positional variations in the input image. On the other hand, up-sampling operations are employed to restore the feature map to its original size, enabling further processing such as classification or segmentation. By incorporating the multi-scale feature extraction module, the model can acquire image information from different scales and granularities. This facilitates a better understanding of the input image and ultimately improves the model's performance.
In Fig. 3, the Dense Fusion Module is a module specifically designed for deep learning image segmentation tasks. Its purpose is to effectively integrate feature information from various scales and levels, thereby enhancing the model's performance. This module typically comprises multiple convolutional layers, up-sampling operations, fusion operations, additional convolutional layers, and activation functions. By fusing feature information from different scales and levels, it generates a comprehensive feature tensor that contributes to improved performance in image segmentation tasks.
The calculation steps for obtaining the disparity and loss are outlined below:
Step 1: The soft argmin method is employed to regress the disparity map, resulting in a differentiable and smooth disparity map. This facilitates end-to-end network training and is represented by Eq. (1):
$$\widehat{y}=\sum _{y=0}^{{Y}_{max}}y\times \alpha \left(-{B}_{y}\right)$$
1
1
In Eq. (1), \(\widehat{y}\) is obtained by weighting the probability of all individual disparities y, \({Y}_{max}\) is the maximum disparity value. \(\alpha \left(-{B}_{y}\right)\) represents the softmax operation. This operation can transform the function cost value into a probability distribution of disparities. \({B}_{y}\) is the cost value under an individual disparity y.
Step 2: the smooth function is used as the loss function for supervised training. This function has better robustness and lower sensitivity to outliers. The function expression is shown in Eq. (2).
$$F=\frac{1}{N}\sum _{i=1}^{N}f\left({y}_{i}-{\widehat{y}}_{i}\right)$$
2
In Eq. (2), F represents the supervised function, and \(f\left({y}_{i}-{\widehat{y}}_{i}\right)\) represents smooth loss function. The calculation of the loss function is shown in Eq. (3):
$$f\left(x\right)=\left\{\begin{array}{c}0.5x, \left|x\right|<1\\ \left|x\right|-0.5, x\le -1, x\ge 1\end{array}\right.$$
3
Data Collection
In order to better evaluate the performance of the designed algorithm, the detailed parameters of the network are shown in Fig. 4:
In Fig. 4, the experimental setup demonstrates the performance of the proposed algorithm on the KITTI dataset using the sense-flow dataset. The network architecture consists of two convolutional layers denoted as con-1 and con-2. The con-1 layer is further divided into con-1.1 and con-1.2, while con-2 is composed of three components. The entire network comprises six branches labeled as branches 1–6. The "Convolution kernel" indicates the number of kernels utilized in each layer. The "Stride" denotes the step size employed in the convolution operation, and the "Expansion rate" signifies the dilation rate. This configuration serves as the network environment for evaluating the performance of the end-to-end CNN algorithm with dense feature fusion. It is worth noting that the analysis focuses on assessing the effectiveness of the proposed algorithm within this network environment.
In order to conduct the analysis of the dense feature fusion-based end-to-end stereo matching network, this study utilizes the ETH3D dataset and the Middlebury dataset for the evaluation and training of the end-to-end stereo matching network. The ETH3D dataset (https://www.eth3d.net/) provides a large-scale collection of stereo matching and depth estimation data. It consists of high-resolution image pairs, disparity maps, and camera parameters in multiple scenes, which can be used for evaluating and training the end-to-end stereo matching network. The Middlebury dataset (http://vision.middlebury.edu/stereo/) is one of the classic datasets used for stereo vision research. It contains high-quality stereo image pairs and manually annotated disparity maps, making it suitable for evaluating stereo matching and depth estimation algorithms.
Analysis of the current situation of cross-cultural management in projects
Project X is an education initiative jointly undertaken by Yunnan Province and the World Bank Group. The project encompasses a four-year preparation phase followed by a three-year execution period [28,29]. With the aid of loans provided by the World Bank Group, this initiative aims to alleviate local financial constraints and enhance the accessibility of standardized and high-quality preschool education resources. The primary objectives include improving enrollment rates, enhancing educational quality, advancing preschool education research capabilities, and realizing economic benefits [30,31]. Throughout the project implementation, international advanced management concepts, extensive experience, and superior resources are utilized to ensure its success. The scale of the project is illustrated in Fig. 5.
In Fig. 5, the organizational structure of Project X comprises a Project Management Office (PMO) consisting of six key components: project manager, finance, engineering cost, migration and relocation, coordination, and financial management. The core team is composed of eleven members, including nine females and two males, with a mix of three foreign nationals and six domestic personnel from four different countries. The management functions encompass decision-making, motivation, and training. In this project, the management personnel carefully analyze the diverse cultural backgrounds of team members and propose suitable solutions, which involve three distinct steps:
Step 1: Defining the interview content, which encompasses project understanding, decision-making processes, cultural dimensions, and communication methods.
Step 2: Refining the interview content. Project understanding is classified into categories such as public welfare, demonstration, and foreign-funded loan projects. The decision-making process involves on-site investigations, collective discussions, and leadership decisions. Cultural dimensions encompass power distance, individualism and collectivism, uncertainty avoidance, orientation maintenance, and career planning. Communication methods include aspects such as attire, physical space, gestures, and timing.
Step 3: Summarizing the interview data and deriving research findings, as presented in Table 1.
Table 1
Summary of Research Results
Index | Project Awareness |
Refinement of indicators | Public welfare projects | Demonstration project | Foreign-funded loan projects |
Number of people | 4 (Chinese) | 4 (3 Chinese and 1 foreign) | 1 (Chinese) |
Index | Decision-making process |
Refinement of indicators | On-site research decisions | Collective discussion + leadership decision-making |
Number of people | 4 (foreigners) | 5 (3 Chinese and 2 foreign) |
Index | Cultural dimension |
Refinement of indicators | Power gap | Individual/collectivism | Uncertainty avoidance | Guidance maintenance | Career planning |
Difference | The small power gap between foreign parties | Foreign individualism is clear | High avoidance of foreign parties | Foreign short-term | Chinese and foreign undertakings |
Index | Communication methods |
Refinement of indicators | Clothes & Accessories | Space | Gesture | Time concept |
Difference | Foreign bias toward business | Close communication distance with foreign parties | Many foreign parties | Strong foreign side |
Step 4: analyze the issues in Project X based on the interview information; |
Step 5: analyze and propose solutions to the problems based on the established end-to-end CNN model.