The proposed method involves designing and building a real-time catheter tracking system compatible with the commercial Meta Quest 3 XR headset. This system will be suitable for use with a variety of commercially available catheters (Fig. 1-a). The hardware component utilizes a 3D-printed frame to secure the catheter within a region of interest (boundary of heart), simultaneously recorded by two orthogonally positioned cameras (Fig. 1-b). A custom-designed computer vision (CV) algorithm (Fig. 1-c) infers the catheter's shape and orientation from the biplane views (Fig. 1-d, e). This data is used to reconstruct the catheter's 3D shape (Fig. 1-f) and transmitted in real-time to the XR headset. The headset leverages the Unity game engine to provide a stand-alone rendering environment (Fig. 1-i). This system will facilitate a specific set of tasks related to precise catheter positioning. The reconstructed catheter data is co-registered and visualized within a patient-specific anatomical heart model (Fig. 1-h) generated from a cardiac computed tomography (CT) scan acquired at end-diastole DICOM format (Fig. 1-g). This integration occurs within the Meta Quest 3 (Fig. 1-j). Users can physically manipulate a real commercial catheter (identical to those used in a cath lab) and maneuver it in real-time within the 3D patient heart model, observing the process via the Meta Quest 3 XR headset. The following sections provide a detailed technical description of each development stage.
2.1. Design and Fabrication of the 3D Printed Setup
To facilitate tracking of the catheter and establish a physical platform for controlled catheter manipulation, we designed a cubic 3D model using Dassault Systems SolidWorks 2022 and fabricated it using a 3D printer (Fig. 2). As illustrated in Fig. 2, several components were integrated into the 3D setup to enable accurate computer vision tracking of the catheter. A removable inlet with a 4 mm diameter, centered at (0,0,0) in our 3D Cartesian coordinate system, is included for catheter insertion and articulation into the central open space within the cube. This inlet can be easily exchanged to accommodate catheters of varying thickness. The setup contains mounts designed to securely hold two cameras at specific locations along two sides of the cube, forming an orthogonal biplane imaging system. From these fixed positions, the two cameras capture video of the catheter movement through the central tracking region of interest (ROI).
To specify the ROI, eight fiducial markers (mTi and mFi for i = 1,2,3, and 4, T and F denotes Top, and Front respectively) were 3D printed onto specific locations within the model on two cross-shaped pillars, ensuring that the ROI can accommodate the size of a human heart. The fiducial markers are in two orthogonal planes, allowing the transformation of each camera’s perspective to a global coordinate system. This coordinate transformation is essential in reconstructing the 3D catheter trajectory within the 3D model workspace, ensuring that heart is centered in the ROI and the inlet of both the cube and heart are aligned. The cubic planes are printed using a white color to provide a high contrast background against the catheter when viewed by the cameras. This design choice helps with more accurate isolation and segmentation of the catheter from the background by the computer vision tracking algorithm. The setup, including all sections and backgrounds, is 3D printed from rigid materials (VeroClear and VeroUltraWhite for backgrounds) on a Stratasys J826 PolyJet printer.
2.2. Computer Vision-Based Catheter Segmentation and 3D Trajectory Extraction
The main component enabling real-time tracking of the catheter position is a computer vision segmentation algorithm that first identifies and segments the catheter and then reconstructs the full 3D trajectory from synchronized orthogonal camera footage captured from the workspace. The processing pipeline operates on a per-frame basis, taking in video streams from two cameras positioned orthogonally around the 3D cube model. The cameras are intrinsically calibrated and have known, fixed poses relative to the prespecified coordinate space. The proposed vision algorithm has nine steps, as illustrated in Fig. 3.
Each frame of the real-time video is denoted by I(r, c, t), where r, c, and t represent the pixel row, pixel column, and time dimensions, respectively. To start, every frame undergoes a perspective transformation using a perspective transformation matrix (Eq. 1) mapping the front and the top cameras to the same global coordinate space, based on the known fiducial points (denoted by mT, and mF). Vectors mT and mF, each consisting of eight coordinate values (mTi and mFi for i = 1,2,3, and 4 as depicted in Fig. 1), are used to specify the four corners of the ROI for the top and front planes, respectively. These vectors are marked on the two printed crosses within the setup (Fig. 1). The perspective transformation facilitates consistent image processing in the unified coordinate space and cancels out the distortion. As expressed in Eq. 1, the transformed coordinate points (x', y') are obtained from the local camera-based coordinates (x, y). The coordinate transformation matrix implements an affine transformation using rotation and scaling (a1 to a4), translation (b1 and b2), and projection (c1 and c2). The parameters of the transformation matrix are initially unknown. To obtain these parameters, a set of eight equations need to be solved. To do so, as an initial step, four fiducial points are selected within the input image. Subsequently, these predefined points are mapped to predetermined locations based on the known dimensions and positions of ROIs. This procedure yields a system of eight equations and eight unknowns, allowing for a solvable configuration. We can then compute the resulting perspective transformation matrix. After the acquisition of the transformation matrix, all the input frames undergo perspective transformation, yielding coordinate-normalized images. Matrix parameters are acquired using the initial frame of each video, enabling seamless application across subsequent frames in real-time scenarios.
$$\left[\begin{array}{c}X{\prime }\\ Y{\prime }\\ 1\end{array}\right]=\left[\begin{array}{ccc}{a}_{1}& {a}_{2}& {b}_{1}\\ {a}_{3}& {a}_{4}& {b}_{2}\\ {c}_{1}& {c}_{2}& 1\end{array}\right]\left[\begin{array}{c}X\\ Y\\ 1\end{array}\right]$$
1
With the frames aligned, preprocessing steps are applied including Gaussian smoothing to reduce background noise and enable robust detection; a larger Gaussian kernel size increases the smoothness but can also degrade localization precision. Next, contrast/brightness adjustments enhance the visibility of the catheter against the white background. To isolate the catheter, an adaptive thresholding operation is applied, which converts the grayscale frame into a binary image based on dynamic local thresholding. The adaptive thresholding algorithm computes individualized threshold value, denoted by T(x, y), for each pixel (x, y), by averaging the pixel values present in a localized neighborhood window centered on the target pixel. In other words, Adaptive thresholding is as follows:
T(x,y) = mean(neighborhood) – C (2)
Where mean(neighborhood) is the mean of the pixel values within the neighborhood proximity window. To further normalize the pixel intensity threshold, the constant C, is subsequently subtracted from this mean value. C is a customizable constant that can be positive, negative, or zero (but is typically positive). A positive C raises the threshold leading to a darker image, a negative C lowers it leading to a brighter image, and zero uses the mean directly. The optimal value for C depends on the specific image and desired outcome. It's generally chosen through experimentation to achieve the best possible segmentation or object detection results. Each pixel's final value, represented by g(x, y), is then obtained by masking the input frames using the computed threshold as in:
$$g\left(x,y\right)=\left\{\begin{array}{c}1, PV(x,y) > T(x,y) \\ 0, PV(x,y) \le T(x,y)\end{array}\right.$$
3
Where the variable PV(x, y) denotes the pixel value within the input image. Consequently, the process involves binary classification of each pixel, wherein the adaptation to a locally determined threshold takes precedence over a globally assigned value for the entire image. The configurability of this method is manifested through the adjustable parameters of the neighborhood block size and the constant C. While larger block sizes serve to mitigate the impact of noisy artifacts, it is imperative to acknowledge that they concurrently introduce a trade-off by diminishing the precision of localization.
In the subsequent stage of the proposed vision algorithm, a morphological operation for skeletonization is introduced to attenuate the binary representation of the catheter object, shaping it into a central, pixel-wide arc that characterizes the medial axis trajectory. This transformation yields a concise portrayal, streamlining the subsequent tracking process. Distinct skeletonization methods are available, such as Zhang's method 21 and Lee’s method 22. In this instance, Zhang’s method has been implemented, which involves a series of sequential passes across the image, systematically eliminating pixels situated on the periphery of the object. This iterative process persists until further removal of pixels becomes unattainable.
Initially, the tip location is determined by identifying the point with the maximum number of neighboring true values (white pixels) 23. The coordinates of the inferred tip are then recorded as the first data point within an array, with the corresponding pixels being set to true values in the skeletonized binary catheter image. This procedure is repeated iteratively until all pixel’s transition to true values, thereby documenting the coordinates of the entire catheter trajectory from its tip to its entry point. This yields a two-dimensional directional array (vector) that succinctly encapsulates the spatial trajectory of the catheter for each frame—from its tip to its entry point. These pixel coordinates are subsequently mapped onto a real-world 3D coordinate system in millimeters, utilizing known phantom dimensions and camera intrinsic parameters. The coordinates are further down sampled to generate a subset of K points, where the initial point denotes the catheter's tip, the concluding point designates the entry point, and the intervening K-2 points are evenly distributed to delineate the catheter's curvature. The computer vision algorithm is implemented in Python using the OpenCV library and operates in real-time on both planes. The resulting K points infer the X, Y coordinates from the top plane and Z coordinate from the front plane. Combining these points results in a real-time 3D K-point tracking system, providing a holistic representation of the catheter's spatial configuration and its curvature.
2.3. Patient-Specific 3D Heart Model Generation
To generate a patient-specific 3D model of the heart, we utilized Materialize Mimics Research software version 21.0 for 3D image processing. The initial step involved importing a cardiac computed tomography (CT) scan acquired at end-diastole in the DICOM format, as demonstrated in Fig. 4-a. Subsequently, this data underwent image segmentation within Mimics Software to delineate the heart and spine, forming a unified 3D mask while preserving their respective spatial relationships. The resulting 3D segmentation was saved as an STL file. To further refine the model and eliminate extraneous components, such as vessels and ribs, we employed Geomagic Wrap software developed by 3D Systems Geomagic Corporation. This refinement process effectively removed artifacts and smoothed the mesh, as illustrated in Fig. <link rid="fig4">4</link>-b and 4-c. The generated 3D heart is used in the proposed XR scene. The initial step involved exporting the model from Geomagic as an STL file, which was then imported into the CAD software SolidWorks. Within SolidWorks, meticulous attention was given to the positioning of six posts inside the Right Atrium, ensuring optimal accessibility from the inferior vena cava using a catheter (Fig. 4-d). Each post was constructed from an extrusion of a 2.9 mm diameter circle, resulting in a 4 mm long cylinder. The xyz coordinates were precisely measured at the center of each post (Fig. 4-e). Upon finalizing the model, four spherical markers, each with a diameter of 3 mm, were added parallel to the spine base (Fig. 4-f). These markers will serve as reference points for orientation and integration within the Unity scene system. Finally, the entire model was exported as a single GLTF file, with a bin file associated with each individual element (spine, heart, post 1, post 2, etc.), facilitating its seamless integration into the XR scene.
2.4. Mixed Reality Rendering
Employing the Unity game engine, we have integrated the inferred tracking data of a catheter and a patient-specific cardiac model into a mixed reality application tailored for the Meta Quest 3 headset. Based on the steps explained in section 2.3, a 3D model of the patient's heart, reconstructed from a CT scan 9,10,24, is rendered using the Unity game engine. Furthermore, the catheter's spline obtained from the inferred K points, acquired in real-time by our proposed computer vision algorithm, are also rendered in the scene. For this mapping to be accurate, alignment of fiducial points on the 3D printed model with corresponding anatomical landmarks is essential.
The communication between the output of the Python-based computer vision algorithm and the 3D rendering in Unity, is implemented by using the web-based library Flask, which is a common choice for efficient real-time data transfer. More specifically, inferred tracking data comprised of the position of each of the K points on the catheter, as well as the rotation of the tip is saved using a JSON file and then transmitted to Unity through a Flask-based WebSocket connection. The Flask server can be run on any IP address and port that the user specifies, with the default being localhost at port 5000. To connect this data to the Unity application, the user is prompted to enter this IP address and port when they launch the app and connect to the server using the SocketIO Unity package. After receiving the transmitted tracking data, Unity renders dynamic updates to both the catheter location and curvature in the 3D scene. Using the fiducial markers, an affine transformation is used to correctly map the catheter placement and movement with respect to the 3D model of the heart. The catheter rendering is superimposed within the patient-specific heart rendering and surrounding scene within the Quest 3 headset, creating a cohesive XR environment. This integrated platform offers an interactive experience that goes beyond mere visualization, allowing users to observe and analyze the catheter's movements in real time with quantitative feedback based on tracking relative to pre-defined target positions. The visual representation is contextualized within the reconstructed cardiac anatomy, providing a valuable tool for training simulations. Moreover, this innovative approach holds considerable potential for real-life applications in clinical settings, where the interactive XR platform could contribute to improved catheterization procedures through enhanced training and procedural guidance.