The introduction of CIM architecture has provided a promising solution to address the inherent inefficiencies related to data movement in the traditional von Neumann architecture1–4. Simultaneously, it has enhanced parallel computing capabilities and energy efficiency within the domain of artificial intelligence (AI) applications5–8. A crucial component within this architecture is the multiply-and-accumulate (MAC) unit9–11. Memristors have garnered significant attention as viable candidates for MAC operations in CIM due to their compact size12, high array density13,14, low energy requirements15–18, and compatibility with back-end-of-line (BEOL) integration19–21. In particular, memristors based on 2D materials have generated substantial research interest due to their exceptional properties that are essential for AI applications, especially the ultrathin thickness and low switching voltage for energy-efficient computing (Supplementary Table 1).
However, the advancement of 2D materials-based memristive crossbar arrays (CBAs) face several challenges. While the synthesis and integration of 2D materials still face challenges22–25, recent advances have demonstrated successful wafer-scale and monolithic 3D integration of 2D materials, achieving remarkable performance in logic and memory applications73–77,46,39. This method is impractical for scaling up to large-scale integration with silicon platforms. Scalable methods such as chemical vapor deposition (CVD), molecular beam epitaxy (MBE), and atomic layer deposition (ALD) present challenges in achieving growth processes compatible with back-end-of-line (BEOL) technology, where the growth temperature must remain below 450°C26. Alternative approaches involve the growth of 2D materials over a substrate without strict temperature constraints, followed by the transfer of the 2D material onto the desired substrate. However, conventional transfer processes may introduce impurities and doping effects into the 2D material due to the utilization of polymers and liquid media.27,28
Furthermore, the implementation of CIM hardware utilizing 2D materials-based CBAs also faces challenges, including the lack of process integration, limited array size, speed constraints, and the absence of integration with peripheral control and sensing circuits. To enhance device selectivity within the CBA and mitigate leakage current, the incorporation of access selectors or transistors becomes crucial in the process integration29,30. Nevertheless, existing 2D materials-based memristor arrays predominantly focus on memristor-only (1R) array configurations, thus lacking scalability22,23,27,28,31–33. The issue of array size presents another obstacle as large-scale CBAs are indispensable for weighted layers in neural networks. Fully-connected (FC) layers, for instance, necessitate hundreds of neurons, while convolution layers require hundreds of channels, both of which rely on large array sizes for efficient parallel computing34,35. Although some heterogeneous integration of 2D materials-based memristors have been reported, such as the graphene transistor/hexagonal boron nitride (h-BN) memristor-based 0.5T0.5R single cell36, the MoS2 transistor/h-BN memristor-based 1T1R single cell37, the Si transistor/MoS2 memristor-based 2T1R single cell38, and the Si transistor/h-BN memristor-based 1T1R 5 × 5 CBA48, these demonstrations suffer from limited array sizes and fail to accommodate the complete weight mapping of a neural network. Moreover, the slow programming speed remains a challenge as state-of-the-art heterogeneous Si/2D integrated 1T1R arrays report a slow program time of 232 us, which fall short of the requirement for tens of nanoseconds in compute-in-memory operation48,40. Engineering the integration of 2D materials-based memristive CBAs with peripheral circuits at the hardware-level for CIM system has not been adequately addressed. Previous demonstrations have primarily focused on device features extracted from small CBAs without integrated peripheral circuits and have evaluated array functionality using simulation approaches48,41–43.
Variable resistive states have been observed in HfSe2-metal compounds, showing potential for memristive devices44,45. Our research indicates that Molecular Beam Epitaxy (MBE) can achieve wafer-scale growth of 2D HfSe2 thin films, enabling large-scale (1R) memristor arrays53. HfSe2 offers advantages in scalability and integration with silicon-based devices for constructing 1S1R arrays. The restricted bandgap of these semiconductors allows for lower set and reset voltages, improving energy efficiency compared to hexagonal boron nitride (hBN)47. A thinner semiconductor layer (3 nm), as shown in device cross-section TEM (Supplementary Fig. 26), shows better switching characteristics than thicker hBN layers (6 nm)48.
In this work, we demonstrate a hardware CIM system that leverages on heterogeneous integration technology and incorporates the design of low-energy peripheral circuits. The integrated crossbar array utilizes a 1S1R configuration, where each cell integrates a Si-based selector and a HfSe2-based memristor. To enable this integration, a low-temperature three-dimensional (3D) stacking process is employed, involving wafer-scale 2D HfSe2 synthesis and wafer-scale metal-assisted transfer techniques, positioning the 2D memristor above the Si-selector, thereby ensuring compatibility with complementary metal-oxide-semiconductor (CMOS) process. As a result, we successfully fabricated a 1-kilobit (32 × 32) one-selector-one-memristor (1S1R) crossbar array (CBA). This integrated array exhibits significantly reduced sneak current and enhanced endurance when compared to other 2D materials-based memristors that lack access devices. In comparison to the current state-of-the-art 2D materials-based memristors integrated with selectors, our presented integrated array demonstrates a marked improvement in both response time and energy efficiency.48,49 Furthermore, these enhancements are shown to be comparable to those achieved by conventional oxide-based memristors,9,21,50–52 with the added advantages of superior switching voltage, faster switching and reduced thickness as shown in Supplementary Table 11. Additionally, time-domain sensing peripheral circuits are designed by utilizing a time-to-digital converter (TDC) for energy-efficient reading and computing, which takes advantage of the rapid device speed. The CIM hardware system achieves full integration of the CBA with peripheral circuits and demonstrates high-accuracy CNN inference. Leveraging the nonlinear behavior of the TDC circuits, analog computing capabilities and in-built activation functions are developed, enhancing energy efficiency during computation for the CNN. Finally, this work suggests that semiconductors with limited bandgaps provide improved voltage and energy efficiency, opening routes for future low-voltage, high-speed memristor applications in logic and storage.
System architecture and device heterogeneous integration
The fully integrated CIM system based on a 1S1R CBA is shown in Fig. 1a. Implemented on a printed circuit board (PCB), the system consists of several modules, including a power supply module, a digital-to-analog converter (DAC) module, a time-domain sensing module, a field-programmable gate array (FPGA) control module, and a MAC computing module. Figure 1b illustrates the role of the Advanced RISC Machine (ARM) controller within the FPGA evaluation board, responsible for encoding input data and decoding output data. The FPGA control unit is programmable with program, read, and compute modes to generate encoded control signals for the Digital-to-Analog (DAC) and handle the collection of output signals with 16 parallel 16-bits counters for Time-to-Digital converter (TDC). Further information on the FPGA programming is provided in Supplementary Fig. 1. Acting as a voltage generator, the DAC module provides voltage pulses to the central CBA for programming, read, and MAC operations. The MAC compute module performs the analog vector-matrix-multiplication (VMM) operations in CNNs, leveraging the structure of the CBA21. Peripheral sensing circuits, including the current subtractor and TDC blocks, are integrated to enable the measurement of output currents from the central CBA. Detailed information about the system setup can be found in Supplementary Note 1.
The MAC module comprises a heterogenous integrated 1S1R CBA that is wire-bonded to the printed circuit board (PCB). In this work, a 32 × 32 CBA is utilized, as depicted in Fig. 1c. A zoomed-in optical image of the central region of the CBA is presented in Fig. 1d, revealing the word lines (WLs) as the horizontal top electrodes (TEs) and the bit lines (BLs) as the vertical bottom electrodes (BEs). To provide a schematic representation of a single cell within the CBA, Fig. 1e shows a cross-sectional view, highlighting the distinctive 3D-stacked structure between the Si-based selector and the HfSe2-based memristor. Corresponding cross-sectional transmission electron microscopy (TEM) images of the region marked by the red dotted rectangle are displayed in Fig. 1f. The inset shows the 2D HfSe2 layer with a thickness of 3 nm. The stacked structure is clearly visible in the low-magnification image, consisting of different layers such as the bottom BE/p-type Si/middle electrode (ME) forming the Si-based selector, and the ME/HfSe2/TE memristor stack. Energy-dispersive X-ray spectroscopy (EDX) confirms the elemental composition of each layer in the stacked structure (Supplementary Fig. 2). Further information on the fabrication process of the Si-selector is provided in Supplementary Fig. 3.
The schematic 3D structure of a single 1S1R cell within the CBA is depicted in Fig. 1g. For the Si-based selector, titanium nitride (TiN) is used to establish a Schottky barrier at the interface between the BE and the p-type Si. Additionally, nickel silicide (NiSi) is utilized to form an Ohmic contact between the middle electrode (ME) and the p-type Si. The HfSe2-based memristor follows the same Ti/HfSe2/Au structure as described in a previous study, where the Ti filaments can be formed or erased under external bias, resulting in resistive switching (RS) behavior53. Accordingly, the memristor operates in set mode when the selector is under reverse-bias and in reset mode when the selector is under forward-bias, as illustrated in the equivalent circuit diagram in Fig. 1h. Such operation mode is typically different from the one-diode-one-memristor 1D1R structure, which conventionally denotes a unipolar memristor in which the set and reset operations take place within the same voltage polarity (Supplementary Table 2). This distinctive operation prompts us to classify our integration as 1S1R instead of 1D1R. In addition, 1S1R integration has typical advantages compared to 1D1R for ANN applications including real-valued weight implementation, backward matrix-vector multiplication, and high resistance state stability (Supplementary Note 2). An optical image of the fabricated Si-based selector array is shown in Fig. 1i, while Fig. 1j displays a two-inch polycrystalline HfSe2 thin film grown at a high temperature of 750 oC53. Raman spectrum and X-ray photoelectron spectroscopy (XPS) confirm the chemical composition of 2D HfSe2 (Supplementary Figs. 4 and 5). X-ray diffraction and High-resolution TEM further prove the layer-by-layer structure of the 2D HfSe2 (Supplementary Fig. 6). To mitigate impurities and doping effects due to the utilization of polymers and liquid media during transfer, we adopt a metal-assisted dry transfer technique, wherein the 2D material is shielded by a metal layer (Supplementary Fig. 7). The maximum transfer temperature is controlled at 150°C to minimize any impact on the underlying Si-based selector. Notably, many existing transfer methods rely on manual craftsmanship rather than automated and controlled procedures, posing challenges for achieving wafer-scale 2D material transfer54. In this study, we successfully implement a wafer-scale transfer process facilitated by a wafer de-bonder machine (Supplementary Fig. 8). Furthermore, the Raman mapping results depict a uniform and consistent distribution of the A1g peak of HfSe2 before and after the transfer, thereby substantiating the existence of continuous and uniform HfSe2 thin film before and after the transfer (Supplementary Fig. 4). The complete CBA is finalized through interlayer oxide (SiO2) deposition, via etching, and WL metallization (Fig. 1k). This low-temperature integration process ensures compatibility with the BEOL processes in Si-based CMOS process flows. For detailed information on the fabrication process, refer to Methods.
Device characteristics of 1S1R CBA
To gain insights into the RS behaviors within the integrated 1S1R CBA, we begin by analyzing the characteristics of individual 1S1R devices. Figure 2a depicts the current-voltage (I-V) curve of a representative Si-based selector. Here, a voltage is applied between the middle electrode (ME) and the bottom electrode (BE), while the resulting current is measured across the bulk p-type Si. Subsequently, we investigate the memristor located on top of the stack by applying a voltage at the top electrode (TE) while keeping the ME grounded. The measured RS I-V curve (Fig. 2b) exhibits behavior consistent with previously reported values for the same structure, including the switching voltage (-0.6 V), switching ratio (50 times), and reset current (1 mA)53, thereby confirming the proper functionality of the integrated memristor. Further examination of the RS characteristics of the integrated device is presented in Fig. 2c. The dotted lines represent DC sweep measurements conducted over multiple cycles. Notably, during the reset process in the positive voltage regime, the I-V curves exhibit significant overlap, underscoring minimal cycle-to-cycle variation when reading the resistance states of the device. Reverse current engineering is conducted for the Si-selector, and the results are summarized in Supplementary Table 3, indicating that identifying the optimal Si-based selector, which balances a sufficiently large rectification ratio with an appropriate reverse current, is imperative for the success of our integration. By utilizing the Si-based selector at the bottom, the current during the set operation is effectively limited, leading to self-compliance behavior. The increased set voltage in the integrated 1S1R device arises from the reverse-biased Si diode dropping a significant portion of the voltage during the set process. However, Supplementary Fig. 9 shows the small switching voltages of the memristors (1R) with minimum device variation, indicating the low voltage advantage of the 2D HfSe2-based memristors (Supplementary Table 1). It should be noted that the primary objective of this study is to demonstrate heterogeneous integration between 2D material memristors and Si selectors, with a focus on exploring the statistical behavior of devices within the 1S1R CBA (will be discussed later). The choice of Si-based diodes over transistors is motivated by practical considerations, particularly the ease of implementation within university laboratories. Potential improvements could be achieved through the utilization of foundry-fabricated transistors, facilitating a 1T1R integration approach (We demonstrated 1T1R using foundry FinFET in Supplementary Fig. 24). In such a configuration, the On-state transistor would incur a notably lower voltage drop compared to a reverse-biased diode. In addition, 1T1R integration is suitable for solving the cross-talk issue when the array density is high such as 10 nm technology node (Supplementary Table 4).
Our HfSe2-based 1R memristor exhibits rapid response speed, with switch and read times as short as 1 ns (Fig. 2d, 2e and Supplementary Fig. 10). For the integrated 1S1R cell, the switch and read times are 45 ns and 60 ns, respectively (Supplementary Fig. 11). It should be noted that, although such response speed is considered typical for standalone memristors, it is crucial to emphasize that such a response time represents a notably swift operation, especially within the context of 2D materials-based 1T1R devices and self-rectifying memristors. To substantiate this assertion, we have conducted a comparative analysis of our work against other system-level implementations utilizing memristors in either 1S1R or 1T1R configurations. The benchmark table illustrating this comparison is provided in Supplementary Tables 5 and 6. We attribute the accelerated speed to two main factors. First, the fast Ti ion diffusivity, facilitating rapid filament formation within 20 ns, which has been reported also in other 2D materials.53,55 Second, the 1S1R CBA design incorporates a small overlap area and maintains low parasitic capacitance, effectively mitigating any additional latency that could potentially arise from integrated devices (Supplementary Fig. 12 and Supplementary Note 3)56. In addition, the integration of the Si-based selector in the 1S1R cell effectively controls the current during the set process. This prevents the over-shoot phenomena where the current escalates to excessively high levels and may cause deep low resistance states (LRS) or device breakdown. As shown in Fig. 2f, precise conductance adjustment during device programming is achieved through a single-boundary closed-loop pulsing scheme. When the device conductance falls below the set target conductance (330 µS), a negatively ramped stair pulse is applied to set the cell. Once the conductance is more than the set target, positive reset pulses are subsequently applied to reset the conductance to the reset target (13 µS). The detailed single-boundary close-loop pulse scheme is shown in Supplementary Note 4. To validate the increase/decrease in conductance due to Ti within the device, Conductive Atomic Force Microscopy (CAFM) measurements were conducted, with different electrodes shown in Supplementary Fig. 27. Supplementary Fig. 13 shows that the observed change in conductance is attributed to the combined effect of an increased/decreased in the number of filaments and the growth/erase of each individual filament. To further control the variation of resistance at HRS and LRS, double-boundary close-loop pulse scheme is developed (Supplementary Note 5). This closed-loop scheme is repeated for 26,500 cycles without device failure, and the endurance results are presented in Fig. 2g. Notably, the endurance of our integrated 1S1R cell surpasses that of other 2D materials-based memristors without selector integration23,27,57,58. In comparison, a HfSe2 memristor with the same structure but lacking selector integration experience breakdown easily due to lack of appropriate compliance current during pulse programing (Supplementary Fig. 14), underscoring the improved endurance achieved in our integrated 1S1R cell. Furthermore, Supplementary Fig. 15 demonstrates that individual devices exhibit a substantial switching ratio of 40 times through the double-boundary close-loop programming. This observed phenomenon is consistently replicable across ten distinct devices. The primary reason for this is the high-speed operation of the memristor during set/reset cycles, which requires an additional device such as a selector or transistor in series to rapidly limit the current. We validated this by applying 1 ns program pulses of ± 1 V to a 1R device, achieving an endurance of 1 million cycles (Supplementary Fig. 25). The device also demonstrates stable non-volatile retention (up to 104 seconds) at 85 ºC, highlighting its potential for CIM applications where weights storage in the CBA requires reliable retention (Supplementary Fig. 16)21. The comparison among our device and other conventional memristors are shown in Supplementary Table 7 and Supplementary Note 6.
We proceeded with an investigation into the RS behavior within the CBA. To the best of our knowledge, unreported exploration of such extensive integration utilizing both 2D material and Si platforms. We contend that the performance characteristics of this large-scale integration have not been thoroughly examined to date. Due to with the Si-based selector, we effectively address the sneak current issue (Fig. 2h), where the programming of a selected cell inadvertently affects half-selected and non-selected cells in CBA without adequate control over the leakage current (Fig. 2i) 29. As shown in Fig. 2h, only the selected cell experiences changes in conductance under external pulses, while other regions of the CBA remain unaffected. The detailed voltage pulsing program is described in Supplementary Fig. 17. It is worth noting that in our memristor array without selector integration, even with the V/2 voltage scheme applied to the 1R array, nearby cells still experience disturbance (Supplementary Fig. 18), which demonstrates the importance of the Si-based selector in mitigating sneak current59. Moreover, it is noteworthy that the utilization of a Si-based selector, coupled with the mitigation of the sneak current issue in the crossbar array (CBA), enables precise programming of individual devices within the CBA. This precision facilitates a detailed statistical analysis, including an examination of device variations and the yield of the as-fabricated CBA. Such intricate analyses are challenging to conduct in passive memristor 1R arrays, as devices will get disturbed during the measurement of nearby devices due to uncontrollable sneak current in the passive array. To assess the 1S1R CBA, we first employed a "2D-Si NUS" pattern to set weights across the entire array using multiple cells. In Fig. 2j, the bright cells represent those set to LRS, while the background cells correspond to HRS. Subsequently, we measured the resistance of 273 cells in different columns across the array, as shown in Fig. 2k. However, discrepancies exist between the ideal pattern and the measured pattern due primarily to analog noise processes (to be discussed in the subsequent section) and variations in HRS and LRS originate from the single-boundary close-loop programming (Supplementary Note 4). The variations in HRS and LRS can be reduced by fine tuning the set and reset target ranges using the double-boundary close-loop logic60,61. Supplementary Fig. 19 illustrates device variation of 40 devices randomly selected from the CBA, and each device is switched for 50 cycles by the double-boundary close-loop logic. The device-to-device variation is extracted and compared with other conventional oxide based memristors, showing small variation and high uniformity of the array (Supplementary Table 8). The uniform RS ratio of 7.5 times and minimal variation of the devices across these cycles are evidenced by the consistent HRS and LRS. Supplementary Table 9 shows a benchmark table of the RS ratio in the 1S1R CBA, comparing our work and other system-level memristor CBA demonstrations. Our RS ratio is evaluated and found to be comparable with other system-level CBA studies. During wafer-level integration, it is anticipated that the variation within a single die remains manageable. The primary source of variation is likely to be die-to-die differences arising from the transfer process. This variability can be mitigated by adapting the pulse programming range for each die. Figure 2l presents a fault mapping of the devices in a 32 × 32 1S1R CBA, revealing a commendable yield of 89.0% assessed across 992 devices. As shown in Supplementary Table 5, our work has significantly improved both response time and energy efficiency compared to other 2D material-based memristors with selectors. This advancement helps close the performance gap between 2D material-based memristors and oxide-based memristors. Furthermore, due to the low-voltage advantage of the ultra-thin 2D HfSe2 film, computations based on a 1T1R configuration demonstrate the potential for substantial improvement in energy efficiency, reaching an impressive 1309.1 TOPS− 1W− 1. This surpasses the performance of oxide-based memristors, showcasing the superior capabilities of our proposed system (Detailed calculations of energy efficiency are presented in Supplementary Table 5, 10 and Supplementary Note 7). In contrast to conventional technologies that offer thousands of conductance states,62 our device demonstrates a 3-bit weight programming capability, as depicted in Supplementary Fig. 20. We contend that this range of 3–4 bits suffices for our intended applications, and exceeding this bit range would be superfluous. Moreover, the discernment of higher bits in the weights would necessitate a system equipped with a very high precision Analog-to-Digital Converter (ADC), inevitably resulting in increased power consumption.
Time-domain parallel array readout
To enhance the functionality and energy efficiency of peripheral circuits in the CIM system, we have designed specialized sensing circuits equipped with current subtractor and sensing units. A time-to-digital converter (TDC) approach was chosen instead of using an analog-to-digital converter (ADC) approach. This choice is motivated by the benefits of current summation in the array columns, which enables low-power readout in the time domain63. As shown in Fig. 3a and Supplementary Fig. 21, each column pair in the column is equipped with a column subtraction unit that enables differential reading, facilitating the implementation of negative weights in CNN kernels23,53,64,65. These readings are then processed through the TDC unit. We have employed a time delay technique with explicit lumped capacitors to extract the current subtraction results. In our hardware implementation, an FPGA-based TDC counter measures the time taken for the voltage of the column lumped capacitor to discharge from the charged state to a reference voltage (Vref). Figure 3b shows an optical image of the fabricated PCB containing the current subtractor and TDC units. As our CBA consists of 32 columns, we have integrated a total of 16 current subtractors and 16 TDC-based sensing circuits into the CIM hardware system (Fig. 1a).
Figure 3c presents the sensed time, measured in tens of nanoseconds, corresponding to the output current from the CBA, demonstrating the rapid sensing performance and low power (2.5 folds) as compared to the conventional ADC sensing circuits (Fig. 3d)10,66. In addition, energy consumption in the time domain is comparatively lower since only a single sample is required, as opposed to multiple samples in the frequency domain. The efficiency of TDC relies on a sufficiently large summed current resulting from the MAC operation, owing to the inverse relationship between current and time conversion. Leveraging on the fast device operation and sensing speeds, and low-power sensing circuits, low-energy sensing in nanojoules range is achieved. The detailed energy consumption of the TDC sensing circuit is shown in Supplementary Fig. 22. However, in the actual implementation on the PCB, data-dependent errors may occur due to factors such as distributed bit cell capacitances and variations in the discrete components. Figure 3e depicts the measured time-current relationship, with the extrapolated curve exhibiting a typical nonlinear decreasing trend. Although TDC sensing exhibits a nonlinear response to current, we have observed that the impact of this nonlinearity on the accuracy of CNNs in practical applications is negligible (discussed in the subsequent section).
Biological nervous systems rely on the parallel processing capabilities of multiple receptor neurons, allowing for the generation of outputs in a simultaneous manner (Fig. 3f)67,68. In our CIM system, we have implemented a biologically-inspired parallel computing scheme. Current readings of all columns are initiated simultaneously when the TDC receives a start signal, and they are terminated using various stop signals depending on the current-dependent delay (Fig. 3g). On the other hand, the diverse properties of different kernels in a neural network pose a challenge when it comes to optimizing a single type of CBA that efficiently handles all of them. For one such case, the output current of the MAC operation through a larger kernel mostly exhibit more current and faster response compared to smaller kernels. Hence, optimal utilization of CBA needs further investigation (Fig. 3h)69.
Hardware implementation of CNN
To demonstrate the capabilities of our hardware CIM system for AI applications, we have constructed a complete-hardware CNN that consists of four 3 × 3 convolution kernels and four output neurons (Fig. 4a). The input images used in this experiment represent binary representations of various English letters. To evaluate the classification accuracy, we directly compare the network's outputs with the labels of the input patterns. The input patterns include four stylized letters ('n', 'u', 'z', and 's') and four sets of nine noisy versions of each letter, formed by flipped one of the pixels of the original image. (Fig. 4b). Because of the limited size of dataset, these 40 patterns are used for both training and testing. It is noteworthy that previous experimental demonstrations of ANN, particularly in the field of hardware implementations, have employed relatively small datasets, consisting of 30 images or even fewer, for both training and testing purposes.70 In our present study, our emphasis lies in showcasing the potential of the 2D material platform in executing a fully hardware compute-in-memory demonstration. It is acknowledged that utilizing identical datasets for both training and testing may lead to risk of overfitting. Future investigations should delve into the expansion of the array size, or the usage of multiple CBAs within the same system, to accommodate more intricate datasets.21 The flowchart in Fig. 4c illustrates the training and inference process employed in this CNN demonstration. In our hardware implementation, we initially perform ex-situ training of the CNN and extract the weights of the convolution kernels (refer to the Methods section for more details). To program the analog weights into our binary-capable 1S1R CBA hardware, we perform weight quantization, in which the weight values are updated to '-1', '0', and '1'. The quantized weight matrix is shown in Fig. 4d.
After extracting the kernel weights, they are transferred onto our hardware by programming selected cells in the CBA to represent the desired conductance values corresponding to the weights. The CBA employs the column differential pair method to represent the kernel weights, where two adjacent cells in different columns represent the weights using the equation: W1i = G1i+ – G1i−, where G1i+ and G1i− represent the conductance of the two neighboring cells (Fig. 4e). The subtraction operation is performed using the current subtractor in the peripheral circuits (Fig. 3a). In this demonstration, we used four 3 × 3 kernels, requiring a total of 72 (4 × 9 × 2) cells in the CBA to generate 36 differential pairs that represent the weight values. Figure 4f provides an example of the hardware inference process within the CBA. The input voltage vectors, encoded based on the input patterns, are sent to the CBA, and the MAC output currents are obtained simultaneously in different columns. The current subtraction between column pairs, such as I1+ and I1−, occurs outside the CBA.
We now summarize the results obtained from the CNN inference on our hardware system. Figure 4g and Fig. 4h illustrate the map of the kernel weights in terms of differential conductance within the hardware CBA. Initially, prior to the weight transfer step, the weights are randomly distributed (Fig. 4g). However, after the weight transfer process, the conductance of the cells are programmed to the desired values based on the extracted kernel weights (Fig. 4h). Next, we present the results of the CNN inference in Fig. 4i-l. Figure 4i shows the direct output currents sensed from our CIM system, where each color represents the current from one column differential pair corresponding to a neuron output in the final layer of the CNN (Fig. 4a). The CNN recognizes the input pattern by identifying the neuron with the highest output current. Figure 4j provides examples of both correct and incorrect classification results. In the upper panel, for the letter 'n', the first neuron exhibits the highest output current, correctly recognizing the letter 'n'. Conversely, in the bottom panel, the largest current occurs at the neuron corresponding to the letter 'z', whereas the true label of the input is 's'. Figure 4k shows the MAC output error as compared to the output current through arithmetic calculation. It should be noted that although error exist due to sensing noise and device variation, the classification accuracy does not get disturbed significantly, because we choose the index of the maximum output as the classified pattern. Figure 4l summarizes the classification results, comparing pure software training, CNN with quantized kernels, and hardware CNN inference. The hardware CNN achieves the same accuracy as the software implementation with quantized weights (97.5%), implying that increase in the number of misclassified patterns in this CNN implementation is due to the weight quantization. Importantly, this CNN hardware demonstration performs all data encoding, MAC operations, and output sensing within our CIM system, without any postprocessing or simulation after classification. This showcases a complete hardware implementation of the CNN based on our CIM system. The high accuracy achieved by this CNN demonstration, combined with the fast device and low-latency sensing circuits, highlights the potential of our CIM system for real-time processing applications71.
To meet the requirements of neural networks in real-world scenarios, where analog input images are common and data transfer is resource-intensive, we have extended the capabilities of our CIM system to incorporate analog computing functionality and in-built activation functions72. Figure 5a depicts the CNN structure employed to evaluate the performance of analog computing and the hardware activation functions. The hardware implementation is confined to the initial layer of the network, facilitating the transition from the input image (28 × 28) to feature maps (4 × 26 × 26). This involves the utilization of four 3 × 3 kernels, requiring a total of 72 devices for weight mapping onto the CBA hardware, following the same method as shown in Fig. 4. Each kernel necessitates 9 × 2 devices due to the column differential method. For subsequent layers in the CNN depicted in Fig. 5, simulations are conducted using Python TensorFlow code. Figure 5b showcases the outputs of the analog convolution operation using both software and our hardware CBA approach. The analog hardware CBA output illustrates the directly measured output currents from the hardware CBA during the convolution operation, accounting for both device-to-device and cycle-to-cycle variations within the CBA. In addition, this output is not the method that only takes the advantage of nonlinear properties of a single device. Conversely, the analog software output is simulated based on single-device conductance data and executes an ideal arithmetic multiply-and-accumulate operation. The close resemblance between the software-processed images and the output images from the hardware CBA validates the accuracy of analog computing. Discrepancies between the analog hardware CBA output and software output arise from nonideal sneak currents in the CBA and the programming error associated with the conductance of each device in the CBA. Additional information regarding the configuration for analog computing, including pixel quantization and voltage encoding, provided in Supplementary Fig. 23.
Additionally, considering the nonlinear response and monotonically decreasing trend of our TDC-based sensing circuits, we leverage these circuits to incorporate in-built hardware activation functions within the CNN. This integration of activation functions reduces data transfer during forward propagation in artificial neural networks and enhance the overall energy efficiency of the system72. Figure 5c illustrates the effectiveness of our activation functions based on nonlinear responses. We compare the recognition accuracy using the rectified linear activation function (ReLU), Sigmoid function, and our TDC-based activation function while maintaining the same CNN structure described in Fig. 5a and ensuring consistency across other layers during software simulation. Our in-built activation function achieves a higher accuracy (95.48%) compared to Sigmoid, indicating superior recognition performance (Fig. 5c and 5d). Following the software simulation illustrated in Fig. 5c, we proceed to extract the four 3 × 3 kernels and transfer their weights into the CBA to evaluate the accuracy of hybrid computing, which involves a utilization of both hardware and software components. In the case of hybrid computing employing TDC-based activation, the analog convolution operation and subsequent TDC-based activation functions are executed in the hardware domain, while the subsequent layers of the CNN are implemented through software coding. In contrast, for hybrid computing scenarios utilizing ReLU and Sigmoid activations, solely the analog convolution is conducted within the hardware CBA, with the activation functions being defined through mathematical equations in the software domain. Figure 5e and 5f demonstrate that the introduction of hardware analog convolution has a negligible impact on CNN recognition accuracy, maintaining a high accuracy rate of 95.02% as compared to the 95.48% in software simulation. Supplementary Video 1 provides a detailed process flow, starting from input images to the recognized output patterns, and includes additional examples.