Animal
Male C57BL/6 mice were obtained from the Animal Center of National Taiwan University (NTU) School of Medicine. Akt1 heterozygous (HET) male mice and their wild-type (WT) littermates were bred from Akt1 HET pairs. PV-Cre (Jax-008069) and GAD-cre (Jax-010802) mice were used to study specific role of cell types in the 2C task. All mice were on a C57BL/6 background and genotyped via PCR of tail DNA. They were housed individually with ad libitum food and water, starting experiments at 2–3 months old. Mice were handled and weighed daily for one week before experiments. All animal procedures adhered to protocols approved by the Animal Care and Use Committees at NTU.
Two-choice probabilistic task (2C task)
The 2C task was adapted from a dynamic foraging task used previously in humans and mice (9, 20, 48). Shown in Fig. 1A, this task featured a two-alternative forced-choice paradigm with one lever offering high-rate rewards and the other low-rate rewards, conducted daily over 45-minute sessions. Trial counts and choice outcomes were recorded using The Graphic State 4.2.03 software from Coulbourn Instruments. The experimental protocol included shaping, surgery and recovery, reshaping, and testing phases, detailed below.
The shaping and reshaping phases
Animals underwent food (or water in Experiment 1) restriction to 85% of their original body weight and locomotor activity assessment before the shaping phase. Each shaping stage lasted 45 minutes, with mice advancing to the next stage daily upon meeting criteria.
The surgery and recovery phase
Experiments 1, 3, 4, and 6 included surgery and recovery. Under isoflurane anesthesia (1.5%), mice underwent stereotaxic surgery with skull burr holes drilled. Procedures included neurotoxin microinjection, electrode implantation, or viral microinjection as dictated by experimental conditions. Post-surgery, spontaneous locomotor activity was assessed in an open field using EthoVision video tracking (Noldus Information Technology).
For Experiment 1, lesions targeted the dorsomedial striatum (DMS; AP, 0.5 mm; ML, ± 1.5 mm; DV, -3.0 mm), dorsolateral striatum (DLS; AP, 0.5 mm; ML, ± 2.5 mm; DV, -3.0 mm), or nucleus accumbens (NA; AP, 1.8 mm; ML, ± 1.1 mm; DV, -4.7 mm) with NMDA solution infusion via Hamilton syringe. Post-operative analgesics were administered for 7 days.
In Experiment 3, electrode implants targeted the DMS region (AP, 0.5 mm; ML, ± 1.5 mm; DV, -3.0 mm) using a 4-electrode array. Electrodes were secured with dental cement and analgesics were provided for 7 days. Mice fully recovered before entering the reshaping phase in all experiments.
For viral microinjection surgery (Experiments 4 and 6), each mouse underwent bilateral microinjection of a virus mix targeting the DMS (AP, 0.5 mm; ML, ± 1.5 mm; DV, -3.0 mm; 0.6 µL per site). The virus mix consisted of AAV with Cre-inducible Gi-coupled human M4 muscarinic receptor (AAV-hsyn-DIO-hM4D(Gi)-mCherry, NTU AAV core) and AAV with Cre expression driven by CMV promoter (AAV-CMV-Cre, NTU AAV core) in a 1:1 ratio. AAV groups received the full virus mix, while sham groups received AAV-CMV-Cre only, matched in volume to the virus mix. For PV-Cre mice, AAV8 injections contained Cre-inducible expression of diphtheria toxin A (AAV-mCherry-FLEx-DTA, UNC vector core). Mice remained in their home cage for 3 weeks post-surgery to allow for full virus expression and recovery before entering the reshaping phase of the 2C task.
The testing phase: Following the shaping phase (or reshaping phase in Experiments 1, 3, 4, and 6), mice entered the testing phase, aiming to achieve specific reward rates: 60%-20% for sucrose water in Experiment 1 and 80%-20% for food pellets in other experiments (Fig. 1A). Each 45-minute daily session comprised 3 to 6 blocks (each block with 10 trials). Sessions began with house and food magazine lights illuminating. A nose-poke initiated a trial, extinguishing the food magazine light. A 5-second fixed inter-trial interval (ITI) preceded insertion of stimulus-response levers. After the ITI, two levers were presented, and mice pressed one. Each press led to a reward or no-reward outcome, followed by food magazine illumination. Trials ended when the reward was collected or after a 5-second wait post-nose-poke. Mice learned through trial and error to identify the high reward rate lever. Completion criteria required achieving ≥ 70% accuracy in lever choice across three consecutive blocks, with an average accuracy > 75%. Mice had 2 weeks to meet these criteria; failure resulted in data exclusion.
The analysis of choice strategy in the 2C task
Trial-by-trial choice data from all mice in the testing phase of the 2C task were recorded and analyzed for accumulated trials and choice strategy. The analysis of choice strategy encompassed four distinct strategies: win-stay, win-shift, lose-stay, and lose-shift. The ratio of each choice strategy was computed using a custom R code. The ratio for each choice strategy was determined by dividing the number of occurrences of the specific strategy by the total accumulated trials.
Fitting a reinforcement learning model to behavioral data in the 2C task
To explore the mechanism governing RPE (reward prediction error)-driven choice behavior, we selectively applied a reinforcement model to fit trial-by-trial behavioral data from mice engaged in the 2C task. Model fitting was performed using Rstan and hBayesDM R packages with custom code. Hierarchical Bayesian modeling with the MCMC algorithm estimated parameters from trial-by-trial choice data. Differences in parameters among mice were compared using posterior distribution values from the Bayesian estimation.
We applied a modified Q-learning model to examine how reward prediction error (RPE) affects and updates expectations. The model separates the learning rate (α) into αrew for rewarding results and αnor for no-reward results, determining the update speed of expected values. The model equations are as follows:
Qc (t) = Qc (t − 1) + α rew δ (t − 1) + α nor δ (t − 1)
δ (t − 1) = Rc (t − 1) – Qc (t − 1)
Here, αnor is set to 0 on reward trials, and αrew is set to 0 on no-reward trials.
To characterize how the choice tendency is guided by the updated expectation, we assumed that the probability of choosing the previously selected lever, P c(t), was determined by the Boltzmann exploration, represented in a logistic form assigning a weight to each action:
Pc (t) = e^(βQc )/(e^(βQc ) + e^(βQnc ) )
Here, the parameter β denotes the choice consistency (choice perseveration or exploration/exploitation) parameter, describing the tendency to make actions guided by expected reward values.
For MCMC analysis, both αrew and αnor were assigned a non-informative beta distribution (β (1.2, 1.2)) between 0 and 1 for the prior. A Gaussian prior between 0 and 10 was assigned to β.
In vivo electrophysiological recording of the DMS
Measuring local field potentials (LFPs): In Experiment 3, LFPs in the DMS were recorded during the 2C task. Event time points were imported into MATLAB for ERP analysis. Normalized LFPs were segmented into − 1 to 1-second epochs around each event: (1) Trial initiation (nose-poke to start), (2) Lever press (choice-making), and (3) Outcome (entering the food magazine for reward or no reward). This segmentation facilitated ERP component extraction for decision-making analysis.
Histological verification of electrode placement
After behavioral testing, mice were euthanized, and electrode positions marked by passing current (10 µA, 30 sec) to create iron deposits, visualized with potassium ferrocyanide.
Inhibition of the DMS of Akt1 HET mice during the 2C task
To investigate the causal relationship between the DMS neuronal activity and reward-related decision-making behavior, we employed chemogenetic modulation to directly inhibit the activity of the DMS in the 2C task. Adult male HET and WT mice (90–100 days old, n = 4–5 per group) were used in Experiment 4. Following virus mixture injection (AAV-hsyn-DIO-hM4D(Gi)-mCherry + AAV-CMV-cre), mice received clozapine N-oxide (CNO, 5 mg/kg, i.p.) 30 minutes before testing. Freshly prepared CNO in 1% DMSO saline was used. After meeting criteria, mice underwent 2-day CNO-off sessions to mitigate chronic injection effects (49).
RNA sequencing (RNA-seq) and validation
RNA Sample Collection
Left or right striatum was dissected from male HET and WT mice (90–100 days old, n = 4 each) in Experiment 5. RNA was extracted using Trizol (Thermal Fisher) and QIAamp RNeasy Mini Kit (QIAGEN). Samples were quantified by Qsep100 Capillary gel electrophoresis (RQN > 8.0), Nanodrop 2000 (260/280 ratio between 1.8 ~ 2.0, 260/230 > 2.0), and Qubit 3 Fluorometer (RNA concentration). Only high-quality RNA was used for RNA sequencing.
RNA-Seq Library Construction and Sequencing
Poly-A enriched libraries were prepared using the SureSelect Strand Specific RNA Library Prep Kit (Integrated Science) and sequenced on the Illumina Miniseq system with an eight-base index for sample identification.
Analysis for RNA-Seq Data
Raw read quality was assessed with FastQC (Babraham Bioinoformatics) and mapped using STAR 2.7.6a (mapping rates > 98%). Mapped reads were aligned to the Mus musculus genome GRCm38 with Gencode vM25 annotation. Alignment quality was checked by RSecQC, and gene expression levels were quantified by featureCounts as transcript per million. Differential expression analysis and volcano plots were generated using limma in R.
Gene selection and primer design: Target and reference genes were selected based on differential expression (top 10 by p-value), significant fold changes (log2FC > 2), and associations with schizophrenia, parvalbumin (PV) expression, or Akt1 function. Notable genes included Akt1, PV, GAD67, Calr, Ascl1, and Cldn5, with Gapdh as the reference gene. Primers were designed using PrimerQuest (Integrated NDA Technologies). Following primers were used in this experiment. Akt1: Forward- TCGTGTGGCAGGATGTGTAT; Reverse- ACCTGGTGTCAGTCTCAGAGG. Gapdh: Forward-TGTGTCCGTCGT GGATCTGA; Reverse- CCTGCTTCACCACCTTCTTGA. Gad67: Forward- CACA GGTCACCCTCGATTTTT; Reverse- ACCATCCAACGATCTCTCTCATC. Pvalb: Forward-ATCAAGAAGGCGATAGGAGCC; Reverse- GGCCAGAAGCGTCTTTG TT. Calretinin: Forward- TTTCAGGGTATGAAGCTGACCTC; Reverse-TGACACT CTTCCTGTAGGTGGTG. Cldn5: Forward-GCAAGGTGTATGAATCTGTGCT; Reverse- GTCAAGGTAACAAAGAGTGCCA. Ascl1: Forward- TTGAACTCTATG GCGGGTTC; Reverse- CAAAGTCCATTCCCAGGAGA.
Reverse Transcriptome-Quantitative Real-time PCR (RT-qPCR): RNA was extracted as mentioned above, and cDNA synthesized using LunaScript RT SuperMix Kit (#E3010, New England Biolabs). For qPCR, 0.5 µl of cDNA was used in a 10 µl reaction with SYBR Green I-based Luna Universal qPCR Master Mix (#M3003, New England Biolabs), and Applied Biosystems StepOne qPCR machine. Threshold cycles (CT) were calculated, and relative expression determined using the ΔΔCT algorithm: ΔΔCT = (CTA – CTref) − (CTB – CTref); Relative expression = 2^(-ΔΔCT).
Immunohistochemistry
Immunohistochemistry labeled PV interneurons on 40 µm brain sections with antibody (1:250; Synaptic Systems). Neuronal density in the DMS was measured using NIH ImageJ.
Selective lesioning PV interneurons in the DMS
For investigating the causal relationship between the DMS PV interneurons and the reward-related decision-making behavior, we selectively lesioned DMS PV interneurons by the virus-expressed diphtheria toxin A (DTA) in PV-cre mice before the 2C task. Adult male PV-cre mice and their WT littermates (90–100 days old) were used (n = 8–11, per group) in Experiment 6. The experimental schedule followed the previously described protocol, with virus injection (AAV-mCherry-FLEx-DTA) for virus expression occurring 3 weeks before the task.
Data analyses and statistics
Data are presented as mean ± SEM. Behavioral data were analyzed using Student's t-test or one-way ANOVA for genotypic differences, and Mann-Whitney U test for choice strategy ratios. Effect sizes were measured by Cohen’s d (≥ 0.8, large effect) and rank-biserial r (maximum = 1). Pearson correlation evaluated relationships between behavioral data and neural oscillation power. Data with misplaced injections or electrodes were excluded. The two-sample Kolmogorov-Smirnov test was employed to reveal genotypic/group differences in the distribution of model parameters of the reinforcement learning model. A p-value below 0.05 was considered statistically significant.