Samples. Sample 1. Russia: Novgorod Prov., right bank of Luga river, nr. Maluy Volochek village, 58.486544 N 30.338933 E, galls on leaves of Fragaria viridis, 14 Sept. 2018, about 80 adult mites. Sample 2. Russia: Leningrad Prov., Luzhsky District, 184m E Beloye Ozero [Beloye Lake], 58.808028 N, 30.487389 E, galls on leaves of Fragaria viridis, Aug 10 2019, about 1500 specimens obtained by alcohol washing. Collection of plant material was done with relevant institutional, national, and international guidelines and legislation. This collecting was done as part of cooperative agreement № 075-15-2020-922 (Ministry of Science and Higher Education of the Russian Federation) for a non-endangered, non-psychoactive species growing on public land outside of protected areas; no additional permissions and/or licenses are required for these samples as per paragraph 11 of the Forest Code of the Russian Federation (No. 200-FZ).
Nucleic acid extraction and sequencing. DNA/RNA extraction and Illumina next generation sequencing were performed as detailed in Supplementary Methods.
Genomic assembly, decontamination, annotation, phylogenomic analyses. Mite metagenomic assembly was done in MetaSPAdes v. 3.13.0 66. It was decontaminated using MetaBat v. 2.12.167, BLAST, and Diamond 68, and annotated in Maker v.2.31.10 69 using the mite transcriptome and Ecdysozoa UniProtKB / Swiss-Prot proteins for gene prediction. See detail in Supplementary Methods. The phylogenomic tree was inferred using amino acid sequence data in a Maximum Likelihood framework in IQ-TREE 70 with alignment matrices prepared based on the BUSCO v.4 arachnida_odb10 database 71 and custom utility scripts. See detail in Supplementary Methods.
Gall-ID. For identification of gall-inducing bacteria in samples 1 and 2 using raw reads, we used SRST2 72 and Gall-ID databases 73 , with the minimum gene coverage parameter set to 50% and maximum divergence parameter set to 10%: srst2 --input_pe $input_file_reads_forward $input_file_reads_reversed --max_divergence 10 --min_coverage 50 --log --output $out --gene_db $input_file_gene_db --threads $proc --report_all_consensus. Gall-ID databases have either functional genes known to be part of gall-inducement pathways or house-keeping genes (16S rDNA) that can be used to identify gall-inducing bacteria. Since the use of a majority rule consensus sequence (the Gall-ID default) is unreliable in the presence of multiple similar bacterial species, we also conducted validation of our Gall-ID results: (i) reads mapped on target genes in the Gall-ID databases were extracted (samtools fastq − 1 forward.fq -2 reverse.fq -s singletons.fq -0 other.fq in.bam), (ii) and assembled in SPAdes (for PE reads: spades.py -1 forward.fq -2 reverse.fq -s singletons.fq -t $proc; for SE reads: spades.py -s $extracted.reads.fq -t $proc -k 127), (iii) SPAdes contigs were then classified by BLAST.
K-mer-based metagenomic profiling. Taxonomic classification of raw reads was made by Kraken2 v.2.0.874. A custom 35-mer database was built from six standard Kraken databases (archaea, bacteria, fungi, human, protozoa, and viral) plus the genomes of Fragaria vesca (GenBank accession GCF_000184155.1), Albugo laibachii Nc14 (GenBank BioProject accession: PRJEA53219), Fragariocoptes setiger (JAIFTH000000000, assembled here), Wolbachia endosymbiont of Fragariocoptes setiger (JAHRAF000000000, assembled here) and the Illumina PhiX technical sequence. Taxonomic classification was done with a confidence scoring threshold value of 0.1, which performed well in identifying Illumina PhiX technical sequences (not reported). In addition, this approach also substantially decreases the number of false positive classifications 75 . Using Kraken utilities scripts (KrakenTools), we converted standard Kraken report files to MetaPhlAn format and then combined these converted files as follows: kreport2mpa.py -r $kraken_report -o $kraken_report.mpa; combine_mpa.py -i kraken_report1.mpa,kraken_report2.mpa -o kraken.mpas.combined.txt. To estimate relative abundances, we also tried Bracken 76 , using the read length value as appropriate, 150 bp (sample 1) and 250 bp (sample 2). However, this analysis produced spurious results, e.g., the read proportion for Enterobacteriaceae were seemingly overestimated: 310.6 times (Salmonella) and 299.2 times (Escherichia) higher in sample 1 as compared to the Kraken data. Abundances of these taxa were also substantially overestimated in sample 2. Since these unusually high abundances were not supported by any other analyses (Kraken, singleM, BLAST, see below), we do not report Bracken analyses here.
Our initial Kraken analysis yielded a large number of OTUs, suggesting that many reads were probably overclassified 75. For example, there was a total of 1,124 genera, including 975 bacterial genera. We believe that such a large diversity is biologically unrealistic and we used a combination of Kraken confidence filtering (0.1, see above) and an abundance cutoff (≥ 0.0338%) as suggested in the literature 75. For comparison, we also ran an analysis with a lower abundance cutoff (≥ 0.0005%).
To calculate the taxonomic intersection (shared OTUs in samples 1 and 2), bacterial genera with a fraction of reads ≥ 0.0338% at least in one sample were selected, and then these data were used to create Venn diagrams (OTU counts) and abundance heatmaps. For Bacteria, this analysis yielded a total of 83 genera in both samples and 21 genera in the sample intersection, which we consider biologically meaningful, and so we present this as our main result. These data were clustered based on Euclidean distances and visualized as heatmaps in TBTools 77. For comparison, abundance heatmaps were also generated for all organisms using a lower abundance cutoff value (≥ 0.0005%), yielding 171 classified genera.
Identifying common OTUs across samples: Read mapping on marker genes. We identified OTUs shared across the two samples based on mapping of raw reads on 14 marker, single-copy genes in SingleM 30 . This is an assembly-free, largely taxonomy-independent approach. The use of a subset of single-copy genes eliminates issues associated with copy-number variation in ribosomal genes (16S, 23S), plasmids, and transposable elements. The following commands were used to create OTU tables from both samples, combine them and cluster OTUs: singlem pipe --forward $f1.fq --reverse $f2.fq --otu_table $dna_tbl --threads $proc; singlem summarise --input_otu_tables $dna_tbl $rna_tbl --output_otu_table dna_rna.combined.otu_table.csv; singlem summarise --input_otu_tables dna_rna.combined.otu_table.csv --cluster --clustered_output_otu_table clustered.otu_table.csv.
Unique representative sequences (used as "OTUs" in SingleM) shared across the two samples were filtered. This dataset was used to construct a heatmap (see the next subsection), where: (i) percentages of the average read counts across the 14 marker genes were used for each OTU in both samples; (ii) gene count, which is indicative of the data completeness, was recorded and visualized on the heatmap.
Identifying common OTUs across samples: Assembly intersection. Common OTUs present in the two NGS samples could also be identified via read assembly for each sample followed by assembly intersection. Intersection was done by standalone BLAST where sample 1 contigs were the query and sample 2 contigs were the subject. Matches having ≥ 98% similarity and bitscore ≥ 500 were then classified by BLAST. OTUs having > = 96% similarity with GenBank nucleotide (nt) database were classified at the species level, while all other matches were classified at the genus level (top hits were reported in case of multiple matches). This approach generated a subset of contigs shared between the two assemblies, and these contigs were identified by their sequence similarity (not by taxonomic labels). This methodology effectively minimizes the effect of high misclassification rates due to incompleteness of the GenBank databases 78 . For each contig, read-based coverage was calculated in bbduk (bbmap.sh ref=$ref in1=$f1 in2=$f2 covstats = covstats.txt), and the number of mapped reads per classified OTUs were recorded. To minimize the influence of rDNA (which may have multiple copies per genome), plasmids, and mobile/transposable elements (which may occur in multiple species), BLAST results were checked and edited. Final read count data were normalized by calculating read percentages. It is important to emphasize that this procedure was done only for OTUs present in the intersection (so these data can only be interpreted in the context of the subset of OTUs present in both samples). Then these data were log2-transformed, and a heatmap was constructed in TBtools v.1.0 77. In this heatmap, each OTU was labeled with (i) an intersection bitscore, which is indicative of the magnitude of common matches across the two metagenomic samples for a given OTU, and (ii) average percent identity with the closest GenBank matches. Clustering was done using Euclidean distances. We did this analysis using only bacterial taxa. Among fungi, some of which also can cause galls, we found only a single dominant taxon, the family Erysiphaceae (powdery mildews, which do not produce galls). Best matches were Cystotheca wrightii AB120747.1 (18S identity = 100% bitscore = 813) followed by Podosphaera pannosa AB525937.2 (rDNA identity = 99.88% bitscore = 1,578) and Podosphaera leucotricha JAATOF010000279.1 (mt-DNA identity = 99.9%, bitscore = 5,723).
Phytohormones and horizontal gene transfer. The annotated mite assembly was searched for the following major plant/bacterial genes responsible for the production of phytohormones and enzymes involved in growth regulatory metabolism: 1-aminocyclopropane-1-carboxylic acid (ACC) deaminase, 2,3-butanediol, abscisic acid, acetoin, auxin, brassinosteroids, cytokinin, ethylene, gibberellic acid, gibberellin, indole, indole-3-acetic acid (IAA), jasmonate, jasmonic acid, salicylic acid, strigolactone 79–81.
TEM microscopy. Mites were fixed in 2.5% glutaraldehyde and 0.1M cacodilate buffer (Sigma C0250) for four hours, washed with the same buffer, treated with 1% tetroxide osmium for 1 hour, washed twice in distillated water, dehydrated in increasing ethanol series (30, 50,70, 96, 100%) and embedded in SPURR resin (Sigma EM0300-1KT). Thin sections were prepared using a Leica EM UC7 ultramicrotome, contrasted with saturated solution of uranyl acetate 20 min and lead citrate for 5 min, and photographed using a Jeol JEM-1400 electron microscope.
Fluorescence in situ hybridization (FISH). We used three oligonucleotide probes, each with a distinct fluorophore label: Eub 338 5'-GCTGCCTCC CGTAGGAGT-3' specific to Bacterial 16S rDNA gene (except for Planctomycetales и Verrucomicrobia), and two probes specific to Agrobacterium tumefaciens: 16S.1722F.Agr.tum 5'-TGTCCTTCAGTTAGGCTGGC-3' and 16S.907F.Agr.tum 5'-AATTAATACCGCATACGCCC-3'. Two fluorescent labels were used: CY3 (indocarocyanine 3) and FITC (5(6)-Carboxyfluorescein); DAPI (Invitrogen) was used for no probe control experiments. See additional detail in Supplementary Methods.
Data submitted to GenBank. Fragariocoptes setiger. Genomic assembly. GenBank BioSample id: SAMN13972306 accession: JAIFTH000000000.
Mite-associated organisms, metagenomic assembly. GenBank BioSample id: SAMN13981716, accession: JAALJN000000000.
Mite and associated organisms, metagenome, Illumina short reads. Short read archive (SRA) BioSample id: SAMN13981716, run selector: SRR11015813
Mite and associated organisms, metatranscriptome, Illumina short reads. Short read archive (SRA) BioSample id: SAMN13991554, run selector: SRR11026779.
Osperalycus tenerphagus AD1672, genomic assembly. DDBJ/ENA/GenBank accession JAGGCA000000000.
Speleorchestes sp. AD1671, genomic assembly. DDBJ/ENA/GenBank accession number JAGHQN000000000.
Wolbachia endosymbiont of Fragariocoptes setiger. Genomic assembly. GenBank accession: JAHRAF000000000, BioSample id: SAMN19370650.