Algorithm
ARM aims to search for frequent itemsets (items that tend to co-occur) and find association rules (patterns of co-occurrence between items within a frequent itemset). Our study aimed to develop a method that would enable i) the search of frequent itemsets in a given dataset containing continuous values and ii) finding association rules between frequent itemsets that are detected in different given datasets to associate two heterogeneous datasets focusing on a limited number of items. The workflow of this algorithm is shown in Fig. 4a.
Let with n binary attributes be a set of “items” called the “itemset”. Let be a set of m observations in which each t has I. When an observation has k items (I has k ones and (n - k) zeros), the itemset is called a k-length itemset. An association rule is defined next, where X (called antecedent) and Y (called consequent) co-occur, with , and :
X → Y. (1)
First, the Apriori algorithm evaluates the frequencies of 1-length itemsets (itemsets containing only 1 item), and those with a low frequency that do not satisfy the user-specified minimum support are pruned. Support is a score that represents the frequency of X, Y or co-occurrence of X and Y. Given X → Y, support can be expressed as follows:
Next, all possible (k + 1)-length itemsets are generated from k-length itemsets, and the ones containing k-length itemsets, whose support is smaller than user-specified minimum support, are pruned. These processes are iterated until convergence is achieved. Using this procedure, frequent itemsets (those with support higher than the user-specified minimum support) are detected.
Thereafter, association rules are generated by searching for an antecedent and a consequent within each frequent itemset. Several scores are commonly used, such as lift, which can be expressed as follows:
lift (X → Y) = support (X → Y)/{support(X) * support(Y)}. (3)
- Fuzzy association rule mining
Conventional ARM approaches assume that input data contain categorical attributes. However, the data that we handle can be quantitative or a mixture of qualitative and quantitative data. Therefore, quantitative attributes are converted into categorical attributes by putting thresholds for quantization (e.g., when threshold1 and threshold2 are given, and threshold1 < threshold2, quantitative attributes can be assigned to one of these categories: i) ≤ threshold1, ii) between threshold1 and threshold2, and iii) ≥ threshold2).
This forms the crisp data; however, the procedure results in loss of information. To solve this problem, fuzzy logic is introduced in the Apriori algorithm.
Fuzzy logic is defined as “a class of objects with a continuum of grades of membership,” and quantitative attributes are converted into several categories with “membership values” ranging from 0 to 1 (e.g., quantitative attribute at threshold1 is converted into i) 0.5 category and ii) 0.5 category). The notions of union, intersection, and complement, which are used to calculate several important scores in ARM, such as support, can also be extended to fuzzy sets.
- Functions for calculating membership values
When converting quantitative attributes into fuzzy categorical sets, the main problem lies in defining membership functions, which are used to calculate membership values. Because membership values range from 0 to 1, min-max scaling, sigmoid transformation, or rank-based conversion is typically used. However, these methods reduce the differences in membership values between the most applicable category and the least applicable category and obtain too-fuzzy data for the Apriori algorithm. With this preliminary observation, we designed novel membership functions, as described next.
・Histogram-based conversion
The frequency distribution of the quantitative data was converted into a histogram with a user-specified number of bins (Fig. 4b). The quantitative attributes are converted into three categories: “low,” “average,” and “high.” Their membership values for quantitative attribute v can be expressed as given subsequently, where the frequency of the bin that includes the quantitative attribute v is Fv, frequency of the highest bin is FH, lower boundary of the highest bin is bL, and upper boundary of the highest bin is bH.
Membership value for category “low” = 1 - Fv/FH if v < bL, (4)
Membership value for category “high” = 1 – Fv/FH if v > bH, (6)
0 otherwise. (7)
The sum of membership values for categories “low,” “average,” and “high” is supposed to be 1. However, the information in category “average” will not be used for ARM because it will be handled as frequently occurring items and will fail to detect interesting association rules that include category “low” or category “high.”
・Z-score-based conversion
The frequency distribution of the quantitative data was converted into a standard normal distribution to obtain the z-scores (Fig. 4c). Because approximately 95% of the data is known to range from -2 to 2, the quantitative attributes are converted into three categories: “low,” “average,” or “high.” Their membership values for quantitative attribute v can be expressed as follows:
Membership value for category “low” (v) = -(z-score(v)/2) if -1 ≤ z-score (v)/2 < 0 (8)
1 if -(z-score(v)/2) > 1 (9)
0 otherwise (10)
Membership value for category “high” (v) = z-score(v)/2 if 0 < z-score (v)/2 ≤ 1 (11)
1 if z-score(v)/2 > 1 (12)
0 otherwise (13)
By analogy with the histogram-based conversion, the sum of membership values for categories “low,” “average,” and “high” is supposed to be 1. However, the information in category “average” will not be used for ARM because it will be handled as frequently occurring items and will fail to detect interesting association rules that include category “low” or category “high.”
- Other membership functions used for comparison
The formula of min-max scaling for the quantitative attribute v is as follows:
Membership value for category “Low” = 1 - (v - vmin) / (vmax - vmin), (14)
Membership value for category “High” = (v - vmin) / (vmax - vmin). (15)
The formula of sigmoid function for the quantitative attribute v is as follows:
Membership value for category “Low” = 1 – 1/(1 + e-v), (16)
Membership value for category “High” = 1/(1 + e-v). (17)
The formula of rank-based conversion for the quantitative attribute v is as follows:
Membership value for category “Low” = 1 – r(v)/n, (18)
Membership value for category “High” = r(v)/n, (19)
where r is the rank, and n is the number of observations.
- Association of heterogeneous datasets
In the conventional ARM approach, association rules are generated “within” each frequent itemset. Our method aims to generate association rules by which their antecedents are derived from one dataset and consequents from the other, so that the detected association rules represent itemsets derived from different datasets related to each other. With this purpose in mind, we developed a novel algorithm, as described subsequently.
Let I1 = {i1,1, i1,2, …, i1,p} of p attributes and I2 = {i2,1, i2,2, …, i2,q} of q attributes be sets of “items” and I1 and I2 be called “itemsets”. Let T1 = {t1,1, t1,2, …, t1,m} and T2 = {t2,1, t2,2, …, t2,m} be sets of m observations in which each of t1 and t2 has I1 and I2, respectively. T1 and T2 are assumed to have the same number of observations, and t1,a and t2,a (a∈{1, 2, …, m}) are associated with each other (e.g., t1,a: medical record of patient ID a, t2,a: gene expression profile of patient ID a). If T1 and/or T2 contain(s) quantitative attributes, the calculation of membership values for category “Low” and category “High” would be required as a pre-processing step.
First, the fuzzy Apriori algorithm detects frequent itemsets in T1 and T2 independently with user-specified minimum support. Support can be expressed as follows:
support (X → Y) = Σ(a∈{1, 2, …, m}) min(X(a), Y(a))/m, (20)
where
X(a): a membership value of X in observation a, and
Y(a): a membership value of Y in observation a.
Second, association rules are generated, so that antecedents are selected from the frequent itemsets detected in T1, and consequents are selected from those detected in T2, or vice versa. Several scores can be used as thresholds to limit the number of rules to the output; for example, lift can be expressed as follows:
lift (X → Y) = support (X → Y)/{support(X) * support(Y)}, (21)
where
X: a frequent itemset constituted of I1,
Y: a frequent itemset constituted of I2,
support(X) = Σ(a∈{1, 2, …, m}) X(a) /m, (22)
and support(Y) = Σ(a∈{1, 2, …, m}) Y(a) /m. (23)
This novel algorithm would enable the identification of relevant items in heterogeneous datasets that associate with each other.
- Pre- and post-processing of the liver toxicity dataset for the experiments
In the original biological data that we used for the experiment (liver toxicity data), histopathological observations were described as “minimal,” “mild,” “moderate,” or “marked.” For this experiment, they were converted into “0,” “1,” “2,” or “3,” respectively. Authors of the original paper had reported 50 and 150 mg/kg body-weight ratios of acetaminophen as subtoxic, and 1500 and 2000 mg/kg body-weight ratios as severely toxic [17]. Hence, the column of dose levels in data2 was converted into binary attributes to represent their toxicity level (50 and 150 mg/kg body-weight: 0; 1500 and 2000 mg/kg body-weight: 1). In addition, the values in the column containing the time points in data2 were converted into binary attributes to represent their toxicity level (6, 18, 48 h: 0; 24 h: 1) because 24 h was reported to be the time of peak toxicity, and rats were in a recovery phase after 48 h of acetaminophen treatment [17]. In data1 (gene expression profile), the genes were indicated by Agilent probe IDs. DAVID [37] was used to convert Agilent probe IDs into Entrez gene IDs and gene names.
Experiments
The AI Bridging Cloud Infrastructure (ABCI) operated at the National Institute of Advanced Industrial Science and Technology (AIST), Japan, was used for the experiments.
Implementation
Our method is implemented in Python 3.0 and is dependent on the pandas, joblib, and os modules. The detection of frequent itemsets was conducted by a modified apriori function in the mlxtend Python module, to introduce fuzzy logic.