In this section, we use the previously mentioned independent factors in Table 2 to build a decision tree model. The dependent variable of the decision is: “Enforcement final status”, which consists of two conditions: “Comply/Closed” and “Failed”. As a white box model, the decision tree can explain the degree of influence of each independent variable on the dependent variable.
Except for the “Violation code”, which is presented in the form theme-document matrix, other independent variables are presented as dummy variables, that is, 1 means occurrence and 0 means no. The overall variables of the decision tree are shown in Fig. 5 below:
Table 3 is the top 35 feature that have the impact for enforcement results of violations. It is noted that “Violation date”, “County”, “Enforcement code description”, themes of violations obtained by LDA, “Inspection type”, “Inspection result description”, and “Inspection source” are the top 7 important impactors. But the relationship between those factors can’t visualize through Table 3, we will use decision tree roadmap to show the patterns between bellow factors.
Table 3
Key factors for enforcement results
Rank
|
Feature name
|
Feature importance
|
1
|
Violation date
|
0.191363
|
2
|
County Susquehanna
|
0.121717
|
3
|
Enforcement code: CACP - Consent Assessment of Civil Penalty
|
0.089137
|
4
|
Topic 4
|
0.075278
|
5
|
County Somerset
|
0.059117
|
6
|
County Greene
|
0.047737
|
7
|
County Washington
|
0.046644
|
8
|
Enforcement code: NOV - Notice of Violation
|
0.037634
|
9
|
Topic 2
|
0.035327
|
10
|
Topic 3
|
0.034699
|
11
|
County Wyoming
|
0.023694
|
12
|
Topic 5
|
0.022939
|
13
|
Inspection type: Incident- Response to Accident or Event
|
0.019401
|
14
|
Inspection source: EFACTS
|
0.01854
|
15
|
Inspection result: Violation(s) & Outstanding Violations
|
0.016991
|
16
|
County Sullivan
|
0.014569
|
17
|
County Armstrong
|
0.012632
|
18
|
County Westmoreland
|
0.012374
|
19
|
County Bradford
|
0.011777
|
20
|
Enforcement code: ADORD - Administrative Order
|
0.010974
|
21
|
Topic 1
|
0.010586
|
22
|
Inspection type Compliance Evaluation
|
0.009899
|
23
|
Inspection type Follow-up Inspection
|
0.009394
|
24
|
Inspection type Routine/Partial Inspection
|
0.009326
|
25
|
Violation type Administrative
|
0.009228
|
26
|
Inspection source: SUBSAIR
|
0.008389
|
27
|
Enforcement code description: COA - Consent Order and Agreement
|
0.006555
|
28
|
Inspection type: Routine/Complete Inspection
|
0.006465
|
29
|
County Lycoming
|
0.005136
|
30
|
Enforcement code description: EHBO Environmental Hearing Board Order
|
0.005134
|
31
|
Inspection type: drilling/alteration
|
0.004463
|
32
|
County Huntingdon
|
0.003708
|
33
|
County Tioga
|
0.003408
|
34
|
Inspection type: Complaint Inspection
|
0.003225
|
35
|
County Clearfield
|
0.002538
|
The decision tree constitutes an “if-then” conditional logic sequence, and its interpretation sequence is from top to bottom (Fig. 6). The decision degree depth after pruning has 8 layers of depth, the 0th level is the root node (depth 0, top), and the corresponding influencing factor of the root node is “County Susquehanna”. Assuming that there is a training sample currently, the root node asks whether the county is Susquehanna (Boolean variable), if the value is greater than 0.5, then false, move down to the left side of the root node (depth 1, left 1), in this case, the decision tree continues to ask whether the violation accident occurred less than 2011, if true, move down to the left node (depth 2, left 1). In this case, it is a leaf node, then we can check the prediction class of the node, the decision tree predicts that the violation accident can be resolved (class: Comply/Closed, Gini: 0.143).
The “sample” attribute of each node calculates the number of training examples. For example, the leaf node (depth 1, left 1), 4063 training samples satisfy the county is not Susquehanna, 526 training samples satisfy the county is Susquehanna (depth 1, left 2). The “value” attribute of the node indicates the number of training instances of each class used by this node: for example, the root node (depth 0, top) is suitable for 3627 cases of Comply/closed, and 962 cases of Failed. Finally, the “Gini” attribute of a node measures its impurity: if all training examples it uses belong to the same class, the node is “pure” (Gini = 0).
The colors of the decision tree nodes represent different classes. The purple nodes represent the class “Comply/Closed”, and the yellow represents “Failed”. The more obvious the color, the better the classification effect (the higher the purity). From the pruned decision tree, there are some enlightening correlations between the impactors and violation final status. When the result of the violation is a failure, there are two judgment chains with significantly small Gini coefficient: violation date is over 2012 → enforcement code description is “Notice of violation” → county is Somerset→ then violation final status is “Failed”; while the second one is: Violation date is over 2012 →Enforcement code description is “Notice of violation” → county is Washington → the violation LDA topic is “Erosion and Sediment” → the violation final status is “Failed”.
There are two judgment chains for successfully handling violations with a lower Gini coefficient (higher purity), namely: the county is not Susquehanna→ violation date is over 2011→ enforcement code is not “Notice of Violation” → the violation final status is “Comply /Closed”; while the second is that county is Susquehanna→ enforcement code is “Consent Assessment of Civil Penalty” → then the violation final status is “Comply/Closed”.
From the above judgment chains we can obtain the following implications for policymakers and practitioners:
First of all, the LDA classification of violation code helps policymakers to pay more attention to related environmental regulations. In the pruned decision tree, the LDA topics as node features are Topic 4 “Erosion and Sediment” and Topic 5 “Water Pollution”. When the violations involve the above two themes, most of the consequences of the accident (more than 50%) are unsuccessful.
Second, economic punishment is not a significant feature that enables the successful enforcement of violations. From the perspective of higher purity sub-nodes, factors like: “Enforcement code”, “Violation date”, “County”, and LDA topics are highly connected with successful enforcement of a violation.
The enforcement consequences of violations in different counties are different, and our research has also pointed out counties with obvious differences. The counties that are of the high important feature in pruning decision trees include Susquehanna, Somerset, Greene, and Washington. Susquehanna had a significantly higher success rate for handling violations under the conditions of financial punishment (Gini = 0.122), and this county had a higher success rate for handling violations before 2008. Most of the violations in Greene and Somerset County after 2012 were not handled successfully. After 2012, besides the type of violation as “Erosion and sediment”, the enforcement of violations generally failed in Washington (Gini = 0.165).