Data-Driven Analysis: A Comprehensive Study of CPS Case Outcomes in 42 English Counties (2014-2018) with R Analytics

doi:10.21203/rs.3.rs-3492090/v1

Download PDF

Research Article

Data-Driven Analysis: A Comprehensive Study of CPS Case Outcomes in 42 English Counties (2014-2018) with R Analytics

https://doi.org/10.21203/rs.3.rs-3492090/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

This scholarly work thoroughly examines a dataset of criminal activities, specifically emphasizing the process of data pre-processing, cleansing, and subsequent analytical procedures. The dataset utilized in this study is obtained from the Crown Prosecution Service Case Outcomes by Principal Offense Category (POC), covering the period from 2014 to 2018 and including forty-two counties in England. The initial stage of data pre-processing encompasses a systematic sequence of procedures, which includes deleting superfluous percentage columns, arranging the data in chronological order, aligning the columns appropriately, removing special characters, and converting the data types as necessary. Appropriate measures are taken to address missing data to protect the integrity of the dataset. The descriptive analytics section examines multiple variables, encompassing county, year, month, area, and crime categories such as homicide, sexual offenses, burglary, etc. Clustering techniques, such as K-means and Hierarchical clustering, are utilized to identify underlying patterns within the dataset. Classification models such as Support Vector Machines (SVM) and Random Forest are utilized to forecast case outcomes. This is facilitated by employing thorough reporting techniques and doing Receiver Operating Characteristic (ROC) analysis. Time series analysis, namely using ARIMA modeling, is employed to comprehend the temporal patterns present in crime data. The paper presents a comprehensive analysis of the performance of ARIMA models, offering hypotheses, model descriptions, accuracy matrices, and visualizations as evaluation tools.

Fraud Detection

Machine Learning

Encoding

Bank

Feature Engineering

ROC-AUC

Over the last two decades, there has been a notable surge in societal reliance on data and information by individuals. Significant technological advancements have been developed to facilitate the storage, analysis, and processing of vast quantities of data. This study examines an intricate collection of data obtained from diverse and non-homogeneous periods. Although certain months are not officially documented in the archives of their respective years, they have been duly acknowledged and taken into account [1]. This extensively prepared study thoroughly examines data pre-processing, cleansing, and analysis in crime datasets, along with a complete guide. The text elucidates various procedural steps in the pre-processing and cleaning stage [2]. These steps encompass the removal of percentage columns, the organization of data chronologically through data augmentation and sorting, column shifting, the removal of memorable characters, the conversion of values to integers, and adept management of missing data. The report serves as evidence of the intricate nature of data preparation for subsequent analysis, as it includes enlightening code samples that shed light on each step of the process. The analysis proceeds with descriptive analytics of the clean dataset, revealing many observable characteristics that may be examined [3].

The comprehensive analysis encompasses an examination of counties, years, months, regions, and a diverse range of crime categories, which include, but are not limited to, homicide, sexual offenses, and burglary [4]. The study offers a comprehensive and immersive experience, incorporating code samples, to effectively lead the reader through the detailed preparation and cleaning procedures.

This research emphasizes the significance of predictive analytics and time series analysis, stimulating intellectual curiosity and broadening the range of potential applications. Linear and multiple regression techniques are employed, together with the requisite formulation of hypotheses, summarization of datasets and models, accuracy matrices, graphs depicting anticipated versus actual values, and evaluation statistics. Carefully selected code snippets enhance the smooth integration of data segmentation and regression models, enabling users to initiate their analytical endeavors. The report delves deeper into the exploration of clustering by examining the intricacies of both K-means and Hierarchical clustering techniques. The hypothesis, dataset summaries, cluster dendrograms, and concise summaries presented thoroughly offer a comprehensive representation of this intricate analytical undertaking.

Here, we set a goal to detect the crime rate, crime categories, and the sum of all crimes in the country in specific years. To vindicate the “R” model, we have set problem statements as below:

To predict the number of crimes based only on the date.
To be able to predict when crimes will happen in the future.
To be able to predict future crimes that won't work.
To determine how many crimes will happen based on the country, the year, and the month as separate variables.

The paper is structured in a manner that Section 2 encompasses a literature survey, Section 3 presents the research materials and methodology, Section 4 provides the analysis and results, and Section 5 concludes the study.

This analysis aims to further explore comparing different machine learning models in the context of crime prosecution analytics, as previously discussed by several scholars.

Apurba A. et al. [5] demonstrated the types of crime in various countries. The author completed the task via the ML model and introduced crime prediction. They worked on assessment, selection, and classification models to predict crime risk. The classification model is mainly used for the time interval and place. The author used Decision Trees, Neural Networks, K-nearest Neighbors, and Impact Learning to compare the performance. They obtained 81% accuracy by using the Decision Tree algorithm.

Wajiha S. et al., [6] examined previous investigations about crime forecasting and the efficacy of prediction accuracy using learning models. The researcher used various machine learning models, including Logistic Regression, Support Vector Machine (SVM), Naive Bayes, K-nearest neighbor, decision tree, and multilayer perception (MLP), to get the optimal fit for the crime data. Chicago's crime rate has decreased, whereas Los Angeles has observed a minor rise in its crime rate. The dataset used in this research comprises the compilation of criminal records obtained from Chicago and Los Angeles. The author demonstrated the article's structure by organizing it into four distinct parts.

Sapna S. et al. [7] introduced a novel approach called the assemble-stacking-based crime prediction technique (SBCPM), which utilizes support vector machine (SVM) algorithms to provide precise crime predictions. The SVM technique was used to acquire domain-specific configurations, which were compared against models like J48, SMO, Naive Bayes bagging, and Random Forest. The model had a classification accuracy of 99.5% when evaluated using the testing data. The approach model used by the authors demonstrates efficacy in forecasting potential criminal activities, while their observation indicates that the stacking ensemble model outperforms a specific classifier.

Debasish B et al. [8] used machine-learning techniques for predicting the crime rate in Metropolitan Bangladesh. Still, they focused on the decision tree technique, especially for the crime quality, to forecast Bangladesh’s metropolitan area. The datasets they gathered from the police department of Bangladesh included different types of crimes such as Dacoity Murders, Women Child Repression, Kidnappings, Police Assault, Burglaries, Thefts, and Other Cases. This study achieved more than 90% accuracy on the test data. Decision Tree technique- precision, Recall, and F1-score are analyzed to predict the crime area.

Vinothkumar K et al., [9] proposed a system for crime hotspot identification. This study used Chicago Police Department manages dataset which was available on the internet. The author applied the Machine Learning model, namely SVM, to identify the crime hotspot and Random Forest (RF). They used cell phones to unearth the thief's movement. The precision rate was 86.6% in this study.

Varun et al., [10] examined over 150 articles to analyze different machine learning and deep learning algorithms used to detect crime. The author reviewed more than 30 datasets from various localities to compare the crime rate. The architectural flow of crime prediction was examined through text, images, and video futz from crime datasets, data preprocessing, and feature selection, applied traditional machine learning or deep learning algorithms, and the crime prediction results. The author concluded by describing the lack of literary wisdom on how those technologies can be used to solve the problem of crime.

Sameya K et al. [11] worked on Machine Learning advanced analysis for predicting crime rate. The study prioritized the Naive Bayes technique over other ML techniques and achieved a 99.5% accuracy rate on crime prediction. The author used the CAW dataset with 13 columns and 18 rows in different crimes. They also generated models by Linear Regression, SVM, Bagging, regression, stacking regression, etc.

Myung et al. (2018) [12] developed a crime classification system incorporating risk assessment and machine learning algorithms to anticipate criminal techniques. The authors also evaluated the system's performance to assess its effectiveness. The primary objective of this research was to focus on several categories of illegal activities, using text-based criminal case summaries derived from actual criminal cases. The author used the KICS data format, using authentic police data. The performance of the models, namely DNN, CNN, Naive Bayes, and SVM, was assessed. The accuracy achieved by each model was 87%, 91%, 84%, and 83% accordingly. The researchers have constructed two distinct prediction models: a crime-type prediction model and a CRS prediction model.

TABLE I. SUMMARY OF RELATED WORK

Ref.	Data set	Model	Precision	Accuracy
[5]	Bangladesh Police website accessed on Oct’21	KNN	0.70	0.73
		MLP Classifier	0.70	0.77
		Decision Tree	0.78	0.81
[6]	Chicago Criminal Records	Logistic Regression	0.93	90
		Decision Tree	1.00	66
		Random Forest	0.92	77
		SVM	1.00	66
		MLP	1.00	87
		Naïve Bayes	1.00	73
		KNN	0.88	88
	Los-Angeles Criminal Records	Logistic Regression	0.72	48
		Decision Tree	0.98	60
		Random Forest	0.83	43
		SVM	0.80	60
		MLP	0.98	84
		Naïve Bayes	0.88	71
		KNN	1.00	89
[7]	National Crime Report Bureau (India)	J48	0.94	94.44
		Random Forest	0.98	97.21
		SMO	0.32	41.66
		Begging	0.95	95.55
		Naïve Bayes	0.70	67.22
[8]	Police Department of Bangladesh	Decision Tree	0.91	90
[9]	Chicago Police Department	Random Forest	86.6
[9]	Chicago Police Department	SVM	98.4	97.64
[10]	CAW Dataset	Naïve Bayes		99.9
[11]	KICS Dataset	CNN	92	91
		DNN	77	87
		SVM	80	84
		Naïve Bayes	68	83

Our work can be divided into five main parts: Dataset Collection, Data Integration and Preprocessing, Cleansing, Descriptive Analytics, Model Preparation and Training, Model Performance, and Evaluation.

Dataset Description

This report delves into a dataset collected from data.gov.uk containing Crown Prosecution Service Case Outcomes by Principal Offence Category (POC). This dataset covers 42 counties in England and includes monthly data for those years (2014–2018) [5].

The results reported by the Crown Prosecution Service (CPS) are broken down into successful and unsuccessful verdicts in this dataset. Convictions include both outcomes with respondents present in court and those who pled guilty or were found guilty in their absence. Dismissals, withdrawals, discharged committals, dismissals, acquittals, and administrative finalizations are all included in the category of failed verdicts.

There is a wide variety of criminal acts represented in this dataset. These include but are not limited to acts of criminal damage aimed at public places and autos and homicide offenses against individuals, including sexual assault, burglary, robbery, theft, handling fraud, or forgery. It's important to note that this all-encompassing category includes every crime except those committed while operating a motor vehicle. There is probably a gold mine of insights hiding somewhere in the depths of this dataset. Let's go on an adventure to find these numbers' hidden meanings and significance.

Data Integration & Pre-processing

Data integration combines data from numerous sources into a single dataset to suit the information demands of all applications and business processes. Data management increasingly uses data integration as massive data integration and data sharing grow [6].

In this work, we collected data from a government-owned public and online source as an Excel form which is not that critical in the format and just requires cleaning before analysis. We can manually load the necessary data files and then merge them. Here both techniques are applied to make a clear understanding.

A series of diligent data pretreatment techniques improve dataset quality for accurate and trustworthy data analytics. This enables accurate and efficient statistical algorithms. We analyze each pre-processing phase's significance, justification, and impact. We will examine each operation's code to bring these processes to life. The data cleaning process in a continuous step is a loop format, requiring evaluation after each step. [11]

Data cleansing detects, corrects, and eliminates errors in the dataset. Data analytics relies on it to assure accuracy and reliability. Data cleansing ensures data integrity and improves insights. Data analysis may be misleading without this step. Data cleansing removes extraneous data and standardizes it. This enhances usability and efficiency. We detail each data preprocessing and cleaning method in the following sections. By evaluating their rationale and effect, we can uncover the complex layers that turn raw data into insights. Prepare for a data-sculpting journey.

After all the steps have been performed, the dataset takes shape as visualized in the below sections.

Glimpse
Visualizing Missing
Visualizing Data Types

We can visualize this diagram for each year from 2014 to 2018, clarifying month values with individual attributes.

This matrix shows criminal activity correlation coefficients. The matrix shows criminal categories’’ relationships. Each matrix cell shows the correlation coefficient between the two crime categories in the row and column. 1 is a perfect positive correlation, -1 is a perfect negative correlation, and 0 is no correlation. The correlation matrix can reveal crime category relationships. Homicide has a significant connection with crimes against the person (0.95), sexual offenses (0.91), and theft (0.89), suggesting they often occur together. Traffic violations have the least association with other offenses. Traffic offenses correlate with homicide at 0.91, crime against the person at 0.98, and sexual slurs at 0.96. Homicide, Crimes against the person, and sexual offenses have higher correlations (0.95, 0.99, and 0.99, respectively). Traffic offenses are less related to other crimes than other crimes are to each other. This suggests that traffic violations are distinct from other criminals and may require different interventions and preventive tactics.

Plots for unsuccessful crimes can be visualized similarly.

Table II. Correlation matrix of crime activity and Crime categories

	Offenses against the person	Sexual Offences	Robbery	Homicide
Offenses against the person	0.95	0.98		0.98
Sexual Offences	0.99	0.96		–
Robbery	0.89
Homicide	0.95	0.95		0.91
Burglary	0.91

Descriptive Analytics

Descriptive analysis is a statistical approach employed to represent a specific dataset succinctly and visually. The analytical methodology outlined above defines the fundamental characteristics of a particular dataset, encompassing measures of central tendency (such as mean, median, and mode), dispersion (including range, variance, and standard deviation), and the distribution of the data. Furthermore, visual representations such as histograms, box plots, and scatter plots are employed to illustrate the data and its underlying characteristics. Descriptive analysis aims to present a comprehensive and succinct overview of the data, emphasizing significant patterns and trends. This method seeks to gain insight into the data's inherent characteristics and detect any atypical findings or outliers. Descriptive analysis may also see any instances of missing data or inaccuracies within the dataset. The process of description analysis is a fundamental component of data analysis, as it serves the purpose of comprehending the data and discerning patterns. This stage is of utmost importance as it lays the groundwork for further statistical analysis and modeling. Sampling is an initial phase of Exploratory Data Analysis (EDA), a methodology employed to analyze data sets and provide a concise summary of their primary characteristics, frequently utilizing visual techniques. The utilization of sampling can assist in alleviating concerns associated with this process. It is observed that the dataset exhibits features that are organized into distinct groups. There exist multiple approaches for handling categorical data. However, given the context of our study, which aims to compare logistic regression and gradient-boosting machines, we have opted to convert these flat features into numerical representations [22]. We got values of 3 significant features after one hot encoding and feature scaling as below.

Model Preparation and Testing

Ridge Regression: Ridge Regression is a way to look at multiple regression data with various correlations. It is a type of linear regression method where the predicted goal value is a linear mixture of the input factors. The way the deals are found makes Ridge Regression different from standard linear regression. With Ridge Regression, a punishment is added that is equal to the square of the size of the coefficients. This punishment, called the L2 penalty, makes the values more minor than they would be if linear regression were used. Several measures were used to judge how well the Ridge Regression model worked [13]. The Mean Squared Error (MSE) of the model was about 70.72, the average squared difference between what the model predicted and what the actual values were. When the Root Mean Squared Error (RMSE), which was about 8.41, was looked at, it showed the standard deviation of the residuals and made the forecast mistakes of the model even more apparent. The Mean Absolute Error (MAE), which measures how big mistakes are on average regardless of how they go, was about 2.91.

Linear Regression: Linear regression is a statistical method used to describe the link between a dependent variable (also called a result or response variable) and one or more independent variables (also called an explanation or prediction variable). It assumes that there is a linear link between the independent factors and the dependent variable, and it tries to find the best-fitting line through the data points. [14].

Random Forest: A random forest (RF) comprises random vectors spread out similarly across all the trees in the woods. This is done by combining several tree models, so each tree depends on those values. As the number of trees in a forest grows, the generalization error for forests approaches a maximum. How well a forest of tree classifiers can generalize their results is based on how strong each tree is and how well they connect [15].

Random Forest Classifier(Gini approach): In the context of the random forest model, the Gini importance of a variable is a measure of how much the tree nodes that use that variable reduce the impurity on average (across all trees in the forest). Variables that result in nodes with higher decreases in impurity will have higher Gini importance scores. Each row in the result corresponds to a variable in the model [16].

Confusion Matrix: The presentation of the prediction summary is shown in the form of a confusion matrix. The presented information showcases the correct and incorrect predictions made for each class. It facilitates distinguishing between types that models erroneously classify as different classes. The confusion matrix comprehensively represents the classification performance by presenting the counts of true positives, true negatives, false positives, and false negatives for each category. The model's total accuracy is 1, indicating a perfect accuracy rate of 100%. The Kappa statistic shows a value of 1, signifying a state of complete agreement between the expected and actual labels [17].

Decision Tree: Decision trees are a type of construction that resemble trees. Using the information provided, it makes judgments. The decision tree is constructed by separating the data into smaller subgroups until it no longer meets the acceptable degree of accuracy. The nodes and branches in this diagram indicate the choice and results, respectively [18].

Support Vector Machines (SVMs) are a particular category of linear classifiers that operate by maximizing the margin. To enhance the classifier's generalization performance, efforts are made to reduce structural hazards by increasing its complexity. The Support Vector Machine (SVM) algorithm performs the classification task by constructing an optimal hyperplane that effectively separates the data points into two distinct groups inside a higher-dimensional feature space [19].

Multiple Regression: The statistical method of multiple regression involves using several independent variables to make predictions about a singular dependent variable. Multiple linear regression is a statistical technique that expands upon simple linear regression by including various independent variables to predict a dependent variable. In multiple regression analysis, the association between the independent factors and the dependent variable is expressed by an equation encompassing a single dependent variable and several independent variables. The equation is employed to forecast the magnitude of the reliant variable, contingent upon the volumes of the autonomous variables. Using multiple regression is advantageous in examining the correlation between several independent variables and a solitary dependent variable and comprehending the impact of alterations in the independent variables on the dependent variable. The utilization of this concept is prevalent in various disciplines, including economics, psychology, and marketing, to comprehend intricate interconnections [20].

Ridge Regression:

Ridge Regression is a statistical method used to analyze multiple regression data sets exhibiting multicollinearity. The approach in question is a variation of linear regression, a statistical technique that assumes the goal variable may be expressed as a linear combination of the input variables. Ridge Regression and conventional linear regression are distinguished in the methodology employed for estimating the coefficients. Ridge Regression incorporates a regularization term proportional to the coefficients' square.

The utilization of the L2 penalty in linear regression results in the estimation of coefficients that are relatively less in magnitude when compared to those obtained using conventional linear regression.

Linear Regression:

Linear regression is a statistical methodology employed to construct a mathematical model that elucidates the relationship between a dependent variable, commonly known as the outcome or response variable, and one or more independent variables, also called explanatory or predictor variables. The methodology assumes that there is a linear relationship between the independent variables and the dependent variable. It aims to determine the best linear regression line that fits the observed data points.

Multiple Regression:

Multiple regression is a statistical technique that predicts a single dependent variable using more than one independent variable. It surpasses conventional linear regression, which indicates a dependent variable using only one independent variable. In multiple regression, the relationship between independent factors and the dependent variable is represented by an equation containing a dependent variable and multiple independent variables. The equation can be used to anticipate the value of the dependent variable based on the number of independent variables.

Clustering:

Clustering is a method by which unsupervised machine learning combines similar data points. Clustering attempts to identify patterns or structures in the data by grouping similar observations. Clustering algorithms organize data elements into a predetermined number of groups based on their similarity. Clustering is frequently used for partitioning markets, shrinking images, and identifying things that do not make sense. There are various classification techniques, such as K-means, Hierarchical clustering, DBSCAN, etc. To perform our duties, we would utilize K-Means and Hierarchical categorization.

Dplyr in R:

Dplyr constitutes an integral component of the tidyverse framework within the R programming language, mainly employed to manipulate data. The software offers a comprehensive suite of tools to effectively operate datasets within the R programming environment. [20].

Caret in R:

The Caret package, which stands for Classification and Regression Training, is a crucial component within the data science ecosystem of the R programming language. The system provides an efficient procedure for constructing predictive models [21].

Principal Component Analysis (PCA):

Principal Component Analysis (PCA) is a statistical approach employed to accentuate the variability present in a dataset and reveal prominent patterns. Frequently used to facilitate data exploration and visualization. This technique is highly effective in reducing the number of dimensions, compressing data, and mitigating noise. However, it is assumed that the principal component is a linear mixture of the original features, however, this assumption may not always hold. Additionally, it results in decreased interpretability of the features [22].

Lattice in R:

Lattice is a powerful visualization library in R inspired by Trellis graphics. It is designed explicitly for visualizing multivariate data in a structured manner [23].

Plotly in R:

Plotly is a versatile library that allows the creation of interactive plots within R. Plotly’s R library lets you create interactive graphs and charts like line graphs, scatter plots, area plots, bar charts, error bars, box plots, histograms, heatmaps, subplots, etc.

Ggplot2 in R:

The ggplots library, part of R’s tidy verse, is a comprehensive and powerful tool for creating static, aesthetically pleasing visualizations. It’s based on the Grammar of the Graphics concept.

The graph below examines crime rates and reveals that crimes against individuals are most prevalent throughout all regions of England, particularly in the northern area. The incidence of robbery in England has a consistent pattern across all areas, with the North region reporting the highest rates. Motoring ranks as the third most prevalent offense across all areas, with other crimes exhibiting a similar distribution pattern. In general, it can be noticed that the northern region has the most significant incidence of criminal activity, while the eastern region tends to have the lowest.

The graph shown below examines instances of failed crimes, revealing that unsuccessful offenses below the individual are most prevalent throughout all regions of England, with a particular concentration in the southern area. The category of Unsuccessful Motoring Offences has the second greatest frequency, consistently seen throughout all areas of England and particularly prevalent in the South. The Unsuccessful

Admin Finalized ranks as the third most crime across all areas, with other offenses exhibiting a similar pattern. In general, it can be noticed that the southern region exhibits the largest incidence of failed instances, while the eastern region has comparatively lower rates.

Evaluating States:

In this phase of the study, we want to examine the correlations between crime occurrences and their temporal distribution throughout different areas of England, using graph visualization techniques and statistical analytic methodologies.

In this phase of the study, we want to examine the correlations between successful and unsuccessful criminal incidents across different countries from 2015 to 2018 in England, considering the temporal factors of years and months, using graph visualization techniques.

Our method successfully generated a dendrogram for neighboring countries, as seen in the following graph. For instance, crime rates have always been higher in Metropolitan and City countries. We use a 2-size cluster to divide the produced dendrogram into metropolitan and city counties and the rest of the states. When we apply the three-size clustering method to the components of the created dendrogram, we get a new cluster at the extreme left of the dendrogram, which contains three counties. To properly group counties with similar characteristics, we cluster the produced dendrogram's parts into 5 different sizes.

The following is a visualization of the Time Series data. The graph's missing information may adversely affect our analysis or render our diagnostics useless. The number of reported offenses, thankfully, continues to decline.

Fig. 20. To address the issue of missing data, the Mice package is used for imputing the incomplete data. The missing values have been imputed, resulting in a more refined dataset that exhibits enhanced coherence and absence of irregularities.

The decomposition of a multiplicative time series is a technique used to disassemble a time series into its constituent components, including trend, seasonality, and residuals. The method is used for analyzing time series data that demonstrate a multiplicative association among its members, indicating that the seasonal and trend components' magnitudes are dependent on the series' level.

The decomposition process entails partitioning a time series into its constituent elements, namely the trend, seasonal, and irregular components. The trend component captures the enduring alterations in the data over an extended period. On the other hand, the seasonal component is responsible for capturing the recurring patterns that manifest at certain time intervals. Lastly, the irregular component is responsible for capturing the stochastic and unforeseeable fluctuations in the data.

According to the four-year prognosis, there is projected to be no significant decline in crimes, as they are expected to remain within the range of 30,000 incidents. This suggests that the existing crime rate will be sustained.

According to the 8-year estimate, it is projected that the number of offences would remain consistently around 30,000, indicating a sustained crime rate.

This comprehensive investigation examined the complexities of a crime dataset, encompassing thorough data pre-processing and extensive analytical examination. The dedication to maintaining data integrity and uniformity during the pre-processing stage established a firm basis, guaranteeing the dependability and credibility of the dataset. Descriptive analytics thoroughly examined several characteristics, such as county, year, month, region, and types of crimes. Code snippets supplemented this analysis to ensure transparency and the ability to reproduce the findings to provide descriptive insights, predictive analytics utilized several statistical techniques such as linear and multiple regression, clustering, classification, and time series analysis to identify patterns, correlations, and prediction capabilities within the information. The sensitivity (true positive rate) is 0.7927, meaning the model correctly classifies 'High' 79.27% of the time. Specificity (true negative rate) is 0.9747, meaning the model correctly classifies 'Low' 97.47% of the time. The precision or PPV is 0.9646. When predicting 'High', the model is 96.46% accurate. The model's 'Low' prediction accuracy is 84.38%, according to the negative predictive value (NPV) of 0.8438. Kappa, a measure of prediction-actual agreement that accounts for chance, is 0.7764. We must recognize that our Lasso Regression model experienced performance abnormalities, as evidenced by atypical R-squared and Adjusted R-squared values. This analysis highlights the crucial importance of data pre-processing and purification in revealing the full potential of a dataset. A meticulously curated dataset serves as the foundation for rigorous research, resulting in significant findings and evidence-based decision-making by the appropriate utilization of data, individuals gain substantial insights and make educated decisions, therefore making data analysis a vital resource for both scholars and practitioners.

Data Availability Statement

The used dataset is available publicly from https://ckan.publishing.service.gov.uk/dataset/?q=crime&sort=score+desc%2C+metadata_modified+desc&organization=crown-prosecution-service

Conflict of Interest Statement

The Authors declare no Conflicts of Interest.

Funding

The present research has partial financial support from RNB Lab to cover the essential cost of digital resources.

Acknowledgments

The authors are grateful for the support of staff and faculty members of the Data Science Cohort, specially Dr William Sayers of the School of Computing and Technology, University of Gloucestershire.

Wang, H., & Ma, S. (2022). Preventing crimes against public health with artificial intelligence and machine learning capabilities. Socio-Economic Planning Sciences, 80, 101043.
Pandey, A., Jaiswal, H., Vij, A., & Mehrotra, T. (2022, April). Case Study on Online Fraud Detection using Machine Learning. In 2022 2nd International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE) (pp. 48–52). IEEE.
Travaini, G. V., Pacchioni, F., Bellumore, S., Bosia, M., & De Micco, F. (2022). Machine learning and criminal justice: A systematic review of advanced methodology for recidivism risk prediction. International journal of environmental research and public health, 19(17), 10594.
Zhang, X., Liu, L., Lan, M., Song, G., Xiao, L., & Chen, J. (2022). Interpretable machine learning models for crime prediction. Computers Environment and Urban Systems, 94, 101789.
Adhikary, A., Murad, S. A., Munir, M. S., & Hong, C. S. (2022, January). Edge assisted crime prediction and evaluation framework for machine learning algorithms. In 2022 International Conference on Information Networking (ICOIN) (pp. 417–422). IEEE.
Safat, W., Asghar, S., & Gillani, S. A. (2021). Empirical analysis for crime prediction and forecasting using machine learning and deep learning techniques. IEEE access, 9, 70080–70094.
Kshatri, S. S., Singh, D., Narain, B., Bhatia, S., Quasim, M. T., & Sinha, G. R. (2021). An empirical analysis of machine learning algorithms for crime prediction using stacked generalization: an ensemble approach. Ieee Access, 9, 67488–67500.
Victor, D. B., & Latif, S. (2021, July). Bangladesh Metropolitan Crime Area Prediction Using Decision Tree. In 2021 6th International Conference on Communication and Electronics Systems (ICCES) (pp. 1226–1233). IEEE.
Mandalapu, V., Elluri, L., Vyas, P., & Roy, N. (2023). Crime Prediction Using Machine Learning and Deep Learning: A Systematic Review and Future Directions. IEEE Access.
Mandalapu, V., Elluri, L., Vyas, P., & Roy, N. (2023). Crime Prediction Using Machine Learning and Deep Learning: A Systematic Review and Future Directions. IEEE Access.
Khatun, S., Banoth, K., Dilli, A., Kakarlapudi, S., Karrola, S. V., & Babu, G. C. (2023, March). Machine Learning based Advanced Crime Prediction and Analysis. In 2023 International Conference on Sustainable Computing and Data Communication Systems (ICSCDS) (pp. 90–96). IEEE.
Kshatri, S. S., Singh, D., Narain, B., Bhatia, S., Quasim, M. T., & Sinha, G. R. (2021). An empirical analysis of machine learning algorithms for crime prediction using stacked generalization: an ensemble approach. Ieee Access, 9, 67488–67500.
Baek, M. S., Park, W., Park, J., Jang, K. H., & Lee, Y. T. (2021). Smart policing technique with crime type and risk score prediction based on machine learning for early awareness of risk situation. Ieee Access : Practical Innovations, Open Solutions, 9, 131906–131915.
Darshan, M. S., & Shankaraiah, S. (2022, October). Crime Analysis and Prediction using Machine Learning Algorithms. In 2022 IEEE 2nd Mysore Sub Section International Conference (MysuruCon) (pp. 1–7). IEEE.
Vinothkumar, K., Ranjith, K. S., Vikram, R. R., Mekala, N., Reshma, R., & Sasirekha, S. P. (2023, March). Crime Hotspot Identification using SVM in Machine Learning. In 2023 International Conference on Sustainable Computing and Data Communication Systems (ICSCDS) (pp. 366–369). IEEE.
Kumar, R. S., Saravanan, N. P., Devi, K. N., Jayanthi, P., Krishnamoorthy, N., & Karthi, S. (2023, January). Empirical Analysis on Crime Prediction using Machine Learning. In 2023 International Conference on Computer Communication and Informatics (ICCCI) (pp. 1–5). IEEE.
Kaur, P., Rani, G., Sharma, T., & Sharma, A. (2021, July). A Comparative Study to analyze crime threats using data mining and machine learning approach. In 2021 International Conference on System, Computation, Automation and Networking (ICSCAN) (pp. 1–4). IEEE.
Mitchell, J., Mitchell, S., & Mitchell, C. (2020). Machine learning for determining accurate outcomes in criminal trials. Law Probability and Risk, 19(1), 43–65.
Parmar, A., Katariya, R., & Patel, V. (2018). A review on Random Forest: An ensemble classifier. International Conference on Intelligent Data Communication Technologies and Internet of Things (ICICI) 2018, 758–763. https://doi.org/10.1007/978-3-030-03146-6_86.
Yang, X., Hou, L., Zhou, Y., Wang, W., & Yan, J. (2021). Dense label encoding for boundary discontinuity free rotation detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 15819–15829).
Dumelle, M., Kincaid, T., Olsen, A. R., & Weber, M. (2023). spsurvey: Spatial Sampling Design and Analysis in R. Journal of Statistical Software, 105, 1–29.
Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for data science. " O'Reilly Media, Inc.".
Demertzis, K., Kostinakis, K., Morfidis, K., & Iliadis, L. (2023). An interpretable machine learning method for the prediction of R/C buildings' seismic response. Journal of Building Engineering, 63, 105493.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Data-Driven Analysis: A Comprehensive Study of CPS Case Outcomes in 42 English Counties (2014-2018) with R Analytics

Status:

Version 1

Abstract

Figures

Introduction

Literature Survey

MATERIALS AND METHODOLOGY

Dataset Description

Data Integration & Pre-processing

Descriptive Analytics

Analysis and Result Discussion

CONCLUSION

Declarations

References

Additional Declarations

Status:

Version 1