This research article analyzed the impact of various machine learning data categorization techniques such as the NB, RF, SVM, and KNN algorithms. Data categorization is performed on sensitivity levels for the cloud. Our proposed model consisting of three classes: Basic Class, Confidential Class, and Highly Confidential Class, as shown in Figure 2.
3.1 Basic Class
Our proposed model's basic class comprises a common type of data, such as text documents, with a low level of confidentiality. Basic information such as advertising, announcements, and notices can be found in text documents. As a result, this level provides a basic level of data security. The basic class does not require encryption on the client side; nevertheless, when sent, it will be encoded on the server-side using the backup service's key.
3.2 Confidential Class
Personal files, such as private accounts, web accounts, and professional details, are covered in this class. Our confidential class is intended for data with a medium level of confidentiality. Security is necessary to protect our data because this class keeps track of secret and private information. At the confidential level, encryption methods such as AES can be utilized for this purpose. In this class, encryption will be used on the client-side.
3.3 Highly Confidential Class
This class is responsible for financial transactions, organization-wide secret documents, and military data. Users may have reservations about the data's high level of confidentiality, so they avoid all newly provided services. This level will provide security by using two standard recommended algorithms due to the high degree of confidentiality and integrity. The US National Security Agency recommended AES 256 to prevent unwanted access to top-secret material (NSA). The SHA-2 algorithm, on the other hand, ensures data integrity. This algorithm will be used to compute the hash value of data before changing or transferring it. Also, construct a hash value for data retrieval on user request; the value must be the same to ensure that the data is not tampered with.
3.4 Dataset
We have collected the Reuters-21578 text categorization collection dataset from the UCI ML repository. We also collect confidential and highly confidential data from the CIA public library for text composition to test the recommended system. The compatible material like commercials, announcements, news articles data, accounts information documents (of the organization) and military information are compiled from the different mentioned international repository.
The datasets generated during and/or analyzed during the current study are available in the [UCI, CIA] repositories (Access links are provided in Table 2).
Table 2
Dataset with corresponding repositories
S.No | Dataset Type | Number of documents | Dataset Link |
1 | Public and Confidential Data | 4000 | https://miguelmalvarez.com/2015/03/20/classifying-reuters-21578-collection-with-python-representing-the-data/ |
2 | Highly Confidential Data | 2010 | https://www.archives.gov/research/intelligence/cia |
3.5 Data Processing
Natural language processing is a method of analyzing, manipulating, and extracting meaning from human language in such a way that computers can understand it. Before the text input is sent to the algorithm, it is transformed using the NLTK library. The unstructured text data is subsequently transformed into a structured format. Many machine learning techniques depend on processing as a significant component. It has a noticeable impact on the classification process as well [15].
3.5.1 Tokenization
Tokenization is the process of breaking down a character arrangement into components, each represents a word or phrase. In natural language processing, there are two types of tokenization: word tokenization and sentence tokenization. The list of tokens, which can be a word or a phrase, is then utilized to process the data [15]. Fig. 3 shows the tokenization process.
3.5.2 Filtering
Filtering a text file is a common practice to remove some of the more inconsequential terms. A reciprocal filtering mechanism prevents the removal of words. Stop words are terms that regularly appear in text that lacks substantive information (for example relational words, conjunctions, and so on.).
Thus, words that appear frequently in the content are said to have insufficient information to distinguish between reports, whereas terms that appear infrequently may similarly be of low significance and can be eliminated from the content document [17].
3.5.3 Lemmatization
It is the study that considers the feature extraction of the words. Taking the different varieties of a word, for example, they can be broken down into a single piece. In other words, lemmatization approaches aim to separate structures into several tenses and items into a single structure. To lemmatize the texts, we must first identify the part of speech of each word in the document, and stemming approaches are preferred over POS because POS is repetitive and prone to errors [16].
3.5.4 Stemming
This process converts words to their root forms, such as mapping a group of words to a common stem, regardless of whether the stem is a valid word in the Language. As a result, stemming a word or a sentence can produce non-words. Stems are formed by deleting all prefixes and suffixes from a word. There are language-dependent stemming algorithms [17].
3.6 Feature extraction
Before being fed into the classifier, the data from the text document is represented in indexes. Words could be used to describe features. The Bag of Words technique, which represents the document as a collection of words, is a commonly used structure. We must first define several terms and variables that will be regularly used in the following to allow for formal or formal descriptions of feature extraction. If there are distinct terms or words in a set of documents D=d1,d2,..., dD, then V=w1,w2,...,wv exists, then V is known as the vocabulary [18]. fd(w) represents the occurrence of the word w V in the document dD, and fD represents the number of documents containing the word w. (w). tD = (fd(w1), fd(w2),..., fd(wv) represents the feature vector for document t.
Algorithm for creating a BoW model
Step 1: Creating the Bag of Words model
Step 2: word2count = {}
Step 3: for data in dataset:
Step 4: words = nltk.word_tokenize(data)
Step 5: for word in words:
Step 6: if word not in word2count.keys():
Step 7: word2count[word] = 1
Step 8: else:
Step 9: word2count[word] += 1
Step 10: End
There are two general approaches for symbolizing a document using a list of features, namely the local dictionary technique and the global dictionary methodology [13, 18]. The international dictionary will be built using just relevant texts. As a result, if a term appears in the relevant document, it can be added to the lexicon as a feature. As far as the local dictionary technique is concerned, it can produce better results [19].
3.7 Feature Vector
Transforming documents into numeric vectors is the most universal approach to represent them. The “Vector Space Model” is another name for this demonstration. Its structure, on the other hand, is simple and was designed with information retrieval (IR) and indexing in consideration. The vector space model is widely used in various text mining techniques and IR classifications, and it allows for intelligent analysis of a huge number of documents [19]. Each word in VSM is identified by a numeric number that signifies the word's weight or 'importance' in the document. The first of two basic feature weight models is the Boolean model. If a feature is present in the document, it has a weight of 1; otherwise, it has a weight of 0 if it is not present in the document. The second way is term frequency and inverse document frequency (TF-IDF), which is the most general term weighting scheme. This phrase comes from IR, which uses both TF and IDF to determine the relevance of a characteristic (IDF) [19]. The number of times a feature appears in the document is represented by TF, and the frequency or rarity of the feature is shown by IDF over all papers. The weight assigned to each word w D is derived as follows using the TF weighting method as an example:
$$f\left(w\right)=fd\left(w\right)\text{*}\frac{{log}\left|D\right|}{fD\left(w\right)} \left(1\right)$$
Algorithm for Tfidf Vectorizer to calculate tf-idf score
tokenizer = RegexpTokenizer(r'\w+')
stemmer = PorterStemmer()
path ="E:/Thesis/dataset/"
docs = []
for subdir, dirs, files in os.walk(path):
for file in files:
file_path = subdir + file
shakes = open(file_path, encoding="latin-1")
text = shakes.read()
docs.append(text)
'''print(docs) '''
def stem_tokens(tokens, stemmer):
stemmed = []
for item in tokens:
stemmed.append(stemmer.stem(item))
return stemmed
def tokenize(text):
tokens = tokenizer.tokenize(text)
stems = stem_tokens(tokens, stemmer)
return stems
'''Initializing Vector'''
vectorizer = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
DocumentVectorizerArray = vectorizer.fit_transform(docs).toarray()
with open('E:/Thesis/fahad.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
line_count = 0
rowcount = 0
vocabularyIndex = 0
Idfscore = 0
for row in csv_reader:
if line_count == 0:
for docIndex,doc in enumerate(docs):
newrow = [0] * len(row)
for index,column in enumerate(row):
if column in doc:
vocabularyIndex = vectorizer.get_feature_names().index(column)
Idfscore = DocumentVectorizerArray[docIndex][vocabularyIndex]
newrow[index] = Idfscore
else:
newrow[index] = 0
with open('E:/Thesis/fahad.csv', 'a', newline='') as f:
writer = csv.writer(f)
writer.writerow(newrow)
line_count = line_count+1
f.close()
else:
csv_file.close()
break
The documents in the collection is represented by |D|. The word frequency is divided by the IDF, in the TF-IDF formula. This normalization lowers the value of terms that appear more frequently in the documents collection, guaranteeing that particular features that appear less frequently in the collection excite the same documents more. In figure 5 below, the estimated feature vector is shown.
3.8 Sensitivity Base Classification
Classification is a method of supervised learning. Which classifier should be used to predict the class label on new data while also learning from the training data? It's used in a variety of disciplines, including medical diagnosis, picture processing, document management, and text classification. It is also taken into consideration in a variety of communities, including machine learning, database, IR, and data mining [23].
The fundamental goal of classification is to assign specified classifications to text documents [17]. The following is a clear definition of the classification problem. We need a D = {d1,d2,.. .,dn} training set of documents, so that each document di is associated with a label ℓi from the collection L = {ℓ1,ℓ2,. .,ℓk}
Some of well-defined machine learning algorithms that were used to classify this document. These techniques include artificial neural networks, decision trees, the KNN technique, NB, rule-based classifiers, and SVM. The classification of the document supplied below is more appropriate out of the classifier [17].
A random forest is simply a series of decision trees with their outcomes aggregated into a single ultimate result. They're so effective because they can reduce overfitting while also reducing bias-related inaccuracy. A decision tree is essentially a categorized tree of the training dataset in which the data is hierarchically segregated using a feature value condition [24, 25].
3.9 Training Module
In the training phase preprocessing is performed using NLP and obtained features automatically. So extracted features are used for prediction by applying different classification algorithms. The predicted output is matched with the input instance to train it. The suggested methodology takes as input text documents including basic, sensitive, and highly confidential data, and at the end of the training phase, a final predictive model is chosen to forecast class labels. We provide 6010 text documents to train our model. KNN, NB, RF, and SVM are the four classifiers used to analyze the data.
The class labels are a set of outputs in this module that are used to train a prediction model by combining them with features (also known as variables). To allow a machine-learning system to predict class labels, use a training model. To train the classifier and test the efficiency of the trained model, cross-validation is performed. After being trained, the model will be able to predict the class label for any new text documents based on the features they include. The suggested method uses the training dataset, which was manually labelled with class labels, to create a classifier. Text documents are mapped to one of the predefined categories, the class labels, in this study. A training model for the automatically retrieved features with sparse matrices would be created using the suggested architecture. Because we developed 2048 automatic features, each one has its unique frequency, which is calculated using TFIDF. To predict class labels, classification algorithms were used for all of them, and three models with higher accuracy were chosen. We chose Random Forest, Nave Bayes, and k-nearest neighbor classifiers over other machine learning algorithms because they performed better. Multiple classifiers are available, and it is critical to choose the appropriate one for our problem to predict correct class labels. Furthermore, the SVM technique is inefficient for large data sets. SVM does not perform well when the data set contains more noise, such as overlapping target classes.
3.10 Testing Module
The final trained model is evaluated by using a new set of testing data to see how successfully our model was trained. A new dataset of text documents was used as input in the testing phase, and the new text documents were pre-processed. For pre-processing, the learning objects were tokenized into words based on the properties they contained. The newest cases were loaded into the trained model at the end. Which predicted the text's class label accurately to evaluate the classifiers, researchers employed about 2030 testing datasets from various publications. This is quite helpful to the testing module.
3.11 Development of Prototyping
The proposed methodology's goal is to construct the application's modules such that they can be validated. The system's back end was built in Python 3.7, which is a powerful language to work with for data analysis when combined with the correct tools and modules.
Document categorization further group into two phases, first consist of training and the second is testing. The training phase is broken down into several parts, including NLP pre-processing, feature extraction, and feature vector generation, followed by the application of various classification algorithms in the prediction module.
A final train model is chosen from the training phase and used to predict document classifications in the testing phase. Before being preprocessed, the data is sent into the pdf to text software (pdf to txt.exe), which converts pdf documents to text files. Before being fed into the preprocessing machine.
The system's back end was built in Python 3.7, which is a powerful language to work with for data analysis when combined with the correct tools and modules. Python is a free and open-source programming language designed to be simple to learn and powerful. We used a variety of Python libraries, including nltk, sklearn, numpy, os, csv, and scipy. These libraries are used to preprocess data, extract features automatically, construct feature vectors, produce dataset csv files, and then give the csv files to the classifier to train and test the models.