Study on Typhoon Disaster Assessment by Mining Data from Social Media Based on Artificial Neural Network

doi:10.21203/rs.3.rs-1849410/v1

Typhoon disaster is a major threat to the economy and personnel safety in coastal areas. After the disaster, the objective assessment of typhoon disaster economic losses can provide an important reference for the post-disaster rescue and reconstruction and improve the scientific decision-making. In this study, social media data, disaster causing factors, disaster bearing carriers (exposure and vulnerability) and other factors in traditional disasters were combined to achieve rapid disaster assessment. Convolutional neural networks were used to train a text classifier and perform automatic text classification of social media data. The correlations were discussed between various texts and disaster economic losses. It was found that there was a strong correlation between the geographical distribution of texts describing disaster damage and disaster economic losses. A Back Propagation neural network was used for supervised learning to realize disaster loss assessment. To prove the reliability and applicability of the evaluation model, typhoons “Mangkhut” and “Lekima” were selected as study cases to realize the whole process from information collection to final assessment. The introduction of social media data modified the assessment results obtained by traditional methods and reduced the difference between the estimated disaster loss value and the actual value.

typhoon disaster economic loss assessment

social media

artificial neural network

Over the past years, typhoon disasters have become increasingly frequent and the impact of disasters has become much more severe with global warming ((Coronese et al. 2019). Typhoons are one of the major natural disasters in China. With the rapid economic development of China and the increasing concentration of population and wealth in coastal cities, the damage caused by typhoon disasters has been increasing. In 2019, typhoon disasters caused direct economic losses of 58.9 billion yuan in China, which accounts for 18% of annual natural disaster losses (UNDRR 2020). Objective assessment and prediction of typhoon disaster economic losses can provide an important reference basis for post-disaster relief operations in affected areas, greatly improve the efficiency of relief operations and reduce typhoon disaster losses.

Conventionally, disaster economic loss assessment mainly includes the HAZUS-MH hurricane Model, which integrates typhoon hazard model, building change model, vulnerability model, and economic income change model to quantify the loss assessment factor relationships, perform accurate simulation of regional loss assessment and disaster prevention capability (Vickery P J et al. 2006); HAZUS-WIND assessment model, which is constructed with the technical support of GIS and used to assess the potential damage level of different types of buildings in each region (Vickery P J et al, 2006). In addition, many methods were used to predict typhoon disaster losses such as Fitting the relationship between insured loss and frequency of typhoon disaster (Unanwa 2000), calculation about the regional wind distribution function (Mitsuta 1996), analysis of the post-disaster images collected by remote sensing techniques (Kakooei and Baleghi, 2017) and et al. In large-scale natural disasters, traditional disaster assessment tools lack timeliness and cannot reflect the possible secondary derivative disasters caused by typhoon disasters promptly, while the assessment methods related to remote sensing images require extremely high-resolution remote sensing data in real-time, and factors such as satellite revisit time also limit their timeliness (Kakooei and Baleghi, 2017). Therefore, it is necessary to develop new methods to make disaster damage assessment rapidly after the occurrence of disasters.

In recent years, with the rapid development of social media and the increasing number of users, active users in social media platforms have become real-time sensors, reflecting the surrounding environment and transmitting information about the events occurring around them at any time (Imran et al. 2018). Internet users not only share what they see and hear but also express their feelings and opinions on the platforms. When natural disasters occur, people relate their personal experiences and opinions about disaster events or rescue operations on social platforms (Tang 2021). In recent years, scholars have begun to pay more attention to the performance of social media in disasters. Researchers have found that social media data can reflect the type and scale of disaster damage to some extent (Shan et al. 2019). Disaster-related twitter activity was closely related to the value of emergency relief funds for disasters such as hurricanes, tornadoes and earthquakes (Kryvasheyeu et al. 2016). Another researcher correlated emotional data in voice and text communication with population density and disaster severity to build a disaster emotional loss assessment model (Teodorescu 2014). However, applying only social media information can lead to inaccurate assessment results. Social media activities are often influenced by regional socioeconomic factors, such as population and household income (Dargin 2021; Fan 2020). Moreover, social media data cannot fully reflect the disaster-bearing capacity of the affected areas, etc.

This study combines social media data and other disaster loss-related information to achieve a rapid assessment of disaster losses. Social media data are used as real-time data related to disaster loss. Social media, disaster-causing factors, and disaster-bearing carriers (exposure and vulnerability) are considered when assessing disaster economic loss, and neural networks are trained to achieve rapid disaster assessment. A text classifier based on a convolutional neural network was trained and the text of social media data is classified to discuss the effectiveness of social media data for disaster loss assessment. Then it was discussed that the correlation between each type of text and economic loss of disaster, and effective indicators were selected for disaster loss assessment. Back Propagation Neural Network was used to integrate the relevant factors of disaster loss and perform supervised learning to achieve disaster loss assessment. To demonstrate the reliability and applicability of the assessment model, the typhoons “Mangkhut” in Guangdong in 2018 and “Lekima” in Zhejiang in 2019 were selected as study cases. The contents of this paper are organized as follows. First, the data sources of the factors related to the disaster economic loss assessment are identified. Second, the research methods used in this paper are described in detail. Third, the results of the study and the strengths and limitations of the research are discussed. Finally, the research contributions and the next steps of this paper are summarized.

2.1 Overview

This paper focused on relevance analysis and economic loss prediction of typhoon disasters using multi-source data. In this study, typhoon meteorological data, city statistics, and social media data were obtained. The typhoon meteorological data were used as a disaster-causing factor, including the maximum wind speed in the region during the typhoon and daily maximum local rainfall. Urban statistics were used as a measure of urban disaster-bearing carriers, including urban GDP, resident population, urban annual rainfall, and urban agricultural land area. Figure 1 shows the framework of this study. In this paper, two main parts are considered. One part is to conduct a correlation analysis of disaster loss and social media data, the other is to carry out economic loss prediction when a typhoon disaster happened. Firstly, the text Convolutional Neural Network model (textCNN) is trained to classify tweets collected from the microblog and discuss the correlation between various categories and disaster loss. Secondly, the BP Neural Network model is applied to obtain a reliable rapid damage assessment. For the correlation analysis, the concentration of disaster losses is considered to be the main factor, and the different kinds of information in social media data to be the correlation factors.

2.2 Method

2.2.1 Text Categorization——textCNN

In typhoon disasters, people use social media platforms to post various typhoon disaster-related news. Analyzing the information released by social media platforms timely can help governments and social groups gain an awareness of the disaster situation, understand the disaster situation in different locations and formulate corresponding rescue and recovery actions based on the analysis results.

The textCNN model is used in this study to explore the relationship between social media data and disasters. Yoon Kim applied Convolutional Neural Networks (CNN) to the task of text classification (Nguyen D T et al. 2016). The model utilizes multiple kernels of different sizes to extract key information in sentences (similar to n-grams with multiple window sizes), which can better capture local correlations. No changes have been made to textCNN in the network structure compared with the traditional image CNN network. The traditional textCNN model consists of four parts: input layer, convolutional layer, pooling layer, and fully connected layer. The first layer is the input layer. The input layer is an n×k matrix, where n is the number of words in a sentence, and k is the dimension of the word vector corresponding to each word. In addition, the padding operation is performed on the original sentence to make the vector length consistent. The second layer is a convolutional layer. Each convolution operation is equivalent to a feature vector extraction. By defining different windows, different feature vectors are extracted to form the output of the convolution layer. The third layer is the pooling layer, the role of which is to pool sentences of different lengths to obtain fixed-length vector representations. Commonly used pooling methods are 1-max pooling, k-max pooling, and average pooling, etc. The last layer is the fully connected layer, which is used to map the learned feature representation to the label space of the sample, and use the softmax activation function to output the classification category probability (Kalchbrenner N et al. 2014).

Compared with traditional models, CNN models don’t rely on well-designed features and complex natural language processing tools, and have the advantages of simple network structure, less computation, and faster training speed. By introducing already trained word vectors, the CNN model performs well in multiple datasets. The word vectors used in this study were provided by the Word2Vec model (Mikolov T et al. 2013). The model has different forms of channels in both static word vectors and dynamic word vectors, one of which is kept static while the other is dynamically fine-tuned through backpropagation during training. In this two-channel architecture, each filter is applied to both channels and the result is added.

2.2.2 Disaster Assessment Based on BP Neural Network Model

The back-propagation neural network continuously corrects the network weights and thresholds through the training of sample data, so that the error function decreases along the negative gradient direction and approaches the desired output. It is a widely used neural network model, which is mostly used for function approximation, model recognition, and classification, data compression and time series prediction (YE X et al. 2011). The BP network consists of an input layer, a hidden layer, and an output layer. The hidden layer can have one or more layers. Figure 3 is the used three-layer BP network model of m×L×n. In this study, the related factors are set to be the input layer, and the value of disaster economic loss is set to be the output layer. There are two main steps in establishing the BPNN model. (1) Model establishment and correlation analysis. Program the process and do network training using sample data to determine parameters and to establish a trained neural network; (2) Model evaluation. Input new related factor data, to get the estimated data of economic loss from the output layer, compare the estimated value with the observed value and evaluate the accuracy of the trained network.

To obtain a better fitting effect and avoid overfitting, we use the K-fold cross-validation method to validate the model. The K-fold cross-validation method is to randomly divide the training set into K groups, use (K-1) groups for modeling, and use the remaining group of the data to make predictions, and compare the predicted results with the actual values. Repeat the above steps until all samples are predicted.

In the process of building a BP neural network model, it is very important to allocate the number of hidden neuron nodes (L). The following two empirical formulas are commonly used to calculate the number of hidden nodes (Zhuo et al. 2011).

Formula (1):

$$\text{L}=\sqrt{m+n}+a$$

Formula (2):

$\text{L}=\sqrt{0.43mn+0.12{n}^{2}+2.54m+0.77n+0.35}$ +0.51

In these two formulas, the parameter m represents the number of neurons in the input layer, the parameter n represents the number of neurons in the output layer, and a is an empirical integer between 1 and 10.

The BP network selects $\text{f}\left(\text{x}\right)=\frac{1}{1+{e}^{-x}}$ as the sigmoid transfer function and $\text{E}=\frac{\sum _{i}{({t}_{i}+{O}_{i})}^{2}}{2}$ as backpropagation error function (${t}_{i}$ is the expected output, ${O}_{i}$ is the calculated output of the network). The BP neural network makes the error function E reach a minimum by continuously adjusting the network weights and thresholds.

This study uses root mean square error (RMSE), mean absolute error (MAE), and coefficient of determination (${R}^{2}$) to evaluate the performance of the model. RMSE represents the standard deviation between the actual loss and the evaluation result, MAE represents the mean absolute difference between the actual loss and the evaluation result (Willmott and Matsuura 2005), and ${R}^{2}$ represents the ratio of the evaluation result to the variance of the actual loss (Draper and Smith 1998). The formulas for the three metrics are as follows:

$$\text{R}\text{M}\text{S}\text{E}=\sqrt{\frac{1}{n}{\sum }_{i=1}^{n}{({y}_{i}-{\tilde{y}}_{i})}^{2}}$$

$$\text{M}\text{A}\text{E}=\frac{1}{n}{\sum }_{i=1}^{n}|{y}_{i}-{\tilde{y}}_{i}|$$

$${R}^{2}=1-\frac{{\sum }_{i=1}^{n}{({y}_{i}-{\tilde{y}}_{i})}^{2}}{{\sum }_{i=1}^{n}{({y}_{i}-\stackrel{-}{y})}^{2}}$$

When ${y}_{i}=\text{a}\text{c}\text{t}\text{u}\text{a}\text{l} \text{d}\text{a}\text{m}\text{a}\text{g}\text{e}$, ${\tilde{y}}_{i}=\text{a}\text{s}\text{s}\text{e}\text{s}\text{s}\text{e}\text{d} \text{d}\text{a}\text{m}\text{a}\text{g}\text{e}$, $\stackrel{-}{y}=$mean value of the actual damage; and $n=$ number of sample used for calculating model performance.

2.3 Data Collection

In order to ensure that the study area is distributed from mild to severe disasters, typhoons “Mangkhut” and “Lekima” are selected in this paper. A brief introduction of the two typhoons is given in Table 1, which includes landfall date, landfall location, max wind level, affected population, and economic loss.

Table 1

Typhoon disaster information
Typhoon	Landfall date	Landfall location	Max wind speed(m/s)	Affected population (million)	Economic loss (࿥:billion)
Mangkhut	Sept 16, 2018	Guangdong	45	3	5.2
Lekima	Aug 10, 2019	Zhejiang	52	14	53.7

To prepare the inputs for disaster loss assessment, the information sources of hazard factors, disaster carriers, and social media information were collected. As for the output, the direct economic loss allocated was collected.

2.3.1 Real-time Data—Social Media Information

The latest research shows that there is a correlation between social media data and disaster losses (Kryvasheyeu et al. 2016; Yuan and Liu. 2018; Chen et al. 2020). The introduction of social media data will enable the original assessment system to respond to disaster losses in near real-time. In this study, social media data is derived from Sina Microblog, which is the most used and active social media platform in China. Sina Microblog allows posting messages within 140 characters and several images through the platform. Users can use the symbol '#' to mark the subject of the messages and use the symbol '@' to interact with other users during the process. In the event of major natural disasters and emergencies, Microblog provides the individual with a channel to release real-time emergencies and express their feelings, as well as provides a platform for governments and social organizations to disseminate early warning information and report emergency news.

Since the purpose of this study is to provide a rapid assessment of disaster losses, the period is limited to the impact period of the typhoon, and the location is limited to the disaster-affected areas. The location filters are set to be at the provincial level. The disaster assessment approach is set at the prefecture-level. It is necessary to consider that correlation between original tweets and disaster damage is higher than that of reposted tweets (Kryvasheyeu et al. 2016). So only original tweets were collected in this study. More information to be collected about social media data is listed in Table 2.

Table 2

The information of social media data collection
Typhoon	Location	Time	Number of original tweets
Mangkhut	Guangdong	Sep 15–18, 2018	21,901
Lekima	Zhejiang	Aug 9–12, 2019	25,967

2.3.2 Hazard Factors Data

As a typical meteorological disaster, the typhoon is characterized by various types of weather information (i.e. atmospheric pressure, rainfall amount, wind speed). Commonly, strong breezes to destroy homes, billboards, and power transmission towers; meanwhile, short-term heavy rainfall can lead to floods and cause damage to homes and infrastructure. Max wind speed and rainfall amount for 24 hours are two critical parameters because most of the damage caused by typhoons is related to strong breezes, floods, waterlogging, and their secondary derived disasters. In this study, the paths of the typhoon are provided by the Oceanographic Data Center, Chinese Academy of Sciences (CASODC) (http://msdc.qdio.ac.cn).

2.3.3 Disaster Carrier Data

According to previous studies, GDP is a vital indicator of disaster-stricken areas which can represent regional infrastructure value as well as to measure the extent of economic damage that may result from a disaster (Donner and Rodriguez, 2008). On the other hand, the population can reflect community exposure to disaster (Santos, 2019). In this study, GDP and population can be collected from Statistical Yearbook (China Statistical Yearbook 2019; 2020). The official website of the Provincial Bureau of Statistics publishes the statistical data of the previous year’s prefecture-level cities every year. In addition, typhoon disasters have a huge impact on regional agricultural development. Strong winds and heavy rains wreaked havoc on food crops, cash crops etc. According to Typhoon Disaster Statistics in China, agricultural economic losses occupy a very high proportion of direct economic losses (Chinese Disaster Statistical Yearbook 2019; 2020). Therefore, it is necessary to consider the agricultural land area as an important factor for disaster assessment. The areas of food crops, cash crops, and other crops are in consideration, which was collected from Guangdong Statistical Yearbook and Zhejiang Statistical Yearbook.

2.3.4 Disaster Economic Loss

In the post-disaster stage, the emergency management bureaus in the disaster-stricken areas usually count disaster loss into five categories: agricultural losses, industrial and mining losses, infrastructure losses, public service losses, and household property losses (Statistical investigation system for natural disasters 2020). The statistical results would be reported to the National Disaster Mitigation Center. In this study, disaster economic loss is considered as output. In addition, storm surge information only occurs in coastal areas, which leads to the consequence that coastal areas and non-coastal areas could not be assessed in the same way. So if a prefecture-level city includes the coastal area, the economic losses in the coastal area will be reduced from the total.

3.1 Spatiotemporal Distribution of Social Media Data

We collected social media data from the microblog. The data collection time was from 24 hours before the typhoon landed to 24 hours after the typhoon left. The time-series distribution of social media data is shown in Fig. 4. The graph shows the number of tweets per hour which is related to typhoons. It is indicated that the numbers of twitter messages related to the typhoon disaster began to increase before the typhoon landed, peaked on the day the typhoon made landfall, and then declined rapidly. There is a lag period between the time of the typhoon's landing and the outbreak of public opinion. A few hours after the typhoon landed, the number of tweets ushered in an outbreak period. In addition, the number of tweets is affected by netizens’ work and rest patterns. From 0:00 am to 6:00 pm, the number of public opinion texts is less, while the number of tweets is more at noon and evening. The number of tweets is affected by netizens’ work and rest patterns. From 0:00 am to 6:00 pm, the number of public opinion texts is less, while the number of tweets is more at noon and evening.

Geographically, On the one hand, the number of tweets is affected by the typhoon, and the high-frequency areas are mainly distributed on both sides of the typhoon pat. In the typhoon “Lekima”, the areas with more than 500 tweets are mainly in Taizhou, Shaoxing, Hangzhou, and Zhoushan, and the areas with more than 800 tweets are all around the typhoon track. The number of tweets is closely related to the typhoon track. On the other hand, it is related to the population distribution in the affected area. In typhoon “Mankhut”, a large number of people concentrated in Guangzhou and its surrounding urban agglomerations (Guangzhou, Shenzhen, Zhuhai, and Dongguan). The number of tweets posted in these regions is much higher than in other regions, with the majority of tweets over 1 000 being posted in these regions. Therefore, there is a certain correlation between the number of tweets and the typhoon disaster, and also a great correlation with the population distribution in the affected areas.

3.2 Text Classification of Social Media Data and Correlation with Disaster Losses

In this section, we formed a classified dataset by extracting part of the social media data and performing manual annotation of classified texts at first. Then, the TextCNN model was trained using the labeled dataset. Finally, the rest of the social media data was classified and processed using the trained TextCNN model, and the correlation between different categories of social media data and typhoon disaster losses was discussed.

3.2.1 TextCNN Classification

This study classified social media data in typhoon disaster environments into eight categories: pre-disaster prevention, disaster damage description, reminder and advice, emotional expression, rescue and recovery, disaster notification, volunteer activities, and irrelevant information (Nguyen D T et al. 2016). The number of each type of marked text is shown in Table 3.

Table 3

Description of the classes in the datasets.(labels shows the total number of each class)
Class	Description	Labels
Pre-disaster Prevention	Individuals, communities, and government preparations ahead of typhoon disasters	462
Damage Description	Reports of infrastructure and utilities damage and messages containing disaster damage scenario description	986
Reminders and Advice	Messages about advice and reminders for reducing casualties and losses during typhoon disaster	823
Emotional Expression	Messages containing own emotions expression and to others(fortunately, sadly, fearfully and sympathetically and et al.)	685
Rescue and Recovery	Personnel rescue and environmental restoration after disasters for individuals, governments and social groups	525
Disaster Notification	Statistical reports of deaths, missing, transferred people and area damage	252
Volunteering	Messages containing volunteering and volunteer expressions	484
Irrelevant Information	Irrelevant, not related and not useful for typhoon disaster	1568

The ROC curve of the TextCNN training model is shown in Fig. 6. Each point on the ROC (receiver operating characteristic curve) curve reflects the sensitivity to the same signal stimulus. The horizontal axis is the false positive rate (FPR), which represents the proportion of actual positive instances in the positive class predicted by the classifier to all positive instances. The vertical axis is the true positive rate (TPR), which represents the actual negative instances in the positive class predicted by the classifier to all negative instances. The performance of a classifier can be measured by the area under the ROC curve (AUC). The larger the AUC value, the higher the accuracy rate that the classifier can obtain after selecting the appropriate threshold.

Table 4

the performance parameters of the proposed classifier models
Class	Precision	Recall	${F}_{1}$
Pre-disaster Prevention	0.86	0.74	0.79
Damage Description	0.59	0.75	0.66
Reminders and Advice	0.88	0.94	0.91
Emotional Expression	0.85	0.82	0.83
Rescue and Recovery	0.78	0.69	0.73
Disaster Notification	0.89	0.92	0.90
Volunteering	0.84	0.79	0.72
Irrelevant Information	0.93	0.79	0.84

Table 4 presents the classification performance of the textCNN model on classified instances. The precision, recall and ${\text{F}}_{1}$ score of the textCNN multi-class classifier are given in the table. According to the comprehensive accuracy rate, recall rate, and ${\text{F}}_{1}$ score, the textCNN model obtained by training is acceptable for text classification.

Table 5 shows that the performances of the different classifiers. Compared with the unsupervised approach, the ${\text{F}}_{1}$ scores of supervised Classification Models are higher. Among supervised learning methods, neural network classification is better than machine learning. Furthermore, CNN outperformed the other classifier models in terms of precision, recall and ${\text{F}}_{1}$ score. Hence, this research selected CNN as the classifier model.

Table 5

the performance parameters of textCNN classifier^[16][17]
Classifier models	Precision	Recall	${F}_{1}$
Unsupervised approach^c	0.32	0.54	0.40
Imran et al.(2013)^b	0.62	0.60	0.61
Naive Bayes	0.78	0.77	0.77
SVM	0.71	0.69	0.69
Decision tree	0.74	0.67	0.60
CNN^b	0.88	0.94	0.91
CNN^c	0.59	0.75	0.66
b represents a classifier related to warnings as well as reminders or not related. c represents a classifier related to disaster damage as well as reminders or not related.

3.2.2 Social Media Data Text Classification Results and Their Correlation with Disaster Losses

This section focuses on the correlation between social media data and disaster losses.

The disaster loss data comes from the post-disaster loss survey data of typhoon “Mangkhut” in 2018 and typhoon “Lekima” in 2019. Figure 6 shows the spatial distribution of statistics on direct economic losses from typhoon disasters. It can be seen that without considering the losses caused by storm surges, the typhoon landfall area is the area with relatively large disaster losses. At the same time, after excluding the impact of storm surges, the coastal areas suffered less damage, while the inland areas were still lower than the coastal areas. And there is a correlation between the distribution of disaster losses and the typhoon disaster path. Most of the severely affected areas are distributed on both sides of the typhoon track.

The social media data was classified by the text classifier in the previous section, and the distribution results of each category of text in each region were obtained. The correlation between typhoon disaster losses and social media data was measured by the correlation coefficient. Among them, the Pearson correlation coefficient is suitable for describing the correlation in the linear relationship, and the Spearman correlation coefficient is often used to describe the correlation in the nonlinear relationship. The results of Pearson’s correlation coefficient and Spearman’s correlation coefficient between disaster losses and social media data are shown in Table 6. It can be seen that there is a moderate correlation between the number of texts describing disaster damage and disaster losses, and both the Pearson and Spearman correlation coefficients are higher than the total number of tweets posted. Disaster losses are also moderately correlated; while disaster prevention, emotional expression, reminders and suggestions, disaster reporting, and the number of texts in volunteer activities are less correlated with disaster losses.

Table 6

Pearson and Spearman Correlation Coefficients of social media data and disaster damage loss
Main Factors			disaster damage loss
Related Factors			The number of different categories of Tweets
No.		Tweet category	Person Correlation coefficient	Spearman Correlation coefficient
1	Pre-disaster Prevention		0.241	0.108
2	Damage Description		0.545	0.615
3	Reminders and Advice		0.331	0.285
4	Emotional Expression		0.177	0.312
5	Rescue and Recovery		0.430	0.493
6	Disaster Notification		0.199	0.103
7	Volunteering		0.084	0.032
8	Irrelevant Information		0.274	0.128
9	Total number of tweets		0.486	0.574

As can be seen from Table 6, the number of texts disaster damage descriptions is an important factor that can describe disaster losses, and it is feasible to use the number of disaster damage texts as real-time data for disaster loss assessment. However, since social media data is greatly affected by the population of the region, the larger the population base, the greater the number of posts on the same topic. Due to the dense population, some areas may see a large number of texts and a small amount of damage.

To reduce the impact of such situations, two solutions are proposed. One is to consider the correlation between per capita loss and social media data. The second is to consider the correlation between the proportion of social media categories in the total number of tweets and disaster losses. The correlations obtained by the two methods are shown in Table 7. When considering per capita disaster loss, the correlation between disaster damage description and disaster loss decreases, the correlation between each category of social media data and per capita disaster loss decreases, and the volunteer-related shows a slight negative correlation with per capita disaster loss. Considering the proportion of different types of disaster texts, the correlation between disaster damage description and disaster loss becomes higher, while reminders, suggestions, emotional expressions, irrelevant information and disaster losses have a moderate negative correlation, and pre-disaster prevention and disaster losses have a mildly negative correlation. The proportion of disaster-damaged text to the total number of tweets is shown in Fig. 8. Comparing Fig. 7 and Fig. 8, it can be seen that areas with a high proportion of disaster damage texts also have high disaster losses, and the two have a high-rank correlation.

Table 7

Pearson and Spearman Correlation Coefficients of social media data and disaster damage loss (Correction of factors affecting population distribution)
Main Factors			Per capita disaster damage loss
Related Factors			Number of tweets in different categories		Percentage of tweets by category
No.		Tweet category	Person Correlation coefficient	Spearman Correlation coefficient	Person Correlation coefficient	Spearman Correlation coefficient
1	Pre-disaster Prevention		0.201	0.108	-0.093	-0.172
2	Damage Description		0.408	0.382	0.683	0.768
3	Reminders and Advice		0.371	0.418	-0.424	-0.384
4	Emotional Expression		0.177	0.312	-0.363	-0.403
5	Rescue and Recovery		0.113	0.268	-0.310	0.009
6	Disaster Notification		0.129	0.122	-0.127	-0.303
7	Volunteering		-0.084	-0.095	0.167	0.112
8	Irrelevant Information		0.315	0.223	-0.411	-0.321

The reason for the above results may be that people in the more severely affected areas are more willing to share the disaster damage information to others through the Internet, hoping to gain attention and help from other areas; while the daily life of people in the less affected areas is more willing to express sympathy and support for the hardest-hit areas, and issue some reminders and suggestions for disaster avoidance and self-rescue.

3.3 Typhoon Disaster Loss Assessment

Selecting representative feature data from basic data can simplify the model and improve its evaluation accuracy. In the loss assessment of typhoon disasters, this study extracts some representative characteristic data based on previous studies from the perspective of disaster-causing factors, disaster-bearing carriers, and real-time data. From the perspective of disaster-causing factors, rainfall and wind speed are selected as disaster description features. From the perspective of disaster-bearing carriers, GDP and population are selected as the characteristics of regional exposure as well as annual rainfall and agricultural land were selected as vulnerability characteristics of disaster-bearing carriers. The data examples of each part are shown in Table 8.

Table 8

Sample of disaster assessment data
Area	PTDD	MWS (m/s)	DRA (mm/d)	ARA (mm/y)	GDP (billion yuan)	PRA (million)	AAL (${\text{k}\text{m}}^{2}$)	Damage (million yuan)
Lekima- Hangzhou	14%	28.81	234	1823	$1.52\times {10}^{4}$	10.36	$1.27\times {10}^{3}$	$1.74\times {10}^{3}$
Mangkhut- Guangzhou	13%	37.23	151	1820	$1.55\times {10}^{4}$	9.72	$2.11\times {10}^{3}$	$1.32\times {10}^{3}$
Note: the proportion of tweets that destroy descriptions (PTDD), max wind speed (MWS), daily rainfall amount(DRA), annual rainfall amout (ARA), permanent residents amount (PRA), the area of agricultural land (AAL).

This study collected typhoon disaster data in 30 regions as the sample data of the BP neural network model, and grouped the sample data according to the proportion of the training set accounting for 70% and the test set accounting for 30%. To discuss the role of social media data in disaster loss assessment, this study constructed two BP neural network models: (Ⅰ) BP Neural Network Model of hazard-bearing carriers and hazard-causing factors; (Ⅱ) BP Neural Network Model of hazard-bearing carriers, hazard-causing factors, and social media.

The results of the disaster loss assessment are shown in Fig. 9. The blue dotted line represents the ideal state in which the actual value of disaster losses is completely consistent with the disaster assessment results. Above the blue dotted line indicates that the assessment result is higher than the actual value, and below the dotted line indicates that the disaster assessment result is lower than the actual value. It can be seen that the evaluation points of the test set are distributed on both sides of the diagonal line, and the evaluation results have certain reference significance for the disaster-stricken areas. Without considering social media data, model (I) is severely undervalued in areas with more severe losses, and overestimated in areas with moderate damage. After considering social media data as an evaluation factor for typhoon disaster losses, it can be seen that the evaluation results of model (II) have been revised to a certain extent in both severely affected areas and moderately damaged areas.

Table 9 shows the performance parameters of the two evaluation models in the test set with and without social media data. It can be seen from the performance parameters that all parameters of the evaluation model considering social media data are better than the traditional model.

Table 9

model performance for the testing data sets
Model	MAE	RMSE	${R}^{2}$
Ⅰ	467.85	526.00	0.687
Ⅱ	306.47	382.69	0.796

Further comparison of disaster-related loss factors, disaster assessment results, and actual disaster losses shows that areas with high assessment have larger GDP, permanent residents amount (PRA), and the area of agricultural land (AAL) but smaller max wind speed (MWS) and daily rainfall amount (DRA). The assessment of hardest-hit areas is significantly lower. The reason is that secondary and derivative disasters had occurred such as floods and landslides but the traditional model couldn’t take it in consideration. When the typhoon disaster happened, social media reflected the local disaster situation in nearly real-time. In severe disaster areas, victims tend to describe disaster damage scenarios; in lightly affected areas, there are fewer disaster damage scenarios, the normal life of the victims is less affected, and the proportion related to disaster damage is low. Therefore, the traditional disaster loss assessment model can be modified by considering social media.

The correlation is analyzed between social media data and disaster losses, and social media data, disaster-causing factors, and disaster-bearing carriers are used for disaster loss assessment by neural network models in this paper. The areas affected by typhoons “Mangkhut” and “Lekima” were selected as the research objects. This study collected social media data from disaster-affected areas, and discussed the correlation between different categories of texts in the social media data and disaster losses. It is shown that compared with the number of tweets, the proportion of tweets related to disaster damage descriptions is more correlated with disaster losses than the number of tweets and can be used as real-time data for disaster loss assessment. In the disaster assessment stage, the maximum wind speed and rainfall at the time of the disaster are used to characterize the hazard-causing factors, the GDP and resident population are used as the exposure characteristics of the disaster-bearing carrier, the average annual rainfall and agricultural land area are used as the vulnerability characteristics of the disaster-bearing carrier, and the social media data is used as real-time data features. Eventually, a fast evaluation method based on supervised learning is proposed.

The evaluation results show that the addition of the proportion of disaster description texts in social media data can effectively improve the accuracy of disaster loss assessment and effectively revise traditional assessment models. Based on the traditional disaster assessment model, this assessment model adds social media data as real-time data to quickly assess the typhoon disaster loss. The evaluation results can provide a reference for post-disaster recovery, rapid rescue, material distribution, and other actions. In future research, the disaster loss assessment work can also be carried out in combination with the historical typhoon disaster data in the affected areas, and the disaster prevention and mitigation work in the affected areas.

Acknowledgments

This research has been supported by China National Natural Science Foundation (No.72091512).

Declaration of interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:

Hong Huang reports financial support was provided by National Natural Science Foundation of China.

A Smith, Billion Dollar Disasters in Context |NOAA Climate.Gov, 2018’s. Accessed, https://www.climate.gov/news-features/blogs/beyond-data/2018s-billion-dollar-disasters-context. (Accessed 15 October 2020).
Chen Y, Ji W. Rapid damage assessment following natural disasters through information integration[J]. Natural hazards review, 2021, 22(4): 04021043.
Coronese M, Lamperti F, Keller K, et al. Evidence for sharp increase in the economic damages of extreme natural disasters[J]. Proceedings of the National Academy of Sciences, 2019, 116(43): 21450-21455.
Dargin J S, Fan C, Mostafavi A. Vulnerable populations and social media use in disasters: Uncovering the digital divide in three major US hurricanes[J]. International Journal of Disaster Risk Reduction, 2021, 54: 102043.
Donner W, Rodríguez H. Population composition, migration and inequality: The influence of demographic changes on disaster risk and vulnerability[J]. Social forces, 2008, 87(2): 1089-1114.
Draper N R, Smith H. Applied regression analysis[M]. John Wiley & Sons, 1998.
Fan C, Esparza M, Dargin J, et al. Spatial biases in crowdsourced data: Social media content attention concentrates on populous areas in disasters[J]. Computers, Environment and Urban Systems, 2020, 83: 101514.
Imran M, Castillo C, Diaz F, et al. Processing social media messages in mass emergency: Survey summary[C]//Companion Proceedings of the The Web Conference 2018. 2018: 507-511.
Imran, M., S. Elbassuoni, C. Castillo, F. Diaz, and P . Meier. 2013. “Extracting information nuggets from disaster-related messages in social media.” In Proc., ISCRAM 2013, 1–10. Baden-Baden, Germany: Information Systems for Crisis Response and Management.
Kakooei M, Baleghi Y. Fusion of satellite, aircraft, and UAV data for automatic disaster damage assessment[J]. International journal of remote sensing, 2017, 38(8-10): 2511-2534.
Kalchbrenner N, Grefenstette E, Blunsom P. A convolutional neural network for modelling sentences[J]. arXiv preprint arXiv:1404.2188, 2014.
Klawa M, Ulbrich U. A model for the estimation of storm losses and the identification of severe winter storms in Germany[J]. Natural Hazards and Earth System Sciences, 2003, 3(6): 725-732.
Kryvasheyeu Y, Chen H, Obradovich N, et al. Rapid assessment of disaster damage using social media activity[J]. Science advances, 2016, 2(3): e1500779.
Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[J]. arXiv preprint arXiv:1301.3781, 2013.
Ministry of Emergency Management, PRC. Statistical investigation system for natural disasters 2020.
MITSUTA Y, FUJII T, NAGASHIMA I. (1996). A Predicting Method of Typhoon Wind Damages[C]//Proc of Asce. Probabilistic Mechanics and Structural Reliability: Proceedings of the 7th Specialty Conference, 1996: 970–973.
National Bureau of statistics of the People's Republic of China. China Statistical Yearbook 2019.
National Bureau of statistics of the People's Republic of China. China Statistical Yearbook 2020.
Nguyen D T, Mannai K A A, Joty S, et al. Rapid classification of crisis-related data on social networks using convolutional neural networks[J]. arXiv preprint arXiv:1608.03902, 2016.
Santos R. Fairness, not just accuracy, is vital to the 2020 census[J]. 2019.
Shan S, Zhao F, Wei Y, et al. Disaster management 2.0: A real-time disaster damage assessment model based on mobile social media data—A case study of Weibo (Chinese Twitter)[J]. Safety science, 2019, 115: 393-413.
Tang J, Yang S, Wang W. Social media-based disaster research: Development, trends, and obstacles[J]. International journal of disaster risk reduction, 2021, 55: 102095.
Teodorescu H N. SN voice and text analysis as a tool for disaster effects estimation—A preliminary exploration[C]//2013 7th Conference on Speech Technology and Human-Computer Dialogue (SpeD). IEEE, 2013: 1-8.
Unanwa C O, McDonald J R, Mehta K C, et al. The development of wind damage bands for buildings[J]. Journal of Wind Engineering and Industrial Aerodynamics, 2000, 84(1): 119-149.
UNDRR (United Nations Office for Disaster Risk Reduction). 2019 global nat-ural disaster assessment report. https://www.preventionweb.net/files/73363_2019globalnaturaldisasterassessment.pdf. Accessed 10 Jan 2022.
Vickery P J, Lin J, Skerlj P F, et al. HAZUS-MH hurricane model methodology. I: Hurricane hazard, terrain, and wind load modeling[J]. Natural Hazards Review, 2006, 7(2): 82-93.
Vickery P J, Skerlj P F, Lin J, et al. HAZUS-MH hurricane model methodology. II: Damage and loss estimation[J]. Natural Hazards Review, 2006, 7(2): 94-103.
WANG K, SUI G, TANG D. A fuzzy intelligent decision support system for tropical cy-lone disaster management[C]//IEEE. International Conference on Fuzzy Systems, 2011, Taipei, Taiwan: 364–367.
Willmott C J, Matsuura K. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance[J]. Climate research, 2005, 30(1): 79-82.
Wu K, Wu J, Ding W, et al. Extracting disaster information based on Sina Weibo in China: A case study of the 2019 Typhoon Lekima[J]. International Journal of Disaster Risk Reduction, 2021, 60: 102304.
Wu Z J. The application of MATLAB in mathematical modeling [M][J]. Beijing: Beijing university of aeronautics and astronautics press, 2011.
Xing Z, Zhang X, Zan X, et al. Crowdsourced social media and mobile phone signaling data for disaster impact assessment: A case study of the 8.8 Jiuzhaigou earthquake[J]. International Journal of Disaster Risk Reduction, 2021, 58: 102200.
YE X, LIU C, ZHANG Y, et al. Damage forecasting of typhoon in Zhejiang province based on BP ANN[J]. Information Technology, 2011.
Yuan F, Liu R. Feasibility study of using crowdsourcing to identify critical affected areas for rapid damage assessment: Hurricane Matthew case study[J]. International journal of disaster risk reduction, 2018, 28: 758-767.
Yuan F, Liu R. Mining social media data for rapid damage assessment during Hurricane Matthew: Feasibility study[J]. Journal of Computing in Civil Engineering, 2020, 34(3): 05020001.

Study on Typhoon Disaster Assessment by Mining Data from Social Media Based on Artificial Neural Network

Status:

Version 1

Abstract

Figures

1. Introduction

2. Methodology