Author: Brad Cable
Illinois State University
For Reproducible Research - Johns Hopkins University (Coursera)
Storms can do a lot of damage, but how much and what type? This document attempts to take public data about storm damage to aggregate and show totals in relation to public health such as injuries and fatalities and economic damages such as property damage. The data used dates from approximately 1950 to 2007. Data from the past is not necessarily accurate, however, as reporting was not as robust as it is today.
library(ggplot2)
storm_data <- read.csv(file=bzfile("repdata-data-StormData.csv.bz2"))
Taken from the document: “National Weather Service Instruction 10-1605” published on August 17, 2007 by the Department of Commerce, National Oceanic and Atmospheric Administration, and the National Weather Service and available for download at this location:
http://www.nws.noaa.gov/directives/
The following categories were taken for classification:
valid_events <- c(
"Astronomical Low Tide", "Avalanche", "Blizzard",
"Coastal Flood", "Cold/Wind Chill", "Debris Flow",
"Dense Fog", "Dense Smoke", "Drought",
"Dust Devil", "Dust Storm", "Excessive Heat",
"Extreme Cold/Wind Chill", "Flash Flood", "Flood",
"Frost/Freeze", "Funnel Cloud", "Freezing Fog",
"Hail", "Heat", "Heavy Rain",
"Heavy Snow", "High Surf", "High Wind",
"Hurricane (Typhoon)", "Ice Storm", "Lake-Effect Snow",
"Lakeshore Flood", "Lightning", "Marine Hail",
"Marine High Wind", "Marine Strong Wind", "Marine Thunderstorm Wind",
"Rip Current", "Seiche", "Sleet",
"Storm Surge/Tide", "Strong Wind", "Thunderstorm Wind",
"Tornado", "Tropical Depression", "Tropical Storm",
"Tsunami", "Volcanic Ash", "Waterspout",
"Wildfire", "Winter Storm", "Winter Weather"
)
First, we need to aggregate the data into usable data objects. This converts the fields “FATALITIES” and “INJURIES” into a sum resulting the “HEALTH_IMPACT” value of humans whose health has been impacted.
We also sum the “PROPDMG” values to generate a dollar value of property damage.
health_impact <- aggregate(FATALITIES + INJURIES ~ EVTYPE, storm_data, sum)
names(health_impact) <- c("EVTYPE", "HEALTH_IMPACT")
econ_impact <- aggregate(PROPDMG ~ EVTYPE, storm_data, sum)
names(econ_impact) <- c("EVTYPE", "ECON_IMPACT")
impact <- merge(health_impact, econ_impact)
impact <- impact[impact$HEALTH_IMPACT != 0 | impact$ECON_IMPACT != 0, ]
We now have to clean up the data since this data is incredibly messy. To start, wejust strip off a bunch of extraneous data.
string_cleaner <- function(x){
gsub("[^A-Z]", "",
gsub(
"(G[0-9]+|ADVISORY|EXPOSURE|DAMAGE|ITATION|S|ING|ACCIDENT|MISHAP)$", "",
gsub("[^A-Z0-9]", "", toupper(x))
)
)
}
impact$EVTYPE <- string_cleaner(impact$EVTYPE)
impact$EVTYPE_CLEAN <- rep(NA, nrow(impact))
valid_events_nonclean <- string_cleaner(valid_events)
From here, we have to create some new valid events at a one-to-one mapping. These are exceptions to the automated classifications below that were done as tweaks to the automated classifier due to errors.
This also created a few assumptions in the data. Namely that all landslide or rockslide events were due to flooding, “DAMBREAK” was assumed to mean that a dam had broken and that the resulting damages were due to flooding, wave damage at sea or on shore was assumed to be the result of high tide, general marine accidents were classified as caused by high surf, and beach erosion was attributed to astronomical high tide.
# create new valid events (1:1 mapping)
valid_events <- c(valid_events,
"Wind", "Ice Storm", "Cold/Wind Chill",
"Ice Storm", "Flood", "Thunderstorm Wind",
"Excessive Heat", "Cold/Wind Chill", "Flood",
"Flood", "Flood", "Flood",
"Heat", "Heat", "Flood",
"Flood", "Thunderstorm Wind", "Thunderstorm Wind",
"High Surf", "High Surf", "Astronomical High Tide",
"Wind", "Marine Thunderstorm Wind", "High Surf",
"Astronomical High Tide", "Astronomical High Tide",
""
)
valid_events_nonclean <- c(valid_events_nonclean,
"BURST", "GLAZE", "HYPOTHERMIA",
"ICY", "SLIDE", "TSTMW",
"HYPERTHERMIA", "LOWTEMPERATURE", "DAMBREAK",
"SLUMP", "PRECIP", "RAINFALL",
"WARM", "WARMANDDRY", "URBANSMALL",
"RAPIDLYRISINGWATER", "TSTMWIND", "TUNDERSTORMWIND",
"ROGUEWAVE", "HIGHWAVE", "SEA",
"TURBULENCE", "MARINETSTMWIND", "MARINE",
"ASTRONOMICALHIGHTIDE", "BEACHEROSION",
"OTHER"
)
This code performs the exact matches from valid answers to reported answers.
matches <- match(impact$EVTYPE, valid_events_nonclean)
impact$EVTYPE_CLEAN <- valid_events[matches]
This function creates the main logic for the automated processing. This function in general allows for you to take a range of characters from the messy data and match them with the valid categories based on a selection of rules. For instance, you could have it match the first six characters of “THUNDERSTROMWNDTREE” to “Thunderstorm Wind”. This fixes many typographical errors made in the original data and also drastically increases the ability to adjust classification rules on messy data.
submatches <- function(range,
impact_startfun, impact_endfun,
valid_startfun, valid_endfun
){
for(i in range){
nomat <- is.na(impact$EVTYPE_CLEAN)
impact_evtype_nchar <- nchar(impact$EVTYPE[nomat])
valid_events_nchar <- nchar(valid_events_nonclean)
matches <- match(
substr(
impact$EVTYPE[nomat],
impact_startfun(i, impact_evtype_nchar),
impact_endfun(i, impact_evtype_nchar)
),
substr(
valid_events_nonclean,
valid_startfun(i, valid_events_nchar),
valid_endfun(i, valid_events_nchar)
)
)
impact$EVTYPE_CLEAN[nomat] <- valid_events[matches]
}
impact$EVTYPE_CLEAN
}
The following code segments initiate the automated system with a number of parameters.
This first call sends the range of 6,5,4,3 characters starting from the beginning of the words in the messy data to the beginning of the words in the valid categories.
impact$EVTYPE_CLEAN <- submatches(6:3,
function(i,nc){0}, function(i,nc){i},
function(i,nc){0}, function(i,nc){i}
)
This first call sends the range of 6,5,4,3 characters starting from the beginning of the words in the messy data to the beginning of the words in the valid categories.
impact$EVTYPE_CLEAN <- submatches(6:3,
function(i,nc){nc-i+1}, function(i,nc){nc},
function(i,nc){0}, function(i,nc){i}
)
impact$EVTYPE_CLEAN <- submatches(6:3,
function(i,nc){0}, function(i,nc){i},
function(i,nc){nc-i+1}, function(i,nc){nc}
)
impact$EVTYPE_CLEAN <- submatches(6:4,
function(i,nc){nc-i+1}, function(i,nc){nc},
function(i,nc){nc-i+1}, function(i,nc){nc}
)
A final re-aggregation is performed to make sure that the repeated values are summed together, since values such as “THUNDERSTROMWNDTREE” and “THUNDERSTORMWIND” are now on different rows with the same classification of “Thunderstorm Wind”.
impact$EVTYPE <- NULL
impact$EVTYPE_CLEAN[impact$EVTYPE_CLEAN == ""] <- "Other"
health_impact <- aggregate(HEALTH_IMPACT ~ EVTYPE_CLEAN, impact, sum)
econ_impact <- aggregate(ECON_IMPACT ~ EVTYPE_CLEAN, impact, sum)
Sorting the data in descending order so that we can generate some plots.
health_impact <- health_impact[order(-health_impact$HEALTH_IMPACT),]
econ_impact <- econ_impact[order(-econ_impact$ECON_IMPACT),]
Now we are ready to use these results.
Plotting the top eight data points, we can get the following results for health impact:
g <- ggplot(health_impact[0:8,], aes(x=factor(EVTYPE_CLEAN), y=HEALTH_IMPACT))
g + geom_bar(stat="identity") +
guides(fill=FALSE) +
ggtitle("Storm Data Health Impact") +
xlab("Event Type") + ylab("Fatalities and Injuries")
Plotting the top eight for economic impact:
g <- ggplot(econ_impact[0:8,], aes(x=factor(EVTYPE_CLEAN), y=ECON_IMPACT/1000000))
g + geom_bar(stat="identity") +
guides(fill=FALSE) +
ggtitle("Storm Data Economic Impact") +
xlab("Event Type") + ylab("Damages (millions of dollars)")
Based on this data, it appears that tornadoes do the most damage to both health and economy, but do significantly more health damage than other types of storm events. On the economic scale, tornadoes are only slightly in the lead over thunderstorms.