Challenge 1: Data Exploration

Submission 1

Make an argument for one quality that is different between good and evil superheroes. For example, you might say “Characters with red eyes are more likely to be evil”.

library(ggplot2)
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble  2.1.3     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ✓ purrr   0.3.3
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggcorrplot)

heroes_information  <- read.csv("heroes_information.csv")
super_hero_powers  <- read.csv("super_hero_powers.csv")

heroes_information <- heroes_information %>% 
  rename(
    hero_names = name
    )

heroes <- inner_join(heroes_information, super_hero_powers, by = "hero_names")
## Warning: Column `hero_names` joining factors with different levels, coercing to
## character vector
heroes <- inner_join(heroes_information, super_hero_powers, by = "hero_names")
## Warning: Column `hero_names` joining factors with different levels, coercing to
## character vector
heroes <- heroes %>% 
  mutate(
    Alignment2 = case_when(
     Alignment == "neutral" ~ "good", 
     Alignment == "good" ~ "good", 
     Alignment == "bad" ~ "bad", 
    )) %>% 
  filter(
    Alignment2 != "NA", 
    Gender != "-") 

heroes$Alignment2 <- as.factor(heroes$Alignment2)
heroes$Gender <- factor(heroes$Gender)

g <- ggplot(heroes, aes(Eye.color))
g + geom_bar(aes(fill=Alignment2), width = 0.5) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
  labs(title="Barchart of eye color distribution for good and bad superheroes") 

# There looks like there is a significant difference between good and bad superhereos based in terms of number of superhereos with blue eyes. More good superheores appear to have blue eyes

heroes <- heroes %>%
  mutate(
    Eye.color2 = case_when(
     Eye.color == "blue" ~ "blue",
     TRUE ~ "non-blue")
    )

table(heroes$Eye.color2, heroes$Alignment2)
##           
##            bad good
##   blue      43  160
##   non-blue 146  287

Submission 2

library(tidyverse)
hero <- read.csv("heroes_information.csv", header = TRUE)
power <- read.csv("super_hero_powers.csv", header = TRUE)

Does Being Human or Not affect Alignment?

I decided to look at whether or not being human affected the alignment of the character. I thought this would be interesting because most of my favorite villains in Star Wars are not human. This led me to wonder if villains are more often not human in all universes. I only counted characters that were purely human. Humans that had alterations, mutations, or are clones did not count as human.

hero_2 <- hero %>%
  filter(Race != "-") %>%
  mutate(
    Human_or_not = case_when(
    Race != "Human" ~ "Other",
    TRUE ~ as.character(Race))
    )%>%
  mutate(
   Alignment_new = case_when(
     Alignment == "bad" ~ "bad",
     Alignment == "good" ~ "good"
   )
  ) %>%
  select(Alignment_new, Human_or_not) %>%
  na.omit(Alignment_new)
  

ggplot(hero_2, aes(x = Human_or_not, fill = Alignment_new)) +
  geom_bar() +
  annotate("text", x = 1, y = 175, label = "25.25% (50)") +
  annotate("text", x = 1, y = 75, label = "74.75% (148)") + 
  annotate("text", x = 2, y = 175, label = "34.83% (70)") + 
  annotate("text", x = 2, y = 75, label = "65.17% (131)") +
  ggtitle("Does Being Human Affect Alignment?") +
  xlab("Human or Not") +
  ylab("Count") +
  scale_fill_discrete(name = "Alignment")

Solely by looking at the stacked bar chart I made, it is evident that characters who are humans tend to be good more often than non-humans. Over 74% of humans are good while only aobut 65% of non-humans are good. However, I am not positive if this difference is actually signifcant.

Chi Square Test

chisq.test(table(hero_2$Alignment_new,
                 hero_2$Human_or_not))
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(hero_2$Alignment_new, hero_2$Human_or_not)
## X-squared = 3.9036, df = 1, p-value = 0.04818

I performed a Chi-Square Test of Significance to determine whether this difference is signficant. At a 5% level of significance there is enough evidence to conclude that whether or not the character is a human or not does have a signifcant effect on their alignment.

Submission 3

I am only including “good” and “bad” superheros since neutral/not listed make up < .05% of the sample

heros <- heros %>% filter(Alignment == "good" | Alignment == "bad")
overall <- prop.table(table(heros$Alignment))
overall 
## 
##       bad      good 
## 0.2944523 0.7055477

This sample of superheros is made up of 70.56% good heros and 29.44% evil ones. We are looking for a characteristic that shifts these proportions by a substantial amount. I feel like the general stereotypes of people with blonde hair and blue eye being either dumb helpless beauties (usually in the case of females) or strapping young lads with savior complexes. These stereotypes, coupled with the fact that many of these characters were created in the mid 1900s when the blonde hair, blue eyed family was the “perfect American family” on every advertisement, I hypothesize that if a character has blond hair and blue eyes (aka the “Aryan” race) then they are significantly more likely to be a good superhero than a bad one.

blond_blue <- table(heros$blond_blue, heros$Alignment)
bb_prop <- prop.table(blond_blue, margin = 1)

# Calcualting relative risk
rel_risk <- riskratio(blond_blue[2], blond_blue[1], blond_blue[2] + blond_blue[4], blond_blue[1] + blond_blue[3])
bb_prop
##      
##              bad       good
##   No  0.32154341 0.67845659
##   Yes 0.08641975 0.91358025
rel_risk["conf.int"]
## $conf.int
## [1] 0.1311881 0.5506205
## attr(,"conf.level")
## [1] 0.95

As we can see from the table above, of the superheros that have blonde hair and blue eyes, 91.36% of then are good while only 8.64% are bad. In addition, we are 95% confident that the risk of being evil for superheros who are blonde hair, blue eyed is between 44.94% and 86.88% smaller than for super heros with any other hair/eye color combination.

Submission 4

Setup

heroes = read.csv('heroes_information.csv', header = T)
heroes = na.omit(heroes)

df1 = heroes[heroes$Alignment == 'good' | heroes$Alignment == 'bad',]

Categorical Variable Analysis

dt_Race = ctree(Alignment ~ Race, data = df1)
plot(dt_Race)

dt_Publisher = ctree(Alignment ~ Publisher, data = df1)
plot(dt_Publisher)

Quantitative Variable Analysis

m_weight = glm(Alignment ~ Weight, data = df1, family = 'binomial')

summary(m_weight)
## 
## Call:
## glm(formula = Alignment ~ Weight, family = "binomial", data = df1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7027  -1.4474   0.7313   0.8559   1.5753  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.975960   0.090386  10.798  < 2e-16 ***
## Weight      -0.002084   0.000658  -3.167  0.00154 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 849.02  on 700  degrees of freedom
## Residual deviance: 838.77  on 699  degrees of freedom
## AIC: 842.77
## 
## Number of Fisher Scoring iterations: 4
binomial_smooth = function(...) {
  geom_smooth(method = "glm", method.args = list(family = "binomial"), ...)
}

ggplot(df1, aes(Weight, as.numeric(df1$Alignment) - 2)) + geom_point() + 
  binomial_smooth(se = F) + labs(y = 'Likelihood Good')
## `geom_smooth()` using formula 'y ~ x'

Conclusions

Based on the analysis of the categorical variables the two variables which produced noticeable difference between good and bad superheroes were Race and Publisher, all other categorical variables produced negligible differences or lacked statistical significance. The most compelling determinant of Alignment between these variables is race due in large part to its significantly bigger sample size. The most compelling quantitative variable is Weight which shows that a one unit increase in weight is associated with a 0.002084 decrease in the likelihood a person is good. Overall while Weight does appear to have some predictive power in determining if a hero is bad, I would still recommend Race as the better predictor.

Submission 5

heroes_information <- read.csv("C:/Users/Austin/STAT431/challenge1/heroes_information.csv")
super_hero_powers <- read.csv("C:/Users/Austin/STAT431/challenge1/super_hero_powers.csv")

Since its the first week and I didn’t budget enough time to really dig into this data, I’m going to have some fun with it.

Here is a link to one of my favorite scenes in the show Fleabag on Amazon: https://www.youtube.com/watch?v=q97iIDx-b7U

Hair is everything for normal people. Fleabag is right; it is the difference between a good day and a bad day. We let it control so much of our life even though we have so little control over it on a daily basis. At least once or twice a week people will wake up and their hair is out of control there’s nothing they can do about it except cope. Obviously, superheroes are not like you and I. But maybe their hair dictates whether they are good or evil.

First off, are balds more likely to be evil than good? Can someone who lost their hair resent the world and tend to be evil in part because of their lack of hair? On the surface, the most famous bald superheroes don’t trend towards good or evil. Professor X and Magneto are both bald but set at opposite sides of the good-evil spectrum. However, when we take a cursory look at the data we can glean some informaton.

heroes_information$baldies <- ifelse(heroes_information$Hair.color == 'No Hair','bald','not bald')
table(heroes_information$Alignment,heroes_information$baldies)
##          
##           bald not bald
##   -          0        7
##   bad       35      172
##   good      37      459
##   neutral    3       21

In this counts table it is obvious that balds tend to be more evil than their hair-ful counterparts. This is backed up by a chi-squared test comparing the two:

chisq.test(heroes_information$Alignment,heroes_information$baldies)
## Warning in chisq.test(heroes_information$Alignment, heroes_information$baldies):
## Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  heroes_information$Alignment and heroes_information$baldies
## X-squared = 15.146, df = 3, p-value = 0.001696

The p-value between the hair and the hairless shows that the balds tend to be more evil superheroes.

Another one I decided to look into was whether “blondes have more fun”. As a blonde myself I had a vested interest in this one. The thing about superheroes is that the good guys almost always win, which is inherentely more fun than losing. Thus, it would make sense that blondes would be more likely to be good guys. However, as a blonde, I’ve noticed we’ve had some more blonde villains recently. Joffrey, Draco Malfoy, and Dolores (Westworld) have led me to question if the blondes really do tend to be good guys?

heroes_information$blondes <- ifelse(heroes_information$Hair.color == 'Blond','blonde','not blonde')
table(heroes_information$Alignment,heroes_information$blondes)
##          
##           blonde not blonde
##   -            2          5
##   bad         11        196
##   good        85        411
##   neutral      1         23
chisq.test(heroes_information$Alignment,heroes_information$blondes)
## Warning in chisq.test(heroes_information$Alignment, heroes_information$blondes):
## Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  heroes_information$Alignment and heroes_information$blondes
## X-squared = 20.665, df = 3, p-value = 0.0001236

It appears that blondes are more likely to be good guys than bad guys in the comic book world. As a blonde its just nice to win one. It’s like the Yankees catching a break. /s

Submission 6

Clear workspace and access dplyr:

rm(list = ls())
library(dplyr)
library(ggplot2)
superheropowers <- read.csv(file = "/Users/maggiegreco/Desktop/super_hero_powers.csv")
heroes_information <- read.csv(file = "/Users/maggiegreco/Desktop/heroes_information.csv")
master_heroes_data <- merge(heroes_information, superheropowers)
master_heroes_data <- master_heroes_data %>% 
select(Alignment, Gender)  %>% 
filter(Gender != "-") %>% 
filter(Alignment != "neutral", Alignment != "-") %>% droplevels()
tbl <- table(master_heroes_data$Alignment, master_heroes_data$Gender)
chisq.test(tbl)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tbl
## X-squared = 12069, df = 1, p-value < 2.2e-16
prop.test(tbl, alternative = "less")
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  tbl
## X-squared = 12069, df = 1, p-value < 2.2e-16
## alternative hypothesis: less
## 95 percent confidence interval:
##  -1.0000000 -0.1603233
## sample estimates:
##    prop 1    prop 2 
## 0.1750000 0.3375262
ggplot(data = master_heroes_data, aes(x = Gender, fill = Alignment)) + 
  geom_bar(position = position_dodge()) + 
  scale_fill_viridis_d(option = 8)+ 
  ylab("Count") + 
  ggtitle("Alignment of Superheroes\nBased on Gender")

Based on the Chi-Sq test statistic, we have sufficient evidence to support the claim that there is a difference in whether a superhero is good or not based on their gender.

The two proportion test confirms this and there is sufficient evidence to support the claim that male superheroes are more likely to be evil (33%) compared to (17.5%) of women.

Submission 7

Hypotheis: Shorter Characters Will Tend to Be More Evil

heros = read.csv("/Users/MarcoB/Downloads/superhero-set/heroes_information.csv")
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
group_by(heros, Alignment) %>%
  summarise(
    count = n(),
    mean = mean(Height, na.rm = TRUE),
    sd = sd(Height, na.rm = TRUE)
  )
## # A tibble: 4 x 4
##   Alignment count  mean    sd
##   <fct>     <int> <dbl> <dbl>
## 1 -             7 117.   149.
## 2 bad         207 106.   133.
## 3 good        496  98.9  138.
## 4 neutral      24 139.   209.

Height Based on ALignment Visual

library("ggpubr")
## Loading required package: ggplot2
## Loading required package: magrittr
ggboxplot(heros, x = "Alignment", y = "Height", 
          color = "Alignment",
          ylab = "Height", xlab = "Alignment")

Analysis Of Variance

res.aov <- aov(Height ~ Alignment, data = heros)

summary(res.aov)
##              Df   Sum Sq Mean Sq F value Pr(>F)
## Alignment     3    42397   14132   0.724  0.538
## Residuals   730 14247448   19517

Multiple Comparison

TukeyHSD(res.aov)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Height ~ Alignment, data = heros)
## 
## $Alignment
##                    diff        lwr       upr     p adj
## bad--        -11.743685 -149.98887 126.50150 0.9963041
## good--       -18.408093 -155.32953 118.51334 0.9857383
## neutral--     22.005952 -132.52074 176.53265 0.9831421
## good-bad      -6.664408  -36.43099  23.10217 0.9391094
## neutral-bad   33.749638  -43.82013 111.31941 0.6770911
## neutral-good  40.414046  -34.77118 115.59927 0.5097228

Based on the information shown above there is little to no evidence to show that evil characters are shorter on average and in fact the good characters tended to be the tallest on average while the nueteral were the tallest.

Submission 8

The Research Question: How are good and evil superheros different?

The Challenge: Find in the dataset the most interesting quality that tends to differentiate good and evil.

#read in data
library(tidyverse)
heroes <- read.csv("/Users/ariarango/Downloads/heroes_information.csv")

Superheroes that are evil tend to be taller than superheores that are good, on average.

compare <- heroes %>%
  select(Height, Alignment) %>%
  group_by(Alignment) %>%
  summarise(avg_height = mean(Height), median =  median(Height)) %>%
  filter(Alignment == "bad" | Alignment == "good")

compare
## # A tibble: 2 x 3
##   Alignment avg_height median
##   <fct>          <dbl>  <dbl>
## 1 bad            106.     180
## 2 good            98.9    173
ggplot(compare, aes(x = Alignment, y = avg_height, fill = Alignment)) +
  geom_bar(stat = "identity") +
  labs(title = "Average Height of Superheroes by Alignment", y = "Average Height(cm)", x = "Alignment")

Submission 9

First I limited the dataset to only heros/villans with a specified alignment

hero_infoGB <- hero_info  %>% 
  filter(Alignment == "bad" | Alignment == "good")

I saw there weren’t too many features in the heros information dataset so I decided I could go through each and figure out which I would be immidately disregard.

For height and weight, I plotted the distribution and seperated by alignment. The shape and the means of the distributions looked similar for both good and bad so I ignored those.

Skin Color had many missing values, so any association made would be very weak as it only pertained to a small fraction of the data.

Between the remaining variables (Hair color, Eye color, Race, Gender) I figured race and gender would be the best to test against alignment. This is because after removing all the missing values I can split up race into human/non-human and split up gender into male/female.

To see how related each category is to the Alignment category, I used Cramér’s V. Cramér’s V is a measure of association between two nominal variables. It varies from 0 to 1, with a 1 indicting a perfect association. So the category with the higher value will be more linked with being good or bad.

hero_infoGB_race <- hero_infoGB %>% 
  mutate(
    Race = case_when(
      Race == "Human" ~ "Human",
      TRUE ~ "non-human"
      )
    )
hero_infoGB_gender <- hero_infoGB %>% 
  filter(Gender != "-")
library(rcompanion)

cramerV(hero_infoGB_gender$Gender, hero_infoGB_gender$Alignment)
## Cramer V 
##   0.1635
cramerV(hero_infoGB_race$Gender, hero_infoGB_race$Alignment)
## Cramer V 
##    0.161

Even though both were not that high, gender managed to edge out race. If we now visualize the data, we can see that there are very few female villians, which is probably the biggest link to whether a given character will be good or bad.

hero_infoGB_gender %>% 
  ggplot() +
    geom_bar(aes(x = Gender, fill = Alignment))

Submission 10

hero <- read.csv("~/Downloads/heroes_information.csv")
hero1 <- hero %>% 
  filter(Weight >= 0, Height >= 0, Alignment != "neutral", Alignment != "-")%>%
  mutate(
  weight_kg = Weight * 0.454,
  height_m = Height / 100,
  bmi = weight_kg / height_m**2
  )

model <- lm(bmi ~ Alignment, data = hero1)
summary(model)
## 
## Call:
## lm(formula = bmi ~ Alignment, data = hero1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -23.46   -6.47   -2.79   -0.53 1110.70 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     29.012      5.248   5.529 5.38e-08 ***
## Alignmentgood  -16.412      6.244  -2.629  0.00885 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 61.65 on 468 degrees of freedom
## Multiple R-squared:  0.01455,    Adjusted R-squared:  0.01244 
## F-statistic:  6.91 on 1 and 468 DF,  p-value: 0.008855

With a F-statistic of 6.91 and a p-value of .008855, we can conclude that the overall model is significant. Thus, there is sufficient evidence that the Body Mass Index of a super-hero depends on if they are “good” or “evil”. Fortunately, good superheroes tend to have a lower Body Mass Index than bad superheroes.

plothero <- hero1 %>% mutate( Alignment = 
                      case_when(
                        Alignment == "good" ~ "Good",
                        Alignment == "bad" ~ "Bad"
                      )
)

ggplot(plothero, aes(x= Alignment, y = bmi, color = Alignment)) + geom_boxplot() + scale_y_continuous(limits=c(0, 55)) + ggtitle("Comparing the BMI of Bad and Good Superheroes") + xlab("Alignment") + ylab("Body Mass Index") + theme_classic()

The median BMI of bad superheroes is slightly higher than the median BMI of good superheroes. Similarly, the first and third quartile of the BMI for bad superheroes are higher than the first and third quartile of the BMI for good superheroes, respectively. The interquartile range for the BMI of bad superheroes is also higher than the interquartile range for the BMI of good superheroes. Both alignments contain many outliers in the data of their Body Mass Index, which is understandable since we are talking about fictional characters here!

Submission 11

Summaries

is_bald bad good
Not Bald 172 459
Bald 35 37

7.5 percent of good characters are bald, while 16.9 percent of bad characters are bald.

The top 5 superpowers that show up more among bad characters than good ones are:

Power bad good diff
Super Strength 0.6546392 0.4919540 0.1626852
Durability 0.4587629 0.3678161 0.0909468
Natural Weapons 0.1340206 0.0666667 0.0673540
Invulnerability 0.2113402 0.1494253 0.0619149
Shapeshifting 0.1391753 0.0781609 0.0610143

Plot

Conclusion

The Ultimate Villain

The Ultimate Villain

Submission 12

#Here I am creating a variable called Total_Power, that calculates the total number of powers each person in the dat set has 
total_powers <- powers %>%
  mutate(Total_Power = rowSums(powers[2:168])) %>%
  select(hero_names, Total_Power)%>%
  arrange(desc(Total_Power))
#Here I am joining the super power information with the information from the heroes characteristic document
#I am left joining to only include names that have a Total_Power calculated
joined_data <- left_join(total_powers, heros, by = c("hero_names" = "name"))



join_power <- joined_data %>%
  #here I am creating a new variable that adjusts Total_Power  by dividing by the person/creatures BMI
  #I divided the height to create inches and then calculated the BMI with Weight(lbs)/Height(inches)
  mutate(Power_Adj = Total_Power / Weight/(Height/30.48)) %>%
  select(hero_names, Power_Adj, Alignment) %>%
  #here I am filtering to only good and bad
  filter(Alignment == "good" | Alignment == "bad") %>%
  #here I am creating a numerical variable to be used later
  mutate(align = case_when(
    Alignment == "good" ~ 1, 
    Alignment == "bad" ~ 0))


#Here we are creating a linear model for the new variable vs. the Alignment 
power_model <- lm(Power_Adj ~ Alignment, data = join_power)
summary(power_model)
## 
## Call:
## lm(formula = Power_Adj ~ Alignment, data = join_power)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77987 -0.01348 -0.00450  0.00626  0.52914 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   0.010170   0.003610   2.817  0.00500 **
## Alignmentgood 0.012639   0.004339   2.912  0.00371 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05016 on 625 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.01339,    Adjusted R-squared:  0.01181 
## F-statistic: 8.482 on 1 and 625 DF,  p-value: 0.003714
#Here we are creating a qqplot for the new variable vs. the Alignment
qqplot(y = join_power$Power_Adj, x= join_power$align)

From the Linear Model, we can see that Alignment[good] is significanlty significant, thus suggesting that the the number of Powers, adjusted for BMI, the person/creature has, has a signifcant impact on the Alignment of the person or creature.

Submission 13

heroes <- merge(super, powers,by="name")
fly <- heroes %>% 
  filter(Alignment == "good" | Alignment == "bad") %>% 
  select(Alignment, Flight, Telepathy, `Super Strength`) %>% 
  group_by(Flight) %>% 
  count(Alignment)

telepathy <- heroes %>% 
  filter(Alignment == "good" | Alignment == "bad") %>% 
  select(Alignment, Flight, Telepathy, `Super Strength`) %>% 
  group_by(Telepathy) %>% 
  count(Alignment)

strength <- heroes %>% 
  filter(Alignment == "good" | Alignment == "bad") %>% 
  select(Alignment, Flight, Telepathy, `Super Strength`) %>% 
  group_by(`Super Strength`) %>% 
  count(Alignment)
ggplot(fly, aes(x = Alignment, y = n, color = Flight)) +
  geom_line(aes(group = Flight)) +
  geom_point()  + 
  labs(title= "Dotplot of Number of Heroes based on Alignment and Flight Power")

# Plot for Telepathy power

ggplot(telepathy, aes(x = Alignment, y = n, color = Telepathy)) +
  geom_line(aes(group = Telepathy)) +
  geom_point()  + 
  labs(title= "Dotplot of Number of Heroes based on Alignment and Telepathy Power")

# Plot for Super Strength power

ggplot(strength, aes(x = Alignment, y = n, color = `Super Strength`)) +
  geom_line(aes(group = `Super Strength`)) +
  geom_point()  + 
  labs(title= "Dotplot of Number of Heroes based on Alignment and Super Strength Power")

Using R, I attempted to find main effects between the Alignment of the hero and these three powers: Flight, Telepathy, and Super Strength. To do this I first merged the two data sets and organized the new data set into three separate data sets based on the super power (fly, telepathy, strength). I then used ggplot and geom_point to produce a plot where a main effect can be seen or not be seen. Based on my plots above, there cannot be a significant main effect seen with the Flight and Telepathy power since the lines do not instersect. However, there can be a significant main effect seen with the Supper Strength power since the lines do intersect.

Submission 14

library(tidyverse)
heroes_info <- read_csv("heroes_information.csv", na = c("-", "NA", ""))
df_hero <- heroes_info %>% 
  # filter out all heroes that have no known alignment or are neutral since we are not interested in them for this analysis 
  filter(!is.na(Alignment)&Alignment != "neutral") %>% 
  # there were numerous enteries of supers that had negative height and weight, I will be replacing those with NA values since those are inaccurate.
  mutate(Height = replace(Height, Height < 0, NA), Weight = replace(Weight, Weight < 0, NA))  
  
  
df_hero %>% 
  head()
## # A tibble: 6 x 11
##      X1 name  Gender `Eye color` Race  `Hair color` Height Publisher
##   <dbl> <chr> <chr>  <chr>       <chr> <chr>         <dbl> <chr>    
## 1     0 A-Bo~ Male   yellow      Human No Hair         203 Marvel C~
## 2     1 Abe ~ Male   blue        Icth~ No Hair         191 Dark Hor~
## 3     2 Abin~ Male   blue        Unga~ No Hair         185 DC Comics
## 4     3 Abom~ Male   green       Huma~ No Hair         203 Marvel C~
## 5     4 Abra~ Male   blue        Cosm~ Black            NA Marvel C~
## 6     5 Abso~ Male   blue        Human No Hair         193 Marvel C~
## # ... with 3 more variables: `Skin color` <chr>, Alignment <chr>, Weight <dbl>
df_hero %>% 
  filter(!is.na(Gender)) %>% 
  group_by(Alignment, Gender) %>% 
  summarise_at(vars(Weight),mean, na.rm = TRUE)
## # A tibble: 4 x 3
## # Groups:   Alignment [2]
##   Alignment Gender Weight
##   <chr>     <chr>   <dbl>
## 1 bad       Female  115. 
## 2 bad       Male    144. 
## 3 good      Female   71.0
## 4 good      Male    108.
df_hero %>% 
  filter(!is.na(Gender)) %>% 
  ggplot(aes(x = Gender, y = Weight)) + 
  geom_boxplot(color = "gray47",fill = "firebrick") +
  facet_wrap(~ Alignment) +
  labs(title = "Superheroes Weights Groupped by Alignment", x = "Gender", y = "Weight (lbs)") +
  theme(
      plot.title = element_text(color="purple", size=16, face="bold.italic"),
      axis.title.x = element_text(color="blue", size=14, face="bold"),
      axis.title.y = element_text(color="red", size=14, face="bold")
    )

I found the variable that distingueshes good and bad the most was Weight. For my analysis I decided to break it down by Gender since some variation in weight is explained by gender. In the tibble above we see that the average weight we see that Female average weight for villians is on average 43 pounds more than the average female hero. And for the average Male villian we see they are on average 144 pounds and the average Male hero is on average 108 pounds which is 36 pounds less than villains. We see that villains are significantly heavier on average than their hero counterparts, however, this is not to say the heavier the super the more evil they are. This can be explained thinking logically as well. There are a significant amount more younger heroes than there are younger villians. This would result in a lower average weight, since there are some heroes that are as young as teenagers. We do not traditionally see villians that are this young, so it makes sense that these younger heroes bring the average weight of heroes. We can see in the plot that the 1st through 3rd quantiles of heroes fall in a much narrower Interquartile range than the villains which have a signifcantly wider IQR that starts at around the same level as the heroes but extends much further than the heroes. I think weight is a good indicator of whether a hero is good or bad, but don’t go around looking at heroes like the Hulk and The Thing saying I called them bad just because they’re a little heavier than the rest of the bunch.

Submission 15

#This chunk is for reading in and joining the two datasets together by the heroes' names.
library(tidyverse)
supers <- read.csv("heroes_information.csv")
powers <- read.csv("super_hero_powers.csv")
supers <- supers %>% 
  rename("hero_names" = name)
fulldata <- left_join(supers, powers, by = "hero_names")

After scouring the dataset for a long, long time, I tested many variables that would best explain the difference between good and evil superheroes. With so many variables to choose from, it was overwhelming and I wanted to run through a brick wall. I developed an insatiable thirst for finding that perfect quality. It was difficult to decide which one quality tends to differentiate good and evil; I thought my thirst would never be quenched. But then, as I read through the over 150 unique super powers, suddenly, I came across one that made me tremble to my very core and exclaim, “OHHH YEAHHH!

fulldata %>% 
  select(Alignment, Thirstokinesis) %>%
  filter(Thirstokinesis == 'True')
##   Alignment Thirstokinesis
## 1      good           True
fulldata %>% 
  select(Thirstokinesis, Alignment) %>% 
  filter(Thirstokinesis == 'True' | Thirstokinesis == 'False') %>% 
  ggplot() +
    geom_bar(mapping = aes(x = Thirstokinesis,
                           color = Thirstokinesis)) +
  ggtitle("Superheroes that have the ability of Thirstokinesis")

THIRSTOKINESIS, the ultimate superpower, is defined as the ability to control someone’s thirst. What remarkable, overwhelming power and responsibility to have… From the code chunk above, there appears to only be one superhero with the strength to yield this dominant, overwhelmingly superior power.

As a visual representation, the bar graph does a superb job at visualizing how much of a joke all the rest of these superheroes are in this dataset. What are they going to do mid-battle when all of a sudden, their mouth is parched?! They’re ruined. Clearly the being that created our universe only saw it fit to bestow this immense power unto one hero. And in that regard, we can see that thirstokinesis can, fortunately, predict with certainty that the hero is on the side of good. For if they were opposed to our society, all hell would break loose in our mouths and our chances of beating them would evaporate.

fulldata %>% 
  filter(Thirstokinesis == 'True') %>% 
  select(Gender, Eye.color, Hair.color, Skin.color)
##   Gender Eye.color Hair.color Skin.color
## 1   Male     black    No Hair        red

Red-skinned and a fiery attitude, causing the feeling of 1000 suns in the enemy’s mouth. Black, searing, eyes that can turn his foe’s chops bone-dry. Who is he you must be asking? Who is this hero???

fulldata %>% 
  select(hero_names, Thirstokinesis) %>% 
  filter(Thirstokinesis == 'True')
##     hero_names Thirstokinesis
## 1 Kool-Aid Man           True
The Kool Aid Man takes on the Nazis (colorized), Circa 1945

The Kool Aid Man takes on the Nazis (colorized), Circa 1945

Submission 16

Make an argument for one quality that is different between good and evil superheroes. For example, you might say “Characters with red eyes are more likely to be evil”.

library(ggplot2)
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble  2.1.3     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ✓ purrr   0.3.3
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggcorrplot)

heroes_information  <- read.csv("heroes_information.csv")
super_hero_powers  <- read.csv("super_hero_powers.csv")

heroes_information <- heroes_information %>% 
  rename(
    hero_names = name
    )

heroes <- inner_join(heroes_information, super_hero_powers, by = "hero_names")
## Warning: Column `hero_names` joining factors with different levels, coercing to
## character vector
heroes <- inner_join(heroes_information, super_hero_powers, by = "hero_names")
## Warning: Column `hero_names` joining factors with different levels, coercing to
## character vector
heroes <- heroes %>% 
  mutate(
    Alignment2 = case_when(
     Alignment == "neutral" ~ "good", 
     Alignment == "good" ~ "good", 
     Alignment == "bad" ~ "bad", 
    )) %>% 
  filter(
    Alignment2 != "NA", 
    Gender != "-") 

heroes$Alignment2 <- as.factor(heroes$Alignment2)
heroes$Gender <- factor(heroes$Gender)

g <- ggplot(heroes, aes(Eye.color))
g + geom_bar(aes(fill=Alignment2), width = 0.5) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
  labs(title="Barchart of eye color distribution for good and bad superheroes") 

# There looks like there is a significant difference between good and bad superhereos based in terms of number of superhereos with blue eyes. More good superheores appear to have blue eyes

heroes <- heroes %>%
  mutate(
    Eye.color2 = case_when(
     Eye.color == "blue" ~ "blue",
     TRUE ~ "non-blue")
    )

table(heroes$Eye.color2, heroes$Alignment2)
##           
##            bad good
##   blue      43  160
##   non-blue 146  287

Submission 17

For my data exploration I wanted to see what the weight of a super person has to do with their alignment. To start I created a stacked boxplot of the weights of good and bad aligned supers.

super <- super %>% filter(Weight != -99) %>%
  filter(Alignment %in% c("bad", "good"))

ggplot(super, aes(Weight, Alignment)) + geom_boxplot()

From the graph we can see that the bad supers tend to weigh a slight bit more. How much more on average you ask?

good <- super %>% 
  filter(Alignment == "good")

bad <- super %>% 
  filter(Alignment == "bad")

mean_g <- mean((good$Weight))
mean_b <- mean(bad$Weight)

paste0("The mean Weight for good supers is ", mean_g, "and the mean Weight for bad supers is ", mean_b)
## [1] "The mean Weight for good supers is 95.5465465465466and the mean Weight for bad supers is 139.80985915493"

That is quite the difference! To estimate a 95% confidence interval of the true mean difference between the weight of good and bad supers we should do some bootstrapping!!

g_weight <- good %>%
  pull(Weight)

b_weight <- bad %>%
  pull(Weight)

booted_good <- replicate(1000, sample(b_weight, replace = TRUE), simplify = FALSE)
booted_bad <- replicate(1000, sample(b_weight, replace = TRUE), simplify = FALSE)

booted_g_ci <- data.frame(
  mean = map_dbl(booted_good, mean))  %>%
  pull(mean) %>%
  quantile(c(.025,.975))

booted_b_ci <- data.frame(
  mean = map_dbl(booted_bad, mean))  %>%
  pull(mean) %>%
  quantile(c(.025,.975))

paste0("We are 95% confident that the true mean weight of good supers is between ", booted_g_ci[1], " and ", booted_g_ci[2], " whereas the true mean weight of bad supers is between ", booted_b_ci[1], " and ", booted_b_ci[2])
## [1] "We are 95% confident that the true mean weight of good supers is between 121.869894366197 and 159.747007042254 whereas the true mean weight of bad supers is between 120.816197183099 and 159.887676056338"

This sadly did not reveal much information to us as it turned out that both good and bad supers have roughly the same 95% confidence interval for their true mean weights. That means you can’t judge the alignment of a super by their weight! Which is a good lesson for humanity but not exactly an interesting find in regards to the data

Submission 18

“‘R’ is among the most menacing of sounds. That’s why they call it Murder, not Muckduck.”

- Dwight Schrute, The Office

This is analysis is based off of this scene from The Office, an American mockumentary sitcom telivison series that depicts the everyday lives of office employees in the Scranton, Pennsylvania branch of the fictional Dunder Mifflin Paper Company.

In the scene, Dwight claims that “R” is the most menacing of sounds. On the other hand, I think that “Z” is the most menacing of sounds. Here, I look to see if the number of Rs or the number of Zs in an individual’s name is associated with being a super villain (i.e. with an alignment of “bad” instead of “good”).

heros_information <- read.csv('superhero-set/heroes_information.csv') %>%
  filter(Alignment %in% c('good','bad')) %>%
  mutate(Alignment = ifelse(Alignment == 'good', 1, 0),
         name = tolower(name))
name_information <- heros_information %>%
  select(name, Alignment) %>%
  mutate(
    n_z = str_count(name, 'z'),
    n_r = str_count(name, 'r'),
    name_length = str_length(name),
  )

Alignment: 1 = ‘good’, 0 = ‘bad’

model <- lm(Alignment ~ n_z + n_r + name_length, name_information)
summary(model)$coefficients
##                 Estimate  Std. Error    t value     Pr(>|t|)
## (Intercept)  0.663643113 0.049603899 13.3788498 1.648271e-36
## n_z         -0.249744412 0.066613862 -3.7491357 1.921200e-04
## n_r          0.010325519 0.024856917  0.4153982 6.779779e-01
## name_length  0.005239924 0.005451769  0.9611420 3.368131e-01

The p-value of 0.00019 for n_z lets us to conclude that the number of “Z”s in an individual’s name is associated with alignment. Specifically, “good” individuals are estimated to have 0.25 less “Z”s in their name than “bad” individuals, on average.

We cannot conclude that the number occurances of Rs in an individual’s name is associated with being evil (p-value = 0.689 > 0.5).

This leads me to conclude that Z is a more menacing sound than R, and Dwight was wrong.

The plots above shows that when looking at individuals with no “Z”s in their name, the majority are good. When looking at individuals with only 1 “Z” in their name, the group is approximately ballanced between good and evil. Finally, when looking at individuals with two “Z”s in their name, the group is mostly evil.

Submission 19

The Research Question: How are good and evil superheros different?

The Challenge: Find in the dataset the most interesting quality that tends to differentiate good and evil.

Details: Make an argument for one quality that is different between good and evil superheroes. For example, you might say “Characters with red eyes are more likely to be evil”.

# Read in data
heros <- read.csv("/Users/mvonsosen/Documents/STAT431/Challenge1/heroes_information.csv", stringsAsFactors = FALSE)
powers <- read.csv("/Users/mvonsosen/Documents/STAT431/Challenge1/super_hero_powers.csv", stringsAsFactors = FALSE)

#Join the two datasets together
powers = powers %>%
  rename(name = hero_names)
allheros = inner_join(heros, powers, by = "name")

#Split heros into good and bad to examine
good = allheros %>%
  filter(Alignment == "good")
bad = allheros %>%
  filter(Alignment == "bad")

Looking at the data between good and bad superheros and their qualities, I was curious as to if there was a difference between the combination of insanity and lack of empathy tied to alignment.

insaneandempathy = allheros %>%
  filter(Insanity == "True" & Empathy == "False")
table(insaneandempathy$Alignment)
## 
##     bad neutral 
##       4       1
#When filtering on heros who are conidered both insane and unempathetic, 4/5 were bad, which is not surprising, and only 1 was considered neutral. None of them were considered "good".

To dig deeper, I made a linear model with insanity, empathy, and I added the power of death, to predict alignment:

#Add 0/1 indicator for good or bad heros
for (i in 1:nrow(allheros)){
  if(allheros$Alignment[i] == "good"){
  allheros$goodorbad[i] = 0
    } else if (allheros$Alignment[i] == "bad"){
  allheros$goodorbad[i] = 1
  }
}

#Running a linear model for insanity, empathy, and the power of death touch as predictors:
linearMod = lm(allheros$goodorbad ~ allheros$Insanity + allheros$Empathy + allheros$Death.Touch)
summary(linearMod)
## 
## Call:
## lm(formula = allheros$goodorbad ~ allheros$Insanity + allheros$Empathy + 
##     allheros$Death.Touch)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8000 -0.2973 -0.2973  0.7027  0.9474 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               0.29730    0.01806  16.462   <2e-16 ***
## allheros$InsanityTrue     0.50270    0.20336   2.472   0.0137 *  
## allheros$EmpathyTrue     -0.24467    0.10547  -2.320   0.0207 *  
## allheros$Death.TouchTrue -0.01158    0.17215  -0.067   0.9464    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4529 on 656 degrees of freedom
## Multiple R-squared:  0.01748,    Adjusted R-squared:  0.01299 
## F-statistic:  3.89 on 3 and 656 DF,  p-value: 0.00898
#Insanity has the lowest p-value at 0.0137, so we will examine how this is differentiated between good or evil:

insane = allheros %>%
  filter(Insanity == "True")

notinsane = allheros %>%
  filter(Insanity == "False")

prop.table(table(insane$Alignment, insane$Insanity))
##          
##           True
##   bad      0.8
##   neutral  0.2
prop.table(table(notinsane$Alignment, notinsane$Insanity))
##          
##                False
##   -       0.01068702
##   bad     0.29007634
##   good    0.66412214
##   neutral 0.03511450

The tables show that that there are no “good” sueprheros considered insane. This provides evidence that one quality that is different between good and evil superheroes is the insanity factor - if a superhero is insane, you can be pretty positive they aren’t good, and they are most likely evil or neutral.

Submission 20

super <- read_csv("heroes_information.csv") %>% 
  filter(Alignment == "good" | Alignment == "bad" ) %>% 
  filter(Height > 0 & Weight > 0)

labels <- data.frame(Alignment = c("good", "bad"),
                     Means = c(95.62, 140.06),
                     Labels = c("Mean = 95.6", "Mean = 140.1")
                     )
super %>% 
  ggplot(aes(x = Weight)) +
  geom_histogram(mapping = aes(y = ..density..), fill = "cyan3") +
  geom_density( alpha = .5) +
  geom_vline(data = labels, aes(xintercept = Means), color = "Red") +
  geom_text(data = labels, aes(x = Means, y = .008, label = Labels), angle = 270, vjust = -1) +
  facet_wrap(~Alignment) +
  ggtitle("Good vs Bad Super-Heroes by Weight")

goodies <- super %>% filter(Alignment == "good")
badies <- super %>%  filter(Alignment == "bad")
t.test(badies$Weight, goodies$Weight)
## 
##  Welch Two Sample t-test
## 
## data:  badies$Weight and goodies$Weight
## t = 3.9593, df = 189.26, p-value = 0.0001063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  22.29989 66.58111
## sample estimates:
## mean of x mean of y 
## 140.05797  95.61747

From the side-by-side histogram with density line plots there seems to be a trend of bad guys weighing more than good guys. This idea is further proven with the two-sample t-test which got a t-statistic 3.96 wich equates to a p-value of .00011. This means we have strong evidnence to conclude that bad guys tend to weigh more than good guys.

Submission 21

Libraries

library(tidyverse)
## -- Attaching packages ----------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.0     v purrr   0.3.3
## v tibble  3.0.0     v dplyr   0.8.5
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts -------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Research Question: How are good and evil superheros different?

The Challenge: Find in the dataset the most interesting quality that tends to differentiate good and evil.


Website I found my data on: https://aces.nmsu.edu/pubs/_c/C316/welcome.html
Based on this website, “cool eye colors” in this data set are “black”, “blue”, “blue / white”, “green”, “green / blue”,“grey”, “purple”, and “silver.” The others would be considered “warm eye colors.” Using this, I will determine if there is an association between the alignment of a superhero and their type of eye color.


H_0: There is no association between alignment (good or bad superhero) and having a “cool eye color.” H_1: There is an association between alignment (good or bad superhero) and having a “cool eye color.”


Reading in csv file

path = "C:/Users/joshu/Desktop/Josh Stuff/Junior/Spring 2019 (Online)/STAT 431/Week 0/Challenge 1/superhero-set/heroes_information.csv"
hero_info = read.csv(path)

Manipulating Data set

# The colors of a "cool eye color"
cool_eye_colors = c("black", "blue", "blue / white", "green", "green / blue",
                    "grey", "purple", "silver")

# I extracted only the good or bad superheros and split the eye colors into "Cool" and "Warm"
eye_vs_alignment = hero_info %>%
  filter(Alignment == "good" | Alignment == "bad") %>%
  mutate(type_of_eyes = ifelse(Eye.color %in% cool_eye_colors, "Cool", "Warm")) %>%
  group_by(Alignment) %>%
  select(Alignment, type_of_eyes)

#Placing data into a table for Fisher's Exact Test
fisher.test(table(eye_vs_alignment))
## 
##  Fisher's Exact Test for Count Data
## 
## data:  table(eye_vs_alignment)
## p-value = 0.04623
## alternative hypothesis: two.sided

Interpretation: At the 5% significance level, we cannot conclude that there is no association between alignment and having “cool eye color.” In other words, there may be an association between the alignment of the superhero and type of eye color.

Graphing

# I made a new data set with the same information to get a table of the exact percent of alignment vs type of eyes

eye_percent = hero_info %>%
  filter(Alignment == "bad" | Alignment == "good") %>%
  mutate(cool_eyes = ifelse(Eye.color %in% cool_eye_colors, 1, 0)) %>%
  group_by(Alignment) %>%
  select(Alignment, cool_eyes) %>%
  summarize(cool_eyes = mean(cool_eyes)) %>%
  mutate(warm_eyes = 1 - cool_eyes)

eye_percent
## # A tibble: 2 x 3
##   Alignment cool_eyes warm_eyes
##   <fct>         <dbl>     <dbl>
## 1 bad           0.396     0.604
## 2 good          0.480     0.520

Graph for Cool Eyes

ggplot(eye_percent) + 
  geom_bar(aes(x = Alignment, y = cool_eyes ),stat = "identity") +
  labs(title = "Alignment vs Cool Eye Probability", 
       x = "Alignment", y = "Probability of Cool Eyes")

Graph for Warm Eyes

ggplot(eye_percent) + 
  geom_bar(aes(x = Alignment, y = warm_eyes ),stat = "identity") +
  labs(title = "Alignment vs Warm Eye Probability", 
       x = "Alignment", y = "Probability of Warm Eyes")

Submission 22

library(dplyr)
library(ggplot2)
info = readr::read_csv('heroes_information.csv')
## Warning: Missing column names filled in: 'X1' [1]
powers = readr::read_csv('super_hero_powers.csv')
supers = merge(info, powers, by.x = 'name', by.y = 'hero_names')
supers = mutate(supers, short = case_when(
  between(Height, 63, 165) ~ 1,
  !between(Height, 63, 165) ~0
))
supers = mutate(supers, good = case_when(
  Alignment == 'good' ~ 1,
  Alignment != 'good' ~ 0
))

For my alignment predicting quality, I decided that no rational publisher would make a medium short supervillian. I picked a height range that excluded those too short to see, and those tall enough to be threatening. This is the territory of Yoda, Baby Yoda, Rocket Raccoon, and super-children, not the Thanos’ of the world.

model = lm(data = supers, good~short)
supers$short = factor(supers$short, labels = c('Not Medium Short', 'Medium Short'))
summary(model)
## 
## Call:
## lm(formula = good ~ short, data = supers)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8654 -0.6414  0.3585  0.3585  0.3585 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.64145    0.01910  33.590  < 2e-16 ***
## short        0.22394    0.06803   3.292  0.00105 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4709 on 658 degrees of freedom
## Multiple R-squared:  0.0162, Adjusted R-squared:  0.0147 
## F-statistic: 10.83 on 1 and 658 DF,  p-value: 0.00105

As you can see, this height range has a statistically significant effect on a characters allignment. A character in this height range is predicted to be 22 percentage points more likely to be good than if they were any other height.

ggplot(data = supers) + geom_bar(aes(x=Alignment)) + facet_grid(cols = vars(short)) + ggtitle('Alignment Differences by Height')

While there are not many of them, they are pure and sweet relative to their tall counterparts. Based on the evidence I have laid out here, I believe the single best quality to determine a superheroe’s allignment is whether or not they fall into the cute height region.

Submission 23

Stat 431


The Dataset: https://www.kaggle.com/claudiodavi/superhero-set

The Research Question: How are good and evil superheros different?

The Challenge: Find in the dataset the most interesting quality that tends to differentiate good and evil.

## # A tibble: 2 x 3
##   Alignment AvgNumPowers MedNumPowers
##   <chr>            <dbl>        <dbl>
## 1 bad               9.37            8
## 2 good              8.90            7

Good guys tend to, each, have less powers then bad guys, even though the top handful of good guys have the most powers. This fits with the stereotypical plot and dynamic of “bad guys work alone” and it’s often a team of good guys with complementary strengths that work together to take down a bad guy.

So, be wary of an individual with many powers!

Submission 24

heroes_information <- read.csv("C:/Users/Austin/STAT431/challenge1/heroes_information.csv")
super_hero_powers <- read.csv("C:/Users/Austin/STAT431/challenge1/super_hero_powers.csv")

Since its the first week and I didn’t budget enough time to really dig into this data, I’m going to have some fun with it.

Here is a link to one of my favorite scenes in the show Fleabag on Amazon: https://www.youtube.com/watch?v=q97iIDx-b7U

Hair is everything for normal people. Fleabag is right; it is the difference between a good day and a bad day. We let it control so much of our life even though we have so little control over it on a daily basis. At least once or twice a week people will wake up and their hair is out of control there’s nothing they can do about it except cope. Obviously, superheroes are not like you and I. But maybe their hair dictates whether they are good or evil.

First off, are balds more likely to be evil than good? Can someone who lost their hair resent the world and tend to be evil in part because of their lack of hair? On the surface, the most famous bald superheroes don’t trend towards good or evil. Professor X and Magneto are both bald but set at opposite sides of the good-evil spectrum. However, when we take a cursory look at the data we can glean some informaton.

heroes_information$baldies <- ifelse(heroes_information$Hair.color == 'No Hair','bald','not bald')
table(heroes_information$Alignment,heroes_information$baldies)
##          
##           bald not bald
##   -          0        7
##   bad       35      172
##   good      37      459
##   neutral    3       21

In this counts table it is obvious that balds tend to be more evil than their hair-ful counterparts. This is backed up by a chi-squared test comparing the two:

chisq.test(heroes_information$Alignment,heroes_information$baldies)
## Warning in chisq.test(heroes_information$Alignment, heroes_information$baldies):
## Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  heroes_information$Alignment and heroes_information$baldies
## X-squared = 15.146, df = 3, p-value = 0.001696

The p-value between the hair and the hairless shows that the balds tend to be more evil superheroes.

Another one I decided to look into was whether “blondes have more fun”. As a blonde myself I had a vested interest in this one. The thing about superheroes is that the good guys almost always win, which is inherentely more fun than losing. Thus, it would make sense that blondes would be more likely to be good guys. However, as a blonde, I’ve noticed we’ve had some more blonde villains recently. Joffrey, Draco Malfoy, and Dolores (Westworld) have led me to question if the blondes really do tend to be good guys?

heroes_information$blondes <- ifelse(heroes_information$Hair.color == 'Blond','blonde','not blonde')
table(heroes_information$Alignment,heroes_information$blondes)
##          
##           blonde not blonde
##   -            2          5
##   bad         11        196
##   good        85        411
##   neutral      1         23
chisq.test(heroes_information$Alignment,heroes_information$blondes)
## Warning in chisq.test(heroes_information$Alignment, heroes_information$blondes):
## Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  heroes_information$Alignment and heroes_information$blondes
## X-squared = 20.665, df = 3, p-value = 0.0001236

It appears that blondes are more likely to be good guys than bad guys in the comic book world. As a blonde its just nice to win one. It’s like the Yankees catching a break. /s

Submission 25

Reading in data

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
heroes = read.csv("heroes_information.csv")
powers = read.csv("super_hero_powers.csv")
library(dplyr)
#renames hero names variable to make consistent to be joined
powers = powers %>%
  rename(name = hero_names)
#merge both data sets
heroes_villians = right_join(heroes, powers, by = "name")
## Warning: Column `name` joining factors with different levels, coercing to
## character vector
#seperate data sets into good superheros and evil villians
good = heroes_villians %>%
  filter(Alignment == "good")
bad =  heroes_villians %>%
  filter(Alignment == "bad")

EXPLORING DATA

good %>%
  group_by(Mind.Control) %>%
  summarise(n = n())
## # A tibble: 2 x 2
##   Mind.Control     n
##   <fct>        <int>
## 1 False          424
## 2 True            11
bad %>%
  group_by(Mind.Control) %>%
  summarise(n = n())
## # A tibble: 2 x 2
##   Mind.Control     n
##   <fct>        <int>
## 1 False          184
## 2 True            10

Analysis

mind_control = table(heroes_villians$Mind.Control,
                     heroes_villians$Alignment)
mind_control
##        
##           - bad good neutral
##   False   7 184  424      23
##   True    0  10   11       1
prop_mind_control = prop.table(mind_control, margin = 2)
prop_mind_control
##        
##                  -        bad       good    neutral
##   False 1.00000000 0.94845361 0.97471264 0.95833333
##   True  0.00000000 0.05154639 0.02528736 0.04166667
barplot(mind_control, main = "count of mind control",
     xlab = "alignment", ylab = "mind.control", 
     col = c("blue", "red"))

barplot(prop_mind_control, main = "proportion of mind control",
     xlab = "alignment", ylab = "mind.control", 
     col = c("blue", "red"))

Characters with mind control are more likely to be evil than good. Although in the data when you look at straight counts of number of characters with mind control there is actually 1 more good hero than evil villian as seen in the the first table and first bar plot. However when you look at the proportion of characters with mind control we see that evil guys are twice as likely to have mind control than heros.

Submission 26

I have a hypothesis that evil heros are typically depicted to be less fit. I will use a crude measure of fitness of bmi to assess this.

super <- read_csv("heroes_information.csv")

super <- super %>% 
  mutate(Weight = ifelse(Weight>0,Weight,NA),
         Height = ifelse(Height>0,Height,NA),
         bmi = (Weight) / (Height/100)^2) %>% 
  filter(Alignment != "-",
         Alignment != "neutral")

Lets just look at mean bmi. Also we can look at the distribution of people in each class of bmi (underweight, normal weight, overweight).

super %>%
  group_by(Alignment) %>%
  summarize(mean_bmi = mean(bmi,na.rm = T))
## # A tibble: 2 x 2
##   Alignment mean_bmi
##   <chr>        <dbl>
## 1 bad           63.9
## 2 good          27.8
super %>% 
  filter(bmi > 0) %>%
  group_by(Alignment) %>% 
  summarize(percentunderw = sum(bmi < 18.5) / n(),
            percentinrange = sum(bmi > 18.5 & bmi < 24.9) / n(),
            percentoverw = sum(bmi > 24.8) / n())
## # A tibble: 2 x 4
##   Alignment percentunderw percentinrange percentoverw
##   <chr>             <dbl>          <dbl>        <dbl>
## 1 bad              0.0725          0.261        0.674
## 2 good             0.0964          0.464        0.443

The mean is higher. Also good have a high percentage underweigth and in the normal range, and a lower percent overweight. Lets try to visualize this. Also there are a few extreme outliers that we should remove.

#taking out outliers
super = super %>% filter(bmi < 1000) %>% mutate(lign = ifelse(Alignment == "bad",1,0))
ggplot(super, aes(x = bmi, fill = Alignment)) + geom_histogram() -> p1
ggplot(super, aes(y=bmi)) + facet_grid(cols = vars(Alignment)) + geom_boxplot() + scale_y_continuous(limits = c(0,50)) -> p2
grid.arrange(p1, p2, ncol = 2)


The goal is to find a variable that is a predictor of alignment so lets do a logistic regression.

mlogit <- glm(lign ~ bmi, family = binomial, data = super)
summary(mlogit)$coefficients
##                Estimate  Std. Error   z value     Pr(>|z|)
## (Intercept) -1.56432601 0.209649683 -7.461619 8.546586e-14
## bmi          0.02196972 0.005895364  3.726610 1.940724e-04
grid = data.frame(bmi = seq(-100,250,.1))
grid$prob = predict(object = mlogit, grid, type = "response")

ggplot(grid, aes(bmi, prob)) + geom_point()

With a p value of less than .001 we can assert that bmi is a significant predictor of alignment. Heros with higher BMIs are more likely to be evil. Even though the probability changes very little between the meaningful ranges of bmi (10 to 50) we still believe that it is a significant predictor.+

Submission 27

Challenge 1: Data Exploration

4-12-2020

Argument: Evil characters have a higher probability of being overweight or obese compared to good characters.

  • My thoughts were that evil characters may be more likely than good characters to be portrayed with characteristics that some would describe as unsavory.
  • The unsavory charactersitic I explored was being overweight or obese, as classfied by body mass index (BMI). I calculated BMI for characters that had both their weight and height supplied in the data set using the calculation and classification for BMI described here: link. The original heroes_information.csv data was also filtered to only include good and bad heroes, as neutral and NA heroes are not applicable here.

There are higher counts of bad characters who are either obese or overweight (categorized by BMI) than good characters who are obese or overweight. This is the case despite there being more good characters than bad characters.

To solidfy my claim that evil characters have a higher probability of being overweight or obese compared to good characters, I ran two bootstrapping procedures, each with 1000 sampling replicates with replacement. The first produces the 95% CI for the percentage of good characters that are obese or overweight. The second produces this same 95% CI, but for bad characters.

##       2.5%      97.5% 
## 0.02710843 0.07228916
##       2.5%      97.5% 
## 0.08695652 0.21014493

The 95% CI for the percentage of bad characters that are obese or overweight contains higher percentages than the same 95% CI produced for good characters.

Stat 431