Biostatistics-Tasks (Ali Ranjbaranghaleh)

Task 1

Task 2 (page 42):

Question:

Assessing the efficiency of the new milking robot:

In another scenario, our current knowledge regarding milking duration for a cow indicates that it typically takes about 40 seconds to collect 1 kg of milk from a cow, with a population standard deviation od approximately 5 seconds. A new milking robot has been intriduced on a farm with the aim of speeding up the milking process. Now, we need to determine the sample size required to test the hypothesis that milking efficiency has indeed improved. We want to confirm this when the actual mean time required to collect 1 kg of milk using the new robot is 39 seconds or less. The farmer desires a 90% power of the test and sets the significance level α at 0.05.

Answer:

For simplicity, I write the values first:

mean = 40 (sec)

sd = 5 (sec)

Δ = 1 (sec) (we need to check whether it would be 39 sec or less so delta is 1 sec)

α = 0.05 (5% and it is a one-tailed test)

β = 0.1 (10%)

Formula : n = (sd²)(Zα+Zβ)²/Δ²

((5^2) * ((qnorm(0.05)+qnorm(0.1))^2)) / (1^2)

[1] 214.0962

So in this case we need 214 samples to conduct the test to be sure that milking efficiency has been improved.

Task 2

Task 2 (page 46):

Question:

Determine the sample size for a two-sided test to detect a difference in the mean carcass length between two pig breeds of 0.5 cm or more with a desired power of 99.9%. Utilize the data in the file “swine-data.txt” to estimate the variance of the trait (column CarcassL).

First we should import our data and library “tidyverse” for maniplulation of data frame because we should filter only 2 pig breeds which we have 3 by default.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Pigs = read.table("http://merlin.up.poznan.pl/~mcszyd/dyda/Experimental-Design/swine-data.txt", header=TRUE)

Answer:

As we create our new data sets called Pigs_2breeds which has only Breeds PL and PLW, we should now since it is going to be two-sided test the qnorm we use this time in the formula should be α/2 so we must use qnorm(0.025). Moreover we are planning for independent samples (+ two-sided test) so we have a multiplication by 2 at the first section of our formula:

mean = MCL

sd = SCL

Δ = 0.5 (cm) (we need to check whether it would be 39 sec or less so delta is 1 sec)

α = 0.05 (5% and it is a one-tailed test) -> α/2 = 0.025

β = 0.001 (0.1%)

Formula : n = 2*(sd²)(Zα/2+Zβ)²/Δ²

Pigs_2breeds <- Pigs |> 
  filter(Breed == "PL" | Breed == "PLW")
Pigs_2breeds$CarcassL -> CL
mean(CL) -> MCL
sd(CL) -> SCL
qnorm(0.025)  -> Z1
qnorm(0.001) -> Z2
n <- (2*(SCL^2) * ((Z1+Z2)^2))/(0.5)^2
n

[1] 651.8972

To conduct the test with the situation asked in the question we need 652 pigs for each breed (652 for PL breed and 652 for PLW breed).

Task 3

Task 4 (page 46):

Question:

Utilize the data in the file ‘swine-data.txt’ to test the research hypothesis that the mean back fat thickness at sacrum point I is different from the mean back fat thickness at sacrum point II (columns BFT3 and BFT4, respectively).

Answer:

We have the data frame in name of pigs here and we filter PL each breed and test our hypothesis in each breed. here is our main hypothesis:

H0 : mean back fat thickness at BFT3 and BFT4 are equal (mean(BFT3)=mean(BFT4))

H1: mean(BFT3) != mean(BFT4) (are not equal)

we use t.test to test our hypothesis and since we have measurements of back fat thickness of same individual pig but at different site (T3-T4) at the table we are doing a paired test:

PLBreed <- Pigs |>
   filter(Breed == "PL")
# H0 : Mu 3 = Mu4 
# H1 : Mu 3 != Mu4
BFT3 <- PLBreed$BFT3
BFT4 <- PLBreed$BFT4
t.test(BFT3, BFT4, paired = TRUE)


    Paired t-test

data:  BFT3 and BFT4
t = 33.267, df = 242, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 0.5869275 0.6608091
sample estimates:
mean difference 
      0.6238683

In PL breed the difference in back fat thickness in T3 and T4 is highly statistically significant, t(242) = 33.267, p < .01.

(it means we reject H0 and we assume H1 is correct)

Now we do it for other breeds of pigs:

PLWBreed <- Pigs |>
   filter(Breed == "PLW")
# H0 : Mu 3 = Mu4 
# H1 : Mu 3 != Mu4
BFT3_PLW <- PLWBreed$BFT3
BFT4_PLW <- PLWBreed$BFT4
t.test(BFT3_PLW, BFT4_PLW, paired = TRUE)


    Paired t-test

data:  BFT3_PLW and BFT4_PLW
t = 32.124, df = 191, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 0.5797796 0.6556370
sample estimates:
mean difference 
      0.6177083

L990Breed <- Pigs |>
   filter(Breed == "L990")
# H0 : Mu 3 = Mu4 
# H1 : Mu 3 != Mu4
BFT3_L990 <- L990Breed$BFT3
BFT4_L990 <- L990Breed$BFT4
t.test(BFT3_L990, BFT4_L990, paired = TRUE)


    Paired t-test

data:  BFT3_L990 and BFT4_L990
t = 37.142, df = 243, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 0.7009102 0.7794176
sample estimates:
mean difference 
      0.7401639

As we can see we have the same results as the first breed (PL) and in each hypothesis, H0 is rejected in each breed.

Task 4

Task 3 (page 59):

Question:

Farmers around the world have been asked about their opinion on GMO. Is there any dependency between opinion and geography?

we have the table on booklet but first we need to transfer it to r :

Favour <- c(24,40,16)
Dont_Favour <- c(27,45,18)
Undecided <- c(9,15,6)
opinions_table <- data.frame(Favour, Dont_Favour, Undecided, row.names = c("Americas", "Europe", "Asia"))
opinions_table

         Favour Dont_Favour Undecided
Americas     24          27         9
Europe       40          45        15
Asia         16          18         6

Answer:

First we need to set our hypothesis and choose our test method. In this case it is better to go with a chi-square test of in-dependency so:

H0 : There is in-dependency between opinion and geography

H1: There is dependency between opinion and geography

and now we test our hypothesis:

chisq.test(opinions_table)


    Pearson's Chi-squared test

data:  opinions_table
X-squared = 0, df = 4, p-value = 1

A chi-square test of independence showed that there was no significant association between opinion and geography, X²(4, N = 200) = 0, p = 1.

Task 5

Task 4 (page 59):

Question:

In some local consumer tests, it was shown that the optimal percentage of intramuscular fat (IMF) is between 1-2%. Analyse the data on IMF from the ‘swine-data.txt’ file. Classify pigs into two groups according to optimal and non-optimal IMF. Is there any contingency between this classification and breed? Note, missing observation on IMF is denoted by ‘0’.

Answer:

As we have the data fram already let’s look at it first and see what we have in breeds and IMF:

1- It seems we should first delete the records in IMF row which has 0 value to prevent their impact in our test since they are missing observations and not the actual measurements:

we can easilly filter the rows based on !=0 condition.

2- we want to classify IMF into 2 groups : optimal and non-optimal

for this purpose we should mutate a new column that has the new classification and we can use case_when function easily to set our conditions.

Pigs |>
  select(Breed,IMF) |>
  filter(IMF != 0 ) |>
  summarise(length(Breed))

  length(Breed)
1           369

So far so good! :) . We have total number of 369 in our table.

now we want to classify IMF into 2 groups : optimal and non-optimal

for this purpose we should mutate a new column that has the new classification and we can use case_when function easily to set our conditions:

Pigs |>
  select(Breed,IMF) |>
  filter(IMF != 0 ) |>
  mutate(classification = case_when(IMF >= 1 & IMF <= 2 ~ "OP",
                                    TRUE ~ "NOP")) |>
  head(n = 10)

   Breed  IMF classification
1   L990 1.14             OP
2   L990 1.89             OP
3   L990 1.89             OP
4    PLW 1.89             OP
5    PLW 2.22            NOP
6   L990 1.57             OP
7   L990 1.17             OP
8   L990 2.21            NOP
9   L990 1.28             OP
10  L990 2.37            NOP

# based on head(n = 10) we can see first 10 rows to check if we have done the classification correctly.

Perfect!

3- Now that we have every thing we can built our data frame to conduct the chi-square test of in-dependency but first we should count for OP and NOP and transport data frame to a reasonable format for the test:

Pigs |>
  select(Breed,IMF) |>
  filter(IMF != 0 ) |>
  mutate(classification = case_when(IMF >= 1 & IMF <= 2 ~ "OP",
                                    TRUE ~ "NOP")) |>
  count(Breed,classification,name = "numbers") |>
  pivot_wider(names_from = classification, values_from = numbers) |>
  column_to_rownames("Breed")

     NOP OP
L990  61 95
PL    56 51
PLW   50 56

Now that we have the data frame ready for our chi-square test of in-dependency we call it Breed_Class to conduct the test and here is our hypothesis:

H0: There is in-dependency between Breed and Classification of IMF (optimal and non-optimal)

H1: There is dependency between Breed and Classification of IMF

Breed_Class <- Pigs |>
  select(Breed,IMF) |>
  filter(IMF != 0 ) |>
  mutate(classification = case_when(IMF >= 1 & IMF <= 2 ~ "OP",
                                    TRUE ~ "NOP")) |>
  count(Breed,classification,name = "numbers") |>
  pivot_wider(names_from = classification, values_from = numbers) |>
  column_to_rownames("Breed")
chisq.test(Breed_Class)


    Pearson's Chi-squared test

data:  Breed_Class
X-squared = 4.7061, df = 2, p-value = 0.09508

A Chi-Square Test of Independence was performed to assess the relationship between IMF classification and Pig Breeds and there was not a significant relationship between the two variables, X²(2, 369) = 4.71, p = .09.