R has different ways of keeping record of code/analysis such as
RShiny or RMarkdown. This analysis project will be recorded using
Rmarkdown. The analysis is split into code chunks which reference
main.R
and each chunk has a brief explanation of
what it does .. just like this one! Calling the source()
functions allows me to grab code from a different file called
main.R
. The knitr package will neatly incorporate and
format code into text and can be used to specify settings to apply to
all code chunks or just a specific one. This file can be processed into
different types of documents formats such as word, powerpoint, or
webpages. For the sake of this example, I have decided to report this
code in HTML.
source("main.R") # load all our wonderful functions
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
knitr::opts_chunk$set(echo = TRUE, fig.height = 10)
load_data()
and store into a
variableOriginal data was in .rda format and .rda files allows users to save
R data structures such as vectors, matrices, and data frames. Saving and
loading data in rda format might be very useful when working with large
datasets that you want to clear from your memory, but you also would
like to save for later. In this example, I loaded the rda file using
load() function and saved it as a csv file called
ELISA.csv
.
Below, ELISA.csv
is being sent to
load_data()
in the main.R
file which will load
the csv into a tibble and then merge PlateDay and Read columns into col
called platedayNO. Looking at the first couple of rows in the dataset,
duplicates are replicated row-wise…
datax <- load_data("ELISA.csv")
## Rows: 504 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): PlateDay, Description
## dbl (3): Read, Concentration, Signal
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(DT)
datatable(datax)
Because data is formatted row-wise, this makes it difficult to analyze and compute means for duplicates. This function converts the data so that it pivots wider and assigns each duplicate their own column such as Concentration_A and Concentration_B.
gr_dup <- rep(1:(nrow(datax)/2), each = 2) #group technical replicates together
gr_AB <- rep(c('A', 'B'), len = nrow(datax)) #separate replicate sets called A and B
colnames <- c('platedayNO', 'Description', 'Concentration','Signal') # col names to keep
datax <- data_wrangle(datax, gr_dup, gr_AB, colnames)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(colnames)
##
## # Now:
## data %>% select(all_of(colnames))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
datax <- datax %>%
rename(
concdil_A = Concentration_A,
concdil_B = Concentration_B)
datax
## # A tibble: 252 × 7
## grp_dup platedayNO_A Description_A concdil_A concdil_B Signal_A Signal_B
## <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 Plate 1 (Day 1) 1 Standard 500 500 2.75 2.59
## 2 2 Plate 1 (Day 1) 1 Standard 200 200 1.38 1.35
## 3 3 Plate 1 (Day 1) 1 Standard 80 80 0.713 0.746
## 4 4 Plate 1 (Day 1) 1 Standard 32 32 0.462 0.466
## 5 5 Plate 1 (Day 1) 1 Standard 12.8 12.8 0.375 0.356
## 6 6 Plate 1 (Day 1) 1 Standard 5.12 5.12 0.335 0.335
## 7 7 Plate 1 (Day 1) 1 Standard 2.05 2.05 0.318 0.287
## 8 8 Plate 1 (Day 1) 1 BLANK 0 0 0.284 0.295
## 9 9 Plate 1 (Day 1) 1 Quality Cont… 312. 312. 1.90 1.97
## 10 10 Plate 1 (Day 1) 1 Quality Cont… 156. 156. 1.01 1.07
## # ℹ 242 more rows
the header explains it all…
datax <- datax %>%
group_by(grp_dup) %>%
mutate(mean_Signal = mean(c(Signal_A, Signal_B)))
datax
## # A tibble: 252 × 8
## # Groups: grp_dup [252]
## grp_dup platedayNO_A Description_A concdil_A concdil_B Signal_A Signal_B
## <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 Plate 1 (Day 1) 1 Standard 500 500 2.75 2.59
## 2 2 Plate 1 (Day 1) 1 Standard 200 200 1.38 1.35
## 3 3 Plate 1 (Day 1) 1 Standard 80 80 0.713 0.746
## 4 4 Plate 1 (Day 1) 1 Standard 32 32 0.462 0.466
## 5 5 Plate 1 (Day 1) 1 Standard 12.8 12.8 0.375 0.356
## 6 6 Plate 1 (Day 1) 1 Standard 5.12 5.12 0.335 0.335
## 7 7 Plate 1 (Day 1) 1 Standard 2.05 2.05 0.318 0.287
## 8 8 Plate 1 (Day 1) 1 BLANK 0 0 0.284 0.295
## 9 9 Plate 1 (Day 1) 1 Quality Cont… 312. 312. 1.90 1.97
## 10 10 Plate 1 (Day 1) 1 Quality Cont… 156. 156. 1.01 1.07
## # ℹ 242 more rows
## # ℹ 1 more variable: mean_Signal <dbl>
In order to account for background noise, and extract a true absorbency value, a blank correction has to be implemented. Below is a an embedded R code chunk doing just that. The dataset called datax is passed along to blank_correction() and the function will subtract the blank from the remaining signal values.
datax <- blank_correction(datax)
datatable(datax)
To find the best fitting line for the standard set, a linear regression model is applied using the lm() function in R to find the best fitting line and coefficients with minimal error. The code block below is applying a linear_model(x,y) against the data where x is the predictor variable and y is the outcome variable.
Predictor Variable X = Theoretical Concentration
Outcome Variable Y = Mean Signal
The resulting table shows the coefficient and slope value for each plate.
standards <- datax %>%
filter(Description_A == 'Standard') %>%
group_by(platedayNO_A)
lin_mod <- standards %>%
do(lin_coeff = coef(lm(mean_Signal ~ concdil_A, data = .))) %>%
summarise(plateday_lm = platedayNO_A, b=lin_coeff[1], slope=lin_coeff[2])
lin_mod
## # A tibble: 12 × 3
## plateday_lm b slope
## <chr> <dbl> <dbl>
## 1 Plate 1 (Day 1) 1 0.0341 0.00477
## 2 Plate 1 (Day 1) 2 0.0358 0.00476
## 3 Plate 1 (Day 1) 3 0.0359 0.00476
## 4 Plate 2 (Day 1) 1 0.0238 0.00478
## 5 Plate 2 (Day 1) 2 0.0267 0.00472
## 6 Plate 2 (Day 1) 3 0.0288 0.00467
## 7 Plate 3 (Day 2) 1 0.0253 0.00557
## 8 Plate 3 (Day 2) 2 0.0267 0.00552
## 9 Plate 3 (Day 2) 3 0.0298 0.00545
## 10 Plate 4 (Day 2) 1 0.0418 0.00486
## 11 Plate 4 (Day 2) 2 0.0415 0.00486
## 12 Plate 4 (Day 2) 3 0.0403 0.00484
After finding the coefficient and slope, we can now solve for the concentration and because our coefficients were fit against a linear model, we can assume the line equation of y = mx + b.
Using our values from above, and plugging it into the equation we can now add our concentration results.
datax <- solve_for_x(datax, lin_mod) %>%
mutate_if(is.double, round, 3)
## `mutate_if()` ignored the following grouping variables:
## • Columns `grp_dup`, `platedayNO_A`
datatable(datax)
Below is a an embedded R code chunk computing the mean concentration
of groups A and B.
The dataset called datax is passed along to find_cv() and the function
will calculate the mean of two groups and then find std dev for the
technical replicates and the variation from the mean, in other words,
the coefficient variation.
datax <- find_cv(datax)
datax <- datax %>% mutate(across(where(is.numeric), round, 3))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `across(where(is.numeric), round, 3)`.
## ℹ In group 1: `grp_dup = 1`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
##
## # Previously
## across(a:b, mean, na.rm = TRUE)
##
## # Now
## across(a:b, \(x) mean(x, na.rm = TRUE))
datax
## # A tibble: 252 × 13
## # Groups: grp_dup [252]
## grp_dup platedayNO_A Description_A concdil_A concdil_B Signal_A Signal_B
## <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 Plate 1 (Day 1) 1 Standard 500 500 2.47 2.29
## 2 2 Plate 1 (Day 1) 1 Standard 200 200 1.10 1.06
## 3 3 Plate 1 (Day 1) 1 Standard 80 80 0.429 0.451
## 4 4 Plate 1 (Day 1) 1 Standard 32 32 0.178 0.171
## 5 5 Plate 1 (Day 1) 1 Standard 12.8 12.8 0.091 0.061
## 6 6 Plate 1 (Day 1) 1 Standard 5.12 5.12 0.051 0.04
## 7 7 Plate 1 (Day 1) 1 Standard 2.05 2.05 0.034 0.008
## 8 8 Plate 1 (Day 1) 1 BLANK 0 0 0 0
## 9 9 Plate 1 (Day 1) 1 Quality Cont… 312. 312. 1.62 1.67
## 10 10 Plate 1 (Day 1) 1 Quality Cont… 156. 156. 0.724 0.773
## # ℹ 242 more rows
## # ℹ 6 more variables: mean_Signal <dbl>, Concentration_A <dbl>,
## # Concentration_B <dbl>, mean_Concentration <dbl>, sd_c <dbl>, cv <dbl>
Below is facet grid plotting standards for all plates in this dataset.
Using shewhart charts (levey jenning) to determine if variables remain within the upper and lower control range.
If the mean concentration is the average variable for each subgroup (qc levels), then we can find the grand mean by adding all the qc level averages and dividing by the number of qc levels which is 7.
Although this dataset does contain enough data points to pass CLT and assume normality. We cannot use pdc to determine std dev from the mean bc we don’t know the population variance, so an estimated paramter is used instead
After getting the standard deviations for each subgroup, we can then calculate the average std dev for each of the subgroups by adding them and dividing by the total. The next challenge is to correct the std by using a theortical factor (an). Finally, the upper and lower control limits are retrieved for each QC level and are shown below in the table.
# IDK why im doing this
QC_all <- extract_qc(datax, 'Plate 1|Plate 2|Plate 3|Plate 4')
# get the number of how many plates were ran (k=subgroups)
N <- unique(QC_all$platedayNO_A) %>% length()
# assign group for each QC levels (1 = 312.500, 2= 156.250, 3= 78.125, 4= 39.100 , 5= 19.500, 6=13.000, 7=9.750 ng/mL)
QC_all$qc_lvl <- rep(1:(nrow(QC_all)/N), length = nrow(QC_all))
CL <- calculate_LUCL(QC_all, N)
CL
## # A tibble: 7 × 4
## low grand_mean high qc_lvl
## <dbl> <dbl> <dbl> <int>
## 1 327. 339. 351. 1
## 2 157. 160. 164. 2
## 3 75.3 79.8 84.3 3
## 4 33.4 36.5 39.6 4
## 5 12.5 15.7 18.9 5
## 6 5.61 7.30 9.00 6
## 7 -0.390 7.81 16.0 7
QC_all <- QC_all %>% left_join(CL, by = c('qc_lvl') )
Below is a 7 figure plot for each QC level (1 = 312.500, 2 = 156.250, 3 = 78.125, 4 = 39.100 , 5 = 19.500, 6 = 13.000, 7 = 9.750 ng/mL) and the data points from each plate. It is shown that some QCS fall out of the upper and lower control limit range and should be omitted from the data and the LC and UC should be recomputed.
Before recalculating the LCL and UCL, the dataset should be filtered down to exclude any QCS higher or lower than the control range. After doing so, only 52 QC data points remain in the dataset.
QC_filter <- QC_all %>% filter(mean_Concentration >= low & mean_Concentration <= high)
CL <- calculate_LUCL(QC_filter, N)
QC_filter <- QC_filter %>% left_join(CL, by = c('qc_lvl') )
QC_filter
## # A tibble: 52 × 20
## # Groups: platedayNO_A [12]
## grp_dup platedayNO_A Description_A concdil_A concdil_B Signal_A Signal_B
## <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 9 Plate 1 (Day 1) 1 Quality Cont… 312. 312. 1.62 1.67
## 2 12 Plate 1 (Day 1) 1 Quality Cont… 39.1 39.1 0.195 0.208
## 3 13 Plate 1 (Day 1) 1 Quality Cont… 19.5 19.5 0.113 0.096
## 4 14 Plate 1 (Day 1) 1 Quality Cont… 13 13 0.074 0.048
## 5 15 Plate 1 (Day 1) 1 Quality Cont… 9.75 9.75 0.054 0.023
## 6 30 Plate 1 (Day 1) 2 Quality Cont… 312. 312. 1.63 1.67
## 7 33 Plate 1 (Day 1) 2 Quality Cont… 39.1 39.1 0.195 0.209
## 8 34 Plate 1 (Day 1) 2 Quality Cont… 19.5 19.5 0.114 0.096
## 9 35 Plate 1 (Day 1) 2 Quality Cont… 13 13 0.074 0.056
## 10 36 Plate 1 (Day 1) 2 Quality Cont… 9.75 9.75 0.056 0.025
## # ℹ 42 more rows
## # ℹ 13 more variables: mean_Signal <dbl>, Concentration_A <dbl>,
## # Concentration_B <dbl>, mean_Concentration <dbl>, sd_c <dbl>, cv <dbl>,
## # qc_lvl <int>, low.x <dbl>, grand_mean.x <dbl>, high.x <dbl>, low.y <dbl>,
## # grand_mean.y <dbl>, high.y <dbl>
Below is recalculated plot for the new LCL and UCL. Although the range is tighter for each QC level, there is not enough data points to qualify the QCs and future work should aim to collect more data and compute error probability such as false postives or false negatives.
Note that the echo = TRUE
parameter was added to the
code chunk to print the R code that generated the plot.