vignettes/Dietary_data_annotation.Rmd
Dietary_data_annotation.Rmd
Compiled date: 2024-05-06
Last edited: 2022-01-12
License: GPL-3
Run the following code to install the Bioconductor version of the package.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("fobitools")
We will also need some additional CRAN packages that will be very useful in this vignette.
In nutritional studies, dietary data are usually collected by using different questionnaires such as FFQs (food frequency questionnaires) or 24h-DRs (24 hours dietary recall). Commonly, the text collected in these questionnaires require a manual preprocessing step before being analyzed.
This is an example of how an FFQ could look like in a common nutritional study.
load("data/sample_ffq.rda")
sample_ffq %>%
dplyr::slice(1L:10L) %>%
kbl(row.names = FALSE, booktabs = TRUE) %>%
kable_styling(latex_options = c("striped"))
ID | Name |
---|---|
ID_001 | Beef: roast, steak, mince, stew casserole, curry or bolognese |
ID_002 | Beefburgers |
ID_003 | Pork: roast, chops, stew, slice or curry |
ID_004 | Lamb: roast, chops, stew or curry |
ID_005 | Chicken, turkey or other poultry: including fried, casseroles or curry |
ID_006 | Bacon |
ID_007 | Ham |
ID_008 | Corned beef, Spam, luncheon meats |
ID_009 | Sausages |
ID_0010 | Savoury pies, e.g. meat pie, pork pie, pasties, steak & kidney pie, sausage rolls, scotch egg |
The fobitools::annotate_foods()
function allows the
automatic annotation of free nutritional text using the FOBI ontology
(Castellano-Escuder et al.
2020). This function provides users with a table of food IDs,
food names, FOBI IDs and FOBI names of the FOBI terms that match the
input text. The input should be structured as a two column data frame,
indicating the food IDs (first column) and food names (second column).
Note that food names can be provided both as words and complex
strings.
This function includes a text mining algorithm composed of 5 sequential layers. In this process, singulars and plurals are analyzed, irrelevant words are removed, each string of the text input is tokenized and each word is analyzed independently, and the semantic similarity between input text and FOBI items is computed. Finally, this function also shows the percentage of the annotated input text.
annotated_text <- fobitools::annotate_foods(sample_ffq)
#> 89.57% annotated
#> 1.676 sec elapsed
annotated_text$annotated %>%
dplyr::slice(1L:10L) %>%
kbl(row.names = FALSE, booktabs = TRUE) %>%
kable_styling(latex_options = c("striped"))
FOOD_ID | FOOD_NAME | FOBI_ID | FOBI_NAME |
---|---|---|---|
ID_00100 | Oranges, satsumas, mandarins, tangerines, clementines | FOODON:03309832 | orange (whole, raw) |
ID_00101 | Grapefruit | FOODON:03301702 | grapefruit (whole, raw) |
ID_00102 | Bananas | FOODON:03311513 | banana (whole, ripe) |
ID_00103 | Grapes | FOODON:03301123 | grape (whole, raw) |
ID_00104 | Melon | FOODON:03301593 | melon (raw) |
ID_00105 | *Peaches, plums, apricots, nectarines | FOODON:03301107 | nectarine (whole, raw) |
ID_00106 | *Strawberries, raspberries, kiwi fruit | FOODON:03305656 | fruit (dried) |
ID_00106 | *Strawberries, raspberries, kiwi fruit | FOODON:03414363 | kiwi |
ID_00106 | *Strawberries, raspberries, kiwi fruit | FOODON:00001057 | plant fruit food product |
ID_00107 | Tinned fruit | FOODON:03305656 | fruit (dried) |
Additionally, the similarity argument indicates the semantic similarity cutoff used at the last layer of the text mining pipeline. It is a numeric value between 1 (exact match) and 0 (very poor match). Users can modify this value to obtain more or less accurated annotations. Authors do not recommend values below 0.85 (default).
annotated_text_95 <- fobitools::annotate_foods(sample_ffq, similarity = 0.95)
#> 86.5% annotated
#> 1.625 sec elapsed
annotated_text_95$annotated %>%
dplyr::slice(1L:10L) %>%
kbl(row.names = FALSE, booktabs = TRUE) %>%
kable_styling(latex_options = c("striped"))
FOOD_ID | FOOD_NAME | FOBI_ID | FOBI_NAME |
---|---|---|---|
ID_00100 | Oranges, satsumas, mandarins, tangerines, clementines | FOODON:03309832 | orange (whole, raw) |
ID_00101 | Grapefruit | FOODON:03301702 | grapefruit (whole, raw) |
ID_00102 | Bananas | FOODON:03311513 | banana (whole, ripe) |
ID_00103 | Grapes | FOODON:03301123 | grape (whole, raw) |
ID_00104 | Melon | FOODON:03301593 | melon (raw) |
ID_00105 | *Peaches, plums, apricots, nectarines | FOODON:03301107 | nectarine (whole, raw) |
ID_00106 | *Strawberries, raspberries, kiwi fruit | FOODON:03305656 | fruit (dried) |
ID_00106 | *Strawberries, raspberries, kiwi fruit | FOODON:03414363 | kiwi |
ID_00106 | *Strawberries, raspberries, kiwi fruit | FOODON:00001057 | plant fruit food product |
ID_00107 | Tinned fruit | FOODON:03305656 | fruit (dried) |
See that by increasing the similarity value from 0.85 (default value)
to 0.95 (a more accurate annotation), the percentage of annotated terms
decreases from 89.57% to 86.5%. Let’s check those food items annotated
with similarity = 0.85
but not with
similarity = 0.95
.
annotated_text$annotated %>%
filter(!FOOD_ID %in% annotated_text_95$annotated$FOOD_ID) %>%
kbl(row.names = FALSE, booktabs = TRUE) %>%
kable_styling(latex_options = c("striped"))
FOOD_ID | FOOD_NAME | FOBI_ID | FOBI_NAME |
---|---|---|---|
ID_00124 | Beansprouts…130 | FOODON:00002753 | bean (whole) |
ID_00127 | Watercress | FOODON:00002340 | water food product |
ID_00140 | Beansprouts…171 | FOODON:00002753 | bean (whole) |
ID_00143 | Brocoli | FOODON:03301713 | broccoli floret (whole, raw) |
ID_002 | Beefburgers | FOODON:00002737 | beef hamburger (dish) |
Then, with the fobitools::fobi_graph()
function we can
visualize the annotated food terms with their corresponding FOBI
relationships.
terms <- annotated_text$annotated %>%
pull(FOBI_ID)
fobitools::fobi_graph(terms = terms,
get = NULL,
layout = "lgl",
labels = TRUE,
legend = TRUE,
labelsize = 6,
legendSize = 20)
Most likely we may be interested in knowing the food-related compounds in our study. Well, if so, once the foods are annotated we can obtain the metabolites associated with the annotated foods as follows:
inverse_rel <- fobitools::fobi %>%
filter(id_BiomarkerOf %in% annotated_text$annotated$FOBI_ID) %>%
dplyr::select(id_code, name, id_BiomarkerOf, FOBI) %>%
dplyr::rename(METABOLITE_ID = 1, METABOLITE_NAME = 2, FOBI_ID = 3, METABOLITE_FOBI_ID = 4)
annotated_foods_and_metabolites <- left_join(annotated_text$annotated, inverse_rel, by = "FOBI_ID")
annotated_foods_and_metabolites %>%
filter(!is.na(METABOLITE_ID)) %>%
dplyr::slice(1L:10L) %>%
kbl(row.names = FALSE, booktabs = TRUE) %>%
kable_styling(latex_options = c("striped"))
FOOD_ID | FOOD_NAME | FOBI_ID | FOBI_NAME | METABOLITE_ID | METABOLITE_NAME | METABOLITE_FOBI_ID |
---|---|---|---|---|---|---|
ID_00100 | Oranges, satsumas, mandarins, tangerines, clementines | FOODON:03309832 | orange (whole, raw) | CHEBI:15600 | (+)-catechin | FOBI:030460 |
ID_00100 | Oranges, satsumas, mandarins, tangerines, clementines | FOODON:03309832 | orange (whole, raw) | CHEBI:15864 | luteolin | FOBI:030555 |
ID_00100 | Oranges, satsumas, mandarins, tangerines, clementines | FOODON:03309832 | orange (whole, raw) | CHEBI:16243 | quercetin | FOBI:030558 |
ID_00100 | Oranges, satsumas, mandarins, tangerines, clementines | FOODON:03309832 | orange (whole, raw) | CHEBI:17620 | ferulic acid | FOBI:030406 |
ID_00100 | Oranges, satsumas, mandarins, tangerines, clementines | FOODON:03309832 | orange (whole, raw) | CHEBI:17620 | ferulic acid | FOBI:030406 |
ID_00100 | Oranges, satsumas, mandarins, tangerines, clementines | FOODON:03309832 | orange (whole, raw) | CHEBI:18388 | apigenin | FOBI:030553 |
ID_00100 | Oranges, satsumas, mandarins, tangerines, clementines | FOODON:03309832 | orange (whole, raw) | CHEBI:28499 | kaempferol | FOBI:030565 |
ID_00100 | Oranges, satsumas, mandarins, tangerines, clementines | FOODON:03309832 | orange (whole, raw) | CHEBI:6052 | isorhamnetin | FOBI:030562 |
ID_00100 | Oranges, satsumas, mandarins, tangerines, clementines | FOODON:03309832 | orange (whole, raw) | CHEBI:77131 | sinapic acid | FOBI:030412 |
ID_00100 | Oranges, satsumas, mandarins, tangerines, clementines | FOODON:03309832 | orange (whole, raw) | CHEBI:77131 | sinapic acid | FOBI:030412 |
The FOBI ontology is currently in its first release version, so it
does not yet include information on many metabolites, foods and food
relationships. All future efforts will be directed at expanding this
ontology, leading to a significant increase in the number of
metabolites, foods (from FoodOn ontology (Dooley et al. 2018)) and
metabolite-food relationships. The fobitools
package
provides the methodology for easy use of the FOBI ontology regardless of
the amount of information it contains. Therefore, future FOBI
improvements will also have a direct impact on the
fobitools
package, increasing its utility and allowing to
perform, among others, more accurate, complete and robust dietary text
annotations.
sessionInfo()
#> R version 4.3.2 (2023-10-31)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Sonoma 14.1
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: America/New_York
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] kableExtra_1.3.4 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
#> [5] dplyr_1.1.4 purrr_1.0.2 readr_2.1.4 tidyr_1.3.0
#> [9] tibble_3.2.1 ggplot2_3.4.4 tidyverse_2.0.0 fobitools_1.11.2
#> [13] BiocStyle_2.30.0
#>
#> loaded via a namespace (and not attached):
#> [1] DBI_1.2.0 ada_2.0-5 qdapRegex_0.7.8
#> [4] gridExtra_2.3 rlang_1.1.3 magrittr_2.0.3
#> [7] e1071_1.7-14 compiler_4.3.2 RSQLite_2.3.4
#> [10] systemfonts_1.0.5 vctrs_0.6.5 rvest_1.0.3
#> [13] pkgconfig_2.0.3 crayon_1.5.2 fastmap_1.1.1
#> [16] labeling_0.4.3 ggraph_2.1.0 utf8_1.2.4
#> [19] rmarkdown_2.25 tzdb_0.4.0 prodlim_2023.08.28
#> [22] ragg_1.2.7 bit_4.0.5 xfun_0.41
#> [25] cachem_1.0.8 jsonlite_1.8.8 blob_1.2.4
#> [28] highr_0.10 tictoc_1.2 BiocParallel_1.36.0
#> [31] tweenr_2.0.2 syuzhet_1.0.7 parallel_4.3.2
#> [34] R6_2.5.1 bslib_0.6.1 stringi_1.8.3
#> [37] textclean_0.9.3 parallelly_1.36.0 rpart_4.1.23
#> [40] jquerylib_0.1.4 Rcpp_1.0.12 bookdown_0.37
#> [43] knitr_1.45 future.apply_1.11.1 clisymbols_1.2.0
#> [46] timechange_0.2.0 Matrix_1.6-1 splines_4.3.2
#> [49] nnet_7.3-19 igraph_1.6.0 tidyselect_1.2.0
#> [52] rstudioapi_0.15.0 yaml_2.3.8 viridis_0.6.4
#> [55] codetools_0.2-19 listenv_0.9.0 lattice_0.22-5
#> [58] withr_2.5.2 evaluate_0.23 ontologyIndex_2.11
#> [61] future_1.33.1 desc_1.4.3 survival_3.5-7
#> [64] proxy_0.4-27 polyclip_1.10-6 xml2_1.3.6
#> [67] pillar_1.9.0 BiocManager_1.30.22 lexicon_1.2.1
#> [70] generics_0.1.3 vroom_1.6.5 hms_1.1.3
#> [73] munsell_0.5.0 scales_1.3.0 ff_4.0.12
#> [76] globals_0.16.2 xtable_1.8-4 class_7.3-22
#> [79] glue_1.7.0 RecordLinkage_0.4-12.4 tools_4.3.2
#> [82] data.table_1.14.10 webshot_0.5.5 fgsea_1.28.0
#> [85] fs_1.6.3 graphlayouts_1.0.2 fastmatch_1.1-4
#> [88] tidygraph_1.3.0 cowplot_1.1.2 grid_4.3.2
#> [91] ipred_0.9-14 colorspace_2.1-0 ggforce_0.4.1
#> [94] cli_3.6.2 evd_2.3-6.1 textshaping_0.3.7
#> [97] fansi_1.0.6 viridisLite_0.4.2 svglite_2.1.3
#> [100] lava_1.7.3 gtable_0.3.4 sass_0.4.8
#> [103] digest_0.6.34 ggrepel_0.9.5 farver_2.1.1
#> [106] memoise_2.0.1 htmltools_0.5.7 pkgdown_2.0.7
#> [109] lifecycle_1.0.4 httr_1.4.7 bit64_4.0.5
#> [112] MASS_7.3-60