Dietary text annotation

Compiled date: 2024-05-06

Last edited: 2022-01-12

License: GPL-3

Installation

Run the following code to install the Bioconductor version of the package.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("fobitools")

Load packages

library(fobitools)

We will also need some additional CRAN packages that will be very useful in this vignette.

library(tidyverse)
library(kableExtra)

Load food items from a food frequency questionnaire (FFQ) sample data

In nutritional studies, dietary data are usually collected by using different questionnaires such as FFQs (food frequency questionnaires) or 24h-DRs (24 hours dietary recall). Commonly, the text collected in these questionnaires require a manual preprocessing step before being analyzed.

This is an example of how an FFQ could look like in a common nutritional study.

load("data/sample_ffq.rda")

sample_ffq %>%
  dplyr::slice(1L:10L) %>%
  kbl(row.names = FALSE, booktabs = TRUE) %>%
  kable_styling(latex_options = c("striped"))

ID	Name
ID_001	Beef: roast, steak, mince, stew casserole, curry or bolognese
ID_002	Beefburgers
ID_003	Pork: roast, chops, stew, slice or curry
ID_004	Lamb: roast, chops, stew or curry
ID_005	Chicken, turkey or other poultry: including fried, casseroles or curry
ID_006	Bacon
ID_007	Ham
ID_008	Corned beef, Spam, luncheon meats
ID_009	Sausages
ID_0010	Savoury pies, e.g. meat pie, pork pie, pasties, steak & kidney pie, sausage rolls, scotch egg

Automatic dietary text anotation

The fobitools::annotate_foods() function allows the automatic annotation of free nutritional text using the FOBI ontology (Castellano-Escuder et al. 2020). This function provides users with a table of food IDs, food names, FOBI IDs and FOBI names of the FOBI terms that match the input text. The input should be structured as a two column data frame, indicating the food IDs (first column) and food names (second column). Note that food names can be provided both as words and complex strings.

This function includes a text mining algorithm composed of 5 sequential layers. In this process, singulars and plurals are analyzed, irrelevant words are removed, each string of the text input is tokenized and each word is analyzed independently, and the semantic similarity between input text and FOBI items is computed. Finally, this function also shows the percentage of the annotated input text.

annotated_text <- fobitools::annotate_foods(sample_ffq)
#> 89.57% annotated
#> 1.676 sec elapsed

annotated_text$annotated %>%
  dplyr::slice(1L:10L) %>%
  kbl(row.names = FALSE, booktabs = TRUE) %>%
  kable_styling(latex_options = c("striped"))

FOOD_ID	FOOD_NAME	FOBI_ID	FOBI_NAME
ID_00100	Oranges, satsumas, mandarins, tangerines, clementines	FOODON:03309832	orange (whole, raw)
ID_00101	Grapefruit	FOODON:03301702	grapefruit (whole, raw)
ID_00102	Bananas	FOODON:03311513	banana (whole, ripe)
ID_00103	Grapes	FOODON:03301123	grape (whole, raw)
ID_00104	Melon	FOODON:03301593	melon (raw)
ID_00105	*Peaches, plums, apricots, nectarines	FOODON:03301107	nectarine (whole, raw)
ID_00106	*Strawberries, raspberries, kiwi fruit	FOODON:03305656	fruit (dried)
ID_00106	*Strawberries, raspberries, kiwi fruit	FOODON:03414363	kiwi
ID_00106	*Strawberries, raspberries, kiwi fruit	FOODON:00001057	plant fruit food product
ID_00107	Tinned fruit	FOODON:03305656	fruit (dried)

The similarity argument

Additionally, the similarity argument indicates the semantic similarity cutoff used at the last layer of the text mining pipeline. It is a numeric value between 1 (exact match) and 0 (very poor match). Users can modify this value to obtain more or less accurated annotations. Authors do not recommend values below 0.85 (default).

annotated_text_95 <- fobitools::annotate_foods(sample_ffq, similarity = 0.95)
#> 86.5% annotated
#> 1.625 sec elapsed

annotated_text_95$annotated %>%
  dplyr::slice(1L:10L) %>%
  kbl(row.names = FALSE, booktabs = TRUE) %>%
  kable_styling(latex_options = c("striped"))

FOOD_ID	FOOD_NAME	FOBI_ID	FOBI_NAME
ID_00100	Oranges, satsumas, mandarins, tangerines, clementines	FOODON:03309832	orange (whole, raw)
ID_00101	Grapefruit	FOODON:03301702	grapefruit (whole, raw)
ID_00102	Bananas	FOODON:03311513	banana (whole, ripe)
ID_00103	Grapes	FOODON:03301123	grape (whole, raw)
ID_00104	Melon	FOODON:03301593	melon (raw)
ID_00105	*Peaches, plums, apricots, nectarines	FOODON:03301107	nectarine (whole, raw)
ID_00106	*Strawberries, raspberries, kiwi fruit	FOODON:03305656	fruit (dried)
ID_00106	*Strawberries, raspberries, kiwi fruit	FOODON:03414363	kiwi
ID_00106	*Strawberries, raspberries, kiwi fruit	FOODON:00001057	plant fruit food product
ID_00107	Tinned fruit	FOODON:03305656	fruit (dried)

See that by increasing the similarity value from 0.85 (default value) to 0.95 (a more accurate annotation), the percentage of annotated terms decreases from 89.57% to 86.5%. Let’s check those food items annotated with similarity = 0.85 but not with similarity = 0.95.

annotated_text$annotated %>%
  filter(!FOOD_ID %in% annotated_text_95$annotated$FOOD_ID) %>%
  kbl(row.names = FALSE, booktabs = TRUE) %>%
  kable_styling(latex_options = c("striped"))

FOOD_ID	FOOD_NAME	FOBI_ID	FOBI_NAME
ID_00124	Beansprouts…130	FOODON:00002753	bean (whole)
ID_00127	Watercress	FOODON:00002340	water food product
ID_00140	Beansprouts…171	FOODON:00002753	bean (whole)
ID_00143	Brocoli	FOODON:03301713	broccoli floret (whole, raw)
ID_002	Beefburgers	FOODON:00002737	beef hamburger (dish)

Network visualization of the annotated terms

Then, with the fobitools::fobi_graph() function we can visualize the annotated food terms with their corresponding FOBI relationships.

terms <- annotated_text$annotated %>%
  pull(FOBI_ID)

fobitools::fobi_graph(terms = terms,
                      get = NULL,
                      layout = "lgl",
                      labels = TRUE,
                      legend = TRUE,
                      labelsize = 6,
                      legendSize = 20)

How do I know which compounds are associated with my study food items?

Most likely we may be interested in knowing the food-related compounds in our study. Well, if so, once the foods are annotated we can obtain the metabolites associated with the annotated foods as follows:

inverse_rel <- fobitools::fobi %>%
  filter(id_BiomarkerOf %in% annotated_text$annotated$FOBI_ID) %>%
  dplyr::select(id_code, name, id_BiomarkerOf, FOBI) %>%
  dplyr::rename(METABOLITE_ID = 1, METABOLITE_NAME = 2, FOBI_ID = 3, METABOLITE_FOBI_ID = 4)

annotated_foods_and_metabolites <- left_join(annotated_text$annotated, inverse_rel, by = "FOBI_ID")

annotated_foods_and_metabolites %>%
  filter(!is.na(METABOLITE_ID)) %>%
  dplyr::slice(1L:10L) %>%
  kbl(row.names = FALSE, booktabs = TRUE) %>%
  kable_styling(latex_options = c("striped"))

FOOD_ID	FOOD_NAME	FOBI_ID	FOBI_NAME	METABOLITE_ID	METABOLITE_NAME	METABOLITE_FOBI_ID
ID_00100	Oranges, satsumas, mandarins, tangerines, clementines	FOODON:03309832	orange (whole, raw)	CHEBI:15600	(+)-catechin	FOBI:030460
ID_00100	Oranges, satsumas, mandarins, tangerines, clementines	FOODON:03309832	orange (whole, raw)	CHEBI:15864	luteolin	FOBI:030555
ID_00100	Oranges, satsumas, mandarins, tangerines, clementines	FOODON:03309832	orange (whole, raw)	CHEBI:16243	quercetin	FOBI:030558
ID_00100	Oranges, satsumas, mandarins, tangerines, clementines	FOODON:03309832	orange (whole, raw)	CHEBI:17620	ferulic acid	FOBI:030406
ID_00100	Oranges, satsumas, mandarins, tangerines, clementines	FOODON:03309832	orange (whole, raw)	CHEBI:17620	ferulic acid	FOBI:030406
ID_00100	Oranges, satsumas, mandarins, tangerines, clementines	FOODON:03309832	orange (whole, raw)	CHEBI:18388	apigenin	FOBI:030553
ID_00100	Oranges, satsumas, mandarins, tangerines, clementines	FOODON:03309832	orange (whole, raw)	CHEBI:28499	kaempferol	FOBI:030565
ID_00100	Oranges, satsumas, mandarins, tangerines, clementines	FOODON:03309832	orange (whole, raw)	CHEBI:6052	isorhamnetin	FOBI:030562
ID_00100	Oranges, satsumas, mandarins, tangerines, clementines	FOODON:03309832	orange (whole, raw)	CHEBI:77131	sinapic acid	FOBI:030412
ID_00100	Oranges, satsumas, mandarins, tangerines, clementines	FOODON:03309832	orange (whole, raw)	CHEBI:77131	sinapic acid	FOBI:030412

Limitations

The FOBI ontology is currently in its first release version, so it does not yet include information on many metabolites, foods and food relationships. All future efforts will be directed at expanding this ontology, leading to a significant increase in the number of metabolites, foods (from FoodOn ontology (Dooley et al. 2018)) and metabolite-food relationships. The fobitools package provides the methodology for easy use of the FOBI ontology regardless of the amount of information it contains. Therefore, future FOBI improvements will also have a direct impact on the fobitools package, increasing its utility and allowing to perform, among others, more accurate, complete and robust dietary text annotations.

Session Information

sessionInfo()
#> R version 4.3.2 (2023-10-31)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Sonoma 14.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: America/New_York
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] kableExtra_1.3.4 lubridate_1.9.3  forcats_1.0.0    stringr_1.5.1   
#>  [5] dplyr_1.1.4      purrr_1.0.2      readr_2.1.4      tidyr_1.3.0     
#>  [9] tibble_3.2.1     ggplot2_3.4.4    tidyverse_2.0.0  fobitools_1.11.2
#> [13] BiocStyle_2.30.0
#> 
#> loaded via a namespace (and not attached):
#>   [1] DBI_1.2.0              ada_2.0-5              qdapRegex_0.7.8       
#>   [4] gridExtra_2.3          rlang_1.1.3            magrittr_2.0.3        
#>   [7] e1071_1.7-14           compiler_4.3.2         RSQLite_2.3.4         
#>  [10] systemfonts_1.0.5      vctrs_0.6.5            rvest_1.0.3           
#>  [13] pkgconfig_2.0.3        crayon_1.5.2           fastmap_1.1.1         
#>  [16] labeling_0.4.3         ggraph_2.1.0           utf8_1.2.4            
#>  [19] rmarkdown_2.25         tzdb_0.4.0             prodlim_2023.08.28    
#>  [22] ragg_1.2.7             bit_4.0.5              xfun_0.41             
#>  [25] cachem_1.0.8           jsonlite_1.8.8         blob_1.2.4            
#>  [28] highr_0.10             tictoc_1.2             BiocParallel_1.36.0   
#>  [31] tweenr_2.0.2           syuzhet_1.0.7          parallel_4.3.2        
#>  [34] R6_2.5.1               bslib_0.6.1            stringi_1.8.3         
#>  [37] textclean_0.9.3        parallelly_1.36.0      rpart_4.1.23          
#>  [40] jquerylib_0.1.4        Rcpp_1.0.12            bookdown_0.37         
#>  [43] knitr_1.45             future.apply_1.11.1    clisymbols_1.2.0      
#>  [46] timechange_0.2.0       Matrix_1.6-1           splines_4.3.2         
#>  [49] nnet_7.3-19            igraph_1.6.0           tidyselect_1.2.0      
#>  [52] rstudioapi_0.15.0      yaml_2.3.8             viridis_0.6.4         
#>  [55] codetools_0.2-19       listenv_0.9.0          lattice_0.22-5        
#>  [58] withr_2.5.2            evaluate_0.23          ontologyIndex_2.11    
#>  [61] future_1.33.1          desc_1.4.3             survival_3.5-7        
#>  [64] proxy_0.4-27           polyclip_1.10-6        xml2_1.3.6            
#>  [67] pillar_1.9.0           BiocManager_1.30.22    lexicon_1.2.1         
#>  [70] generics_0.1.3         vroom_1.6.5            hms_1.1.3             
#>  [73] munsell_0.5.0          scales_1.3.0           ff_4.0.12             
#>  [76] globals_0.16.2         xtable_1.8-4           class_7.3-22          
#>  [79] glue_1.7.0             RecordLinkage_0.4-12.4 tools_4.3.2           
#>  [82] data.table_1.14.10     webshot_0.5.5          fgsea_1.28.0          
#>  [85] fs_1.6.3               graphlayouts_1.0.2     fastmatch_1.1-4       
#>  [88] tidygraph_1.3.0        cowplot_1.1.2          grid_4.3.2            
#>  [91] ipred_0.9-14           colorspace_2.1-0       ggforce_0.4.1         
#>  [94] cli_3.6.2              evd_2.3-6.1            textshaping_0.3.7     
#>  [97] fansi_1.0.6            viridisLite_0.4.2      svglite_2.1.3         
#> [100] lava_1.7.3             gtable_0.3.4           sass_0.4.8            
#> [103] digest_0.6.34          ggrepel_0.9.5          farver_2.1.1          
#> [106] memoise_2.0.1          htmltools_0.5.7        pkgdown_2.0.7         
#> [109] lifecycle_1.0.4        httr_1.4.7             bit64_4.0.5           
#> [112] MASS_7.3-60

References

Castellano-Escuder, Pol, Raúl González-Domı́nguez, David S Wishart, Cristina Andrés-Lacueva, and Alex Sánchez-Pla. 2020. “FOBI: An Ontology to Represent Food Intake Data and Associate It with Metabolomic Data.” Database 2020.

Dooley, Damion M, Emma J Griffiths, Gurinder S Gosal, Pier L Buttigieg, Robert Hoehndorf, Matthew C Lange, Lynn M Schriml, Fiona SL Brinkman, and William WL Hsiao. 2018. “FoodOn: A Harmonized Food Ontology to Increase Global Food Traceability, Quality Control and Data Integration.” Npj Science of Food 2 (1): 1–10.

Pol Castellano-Escuder

6 May 2024