• Advice on data management
  • Training (data, reproducibility, identifiers)
  • Support for data management plans
  • Curation of the Univ-Rennes collection on Recherche Data Gouv
  • help for making your research source code reproducible

ARDoISE data hub

1. Research Data, What Are We Talking About?

Figure 1: data life cycle

a typical experiment : counting the fish in a river

Figure 2

Which Files Are Important to Make Available?

  • raw_data_fish_counter.csv
  • intermediate_data.xls
  • filter1.py
  • first_draft_submission.pdf
  • fish_counter_calibration.md
  • kick_off_report.docx
  • filter2.py
  • notebook_experiment.ipynb
  • final_data_fish_counter.xls
  • project_presentation_funders.pptx
  • final_data.csv
  • study_draft.qmd
  • january_meeting_partners.docx
  • fish_counter_instructions_for_use.pdf
  • gantt_calendar.xlsx

Answers

  • raw_data_fish_counter.csv
  • intermediate_data.xls
  • filter1.py
  • first_draft_submission.pdf
  • fish_counter_calibration.md
  • kick_off_report.docx
  • filter2.py
  • notebook_experiment.ipynb
  • final_data_fish_counter.xls
  • project_presentation_funders.pptx
  • final_data.csv
  • study_draft.qmd
  • january_meeting_partners.docx
  • fish_counter_instructions_for_use.pdf
  • gantt_calendar.xlsx

2. Towards Cumulative, Reliable, and Reproducible Science?

📓 Berners-Lee (2015)

From buried-in-a-PDF data…

Figure 3
  • Even Optical AI tools are bad at extracting data from PDFs 📓 Edwards (2025)
  • Mistral failed at converting tables in PDF into markdown (figures truncated)
  • PDF should be one of the layer not the only one to be communicated
  • share native format and use reproducible tools such as R

…through Excel (but why Excel would be necessary anyway)

  • better to share flat files (not workbooks)
  • have you considered collecting your data in CSL rather than Excel ?

Unlike a script written in a programming language such as Python or R, which documents every step of the process and can be saved, versioned and rerun, an analysis that happens inside a spreadsheet using pointing and clicking is hard to follow and even harder to replicate.

📓 Melchor (2025)

Excel (as every non FOSS tool) itself is prone to errors

Figure 4: CSV vs XLS

📓 Ziemann et al. (2023)

Preserve the raw data

  • The raw and final data are important to preserve and later deposit in a repository
  • if your data management is well documented, you don’t need to keep intermediary versions
  • keep the file that contains your raw data untouched (lock it down in a separate folder), use a duplicate as your working version

…to Linked Open Data

“Lowest temperature in Galway on the fourteenth of January 2020”

concept ontology expression
lowest temperature http://inamidst.com/sw/ont/meteo Property :temperature
temperature type Wikipedia https://fr.wikipedia.org/wiki/Degr%C3%A9_Celsius
Galway DBpedia https://dbpedia.org/page/Galway
14th of January Unix epoch 1578984191

Permanence of Data Access

📓 Gibney & Van Noorden (2013)

3. A Challenge for Open Science

FAIR Principles

Figure 5: FAIR principles

openness / closure

  • “as open as possible, as closed as necessary”

  • Default openness

  • Closure to justify:

    • personal data
    • intellectual property
    • industrial secret
    • defense

Making Your Data Findable

Quality of a directory:

  • reputation
  • sustainability (institutional support)
  • open license
  • persistent identifier
  • richness of metadata
  • curation

discipline repository
images (SHS) MediHal
code Software Heritage via HAL
Bioinformatics GenOuest
Humanities Nakala
Mathematics no repository, see with the RNBM group
environment, hydrology Data Indores
Earth Sciences data terra
Marine Sciences data ifremer, seanoe
medical sciences INSERM repository on RDG
Ecology, Environment, and Society Data InDoRES and Cat.InDoRES

Recherche Data Gouv

  • richness of metadata
  • curation
  • national reference (supported by the Ministry)
  • persistent identifier
  • significant volume
  • free of charge
  • simplified generation of datapapers
  • RDG sandbox

Are Data Accessible?

In 93% of cases no response or negative response without justification 📓 Gabelica et al. (2022)

Figure 6: “data available upon request”

Are Data Interoperable?

Which identifiers to use for copper telluride?
registry identifier
CAS number 12019-52-2
PubChem CID number 6914517
PubChem SID number 24879035
openSMILES identifier CuCu.CuCu.TeTe
InChI identifier InChI=1/2Cu.Te
MDL number MFCD00049727

Documenting the Data

  • Documentation is the glue that binds a data science project together 📓 Ziemann et al. (2023)
  • Carefully describe the data and the context of its acquisition (production, collection)
  • literate programming
  • describe the data using ontologies

Documenting to Avoid Context Errors

Be precise in describing the context of data production

Figure 7: the importance of data context

Ontologies

discipline thesaurus
biodiversity INRAE
environment GEMET
Biology, Health MeSH
Mental Health ascodopsy

directory of thesauri

Reusable Data?

  • Creative Commons (CC:by)
  • a license written by a law firm expert in intellectual property that provides for a variety of authorized or prohibited use cases
  • ODBL
  • Etalab
  • no license, do whatever you want with my dataset
  • CC0
  • CC:by for everyone except for fossil industries, arms sellers, and Google (📓 Thomas (2023)) text available here.

Data Management Plan

  • The DMP summarizes all the choices made for data management
  • Submit an initial version of a DMP 6 months after signing a contract (ANR, European projects)
  • DMP OPIDOR
  • DMP Online

4. Let’s Get Practical

Figures

figure credits
Figure 2 yanmar https://www.yanmar.com/global/news/2021/04/20/90815.html
Figure 1 Harvard Biomedical Data managemnet
?@fig-data_steps Tim Berners-Lee
Figure 3 Jon Ippolito
?@fig-data_loss Gibney, Van Noorden
Figure 5 Willkinson, Dumontier et al.
Figure 6 Sergio Uribe
Figure 4 meme dont l’origine se perd dans la nuit des temps
Figure 7 Ralph Aboujaoude Diaz

Software Used for the Presentation

The presentation was created with free and Open Source software. Thank you to all people who make them alive ❤️ ❤️

Slides 
 
[1] "Quarto version: 1.6.40"
R version 4.5.2 (2025-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8    
 [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
 [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Paris
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] digest_0.6.37     later_1.3.2       fastmap_1.2.0     xfun_0.53        
 [5] knitr_1.50        htmltools_0.5.8.1 rmarkdown_2.29    lifecycle_1.0.4  
 [9] ps_1.7.6          cli_3.6.5         processx_3.8.3    compiler_4.5.2   
[13] rstudioapi_0.15.0 tools_4.5.2       quarto_1.5.1      evaluate_1.0.4   
[17] Rcpp_1.1.0        yaml_2.3.10       rlang_1.1.6       jsonlite_2.0.0   

 quiz : H5P 
 
 repository : Framagit owned and managed by the Fnch pro-FOSS association Framasoft 
 

References

Berners-Lee, T. (2015). 5-star open data.
Edwards, B. (2025). Why extracting data from PDFs is still a nightmare for data experts. In Ars Technica. https://arstechnica.com/ai/2025/03/why-extracting-data-from-pdfs-is-still-a-nightmare-for-data-experts/
Gabelica, M., Bojčić, R., & Puljak, L. (2022). Many researchers were not compliant with their published data sharing statement: Mixed-methods study. Journal of Clinical Epidemiology, 0(0). https://doi.org/10.1016/j.jclinepi.2022.05.019
Gibney, E., & Van Noorden, R. (2013). Scientists losing data at a rapid rate. Nature. https://doi.org/10.1038/nature.2013.14416
Melchor, S. (2025). Six questions to ask before jumping into a spreadsheet. Nature, 644(8076), 569–570. https://doi.org/10.1038/d41586-025-02511-z
Thomas, M., Éric Tannier. (2023, May 17). Se réapproprier la production de connaissance - AOC media. AOC media - Analyse Opinion Critique. https://aoc.media/opinion/2023/05/17/se-reapproprier-la-production-de-connaissance/
Ziemann, M., Poulain, P., & Bora, A. (2023). The five pillars of computational reproducibility: Bioinformatics and beyond. Briefings in Bioinformatics, 24(6), bbad375. https://doi.org/10.1093/bib/bbad375