The 26th ACM Symposium on Document Engineering
August 25, 2026 to August 28, 2026
HES-SO / University of Fribourg, Switzerland
Tutorials
Tuesday, August 25, 2026 — Pérolles Campus, Fribourg
→ Back to Workshops & Tutorials
DocEng 2026 features four tutorials. Attendance is included in the full conference registration; a Tutorials & Workshop Only registration ($75) is also available — see the Registration page.
1. Document Engineering Issues in Malware Analysis
Presenters: Charles Nicholas (UMBC), Robert J. Joyce (UMBC), Steve Simske (Colorado State University)
Duration: Half-day (3 hours)
We present an overview of the field of malware analysis with an emphasis on issues related to document engineering. The tutorial introduces the field with a discussion of the types of malware, including executable binaries, malicious PDFs, polymorphic malware, ransomware, and exploit kits. Malware analysis has been impacted by recent developments in machine learning, especially LLMs, and we will discuss and demonstrate some of these advances. We conclude with our view of important research questions in the field. This is an updated version of tutorials presented in previous years, with more information about newly-available tools.
The session covers both static analysis (inspection of the PE header and system-call import table, disassembly with tools such as IDA and Ghidra, including the impact of the Model Control Protocol) and dynamic analysis (running specimens within a virtual machine). It also touches on malware analysis "in the large" — finding patterns and trends across collections of specimens through cluster analysis — and on the persistent difficulty of malware attribution.
Participants are welcome to follow along on their own laptops, ideally with a virtual machine platform such as VMware or VirtualBox running Windows and Linux; those without laptops will be at no disadvantage.
2. Accessible Presentations with LaTeX
Presenters: Frank Mittelbach, Ulrike Fischer, Joseph Wright (LaTeX Project)
Duration: Half-day (3 hours)
Document reuse and accessibility are central concerns for the future of document handling. Accessible presentations matter in particular when slides are handed out to students or made available on websites, where step-wise reveals and other slide-specific techniques require special handling to remain accessible.
This tutorial shows the current state of the art for producing well-tagged, accessible PDFs with LaTeX — WTPDF and PDF/UA-2 documents — with a special focus on presentations. It demonstrates tools to inspect and validate tagged PDFs, and discusses open issues, future work, and the resulting restrictions on supported input. Across various types of LaTeX presentations, participants will see how documents can be made tagging-aware and how the tagging can be checked. With internet access, participants will be able to compile and test on their own laptops.
The material builds on talks and tutorials presented at venues including TUG 2023 (Bonn), DEIMS 2024 (Tokyo), TUG 2024 (Prague), DocEng 2024, and DocEng 2025.
3. Temporally Entangled Documents: Multimodal AI for ICU Records Under Label Ambiguity
Presenter: Liam Butler (University of Malta)
Duration: Half-day (3 hours)
Modern Intensive Care Unit (ICU) records are not simply multimodal collections of text, tables, and biosignals; they are temporally entangled document systems. Clinical meaning depends critically on when evidence is recorded, not only on what is recorded: physiological signals evolve continuously, laboratory results arrive with delays, and clinical notes retrospectively summarise earlier events. The timing of a diagnosis is frequently uncertain, indirect, or distributed across multiple records with no single ground-truth timestamp.
This tutorial introduces temporal document–signal alignment under label ambiguity as a foundational challenge in clinical document engineering — one that has received limited systematic treatment in the community. It argues that treating clinical records as static, synchronous inputs causes models to learn documentation artefacts rather than clinical reality, and presents a structured framework for time-aware multimodal integration and cross-modal explainability, with implications for pipeline design, representation, and evaluation.
The tutorial is delivered in three 60-minute blocks combining conceptual presentation with guided Jupyter notebooks on synthetic ICU data:
- Block 1 — ICU Records as Document Engineering Problems: the temporal anatomy of ICU records (notes, flowsheets, labs, waveforms) and the documentation-lag problem.
- Block 2 — Temporal Alignment and Label Ambiguity: why standard multimodal fusion fails under asynchrony; event-centred windowing, onset uncertainty, soft windowing, probabilistic labelling, and structured missingness.
- Block 3 — Explainability, Evaluation, and Deployment: cross-modal, time-aware attribution; temporal inconsistency as an audit tool; evaluation beyond AUROC; and regulatory context including EU AI Act requirements for clinical AI.
Participants should bring their own laptops. A synthetic clinical dataset and a Google Colab-compatible environment are provided, requiring no institutional data access or local installation. Familiarity with Python and basic ML is recommended; no clinical informatics background is required. Slides, notebooks, and a reading list will be made openly available after the tutorial.
4. How to Measure the Energy and Environmental Impact of AI Models? Challenges and Practical Solutions
Presenters: Loïc Guibert, Jean Hennebert, Sébastien Rumley (HES-SO, iCoSys, Fribourg)
Duration: Half-day (approximately 3 hours, with the option to conclude at around 2 hours)
The energetic and environmental impact of AI is becoming significant across all application domains. In document engineering, AI models are now the predominant components of processing pipelines — from information extraction and transformation to interpretation and translation. As AI workloads are projected to potentially double global data-centre energy consumption by 2030, largely driven by generative models, there is an urgent need to accurately quantify and mitigate this footprint.
Measuring this impact is challenged by the "black box" nature of large-model training and inference, opaque hardware-software interactions, and a historical prioritization of accuracy over environmental efficiency. This tutorial addresses those challenges by exploring Lifecycle Assessment (LCA) frameworks and telemetry tools that track operational energy, carbon intensity, and embodied carbon. The session also situates the topic within new reporting directives such as the EU Corporate Sustainability Reporting Directive (CSRD) and the principle of double materiality.
Participants will gain practical experience in:
- Auditing AI pipelines with open-source tools to estimate the carbon footprint of specific workloads.
- Optimizing for efficiency through techniques such as quantization, distillation, and pruning.
- Strategic deployment — making informed decisions about when and where to run models to minimize environmental impact.
The tutorial includes a live demonstration using a physical benchmark server connected to a Power Distribution Unit (PDU) to measure power drainage in real time; the organizers will bring the equipment to the venue (a power socket is required). It is an adaptation of a workshop the organizers ran at the 2026 Swiss AI Days (Martigny and Fribourg), reframed to introduce introductory concepts and practical aspects in tutorial form.
Contact
For general questions about tutorials, please contact docengsymposium@gmail.com.