Historical Document Processing |
Basilis G. Gatos Read more >
Historical manuscript collections can be considered as an important source of original information in order to provide access to historical data and develop cultural documentation over the years. This tutorial focuses on recent advances and ongoing developments for historical handwritten document processing. It includes the main challenges involved, the different tasks that have to be implemented as well as practices and technologies that currently exist in the literature. The main tasks that have to be implemented in the historical document image recognition pipeline, include preprocessing for image enhancement and binarisation, segmentation for the detection of main page elements, of text lines and words and, finally, recognition. In cases where optical recognition is expected to give poor results, keyword spotting has been proposed to substitute full text recognition. The focus is given on the most promising techniques, related projects as well as on existing datasets and competitions that can be proved useful to historical handwritten document processing research.
Format: Presentations and demos
Basilis G. Gatos was born in Athens, Greece. He received his Electrical Engineering Diploma and his Ph.D. degree, both from the Electrical and Computer Engineering Department of Democritus University of Thrace, Xanthi, Greece. His Ph.D. thesis is on Optical Character Recognition Techniques. In 1993 he was awarded a scholarship from the Institute of Informatics and Telecommunications, NCSR "Demokritos", where he worked till 1996. From 1997 to 1998 he worked as a Software Engineer at Computer Logic S.A. From 1998 to 2001 he worked at Lambrakis Press Archives as a Director of the Research Division in the field of digital preservation of old newspapers. From 2001 to 2003 he worked at BSI S.A. as Managing Director of R&D Division in the field of document management and recognition. He is currently working as a Researcher at the Institute of Informatics and Telecommunications of the National Center for Scientific Research "Demokritos", Athens, Greece. His main research interests are in Image Processing and Document Image Analysis, OCR and Pattern Recognition. He has more than 150 publications in journals and international conference proceedings and has participated in several research programs funded by the European community. He is a member of the Technical Chamber of Greece, of the Editorial Board of the International Journal on Document Analysis and Recognition (IJDAR) and program committee member of several international Conferences and Workshops (e.g. ICDAR, ICFHR, DAS, International Workshop on Historical Document Imaging and Processing). Basilis Gatos was co-organiser of the 14th International Conference of Frontiers in Handwriting Recognition (ICFHR) as well as of the 12th International Workshop on Document Analysis Systems (DAS) that were held in Greece in 2014 (Crete) & 2016 (Santorini).
Document Engineering Issues in Malware Analysis |
Charles Nicholas Read more >
The focus of the tutorial will be an overview of the field of malware analysis with emphasis on issues related to scalability. We introduce the field with a discussion of the types of malware, including executable binaries, malicious PDFs, and exploit kits. Some of the popular tools used for analyzing malicious binaries will be presented, including IDA, Binary Ninja, and x64dbg. Concepts and tools from static and binary analysis will be discussed. Some collections of malware specimens are available to researchers, and these will be used as examples as appropriate. We will discuss cluster analysis, malware attribution, and the problems caused by polymorphic malware. We will conclude with our view of important research questions in the field.
Format: Presentations and demos
Dr. Charles Nicholas is a Professor of Computer Science at UMBC. He has in recent years turned his attention to the problems of malware analysis “in the large”. His recent work has considered questions related to storing, searching, and finding patterns in large collections of malware. He has taught a combined graduate-undergraduate course in malware analysis at UMBC for each of the last four years.
Understanding the User: User Studies and User Evaluation for Document Engineering |
Kim Marriott, Steven Simske, Margaret Sturgill Read more >
Document engineering is all about building systems and tools that allow people to work with documents and document collections. A key aspect is the usefulness and usability of these tools. In this tutorial we will look at the many different kinds of user studies and user evaluations that can be used to inform the design and improve utility and usability of document engineering applications. The tutorial will be based on actual studies and will also give participants a chance to explore how they might use these techniques in their research or system development. In the first part of the tutorial we will look at:
1. Controlled experiments (lab studies). These have been adopted from research methods in psychology and are widely used to answer very focussed questions of the form if I vary X how does that affect Y. For instance, how do different layouts affect reading speed and comprehension.
2. Questionnaires, in-depth interview, focus groups and field studies. These provide more open-ended information and draw on techniques from anthropology/ethnography. For instance, do academics read research electronically or on paper and why.
3. Participative design. Participative design, user-centered design and co-design includes the user in the whole development process. One case study is for the presentation of accessible eBooks.
4. User data collection and analysis. What kinds of user data can you collect, e.g. instrumented collection, eye tracking, and how do you analyse it.
In the second part of the tutorial we will look at data analytics. How can data analytics be applied to user evaluation in the document engineering field. This is a two-direction relationship:
A. Data science to understand how users evaluate document sets (multiple versions, related documents, search results, other corpora). This includes functional measurements, user errors, tie to UI design, etc.
B. Data science to understand how to evaluate users based on their interaction with the document set (user analytics), including time to task completion, robustness to frustration, ability to complete task, etc.
In (A) we use analytics to discern what types of workflows and user-document interactions to enable. In (B) we use analytics to classify different types of users, in hopes of feeding this back to affect the design and architecture (structure and flow) of the user interface(s).
The goal of the 'data analytics' portion of the tutorial will be to introduce the audience to classification and evaluation approaches, and from this understanding help to identify research challenges and experiments to be performed by the document engineering research community.
Format: Presentations and demos
Kim Marriott is a Professor at Monash University Australia where he leads the Creative Technologies and HCI group. Marriott obtained his PhD in 1988 from University of Melbourne, then worked as a Research Scientist with IBM TJ Watson Labs in New York before taking a position at Monash in 1993. With around 200 scientific publications he is best known for his research in data visualisation, document engineering, assistive technologies, optimisation and visual languages. Originally a theoretical computer scientist he became more and more interested in the people using computers and now most of his research involves some sort of user study.
Steve Simske is an HP Fellow and Director in HP Inc Labs where he leads teams in analytics, security printing, authentication, forensics, mechatronics, 3D printing, microfluidics and sensing. Simske obtained his BS (Marquette) and MS (Rensselaer Polytechnic) in Biomedical Engineering, his PhD (University of Colorado) in Electrical Engineering, and his PostDoc in Aerospace Engineering. Steve has more than 150 US patents and 400 scientific publications. Most relevant to this workshop is his book on Meta-Algorithmics (Wiley, 2013) which explores using multiple simultaneous approaches to obtain more robust, cost-sensitive and/or accurate intelligent systems.
Margaret Sturgill currently works at HP Labs in Fort Collins, Colorado in the HP Labs Print Adjacencies & 3D Lab. Her main interests include Document Security, Document Workflows, Supply Chain Analysis and Anti-counterfeiting. She holds BS in Computer Science and Mathematics from University of Kentucky and a Ph.D. in Computer Science from University of Utah. She has previously worked at HP on scanner image processing software and as a Chief Operating Officer of Ataman Software Inc. She has 25 US Patents.