I am a Ph.D candidate in Artificial Intelligence at AImageLab (University of Modena and Reggio Emila), advised by Rita Cucchiara and Silvia Cascianelli. I am working in visually-rich Document Understanding and Generation, exploring techniques in handwriting imitation, image generation, and document understanding.
Additionally, I gained valuable experience as an Applied Scientist Intern at Amazon CVNA in 2025, where I worked for six months performing research on foundation GenAI models to establish visual understanding and generation capabilities in images with text.
Education
University of Modena and Reggio Emilia
Ph.D. in Artificial Intelligence
Advisor: Prof. Rita Cucchiara and Dr. Silvia Cascianelli
University of Modena and Reggio Emilia
M.S. in Computer Engineering, 110 CUM LAUDE
Thesis: Introducing Vision Transformers in Denoising Diffusion Probabilistic Models
Experience
Applied Scientist Intern — Amazon Berlin
Advisor: Bojan Pepik and Michael Opitz
Research on foundation GenAI models to establish visual understanding and generation capabilities in images with text. Managed project planning and supervised data collection and annotation team.
Publications
CVPR 2025
🏆 Oustanding Reviewer Award
Zero-Shot Styled Text Image Generation, but Make It Autoregressive
Vittorio Pippi*, Fabio Quattrini*, Silvia Cascianelli, Alessio Tonioni, Rita Cucchiara
Existing styled handwritten text generation (HTG) methods struggle to generalize to new styles and have technical constraints like maximum output length. We propose a new framework that combines a variational autoencoder with an autoregressive Transformer to generate styled text images based on both content and style examples. Trained solely on a diverse synthetic dataset of English text with over 100,000 fonts, our approach can reproduce previously unseen styles in zero-shot. Our model generates clean images without background artifacts, making them easier for downstream use. We extensively evaluate our method on both typewritten and handwritten text images of any length.
ECCV 2024
Merging and Splitting Diffusion Paths for Semantically Coherent Panoramas
Fabio Quattrini, Vittorio Pippi, Silvia Cascianelli, Rita Cucchiara
Existing panorama generation methods with diffusion models often create images that look aligned but lack semantic consistency. We introduce the Merge-Attend-Diffuse operator (MAD), which modifies pretrained diffusion models at inference time to generate panorama images with improved semantic and perceptual coherence. By merging diffusion paths and reprogramming self- and cross-attention layers, our approach addresses the semantic incoherence of previous methods and outperforms them, as demonstrated by extensive experiments and a user study.
AI for Visual Arts Workshop, ECCV 2024
⭐ Oral Spotlight (Top 15%)
Alfie: Democratising RGBA Image Generation With No $$$
Fabio Quattrini, Vittorio Pippi, Silvia Cascianelli, Rita Cucchiara
Most image generation models are incapable of generating RGBA images, required by graphic designers for composition in artworks. We propose a fully automated approach that modifies the inference process of a pretrained Diffusion Transformer to generate RGBA images with prompt-guided control and high visual quality. Our method enables the creation of complete subjects with easily removable backgrounds, making them ideal for integration into design projects. A user study shows that users prefer, in most cases, our solution over traditional generate-and-matte pipelines, and our illustrations work well in composite scene generation.
AI for Digital Humanities Workshop, ECCV 2024
μgat: Improving Single-Page Document Parsing by Providing Multi-Page Context
Fabio Quattrini*, Carmine Zaccagnino*, Silvia Cascianelli, Laura Righi, Rita Cucchiara
We focus on multi-page visually rich documents, where the layout is as important as the text content to convey the contained information through the structure. In this context, Document Parsing has emerged as a task to process document images and convert them into machine-readable structured representations, usually markup language,. However, most current models consider single-paged documents. In this work, we propose an adaptation to process multi-page context.
ICDAR 2024
⭐ Oral
Binarizing Documents by Leveraging both Space and Frequency
Fabio Quattrini, Vittorio Pippi, Silvia Cascianelli, Rita Cucchiara
Document image binarization remains an unsolved problem due to variable page degradations and the need for both local and global context. We propose a solution based on Fast Fourier Convolutions, which models global information more effectively than standard convolutions and scales better to different resolutions at inference time, while being more efficient than Vision Transformers. Our method is validated on diverse document degradations.
BMVC 2023
HWD: A Novel Evaluation Score for Styled Handwritten Text Generation
Vittorio Pippi, Fabio Quattrini, Silvia Cascianelli, Rita Cucchiara
We introduce Handwriting Distance (HWD), a new metric for evaluating Styled Handwritten Text Generation models. HWD measures similarity in the feature space of a network trained to extract handwriting style features from variable-length images. We show its effectiveness with extensive experiments.
4th ICCV Workshop on e-Heritage, ICCV 2023
⭐ Oral Spotlight
Volumetric Fast Fourier Convolution for Detecting Ink on the Carbonized Herculaneum Papyr
Fabio Quattrini, Vittorio Pippi, Silvia Cascianelli, Rita Cucchiara
We adapt the Fast Fourier Convolution operator for 3D volumetric data and integrate it in a segmentation architecture, aiming to detect ink traces in the 3D X-ray tomography scans of the carbonized Herculaneum papyri.