I am a Ph.D candidate in Artificial Intelligence at AImageLab (University of Modena and Reggio Emila), advised by Rita Cucchiara and Silvia Cascianelli. I am working in visually-rich Document Understanding and Generation, exploring techniques in handwriting imitation, image generation, and document understanding.

Additionally, I gained valuable experience as an Applied Scientist Intern at Amazon CVNA in 2025, where I worked for six months performing research on foundation GenAI models to establish visual understanding and generation capabilities in images with text.

Education

2023—Present

University of Modena and Reggio Emilia

Ph.D. in Artificial Intelligence

Advisor: Prof. Rita Cucchiara and Dr. Silvia Cascianelli

2019—2022

University of Modena and Reggio Emilia

M.S. in Computer Engineering, 110 CUM LAUDE

Thesis: Introducing Vision Transformers in Denoising Diffusion Probabilistic Models

Experience

09/2024 - 03/2025

Applied Scientist Intern Amazon Berlin

Advisor: Bojan Pepik and Michael Opitz

Research on foundation GenAI models to establish visual understanding and generation capabilities in images with text. Managed project planning and supervised data collection and annotation team.

Publications

CVPR 2025

🏆 Oustanding Reviewer Award

Zero-Shot Styled Text Image Generation, but Make It Autoregressive

Vittorio Pippi*, Fabio Quattrini*, Silvia Cascianelli, Alessio Tonioni, Rita Cucchiara

Existing styled handwritten text generation (HTG) methods struggle to generalize to new styles and have technical constraints like maximum output length. We propose a new framework that combines a variational autoencoder with an autoregressive Transformer to generate styled text images based on both content and style examples. Trained solely on a diverse synthetic dataset of English text with over 100,000 fonts, our approach can reproduce previously unseen styles in zero-shot. Our model generates clean images without background artifacts, making them easier for downstream use. We extensively evaluate our method on both typewritten and handwritten text images of any length.

ECCV 2024

Merging and Splitting Diffusion Paths for Semantically Coherent Panoramas

Fabio Quattrini, Vittorio Pippi, Silvia Cascianelli, Rita Cucchiara

Existing panorama generation methods with diffusion models often create images that look aligned but lack semantic consistency. We introduce the Merge-Attend-Diffuse operator (MAD), which modifies pretrained diffusion models at inference time to generate panorama images with improved semantic and perceptual coherence. By merging diffusion paths and reprogramming self- and cross-attention layers, our approach addresses the semantic incoherence of previous methods and outperforms them, as demonstrated by extensive experiments and a user study.

AI for Visual Arts Workshop, ECCV 2024

⭐ Oral Spotlight (Top 15%)

Alfie: Democratising RGBA Image Generation With No $$$

Fabio Quattrini, Vittorio Pippi, Silvia Cascianelli, Rita Cucchiara

Most image generation models are incapable of generating RGBA images, required by graphic designers for composition in artworks. We propose a fully automated approach that modifies the inference process of a pretrained Diffusion Transformer to generate RGBA images with prompt-guided control and high visual quality. Our method enables the creation of complete subjects with easily removable backgrounds, making them ideal for integration into design projects. A user study shows that users prefer, in most cases, our solution over traditional generate-and-matte pipelines, and our illustrations work well in composite scene generation.

AI for Digital Humanities Workshop, ECCV 2024

μgat: Improving Single-Page Document Parsing by Providing Multi-Page Context

Fabio Quattrini*, Carmine Zaccagnino*, Silvia Cascianelli, Laura Righi, Rita Cucchiara

We focus on multi-page visually rich documents, where the layout is as important as the text content to convey the contained information through the structure. In this context, Document Parsing has emerged as a task to process document images and convert them into machine-readable structured representations, usually markup language,. However, most current models consider single-paged documents. In this work, we propose an adaptation to process multi-page context.

ICDAR 2024

⭐ Oral

Binarizing Documents by Leveraging both Space and Frequency

Fabio Quattrini, Vittorio Pippi, Silvia Cascianelli, Rita Cucchiara

Document image binarization remains an unsolved problem due to variable page degradations and the need for both local and global context. We propose a solution based on Fast Fourier Convolutions, which models global information more effectively than standard convolutions and scales better to different resolutions at inference time, while being more efficient than Vision Transformers. Our method is validated on diverse document degradations.

BMVC 2023

HWD: A Novel Evaluation Score for Styled Handwritten Text Generation

Vittorio Pippi, Fabio Quattrini, Silvia Cascianelli, Rita Cucchiara

We introduce Handwriting Distance (HWD), a new metric for evaluating Styled Handwritten Text Generation models. HWD measures similarity in the feature space of a network trained to extract handwriting style features from variable-length images. We show its effectiveness with extensive experiments.

4th ICCV Workshop on e-Heritage, ICCV 2023

⭐ Oral Spotlight

Volumetric Fast Fourier Convolution for Detecting Ink on the Carbonized Herculaneum Papyr

Fabio Quattrini, Vittorio Pippi, Silvia Cascianelli, Rita Cucchiara

We adapt the Fast Fourier Convolution operator for 3D volumetric data and integrate it in a segmentation architecture, aiming to detect ink traces in the 3D X-ray tomography scans of the carbonized Herculaneum papyri.