AIpparel: A Multimodal Foundation Model for Digital Garments

Kiyohiro Nakayama^1* Jan Ackermann^1,2*† Timur Levent Kesdogan^1,2*† Yang Zheng¹ Maria Korosteleva² Olga Sorkine-Hornung² Leonidas Guibas¹ Guandao Yang¹ Gordon Wetzstein¹

¹Stanford University ²ETH Zürich
^* Equal contribution.
^† Work done as a visiting researcher at Stanford.

arXiv Code Dataset

2025 Computer Vision and Pattern Recognition (CVPR 2025)

AIpparel is a multimodal foundation model for digital garments trained by fine-tuning a large multimodal model on a custom sewing pattern dataset using a novel tokenization scheme for these patterns. AIpparel generates complex, diverse, high-quality sewing patterns based on multimodal inputs, such as text and images, and it unlocks new applications such as language-instructed sewing pattern editing. The generated sewing patterns can be directly used to simulate the corresponding 3D garments.

Abstract

Apparel is essential to human life, offering protection, mirroring cultural identities, and showcasing personal style. Yet, the creation of garments remains a time-consuming process, largely due to the manual work involved in designing them. To simplify this process, we introduce AIpparel, a large multimodal model for generating and editing sewing patterns. Our model fine-tunes state-of-the-art large multimodal models (LMMs) on a custom-curated large-scale dataset of over 120,000 unique garments, each with multimodal annotations including text, images, and sewing patterns. Additionally, we propose a novel tokenization scheme that concisely encodes these complex sewing patterns so that LLMs can learn to predict them efficiently. AIpparel achieves state-of-the-art performance in single-modal tasks, including text-to-garment and image-to-garment prediction, and it enables novel multimodal garment generation applications such as interactive garment editing.

Dataset

GarmentCodeData-Multimodal Dataset extends GarmentCodeData (GCD) with additional, rich annotations. Specifically, we provide textual descriptions of garments, including a descriptive text that decribes the garment's style in detail, and a speculative text that describes a suitable occasion for the garment. Moreover, we also provide pairs of garments that are edited version of each other. They are paired with the suitable editing instructions.

Dataset Samples

The midi dress features a strapless top connected to a narrow waistband, creating a defined high-rise waist. The skirt extends from the waistband, maintaining a fitted silhouette, and includes a back slit for ease of movement.

The upper garment features a mini length with long sleeves and an elevated waistline. It includes a curvy neckline and a similarly contoured back, enhancing its distinctive shape. The garment is constructed with lapels at the front.

This upper garment is a mini-length style featuring short sleeves. It is constructed with a short, narrow trapezoidal neckline. The garment includes a waistband that cinches the waist and connects the upper and lower body panels.

The dress features an asymmetric neckline extending into an asymmetric top. It has a single short sleeve on the right side. The skirt portion is knee-length and constructed from five horizontal, tiered levels.

The midi jumpsuit features a strapless top with a bodice constructed from multiple panels. The garment extends to a midi length with wide-leg trousers.

The dress is of midi length and features short sleeves. A waistband is positioned at the high-rise waist. The skirt section includes both a back slit and a side slit.

The upper garment features long sleeves and a short oval neckline in the front, with a similarly shaped back neckline. It includes a godet skirt with 8 panels.

The dress features an asymmetric top with a neckline differing heights on either side. Both sleeves are long, with the right sleeve creating an angled, looser fit.

Method

AIpparel uses a novel sewing pattern tokenizer (light blue region) to tokenize each panel into a set of special tokens (light green region). Panel vertex positions and 3D transformations are incorporated using positional embeddings (colored arrows) to the tokens. AIpparel takes in multimodal inputs, such as images and texts (light orange region), to output sewing patterns using autoregressive sampling (light grey region). Finally, the output is decoded to produce simulation-ready sewing patterns (light pink region).

Image to Sewing Pattern Reconstruction

Left: our model can reconstruct suitable sewing patterns from the input image alone. In contrast, SewFormer does not produce simulation-ready sewing patterns despite fine-tuning. Right: our model also achieves state-of-the-art performance on the existing SewFactory dataset.

Text to Sewing Pattern Generation

our model can generate sewing patterns following text descriptions. Our generated sewing patterns closely follow the textual details as highlighted.

Sewing Pattern Editing

Our model can take an existing sewing pattern and edit it based on textual instructions. The resulting sewing pattern closely follow the style of the original garment while performing the desired editing.

Citation

@article{nakayama2024aipparel,
          title={AIpparel: A Multimodal Foundation Model for Digital Garments}, 
          author={Kiyohiro Nakayama and Jan Ackermann and Timur Levent Kesdogan 
                  and Yang Zheng and Maria Korosteleva and Olga Sorkine-Hornung and Leonidas Guibas
                  and Guandao Yang and Gordon Wetzstein},
          journal = {Computer Vision and Pattern Recognition (CVPR)},
          year={2025}
      }