TextFlux: An OCR-Free DiT Model for High-Fidelity Multilingual Scene Text Synthesis

dc.contributor.authorXie, Yu
dc.contributor.authorZhang, Jielei
dc.contributor.authorChen, Pengyu
dc.contributor.authorWang, Weihang
dc.contributor.authorGao, Longwen
dc.contributor.authorLi, Peiyi
dc.contributor.authorQiao, Qian
dc.contributor.authorLian, Zhouhui
dc.contributor.editorMasia, Belen
dc.contributor.editorThies, Justus
dc.date.accessioned2026-04-17T11:52:30Z
dc.date.available2026-04-17T11:52:30Z
dc.date.issued2026
dc.description.abstractDiffusion-based scene text synthesis has progressed rapidly, yet existing methods commonly rely on additional visual conditioning modules and require large-scale annotated data to support multilingual generation. In this work, we revisit the necessity of complex auxiliary modules and further explore an approach that simultaneously ensures glyph accuracy and achieves high-fidelity scene integration, by leveraging diffusion models’ inherent capabilities for contextual reasoning. To this end, we introduce TextFlux, a DiT-based framework that enables multilingual scene text synthesis. The advantages of TextFlux can be summarized as follows: (1) OCR-free model architecture. TextFlux eliminates the need for OCR encoders that are specifically used to extract visual text-related features. (2) Strong multilingual scalability. TextFlux is effective in low-resource multilingual settings, and achieves strong performance in newly added languages with fewer than 1,000 samples. (3) Streamlined training setup. TextFlux is trained with only 1% of the training data required by competing methods. (4) Controllable multi-line text generation. TextFlux offers flexible multi-line synthesis with precise line-level control, outperforming methods restricted to single-line or rigid layouts. Extensive experiments and visualizations demonstrate that TextFlux outperforms previous methods in both qualitative and quantitative evaluations. Our code is available at https://github.com/yyyyyxie/textflux.
dc.description.number2
dc.description.sectionheadersDiffusion and Beyond: Controlled Image Generation and Stylization
dc.description.seriesinformationComputer Graphics Forum
dc.description.volume45
dc.identifier.doi10.1111/cgf.70342
dc.identifier.issn1467-8659
dc.identifier.pages12 pages
dc.identifier.urihttps://diglib.eg.org/handle/10.1111/cgf70342
dc.identifier.urihttps://doi.org/10.1111/cgf.70342
dc.publisherThe Eurographics Association and John Wiley & Sons Ltd.
dc.rightsCC-BY-4.0
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subjectKeywords: Scene Text Synthesis, Diffusion Models, OCR-free, Image Editing, Multilingual Generation
dc.subjectScene Text Synthesis
dc.subjectDiffusion Models
dc.subjectOCR-free
dc.subjectImage Editing
dc.subjectMultilingual Generation
dc.titleTextFlux: An OCR-Free DiT Model for High-Fidelity Multilingual Scene Text Synthesis
Files
Original bundle
Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
cgf70342.pdf
Size:
20.69 MB
Format:
Adobe Portable Document Format
No Thumbnail Available
Name:
paper1161_mm1.zip
Size:
20.96 MB
Format:
Zip file