References
Chen, J.; Shen, Y.; Gao, J.; Liu, J.; and Liu, X. 2018.
Language-based image editing with recurrent attentive mod-
els. In CVPR.
Cheng, Y.; Gan, Z.; Li, Y.; Liu, J.; and Gao, J. 2020. Se-
quential attention GAN for interactive image editing. In
ACMMM.
Ding, M.; Yang, Z.; Hong, W.; Zheng, W.; Zhou, C.; Yin,
D.; Lin, J.; Zou, X.; Shao, Z.; Yang, H.; and Tang, J. 2021.
CogView: Mastering Text-to-Image Generation via Trans-
formers. arXiv:2105.13290.
El-Nouby, A.; Sharma, S.; Schulz, H.; Hjelm, D.; Asri, L. E.;
Kahou, S. E.; Bengio, Y.; and Taylor, G. W. 2019. Tell, draw,
and repeat: Generating and modifying images based on con-
tinual linguistic instruction. In ICCV.
Fu, T.-J.; Wang, X.; Grafton, S.; Eckstein, M.; and Wang,
W. Y. 2020. Iterative language-based image editing via self-
supervised counterfactual reasoning. In EMNLP.
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.;
Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y.
2014. Generative adversarial nets. Advances in neural in-
formation processing systems, 27.
Guo, X.; Wu, H.; Cheng, Y.; Rennie, S.; Tesauro, G.; and
Feris, R. 2018. Dialog-based Interactive Image Retrieval. In
NIPS, 676–686.
Guo, X.; Wu, H.; Gao, Y.; Rennie, S.; and Feris, R. 2019.
The Fashion IQ Dataset: Retrieving Images by Combining
Side Information and Relative Natural Language Feedback.
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; and
Hochreiter, S. 2017. Gans trained by a two time-scale update
rule converge to a local nash equilibrium. In NeurIPS.
Karras, T.; Laine, S.; and Aila, T. 2019. A style-based gen-
erator architecture for generative adversarial networks. In
Proceedings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition, 4401–4410.
Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.;
and Aila, T. 2020. Analyzing and improving the image qual-
ity of stylegan. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, 8110–8119.
Li, B.; Qi, X.; Lukasiewicz, T.; and Torr, P. H. 2020. Mani-
gan: Text-guided image manipulation. In PCVPR.
Lin, T.-H.; Bui, T.; Kim, D. S.; and Oh, J. 2018. A multi-
modal dialogue system for conversational image editing. In
NeurIPSW.
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ra-
manan, D.; Doll
´
ar, P.; and Zitnick, C. L. 2014. Microsoft
coco: Common objects in context. In European conference
on computer vision, 740–755. Springer.
Liu, Y.; Li, Q.; Sun, Z.; and Tan, T. 2020. Style Intervention:
How to Achieve Spatial Disentanglement with Style-based
Generators? arXiv:2011.09699.
Nam, S.; Kim, Y.; and Kim, S. J. 2018. Text-adaptive gen-
erative adversarial networks: manipulating images with nat-
ural language. In NeurIPS.
Patashnik, O.; Wu, Z.; Shechtman, E.; Cohen-Or, D.; and
Lischinski, D. 2021. Styleclip: Text-driven manipulation of
stylegan imagery. arXiv preprint arXiv:2103.17249.
Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.;
Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.;
et al. 2021. Learning transferable visual models from natural
language supervision. arXiv preprint arXiv:2103.00020.
Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Rad-
ford, A.; Chen, M.; and Sutskever, I. 2021. Zero-shot text-
to-image generation. arXiv preprint arXiv:2102.12092.
Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Rad-
ford, A.; and Chen, X. 2016. Improved techniques for train-
ing gans. Advances in neural information processing sys-
tems, 29: 2234–2242.
Shi, J.; Xu, N.; Bui, T.; Dernoncourt, F.; Wen, Z.; and Xu, C.
2020. A benchmark and baseline for language-driven image
editing. In ACCV.
Shi, J.; Xu, N.; Xu, Y.; Bui, T.; Dernoncourt, F.; and Xu, C.
2021. Learning by planning: Language-guided global image
editing. In CVPR.
Tan, F.; Cascante-Bonilla, P.; Guo, X.; Wu, H.; Feng, S.; and
Ordonez, V. 2019. Drill-down: Interactive Retrieval of Com-
plex Scenes using Natural Language Queries. In NeurIPS.
van den Oord, A.; Vinyals, O.; and Kavukcuoglu, K. 2017.
Neural discrete representation learning. In Proceedings of
the 31st International Conference on Neural Information
Processing Systems, 6309–6318.
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-
tention is all you need. In NIPS.
Wu, Z.; Lischinski, D.; and Shechtman, E. 2021. Stylespace
analysis: Disentangled controls for stylegan image genera-
tion. In Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, 12863–12872.
Xia, W.; Yang, Y.; Xue, J.-H.; and Wu, B. 2021. TediGAN:
Text-Guided Diverse Face Image Generation and Manipula-
tion. In CVPR.
Xu, T.; Zhang, P.; Huang, Q.; Zhang, H.; Gan, Z.; Huang,
X.; and He, X. 2018. Attngan: Fine-grained text to image
generation with attentional generative adversarial networks.
In Proceedings of the IEEE conference on computer vision
and pattern recognition, 1316–1324.
Yu, A.; and Grauman, K. 2014. Fine-Grained Visual Com-
parisons with Local Learning. In CVPR.
Zhang, H.; Koh, J. Y.; Baldridge, J.; Lee, H.; and Yang,
Y. 2021a. Cross-Modal Contrastive Learning for Text-to-
Image Generation. arXiv:2101.04702.
Zhang, R.; Isola, P.; Efros, A. A.; Shechtman, E.; and Wang,
O. 2018. The unreasonable effectiveness of deep features as
a perceptual metric. In CVPR.
Zhang, R.; Yu, T.; Shen, Y.; Jin, H.; and Chen, C. 2019.
Text-Based Interactive Recommendation via Constraint-
Augmented Reinforcement Learning. In Advances in Neural
Information Processing Systems, 15188–15198.
3587