TiGAN: Text-Based Interactive Image Generation and Manipulation
Yufan Zhou
1
*
, Ruiyi Zhang
2
, Jiuxiang Gu
2
, Chris Tensmeyer
2
, Tong Yu
2
,
Changyou Chen
1
, Jinhui Xu
1
, Tong Sun
2
1
State University of New York at Buffalo
2
Adobe Research
{yufanzho, changyou, jinhui}@buffalo.edu {ruizhang, jigu, tensmeye, tyu,tsun}@adobe.com
Abstract
Using natural-language feedback to guide image generation
and manipulation can greatly lower the required efforts and
skills. This topic has received increased attention in recent
years through refinement of Generative Adversarial Networks
(GANs); however, most existing works are limited to single-
round interaction, which is not reflective of real world inter-
active image editing workflows. Furthermore, previous works
dealing with multi-round scenarios are limited to predefined
feedback sequences, which is also impractical. In this paper,
we propose a novel framework for Text-based interactive im-
age generation and manipulation (TiGAN) that responds to
users’ natural-language feedback. TiGAN utilizes the pow-
erful pre-trained CLIP model to understand users’ natural-
language feedback and exploits contrastive learning for a bet-
ter text-to-image mapping. To maintain the image consistency
during interactions, TiGAN generates intermediate feature
vectors aligned with the feedback and selectively feeds these
vectors to our proposed generative model. Empirical results
on several datasets show that TiGAN improves both interac-
tion efficiency and image quality while better avoids undesir-
able image manipulation during interactions.
Introduction
Text-to-image generation and text-guided image manipula-
tion are important research topics, which have demonstrated
great application potentials due to the flexibility and usabil-
ity of natural language. Compared to traditional image edit-
ing software that require users to learn complex tools, lan-
guage driven methods can be more intuitive for novice users.
One main challenge of text-to-image generation/manipula-
tion is that images are 2D arrays of pixels, while natural
language expressions are sequences of words with no clear
mapping between them. While existing works (Xu et al.
2018; Zhang et al. 2021a; Xia et al. 2021; Patashnik et al.
2021) have proposed useful new models and loss functions,
they share the limitation that they focus on single-round
tasks, i.e., these methods generate or manipulate an image
only in the context of a single natural language instruction.
Such a restriction limits the applicability of the models for
*
Work done during an internship at Adobe.
Corresponding author
Copyright © 2022, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
real-use cases, as a user may want to continually refine an
image until satisfactory. While such models could be naively
applied recursively, at each round, the model would be obliv-
ious to previously given feedback leading to high likelihood
that the model interferes with previous edits.
There also exist some works that sequentially gener-
ate images following different instructions (El-Nouby et al.
2019; Fu et al. 2020). However, these methods are not fully
interactive and less practical. For example, models in (El-
Nouby et al. 2019; Fu et al. 2020) are trained on prede-
fined sequences of natural language instructions, while the
instructions are independent of generated images and fol-
lows a predefined order. However, when a real user inter-
acts with the model, the natural-language feedback is un-
predictable and depends on generated image in each round.
Thus, the use of predefined sequences is impractical for real-
world interactive applications.
In this work, we focus on a new problem of interactive
image generation, which generalizes text-to-image gener-
ation and text-guided image manipulation to the multiple
round setting. It is a natural extension to existing single
round methods, and our goal is to generate desired images
with fewer interactions. Consequently, we address these two
critical challenges: (i) how to learn a better text-to-image
mapping; (ii) how to avoid undesirable image manipulations
throughout the interaction session. A better text-to-image
mapping would improve overall image quality and improve
how well the image agrees with the text. An undesirable
image manipulation would occur if the model accidentally
changes an aspect of the image that the user already spec-
ified. For instance, assume the user requests the model to
“generate a man’s face” and then issues the command “make
the hair long”. We are expecting two generated images: an
image of a man and an image of a man with long hair for this
two-round example. Receiving an image of a man and an im-
age of a woman with long hair is a failure case which also
satisfies the user’s requirement at every round. Since user
has requested the image to be of a man in the first round, the
model should not change that aspect of the image in later
manipulations unless the user explicitly says otherwise.
To handle the aforementioned challenges, we propose
Text-based Interactive image generation and manipulation
(TiGAN). Different from existing works that focus on com-
plicated architecture designs, we tackle the problem by di-
The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)
3580
Figure 1: Overview of our interactive image generation. A user starts a session and keeps giving natural-language feedback to
the generative model until they are satisfied with the generated image. We propose TiGAN with specifically designed contrastive
losses that encourage a better text-to-image mapping, which is used in both generation and manipulation. The pre-trained CLIP
encoders help TiGAN to better understand images and texts semantically.
rectly adapting powerful unconditional generative models
into our model for text-conditional generation. Specifically,
TiGAN uses state-of-the-art (SOTA) StyleGAN2 (Karras
et al. 2020) as its backbone and use Contrastive Language-
Image Pre-training (CLIP) (Radford et al. 2021) to inject
text information into StyleGAN2. CLIP is a multi-modal
model pre-trained on 400 million text-image pairs and con-
sists of one image encoder and one text encoder that respec-
tively map images and text into a unified joint embedding
space. Using CLIP, TiGAN can evaluate the semantic simi-
larity of text with images inside the joint embedding space.
We train TiGAN with various proposed contrastive losses
that encourage the model to learn a better text-to-image
mapping with disentangled intermediate features. On top of
the trained model, we propose an image manipulation mech-
anism that manipulates an image according to text feedback
and avoids undesirable visible changes. We achieve this by
only updating the intermediate features of the generator that
are relevant to the text.
To summarize, we propose a novel model for Text-
based Interactive image generation (abbreviated TiGAN).
Our main contributions are as follows:
We propose a novel text-to-image generation model,
which seamlessly integrates SOTA StyleGAN2 and CLIP
model. To achieve a better text-to-image mapping with
disentangled, semantically meaningful features, we also
propose new contrastive losses to train the model;
We further propose a new text-guided image manipula-
tion mechanism, which can handle complex text informa-
tion and maintain image consistency in the interaction;
We conduct extensive experiments, demonstrating the
advances of the proposed method over SOTA methods
in both standard text-to-image generation and interac-
tive image generation settings; Human evaluations fur-
ther verify the effectiveness of the proposed method com-
pared to existing works.
Proposed Framework
Based on generative adversarial networks (GANs) (Goodfel-
low et al. 2014) and contrastive language-image pre-training
model (CLIP) (Radford et al. 2021), our framework consists
of a text-to-image generation module and a text-guided im-
age manipulation mechanism. Our proposed framework for
interactive image generation is illustrated in Figure 1, with
details described below. Different from the standard text-to-
image generation task which is in a single-round setting, in-
teractive image generation is naturally in a multi-round set-
ting. At every round, the user will provide natural language
to the proposed model, the model will generate or manip-
ulate the images according to the requirements. The images
will be fed to the user to obtain further feedback. The session
will end when the user is satisfied with the results.
Architecture of the Proposed TiGAN
In this part, we present the detailed architecture designs for
our proposed framework for text-based interactive image
Generation. Throughout the paper, z denotes standard Gaus-
sian noise, x denotes real image sample, x
0
denotes gener-
ated image, T denotes raw text description.
Generator architecture The generator is used to gener-
ate realistic and high-quality data samples. To achieve this,
we build our generator based on the StyleGAN2 architecture
(Karras et al. 2020). Our proposed generator architecture is
illustrated in Figure 2, where w denotes the intermediate la-
tent vector, and {s
i
}
m
i=1
denote vectors obtained by apply-
ing learned transformations on w. These transformations are
affine transformations in original StyleGAN2. Throughout
the paper, we use s to denote the concatenation of vectors
{s
i
}
m
i=1
, which is defined as the style vector following pre-
vious work (Wu, Lischinski, and Shechtman 2021).
Different from the original StyleGAN2, our proposed
generator requires extra text features as inputs. Thus the
main challenge is how to effectively extend the uncondi-
tional SOTA model to a conditional one by utilizing the text
information. Existing works (Zhu et al. 2019; Xu et al. 2018;
Zhang et al. 2021a) inject text information either by directly
concatenating the text feature with noise vector, or updat-
ing latent noise by learn-able scale and bias factors. Differ-
ent from these methods that exploit different ways to update
the noise vectors (initial input of the generator), we han-
dle the problem by updating well-disentangled, intermediate
features of the generator.
3581
Figure 2: Illustration of the proposed generator architecture.
Replacing the proposed module with affine transformation
and removing text information will lead to the original gen-
erator in StyleGAN2.
Intuitively, dimensions of a well-disentangled feature vec-
tor should be highly independent. Ideally, each dimension
should control a specific visible attribute of the gener-
ated images. Consequently, accurate text-to-image genera-
tion can be achieved if one can directly learn a mapping from
text to the well-disentangled features.
To this end, let the style space S be the space spanned
by style vectors. As analyzed in some previous works (Wu,
Lischinski, and Shechtman 2021; Liu et al. 2020), style
space is shown to be well-disentangled. Inspired by these
works, we propose to directly inject text information into
this disentangled style space. We propose the following two
modules to replace the affine transformations on w in origi-
nal StyleGAN2:
s
i
= π
i
([κ
i
(t), w]), and (1)
s
i
= φ
i
(t) ψ
i
(w) + χ
i
(t) , (2)
where π
i
, κ
i
, φ
i
, ψ
i
, χ
i
denote different learn-able func-
tions constructed using 2-layer neural networks, denotes
element-wise multiplication, [·, ·] denotes vector concatena-
tion, t denotes text feature extracted with pre-trained CLIP
model. With the proposed module, the generator can gener-
ate images that match text descriptions. In practice, one can
choose one of the modules, or use both modules in the gen-
erator. In experiments, we start from using (1) for all s
i
, and
gradually tune the model architecture by using (2) for some
layers. Generally, using only (1) can lead to promising re-
sults, using (2) for the last few layers may further improve
the results.
Discriminator architecture In standard unconditional
settings, the discriminator D(·) is trained to distinguish the
real samples from fake samples. In our conditional setting,
the discriminator should also consider text information to
distinguish samples. To incorporate the text information, we
propose to use the architecture in Figure 3, where f
R
(x) is
a scalar that indicates the unconditional realness of the im-
age as the standard discriminator output; and f
D
(x) is the
semantic feature extracted by the discriminator. An image x
is classified as real when it has both high similarity with text
T and large unconditional realness f
R
(x). Thus we can de-
fine D(x ) = f
R
(x) + hf
D
(x), ti as the realness of image x
given text feature t.
Figure 3: Illustration of the proposed discriminator. FC de-
notes fully-connected layers.
Consequently, the standard loss functions for our genera-
tion and discriminator are:
L
G
= E
p(x
0
)
[log σ(D(x
0
))]
L
D
= E
p(x)
[log σ(D(x))] E
p(x
0
)
[log(1 σ(D(x
0
))]
where σ(·) is the sigmoid function, p(x), p(x
0
) denote the
distribution of real and generated images respectively.
Text-Image Matching via Contrastive Learning
In TiGAN, we propose two additional contrastive losses to
enhance the text-image matching. Let {(x
i
, T
i
)}
n
i=1
be a
mini-batch of text-image pairs and {x
0
i
}
n
i=1
be the corre-
sponding generated fake images. f
I
, f
T
denote the image
encoder and text encoder of CLIP respectively, and t
i
=
f
T
(T
i
) denotes the CLIP text feature of T
i
. We propose to
add the following contrastive loss
L
CLIP
({x
0
i
}
n
i=1
, {T
i
}
n
i=1
) (3)
= λ
n
X
i=1
log
exp(τ cos(f
I
(x
0
i
), t
i
))
P
n
j=1
exp(τ cos(f
I
(x
0
i
), t
j
))
(1 λ)
n
X
j=1
log
exp(τ cos(f
I
(x
0
j
), t
j
))
P
n
i=1
exp(τ cos(f
I
(x
0
i
), t
j
))
where λ and τ are hyper-parameters, and cos(·, ·) denotes
cosine similarity. Intuitively, minimizing L
CLIP
encourages
the generator to generate image x
0
i
that has high seman-
tic similarity with the corresponding text description T
i
.
This also encourage x
0
i
to have low semantic similarity with
{T
j
}
j6=i
, which are the text descriptions of other images.
In addition, we propose the following contrastive loss to
regularize the discriminator.
L
CD
({x
i
}
n
i=1
, {T
i
}
n
i=1
) (4)
= λ
n
X
i=1
log
exp(τ cos(f
D
(x
i
), t
i
))
P
n
j=1
exp(τ cos(f
D
(x
i
), t
j
))
(1 λ)
n
X
j=1
log
exp(τ cos(f
D
(x
j
), t
j
))
P
n
i=1
exp(τ cos(f
D
(x
i
), t
j
))
where f
D
(x
i
) denotes the feature from the discriminator as
illustrated in Figure 3. L
CD
encourages the discriminator to
extract semantically meaningful features aligned with input
text.
3582
The final loss functions for the generator and the discrim-
inator are defined respectively as:
L
0
G
= L
G
+αL
CLIP
({x
0
i
}
n
i=1
, {T
i
}
n
i=1
)
+βL
CD
({x
0
i
}
n
i=1
, {T
i
}
n
i=1
), (5)
L
0
D
= L
D
+ βL
CD
({x
i
}
n
i=1
, {T
i
}
n
i=1
). (6)
During the training process, only the parameters of the
generator and discriminator are updated. The parameters of
the CLIP text and image encoders are fixed and loaded from
the pre-trained checkpoint. In later section, we will discuss
the difference between our work and other methods that also
use contrastive loss (Xu et al. 2018; Zhang et al. 2021a). We
also performed an ablation study to help better understand-
ing the impact of these contrastive losses.
Interactive Image Generation
Training with (5) and (6) results in a standard text-to-image
generation model for a single-round interaction. To extend
our model for interactive generation, we regard the prob-
lem as a combination of text-to-image generation and a se-
quence of text-guided image manipulation. Thus our next
step is to design a method for image manipulation that only
allows the model to manipulate target attributes of the im-
age. With this, information from previous interactions can
be maximally preserved, undesirable image changes can be
maximally avoided.
Let z be a noise sampled from standard Gaussian distribu-
tion, t be a text feature from the dataset extracted with CLIP.
The text-to-image generation process can be formulated as
x = G
I
(s), s = G
S
(t, z), where s = [s
1
, s
2
, ..., s
m
] de-
notes the generated style vector. As shown in Figure 2, G
S
consists of a mapping network and a newly proposed mod-
ule, which generates a style vector s given text t. G
I
denotes
the synthesis network in Figure 2, which will generate an im-
age based on the style vector s. To manipulate an image x
with style s according to a new text t
0
, we first identify the
most relevant dimensions of s, denoted as {c
i
}
k
i=1
. Then we
generating new style s
0
via:
[s
0
]
i
=
[s]
i
+ γ([G
S
(z, t
0
)]
i
[s]
i
) if i {c
i
}
k
i=1
[s]
i
otherwise
(7)
where [s]
i
denotes the i
th
element of s, γ > 0 is the step size
(we set γ = 1 in practice). With the updated style vector, a
new image is generated via x
0
= G
I
(s
0
).
To obtain the relevant dimensions {c
i
}
k
i=1
of s, we follow
the same strategy as (Patashnik et al. 2021). Let
˜
s
i
R
dim(s)
be a vector with value η
i
on its i
th
dimension and 0 on other
dimensions (
˜
s
i
has the same dimensionality as s). We use
the following term to evaluate the effects of revising i
th
di-
mension:
r
i
= E
s
[f
I
(G
I
(s +
˜
s
i
)) f
I
(G
I
(s)] , s = G
S
(z, t) (8)
where z is sampled from standard Gaussian distribution, t
is randomly sampled text from the dataset. Intuitively, r
i
evaluates the semantic feature change of revising i
th
dimen-
sion of style vector. After obtaining r
i
for all dimensions,
we select all the dimension i satisfying:
cos(∆ t, r
i
) a (9)
where a > 0 is a threshold, t is the desired se-
mantic change evaluated by CLIP. t can be estimated
in different ways. For instance, let f
T
be the text en-
coder of CLIP and we would like to edit the hair
color of the human face in the image. t can be esti-
mated using prompts: t = f
T
(a face with black hair)
f
T
(a face with hair). It can also be directly estimated by
t = f
T
(this person should have black hair) t, where t
is the text feature of previous round’s instruction or the fea-
ture of an empty string (for the first round). In practice, we
found both ways work equally well.
Related Work
Compared to existing works, our proposed framework is
more general and can be applied in different scenarios.
Text-to-Image Generation There are two major cate-
gories of text-to-image generation models. (Xu et al. 2018;
Zhu et al. 2019; Zhang et al. 2021a) propose to use GAN-
based structures, (Ramesh et al. 2021; Ding et al. 2021)
propose to combine discrete variational auto-encoder (VAE)
(van den Oord, Vinyals, and Kavukcuoglu 2017) and trans-
former (Vaswani et al. 2017). Although (Ramesh et al. 2021;
Ding et al. 2021) achieve better qualitative results compared
to GAN-based models, they are large models trained on
huge dataset, which is inaccessible to most researchers, e.g.
DALL-E (Ramesh et al. 2021) has over 12 billion parame-
ters and is trained on a dataset consists of 250 million text-
image pairs.
Our proposed model follows the GAN-based structure.
Although AttnGAN (Xu et al. 2018) propose DAMSM
which also use contrastive loss, it only trains generator with
the contrastive loss. XMC-GAN (Zhang et al. 2021a) pro-
posed to use contrastive loss for both generator and dis-
criminator, and used a complicated architecture for SOTA
results. Different from these works that directly design com-
plex model architectures in a heuristical way, we focus on
how to efficiently turn an existing SOTA unconditional GAN
into a text-conditioned GAN. To this end, we propose to
inject text information into the disentangled feature space
of the generator, and train both generator and discriminator
with the proposed contrastive loss. Pre-trained CLIP model
is also incorporated to provide better semantic information
for the training process. As a result, we obtains SOTA per-
formance on text-to-image generation tasks.
Text-guided Image Manipulation General idea on ma-
nipulating images consists of three steps: map images into
some latent spaces, manipulate the obtained latent vectors,
generate images with the manipulated latent vectors. Exist-
ing works (Wu, Lischinski, and Shechtman 2021; Liu et al.
2020; Li et al. 2020; Xia et al. 2021; Patashnik et al. 2021)
handle the third step by directly using pre-trained GANs,
and focus on the first or second step. Different from these
works, we solve the challenge of the third step by training a
better text-to-image generation model. Now we briefly dis-
cuss the SOTA methods for the second step, which is also
the topic our manipulation mechanism focus on.
TediGAN (Xia et al. 2021) propose to train different en-
coders that map different modalities into the same latent
3583
(a) A green train is
coming down the track
(b) A yellow school
bus in the forest
(c) A small kitchen
with a low ceiling
(d) A peaceful lake in
a cloudy day
(e) Skyline of a mod-
ern city
(f) A tower on the
mountain
Figure 4: Text-to-image generation examples with model trained on MS-COCO 2014 dataset, and captions are the input text.
space of the generator. To manipulate an image according
to a text description, the authors propose to first map both
image and text into the joint latent space, then combine the
two latent vectors by replacing some elements in image la-
tent vector using elements from text latent vector. The re-
sulting latent vector will be fed into the generator to gen-
erate manipulated image. StyleCLIP (Patashnik et al. 2021)
also propose to utilize the pre-trained CLIP model and max-
imize the semantic similarity between the resulting images
and text descriptions. THree different methods are proposed
in (Patashnik et al. 2021), we will compare our method to
the one with the most promising results, which is denoted as
StyleCLIP-Global. StlyeCLIP-Global use a similar strategy
as our manipulation mechanism, it first find r
i
for all the
dimensions, then select relevant dimensions and add prede-
fined constant values to the selected elements. The potential
drawback of StyleCLIP-Global is that it could lead to weird
images, some examples are provided in the Appendix. This
usually happens when adding inappropriate constants, ren-
dering the style vectors are outside the support of the G
I
.
Compared with StyleCLIP-Global, we first train a text-to-
image generation model on the given datasets, and then ma-
nipulate the style vector by (7) instead of adding constants.
Since G
S
is trained in conjunction with G
I
, the manipulated
style vector remains within the support of G
I
.
Interactive Multi-modal Learning The classical multi-
round manipulation problem considers the problem where a
model sequentially generates images for the ultimate goal
following a sequence of linguistic instructions (El-Nouby
et al. 2019; Shi et al. 2020; Chen et al. 2018; Nam, Kim, and
Kim 2018; Zhang et al. 2021b; Shi et al. 2021). The SOTA
performance of this task is achieved by a self-supervised
framework which incorporates counterfactual thinking to
overcome data scarcity (Fu et al. 2020). All these meth-
ods are based on predefined sequences. Furthermore, they
will suffer the exposure bias and error accumulation issues,
i.e., the image quality becomes worse with more interac-
tions. A POMDP formulation for conversational image edit-
ing was also developed to enable fully interaction (Lin et al.
2018), but the manipulation is based on predefined opera-
tions without any creation. The fully interactive image gen-
eration problem was explored in (Cheng et al. 2020), while
the generation quality are miserably poor and can only han-
dle relatively simple datasets. Interactive image retrieval was
explored in (Guo et al. 2019, 2018; Tan et al. 2019; Zhang
et al. 2019), which focus on learning a better recommender
policy to handle user natural-language feedback. Compared
to the aforementioned methods, our proposed method is
fully interactive, does not have error accumulation, and can
handle both image generation and manipulation problems on
complex datasets.
Experiments
We conduct extensive experiments on three different
datasets: UT Zappos50k (Yu and Grauman 2014), MS-
COCO 2014 (Lin et al. 2014) and Multi-modal CelebA-HQ
(Xia et al. 2021). The experiments are implemented under
two settings: single-round image generation and interactive
(multi-round) image generation. All experiments are con-
ducted on 4 Nvidia Tesla V100 GPU and implemented with
Pytorch. Details of the datasets, the experimental setup and
hyper-parameters are provided in the Appendix.
Text-to-image Generation
To test the generation quality of our method for text-to-
image generation, we first evaluate it on MS-COCO 2014,
a dateset containing complex scenes and many kinds of
objects and is commonly used in text-to-image generation
tasks. Following previous work (Zhang et al. 2021a), we re-
port Fr
´
echet Inception Distance (FID) (Heusel et al. 2017)
and Inception Score (IS) (Salimans et al. 2016), which eval-
uate the quality and the diversity of generated images re-
spectively. 30,000 generated images with randomly sampled
text are used to compute the metrics. The main results are
provided in Table 2. Our proposed method outperforms pre-
vious SOTA model XMC-GAN (Zhang et al. 2021a). Com-
pared to XMC-GAN that contains many attention modules,
our proposed model has less parameters and smaller model
size, while achieving better IS and FID scores. Some gener-
ated examples are shown in Figure 4, more results are pro-
vided in the Appendix.
In addition, (Xia et al. 2021) provides results of text-to-
image generation on Multi-modal CelebA-HQ. We also have
compared our method with it in Table 3. Note that the results
in (Xia et al. 2021) are based on the generator pre-trained
on FFHQ (Karras, Laine, and Aila 2019), which is directly
used to calculate the FID score on Multi-modal CelebA-
HQ. Since FID measures the distance between generated im-
ages and real images from a dataset, it is fairer to fine-tune
the generator on Multi-modal CelebA-HQ before evaluating
FID. Thus we report both the original results from (Xia et al.
2021) and the results of fine-tuning the model before apply-
3584
MET
HOD AR (10) SR (10) SR (20) SR (50) CGAR (10) CGAR (20) CGAR (50)
DAT
ASET: UT ZAPPOS50K
SEQATTNGAN 7.090 0.426 0.506 0.596 0.798 0.847 0.879
TEDIGAN 7.537 0.419 0.442 0.492 0.781 0.802 0.818
STYLECLIP-GLOBAL 6.954 0.424 0.462 0.476 0.757 0.773 0.790
TIGAN (W/O THRESHOLD) 6.056 0.628 0.724 0.818 0.896 0.922 0.951
TIGAN 5.412 0.682 0.784 0.886 0.896 0.941 0.970
DAT
ASET: MULTI-MODAL CELEBA HQ
SEQATTNGAN 6.284 0.582 0.728 0.835 0.878 0.926 0.944
TEDIGAN 5.769 0.597 0.670 0.706 0.854 0.876 0.897
STYLECLIP-GLOBAL 5.510 0.628 0.664 0.666 0.864 0.879 0.880
TIGAN (W/O THRESHOLD) 4.942 0.737 0.816 0.852 0.923 0.950 0.957
TIGAN 4.933 0.761 0.830 0.886 0.928 0.947 0.967
Table 1: Interactive image generation results evaluated with user simulator. Average round (AR) is the average number of
needed interactions. Success rate (SR) is defined as the ratio of number of successful cases to the number of total cases.
Correctly generated attribute rate (CGAR) denotes the average percentage of correctly generated attributes in all the cases.
Integer in the parenthesis denotes the maximal number of interaction rounds.
Method IS FID
AttnGAN 23.61 33.10
Obj-GAN 24.09 36.52
DM-GAN 32.32 27.23
OP-GAN 27.88 24.70
XMC-GAN 30.45 9.33
TiGAN 31.95 8.90
Table 2: Text-to-image generation on MS-COCO 2014.
Method IS FID
w/o fine-tuning (Xia et al. 2021)
AttnGAN - 125.98
ControlGAN - 116.32
DFGAN - 137.60
DMGAN - 131.05
TediGAN - 106.57
with fine-tuning
TediGAN + fine-tune 2.29 27.39
TiGAN 2.85 11.35
Table 3: Text-to-image generation on Multi-modal CelebA-
HQ.
ing their methods. Following (Xia et al. 2021), all the results
are evaluated by generating 6000 images using the descrip-
tions from test set of Multi-modal CelebA-HQ. Note that we
do not report LPIPS (Zhang et al. 2018) as (Xia et al. 2021),
because we found that LPIPS can be easily hacked in this
experiment, where one can easily obtain good LPIPS that
does not represent a good model. More discussions can be
found in the Appendix.
Interactive Image Generation
We then test the proposed method on UT Zappos 50k and
Multi-modal CelebA-HQ for the interactive image genera-
tion task. We choose these two datasets because each image
has associated attributes in these datasets. Some examples
are shown in the Appendix.
Some visualization results are illustrated in Figure 5. It
is clear that the proposed method can manipulate the im-
age correctly and maintain the manipulated attributes dur-
ing the whole interaction. We also evaluate the proposed
method quantitatively. To this end, we design a user simula-
tor to gives text feedback based on the generated images. In
each test case, the user simulator has some target attributes
in mind, which are randomly sampled from the dataset and
unknown to the model. The model starts from generating a
random image and feed the image to the user simulator. The
user simulator will give feedback by randomly pointing out
one of the target attributes that is not satisfied by the gen-
eration. The feedback will then be fed to the model for fur-
ther image manipulation. The interaction process will stop
when the user simulator find the generated image matches
all the target attributes. In the experiments, we use neural
network based classifier as the user simulator, which classi-
fies the attributes of the generated images, and output text
feedback based on prompt engineering. The details of con-
structing user simulator can be found in Appendix.
The main results of averaging over 1000 test cases are
reported in Table 1. Note that we set a maximal num-
ber of interaction rounds. Once the interaction exceed this
number, the user simulator would directly treat current test
as a failure case and start a new test case. The attributes
used in this experiment are summarized in Table 6 in Ap-
pendix. We compare our proposed method with currently
SOTA image manipulation methods: StyleCLIP-Global, Te-
diGAN and existing SOTA interactive method SeqAttnGAN
(Cheng et al. 2020). Note that for fair comparison, we
re-implemented SeqAttnGAN using StyleGAN2 and CLIP
model, which leads to a much more powerful variant than
(Cheng et al. 2020). We also provide the results of our
method without threshold during image manipulation, i.e.,
instead of using method in Eq. (7), we directly generate new
style vector s
0
using feedback t
0
via s
0
= G
S
(z, t
0
). From
the results, we can conclude that our proposed method leads
to better interaction efficiency as it needs less interaction
rounds in average.
Human Evaluation
We also conducted human evaluation on Amazon Mechan-
ical Turk (MTurk) for text-to-image generation, text-guided
image manipulation and interactive image generation. In the
3585
TEX
T-TO-IMAGE GENERATION TEXT-GUIDED MANIPULATION INTERACTIVE GENERATION
METHOD REALISTIC MATCH REALISTIC MATCH CONSISTENCY REALISTIC MATCH
DAT
ASET: UT ZAPPOS50K
SEQATTNGAN 3.66 3.82 3.88 2.86 2.64 3.46 2.78
TEDIGAN 3.91 2.31 3.50 3.04 2.95 3.66 2.60
STYLECLIP-GLOBAL - - 3.28 2.30 2.93 3.84 2.28
TIGAN 4.12 4.11 4.10 3.64 2.98 4.18 2.98
DAT
ASET: MULTI-MODAL CELEBA HQ
SEQATTNGAN 3.10 3.59 3.74 3.58 3.26 2.92 2.34
TEDIGAN 3.19 2.49 4.50 2.92 2.62 3.86 2.62
STYLECLIP-GLOBAL - - 4.14 3.60 3.42 2.84 2.36
TIGAN 3.27 4.09 4.36 3.68 3.72 4.00 2.76
Table 4: Results of Human Evaluation on Zappos 50k and Multi-modal CelebA-HQ. Note text-to-image generation and text-
guided manipulation are under single-round setting, while interactive generation are under multi-round setting. StyleCLIP is a
image manipulation method and can not be applied in single-round text-to-image generation task.
(a) A woman face (b) A face wearing
earrings
(c) This is a face
with blonde hair
(d) She has short
hair
(e) A face with
heavy makeup
(f) A young man
face
(g) A face with red
hair
(h) He has beard (i) This is a face
wearing glasses
(j) He is wearing a
hat
Figure 5: Interactive image generation of the proposed method. Each row is a user session, and each sub-figure is a result of
one round interaction. The caption of each sub-figure is the text input from the user.
evaluation, the workers were provided 100 images from each
method, which are generated or manipulated according to
randomly sampled texts. The workers were asked to judge
whether the generated or manipulated images match the text
and how realistic the images are. Furthermore, the workers
are also asked to judge whether the consistency is well main-
tained in manipulation, in the sense that there are no unde-
sirable changes observed. The three metrics are denoted as
Match, Realistic and Consistency respectively. The workers
are all from the US and were required to have performed at
least 10,000 assignments approved with an approval rate
98%. For each metric, the workers are asked to score the im-
ages at scale of 1 to 5, where 5 denotes the most realistic/best
matching/most consistent. The main results are provided in
Table 4 and more details of the human evaluation can be
found in the Appendix.
Ablation Study
To better understand the proposed method, we conducted an
ablation study to determine how each component of the loss
function influence TiGAN. The main results are provided
METHOD IS FID
TIGAN W/O L
CLIP
22.87 19.62
TIGAN W/O L
CD
27.21 18.21
TIGAN 31.95 8.90
Table 5: Ablation study on MS-COCO 2014.
in Table 5. We observe that excluding either L
CLIP
or L
CD
leads to performance degeneration. Meanwhile, L
CLIP
seems
to contribute more than L
CD
, as the model trained without
L
CLIP
has much poorer diversity according to IS.
Conclusions
In this paper, we proposed TiGAN for interactive image gen-
eration and manipulation from text. Using both human and
automated evaluation, we showed that TiGAN is able to gen-
erate more realistic images that better match the text in fewer
rounds than prior SOTA methods. Empirical results on sev-
eral datasets show that TiGAN improves both interaction ef-
ficiency and image quality while better avoids undesirable
image manipulations during interaction.
3586
References
Chen, J.; Shen, Y.; Gao, J.; Liu, J.; and Liu, X. 2018.
Language-based image editing with recurrent attentive mod-
els. In CVPR.
Cheng, Y.; Gan, Z.; Li, Y.; Liu, J.; and Gao, J. 2020. Se-
quential attention GAN for interactive image editing. In
ACMMM.
Ding, M.; Yang, Z.; Hong, W.; Zheng, W.; Zhou, C.; Yin,
D.; Lin, J.; Zou, X.; Shao, Z.; Yang, H.; and Tang, J. 2021.
CogView: Mastering Text-to-Image Generation via Trans-
formers. arXiv:2105.13290.
El-Nouby, A.; Sharma, S.; Schulz, H.; Hjelm, D.; Asri, L. E.;
Kahou, S. E.; Bengio, Y.; and Taylor, G. W. 2019. Tell, draw,
and repeat: Generating and modifying images based on con-
tinual linguistic instruction. In ICCV.
Fu, T.-J.; Wang, X.; Grafton, S.; Eckstein, M.; and Wang,
W. Y. 2020. Iterative language-based image editing via self-
supervised counterfactual reasoning. In EMNLP.
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.;
Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y.
2014. Generative adversarial nets. Advances in neural in-
formation processing systems, 27.
Guo, X.; Wu, H.; Cheng, Y.; Rennie, S.; Tesauro, G.; and
Feris, R. 2018. Dialog-based Interactive Image Retrieval. In
NIPS, 676–686.
Guo, X.; Wu, H.; Gao, Y.; Rennie, S.; and Feris, R. 2019.
The Fashion IQ Dataset: Retrieving Images by Combining
Side Information and Relative Natural Language Feedback.
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; and
Hochreiter, S. 2017. Gans trained by a two time-scale update
rule converge to a local nash equilibrium. In NeurIPS.
Karras, T.; Laine, S.; and Aila, T. 2019. A style-based gen-
erator architecture for generative adversarial networks. In
Proceedings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition, 4401–4410.
Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.;
and Aila, T. 2020. Analyzing and improving the image qual-
ity of stylegan. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, 8110–8119.
Li, B.; Qi, X.; Lukasiewicz, T.; and Torr, P. H. 2020. Mani-
gan: Text-guided image manipulation. In PCVPR.
Lin, T.-H.; Bui, T.; Kim, D. S.; and Oh, J. 2018. A multi-
modal dialogue system for conversational image editing. In
NeurIPSW.
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ra-
manan, D.; Doll
´
ar, P.; and Zitnick, C. L. 2014. Microsoft
coco: Common objects in context. In European conference
on computer vision, 740–755. Springer.
Liu, Y.; Li, Q.; Sun, Z.; and Tan, T. 2020. Style Intervention:
How to Achieve Spatial Disentanglement with Style-based
Generators? arXiv:2011.09699.
Nam, S.; Kim, Y.; and Kim, S. J. 2018. Text-adaptive gen-
erative adversarial networks: manipulating images with nat-
ural language. In NeurIPS.
Patashnik, O.; Wu, Z.; Shechtman, E.; Cohen-Or, D.; and
Lischinski, D. 2021. Styleclip: Text-driven manipulation of
stylegan imagery. arXiv preprint arXiv:2103.17249.
Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.;
Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.;
et al. 2021. Learning transferable visual models from natural
language supervision. arXiv preprint arXiv:2103.00020.
Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Rad-
ford, A.; Chen, M.; and Sutskever, I. 2021. Zero-shot text-
to-image generation. arXiv preprint arXiv:2102.12092.
Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Rad-
ford, A.; and Chen, X. 2016. Improved techniques for train-
ing gans. Advances in neural information processing sys-
tems, 29: 2234–2242.
Shi, J.; Xu, N.; Bui, T.; Dernoncourt, F.; Wen, Z.; and Xu, C.
2020. A benchmark and baseline for language-driven image
editing. In ACCV.
Shi, J.; Xu, N.; Xu, Y.; Bui, T.; Dernoncourt, F.; and Xu, C.
2021. Learning by planning: Language-guided global image
editing. In CVPR.
Tan, F.; Cascante-Bonilla, P.; Guo, X.; Wu, H.; Feng, S.; and
Ordonez, V. 2019. Drill-down: Interactive Retrieval of Com-
plex Scenes using Natural Language Queries. In NeurIPS.
van den Oord, A.; Vinyals, O.; and Kavukcuoglu, K. 2017.
Neural discrete representation learning. In Proceedings of
the 31st International Conference on Neural Information
Processing Systems, 6309–6318.
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-
tention is all you need. In NIPS.
Wu, Z.; Lischinski, D.; and Shechtman, E. 2021. Stylespace
analysis: Disentangled controls for stylegan image genera-
tion. In Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, 12863–12872.
Xia, W.; Yang, Y.; Xue, J.-H.; and Wu, B. 2021. TediGAN:
Text-Guided Diverse Face Image Generation and Manipula-
tion. In CVPR.
Xu, T.; Zhang, P.; Huang, Q.; Zhang, H.; Gan, Z.; Huang,
X.; and He, X. 2018. Attngan: Fine-grained text to image
generation with attentional generative adversarial networks.
In Proceedings of the IEEE conference on computer vision
and pattern recognition, 1316–1324.
Yu, A.; and Grauman, K. 2014. Fine-Grained Visual Com-
parisons with Local Learning. In CVPR.
Zhang, H.; Koh, J. Y.; Baldridge, J.; Lee, H.; and Yang,
Y. 2021a. Cross-Modal Contrastive Learning for Text-to-
Image Generation. arXiv:2101.04702.
Zhang, R.; Isola, P.; Efros, A. A.; Shechtman, E.; and Wang,
O. 2018. The unreasonable effectiveness of deep features as
a perceptual metric. In CVPR.
Zhang, R.; Yu, T.; Shen, Y.; Jin, H.; and Chen, C. 2019.
Text-Based Interactive Recommendation via Constraint-
Augmented Reinforcement Learning. In Advances in Neural
Information Processing Systems, 15188–15198.
3587
Zhang, T.; Tseng, H.-Y.; Jiang, L.; Yang, W.; Lee, H.; and
Essa, I. 2021b. Text as neural operator: Image manipulation
by text instruction. In ACMMM.
Zhu, M.; Pan, P.; Chen, W.; and Yang, Y. 2019. Dm-gan:
Dynamic memory generative adversarial networks for text-
to-image synthesis. In CVPR.
3588