24. Personalizing Dialogue Agents : I have a dog, do you have pets too? (2018)

Abstract

Problems of Chit-chat models

Present the task of making chit-chat more “engaging”, by conditioning on profile info

collect data & train models to…

common issues :

1) lack of consistent personality
- \(\because\) different speakers
2) lack of explicit long-term memory
- \(\because\) trained to produce an utterance given only RECENT dialogue history
3) tendency to produce non-specific answers.. ex) “I don’t know”

\(\rightarrow\) due to being no good publicly available dataset!

Goal : make more ENGAGING chit-chat dialogue agents!

give them configurable & persistent persona

( encoded by multiple sentences = “profile” )
trained to both ask & answer questions about personal topics
resulting dialogue can be used to build a model of the persona of the speaking partner!

crowd-sourced dataset
collect via Amazon Mechanical Turk
each pair of speakers condition their dialogue on a given profile, which is provided

2 classes of model, for next utterance prediction

1) ranking models
- produce next utterance, considering any utterance in the training set as a possible candidate reply
- 3-1) ~ 3-3)
2) generative models
- generate novel sentences, conditioning on the dialogue history
- generate the response word-by-word
- 3-4) ~ 3-5)

1) IR basline & 2) Starspace
similarity function \(sim (q, c')\)
- cosine similarity of the sum of word embeddings of query \(q\) and candidate \(c'\)
in both 1) & 2), to incorporate the profile, concatenate it to the query vector

Comparison

baseline models : use profile information by “combining it with the dialogue “
this model : use a memory network, with the dialogue history as input & then perform attention

\(q^{+}=q+\sum s_{i} p_{i}, \quad s_{i}=\operatorname{Softmax}\left(\operatorname{sim}\left(q, p_{i}\right)\right)\).

then, rank candidates \(c^{\prime}\) using \(\operatorname{sim}\left(q^{+}, c^{\prime}\right) .\) One can

improvement to memory network, by performing attention over keys and outputting the values

encode input sentence by \(h_{t}^{e}=L S T M_{e n c}\left(x_{t} \mid h_{t-1}^{e}\right) .\)
use Glove for word embeddings
final hidden state \(h_{t}^{e}\) is fed into decoder as an initial state \(h_{0}^{d}\)
For each time step \(t,\) the decoder produces the probability of a word \(j\),

occurring in that place via the softmax, \(p\left(y_{t, j}=1 \mid y_{t-1}, \ldots, y_{1}\right)=\frac{\exp \left(w_{j} h_{t}^{d}\right)}{\sum_{j^{\prime}=1}^{K} \exp \left(w_{j^{\prime}} h_{t}^{d}\right)}\)
trained via negative log likelihood.
can be extended to include persona information!

\(x=\forall p \in\) \(P \mid \mid x,\) where \(\mid \mid\) denotes concatenation

generative model that encodes “each of the profile entries as individual memory representations in a memory network”
\(a_{t}=\operatorname{softmax}\left(F W_{a} h_{t}^{d}\right)\).

\(c_{t}=a_{t}^{\top} F ; \hat{x}_{t}=\tanh \left(W_{c}\left[c_{t-1}, x_{t}\right]\right)\).
if model has no profile information ( no memory ), becomes same as Seq2Seq in 3-4)