Games  

How to Turn Off Domain_6 Topics

Proc Int Conf Mach Learn.
Author manuscript; available in PMC 2010 Sep 22.

Published in final edited class equally:

Proc Int Conf Mach Larn. 2009; 382(26): 25–32.

PMCID:

PMC2943854

NIHMSID:

NIHMS154934

Incorporating Domain Knowledge into Topic Modeling via Dirichlet Forest Priors

David Andrzejewski

*Department of Figurer Sciences, Academy of Wisconsin-Madison, Madison, WI 53706 U.s.a.

Department of Biostatistics and Medical Computer science, University of Wisconsin-Madison, Madison, WI 53706 USA

Xiaojin Zhu

*Department of Calculator Sciences, Academy of Wisconsin-Madison, Madison, WI 53706 U.s.

Mark Craven

*Department of Figurer Sciences, University of Wisconsin-Madison, Madison, WI 53706 Usa

Department of Biostatistics and Medical Computer science, University of Wisconsin-Madison, Madison, WI 53706 USA

Abstract

Users of topic modeling methods oft have knowledge nearly the limerick of words that should accept loftier or low probability in various topics. Nosotros incorporate such domain knowledge using a novel Dirichlet Forest prior in a Latent Dirichlet Allocation framework. The prior is a mixture of Dirichlet tree distributions with special structures. Nosotros present its construction, and inference via collapsed Gibbs sampling. Experiments on synthetic and real datasets demonstrate our model’s power to follow and generalize across user-specified domain knowledge.

1. Introduction

Topic modeling, using approaches such as Latent Dirichlet Allocation (LDA) (Blei et al., 2003), has enjoyed popularity as a way to model hidden topics in data. However, in many applications, a user may have additional knowledge nearly the limerick of words that should have high probability in various topics. For example, in a biological application, one may prefer that the words “termination”, “disassembly” and “release” appear with loftier probability in the same topic, considering they all describe the same phase of biological processes. Furthermore, a biologist could automatically extract these preferences from an existing biomedical ontology, such every bit the Gene Ontology (Go) (The Factor Ontology Consortium, 2000). Equally another example, an analyst may run topic modeling on a corpus of people’southward wishes, inspect the resulting topics, and find that “into, higher” and “cure, cancer” all appear with loftier probability in the same topic. The analyst may want to interactively express the preference that the two sets of words should not appear together, re-run topic modeling, and incorporate additional preferences based on the new results. In both cases, nosotros would similar these preferences to guide the recovery of latent topics. Standard LDA lacks a mechanism for incorporating such domain knowledge.

In this newspaper, we propose a principled approach to the incorporation of such domain cognition into LDA. We bear witness that many types of knowledge can exist expressed with ii primitives on word pairs. Borrowing names from the constrained clustering literature (Basu et al., 2008), we telephone call the two primitives Must-Links and Cannot-Links, although at that place are important differences. Nosotros and then encode the set of Must-Links and Cannot-Links associated with the domain knowledge using a
Dirichlet Wood prior, replacing the Dirichlet prior over the topic-give-and-take multinomial
p(give-and-take|topic). The Dirichlet Forest prior is a mixture of Dirichlet tree distributions with very specific tree structures. Our arroyo has several advantages: (i) A Dirichlet Wood tin encode Must-Links and Cannot-Links, something incommunicable with Dirichlet distributions. (ii) The user can command the strength of the domain knowledge by setting a parameter η, assuasive domain knowledge to be overridden if the data strongly suggest otherwise. (iii) The Dirichlet Woods lends itself to efficient inference via collapsed Gibbs sampling, a property inherited from the conjugacy of Dirichlet trees. Nosotros present experiments on several constructed datasets and two real domains, demonstrating that the resulting topics not simply successfully incorporate the specified domain knowledge, simply also generalize beyond it past including/excluding other related words not explicitly mentioned in the Must-Links and Cannot-Links.

ii. Related Work

We review LDA using the notation of
Griffiths and Steyvers (2004). Let there be
T
topics. Permit
westward
=
w
ane

westn

represent a corpus of
D
documents, with a total of
n
words. We utilize
di

to announce the document of word
westi
, and
zi

the hidden topic from which
wi

is generated. Let



ϕ
j

(
west
)


=
p
(
w
|
z
=
j
)
,

and


ϕ
j

(
d
)


=
p
(
z
=
j
)

for certificate
d. The LDA generative model is then:

z

i
(d

i
)

~ Multinominal(θ(d

i
)
)

(2)

w

i
|z

i
, ϕ ~ Multinomial(ϕ
z

i

)

(four)

where α and β are hyperparameters for the document-topic and topic-word Dirichlet distributions, respectively. For simplicity we will presume symmetric α and β but asymmetric hyperparameters are also possible.

Previous work has modeled correlations in the LDA certificate-topic mixtures using the logistic Normal distribution (Blei & Lafferty, 2006), DAG (Pachinko) structures (Li & McCallum, 2006), or the Dirichlet Tree distribution (Tam & Schultz, 2007). In addition, the concept-topic model (Chemudugunta et al., 2008) employs domain cognition through special “concept” topics, in which only a item set of words can be present. Our piece of work complements the previous piece of work past encoding complex domain knowledge on
words
(especially arbitrary Cannot-Links) into a flexible and computationally efficient prior.

three. Topic Modeling with Dirichlet Forest

Our proposed model differs from LDA in the manner ϕ is generated. Instead of
(3), we have





q

~



DirichletForest
(
β
,
η
)





ϕ

~



DirichletTree
(
q
)





where
q
specifies a Dirichlet tree distribution, β plays a role coordinating to the topic-word hyperparameter in standard LDA, and η ≥ 1 is the “force parameter” of the domain knowledge. Before discussing DirichletForest(β, η) and Dirichlet Tree(q), we first explain how knowledge can be expressed using Must-Link and Cannot-Link primitives.

3.1. Must-Links and Cannot-Links

Must-Links and Cannot-Links were originally proposed for constrained clustering to encourage two instances to fall into the same cluster or into separate clusters, respectively. We borrow the notion for topic modeling. Informally, the Must-Link archaic prefers that two words tend to be generated by the same topic, while the Cannot-Link primitive prefers that two words tend to be generated by dissever topics. However, since any topic ϕ is a multinomial over words, whatsoever 2 words (in full general) ever have some probability of being generated by the topic. We therefore advise the following definition:

Must-Link (u, υ)

Two words
u, υ take similar probability within any topic, i.due east.,



ϕ
j

(
u
)




ϕ
j

(
υ
)



for
j
= one …
T. It is important to note that the probabilities can be both large or both small, every bit long as they are similar. For case, for the earlier biology case we could say Must-Link (termination, disassembly).

Cannot-Link (u, υ)

Two words
u, υ should non both have large probability within whatever topic. It is permissible for ane to have a large probability and the other pocket-size, or both small. For example, one primitive for the wish case can be Cannot-Link (college, cure).

Many types of domain knowledge can be decomposed into a set of Must-Links and Cannot-Links. We demonstrate three types in our experiments: we tin
Split
two or more sets of words from a unmarried topic into dissimilar topics by placing Must-Links inside the sets and Cannot-Links between them. Nosotros can
Merge
ii or more than sets of words from unlike topics into 1 topic by placing Must-Links amidst the sets. Given a common set of words which appear in multiple topics (such every bit stopwords in English, which tend to appear in all LDA topics), we tin can
Isolate
them by placing Must-Links inside the mutual prepare, and then placing Cannot-Link between the common set and the other high-probability words from all topics. It is of import to note that our Must-Links and Cannot-Links are
preferences
instead of hard constraints.

3.2. Encoding Must-Links

It is well-known that the Dirichlet distribution is limited in that all words share a mutual variance parameter, and are mutually independent except the normalization constraint (Minka, 1999). Even so, for Must-Link (u, υ) it is crucial to control the ii words
u, υ differently than other words.

The Dirichlet tree distribution (Dennis III, 1991) is a generalization of the Dirichlet distribution that allows such control. Information technology is a tree with the words every bit leaf nodes; run across

Figure ane(a)

for an example. Let γ(m)
exist the Dirichlet tree edge weight
leading into node k. Let
C(k) be the immediate children of node
k
in the tree,
50
the leaves of the tree,
I
the internal nodes, and
L(thou) the leaves in the subtree nether
g. To generate a sample ϕ ~ DirichletTree(γ), one commencement draws a multinomial at each internal node
s

I
from Dirichlet(γ
C(southward)
), i.due east., using the weights from
south
to its children as the Dirichlet parameters. Ane tin remember of it every bit re-distributing the probability mass reaching
s
by this multinomial (initially, the mass is 1 at the root). The probability ϕ(m)
of a discussion
m

L
is then only the product of the multinomial parameters on the edges from
one thousand
to the root, as shown in

Figure one(b)
. It tin can be shown (Dennis Iii, 1991) that this procedure gives

Popular:   This Meme Is From the Future Domain_10 Among Us



DirichletTree
(
γ
)

p
(
ϕ
|
γ
)
=


(




k
L



ϕ



(
thousand
)



γ


(
g
)





1








)



(




due south
I




Γ


(




thousand

C
(
southward
)




γ

(
k
)





)






k

C
(
southward
)



Γ


(


γ

(
yard
)



)








(




k

L
(
s
)




ϕ

(
k
)





)



Δ
(
south
)





)


where Γ(·) is the standard gamma function, and the notation



k
L

ways ∏
gL
. The role Δ(s) ≡ γ(s)
− ∑
kC(southward)

γ(k)
is the departure between the in-caste and out-degree of internal node
south. When this deviation Δ(s) = 0 for all internal nodes
south

I, the Dirichlet tree reduces to a Dirichlet distribution.




Encoding Must-Links and Cannot-Links with a Dirichlet Forest. (a) A Dirichlet tree encoding Must-Link (A,B) with β = ane, η = 50 on vocabulary {A,B,C}. (b) A sample ϕ from this Dirichlet tree. (c) A big fix of samples from the Dirichlet tree, plotted on the 3-simplex. Note
p(A) ≈
p(B), all the same they remain flexible in actual value, which is desirable for a Must-Link. (d) In contrast, samples from a standard Dirichlet with comparable parameters (50,fifty,ane) force
p(A) ≈
p(B) ≈ 0.5, and cannot encode a Must-Link. (e) The Cannot-Link-graph for Cannot-Link (A,B) and Cannot-Link (B,C). (f) The complementary graph, with ii maximal cliques {A,C} and {B}. (m) The Dirichlet subtree for clique {A,C}. (h) The Dirichlet subtree for clique {B}. (i) Samples from the mixture model on (g,h), encoding both Cannot-Links, once more with β = 1, η = 50.

Like the Dirichlet, the Dirichlet tree is conjugate to the multinomial. It is possible to integrate out ϕ to become a distribution over word counts directly, like to the multivariate Pólya distribution:


p
(
westward
|
γ
)
=



s
I




(



Γ


(




k

C
(
due south
)




γ

(
1000
)





)



Γ


(




k

C
(
s
)




(


γ

(
k
)


+

n

(
grand
)



)




)






yard

C
(
s
)





Γ


(


γ

(
k
)


+

n

(
k
)



)



Γ


(


γ

(
k
)



)






)




(5)

Here
north
(chiliad)
is the number of word tokens in
w
that appear in
Fifty(thousand).

We encode Must-Links using a Dirichlet tree. Note that our definition of Must-Link is transitive: Must-Link (u, υ) and Must-Link (υ,
w) imply Must-Link (u,
due west). We thus get-go compute the transitive closures of expressed Must-Links. Our Dirichlet tree for Must-Links has a very simple construction: each transitive closure is a sub-tree, with one internal node and the words in the closure equally its leaves. The weights from the internal node to its leaves are ηβ. The root connects to these internal nodes
s
with weight |L(southward)|β, where | · | represents the gear up size. In addition, the root directly connects to other words not in whatsoever closure, with weight β. For example, the transitive closure for a Must-Link (A,B) on vocabulary {A,B,C} is simply {A,B}, corresponding to the Dirichlet tree in

Figure 1(a)
.

To sympathise this encoding of Must-Links, consider first the example when the domain knowledge strength parameter is at its weakest η = ane. And so in-degree equals out-caste for whatever internal node
s
(both are |L(s)|β), and the tree reduces to a Dirichlet distribution with symmetric prior β: the Must-Links are turned off in this example. As we increase η, the re-distribution of probability mass at
due south
(governed by a Dirichlet under
s) has increasing
concentration
|L(s)|ηβ but the aforementioned uniform base-measure. This tends to redistribute the mass evenly in the transitive closure represented by
due south. Therefore, the Must-Links are turned on when η > 1. Furthermore, the mass
reaching south
is independent of η, and tin still have a large variance. This properly encodes the fact that we desire Must-Linked words to have like, just not e’er large, probabilities. Otherwise, Must-Linked words would be forced to appear with big probability in
all
topics, which is clearly undesirable. This is impossible to represent with Dirichlet distributions. For example, the blue dots in

Figure 1(c)

are ϕ samples from the Dirichlet tree in

Figure 1(a)
, plotted on the probability simplex of dimension three. While it is always true that
p(A) ≈
p(B), their total probability mass tin can be anywhere from 0 to 1. The most like Dirichlet distribution is perchance the one with parameters (fifty,fifty,ane), which generates samples close to (0.five, 0.5, 0) (
Figure 1(d)
.)

iii.3. Encoding Cannot-Links

Cannot-Links are considerably harder to handle. We first transform them into an alternative form that is amenable to Dirichlet trees. Note that Cannot-Links are non transitive: Cannot-Link (A,B) and Cannot-Link (B,C) does not entail Cannot-Link (A,C). Nosotros define a Cannot-Link-graph where the nodes are words
1
, and the edges represent to the Cannot-Links. Then the
connected components
of this graph are independent of each other when encoding Cannot-Links. Nosotros will use this property to factor a Dirichlet-tree option probability later. For case, the two Cannot-Links (A,B) and (B,C) form the graph in

Figure 1(e)

with a single connected component {A,B,C}.

Consider the subgraph on connected component
r. We define its
complement graph
by flipping the edges (on to off, off to on), as shown in

Figure i(f)
. Let there be
Q
(r)
maximal cliques
K

r1


Thousand

rQ
(r)

in this complement graph. In the following, we just call them “cliques”, but it is important to remember that they are maximal cliques of the complement graph, not the original Cannot-Link-graph. In our case,
Q
(r)
= 2 and
Thou

ri

= {A,
C},
M

rtwo

= {B}. These cliques have the following interpretation: each clique (eastward.g.,
G

r1

= {A,
C}) is the maximal subset of words in the connected component that can “occur together”. That is, these words are allowed to simultaneously take large probabilities in a given topic without violating any Cannot-Link preferences. By the maximality of these cliques, allowing any word exterior the clique (e.thou., “B”) to too have a big probability will violate at least 1 Cannot-Link (in this example ii).

Nosotros discuss the encoding for this unmarried connected component
r
now, deferring give-and-take of the complete encoding to section three.4. Nosotros create a mixture model of
Q
(r)
Dirichlet subtrees, one for each clique. Each topic selects exactly 1 subtree according to probability

p(q)

|Thousand

r
q
|,q
= i…Q
(r).

(6)

Conceptually, the selected subtree indexed by
q
tends to redistribute nigh all probability mass to the words within
Mrq
. Since in that location is no mass left for other cliques, information technology is impossible for a give-and-take exterior clique
Chiliadrq

to have a large probability. Therefore, no Cannot-Link will be violated. In reality, the subtrees are soft rather than hard, because Cannot-Links are but preferences. The Dirichlet subtree for
One thousandrq

is structured as follows. The subtree’due south root connects to an internal node
s
with weight η|Mrq
|β. The node
s
connects to words in
Mrq
, with weight β. The subtree’due south root also directly connects to words not in
Mrq

(simply in the connected component
r) with weight β. This will send most probability mass downward to
s, then flexibly redistribute it among words in
Mrq
. For case,

Figures 1(g,h)

evidence the Dirichlet subtrees for
1000

ri

= {A,
C} and
M

r2

= {B} respectively. Samples from this mixture model are shown in

Figure 1(i)
, representing multinomials in which no Cannot-Link is violated. Such behavior is not achievable past a Dirichlet distribution, or a single Dirichlet tree
two
.

Finally, nosotros mention that although in the worst case the number of maximal cliques
Q
(r)
in a connected component of size |r| can grow exponentially as
O(3|r|/iii) (Griggs et al., 1988), in our experiments
Q
(r)
is no larger than 3, due in part to Must-Linked words “collapsing” to single nodes in the Cannot-Link graph.

3.four. The Dirichlet Forest Prior

In general, our domain cognition is expressed by a gear up of Must-Links and Cannot-Links. We first compute the transitive closure of Must-Links. We then form a Cannot-Link-graph, where a node is either a Must-Link closure or a give-and-take not present in any Must-Link. Notation that the domain knowledge must be “consequent” in that no pairs of words are simultaneously Cannot-Linked and Must-Linked (either explicitly or implicitly through Must-Link transitive closure.) Let
R
exist the number of connected components in the Cannot-Link-graph. Our Dirichlet Wood consists of






r
=
i

R



Q

(
r
)





Dirichlet trees, represented past the template in

Figure 2
. Each Dirichlet tree has
R
branches beneath the root, one for each continued component. The trees differ in which subtrees they include under these branches. For the
r-th branch, there are
Q
(r)
possible Dirichlet subtrees, corresponding to cliques
Chiliad

r1


M

rQ
(r)
. Therefore, a tree in the forest is uniquely identified by an index vector
q
= (q
(ane)

q
(R)), where
q
(r)
∈ {1 …
Q
(r)}.




Template of Dirichlet trees in the Dirichlet Forest

To depict a Dirichlet tree
q
from the prior DirichletForest(β, η), we select the subtrees independently because the
R
continued components are independent with respect to Cannot-Links:


p
(
q
)
=




r
=
1

R


p

(


q

(
r
)



)




. Each
q
(r)
is sampled co-ordinate to
(6), and corresponds to choosing a solid box for the
r-th branch in

Effigy ii
. The structure of the subtree within the solid box has been divers in Section three.3. The black nodes may be a single discussion, or a Must-Link transitive closure having the subtree structure shown in the dotted box. The edge weight leading to nearly nodes
k
is γ(k)
= |50(k)|β, where
L(k) is the set of leaves under
chiliad. However, for edges coming out of a Must-Link internal node or going into a Cannot-Link internal node, their weights are multiplied by the strength parameter η. These edges are marked past “η*” in

Figure ii
.

Popular:   There Is No Game Wiki

We now define the complete Dirichlet Forest model, integrating out (“collapsing”) θ and ϕ. Let



n
j

(
d
)



be the number of discussion tokens in document
d
that are assigned to topic
j.
z
is generated the aforementioned every bit in LDA:


p
(
z
|
α
)
=



(



Γ
(
T
α
)


Γ


(
α
)

T




)


D





d
=
one

D








j
=
ane

T


Γ

(


north
j

(
d
)


+
α

)





Γ

(


n
.

(
d
)


+
T
α

)



.



There is i Dirichlet tree
q

j

per topic
j
= one …
T, sampled from the Dirichlet Forest prior


p
(

q
j

)
=




r
=
one

R


p

(


q
j

(
r
)



)




. Each Dirichlet tree
q

j

implicitly defines its tree border weights



γ
j

(
·
)



using β, η, and its tree structure
Lj
,
Ij
,
Cj
(·). Let



n
j

(
k
)



be the number of word tokens in the corpus assigned to topic
j
that announced under the node
g
in the Dirichlet tree
q

j
. The probability of generating the corpus
w, given the trees
q
1:T


q
1

q

T

and the topic assignment
z, tin can be derived using
(five):


p
(
w
|

q

1
:
T


,
z
,
β
,
η
)
=




j
=
1

T





s


I
j






(



Γ


(




g


C
j

(
s
)




γ
j

(
grand
)





)



Γ


(




k


C
j

(
south
)



(

γ
j

(
1000
)


+

n
j

(
k
)


)



)






k


C
j

(
s
)





Γ

(

γ
j

(
k
)


+

due north
j

(
k
)


)


Γ

(

γ
j

(
chiliad
)


)





)


.





Finally, the complete generative model is


p
(
w
,
z
,

q

1
:
T


|
α
,
β
,
η
)
=
p
(
w
|

q

1
:
T


,
z
,
β
,
η
)
p
(
z
|
α
)




j
=
1

T


p
(

q
j

)
.



4. Inference for Dirichlet Forest

Because a Dirichlet Forest is a mixture of Dirichlet trees, which are conjugate to multinomials, we can efficiently perform inference by Markoff chain Monte Carlo (MCMC). Specifically, nosotros use collapsed Gibbs sampling like to
Griffiths and Steyvers (2004). Still, in our instance the MCMC state is defined by both the topic labels
z
and the tree indices
q
i:T
. An MCMC iteration in our instance consists of a sweep through both
z
and
q
1:T
. We present the conditional probabilities for complanate Gibbs sampling below.

(Sampling
zi

):
Permit



n


i
,
j


(
d
)



exist the number of word tokens in document
d
assigned to topic
j, excluding the word at position
i. Similarly, let



n


i
,
j


(
g
)



be the number of word tokens in the corpus that are under node
thousand
in topic
j’s Dirichlet tree, excluding the word at position
i. For candidate topic labels υ = 1 …
T, we have


p
(

z
i

=
υ
|

z


i


,

q

ane
:
T


,
due west
)

(

n


i
,
υ


(
d
)


+
α
)



s


I
υ

(

i
)






γ
υ

(

C
υ

(
s

i
)
)


+

north


i
,
υ


(

C
υ

(
s

i
)
)







k


C
υ

(
southward
)




(


γ
υ

(
k
)


+

due north


i
,
υ


(
k
)



)





,



where
I
υ(↑
i) denotes the subset of internal nodes in topic υ’due south Dirichlet tree that are ancestors of leaf
wi
, and
C
υ(southi) is the unique node that is
southward’south immediate child
and
an ancestor of
westwardi

(including
wi

itself).

(Sampling



q
j

(
r
)



):
Since the connected components are independent, sampling the tree
q

j

factors into sampling the cliques for each connected component



q
j

(
r
)



. For candidate cliques
q′ = 1 …
Q(r), we have


p
(

q
j

(
r
)


=

q


|
z
,

q


j


,

q
j

(

r
)


,
w
)


(




k


M

r
q







β
k




)


×




due south


I

j
,
r
=

q








(



Γ


(




k


C
j

(
south
)




γ
j

(
k
)





)



Γ


(




k


C
j

(
s
)



(

γ
j

(
k
)




+

n
j

(
k
)


)

)






one thousand


C
j

(
due south
)





Γ

(

γ
j

(
thousand
)


+

n
j

(
k
)


)


Γ

(

γ
j

(
k
)


)





)




where
I

j,r=q

denotes the internal nodes beneath the
r-th co-operative of tree
q

j
, when clique
M

rq

is selected.

(Estimating
ϕ
and
θ):
Later running MCMC for sufficient iterations, nosotros follow standard exercise (e.yard. (Griffiths & Steyvers, 2004)) and utilize the last sample (z,
q
1:T
) to estimate ϕ and θ. Because a Dirichlet tree is a conjugate distribution, its posterior is a Dirichlet tree with the same construction and updated edge weights. The posterior for the Dirichlet tree of the
j-thursday topic is





γ
post


j

(
grand
)


=

γ
j

(
one thousand
)


+

north
j

(
thou
)



, where the counts



n
j

(
k
)



are nerveless from
z,
q
ane:T
,
due west. We estimate ϕ
j

by the offset moment nether this posterior (Minka, 1999):




ϕ
^

j

(
w
)


=



south


I
j

(

westward
)






γ
mail


j

(

C
j

(
s

due west
)
)






(





s




C
j

(
s
)






γ
post


j

(

s


)





)




one


.



(vii)

The parameter θ is estimated the same way every bit in standard LDA:




θ
^

j

(
d
)


=
(

n
j

(
d
)


+
α
)
/
(

n
.

(
d
)


+
T
α
)

.

five. Experiments

Synthetic Corpora

Nosotros present results on synthetic datasets to show how the Dirichlet Woods (DF) incorporates dissimilar types of cognition. Recall that DF with η = 1 is equivalent to standard LDA (verified with the code of (Griffiths & Steyvers, 2004)).

Previous studies ofttimes take the last MCMC sample (z
and
q
i:T
), and discuss the topics ϕ1:T

derived from that sample. Because of the stochastic nature of MCMC, nosotros fence that more insight can exist gained if multiple contained MCMC samples are considered. For each dataset, and each DF with a different η, we run a long MCMC chain with 200,000 iterations of burn-in, and have out a sample every 10,000 iterations afterward, for a total of 200 samples. Nosotros have some indication that our chain is well-mixed, as we observe all expected modes, and that samples with “label switching” (i.e., equivalent up to label permutation) occur with well-nigh equal frequency. For each sample, we derive its topics ϕone:T

with
(7)
and then greedily align the ϕ’south from unlike samples, permuting the
T
topic labels to remove the label switching consequence. Within a dataset, we perform PCA on the baseline (η = ane) ϕ and project all samples into the resulting space to obtain a common visualization (each row in

Figure 3
. Points are dithered to show overlap.).




PCA projections of permutation-aligned ϕ samples for the four constructed information experiments.

Must-Link (B,C)

The corpus consists of vi documents over a vocabulary of 5 “words.” The documents are: ABAB, CDCD, and EEEE, each represented twice. Nosotros let
T
= 2, α = 0.5, β = 0.01. LDA produces three kinds of ϕ1:T
: roughly a 3rd of the time the topics are around



[


A
2


B
ii

|

C
four


D
4


Due east
2


]


, which is shorthand for



ϕ
1

=

(


1
two

,

1
2

,

,

,


)



ϕ
ii

=

(


,

,

1
iv

,

1
four

,

1
ii


)


on the vocabulary ABCDE. Another third are around



[


A
iv


B
four


E
2

|

C
2


D
ii


]


, and the final third around



[


A
iv


B
4


C
four


D
4

|
East

]


. They correspond to clusters i,2 and 3 respectively in the upper-left console of

Figure 3
. We add a single Must-Link (B,C). When η = 10, the data even so override our Must-Link somewhat because clusters 1 and ii practise not disappear completely. As η increases to 50, Must-Link overrides the data and clusters 1 and 2 vanish, leaving just cluster three. That is, running DF and taking the concluding sample is very likely to obtain the



[


A
iv


B
4


C
four


D
4

|
E

]


topics. This is what nosotros want: B and C are present or absent-minded
together
in the topics and they likewise “pull” A, D forth, fifty-fifty though A, D are not in the knowledge we added.

Cannot-Link (A,B)

The corpus has four documents: ABCCABCC, ABDDABDD, twice each;
T
= iii, α = 1, β = 0.01. LDA produces six kinds of ϕ1:T

evenly:



[


B
2


D
2

|
A
|
C

]

,

[


A
2


B
ii

|
C
|
D

]

,

[


A
ii


D
2

|
B
|
C

]

,

[


B
ii


C
two

|
A
|
D

]

,

[


A
2


C
2

|
B
|
D

]

,

[


C
ii


D
2

|
A
|
B

]


, corresponding to clusters 1–5 and the “lines”. We add a single Cannot-Link (A,B). As DF η increases, cluster 2



[


A
2


B
two

|
C
|
D

]


disappears, because it involves a topic



A
two


B
2


that violates the Cannot-Link. Other clusters become uniformly more likely.

Isolate(B)

The corpus has 4 documents, all of which are ABC;
T
= ii, α = i, β = 0.01. LDA produces iii clusters evenly:



[


A
two


C
ii

|
B

]

,

[


A
2


B
2

|
C

]

,

[


B
2


C
2

|
A

]


. Nosotros add together
Isolate(B), which is compiled into Cannot-Link (B,A) and Cannot-Link (B,C). The DF’s samples concentrate to cluster 1:



[


A
2


C
two

|
B

]


, which indeed isolates B into its ain topic.

Split up(AB,CD)

The corpus has half dozen documents: ABCDEEEE, ABCDFFFF, each present three times; α = 0.5, β = 0.01. LDA with
T
= iii produces a large portion of topics around



[


A
4


B
iv


C
iv


D
four

|
E
|
F

]


(not shown). We add together
Split(AB,CD), which is compiled into Must-Link (A,B), Must-Link (C,D), Cannot-Link (B,C), and increase
T
= 4. However, DF with η = i (i.east., LDA with
T
= 4) produces a large variety of topics: e.one thousand., cluster ane is



[


A
iv



three
B

8



3
D

eight

|

A
viii



vii
F

viii

|
C
|
Due east

]


, cluster two is



[


C
8



7
D

8

|


3
A

8



3
B

viii


C
four

|
E
|
F

]


, and cluster seven is



[


A
ii


B
2

|

C
two


D
two

|
E
|
F

]


. That is, merely adding one more topic does non clearly separate AB and CD. On the other paw, with η increasing, DF eventually concentrates on cluster seven, which satisfies the Divide operation.

Wish Corpus

Nosotros at present consider
interactive topic modeling
with DF. The corpus we apply is a collection of 89,574 New year’due south wishes submitted to The Times Square Alliance (Goldberg et al., 2009). Each wish is treated as a document, downcased but without stop-give-and-take removal. For each step in our interactive example, we set α = 0.v, β = 0.1, η = thou, and run MCMC for 2000 iterations before estimating the topics from the last sample. The domain cognition in DF is accumulative forth the steps.

Pace i:
We run LDA with
T
= 15. Many of the most probable words in the topics are conventional (“to, and”) or corpus-specific (“wish, 2008”) stop-words, which obscure the meaning of the topics.

Step 2:
Nosotros manually create a 50-discussion stopword list, and issue an
Isolate
preference. This is compiled into Must-Links among this set and Cannot-Links between this set and all other words in the top 50 for all topics.
T
is increased to 16. Later on running DF, we stop upward with two stop-word topics. Importantly, with the stop-words explained by these two topics, the top words for the other topics get much more than meaningful.

Popular:   Ps4 Lagging Online With Good Connection 2022

Stride 3:
We notice that one topic conflates ii concepts: enter college and cure disease (summit 8 words: “get schoolhouse cancer into well free cure college”). We issue
Split up(“go,school,into,college”, “cancer,complimentary,cure,well”) to separate the concepts. This is compiled into Must-Links within each quadruple, and a Cannot-Link between them.
T
is increased to 18. After running DF, one of the topics clearly takes on the “higher” concept, picking up related words which we did non explicitly encode in our prior. Another topic does likewise for the “cure” concept (many wishes are similar “mom stays cancer free”). Other topics have minor changes.

Step 4:
We then notice that 2 topics correspond to romance concepts. We apply
Merge(“love, forever, marry, together, loves”, “meet, young man, married, girlfriend, wedding”), which is compiled into Must-Links between these words.
T
is decreased to 17. After running DF, one of the romance topics disappears, and the remaining ane corresponds to the merged romance topic (“lose”, “weight” were in one of them, and remain and so). Other previous topics survive with but minor changes.

Table ane

shows the wish topics after these four steps, where nosotros place the DF operations next to the near affected topics, and colour-code the words explicitly specified in the domain knowledge.


Table 1

Wish topics from interactive topic modeling

Topic Top words sorted by ϕ =
p(word|topic)

Merge

An external file that holds a picture, illustration, etc.
Object name is nihms154934t1.jpg

success wellness happiness family unit good friends prosperity
life life happy best live time long wishes ever years
as do not what someone so similar don much he
money out make money up house piece of work able pay own lots
people no people end less day every each other another
republic of iraq domicile safety end troops iraq bring war return
joy

An external file that holds a picture, illustration, etc.
Object name is nihms154934t2.jpg

family unit happy healthy family babe safe prosperous
vote meliorate promise president paul ron than person bush
Isolate and to for a the yr in new all my
god god bless jesus everyone loved know centre christ
peace peace earth earth win lottery around salve
spam com telephone call if u 4 www 2 three visit 1
Isolate i to wish my for and a be that the
Split

An external file that holds a picture, illustration, etc.
Object name is nihms154934t3.jpg

Carve up

An external file that holds a picture, illustration, etc.
Object name is nihms154934t4.jpg

Yeast Corpus

Whereas the previous experiment illustrates the utility of our approach in an interactive setting, nosotros now consider a case in which nosotros utilize background cognition from an ontology to guide topic modeling. Our prior knowledge is based on six concepts. The concepts
transcription,
translation
and
replication
characterize three of import
processes
that are carried out at the molecular level. The concepts
initiation,
elongation
and
termination
describe
phases
of the iii aforementioned processes. Combinations of concepts from these two sets correspond to concepts in the Gene Ontology (e.g., Get:0006414 is
translational elongation, and GO:0006352 is
transcription initiation). We guide our topic modeling using Must-Links among a small-scale gear up of words for each concept. Moreover, we use Cannot-Links among words to specify that we prefer (i)
transcription,
translation
and
replication
to be represented in separate topics, and (ii)
initiation,
elongation
and
termination
to exist represented in separate topics. Nosotros practise non set any preferences betwixt the “procedure” topics and the “phase” topics, even so.

The corpus that we use for our experiments consists of 18,193 abstracts selected from the MEDLINE database for their relevance to yeast genes. We induce topic models using DF to encode the Must-Links and Cannot-Links described higher up, and use standard LDA as a control. Nosotros set
T
= 100, α = 0.5, β = 0.1, η = 5000. For each discussion that we use to seed a concept,

Table 2

shows the topics that include it amidst their l nearly probable words. We brand several observations about the DF-induced topics. First, each concept is represented past a small number of topics and the Must-Link words for each topic all occur as highly probable words in these topics. Second, the Cannot-Link preferences are obeyed in the terminal topics. Third, the topics use the process and phase topics in a compositionally. For case, DF Topic iv represents
transcription initiation
and DF Topic 8 represents
replication initiation. Moreover, the topics that are significantly influenced by the prior typically include highly relevant terms amid their most probable words. For instance, the top words in DF Topic iv include “TATA”, “TFIID”, “promoter”, and “recruitment” which are all specifically germane to the composite concept of
transcription initiation. In the case of standard LDA, the seed concept words are dispersed beyond a greater number of topics, and highly related words, such as “bike” and “segmentation” often do non autumn into the aforementioned topic. Many of the topics induced by ordinary LDA are semantically coherent, merely the specific concepts suggested past our prior practice not naturally sally without using DF.


Table 2

Yeast topics. The left cavalcade shows the seed words in the DF model. The middle columns betoken the topics in which at least 2 seed words are among the 50 highest probability words for LDA, the “o” column gives the number of other topics (non shared by another give-and-take). The right columns show the same topic-give-and-take relationships for the DF model.

LDA DF
ane ii three 4 5 6 7 8 o 1 2 iii 4 v 6 7 8 9 10

transcription i
transcriptional two
template one

translation
translational
tRNA 1

replication 2
cycle
division iii


initiation
showtime
assembly 7

elongation 1

termination
disassembly
release 2
cease

Acknowledgments

This work was supported past NIH/NLM grants T15 LM07359 and R01 LM07050, and the Wisconsin Alumni Enquiry Foundation.

Footnotes

Appearing in
Proceedings of the 26thursday
International Briefing on Automobile Learning
, Montreal, Canada, 2009.

1When at that place are Must-Links, all words in a Must-Link transitive closure form a single node in this graph.

iiDirichlet distributions with very small concentration do have some selection consequence. For example, Beta(0.1,0.1) tends to concentrate probability mass on 1 of the two variables. Notwithstanding, such priors are weak – the “pseudo counts” in them are too small because of the minor concentration. The posterior will be dominated by the data, and we would lose any encoded domain knowledge.

References

  • Basu S, Davidson I, Wagstaff G, editors.
    Constrained clustering: Advances in algorithms, theory, and applications.
    Chapman & Hall/CRC; 2008.
    [Google Scholar]
  • Blei D, Lafferty J.
    Advances in neural data processing systems.
    Vol. 18. Cambridge, MA: MIT Press; 2006. Correlated topic models; pp. 147–154.
    [Google Scholar]
  • Blei D, Ng A, Jordan Chiliad. Latent Dirichlet resource allotment.

    Journal of Machine Learning Enquiry.
    2003;3:993–1022.

    [Google Scholar]
  • Chemudugunta C, Holloway A, Smyth P, Steyvers M. Modeling documents by combining semantic concepts with unsupervised statistical learning; Intl. Semantic Web Conf; Springer; 2008. pp. 229–244.
    [Google Scholar]
  • Dennis SY., III On the hyper-Dirichlet type one and hyper-Liouville distributions.

    Communications in Statistics – Theory and Methods.
    1991;20:4069–4081.

    [Google Scholar]
  • Goldberg A, Fillmore N, Andrzejewski D, Xu Z, Gibson B, Zhu X. May all your wishes come true: A study of wishes and how to recognize them; Human Language Technologies: Proc. of the Annual Conf. of the North American Affiliate of the Assoc. for Computational Linguistics; ACL Press; 2009.
    [Google Scholar]
  • Griffths TL, Steyvers M. Finding scientific topics.

    Proc. of the Nat. Academy of Sciences of the Us of America.
    2004;101:5228–5235.

    [PMC costless article]
    [PubMed]
    [Google Scholar]
  • Griggs JR, Grinstead CM, Guichard DR. The number of maximal independent sets in a connected graph.

    Detached Math.
    1988;68:211–220.

    [Google Scholar]
  • Li W, McCallum A. Pachinko allocation: DAG-structured mixture models of topic correlations; Proc. of the 23rd Intl. Conf. on Machine Learning; ACM Press; 2006. pp. 577–584.
    [Google Scholar]
  • Minka TP.
    The Dirichlet-tree distribution (Technical Written report)
    1999.
    http://research.microsoft.com/~minka/papers/dirichlet/minka-dirtree.pdf.
  • Tam Y-C, Schultz T. Correlated latent semantic model for unsupervised LM adaptation; IEEE Intl. Conf. on Acoustics, Speech and Signal Processing; 2007. pp. 41–44.
    [Google Scholar]
  • The Factor Ontology Consortium. Cistron Ontology: Tool for the unification of biology.

    Nature Genetics.
    2000;25:25–29.

    [PMC free article]
    [PubMed]
    [Google Scholar]

Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2943854/