Proc Int Conf Mach Learn.
Author manuscript; available in PMC 2010 Sep 22.
Published in final edited class equally:
Proc Int Conf Mach Larn. 2009; 382(26): 25–32.
PMCID:
PMC2943854
NIHMSID:
NIHMS154934
Incorporating Domain Knowledge into Topic Modeling via Dirichlet Forest Priors
David Andrzejewski
^{*}Department of Figurer Sciences, Academy of WisconsinMadison, Madison, WI 53706 U.s.a.
^{†}Department of Biostatistics and Medical Computer science, University of WisconsinMadison, Madison, WI 53706 USA
Xiaojin Zhu
^{*}Department of Calculator Sciences, Academy of WisconsinMadison, Madison, WI 53706 U.s.
Mark Craven
^{*}Department of Figurer Sciences, University of WisconsinMadison, Madison, WI 53706 Usa
^{†}Department of Biostatistics and Medical Computer science, University of WisconsinMadison, Madison, WI 53706 USA
Abstract
Users of topic modeling methods oft have knowledge nearly the limerick of words that should accept loftier or low probability in various topics. Nosotros incorporate such domain knowledge using a novel Dirichlet Forest prior in a Latent Dirichlet Allocation framework. The prior is a mixture of Dirichlet tree distributions with special structures. Nosotros present its construction, and inference via collapsed Gibbs sampling. Experiments on synthetic and real datasets demonstrate our model’s power to follow and generalize across userspecified domain knowledge.
1. Introduction
Topic modeling, using approaches such as Latent Dirichlet Allocation (LDA) (Blei et al., 2003), has enjoyed popularity as a way to model hidden topics in data. However, in many applications, a user may have additional knowledge nearly the limerick of words that should have high probability in various topics. For example, in a biological application, one may prefer that the words “termination”, “disassembly” and “release” appear with loftier probability in the same topic, considering they all describe the same phase of biological processes. Furthermore, a biologist could automatically extract these preferences from an existing biomedical ontology, such every bit the Gene Ontology (Go) (The Factor Ontology Consortium, 2000). Equally another example, an analyst may run topic modeling on a corpus of people’southward wishes, inspect the resulting topics, and find that “into, higher” and “cure, cancer” all appear with loftier probability in the same topic. The analyst may want to interactively express the preference that the two sets of words should not appear together, rerun topic modeling, and incorporate additional preferences based on the new results. In both cases, nosotros would similar these preferences to guide the recovery of latent topics. Standard LDA lacks a mechanism for incorporating such domain knowledge.
In this newspaper, we propose a principled approach to the incorporation of such domain cognition into LDA. We bear witness that many types of knowledge can exist expressed with ii primitives on word pairs. Borrowing names from the constrained clustering literature (Basu et al., 2008), we telephone call the two primitives MustLinks and CannotLinks, although at that place are important differences. Nosotros and then encode the set of MustLinks and CannotLinks associated with the domain knowledge using a
Dirichlet Wood prior, replacing the Dirichlet prior over the topicgiveandtake multinomial
p(giveandtaketopic). The Dirichlet Forest prior is a mixture of Dirichlet tree distributions with very specific tree structures. Our arroyo has several advantages: (i) A Dirichlet Wood tin encode MustLinks and CannotLinks, something incommunicable with Dirichlet distributions. (ii) The user can command the strength of the domain knowledge by setting a parameter η, assuasive domain knowledge to be overridden if the data strongly suggest otherwise. (iii) The Dirichlet Woods lends itself to efficient inference via collapsed Gibbs sampling, a property inherited from the conjugacy of Dirichlet trees. Nosotros present experiments on several constructed datasets and two real domains, demonstrating that the resulting topics not simply successfully incorporate the specified domain knowledge, simply also generalize beyond it past including/excluding other related words not explicitly mentioned in the MustLinks and CannotLinks.
ii. Related Work
We review LDA using the notation of
Griffiths and Steyvers (2004). Let there be
T
topics. Permit
westward
=
w
_{ane}
…
west_{n}
represent a corpus of
D
documents, with a total of
n
words. We utilize
d_{i}
to announce the document of word
west_{i}
, and
z_{i}
the hidden topic from which
w_{i}
is generated. Let
for certificate
d. The LDA generative model is then:
z
_{
i
}θ^{(d
i
)}
~ Multinominal(θ^{(d
i
)})
(2)
w
_{
i
}z
_{
i
}, ϕ ~ Multinomial(ϕ_{
z
i
})
(four)
where α and β are hyperparameters for the documenttopic and topicword Dirichlet distributions, respectively. For simplicity we will presume symmetric α and β but asymmetric hyperparameters are also possible.
Previous work has modeled correlations in the LDA certificatetopic mixtures using the logistic Normal distribution (Blei & Lafferty, 2006), DAG (Pachinko) structures (Li & McCallum, 2006), or the Dirichlet Tree distribution (Tam & Schultz, 2007). In addition, the concepttopic model (Chemudugunta et al., 2008) employs domain cognition through special “concept” topics, in which only a item set of words can be present. Our piece of work complements the previous piece of work past encoding complex domain knowledge on
words
(especially arbitrary CannotLinks) into a flexible and computationally efficient prior.
three. Topic Modeling with Dirichlet Forest
Our proposed model differs from LDA in the manner ϕ is generated. Instead of
(3), we have
$$$$
where
q
specifies a Dirichlet tree distribution, β plays a role coordinating to the topicword hyperparameter in standard LDA, and η ≥ 1 is the “force parameter” of the domain knowledge. Before discussing DirichletForest(β, η) and Dirichlet Tree(q), we first explain how knowledge can be expressed using MustLink and CannotLink primitives.
3.1. MustLinks and CannotLinks
MustLinks and CannotLinks were originally proposed for constrained clustering to encourage two instances to fall into the same cluster or into separate clusters, respectively. We borrow the notion for topic modeling. Informally, the MustLink archaic prefers that two words tend to be generated by the same topic, while the CannotLink primitive prefers that two words tend to be generated by dissever topics. However, since any topic ϕ is a multinomial over words, whatsoever 2 words (in full general) ever have some probability of being generated by the topic. We therefore advise the following definition:
MustLink (u, υ)
Two words
u, υ take similar probability within any topic, i.due east.,
for
j
= one …
T. It is important to note that the probabilities can be both large or both small, every bit long as they are similar. For case, for the earlier biology case we could say MustLink (termination, disassembly).
CannotLink (u, υ)
Two words
u, υ should non both have large probability within whatever topic. It is permissible for ane to have a large probability and the other pocketsize, or both small. For example, one primitive for the wish case can be CannotLink (college, cure).
Many types of domain knowledge can be decomposed into a set of MustLinks and CannotLinks. We demonstrate three types in our experiments: we tin
Split
two or more sets of words from a unmarried topic into dissimilar topics by placing MustLinks inside the sets and CannotLinks between them. Nosotros can
Merge
ii or more than sets of words from unlike topics into 1 topic by placing MustLinks amidst the sets. Given a common set of words which appear in multiple topics (such every bit stopwords in English, which tend to appear in all LDA topics), we tin can
Isolate
them by placing MustLinks inside the mutual prepare, and then placing CannotLink between the common set and the other highprobability words from all topics. It is of import to note that our MustLinks and CannotLinks are
preferences
instead of hard constraints.
3.2. Encoding MustLinks
It is wellknown that the Dirichlet distribution is limited in that all words share a mutual variance parameter, and are mutually independent except the normalization constraint (Minka, 1999). Even so, for MustLink (u, υ) it is crucial to control the ii words
u, υ differently than other words.
The Dirichlet tree distribution (Dennis III, 1991) is a generalization of the Dirichlet distribution that allows such control. Information technology is a tree with the words every bit leaf nodes; run across
Figure ane(a)
for an example. Let γ^{(m)}
exist the Dirichlet tree edge weight
leading into node k. Let
C(k) be the immediate children of node
k
in the tree,
50
the leaves of the tree,
I
the internal nodes, and
L(thou) the leaves in the subtree nether
g. To generate a sample ϕ ~ DirichletTree(γ), one commencement draws a multinomial at each internal node
s
∈
I
from Dirichlet(γ^{
C(southward)}), i.due east., using the weights from
south
to its children as the Dirichlet parameters. Ane tin remember of it every bit redistributing the probability mass reaching
s
by this multinomial (initially, the mass is 1 at the root). The probability ϕ^{(m)}
of a discussion
m
∈
L
is then only the product of the multinomial parameters on the edges from
one thousand
to the root, as shown in
Figure one(b)
. It tin can be shown (Dennis Iii, 1991) that this procedure gives
$$$$
where Γ(·) is the standard gamma function, and the notation
${}_{}^{}$ ways ∏_{
g∈L
}. The role Δ(s) ≡ γ^{(s)}
− ∑_{
k∈C(southward)}
γ^{(k)}
is the departure between the incaste and outdegree of internal node
south. When this deviation Δ(s) = 0 for all internal nodes
south
∈
I, the Dirichlet tree reduces to a Dirichlet distribution.
Like the Dirichlet, the Dirichlet tree is conjugate to the multinomial. It is possible to integrate out ϕ to become a distribution over word counts directly, like to the multivariate Pólya distribution:
$$$$
(5)
Here
north
^{(chiliad)}
is the number of word tokens in
w
that appear in
Fifty(thousand).
We encode MustLinks using a Dirichlet tree. Note that our definition of MustLink is transitive: MustLink (u, υ) and MustLink (υ,
w) imply MustLink (u,
due west). We thus getgo compute the transitive closures of expressed MustLinks. Our Dirichlet tree for MustLinks has a very simple construction: each transitive closure is a subtree, with one internal node and the words in the closure equally its leaves. The weights from the internal node to its leaves are ηβ. The root connects to these internal nodes
s
with weight L(southward)β, where  ·  represents the gear up size. In addition, the root directly connects to other words not in whatsoever closure, with weight β. For example, the transitive closure for a MustLink (A,B) on vocabulary {A,B,C} is simply {A,B}, corresponding to the Dirichlet tree in
Figure 1(a)
.
To sympathise this encoding of MustLinks, consider first the example when the domain knowledge strength parameter is at its weakest η = ane. And so indegree equals outcaste for whatever internal node
s
(both are L(s)β), and the tree reduces to a Dirichlet distribution with symmetric prior β: the MustLinks are turned off in this example. As we increase η, the redistribution of probability mass at
due south
(governed by a Dirichlet under
s) has increasing
concentration
L(s)ηβ but the aforementioned uniform basemeasure. This tends to redistribute the mass evenly in the transitive closure represented by
due south. Therefore, the MustLinks are turned on when η > 1. Furthermore, the mass
reaching south
is independent of η, and tin still have a large variance. This properly encodes the fact that we desire MustLinked words to have like, just not e’er large, probabilities. Otherwise, MustLinked words would be forced to appear with big probability in
all
topics, which is clearly undesirable. This is impossible to represent with Dirichlet distributions. For example, the blue dots in
Figure 1(c)
are ϕ samples from the Dirichlet tree in
Figure 1(a)
, plotted on the probability simplex of dimension three. While it is always true that
p(A) ≈
p(B), their total probability mass tin can be anywhere from 0 to 1. The most like Dirichlet distribution is perchance the one with parameters (fifty,fifty,ane), which generates samples close to (0.five, 0.5, 0) (
Figure 1(d)
.)
iii.3. Encoding CannotLinks
CannotLinks are considerably harder to handle. We first transform them into an alternative form that is amenable to Dirichlet trees. Note that CannotLinks are non transitive: CannotLink (A,B) and CannotLink (B,C) does not entail CannotLink (A,C). Nosotros define a CannotLinkgraph where the nodes are words^{
1
}, and the edges represent to the CannotLinks. Then the
connected components
of this graph are independent of each other when encoding CannotLinks. Nosotros will use this property to factor a Dirichlettree option probability later. For case, the two CannotLinks (A,B) and (B,C) form the graph in
Figure 1(e)
with a single connected component {A,B,C}.
Consider the subgraph on connected component
r. We define its
complement graph
by flipping the edges (on to off, off to on), as shown in
Figure i(f)
. Let there be
Q
^{(r)}
maximal cliques
K
_{
r1}
…
Thousand
_{
rQ
(r)
}
in this complement graph. In the following, we just call them “cliques”, but it is important to remember that they are maximal cliques of the complement graph, not the original CannotLinkgraph. In our case,
Q
^{(r)}
= 2 and
Thou
_{
ri}
= {A,
C},
M
_{
rtwo}
= {B}. These cliques have the following interpretation: each clique (eastward.g.,
G
_{
r1}
= {A,
C}) is the maximal subset of words in the connected component that can “occur together”. That is, these words are allowed to simultaneously take large probabilities in a given topic without violating any CannotLink preferences. By the maximality of these cliques, allowing any word exterior the clique (e.thou., “B”) to too have a big probability will violate at least 1 CannotLink (in this example ii).
Nosotros discuss the encoding for this unmarried connected component
r
now, deferring giveandtake of the complete encoding to section three.4. Nosotros create a mixture model of
Q
^{(r)}
Dirichlet subtrees, one for each clique. Each topic selects exactly 1 subtree according to probability
p(q)
∝
Thousand
_{
r
q
},q
= i…Q
^{(r)}.
(6)
Conceptually, the selected subtree indexed by
q
tends to redistribute nigh all probability mass to the words within
M_{rq}
. Since in that location is no mass left for other cliques, information technology is impossible for a giveandtake exterior clique
Chiliad_{rq}
to have a large probability. Therefore, no CannotLink will be violated. In reality, the subtrees are soft rather than hard, because CannotLinks are but preferences. The Dirichlet subtree for
One thousand_{rq}
is structured as follows. The subtree’due south root connects to an internal node
s
with weight ηM_{rq}
β. The node
s
connects to words in
M_{rq}
, with weight β. The subtree’due south root also directly connects to words not in
M_{rq}
(simply in the connected component
r) with weight β. This will send most probability mass downward to
s, then flexibly redistribute it among words in
M_{rq}
. For case,
Figures 1(g,h)
evidence the Dirichlet subtrees for
1000
_{
ri}
= {A,
C} and
M
_{
r2}
= {B} respectively. Samples from this mixture model are shown in
Figure 1(i)
, representing multinomials in which no CannotLink is violated. Such behavior is not achievable past a Dirichlet distribution, or a single Dirichlet tree^{
two
}.
Finally, nosotros mention that although in the worst case the number of maximal cliques
Q
^{(r)}
in a connected component of size r can grow exponentially as
O(3^{r/iii}) (Griggs et al., 1988), in our experiments
Q
^{(r)}
is no larger than 3, due in part to MustLinked words “collapsing” to single nodes in the CannotLink graph.
3.four. The Dirichlet Forest Prior
In general, our domain cognition is expressed by a gear up of MustLinks and CannotLinks. We first compute the transitive closure of MustLinks. We then form a CannotLinkgraph, where a node is either a MustLink closure or a giveandtake not present in any MustLink. Notation that the domain knowledge must be “consequent” in that no pairs of words are simultaneously CannotLinked and MustLinked (either explicitly or implicitly through MustLink transitive closure.) Let
R
exist the number of connected components in the CannotLinkgraph. Our Dirichlet Wood consists of
Dirichlet trees, represented past the template in
Figure 2
. Each Dirichlet tree has
R
branches beneath the root, one for each continued component. The trees differ in which subtrees they include under these branches. For the
rth branch, there are
Q
^{(r)}
possible Dirichlet subtrees, corresponding to cliques
Chiliad
_{
r1}
…
M
_{
rQ
(r)
}. Therefore, a tree in the forest is uniquely identified by an index vector
q
= (q
^{(ane)}
…
q
^{(R)}), where
q
^{(r)}
∈ {1 …
Q
^{(r)}}.
To depict a Dirichlet tree
q
from the prior DirichletForest(β, η), we select the subtrees independently because the
R
continued components are independent with respect to CannotLinks:
. Each
q
^{(r)}
is sampled coordinate to
(6), and corresponds to choosing a solid box for the
rth branch in
Effigy ii
. The structure of the subtree within the solid box has been divers in Section three.3. The black nodes may be a single discussion, or a MustLink transitive closure having the subtree structure shown in the dotted box. The edge weight leading to nearly nodes
k
is γ^{(k)}
= 50(k)β, where
L(k) is the set of leaves under
chiliad. However, for edges coming out of a MustLink internal node or going into a CannotLink internal node, their weights are multiplied by the strength parameter η. These edges are marked past “η*” in
Figure ii
.
We now define the complete Dirichlet Forest model, integrating out (“collapsing”) θ and ϕ. Let
$$ be the number of discussion tokens in document
d
that are assigned to topic
j.
z
is generated the aforementioned every bit in LDA:
$$$$
There is i Dirichlet tree
q
_{
j
}
per topic
j
= one …
T, sampled from the Dirichlet Forest prior
. Each Dirichlet tree
q
_{
j
}
implicitly defines its tree border weights
using β, η, and its tree structure
L_{j}
,
I_{j}
,
C_{j}
(·). Let
be the number of word tokens in the corpus assigned to topic
j
that announced under the node
g
in the Dirichlet tree
q
_{
j
}. The probability of generating the corpus
w, given the trees
q
_{1:T
}
≡
q
_{1}
…
q
_{
T
}
and the topic assignment
z, tin can be derived using
(five):
$$$$
Finally, the complete generative model is
$$$$
4. Inference for Dirichlet Forest
Because a Dirichlet Forest is a mixture of Dirichlet trees, which are conjugate to multinomials, we can efficiently perform inference by Markoff chain Monte Carlo (MCMC). Specifically, nosotros use collapsed Gibbs sampling like to
Griffiths and Steyvers (2004). Still, in our instance the MCMC state is defined by both the topic labels
z
and the tree indices
q
_{i:T
}. An MCMC iteration in our instance consists of a sweep through both
z
and
q
_{1:T
}. We present the conditional probabilities for complanate Gibbs sampling below.
(Sampling
z_{i}
):
Permit
exist the number of word tokens in document
d
assigned to topic
j, excluding the word at position
i. Similarly, let
be the number of word tokens in the corpus that are under node
thousand
in topic
j’s Dirichlet tree, excluding the word at position
i. For candidate topic labels υ = 1 …
T, we have
$$$$
where
I
_{υ}(↑
i) denotes the subset of internal nodes in topic υ’due south Dirichlet tree that are ancestors of leaf
w_{i}
, and
C
_{υ}(south↓i) is the unique node that is
southward’south immediate child
and
an ancestor of
westward_{i}
(including
w_{i}
itself).
(Sampling
$$ ):
Since the connected components are independent, sampling the tree
q
_{
j
}
factors into sampling the cliques for each connected component
. For candidate cliques
q′ = 1 …
Q(r), we have
$$$$
where
I
_{
j,r=q′}
denotes the internal nodes beneath the
rth cooperative of tree
q
_{
j
}, when clique
M
_{
rq′}
is selected.
(Estimating
ϕ
and
θ):
Later running MCMC for sufficient iterations, nosotros follow standard exercise (e.yard. (Griffiths & Steyvers, 2004)) and utilize the last sample (z,
q
_{1:T
}) to estimate ϕ and θ. Because a Dirichlet tree is a conjugate distribution, its posterior is a Dirichlet tree with the same construction and updated edge weights. The posterior for the Dirichlet tree of the
jthursday topic is
, where the counts
$$ are nerveless from
z,
q
_{ane:T
},
due west. We estimate ϕ_{
j
}
by the offset moment nether this posterior (Minka, 1999):
$$$$
(vii)
The parameter θ is estimated the same way every bit in standard LDA:
$$.
five. Experiments
Synthetic Corpora
Nosotros present results on synthetic datasets to show how the Dirichlet Woods (DF) incorporates dissimilar types of cognition. Recall that DF with η = 1 is equivalent to standard LDA (verified with the code of (Griffiths & Steyvers, 2004)).
Previous studies ofttimes take the last MCMC sample (z
and
q
_{i:T
}), and discuss the topics ϕ_{1:T
}
derived from that sample. Because of the stochastic nature of MCMC, nosotros fence that more insight can exist gained if multiple contained MCMC samples are considered. For each dataset, and each DF with a different η, we run a long MCMC chain with 200,000 iterations of burnin, and have out a sample every 10,000 iterations afterward, for a total of 200 samples. Nosotros have some indication that our chain is wellmixed, as we observe all expected modes, and that samples with “label switching” (i.e., equivalent up to label permutation) occur with wellnigh equal frequency. For each sample, we derive its topics ϕ_{one:T
}
with
(7)
and then greedily align the ϕ’south from unlike samples, permuting the
T
topic labels to remove the label switching consequence. Within a dataset, we perform PCA on the baseline (η = ane) ϕ and project all samples into the resulting space to obtain a common visualization (each row in
Figure 3
. Points are dithered to show overlap.).
MustLink (B,C)
The corpus consists of vi documents over a vocabulary of 5 “words.” The documents are: ABAB, CDCD, and EEEE, each represented twice. Nosotros let
T
= 2, α = 0.5, β = 0.01. LDA produces three kinds of ϕ_{1:T
}: roughly a 3rd of the time the topics are around
, which is shorthand for
$$on the vocabulary ABCDE. Another third are around
$$, and the final third around
$$. They correspond to clusters i,2 and 3 respectively in the upperleft console of
Figure 3
. We add a single MustLink (B,C). When η = 10, the data even so override our MustLink somewhat because clusters 1 and ii practise not disappear completely. As η increases to 50, MustLink overrides the data and clusters 1 and 2 vanish, leaving just cluster three. That is, running DF and taking the concluding sample is very likely to obtain the
topics. This is what nosotros want: B and C are present or absentminded
together
in the topics and they likewise “pull” A, D forth, fiftyfifty though A, D are not in the knowledge we added.
CannotLink (A,B)
The corpus has four documents: ABCCABCC, ABDDABDD, twice each;
T
= iii, α = 1, β = 0.01. LDA produces six kinds of ϕ_{1:T
}
evenly:
, corresponding to clusters 1–5 and the “lines”. We add a single CannotLink (A,B). As DF η increases, cluster 2
$$disappears, because it involves a topic
$$that violates the CannotLink. Other clusters become uniformly more likely.
Isolate(B)
The corpus has 4 documents, all of which are ABC;
T
= ii, α = i, β = 0.01. LDA produces iii clusters evenly:
. Nosotros add together
Isolate(B), which is compiled into CannotLink (B,A) and CannotLink (B,C). The DF’s samples concentrate to cluster 1:
, which indeed isolates B into its ain topic.
Split up(AB,CD)
The corpus has half dozen documents: ABCDEEEE, ABCDFFFF, each present three times; α = 0.5, β = 0.01. LDA with
T
= iii produces a large portion of topics around
(not shown). We add together
Split(AB,CD), which is compiled into MustLink (A,B), MustLink (C,D), CannotLink (B,C), and increase
T
= 4. However, DF with η = i (i.east., LDA with
T
= 4) produces a large variety of topics: e.one thousand., cluster ane is
, cluster two is
$$, and cluster seven is
$$. That is, merely adding one more topic does non clearly separate AB and CD. On the other paw, with η increasing, DF eventually concentrates on cluster seven, which satisfies the Divide operation.
Wish Corpus
Nosotros at present consider
interactive topic modeling
with DF. The corpus we apply is a collection of 89,574 New year’due south wishes submitted to The Times Square Alliance (Goldberg et al., 2009). Each wish is treated as a document, downcased but without stopgiveandtake removal. For each step in our interactive example, we set α = 0.v, β = 0.1, η = thou, and run MCMC for 2000 iterations before estimating the topics from the last sample. The domain cognition in DF is accumulative forth the steps.
Pace i:
We run LDA with
T
= 15. Many of the most probable words in the topics are conventional (“to, and”) or corpusspecific (“wish, 2008”) stopwords, which obscure the meaning of the topics.
Step 2:
Nosotros manually create a 50discussion stopword list, and issue an
Isolate
preference. This is compiled into MustLinks among this set and CannotLinks between this set and all other words in the top 50 for all topics.
T
is increased to 16. Later on running DF, we stop upward with two stopword topics. Importantly, with the stopwords explained by these two topics, the top words for the other topics get much more than meaningful.
Stride 3:
We notice that one topic conflates ii concepts: enter college and cure disease (summit 8 words: “get schoolhouse cancer into well free cure college”). We issue
Split up(“go,school,into,college”, “cancer,complimentary,cure,well”) to separate the concepts. This is compiled into MustLinks within each quadruple, and a CannotLink between them.
T
is increased to 18. After running DF, one of the topics clearly takes on the “higher” concept, picking up related words which we did non explicitly encode in our prior. Another topic does likewise for the “cure” concept (many wishes are similar “mom stays cancer free”). Other topics have minor changes.
Step 4:
We then notice that 2 topics correspond to romance concepts. We apply
Merge(“love, forever, marry, together, loves”, “meet, young man, married, girlfriend, wedding”), which is compiled into MustLinks between these words.
T
is decreased to 17. After running DF, one of the romance topics disappears, and the remaining ane corresponds to the merged romance topic (“lose”, “weight” were in one of them, and remain and so). Other previous topics survive with but minor changes.
Table ane
shows the wish topics after these four steps, where nosotros place the DF operations next to the near affected topics, and colourcode the words explicitly specified in the domain knowledge.
Table 1
Topic  Top words sorted by ϕ = p(wordtopic) 



Merge 

success  wellness happiness family unit good friends prosperity 
life  life happy best live time long wishes ever years 
–  as do not what someone so similar don much he 
money  out make money up house piece of work able pay own lots 
people  no people end less day every each other another 
republic of iraq  domicile safety end troops iraq bring war return 
joy 

family unit  happy healthy family babe safe prosperous 
vote  meliorate promise president paul ron than person bush 
Isolate  and to for a the yr in new all my 
god  god bless jesus everyone loved know centre christ 
peace  peace earth earth win lottery around salve 
spam  com telephone call if u 4 www 2 three visit 1 
Isolate  i to wish my for and a be that the 
Split 

Carve up 

Yeast Corpus
Whereas the previous experiment illustrates the utility of our approach in an interactive setting, nosotros now consider a case in which nosotros utilize background cognition from an ontology to guide topic modeling. Our prior knowledge is based on six concepts. The concepts
transcription,
translation
and
replication
characterize three of import
processes
that are carried out at the molecular level. The concepts
initiation,
elongation
and
termination
describe
phases
of the iii aforementioned processes. Combinations of concepts from these two sets correspond to concepts in the Gene Ontology (e.g., Get:0006414 is
translational elongation, and GO:0006352 is
transcription initiation). We guide our topic modeling using MustLinks among a smallscale gear up of words for each concept. Moreover, we use CannotLinks among words to specify that we prefer (i)
transcription,
translation
and
replication
to be represented in separate topics, and (ii)
initiation,
elongation
and
termination
to exist represented in separate topics. Nosotros practise non set any preferences betwixt the “procedure” topics and the “phase” topics, even so.
The corpus that we use for our experiments consists of 18,193 abstracts selected from the MEDLINE database for their relevance to yeast genes. We induce topic models using DF to encode the MustLinks and CannotLinks described higher up, and use standard LDA as a control. Nosotros set
T
= 100, α = 0.5, β = 0.1, η = 5000. For each discussion that we use to seed a concept,
Table 2
shows the topics that include it amidst their l nearly probable words. We brand several observations about the DFinduced topics. First, each concept is represented past a small number of topics and the MustLink words for each topic all occur as highly probable words in these topics. Second, the CannotLink preferences are obeyed in the terminal topics. Third, the topics use the process and phase topics in a compositionally. For case, DF Topic iv represents
transcription initiation
and DF Topic 8 represents
replication initiation. Moreover, the topics that are significantly influenced by the prior typically include highly relevant terms amid their most probable words. For instance, the top words in DF Topic iv include “TATA”, “TFIID”, “promoter”, and “recruitment” which are all specifically germane to the composite concept of
transcription initiation. In the case of standard LDA, the seed concept words are dispersed beyond a greater number of topics, and highly related words, such as “bike” and “segmentation” often do non autumn into the aforementioned topic. Many of the topics induced by ordinary LDA are semantically coherent, merely the specific concepts suggested past our prior practice not naturally sally without using DF.
Table 2
LDA  DF  

ane  ii  three  4  5  6  7  8  o  1  2  iii  4  v  6  7  8  9  10  


transcription  •  •  •  i  •  •  •  
transcriptional  •  •  •  two  •  •  •  
template  •  one  •  •  •  


translation  •  •  •  •  
translational  •  •  •  
tRNA  1  •  •  


replication  •  2  •  •  •  
cycle  •  •  •  •  •  
division  •  iii  •  •  •  




initiation  •  •  •  •  •  •  •  •  •  
showtime  •  •  •  •  •  •  •  
assembly  •  •  7  •  •  •  •  


elongation  •  •  1  •  


termination  •  •  •  
disassembly  •  
release  2  •  
cease  •  • 
Acknowledgments
This work was supported past NIH/NLM grants T15 LM07359 and R01 LM07050, and the Wisconsin Alumni Enquiry Foundation.
Footnotes
Appearing in
Proceedings of the 26^{thursday}
International Briefing on Automobile Learning, Montreal, Canada, 2009.
^{1}When at that place are MustLinks, all words in a MustLink transitive closure form a single node in this graph.
^{ii}Dirichlet distributions with very small concentration do have some selection consequence. For example, Beta(0.1,0.1) tends to concentrate probability mass on 1 of the two variables. Notwithstanding, such priors are weak – the “pseudo counts” in them are too small because of the minor concentration. The posterior will be dominated by the data, and we would lose any encoded domain knowledge.
References

Basu S, Davidson I, Wagstaff G, editors.
Constrained clustering: Advances in algorithms, theory, and applications.
Chapman & Hall/CRC; 2008.
[Google Scholar]

Blei D, Lafferty J.
Advances in neural data processing systems.
Vol. 18. Cambridge, MA: MIT Press; 2006. Correlated topic models; pp. 147–154.
[Google Scholar]

Blei D, Ng A, Jordan Chiliad. Latent Dirichlet resource allotment.
Journal of Machine Learning Enquiry.
2003;3:993–1022.
[Google Scholar]

Chemudugunta C, Holloway A, Smyth P, Steyvers M. Modeling documents by combining semantic concepts with unsupervised statistical learning; Intl. Semantic Web Conf; Springer; 2008. pp. 229–244.
[Google Scholar]

Dennis SY., III On the hyperDirichlet type one and hyperLiouville distributions.
Communications in Statistics – Theory and Methods.
1991;20:4069–4081.
[Google Scholar]

Goldberg A, Fillmore N, Andrzejewski D, Xu Z, Gibson B, Zhu X. May all your wishes come true: A study of wishes and how to recognize them; Human Language Technologies: Proc. of the Annual Conf. of the North American Affiliate of the Assoc. for Computational Linguistics; ACL Press; 2009.
[Google Scholar]

Griffths TL, Steyvers M. Finding scientific topics.
Proc. of the Nat. Academy of Sciences of the Us of America.
2004;101:5228–5235.
[PMC costless article]
[PubMed]
[Google Scholar]

Griggs JR, Grinstead CM, Guichard DR. The number of maximal independent sets in a connected graph.
Detached Math.
1988;68:211–220.
[Google Scholar]

Li W, McCallum A. Pachinko allocation: DAGstructured mixture models of topic correlations; Proc. of the 23rd Intl. Conf. on Machine Learning; ACM Press; 2006. pp. 577–584.
[Google Scholar]

Minka TP.
The Dirichlettree distribution (Technical Written report)
1999.
http://research.microsoft.com/~minka/papers/dirichlet/minkadirtree.pdf. 
Tam YC, Schultz T. Correlated latent semantic model for unsupervised LM adaptation; IEEE Intl. Conf. on Acoustics, Speech and Signal Processing; 2007. pp. 41–44.
[Google Scholar]

The Factor Ontology Consortium. Cistron Ontology: Tool for the unification of biology.
Nature Genetics.
2000;25:25–29.
[PMC free article]
[PubMed]
[Google Scholar]
Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2943854/