Empirical Inference Conference Paper 2008

Graph Mining with Variational Dirichlet Process Mixture Models

Graph data such as chemical compounds and XML documents are getting more common in many application domains. A main difficulty of graph data processing lies in the intrinsic high dimensionality of graphs, namely, when a graph is represented as a binary feature vector of indicators of all possible subgraph patterns, the dimensionality gets too large for usual statistical methods. We propose a nonparametric Bayesian method for clustering graphs and selecting salient patterns at the same time. Variational inference is adopted here, because sampling is not applicable due to extremely high dimensionality. The feature set minimizing the free energy is efficiently collected with the DFS code tree, where the generation of useless subgraphs is suppressed by a tree pruning condition. In experiments, our method is compared with a simpler approach based on frequent subgraph mining, and graph kernels.

Author(s): Tsuda, K. and Kurihara, K.
Book Title: SDM 2008
Journal: Proceedings of the 8th SIAM International Conference on Data Mining
Pages: 432-442
Year: 2008
Month: April
Day: 0
Editors: Zaki, M. J.
Publisher: Society for Industrial and Applied Mathematics
Bibtex Type: Conference Paper (inproceedings)
Address: Philadelphia, PA, USA
Event Name: 8th 2008 SIAM International Conference on Data Mining
Event Place: Atlanta, GA, USA
Digital: 0
Electronic Archiving: grant_archive
Language: en
Organization: Max-Planck-Gesellschaft
School: Biologische Kybernetik
Links:

BibTex

@inproceedings{4950,
  title = {Graph Mining with Variational Dirichlet Process Mixture Models},
  journal = {Proceedings of the 8th SIAM International Conference on Data Mining},
  booktitle = {SDM 2008},
  abstract = {Graph data such as chemical compounds and XML documents are getting more
  common in many application domains.
  A main difficulty of graph data processing
  lies in the intrinsic high dimensionality of graphs, namely,
  when a graph is represented as a binary feature vector
  of indicators of all possible subgraph patterns,
  the dimensionality gets too large for usual statistical methods.
  We propose a nonparametric Bayesian method for clustering graphs and
  selecting salient patterns at the same time.
  Variational inference is adopted here, because sampling
  is not applicable due to extremely high dimensionality.
  The feature set minimizing the free energy is efficiently
  collected with the DFS code tree, where the generation of useless subgraphs
  is suppressed by a tree pruning condition.
  In experiments, our method is compared with a simpler approach based on
  frequent subgraph mining, and graph kernels.},
  pages = {432-442},
  editors = {Zaki, M. J.},
  publisher = {Society for Industrial and Applied Mathematics},
  organization = {Max-Planck-Gesellschaft},
  school = {Biologische Kybernetik},
  address = {Philadelphia, PA, USA},
  month = apr,
  year = {2008},
  slug = {4950},
  author = {Tsuda, K. and Kurihara, K.},
  month_numeric = {4}
}