Distributed mining of molecular fragments
Di Fatta, G. and Berthold, M. R. (2004) Distributed mining of molecular fragments. In: DM-Grid 2004, IEEE Workshop on Data Mining and the Grid in conjunction with ICDM 2004, 1 Nov 2004, Brighton, UK. (Unpublished)
In real world applications sequential algorithms of data mining and data exploration are often unsuitable for datasets with enormous size, high-dimensionality and complex data structure. Grid computing promises unprecedented opportunities for unlimited computing and storage resources. In this context there is the necessity to develop high performance distributed data mining algorithms. However, the computational complexity of the problem and the large amount of data to be explored often make the design of large scale applications particularly challenging. In this paper we present the first distributed formulation of a frequent subgraph mining algorithm for discriminative fragments of molecular compounds. Two distributed approaches have been developed and compared on the well known National Cancer Institute’s HIV-screening dataset. We present experimental results on a small-scale computing environment.
 http://dtp.nci.nih.gov/docs/aids/aids data.html.  R. Agrawal, T. Imielienski, and A. Swami. Mining association rules between sets of items in large databases Proc. of Conf. on Management of Data. pages 207–216.  R. Agrawal and J. Shafer. Parallel mining of association rules. IEEE Trans. Knowledge and Data Eng., 8(6):962969, Dec. 1996.  C. Borgelt and M. R. Berthold. Mining molecular fragments: Finding relevant substructures of molecules. IEEE International Conference on Data Mining (ICDM 2002, Maebashi, Japan). pages 51–58, December 09-12, 2002.  B. Carpenter, G. Fox, S.-H. Ko, and S. Lim. mpijava 1.2: Api specification.  S. Chakrabarti, A. Ranade, and K. Yelick. Randomized loadbalancing for tree-structured computation, In Scalable High Performance Computing Conference, Knoxville, TN. 1994.  Y. Chung, J.-W. Park, and S.-H. Yoon. An asynchronous algorithm for balancing unpredictable workload on distributed-memory machines. ETRI Journal, 20(4):346– 360, Dec. 1998.  M. Desphande, M. Kuramochi, and G. Karypis. Automated approaches for classifying structures Proc. of Workshop on Data Mining in Bioinformatics (BioKDD). pages 11–18, 2002.  I. Foster, C. Kesselman, and S. Tuecke. The anatomy of the grid: Enabling scalable virtual organizations. International J. Supercomputer Applications.  Globus Project Team. The Globus Project, http://www.globus.org.  A. Grimshaw andW.Wulf. The legion vision of a worldwide virtual computer. In Communications of the ACM, 40(1), January 1997.  E.-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. IEEE Transactions on Knowledge and Data Engineering, 12(3):337–352, May/June 2000.  I. Foster and C. Kesselman and J. Nick and S. Tuecke. The physiology of the grid: An open grid services architecture for distributed systems integration Open Grid Service Infrastructure WG, Global Grid Forum. June 22, 2002.  R. Karp and Y. Zhang. A randomized parallel branch-and-bound procedure, In Proceedings of the 20 Annual ACM Symp. on Theory of Computing. 1988.  S. Kramer, L. de Raedt, and C. Helma. Molecular feature mining in hiv data Proc. of 7th Int. Conf. on Knowledge Discovery and Data Mining, (KDD-2001, San Francisco, CA). pages 136–143, 2001.  T. Washio and H. Motoda. State of the art of graphbased data mining. ACM SIGKDD Explorations Newsletter, 5(1):59–68, July 2003.  O. Weislow, R. Kiser, D. Fine, J. Bader, R. Shoemaker, and M. Boyd. New soluble formazan assay for hiv-1 cytopathic effects: Application to high flux screening of synthetic and natural products for aids antiviral activity. Journal of the National Cancer Institute, University Press, Oxford, United Kingdom.  X. Yan and J. Han. gspan: Graph-based substructure pattern mining Proceedings of the IEEE International Conference on Data Mining ICDM, Maebashi City, Japan. 2002.  M. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery of association rules Proc. of 3rd Int. Conf. on Knowledge Discovery and Data Mining (KDD’97). pages 283–296, 1997.  M. J. Zaki. Parallel and distributed association mining: A survey. IEEE Concurrency, 7(4):14–25, 1999.
Repository Staff Only: item control page