سال انتشار: ۱۳۸۸

محل انتشار: اولین کنفرانس ملی مهندسی نرم افزار ایران

تعداد صفحات: ۷

نویسنده(ها):

Sajjad Fallah – Department of ICT, Shahid Beheshti of Medical University Tehran, Iran
Ghorban Kheradmandian – Department of Computer Engineering, Amirkabir University of Technology Tehran, Iran

چکیده:

Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely studied problems in this area is the identification of clusters, or densely populated regions, in a multi-dimensional dataset. In this paper a new clustering algorithm for very large databases is proposed. The proposed algorithm, at most loads the entire data set into memory, three times. In phase 1, the encompassing space of data set is identified and it is partitioned into several sub-spaces, depending on the amount of available memory. Next, the entire data set, step by step is loaded into memory and each data point is assigned into a sub-space, and the average of data assigned into each sub-space is stored. In the later phase, some of the small sub-clusters corresponding to sub-spaces are hierarchically merged and constitute larger clusters. Our clustering algorithm is independent of the order of training samples appearance and our experimental results with complex data reveal that its performance is better than the well known k means algorithm.