LSROM: Learning Self-Refined Organizing Map for Fast Imbalanced Streaming Data Clustering (2024)

\newunicodechar

✓✓

Yongqi Xu, Yujian Lee, Rong Zou,YiqunZhang,,and
Yiu-Ming Cheung
Yongqi Xu is with the School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, China (e-mail: 3121004930@mail2.gdut.edu.cn).Yujian Lee is with the School of Artificial Intelligence, BNU-HKBU United International College, Zhuhai, China (e-mail: r130034019@mail.uic.edu.cn).Yiqun Zhang is with the Department of Computer Science, Hong Kong Baptist University, Hong Kong, SAR, China, and also with the School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, China (e-mail: yqzhang@comp.hkbu.edu.hk).Rong Zou and Yiu-Ming Cheung are with the Department of Computer Science, Hong Kong Baptist University, Hong Kong, SAR, China (e-mail: {rongzou, ymc}@comp.hkbu.edu.hk).Yiqun Zhang is the corresponding author.

Abstract

Streaming data clustering is a popular research topic in the fields of data mining and machine learning. Compared to static data, streaming data, which is usually analyzed in data chunks, is more susceptible to encountering the dynamic cluster imbalanced issue. That is, the imbalanced degree of clusters varies in different streaming data chunks, leading to corruption in either the accuracy or the efficiency of streaming data analysis based on existing clustering methods. Therefore, we propose an efficient approach called Learning Self-Refined Organizing Map (LSROM) to handle the imbalanced streaming data clustering problem, where we propose an advanced SOM for representing the global data distribution. The constructed SOM is first refined for guiding the partition of the dataset to form many micro-clusters to avoid the missing small clusters in imbalanced data. Then an efficient merging of the micro-clusters is conducted through quick retrieval based on the SOM, which can automatically yield a true number of imbalanced clusters. In comparison to existing imbalanced data clustering approaches, LSROM is with a lower time complexity O(nlogn)𝑂𝑛𝑛O(n\log n)italic_O ( italic_n roman_log italic_n ), while achieving very competitive clustering accuracy. Moreover, LSROM is interpretable and insensitive to hyper-parameters. Extensive experiments have verified its efficacy.

Index Terms:

Cluster analysis, streaming data, imbalanced data, self-organizing map, time complexity.

I Introduction

Streaming data is prevalent in various fields such as market research, streaming media analysis, and the Internet of Things [1, 2]. Due to the lack of readily available labels for streaming data, clustering that gathers similar data objects into a certain number of groups becomes indispensable for data analysis. Cluster analysis of streaming data is often conducted on a chunk-by-chunk basis to ensure sufficient statistical information. However, the non-uniform co-occurrence of samples from different distributions, along with shifts in data distributions [3] will more frequently lead to the emergence of imbalanced clusters, i.e., clusters with very different numbers of objects in the same data chunk. Fig.1 shows the causes of such a problem, which puts forward a new clustering challenge, i.e., how to efficiently deal with the cross-coupling streaming and imbalanced issues under an unsupervised learning environment.

LSROM: Learning Self-Refined Organizing Map for Fast Imbalanced Streaming Data Clustering (1)

Imbalanced streaming data introduces more complex and challenging issues to clustering, which are manifested into two key aspects: 1) Uncertainty in the number of clusters. Due to the very different and changing sizes of clusters, it is difficult to dynamically determine an appropriate number of clusters, leading to unsatisfactory clustering accuracy. 2) Laborious computation involved in detecting imbalanced clusters. Existing imbalanced data clustering methods usually involve a two-phase process: i) partitioning data into m𝑚mitalic_m-scale micro-clusters, and ii) considering m2superscript𝑚2m^{2}italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-scale combinations of the micro-clusters for merging, where a proper m𝑚mitalic_m is usually dependent on the scale n𝑛nitalic_n of the dataset. Such a time-consuming O(m2)𝑂superscript𝑚2O(m^{2})italic_O ( italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )-complex process evidently prevents the existing imbalanced data clustering approaches from being applied to the real-time analysis of large-scale streaming data. Subsequently, we analyze existing methods from the perspective of the above two challenges to Imbalanced Streaming Data Clustering (ISDC).

Partition-based clustering algorithms demonstrate a certain level of capability in addressing ISDC. One of the most classic methods is k𝑘kitalic_k-means [4]. Due to its simplicity and efficiency, it can be applied to the analysis of streaming data with linear complexity. However, either k𝑘kitalic_k-means or its variants (e.g., [5, 6]) tend to generate balanced clusters, which makes them incompetent to the imbalanced clusters. To specially address the imbalance issue, the Multi-Center (MC) clustering algorithm [7] has been proposed to divide the data into several small subclusters, and then successively merge the closer subclusters until a certain degree to detect imbalanced clusters. As its performance is very sensitive to initialization, a self-adaptive clustering method called SMCL [8] has been proposed, which can automatically select a reasonable number of subclusters. Nevertheless, both MC and SMCL experience a laborious computation as their merging processes involve exhaustive searching of the current most similar pair of subclusters.

Density-based clustering methods determine object-cluster affiliation according to the local density of objects [9], and they demonstrate a certain degree of capability in handling imbalanced data. For instance, Density Peaks Clustering (DPC) [10] identifies data objects with relatively higher local density as cluster centers, and can thus explore smaller clusters. To further achieve a more reasonable density quantification, an algorithm called FKNN-DPC [11] has been proposed employing fuzzy weighted k𝑘kitalic_k-nearest neighbor as a density measure. With a more appropriate density measure, FKNN-DPC achieves better performance in detecting imbalanced clusters in comparison with the original DPC. Later, an approach called Local Density Peaks for Imbalanced data (LDPI) [12] has been proposed adopting an adaptive subcluster construction scheme, which forms more subclusters than true clusters, and the subclusters are utilized to jointly support the detection of imbalanced clusters. However, LDPI is with a quadratic time complexity and is thus unsuitable for ISDC.

To achieve efficient cluster analysis on streaming data, fast clustering solutions have emerged. One of the conventional methods is StreamKM++ [13], which integrates a software acceleration architecture and k𝑘kitalic_k-means++ [14], and dynamically updates cluster centers to enhance clustering efficiency. The efficiency of density-based clustering has also been improved by adapting DPC to streaming data, which demonstrates superior clustering performance compared to other streaming data clustering methods [9]. However, dynamic DPC is not robust to hyper-parameters when dealing with non-stationary streaming data. To further address this, AMD-DPC [15] has been proposed to adopt a graph-based data structure for local density updates. This algorithm saves computation cost while still remaining accurate, showing potential for real-time big data processing. However, the above-mentioned fast algorithms have yet to consider imbalanced clusters, and may thus yield unsatisfactory accurate performance in ISDC.

To the best of our knowledge, clustering algorithms that are dedicately designed for imbalanced data or streaming data clustering cannot simultaneously achieve fast and accurate ISDC. Moreover, as streaming data chunks may with changing number of clusters, adaptability is also worth considering for an ideal ISDC approach to automatically adapt to the “true” number of clusters for different data chunks.

Therefore, we propose an approach called Learning Self-Refined Organizing Map (LSROM) to address the ISDC problem. To quickly describe the distribution of a new stream-in chunk for clustering, we propose a new SOM technique to form a topological representation of data. As the nodes may be somewhat trapped by the initialized map and thus cannot well-represent natural object clusters, a local topological radius metric is designed to eliminate incompetent nodes, and a refined topology is thus obtained. Nodes in the topology are treated as cluster centers to divide the whole dataset into many micro-clusters to avoid the missing of small clusters, and then the micro-clusters are hierarchically merged to detect imbalanced clusters. As hierarchical merging is the most laborious phase, we also let the established topology serve as the retrieval structure, which acts to only allow promising similar micro-clusters to be considered for merging. In comparison with the state-of-the-art counterparts, LSROM demonstrates its superiority in ISDC as it sufficiently improves time complexity while still demonstrating very competitive clustering accuracy. Comprehensive experiments have been conducted to illustrate the promising performance of LSROM. The main contributions of this paper can be summarized into four-fold:

  1. 1.

    A new clustering paradigm called LSROM is proposed, which thoroughly exploits the information provided by the finely trained SOM to cope with the trade-off between efficiency and detection of imbalanced clusters, and achieves an accurate but efficient ISDC.

  2. 2.

    To remove the less-representative nodes trapped by the map, a nodes elimination strategy is designed based on the proposed measure called local topological radius ratio. The trained SOM can be refined accordingly, providing the representation basis for fast and accurate ISDC.

  3. 3.

    A rapid retrieval mechanism, which retrieves object neighbors through the topological links within the refined SOM to approximate density computation, is proposed to adequately save the computation cost for the laborious micro-cluster merging phase.

  4. 4.

    To ensure a more convincing experimental evaluation, we design a new data generator, which simultaneously models the changing cluster number and cluster size for approximating real streaming data. It can become a universal experimental tool in studying ISDC.

The rest of this paper is organized as follows: Section II reviews the related work, and Section III presents the problem statement. In Section IV, we provide technical details and time complexity analysis of the proposed LSROM. Section V showcases experimental results with observations, and finally we draw a conclusion in Section VI.

II Related Work

This section will briefly introduce streaming data clustering, imbalanced data clustering, and clustering methods utilizing Self-Organizing Maps.

II-A Streaming Clustering

Data stream clustering involves categorizing data chunks into a finite set of clusters, which differs from static clustering and places higher demands on memory and time due to continuously varying chunks entering the stream. This necessitates fast incremental data processing to produce timely results. Modern data stream clustering algorithms strive to balance high accuracy with compressed time complexity.

Many of these algorithms are variations of traditional clustering methods like k𝑘kitalic_k-means (e.g., [16], [17], [13]), DBSCAN (e.g., [18], [19]), and hierarchical clustering [20]. While these adaptations enable efficient processing, they introduce specific hyper-parameters. To circumvent the non-trivial setting of hyper-parameters, Zhang et al. propose the Davies-Bouldin Index Evolving Clustering Method (DBIECM) [21] for streaming data clustering. However, this concept cannot be directly extended to the online setting since streaming algorithms do not retain data or maintain a partition of it, both of which are needed by batch cluster validity indices. Moshtaghi et al. develop two incremental versions of the Xie-Beni and Davies-Bouldin validity indices and employ them to monitor and control two streaming clustering algorithms, such as sk-means and online ellipsoidal clustering [22]. The above methods are limited to single-view, and a multiview support vector domain description (MVSVDD) model [23] is proposed to capture cluster evolution and discover arbitrarily shaped clusters with limited computing resources. In this context, these new incremental validity indices serve as performance monitoring functions. Nevertheless, handling non-stationary data distributions remains a challenge as most algorithms struggle to maintain accuracy and robustness.

II-B Imbalanced Clustering

Imbalanced clustering, where the number of samples varies in different clusters, has emerged from many real data mining applications [24]. Using prototypes is one of the effective approaches to tackle this issue (e.g., in [25], [26], [7]). Each cluster can be represented by multiple prototype points, reducing uniform effect [27] between subclusters. However, the above methods typically use a pre-defined number of prototypes. The issue with this strategy is that if the given number of prototypes is too small, the uniform effect may still occur between minority and majority clusters. Conversely, if the number of prototypes is too large, prototype points may also reside in overlapping regions between clusters, and noise and outliers could be treated as subclusters. Furthermore, if the number of prototypes is less than the number of clusters, the clustering result will never be correct. Hence, various methods have sought adaptive improvements in the pre-defined number of prototype points.

SMCL [8] achieves this by gradually adding seed points until a seed point is driven by a competitive punishment mechanism. The introduced competitive punishment mechanism effectively prevents the issue of dead units. In the case of using k𝑘kitalic_k-means as the clustering algorithm during learning, if no seed point is driven, a new seed point will be reintroduced. However, this approach lacks effectiveness in distinguishing small clusters from noise points. To address this, LDPI [12] designs an initial subcluster generation scheme, improving the clustering method of DPC by automatically identifying noise points and initial subcluster centers. According to the nearest-neighbor principle, the remaining points are classified as subcluster centers to generate subclusters. MCNS [28] introduces a measure based on the reconstruction rate to select the appropriate number of clusters, enhancing convergence speed while ensuring accuracy. All of these methods seek the optimal solution through iterations, resulting in time complexity exceeding quadratic levels.

II-C SOM-Based Clustering

Clustering algorithms based on SOM [29] have garnered significant attention and research in recent years. The two-dimensional mapping produced by SOM is a highly suitable method for representing the inherent relationships within data. It makes it easy to explore the intrinsic clustering structures. As a result, many improvements and extensions have been developed in this regard. Given that traditional SOM requires an initial number of neurons, leading to non-adaptive clustering precision, Raube et al. construct a hierarchical structure (GHSOM) [30] by recursively growing and splitting neurons, partitioning the data space into finer-grained regions, and gradually refining the clustering results. In order to go beyond the constraints of traditional numerical features, FMSOM [31] has devised a method that extends neural prototypes with category frequency tables to handle heterogeneous feature data without the need for binary transformation.

Olszewski et al. address data asymmetry by introducing additional asymmetric connection weights (ASOM) [32], which makes the topological structure more adaptable and reduces the topological distortion caused by random initialization. A. Ali Hameed et al. introduce a new intelligent adaptive learning SOM [33]. In this approach, they derive a new variable learning rate that adaptively achieves optimal weights, resulting in winning neurons with a shorter learning process time compared to traditional SOM. The authors aim to simultaneously achieve low quantization error in SOM and fast high convergence of the SOM algorithm, thereby overcoming the main drawbacks of classical SOM. Zhang et al. design a clustering method (SOFM) [34] where the learning rate and neighborhood function are adaptively adjusted based on data distribution and changes in the feature space. Later, a self-supervised self-organizing clustering network (S3superscriptS3\text{S}^{3}S start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPTOCNet) [35] is proposed to jointly learn feature extraction and feature clustering to achieve a single-stage clustering method.

III Problem Statement

The primary cause of the current decrease in accuracy in streaming clustering algorithms is the difficulty in adapting to imbalanced data situations. To address this challenge, we have introduced the concept of Imbalanced Streaming Data Clustering (ISDC). Assuming data objects are continuously generated or collected, which can be more formally denoted as a data stream S={𝐗i}i=1N𝑆superscriptsubscriptsuperscript𝐗𝑖𝑖1𝑁S=\left\{\mathbf{X}^{i}\right\}_{i=1}^{N}italic_S = { bold_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT comprising a series of data chunks 𝐗isuperscript𝐗𝑖\mathbf{X}^{i}bold_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT formed at different time-stamps i𝑖iitalic_i. Generally, data streams are considered to be unbounded (i.e., N𝑁N\rightarrow\inftyitalic_N → ∞). Each data chunk consists of an n-dimensional attribute vector, denoted as 𝐗i=[𝐱ji]j=1nsuperscript𝐗𝑖superscriptsubscriptdelimited-[]superscriptsubscript𝐱𝑗𝑖𝑗1𝑛\mathbf{X}^{i}=\left[\mathbf{x}_{j}^{i}\right]_{j=1}^{n}bold_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = [ bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, describing properties, and the attribute space can be diverse.When sampling is biased towards certain categories, there is often a high probability of generating class-imbalanced data chunks, where the number of instances in the majority class significantly exceeds that in the minority class. We define this difference as the Imbalanced Ratio (IR𝐼𝑅IRitalic_I italic_R), as shown in Eq. (1):

IR=mami,𝐼𝑅subscript𝑚𝑎subscript𝑚𝑖IR=\frac{m_{a}}{m_{i}},italic_I italic_R = divide start_ARG italic_m start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ,(1)

where masubscript𝑚𝑎m_{a}italic_m start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represent the data count in the majority class and the data count in the minority class respectively. IR=1𝐼𝑅1IR=1italic_I italic_R = 1 indicates an extremely balanced dataset, with larger IR𝐼𝑅IRitalic_I italic_R values signifying increased data imbalance. The data chunk 𝐗isuperscript𝐗𝑖\mathbf{X}^{i}bold_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT with a relatively large IR𝐼𝑅IRitalic_I italic_R serves as the metric space for the clustering operation. For any choice of Zi={𝐳1i,𝐳2i,,𝐳ki}𝐗isuperscript𝑍𝑖subscriptsuperscript𝐳𝑖1subscriptsuperscript𝐳𝑖2subscriptsuperscript𝐳𝑖𝑘superscript𝐗𝑖Z^{i}=\{\mathbf{z}^{i}_{1},\mathbf{z}^{i}_{2},\ldots,\mathbf{z}^{i}_{k}\}%\subset\mathbf{X}^{i}italic_Z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = { bold_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } ⊂ bold_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT consisting of k𝑘kitalic_k cluster centers, a partition of 𝐗isuperscript𝐗𝑖\mathbf{X}^{i}bold_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT into k𝑘kitalic_k clusters Φ(k)={C1i,C2i,,Cki}Φ𝑘subscriptsuperscript𝐶𝑖1subscriptsuperscript𝐶𝑖2subscriptsuperscript𝐶𝑖𝑘\varPhi(k)=\{C^{i}_{1},C^{i}_{2},\ldots,C^{i}_{k}\}roman_Φ ( italic_k ) = { italic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } is defined. Each Cjisubscriptsuperscript𝐶𝑖𝑗C^{i}_{j}italic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT contains all objects in 𝐗isuperscript𝐗𝑖\mathbf{X}^{i}bold_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT that are closer to 𝐳jisubscriptsuperscript𝐳𝑖𝑗\mathbf{z}^{i}_{j}bold_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT than to any other center. The objective is to select Φ(k)Φ𝑘\varPhi(k)roman_Φ ( italic_k ) in a way that minimizes the sum of squared distances (SSQ) measure [36]:

SSQ(𝐗i,Zi)=j=1k𝐱Cjid(𝐱,𝐳ji).𝑆𝑆𝑄superscript𝐗𝑖superscript𝑍𝑖superscriptsubscript𝑗1𝑘subscript𝐱subscriptsuperscript𝐶𝑖𝑗𝑑𝐱subscriptsuperscript𝐳𝑖𝑗SSQ(\mathbf{X}^{i},Z^{i})=\sum_{j=1}^{k}{\sum_{\mathbf{x}\in C^{i}_{j}}{d(%\mathbf{x},\mathbf{z}^{i}_{j})}}.italic_S italic_S italic_Q ( bold_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_Z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT bold_x ∈ italic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_d ( bold_x , bold_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) .(2)

In this case, d(𝐱,𝐳ji)𝑑𝐱subscriptsuperscript𝐳𝑖𝑗d(\mathbf{x},\mathbf{z}^{i}_{j})italic_d ( bold_x , bold_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is typically employed to calculate the Euclidean distance between two objects. ISDC requires fast incremental processing of data objects to provide real-time results while achieving good clustering performance within the imbalanced distribution of data chunks. Therefore, there are more precise requirements in terms of both time and accuracy. However, to the best of our knowledge, this problem has not yet received thorough research attention.

IV Proposed Method

This section provides a detailed description of the LSROM algorithm, which contains three steps, i.e. 1) Intensive Partition based on Trained Topology (IP), 2) Refining Partition by Bridge Nodes (RP), and 3) Merging Micro-Clusters based on Refined Topology (MMC). Finally, we analyze the overall algorithm framework of LSROM along with its time complexity. The overview of the LSROM algorithm is shown in Fig.2.

LSROM: Learning Self-Refined Organizing Map for Fast Imbalanced Streaming Data Clustering (2)

IV-A IP: Intensive Partition based on Trained Topology

In order to expedite the rapid construction of micro-clusters, we employ reclustering, which consists of two clustering steps. In the initial clustering step, we adopt an advanced SOM architecture named Randomized Self-Organizing Map (RSOM) [37].We assume that the mapping grid comprises Q×Q𝑄𝑄Q\times Qitalic_Q × italic_Q neurons, while the input imbalanced data chunk 𝐗in×fsuperscript𝐗𝑖superscript𝑛𝑓\mathbf{X}^{i}\in\mathbb{R}^{n\times f}bold_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_f end_POSTSUPERSCRIPT is used as the dataset. In this context, 𝐱jiRfsuperscriptsubscript𝐱𝑗𝑖superscript𝑅𝑓\mathbf{x}_{j}^{i}\in R^{f}bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT represents the j𝑗jitalic_j-th data object within 𝐗isuperscript𝐗𝑖\mathbf{X}^{i}bold_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. 𝐳qRfsubscript𝐳𝑞superscript𝑅𝑓\mathbf{z}_{q}\in R^{f}bold_z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT represents each neuron in the topology, to consider the random placement of neurons with a specific spectral distribution, a method employing rapid Poisson disk sampling for neurons of any dimension [38] is utilized. The Poisson disk sampling ensures that neurons are not placed too close to each other. Subsequently, through 10 iterations of the Lloyd relaxation scheme [39], the initial placements are further rearranged to achieve nearly centroidal Voronoi tessellation.Furthermore, neighbors of 𝐳qsubscript𝐳𝑞\mathbf{z}_{q}bold_z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT at step t𝑡titalic_t can be defined as Ω(𝐳q,t)Ωsubscript𝐳𝑞𝑡\varOmega\left(\mathbf{z}_{q},t\right)roman_Ω ( bold_z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_t ), representing a group of neurons close to 𝐳qsubscript𝐳𝑞\mathbf{z}_{q}bold_z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT on the grid. Throughout the entire training process, the neighborhood variation follows a nearest-neighbor winner-takes-all rule (shrinking over time).

During the learning phase, for each data object 𝐱jisubscriptsuperscript𝐱𝑖𝑗\mathbf{x}^{i}_{j}bold_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, the nearest neuron is denoted as 𝐳^bsubscript^𝐳𝑏\hat{\mathbf{z}}_{b}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, where b𝑏bitalic_b is identified as:

b=argmin1qQ2d(𝐳q,𝐱ji).𝑏1𝑞superscript𝑄2𝑎𝑟𝑔𝑑subscript𝐳𝑞subscriptsuperscript𝐱𝑖𝑗b=\underset{1\leq q\leq Q^{2}}{arg\min}\hskip 3.00003ptd(\mathbf{z}_{q},%\mathbf{x}^{i}_{j}).italic_b = start_UNDERACCENT 1 ≤ italic_q ≤ italic_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG italic_a italic_r italic_g roman_min end_ARG italic_d ( bold_z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , bold_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) .(3)

𝐳^bsubscript^𝐳𝑏\hat{\mathbf{z}}_{b}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT is called the Best Matching Unit (BMU), and neurons in the BMU neighborhood are updated according to:

Δ𝐳q=ε(t)hσ(t,q,b)(𝐱ji𝐳q),Δsubscript𝐳𝑞𝜀𝑡subscript𝜎𝑡𝑞𝑏subscriptsuperscript𝐱𝑖𝑗subscript𝐳𝑞\varDelta\mathbf{z}_{q}=\varepsilon\left(t\right)h_{\sigma}\left(t,q,b\right)%\left(\mathbf{x}^{i}_{j}-\mathbf{z}_{q}\right),roman_Δ bold_z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = italic_ε ( italic_t ) italic_h start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ( italic_t , italic_q , italic_b ) ( bold_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - bold_z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) ,(4)

where ε(t)𝜀𝑡\varepsilon\left(t\right)italic_ε ( italic_t ) is the learning rate, which decreases over time, while hσ(t,q,b)subscript𝜎𝑡𝑞𝑏h_{\sigma}\left(t,q,b\right)italic_h start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ( italic_t , italic_q , italic_b ) represents the Gaussian kernel function used as a suppression coefficient. The update strategy entails that neurons within the vicinity of the BMU are brought into close proximity to the corresponding mapped data object, while neurons outside this neighborhood are unaffected.

Since the data characteristics in the data chunk may be high-dimensional and diverse, the topological structure obtained through RSOM exhibits controlled randomness and discontinuities, enhancing readability in high-dimensional spaces. As neural network training tends to converge, it eventually yields trained neurons, represented as Z={𝐳1,𝐳2,,𝐳Q2}𝑍subscript𝐳1subscript𝐳2subscript𝐳superscript𝑄2Z=\left\{\mathbf{z}_{1},\mathbf{z}_{2},\ldots,\mathbf{z}_{Q^{2}}\right\}italic_Z = { bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_z start_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT }. The SOM topological map, denoted as GSOM={Z,𝐀}subscript𝐺SOM𝑍𝐀G_{\text{SOM}}=\left\{Z,\mathbf{A}\right\}italic_G start_POSTSUBSCRIPT SOM end_POSTSUBSCRIPT = { italic_Z , bold_A }, constitutes an undirected weighted graph. The adjacency matrix of connections between neurons, 𝐀=[aij]Q2×Q2𝐀delimited-[]subscript𝑎𝑖𝑗superscriptsuperscript𝑄2superscript𝑄2\mathbf{A}=\left[a_{ij}\right]\in\mathbb{R}^{Q^{2}\times Q^{2}}bold_A = [ italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × italic_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, the formation formula is as follows:

aij={1,if𝐳iand𝐳jare connected,0,otherwise.subscript𝑎𝑖𝑗cases1ifsubscript𝐳𝑖andsubscript𝐳𝑗are connected,0otherwise.a_{ij}=\left\{\begin{array}[]{ll}1,&\text{if }\mathbf{z}_{i}\text{ and }%\mathbf{z}_{j}\text{ are connected,}\\0,&\text{otherwise.}\end{array}\right.italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = { start_ARRAY start_ROW start_CELL 1 , end_CELL start_CELL if bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are connected, end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise. end_CELL end_ROW end_ARRAY(5)

When aij=1subscript𝑎𝑖𝑗1a_{ij}=1italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1, there exists a connecting edge between 𝐳isubscript𝐳𝑖\mathbf{z}_{i}bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝐳jsubscript𝐳𝑗\mathbf{z}_{j}bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. In contrast, when aij=0subscript𝑎𝑖𝑗0a_{ij}=0italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 0, there is no connecting edge between 𝐳isubscript𝐳𝑖\mathbf{z}_{i}bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝐳jsubscript𝐳𝑗\mathbf{z}_{j}bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

For the second step of reclustering, the Z𝑍Zitalic_Z will serve as the initial cluster centroids for k𝑘kitalic_k-means. Our criterion for fine-tuning neurons is to minimize the objective function P𝑃Pitalic_P [40], which can be written as follows:

P(𝐌,Z)=j=1nu=1Q2mjud(𝐱ji,𝐳u),𝑃𝐌𝑍superscriptsubscript𝑗1𝑛superscriptsubscript𝑢1superscript𝑄2subscript𝑚𝑗𝑢𝑑superscriptsubscript𝐱𝑗𝑖subscript𝐳𝑢P\left(\mathbf{M},Z\right)=\sum_{j=1}^{n}{\sum_{u=1}^{Q^{2}}{m_{ju}\cdot d%\left(\mathbf{x}_{j}^{i},\mathbf{z}_{u}\right)}},italic_P ( bold_M , italic_Z ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_u = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_j italic_u end_POSTSUBSCRIPT ⋅ italic_d ( bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) ,(6)

where 𝐌𝐌\mathbf{M}bold_M is an n×Q2𝑛superscript𝑄2n\times Q^{2}italic_n × italic_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT partition matrix, and mjusubscript𝑚𝑗𝑢m_{ju}italic_m start_POSTSUBSCRIPT italic_j italic_u end_POSTSUBSCRIPT is one of the binary indicator variables, representing whether the j𝑗jitalic_j-th data object belongs to the u𝑢uitalic_u-th cluster, satisfying the following:

mju={1,ifd(𝐱ji,𝐳u)<d(𝐱ji,𝐳q),for 1qQ2,uq,0,otherwise.subscript𝑚𝑗𝑢cases1if𝑑superscriptsubscript𝐱𝑗𝑖subscript𝐳𝑢𝑑superscriptsubscript𝐱𝑗𝑖subscript𝐳𝑞missing-subexpressionformulae-sequencefor1𝑞superscript𝑄2𝑢𝑞0otherwisem_{ju}=\left\{\begin{array}[]{ll}1,&\text{if}\ d\left(\mathbf{x}_{j}^{i},%\mathbf{z}_{u}\right)<d\left(\mathbf{x}_{j}^{i},\mathbf{z}_{q}\right),\\&\text{for}\ 1\leq q\leq Q^{2},u\neq q,\\0,&\text{otherwise}.\end{array}\right.italic_m start_POSTSUBSCRIPT italic_j italic_u end_POSTSUBSCRIPT = { start_ARRAY start_ROW start_CELL 1 , end_CELL start_CELL if italic_d ( bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) < italic_d ( bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL for 1 ≤ italic_q ≤ italic_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_u ≠ italic_q , end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise . end_CELL end_ROW end_ARRAY(7)

If mju=1subscript𝑚𝑗𝑢1m_{ju}=1italic_m start_POSTSUBSCRIPT italic_j italic_u end_POSTSUBSCRIPT = 1, it is denoted that the j𝑗jitalic_j-th data object belongs to cluster Cusubscript𝐶𝑢C_{u}italic_C start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. Next, we provide a detailed description of the actual fine-tuning method:

𝐳u=𝐱jiCu𝐱ji|Cu|.subscript𝐳𝑢subscriptsuperscriptsubscript𝐱𝑗𝑖subscript𝐶𝑢superscriptsubscript𝐱𝑗𝑖subscript𝐶𝑢\mathbf{z}_{u}=\frac{\sum_{\mathbf{x}_{j}^{i}\in C_{u}}{\mathbf{x}_{j}^{i}}}{%\left|C_{u}\right|}.bold_z start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG start_ARG | italic_C start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | end_ARG .(8)

Repeating the above steps for a finite number of iterations will inevitably converge to a minimum solution of P𝑃Pitalic_P [41]. Since the neurons after fine-tuning are already reasonably representative of the data objects, we define them as micro-cluster centers M={𝐦1,𝐦2,,𝐦Q2}𝑀subscript𝐦1subscript𝐦2subscript𝐦superscript𝑄2M=\{\mathbf{m}_{1},\mathbf{m}_{2},\ldots,\mathbf{m}_{Q^{2}}\}italic_M = { bold_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_m start_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT }. Thus, the reclustering process composed of RSOM and k𝑘kitalic_k-means is completed.

IV-B RP: Refining Partition by Bridge Nodes

After obtaining micro-cluster centers, an excessive quantity of them may result in the dispersion of micro-clusters within the overlapping regions among the true clusters.

Definition 1.

Given a micro-cluster center 𝐦ifsubscript𝐦𝑖superscript𝑓\mathbf{m}_{i}\in\mathbb{R}^{f}bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT, if the local topological radius of 𝐦isubscript𝐦𝑖\mathbf{m}_{i}bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s neighbor is smaller than its topological radius, i.e. r(𝐦i)<1𝑟subscript𝐦𝑖1r(\mathbf{m}_{i})<1italic_r ( bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < 1, then such center is called Bridge Node (BN).

Because BN acts like a bridge connecting the two micro-clusters that should be separated during the merge phase, which results in degraded performance. Specifically, the distribution of micro-cluster centers reflects the data objects, making these centers the primary reference for the merging process. If we do not eliminate BNs, there is a risk of merging two true clusters into one, resulting in a deviation from the expected final number of clusters.

The generated topology map of Section IV-A, as shown in Fig. 3(a), exhibits high-density data objects that are also mapped in places with high local density of micro-cluster centers. BNs exist in the blank area between every two true clusters because the SOM-based construction of the topology map forms a neural network-like structure, and during the learning and updating process, neurons need to be driven based on data objects, and their driving ability is limited in low-density data areas, causing them to be trapped.

LSROM: Learning Self-Refined Organizing Map for Fast Imbalanced Streaming Data Clustering (3)

We will introduce the local topological radius ratio as a metric to determine whether a micro-cluster center is a BN. Although we use k𝑘kitalic_k-means to fine-tune the neurons to the micro-cluster centers, the original topological information is not destroyed. Therefore, we use Ω(𝐦i)Ωsubscript𝐦𝑖\Omega\left(\mathbf{m}_{i}\right)roman_Ω ( bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) to denote the adjacent micro-cluster centers of 𝐦isubscript𝐦𝑖\mathbf{m}_{i}bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the topological map. If there exists an edge connecting 𝐦isubscript𝐦𝑖\mathbf{m}_{i}bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝐦jsubscript𝐦𝑗\mathbf{m}_{j}bold_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, then it satisfies:

Ω(𝐦i)={𝐦j|aij=1}.Ωsubscript𝐦𝑖conditional-setsubscript𝐦𝑗subscript𝑎𝑖𝑗1\Omega\left(\mathbf{m}_{i}\right)=\left\{\mathbf{m}_{j}\,|\,a_{ij}=1\right\}.roman_Ω ( bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = { bold_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 } .(9)

The local topological radius refers to the average Euclidean distance from a micro-cluster center to its adjacent micro-cluster centers, defined as follows:

ρ(𝐦i)=1|Ω(𝐦i)|𝐦jΩ(𝐦i)𝐦i𝐦j2,𝜌subscript𝐦𝑖1Ωsubscript𝐦𝑖subscriptsubscript𝐦𝑗Ωsubscript𝐦𝑖superscriptdelimited-∥∥subscript𝐦𝑖subscript𝐦𝑗2\rho\left(\mathbf{m}_{i}\right)=\frac{1}{\left|\Omega\left(\mathbf{m}_{i}%\right)\right|}\sum_{\mathbf{m}_{j}\in\Omega\left(\mathbf{m}_{i}\right)}{%\lVert\mathbf{m}_{i}-\mathbf{m}_{j}\rVert}^{2},italic_ρ ( bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG | roman_Ω ( bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | end_ARG ∑ start_POSTSUBSCRIPT bold_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ roman_Ω ( bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ∥ bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,(10)

|Ω(𝐦i)|Ωsubscript𝐦𝑖\left|\Omega\left(\mathbf{m}_{i}\right)\right|| roman_Ω ( bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | refers to the number of the adjacent micro-cluster centers of 𝐦isubscript𝐦𝑖\mathbf{m}_{i}bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. According to Eq.(10), we can calculate the local topological radius ratio r(𝐦i)𝑟subscript𝐦𝑖r\left(\mathbf{m}_{i}\right)italic_r ( bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) as follows:

r(𝐦i)=𝐦jΩ(𝐦i)ρ(𝐦j)|Ω(𝐦i)|ρ(𝐦i),𝑟subscript𝐦𝑖subscriptsubscript𝐦𝑗Ωsubscript𝐦𝑖𝜌subscript𝐦𝑗Ωsubscript𝐦𝑖𝜌subscript𝐦𝑖r\left(\mathbf{m}_{i}\right)=\frac{\sum_{\mathbf{m}_{j}\in\Omega\left(\mathbf{%m}_{i}\right)}{\rho\left(\mathbf{m}_{j}\right)}}{\left|\Omega\left(\mathbf{m}_%{i}\right)\right|\cdot\rho\left(\mathbf{m}_{i}\right)},italic_r ( bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG ∑ start_POSTSUBSCRIPT bold_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ roman_Ω ( bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_ρ ( bold_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG | roman_Ω ( bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | ⋅ italic_ρ ( bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ,(11)

which computes the size relationship between the average ρ𝜌\rhoitalic_ρ of adjacent micro-cluster centers and the ρ(mi)𝜌subscript𝑚𝑖\rho(m_{i})italic_ρ ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). When the former is larger, r(𝐦i)>1𝑟subscript𝐦𝑖1r\left(\mathbf{m}_{i}\right)>1italic_r ( bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > 1. Intuitively, if r(𝐦i)<1𝑟subscript𝐦𝑖1r\left(\mathbf{m}_{i}\right)<1italic_r ( bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < 1, it means that the data objects represented by 𝐦isubscript𝐦𝑖\mathbf{m}_{i}bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are sparser than the data objects represented by the adjacent micro-cluster centers of 𝐦isubscript𝐦𝑖\mathbf{m}_{i}bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Therefore, using the

Mr={𝐦i|r(𝐦i)<1},subscript𝑀𝑟conditional-setsubscript𝐦𝑖𝑟subscript𝐦𝑖1M_{r}=\left\{\mathbf{m}_{i}|r\left(\mathbf{m}_{i}\right)<1\right\},italic_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = { bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_r ( bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < 1 } ,(12)

BNs can be identified. As shown in Fig. 3(b), RP effectively eliminates BNs located in blank areas, preventing an excessive dispersion of micro-cluster centers. This removal enhances the separability of true clusters. Simultaneously, the algorithm preserves the primary structures represented by the remaining micro-cluster centers that have not been pruned. We then divide the data objects into the updated micro-cluster centers to obtain each micro-cluster.

IV-C MMC: Merging Micro-Clusters based on Refined Topology

In this stage, micro-clusters will be merged to automatically obtain the final clustering result. Our merge process utilizes two metrics, namely compactness and separability. We will use separation to find the most suitable micro-clusters for merging, because the smaller the separation, the more similar the two micro-clusters are and are more likely to come from the same true cluster. The separation is calculated by using a 1-D Gaussian mixture probability density function.

Assuming we want to compute the separation between two micro-clusters Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Cjsubscript𝐶𝑗C_{j}italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. The data objects 𝐱𝐱\mathbf{x}bold_x within Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Cjsubscript𝐶𝑗C_{j}italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT should be projected onto the topological connection line as 1-D objects 𝐱superscript𝐱\mathbf{x}^{\prime}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT:

𝐱=(𝐱𝐦¯)T(𝐦i𝐦j)𝐦i𝐦j2,superscript𝐱superscript𝐱¯𝐦𝑇subscript𝐦𝑖subscript𝐦𝑗superscriptdelimited-∥∥subscript𝐦𝑖subscript𝐦𝑗2\mathbf{x}^{\prime}=\frac{(\mathbf{x}-\bar{\mathbf{m}})^{T}(\mathbf{m}_{i}-%\mathbf{m}_{j})}{\lVert\mathbf{m}_{i}-\mathbf{m}_{j}\rVert^{2}},bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG ( bold_x - over¯ start_ARG bold_m end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG ∥ bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ,(13)

where 𝐦¯¯𝐦\bar{\mathbf{m}}over¯ start_ARG bold_m end_ARG is the midpoint of a topological connection line, denoted as 𝐦¯=(𝐦i+𝐦j)/2¯𝐦subscript𝐦𝑖subscript𝐦𝑗2\bar{\mathbf{m}}=\left(\mathbf{m}_{i}+\mathbf{m}_{j}\right)/2over¯ start_ARG bold_m end_ARG = ( bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / 2. Here, 𝐦isubscript𝐦𝑖\mathbf{m}_{i}bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝐦jsubscript𝐦𝑗\mathbf{m}_{j}bold_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are the endpoints of the connection line located at 0.50.50.50.5 and 0.50.5-0.5- 0.5. By substituting statistics of 𝐱superscript𝐱\mathbf{x}^{\prime}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT into the 1-D Gaussian mixture probability density function S(u)𝑆𝑢S(u)italic_S ( italic_u ), we obtain:

S(u)=|Ci|f(u|0.5,σi2)+|Cj|f(u|0.5,σj2)|Ci|+|Cj|.𝑆𝑢subscript𝐶𝑖𝑓conditional𝑢0.5superscriptsubscript𝜎𝑖2subscript𝐶𝑗𝑓conditional𝑢0.5superscriptsubscript𝜎𝑗2subscript𝐶𝑖subscript𝐶𝑗S\left(u\right)=\frac{\left|C_{i}\right|f\left(u|0.5,\sigma_{i}^{2}\right)+%\left|C_{j}\right|f\left(u|-0.5,\sigma_{j}^{2}\right)}{\left|C_{i}\right|+%\left|C_{j}\right|}.italic_S ( italic_u ) = divide start_ARG | italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_f ( italic_u | 0.5 , italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + | italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_f ( italic_u | - 0.5 , italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG | italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | + | italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | end_ARG .(14)

f(u|0.5,σi2)𝑓conditional𝑢0.5superscriptsubscript𝜎𝑖2f(u|0.5,\sigma_{i}^{2})italic_f ( italic_u | 0.5 , italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) is the probability density function of a Gaussian distribution, where u𝑢uitalic_u is the variable and σi2superscriptsubscript𝜎𝑖2\sigma_{i}^{2}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the variance for Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We define the separation between two micro-clusters with the following formula:

sij=1minuUS(u).subscript𝑠𝑖𝑗1𝑢𝑈𝑆𝑢s_{ij}=\frac{1}{\underset{u\in U}{\min}S\left(u\right)}.italic_s start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG start_UNDERACCENT italic_u ∈ italic_U end_UNDERACCENT start_ARG roman_min end_ARG italic_S ( italic_u ) end_ARG .(15)

In order to estimate the minimum discrete probability of u𝑢uitalic_u within S(u)𝑆𝑢S(u)italic_S ( italic_u ), we will use a step size of 0.01 to traverse the interval from -0.5 to 0.5, denoted as U={0.5,0.49,,0.5}𝑈0.50.490.5U=\{-0.5,-0.49,\ldots,0.5\}italic_U = { - 0.5 , - 0.49 , … , 0.5 }. When the separation between two micro-clusters is large, minuUS(u)subscript𝑢𝑈𝑆𝑢\min_{u\in U}S(u)roman_min start_POSTSUBSCRIPT italic_u ∈ italic_U end_POSTSUBSCRIPT italic_S ( italic_u ) approaches 0, and sijsubscript𝑠𝑖𝑗s_{ij}italic_s start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT becomes large. Suppose the current unmerged clustering result state is Φ(k)Φ𝑘\varPhi(k)roman_Φ ( italic_k ), which indicates that the current data chunk is split by k𝑘kitalic_k micro-clusters. The following equation calculates the micro-clusters that should be merged and the global compactness of the clustering result for Φ(k)Φ𝑘\varPhi(k)roman_Φ ( italic_k ):

comk1=min1<i<kmin𝐦jSOMN(𝐦i)j>isij.𝑐𝑜subscript𝑚𝑘11𝑖𝑘subscript𝐦𝑗𝑆𝑂𝑀𝑁subscript𝐦𝑖𝑗𝑖subscript𝑠𝑖𝑗com_{k-1}=\underset{1<i<k}{\min}\,\underset{\begin{subarray}{c}\mathbf{m}_{j}%\in SOMN\left(\mathbf{m}_{i}\right)\\j>i\end{subarray}}{\min}s_{ij}.italic_c italic_o italic_m start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT = start_UNDERACCENT 1 < italic_i < italic_k end_UNDERACCENT start_ARG roman_min end_ARG start_UNDERACCENT start_ARG start_ROW start_CELL bold_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_S italic_O italic_M italic_N ( bold_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_j > italic_i end_CELL end_ROW end_ARG end_UNDERACCENT start_ARG roman_min end_ARG italic_s start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT .(16)

We merge Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Cjsubscript𝐶𝑗C_{j}italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT from Eq. (16) and update Φ(k)Φ𝑘\varPhi(k)roman_Φ ( italic_k ) to Φ(k1)Φ𝑘1\varPhi(k-1)roman_Φ ( italic_k - 1 ). The computation stops when k=1𝑘1k=1italic_k = 1. The global compactness of the clustering result denoted as com𝑐𝑜𝑚comitalic_c italic_o italic_m, represents the minimum separation value between globally adjacent clusters. Typically, when different micro-clusters are merged, their separation measure before merging should be significantly higher than after merging. Therefore, a smaller com𝑐𝑜𝑚comitalic_c italic_o italic_m value indicates a better clustering result, as each cluster becomes more compact. To automatically determine the final number of clusters, we introduce another metric: the global separability of the clustering result. But before that, let’s define 𝐦^gsubscript^𝐦𝑔\hat{\mathbf{m}}_{g}over^ start_ARG bold_m end_ARG start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT as the micro-cluster center with the maximum local topological radius, where g𝑔gitalic_g is determined by:

g=argmax1qQ2ρ(𝐦q)𝑔1𝑞superscript𝑄2𝑎𝑟𝑔𝜌subscript𝐦𝑞g=\underset{1\leq q\leq Q^{2}}{arg\max}\hskip 3.00003pt\rho\left(\mathbf{m}_{q%}\right)italic_g = start_UNDERACCENT 1 ≤ italic_q ≤ italic_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG italic_a italic_r italic_g roman_max end_ARG italic_ρ ( bold_m start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT )(17)

represents the micro-cluster Cgsubscript𝐶𝑔C_{g}italic_C start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT, which is generally distributed in the outer region of the true cluster, making it sparse and exhibiting a higher degree of dispersion from objects outside the micro-cluster. Therefore, we can conclude that:

sepk=𝐱jCg(qjκ),𝑠𝑒subscript𝑝𝑘subscriptsubscript𝐱𝑗subscript𝐶𝑔subscript𝑞𝑗𝜅sep_{k}=\sum_{\mathbf{x}_{j}\in C_{g}}{(\frac{q_{j}}{\kappa})},italic_s italic_e italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( divide start_ARG italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_κ end_ARG ) ,(18)

where κ𝜅\kappaitalic_κ refers to the number of neighbors, and qjsubscript𝑞𝑗q_{j}italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT represents the number of neighbors out of the κ𝜅\kappaitalic_κ-nearest neighbors of 𝐱jsubscript𝐱𝑗\mathbf{x}_{j}bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT that do not belong to the Cgsubscript𝐶𝑔C_{g}italic_C start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT. In order to efficiently compute the κ𝜅\kappaitalic_κ-nearest neighbors for the specified data objects, an approximate search algorithm is employed, reducing time overhead while maintaining data accuracy. This is achieved by constructing a multi-level graph structure, where each level is composed of Approximate Delaunay Graphs. To enhance search efficiency, skip lists and heuristic initialization are introduced during the search process [42].

Clearly, a smaller sep𝑠𝑒𝑝sepitalic_s italic_e italic_p value indicates a better clustering result, as each cluster exhibits a higher degree of separability. During the merging process, the com𝑐𝑜𝑚comitalic_c italic_o italic_m value will trend upward and the sep𝑠𝑒𝑝sepitalic_s italic_e italic_p value will trend downward. Therefore, we can automatically determine the number of clusters as follows:

k*=argmin1<K<k1(comKmaxk{comk}+sepKmaxk{sepk}).superscript𝑘1𝐾𝑘1𝑐𝑜subscript𝑚𝐾superscript𝑘𝑐𝑜subscript𝑚superscript𝑘𝑠𝑒subscript𝑝𝐾superscript𝑘𝑠𝑒subscript𝑝superscript𝑘k^{*}=\underset{1<K<k-1}{\arg\min}\left(\frac{com_{K}}{\underset{k^{\prime}}{%\max}\{com_{k^{\prime}}\}}+\frac{sep_{K}}{\underset{k^{\prime}}{\max}\{sep_{k^%{\prime}}\}}\right).italic_k start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = start_UNDERACCENT 1 < italic_K < italic_k - 1 end_UNDERACCENT start_ARG roman_arg roman_min end_ARG ( divide start_ARG italic_c italic_o italic_m start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_ARG start_ARG start_UNDERACCENT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG roman_max end_ARG { italic_c italic_o italic_m start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } end_ARG + divide start_ARG italic_s italic_e italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_ARG start_ARG start_UNDERACCENT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG roman_max end_ARG { italic_s italic_e italic_p start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } end_ARG ) .(19)

Iterate through all possible values of K𝐾Kitalic_K to find the optimal combination state for com𝑐𝑜𝑚comitalic_c italic_o italic_m and sep𝑠𝑒𝑝sepitalic_s italic_e italic_p. To normalize the global compactness and global separability, we divide their maximum values into equal ranges. Finally, we select the minimum value from the sum of global separability and global compactness to guide the selection of the clustering result.

11Input: dataset 𝐗𝐗\mathbf{X}bold_X, the number of neurons Q2superscript𝑄2Q^{2}italic_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

11Output: k𝑘kitalic_k clusters.

  1. 1.

    IP:

    1. 1.1

      Construct RSOM from dataset 𝐗𝐗\mathbf{X}bold_X to generate the Q2superscript𝑄2Q^{2}italic_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT number of neurons according to Eqs. (3)-(4).

    2. 1.2

      Adjust neurons to align with the center of dataset 𝐗𝐗\mathbf{X}bold_X according to Eqs. (6)-(8).

  2. 2.

    RP:

    1. 2.1

      Calculate the local topological radius ratio on micro-cluster centers according to Eqs. (9)-(11).

    2. 2.2

      Determine and remove BNs according to Eq. (12).

  3. 3.

    MMC:

    1. 3.1

      Merge micro-clusters by calculating the separation of adjacent micro-clusters according to Eqs. (13)-(15).

    2. 3.2

      Calculate compactness and separability two measures according to Eqs. (16)-(18).

    3. 3.3

      Obtain k𝑘kitalic_k clusters from the minimum value of the sum of global separability and global compactness according to Eq. (19).

IV-D Overall Algorithm and Complexity Analysis

The summary of LSROM is presented in Algorithm 1. In the initial stage, we utilize RSOM and apply reclustering with k𝑘kitalic_k-means to generate micro-cluster centers distributed across the imbalanced dataset. We employ the topological map to guide the subsequent processes. In the second stage, we compute the local topological radius ratio for each micro-cluster center and use this ratio to remove BNs. In the third stage, the updated micro-clusters are adaptively merged using two metrics: separability and compactness. We quickly determine the final, suitable number of clusters using the topological map.

Theorem 1.

Given a data chunk X={𝐱1,𝐱2,,𝐱n}𝑋subscript𝐱1subscript𝐱2subscript𝐱𝑛X=\left\{\mathbf{x}_{1},\mathbf{x}_{2},\cdots,\mathbf{x}_{n}\right\}italic_X = { bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, the initial number of neurons is Q𝑄Qitalic_Q, where Qnmuch-less-than𝑄𝑛Q\ll nitalic_Q ≪ italic_n, and the number of micro-clusters is Q^^𝑄\hat{Q}over^ start_ARG italic_Q end_ARG, where Q^Q^𝑄𝑄\hat{Q}\leq Qover^ start_ARG italic_Q end_ARG ≤ italic_Q. The time complexity of LSROM algorithm is O(nlogn)𝑂𝑛𝑙𝑜𝑔𝑛O\left(nlogn\right)italic_O ( italic_n italic_l italic_o italic_g italic_n ).

Proof.

In the IP, the time complexity for training RSOM neurons is O(Q*n)𝑂𝑄𝑛O\left(Q*n\right)italic_O ( italic_Q * italic_n ), and the time complexity for fine-tuning neurons using k𝑘kitalic_k-means is O(Q*n)𝑂𝑄𝑛O\left(Q*n\right)italic_O ( italic_Q * italic_n ). In the RP stage, generating the adjacency micro-cluster centers matrix has a time complexity of O(Q2)𝑂superscript𝑄2O\left(Q^{2}\right)italic_O ( italic_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), and calculating the time complexity for micro-cluster center local topological radius ratios is O(Q)𝑂𝑄O\left(Q\right)italic_O ( italic_Q ). In the MMC stage, the number of micro-clusters is reduced to Q^^𝑄\hat{Q}over^ start_ARG italic_Q end_ARG, and obtaining updated calculations for the separation degree of adjacency clusters has a time complexity of O(Q^2logQ^)𝑂superscript^𝑄2^𝑄O\left(\hat{Q}^{2}\log\hat{Q}\right)italic_O ( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log over^ start_ARG italic_Q end_ARG ). When iteratively calculating the global separability degree and global compactness degree, an approximate k𝑘kitalic_k-nearest neighbor algorithm is used, leveraging the information provided by the topological structure to avoid traversing all micro-clusters, requiring a time complexity O(nQ^lognQ^)𝑂𝑛^𝑄𝑛^𝑄O\left(\frac{n}{\hat{Q}}\log\frac{n}{\hat{Q}}\right)italic_O ( divide start_ARG italic_n end_ARG start_ARG over^ start_ARG italic_Q end_ARG end_ARG roman_log divide start_ARG italic_n end_ARG start_ARG over^ start_ARG italic_Q end_ARG end_ARG ). Both Q𝑄Qitalic_Q and Q^^𝑄\hat{Q}over^ start_ARG italic_Q end_ARG above can be regarded as constants, therefore, the overall time complexity of the algorithm is O(nlogn)𝑂𝑛𝑙𝑜𝑔𝑛O\left(nlogn\right)italic_O ( italic_n italic_l italic_o italic_g italic_n ).∎

V Experimental Results

Four experiments, i.e., efficiency evaluation, clustering accuracy evaluation, ablation study, and parameter sensitivity evaluation, have been conducted on ten datasets with ten counterparts (seven existing methods and three ablated versions of LSROM). In the following, we first provide the experimental settings, and then demonstrate experimental results.

No.Dataset# Features# Samples# IR𝐼𝑅IRitalic_I italic_R# Clusters
1lithuanian22.00K5.002
2banana22.40K5.002
3haberman30.31K2.772
4wpbc300.57K1.692
5ids221.00M10.005
6gaussian21.00M19.874
7Forest540.58M103.137
8IOT262.00M2.542
9Bitcoin81.00M23.192
10SEER161.00M3.993

1:Input: dataset 𝐗𝐗\mathbf{X}bold_X, Imbalanced Ratio of dataset IR𝐼𝑅IRitalic_I italic_R, the number of clusters within dataset k𝑘kitalic_k.

2:Output: data chunk at t𝑡titalic_t-th time-stamp 𝐗tsuperscript𝐗𝑡\mathbf{X}^{t}bold_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT.

3:ktDiscreteUniform([2,k])similar-tosuperscript𝑘𝑡DiscreteUniform2𝑘k^{t}\sim\text{DiscreteUniform}(\mathbb{N}\cap[2,k])italic_k start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∼ DiscreteUniform ( blackboard_N ∩ [ 2 , italic_k ] ).

4:𝔫𝔫absent\mathbf{\mathfrak{n}}\leftarrowfraktur_n ← sort the number of clusters in 𝐗𝐗\mathbf{X}bold_X in ascending order.

5:fori1𝑖1i\leftarrow 1italic_i ← 1 to kt1superscript𝑘𝑡1k^{t}-1italic_k start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - 1do

6:IRitDiscreteUniform((1,IR])similar-to𝐼superscriptsubscript𝑅𝑖𝑡DiscreteUniform1𝐼𝑅IR_{i}^{t}\sim\text{DiscreteUniform}(\mathbb{Z}\cap(1,IR])italic_I italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∼ DiscreteUniform ( blackboard_Z ∩ ( 1 , italic_I italic_R ] );

7:endfor

8:IRt𝐼superscript𝑅𝑡absentIR^{t}\leftarrowitalic_I italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← sort IRt𝐼superscript𝑅𝑡IR^{t}italic_I italic_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT in ascending order.

9:𝔫1t𝔫1subscriptsuperscript𝔫𝑡1subscript𝔫1\mathfrak{n}^{t}_{1}\leftarrow\mathfrak{n}_{1}fraktur_n start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ← fraktur_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT;\triangleright quantity represented by 𝔫1subscript𝔫1\mathfrak{n}_{1}fraktur_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the smallest.

10:fori2𝑖2i\leftarrow 2italic_i ← 2 to ktsuperscript𝑘𝑡k^{t}italic_k start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPTdo

11:ifIRi1t𝔫i1𝔫i𝐼superscriptsubscript𝑅𝑖1𝑡subscript𝔫𝑖1subscript𝔫𝑖IR_{i-1}^{t}\cdot\mathfrak{n}_{i-1}\leq\mathfrak{n}_{i}italic_I italic_R start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⋅ fraktur_n start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ≤ fraktur_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTthen

12:𝔫itIRi1t𝔫i1superscriptsubscript𝔫𝑖𝑡𝐼superscriptsubscript𝑅𝑖1𝑡subscript𝔫𝑖1\mathfrak{n}_{i}^{t}\leftarrow IR_{i-1}^{t}\cdot\mathfrak{n}_{i-1}fraktur_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← italic_I italic_R start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⋅ fraktur_n start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT;

13:else

14:𝔫it𝔫isuperscriptsubscript𝔫𝑖𝑡subscript𝔫𝑖\mathfrak{n}_{i}^{t}\leftarrow\mathfrak{n}_{i}fraktur_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← fraktur_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

15:endif

16:endfor

17:𝔫itsuperscriptsubscript𝔫𝑖𝑡\mathfrak{n}_{i}^{t}fraktur_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT data objects are randomly taken from the i𝑖iitalic_i-th cluster and form the 𝐗tsuperscript𝐗𝑡\mathbf{X}^{t}bold_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT.

V-A Experimental Settings

Ten datasets including four synthetic and six real datasets with varying sizes, dimensions, and distribution types. Their statistical details are provided in TableI. For the synthetic datasets, banana and Lithuanian are generated using PRTools[43]with thickness parameter set at 0.40.40.40.4 and choose ”non-spherical clusters”. ids2 and gaussian [7] are synthesized by applying a mixture of bivariate Gaussian density functions with custom scales in the order of one million, resulting in spherical clustering. All real datasets are found in the UCI Machine Learning Repository111https://archive.ics.uci.edu. Min-max normalization is adopted to pre-process each feature to be with the identical value domain [0,1]01[0,1][ 0 , 1 ].

LSROM: Learning Self-Refined Organizing Map for Fast Imbalanced Streaming Data Clustering (4)

LSROM: Learning Self-Refined Organizing Map for Fast Imbalanced Streaming Data Clustering (5)

LSROM: Learning Self-Refined Organizing Map for Fast Imbalanced Streaming Data Clustering (6)

LSROM: Learning Self-Refined Organizing Map for Fast Imbalanced Streaming Data Clustering (7)

LSROM: Learning Self-Refined Organizing Map for Fast Imbalanced Streaming Data Clustering (8)

LSROM: Learning Self-Refined Organizing Map for Fast Imbalanced Streaming Data Clustering (9)

We have designed a data generation method to validate the ISDC problem, which can comprehensively generate imbalanced data chunks. A selected dataset serves as the basis for generating data chunks using a Two-Layer Random Sampling method (TLRS), where the two layers involve random sampling for the number of clusters and the Imbalanced Ratio. The specific algorithmic procedure is described in Algorithm2. By repeating this algorithm t𝑡titalic_t times, we obtain a total of t𝑡titalic_t data chunks. Generating data chunks in this manner ensures the rationality of data distribution while attempting to include the imbalanced states of all streaming data as comprehensively as possible. This approach enhances the diversity of experimental data, allowing a better evaluation of algorithm performance under various imbalanced scenarios.

DatasetMeasuresk-meansStreamKM++BIRCHAMD-DPCSMCLMC-MNNLDPILSROM
lithuanianNMI0.0009±0.00020.0012±0.00080.0452+0.00120.0548±0.00371.0000±0.00001.0000±0.00001.0000±0.00001.0000±0.0000
DCV0.9815±0.02770.7452±0.06091.0228±0.00350.9289±0.02870.0000±0.00000.0000±0.00000.0000±0.00000.0000±0.0000
bananaNMI0.0943±0.02190.9206±0.07430.6512±0.00130.2631±0.00140.9745±0.00871.0000±0.00001.0000±0.00001.0000±0.0000
DCV0.8221±0.00540.7995±0.04640.9195±0.00440.4057±0.00910.0791±0.00700.0000±0.00000.0000±0.00000.0000±0.0000
habermanNMI0.0007±0.00010.0015±0.00010.0003+0.00000.0469±0.00300.0011±0.00020.1343±0.00410.1002±0.00170.0088±0.0012
DCV1.0289±0.05001.0160±0.16271.3489±0.15240.4612±0.01250.4416±0.00050.3545±0.00340.3991±0.01070.4529±0.0052
wpbcNMI0.6175±0.11320.6231±0.24020.3219±0.09940.1329±0.00900.6339±0.00570.6355±0.01280.4353±0.04700.5052±0.0323
DCV1.1217±0.00381.2710±0.70391.2335±0.00730.1870±0.00610.1396±0.00010.1564±0.00460.1028±0.00330.1693±0.0037
ids2NMI0.4312±0.00380.5092±0.00290.7966±0.00410.9312±0.0081___0.9648±0.0002
DCV0.5337±0.00330.4766±0.00740.7919±0.02340.0312±0.0015___0.0023±0.0002
gaussianNMI0.5022±0.00120.5498±0.00120.9234±0.00260.9358±0.0016___0.9498±0.0000
DCV0.6092±0.00070.5720±0.00920.5253±0.00220.0406±0.0011___0.0031±0.0005
ForestNMI0.1374±0.00410.1459±0.00390.1487+0.00160.2648±0.0004___0.3677±0.0020
DCV0.8561±0.03520.8092±0.01141.4571±0.03130.4113±0.0210___0.2901±0.0063
IOTNMI0.2011±0.00910.2310±0.00170.0757+0.00150.7951±0.0008___0.9703±0.0909
DCV0.5099±0.02320.4971±0.01631.2376±0.04620.1463±0.0810___0.0002±0.0000
BitcoinNMI0.0138+0.00230.0139+0.00760.0201+0.00370.0192+0.0056___0.5722+0.0135
DCV1.2386+0.03250.8639+0.04820.8257+0.01541.0395+0.0498___0.0482+0.0061
SEERNMI0.2185+0.00260.2202+0.00590.2123+0.02650.0402+0.0002___0.3155+0.0251
DCV1.4544+0.06741.4517+0.06141.6613+0.04172.1138+0.1208___0.3252+0.0485

We compare LSROM with seven other methods: the traditional partitional clustering approach (k𝑘kitalic_k-means [4]), advanced algorithms designed for streaming data (StreamKM++ [13], BIRCH [16], AMD-DPC [15]), and state-of-the-art methods suitable for static imbalanced data (SMCL [8], MC-MNN [44], LDPI [12]).For k𝑘kitalic_k-means, StreamKM++, BIRCH, and AMD-DPC, the number of clusters needs to be specified in advance. In contrast, SMCL, MC-MNN, LDPI, and our method can adaptively determine the number of clusters without prior specification. The specific experimental parameters of our method are set as follows: the initial number of neurons Q2=100superscript𝑄2100Q^{2}=100italic_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 100, and the number of nearest neighbors κ=10𝜅10\kappa=10italic_κ = 10. In the following Section V-E, we conduct a comprehensive parameter sensitivity verification. Since κ𝜅\kappaitalic_κ and Q𝑄Qitalic_Q are not too sensitive, the rationality of our settings is confirmed. The other parameters for all the compared methods are set following the recommendations in the source literature.

Two metrics, i.e., Normalized Mutual Information (NMI) [45] and Difference of CV (DCV) [8] index are utilized for performance evaluation. NMI is an external metric reflecting the correlation between the clustering results and the given labels from the perspective of information theory. Its value range is [0,1]01[0,1][ 0 , 1 ], and the larger the better. DCV is an internal metric that is specifically designed for evaluating the clustering performance of imbalanced data. It calculates the effectiveness of clustering based on the similarity between clusters and the compactness within clusters. A lower index value indicates better clustering results, with greater differences between clusters and higher compactness within clusters. All the reported NMI and DCV performances are averaged over 10 runs with random initialization of seeds.

V-B Efficiency Evaluation

The impact of data size on the runtime of clustering methods is studied by plotting their runtime-data size curves on the six large-scale datasets in Fig. 4. Each point in the figure reflects the execution time (y-axis) of a method under the amount of data (x-axis). We categorize the compared methods into three groups, i.e., 1) the proposed LSROM (in purple), 2) SMCL (in orange), MC-MNN (in green), LDPI (in blue), and 3) existing fast streaming data clustering methods that do not dedicate to consider the imbalanced issue (in gray). Please note that we visualize the y-axis on a logarithmic scale because the running times of different methods vary widely.

It can be seen that the proposed LSROM has a similar running time as that of the k𝑘kitalic_k-means, StreamKM++, BIRCH, and AMD-DPC thanks to their linear orO(nlogn)𝑂𝑛𝑛O(n\log n)italic_O ( italic_n roman_log italic_n ) level complexity, while the group of SMCL, MC-MNN, and LDPI is with heavy computational cost due to their polynomial time complexity. Moreover, due to the excessive computational resource requirements, these algorithms encounter ”out of memory” situations when dealing with larger datasets.

V-C Clustering Accuracy Evaluation

Two sub-experiments have been conducted. We first compare the clustering performance of different clustering methods on each whole dataset in Table II to evaluate the clustering accuracy of methods on static datasets, where the best- and second-best-performing method on each dataset is highlighted in boldface and underlined, respectively. Then we generate imbalanced streaming data chunks based on the six large-scale datasets to specifically evaluate the capability of methods in ISDC as shown in Fig. 5 and Fig. 6. For each of the six large-scale datasets, we generate 50 chunks by using TLRS and let them flow in one by one per the time-stamp. The data volume of each chunk is determined based on the maximum data capacity that can be processed by SMCL. Since SMCL takes the longest time to run among all methods as shown in Fig.4, we hope that all methods can run and get accurate results. Specifically, it is determined as follows: chunkids2=chunkgaussian=8×104𝑐𝑢𝑛subscript𝑘𝑖𝑑𝑠2𝑐𝑢𝑛subscript𝑘𝑔𝑎𝑢𝑠𝑠𝑖𝑎𝑛8superscript104chunk_{ids2}=chunk_{gaussian}=8\times 10^{4}italic_c italic_h italic_u italic_n italic_k start_POSTSUBSCRIPT italic_i italic_d italic_s 2 end_POSTSUBSCRIPT = italic_c italic_h italic_u italic_n italic_k start_POSTSUBSCRIPT italic_g italic_a italic_u italic_s italic_s italic_i italic_a italic_n end_POSTSUBSCRIPT = 8 × 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT, chunkForest=chunkBitcoin=chunkSEER=5×104𝑐𝑢𝑛subscript𝑘𝐹𝑜𝑟𝑒𝑠𝑡𝑐𝑢𝑛subscript𝑘𝐵𝑖𝑡𝑐𝑜𝑖𝑛𝑐𝑢𝑛subscript𝑘𝑆𝐸𝐸𝑅5superscript104chunk_{Forest}=chunk_{Bitcoin}=chunk_{SEER}=5\times 10^{4}italic_c italic_h italic_u italic_n italic_k start_POSTSUBSCRIPT italic_F italic_o italic_r italic_e italic_s italic_t end_POSTSUBSCRIPT = italic_c italic_h italic_u italic_n italic_k start_POSTSUBSCRIPT italic_B italic_i italic_t italic_c italic_o italic_i italic_n end_POSTSUBSCRIPT = italic_c italic_h italic_u italic_n italic_k start_POSTSUBSCRIPT italic_S italic_E italic_E italic_R end_POSTSUBSCRIPT = 5 × 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT and chunkIOT=4×104𝑐𝑢𝑛subscript𝑘𝐼𝑂𝑇4superscript104chunk_{IOT}=4\times 10^{4}italic_c italic_h italic_u italic_n italic_k start_POSTSUBSCRIPT italic_I italic_O italic_T end_POSTSUBSCRIPT = 4 × 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT.

LSROM: Learning Self-Refined Organizing Map for Fast Imbalanced Streaming Data Clustering (10)
LSROM: Learning Self-Refined Organizing Map for Fast Imbalanced Streaming Data Clustering (11)

From TableII, it can be seen that the proposed LSROM performs best in general, winning in 16 out of 20 comparisons. In comparison with the fast clustering methods, i.e., k𝑘kitalic_k-means, StreamKM++, BIRCH, and AMD-DPC, LSROM demonstrates its superiority as it is competent in detecting imbalanced clusters, while the four fast methods do not adopt a mechanism to specially take the imbalanced issue into account. AMD-DPC performs the second-best in general as it adopts a density-based principle, which can somewhat distinguish imbalanced clusters as discussed in SectionI. In comparison with SMCL, MC-MNN, and LDPI that are proposed for imbalanced data, LSROM still performs the best in general, as SMCL, MC-MNN, and LDPI are unavailable due to the ”out of memory” on the six large-scale datasets, while LSROM remains competitive on the four smaller datasets. The reason would be that LSROM inherits most merits of SMCL and further appropriately considers the global distribution information in doing the intensive data partition with the help of the trained SOM-topology and the partition refinement (i.e. SectionIV).

It can be observed from Fig. 5 and Fig. 6 that in ISDC tasks, LSROM still significantly outperforms the other four fast clustering methods, i.e., k𝑘kitalic_k-means, StreamKM++, BIRCH, and AMD-DPC, which conforms with the observations of TableII. Since SMCL, MC-MNN, and LDPI are executable on each chunk generated from the large-scale datasets, their ISDC performance is reported, showing that LSROM is still very competitive in comparison with them. In summary, the proposed LSROM is superior in both accuracy and scalability in handling imbalanced streaming data with different scales.

V-D Ablation Study

To further validate the effectiveness of LSROM, we conduct ablation experiments. Since LSROM consists of three components, we perform ablations on each module individually to obtain their NMI, DCV scores, and Runtime. For the IP ablation, we directly obtain neurons by training RSOM without the k𝑘kitalic_k-means fine-tuning step. For the RP ablation, we omit the step of removing BNs, thus directly merging the micro-clusters obtained by the IP. For the MMC ablation, we use the merge method named SGMS of SMCL, instead of MMC. The specific precision results are shown in TableIII.We find that for the standard LSROM workflow, ablation of any module leads to a certain decrease in clustering performance, demonstrating that each module contributes to achieving good clustering performance. Among these ablations, the IP ablation has the smallest impact, while the ablations of RP and MMC result in more significant differences in precision. When processing large-scale data, SGMS requires quadratic running time, so it causes ”out of memory”. In this study, the LSROM algorithm consistently produces the best clustering results among all ablation settings, further confirming its effectiveness.

DatasetIPRPMMCNMIDCVRuntime
lithuanian1.000±plus-or-minus\pm±0.0000.000±plus-or-minus\pm±0.0002.332±plus-or-minus\pm±0.002
0.993±plus-or-minus\pm±0.0000.001±plus-or-minus\pm±0.0001.820±plus-or-minus\pm±0.002
0.987±plus-or-minus\pm±0.0010.002±plus-or-minus\pm±0.0001.755±plus-or-minus\pm±0.002
0.987±plus-or-minus\pm±0.0000.002±plus-or-minus\pm±0.0002.060±plus-or-minus\pm±0.002
banana1.000±plus-or-minus\pm±0.0000.000±plus-or-minus\pm±0.0002.206±plus-or-minus\pm±0.001
0.998±plus-or-minus\pm±0.0000.000±plus-or-minus\pm±0.0002.145±plus-or-minus\pm±0.003
0.992±plus-or-minus\pm±0.0000.002±plus-or-minus\pm±0.0002.308±plus-or-minus\pm±0.002
0.993±plus-or-minus\pm±0.0000.001±plus-or-minus\pm±0.0002.415±plus-or-minus\pm±0.002
haberman0.009±plus-or-minus\pm±0.0000.453±plus-or-minus\pm±0.0050.679±plus-or-minus\pm±0.000
0.006±plus-or-minus\pm±0.0000.666±plus-or-minus\pm±0.0020.730±plus-or-minus\pm±0.000
0.005±plus-or-minus\pm±0.0000.712±plus-or-minus\pm±0.0030.818±plus-or-minus\pm±0.000
0.005±plus-or-minus\pm±0.0000.710±plus-or-minus\pm±0.0020.776±plus-or-minus\pm±0.000
wpbc0.505±plus-or-minus\pm±0.0320.169±plus-or-minus\pm±0.0031.061±plus-or-minus\pm±0.001
0.403±plus-or-minus\pm±0.0460.508±plus-or-minus\pm±0.0001.047±plus-or-minus\pm±0.002
0.010±plus-or-minus\pm±0.0001.041±plus-or-minus\pm±0.0050.949±plus-or-minus\pm±0.002
0.010±plus-or-minus\pm±0.0001.041±plus-or-minus\pm±0.0021.131±plus-or-minus\pm±0.001
ids20.965±plus-or-minus\pm±0.0000.002±plus-or-minus\pm±0.000102.924±plus-or-minus\pm±1.944
0.963±plus-or-minus\pm±0.0760.019±plus-or-minus\pm±0.00099.865±plus-or-minus\pm±1.906
0.892±plus-or-minus\pm±0.0230.026±plus-or-minus\pm±0.000251.019±plus-or-minus\pm±2.137
---
gaussian0.949±plus-or-minus\pm±0.0000.003±plus-or-minus\pm±0.000101.661±plus-or-minus\pm±1.963
0.922±plus-or-minus\pm±0.0000.023±plus-or-minus\pm±0.00098.137±plus-or-minus\pm±1.958
0.825±plus-or-minus\pm±0.0030.140±plus-or-minus\pm±0.000242.696±plus-or-minus\pm±2.126
---
Forest0.368±plus-or-minus\pm±0.0020.290±plus-or-minus\pm±0.006109.286±plus-or-minus\pm±1.893
0.237±plus-or-minus\pm±0.0050.399±plus-or-minus\pm±0.000107.207±plus-or-minus\pm±1.837
0.211±plus-or-minus\pm±0.0030.291±plus-or-minus\pm±0.002239.633±plus-or-minus\pm±2.299
---
IOT0.970±plus-or-minus\pm±0.0960.000±plus-or-minus\pm±0.000143.023±plus-or-minus\pm±2.060
0.828±plus-or-minus\pm±0.0060.195±plus-or-minus\pm±0.096138.871±plus-or-minus\pm±2.136
0.777±plus-or-minus\pm±0.0060.374±plus-or-minus\pm±0.026356.471±plus-or-minus\pm±2.055
---
Bitcoin0.572±plus-or-minus\pm±0.0130.048±plus-or-minus\pm±0.006137.653±plus-or-minus\pm±2.019
0.542±plus-or-minus\pm±0.0150.057±plus-or-minus\pm±0.000135.639±plus-or-minus\pm±2.013
0.428±plus-or-minus\pm±0.0050.129±plus-or-minus\pm±0.002301.587±plus-or-minus\pm±2.023
---
SEER0.316±plus-or-minus\pm±0.0110.325±plus-or-minus\pm±0.002135.328±plus-or-minus\pm±2.023
0.297±plus-or-minus\pm±0.0150.621±plus-or-minus\pm±0.036134.232±plus-or-minus\pm±2.057
0.275±plus-or-minus\pm±0.0160.614±plus-or-minus\pm±0.022297.729±plus-or-minus\pm±2.003
---
LSROM: Learning Self-Refined Organizing Map for Fast Imbalanced Streaming Data Clustering (12)
LSROM: Learning Self-Refined Organizing Map for Fast Imbalanced Streaming Data Clustering (13)

V-E Parameter Sensitivity Evaluation

In the proposed LSROM framework, there are two parameters, namely the number of neurons Q2superscript𝑄2Q^{2}italic_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and the neighbor count κ𝜅\kappaitalic_κ. Different values of these parameters can potentially affect the clustering performance in various ways. Here is an explanation of these parameters.

For the parameter Q2superscript𝑄2Q^{2}italic_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, it directly determines the initial number of neurons. A value of Q2superscript𝑄2Q^{2}italic_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT that is too large can lead to an excessive number of micro-clusters, resulting in significant overlap between final clusters. During the merging process, micro-clusters representing low-density data may be incorrectly identified as belonging to the same true cluster. Additionally, an excessive number of micro-clusters can result in high computational costs. On the other hand, a too small value of Q2superscript𝑄2Q^{2}italic_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT can lead to significant differences in the data sizes contained within micro-clusters, which can still be influenced by the uniform effect, directly affecting the quality of the LSROM framework.Regarding the parameter κ𝜅\kappaitalic_κ, which is used in the approximate κ𝜅\kappaitalic_κ-neighbor algorithm in MMC, it calculates the global separability values for specific clusters. When κ𝜅\kappaitalic_κ is too big, it can lead to increased computational overhead. Conversely, when κ𝜅\kappaitalic_κ is too small, it can result in global separability values being meaningless.

We conduct experiments on ten datasets by varying the values of Q𝑄Qitalic_Q and κ𝜅\kappaitalic_κ, recording the obtained NMI and DCV scores, as shown in Fig. 7. Q𝑄Qitalic_Q values are traversed from 3 to 12 with a step size of 1, while κ𝜅\kappaitalic_κ values range from 3 to 21 with a step size of 2. Based on the results shown in Fig. 7, it can be concluded that the clustering quality, as measured by NMI and DCV, remains robust for different parameter value combinations. Particularly, Q𝑄Qitalic_Q performs well and is stable between 8 and 12, and κ𝜅\kappaitalic_κ performs well and is stable between 7 and 15. Different Q𝑄Qitalic_Q values produce larger accuracy changes than κ𝜅\kappaitalic_κ. In general, the LSROM framework demonstrates low sensitivity to parameter changes.

VI Conclusion

In this study, we propose an efficient and accurate ISDC approach called LSROM, which trains and refines to obtain an informative SOM-based representation of data. On such basis, the dataset is first partitioned into many compact micro-clusters, and the micro-clusters are then efficiently merged to form an appropriate number of clusters. To facilitate a convincing experimental evaluation, a data generator called TLRS is designed to generate realistic imbalanced streaming data based on ten real benchmark and synthetic datasets. Compared to the seven counterparts, LSROM demonstrates its superiority in terms of both computational efficiency and clustering accuracy. Furthermore, the comprehensive ablation experiments verify the effectiveness of its key components, and the parameter sensitivity evaluation illustrates its robustness.

References

  • [1]M.M. Gaber, A.Zaslavsky, and S.Krishnaswamy, “Mining data streams: A review,” SIGMOD Rec., vol.34, no.2, pp. 18–26, 2005.
  • [2]L.Zhao, Z.Chen, Y.Yang, L.Zou, and Z.J. Wang, “Icfs clustering with multiple representatives for large data,” IEEE Transactions on Neural Networks and Learning Systems, vol.30, no.3, pp. 728–738, 2019.
  • [3]L.Wang, H.Zhu, J.Meng, and W.He, “Incremental local distribution-based clustering using bayesian adaptive resonance theory,” IEEE Transactions on Neural Networks and Learning Systems, vol.30, no.11, pp. 3496–3504, 2019.
  • [4]J.A. Hartigan and M.A. Wong, “A k-means clustering algorithm,” Journal of the Royal Statistical Society: Series C (Applied Statistics), vol.28, no.1, pp. 100–108, 1979.
  • [5]S.C. Ahalt, A.K. Krishnamurthy, P.Chen, and D.E. Melton, “Competitive learning algorithms for vector quantization,” Neural Networks, vol.3, no.3, pp. 277–290, 1990.
  • [6]L.Xu, A.Krzyzak, and E.Oja, “Rival penalized competitive learning for clustering analysis, rbf net, and curve detection,” IEEE Transactions on Neural Networks, vol.4, no.4, pp. 636–649, 1993.
  • [7]J.Liang, L.Bai, C.Dang, and F.Cao, “The k𝑘kitalic_k-means-type algorithms versus imbalanced data distributions,” IEEE Transactions on Fuzzy Systems, vol.20, no.4, pp. 728–745, 2012.
  • [8]Y.Lu, Y.-M. Cheung, and Y.Y. Tang, “Self-adaptive multiprototype-based competitive learning approach: A k-means-type algorithm for imbalanced data clustering,” IEEE Transactions on Cybernetics, vol.51, no.3, pp. 1598–1612, 2021.
  • [9]R.Liu, H.Wang, and X.Yu, “Shared-nearest-neighbor-based clustering by fast search and find of density peaks,” Information Sciences, vol. 450, pp. 200–226, 2018.
  • [10]A.Rodriguez and A.Laio, “Clustering by fast search and find of density peaks,” Science, vol. 344, no. 6191, pp. 1492–1496, 2014.
  • [11]J.Xie, H.Gao, W.Xie, X.Liu, and P.W. Grant, “Robust clustering by detecting density peaks and assigning points based on fuzzy weighted k-nearest neighbors,” Information Sciences, vol. 354, pp. 19–40, 2016.
  • [12]W.Tong, Y.Wang, and D.Liu, “An adaptive clustering algorithm based on local-density peaks for imbalanced data without parameters,” IEEE Transactions on Knowledge and Data Engineering, vol.35, no.4, pp. 3419–3432, 2023.
  • [13]M.R. Ackermann, M.Märtens, C.Raupach, K.Swierkot, C.Lammersen, and C.Sohler, “Streamkm++: A clustering algorithm for data streams,” ACM J. Exp. Algorithmics, vol.17, 2012.
  • [14]D.Arthur and S.Vassilvitskii, “K-means++: The advantages of careful seeding,” in Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, 2007, pp. 1027–1035.
  • [15]D.Amagata, “Scalable and accurate density-peaks clustering on fully dynamic data,” in Proceedings of 2022 IEEE International Conference on Big Data, 2022, pp. 445–454.
  • [16]T.Zhang, R.Ramakrishnan, and M.Livny, “Birch: A new data clustering algorithm and its applications,” Data mining and knowledge discovery, vol.1, pp. 141–182, 1997.
  • [17]C.C. Aggarwal, P.S. Yu, J.Han, and J.Wang, “- a framework for clustering evolving data streams,” in Proceedings of 2003 VLDB Conference, 2003, pp. 81–92.
  • [18]Y.Chen and L.Tu, “Density-based clustering for real-time stream data,” in Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2007, pp. 133–142.
  • [19]F.Cao, M.Estert, W.Qian, and A.Zhou, “Density-based clustering over an evolving data stream with noise,” in Proceedings of the 2006 SIAM International Conference on Data Mining, 2006, pp. 328–339.
  • [20]P.P. Rodrigues, J.Gama, and J.Pedroso, “Hierarchical clustering of time-series data streams,” IEEE Transactions on Knowledge and Data Engineering, vol.20, no.5, pp. 615–627, 2008.
  • [21]K.-s. Zhang, L.Zhong, L.Tian, X.-y. Zhang, and L.Li, “Dbiecm-an evolving clustering method for streaming data clustering,” AMSE J, vol.60, no.1, pp. 239–254, 2017.
  • [22]M.Moshtaghi, J.C. Bezdek, S.M. Erfani, C.Leckie, and J.Bailey, “Online cluster validity indices for performance monitoring of streaming data clustering,” International Journal of Intelligent Systems, vol.34, no.4, pp. 541–563, 2019.
  • [23]L.Huang, C.-D. Wang, H.-Y. Chao, and P.S. Yu, “Mvstream: Multiview data stream clustering,” IEEE Transactions on Neural Networks and Learning Systems, vol.31, no.9, pp. 3482–3496, 2020.
  • [24]J.Zhang, H.Tao, and C.Hou, “Imbalanced clustering with theoretical learning bounds,” IEEE Transactions on Knowledge and Data Engineering, vol.35, no.9, pp. 9598–9612, 2023.
  • [25]M.Liu, X.Jiang, and A.C. Kot, “A multi-prototype clustering algorithm,” Pattern Recognition, vol.42, no.5, p. 689–698, 2009.
  • [26]T.Luo, C.Zhong, H.Li, and X.Sun, “A multi-prototype clustering algorithm based on minimum spanning tree,” in Proceedings of 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery, vol.4, 2010, pp. 1602–1607.
  • [27]H.Xiong, J.Wu, and J.Chen, “K-means clustering versus validation measures: A data-distribution perspective,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol.39, no.2, pp. 318–331, 2009.
  • [28]D.Li, S.Zhou, T.Zeng, and R.H. Chan, “Multi-prototypes convex merging based k-means clustering algorithm,” arXiv preprint arXiv:2302.07045, 2023.
  • [29]T.Kohonen, “The self-organizing map,” Proceedings of the IEEE, vol.78, no.9, pp. 1464–1480, 1990.
  • [30]A.Rauber, D.Merkl, and M.Dittenbach, “The growing hierarchical self-organizing map: exploratory analysis of high-dimensional data,” IEEE Transactions on Neural Networks, vol.13, no.6, p. 1331–1341, 2002.
  • [31]C.DelCoso, D.Fustes, C.Dafonte, F.J. Nóvoa, J.M. Rodríguez-Pedreira, and B.Arcay, “Mixing numerical and categorical data in a self-organizing map by means of frequency neurons,” Applied Soft Computing, vol.36, pp. 246–254, 2015.
  • [32]D.Olszewski, “Asymmetric k-means clustering of the asymmetric self-organizing map,” Neural Processing Letters, vol.43, no.1, pp. 231–253, 2016.
  • [33]A.A. Hameed, B.Karlik, M.S. Salman, and G.Eleyan, “Robust adaptive learning approach to self-organizing maps,” Knowledge-Based Systems, vol. 171, pp. 25–36, 2019.
  • [34]Y.Zhang, M.Simsek, and B.Kantarci, “Empowering self-organized feature maps for ai-enabled modeling of fake task submissions to mobile crowdsensing platforms,” IEEE Internet of Things Journal, vol.8, no.3, pp. 1334–1346, 2020.
  • [35]S.Li, F.Liu, L.Jiao, P.Chen, and L.Li, “Self-supervised self-organizing clustering network: A novel unsupervised representation learning method,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–15, 2022.
  • [36]R.Zhu, Z.Wang, Z.Ma, G.Wang, and J.-H. Xue, “Lrid: A new metric of multi-class imbalance degree based on likelihood-ratio test,” Pattern Recognition Letters, vol. 116, pp. 36–42, 2018.
  • [37]N.P. Rougier and G.I. Detorakis, “Randomized Self-Organizing Map,” Neural Computation, vol.33, no.8, pp. 2241–2273, 2021.
  • [38]R.Bridson, “Fast poisson disk sampling in arbitrary dimensions.” SIGGRAPH sketches, vol.10, no.1, 2007.
  • [39]S.Lloyd, “Least squares quantization in pcm,” IEEE Transactions on Information Theory, vol.28, no.2, pp. 129–137, 1982.
  • [40]X.Huang, Y.Ye, and H.Zhang, “Extensions of kmeans-type algorithms: A new clustering framework by integrating intracluster compactness and intercluster separation,” IEEE Transactions on Neural Networks and Learning Systems, vol.25, no.8, pp. 1433–1446, 2014.
  • [41]J.Huang, M.Ng, H.Rong, and Z.Li, “Automated variable weighting in k-means type clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 657–668, 2005.
  • [42]Y.A. Malkov and D.A. Yashunin, “Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.42, no.04, pp. 824–836, 2020.
  • [43]R.P. Duin, P.Juszczak, P.Paclik, E.Pekalska, D.DeRidder, D.M. Tax, and S.Verzakov, “Prtools4. 1, a matlab toolbox for pattern recognition,” Delft University of technology, vol. 2600, 2007.
  • [44]W.Tong, Y.Wang, D.Liu, and X.Guo, “A multi-center clustering algorithm based on mutual nearest neighbors for arbitrarily distributed data,” Integrated Computer-Aided Engineering, vol.29, no.3, pp. 259–275, 2022.
  • [45]A.Fahad, N.Alshatri, Z.Tari, A.Alamri, I.Khalil, A.Y. Zomaya, S.Foufou, and A.Bouras, “A survey of clustering algorithms for big data: Taxonomy and empirical analysis,” IEEE Transactions on Emerging Topics in Computing, vol.2, no.3, pp. 267–279, 2014.
LSROM: Learning Self-Refined Organizing Map for Fast Imbalanced Streaming Data Clustering (14)Yongqi Xu is with the School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, China. His current research interests include unsupervised machine learning and streaming data mining.
LSROM: Learning Self-Refined Organizing Map for Fast Imbalanced Streaming Data Clustering (15)Yujian Lee is with the School of Artificial Intelligence, Beijing Normal University - Hong KongBaptist University United International College, Zhuhai, China. His current research interests include unsupervised machine learning and deep clustering.
LSROM: Learning Self-Refined Organizing Map for Fast Imbalanced Streaming Data Clustering (16)Rong Zou received the B.Eng. degree from Xiamen University in 2016, and the M.S. degree from Hong Kong Baptist University in 2017. He is currently pursuing a Ph.D. at Hong Kong Baptist University. His current research interests include machine learning, image processing, and their applications.
LSROM: Learning Self-Refined Organizing Map for Fast Imbalanced Streaming Data Clustering (17)Yiqun Zhang (Senior Member, IEEE) received the B.Eng. degree from South China University of Technology in 2013, and the M.S. and Ph.D. degrees from Hong Kong Baptist University in 2014 and 2019, respectively. He is currently an associate professor at Guangdong University of Technology. His research works have been published in top-tier journals and conferences, including TPAMI, TCYB, TNNLS, AAAI, and IJCAI. His current research interests include machine learning and data mining. Dr. Zhang serves as an Associate Editor for the IEEE Transactions on Emerging Topics in Computational Intelligence.
LSROM: Learning Self-Refined Organizing Map for Fast Imbalanced Streaming Data Clustering (18)Yiu-ming Cheung (Fellow, IEEE) received the Ph.D. degree from the Department of ComputerScience and Engineering, The Chinese University of Hong Kong, Hong Kong. He is currently a Chair Professor with the Department of Computer Science, Hong Kong Baptist University, Hong Kong. His research interests include Machine Learning, DataScience, Visual Computing, and Optimization. Prof. Cheung is a Fellow of American Association for the Advancement of Science (AAAS), the Institution of Engineering and Technology (IET), and British Computer Society (BCS). Also, he is an awardee of RGC Senior Research Fellow. He is the Editor-in-Chief of IEEE Transactions on Emerging Topics in Computational Intelligence, and is an Associate Editor for several prestigious journals, including IEEE Transactions on Cybernetics, IEEE Transactions onCognitive and Developmental Systems, Pattern Recognition, Pattern Recognition Letters, Neurocomputing, to name a few. For more details, please refer to: https://www.comp.hkbu.edu.hk/~ymc.
LSROM: Learning Self-Refined Organizing Map for Fast Imbalanced Streaming Data Clustering (2024)
Top Articles
Latest Posts
Article information

Author: Arline Emard IV

Last Updated:

Views: 5864

Rating: 4.1 / 5 (52 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Arline Emard IV

Birthday: 1996-07-10

Address: 8912 Hintz Shore, West Louie, AZ 69363-0747

Phone: +13454700762376

Job: Administration Technician

Hobby: Paintball, Horseback riding, Cycling, Running, Macrame, Playing musical instruments, Soapmaking

Introduction: My name is Arline Emard IV, I am a cheerful, gorgeous, colorful, joyous, excited, super, inquisitive person who loves writing and wants to share my knowledge and understanding with you.