LSROM: Learning Self-Refined Organizing Map for Fast Imbalanced Streaming Data Clustering (2024)

\newunicodechar

✓✓

Yongqi Xu, Yujian Lee, Rong Zou,YiqunZhang,,and
Yiu-Ming CheungYongqi Xu is with the School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, China (e-mail: 3121004930@mail2.gdut.edu.cn).Yujian Lee is with the School of Artificial Intelligence, BNU-HKBU United International College, Zhuhai, China (e-mail: r130034019@mail.uic.edu.cn).Yiqun Zhang is with the Department of Computer Science, Hong Kong Baptist University, Hong Kong, SAR, China, and also with the School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, China (e-mail: yqzhang@comp.hkbu.edu.hk).Rong Zou and Yiu-Ming Cheung are with the Department of Computer Science, Hong Kong Baptist University, Hong Kong, SAR, China (e-mail: {rongzou, ymc}@comp.hkbu.edu.hk).Yiqun Zhang is the corresponding author.

Abstract

Streaming data clustering is a popular research topic in the fields of data mining and machine learning. Compared to static data, streaming data, which is usually analyzed in data chunks, is more susceptible to encountering the dynamic cluster imbalanced issue. That is, the imbalanced degree of clusters varies in different streaming data chunks, leading to corruption in either the accuracy or the efficiency of streaming data analysis based on existing clustering methods. Therefore, we propose an efficient approach called Learning Self-Refined Organizing Map (LSROM) to handle the imbalanced streaming data clustering problem, where we propose an advanced SOM for representing the global data distribution. The constructed SOM is first refined for guiding the partition of the dataset to form many micro-clusters to avoid the missing small clusters in imbalanced data. Then an efficient merging of the micro-clusters is conducted through quick retrieval based on the SOM, which can automatically yield a true number of imbalanced clusters. In comparison to existing imbalanced data clustering approaches, LSROM is with a lower time complexity $O(n\log n)$ , while achieving very competitive clustering accuracy. Moreover, LSROM is interpretable and insensitive to hyper-parameters. Extensive experiments have verified its efficacy.

Index Terms:

Cluster analysis, streaming data, imbalanced data, self-organizing map, time complexity.

I Introduction

Streaming data is prevalent in various fields such as market research, streaming media analysis, and the Internet of Things [1, 2]. Due to the lack of readily available labels for streaming data, clustering that gathers similar data objects into a certain number of groups becomes indispensable for data analysis. Cluster analysis of streaming data is often conducted on a chunk-by-chunk basis to ensure sufficient statistical information. However, the non-uniform co-occurrence of samples from different distributions, along with shifts in data distributions [3] will more frequently lead to the emergence of imbalanced clusters, i.e., clusters with very different numbers of objects in the same data chunk. Fig.1 shows the causes of such a problem, which puts forward a new clustering challenge, i.e., how to efficiently deal with the cross-coupling streaming and imbalanced issues under an unsupervised learning environment.

LSROM: Learning Self-Refined Organizing Map for Fast Imbalanced Streaming Data Clustering (1)

Imbalanced streaming data introduces more complex and challenging issues to clustering, which are manifested into two key aspects: 1) Uncertainty in the number of clusters. Due to the very different and changing sizes of clusters, it is difficult to dynamically determine an appropriate number of clusters, leading to unsatisfactory clustering accuracy. 2) Laborious computation involved in detecting imbalanced clusters. Existing imbalanced data clustering methods usually involve a two-phase process: i) partitioning data into $m$ -scale micro-clusters, and ii) considering $m^{2}$ -scale combinations of the micro-clusters for merging, where a proper $m$ is usually dependent on the scale $n$ of the dataset. Such a time-consuming $O(m^{2})$ -complex process evidently prevents the existing imbalanced data clustering approaches from being applied to the real-time analysis of large-scale streaming data. Subsequently, we analyze existing methods from the perspective of the above two challenges to Imbalanced Streaming Data Clustering (ISDC).

Partition-based clustering algorithms demonstrate a certain level of capability in addressing ISDC. One of the most classic methods is $k$ -means [4]. Due to its simplicity and efficiency, it can be applied to the analysis of streaming data with linear complexity. However, either $k$ -means or its variants (e.g., [5, 6]) tend to generate balanced clusters, which makes them incompetent to the imbalanced clusters. To specially address the imbalance issue, the Multi-Center (MC) clustering algorithm [7] has been proposed to divide the data into several small subclusters, and then successively merge the closer subclusters until a certain degree to detect imbalanced clusters. As its performance is very sensitive to initialization, a self-adaptive clustering method called SMCL [8] has been proposed, which can automatically select a reasonable number of subclusters. Nevertheless, both MC and SMCL experience a laborious computation as their merging processes involve exhaustive searching of the current most similar pair of subclusters.

Density-based clustering methods determine object-cluster affiliation according to the local density of objects [9], and they demonstrate a certain degree of capability in handling imbalanced data. For instance, Density Peaks Clustering (DPC) [10] identifies data objects with relatively higher local density as cluster centers, and can thus explore smaller clusters. To further achieve a more reasonable density quantification, an algorithm called FKNN-DPC [11] has been proposed employing fuzzy weighted $k$ -nearest neighbor as a density measure. With a more appropriate density measure, FKNN-DPC achieves better performance in detecting imbalanced clusters in comparison with the original DPC. Later, an approach called Local Density Peaks for Imbalanced data (LDPI) [12] has been proposed adopting an adaptive subcluster construction scheme, which forms more subclusters than true clusters, and the subclusters are utilized to jointly support the detection of imbalanced clusters. However, LDPI is with a quadratic time complexity and is thus unsuitable for ISDC.

To achieve efficient cluster analysis on streaming data, fast clustering solutions have emerged. One of the conventional methods is StreamKM++ [13], which integrates a software acceleration architecture and $k$ -means++ [14], and dynamically updates cluster centers to enhance clustering efficiency. The efficiency of density-based clustering has also been improved by adapting DPC to streaming data, which demonstrates superior clustering performance compared to other streaming data clustering methods [9]. However, dynamic DPC is not robust to hyper-parameters when dealing with non-stationary streaming data. To further address this, AMD-DPC [15] has been proposed to adopt a graph-based data structure for local density updates. This algorithm saves computation cost while still remaining accurate, showing potential for real-time big data processing. However, the above-mentioned fast algorithms have yet to consider imbalanced clusters, and may thus yield unsatisfactory accurate performance in ISDC.

To the best of our knowledge, clustering algorithms that are dedicately designed for imbalanced data or streaming data clustering cannot simultaneously achieve fast and accurate ISDC. Moreover, as streaming data chunks may with changing number of clusters, adaptability is also worth considering for an ideal ISDC approach to automatically adapt to the “true” number of clusters for different data chunks.

Therefore, we propose an approach called Learning Self-Refined Organizing Map (LSROM) to address the ISDC problem. To quickly describe the distribution of a new stream-in chunk for clustering, we propose a new SOM technique to form a topological representation of data. As the nodes may be somewhat trapped by the initialized map and thus cannot well-represent natural object clusters, a local topological radius metric is designed to eliminate incompetent nodes, and a refined topology is thus obtained. Nodes in the topology are treated as cluster centers to divide the whole dataset into many micro-clusters to avoid the missing of small clusters, and then the micro-clusters are hierarchically merged to detect imbalanced clusters. As hierarchical merging is the most laborious phase, we also let the established topology serve as the retrieval structure, which acts to only allow promising similar micro-clusters to be considered for merging. In comparison with the state-of-the-art counterparts, LSROM demonstrates its superiority in ISDC as it sufficiently improves time complexity while still demonstrating very competitive clustering accuracy. Comprehensive experiments have been conducted to illustrate the promising performance of LSROM. The main contributions of this paper can be summarized into four-fold:

1.
A new clustering paradigm called LSROM is proposed, which thoroughly exploits the information provided by the finely trained SOM to cope with the trade-off between efficiency and detection of imbalanced clusters, and achieves an accurate but efficient ISDC.
2.
To remove the less-representative nodes trapped by the map, a nodes elimination strategy is designed based on the proposed measure called local topological radius ratio. The trained SOM can be refined accordingly, providing the representation basis for fast and accurate ISDC.
3.
A rapid retrieval mechanism, which retrieves object neighbors through the topological links within the refined SOM to approximate density computation, is proposed to adequately save the computation cost for the laborious micro-cluster merging phase.
4.
To ensure a more convincing experimental evaluation, we design a new data generator, which simultaneously models the changing cluster number and cluster size for approximating real streaming data. It can become a universal experimental tool in studying ISDC.

The rest of this paper is organized as follows: Section II reviews the related work, and Section III presents the problem statement. In Section IV, we provide technical details and time complexity analysis of the proposed LSROM. Section V showcases experimental results with observations, and finally we draw a conclusion in Section VI.

II Related Work

This section will briefly introduce streaming data clustering, imbalanced data clustering, and clustering methods utilizing Self-Organizing Maps.

II-A Streaming Clustering

Data stream clustering involves categorizing data chunks into a finite set of clusters, which differs from static clustering and places higher demands on memory and time due to continuously varying chunks entering the stream. This necessitates fast incremental data processing to produce timely results. Modern data stream clustering algorithms strive to balance high accuracy with compressed time complexity.

Many of these algorithms are variations of traditional clustering methods like $k$ -means (e.g., [16], [17], [13]), DBSCAN (e.g., [18], [19]), and hierarchical clustering [20]. While these adaptations enable efficient processing, they introduce specific hyper-parameters. To circumvent the non-trivial setting of hyper-parameters, Zhang et al. propose the Davies-Bouldin Index Evolving Clustering Method (DBIECM) [21] for streaming data clustering. However, this concept cannot be directly extended to the online setting since streaming algorithms do not retain data or maintain a partition of it, both of which are needed by batch cluster validity indices. Moshtaghi et al. develop two incremental versions of the Xie-Beni and Davies-Bouldin validity indices and employ them to monitor and control two streaming clustering algorithms, such as sk-means and online ellipsoidal clustering [22]. The above methods are limited to single-view, and a multiview support vector domain description (MVSVDD) model [23] is proposed to capture cluster evolution and discover arbitrarily shaped clusters with limited computing resources. In this context, these new incremental validity indices serve as performance monitoring functions. Nevertheless, handling non-stationary data distributions remains a challenge as most algorithms struggle to maintain accuracy and robustness.

II-B Imbalanced Clustering

Imbalanced clustering, where the number of samples varies in different clusters, has emerged from many real data mining applications [24]. Using prototypes is one of the effective approaches to tackle this issue (e.g., in [25], [26], [7]). Each cluster can be represented by multiple prototype points, reducing uniform effect [27] between subclusters. However, the above methods typically use a pre-defined number of prototypes. The issue with this strategy is that if the given number of prototypes is too small, the uniform effect may still occur between minority and majority clusters. Conversely, if the number of prototypes is too large, prototype points may also reside in overlapping regions between clusters, and noise and outliers could be treated as subclusters. Furthermore, if the number of prototypes is less than the number of clusters, the clustering result will never be correct. Hence, various methods have sought adaptive improvements in the pre-defined number of prototype points.

SMCL [8] achieves this by gradually adding seed points until a seed point is driven by a competitive punishment mechanism. The introduced competitive punishment mechanism effectively prevents the issue of dead units. In the case of using $k$ -means as the clustering algorithm during learning, if no seed point is driven, a new seed point will be reintroduced. However, this approach lacks effectiveness in distinguishing small clusters from noise points. To address this, LDPI [12] designs an initial subcluster generation scheme, improving the clustering method of DPC by automatically identifying noise points and initial subcluster centers. According to the nearest-neighbor principle, the remaining points are classified as subcluster centers to generate subclusters. MCNS [28] introduces a measure based on the reconstruction rate to select the appropriate number of clusters, enhancing convergence speed while ensuring accuracy. All of these methods seek the optimal solution through iterations, resulting in time complexity exceeding quadratic levels.

II-C SOM-Based Clustering

Clustering algorithms based on SOM [29] have garnered significant attention and research in recent years. The two-dimensional mapping produced by SOM is a highly suitable method for representing the inherent relationships within data. It makes it easy to explore the intrinsic clustering structures. As a result, many improvements and extensions have been developed in this regard. Given that traditional SOM requires an initial number of neurons, leading to non-adaptive clustering precision, Raube et al. construct a hierarchical structure (GHSOM) [30] by recursively growing and splitting neurons, partitioning the data space into finer-grained regions, and gradually refining the clustering results. In order to go beyond the constraints of traditional numerical features, FMSOM [31] has devised a method that extends neural prototypes with category frequency tables to handle heterogeneous feature data without the need for binary transformation.

Olszewski et al. address data asymmetry by introducing additional asymmetric connection weights (ASOM) [32], which makes the topological structure more adaptable and reduces the topological distortion caused by random initialization. A. Ali Hameed et al. introduce a new intelligent adaptive learning SOM [33]. In this approach, they derive a new variable learning rate that adaptively achieves optimal weights, resulting in winning neurons with a shorter learning process time compared to traditional SOM. The authors aim to simultaneously achieve low quantization error in SOM and fast high convergence of the SOM algorithm, thereby overcoming the main drawbacks of classical SOM. Zhang et al. design a clustering method (SOFM) [34] where the learning rate and neighborhood function are adaptively adjusted based on data distribution and changes in the feature space. Later, a self-supervised self-organizing clustering network ( $\text{S}^{3}$ OCNet) [35] is proposed to jointly learn feature extraction and feature clustering to achieve a single-stage clustering method.

III Problem Statement

The primary cause of the current decrease in accuracy in streaming clustering algorithms is the difficulty in adapting to imbalanced data situations. To address this challenge, we have introduced the concept of Imbalanced Streaming Data Clustering (ISDC). Assuming data objects are continuously generated or collected, which can be more formally denoted as a data stream $S=\left\{\mathbf{X}^{i}\right\}_{i=1}^{N}$ comprising a series of data chunks $\mathbf{X}^{i}$ formed at different time-stamps $i$ . Generally, data streams are considered to be unbounded (i.e., $N\rightarrow\infty$ ). Each data chunk consists of an n-dimensional attribute vector, denoted as $\mathbf{X}^{i}=\left[\mathbf{x}_{j}^{i}\right]_{j=1}^{n}$ , describing properties, and the attribute space can be diverse.When sampling is biased towards certain categories, there is often a high probability of generating class-imbalanced data chunks, where the number of instances in the majority class significantly exceeds that in the minority class. We define this difference as the Imbalanced Ratio ( $IR$ ), as shown in Eq. (1):

IR=\frac{m_{a}}{m_{i}},

(1)

where $m_{a}$ and $m_{i}$ represent the data count in the majority class and the data count in the minority class respectively. $IR=1$ indicates an extremely balanced dataset, with larger $IR$ values signifying increased data imbalance. The data chunk $\mathbf{X}^{i}$ with a relatively large $IR$ serves as the metric space for the clustering operation. For any choice of $Z^{i}=\{\mathbf{z}^{i}_{1},\mathbf{z}^{i}_{2},\ldots,\mathbf{z}^{i}_{k}\}%\subset\mathbf{X}^{i}$ consisting of $k$ cluster centers, a partition of $\mathbf{X}^{i}$ into $k$ clusters $\varPhi(k)=\{C^{i}_{1},C^{i}_{2},\ldots,C^{i}_{k}\}$ is defined. Each $C^{i}_{j}$ contains all objects in $\mathbf{X}^{i}$ that are closer to $\mathbf{z}^{i}_{j}$ than to any other center. The objective is to select $\varPhi(k)$ in a way that minimizes the sum of squared distances (SSQ) measure [36]:

SSQ(\mathbf{X}^{i},Z^{i})=\sum_{j=1}^{k}{\sum_{\mathbf{x}\in C^{i}_{j}}{d(%\mathbf{x},\mathbf{z}^{i}_{j})}}.

(2)

In this case, $d(\mathbf{x},\mathbf{z}^{i}_{j})$ is typically employed to calculate the Euclidean distance between two objects. ISDC requires fast incremental processing of data objects to provide real-time results while achieving good clustering performance within the imbalanced distribution of data chunks. Therefore, there are more precise requirements in terms of both time and accuracy. However, to the best of our knowledge, this problem has not yet received thorough research attention.

IV Proposed Method

This section provides a detailed description of the LSROM algorithm, which contains three steps, i.e. 1) Intensive Partition based on Trained Topology (IP), 2) Refining Partition by Bridge Nodes (RP), and 3) Merging Micro-Clusters based on Refined Topology (MMC). Finally, we analyze the overall algorithm framework of LSROM along with its time complexity. The overview of the LSROM algorithm is shown in Fig.2.

LSROM: Learning Self-Refined Organizing Map for Fast Imbalanced Streaming Data Clustering (2)

IV-A IP: Intensive Partition based on Trained Topology

In order to expedite the rapid construction of micro-clusters, we employ reclustering, which consists of two clustering steps. In the initial clustering step, we adopt an advanced SOM architecture named Randomized Self-Organizing Map (RSOM) [37].We assume that the mapping grid comprises $Q\times Q$ neurons, while the input imbalanced data chunk $\mathbf{X}^{i}\in\mathbb{R}^{n\times f}$ is used as the dataset. In this context, $\mathbf{x}_{j}^{i}\in R^{f}$ represents the $j$ -th data object within $\mathbf{X}^{i}$ . $\mathbf{z}_{q}\in R^{f}$ represents each neuron in the topology, to consider the random placement of neurons with a specific spectral distribution, a method employing rapid Poisson disk sampling for neurons of any dimension [38] is utilized. The Poisson disk sampling ensures that neurons are not placed too close to each other. Subsequently, through 10 iterations of the Lloyd relaxation scheme [39], the initial placements are further rearranged to achieve nearly centroidal Voronoi tessellation.Furthermore, neighbors of $\mathbf{z}_{q}$ at step $t$ can be defined as $\varOmega\left(\mathbf{z}_{q},t\right)$ , representing a group of neurons close to $\mathbf{z}_{q}$ on the grid. Throughout the entire training process, the neighborhood variation follows a nearest-neighbor winner-takes-all rule (shrinking over time).

During the learning phase, for each data object $\mathbf{x}^{i}_{j}$ , the nearest neuron is denoted as $\hat{\mathbf{z}}_{b}$ , where $b$ is identified as:

IV-B RP: Refining Partition by Bridge Nodes

After obtaining micro-cluster centers, an excessive quantity of them may result in the dispersion of micro-clusters within the overlapping regions among the true clusters.

Definition 1.

Given a micro-cluster center $\mathbf{m}_{i}\in\mathbb{R}^{f}$ , if the local topological radius of $\mathbf{m}_{i}$ ’s neighbor is smaller than its topological radius, i.e. $r(\mathbf{m}_{i})<1$ , then such center is called Bridge Node (BN).

Because BN acts like a bridge connecting the two micro-clusters that should be separated during the merge phase, which results in degraded performance. Specifically, the distribution of micro-cluster centers reflects the data objects, making these centers the primary reference for the merging process. If we do not eliminate BNs, there is a risk of merging two true clusters into one, resulting in a deviation from the expected final number of clusters.

The generated topology map of Section IV-A, as shown in Fig. 3(a), exhibits high-density data objects that are also mapped in places with high local density of micro-cluster centers. BNs exist in the blank area between every two true clusters because the SOM-based construction of the topology map forms a neural network-like structure, and during the learning and updating process, neurons need to be driven based on data objects, and their driving ability is limited in low-density data areas, causing them to be trapped.

LSROM: Learning Self-Refined Organizing Map for Fast Imbalanced Streaming Data Clustering (3)

We will introduce the local topological radius ratio as a metric to determine whether a micro-cluster center is a BN. Although we use $k$ -means to fine-tune the neurons to the micro-cluster centers, the original topological information is not destroyed. Therefore, we use $\Omega\left(\mathbf{m}_{i}\right)$ to denote the adjacent micro-cluster centers of $\mathbf{m}_{i}$ in the topological map. If there exists an edge connecting $\mathbf{m}_{i}$ and $\mathbf{m}_{j}$ , then it satisfies:

\Omega\left(\mathbf{m}_{i}\right)=\left\{\mathbf{m}_{j}\,|\,a_{ij}=1\right\}.

(9)

The local topological radius refers to the average Euclidean distance from a micro-cluster center to its adjacent micro-cluster centers, defined as follows:

\rho\left(\mathbf{m}_{i}\right)=\frac{1}{\left|\Omega\left(\mathbf{m}_{i}%\right)\right|}\sum_{\mathbf{m}_{j}\in\Omega\left(\mathbf{m}_{i}\right)}{%\lVert\mathbf{m}_{i}-\mathbf{m}_{j}\rVert}^{2},

(10)

$\left|\Omega\left(\mathbf{m}_{i}\right)\right|$ refers to the number of the adjacent micro-cluster centers of $\mathbf{m}_{i}$ . According to Eq.(10), we can calculate the local topological radius ratio $r\left(\mathbf{m}_{i}\right)$ as follows:

r\left(\mathbf{m}_{i}\right)=\frac{\sum_{\mathbf{m}_{j}\in\Omega\left(\mathbf{%m}_{i}\right)}{\rho\left(\mathbf{m}_{j}\right)}}{\left|\Omega\left(\mathbf{m}_%{i}\right)\right|\cdot\rho\left(\mathbf{m}_{i}\right)},

(11)

which computes the size relationship between the average $\rho$ of adjacent micro-cluster centers and the $\rho(m_{i})$ . When the former is larger, $r\left(\mathbf{m}_{i}\right)>1$ . Intuitively, if $r\left(\mathbf{m}_{i}\right)<1$ , it means that the data objects represented by $\mathbf{m}_{i}$ are sparser than the data objects represented by the adjacent micro-cluster centers of $\mathbf{m}_{i}$ . Therefore, using the

M_{r}=\left\{\mathbf{m}_{i}|r\left(\mathbf{m}_{i}\right)<1\right\},

(12)

BNs can be identified. As shown in Fig. 3(b), RP effectively eliminates BNs located in blank areas, preventing an excessive dispersion of micro-cluster centers. This removal enhances the separability of true clusters. Simultaneously, the algorithm preserves the primary structures represented by the remaining micro-cluster centers that have not been pruned. We then divide the data objects into the updated micro-cluster centers to obtain each micro-cluster.

IV-C MMC: Merging Micro-Clusters based on Refined Topology

In this stage, micro-clusters will be merged to automatically obtain the final clustering result. Our merge process utilizes two metrics, namely compactness and separability. We will use separation to find the most suitable micro-clusters for merging, because the smaller the separation, the more similar the two micro-clusters are and are more likely to come from the same true cluster. The separation is calculated by using a 1-D Gaussian mixture probability density function.

Assuming we want to compute the separation between two micro-clusters $C_{i}$ and $C_{j}$ . The data objects $\mathbf{x}$ within $C_{i}$ and $C_{j}$ should be projected onto the topological connection line as 1-D objects $\mathbf{x}^{\prime}$ :

\mathbf{x}^{\prime}=\frac{(\mathbf{x}-\bar{\mathbf{m}})^{T}(\mathbf{m}_{i}-%\mathbf{m}_{j})}{\lVert\mathbf{m}_{i}-\mathbf{m}_{j}\rVert^{2}},

(13)

where $\bar{\mathbf{m}}$ is the midpoint of a topological connection line, denoted as $\bar{\mathbf{m}}=\left(\mathbf{m}_{i}+\mathbf{m}_{j}\right)/2$ . Here, $\mathbf{m}_{i}$ and $\mathbf{m}_{j}$ are the endpoints of the connection line located at $0.5$ and $-0.5$ . By substituting statistics of $\mathbf{x}^{\prime}$ into the 1-D Gaussian mixture probability density function $S(u)$ , we obtain:

S\left(u\right)=\frac{\left|C_{i}\right|f\left(u|0.5,\sigma_{i}^{2}\right)+%\left|C_{j}\right|f\left(u|-0.5,\sigma_{j}^{2}\right)}{\left|C_{i}\right|+%\left|C_{j}\right|}.

(14)

$f(u|0.5,\sigma_{i}^{2})$ is the probability density function of a Gaussian distribution, where $u$ is the variable and $\sigma_{i}^{2}$ is the variance for $C_{i}$ . We define the separation between two micro-clusters with the following formula:

s_{ij}=\frac{1}{\underset{u\in U}{\min}S\left(u\right)}.

(15)

In order to estimate the minimum discrete probability of $u$ within $S(u)$ , we will use a step size of 0.01 to traverse the interval from -0.5 to 0.5, denoted as $U=\{-0.5,-0.49,\ldots,0.5\}$ . When the separation between two micro-clusters is large, $\min_{u\in U}S(u)$ approaches 0, and $s_{ij}$ becomes large. Suppose the current unmerged clustering result state is $\varPhi(k)$ , which indicates that the current data chunk is split by $k$ micro-clusters. The following equation calculates the micro-clusters that should be merged and the global compactness of the clustering result for $\varPhi(k)$ :

com_{k-1}=\underset{1<i<k}{\min}\,\underset{\begin{subarray}{c}\mathbf{m}_{j}%\in SOMN\left(\mathbf{m}_{i}\right)\\j>i\end{subarray}}{\min}s_{ij}.

(16)

We merge $C_{i}$ and $C_{j}$ from Eq. (16) and update $\varPhi(k)$ to $\varPhi(k-1)$ . The computation stops when $k=1$ . The global compactness of the clustering result denoted as $com$ , represents the minimum separation value between globally adjacent clusters. Typically, when different micro-clusters are merged, their separation measure before merging should be significantly higher than after merging. Therefore, a smaller $com$ value indicates a better clustering result, as each cluster becomes more compact. To automatically determine the final number of clusters, we introduce another metric: the global separability of the clustering result. But before that, let’s define $\hat{\mathbf{m}}_{g}$ as the micro-cluster center with the maximum local topological radius, where $g$ is determined by:

g=\underset{1\leq q\leq Q^{2}}{arg\max}\hskip 3.00003pt\rho\left(\mathbf{m}_{q%}\right)

(17)

represents the micro-cluster $C_{g}$ , which is generally distributed in the outer region of the true cluster, making it sparse and exhibiting a higher degree of dispersion from objects outside the micro-cluster. Therefore, we can conclude that:

sep_{k}=\sum_{\mathbf{x}_{j}\in C_{g}}{(\frac{q_{j}}{\kappa})},

(18)

where $\kappa$ refers to the number of neighbors, and $q_{j}$ represents the number of neighbors out of the $\kappa$ -nearest neighbors of $\mathbf{x}_{j}$ that do not belong to the $C_{g}$ . In order to efficiently compute the $\kappa$ -nearest neighbors for the specified data objects, an approximate search algorithm is employed, reducing time overhead while maintaining data accuracy. This is achieved by constructing a multi-level graph structure, where each level is composed of Approximate Delaunay Graphs. To enhance search efficiency, skip lists and heuristic initialization are introduced during the search process [42].

Clearly, a smaller $sep$ value indicates a better clustering result, as each cluster exhibits a higher degree of separability. During the merging process, the $com$ value will trend upward and the $sep$ value will trend downward. Therefore, we can automatically determine the number of clusters as follows:

k^{*}=\underset{1<K<k-1}{\arg\min}\left(\frac{com_{K}}{\underset{k^{\prime}}{%\max}\{com_{k^{\prime}}\}}+\frac{sep_{K}}{\underset{k^{\prime}}{\max}\{sep_{k^%{\prime}}\}}\right).

(19)

Iterate through all possible values of $K$ to find the optimal combination state for $com$ and $sep$ . To normalize the global compactness and global separability, we divide their maximum values into equal ranges. Finally, we select the minimum value from the sum of global separability and global compactness to guide the selection of the clustering result.

11Input: dataset $\mathbf{X}$ , the number of neurons $Q^{2}$ .

11Output: $k$ clusters.

1.
IP:
1. 1.1
  Construct RSOM from dataset $\mathbf{X}$ to generate the $Q^{2}$ number of neurons according to Eqs. (3)-(4).
2. 1.2
  Adjust neurons to align with the center of dataset $\mathbf{X}$ according to Eqs. (6)-(8).
2.
RP:
1. 2.1
  Calculate the local topological radius ratio on micro-cluster centers according to Eqs. (9)-(11).
2. 2.2
  Determine and remove BNs according to Eq. (12).
3.
MMC:
1. 3.1
  Merge micro-clusters by calculating the separation of adjacent micro-clusters according to Eqs. (13)-(15).
2. 3.2
  Calculate compactness and separability two measures according to Eqs. (16)-(18).
3. 3.3
  Obtain $k$ clusters from the minimum value of the sum of global separability and global compactness according to Eq. (19).

IV-D Overall Algorithm and Complexity Analysis

The summary of LSROM is presented in Algorithm 1. In the initial stage, we utilize RSOM and apply reclustering with $k$ -means to generate micro-cluster centers distributed across the imbalanced dataset. We employ the topological map to guide the subsequent processes. In the second stage, we compute the local topological radius ratio for each micro-cluster center and use this ratio to remove BNs. In the third stage, the updated micro-clusters are adaptively merged using two metrics: separability and compactness. We quickly determine the final, suitable number of clusters using the topological map.

Theorem 1.

Given a data chunk $X=\left\{\mathbf{x}_{1},\mathbf{x}_{2},\cdots,\mathbf{x}_{n}\right\}$ , the initial number of neurons is $Q$ , where $Q\ll n$ , and the number of micro-clusters is $\hat{Q}$ , where $\hat{Q}\leq Q$ . The time complexity of LSROM algorithm is $O\left(nlogn\right)$ .

Proof.

In the IP, the time complexity for training RSOM neurons is $O\left(Q*n\right)$ , and the time complexity for fine-tuning neurons using $k$ -means is $O\left(Q*n\right)$ . In the RP stage, generating the adjacency micro-cluster centers matrix has a time complexity of $O\left(Q^{2}\right)$ , and calculating the time complexity for micro-cluster center local topological radius ratios is $O\left(Q\right)$ . In the MMC stage, the number of micro-clusters is reduced to $\hat{Q}$ , and obtaining updated calculations for the separation degree of adjacency clusters has a time complexity of $O\left(\hat{Q}^{2}\log\hat{Q}\right)$ . When iteratively calculating the global separability degree and global compactness degree, an approximate $k$ -nearest neighbor algorithm is used, leveraging the information provided by the topological structure to avoid traversing all micro-clusters, requiring a time complexity $O\left(\frac{n}{\hat{Q}}\log\frac{n}{\hat{Q}}\right)$ . Both $Q$ and $\hat{Q}$ above can be regarded as constants, therefore, the overall time complexity of the algorithm is $O\left(nlogn\right)$ .∎

V Experimental Results

Four experiments, i.e., efficiency evaluation, clustering accuracy evaluation, ablation study, and parameter sensitivity evaluation, have been conducted on ten datasets with ten counterparts (seven existing methods and three ablated versions of LSROM). In the following, we first provide the experimental settings, and then demonstrate experimental results.

No.	Dataset	# Features	# Samples	# $IR$	# Clusters
1	lithuanian	2	2.00K	5.00	2
2	banana	2	2.40K	5.00	2
3	haberman	3	0.31K	2.77	2
4	wpbc	30	0.57K	1.69	2
5	ids2	2	1.00M	10.00	5
6	gaussian	2	1.00M	19.87	4
7	Forest	54	0.58M	103.13	7
8	IOT	26	2.00M	2.54	2
9	Bitcoin	8	1.00M	23.19	2
10	SEER	16	1.00M	3.99	3

1:Input: dataset $\mathbf{X}$ , Imbalanced Ratio of dataset $IR$ , the number of clusters within dataset $k$ .

2:Output: data chunk at $t$ -th time-stamp $\mathbf{X}^{t}$ .

3: $k^{t}\sim\text{DiscreteUniform}(\mathbb{N}\cap[2,k])$ .

4: $\mathbf{\mathfrak{n}}\leftarrow$ sort the number of clusters in $\mathbf{X}$ in ascending order.

5:for $i\leftarrow 1$ to $k^{t}-1$ do

6: $IR_{i}^{t}\sim\text{DiscreteUniform}(\mathbb{Z}\cap(1,IR])$ ;

7:endfor

8: $IR^{t}\leftarrow$ sort $IR^{t}$ in ascending order.

9: $\mathfrak{n}^{t}_{1}\leftarrow\mathfrak{n}_{1}$ ; $\triangleright$ quantity represented by $\mathfrak{n}_{1}$ is the smallest.

10:for $i\leftarrow 2$ to $k^{t}$ do

11:if $IR_{i-1}^{t}\cdot\mathfrak{n}_{i-1}\leq\mathfrak{n}_{i}$ then

12: $\mathfrak{n}_{i}^{t}\leftarrow IR_{i-1}^{t}\cdot\mathfrak{n}_{i-1}$ ;

13:else

14: $\mathfrak{n}_{i}^{t}\leftarrow\mathfrak{n}_{i}$ .

15:endif

16:endfor

17: $\mathfrak{n}_{i}^{t}$ data objects are randomly taken from the $i$ -th cluster and form the $\mathbf{X}^{t}$ .

V-A Experimental Settings

Ten datasets including four synthetic and six real datasets with varying sizes, dimensions, and distribution types. Their statistical details are provided in TableI. For the synthetic datasets, banana and Lithuanian are generated using PRTools[43]with thickness parameter set at $0.4$ and choose ”non-spherical clusters”. ids2 and gaussian [7] are synthesized by applying a mixture of bivariate Gaussian density functions with custom scales in the order of one million, resulting in spherical clustering. All real datasets are found in the UCI Machine Learning Repository¹¹1https://archive.ics.uci.edu. Min-max normalization is adopted to pre-process each feature to be with the identical value domain $[0,1]$ .

LSROM: Learning Self-Refined Organizing Map for Fast Imbalanced Streaming Data Clustering (4)

LSROM: Learning Self-Refined Organizing Map for Fast Imbalanced Streaming Data Clustering (5)

We have designed a data generation method to validate the ISDC problem, which can comprehensively generate imbalanced data chunks. A selected dataset serves as the basis for generating data chunks using a Two-Layer Random Sampling method (TLRS), where the two layers involve random sampling for the number of clusters and the Imbalanced Ratio. The specific algorithmic procedure is described in Algorithm2. By repeating this algorithm $t$ times, we obtain a total of $t$ data chunks. Generating data chunks in this manner ensures the rationality of data distribution while attempting to include the imbalanced states of all streaming data as comprehensively as possible. This approach enhances the diversity of experimental data, allowing a better evaluation of algorithm performance under various imbalanced scenarios.

Dataset	Measures	k-means	StreamKM++	BIRCH	AMD-DPC	SMCL	MC-MNN	LDPI	LSROM
lithuanian	NMI	0.0009±0.0002	0.0012±0.0008	0.0452+0.0012	0.0548±0.0037	1.0000±0.0000	1.0000±0.0000	1.0000±0.0000	1.0000±0.0000
lithuanian	DCV	0.9815±0.0277	0.7452±0.0609	1.0228±0.0035	0.9289±0.0287	0.0000±0.0000	0.0000±0.0000	0.0000±0.0000	0.0000±0.0000
banana	NMI	0.0943±0.0219	0.9206±0.0743	0.6512±0.0013	0.2631±0.0014	0.9745±0.0087	1.0000±0.0000	1.0000±0.0000	1.0000±0.0000
banana	DCV	0.8221±0.0054	0.7995±0.0464	0.9195±0.0044	0.4057±0.0091	0.0791±0.0070	0.0000±0.0000	0.0000±0.0000	0.0000±0.0000
haberman	NMI	0.0007±0.0001	0.0015±0.0001	0.0003+0.0000	0.0469±0.0030	0.0011±0.0002	0.1343±0.0041	0.1002±0.0017	0.0088±0.0012
haberman	DCV	1.0289±0.0500	1.0160±0.1627	1.3489±0.1524	0.4612±0.0125	0.4416±0.0005	0.3545±0.0034	0.3991±0.0107	0.4529±0.0052
wpbc	NMI	0.6175±0.1132	0.6231±0.2402	0.3219±0.0994	0.1329±0.0090	0.6339±0.0057	0.6355±0.0128	0.4353±0.0470	0.5052±0.0323
wpbc	DCV	1.1217±0.0038	1.2710±0.7039	1.2335±0.0073	0.1870±0.0061	0.1396±0.0001	0.1564±0.0046	0.1028±0.0033	0.1693±0.0037
ids2	NMI	0.4312±0.0038	0.5092±0.0029	0.7966±0.0041	0.9312±0.0081	_	_	_	0.9648±0.0002
ids2	DCV	0.5337±0.0033	0.4766±0.0074	0.7919±0.0234	0.0312±0.0015	_	_	_	0.0023±0.0002
gaussian	NMI	0.5022±0.0012	0.5498±0.0012	0.9234±0.0026	0.9358±0.0016	_	_	_	0.9498±0.0000
gaussian	DCV	0.6092±0.0007	0.5720±0.0092	0.5253±0.0022	0.0406±0.0011	_	_	_	0.0031±0.0005
Forest	NMI	0.1374±0.0041	0.1459±0.0039	0.1487+0.0016	0.2648±0.0004	_	_	_	0.3677±0.0020
Forest	DCV	0.8561±0.0352	0.8092±0.0114	1.4571±0.0313	0.4113±0.0210	_	_	_	0.2901±0.0063
IOT	NMI	0.2011±0.0091	0.2310±0.0017	0.0757+0.0015	0.7951±0.0008	_	_	_	0.9703±0.0909
IOT	DCV	0.5099±0.0232	0.4971±0.0163	1.2376±0.0462	0.1463±0.0810	_	_	_	0.0002±0.0000
Bitcoin	NMI	0.0138+0.0023	0.0139+0.0076	0.0201+0.0037	0.0192+0.0056	_	_	_	0.5722+0.0135
Bitcoin	DCV	1.2386+0.0325	0.8639+0.0482	0.8257+0.0154	1.0395+0.0498	_	_	_	0.0482+0.0061
SEER	NMI	0.2185+0.0026	0.2202+0.0059	0.2123+0.0265	0.0402+0.0002	_	_	_	0.3155+0.0251
SEER	DCV	1.4544+0.0674	1.4517+0.0614	1.6613+0.0417	2.1138+0.1208	_	_	_	0.3252+0.0485

We compare LSROM with seven other methods: the traditional partitional clustering approach ( $k$ -means [4]), advanced algorithms designed for streaming data (StreamKM++ [13], BIRCH [16], AMD-DPC [15]), and state-of-the-art methods suitable for static imbalanced data (SMCL [8], MC-MNN [44], LDPI [12]).For $k$ -means, StreamKM++, BIRCH, and AMD-DPC, the number of clusters needs to be specified in advance. In contrast, SMCL, MC-MNN, LDPI, and our method can adaptively determine the number of clusters without prior specification. The specific experimental parameters of our method are set as follows: the initial number of neurons $Q^{2}=100$ , and the number of nearest neighbors $\kappa=10$ . In the following Section V-E, we conduct a comprehensive parameter sensitivity verification. Since $\kappa$ and $Q$ are not too sensitive, the rationality of our settings is confirmed. The other parameters for all the compared methods are set following the recommendations in the source literature.

Two metrics, i.e., Normalized Mutual Information (NMI) [45] and Difference of CV (DCV) [8] index are utilized for performance evaluation. NMI is an external metric reflecting the correlation between the clustering results and the given labels from the perspective of information theory. Its value range is $[0,1]$ , and the larger the better. DCV is an internal metric that is specifically designed for evaluating the clustering performance of imbalanced data. It calculates the effectiveness of clustering based on the similarity between clusters and the compactness within clusters. A lower index value indicates better clustering results, with greater differences between clusters and higher compactness within clusters. All the reported NMI and DCV performances are averaged over 10 runs with random initialization of seeds.

V-B Efficiency Evaluation

The impact of data size on the runtime of clustering methods is studied by plotting their runtime-data size curves on the six large-scale datasets in Fig. 4. Each point in the figure reflects the execution time (y-axis) of a method under the amount of data (x-axis). We categorize the compared methods into three groups, i.e., 1) the proposed LSROM (in purple), 2) SMCL (in orange), MC-MNN (in green), LDPI (in blue), and 3) existing fast streaming data clustering methods that do not dedicate to consider the imbalanced issue (in gray). Please note that we visualize the y-axis on a logarithmic scale because the running times of different methods vary widely.

It can be seen that the proposed LSROM has a similar running time as that of the $k$ -means, StreamKM++, BIRCH, and AMD-DPC thanks to their linear or $O(n\log n)$ level complexity, while the group of SMCL, MC-MNN, and LDPI is with heavy computational cost due to their polynomial time complexity. Moreover, due to the excessive computational resource requirements, these algorithms encounter ”out of memory” situations when dealing with larger datasets.

V-C Clustering Accuracy Evaluation

Two sub-experiments have been conducted. We first compare the clustering performance of different clustering methods on each whole dataset in Table II to evaluate the clustering accuracy of methods on static datasets, where the best- and second-best-performing method on each dataset is highlighted in boldface and underlined, respectively. Then we generate imbalanced streaming data chunks based on the six large-scale datasets to specifically evaluate the capability of methods in ISDC as shown in Fig. 5 and Fig. 6. For each of the six large-scale datasets, we generate 50 chunks by using TLRS and let them flow in one by one per the time-stamp. The data volume of each chunk is determined based on the maximum data capacity that can be processed by SMCL. Since SMCL takes the longest time to run among all methods as shown in Fig.4, we hope that all methods can run and get accurate results. Specifically, it is determined as follows: $chunk_{ids2}=chunk_{gaussian}=8\times 10^{4}$ , $chunk_{Forest}=chunk_{Bitcoin}=chunk_{SEER}=5\times 10^{4}$ and $chunk_{IOT}=4\times 10^{4}$ .

LSROM: Learning Self-Refined Organizing Map for Fast Imbalanced Streaming Data Clustering (10)

LSROM: Learning Self-Refined Organizing Map for Fast Imbalanced Streaming Data Clustering (11)

From TableII, it can be seen that the proposed LSROM performs best in general, winning in 16 out of 20 comparisons. In comparison with the fast clustering methods, i.e., $k$ -means, StreamKM++, BIRCH, and AMD-DPC, LSROM demonstrates its superiority as it is competent in detecting imbalanced clusters, while the four fast methods do not adopt a mechanism to specially take the imbalanced issue into account. AMD-DPC performs the second-best in general as it adopts a density-based principle, which can somewhat distinguish imbalanced clusters as discussed in SectionI. In comparison with SMCL, MC-MNN, and LDPI that are proposed for imbalanced data, LSROM still performs the best in general, as SMCL, MC-MNN, and LDPI are unavailable due to the ”out of memory” on the six large-scale datasets, while LSROM remains competitive on the four smaller datasets. The reason would be that LSROM inherits most merits of SMCL and further appropriately considers the global distribution information in doing the intensive data partition with the help of the trained SOM-topology and the partition refinement (i.e. SectionIV).

It can be observed from Fig. 5 and Fig. 6 that in ISDC tasks, LSROM still significantly outperforms the other four fast clustering methods, i.e., $k$ -means, StreamKM++, BIRCH, and AMD-DPC, which conforms with the observations of TableII. Since SMCL, MC-MNN, and LDPI are executable on each chunk generated from the large-scale datasets, their ISDC performance is reported, showing that LSROM is still very competitive in comparison with them. In summary, the proposed LSROM is superior in both accuracy and scalability in handling imbalanced streaming data with different scales.

V-D Ablation Study

To further validate the effectiveness of LSROM, we conduct ablation experiments. Since LSROM consists of three components, we perform ablations on each module individually to obtain their NMI, DCV scores, and Runtime. For the IP ablation, we directly obtain neurons by training RSOM without the $k$ -means fine-tuning step. For the RP ablation, we omit the step of removing BNs, thus directly merging the micro-clusters obtained by the IP. For the MMC ablation, we use the merge method named SGMS of SMCL, instead of MMC. The specific precision results are shown in TableIII.We find that for the standard LSROM workflow, ablation of any module leads to a certain decrease in clustering performance, demonstrating that each module contributes to achieving good clustering performance. Among these ablations, the IP ablation has the smallest impact, while the ablations of RP and MMC result in more significant differences in precision. When processing large-scale data, SGMS requires quadratic running time, so it causes ”out of memory”. In this study, the LSROM algorithm consistently produces the best clustering results among all ablation settings, further confirming its effectiveness.

Dataset	IP	RP	MMC	NMI	DCV	Runtime
lithuanian	✓	✓	✓	1.000 $\pm$ 0.000↓	0.000 $\pm$ 0.000↑	2.332 $\pm$ 0.002
		✓	✓	0.993 $\pm$ 0.000‡	0.001 $\pm$ 0.000‡	1.820 $\pm$ 0.002
			✓	0.987 $\pm$ 0.001‡	0.002 $\pm$ 0.000‡	1.755 $\pm$ 0.002
				0.987 $\pm$ 0.000‡	0.002 $\pm$ 0.000‡	2.060 $\pm$ 0.002
banana	✓	✓	✓	1.000 $\pm$ 0.000↓	0.000 $\pm$ 0.000↑	2.206 $\pm$ 0.001
		✓	✓	0.998 $\pm$ 0.000‡	0.000 $\pm$ 0.000‡	2.145 $\pm$ 0.003
			✓	0.992 $\pm$ 0.000‡	0.002 $\pm$ 0.000‡	2.308 $\pm$ 0.002
				0.993 $\pm$ 0.000‡	0.001 $\pm$ 0.000‡	2.415 $\pm$ 0.002
haberman	✓	✓	✓	0.009 $\pm$ 0.000↓	0.453 $\pm$ 0.005↑	0.679 $\pm$ 0.000
		✓	✓	0.006 $\pm$ 0.000‡	0.666 $\pm$ 0.002‡	0.730 $\pm$ 0.000
			✓	0.005 $\pm$ 0.000‡	0.712 $\pm$ 0.003‡	0.818 $\pm$ 0.000
				0.005 $\pm$ 0.000‡	0.710 $\pm$ 0.002‡	0.776 $\pm$ 0.000
wpbc	✓	✓	✓	0.505 $\pm$ 0.032↓	0.169 $\pm$ 0.003↑	1.061 $\pm$ 0.001
		✓	✓	0.403 $\pm$ 0.046‡	0.508 $\pm$ 0.000‡	1.047 $\pm$ 0.002
			✓	0.010 $\pm$ 0.000‡	1.041 $\pm$ 0.005‡	0.949 $\pm$ 0.002
				0.010 $\pm$ 0.000‡	1.041 $\pm$ 0.002‡	1.131 $\pm$ 0.001
ids2	✓	✓	✓	0.965 $\pm$ 0.000↓	0.002 $\pm$ 0.000↑	102.924 $\pm$ 1.944
		✓	✓	0.963 $\pm$ 0.076‡	0.019 $\pm$ 0.000‡	99.865 $\pm$ 1.906
			✓	0.892 $\pm$ 0.023‡	0.026 $\pm$ 0.000‡	251.019 $\pm$ 2.137
				-	-	-
gaussian	✓	✓	✓	0.949 $\pm$ 0.000↓	0.003 $\pm$ 0.000↑	101.661 $\pm$ 1.963
		✓	✓	0.922 $\pm$ 0.000‡	0.023 $\pm$ 0.000‡	98.137 $\pm$ 1.958
			✓	0.825 $\pm$ 0.003‡	0.140 $\pm$ 0.000‡	242.696 $\pm$ 2.126
				-	-	-
Forest	✓	✓	✓	0.368 $\pm$ 0.002↓	0.290 $\pm$ 0.006↑	109.286 $\pm$ 1.893
		✓	✓	0.237 $\pm$ 0.005‡	0.399 $\pm$ 0.000‡	107.207 $\pm$ 1.837
			✓	0.211 $\pm$ 0.003‡	0.291 $\pm$ 0.002‡	239.633 $\pm$ 2.299
				-	-	-
IOT	✓	✓	✓	0.970 $\pm$ 0.096↓	0.000 $\pm$ 0.000↑	143.023 $\pm$ 2.060
		✓	✓	0.828 $\pm$ 0.006‡	0.195 $\pm$ 0.096‡	138.871 $\pm$ 2.136
			✓	0.777 $\pm$ 0.006‡	0.374 $\pm$ 0.026‡	356.471 $\pm$ 2.055
				-	-	-
Bitcoin	✓	✓	✓	0.572 $\pm$ 0.013↓	0.048 $\pm$ 0.006↑	137.653 $\pm$ 2.019
		✓	✓	0.542 $\pm$ 0.015‡	0.057 $\pm$ 0.000‡	135.639 $\pm$ 2.013
			✓	0.428 $\pm$ 0.005‡	0.129 $\pm$ 0.002‡	301.587 $\pm$ 2.023
				-	-	-
SEER	✓	✓	✓	0.316 $\pm$ 0.011↓	0.325 $\pm$ 0.002↑	135.328 $\pm$ 2.023
		✓	✓	0.297 $\pm$ 0.015‡	0.621 $\pm$ 0.036‡	134.232 $\pm$ 2.057
			✓	0.275 $\pm$ 0.016‡	0.614 $\pm$ 0.022‡	297.729 $\pm$ 2.003
				-	-	-

LSROM: Learning Self-Refined Organizing Map for Fast Imbalanced Streaming Data Clustering (12)

LSROM: Learning Self-Refined Organizing Map for Fast Imbalanced Streaming Data Clustering (13)

V-E Parameter Sensitivity Evaluation

In the proposed LSROM framework, there are two parameters, namely the number of neurons $Q^{2}$ and the neighbor count $\kappa$ . Different values of these parameters can potentially affect the clustering performance in various ways. Here is an explanation of these parameters.

For the parameter $Q^{2}$ , it directly determines the initial number of neurons. A value of $Q^{2}$ that is too large can lead to an excessive number of micro-clusters, resulting in significant overlap between final clusters. During the merging process, micro-clusters representing low-density data may be incorrectly identified as belonging to the same true cluster. Additionally, an excessive number of micro-clusters can result in high computational costs. On the other hand, a too small value of $Q^{2}$ can lead to significant differences in the data sizes contained within micro-clusters, which can still be influenced by the uniform effect, directly affecting the quality of the LSROM framework.Regarding the parameter $\kappa$ , which is used in the approximate $\kappa$ -neighbor algorithm in MMC, it calculates the global separability values for specific clusters. When $\kappa$ is too big, it can lead to increased computational overhead. Conversely, when $\kappa$ is too small, it can result in global separability values being meaningless.

We conduct experiments on ten datasets by varying the values of $Q$ and $\kappa$ , recording the obtained NMI and DCV scores, as shown in Fig. 7. $Q$ values are traversed from 3 to 12 with a step size of 1, while $\kappa$ values range from 3 to 21 with a step size of 2. Based on the results shown in Fig. 7, it can be concluded that the clustering quality, as measured by NMI and DCV, remains robust for different parameter value combinations. Particularly, $Q$ performs well and is stable between 8 and 12, and $\kappa$ performs well and is stable between 7 and 15. Different $Q$ values produce larger accuracy changes than $\kappa$ . In general, the LSROM framework demonstrates low sensitivity to parameter changes.

VI Conclusion

In this study, we propose an efficient and accurate ISDC approach called LSROM, which trains and refines to obtain an informative SOM-based representation of data. On such basis, the dataset is first partitioned into many compact micro-clusters, and the micro-clusters are then efficiently merged to form an appropriate number of clusters. To facilitate a convincing experimental evaluation, a data generator called TLRS is designed to generate realistic imbalanced streaming data based on ten real benchmark and synthetic datasets. Compared to the seven counterparts, LSROM demonstrates its superiority in terms of both computational efficiency and clustering accuracy. Furthermore, the comprehensive ablation experiments verify the effectiveness of its key components, and the parameter sensitivity evaluation illustrates its robustness.

References

[1]M.M. Gaber, A.Zaslavsky, and S.Krishnaswamy, “Mining data streams: A review,” SIGMOD Rec., vol.34, no.2, pp. 18–26, 2005.
[2]L.Zhao, Z.Chen, Y.Yang, L.Zou, and Z.J. Wang, “Icfs clustering with multiple representatives for large data,” IEEE Transactions on Neural Networks and Learning Systems, vol.30, no.3, pp. 728–738, 2019.
[3]L.Wang, H.Zhu, J.Meng, and W.He, “Incremental local distribution-based clustering using bayesian adaptive resonance theory,” IEEE Transactions on Neural Networks and Learning Systems, vol.30, no.11, pp. 3496–3504, 2019.
[4]J.A. Hartigan and M.A. Wong, “A k-means clustering algorithm,” Journal of the Royal Statistical Society: Series C (Applied Statistics), vol.28, no.1, pp. 100–108, 1979.
[5]S.C. Ahalt, A.K. Krishnamurthy, P.Chen, and D.E. Melton, “Competitive learning algorithms for vector quantization,” Neural Networks, vol.3, no.3, pp. 277–290, 1990.
[6]L.Xu, A.Krzyzak, and E.Oja, “Rival penalized competitive learning for clustering analysis, rbf net, and curve detection,” IEEE Transactions on Neural Networks, vol.4, no.4, pp. 636–649, 1993.
[7]J.Liang, L.Bai, C.Dang, and F.Cao, “The $k$ -means-type algorithms versus imbalanced data distributions,” IEEE Transactions on Fuzzy Systems, vol.20, no.4, pp. 728–745, 2012.
[8]Y.Lu, Y.-M. Cheung, and Y.Y. Tang, “Self-adaptive multiprototype-based competitive learning approach: A k-means-type algorithm for imbalanced data clustering,” IEEE Transactions on Cybernetics, vol.51, no.3, pp. 1598–1612, 2021.
[9]R.Liu, H.Wang, and X.Yu, “Shared-nearest-neighbor-based clustering by fast search and find of density peaks,” Information Sciences, vol. 450, pp. 200–226, 2018.
[10]A.Rodriguez and A.Laio, “Clustering by fast search and find of density peaks,” Science, vol. 344, no. 6191, pp. 1492–1496, 2014.
[11]J.Xie, H.Gao, W.Xie, X.Liu, and P.W. Grant, “Robust clustering by detecting density peaks and assigning points based on fuzzy weighted k-nearest neighbors,” Information Sciences, vol. 354, pp. 19–40, 2016.
[12]W.Tong, Y.Wang, and D.Liu, “An adaptive clustering algorithm based on local-density peaks for imbalanced data without parameters,” IEEE Transactions on Knowledge and Data Engineering, vol.35, no.4, pp. 3419–3432, 2023.
[13]M.R. Ackermann, M.Märtens, C.Raupach, K.Swierkot, C.Lammersen, and C.Sohler, “Streamkm++: A clustering algorithm for data streams,” ACM J. Exp. Algorithmics, vol.17, 2012.
[14]D.Arthur and S.Vassilvitskii, “K-means++: The advantages of careful seeding,” in Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, 2007, pp. 1027–1035.
[15]D.Amagata, “Scalable and accurate density-peaks clustering on fully dynamic data,” in Proceedings of 2022 IEEE International Conference on Big Data, 2022, pp. 445–454.
[16]T.Zhang, R.Ramakrishnan, and M.Livny, “Birch: A new data clustering algorithm and its applications,” Data mining and knowledge discovery, vol.1, pp. 141–182, 1997.
[17]C.C. Aggarwal, P.S. Yu, J.Han, and J.Wang, “- a framework for clustering evolving data streams,” in Proceedings of 2003 VLDB Conference, 2003, pp. 81–92.
[18]Y.Chen and L.Tu, “Density-based clustering for real-time stream data,” in Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2007, pp. 133–142.
[19]F.Cao, M.Estert, W.Qian, and A.Zhou, “Density-based clustering over an evolving data stream with noise,” in Proceedings of the 2006 SIAM International Conference on Data Mining, 2006, pp. 328–339.
[20]P.P. Rodrigues, J.Gama, and J.Pedroso, “Hierarchical clustering of time-series data streams,” IEEE Transactions on Knowledge and Data Engineering, vol.20, no.5, pp. 615–627, 2008.
[21]K.-s. Zhang, L.Zhong, L.Tian, X.-y. Zhang, and L.Li, “Dbiecm-an evolving clustering method for streaming data clustering,” AMSE J, vol.60, no.1, pp. 239–254, 2017.
[22]M.Moshtaghi, J.C. Bezdek, S.M. Erfani, C.Leckie, and J.Bailey, “Online cluster validity indices for performance monitoring of streaming data clustering,” International Journal of Intelligent Systems, vol.34, no.4, pp. 541–563, 2019.
[23]L.Huang, C.-D. Wang, H.-Y. Chao, and P.S. Yu, “Mvstream: Multiview data stream clustering,” IEEE Transactions on Neural Networks and Learning Systems, vol.31, no.9, pp. 3482–3496, 2020.
[24]J.Zhang, H.Tao, and C.Hou, “Imbalanced clustering with theoretical learning bounds,” IEEE Transactions on Knowledge and Data Engineering, vol.35, no.9, pp. 9598–9612, 2023.
[25]M.Liu, X.Jiang, and A.C. Kot, “A multi-prototype clustering algorithm,” Pattern Recognition, vol.42, no.5, p. 689–698, 2009.
[26]T.Luo, C.Zhong, H.Li, and X.Sun, “A multi-prototype clustering algorithm based on minimum spanning tree,” in Proceedings of 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery, vol.4, 2010, pp. 1602–1607.
[27]H.Xiong, J.Wu, and J.Chen, “K-means clustering versus validation measures: A data-distribution perspective,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol.39, no.2, pp. 318–331, 2009.
[28]D.Li, S.Zhou, T.Zeng, and R.H. Chan, “Multi-prototypes convex merging based k-means clustering algorithm,” arXiv preprint arXiv:2302.07045, 2023.
[29]T.Kohonen, “The self-organizing map,” Proceedings of the IEEE, vol.78, no.9, pp. 1464–1480, 1990.
[30]A.Rauber, D.Merkl, and M.Dittenbach, “The growing hierarchical self-organizing map: exploratory analysis of high-dimensional data,” IEEE Transactions on Neural Networks, vol.13, no.6, p. 1331–1341, 2002.
[31]C.DelCoso, D.Fustes, C.Dafonte, F.J. Nóvoa, J.M. Rodríguez-Pedreira, and B.Arcay, “Mixing numerical and categorical data in a self-organizing map by means of frequency neurons,” Applied Soft Computing, vol.36, pp. 246–254, 2015.
[32]D.Olszewski, “Asymmetric k-means clustering of the asymmetric self-organizing map,” Neural Processing Letters, vol.43, no.1, pp. 231–253, 2016.
[33]A.A. Hameed, B.Karlik, M.S. Salman, and G.Eleyan, “Robust adaptive learning approach to self-organizing maps,” Knowledge-Based Systems, vol. 171, pp. 25–36, 2019.
[34]Y.Zhang, M.Simsek, and B.Kantarci, “Empowering self-organized feature maps for ai-enabled modeling of fake task submissions to mobile crowdsensing platforms,” IEEE Internet of Things Journal, vol.8, no.3, pp. 1334–1346, 2020.
[35]S.Li, F.Liu, L.Jiao, P.Chen, and L.Li, “Self-supervised self-organizing clustering network: A novel unsupervised representation learning method,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–15, 2022.
[36]R.Zhu, Z.Wang, Z.Ma, G.Wang, and J.-H. Xue, “Lrid: A new metric of multi-class imbalance degree based on likelihood-ratio test,” Pattern Recognition Letters, vol. 116, pp. 36–42, 2018.
[37]N.P. Rougier and G.I. Detorakis, “Randomized Self-Organizing Map,” Neural Computation, vol.33, no.8, pp. 2241–2273, 2021.
[38]R.Bridson, “Fast poisson disk sampling in arbitrary dimensions.” SIGGRAPH sketches, vol.10, no.1, 2007.
[39]S.Lloyd, “Least squares quantization in pcm,” IEEE Transactions on Information Theory, vol.28, no.2, pp. 129–137, 1982.
[40]X.Huang, Y.Ye, and H.Zhang, “Extensions of kmeans-type algorithms: A new clustering framework by integrating intracluster compactness and intercluster separation,” IEEE Transactions on Neural Networks and Learning Systems, vol.25, no.8, pp. 1433–1446, 2014.
[41]J.Huang, M.Ng, H.Rong, and Z.Li, “Automated variable weighting in k-means type clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 657–668, 2005.
[42]Y.A. Malkov and D.A. Yashunin, “Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.42, no.04, pp. 824–836, 2020.
[43]R.P. Duin, P.Juszczak, P.Paclik, E.Pekalska, D.DeRidder, D.M. Tax, and S.Verzakov, “Prtools4. 1, a matlab toolbox for pattern recognition,” Delft University of technology, vol. 2600, 2007.
[44]W.Tong, Y.Wang, D.Liu, and X.Guo, “A multi-center clustering algorithm based on mutual nearest neighbors for arbitrarily distributed data,” Integrated Computer-Aided Engineering, vol.29, no.3, pp. 259–275, 2022.
[45]A.Fahad, N.Alshatri, Z.Tari, A.Alamri, I.Khalil, A.Y. Zomaya, S.Foufou, and A.Bouras, “A survey of clustering algorithms for big data: Taxonomy and empirical analysis,” IEEE Transactions on Emerging Topics in Computing, vol.2, no.3, pp. 267–279, 2014.