API Reference#
This page documents the implementation details of the radius_clustering package.
RadiusClustering Class#
- class radius_clustering.RadiusClustering(manner: str = 'approx', radius: float = 0.5, threshold=None, random_state: int | None = None)[source]#
Bases:
ClusterMixin
,BaseEstimator
Radius Clustering algorithm.
This class implements clustering based on the Minimum Dominating Set (MDS) problem. It can use either an exact or approximate method for solving the MDS problem.
Parameters:#
- mannerstr, optional (default=”approx”)
The method to use for solving the MDS problem. Can be “exact” or “approx”.
- radiusfloat, optional (default=0.5)
The dissimilarity threshold to act as radius constraint for the clustering.
Attributes:#
- Xarray-like, shape (n_samples, n_features)
The input data.
- centers_list
The indices of the cluster centers.
- labels_array-like, shape (n_samples,)
The cluster labels for each point in the input data.
- effective_radius_float
The maximum distance between any point and its assigned cluster center.
- random_state_int | None
The random state used for reproducibility. If None, no random state is set.
Note
The random_state_ attribute is not used when the manner is set to “exact”.
Changed in version 1.4.0: The RadiusClustering class has been refactored. Clustering algorithms are now separated into their own module (algorithms.py) to improve maintainability and extensibility.
Added in version 1.4.0: The set_solver method was added to allow users to set a custom solver for the MDS problem. This allows for flexibility in how the MDS problem is solved and enables users to use their own implementations of MDS clustering algorithms.
Added in version 1.3.0:
The random_state parameter was added to allow reproducibility in the approximate method.
The radius parameter replaces the threshold parameter for setting the dissimilarity threshold for better clarity and consistency.
Changed in version 1.3.0: All publicly accessible attributes are now suffixed with an underscore (e.g., centers_, labels_). This is particularly useful for compatibility with scikit-learn’s API.
Deprecated since version 1.3.0: The threshold parameter is deprecated. Use radius instead. Will be removed in a future version.
- fit(X: np.ndarray, y: None = None, metric: str | callable = 'euclidean') RadiusClustering [source]#
Fit the MDS clustering model to the input data.
This method computes the distance matrix if the input is a feature matrix, or uses the provided distance matrix directly if the input is already a distance matrix.
Note
If the input is a distance matrix, it should be symmetric and square. If the input is a feature matrix, the distance matrix will be computed using Euclidean distance.
Tip
Next version will support providing different metrics or even custom callables to compute the distance matrix.
Parameters:#
- Xarray-like, shape (n_samples, n_features)
The input data to cluster. X should be a 2D array-like structure. It can either be : - A distance matrix (symmetric, square) with shape (n_samples, n_samples). - A feature matrix with shape (n_samples, n_features) where the distance matrix will be computed.
- yIgnored
Not used, present here for API consistency by convention.
- metricstr | callable, optional (default=”euclidean”)
The metric to use when computing the distance matrix. The default is “euclidean”. This should be a valid metric string from sklearn.metrics.pairwise_distances or a callable that computes the distance between two points.
Note
The metric parameter MUST be a valid metric string from sklearn.metrics.pairwise_distances or a callable that computes the distance between two points. Valid metric strings include : - “euclidean” - “manhattan” - “cosine” - “minkowski” - and many more supported by scikit-learn. please refer to the sklearn.metrics.pairwise_distances documentation for a full list.
Attention
If the input is a distance matrix, the metric parameter is ignored. The distance matrix should be symmetric and square.
Warning
If the parameter is a callable, it should : - Accept two 1D arrays as input. - Return a single float value representing the distance between the two points.
Returns:#
- selfobject
Returns self.
Examples :#
>>> from radius_clustering import RadiusClustering >>> from sklearn import datasets >>> # Load the Iris dataset >>> iris = datasets.fetch_openml(name="iris", version=1, parser="auto") >>> X = iris["data"] # Use dictionary-style access instead of attribute access >>> rad = RadiusClustering(manner="exact", threshold=1.43).fit( ... X ... ) # Threshold set to 1.43 because it is the optimal ... # threshold for the Iris dataset >>> rad.centers_ [96, 49, 102]
For examples on common datasets and differences with kmeans, see Iris Dataset Clustering Example
- fit_predict(X: np.ndarray, y: None = None, metric: str | callable = 'euclidean') np.ndarray [source]#
Fit the model and return the cluster labels.
This method is a convenience function that combines fit and predict.
Parameters:#
- Xarray-like, shape (n_samples, n_features)
The input data to cluster. X should be a 2D array-like structure. It can either be : - A distance matrix (symmetric, square) with shape (n_samples, n_samples). - A feature matrix with shape (n_samples, n_features) where the distance matrix will be computed.
- yIgnored
Not used, present here for API consistency by convention.
- metricstr | callable, optional (default=”euclidean”)
The metric to use when computing the distance matrix. The default is “euclidean”. Refer to the fit method for more details on valid metrics.
Returns:#
- labelsarray, shape (n_samples,)
The cluster labels for each point in X.
- set_fit_request(*, metric: bool | None | str = '$UNCHANGED$') RadiusClustering #
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
metric (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
metric
parameter infit
.- Returns:
self – The updated object.
- Return type:
object
- set_solver(solver: callable) None [source]#
Set a custom solver for resolving the MDS problem. This method allows users to replace the default MDS solver with a custom one.
An example is provided below and in the example gallery : Benchmark of Radius Clustering using multiple datasets and comparison with custom MDS
Important
The custom solver must accept the same parameters as the default solvers and return a tuple containing the cluster centers and the execution time. e.g., it should have the signature:
>>> def custom_solver( >>> n: int, >>> edges: np.ndarray, >>> nb_edges: int, >>> random_state: int | None = None >>> ) -> tuple[list, float]: >>> # Custom implementation details >>> centers = [...] >>> exec_time = ... >>> # Return the centers and execution time >>> return centers, exec_time
This allows for flexibility in how the MDS problem is solved.
Parameters:#
- solvercallable
The custom solver function to use for MDS clustering. It should accept the same parameters as the default solvers and return a tuple containing the cluster centers and the execution time.
Raises:#
- ValueError
If the provided solver does not have the correct signature.
Algorithms Module#
This module contains the implementation of the clustering algorithms. It provides two main functions: clustering_approx and clustering_exact.
These functions can be replaced in the RadiusClustering class to perform clustering using another algorithm.
Added in version 1.4.0: Refactoring the structure of the code to separate the clustering algorithms This allows for easier maintenance and extensibility of the codebase.
- radius_clustering.algorithms.clustering_approx(n: int, edges: ndarray, nb_edges: int, random_state: int | None = None) None [source]#
Perform approximate MDS clustering. This method uses a pretty trick to set the seed for the random state of the C++ code of the MDS solver.
Tip
The random state is used to ensure reproducibility of the results when using the approximate method. If random_state is None, a default value of 42 is used.
Important
The trick to set the random state is :
1. Use the check_random_state function to get a RandomState`singleton instance, set up with the provided `random_state.
2. Use the randint method of the RandomState instance to generate a random integer.
Use this random integer as the seed for the C++ code of the MDS solver.
This ensures that the seed passed to the C++ code is always an integer, which is required by the MDS solver, and allows for reproducibility of the results.
Note
This function uses the approximation method to solve the MDS problem. See [casado] for more details.
Parameters:#
- nint
The number of points in the dataset.
- edgesnp.ndarray
The edges of the graph, flattened into a 1D array.
- nb_edgesint
The number of edges in the graph.
- random_stateint | None
The random state to use for reproducibility. If None, a default value of 42 is used.
Returns:#
- centerslist
A sorted list of the centers of the clusters.
- mds_exec_timefloat
The execution time of the MDS algorithm in seconds.
- radius_clustering.algorithms.clustering_exact(n: int, edges: ndarray, nb_edges: int, seed: None = None) None [source]#
Perform exact MDS clustering.
This function uses the EMOs algorithm to solve the MDS problem.
Important
The EMOS algorithm is an exact algorithm for solving the MDS problem. It is a branch and bound algorithm that uses graph theory tricks to efficiently cut the search space. See [jiang] for more details.
Parameters:#
- nint
The number of points in the dataset.
- edgesnp.ndarray
The edges of the graph, flattened into a 1D array.
- nb_edgesint
The number of edges in the graph.
- seedNone
This parameter is not used in the exact method, but it is kept for compatibility with the approximate method.
Returns:#
- centerslist
A sorted list of the centers of the clusters.
- mds_exec_timefloat
The execution time of the MDS algorithm in seconds.