• No results found

that shown by surface mapping. Similarly, Gvishiani et al. [32] introduced a fuzzy based Discrete Perfect Sets (DPS) algorithm to recognize the earthquake-prone areas of California.

Here, the location of epicenter clusters recognized by the DPS algorithm is compared with the location recognized for Earthquake-Prone Areas (EPA) [33] method. This method uses only information of earthquake coordinates instead of any geophysical, geomorphological and geological attributes associated with a region.

2.5 Density based clustering approaches

Density-based clustering (DBC) algorithms categorize a dataset into meaningful sub-classes based on the crowdedness of points in large spatial databases. They consider clusters as high-density areas in the database separated by regions of low density. Among others, these algorithms are popular due to their potential for detecting clusters of arbitrary shape and ability to handle outliers present in a dataset. Further, these algorithms do not need a priori information about the number of clusters or their shapes as input parameters. The number of clusters evolves with the progress of the algorithm. In this thesis work, two most prominent algorithms are utilized and they are discussed below:

2.5.1 Density-based spatial clustering of applications with noise (DB- SCAN)

Density-based spatial clustering of applications with noise (DBSCAN) is proposed by Ester et al.[2] which utilizes a minimum number of input parameters to cluster large spatial databases effectively. It is a fundamental density-based approach for cluster analysis which does not require prior information about the number of clusters. It determines the number of clusters automatically, based on the estimated density with the progress of the algorithm. This uses a predefined neighborhood radius (ε) to estimate the density and discover clusters having a significant number of data points (MinPts). The algorithm is described with the help of some fundamental definitions, outlined as:

Definition 1: ε-distance neighborhood of a point – Theε neighborhood of a point pin the datasetEN×Dis defined as

Nε(p) ={q∈P:d(p,q)≤ε} (2.16) whered(.)represents Euclidean distance between two points.

Definition 2: Direct density-reachable– A pointqis said to be directly density reachable from point pafter considerationε andMinPtsif

q∈Nε(p) (2.17) and

|Nε(p)| ≥MinPts (2.18)

where|Nε(p)|represents number of data points inNε(p).

Here, there are two types of points lying in a cluster, points inside the cluster (core points) and points on the border of the cluster (border points). The density reachable is symmetric when it involves two core points (points p andq). But in case of one core point and one border point it becomes non-symmetric (border pointqis directly reachable from pbut not vice-versa).

Definition 3Density-reachable – A pointqis density-reachable from a point pif there is a sequence of data points q1,q2, ...,qj with q1= p and qj=q such that qj+1 is directly density-reachable fromqj.

Definition 4Density-connected – Two points pandqare density-connected with respect to ε and MinPtsif there exists a point xsuch that both pandqare density-reachable fromx after considerationε andMinPts.

Definition 5Cluster – A clusterC∈EN×Dis a non-empty set satisfying the condition of maximality and connectivity defined as follows:

• Maximality: For all p,q∈N×D,if p∈Candqare density-reachable from pwith respect toεandMinPtsthenq∈C.

• Connectivity: For all p,q∈C, pandqare density connected whenε andMinPtsare considered.

Definition 6Outliers – LetC1, ..Ck be clusters ofEN×D, then Outliers are defined as a set of points inEN×Dwhich do not belong to any clusterCiwith respect toε andMinPts.

These definitions allow the DBSCAN algorithm to discover clusters of arbitrary shape in the presence of outliers. Nanda and Panda [34] reduced the computational complexity associated with the DBSCAN by efficiently implementing new merging criteria at the initial stage of evolution of clusters. The algorithm is then applied to detect the hazardous regions present in the earthquake catalog of Japan.

Georgoulas et al. [35] proposed a seismic-mass density-based clustering (SM-DBSCAN) scheme to determine the potential seismic sources in the vicinity of the Hellenic arc. The algorithm considers a weight to each data point by its earthquake magnitude (seismic mass) in the estimation of density. Space and time information is incorporated in the algorithm to discover the clusters that come from the same fault but at different time instances.

2.5 Density based clustering approaches 19 Recently, Scitovski introduced a modified implementation of the DBSCAN algorithm called ‘Rough DBSCAN’ [36] to determine non-convex shapes of seismic zones present in an area of the Republic of Croatia. The author also paid attention to determine the ε parameter which can influence the number and configuration of earthquake zones.

2.5.2 Clustering by fast search and finding of density peaks (CFSFDP)

Clustering by fast search and finding of density peaks also known as density peak clustering (DPC) is reported in [3]. It is based on the assumption that cluster centers are surrounded by neighbors with less local density and they are at a relatively large distance from the points with a higher local density. By this assumption, the algorithm primarily determines the cluster centers accurately; then remaining points are assigned to the same clusters as its nearest neighbor with higher density. Further, the advantage is, the algorithm finds out cluster number intuitively and able to detect clusters regardless of their shapes and the dimension of given data. The algorithm also determines the membership of a point to a cluster by considering not only connectivity but also the separation of points. Therefore its performance is robust for the radius of a neighborhood, compared to DBSCAN [2]. The clusters obtained from DPC mainly depends on computation of two important quantities for eachithdata point:

one is its local densityρiand another is its distanceδifrom points of higher density.

DPC algorithm determines the local density (ρ) for each point present in the dataset EN×Din three different ways.

1.Counting method: Theρiofithpoint can be determined by counting the number of points with a hard threshold. It is defined as

ρi =∑

j

χ(d(xi,xj)−dc)where χ(x) =

1 if x<0 0 otherwise

(2.19)

It increases the counter by one wheneverρiis equal to the number of points that are closer thandctoithpoint. Thedcis a predefined cut-off distance parameter. The similarity measure d(xi,xj)between pointxiandxjis taken as Euclidean distance.

2. Kernel based method: In this approach, DPC adopts kernel density estimation technique.

Here, Gaussian function is utilized for kernel andρi(soft threshold) is determined as ρi=

j

exp

−d(xi,xj)2 dc2

(2.20) In both approaches, choice ofdcis important to obtain the robust local densityρi. Generally, it is taken as the average number of neighbors around 1-2% of the total number of points in

the dataset.

3.K-nearest neighbor (kNN) based method:K-nearest neighbor (KNN) based approach given in [37, 38] for determining the local densityρ. It require ‘K’ as input to find the K-nearest neighbor instead of ‘dc’. The setting of ‘K’ is feasible compare to ‘dc’ to get the robust performance of the algorithm. Ifηk is theKth nearest neighbor ofxithen KNN ofxi is

κη(xi) ={ j ∈ EN×D|d(xi,xj)≤d(xik(xi))} (2.21) Then,ρiis defined as:

ρi=exp

−1 k

xj

d(xi,xj)2

,∀xj∈κη(xi) (2.22) After obtaining the local densityρifrom any of the above methods,δiis easily computed by finding the minimum distance between pointxi and any other point with higher local density as follows:

δi=





minj d(xi,xj) ifρji

maxj d(xi,xj) otherwise

(2.23)

Based on these two quantities ρ, δ: a decision graph is drawn between ρ and δ to determine the cluster centroids. The points in a graph for which value of δ and ρ are comparatively high (i.e., points have density peak and large separation) are considered as cluster centroids in the DPC. After that, the remaining data points are assigned to the same cluster as its nearest neighbor of higher density.

Outliers present in a catalog are determined by finding a border region for each cluster in DPC (without using a noise-signal cutoff as in DBSCAN algorithm). It is also defined as the set of points assigned to that cluster (within a distancedc from data points belonging to other clusters. Then one finds the point of highest density within its border region for each cluster by defining its density asρb. The points of the cluster). whose density is higher than ρbare assigned as a part of the cluster, and the rest of the points assigned to the cluster are considered as outliers. An improved method for outlier detection in DPC [38] is presented in the following steps:

Step 1: If E is the whole dataset of sizeN, then the kNN distance δiK of a point iis determined as:

δiK=max

j∈κηi di j (2.24)