Podcasts about dbscan

13PODCASTS
20EPISODES
15mAVG DURATION
1MONTHLY NEW EPISODE
Sep 20, 2025LATEST

POPULARITY

20172018201920202021202220232024

Best podcasts about dbscan

Fakultät für Mathematik, Informatik und Statistik - Digitale Hochschulschriften der LMU - Teil 01/02

5 episodes with dbscan

Astro arXiv | all categories

3 episodes with dbscan

PaperPlayer biorxiv bioinformatics

2 episodes with dbscan

Latest podcast episodes about dbscan

Browser attacks without downloads. [Research Saturday]

The CyberWire

Play Episode Listen Later Sep 20, 2025 21:45

Today we are joined by Nati Tal, Head of Guardio Labs, discussing their work “CAPTCHAgeddon” or unmasking the viral evolution of the ClickFix browser-based threat. CAPTCHAgeddon — Shaked Chen's deep dive into the ClickFix fake-captcha wave — reveals how a red-team trick morphed into a dominant, download-free browser threat that tricks users into pasting clipboard PowerShell/shell commands and leverages trusted infrastructure, including Google Scripts. Guardio's DBSCAN-based payload clustering exposes distinct attacker toolkits and distribution paths — from malvertising and compromised WordPress to social posts and Git repos — and argues defenders need behavioral, intelligence-driven protections, not just signatures. The research can be found here: “CAPTCHAgeddon” Unmasking the Viral Evolution of the ClickFix Browser-Based Threat Learn more about your ad choices. Visit megaphone.fm/adchoices

head research attacks wordpress downloads browsers git powershell dbscan

Browser attacks without downloads.

Research Saturday

Play Episode Listen Later Sep 20, 2025 21:45

head attacks wordpress downloads browsers git powershell dbscan

E133 Balaji Padmanathan, Data Scientist & Masters Graduate at UCD

AI in Action Ireland

Play Episode Listen Later Jan 11, 2024 10:07

Today's guest is Balaji Padmanathan, Data Scientist & Masters Graduate at Uniersity College Dublin. Balaji was the winner in the Best Application of AI in a Student Project at the 2023 AI Awards for his project focusing on optimising electric vehicle charging station locations in Dublin. The aim of the project was to make Dublin more sustainable and serve as a model for global efforts to combat climate change and promote responsible urban development. Topics include: Optimizing electric vehicle charging stations for viability Pivoting to DBSCAN and ranking methods due to data scarcity Utilized open source data such as OpenStreetMap & OpenChargeMap for analysis Validating analysis using existing EV stations for accuracy assessment

ai masters dublin ev optimizing graduate pivoting data scientists validating balaji utilized openstreetmap student project dbscan

#111 Dlaczego startupy chodzą stadami?

Efekt Sieci

Play Episode Listen Later Dec 15, 2022 24:14

O tym, czy startupy chodzą stadami, czyli o badaniu wzorca lokalizacji startupów opowiada Maria Kubara, doktorantka w Szkole Doktorskiej Nauk Społecznych na kierunku ekonomia i finanse, odpowiadając na pytania dr Justyny Pokojskiej o: → jakie czynniki przestrzenne warunkują możliwości startu? → skąd wzięło się zainteresowanie badaniem startupów? → jakie czynniki wpływają na sukces startupów? → jakie są różnice między dzielnicami Warszawy w tej kwestii? → które czynniki lokalizacyjne mają wpływ na startupy? → dlaczego startupy wędrują stadami? → jakie są założenia metody badania sieci neuronowych? → na czym opiera się metodologia badania ekspertki - DBSCAN, kernel density estimation? → jakie są możliwości skonstruowanego modelu? → jakie będą efekty badania? → w jakich dziedzinach można użyć rekurencyjnych sieci neuronowych?

dlaczego warszawy startupy chodz dbscan

The second set of pulsar discoveries by CHIME FRB Pulsar: 14 Rotating Radio Transients and 7 pulsars

Astro arXiv | all categories

Play Episode Listen Later Oct 18, 2022 0:54

The second set of pulsar discoveries by CHIME FRB Pulsar: 14 Rotating Radio Transients and 7 pulsars by Fengqiu Adam Dong et al. on Tuesday 18 October The Canadian Hydrogen Mapping Experiment (CHIME) is a radio telescope located in British Columbia, Canada. The large field of view (FOV) of $sim$ 200 square degrees has enabled the CHIME/FRB instrument to produce the largest FRB catalog to date. The large FOV also allows CHIME/FRB to be an exceptional pulsar and Rotating Radio Transient (RRAT) finding machine, despite saving only the metadata information of incoming Galactic events. We have developed a pipeline to search for pulsars/RRATs using DBSCAN, a clustering algorithm. Output clusters are then inspected by a human for pulsar/RRAT candidates and follow-up observations are scheduled with the more sensitive CHIME/Pulsar instrument. The CHIME/Pulsar instrument is capable of a near-daily search mode observation cadence. We have thus developed the CHIME/Pulsar Single Pulse Pipeline to automate the processing of CHIME/Pulsar search mode data. We report the discovery of 21 new Galactic sources, with 14 RRATs, 6 regular slow pulsars and 1 binary system. Owing to CHIME/Pulsar's daily observations we have obtained timing solutions for 8 of the 14 RRATs along with all the regular pulsars. This demonstrates CHIME/Pulsar's ability at finding timing solutions for transient sources. arXiv: http://arxiv.org/abs/http://arxiv.org/abs/2210.09172v1

canada british columbia discoveries galactic chime rotating pulsars owing arxiv frb second set fov transients dbscan

The second set of pulsar discoveries by CHIME FRB Pulsar: 14 Rotating Radio Transients and 7 pulsars

Astro arXiv | all categories

Play Episode Listen Later Oct 18, 2022 0:46

canada british columbia discoveries galactic chime rotating pulsars owing arxiv frb second set fov transients dbscan

Unveiling hidden stellar aggregates in the Milky Way: 1656 new star clusters found in Gaia EDR3

Astro arXiv | all categories

Play Episode Listen Later Sep 19, 2022 0:12

Unveiling hidden stellar aggregates in the Milky Way: 1656 new star clusters found in Gaia EDR3 by Zhihong He et al. on Monday 19 September We report 1,656 new star clusters found in the Galactic disk (|b|

hidden unveiling cds gaia stellar galactic milky way aggregate new star arxiv star clusters dbscan

Tracking Elephant Clusters

Data Skeptic

Play Episode Listen Later Feb 18, 2022 26:27

In today's episode, Gregory Glatzer explained his machine learning project that involved the prediction of elephant movement and settlement, in a bid to limit the activities of poachers. He used two machine learning algorithms, DBSCAN and K-Means clustering at different stages of the project. Listen to learn about why these two techniques were useful and what conclusions could be drawn.

tracking elephants clusters k means dbscan

Using kmeans unsupervised learning and dbscan to analyze centroids and outlier clusters for fraud

Machine learning

Play Episode Listen Later Jun 16, 2021 9:44

Unsupervised learning for finding fraud cases --- Send in a voice message: https://anchor.fm/david-nishimoto/message

fraud analyze outliers clusters unsupervised unsupervised learning k means dbscan

17 Anomaly Detection: Clustering

Machine Learning with Coffee

Play Episode Listen Later Dec 22, 2020 27:37

We present 3 clustering algorithms which will help us detect anomalies: DBSCAN, Gaussian Mixture Models and K-means. These 3 algorithms are very popular and basic but have passed the test of time. All these algorithms have many variations which try to overcome some of the disadvantages of the original implementation.

clustering anomaly detection dbscan

17 Detección de Anormalidades: Clustering

Machine Learning en Español

Play Episode Listen Later Dec 21, 2020 22:14

En esta ocasión presentamos 3 técnicas de clustering que nos ayudarán a detectar anormalidades: DBSCAN, Gaussian Mixture Models y K-means. Estos 3 algoritmos son de los mas populares y básicos, a partir de ellos se han podido desarrollar nuevas versiones que resuelven algunas desventajas inicialmente detectadas en su implementación.

detecci clustering dbscan

Supervised Application of Internal Validation Measures to Benchmark Dimensionality Reduction Methods in scRNA-seq Data

PaperPlayer biorxiv bioinformatics

Play Episode Listen Later Oct 30, 2020

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.10.29.361451v1?rss=1 Authors: Koch, F. C., Sutton, G. J., Voineagu, I., Vafaee, F. C. Abstract: A typical single-cell RNA sequencing (scRNA-seq) experiment will measure on the order of 20,000 transcripts and thousands, if not millions, of cells. The high dimensionality of such data presents serious complications for traditional data analysis methods and, as such, methods to reduce dimensionality play an integral role in many analysis pipelines. However, few studies benchmark the performance of these methods on scRNA-seq data, with existing comparisons assessing performance via downstream analysis accuracy measures which may confound the interpretation of their results. Here, we present the most comprehensive benchmark of dimensionality reduction methods in scRNA-seq data to date, utilizing over 300,000 compute hours to assess the performance of over 25,000 low dimension embeddings across 33 dimensionality reduction methods and 55 scRNA-seq datasets (ranging from 66-27,500 cells). We employ a simple-yet-novel approach which does not rely on the results of downstream analyses. Internal validation measures (IVMs), traditionally used as an unsupervised method to assess clustering performance, are repurposed to measure how well-formed biological clusters are after dimensionality reduction. Performance was further evaluated using nearly 200,000,000 iterations of DBSCAN, a density-based clustering algorithm, showing that hyperparameter optimization using IVMs as the objective function leads to near-optimal clustering. Methods were also assessed on the extent to which they preserve the global structure of the data, and on their computational memory and time requirements across a large range of sample sizes. Our comprehensive benchmarking analysis provides a valuable resource for researchers and aims to guide best practice for dimensionality reduction in scRNA-seq analyses, and we highlight LDA (Latent Dirichlet Allocation) and PHATE (Potential of Heat-diffusion for Affinity-based Transition Embedding) as high-performing algorithms. Copy rights belong to original authors. Visit the link for more info

performance data heat internal application methods copy validation measures reduction rna benchmark affinity supervised biorxiv dimensionality internal validation dbscan

Spatially-Aware Clustering of Ion Images in Mass Spectrometry Imaging Data Using Deep Learning

PaperPlayer biorxiv bioinformatics

Play Episode Listen Later Sep 25, 2020

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.09.25.285619v1?rss=1 Authors: Zhang, W., Claesen, M., Moerman, T., Groseclose, M. R., Waelkens, E., De Moor, B., Verbeeck, N. Abstract: Computational analysis is crucial to capitalize on the wealth of spatio-molecular information generated by mass spectrometry imaging (MSI) experiments. Currently, the spatial information available in MSI data is often under-utilized, due to the challenges of in-depth spatial pattern extraction. The advent of deep learning has greatly facilitated such complex spatial analysis. In this work, we use a pre-trained neural network to extract high-level features from ion images in MSI data, and test whether this improves downstream data analysis. The resulting neural network interpretation of ion images, coined neural ion images, are used to cluster ion images based on spatial expressions. We evaluate the impact of neural ion images on two ion image clustering pipelines, namely DBSCAN clustering, combined with UMAP-based dimensionality reduction, and k-means clustering. In both pipelines, we compare regular and neural ion images from two different MSI datasets. All tested pipelines could extract underlying spatial patterns, but the neural network-based pipelines provided better assignment of ion images, with more fine-grained clusters, and greater consistency in the spatial structures assigned to individual clusters. Additionally, we introduce the Relative Isotope Ratio metric to quantitatively evaluate clustering quality. The resulting scores show that isotopical m/z values are more often clustered together in the neural network-based pipeline, indicating improved clustering outcomes. The usefulness of neural ion images extends beyond clustering towards a generic framework to incorporate spatial information into any MSI-focused machine learning pipeline, both supervised and unsupervised. Copy rights belong to original authors. Visit the link for more info

data images copy aware imaging deep learning msi clustering mass spectrometry biorxiv spatially moerman de moor umap groseclose dbscan

Clustering with DBSCAN

Linear Digressions

Play Episode Listen Later Nov 19, 2017 16:14

DBSCAN is a density-based clustering algorithm for doing unsupervised learning. It's pretty nifty: with just two parameters, you can specify "dense" regions in your data, and grow those regions out organically to find clusters. In particular, it can fit irregularly-shaped clusters, and it can also identify outlier points that don't belong to any of the clusters. Pretty cool!

learning science data linear digressions clustering dbscan

Density-based algorithms for active and anytime clustering

Fakultät für Mathematik, Informatik und Statistik - Digitale Hochschulschriften der LMU - Teil 02/02

Play Episode Listen Later Sep 26, 2014

Data intensive applications like biology, medicine, and neuroscience require effective and efficient data mining technologies. Advanced data acquisition methods produce a constantly increasing volume and complexity. As a consequence, the need of new data mining technologies to deal with complex data has emerged during the last decades. In this thesis, we focus on the data mining task of clustering in which objects are separated in different groups (clusters) such that objects inside a cluster are more similar than objects in different clusters. Particularly, we consider density-based clustering algorithms and their applications in biomedicine. The core idea of the density-based clustering algorithm DBSCAN is that each object within a cluster must have a certain number of other objects inside its neighborhood. Compared with other clustering algorithms, DBSCAN has many attractive benefits, e.g., it can detect clusters with arbitrary shape and is robust to outliers, etc. Thus, DBSCAN has attracted a lot of research interest during the last decades with many extensions and applications. In the first part of this thesis, we aim at developing new algorithms based on the DBSCAN paradigm to deal with the new challenges of complex data, particularly expensive distance measures and incomplete availability of the distance matrix. Like many other clustering algorithms, DBSCAN suffers from poor performance when facing expensive distance measures for complex data. To tackle this problem, we propose a new algorithm based on the DBSCAN paradigm, called Anytime Density-based Clustering (A-DBSCAN), that works in an anytime scheme: in contrast to the original batch scheme of DBSCAN, the algorithm A-DBSCAN first produces a quick approximation of the clustering result and then continuously refines the result during the further run. Experts can interrupt the algorithm, examine the results, and choose between (1) stopping the algorithm at any time whenever they are satisfied with the result to save runtime and (2) continuing the algorithm to achieve better results. Such kind of anytime scheme has been proven in the literature as a very useful technique when dealing with time consuming problems. We also introduced an extended version of A-DBSCAN called A-DBSCAN-XS which is more efficient and effective than A-DBSCAN when dealing with expensive distance measures. Since DBSCAN relies on the cardinality of the neighborhood of objects, it requires the full distance matrix to perform. For complex data, these distances are usually expensive, time consuming or even impossible to acquire due to high cost, high time complexity, noisy and missing data, etc. Motivated by these potential difficulties of acquiring the distances among objects, we propose another approach for DBSCAN, called Active Density-based Clustering (Act-DBSCAN). Given a budget limitation B, Act-DBSCAN is only allowed to use up to B pairwise distances ideally to produce the same result as if it has the entire distance matrix at hand. The general idea of Act-DBSCAN is that it actively selects the most promising pairs of objects to calculate the distances between them and tries to approximate as much as possible the desired clustering result with each distance calculation. This scheme provides an efficient way to reduce the total cost needed to perform the clustering. Thus it limits the potential weakness of DBSCAN when dealing with the distance sparseness problem of complex data. As a fundamental data clustering algorithm, density-based clustering has many applications in diverse fields. In the second part of this thesis, we focus on an application of density-based clustering in neuroscience: the segmentation of the white matter fiber tracts in human brain acquired from Diffusion Tensor Imaging (DTI). We propose a model to evaluate the similarity between two fibers as a combination of structural similarity and connectivity-related similarity of fiber tracts. Various distance measure techniques from fields like time-sequence mining are adapted to calculate the structural similarity of fibers. Density-based clustering is used as the segmentation algorithm. We show how A-DBSCAN and A-DBSCAN-XS are used as novel solutions for the segmentation of massive fiber datasets and provide unique features to assist experts during the fiber segmentation process.

data active algorithms motivated density clustering dbscan ddc:004 ddc:000 informatik und statistik

Efficient and Effective Similarity Search on Complex Objects

Fakultät für Mathematik, Informatik und Statistik - Digitale Hochschulschriften der LMU - Teil 01/02

Play Episode Listen Later Feb 22, 2007

Due to the rapid development of computer technology and new methods for the extraction of data in the last few years, more and more applications of databases have emerged, for which an efficient and effective similarity search is of great importance. Application areas of similarity search include multimedia, computer aided engineering, marketing, image processing and many more. Special interest adheres to the task of finding similar objects in large amounts of data having complex representations. For example, set-valued objects as well as tree or graph structured objects are among these complex object representations. The grouping of similar objects, the so-called clustering, is a fundamental analysis technique, which allows to search through extensive data sets. The goal of this dissertation is to develop new efficient and effective methods for similarity search in large quantities of complex objects. Furthermore, the efficiency of existing density-based clustering algorithms is to be improved when applied to complex objects. The first part of this work motivates the use of vector sets for similarity modeling. For this purpose, a metric distance function is defined, which is suitable for various application ranges, but time-consuming to compute. Therefore, a filter refinement technology is suggested to efficiently process range queries and k-nearest neighbor queries, two basic query types within the field of similarity search. Several filter distances are presented, which approximate the exact object distance and can be computed efficiently. Moreover, a multi-step query processing approach is described, which can be directly integrated into the well-known density-based clustering algorithms DBSCAN and OPTICS. In the second part of this work, new application ranges for density-based hierarchical clustering using OPTICS are discussed. A prototype is introduced, which has been developed for these new application areas and is based on the aforementioned similarity models and accelerated clustering algorithms for complex objects. This prototype facilitates interactive semi-automatic cluster analysis and allows visual search for similar objects in multimedia databases. Another prototype extends these concepts and enables the user to analyze multi-represented and multi-instance data. Finally, the problem of music genre classification is addressed as another application supporting multi-represented and multi-instance data objects. An extensive experimental evaluation examines efficiency and effectiveness of the presented techniques using real-world data and points out advantages in comparison to conventional approaches.

search complex application efficient objects optics similarity ddc:500 dbscan ddc:510 informatik und statistik

Similarity search and data mining techniques for advanced database systems.

Fakultät für Mathematik, Informatik und Statistik - Digitale Hochschulschriften der LMU - Teil 01/02

Play Episode Listen Later Dec 21, 2006

Modern automated methods for measurement, collection, and analysis of data in industry and science are providing more and more data with drastically increasing structure complexity. On the one hand, this growing complexity is justified by the need for a richer and more precise description of real-world objects, on the other hand it is justified by the rapid progress in measurement and analysis techniques that allow the user a versatile exploration of objects. In order to manage the huge volume of such complex data, advanced database systems are employed. In contrast to conventional database systems that support exact match queries, the user of these advanced database systems focuses on applying similarity search and data mining techniques. Based on an analysis of typical advanced database systems — such as biometrical, biological, multimedia, moving, and CAD-object database systems — the following three challenging characteristics of complexity are detected: uncertainty (probabilistic feature vectors), multiple instances (a set of homogeneous feature vectors), and multiple representations (a set of heterogeneous feature vectors). Therefore, the goal of this thesis is to develop similarity search and data mining techniques that are capable of handling uncertain, multi-instance, and multi-represented objects. The first part of this thesis deals with similarity search techniques. Object identification is a similarity search technique that is typically used for the recognition of objects from image, video, or audio data. Thus, we develop a novel probabilistic model for object identification. Based on it, two novel types of identification queries are defined. In order to process the novel query types efficiently, we introduce an index structure called Gauss-tree. In addition, we specify further probabilistic models and query types for uncertain multi-instance objects and uncertain spatial objects. Based on the index structure, we develop algorithms for an efficient processing of these query types. Practical benefits of using probabilistic feature vectors are demonstrated on a real-world application for video similarity search. Furthermore, a similarity search technique is presented that is based on aggregated multi-instance objects, and that is suitable for video similarity search. This technique takes multiple representations into account in order to achieve better effectiveness. The second part of this thesis deals with two major data mining techniques: clustering and classification. Since privacy preservation is a very important demand of distributed advanced applications, we propose using uncertainty for data obfuscation in order to provide privacy preservation during clustering. Furthermore, a model-based and a density-based clustering method for multi-instance objects are developed. Afterwards, original extensions and enhancements of the density-based clustering algorithms DBSCAN and OPTICS for handling multi-represented objects are introduced. Since several advanced database systems like biological or multimedia database systems handle predefined, very large class systems, two novel classification techniques for large class sets that benefit from using multiple representations are defined. The first classification method is based on the idea of a k-nearest-neighbor classifier. It employs a novel density-based technique to reduce training instances and exploits the entropy impurity of the local neighborhood in order to weight a given representation. The second technique addresses hierarchically-organized class systems. It uses a novel hierarchical, supervised method for the reduction of large multi-instance objects, e.g. audio or video, and applies support vector machines for efficient hierarchical classification of multi-represented objects. User benefits of this technique are demonstrated by a prototype that performs a classification of large music collections. The effectiveness and efficiency of all proposed techniques are discussed and verified by comparison with conventional approaches in versatile experimental evaluations on real-world datasets.

search modern practical user object databases cad optics data mining similarity gauss ddc:500 database systems dbscan ddc:510 informatik und statistik

New Techniques for Clustering Complex Objects

Fakultät für Mathematik, Informatik und Statistik - Digitale Hochschulschriften der LMU - Teil 01/02

Play Episode Listen Later Nov 15, 2004

The tremendous amount of data produced nowadays in various application domains such as molecular biology or geography can only be fully exploited by efficient and effective data mining tools. One of the primary data mining tasks is clustering, which is the task of partitioning points of a data set into distinct groups (clusters) such that two points from one cluster are similar to each other whereas two points from distinct clusters are not. Due to modern database technology, e.g.object relational databases, a huge amount of complex objects from scientific, engineering or multimedia applications is stored in database systems. Modelling such complex data often results in very high-dimensional vector data ("feature vectors"). In the context of clustering, this causes a lot of fundamental problems, commonly subsumed under the term "Curse of Dimensionality". As a result, traditional clustering algorithms often fail to generate meaningful results, because in such high-dimensional feature spaces data does not cluster anymore. But usually, there are clusters embedded in lower dimensional subspaces, i.e. meaningful clusters can be found if only a certain subset of features is regarded for clustering. The subset of features may even be different for varying clusters. In this thesis, we present original extensions and enhancements of the density-based clustering notion to cope with high-dimensional data. In particular, we propose an algorithm called SUBCLU (density-connected Subspace Clustering) that extends DBSCAN (Density-Based Spatial Clustering of Applications with Noise) to the problem of subspace clustering. SUBCLU efficiently computes all clusters of arbitrary shape and size that would have been found if DBSCAN were applied to all possible subspaces of the feature space. Two subspace selection techniques called RIS (Ranking Interesting Subspaces) and SURFING (SUbspaces Relevant For clusterING) are proposed. They do not compute the subspace clusters directly, but generate a list of subspaces ranked by their clustering characteristics. A hierarchical clustering algorithm can be applied to these interesting subspaces in order to compute a hierarchical (subspace) clustering. In addition, we propose the algorithm 4C (Computing Correlation Connected Clusters) that extends the concepts of DBSCAN to compute density-based correlation clusters. 4C searches for groups of objects which exhibit an arbitrary but uniform correlation. Often, the traditional approach of modelling data as high-dimensional feature vectors is no longer able to capture the intuitive notion of similarity between complex objects. Thus, objects like chemical compounds, CAD drawings, XML data or color images are often modelled by using more complex representations like graphs or trees. If a metric distance function like the edit distance for graphs and trees is used as similarity measure, traditional clustering approaches like density-based clustering are applicable to those data. However, we face the problem that a single distance calculation can be very expensive. As clustering performs a lot of distance calculations, approaches like filter and refinement and metric indices get important. The second part of this thesis deals with special approaches for clustering in application domains with complex similarity models. We show, how appropriate filters can be used to enhance the performance of query processing and, thus, clustering of hierarchical objects. Furthermore, we describe how the two paradigms of filtering and metric indexing can be combined. As complex objects can often be represented by using different similarity models, a new clustering approach is presented that is able to cluster objects that provide several different complex representations.

complex applications objects cad modelling xml 4c clustering new techniques ddc:500 dbscan ddc:510 informatik und statistik

Advanced Data Mining Techniques for Compound Objects

Fakultät für Mathematik, Informatik und Statistik - Digitale Hochschulschriften der LMU - Teil 01/02

Play Episode Listen Later Nov 9, 2004

Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in large data collections. The most important step within the process of KDD is data mining which is concerned with the extraction of the valid patterns. KDD is necessary to analyze the steady growing amount of data caused by the enhanced performance of modern computer systems. However, with the growing amount of data the complexity of data objects increases as well. Modern methods of KDD should therefore examine more complex objects than simple feature vectors to solve real-world KDD applications adequately. Multi-instance and multi-represented objects are two important types of object representations for complex objects. Multi-instance objects consist of a set of object representations that all belong to the same feature space. Multi-represented objects are constructed as a tuple of feature representations where each feature representation belongs to a different feature space. The contribution of this thesis is the development of new KDD methods for the classification and clustering of complex objects. Therefore, the thesis introduces solutions for real-world applications that are based on multi-instance and multi-represented object representations. On the basis of these solutions, it is shown that a more general object representation often provides better results for many relevant KDD applications. The first part of the thesis is concerned with two KDD problems for which employing multi-instance objects provides efficient and effective solutions. The first is the data mining in CAD parts, e.g. the use of hierarchic clustering for the automatic construction of product hierarchies. The introduced solution decomposes a single part into a set of feature vectors and compares them by using a metric on multi-instance objects. Furthermore, multi-step query processing using a novel filter step is employed, enabling the user to efficiently process similarity queries. On the basis of this similarity search system, it is possible to perform several distance based data mining algorithms like the hierarchical clustering algorithm OPTICS to derive product hierarchies. The second important application is the classification and search for complete websites in the world wide web (WWW). A website is a set of HTML-documents that is published by the same person, group or organization and usually serves a common purpose. To perform data mining for websites, the thesis presents several methods to classify websites. After introducing naive methods modelling websites as webpages, two more sophisticated approaches to website classification are introduced. The first approach uses a preprocessing that maps single HTML-documents within each website to so-called page classes. The second approach directly compares websites as sets of word vectors and uses nearest neighbor classification. To search the WWW for new, relevant websites, a focused crawler is introduced that efficiently retrieves relevant websites. This crawler minimizes the number of HTML-documents and increases the accuracy of website retrieval. The second part of the thesis is concerned with the data mining in multi-represented objects. An important example application for this kind of complex objects are proteins that can be represented as a tuple of a protein sequence and a text annotation. To analyze multi-represented objects, a clustering method for multi-represented objects is introduced that is based on the density based clustering algorithm DBSCAN. This method uses all representations that are provided to find a global clustering of the given data objects. However, in many applications there already exists a sophisticated class ontology for the given data objects, e.g. proteins. To map new objects into an ontology a new method for the hierarchical classification of multi-represented objects is described. The system employs the hierarchical structure of the ontology to efficiently classify new proteins, using support vector machines.

modern www objects compound html cad optics data mining kdd knowledge discovery ddc:500 dbscan ddc:510 informatik und statistik databases kdd

Coping With New Challengens for Density-Based Clustering

Fakultät für Mathematik, Informatik und Statistik - Digitale Hochschulschriften der LMU - Teil 01/02

Play Episode Listen Later Jul 8, 2004

Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. The core step of the KDD process is the application of a Data Mining algorithm in order to produce a particular enumeration of patterns and relationships in large databases. Clustering is one of the major data mining tasks and aims at grouping the data objects into meaningful classes (clusters) such that the similarity of objects within clusters is maximized, and the similarity of objects from different clusters is minimized. Beside many others, the density-based clustering notion underlying the algorithm DBSCAN and its hierarchical extension OPTICS has been proposed recently, being one of the most successful approaches to clustering. In this thesis, our aim is to advance the state-of-the-art clustering, especially density-based clustering by identifying novel challenges for density-based clustering and proposing innovative and solid solutions for these challenges. We describe the development of the industrial prototype BOSS (Browsing OPTICS plots for Similarity Search) which is a first step towards developing a comprehensive, scalable and distributed computing solution designed to make the efficiency and analytical capabilities of OPTICS available to a broader audience. For the development of BOSS, several key enhancements of OPTICS are required which are addressed in this thesis. We develop incremental algorithms of OPTICS to efficiently reconstruct the hierarchical clustering structure in frequently updated databases, in particular, when a set of objects is inserted in or deleted from the database. We empirically show that these incremental algorithms yield significant speed-up factors over the original OPTICS algorithm. Furthermore, we propose a novel algorithm for automatic extraction of clusters from hierarchical clustering representations that outperforms comparative methods, and introduce two novel approaches for selecting meaningful representatives, using the density-based concepts of OPTICS and producing better results than the related medoid approach. Another major challenge for density-based clustering is to cope with high dimensional data. Many today's real-world data sets contain a large number of measurements (or features) for a single data object. Usually, global feature reduction techniques cannot be applied to these data sets. Thus, the task of feature selection must be combined with and incooperated into the clustering process. In this thesis, we present original extensions and enhancements of the density-based clustering notion to cope with high dimensional data. In particular, we propose an algorithm called SUBCLU (density based SUBspace CLUstering) that extends DBSCAN to the problem of subspace clustering. SUBCLU efficiently computes all clusters that would have been found if DBSCAN is applied to all possible subspaces of the feature space. An experimental evaluation on real-world data sets illustrates that SUBCLU is more effective than existing subspace clustering algorithms because it is able to find clusters of arbitrary size and shape, and produces determine results. A semi-hierarchical extension of SUBCLU called RIS (Ranking Interesting Subspaces) is proposed that does not compute the subspace clusters directly, but generates a list of subspaces ranked by their clustering characteristics. A hierarchical clustering algorithm can be applied to these interesting subspaces in order to compute a hierarchical (subspace) clustering. A comparative evaluation of RIS and SUBCLU shows that RIS in combination with OPTICS can achieve an information gain over SUBCLU. In addition, we propose the algorithm 4C (Computing Correlation Connected Clusters) that extends the concepts of DBSCAN to compute density-based correlation clusters. 4C benefits from an innovative, well-defined and effective clustering model, outperforming related approaches in terms of clustering quality on real-world data sets.

boss coping optics density data mining 4c ris clustering kdd knowledge discovery ddc:500 dbscan ddc:510 informatik und statistik databases kdd

Podcasts about dbscan

Best podcasts about dbscan

Fakultät für Mathematik, Informatik und Statistik - Digitale Hochschulschriften der LMU - Teil 01/02

Astro arXiv | all categories

PaperPlayer biorxiv bioinformatics

Latest news about dbscan

Latest podcast episodes about dbscan

Browser attacks without downloads. [Research Saturday]

Browser attacks without downloads.

E133 Balaji Padmanathan, Data Scientist & Masters Graduate at UCD

#111 Dlaczego startupy chodzą stadami?

The second set of pulsar discoveries by CHIME FRB Pulsar: 14 Rotating Radio Transients and 7 pulsars

The second set of pulsar discoveries by CHIME FRB Pulsar: 14 Rotating Radio Transients and 7 pulsars

Unveiling hidden stellar aggregates in the Milky Way: 1656 new star clusters found in Gaia EDR3

Tracking Elephant Clusters

Using kmeans unsupervised learning and dbscan to analyze centroids and outlier clusters for fraud

17 Anomaly Detection: Clustering

17 Detección de Anormalidades: Clustering

Supervised Application of Internal Validation Measures to Benchmark Dimensionality Reduction Methods in scRNA-seq Data

Spatially-Aware Clustering of Ion Images in Mass Spectrometry Imaging Data Using Deep Learning

Clustering with DBSCAN

Density-based algorithms for active and anytime clustering

Efficient and Effective Similarity Search on Complex Objects

Similarity search and data mining techniques for advanced database systems.

New Techniques for Clustering Complex Objects

Advanced Data Mining Techniques for Compound Objects

Coping With New Challengens for Density-Based Clustering