...CHAPTER 2 DATA MINING TECHNIQUE OVERVIEW 2.1 Introduction In the 21st century as we are moving towards more and more online system, the databases have grown into terabytes. Within this huge data, information of importance needs to be identified. Since the evolution of human life, the people discover patterns. As farmer recognizes pattern of growth in the field, bank recognizes the earning and spending pattern of a customer and politicians seeks pattern in voter opinion. This huge amount of data needs to be used either for business growth or scientific discoveries. The process of discovering the patterns and relationships in data using the analysis tools is called Data Mining. The simplest form of data mining is as follows: 1. Describing...
Words: 2594 - Pages: 11
...(Online): 2347 - 4718 DATA MINING TECHNIQUES TO ANALYZE CRIME DATA R. G. Uthra, M. Tech (CS) Bharathidasan University, Trichy, India. Abstract: In data mining, Crime management is an interesting application where it plays an important role in handling of crime data. Crime investigation has very significant role of police system in any country. There had been an enormous increase in the crime in recent years. With rapid popularity of the internet, crime information maintained in web is becoming increasingly rampant. In this paper the data mining techniques are used to analyze the web data. This paper presents detailed study on classification and clustering. Classification is the process of classifying the crime type Clustering is the process of combining data object into groups. The construct of scenario is to extract the attributes and relations in the web page and reconstruct the scenario for crime mining. Key words: Crime data analysis, classification, clustering. I. INTRODUCTION Crime is one of the dangerous factors for any country. Crime analysis is the activity in which analysis is done on crime activities. Today criminals have maximum use of all modern technologies and hi-tech methods in committing crimes. The law enforcers have to effectively meet out challenges of crime control and maintenance of public order. One challenge to law enforcement and intelligence agencies is the difficulty of analyzing large volumes of data involved in criminal and...
Words: 1699 - Pages: 7
...Data Mining 6/3/12 CIS 500 Data Mining is the process of analyzing data from different perspectives and summarizing it into useful information. This information can be used to increase revenue, cut costs or both. Data mining software is a major analytical tool used for analyzing data. It allows the user to analyze data from many different angles, categorize the data and summarizing the relationships. In a nut shell data mining is used mostly for the process of finding correlations or patterns among fields in very large databases. What ultimately can data mining do for a company? A lot. Data mining is primarily used by companies with strong customer focus in retail or financial. It allows companies to determine relationships among factors such as price, product placement, and staff skill set. There are external factors that data mining can use as well such as location, economic indicators, and competition of other companies. With the use of data mining a retailer can look at point of sale records of a customer purchases to send promotions to certain areas based on purchases made. An example of this is Blockbuster looking at movie rentals to send customers updates regarding new movies depending on their previous rent list. Another example would be American express suggesting products to card holders depending on monthly purchases histories. Data Mining consists of 5 major elements: • Extract, transform, and load transaction data onto the data...
Words: 1012 - Pages: 5
...1 Video Data Mining JungHwan Oh University of Texas at Arlington, USA JeongKyu Lee University of Texas at Arlington, USA Sae Hwang University of Texas at Arlington, USA 8 INTRODUCTION Data mining, which is defined as the process of extracting previously unknown knowledge and detecting interesting patterns from a massive set of data, has been an active research area. As a result, several commercial products and research prototypes are available nowadays. However, most of these studies have focused on corporate data — typically in an alpha-numeric database, and relatively less work has been pursued for the mining of multimedia data (Zaïane, Han, & Zhu, 2000). Digital multimedia differs from previous forms of combined media in that the bits representing texts, images, audios, and videos can be treated as data by computer programs (Simoff, Djeraba, & Zaïane, 2002). One facet of these diverse data in terms of underlying models and formats is that they are synchronized and integrated hence, can be treated as integrated data records. The collection of such integral data records constitutes a multimedia data set. The challenge of extracting meaningful patterns from such data sets has lead to research and development in the area of multimedia data mining. This is a challenging field due to the non-structured nature of multimedia data. Such ubiquitous data is required in many applications such as financial, medical, advertising and Command, Control, Communications and Intelligence...
Words: 3477 - Pages: 14
...1. Define data mining. Why are there many different names and definitions for data mining? Data mining is the process through which previously unknown patterns in data were discovered. Another definition would be “a process that uses statistical, mathematical, artificial intelligence, and machine learning techniques to extract and identify useful information and subsequent knowledge from large databases.” This includes most types of automated data analysis. A third definition: Data mining is the process of finding mathematical patterns from (usually) large sets of data; these can be rules, affinities, correlations, trends, or prediction models. Data mining has many definitions because it’s been stretched beyond those limits by some software vendors to include most forms of data analysis in order to increase sales using the popularity of data mining. What recent factors have increased the popularity of data mining? Following are some of most pronounced reasons: * More intense competition at the global scale driven by customers’ ever-changing needs and wants in an increasingly saturated marketplace. * General recognition of the untapped value hidden in large data sources. * Consolidation and integration of database records, which enables a single view of customers, vendors, transactions, etc. * Consolidation of databases and other data repositories into a single location in the form of a data warehouse. * The exponential increase...
Words: 4581 - Pages: 19
...of collected data. The presence of outliers may indicate something sinister such as unauthorised system access or fraudulent activity, or may be a new and previously unidentified occurrence. Whatever the cause of these outliers, it is important they are detected so appropriate action can be taken to minimise their harm if malignant or to exploit a newly discovered opportunity. Chandola, Banerjee and Kumar (2007) conducted a comprehensive survey of outlier detection techniques, which highlighted the importance of detection across a wide variety of domains. Their survey described the categories of outlier detection, applications of detection and detection techniques. Chandola et al. identified three main categories of outlier detection - supervised, semi-supervised and unsupervised detection. Each category utilises different detection techniques such as classification, clustering, nearest neighbour and statistical. Each category and technique has several strengths and weaknesses compared with other outlier detection methods. This review provides initial information on data labelling and classification before examining some of the existing outlier detection techniques within each of the three categories. It then looks at the use of combining detection techniques before comparing and discussing the advantages and disadvantages of each method. Finally, a new classification technique is proposed using a new outlier detection algorithm, Isolation Forest. DATA LABELLING Datasets...
Words: 2395 - Pages: 10
...identify and evaluate data mining algorithms which are commonly implemented in modern Medical Decision Support Systems (MDSS). They are used in various healthcare units all over the world. These institutions store large amounts of medical data. This data may contain relevant medical information hidden in various patterns buried among the records. Within the research several popular MDSS’s are analysed in order to determine the most common data mining algorithms utilized by them. Three algorithms have been identified: Naïve Bayes, Multilayer Perceptron and C4.5. Prior to the very analyses the algorithms are calibrated. Several testing configurations are tested in order to determine the best setting for the algorithms. Afterwards, an ultimate comparison of the algorithms orders them with respect to their performance. The evaluation is based on a set of performance metrics. The analyses are conducted in WEKA on five UCI medical datasets: breast cancer, hepatitis, heart disease, dermatology disease, diabetes. The analyses have shown that it is very difficult to name a single data mining algorithm to be the most suitable for the medical data. The results gained for the algorithms were very similar. However, the final evaluation of the outcomes allowed singling out the Naïve Bayes to be the best classifier for the given domain. It was followed by the Multilayer Perceptron and the C4.5. Keywords: Naïve Bayes, Multilayer Perceptron, C4.5, medical data mining, medical decision...
Words: 35271 - Pages: 142
...Computer Networks Computer Graphics and Multimedia Lab Advanced Operating System Internet programming and Web Design Data Mining and Warehousing Internet programming and Web Design Lab Project Work and Viva Voce Total University Examinations Durations Max in Hrs Marks 3 100 3 100 3 100 3 100 3 100 3 3 3 3 100 100 100 100 100 1000 II For project work and viva voce (External) Breakup: Project Evaluation : 75 Viva Voce : 25 1 Anx.31 J - M Sc CS (SDE) 2007-08 with MQP Page 2 of 16 YEAR – I PAPER I: ADVANCED COMPUTER ARCHITECTURE Subject Description: This paper presents the concept of parallel processing, solving problem in parallel processing, Parallel algorithms and different types of processors. Goal: To enable the students to learn the Architecture of the Computer. Objectives: On successful completion of the course the students should have: Understand the concept of Parallel Processing. Learnt the different types of Processors. Learnt the Parallel algorithms. Content: Unit I Introduction to parallel processing – Trends towards parallel processing – parallelism in uniprocessor Systems – Parallel Computer structures – architectural classification schemes – Flynn’ Classification – Feng’s Classification – Handler’s Classification – Parallel Processing Applications. Unit II Solving problems in Parallel: Utilizing Temporal Parallelism – Utilizing Data Parallelism –...
Words: 3613 - Pages: 15
...spectrometric data analysis By XIAO Jiali, Jenny ( 0830300038) A Final Year Project thesis (STAT 4121; 3 Credits) submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Statistics at BNU-HKBU UNITED INTERNATIONAL COLLEGE December, 2011 DECLARATION I hereby declare that all the work done in this Project is of my independent effort. I also certify that I have never submitted the idea and product of this Project for academic or employment credits. XIAO Jiali, Jenny (0830300038) Date: ii Application of Bootstrap method in spectrometric data analysis XIAO Jiali, Jenny Science and Technology Division Abstract In this project the bootstrap methodology for spectrometric data is considered. The bootstrap can also compare two populations, without the normality condition and without the restriction to comparison of means. The most important new idea is that bootstrap resampling must mimic the separate samples design that produced the original data. Bootstrap in mean, bootstrap in median, and bootstrap in confidence interval are three kinds of effective way to handle mass spectrometric data. Then,we need to reduce dimension based on bootstrap method. It may allow the data to be more easily visualized. Afterwards, using results obtained by bootstrap, we use data mining method to predict a patient has ovarian cancer or not. Decision tree induction and neural network are usual way to classify it. Keywords: Bootstrap, data mining...
Words: 7049 - Pages: 29
...Top 10 data mining algorithms in plain English 1.1K Today, I’m going to explain in plain English the top 10 most influential data mining algorithms as voted on by 3 separate panels in this survey paper. Once you know what they are, how they work, what they do and where you can find them, my hope is you’ll have this blog post as a springboard to learn even more about data mining. What are we waiting for? Let’s get started! Contents [hide] 1. C4.5 2. k-means 3. Support vector machines 4. Apriori 5. EM 6. PageRank 7. AdaBoost 8. kNN 9. Naive Bayes 10. CART Interesting Resources Now it’s your turn… Update 16-May-2015: Thanks to Yuval Merhav and Oliver Keyes for their suggestions which I’ve incorporated into the post. Update 28-May-2015: Thanks to Dan Steinberg (yes, the CART expert!) for the suggested updates to the CART section which have now been added. 1. C4.5 What does it do? C4.5 constructs a classifier in the form of a decision tree. In order to do this, C4.5 is given a set of data representing things that are already classified. Wait, what’s a classifier? A classifier is a tool in data mining that takes a bunch of data representing things we want to classify and attempts to predict which class the new data belongs to. What’s an example of this? Sure, suppose a dataset contains a bunch of patients. We know various things about each patient like age, pulse...
Words: 6478 - Pages: 26
...Data Mining Objectives: Highlight the characteristics of Data mining Operations, Techniques and Tools. A Brief Overview Online Analytical Processing (OLAP): OLAP is the dynamic synthesis, analysis, and consolidation of large volumns of multi-dimensional data. Multi-dimensional OLAP support common analyst operations, such as: ▪ Considation – aggregate of data, e.g. roll-ups from branches to regions. ▪ Drill-down – showing details, just the reverse of considation. ▪ Slicing and dicing – pivoting. Looking at the data from different viewpoints. E.g. X, Y, Z axis as salesman, Nth quarter and products, or region, Nth quarter and products. A Brief Overview Data Mining: Construct an advanced architecture for storing information in a multi-dimension data warehouse is just the first step to evolve from traditional DBMS. To realize the value of a data warehouse, it is necessary to extract the knowledge hidden within the warehouse. Unlike OLAP, which reveal patterns that are known in advance, Data Mining uses the machine learning techniques to find hidden relationships within data. So Data Mining is to ▪ Analyse data, ▪ Use software techniques ▪ Finding hidden and unexpected patterns and relationships in sets of data. Examples of Data Mining Applications: ▪ Identifying potential credit card customer groups ▪ Identifying buying patterns of customers. ▪ Predicting trends of market...
Words: 1258 - Pages: 6
...512 Use of Data Mining in the field of Library and Information Science : An Overview Roopesh K Dwivedi Abstract Data Mining refers to the extraction or “Mining” knowledge from large amount of data or Data Warehouse. To do this extraction data mining combines artificial intelligence, statistical analysis and database management systems to attempt to pull knowledge form stored data. This paper gives an overview of this new emerging technology which provides a road map to the next generation of library. And at the end it is explored that how data mining can be effectively and efficiently used in the field of library and information science and its direct and indirect impact on library administration and services. R P Bajpai Keywords : Data Mining, Data Warehouse, OLAP, KDD, e-Library 0. Introduction An area of research that has seen a recent surge in commercial development is data mining, or knowledge discovery in databases (KDD). Knowledge discovery has been defined as “the non-trivial extraction of implicit, previously unknown, and potentially useful information from data” [1]. To do this extraction data mining combines many different technologies. In addition to artificial intelligence, statistics, and database management system, technologies include data warehousing and on-line analytical processing (OLAP), human computer interaction and data visualization; machine learning (especially inductive learning techniques), knowledge representation, pattern recognition...
Words: 3471 - Pages: 14
...accuracy rate of the proposed algorithm for recognizing the network traffic. A. Overview The algorithm consists two phases: the classification phase and clustering phase. In the classification phase, the artificial neural network, one of the core methods of Computational Intelligence is used to create the classifier from the known network traffic data. The neural network it identifies the input pattern and tries to output the corresponding class. It can map input patterns to their associated output patterns using their mapping capabilities. They can be trained with known examples of a problem...
Words: 1312 - Pages: 6
...FINAL REPORT DATA MINING Reported by: Nguyen Bao An – M9839920 Date: 99/06/16 Outline In this report I present my study in the Data mining course. It includes my two proposed approaches in the field of clustering, my learn lessons in class and my comment on this class. The report’s outline is as following: Part I: Proposed approaches 1. Introduction and backgrounds 2. Related works and motivation 3. Proposed approaches 4. Evaluation method 5. Conclusion Part II: Lessons learned 1. Data preprocessing 2. Frequent pattern and association rule 3. Classification and prediction 4. Clustering Part III: My own comments on this class. I. Proposed approach • An incremental subspace-based K-means clustering method for high dimensional data • Subspace based document clustering and its application in data preprocessing in Web mining 1. Introduction and background High dimensional data clustering has many applications in real world, especially in bioinformatics. Many well-known clustering algorithms often use a whole-space distance score to measure the similarity or distance between two objects, such as Euclidean distance, Cosine function... However, in fact, when the dimensionality of space or the number of objects is large, such whole-space-based pairwise similarity scores are no longer meaningful, due to the distance of each pair of object nearly the same [5]. ...
Words: 5913 - Pages: 24
...Assignment 1: Introduction Miles A. Cabanos Data Mining - 10303 CAP4770 Q1: Present an example where data mining is crucial to the success of a business. What data mining function does this business need? Can they be performed alternatively by data query processing or simple statistical analysis? Best example that I could think of would be Amazon.com. I think one of the functions the business uses is the characterization. This way it can keep track of what type of products customers buy, and then the pop-up windows to suggest similar items. No, I do not think they could be performed by data query processing or simple statistical analysis. Q2: Define each of the following data mining functionalities: characterization, discrimination, association and correlation analysis, classification, prediction and clustering. Give examples of each data mining functionality, using a real-life database with which you are familiar. Characterization: Given data that shares similarities of characteristics and/or other requested specific information. (Example: Data from Amazon.com customers buying superhero books, can then be used to determine what age group buys superhero books. Then can suggest other types of similar books) Discrimination: Compares or contrasts data information. (Example: Amazon could find out what customers types buy more superhero book compared to biographies) Association and correlation analysis: How certain types of data can be associated with each other. Patterns...
Words: 481 - Pages: 2