Premium Essay

Data Mining A1

In:

Submitted By liveReign
Words 973
Pages 4
FIT3002 Applications of Data Mining
Assignment 1 (100 marks)
This assignment requires you to use the data mining tool, WEKA, to build a good model from a given set of data; and then write a report to describe the process. The Hyperthyroid data set is for the study of hyperthyroid disease. The data is supplied by Garvan Institute and J. Ross Quinlan. An instance in this data set is a diagnosis record for a single patient, and the data set contains a total of 2800 instances. Each instance is represented by 29 input attributes and a class attribute indicating whether the diagnosis for the patient is hyperthyroid, T3 toxic, goitre, secondary toxic, or negative. The attribute information is given below: age: numeric. sex: M, F. on thyroxine: f, t. query on thyroxine: f, t. on antithyroid medication: f, t. sick: f, t. pregnant: f, t. thyroid surgery: f, t. I131 treatment: f, t. query hypothyroid: f, t. query hyperthyroid: f, t. lithium: f, t. goitre: f, t. tumor: f, t. hypopituitary: f, t. psych: f, t. TSH measured: f, t. TSH: numeric. T3 measured: f, t. T3: numeric. TT4 measured: f, t. TT4: numeric. T4U measured: f, t. T4U: numeric. FTI measured: f, t. FTI: numeric. TBG measured: f, t. TBG: numeric. referral source: WEST, STMW, SVHC, SVI, SVHD, other. class: hyperthyroid, T3 toxic, goitre, secondary toxic, negative.

Your tasks are to: (a) analyze the data, and convert the data as suggested above, build several models from it and choose the best model, and (b) to write a report describing the data mining process (i.e., problem definition, data preparation and preprocessing, algorithms and training parameters selection, train and test, and result analysis). Your analysis should include a comparison with a default model which always predicts the majority class in the training data. For the "data preparation and preprocessing" step, you only need to concentrate on

Similar Documents

Premium Essay

Association Rule Mining

...This part gives you the concept of multi-level association rule or generalized association rule. 基本阅读:英文资料 5.1,5.2.1 和 5.2.2,这部分内容与老师上课所介 绍的内容一致,不必过分专注于其中的算法和代码部分,更重要的是 理解方法意思,过程及其中的相关例子。扩展阅读:为了解决作业问 题 2 中的(c)小问,你还最好阅读 5.3.1 部分。 Mining Frequent Patterns, Associations, and Correlations Frequent patterns are patterns (such as itemsets, subsequences, or substructures) that appear in a data set frequently. For example, a set of items, such as milk and bread, that appear frequently together in a transaction data set is a frequent itemset. A subsequence, such as buying first a PC, then a digital camera, and then a memory card, if it occurs frequently in a shopping history database, is a (frequent) sequential pattern. A substructure can refer to different structural forms, such as subgraphs, subtrees, or sublattices, which may be combined with itemsets or subsequences. If a substructure occurs frequently, it is called a (frequent) structured pattern. Finding such frequent patterns plays an essential role in mining associations, correlations, and many other interesting relationships among data. Moreover, it helps in data classification, clustering, and other data mining tasks as well. Thus, frequent pattern mining has become an important data mining task and a focused...

Words: 26078 - Pages: 105

Premium Essay

Data Mining with R

...Data Mining with R: learning by case studies Luis Torgo LIACC-FEP, University of Porto R. Campo Alegre, 823 - 4150 Porto, Portugal email: ltorgo@liacc.up.pt http://www.liacc.up.pt/∼ltorgo May 22, 2003 Preface The main goal of this book is to introduce the reader to the use of R as a tool for performing data mining. R is a freely downloadable1 language and environment for statistical computing and graphics. Its capabilities and the large set of available packages make this tool an excellent alternative to the existing (and expensive!) data mining tools. One of the key issues in data mining is size. A typical data mining problem involves a large database from where one seeks to extract useful knowledge. In this book we will use MySQL as the core database management system. MySQL is also freely available2 for several computer platforms. This means that you will be able to perform “serious” data mining without having to pay any money at all. Moreover, we hope to show you that this comes with no compromise in the quality of the obtained solutions. Expensive tools do not necessarily mean better tools! R together with MySQL form a pair very hard to beat as long as you are willing to spend some time learning how to use them. We think that it is worthwhile, and we hope that you are convinced as well at the end of reading this book. The goal of this book is not to describe all facets of data mining processes. Many books exist that cover this area. Instead we propose to introduce...

Words: 18348 - Pages: 74

Premium Essay

Importance Of Confidentiality Classification

...particularly useful in many applications. In this paper, we aim to advance a worldwide classification model based on the Naïve Bayes classification scheme. The Naïve Bayes classification is chosen because of its applicability in case of its previous history. For Confidentiality-preservation of the data, the concept of counterweight providing reliable party is used. Keywords- Confidentiality-preservation, Naïve Bayes, distributed databases, partition. I. INTRODUCTION In current years, There has been an advancement of computing and...

Words: 1684 - Pages: 7

Premium Essay

Effect of Broken Home on Student Academic Performances

...students in tertiary institutions has for a long time been the focus of study among higher education managers, parents, government and researchers. The cause of this differential can be due to intellective, non-intellective factors or both. From studies investigating student performance and related problems it has been determined that academic success is dependent on many factors such as; grades and achievements, personality and expectations, and academic environments. This work uses data mining techniques to investigate the effect of socio-economic or family background on the performance of students using the data from one of the Nigerian tertiary institutions as case study. The analysis was carried out using Decision Tree algorithms. The data comprised of two hundred forty (240) records of students. The academic performance of students was measured by the students’ first year cumulative grade point average (CGPA). Various Decision Tree algorithms were investigated and the algorithm which best models the data was used to generate rule sets which can be used to analyze the effect of the socio-economic background of students on their academic performance. The rules generated can serve as a guide to educational administrators in their planning activities. Keywords: Socio-Economic, Performance. Intellective, Family Background, Academic 1.0 INTRODUCTION The differential students’ performance in tertiary institutions has been and is still a source of great concern and research interest...

Words: 5499 - Pages: 22

Premium Essay

Dataminig

...Data Mining Third Edition This page intentionally left blank Data Mining Practical Machine Learning Tools and Techniques Third Edition Ian H. Witten Eibe Frank Mark A. Hall AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO Morgan Kaufmann Publishers is an imprint of Elsevier Morgan Kaufmann Publishers is an imprint of Elsevier 30 Corporate Drive, Suite 400, Burlington, MA 01803, USA This book is printed on acid-free paper. Copyright © 2011 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must...

Words: 194698 - Pages: 779

Premium Essay

Data Mining Practical Machine Learning Tools and Techniques - Weka

...Data Mining Practical Machine Learning Tools and Techniques The Morgan Kaufmann Series in Data Management Systems Series Editor: Jim Gray, Microsoft Research Data Mining: Practical Machine Learning Tools and Techniques, Second Edition Ian H. Witten and Eibe Frank Fuzzy Modeling and Genetic Algorithms for Data Mining and Exploration Earl Cox Data Modeling Essentials, Third Edition Graeme C. Simsion and Graham C. Witt Location-Based Services Jochen Schiller and Agnès Voisard Database Modeling with Microsoft® Visio for Enterprise Architects Terry Halpin, Ken Evans, Patrick Hallock, and Bill Maclean Designing Data-Intensive Web Applications Stefano Ceri, Piero Fraternali, Aldo Bongio, Marco Brambilla, Sara Comai, and Maristella Matera Mining the Web: Discovering Knowledge from Hypertext Data Soumen Chakrabarti Understanding SQL and Java Together: A Guide to SQLJ, JDBC, and Related Technologies Jim Melton and Andrew Eisenberg Database: Principles, Programming, and Performance, Second Edition Patrick O’Neil and Elizabeth O’Neil The Object Data Standard: ODMG 3.0 Edited by R. G. G. Cattell, Douglas K. Barry, Mark Berler, Jeff Eastman, David Jordan, Craig Russell, Olaf Schadow, Torsten Stanienda, and Fernando Velez Data on the Web: From Relations to Semistructured Data and XML Serge Abiteboul, Peter Buneman, and Dan Suciu Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations Ian H. Witten and Eibe Frank ...

Words: 191947 - Pages: 768

Premium Essay

Hostel Management System

...Applications of Data Mining A wide range of companies have deployed successful applications of data mining. While early adopters of this technology have tended to be in information-intensive industries such as financial services and direct mail marketing, the technology is applicable to any company looking to leverage a large data warehouse to better manage their customer relationships. Two critical factors for success with data mining are: a large, well-integrated data warehouse and a well-defined understanding of the business process within which data mining is to be applied (such as customer prospecting, retention, campaign management, and so on). Some successful application areas include: • A pharmaceutical company can analyze its recent sales force activity and their results to improve targeting of high-value physicians and determine which marketing activities will have the greatest impact in the next few months. The data needs to include competitor market activity as well as information about the local health care systems. The results can be distributed to the sales force via a wide-area network that enables the representatives to review the recommendations from the perspective of the key attributes in the decision process. The ongoing, dynamic analysis of the data warehouse allows best practices from throughout the organization to be applied in specific sales situations. • A credit card company can leverage its vast warehouse of customer transaction data to identify...

Words: 5855 - Pages: 24

Premium Essay

Research on Synopsis

...Chapter 9 A SURVEY OF SYNOPSIS CONSTRUCTION IN DATA STREAMS Abstract The large volume of data streams poses unique space and time constraints on the computation process. Many query processing, database operations, and mining algorithms require efficient execution which can be difficult to achieve with a fast data stream. In many cases, it may be acceptable to generate approximate solutions for such problems. In recent years a number of synopsis structures have been developed, which can be used in conjunction with a variety of mining and query processing techniques in data stream processing. Some key synopsis methods include those of sampling, wavelets, sketches and histograms. In this chapter, we will provide a survey of the key synopsis techniques, and the mining techniques supported by such methods. We will discuss the challenges and tradeoffs associated with using different kinds of techniques, and the important research directions for synopsis construction. 1. Introduction Data streams pose a unique challenge to many database and data mining applications because of the computational and storage costs associated with the large volume of the data stream. In many cases, synopsis data structures 170 DATA STREAMS: MODELS AND ALGORITHMS and statistics can be constructed from streams which are useful for a variety of applications. Some examples of such applications are as follows: Approximate Query Estimation: The problem of query estimation...

Words: 17478 - Pages: 70

Premium Essay

Dfdfadf

...CHAPTER Importing Data into Excel Image copyright Gina Sanders 2010. Used under license from Shutterstock.com © Cengage Learning. All rights reserved. No distribution allowed without express authorization. 17 FINDING INFORMATION WITH DATA MINING T he types of data analysis we discuss in this and other chapters of this book are crucial to the success of most companies in today’s datadriven business world. However, the sheer volume of available data often defies traditional methods of data analysis.Therefore, a whole new set of methods—and accompanying software—has recently been developed under the name of data mining. Data mining attempts to discover patterns, trends, and relationships among data, especially non-obvious and unexpected patterns. For example, the analysis might discover that people who purchase skim milk also tend to purchase whole wheat bread, or that cars built on Mondays before 10 A.M. on production line #5 using parts from suppliers ABC and XYZ have significantly more defects than average. This new knowledge can then be used for more effective management of a business. The place to start is with a data warehouse.Typically, a data warehouse is a huge database that is designed specifically to study patterns 17-1 ■ ■ ■ ■ ■ Classification analysis attempts to find variables that are related to a categorical (often binary) variable. For example, credit card customers can be categorized as those who pay their...

Words: 23665 - Pages: 95

Free Essay

Hypermarket Personal Care Campaign

...Dove Valentine Mailing Campaign Course Name: Business Analytics Using Data Mining Submitted by: (Student names) Group Members (8A) Harneet Chawla Ankit Sobti Kanika Miglani Varghese Cherian Saad Khan Note: Considering our client is an FMCG, each technique mentioned below has been explained in detail ensuring thorough/easy understanding. Business Analytics Using Data Mining – Final Project Valentine Coupon Scheme Executive summary Business problem We have been hired by our client, a reputed FMCG conglomerate, Unilever as data mining consultants. Our client has a range of products in the Personal Care Category that comprises of soaps etc. One of the brands that our client happens to own is the Dove brand of soap. For the first time the client is formulating a Valentine mai-in-coupon scheme to be rolled out in the month of February (next year). The scheme has the following business objectives: 1) Understand the customer profile of those customers who buy Dove soap. 2) Based on the customer profile understanding, predict for next year, new customers who are most likely to buy Dove. 3) Send out a mail-in-discount coupon to those respective customers of Hypermarket. 4) Client will be conducting a promotional campaign for which it will be incurring substantial costs thus it wants to ensure that next year when the campaign is rolled out, the coupons are sent out to customers who are most likely to avail them. 5) Client wants to increase customer loyalty towards Dove soap, considering...

Words: 2632 - Pages: 11

Premium Essay

Data Mining

...Running Head: DATA MINING Assignment 4: Data Mining Submitted by: Submitted to: Course: Introduction Data Mining is also called as Knowledge Discovery in Databases (KDD). It is a powerful technology which has great potential in helping companies to focus on the most important information they have in their data base. Due to the increased use of technologies, interest in data mining has increased speedily. Data mining can be used to predict future behavior rather than focus on past events. This is done by focusing on existing information that may be stored in their data warehouse or information warehouse. Companies are now utilizing data mining techniques to assess their database for trends, relationships, and outcomes to improve their overall operations and discover new ways that may permit them to improve their customer services. Data mining provides multiple benefits to government, businesses, society as well as individual persons (Data Mining, 2011). Benefits of data mining to the businesses when employing Advantages of data mining from business point of view is that large sizes of apparently pointless information have been filtered into important and valuable business information to the company, which could be stored in data warehouses. While in the past, the responsibility was on marketing utilities and services, products, the center of attention is now on customers- their choices, preferences, dislikes and likes, and possibly data mining is one of the most important tools...

Words: 1302 - Pages: 6

Free Essay

Market Segmentation Theory

...Marketing Segmentation Theory” Defining the Segmentation: Segmentation can be defined as “the term given to the grouping of customers with similar needs by a number of different variables”. In simple words it can also be define as “the act of dividing or partitioning; separation by the creation of a boundary that divides or keeps apart”. What Does Market Segmentation Mean? “A marketing term refers to the aggregating of prospective buyers into groups (segments) that have common needs and will respond similarly to a marketing action”. Market segmentation can also be define as “the process of dividing a market up into different groups of customers, in order to create different products to meet their specific needs”. The most obvious type of segmentation is between customers who buy distinctly different products. For example, in manufacturing sandwiches, you would clearly be able to make a distinction between creating sandwiches for vegetarians and those for meat eaters. Market segmentation enables companies to target different categories of consumers who perceive the full value of certain products and services differently from one another. Generally three criteria can be used to identify different market segments: 1) Homogeneity (common needs within segment) 2) Distinction (unique from other groups) 3) Reaction (similar response to market) What is Market Segmentation Theory? “A modern theory pertaining to interest rates stipulating that there is no necessary relationship...

Words: 1034 - Pages: 5

Premium Essay

Airtel

...Churn Prediction Vladislav Lazarov vladislav.lazarov@in.tum.de Technische Universität München Marius Capota Technische Universität München mariuscapota@yahoo.com ABSTRACT The rapid growth of the market in every sector is leading to a bigger subscriber base for service providers. More competitors, new and innovative business models and better services are increasing the cost of customer acquisition. In this environment service providers have realized the importance of the retention of existing customers. Therefore, providers are forced to put more efforts for prediction and prevention of churn. This paper aims to present commonly used data mining techniques for the identification of churn. Based on historical data these methods try to find patterns which can point out possible churners. Well-known techniques used for this are Regression analysis, Decision Trees, Neural Networks and Rule based learning. In section 1 we give a short introduction describing the current state of the market, then in section 2 a definition of customer churn, its’ types and the imporance of identification of churners is being discussed. Section 3 reviews different techniques used, pointing out advantages and disadvantages. Finally, current state of research and new emerging algorithms are being presented. given a huge choice of offers and different service providers to decide upon, winning new customers is a costly and hard process. Therefore, putting more effort in keeping churn low has become...

Words: 3713 - Pages: 15

Free Essay

Data Mining Algorithms for Classification

...Data Mining Algorithms for Classification BSc Thesis Artificial Intelligence Author: Patrick Ozer Radboud University Nijmegen January 2008 Supervisor: Dr. I.G. Sprinkhuizen-Kuyper Radboud University Nijmegen Abstract Data Mining is a technique used in various domains to give meaning to the available data. In classification tree modeling the data is classified to make predictions about new data. Using old data to predict new data has the danger of being too fitted on the old data. But that problem can be solved by pruning methods which degeneralizes the modeled tree. This paper describes the use of classification trees and shows two methods of pruning them. An experiment has been set up using different kinds of classification tree algorithms with different pruning methods to test the performance of the algorithms and pruning methods. This paper also analyzes data set properties to find relations between them and the classification algorithms and pruning methods. 2 1 Introduction The last few years Data Mining has become more and more popular. Together with the information age, the digital revolution made it necessary to use some heuristics to be able to analyze the large amount of data that has become available. Data Mining has especially become popular in the fields of forensic science, fraud analysis and healthcare, for it reduces costs in time and money. One of the definitions of Data Mining is; “Data Mining is a process that consists of applying data analysis and discovery...

Words: 5455 - Pages: 22

Premium Essay

Data Mining

...Data Mining 6/3/12 CIS 500 Data Mining is the process of analyzing data from different perspectives and summarizing it into useful information. This information can be used to increase revenue, cut costs or both. Data mining software is a major analytical tool used for analyzing data. It allows the user to analyze data from many different angles, categorize the data and summarizing the relationships. In a nut shell data mining is used mostly for the process of finding correlations or patterns among fields in very large databases. What ultimately can data mining do for a company? A lot. Data mining is primarily used by companies with strong customer focus in retail or financial. It allows companies to determine relationships among factors such as price, product placement, and staff skill set. There are external factors that data mining can use as well such as location, economic indicators, and competition of other companies. With the use of data mining a retailer can look at point of sale records of a customer purchases to send promotions to certain areas based on purchases made. An example of this is Blockbuster looking at movie rentals to send customers updates regarding new movies depending on their previous rent list. Another example would be American express suggesting products to card holders depending on monthly purchases histories. Data Mining consists of 5 major elements: • Extract, transform, and load transaction data onto the data...

Words: 1012 - Pages: 5