...Assignment 3 – Classification Note: Show all your work. Problem 1 (25 points) Consider the following dataset: ID | A1 | A2 | A3 | Class | 1 | Low | Mild | East | Yes | 2 | Low | Hot | West | No | 3 | Medium | Mild | East | No | 4 | Low | Mild | East | Yes | 5 | High | Mild | East | Yes | 6 | Medium | Hot | West | No | 7 | High | Hot | West | Yes | 8 | Low | Cool | West | No | 9 | Medium | Cool | East | Yes | 10 | High | Cool | East | No | 11 | Medium | Mild | West | Yes | 12 | Medium | Cool | West | No | 13 | Medium | Hot | West | Yes | 14 | high | Hot | East | Yes | Suppose we have a new tuple X = (A1 = Medium, A2 = Cool, A3 = East). Predict the class label of X using Naïve Bayesian classification. You need show all your work. Problem 2 (25 points) Consider the following dataset D. ID | A1 | A2 | A3 | Class | 1 | Low | Mild | East | Yes | 2 | Low | Hot | West | No | 3 | Medium | Mild | East | No | 4 | Low | Mild | West | Yes | 5 | High | Cool | East | Yes | 6 | Low | Hot | West | No | 7 | High | Hot | West | Yes | 8 | Low | Cool | West | No | 9 | Medium | Cool | East | Yes | 10 | High | Hot | East | No | 11 | Medium | Mild | East | Yes | 12 | Medium | Cool | West | No | 13 | High | Hot | West | Yes | 14 | High | Hot | East | Yes | (1) Compute the Info of the whole dataset D. (2) Compute the information gain for each of A1, A2, and A3, and determine the splitting attribute (or the best split attribute)...
Words: 1222 - Pages: 5
...data. The investigator may want to ensure that the dogs allocated to each treatment group were of similar compositions with respect to gender and hair coat. Use PROC FREQ to conduct Fisher’s exact test to see if the concentration of the drug received was statistically independent of the gender of the dog. Likewise, see if the length of the coat and the drug treatment were statistically independent with Fisher’s exact test. Write your interpretation of the results of these tests. 2. Refer to the MANATEES data. Your task is to see if the proportion of manatees killed by human-related causes has remained about the same through time or if this proportion has changed significantly from year to year. Create a dataset with...
Words: 891 - Pages: 4
...calculated Mean The mean is calculated by adding up all the numbers in a dataset and then dividing that by the number of values in that dataset. An example is given below of a dataset of six students and their test scores out of 20: S1 S2 S3 S4 S5 S6 Total Divided by: Mean 18 16 17 13 14 19 97 6 16.17 In order to find the mean test score, the total amount scored (97) is divided by the number of students (6) which equals 16.17. The formula for calculation the mean is as follows: _ X is the symbol for the mean and is referred to as bar X (ex) Σ is the Greek symbol sigma and simply means sum or add up X refers to each of the individual values that make up the dataset n is the number of values that make up the dataset The use of mean is effective when the values of the dataset are evenly spread with no extreme high or low values. If the dataset does contains one or two extremely high or low values the result will be adversely affected by this. An example of this is given below, where S1 and S4 are much lower than the other variables. S1 S2 S3 S4 S5 S6 Total Divided by: Mean 2 16 17 3 14 19 71 6 11.83 If a dataset does containing extremely low or extremely high values, then the median is a better way of measuring average value. Median This refers to the middle value in a dataset, where values are arranged from lowest to highest and vice versa. When the dataset has...
Words: 1285 - Pages: 6
...semi-supervised and unsupervised detection. Each category utilises different detection techniques such as classification, clustering, nearest neighbour and statistical. Each category and technique has several strengths and weaknesses compared with other outlier detection methods. This review provides initial information on data labelling and classification before examining some of the existing outlier detection techniques within each of the three categories. It then looks at the use of combining detection techniques before comparing and discussing the advantages and disadvantages of each method. Finally, a new classification technique is proposed using a new outlier detection algorithm, Isolation Forest. DATA LABELLING Datasets normally consist of many data instances with each instance usually containing one to many attributes. Each attribute may take a predetermined value such as numerical, nominal or binary or the value may be missing. Data labelling (assigning a class to each instance in a dataset) is a time...
Words: 2395 - Pages: 10
...easier input options that perform similar tasks as Microsoft Access. This allows ABC Sales to develop the most robust data sets for the lowest cost, with minimal additional training requirements to existing staff. It is clear that ABC Sales Department needs to choose Microsoft Excel. Problem Statement The ABC Sales Department has an organizational problem. ABC Sales Department does not keep a record of annual inventory renewal schedules for their special projects. The sales department needs to start to keep these records in order to properly plan future budgets, set realistic goals for employees to reach, and be able to forecast proper pricing for renewals. The sales director needs to implement an easy to use and cost affective tracker for the managers to use. The managers need to start tracking these items in order to increase revenue so that the ABC Company can be successful in the market. Overview of Alternatives The following two alternatives considered in this report meet the ABC Company requirements: Alternative A-Microsoft Excel Spreadsheets In order to understand and track the revenue and dates of prior existing orders, databases can...
Words: 2026 - Pages: 9
...easier input options that perform similar tasks as Microsoft Access. This allows ABC Sales to develop the most robust data sets for the lowest cost, with minimal additional training requirements to existing staff. It is clear that ABC Sales Department needs to choose Microsoft Excel. Problem Statement The ABC Sales Department has an organizational problem. ABC Sales Department does not keep a record of annual inventory renewal schedules for their special projects. The sales department needs to start to keep these records in order to properly plan future budgets, set realistic goals for employees to reach, and be able to forecast proper pricing for renewals. The sales director needs to implement an easy to use and cost affective tracker for the managers to use. The managers need to start tracking these items in order to increase revenue so that the ABC Company can be successful in the market. Overview of Alternatives The following two alternatives considered in this report meet the ABC Company requirements: Alternative A-Microsoft Excel Spreadsheets In order to understand and track the revenue and dates of prior existing orders, databases can be exported...
Words: 2025 - Pages: 9
...vkanlanji}@andrew.cmu.edu 1 Carnegie Mellon School of Computer Science, Pittsburgh, USA Abstract. In various ML-as-a-service cloud systems, the process of performing machine learning over the data is almost treated as a black box, where the user just feeds in their data, knows the model used and the system outputs required insights. In this work, we explore the idea of being able to predict sensitive attributes associated with the database given that the adversary would have access to a few quasi-identifiers associated with the database. We use inversion attack as the theoretical foundation for our attack, and implement the same for our database. We experiment this attack for di↵erent variants of classification algorithms, like classification tree and regression tree. We follow it up with analysing the accuracy of our attack for each of our classification based machine learning algorithms for di↵erent size of training datasets. We end our work by trying to figure out what we say is the ”most impactful attribute”, by selectively removing the data pertaining to an attribute and check what is the corresponding e↵ect on inversion attack. We hope our work in this domain pushes future batches of this class to explore this question even further, and too look into understanding if Di↵erential Privacy solves this problem. Keywords: Inversion Attack, Black Box, Classification Tree, Regression Tree 1 Introduction The Internet of Things (IoT) is the network of physical...
Words: 5223 - Pages: 21
...Data Collection and Calculation: Real Estate Data QNT/351 Real Estate: Home Price and Size Analysis The purpose of this paper is to perform a descriptive statistics analysis on the real estate dataset. The analysis will be aimed at investigating relationships between the price of a home and the home’s square footage. The team will use the dataset to figure out if there is a direct relationship between price and square footage, the assumption being there is a positive correlation between home size and price. Research Questions Purchasing a home is the largest financial decision made by most American families. The traditional view has been that people should first buy a small “starter home,” build equity for a few years, and then move into a bigger home. The logic behind this is that smaller homes are relatively cheaper than larger homes, i.e. that there is a direct positive correlation between the price of a house and its square footage. This brings up the following research questions: • What are some of the descriptive statistics for home prices and home sizes? • Is there a relationship between the price of a home and its size? These research questions will be answered through an analysis of the descriptive statistics of two of the variables in the real estate dataset, specifically Price (in dollars) and Size (in square feet.) These two variables are both quantitative, with a ratio level of measurement. Descriptive Statistics The following table summarizes some...
Words: 1135 - Pages: 5
...[BI-PROJECT REPORT] April 13, 2014 DATA MINING Analysis of Bike sharing dataset April 13, 2014 Group 007 MIS 6324 1 [BI-PROJECT REPORT] April 13, 2014 Project Report for Analysis of bike sharing dataset MIS-6324 Intro. to business intelligence software and techniques Prepared by Group Name Group007 Group Members Rohith Raj Abhay Joshi Sai Karan Jahnavi Papanaboina Under the guidance of Professor Kelly Slaughter, PhD Clinical Professor Information Systems University of Texas at Dallas MIS 6324 2 [BI-PROJECT REPORT] April 13, 2014 Table of Contents 1.Introduction to Data Mining ...................................................................................................................... 4 2. Background of the dataset ........................................................................................................................ 4 2.1 Description of dataset ......................................................................................................................... 5 3.Outline of Analysis ..................................................................................................................................... 6 4. The Methodology ...................................................................................................................................... 7 5. Pre-processing the dataset ...........................................................................................................
Words: 2575 - Pages: 11
...to cope with the variety of human skin colors across different ethnic. Moreover, existing methods require high computational cost. In this paper, we propose a novel human skin detection approach that combines a smoothed 2-D histogram and Gaussian model, for automatic human skin detection in color image(s). In our approach, an eye detector is used to refine the skin model for a specific person. The proposed approach reduces computational costs as no training is required, and it improves the accuracy of skin detection despite wide variation in ethnicity and illumination. To the best of our knowledge, this is the first method to employ fusion strategy for this purpose. Qualitative and quantitative results on three standard public datasets and a comparison with state-of-the-art methods have shown the effectiveness and robustness of the proposed approach. Index Terms—Color space, dynamic threshold, fusion strategy, skin detection. I. INTRODUCTION W ITH the progress of information society today, images have become more and more important. Among them, skin detection plays an important role in a wide range of image processing applications from face tracking, gesture analysis, content-based image retrieval systems to various human–computer interaction domains [1]–[6]. In these applications, the search space for...
Words: 5432 - Pages: 22
...PROC FREQ [DATA=dataset] [ORDER=FREQ|INTERNAL|DATA]; [TABLES vars ... [/ [MISSING] [NOCUM] ;] DATA; INPUT dept$ pos$; CARDS; CM CLK CM MGR BA CLK BA CLK CM MGR EE MGR EE CLK CM CLK CM MGR BA CLK BA CLK CM MGR EE MGR EE CLK RUN; PROC FREQ ORDER=FREQ; TABLES dept*pos / MISSING NOCUM; RUN; Frequency‚ Percent ‚ Row Pct ‚ Col Pct ‚CLK ‚MGR ‚ Total ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ CM ‚ 2 ‚ 4 ‚ 6 ‚ 14.29 ‚ 28.57 ‚ 42.86 ‚ 33.33 ‚ 66.67 ‚ ‚ 25.00 ‚ 66.67 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ BA ‚ 4 ‚ 0 ‚ 4 ‚ 28.57 ‚ 0.00 ‚ 28.57 ‚ 100.00 ‚ 0.00 ‚ ‚ 50.00 ‚ 0.00 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ EE ‚ 2 ‚ 2 ‚ 4 ‚ 14.29 ‚ 14.29 ‚ 28.57 ‚ 50.00 ‚ 50.00 ‚ ‚ 25.00 ‚ 33.33 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 8 6 14 57.14 42.86 100.00 Or we can use he WEIGHT statement specifies a numeric variable with a value that represents the frequency of the observation. DATA; INPUT dept$ pos$ weight; CARDS; CM CLK 1 CM MGR 2 BA CLK 2 EE MGR 1 ...
Words: 948 - Pages: 4
...growth and expansion of Internet technology and local network systems. Moreover, programming errors, firewall configuration errors and ambiguous or undefined security policies add to the system’s complexity. An Intrusion Detection system (IDS) is therefore needed as another layer to protect computer systems. The IDS is one of the most important techniques of information dynamic security technology. It is defined as a process of monitoring the events occurring in a computer system or network and analyzing them to differentiate between normal activities of the system and behaviors that can be classified as suspicious or intrusive. Current Intrusion Detection Systems have several known shortcomings, such as low accuracy (registering high False Positives and False Negatives); low real-time performance (processing a large amount of traffic in real time); limited scalability (storing a large number of user profiles and attack signatures); an inability to detect new attacks (recognizing new attacks when they are launched for the first time); and weak system reactive capabilities (efficiency of response). This makes the area of IDS an attractive...
Words: 2519 - Pages: 11
...Introduction The dramatic growth in obesity and overweight among Americans has become a health topic, which receives widespread of attention in the media. Providers believe that environmental and community factors contribute to unhealthy habits, which pose a major risk for chronic health conditions. The following are chronic health conditions: diabetes, high blood pressure, cardiovascular disease, stroke, high cholesterol, asthma, and depression. These health consequences can lead to premature death and chronic health conditions, which reduces the quality of life. In the Atlanta area, obesity has increased over the past 10 years in which affects an individual life. Health care organizations have established health objectives to reduce the prevalence of obesity among individuals in America. What is overweight and obesity? According to National Heart and Lung Institute (2010) “the terms overweight and obesity refer to a person’s overall body weight and whether it’s too high” (What are overweight and obesity, para. 1). A person is overweight when he or she is above a weight because of muscle, bone, and fat. Obese occurs when individuals have extra body fat on them. Hospitals, community clinics, and public health care agencies utilize the body mass index (BMI) to measure overweight and obesity for adults, children, and teens. BMI is the ratio of a person’s weight to the square of his or her height (MediLexicon International Ltd, 2011). This is an assessment tool to chart...
Words: 1980 - Pages: 8
...the field. In contrast, in this article, we take a closer look at the entire CS research in the past two decades by analyzing the data on publications in the ACM Digital Library and IEEE Xplore, and the grants awarded by the National Science Foundation (NSF). We identify trends, bursty topics, and interesting inter-relationships between NSF awards and CS publications, finding, for example, that if an uncommonly high frequency of a specific topic is observed in publications, the funding for this topic is usually increased. We also analyze CS researchers and communities, finding that only a small fraction of authors attribute their work to the same research area for a long period of time, reflecting for instance the emphasis on novelty (use of new keywords) and typical academic research teams (with core faculty and more rapid turnover of students and postdocs). Finally, our work highlights the dynamic research landscape in CS, with its focus constantly moving to new challenges arising from new technological developments. Computer science is atypical science in that its universe evolves quickly, with a speed that is unprecedented even for engineers. Naturally, researchers follow the evolution of their artifacts by adjusting their research interests. We want to capture this vibrant co-evolution in this paper. 1 Introduction...
Words: 15250 - Pages: 61
...III: My own comments on this class. I. Proposed approach • An incremental subspace-based K-means clustering method for high dimensional data • Subspace based document clustering and its application in data preprocessing in Web mining 1. Introduction and background High dimensional data clustering has many applications in real world, especially in bioinformatics. Many well-known clustering algorithms often use a whole-space distance score to measure the similarity or distance between two objects, such as Euclidean distance, Cosine function... However, in fact, when the dimensionality of space or the number of objects is large, such whole-space-based pairwise similarity scores are no longer meaningful, due to the distance of each pair of object nearly the same [5]. Pattern-space clustering, a kind of subspace clustering, can overcome above problem by discovering such patterns that existing in subspaces of the high dimensional data. An example of subspace pattern can be seen in Figure 1: Given a dataset consists of 5 objects having 5 attribute {a, b, c, d, e} as in the table1. A subspace pattern with 4 attributes {a, b, d, e} exists in 3 object {2, 3, 5}....
Words: 5913 - Pages: 24