Free Essay

A Statistical Interpretation of Term Specificity and Its Application in Retrieval

In:

Submitted By creat89
Words 2963
Pages 12
Reprinted from Journal of Documentation Volume 60 Number 5 2004 pp. 493-502 Copyright © MCB University Press ISSN 0022-0418 and previously from Journal of Documentation Volume 28 Number 1 1972 pp. 11-21

A statistical interpretation of term specificity and its application in retrieval
Karen Spärck Jones
Computer Laboratory, University of Cambridge, Cambridge, UK

Abstract: The exhaustivity of document descriptions and the specificity of index terms are usually regarded as independent. It is suggested that specificity should be interpreted statistically, as a function of term use rather than of term meaning. The effects on retrieval of variations in term specificity are examined, experiments with three test collections showing, in particular, that frequently-occurring terms are required for good overall performance. It is argued that terms should be weighted according to collection frequency, so that matches on less frequent, more specific, terms are of greater value than matches on frequent terms. Results for the test collections show that considerable improvements in performance are obtained with this very simple procedure.

Exhaustivity and specificity
We are familiar with the notions of exhaustivity and specificity: exhaustivity is a property of index descriptions, and specificity one of index terms. They are most clearly illustrated by a simple keyword or descriptor system. In this case the exhaustivity of a document description is the coverage of its various topics given by the terms assigned to it; and the specificity of an individual term is the level of detail at which a given concept is represented. These features of a document retrieval system have been discussed by Cleverdon et al. (1966) and Lancaster (1968), for example, and the effects of variation in either have been noted. For instance, if the exhaustivity of a document description is increased by the assignment of more terms, when the number of terms in the indexing vocabulary is constant, the chance of the document matching a request is increased. The idea of an optimum level of indexing exhaustivity for a given document collection then follows: the average number of descriptors per document should be adjusted so that, hopefully, the chances of requests matching relevant documents are maximized, while too many false drops are avoided. Exhaustivity obviously applies to requests too, and one function of a search strategy is to vary request exhaustivity. I will be mainly concerned here, however, with document descriptions. Specificity as characterized above is a semantic property of index terms: a term is more or less specific as its meaning is more or less detailed and precise. This is a natural view for anyone concerned with the construction of an entire indexing vocabulary. Some decision has to be made about the discriminating power of individual terms in addition to their descriptive propriety. For example, the index term "beverage" may be as properly used for documents about tea, coffee, and cocoa as the terms "tea", "coffee", and "cocoa". Whether the more

general term "beverage" only is incorporated in the vocabulary, or whether "tea", "coffee", and "cocoa" are adopted, depends on judgements about the retrieval utility of distinctions between documents made by the latter but not the former. It is also predicted that the more general term would be applied to more documents than the separate terms "tea", "coffee", and "cocoa", so the less specific term would have a larger collection distribution than the more specific ones. It is of course assumed here that such choices when a vocabulary is constructed are exclusive: we may either have "beverage" or "tea", "coffee", and "cocoa". What happens if we have all four terms is a different matter. We may then either interpret "beverage" to mean "other beverages" or explicitly treat it as a related broader term. I will, however, disregard these alternatives here. In setting up an index vocabulary the specificity of index terms is looked at from one point of view: we are concerned with the probable effects on document description, and hence retrieval, of choosing particular terms, or rather of adopting a certain set of terms. For our decisions will, in part, be influenced by relations between terms, and how the set of chosen terms will collectively characterize the set of documents. But throughout we assume some level of indexing exhaustivity. We are concerned with obtaining an effective vocabulary for a collection of documents of some broadly known subject matter and size, where a given level of indexing exhaustivity is believed to be sufficient to represent the content of individual documents adequately, and distinguish one document from another. Index term specificity must, however, be looked at from another point of view. What happens when a given index vocabulary is actually used? We predict when we opt for "beverage", for example, that it will be used more than "cocoa". But we do not have much idea of how many documents there will be to which "beverage" may appropriately be assigned. This is not simply determined even when some level of exhaustivity is assumed. There will be some documents which cry out for "beverage", so to speak, and we may have some idea of what proportion of the collection this is likely to be. There will also be documents to which "beverage" cannot justifiably be assigned, and this proportion may also be estimated. But there is unfortunately liable to be some number of documents to which "beverage" may or may not be assigned, in either case quite plausibly. In general, therefore, the actual use of a descriptor may diverge considerably from the predicted use. The proportions of a collection to which a term does and does not belong can only be estimated very roughly; and there may be enough intermediate documents for the way the term is assigned to these to affect its overall distribution considerably. Over a long period the character of the collection as a whole may also change, with further effects on term distribution. This is where the level of exhaustivity of description matters. As a collection grows maintaining a certain level of exhaustivity may mean that the descriptions of different documents are not sufficiently distinguished, while some terms are very heavily used. More generally, great variation in term distribution is likely to appear. It may thus be the case that a particular term becomes less effective as a means of retrieval, whatever its actual meaning. This is because it is not discriminating. It may be properly assigned to documents, in the sense that their content justifies the assignment; but it may no longer be sufficiently useful in itself as a device for distinguishing the typically small class of documents relevant to a request from the remainder of the collection. A frequently used term thus functions in retrieval as a nonspecific term, even though its meaning may be quite specific in the ordinary sense.

Statistical specificity
It is not enough, in other words, to think of index term specificity solely in setting up an index vocabulary, as having to do with accuracy of concept representation. We should think of specificity as a function of term use. It should be interpreted as a statistical rather than semantic property of index terms. In general we may expect vaguer terms to be used more often, but the behaviour of individual terms will be unpredictable. We can thus redefine exhaustivity and specificity for simple term systems: the exhaustivity of a document description is the number of terms it contains, and the specificity of a term is the number of documents to which it pertains. The relation between the two is then clear, and we can see, for instance, that a change in the exhaustivity of descriptions will affect term specificity: if

descriptions are longer, terms will be used more often. This is inevitable for a controlled vocabulary, but also applies if extracted keywords are used, particularly in stem form. The incidence of words new to the keyword vocabulary does not simply parallel the number of documents indexed, and the extraction of more keywords per document is more likely to increase the frequency of current keywords than to generate new ones. Once this statistical interpretation of specificity, and the relation between it and exhaustivity, are recognized, it is natural to attempt a more formal approach to seeking an optimum level of specificity in a vocabulary and an optimum level of exhaustivity in indexing, for a given collection. Within the broad limits imposed by having sensible terms, i.e. ones which can be reached from requests and applied to documents, we may try to set up a vocabulary with the statistical properties which are hopefully optimal for retrieval. Purely formal calculations may suggest the correct number of terms, and of terms per document, for a certain degree of document discrimination. Work on these lines has been done by Zunde and Slamecka (1967), for instance. More informally, the suggestion that descriptors should be designed to have approximately the same distribution, made by Salton (1968), for example, is motivated by respect for the retrieval effects of purely statistical features of term use. Unfortunately, abstract calculations do not select actual terms. Nor are document collections static. More importantly, it is difficult to control requests. One may characterize documents with a view to distinguishing them nicely and then find that users do not provide requests utilizing these distinctions. We may therefore be forced to accept a de facto non-optimal situation with terms of varying specificity and at least some disagreeably non-specific terms. There will be some terms which, whatever the original intention, retrieve a large number of documents, of which only a small proportion can be expected to be relevant to a request. Such terms are on the whole more of a nuisance than rare, over-specific terms which fail to retrieve documents. These features of term behaviour can be illustrated by examples from three well-known test collections, obtained from the Aslib Cranfield, INSPEC, and College of Librarianship Wales projects. In fact in these the vocabulary consists of extracted keyword stems, which may be expected to show more variation than controlled terms. But there is no reason to suppose that the situation is essentially different. Full descriptions of the collections are given in Cleverdon et al. (1966), Aitchison et al. (1970), and Keen (forthcoming). Relevant characteristics of the collections are given in Section A of Table I. The INSPEC Collection, for instance, has 541 documents indexed by 1,341 terms. In all the collections, there are some very frequently occurring terms: for example in the Cranfield collection, one term occurs in 144 out of 200 documents; in the INSPEC one term occurs in 112 out of 541, and in the Keen collection one term occurs in 199 out of 797 documents. The terms concerned do not necessarily represent concepts central to the subject areas of the collections, and they are not always general terms. In the Keen collection, which is about information science, the most frequent term is "index-", and other frequent ones include "librar-", "inform-", and "comput-". In the INSPEC collection the most frequent is "theor-", followed by "measur-" and "method-". And in the Cranfield collection the most frequent is "flow-", followed by "pressur-", "distribut-" and "bound-" (boundary). The rarer terms are a fine mixed bag including "purchas-", and "xerograph-" for Keen, "parallel-" and "silver-" for INSPEC, and "logarithm-" and "seri-" (series) for Cranfield.

Table I

Specificity and matching
How should one cope with variable term specificity, and especially with insufficiently specific terms, when these occur in requests? The untoward effects of frequent term use can in principle be dealt with very naturally, through term combinations. For instance, though the three terms "bound-", "layer-", and "flow-" occur in 73, 62, and 144 documents each in the Cranfield collection, there are only50 documents indexed by all three terms together. Relying on term conjunction is quite straightforward. It is in particular a way of overcoming the untoward consequences of the fact that requests tend to be formulated in better known, and hence generally more frequent, terms. It is unfortunate, but not surprising, that requests tend to be presented in terms with an average frequency much above that for the indexing vocabulary as a whole. This holds for all three test collections, as appears in Section B of Table I. For the Cranfield collection, for example, the average number of postings for the terms in the vocabulary is nine, while the average for the terms used in the requests is 31.6; for Keen the figures are 6.1 and 44.8. But relying on term combination to reduce false drops is well-known to be risky. It is true that the more terms in common between a document and a request, the more likely it is that the document is relevant to the request. Unfortunately, it just happens to be difficult to match term conjunctions. This is well exhibited by the term-matching behaviour of the three collections, as shown in Section C of Table I. The average number of starting terms per request ranges from 5.3 for Keen to 6.9 for Cranfield. But the average number of retrieving terms per request, i.e. the average of the highest matching scores, ranges from 3.2 to 5.0. More importantly, the average number of matching terms for the relevant documents retrieved ranges from only 1.8 for Keen to 3.6 for Cranfield, though fortunately the average for all documents retrieved, which are predominantly non-relevant, ranges from a mere 1.2 to 1.8. Clearly, one solution to this problem is to provide for more matching terms in some way. This may be achieved either by providing alternative substitutes for given terms, through a classification; or by increasing the exhaustivity of document or request specifications, say by adding statistically associated terms. But either approach involves effort, perhaps considerable effort, since the sets of terms related to individual terms must be identified. The question naturally arises as to whether better use of existing term descriptions can be made which does not involve such effort. As very frequently occurring terms are responsible for noise in retrieval, one possible course is simply to remove them from requests. The fact that this will reduce the number of terms available for conjoined matching may be offset by the fact that fewer non-relevant documents will be retrieved. Unfortunately, while frequent terms cause noise, they are also required for reasonably high recall. For all three test collections, the deletion of very frequent terms by the application of a suitable threshold leads to a decline in overall performance. For the INSPEC collection, for example, the threshold was set to delete terms occurring in 20 or more documents, so that 73 terms out of the total vocabulary of 1,341 were removed. The effect in retrieval performance is illustrated by the recall/precision graph of Figure 1 for the Cranfleld collection. Matching is by simple term co-ordination levels, and averaging over the set of

requests is by straightforward average of numbers. Precision at ten standard recall values is then interpolated. The same relationship between full term matching and this restricted matching with non-frequent terms only is exhibited by the other collections: the recall ceiling is lowered by at least 30 per cent, and indeed for the Keen collection is reduced from 75 per cent to 25 per cent, though precision is maintained. Inspection of the requests shows why this result is obtained. Not merely is request term frequency much above average collection frequency; the comparatively small number of very frequent terms plays a large part in request formulation. "Flow-" for example, appears in twelve Cranfield requests out of 42, and in general for all three collections about half the terms in a request are very frequent ones, as shown in Section D of Table I. Throwing very frequent terms away is throwing the baby out with the bath water, since they are required for the retrieval of many relevant documents. The combination of non-frequent terms is discriminating, but no more than that of frequent and non-frequent terms. The value of the non-frequent terms is clearly seen, on the other hand, when matching using frequent terms only is compared with full matching, also shown in Figure 1. Matching levels for total and relevant documents are nearly as high as for all terms, but the non-frequent terms in the latter raise the relevant matching level about I. These features of term retrieval suggest that to improve on the initial full term performance we need to exploit the good features of very frequent and non-frequent terms, while minimizing their bad ones. We should allow some merit in frequent term matches, while allowing rather more in non-frequent ones. In any case we wish to maximize the number of matching terms.

Figure 1

Weighting by specificity
This clearly suggests a weighting scheme. In normal term co-ordination matches, if a request and document have a frequent term in common, this counts for as much as a non-frequent one; so if a request and document share three common terms, the document is retrieved at the same level as another one sharing three rare terms with the request. But it seems we should treat matches on non-frequent terms as more valuable than ones on frequent terms, without disregarding the latter altogether. The natural solution is to correlate a term's matching value with its collection frequency. At this stage the division of terms into frequent and non-frequent is arbitrary and probably not optimal: the elegant and almost certainly better approach is to relate matching value more closely to relative frequency. The appropriate way of doing this is suggested by the term distribution curve for the vocabulary, which has the m-1 m familiar Zipf shape. Letf(n)=m such that 2 < n

Similar Documents

Premium Essay

Knowledge Discovery in Medical Databases Leveraging Data Mining

...Abstract Abstract The goal of this master’s thesis is to identify and evaluate data mining algorithms which are commonly implemented in modern Medical Decision Support Systems (MDSS). They are used in various healthcare units all over the world. These institutions store large amounts of medical data. This data may contain relevant medical information hidden in various patterns buried among the records. Within the research several popular MDSS’s are analysed in order to determine the most common data mining algorithms utilized by them. Three algorithms have been identified: Naïve Bayes, Multilayer Perceptron and C4.5. Prior to the very analyses the algorithms are calibrated. Several testing configurations are tested in order to determine the best setting for the algorithms. Afterwards, an ultimate comparison of the algorithms orders them with respect to their performance. The evaluation is based on a set of performance metrics. The analyses are conducted in WEKA on five UCI medical datasets: breast cancer, hepatitis, heart disease, dermatology disease, diabetes. The analyses have shown that it is very difficult to name a single data mining algorithm to be the most suitable for the medical data. The results gained for the algorithms were very similar. However, the final evaluation of the outcomes allowed singling out the Naïve Bayes to be the best classifier for the given domain. It was followed by the Multilayer Perceptron and the C4.5. Keywords: Naïve Bayes, Multilayer...

Words: 35271 - Pages: 142

Free Essay

Develop of System

...CHAPTER ONE INTRODUCTION 1.1 GENERAL INTRODUCTION The ways computer application is embraced in every aspect of human life shows that every activity, organization cooperation, companies, hospitals; need to be computerized. Hospital activities are not left out. The information revolution that has moved from the individual age to the information age is a result of several developments in electronic and information. The high to revolution and computer in particular is availability of power at modest cost for the consumption of business organization makes the difference. Furthermore, it is very necessary to survey continually the use of the newly discovered or developers to make changes of existing technology. This fact points the need to document every aspect of computer system. Attention should be turned towards reduced need to understand the technicalities, which the system developers bothered with in order to design and implement a new system. The activities of hospital which includes personal record, drug inventory, disease inventory, death statistics and birth statistics keeps on growing from time to time due to apparent population explosion. These areas can benefit from the information technology tool called computer. The control and management of the data call for database management system (DBMS), which handle structure data that will store manual on card index or cabinet containing files, (Muzzi M, 2010). BACKGROUND OF THE STUDY Birth and death records are probably...

Words: 7128 - Pages: 29

Premium Essay

Exam

...• Question 1 2 out of 2 points Figuring out where the vending machine is broken internally is an example of ______. Selected Answer: d. reasoning with a mental model Answers: a. deductive reasoning b. reasoning with a mental model c. syllogistic reasoning d. inductive reasoning Response Feedback: Page: 291 Reason: A mental model is a visual, spatial, or content-based representation of a problem or situation. Topic: 8.4 Reasoning 0 out of 2 points • Question 2 Considering whether to invite the president to speak at your college graduation ceremony is an example of a ______. Selected Answer: b. mental set Answers: a. decision b. problem c. mental set d. judgment Response Feedback: Page: 286 Reason: Decisions involve thinking that requires a choice among alternatives. Topic: 8.3 Decision Making 0 out of 2 points • Question 3 A bias in problem solving is ______. Selected Answer: a. irrelevant information Answers: a. irrelevant information b. unnecessary constraints c. mental set d. All of the above. Response Feedback: Page: 284 Topic: 8.2 Problem Solving 0 out of 2 points • Question 4 Deciding that, “if all dogs are pets, and all pets are owned, then all dogs must be owned” illustrates ______. Selected Answer: d. deductive reasoning Answers: a. syllogistic reasoning b. deductive reasoning c. inductive reasoning d. reasoning with a mental model Response...

Words: 14580 - Pages: 59

Premium Essay

Psy Study Guide

...PSY 200/203 Study Guide for Final Exam Chapter 1 - Scientific method A. Psychology has four basic goals regarding behavior and mental processes 1. Describe 2. Explain 3. Predict 4. Control B. Scientific method - set of assumptions, attitudes, and procedures that guide researchers in investigations 1. Events are lawful (follow consistent patterns) 2. Events are explainable 3. Events are approached with scientific skepticism (critical thinking) a. Minimize the influence of preconceptions/biases while evaluating the evidence b. Determine the conclusions that can be reasonably drawn from the evidence c. Consider alternative explanations for research findings Steps in the scientific method A. Formulate a testable hypothesis 1. hypothesis 2. variables 3. operational definition B. Design the study and collect data 1. descriptive methods 2. experimental methods C. Analyze the data and draw conclusions D. Report the findings Descriptive research methods – strategies for observing and describing behavior A. Naturalistic observation – systematic observation and recording of behaviors as they occur in their natural settings 1. Allows study of behaviors that cannot be easily or ethically manipulated in an experiment B. Case study – indepth investigation of an individual or small group of individuals 1. Allows study of rare, unusual or extreme conditions C. Surveys 1. Sample ...

Words: 5603 - Pages: 23

Premium Essay

Goal Setting Theory

...Motiv Emot (2009) 33:343–352 DOI 10.1007/s11031-009-9143-3 ORIGINAL PAPER The combined effects of goal type and cognitive ability on performance Gerard Seijts Æ Dan Crim Published online: 18 September 2009 Ó Springer Science+Business Media, LLC 2009 Abstract We tested the combined effects of goal type and cognitive ability on task performance using a moderately complex task. Business students (N = 105) worked on a 24 min class scheduling task. The results showed that participants with higher cognitive ability benefited more from the setting of a performance goal as opposed to a learning goal. The reverse pattern was true for participants with lower cognitive ability. Performance goals were more effective for participants with higher cognitive ability vis` a-vis those with lower cognitive ability. The correlation between goal commitment and performance was positive and significant as was the correlation between cognitive ability and performance. Keywords Learning goals Á Performance goals Á Cognitive ability Á Performance Introduction More than a thousand studies have shown the positive effect of goal setting on subsequent task performance (e.g., Latham 2007; Locke and Latham 2002). The most difficult goals produce the highest levels of effort and performance (e.g., Locke and Latham 2002). Moderators or boundary conditions for goal setting include ability, feedback, task complexity, goal commitment and situational constraints (e.g., Locke and Latham 1990, 2002). Pinder...

Words: 7318 - Pages: 30

Premium Essay

Databasse Management

...Fundamentals of Database Systems Preface....................................................................................................................................................12 Contents of This Edition.....................................................................................................................13 Guidelines for Using This Book.........................................................................................................14 Acknowledgments ..............................................................................................................................15 Contents of This Edition.........................................................................................................................17 Guidelines for Using This Book.............................................................................................................19 Acknowledgments ..................................................................................................................................21 About the Authors ..................................................................................................................................22 Part 1: Basic Concepts............................................................................................................................23 Chapter 1: Databases and Database Users..........................................................................................23 ...

Words: 229471 - Pages: 918

Premium Essay

Total Quality Management

...TOTAL QUALITY MANAGEMENT AND SIX SIGMA Edited by Tauseef Aized Total Quality Management and Six Sigma http://dx.doi.org/10.5772/2559 Edited by Tauseef Aized Contributors Aleksandar Vujovic, Zdravko Krivokapic, Jelena Jovanovic, Svante Lifvergren, Bo Bergman, Adela-Eliza Dumitrascu, Anisor Nedelcu, Erika Alves dos Santos, Mithat Zeydan, Gülhan Toğa, Johnson Olabode Adeoti, Andrey Kostogryzov, George Nistratov, Andrey Nistratov, Vidoje Moracanin, Ching-Chow Yang, Ayon Chakraborty, Kay Chuan Tan, Graham Cartwright, John Oakland Published by InTech Janeza Trdine 9, 51000 Rijeka, Croatia Copyright © 2012 InTech All chapters are Open Access distributed under the Creative Commons Attribution 3.0 license, which allows users to download, copy and build upon published articles even for commercial purposes, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications. After this work has been published by InTech, authors have the right to republish it, in whole or part, in any publication of which they are the author, and to make other personal use of the work. Any republication, referencing or personal use of the work must explicitly identify the original source. Notice Statements and opinions expressed in the chapters are these of the individual contributors and not necessarily those of the editors or publisher. No responsibility is accepted for the accuracy of information contained...

Words: 105584 - Pages: 423

Premium Essay

Your Research Project

...your research project your research project a step-by-step guide for the first-time researcher NICHOLAS WALLIMAN with Bousmaha Baiche SAGE Publications London • Thousand Oaks • New Delhi To my wife, Ursula © Nicholas Walliman 2001 Chapter 2 © Dr Bousmaha Baiche 2001 First published 2001 Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act, 1988, this publication may be reproduced, stored or transmitted in any form, or by any means, only with the prior permission in writing of the publishers, or in the case of reprographic reproduction, in accordance with the terms of licences issued by the Copyright Licensing Agency. Inquiries concerning reproduction outside those terms should be sent to the publishers. SAGE Publications Ltd 6 Bonhill Street London EC2A 4PU SAGE Publications Inc 2455 Teller Road Thousand Oaks, California 91320 SAGE Publications India Pvt Ltd 32, M-Block Market Greater Kailash – I New Delhi 110 048 British Library Cataloguing in Publication data A catalogue record for this book is available from the British Library ISBN 0 7619 6538 6 ISBN 0 7619 6539 4 (pbk) Library of Congress catalog record available Typeset by Keystroke, Jacaranda Lodge, Wolverhampton. Printed in Great Britain by The Cromwell Press Ltd, Trowbridge, Wiltshire CONTENTS Acknowledgements Introduction 1 2 3 4 5 6 7 8 Research and the Research Problem Information,...

Words: 136496 - Pages: 546

Premium Essay

Fraternity

...POSTPARTUM DEPRESSION: LITERATURE REVIEW OF RISK FACTORS AND INTERVENTIONS Donna E. Stewart, MD, FRCPC E. Robertson, M.Phil, PhD Cindy-Lee Dennis, RN, PhD Sherry L. Grace, MA, PhD Tamara Wallington, MA, MD, FRCPC ©University Health Network Women’s Health Program 2003 Prepared for: Toronto Public Health October 2003 Women’s Health Program Financial assistance by Health Canada Toronto Public Health Advisory Committee: Jan Fordham, Manager, Planning & Policy – Family Health Juanita Hogg-Devine, Family Health Manager Tobie Mathew, Health Promotion Consultant – Early Child Development Project Karen Wade, Clinical Nurse Specialist, Planning & Policy – Family Health Mary Lou Walker, Family Health Manager Karen Whitworth, Mental Health Manager Copyright: Copyright of this document is owned by University Health Network Women’s Health Program. The document has been reproduced for purposes of disseminating information to health and social service providers, as well as for teaching purposes. Citation: The following citation should be used when referring to the entire document. Specific chapter citations are noted at the beginning of each chapter. Stewart, D.E., Robertson, E., Dennis, C-L., Grace, S.L., & Wallington, T. (2003). Postpartum depression: Literature review of risk factors and interventions. POSTPARTUM DEPRESSION: LITERATURE REVIEW OF RISK FACTORS AND INTERVENTIONS Table of Contents EXECUTIVE SUMMARY 2 OVERALL METHODOLOGICAL FRAMEWORK 5 CHAPTER 1: RISK FACTORS FOR...

Words: 108533 - Pages: 435

Premium Essay

Phsco

...www.it-ebooks.info www.it-ebooks.info Praise “A must-read resource for anyone who is serious about embracing the opportunity of big data.” — Craig Vaughan Global Vice President at SAP “This timely book says out loud what has finally become apparent: in the modern world, Data is Business, and you can no longer think business without thinking data. Read this book and you will understand the Science behind thinking data.” — Ron Bekkerman Chief Data Officer at Carmel Ventures “A great book for business managers who lead or interact with data scientists, who wish to better understand the principals and algorithms available without the technical details of single-disciplinary books.” — Ronny Kohavi Partner Architect at Microsoft Online Services Division “Provost and Fawcett have distilled their mastery of both the art and science of real-world data analysis into an unrivalled introduction to the field.” —Geoff Webb Editor-in-Chief of Data Mining and Knowledge Discovery Journal “I would love it if everyone I had to work with had read this book.” — Claudia Perlich Chief Scientist of M6D (Media6Degrees) and Advertising Research Foundation Innovation Award Grand Winner (2013) www.it-ebooks.info “A foundational piece in the fast developing world of Data Science. A must read for anyone interested in the Big Data revolution." —Justin Gapper Business Unit Analytics Manager at Teledyne Scientific and Imaging “The authors, both renowned experts in data science before it had a name, have...

Words: 146629 - Pages: 587

Premium Essay

Data Mining Practical Machine Learning Tools and Techniques - Weka

...Data Mining Practical Machine Learning Tools and Techniques The Morgan Kaufmann Series in Data Management Systems Series Editor: Jim Gray, Microsoft Research Data Mining: Practical Machine Learning Tools and Techniques, Second Edition Ian H. Witten and Eibe Frank Fuzzy Modeling and Genetic Algorithms for Data Mining and Exploration Earl Cox Data Modeling Essentials, Third Edition Graeme C. Simsion and Graham C. Witt Location-Based Services Jochen Schiller and Agnès Voisard Database Modeling with Microsoft® Visio for Enterprise Architects Terry Halpin, Ken Evans, Patrick Hallock, and Bill Maclean Designing Data-Intensive Web Applications Stefano Ceri, Piero Fraternali, Aldo Bongio, Marco Brambilla, Sara Comai, and Maristella Matera Mining the Web: Discovering Knowledge from Hypertext Data Soumen Chakrabarti Understanding SQL and Java Together: A Guide to SQLJ, JDBC, and Related Technologies Jim Melton and Andrew Eisenberg Database: Principles, Programming, and Performance, Second Edition Patrick O’Neil and Elizabeth O’Neil The Object Data Standard: ODMG 3.0 Edited by R. G. G. Cattell, Douglas K. Barry, Mark Berler, Jeff Eastman, David Jordan, Craig Russell, Olaf Schadow, Torsten Stanienda, and Fernando Velez Data on the Web: From Relations to Semistructured Data and XML Serge Abiteboul, Peter Buneman, and Dan Suciu Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations Ian H. Witten and Eibe Frank ...

Words: 191947 - Pages: 768

Premium Essay

Marketing

...Review of Marketing Research Review of Marketing Research VOLUME 1 Naresh K. Malhotra Editor M.E.Sharpe Armonk, New York London, England 4 AUTHOR Copyright © 2005 by M.E.Sharpe, Inc. All rights reserved. No part of this book may be reproduced in any form without written permission from the publisher, M.E. Sharpe, Inc., 80 Business Park Drive, Armonk, New York 10504. Library of Congress ISSN: 1548-6435 ISBN 0-7656-1304-2 (hardcover) Printed in the United States of America The paper used in this publication meets the minimum requirements of American National Standard for Information Sciences Permanence of Paper for Printed Library Materials, ANSI Z 39.48-1984. ~ MV (c) 10 9 8 7 6 5 4 3 2 1 CHAPTER TITLE 5 REVIEW OF MARKETING RESEARCH EDITOR: NARESH K. MALHOTRA, GEORGIA INSTITUTE OF TECHNOLOGY Editorial Board Rick P. Bagozzi, Rice University Ruth Bolton, Arizona State University George Day, University of Pennsylvania Morris B. Holbrook, Columbia University Michael Houston, University of Minnesota Shelby Hunt, Texas Tech University Dawn Iacobucci, Northwestern University Arun K. Jain, University at Buffalo, State University of New York Barbara Kahn, University of Pennsylvania Wagner Kamakura, Duke University Donald Lehmann, Columbia University Robert F. Lusch, University of Arizona Kent B. Monroe, University of Illinois, Urbana A. Parasuraman, University of Miami William Perreault, University of North Carolina Robert A. Peterson, University...

Words: 167068 - Pages: 669

Premium Essay

Visualizing Research

...Visualizing Research This page intentionally left blank Visualizing Research A Guide to the Research Process in Art and Design Carole Gray and Julian Malins © Carole Gray and Julian Malins 2004 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise without the prior permission of the publisher. Carole Gray and Julian Malins have asserted their right under the Copyright, Designs and Patents Act, 1988, to be identified as the authors of this work. Published by Ashgate Publishing Limited Gower House Croft Road Aldershot Hants GU11 3HR England Ashgate website: http://www.ashgate.com British Library Cataloguing in Publication Data Gray, Carole Visualizing research : a guide to the research process in art and design 1.Art – Research 2.Design – Research 3.Universities and colleges – Graduate work I.Title II.Malins, Julian 707.2 Library of Congress Cataloging-in-Publication Data Gray, Carole, 1957Visualizing research : a guide to the research process in art and design / by Carole Gray and Julian Malins. p. cm. Includes index. ISBN 0-7546-3577-5 1. Design--Research--Methodology--Handbooks, manuals, etc. 2. Art--Research--Methodology-Handbooks, manuals, etc. 3. Research--Methodology--Handbooks, manuals, etc. I. Malins, Julian. II. Title. NK1170.G68 2004 707’.2--dc22 ISBN 0 7546 3577 5 Typeset by Wileman Design Printed and bound...

Words: 81106 - Pages: 325

Premium Essay

Dataminig

...Mining Third Edition This page intentionally left blank Data Mining Practical Machine Learning Tools and Techniques Third Edition Ian H. Witten Eibe Frank Mark A. Hall AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO Morgan Kaufmann Publishers is an imprint of Elsevier Morgan Kaufmann Publishers is an imprint of Elsevier 30 Corporate Drive, Suite 400, Burlington, MA 01803, USA This book is printed on acid-free paper. Copyright © 2011 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely...

Words: 194698 - Pages: 779

Free Essay

Deep Learning Wikipedia

... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.1 Network function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.2 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.3 Learning paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.4 Learning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Employing artificial neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5.1 Real-life applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5.2 Neural networks and neuroscience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.6 Neural network software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.7 Types of artificial neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.8 Theoretical properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.8.1 Computational power . . . . . . . . . . . . . . . . . . . . . ....

Words: 55759 - Pages: 224