Premium Essay

Text Mining Research Paper

Submitted By
Words 1209
Pages 5
Text mining is the process of extracting interesting and non-trivial knowledge or information from unstructured text data. Text mining is the multidisciplinary field which draws on data mining, machine learning, information retrieval, computational linguistics and statistics. This research paper discussed about one of the text mining preprocessing techniques. The initial process of text mining systems is preprocessing steps. Pre-processing reduces the size of the input text documents significantly. It involves the actions like sentence boundary determination, natural language specific stop-word elimination, tokenization and stemming. This research paper established the comparative analysis of document tokenization tools.
I. Introduction Tokenization …show more content…
 Tokens are separated by the way of whitespace characters, such as a space or line break, or by punctuation characters.
Tokenization process is mainlycomplicated for languages written in ‘scriptio continua’ which reveals no word limits such as Ancient Greek, Chinese, or Thai [2].A Scriptio continuum, also known as scriptura continua or scripta continua, is a style of writing without any spaces or other marks in between the words or sentences.The main use of tokenization is identifying the meaningful keywords.
Need for Tokenization
Generally textual data is only a block of characters at the beginning. All processes in information retrieval require the words of the data set. Hence, the requirement for a parser is a tokenization of documents. This may sound trivial as the text is already stored in machine-readable formats. Nevertheless, some problems are still left, like the removal of punctuation marks. Other characters like brackets, hyphens, etc require processing as well. Furthermore, tokenizer can cater for consistency in the documents. The main use of tokenization is identifying the meaningful keywords. The inconsistency can be different number and time formats. Another problem are abbreviations and acronyms which have to be transformed into a standard …show more content…
These tokens could be paragraphs, sentences, or individual words. NLTK provides a number of tokenizers in the module. The text is first tokenized into sentences using the PunktSentenceTokenizer [11]. Then each sentence is tokenized into words using 4 different word tokenizers:
• TreebankWordTokenizer
• WordPunctTokenizer
• PunctWordTokenizer
• WhitespaceTokenize TreebankWordTokenizer

WordPunctTokenizer PunctWordTokenizer WhitespaceTokenize

III. Challenges
A challenge in tokenization depends upon the type of language. For example, English and French are referred to as space delimited as the greater part of the words is separated from each other by white spaces [3]. Languages such as Chinese and Thai are referred to as un-segmented as words don’t have clear limits. Tokenizing un-segmented language sentences requires extra lexical and morphological information. Tokenization is likewise influenced by writing system and the typographical structure of the words [4].
IV. Conclusion
Pre-processing activities plays animportant role in the various application domains. Therefore it is concluded that the domain specific applications are more proper for text analysis. This research paper describes the overview of tokenization tools which is available in open source.This improves performance of the information retrieval

Similar Documents

Premium Essay

Business Intelligence

...ISSUE: BUSINESS INTELLIGENCE RESEARCH BUSINESS INTELLIGENCE AND ANALYTICS: FROM BIG DATA TO BIG IMPACT Hsinchun Chen Eller College of Management, University of Arizona, Tucson, AZ 85721 U.S.A. {hchen@eller.arizona.edu} Roger H. L. Chiang Carl H. Lindner College of Business, University of Cincinnati, Cincinnati, OH 45221-0211 U.S.A. {chianghl@ucmail.uc.edu} Veda C. Storey J. Mack Robinson College of Business, Georgia State University, Atlanta, GA 30302-4015 U.S.A. {vstorey@gsu.edu} Business intelligence and analytics (BI&A) has emerged as an important area of study for both practitioners and researchers, reflecting the magnitude and impact of data-related problems to be solved in contemporary business organizations. This introduction to the MIS Quarterly Special Issue on Business Intelligence Research first provides a framework that identifies the evolution, applications, and emerging research areas of BI&A. BI&A 1.0, BI&A 2.0, and BI&A 3.0 are defined and described in terms of their key characteristics and capabilities. Current research in BI&A is analyzed and challenges and opportunities associated with BI&A research and education are identified. We also report a bibliometric study of critical BI&A publications, researchers, and research topics based on more than a decade of related academic and industry publications. Finally, the six articles that comprise this special issue are introduced and characterized in terms of the proposed BI&A research framework. Keywords: Business...

Words: 16335 - Pages: 66

Premium Essay

Bpcl

...BUSINESS INTELLIGENCE RESEARCH BUSINESS INTELLIGENCE AND ANALYTICS: FROM BIG DATA TO BIG IMPACT Hsinchun Chen Eller College of Management, University of Arizona, Tucson, AZ 85721 U.S.A. {hchen@eller.arizona.edu} Roger H. L. Chiang Carl H. Lindner College of Business, University of Cincinnati, Cincinnati, OH 45221-0211 U.S.A. {chianghl@ucmail.uc.edu} Veda C. Storey J. Mack Robinson College of Business, Georgia State University, Atlanta, GA 30302-4015 U.S.A. {vstorey@gsu.edu} Business intelligence and analytics (BI&A) has emerged as an important area of study for both practitioners and researchers, reflecting the magnitude and impact of data-related problems to be solved in contemporary business organizations. This introduction to the MIS Quarterly Special Issue on Business Intelligence Research first provides a framework that identifies the evolution, applications, and emerging research areas of BI&A. BI&A 1.0, BI&A 2.0, and BI&A 3.0 are defined and described in terms of their key characteristics and capabilities. Current research in BI&A is analyzed and challenges and opportunities associated with BI&A research and education are identified. We also report a bibliometric study of critical BI&A publications, researchers, and research topics based on more than a decade of related academic and industry publications. Finally, the six articles that comprise this special issue are introduced and characterized in terms of the proposed BI&A research framework. Keywords:...

Words: 16335 - Pages: 66

Free Essay

Crime Investigation

...International Journal For Technological Research In Engineering Volume 1, Issue 9, May-2014 ISSN (Online): 2347 - 4718 DATA MINING TECHNIQUES TO ANALYZE CRIME DATA R. G. Uthra, M. Tech (CS) Bharathidasan University, Trichy, India. Abstract: In data mining, Crime management is an interesting application where it plays an important role in handling of crime data. Crime investigation has very significant role of police system in any country. There had been an enormous increase in the crime in recent years. With rapid popularity of the internet, crime information maintained in web is becoming increasingly rampant. In this paper the data mining techniques are used to analyze the web data. This paper presents detailed study on classification and clustering. Classification is the process of classifying the crime type Clustering is the process of combining data object into groups. The construct of scenario is to extract the attributes and relations in the web page and reconstruct the scenario for crime mining. Key words: Crime data analysis, classification, clustering. I. INTRODUCTION Crime is one of the dangerous factors for any country. Crime analysis is the activity in which analysis is done on crime activities. Today criminals have maximum use of all modern technologies and hi-tech methods in committing crimes. The law enforcers have to effectively meet out challenges of crime control and maintenance of public order. One challenge to law enforcement...

Words: 1699 - Pages: 7

Premium Essay

Video Data Mining

...VIDEO DATA MINING Pooja Gupta Department of Computer Science Jaypee Institute of Information Technology, NOIDA, U.P. ABSTRACT In recent years, advanced digital capturing technology has made digital data grow rapidly. Knowledge discovery from massive amounts of multimedia data, so-called multimedia mining, has been the focus of attention over the past few years. Typically, a video can be viewed as a series of images, which contains a lot of various concepts. Even though the annotation for the individual image frame is in effect, any concept in the image frames still cannot represent the complete video. The aim of the paper is to allow users to obtain their desired videos by submitting her/his interested video clips, without considering the identification of the query terms and applying an efficient indexing technique for searching videos present in large multimedia repositories. Keywords Video data mining, multimedia repository, index table, query clip. 1. INTRODUCTION Data mining is the process of posing various queries and extracting useful and often previously unknown and unexpected information, patterns, and trends from large quantities of data, generally stored in databases[1]. These data could be accumulated over a long period of time or they could be large data sets accumulated simultaneously from heterogeneous sources. Video mining is considered as...

Words: 2310 - Pages: 10

Premium Essay

Data

...Data Mining Data mining began with the advent of databases. Databases are warehouses full of computer data. Computer scientists began to realize that this data contains patterns and relationship to other sets of data. As computer technology emerged, data was extracted into useful information. Often, hidden relationships began to appear. Once this data became known and useful, industries grew around data mining. Data mining is a million dollar business aimed at improving marketing, research, criminal apprehension, fraud detection and other applications. History of Data Mining Computers began to be more widely used in the 1960’s. Computers were used to collect and store data. The data was stored on tapes and disks. The companies and organizations began to wonder about the data that was stored. They wanted to know about past sales, past performances and other pertinent information that was stored on these tapes and disks. The next step was to find an accurate way to retrieve the needed information without manually reading all the data. The next step in this quest came in the 1980’s with relational databases and structured queries. Query language could be used to find out more of what was in the data. The companies and organizations could now identify what has happened in the past. They also wanted to know how to apply this knowledge to future predictions based on past performances. In 1989, the first knowledge discovery workshop was held in Detroit (SQL Data Mining, 2012)...

Words: 3258 - Pages: 14

Premium Essay

Data Mining

...Data Mining Introduction to Management Information System 04-73-213 Section 5 Professor Mao March 22, 2011 Group 5: Carol DeBruyn, Jason Rekker, Matt Smith, Mike St. Denis Odette School of Business – The University of Windsor Table of Contents Table of Contents ……………………………………………………………...…….………….. ii Introduction ……………………………………………………………………………………… 1 Data Mining ……………………………………………………………………...……………… 1 Text Mining ……………………………………………………………………...……………… 4 Conclusion ………………………...…………………………………………………………….. 7 References ………………………………………………..……………………………………… 9 Introduction Everyday millions of transactions occur at thousands of businesses. Each transaction provides valuable data to these businesses. This valuable data is then stored in data warehouses and data marts for later reference. This stored data represents a large asset that until the advent of data mining had been largely unexploited. As companies attempt to gain a competitive advantage over each other, new data mining techniques have been developed. The most recent revolution in data mining has resulted in text mining. Prior to text mining, companies could only focus on leveraging their numerical data. Now companies are beginning to benefit from the textual data stored in data warehouses as well. Data Mining Data mining, which is also known as data discovery or knowledge discovery is the procedure that gathers, analyzes and places into perspective useful information. This facilitates the analysis of data from...

Words: 2331 - Pages: 10

Premium Essay

An Evolution of Computer Science Research

...of this report is published as "Trends in Computer Science Research" Apirak Hoonlor, Boleslaw K. Szymanski and M. Zaki, Communications of the ACM, 56(10), Oct. 2013, pp.74-83 An Evolution of Computer Science Research∗ Apirak Hoonlor, Boleslaw K. Szymanski, Mohammed J. Zaki, and James Thompson Abstract Over the past two decades, Computer Science (CS) has continued to grow as a research field. There are several studies that examine trends and emerging topics in CS research or the impact of papers on the field. In contrast, in this article, we take a closer look at the entire CS research in the past two decades by analyzing the data on publications in the ACM Digital Library and IEEE Xplore, and the grants awarded by the National Science Foundation (NSF). We identify trends, bursty topics, and interesting inter-relationships between NSF awards and CS publications, finding, for example, that if an uncommonly high frequency of a specific topic is observed in publications, the funding for this topic is usually increased. We also analyze CS researchers and communities, finding that only a small fraction of authors attribute their work to the same research area for a long period of time, reflecting for instance the emphasis on novelty (use of new keywords) and typical academic research teams (with core faculty and more rapid turnover of students and postdocs). Finally, our work highlights the dynamic research landscape in CS, with its focus constantly moving to new challenges...

Words: 15250 - Pages: 61

Free Essay

Life

...Opinion Mining Using Econometrics: A Case Study on Reputation Systems Anindya Ghose Panagiotis G. Ipeirotis Arun Sundararajan Department of Information, Operations, and Management Sciences Leonard N. Stern School of Business, New York University {aghose,panos,arun}@stern.nyu.edu Abstract Deriving the polarity and strength of opinions is an important research topic, attracting significant attention over the last few years. In this work, to measure the strength and polarity of an opinion, we consider the economic context in which the opinion is evaluated, instead of using human annotators or linguistic resources. We rely on the fact that text in on-line systems influences the behavior of humans and this effect can be observed using some easy-to-measure economic variables, such as revenues or product prices. By reversing the logic, we infer the semantic orientation and strength of an opinion by tracing the changes in the associated economic variable. In effect, we use econometrics to identify the “economic value of text” and assign a “dollar value” to each opinion phrase, measuring sentiment effectively and without the need for manual labeling. We argue that by interpreting opinions using econometrics, we have the first objective, quantifiable, and contextsensitive evaluation of opinions. We make the discussion concrete by presenting results on the reputation system of Amazon.com. We show that user feedback affects the pricing power of merchants and by measuring their pricing...

Words: 6122 - Pages: 25

Premium Essay

Vidoe Mining

...1 Video Data Mining JungHwan Oh University of Texas at Arlington, USA JeongKyu Lee University of Texas at Arlington, USA Sae Hwang University of Texas at Arlington, USA 8 INTRODUCTION Data mining, which is defined as the process of extracting previously unknown knowledge and detecting interesting patterns from a massive set of data, has been an active research area. As a result, several commercial products and research prototypes are available nowadays. However, most of these studies have focused on corporate data — typically in an alpha-numeric database, and relatively less work has been pursued for the mining of multimedia data (Zaïane, Han, & Zhu, 2000). Digital multimedia differs from previous forms of combined media in that the bits representing texts, images, audios, and videos can be treated as data by computer programs (Simoff, Djeraba, & Zaïane, 2002). One facet of these diverse data in terms of underlying models and formats is that they are synchronized and integrated hence, can be treated as integrated data records. The collection of such integral data records constitutes a multimedia data set. The challenge of extracting meaningful patterns from such data sets has lead to research and development in the area of multimedia data mining. This is a challenging field due to the non-structured nature of multimedia data. Such ubiquitous data is required in many applications such as financial, medical, advertising and Command, Control, Communications and Intelligence...

Words: 3477 - Pages: 14

Premium Essay

On the Development of Comprehensive Information Security Policies for Organizations

...Annotated Bibliography Assignment 1 Gary L. Williams Information Assurance Research Literature RSC 830 January 20, 2015 Dr. Emily Darraj Annotated Bibliography Assignment 1 The purpose of this assignment is to examine the topic cybersecurity via an annotated bibliographic review of multiple dissertations. This assignment will work toward the identification of a future dissertation topic within this field and also towards the identification of research material in support of the final dissertation. The annotated bibliographic reviews contained within this paper will work to provide information that will support my future research and provide experience in garnering and explaining the salient tenants of research material. NOTE: This paper will not include proper APA formatting as citations have been bolded to ensure the professor can discern where citations begin and end. Curtis, S. K. (2012). Commitment to cybersecurity and information technology governance: A case study and leadership model. (Doctoral dissertation). Retrieved from the ProQuest dissertation and thesis database. (UMI No. 3569139) The problem as described by the author in this quantitative study is senior managers are not using web analytic technology (WAT) and there is a lack of literature describing why this is the case. The purpose of this study is to “examine how management consultants perceive WAT” (p. 22). This study has seven hypotheses. Unified theory of acceptance use of technology...

Words: 3359 - Pages: 14

Free Essay

Bafgdffhf

...IEEE International Conference on Data Engineering Business Intelligence from Voice of Customer L. Venkata Subramaniam, Tanveer A. Faruquie, Shajith Ikbal, Shantanu Godbole, Mukesh K. Mohania IBM India Research Lab, India {lvsubram,ftanveer,shajmoha,shantanugodbole,mkmukesh}@in.ibm.com Abstract— In this paper, we present a first of a kind system, called Business Intelligence from Voice of Customer (BIVoC), that can: 1) combine unstructured information and structured information in an information intensive enterprise and 2) derive richer business insights from the combined data. Unstructured information, in this paper, refers to Voice of Customer (VoC) obtained from interaction of customer with enterprise namely, conversation with call-center agents, email, and sms. Structured database reflect only those business variables that are static over (a longer window of) time such as, educational qualification, age group, and employment details. In contrast, a combination of unstructured and structured data provide access to business variables that reflect upto date dynamic requirements of the customers and more importantly indicate trends that are difficult to derive from a larger population of customers through any other means. For example, some of the variables reflected in unstructured data are problem/interest in a certain product, expression of dissatisfaction with the business provided, and some unexplored category of people showing certain interest/problem...

Words: 9671 - Pages: 39

Premium Essay

Managment

...An Introduction to Data Mining Kurt Thearling, Ph.D. www.thearling.com 1 Outline — Overview of data mining — What is data mining? — Predictive models and data scoring — Real-world issues — Gentle discussion of the core algorithms and processes — Commercial data mining software applications — Who are the players? — Review the leading data mining applications — Presentation & Understanding — Data visualization: More than eye candy — Build trust in analytic results 2 1 Resources — Good overview book: — Data Mining Techniques by Michael Berry and Gordon Linoff — Web: — My web site (recommended books, useful links, white papers, …) > http://www.thearling.com — Knowledge Discovery Nuggets > http://www.kdnuggets.com — DataMine Mailing List — majordomo@quality.org — send message “subscribe datamine-l” 3 A Problem... — You are a marketing manager for a brokerage company — Problem: Churn is too high > Turnover (after six month introductory period ends) is 40% — Customers receive incentives (average cost: $160) when account is opened — Giving new incentives to everyone who might leave is very expensive (as well as wasteful) — Bringing back a customer after they leave is both difficult and costly 4 2 … A Solution — One month before the end of the introductory period is over, predict which customers will leave — If you want to keep a customer that is predicted to churn, offer them something based on their predicted...

Words: 3180 - Pages: 13

Premium Essay

Hostel Management System

...5.1 Applications of Data Mining A wide range of companies have deployed successful applications of data mining. While early adopters of this technology have tended to be in information-intensive industries such as financial services and direct mail marketing, the technology is applicable to any company looking to leverage a large data warehouse to better manage their customer relationships. Two critical factors for success with data mining are: a large, well-integrated data warehouse and a well-defined understanding of the business process within which data mining is to be applied (such as customer prospecting, retention, campaign management, and so on). Some successful application areas include: • A pharmaceutical company can analyze its recent sales force activity and their results to improve targeting of high-value physicians and determine which marketing activities will have the greatest impact in the next few months. The data needs to include competitor market activity as well as information about the local health care systems. The results can be distributed to the sales force via a wide-area network that enables the representatives to review the recommendations from the perspective of the key attributes in the decision process. The ongoing, dynamic analysis of the data warehouse allows best practices from throughout the organization to be applied in specific sales situations. • A credit card company can leverage its vast warehouse of customer transaction...

Words: 5855 - Pages: 24

Premium Essay

Data Management

...beach ball products sold in Florida in the month of July, compare revenue figures with those for the same products in September, and then see a comparison of other product sales in Data visualization Extract, transform, load (ETL) Florida in the same time period. To facilitate this kind of analysis, OLAP data is stored in a multidimensional database. Whereas a relational database can be thought of as two-dimensional, a multidimensional database considers each data attribute (such as product, geographic sales region, and time Association rules (in data mining) Relational database period) as a separate "dimension." OLAP software can locate the intersection of dimensions (all products sold in the Eastern region above a certain price during a certain time period) and display them. Attributes such as time periods can be broken down into subattributes. Denormalization OLAP can be used for data mining or the discovery of previously Master data management (MDM) undiscerned relationships between data items. An OLAP database does not Predictive modeling needed for trend...

Words: 4616 - Pages: 19

Premium Essay

Business Intelligence

...Hand-in date: 01.09.2009 Supervisor: Dr. Espen Andersen This thesis is a part of the MSc programme at BI Norwegian School of Management. The school takes no responsibility for the methods used, results found and conclusions drawn. Acknowledgments I would like to thank my supervisor, Dr. Espen Andersen, for his support and guidance throughout the project. I am also grateful to all participants in this research for their contribution and time. Finally, I thank my family for their understanding, encouragement and patience. Thesis 01.09.2009 Content Content ..................................................................................................................... i Abstract .................................................................................................................. iv Introduction ............................................................................................................ 1 Research Methodology ........................................................................................... 2 Research Question .........................................................................................................2 Method Used ..................................................................................................................2 Collecting Data ..............................................................................................................3 BI vendors .................................

Words: 19593 - Pages: 79