...decommission servers on demand. The shared-disk database architecture is ideally suited to cloud computing. It requires fewer and lower cost servers, it provides high availability, reduces maintenance costs by eliminating partitioning and it delivers dynamic scalability. Benefits of Cloud Computing: a. Lower Costs: All resources are shared resulting in reduced costs. b. Shifting CapEx to OpEx: This enables customer to focus on adding value in their areas of competence. It allows customer to focus their money and resources on innovating. c. Agility d. Dynamic Scalability: It can smoothly and efficiently scale to the spikes with a more cost-effective pay-as-you-go model. e. Simplified maintenance: All Patches and upgrades are deployed across the shared infrastructure. f. Large scale prototyping/load testing g. Diverse platform support h. Faster Management approval i. Faster development With corporate adoption of cloud computing there are explosion of cloud options. One of those options is the provisioning of database services in the form of cloud databases or Database-as-a-Service (DaaS). The Cloud databases serviced consumer applications which put a priority on read access, because the ratio of reads to writes was very high. However, this is changing. Now-a-days consumer-centric cloud database...
Words: 3040 - Pages: 13
... June 2013 ISSN 2250-3153 1 Big Data Landscape Shubham Sharma Banking Product Development Division, Oracle Financial Services Software Ltd. Bachelor of Technology Information Technology, Maharishi Markandeshwar Engineering College Abstract- “Big Data” has become a major source of innovation across enterprises of all sizes .Data is being produced at an ever increasing rate. This growth in data production is driven by increased use of media, fast developing organizations, proliferation of web and systems connected to it. Having a lot of data is one thing, being able to store it, analyze it and visualize it in real time environment is a whole different ball game. New technologies are accumulating more data than ever; therefore many organizations are looking forward to optimal ways to make better use of their data. In a broader sense, organizations analyzing big data need to view data management, analysis, and decision-making in terms of “industrialized” flows and processes rather than discrete stocks of data or events. To handle these aspects of large quantities of data various open platforms had been developed. Index Terms- Big Technologies,Tools Data, Landscape,Open Platforms, nearly 500 exabytes per day .To put the numbers in perspective this is equivalent to 5×1020 bytes per day. Almost 200 times higher than all the sources combined together in the world. To handle this huge chunk of data will be hard with the existing data management technologies. Hence the technology...
Words: 3643 - Pages: 15
...Paper on Big Data and Hadoop Harshawardhan S. Bhosale1, Prof. Devendra P. Gadekar2 1 Department of Computer Engineering, JSPM’s Imperial College of Engineering & Research, Wagholi, Pune Bhosale.harshawardhan186@gmail.com 2 Department of Computer Engineering, JSPM’s Imperial College of Engineering & Research, Wagholi, Pune devendraagadekar84@gmail.com Abstract: The term ‘Big Data’ describes innovative techniques and technologies to capture, store, distribute, manage and analyze petabyte- or larger-sized datasets with high-velocity and different structures. Big data can be structured, unstructured or semi-structured, resulting in incapability of conventional data management methods. Data is generated from various different sources and can arrive in the system at various rates. In order to process these large amounts of data in an inexpensive and efficient way, parallelism is used. Big Data is a data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it. Hadoop is the core platform for structuring Big Data, and solves the problem of making it useful for analytics purposes. Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Keywords -Big Data, Hadoop, Map...
Words: 5034 - Pages: 21
...511 Report – Webcast 8/13/14 on Data Mining SAS (Statistical Analysis System) was originally developed as a project to analyze agriculture from 1966-1976 at North Carolina State University. As demand for such software grew, SAS Institute was founded in 1976. SAS is a software suite that can mine, alter, manage and retrieve data from a variety of sources and perform statistical analysis on it. SAS provides a graphical point-and-click user interface for non-technical users and they provide more advanced options through the SAS programming language. On August 13 2014, SAS sponsored a web seminar titled “Analytically Speaking” the topic of the webcast was data mining techniques. Michael Berry and Gordon Linoff were the featured speakers, they have written a leading introductory book (on data mining) titled “Data Mining Techniques”. They discussed a lot of the current data mining landscape, including new methods, new types of data and the importance of using the right analysis for your problem (as good analysis is wasted doing the wrong thing). They also briefly discussed using ‘found data’ – text data, social data and device data. Michael Berry is the Business Intelligence Director at TripAdvisor and co-founder of Data Miners Inc. Gordon Linoff is co-founder of Data Miners Inc. and a consultant to financial, media and pharmaceutical companies. Data mining is the analysis step of the “KDD” (Knowledge Discovery in Databases). Data mining is an interdisciplinary sub-field...
Words: 818 - Pages: 4
...511 Report – Webcast 8/13/14 on Data Mining SAS (Statistical Analysis System) was originally developed as a project to analyze agriculture from 1966-1976 at North Carolina State University. As demand for such software grew, SAS Institute was founded in 1976. SAS is a software suite that can mine, alter, manage and retrieve data from a variety of sources and perform statistical analysis on it. SAS provides a graphical point-and-click user interface for non-technical users and they provide more advanced options through the SAS programming language. On August 13 2014, SAS sponsored a web seminar titled “Analytically Speaking” the topic of the webcast was data mining techniques. Michael Berry and Gordon Linoff were the featured speakers, they have written a leading introductory book (on data mining) titled “Data Mining Techniques”. They discussed a lot of the current data mining landscape, including new methods, new types of data and the importance of using the right analysis for your problem (as good analysis is wasted doing the wrong thing). They also briefly discussed using ‘found data’ – text data, social data and device data. Michael Berry is the Business Intelligence Director at TripAdvisor and co-founder of Data Miners Inc. Gordon Linoff is co-founder of Data Miners Inc. and a consultant to financial, media and pharmaceutical companies. Data mining is the analysis step of the “KDD” (Knowledge Discovery in Databases). Data mining is an interdisciplinary sub-field...
Words: 818 - Pages: 4
...(limited store size). 7-eleven supply uncertainty is medium to high. Supply uncertainty is increased due to: high diversity of products, perishable products (e.g. frozen and dairy products), the rate of innovative/number of new products, possible delivery delays due to dense traffic around stores as well as possible low yields further upstream the supply chain (2nd, 3rd tier suppliers of raw materials e.g. rice). All these attributes call for a responsive supply chain. Assignment/Question #2 7-Eleven has a high degree of responsiveness, due to the high implied demand uncertainty and medium-high supply uncertainty. It is very crucial for their supply chain ability to respond to wide ranges of quantities demanded, meet short lead times, handle a large variety of products, meet a very high service level and handle supply uncertainty due to the customer needs mentioned above. Additionally they are innovative in the sense that they introduce constantly new products, as well as new services, to attract...
Words: 656 - Pages: 3
...HIPI: A Hadoop Image Processing Interface for Image-based MapReduce Tasks Chris Sweeney Liu Liu Sean Arietta Jason Lawrence University of Virginia Images 1...k Cull ... ... images n-k....n Hipi Image Bundle Map 1 Map i Reduce 1 Shuffle ... Result Reduce j Figure 1: A typical MapReduce pipeline using our Hadoop Image Processing Interface with n images, i map nodes, and j reduce nodes Abstract 1 The amount of images being uploaded to the internet is rapidly increasing, with Facebook users uploading over 2.5 billion new photos every month [Facebook 2010], however, applications that make use of this data are severely lacking. Current computer vision applications use a small number of input images because of the difficulty is in acquiring computational resources and storage options for large amounts of data [Guo. . . 2005; White et al. 2010]. As such, development of vision applications that use a large set of images has been limited [Ghemawat and Gobioff. . . 2003]. The Hadoop Mapreduce platform provides a system for large and computationally intensive distributed processing (Dean, 2004), though use of Hadoops system is severely limited by the technical complexities of developing useful applications [Ghemawat and Gobioff. . . 2003; White et al. 2010]. To immediately address this, we propose an open-source Hadoop Image Processing Interface (HIPI) that aims to create an interface for computer vision with MapReduce...
Words: 4082 - Pages: 17
...Apache Hadoop YARN: Yet Another Resource Negotiator Vinod Kumar Vavilapallih Mahadev Konarh Siddharth Sethh h: Arun C Murthyh Carlo Curinom Chris Douglasm Jason Lowey Owen O’Malleyh f: Sharad Agarwali Hitesh Shahh Sanjay Radiah facebook.com Robert Evansy Bikas Sahah m: Thomas Gravesy Benjamin Reed f hortonworks.com, Eric Baldeschwielerh microsoft.com, i : inmobi.com, y : yahoo-inc.com, Abstract The initial design of Apache Hadoop [1] was tightly focused on running massive, MapReduce jobs to process a web crawl. For increasingly diverse companies, Hadoop has become the data and computational agor´ —the de a facto place where data and computational resources are shared and accessed. This broad adoption and ubiquitous usage has stretched the initial design well beyond its intended target, exposing two key shortcomings: 1) tight coupling of a specific programming model with the resource management infrastructure, forcing developers to abuse the MapReduce programming model, and 2) centralized handling of jobs’ control flow, which resulted in endless scalability concerns for the scheduler. In this paper, we summarize the design, development, and current state of deployment of the next generation of Hadoop’s compute platform: YARN. The new architecture we introduced decouples the programming model from the resource management infrastructure, and delegates many scheduling functions (e.g., task faulttolerance) to per-application components. We provide experimental...
Words: 12006 - Pages: 49
...Identification Number (LIN) * Submission of returns * Grievance redressal * Combined returns under 8 labour laws * Online portals: * Real-time registration * Payments through 56 accredited banks * Online application process for environmental and forest clearances * 14 government services delivered via eBiz, a single-window online portal * Investor Facilitation Cell established * Dedicated Japan+ Cell established * Consent to Establish/NOC no longer required for new electricity connections * Documents reduced from 7 to 3 for exports and imports * Option to obtain company name and DIN at the time of incorporation * Simplified forms for: * Industrial Licence * Industrial Entrepreneurs Memorandum * Many defence sector dual-use products no longer require licences * Validity of security clearance from Ministry of Home Affairs extended to 3 years * Extended validity for implementing industrial...
Words: 2161 - Pages: 9
...the primary source of performance variations comes from disk I/O and the underlying communication network [1]. In this paper, we explore the opportunities to improve performance of high performance applications running on emerging cloud platforms. Our contributions are 1. the quantification and assessment of performance variation of data-intensive scientific workloads on a small set of homogeneous nodes running Hadoop and 2. the development of an improved Hadoop scheduler that can improve performance (and potentially scalability) of these application by leveraging the intrinsic performance variation of the system. In using our enhanced scheduler for data-intensive scientific workloads, we are able to obtain more than a 21% performance gain over the default Hadoop scheduler. I. I NTRODUCTION Certain high-performance applications such as weather prediction or algorithmic trading require the analysis and aggregation of large amounts of data geo-spatially distributed across the world, in a very short amount of time (i.e. on-demand). A traditional supercomputer may be neither a practical nor an economical solution because it is not suitable for handling data that is distributed across the...
Words: 7930 - Pages: 32
...implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients. While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore radically different design points. The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients. In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use. We have designed and implemented the Google File System (GFS) to meet the rapidly growing demands of Google’s data processing needs. GFS shares many of the same goals as previous distributed file systems...
Words: 14789 - Pages: 60
...An Overview of Data Mining Techniques Page 1 of 48 An Overview of Data Mining Techniques Excerpted from the book Building Data Mining Applications for CRM by Alex Berson, Stephen Smith, and Kurt Thearling Introduction This overview provides a description of some of the most common data mining algorithms in use today. We have broken the discussion into two sections, each with a specific theme: Classical Techniques: Statistics, Neighborhoods and Clustering Next Generation Techniques: Trees, Networks and Rules Each section will describe a number of data mining algorithms at a high level, focusing on the "big picture" so that the reader will be able to understand how each algorithm fits into the landscape of data mining techniques. Overall, six broad classes of data mining algorithms are covered. Although there are a number of other algorithms and many variations of the techniques described, one of the algorithms from this group of six is almost always used in real world deployments of data mining systems. I. Classical Techniques: Statistics, Neighborhoods and Clustering 1.1. The Classics These two sections have been broken up based on when the data mining technique was developed and when it became technically mature enough to be used for business, especially for aiding in the optimization of customer relationship management systems. Thus this section contains descriptions of techniques that have classically been used for decades the next section represents techniques...
Words: 23868 - Pages: 96
...Technical Overview of the Oracle Exadata Database Machine and Exadata Storage Server Introduction ......................................................................................... 2 Exadata Product Family ...................................................................... 4 Exadata Database Machine ............................................................ 4 Exadata Storage Server .................................................................. 8 Exadata Database Machine Architecture .......................................... 12 Database Server Software ............................................................ 14 Exadata Storage Server Software ................................................. 16 Exadata Smart Scan Processing .................................................. 16 Exadata Hybrid Columnar Compression ....................................... 20 I/O Resource Management With Exadata ..................................... 21 Quality of Service (QoS) Management with Exadata .................... 22 Conclusion ........................................................................................ 28 Oracle White Paper— A...
Words: 10244 - Pages: 41
...How Windows Server 2008 Delivers Business Value Published: January 2008 © 2008 Microsoft Corporation. All rights reserved. This document is developed prior to the product’s release to manufacturing, and as such, we cannot guarantee that all details included herein will be exactly as what is found in the shipping product. The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication. The information represents the product at the time this document was printed and should be used for planning purposes only. Information subject to change at any time without prior notice. This whitepaper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. Microsoft, Active Directory, PowerShell, SharePoint, SoftGrid, Windows, Windows Media, the Windows logo, Windows Vista, and Windows Server are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. All other trademarks are property of their respective owners. Table of Contents Table of Contents ii Introduction 1 Make Your Infrastructure More Efficient with Virtualization 1 Server Virtualization...
Words: 10609 - Pages: 43
...Top 10 data mining algorithms in plain English 1.1K Today, I’m going to explain in plain English the top 10 most influential data mining algorithms as voted on by 3 separate panels in this survey paper. Once you know what they are, how they work, what they do and where you can find them, my hope is you’ll have this blog post as a springboard to learn even more about data mining. What are we waiting for? Let’s get started! Contents [hide] 1. C4.5 2. k-means 3. Support vector machines 4. Apriori 5. EM 6. PageRank 7. AdaBoost 8. kNN 9. Naive Bayes 10. CART Interesting Resources Now it’s your turn… Update 16-May-2015: Thanks to Yuval Merhav and Oliver Keyes for their suggestions which I’ve incorporated into the post. Update 28-May-2015: Thanks to Dan Steinberg (yes, the CART expert!) for the suggested updates to the CART section which have now been added. 1. C4.5 What does it do? C4.5 constructs a classifier in the form of a decision tree. In order to do this, C4.5 is given a set of data representing things that are already classified. Wait, what’s a classifier? A classifier is a tool in data mining that takes a bunch of data representing things we want to classify and attempts to predict which class the new data belongs to. What’s an example of this? Sure, suppose a dataset contains a bunch of patients. We know various things about each patient like age, pulse...
Words: 6478 - Pages: 26