Free Essay

Hadoop Setup

In:

Submitted By allauddin0003
Words 1213
Pages 5
Hadoop Cluster Setup
Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System (GFS) and of the MapReduce computing paradigm. Hadoop’s HDFS is a highly fault-tolerant distributed file system and, like Hadoop in general, designed to be deployed on low-cost hardware.

This document describes how to install, configure and manage non-trivial Hadoop clusters ranging from a few nodes to extremely large clusters with thousands of nodes.

Required Software
Required software for Linux and Windows include: 1. Java 1.6.x, preferably from Sun, must be installed. 2. ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons.

Installation
Installing a Hadoop cluster typically involves unpacking the software on all the machines in the cluster. Typically one machine in the cluster is designated as the NameNode and another machine the as JobTracker, exclusively. These are the masters. The rest of the machines in the cluster act as both DataNode and TaskTracker. These are the slaves.
The root of the distribution is referred to as HADOOP_HOME. All machines in the cluster usually have the same HADOOP_HOME path.

Steps for Installation
1. Install java 1.6
Check java version: $ java –version

2. Adding dedicated user group
$ sudo addgroup hadoop $ sudo adduser --ingroup hadoop hduser

3. Install ssh
$ su - hduser Generate keypairs: $ ssh-keygen -t rsa -P "" Enable access to the local machine: $ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys $ ssh localhost

4. Disable IPV6
$ sudo gedit /etc/sysctl.conf Edit the following in the file: # disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1

5. Download Hadoop and place it in /usr/local/Hadoop
$ $ $ $ cd /usr/local sudo tar xzf hadoop-1.0.3.tar.gz sudo mv hadoop-1.0.3 hadoop sudo chown -R hduser:hadoop Hadoop

6. Update .bashrc
# Set Hadoop-related environment variables export HADOOP_HOME=/usr/local/hadoop # Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop) export JAVA_HOME=/usr/lib/jvm/java-6-sun # Some convenient aliases and functions for running Hadoop-related commands unalias fs &> /dev/null alias fs="hadoop fs" unalias hls &> /dev/null alias hls="fs -ls" # Add Hadoop bin/ directory to PATH export PATH=$PATH:$HADOOP_HOME/bin

For Server Machine only: Copy ssh keypairs to the slave systems
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@ $ ssh $ ssh # repeat the same for all the slaves ips.

Configuration:
1.Setting JAVA_HOME:
Open /usr/local/hadoop/conf/hadoop-env.sh and modify JAVA_HOME to “export JAVA_HOME=/usr/lib/jvm/java-6-sun”

2.Create tmp directory:
$ sudo mkdir -p /app/hadoop/tmp $ sudo chown hduser:hadoop /app/hadoop/tmp $ sudo chmod 755 /app/hadoop/tmp 3. Modify conf/masters # only on master node

#use full ip, don’t use localhost or 127.0.0.1 # only on slave node

4. Modify conf/slaves

#use full ip, don’t use localhost or 127.0.0.1 #use full ip, don’t use localhost or 127.0.0.1 # repeat the same for all slaves

5.Modify conf/core-site.xml: hadoop.tmp.dir /app/hadoop/tmp fs.default.name hdfs://:54310

6.Modify conf/mapred-site.xml: mapred.job.tracker :54311

7.Modify conf/hdfs-site.xml: dfs.replication 2

Formatting the HDFS filesystem
HDFS should be initialized for starting the cluster. File system Directory is specified by dfs.name.dir $ /usr/local/hadoop/bin/hadoop namenode –format #only on master system

Starting the multi-node cluster
Starting the cluster is performed in two steps.
1. We begin with starting the HDFS daemons: the NameNode daemon is started on master,

and DataNode daemons are started on all slaves (here: master and slaves). 2. Then we start the MapReduce daemons: the JobTracker is started on master, and TaskTracker daemons are started on all slaves (here: master and slaves). Start the HDFS layer: hduser@master:/usr/local/hadoop$ bin/start-dfs.sh On slave, you can examine the success or failure of this command by inspecting the log file logs/hadoop-hduser-datanode-slave.log.
Java processes on master after starting HDFS daemons hduser@master:/usr/local/hadoop$ jps Output: 14799 NameNode 15314 Jps 14880 DataNode 14977 SecondaryNameNode

Java processes on slave after starting HDFS daemons hduser@slave:/usr/local/hadoop$ jps Output: 15183 DataNode 15616 Jps

Start the Map Reduce Daemons ## Perform the following only on the master node hduser@master:/usr/local/hadoop$ bin/start-mapred.sh

Java processes on master after starting MapReduce daemons hduser@master:/usr/local/hadoop$ jps Output: 16017 Jps 14799 NameNode 15686 TaskTracker 14880 DataNode 15596 JobTracker 14977 SecondaryNameNode hduser@slave:/usr/local/hadoop$ jps Output: 15183 DataNode 15897 TaskTracker 16284 Jps

Java processes on slave after starting MapReduce daemons

Stopping the Cluster:
Like starting the cluster, stopping it is done in two steps. The workflow however is the opposite of starting.
1. We begin with stopping the MapReduce daemons: the JobTracker is stopped on master,

and TaskTracker daemons are stopped on all slaves (here: master and slave). 2. Then we stop the HDFS daemons: the NameNode daemon is stopped on master, and DataNode daemons are stopped on all slaves (here: master and slave). Stopping MapReduce daemons Run the command bin/stop-mapred.sh on the JobTracker machine. This will shut down the MapReduce cluster by stopping the JobTracker daemon running on the machine you ran the previous command on, and TaskTrackers on the machines listed in the conf/slaves file. Stopping the HDFS layer hduser@master:/usr/local/hadoop$ bin/stop-dfs.sh

Java processes on master after stopping HDFS daemons hduser@master:/usr/local/hadoop$ jps output: 18670 Jps hduser@slave:/usr/local/hadoop$ jps output:

18894 Jps

Running a Map reduce Job (Word Count) #Perform all operation only on master node
1.Start the Hadoop cluster 2.Copy data to HDFS
Download eBooks as text files in Plain Text UTF-8 encoding and store the files in a local temporary directory of choice, for example /tmp/gutenberg
Links: http://www.gutenberg.org/ebooks/19699 http://www.gutenberg.org/ebooks/132 http://www.gutenberg.org/ebooks/1661 hduser@master:/usr/local/hadoop$ bin/hadoop dfs -copyFromLocal /tmp/gutenberg /user/hduser/gutenberg

hduser@master:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output

3.Run the Map Reduce Job
(or)

hduser@master:/usr/local/hadoop$ bin/hadoop jar hadoop-examples-1.0.3.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output

4. Check if the result is successfully stored in HDFS directory “/user/hduser/gutenberg-output” hduser@master:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser

5.Copy result to Local Disk
To inspect the file, you can copy it from HDFS to the local file system hduser@ubuntu:/usr/local/hadoop$ mkdir /tmp/gutenberg-output hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -getmerge /user/hduser/gutenberg-output /tmp/gutenberg-output hduser@ubuntu:/usr/local/hadoop$ head /tmp/gutenberg-output/gutenberg-output

Hadoop Web Interfaces

• • •

http://localhost:50070/ – web UI of the NameNode daemon http://localhost:50030/ – web UI of the JobTracker daemon http://localhost:50060/ – web UI of the TaskTracker daemon

Similar Documents

Free Essay

Donald Miner Serves as a Solutions Architect at Emc Greenplum,

...www.it-ebooks.info MapReduce Design Patterns Donald Miner and Adam Shook www.it-ebooks.info MapReduce Design Patterns by Donald Miner and Adam Shook Copyright © 2013 Donald Miner and Adam Shook. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/ institutional sales department: 800-998-9938 or corporate@oreilly.com. Editors: Andy Oram and Mike Hendrickson Production Editor: Christopher Hearse Proofreader: Dawn Carelli Cover Designer: Randy Comer Interior Designer: David Futato Illustrator: Rebecca Demarest December 2012: First Edition Revision History for the First Edition: 2012-11-20 First release See http://oreilly.com/catalog/errata.csp?isbn=9781449327170 for release details. Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc. MapReduce Design Patterns, the image of Père David’s deer, and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trade‐ mark claim, the...

Words: 63341 - Pages: 254

Free Essay

Haddoop Installation

...In this tutorial, the required steps has been described for setting up a pseudo-distributed, single-node Hadoop cluster backed by the Hadoop Distributed File System, running on Ubuntu Linux. Installing Python $ sudo apt-get install python-software-properties $ sudo add-apt-repository ppa:ferramroberto/java Update the source list $ sudo apt-get update Install Sun Java 6 JDK $ sudo apt-get install sun-java6-jdk Select Sun's Java as the default on your machine. (See 'sudo update-alternatives --config java' for more information.) $ sudo update-java-alternatives -s java-6-sun The full JDK which will be placed in /usr/lib/jvm/java-6-sun (well, this directory is actually a symlink on Ubuntu). After installation, make a quick check whether Sun’s JDK is correctly set up: $ java –version java version "1.6.0_20" Java(TM) SE Runtime Environment (build 1.6.0_20-b02) Java HotSpot(TM) Client VM (build 16.3-b01, mixed mode, sharing) Adding a dedicated Hadoop system user $ sudo addgroup hadoop $ sudo adduser --ingroup hadoop hduser This will add the user hduser and the group hadoop to your local machine. Configuring SSH user@ubuntu:~$ su – hduser hduser@ubuntu:~$ ssh-keygen -t rsa -P "" Generating public/private rsa key pair. Enter file in which to save the key (/home/hduser/.ssh/id_rsa): Created directory '/home/hduser/.ssh'. Your identification has been saved in /home/hduser/.ssh/id_rsa. Your public key has been saved in /home/hduser/.ssh/id_rsa...

Words: 2067 - Pages: 9

Free Essay

Hipi

...HIPI: A Hadoop Image Processing Interface for Image-based MapReduce Tasks Chris Sweeney Liu Liu Sean Arietta Jason Lawrence University of Virginia Images 1...k Cull ... ... images n-k....n Hipi Image Bundle Map 1 Map i Reduce 1 Shuffle ... Result Reduce j Figure 1: A typical MapReduce pipeline using our Hadoop Image Processing Interface with n images, i map nodes, and j reduce nodes Abstract 1 The amount of images being uploaded to the internet is rapidly increasing, with Facebook users uploading over 2.5 billion new photos every month [Facebook 2010], however, applications that make use of this data are severely lacking. Current computer vision applications use a small number of input images because of the difficulty is in acquiring computational resources and storage options for large amounts of data [Guo. . . 2005; White et al. 2010]. As such, development of vision applications that use a large set of images has been limited [Ghemawat and Gobioff. . . 2003]. The Hadoop Mapreduce platform provides a system for large and computationally intensive distributed processing (Dean, 2004), though use of Hadoops system is severely limited by the technical complexities of developing useful applications [Ghemawat and Gobioff. . . 2003; White et al. 2010]. To immediately address this, we propose an open-source Hadoop Image Processing Interface (HIPI) that aims to create an interface for computer vision with MapReduce...

Words: 4082 - Pages: 17

Free Essay

Hadoop

...www.linuxidc.com Hadoop入门实战手册 更多Hadoop相关信息见Hadoop 专题页面 http://www.linuxidc.com/topicnews.aspx?tid=13 北京宽连十方数字技术有限公司 技术研究部 (2011年7月) Linux¹«Éç(LinuxIDC.com) ÊÇ°üÀ¨Ubuntu,Fedora,SUSE¼¼Êõ£¬×îÐÂIT×ÊѶµÈLinuxרҵÀàÍøÕ¾¡£ www.linuxidc.com 目录 1  概述 ........................................................................................................................... 4  1.1  什么是Hadoop? .................................................................................................. 4  1.2  为什么要选择Hadoop? ....................................................................................... 4  1.2.1  系统特点 ........................................................................................................ 4  1.2.2  使用场景 ........................................................................................................ 5  2  术语 ........................................................................................................................... 5  3  Hadoop的单机部署 .................................................................................................... 6  3.1  目的 ..................................................................................................................... 6  3.2  先决条件 .............................................................................................................. 6  3.2.1  支持平台 ........................................................................................................ 6  3.2.2  所需软件 .........

Words: 8590 - Pages: 35

Premium Essay

Bigdata

...4. 4.1 Big Data Introduction In 2004, Wal-Mart claimed to have the largest data warehouse with 500 terabytes storage (equivalent to 50 printed collections of the US Library of Congress). In 2009, eBay storage amounted to eight petabytes (think of 104 years of HD-TV video). Two years later, the Yahoo warehouse totalled 170 petabytes1 (8.5 times of all hard disk drives created in 1995)2. Since the rise of digitisation, enterprises from various verticals have amassed burgeoning amounts of digital data, capturing trillions of bytes of information about their customers, suppliers and operations. Data volume is also growing exponentially due to the explosion of machine-generated data (data records, web-log files, sensor data) and from growing human engagement within the social networks. The growth of data will never stop. According to the 2011 IDC Digital Universe Study, 130 exabytes of data were created and stored in 2005. The amount grew to 1,227 exabytes in 2010 and is projected to grow at 45.2% to 7,910 exabytes in 2015.3 The growth of data constitutes the “Big Data” phenomenon – a technological phenomenon brought about by the rapid rate of data growth and parallel advancements in technology that have given rise to an ecosystem of software and hardware products that are enabling users to analyse this data to produce new and more granular levels of insight. Figure 1: A decade of Digital Universe Growth: Storage in Exabytes Error! Reference source not found.3 1 ...

Words: 22222 - Pages: 89

Free Essay

Hadoop

...Cluster. Hadoop Architecture Two Components * Distributed File System * Map Reduce Engine HDFS Nodes * Name Node * Only one node per Cluster * Manages File system, Name Space and Metadata * Single point of Failure but mitigated by writing to multiple file systems * Data Node * Many per cluster * Manages blocks with data and serves them to Nodes * Periodically reports to Name Node on the list of blocks it stores Map Reduce Nodes * Job Tracker * Task Tracker PIG – A high level Hadoop programing language that provides data flow language and execution framework for parallel computation Created by Yahoo Like a Built in Function for Map Reduce We write queries in PIG – Queries get translated to Map Reduce Program during execution HIVE : Provides adhoc SQL like queries for data aggregation and summarization Written by JEFF from FACEBOOK. Database on top of Hadoop HiveQL is the query language. Runs like SQL with less features of SQL HBASE: Database on top of Hadoop. Real-time distributed database on the top of HDFS It is based on Google’s BIG TABLE – Distributed non-RDBMS which can store billions of rows and columns in single table across multiple servers Handy to write output from MAP REDUCE to HBASE ZOO KEEPER: Maintains the order of all animals in Hadoop.Created by Yahoo. Helps to run distributed application and maintain them in Hadoop. SQOOP: Sqoops the data from RDBMS to Hadoop. Created...

Words: 276 - Pages: 2

Free Essay

Abc Ia S Aresume

...De-Identified Personal Health Care System Using Hadoop The use of medical Big Data is increasingly popular in health care services and clinical research. The biggest challenges in health care centers are the huge amount of data flows into the systems daily. Crunching this BigData and de-identifying it in a traditional data mining tools had problems. Therefore to provide solution to the de-identifying personal health information, Map Reduce application uses jar files which contain a combination of MR code and PIG queries. This application also uses advanced mechanism of using UDF (User Data File) which is used to protect the health care dataset. Responsibilities: Moved all personal health care data from database to HDFS for further processing. Developed the Sqoop scripts in order to make the interaction between Hive and MySQL Database Wrote MapReduce code for DE-Identifying data. Loaded the processed results into Hive tables. Generated test cases using MRunit. Best-Buy – Rehosting of Web Intelligence project The purpose of the project is to store terabytes of log information generated by the ecommerce website and extract meaning information out of it. The solution is based on the open source Big Data s/w Hadoop .The data will be stored in Hadoop file system and processed using PIG scripts. Which intern includes getting the raw html data from the websites, Process the html to obtain product and pricing information, Extract various reports out of the product pricing...

Words: 500 - Pages: 2

Premium Essay

Cisco Case Study

...(SLAs) for internal customers using big data analytics services ● Support multiple internal users on same platform SOLUTION ● Implemented enterprise Hadoop platform on Cisco UCS CPA for Big Data - a complete infrastructure solution including compute, storage, connectivity and unified management ● Automated job scheduling and process orchestration using Cisco Tidal Enterprise Scheduler as alternative to Oozie RESULTS ● Analyzed service sales opportunities in one-tenth the time, at one-tenth the cost ● $40 million in incremental service bookings in the current fiscal year as a result of this initiative ● Implemented a multi-tenant enterprise platform while delivering immediate business value LESSONS LEARNED ● Cisco UCS can reduce complexity, improves agility, and radically improves cost of ownership for Hadoop based applications ● Library of Hive and Pig user-defined functions (UDF) increases developer productivity. ● Cisco TES simplifies job scheduling and process orchestration ● Build internal Hadoop skills ● Educate internal users about opportunities to use big data analytics to improve data processing and decision making NEXT STEPS ● Enable NoSQL Database and advanced analytics capabilities on the same platform. ● Adoption of the platform across different business functions. Enterprise Hadoop architecture, built on Cisco UCS Common Platform Architecture (CPA) for Big Data, unlocks hidden business intelligence. Challenge Cisco is the worldwide...

Words: 3053 - Pages: 13

Free Essay

Big Analytics

...REVOLUTION ANALYTICS WHITE PAPER Advanced ‘Big Data’ Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional analytical model. First, Big Analytics describes the efficient use of a simple model applied to volumes of data that would be too large for the traditional analytical environment. Research suggests that a simple algorithm with a large volume of data is more accurate than a sophisticated algorithm with little data. The algorithm is not the competitive advantage; the ability to apply it to huge amounts of data—without compromising performance—generates the competitive edge. Second, Big Analytics refers to the sophistication of the model itself. Increasingly, analysis algorithms are provided directly by database management system (DBMS) vendors. To pull away from the pack, companies must go well beyond what is provided and innovate by using newer, more sophisticated statistical analysis. Revolution Analytics addresses both of these opportunities in Big Analytics while supporting the following objectives for working with Big Data Analytics: 1. 2. 3. 4. Avoid sampling / aggregation; Reduce data movement and replication; Bring the analytics as close as possible to the data and; Optimize computation speed. First, Revolution Analytics delivers optimized statistical algorithms for the three primary data management paradigms being employed to address...

Words: 1996 - Pages: 8

Free Essay

Hadopp Yarn

...Apache Hadoop YARN: Yet Another Resource Negotiator Vinod Kumar Vavilapallih Mahadev Konarh Siddharth Sethh h: Arun C Murthyh Carlo Curinom Chris Douglasm Jason Lowey Owen O’Malleyh f: Sharad Agarwali Hitesh Shahh Sanjay Radiah facebook.com Robert Evansy Bikas Sahah m: Thomas Gravesy Benjamin Reed f hortonworks.com, Eric Baldeschwielerh microsoft.com, i : inmobi.com, y : yahoo-inc.com, Abstract The initial design of Apache Hadoop [1] was tightly focused on running massive, MapReduce jobs to process a web crawl. For increasingly diverse companies, Hadoop has become the data and computational agor´ —the de a facto place where data and computational resources are shared and accessed. This broad adoption and ubiquitous usage has stretched the initial design well beyond its intended target, exposing two key shortcomings: 1) tight coupling of a specific programming model with the resource management infrastructure, forcing developers to abuse the MapReduce programming model, and 2) centralized handling of jobs’ control flow, which resulted in endless scalability concerns for the scheduler. In this paper, we summarize the design, development, and current state of deployment of the next generation of Hadoop’s compute platform: YARN. The new architecture we introduced decouples the programming model from the resource management infrastructure, and delegates many scheduling functions (e.g., task faulttolerance) to per-application components. We provide experimental...

Words: 12006 - Pages: 49

Premium Essay

Big Data

...Big Data is Scaling BI and Analytics How the information surge is changing the way organizations use business intelligence and analytics Information Management Magazine, Sept/Oct 2011 Shawn Rogers Like what you see? Click here to sign up for Information Management's daily newsletter to get the latest news, trends, commentary and more. The explosive growth in the amount of data created in the world continues to accelerate and surprise us in terms of sheer volume, though experts could see the signposts along the way. Gordon Moore, co-founder of Intel and the namesake of Moore's law, first forecast that the number of transistors that could be placed on an integrated circuit would double year over year. Since 1965, this "doubling principle" has been applied to many areas of computing and has more often than not been proven correct. When applied to data, not even Moore's law seems to keep pace with the exponential growth of the past several years. Recent IDC research on digital data indicates that in 2010, the amount of digital information in the world reached beyond a zettabyte in size. That's one trillion gigabytes of information. To put that in perspective, a blogger at Cisco Systems noted that a zettabyte is roughly the size of 125 billion 8GB iPods fully loaded. Advertisement As the overall digital universe has expanded, so has the world of enterprise data. The good news for data management professionals is that our working data won't reach zettabyte scale for some...

Words: 2481 - Pages: 10

Premium Essay

Big Data

...Big Data Big Data and Business Strategy Businesses have come a long way in the way that information is being given to management, from comparing quarter sales all the way down to view how customers interact with the business. With so many new technology’s and new systems emerging, it has now become faster and easier to get any type of information, instead of using, for example, your sales processing system that might not get all the information that a manger might need. This is where big data comes into place with how it interacts with businesses. We can begin with how to explain what big data is and how it is used. Big data is a term used to describe the exponential growth and availability of data for both unstructured and structured systems. Back in 2001, Doug Laney (Gartner) gave a definition that ties in more closely on how big data is managed with a business strategy, which is given as velocity, volume, and variety. Velocity which is explained as how dig data is constantly and rapidly changing within time and how fast companies are able to keep up with in a real time manner. Which sometimes is a challenge to most companies. Volume is increasing also at a high level, especially with the amount of unstructured data streaming from social media such as Facebook. Also including the amount of data being collected from customer information. The final one is variety, which is what some companies also struggle with in handling many varieties of structured and unstructured data...

Words: 1883 - Pages: 8

Premium Essay

Integration of Technology

...from good decisions and identify new opportunities to gain a competitive advantage. Hadoop It is open source software designed to provide massive storage and large data processing power. Hadoop has the ability to handle tasks running at the same time. Hadoop has a storage and processing part. It works by dividing files into large blocks and distributing them amongst the nodes (Kozielski & Wrembel, 2014). In processing, it works with MapReduce to ensure that codes are transferred and nodes are processed in parallel. By using nodes, Hadoop allows data manipulation making it is process faster and more efficiently. It has four main components: The Hadoop Common which contains utilities required, the Hadoop Distributed File System which is the storage part, Hadoop Yarn which manages and computes resources and Hadoop MapReduce which is a program responsible for processing large-scale data. It can process large amounts of data quickly by using multiple computers (Kozielski & Wrembel, 2014). Hadoop is being turned into a data processing operating system by large organizations. This is because it allows numerous data manipulations and analytical processes. Other data analysis programs such as SQL run on Hadoop and perform well on this system. The ability of Hadoop running many programs lowers cost of data analysis and allows businesses to analyze different amounts of data on products and consumers. Hadoop not only provides an organization with more data to work...

Words: 948 - Pages: 4

Free Essay

Literature Review

...Big Data and Hadoop Harshawardhan S. Bhosale1, Prof. Devendra P. Gadekar2 1 Department of Computer Engineering, JSPM’s Imperial College of Engineering & Research, Wagholi, Pune Bhosale.harshawardhan186@gmail.com 2 Department of Computer Engineering, JSPM’s Imperial College of Engineering & Research, Wagholi, Pune devendraagadekar84@gmail.com Abstract: The term ‘Big Data’ describes innovative techniques and technologies to capture, store, distribute, manage and analyze petabyte- or larger-sized datasets with high-velocity and different structures. Big data can be structured, unstructured or semi-structured, resulting in incapability of conventional data management methods. Data is generated from various different sources and can arrive in the system at various rates. In order to process these large amounts of data in an inexpensive and efficient way, parallelism is used. Big Data is a data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it. Hadoop is the core platform for structuring Big Data, and solves the problem of making it useful for analytics purposes. Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Keywords -Big Data, Hadoop, Map Reduce...

Words: 5034 - Pages: 21

Free Essay

Big Data

...Literature ◦ What is Big data? ◦ Why Big-Data? ◦ When Big-Data is really a problem?   ‘Big-data’ is similar to ‘Small-data’, but bigger …but having data bigger consequently requires different approaches: …to solve: ◦ techniques, tools & architectures ◦ New problems… ◦ …and old problems in a better way.  From “Understanding Big Data” by IBM Big-Data  Key enablers for the growth of “Big Data” are: ◦ Increase of storage capacities ◦ Increase of processing power ◦ Availability of data  NoSQL  MapReduce Storage Servers ◦ DatabasesMongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, Hypertable, Voldemort, Riak, ZooKeeper ◦ Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu, Flume, Kafka, Azkaban, Oozie, Greenplum ◦ S3, Hadoop Distributed File System ◦ EC2, Google App Engine, Elastic, Beanstalk, Heroku ◦ R, Yahoo! Pipes, Mechanical Turk, Solr/Lucene, ElasticSearch, Datameer, BigSheets, Tinkerpop    Processing  …when the operations on data are complex: ◦ …e.g. simple counting is not a complex problem ◦ Modeling and reasoning with data of different kinds can get extremely complex  Good news about big-data: ◦ Often, because of vast amount of data, modeling techniques can get simpler (e.g. smart counting can replace complex model based analytics)… ◦ …as long as we deal with the scale  Research areas (such as IR, KDD, ML, NLP, SemWeb, …) are subcubes within the data cube ...

Words: 754 - Pages: 4