...ETL_Notes - Pentaho Version ETL, which stands for Extract, Transform and Load, is the process to move data from a source to a destination. I use this generic definition, as the tools are not specific to data warehousing. ETL tools and processes can be used to migrate data in any data context from data warehousing to data migration on an OLTP system update. The rest of this document will focus specifically on ETL issues and issues related to Pentaho Kettle. A good resource for Penaho Kettle is http://wiki.pentaho.com/display/EAI/Spoon+User+Guide ETL BASICS Some of the common usages for ETL are: Merging Data – Data is pulled from multiple sources to be merged into one or more destinations. Data Scrubbing (Format Modification) – Data formats can be changed. i.e. string to int, additionally, ETL can be used to “scrub” the data where bad data can be “fixed” using fuzzy lookups or other operations to infer a value. Automation – ETL can be used to automate the movement of data between two locations. This also standardizes the process, so that the load is done the same way in every run. JOBS, TRANSFORMATIONS, STEPS, and HOPS In Pentaho, each entire ETL process is described in one or more TRANSFORMATIONS. A transformation is defined as an entity that contains definitions on how to move data from one or more sources to one or more destinations. Specifically, a transformation will contain the following parts: 1. Input (s) – one or more input steps defines is the source of...
Words: 2349 - Pages: 10
...White Paper Big Data Analytics Extract, Transform, and Load Big Data with Apache Hadoop* ABSTRACT Over the last few years, organizations across public and private sectors have made a strategic decision to turn big data into competitive advantage. The challenge of extracting value from big data is similar in many ways to the age-old problem of distilling business intelligence from transactional data. At the heart of this challenge is the process used to extract data from multiple sources, transform it to fit your analytical needs, and load it into a data warehouse for subsequent analysis, a process known as “Extract, Transform & Load” (ETL). The nature of big data requires that the infrastructure for this process can scale cost-effectively. Apache Hadoop* has emerged as the de facto standard for managing big data. This whitepaper examines some of the platform hardware and software considerations in using Hadoop for ETL. – e plan to publish other white papers that show how a platform based on Apache Hadoop can be extended to W support interactive queries and real-time predictive analytics. When complete, these white papers will be available at http://hadoop.intel.com. Abstract. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 The ETL Bottleneck in Big Data Analytics The ETL Bottleneck in Big Data Analytics. . . . . . . . . . . . . . . . . . . . . . 1 Big Data refers to the large amounts, at least terabytes, of poly-structured...
Words: 6174 - Pages: 25