Friday, March 21, 2014

Hadoop Technology

Corporation world needs to process petabytes of data efficiently. It has become Expensive to build reliability in each Application for processing large Datasets. Computer Networks contain large electronic components, so, if there is a problem of Nodes fails, some of the causes of failure may be. Failure is expected, rather than exceptional. The number of nodes in a cluster is not constant. So there is a Need for common infrastructure to have Efficient, Reliable, Scalable, and Economical. Hadoop is an Apache open source software framework for distributed processing of large datasets across large clusters of computers and that provides a parallel storage. Hadoop is a “flexible and available architecture for large scale computation and data processing on a network”. Hadoop enables users to store and process large volumes of data. Originally it was created by Doug Cutting at Yahoo! Its primary purpose is to run MapReduce batch programs in parallel on tens to thousands of server nodes. MapReduce is an application module written by a programmer that runs in two phases: first mapping the data (extract) then reducing it (transform). Hadoop scales out to large clusters of servers and storage using the Hadoop Distributed File System (HDFS) to manage Big Data sets and spread them across the servers. More advantages of Hadoop include affordability (it runs on commodity hardware), open source (free download from Cloudera), and agility (store any data, run any analysis). The current Apache Hadoop consists the number of related projects such as Apache Hive, HBase and Zookeeper. Hadoop is a software framework for distributed processing of large datasets across large clusters of computers. Here, large datasets means terabytes or petabytes of data while large cluster means hundreds or thousands of nodes. Hadoop framework consists of two main layers and Hadoop is designed as master-slave architecture: •MapReduce EngineDistributed file system (HDFS) There are several design principles of hadoop, such as need to process big data, Need to parallelize computation across thousands of nodes, Commodity hardware, which means large number of low-end cheap machines working in parallel to solve a computing problem, Parallel DBs, Automatic parallelization & distribution, Fault tolerance and automatic recovery, and clean and simple programming abstraction etc. Hadoop is composed of a MapReduce engine and a user-level file system that manages storage resources across the cluster. The preferred operating systems is Windows and Linux but Hadoop can also work with BSD and OS X.

No comments:

Post a Comment