MODULE 5 Flashcards
(8 cards)
- List and explain the Important Cloud Platform Capabilities
Cloud platforms offer a variety of capabilities that make them suitable for deploying scalable, reliable, and efficient applications. These capabilities span hardware abstraction, programming support, and service-level integration. Below are the most important cloud platform capabilities as listed in Table 6.1 of the textbook.
- Physical or Virtual Computing Platform (1 Mark)
Cloud platforms provide both physical servers and virtual machines (VMs). Virtual platforms offer isolation, flexibility, and efficient resource utilization by running multiple virtual environments on the same hardware
. - Massive Data Storage Service and Distributed File System (1 Mark)
Clouds offer large-scale storage services that allow users to store and retrieve data efficiently. Distributed file systems mimic traditional local file systems but are built to handle massive datasets across geographically distributed nodes (e.g., GFS, HDFS). - Massive Database Storage Service (1 Mark)
In addition to file-based storage, clouds provide scalable and semantic data storage through distributed databases. These services are similar to traditional DBMS but designed for horizontal scalability (e.g., BigTable, SimpleDB, SQL Azure). - Massive Data Processing Methods and Programming Models (1.5 Marks)
Cloud infrastructure provides thousands of computing nodes. Programming models like MapReduce, Dryad, and Twister abstract the complexity of distributed processing, enabling developers to process large data volumes without managing hardware intricacies. - Workflow and Data Query Language Support (1 Mark)
Cloud providers support workflow programming and query languages that simplify the execution of complex data processing tasks. These features offer abstraction similar to SQL for data management, improving developer productivity. - Programming Interface and Service Deployment (1.5 Marks)
Clouds expose APIs and support common programming environments like J2EE, Python, .NET, PHP, etc. Web services can be accessed through browsers using Ajax and RESTful interfaces, facilitating cross-platform development and integration. - Runtime Support (1.5 Marks)
Cloud platforms provide built-in support services such as:
Distributed task schedulers
Monitoring services
Lock management systems
These services ensure the stable execution of applications and seamless scaling.
- Support Services (1.5 Marks)
Support services include:
Data services
Computing services
Data parallel execution models (e.g., MapReduce)
These enhance the performance and reliability of applications hosted on the cloud.
Conclusion:
The cloud platform capabilities collectively enable elastic, scalable, and reliable computing environments. They support developers and enterprises in building, deploying, and managing applications without concern for the underlying infrastructure complexities.
- Explain data features and databases
In cloud computing, efficient data storage, processing, and management are essential due to the massive and distributed nature of applications and services. Cloud platforms support various data features and databases to cater to different requirements, including performance, scalability, and ease of access.
- Data Features in Cloud Platforms (5 Marks)
Cloud data platforms provide rich and varied storage and access features to support different data types and workloads:
a) Blobs and Drives
Used in Azure (Blobs) and Amazon S3.
Suitable for storing large binary objects (e.g., images, videos).
Can be attached to compute instances as virtual drives like Azure Drives and Amazon EBS.
Provide fault tolerance and high availability
.
b) DPFS (Data Parallel File Systems)
Includes systems like Google File System (GFS), HDFS, and Cosmos.
Optimized for data processing with compute-data affinity.
Useful in applications using MapReduce models
.
c) Workflow and Query Language Support
Supports SQL-like languages and custom workflow languages for data manipulation.
Simplifies cloud programming and makes it similar to traditional software development
.
d) Queuing Services
Offered by Amazon and Azure for reliable communication between application components.
RESTful interface and support for fault-tolerant message delivery
.
- Databases in the Cloud (5 Marks)
Cloud platforms support both traditional relational databases and modern non-relational (NoSQL) systems to meet different application demands.
a) SQL and Relational Databases
Available on Amazon RDS and SQL Azure.
Useful for structured data with consistency and transactional support.
Often deployed as SQL-as-a-Service on independent VMs
.
b) Table-Based and NoSQL Databases
Include Google BigTable, Amazon SimpleDB, and Azure Table.
Provide scalable, schema-free data storage suitable for large datasets.
Support “document stores” and key-value storage.
Useful for applications requiring horizontal scalability and semi-structured data processing
.
c) Triple Stores and Semantic Databases
Used for metadata and semantic web applications.
Scalable implementations based on MapReduce and tables (e.g., RDF triple stores).
Examples include HBase (for BigTable) and CouchDB
.
- Illustrative Storage Systems in Cloud Platforms
System Type Features
Google GFS Distributed FS High bandwidth, POSIX-like API
Hadoop HDFS Open-source GFS Java-based, used in Apache ecosystem
Amazon S3 + EBS Object + Block Remote access + virtual disk support
Conclusion:
Cloud platforms offer advanced data features such as blob storage, distributed file systems, and scalable database services. These capabilities support both structured and unstructured data, providing flexibility for a wide range of applications and development models.
- Explain MapReduce framework with neat diagram.
MapReduce is a powerful framework for processing large datasets using distributed and parallel computing. It was developed by Google and is widely used in big data applications. The model abstracts the complexity of parallel programming by providing two fundamental operations: Map and Reduce.
- Definition and Purpose (1 Mark)
MapReduce is a programming model and processing technique for handling large-scale data by distributing it across many nodes and processing it in parallel. It simplifies data processing tasks such as filtering, sorting, and aggregation
. - Key Components (2 Marks)
Map Function: Takes an input key-value pair and produces a list of intermediate key-value pairs.
Reduce Function: Accepts intermediate keys and a list of values for each key, aggregates them, and produces the final output.
Master Node: Manages task distribution and monitors progress.
Worker Nodes: Execute the actual Map and Reduce tasks.
- Steps in MapReduce Framework (4 Marks)
The MapReduce framework follows a systematic execution process:
Data Partitioning: Input data is divided into chunks and distributed to map workers.
Map Phase: Each map worker processes its chunk and emits intermediate (key, value) pairs.
Combiner Function (Optional): Aggregates intermediate data locally at each map node to reduce network traffic.
Partitioning: Intermediate data is divided based on keys using a partitioning function (e.g., hash(key) mod R) to assign data to appropriate reduce workers.
Shuffling and Sorting: Data is transferred to reduce workers and sorted by key.
Reduce Phase: Reduce workers process each group of values associated with a key to produce the final result
5. Advantages (1 Mark)
Fault Tolerance: Automatic re-execution on failure.
Scalability: Handles petabytes of data across thousands of nodes.
Simplicity: Developers focus only on Map and Reduce logic.
Conclusion:
The MapReduce framework abstracts the complexity of parallel and distributed data processing. By breaking tasks into simple Map and Reduce operations and managing data flow through partitioning, shuffling, and sorting, it provides an efficient and scalable solution for big data applications.
- Explain Hadoop library from Apache with architectural diagram.
Hadoop is an open-source framework from the Apache Software Foundation designed for scalable and fault-tolerant distributed computing. It provides tools to store and process massive datasets using a distributed architecture, primarily composed of HDFS for storage and MapReduce for computation.
- Overview of Hadoop (1 Mark)
Hadoop is implemented in Java and enables users to write and run applications that process large data volumes across clusters of commodity hardware. It is based on the MapReduce programming model and the Hadoop Distributed File System (HDFS) for data storage
. - Hadoop Core Layers (2 Marks)
Hadoop has two fundamental layers:
HDFS (Hadoop Distributed File System):
Manages data storage across nodes.
Splits files into large blocks (e.g., 64 MB) and stores them redundantly across DataNodes.
NameNode: The master node, manages metadata and namespace.
DataNode: Slave nodes storing actual data blocks.
MapReduce Engine:
The computation engine that runs on top of HDFS.
Consists of a JobTracker (master) and TaskTrackers (slaves).
Responsible for distributing map and reduce tasks and monitoring their execution
.
Legend:
JobTracker and NameNode manage task execution and file storage respectively.
TaskTrackers and DataNodes are distributed across different physical nodes/racks.
Map and Reduce tasks are executed by TaskTrackers based on data locality.
- Execution Flow (2 Marks)
The execution of a job in Hadoop follows these steps:
Job Submission: A user submits a job using the runJob() function with configuration parameters.
Input Splitting: Input data is split into blocks and assigned to map tasks.
Task Assignment: JobTracker assigns map/reduce tasks to TaskTrackers.
Execution and Monitoring: TaskTrackers execute tasks, report status to JobTracker.
Result Storage: Output is written back to HDFS
.
- Key Characteristics (2 Marks)
Fault Tolerance: Data blocks are replicated across nodes; failed tasks are re-executed.
Scalability: Easily scales to thousands of nodes.
High Throughput: Optimized for batch processing of large data sets.
Open Source and Extensible: Can integrate with tools like Pig, Hive, HBase.
Conclusion:
Apache Hadoop provides a robust framework for handling big data with efficient storage and distributed processing. Its core architecture—MapReduce engine over HDFS—ensures scalability, fault tolerance, and high availability in data-intensive applications.
- Explain Amazon EC2 execution environment.
Amazon EC2 (Elastic Compute Cloud) is a foundational IaaS (Infrastructure-as-a-Service) offering from Amazon Web Services (AWS). It provides resizable compute capacity in the cloud, allowing users to run virtual machines (instances) on demand with full control over configuration, software, and networking.
- Overview and Objective (1 Mark)
Amazon EC2 enables users to launch virtual servers using Amazon Machine Images (AMIs) and pay only for the compute time consumed. It supports dynamic scaling, making it ideal for varying workload demands
. - Key Components of EC2 Execution Environment (3 Marks)
a) Amazon Machine Images (AMIs):
Pre-configured templates containing OS and software.
Types: Public, Private, and Paid AMIs.
Users can create custom AMIs as per their needs
.
b) Virtualization Layer:
EC2 runs on a Xen-based virtualization environment.
Separates physical resources into virtual compute, storage, and network units.
c) Instance Types:
EC2 provides various instance types to support diverse workloads:
Standard Instances – general-purpose.
High-memory Instances – for memory-intensive applications.
High-CPU Instances – optimized for compute-heavy tasks.
Cluster Compute Instances – for high-performance computing with high-speed networking
.
- VM Deployment Workflow (2 Marks)
A user typically follows this sequence to launch an EC2 instance:
Create an AMI
Create Key Pair
Configure Firewall Rules
Launch Instance
Assign Elastic IP (Optional)
4. Execution Flexibility (2 Marks)
Elasticity: Users can scale up or down based on demand.
Persistence: Instances can use Elastic Block Store (EBS) for persistent storage.
Auto-scaling and Load Balancing: EC2 integrates with CloudWatch to monitor and scale instances automatically
.
- Instance Performance Metrics (1 Mark)
Amazon uses EC2 Compute Units (ECUs) to define performance. One ECU equals the CPU capacity of a 1.0–1.2 GHz 2007 Opteron or Xeon processor. Instance types vary in cores, memory, and storage capacity as shown in Table 6.13
.
Conclusion:
Amazon EC2 provides a flexible, secure, and cost-effective environment for deploying cloud applications. It offers rich customization, scalability, and integration with AWS services like S3 and EBS, making it a preferred choice for developers and enterprises.
- Explain the control flow implementation for MapReduce functionalities.
The control flow in the MapReduce framework refers to the sequence of execution steps and coordination between different components (master, map workers, and reduce workers) from input processing to output generation. It ensures the orderly handling of data splits, task scheduling, communication, and result consolidation.
- Framework Overview (1 Mark)
MapReduce is based on a master-worker model where the master node handles task coordination, and worker nodes perform the actual map and reduce tasks. The user provides custom Map and Reduce functions that are called during execution via a centralized control mechanism
. - Initialization and Task Forking (2 Marks)
The user initializes a Spec object which defines input/output file locations, Map and Reduce function names, and parameters.
The user program calls MapReduce(Spec, &Results).
The MapReduce library forks the program, and one copy becomes the master, others become workers
.
- Map Task Assignment and Execution (2 Marks)
The master partitions the input file and assigns chunks to map workers.
Each map worker reads its assigned input split, executes the Map() function, and emits intermediate (key, value) pairs.
An optional Combiner function may locally aggregate values to reduce communication cost
.
- Partitioning and Synchronization (2 Marks)
The intermediate results are partitioned using a partitioning function (e.g., hash(key) mod R), distributing them to the appropriate reduce tasks.
The master coordinates synchronization so that reduce tasks begin only after all map tasks complete
.
- Reduce Phase and Output Generation (2 Marks)
Reduce workers retrieve and sort intermediate data by key.
The Reduce() function is applied to each group of values sharing the same key.
Final output is written to disk in a user-defined format
- Explain the Dryad control and data flow with diagram.
Dryad is a distributed execution engine developed by Microsoft for running data-parallel applications on large clusters. Unlike MapReduce, which enforces a rigid flow (Map → Reduce), Dryad supports arbitrary directed acyclic graph (DAG) structures, providing greater flexibility in expressing complex computations.
- Overview (1 Mark)
Dryad allows users to represent applications as DAGs where each vertex is a computation and each edge is a communication channel. This allows flexible control over computation and data flow in distributed environments
. - Dryad Control Flow (2 Marks)
The Job Manager constructs the DAG using the application’s logic.
It queries the Name Server to get resource availability in the cluster.
The DAG is mapped to the underlying physical nodes based on data locality and resource availability.
A daemon runs on each node to receive the assigned tasks and monitor execution status.
Control decisions (scheduling, deployment, monitoring) are made by the Job Manager, while actual data transfer is handled independently to avoid bottlenecks
.
- Dryad Data Flow (2 Marks)
The edges in the DAG represent data channels (e.g., files, TCP, shared memory).
Each vertex executes a program that reads from input channels and writes to output channels.
Communication is asynchronous, allowing parallel data streaming.
The system supports dynamic graph updates such as adding vertices or merging graphs during runtime.
5. Dryad Job Execution Example (2 Marks)
An example 2D pipe execution flow could be:
mathematica
Copy
Edit
Grep → Sort → Awk
Grep → Sort → Perl
Grep → Sort → Sed
Each stage (e.g., “Sort”) has multiple parallel vertices. Data flows between them via channels, while control is managed separately. Dryad handles fault tolerance by re-executing failed vertices or regenerating broken channels
.
- Fault Tolerance and Dynamic Scheduling (1 Mark)
Dryad supports two types of failures:
Vertex failure: The job manager reassigns the task to another node.
Channel failure: The edge is regenerated by re-running the source vertex.
It also supports graph refinement during runtime, adapting execution for performance gains.
Conclusion:
Dryad’s separation of control and data flow, use of DAGs, and flexible channel-based communication make it a powerful platform for distributed data-parallel applications. It supports complex workflows and ensures efficient scheduling, scalability, and resilience in cloud environments.
- Explain Eucalyptus architecture for VM image management
Eucalyptus (Elastic Utility Computing Architecture for Linking Your Programs to Useful Systems) is an open-source cloud computing platform that provides Infrastructure as a Service (IaaS) for building private and hybrid clouds. It emulates Amazon Web Services (AWS) and allows users to manage virtual machine (VM) images and infrastructure resources effectively.
- Overview of Eucalyptus Architecture (1.5 Marks)
Eucalyptus architecture is composed of a set of modular components that provide cloud services. These components include the Cloud Controller (CLC), Cluster Controller (CC), Storage Controller (SC), Node Controller (NC), and Walrus (a storage service analogous to Amazon S3).
It is AWS-compatible and supports EC2-style VM management and S3-style storage via Walrus.
Users can upload, register, and launch VMs, making it ideal for academic and enterprise environments
.
- Key Components of VM Image Management (3 Marks)
a) Walrus Storage System:
Functions like AWS S3 and is used to store VM images.
Users can bundle root file systems and upload them into Walrus using defined buckets.
b) Image Registration:
Uploaded images are registered and linked with kernel and RAM disk images.
Enables users to create and deploy virtual appliances for specific use cases
.
c) Instance Management:
Once an image is registered, it can be launched across any available Node Controller (NC) in any availability zone.
Images can be used repeatedly and managed using the Eucalyptus user interface or API.
- Eucalyptus Control Hierarchy (1.5 Marks)
Component Function
CLC Entry point for users and admins; handles authentication and scheduling.
CC Manages node resources in a cluster.
SC Manages storage volumes and snapshots.
NC Hosts and executes VM instances.
Walrus Stores and retrieves VM images and user data. - Advantages of Eucalyptus for VM Image Management (1.5 Marks)
Supports custom virtual appliances for user-defined applications.
Fully compatible with AWS EC2/S3 APIs, allowing easy transition to/from AWS.
Enables on-premise private cloud development.
Provides fine-grained control over VM provisioning, networking, and storage.