Top 50 Data Architecture Questions Flashcards
What is big data?
Big Data is a term associated with complex and large datasets. A relational database cannot handle big data, and that’s why special tools and methods are used to perform operations on a vast collection of data. Big data enables companies to understand their business better and helps them derive meaningful information from the unstructured and raw data collected on a regular basis. Big data also allows the companies to take better business decisions backed by data.
What are the five V’s of Big Data?
Volume – Volume represents the volume i.e. amount of data that is growing at a high rate i.e. data volume in Petabytes
Velocity – Velocity is the rate at which data grows. Social media contributes a major role in the velocity of growing data.
Variety – Variety refers to the different data types i.e. various data formats like text, audios, videos, etc.
Veracity – Veracity refers to the uncertainty of available data. Veracity arises due to the high volume of data that brings incompleteness and inconsistency.
Value –Value refers to turning data into value. By turning accessed big data into values, businesses may generate revenue.
Tell us how big data and Hadoop are related to each other.
Big data and Hadoop are almost synonyms terms. With the rise of big data, Hadoop, a framework that specializes in big data operations also became popular. The framework can be used by professionals to analyze big data and help businesses to make decisions.
How is big data analysis helpful in increasing business revenue?
sentimental analysis, predictive analytics, prescriptive analytics
Define respective components of HDFS and YARN
NameNode – This is the master node for processing metadata information for data blocks within the HDFS
DataNode/Slave node – This is the node which acts as slave node to store the data, for processing and use by the NameNode
The two main components of YARN are–
ResourceManager– This component receives processing requests and accordingly allocates to respective NodeManagers depending on processing needs.
NodeManager– It executes tasks on each single Data Node
Why is Hadoop used for Big Data Analytics?
Answer: Since data analysis has become one of the key parameters of business, hence, enterprises are dealing with massive amount of structured, unstructured and semi-structured data. Analyzing unstructured data is quite difficult where Hadoop takes major part with its capabilities of
Storage
Processing
Data collection
Moreover, Hadoop is open source and runs on commodity hardware. Hence it is a cost-benefit solution for businesses.
What is fsck?
fsck stands for File System Check. It is a command used by HDFS. This command is used to check inconsistencies and if there is any problem in the file. For example, if there are any missing blocks for a file, HDFS gets notified through this command.
What are the main differences between NAS (Network-attached storage) and HDFS?
HDFS runs on a cluster of machines while NAS runs on an individual machine. Hence, data redundancy is a common issue in HDFS. On the contrary, the replication protocol is different in case of NAS. Thus the chances of data redundancy are much less.
Data is stored as data blocks in local drives in case of HDFS. In case of NAS, it is stored in dedicated hardware.
What is the Command to format the NameNode?
$ hdfs namenode -format
Do you have any Big Data experience? If so, please share it with us.
There is no specific answer to the question as it is a subjective question and the answer depends on your previous experience. Asking this question during a big data interview, the interviewer wants to understand your previous experience and is also trying to evaluate if you are fit for the project requirement.
So, how will you approach the question? If you have previous experience, start with your duties in your past position and slowly add details to the conversation. Tell them about your contributions that made the project successful. This question is generally, the 2nd or 3rd question asked in an interview. The later questions are based on this question, so answer it carefully. You should also take care not to go overboard with a single aspect of your previous job. Keep it simple and to the point.
Do you prefer good data or good models? Why?
Not mutually exclusive
Will you optimize algorithms or code to make them run faster?
Not mutually exclusive
How do you approach data preparation?
As you already know, data preparation is required to get necessary data which can then further be used for modeling purposes. You should convey this message to the interviewer. You should also emphasize the type of model you are going to use and reasons behind choosing that particular model. Last, but not the least, you should also discuss important data preparation terms such as transforming variables, outlier values, unstructured data, identifying gaps, and others.
How would you transform unstructured data into structured data?
Unstructured data is very common in big data. The unstructured data should be transformed into structured data to ensure proper data analysis. You can start answering the question by briefly differentiating between the two. Once done, you can now discuss the methods you use to transform one form to another. You might also share the real-world situation where you did it. If you have recently been graduated, then you can share information related to your academic projects.
Which hardware configuration is most beneficial for Hadoop jobs?
Dual processors or core machines with a configuration of 4 / 8 GB RAM and ECC memory is ideal for running Hadoop operations. However, the hardware configuration varies based on the project-specific workflow and process flow and need customization accordingly.
What happens when two users try to access the same file in the HDFS?
HDFS NameNode supports exclusive write only. Hence, only the first user will receive the grant for file access and the second user will be rejected.
How to recover a NameNode when it is down?
Use the FsImage which is file system metadata replica to start a new NameNode.
Configure the DataNodes and also the clients to make them acknowledge the newly started NameNode.
Once the new NameNode completes loading the last checkpoint FsImage which has received enough block reports from the DataNodes, it will start to serve the client.
What do you understand by Rack Awareness in Hadoop?
It is an algorithm applied to the NameNode to decide how blocks and its replicas are placed. Depending on rack definitions network traffic is minimized between DataNodes within the same rack. For example, if we consider replication factor as 3, two copies will be placed on one rack whereas the third copy in a separate rack.
What is the difference between “HDFS Block” and “Input Split”?
The HDFS divides the input data physically into blocks for processing which is known as HDFS Block.
Input Split is a logical division of data by mapper for mapping operation.
Explain the different modes in which Hadoop run.
Standalone (Local) Mode – By default, Hadoop runs in a local mode i.e. on a non-distributed, single node. This mode uses the local file system to perform input and output operation. This mode does not support the use of HDFS, so it is used for debugging. No custom configuration is needed for configuration files in this mode.
Pseudo-Distributed Mode – In the pseudo-distributed mode, Hadoop runs on a single node just like the Standalone mode. In this mode, each daemon runs in a separate Java process. As all the daemons run on a single node, there is the same node for both the Master and Slave nodes.
Fully – Distributed Mode – In the fully-distributed mode, all the daemons run on separate individual nodes and thus forms a multi-node cluster. There are different nodes for Master and Slave nodes.
What are hadoop components?
HDFS, Yarn, Mapreduce
What are the configuration parameters in a “MapReduce” program?
Input locations of Jobs in the distributed file system
Output location of Jobs in the distributed file system
The input format of data
The output format of data
The class which contains the map function
The class which contains the reduce function
JAR file which contains the mapper, reducer and the driver classes
What is a block in HDFS and what is its default size in Hadoop 1 and Hadoop 2? Can we change the block size?
Blocks are smallest continuous data storage in a hard drive. For HDFS, blocks are stored across Hadoop cluster.
The default block size in Hadoop 1 is: 64 MB
The default block size in Hadoop 2 is: 128 MB
Yes, we can change block size by using the parameter – dfs.block.size located in the hdfs-site.xml file.
Explain JobTracker in Hadoop
obTracker is a JVM process in Hadoop to submit and track MapReduce jobs.
JobTracker performs the following activities in Hadoop in a sequence –
JobTracker receives jobs that a client application submits to the job tracker
JobTracker notifies NameNode to determine data node
JobTracker allocates TaskTracker nodes based on available slots.
it submits the work on allocated TaskTracker Nodes,
JobTracker monitors the TaskTracker nodes.
When a task fails, JobTracker is notified and decides how to reallocate the task.