7. Big Data - Hadoop Ecosystem Flashcards
DataNode function
Store data
NameNode function
- Store metadata 2. Know what DataNode each block is located
Qual era o volume de dados no mundo em 2013
4.4 x 10^21
Qual é o problema dos discos para processar grande volume de dados?
A velocidade de leitura de um disco de 1TB é de no máximo 100 MB/s
O que é Hive?
SQL que roda com dados no HDFS
What is Spark?
Processamento interativo na memória
Solr
Search de dados no HDFS
Qual é o primeiro passo do cientista de dados?
Definir bem a questão, o que busca responder com os dados.
Fases do MapReduce
- Map - filtra o que precisa 2. Reduce - junta os dados em um resumo
MapReduce Jobs (components)
Input Data + MapReduce Program + Config Info
O que são distributed filesystems?
Filesystems that manage the storage across a network of machines are called distributed filesystems.
Quais são os componentes em mente no desenho do HDFS?
- Very large files
- Streaming data access
- Commodity hardware
Em quais situações o HDFS não se encaixa bem?
- Low-latency data access 2. Lots of small files 3. Multiple writers, arbitrary file modifications
Qual é o tamanho padrão do “block” do HDFS?
128 MB
Para quantos servidores tipicamente é replicado um “block” de dados?
Três servidores
Quais são as medidas de proteção do NameNode?
- Realizar backup dos arquivos de Metadata do NameNode.
- Rodar um “secondary NameNode”.
- Instalar no modelo “Active/Standby” disponível a partir da versão 2.
Quais são as etapas de inicialização do NameNode?
(i) loaded its namespace image into memory (ii) replayed its edit log, and (iii) received enough block reports from the datanodes to leave safe mode.
Quanto tempo leva para o NameNode inicializar em um cluster grande?
On large clusters with many files and blocks, the time it takes for a namenode to start from cold can be 30 minutes or more.
Quem é o responsável por gerenciar o processo de failover entre o NameNode ativo e passivo?
Failover controller. The transition from the active namenode to the standby is managed by a new entity in the system called the failover controller. There are various failover controllers, but the default implementation uses ZooKeeper to ensure that only one namenode is active. Each namenode runs a lightweight failover controller process whose job it is to monitor its namenode for failures (using a simple heartbeating mechanism) and trigger a failover should a namenode fail.
Qual propriedade define a quantidade de DataNodes que um “block” é replicado?
dfs.replication = x
Qual o comando para conseguir ajuda dos comandos do HDFS?
hadoop fs -help
Qual a propriedade para habilitar a segurança do HDFS?
dfs.permissions.enabled
Como funciona o processo de abertura de um arquivo no HDFS?
The client opens the file it wishes to read by calling open() on the FileSystem object, which for HDFS is an instance of DistributedFileSystem (step 1 in Figure 3-2). DistributedFileSystem calls the namenode, using remote procedure calls (RPCs), to determine the locations of the first few blocks in the file (step 2). For each block, the namenode returns the addresses of the datanodes that have a copy of that block. Furthermore, the datanodes are sorted according to their proximity to the client (according to the topology of the cluster’s network; see Network Topology and Hadoop). If the client is itself a datanode (in the case of a MapReduce task, for instance), the client will read from the local datanode if that datanode hosts a copy of the block (see also Figure 2-2 and Short-circuit local reads).
What does it mean for two nodes in a local network to be “close” to each other?
Hadoop takes a simple approach in which the network is represented as a tree and the distance between two nodes is the sum of their distances to their closest common ancestor. Levels in the tree are not predefined, but it is common to have levels that correspond to the data center, the rack, and the node that a process is running on.
