# hadoop简介

## HDFS文件系统

![](/files/-LoyKMnOadQMs978CK-Q)

* namenode：负责管理与维护HDFS目录系统，控制文件读写
* datanode...：负责存储数据

### HDFS设计

* 低成本高容错服务器上
* Streamng流式数据存取
* 大数据，cluster集群架构
* 简单一致模型，一次写入多次读取的存取模式，一个文件被创建后就不会再修改
* 选择在靠近数据的服务器中计算数据

### HDFS文件存储架构

* 当以HDFS命令存储文件时，会把文件分为多个区块，每个64MB，如图文件被分为ABC三个区块；
* 一个区块默认会复制3个副本，当某个文件区块损坏时namenode会自动在其它datanode上寻找副本数据来恢复，维持3个副本；
* 机架（放服务器的架子）感知，如下图3个机架，每个有4台datanode服务器，hadoop能感知（厉害）

![](/files/-LoyKMnQcw_IWdUzWJob)

## MapReduce

分布式计算技术：

* Map将任务分割为小任务
* Reduce将服务器计算结果汇总整理，返回最终结果

使得可以在成千上万机器上并行处理数据

hadoop的mapreduce架构称为YARN

hadoop的mapreduce运行时会将数据存储到磁盘，因此会有延迟

spark是基于内存的计算框架，性能大幅提升


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://im-qianuxn.gitbook.io/pytorch/ji-suan-ji/spark-hadoop/hadoop-jian-jie.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.