Apache Hadoop

Architecture

HDFS

  • Designed to reliably store large files across machines in a large cluster.
  • It stores each file as a sequence of blocks. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file.
  • Write Once Read Many (WORM) model. It means only one client can write a file at a time.
  • Writes are always made at the end of the file, in append-only fashion.
  • Block size is 128 MB by default.

NameNode (master)

  • The limit to the number of files in a filesystem is governed by the amount of memory on the namenode.

DataNode (worker)

HDFS - Cheatsheet

List the blocks of a file
hdfs fsck $FILE_PATH -files -blocks
Copy a file from local to HDFS
hdfs dfs -copyFromLocal $LOCAL_FILE_PATH $HDFS_FILE_PATH
List

Deployment

Docker