Learn Architecture from Scratch

Learning Architecture from Scratch#

Basics of Architecture#

Concepts#

Systems and Subsystems

System: Refers to a group of related individuals that operate according to certain rules and can accomplish tasks that individual components cannot achieve alone. (Relation, Rules, Capability)

A subsystem is also a system composed of a group of related individuals, often part of a larger system.

For example, WeChat itself is a system, which also includes subsystems like chat, support, and Moments. The Moments system contains subsystems like dynamics, comments, and likes.
Modules and Components

Software Module: A consistent and closely related organization of software that includes both programs and data structures.

Software Component: Defined as a self-contained, programmable, reusable software unit that is independent of language, and can be easily used to assemble applications.

For example, a student information management system can be logically divided into login and registration module, personal information module, and personal grades module; from a physical perspective, it can be divided into nginx, web server, and MySQL.
- From a logical perspective, the split system results in modules.
- From a physical perspective, the split system results in components.
- Dividing modules is for the separation of responsibilities, while dividing components is for unit reuse.
Framework and Architecture

Software Framework: A software organization specification designed to implement a certain industry standard or complete specific basic tasks, also refers to software products that provide the basic functions required by the specification when implementing a certain software component specification.

Software Architecture: Refers to the foundational structure of a software system, the principles that create these foundational structures, and the description of these principles. (Software architecture refers to the top-level structure of the software system.)

Purpose of Architecture Design: To solve problems arising from complexity.#

Sources of Complexity#

High Performance
- Complexity of a Single Machine
  - Consider multi-process, multi-threading, inter-process communication, multi-threaded concurrency, etc.
- Complexity of a Cluster
  - Task Allocation
    - One task allocator with multiple business servers gradually evolves into multiple task allocators and multiple business servers.
  - Task Decomposition
    - Splitting a complex business system into multiple small and simple systems that cooperate.
    - Reasons for performance improvement through task decomposition:
      - Simple systems are easier to achieve high performance.
      - Can scale for individual tasks, and the finer the system split, the better, because too many subsystems require multiple iterative calls for a user to get a response.
High Availability
- Achieved through "redundancy."
  - High availability in computing
  - High availability in storage
  - High availability in state decision-making
    - Dictatorial: Only one decision-maker exists, others are reporters.
    - Negotiation: Two independent individuals communicate information and then make decisions based on rules.
    - Democratic: Multiple independent individuals make state decisions through voting.
Scalability
- A capability provided to respond to future demand changes, allowing the system to support new requirements with little or no modification, without needing to completely restructure or rebuild the system.
  - Correctly predict changes
  - Perfectly encapsulate changes
Low Cost
- Only innovation can achieve low-cost goals.
  - Examples
    - The emergence of NoSQL (Redis, Memcache, etc.) is to solve the access pressure brought by high concurrency that relational databases cannot handle.
    - The emergence of full-text search engines (Sphinx, Elasticsearch, Solr) is to solve the inefficiency of like searches in relational databases.
    - The emergence of Hadoop is to solve the problem of traditional file systems being unable to handle massive data storage and computation.
Security
- Functional Security
  - Common XSS attacks, CSRF attacks, SQL injection, Windows vulnerabilities, password cracking, etc., are essentially due to vulnerabilities in the system that allow hackers to exploit them.
- Architectural Security
  - Mainly relies on firewalls for network isolation.
    - DDoS
      
      Distributed Denial of Service (DDoS) refers to multiple attackers from different locations simultaneously attacking one or several targets, or one attacker controlling multiple machines in different locations to attack the victim simultaneously. Since the attack points are distributed in different places, this type of attack is called a distributed denial of service attack, and there can be multiple attackers.
Scale
- Increasing functionality leads to an exponential rise in system complexity.
- Increasing data leads to a qualitative change in system complexity.

Principles of Architecture Design#

Suitability Principle: Suitability is better than business leadership.#

Simplicity Principle: Simplicity is better than complexity.#

Evolution Principle: Evolution is better than a one-step approach.#

High-Performance Architecture#

High-Performance Storage#

Relational Databases
- Read-Write Separation: Distributes access pressure across multiple nodes in a cluster, but does not distribute storage pressure.
- Database Sharding: Can both distribute access pressure and storage pressure.
  - Business Sharding: Distributes storage and access pressure.
    - Introduced Issues
      - Join Operation Issues
        
        Tables originally in the same database are spread across different databases, making it impossible to use SQL's join queries.
      - Transaction Issues
        
        Different tables originally in the same database can be modified in the same transaction; after sharding, tables spread across different databases cannot be modified uniformly through transactions.
      - Cost Issues
        
        What could originally be handled by one server may now require three or more.
  - Table Sharding
    - Vertical Sharding: Suitable for splitting out certain infrequently used columns that occupy a lot of space.
      - Introduced Issues
        
        Because table information is spread across multiple tables, what was once a single query may now require two or more.
    - Horizontal Sharding: Suitable for tables with particularly large row data, such as tables with over 50 million records.
      - Introduced Issues
        
        Routing: After horizontal sharding, determining which split sub-table a specific piece of data belongs to requires adding routing algorithms for calculation.
        
        Range Routing
        
        Hash Routing
        
        Configuration Routing
        
        Count() Operation
        
        Originally performing count() on a table, after sharding requires table count * count(*).
        
        Record Count Table
        
        Create a new table to record the number of records in each table after each insert or delete operation.
- Implementation Methods
  - Program Code Encapsulation: Abstract a data access layer in the code to achieve read-write separation and sharding.
  - Middleware Encapsulation: Create an independent system to implement read-write separation and sharding operations.
  - Implementation Complexity: Sharding is much more complex than read-write separation.
    - For read-write separation, it is only necessary to identify whether the SQL operation is a read or write operation, which can be determined by keywords SELECT, UPDATE, INSERT, DELETE.
    - For sharding, in addition to determining the operation type, it is necessary to identify the specific table to be operated on in the SQL, the operation functions (count, order by, group by), and then handle them differently based on the operation.
- Existing Drawbacks
  - Relational databases store row records and cannot store data structures.
  - Extending the schema of relational databases is very inconvenient.
  - Relational databases have high I/O in big data scenarios.
  - Relational databases have weak full-text search capabilities.
NoSQL
- Relational databases
- The essence of NoSQL is sacrificing one or more features of ACID as compensation for relational databases.
- Common NoSQL solutions fall into four categories:
  - K-V Storage: Solves the problem of relational databases being unable to store data structures, represented by Redis.
  - Document Databases: Solve the problem of strong schema constraints in relational databases, represented by MongoDB.
    
    The biggest feature of document databases is no-schema, allowing for the storage and retrieval of any data, typically in JSON format.
    Advantages:
    Adding fields is simple;
    Historical data will not be erroneous;
    Can easily store complex data.
  - Columnar Databases: Solve the I/O problems of relational databases in big data scenarios, represented by HBase.
    
    For example, to unify the number of overweight individuals in a city, only the weight column data needs to be read.
  - Full-Text Search Engines: Solve the performance issues of full-text searches in relational databases, represented by Elasticsearch.
Caching
- Basic Principle: Place potentially reusable data in memory, generating it once and using it multiple times, avoiding accessing the storage system every time.
- Problems Faced
  - Cache Penetration: Accessing data that does not exist in the cache, causing the business system to need to access the database again, putting pressure on the database server.
  - Cache Breakdown: The moment a single high-traffic data expires, the data access volume is large, and after missing the cache, a large number of database accesses for the same data are initiated, putting pressure on the database server.
  - Cache Preheating: Before system startup, relevant cache data is loaded directly into the cache system.
  - Cache Avalanche: Due to a large number of hot data having the same or similar expiration times, causing the cache to fail densely at a certain moment, resulting in a large number of requests being forwarded to the database, putting immense pressure on the storage system, ultimately leading to system crashes.

High-Performance Computing#

High Performance on a Single Server
- PPC (Process per Connection): A new process is created to handle the request of each new connection.
- Prefork: Processes are created in advance for subsequent direct use.
- TPC (Thread per Connection): A new thread is created to handle the request of each new connection.
- Prethread: Threads are created in advance for subsequent direct use.
- Reactor (Non-blocking Synchronous Network Model): Core components include the Reactor and the processing resource pool, where the Reactor is responsible for listening and distributing events, and the processing resource pool handles events.
  - 1. The mainReactor object in the parent process monitors connection establishment events through select, and upon receiving an event, it accepts it through the Acceptor, assigning the new connection to a child process.
  - 1. The subReactor adds the connections assigned by the mainReactor to a connection queue for monitoring and creates a Handler to handle various events of the connection.
  - 1. When a new event occurs, the subReactor calls the corresponding Handler for the connection to respond.
  - 1. The Handler completes the full business process of read → business processing → send.
  - Image
- Proactor (Asynchronous Network Model): Core components include the Proactor and the asynchronous operation processor.
  - 1. The Proactor Initiator is responsible for creating the Proactor and Handler, and registering both with the kernel through the Asynchronous Operation Processor.
  - 1. The Asynchronous Operation Processor handles registration requests and completes I/O operations.
  - 1. After completing the I/O operation, the Asynchronous Operation Processor notifies the Proactor.
  - 1. The Proactor calls different Handlers for business processing based on different event types.
  - 1. The Handler completes business processing and can also register new Handlers with the kernel process.
  - Image
High Performance in Clusters
- Essence: Enhance the overall computing capability of the system by adding more servers.
- Complexity: Increase task allocators and select a suitable task allocation algorithm. (Task allocators are more commonly referred to as load balancers.)
- Responsible for balancing classification
  - DNS Load Balancing: Achieves geographical level load balancing. For example, northern users access the Beijing data center; southern users access the Shenzhen data center.
  - Hardware Load Balancing: Achieves cluster-level load balancing through dedicated hardware devices. These devices are similar to routers and switches and can be understood as basic network devices for load balancing.
  - Software Load Balancing: Achieves machine-level load balancing through load balancing software.
- Load Balancing Architecture
  - In actual use, the above three load balancing methods can be flexibly used. First, find the nearest city server IP through DNS load balancing, then find the corresponding cluster group through hardware load balancing, and finally find the required cluster within the cluster group through software load balancing.
- Load Balancing Algorithms
  - Task Balancing: Round Robin, Weighted Round Robin
  - Load Balancing Class: Lowest Load Priority
  - Performance Optimal Class: Shortest Response Time Priority
  - Hash Class: Perform hash calculations based on certain key information of the task to map to a specified host.

High-Availability Architecture#

CAP#

CAP Theory
BASE Theory

The essence of high availability is achieved through redundancy.#

High-Availability Storage#

Common high-availability storage architectures include master-slave, master-master, cluster, and partition.
- Master-Slave Replication: Client operations are all completed through the master, with the slave only serving as a backup and not participating in actual business read/write operations.
- Master-Slave Replication: The master is responsible for read/write operations, while the slave is only responsible for read operations, not write operations.
- Master-Slave Failover and Master-Slave Switching
  - Key Design Considerations
    - 1. State judgment between master and slave: including the channels for state transmission and the content of state detection.
    - 1. Failover Decision: Timing of failover, failover strategy, degree of automation.
    - 1. Data Conflict Resolution: After the faulty master recovers, how to synchronize data.
  - Common Architectures
    - Interconnected: The master and slave directly establish channels for state transmission.
    - Mediated: The master and slave are not directly connected but connect to a mediator, and transmit state information through the mediator.
    - Simulated: No state data is transmitted between the master and slave; instead, the slave simulates a client, initiating simulated read/write operations to the master and judging the master's state based on the read/write response.
- Master-Master Replication
  - Concept: Both machines are masters, replicating data to each other, and clients can choose either machine for read/write operations.
  - Many data cannot be replicated bidirectionally: for example, user registration IDs, inventory, etc.
- Data Clusters
  - Concept: A cluster is a combination of multiple machines forming a unified system. (Master-slave, master-master, and cluster architectures inherently assume that the master can store all data.)
  - Cluster Classification
    - Centralized Data Cluster
    - Distributed Data Cluster
      - Elasticsearch Cluster
  - Distributed Transaction Algorithms
    - Purpose: To ensure that data scattered across multiple nodes is uniformly committed or rolled back to meet ACID requirements.
    - Two-Phase Commit Protocol (2PC)
      - Commit request phase and Commit submission phase.
    - Three-Phase Commit Protocol (3PC)
      - Commit judgment phase; Commit preparation phase; Commit execution phase. (To address the single point of failure problem in the two-phase commit algorithm, a preparation phase is introduced, allowing participants to avoid being blocked indefinitely through timeout submission when the coordinator fails.)
  - Distributed Consistency Algorithms
    - Purpose: To ensure consistency of the same data across multiple nodes.
    - Mechanism: Replicated State Machine
      - Replica: Multiple distributed servers form a cluster, with each server containing a replica of the complete state machine.
      - State Machine: The state machine accepts input, executes operations, and changes the state to the next state.
      - Algorithm: Uses algorithms to coordinate the processing logic of each replica to keep the state machines consistent.
    - Algorithms: Paxos, Raft, ZAB
- Data Partitioning
  - Concept: Refers to partitioning data according to certain rules, with different partitions distributed across different geographical locations, each storing a portion of the data, thereby mitigating the significant impact of geographical-level failures.
  - Issues to Consider
    - Data Volume: The larger the data volume, the more complex the partitioning rules will be, and the more situations to consider.
    - Partitioning Rules: Intercontinental partitioning, national partitioning, city partitioning.
    - Replication Rules
      - Centralized: There is a central backup center where all partition data is backed up.
      - Mutual Backup: Each partition backs up the data of another partition.
      - Independent: Each partition has its own independent backup center.

High-Availability Computing#

Master-Slave
- Cold Backup: The backup machine's business is not started.
- Hot Backup: The backup machine's business is started.
Master-Slave
Symmetrical Cluster: Each server in the cluster has the same role and can perform all tasks.
Asymmetrical Cluster: Servers in the cluster are divided into multiple different roles, with different roles performing different tasks.

Business High Availability#

Multi-Active in Different Locations
- Purpose: To respond to system-level failures.
- Architecture: Same city different districts, cross-city different locations, cross-national different locations.
- Design Techniques
  - 1. Ensure multi-active core business in different locations.
  - 1. Final consistency of core data.
  - 1. Use various means to synchronize data.
    - Message Queues
    - Secondary Reads
    - Storage System Synchronization
    - Backtracking Reads
    - Data Regeneration Methods
  - 1. Only ensure multi-active in different locations for the vast majority of users.
- Design Steps: Business Classification, Data Classification, Data Synchronization, Exception Handling.
Interface-Level Fault Response Plans
- Downgrade: Reduce the functionality of certain businesses or interfaces, which can be partial functionality or completely stop all functionalities.
  - System Backdoor Downgrade
  - Independent Downgrade System
- Circuit Breaker: Set thresholds to respond to failures in dependent external systems.
- Rate Limiting: Only allow a manageable volume of access into the system, discarding requests that exceed the limit.
  - Request-based rate limiting
  - Resource-based rate limiting
- Queueing: Make users wait a long time to get processed or still not receive a response after a long wait.