System Design concepts I learned while preparing for large-scale system developments

Vishalsheth
9 min readJan 3, 2021

— — — — -— — Index : — — — — — —

Depth details of the Topic in System Design…..!

Real time application — System Design Learning.

Key Characteristics of Distributed Systems —

Step by Step factor we need to tackle —

Primary things to learn is — Consistency, availability or idem-potency, SLAs, data durability, message persistence,

— — — — — — — Fundamentals : — — — — — —

1) What is System: Architecture or Collection of Technologies.

For the Computing world (systems), we need to keep in mind — Set of Users and their set of requirements and the components used to serve these users as per their requirements.

@Copyright to @SudoCode

2) Abstract level Components of System Design :

Logical Entity: User interact with presentation layer(front-end). Front-end connects with middle-ware API, RPC, message queue. This service data that store in the database.

Tangible Entity: Technology to each component….and its uses as per their different use-cases.

Read also the follow-up post for this article: Operating a large, distributed system in a reliable way and I also recommend the book Designing Data-Intensive Applications for more reading.

A — Ask good questions

B — Don’t use buzzwords

C — Clear and organized thinking

D — Drive discussions with 80–20 rule

Things to consider — — Features API Availability Latency Scalability Durability Class Diagram Security and Privacy Cost-effective Concepts to know Vertical vs horizontal scaling CAP theorem ACID vs BASE Partitioning/Sharding Consistent Hashing Optimistic vs pessimistic locking Strong vs eventual consistency RelationalDB vs NoSQL Types of NoSQL Key-value Wide column Document-based Graph-based Caching Datacenter/racks/hosts CPU/memory/Hard drives/Network bandwidth Random vs sequential read/writes to disk HTTP vs http2 vs WebSocket TCP/IP model ipv4 vs ipv6 TCP vs UDP DNS lookup Http & TLS Public key infrastructure and certificate authority(CA) Symmetric vs asymmetric encryption Load Balancer CDNs & Edges Bloom filters and Count-Min sketch Paxos Leader election Design patterns and Object-oriented design Virtual machines and containers Pub-sub architecture MapReduce Multithreading, locks, synchronization, CAS(compare and set) Tools Cassandra MongoDB/Couchbase Mysql Memcached Redis Zookeeper Kafka NGINX HAProxy Solr, Elastic search Amazon S3 Docker, Kubernetes, Mesos Hadoop/Spark and HDFS

https://rajagoyal815.medium.com/design-amazon-a6d157d6f2f7

Depth details of the Topic in System Design…..!

System Design — Load Balancing

System Design — Caching

System Design — Sharding / Data Partitioning

System Design — Indexes

System Design — Proxies

System Design — Message Queues

System Design — Redundancy and Replication

System Design — SQL vs. NoSQL : https://vishalsheth4.medium.com/nosql-vs-sql-4-reasons-why-nosql-is-better-for-data-science-data-science-applications-38632f2f5fee

System Design — CAP Problem

System Design — Consistent Hashing

System Design — Client-Server Communication

System Design — Storage

System Design — Other Topics

Real time application — System Design Learning.

Netflix

1. Key Characteristics of Distributed Systems-

https://liverungrow.medium.com

Scalability

- The capability of a system to continuously grow and manage large amount of work/traffic.

- Horizontal scaling: by adding more servers into the pool of resources.

- Vertical scaling: by adding more resource (CPU, RAM, storage, etc) to an existing server. This approach comes with downtime and an upper limit.

Reliability

- Reliability is the probability that a system will fail in a given period.

- A distributed system is reliable if it keeps delivering its service even when one or multiple components fail.

- Reliability is achieved through redundancy of components and data (remove every single point of failure).

Availability

- Availability is the time a system remains operational to perform its required function in a specific period.

- Measured by the percentage of time that a system remains operational under normal conditions.

- A reliable system is available but an available system is not necessarily reliable.

- A system with a security hole is available when there is no security attack.

Efficiency

- Latency: response time, the delay to obtain the first piece of data.

- Bandwidth: throughput, amount of data delivered in a given time.

- Serviceability / Manageability and Easiness to operate and maintain the system.

- Simplicity and spend with which a system can be repaired or maintained.

Load Balancing

- Between user and web server

- Spread workload

- Proxies — A proxy server is an intermediary piece of hardware / software sitting between client and back-end server.

- Filter requests

- Log requests

- Transform requests (encryption, compression, etc)

Cache

Queues — A queue is a line of things waiting to be handled — in sequential order starting at the beginning of the line.

Queues are used to effectively manage requests in a large-scale distributed system, in which different components of the system may need to work in an asynchronous way. Asynchronous processing allows a task to call a service, and move on to the next task while the service processes the request at its own pace.

It is an abstraction between the client’s request and the actual work performed to service it. The client can continue operating without interruption when the server is not ready.

The basic architecture of a message queue is simple; there are client applications called producers that create messages and deliver them to the message queue. Another application, called a consumer, connects to the queue and gets the messages to be processed. Messages placed onto the queue are stored until the consumer retrieves them.

Example of queues: Kafka, Heron, real time streaming, RabbitMQ

Redundancy — Duplication of critical data or services with the intention of increased reliability of the system.

Sharding (Data Partitioning)

- Horizontal Partitioning: Hash (Range based, consistent hashing), Horizontally derived

- Vertical Partitioning

SQL VS No SQL — Component details

Storage

  • SQL: store data in tables.
  • NoSQL: have different data storage models, Eg, Key-Values, Document Databases, Wide-Column (Cassandra), Graph Databases

Schema

  • Sql: Each record has a fixed schema.
  • NoSql: Schemas are dynamic

Scalability

  • SQL -> Vertically scalable. Horizontally scalable may be harder.
  • NoSQL -> Horizontally scalable and cheap.

ACID

  • SQL: ACID compliant and guarantee of transactions
  • NO SQL: Sacrifice ACID for performance and scalability. Even Cassandra is not ACID. It does not delete items, only mark with tombstones. It does append-on writes. It is good for heavy writes applications.

When to use SQL and No-SQL

SQL

  • Ensure ACID compliance.
  • Reduce anomalies.
  • Protect database integrity.
  • Data is structured and unchanging.

NoSQL

  • Data has little or no structure.
  • Make the most of cloud computing and storage.
  • Cloud-based storage requires data to be easily spread across multiple servers to scale up.
  • Rapid development.
  • Frequent updates to the data structure.

Step by Step factor we need to tackle.

  1. Database Selection….? — Let suppose we want to do count on any metric. (So here metric is ambiguous i.e. videos views count, Facebook like count, machine performance count.) So we have to use different database for different purpose.

For every specific purpose we have to use specific database to provide convenient service.

2. Requirements clarification: — Ask as many as questions into this 4 category.

4 Category from this we can decide our approach to solve the system design solution.

3. Then define Functional and Non-functional requirements.

→ Functional Requirements — We mean system behavior or more specifically API’s.

Let suppose our system count video view events.

First Define API’s Input.

Secondly define output/return of APIs.

→ Non-Functional Requirements — fast, fault tolerance , secure etc…..

4. Data Driven Model: — Like wise you want to store the individual events data or aggregate data(e.g. per minute) in real time.

As per requirement we have to decided which kind of data model we have to go further.

5. Selection Of Specific Database : — Based on gathering of Non-functional requirements. — Scalability , Performance and Availability.

So we have to evaluate databases against these requirements.

Majorly it looks like that we can use both SQL and No-SQL database.

SQL Database(MySQL)

  1. So here there are two service that accessing data — Processing service(store) and Query Service(retrieve) data.
  2. Between MySQL and service we mainly added Cluster-proxy so that it do health check-up, horizontal scaling and database sharding.
  3. From 1st point we able to achieve performance and from 2nd point we able to achieve scalability.
  4. For Availability we use Master-Slave (Use Synchronously and Asynchronously approach) Write is done on Master side and read is done on Master or Slave side.

But this solution does not seem simple as such need more details and configuration service, leader, proxies and replica instances…?

No-SQL Databases — Apache Cassandra

  1. Each node of equal size and each node communicate with each other node not more than 2 nodes.
  2. Now all node know each other so no proxy server needed. this communication of this nodes/shards is called Gossip Protocol.
  3. Client may call any node or any specific to closet location or use round robin algorithm to choose initial node for serve request (Consistent Hashing).
  4. When any data Write happen, initially it stores values on only 2 nodes and after then system start writing to all other nodes eventually — called Quorum writes.
  5. Similarly when Read happen, it fetch value from all nodes and majority response wins will return the value — because might be there are some nodes who still does not have latest value or updated value for this Cassandra uses version for latest updation — called Quorum Read.
  6. For Availability we replica all this nodes to data center B.

Note: synchronous data replication is slow , we usually replicate data asynchronously. some read replica will behind there master — this inconsistency is temporary — called eventual inconsistency.

Now we have to think in terms of database views —

SQL provides — Normalization — So no data inconsistency.

No-SQL provides every hour data and new column can be added when any new nouns/parameter added.

Cassandra Details : Fault tolerance , Scalable, Support multiple data replication and works well with Time Series Data.

(Need to learn MongoDB, HBase, Master-based database architectures so we can understand more depth and its utility)

6. Processing Service —

Data processing concepts: check-pointing, partitioning, in-memory aggregation, de-duplication cache, dead-letter queue, embedded database, state management.

https://www.youtube.com/watch?v=bUHFg8CZFws

Example: 1https://thinksoftware.medium.com/elevator-system-design-a-tricky-technical-interview-question-116f396f2b1c

Choosing the Best Databases in a System Design.

Note-1) Selecting a database does not impact functional requirements.

Factors to check :

1) Structure of data we have — Structure data or Non-structure data

2) Query Pattern

3) Amount of scale that we need to handle

  1. Caching: When we request the same data lots of time to the server so to optimize we try to store it in data-value in Key-Value format. The basic database used is Redis, Memcached, Etcd, Hazelcast.
  2. File Storage Options [i.e. Amazon to store product images and videos, Netflix to store movies ]: Blob storage comes into the picture: Framework: Amazon S3, Alternative approach is to use a content delivery network (CDN). so it fetches the data from nearby geographical location.
  3. Text Searching Capability or its description — Text Search Engine Framework — Elastic Search, Solaris. [i.e. — AIRPROT the framework will use fuzziness approach and convert the wrong spell to right spell and fetch the result of right spell — AIRPORT] — it is not a database it is a Search engine.
  4. Metrics kind of database: Metrics, CPU, Utilization, — Time Series database — [Database used as InfluxDB, openTSDB (Open Time-series Database)].
  5. Analytics requirements : [i.e Amazon or Uber to provide analytics on all the transactions.][how many orders are having from which Geopgraphics][Database used: data warehouse, Hadoop] [Generally used for Offline Reporting]

MUST for EASY UnderStanding BLOG: How to design a system to scale to your first 100 million users…

https://levelup.gitconnected.com/how-to-design-a-system-to-scale-to-your-first-100-million-users-4450a2f9703d

--

--