System Design-Instagram/Twitter/Reddit
The Fork/Join model in multi-threaded programming:
(1) Initial setup: The Main Thread
(2) Fork: Spawn new subtasks
(3) Parallel execution
(4) Join: consolidate results
(5) Repeat
Critical Section: shared resources or variables
Race Condition: multiple threads trying to do things.
Synchronization tools: coordinate threads with critical section. includes: Mutexes, Read/Write locks, Semaphores, Condition variables, and Barriers.
There are lots of different options for databases. How can we decide which one to choose? There are couple aspects to consider.
Will discuss through couple aspects, includes: Index, Replication, Failure detection and Consistency, the last parts will be some existing database example.
A database index is used for speeding up reads based on a specific key. Index will slow down database write and speed up read.
Hash index is kept in memory hash table of key mapped to the memory location of the data, occasionally write to disk for persistence. however, it works poorly on disk.
Pros: easy to implement and veryfast (RAM is fast)
Cons: all keys must fit in memory and it is bad for range queries.
Scenarios: it is fast but only useful on small datasets.
Block storage: raw blocks attached to a server as a volume. mutable, higher cost and higher performance, however lower scalability casue it could only attached to one server and good for VMs and databases.
File storage: built on top of block storage, higher level of abstraction, handle files and directories, medium to high performance and cost, medium scalability, which provides general purpose file system access, good for sharing files/folders within organization.
Object storage: sacrifice performance for higher durability and vast scalability with low cost. it is generally immutable however version is supported. it targets relatively colder data, access is through Restful apis.
This blog is more about object storage. It provides Restful Apis, includes PUT, GET object.
Business entities: bucket(folder) and object.
Consider maintaining a highly concurrent service, and with 3k-5k requests per second hitting one server. this will generate a large number of request logs. Among all those request logs, normally data plane apis (data related) have a much larger volume than control plane apis (management related).
We want to understand how healthy the service is running, how healthy every apis are, note that a api with lower volume does not make it less important.
How do we do that? when the request logs data is large, it is often advantageous to choose a smaller subset of data which could summarize the original dataset, this is called sampling. Main idea is to take a statistically significant sample of data and then analyse this sample rather than having to use the whole original data set.
By querying sampling data, the system is able to provide a efficient result which is approximate to the real answer.
Cache helps on availability and resiliency by for example, improving request latency then service is more able to handle incoming traffic. as well as decrease load on downstream dependencies.
On the flip side, cache introduces modal behavior for your service, with differing behavior depending on whether a given object is cached.