Analysis of Dependency Management

npm#

npm is the earliest command-line tool for dependency installation. Here are the steps on how npm installs dependencies:

Issue the npm install command
npm queries the registry for the URL of the module package
Download the package and store it in the ~/.npm directory
Extract the package into the current project's node_modules directory.

It is important to note that there are some differences between npm2 and npm3.

npm2 Nesting Hell#

Installing dependencies with npm2 is relatively simple and straightforward, downloading and filling the local directory structure according to the package dependency tree, which is the nested node_modules structure. Direct dependencies are placed under node_modules, while sub-dependencies are nested within their direct dependencies' node_modules.

For example, if the project depends on A and C, and both A and C depend on the same version of B@1.0, while C also depends on D@1.0.0, the structure of node_modules would be as follows:

node_modules
├── A@1.0.0
│   └── node_modules
│       └── B@1.0.0
└── C@1.0.0
    └── node_modules
        └── B@1.0.0
        └── D@1.0.0

As can be seen, the same version of B is installed twice by A and C.

If the levels of dependencies increase and the number of dependency packages grows, over time, this will lead to nesting hell:

npm3#

Flattening Nesting

To address the issues present in npm2, npm3 proposed a new solution: flattening dependencies.

npm v3 adopts a flattened node_modules structure by "hoisting" sub-dependencies, installing them as close to the main dependency as possible.

For example, if the project depends on A and C, and A depends on B@1.0.0, while C depends on B@2.0.0, then:

node_modules
├── A@1.0.0
├── B@1.0.0
└── C@1.0.0
     └── node_modules
          └── B@2.0.0

As can be seen, A's sub-dependency B@1.0 is no longer located under A's node_modules, but is at the same level as A. Due to versioning, C's dependency B@2.0 remains in C's node_modules.

This avoids a large number of duplicate package installations and prevents overly deep dependency hierarchies, thus solving the dependency hell problem.

So why not hoist B@2.0 to node_modules instead of B@1.0? If B is directly extracted to our node_modules, does it mean we can directly reference the B package in our code? This leads us to the next question:

Uncertainty

With this handling method, it is easy to have doubts: if I reference different versions of the same package simultaneously, which package will be hoisted? Will the version of the package hoisted after each npm i run be the same? This means that even with the same package.json file, the installed dependency structure of node_modules may differ.

For example:

A@1.0.0: B@1.0.0
C@1.0.0: B@2.0.0

After installation, should B's version be hoisted to 1.0 or 2.0?

node_modules
├── A@1.0.0
├── B@1.0.0
└── C@1.0.0
     └── node_modules
         └── B@2.0.0

node_modules
├── A@1.0.0
│   └── node_modules
│       └── B@1.0.0
├── B@2.0.0
└── C@1.0.0

Many people believe that the package to be hoisted will be determined by the order in package.json, with the earlier package being hoisted first. However, in reality, after examining the source code, npm uses a method called localeCompare to sort dependencies. In fact, packages with earlier dictionary order in npm's underlying dependencies will be prioritized for hoisting.

Phantom Dependencies

Phantom dependencies refer to dependencies that are not listed in the package.json file but are actually used in the project, and due to flattening, these dependencies can be accessed directly, which is an illegal access method. The dayjs package is a common example.

For instance, my project uses arco, but arco's sub-dependencies include dayjs. According to the flattening rule, dayjs will be placed at the top level of node_modules. However, this creates a significant problem: once arco removes dayjs as a sub-dependency, our code will throw an error directly.

Dependency Doppelgängers

Assuming we continue to install module D that depends on B@1.0 and module E that depends on B@2.0, at this point:

A and D depend on B@1.0
C and E depend on B@2.0

Here is the structure of node_modules with B@1.0 hoisted:

node_modules
├── A@1.0.0
├── B@1.0.0
├── D@1.0.0
├── C@1.0.0
│    └── node_modules
│         └── B@2.0.0
└── E@1.0.0
      └── node_modules
           └── B@2.0.0

As can be seen, B@2.0 is installed twice. In fact, whether hoisting B@1.0 or B@2.0 will lead to the existence of duplicate installed versions of B. These duplicate installations of B are referred to as "doppelgängers."

Moreover, although modules C and E seem to depend on B@2.0, they are not actually referencing the same B. If B undergoes some caching or side effects before being exported, users of the project may encounter errors.

npm install

The steps for installing dependencies in versions of npm3 and above are:

Check Configuration: Read npm config and .npmrc configurations, such as mirror source configurations.
Determine Dependency Versions, Build Dependency Tree: Check for the existence of package-lock.json. If it exists, perform version comparisons, and the handling method depends on the npm version; according to the latest npm version handling rules, compatible versions will be installed according to package-lock, otherwise according to package.json; if it does not exist, determine dependency package information based on package.json.
Check Cache or Download: Determine if there is a cache. If there is, extract the corresponding cache to node_modules, generating package-lock.json; if not, download the resource package, verify package integrity, add it to the cache, and then extract it to node_modules, generating package-lock.json.

yarn#

Parallel Installation#

When using npm or yarn to install packages, a series of tasks are generated. When using npm, these tasks are executed sequentially according to the order of packages, with the next package only being installed after one package is completely installed.

Yarn maximizes resource utilization through parallel operations, making installation times faster than before. Before npm5, a serial download method was used, waiting for one package to finish installing before installing the next.

yarn.lock#

We know that the structure or version of packages installed in npm's package.json file is not always consistent because the writing of package.json is based on semantic versioning: released patches should only contain non-substantive modifications. Unfortunately, this is not always the case. npm's strategy may lead to two devices using the same package.json file but installing different versions of packages, which can cause failures.

To prevent pulling different versions of packages, yarn uses a lock file to record the exact version numbers of the modules installed. Each time a module is added, yarn creates (or updates) a file named yarn.lock. This way, every time the dependencies of the same project are pulled, it ensures the same module versions are used.

The yarn.lock file only contains version locks and does not specify the structure of uncertain dependencies, needing to be used in conjunction with the package.json file to determine the dependency structure. During the install process, detailed explanations will be provided.

The yarn.lock lock file presents all dependency packages in a flattened manner and places packages with the same name but incompatible semver as different fields at the same structural level in yarn.lock.

yarn install#

After executing yarn install, it goes through five stages:

Validating package.json: Check the system runtime environment, including OS, CPU, engines, etc.
Resolving packages: Integrate dependency information.
Fetching packages: First, check if there are cached resources in the cache directory, then read the file system; if neither exists, download from the Registry.
Linking dependencies: Copy dependencies to node_modules. First, resolve peerDependencies information, then based on the flattening principle (yarn's flattening rules differ from npm, with frequently used versions installed at the top level, a process called dedupe), copy dependencies from the cache to the current project's node_modules directory.
Building fresh packages: This process executes install-related hooks, including preinstall, install, postinstall.

Resolving packages: First, form the first-level dependency set based on the dependencies, devDependencies, and optionalDependencies fields in the project's package.json, then recursively resolve nested dependencies level by level (using a Set data structure to store resolved and currently resolving packages to ensure that packages within the same version range are not resolved multiple times), combining yarn.lock and Registry to obtain specific versions, download addresses, hash values, sub-dependencies, etc. (following the yarn.lock priority principle during the process) to ultimately determine dependency version information and download addresses.

The process can be summarized in two parts:

Collect first-level dependencies, integrating the dependencies from package.json's dependencies, devDependencies, and optionalDependencies lists and the top-level packages list in workspaces into a first-level dependency set in the format of "package name@version range," which can be visualized as a string array.
Traverse all dependencies to collect specific information, starting from the first-level dependency set, combining yarn.lock and Registry to obtain specific versions, download addresses, hash values, sub-dependencies, etc.

pnpm#

pnpm stands for performant npm. As the official introduction states, it is a package manager that is fast and space-efficient. pnpm essentially is a package manager with two advantages:

Extremely fast package installation speed
Highly efficient disk space utilization

Link Mechanism#

Hard Link

So, how does pnpm achieve such a significant performance boost? This is due to a mechanism in computers called "hard link." Hard links allow users to reference a file through different path references. pnpm stores hard link files in the global store directory under the project's node_modules.

Hard links can be understood as copies of the source file, and in fact, what is installed in the project are these copies. It allows users to find the source file through path references. At the same time, pnpm stores hard links in the global store, allowing different projects to find the same dependencies from the global store, greatly saving disk space.

Hard links are connected through inode. In the Linux file system, each file stored on the disk partition is assigned a number called the inode index. In Linux, multiple file names can point to the same inode. For example, if A is a hard link to B (both A and B are file names), then the inode number in A's directory entry is the same as that in B's directory entry, meaning one inode corresponds to two different file names, and both file names point to the same file. For the file system, A and B are completely equal. Deleting either one does not affect the access of the other.

Symbolic Link

A symbolic link, also known as a soft link, can be understood as a shortcut. pnpm can find the corresponding dependency address in the disk directory through symbolic links. A symbolic link file is merely a marker for its source file. When the source file is deleted, the link file cannot exist independently. Although the file name is retained, the content of the symbolic link file cannot be viewed.

Deleting a file affects the content of the symlink; if the file is deleted and then restored, it will still remain in sync with the symlink, and the link file can even link to non-existent files, leading to what is commonly referred to as a "broken link."

pnpm's Link

This new mechanism design is very clever, as it not only accommodates node's dependency resolution but also solves the following problems:

Phantom dependency issue: Only direct dependencies will be expanded in the node_modules directory; sub-dependencies will not be hoisted, thus avoiding phantom dependencies.
Dependency doppelgänger issue: The same dependency will only be installed once in the global store. The project only contains copies of the source files, which occupy almost no space, thus eliminating the dependency doppelgänger issue.
The greatest advantage is saving disk space; each package is stored only once in the global store, while the rest are soft links or hard links.

Drawbacks

Global hard links can also lead to some issues. For example, if the linked code is modified, all projects will be affected; it is also not friendly to postinstall operations; modifying code in postinstall may cause issues in other projects. pnpm defaults to a cow (Copy on Write) strategy, but this configuration does not work on Mac. This is actually due to node's lack of support, which can be referenced in the corresponding issue.
Since the node_modules dependencies created by pnpm are soft links, pnpm cannot be used in environments that do not support soft links, such as Electron applications.

References:
https://mp.weixin.qq.com/s/9JCs3rCmVuGT3FvKxXMJwg