Git (
[7]) is a
version control system for tracking changes in
computer files and coordinating work on those files among multiple people. It is primarily used for
source code management in
software development,
[8] but it can be used to keep track of changes in any set of files. As a
distributed revision control system it is aimed at speed,
[9] data integrity,
[10] and support for distributed, non-linear workflows.
[11]
Git was created by
Linus Torvalds in 2005 for development of the
Linux kernel, with other kernel developers contributing to its initial development.
[12] Its current maintainer since 2005 is
Junio Hamano.
As with most other distributed version control systems, and unlike most
client–server systems, every Git
directory on every
computer is a full-fledged
repository with complete history and full version tracking abilities, independent of network access or a central server.
[13]
Git is
free software distributed under the terms of the
GNU General Public License version 2.
History
Git development began in April 2005, after many developers of the
Linux kernel gave up access to
BitKeeper, a proprietary
source control management (SCM) system that they had formerly used to maintain the project.
[14] The copyright holder of BitKeeper,
Larry McVoy, had withdrawn free use of the product after claiming that
Andrew Tridgell had
reverse-engineered the BitKeeper protocols.
[15] (The same incident would also spur the creation of another version control system,
Mercurial.)
Linus Torvalds
wanted a distributed system that he could use like BitKeeper, but none
of the available free systems met his needs, especially for performance.
Torvalds cited an example of a source-control management system needing
30 seconds to apply a patch and update all associated metadata, and
noted that this would not scale to the needs of Linux kernel
development, where syncing with fellow maintainers could require 250
such actions at once. For his design criteria, he specified that
patching should take no more than three seconds,
[9] and added three more points:
- Take Concurrent Versions System (CVS) as an example of what not to do; if in doubt, make the exact opposite decision[11]
- Support a distributed, BitKeeper-like workflow[11]
- Include very strong safeguards against corruption, either accidental or malicious[10]
These criteria eliminated every then-extant
version control system except
Monotone. Performance considerations excluded it, too.
[11] So immediately after the 2.6.12-rc2 Linux kernel development release, Torvalds set out to write his own system.
[11]
Torvalds quipped about the name
git (which means
unpleasant person in
British English slang): "I'm an egotistical bastard, and I name all my projects after myself. First '
Linux', now 'git'."
[16][17] The
man page describes Git as "the stupid content tracker".
[18] The readme file of the source code elaborates further:
[19]
The name "git" was given by Linus Torvalds when he wrote the very
first version. He described the tool as "the stupid content tracker"
and the name as (depending on your way):
- random three-letter combination that is pronounceable, and not actually used by any common UNIX command. The fact that it is a mispronunciation of "get" may or may not be relevant.
- stupid. contemptible and despicable. simple. Take your pick from the dictionary of slang.
- "global information tracker": you're in a good mood, and it actually works for you. Angels sing, and a light suddenly fills the room.
- "goddamn idiotic truckload of shit": when it breaks
The development of Git began on 3 April 2005.
[20] Torvalds announced the project on 6 April;
[21] it became
self-hosting as of 7 April.
[20] The first merge of multiple branches took place on 18 April.
[22]
Torvalds achieved his performance goals; on 29 April, the nascent Git
was benchmarked recording patches to the Linux kernel tree at the rate
of 6.7 patches per second.
[23] On 16 June Git managed the kernel 2.6.12 release.
[24]
Torvalds turned over
maintenance on 26 July 2005 to
Junio Hamano, a major contributor to the project.
[25] Hamano was responsible for the 1.0 release on 21 December 2005, and remains the project's maintainer.
[26]
Releases
| Version |
Original release date[citation needed] |
Latest version |
Release date[citation needed] |
| 0.99 |
2005-07-11 |
0.99.9n |
2005-12-15 |
| 1.0 |
2005-12-21 |
1.0.13 |
2006-01-27 |
| 1.1 |
2006-01-08 |
1.1.6 |
2006-01-30 |
| 1.2 |
2006-02-12 |
1.2.6 |
2006-04-08 |
| 1.3 |
2006-04-18 |
1.3.3 |
2006-05-16 |
| 1.4 |
2006-06-10 |
1.4.4.5 |
2008-07-16 |
| 1.5 |
2007-02-14 |
1.5.6.6 |
2008-12-17 |
| 1.6 |
2008-08-17 |
1.6.6.3 |
2010-12-15 |
| 1.7 |
2010-02-13 |
1.7.12.4 |
2012-10-17 |
| 1.8 |
2012-10-21 |
1.8.5.6 |
2014-12-17 |
| 1.9 |
2014-02-14 |
1.9.5 |
2014-12-17 |
| 2.0 |
2014-05-28 |
2.0.5 |
2014-12-17 |
| 2.1 |
2014-08-16 |
2.1.4 |
2014-12-17 |
| 2.2 |
2014-11-26 |
2.2.3 |
2015-09-04 |
| 2.3 |
2015-02-05 |
2.3.10 |
2015-09-29 |
| 2.4 |
2015-04-30 |
2.4.12 |
2017-05-05 |
| 2.5 |
2015-07-27 |
2.5.6 |
2017-05-05 |
| 2.6 |
2015-09-28 |
2.6.7 |
2017-05-05 |
| 2.7 |
2015-10-04 |
2.7.5 |
2017-05-05 |
| 2.8 |
2016-03-28 |
2.8.5 |
2017-05-05 |
| 2.9 |
2016-06-13 |
2.9.4 |
2017-05-05 |
| 2.10 |
2016-09-02 |
2.10.3 |
2017-05-05 |
| 2.11 |
2016-11-29 |
2.11.2 |
2017-05-05 |
| 2.12 |
2017-02-24 |
2.12.3 |
2017-05-05 |
| 2.13 |
2017-05-10 |
2.13.4 |
2017-08-01 |
| 2.14 |
2017-08-04 |
2.14.3 |
2017-10-24 |
| 2.15 |
2017-10-30 |
2.15.1 |
2017-11-28 |
Legend:
Old version
Older version, still supported
Latest version
Latest preview version
|
Design
Git's design was inspired by
BitKeeper and
Monotone.
[27][28]
Git was originally designed as a low-level version control system
engine on top of which others could write front ends, such as
Cogito or
StGIT.
[28] The core Git project has since become a complete version control system that is usable directly.
[29] While strongly influenced by BitKeeper, Torvalds deliberately avoided conventional approaches, leading to a unique design.
[30]
Characteristics
Git's
design is a synthesis of Torvalds's experience with Linux in
maintaining a large distributed development project, along with his
intimate knowledge of file system performance gained from the same
project and the urgent need to produce a working system in short order.
These influences led to the following implementation choices
[citation needed]:
- Strong support for non-linear development
- Git supports rapid branching and merging, and includes specific
tools for visualizing and navigating a non-linear development history.
In Git, a core assumption is that a change will be merged more often
than it is written, as it is passed around to various reviewers. In Git,
branches are very lightweight: a branch is only a reference to one
commit. With its parental commits, the full branch structure can be
constructed.
- Distributed development
- Like Darcs, BitKeeper, Mercurial, SVK, Bazaar, and Monotone,
Git gives each developer a local copy of the full development history
and changes are copied from one such repository to another. These
changes are imported as added development branches, and can be merged in
the same way as a locally developed branch.
- Compatibility with existent systems and protocols
- Repositories can be published via Hypertext Transfer Protocol (HTTP), File Transfer Protocol (FTP), rsync (removed in Git 2.8.0[31]), or a Git protocol over either a plain socket, or Secure Shell
(ssh). Git also has a CVS server emulation, which enables the use of
extant CVS clients and IDE plugins to access Git repositories. Subversion and svk repositories can be used directly with git-svn.
- Efficient handling of large projects
- Torvalds has described Git as being very fast and scalable,[32] and performance tests done by Mozilla[33] showed it was an order of magnitude
faster than some version control systems, and fetching version history
from a locally stored repository can be one hundred times faster than
fetching it from the remote server.[34]
- Cryptographic authentication of history
- The Git history is stored in such a way that the ID of a particular version (a commit
in Git terms) depends upon the complete development history leading up
to that commit. Once it is published, it is not possible to change the
old versions without it being noticed. The structure is similar to a Merkle tree, but with added data at the nodes and leaves.[35] (Mercurial and Monotone also have this property.)
- Toolkit-based design
- Git was designed as a set of programs written in C, and several shell scripts that provide wrappers around those programs.[36]
Although most of those scripts have since been rewritten in C for speed
and portability, the design remains, and it is easy to chain the
components together.[37]
- Pluggable merge strategies
- As part of its toolkit design, Git has a well-defined model of an
incomplete merge, and it has multiple algorithms for completing it,
culminating in telling the user that it is unable to complete the merge
automatically and that manual editing is needed.
- Garbage accumulates until collected
- Aborting operations or backing out changes will leave useless
dangling objects in the database. These are generally a small fraction
of the continuously growing history of wanted objects. Git will
automatically perform garbage collection when enough loose objects have been created in the repository. Garbage collection can be called explicitly using
git gc --prune.[38]
- Periodic explicit object packing
- Git stores each newly created object as a separate file. Although
individually compressed, this takes a great deal of space and is
inefficient. This is solved by the use of packs that store a large number of objects delta-compressed among themselves in one file (or network byte stream) called a packfile. Packs are compressed using the heuristic
that files with the same name are probably similar, but do not depend
on it for correctness. A corresponding index file is created for each
packfile, telling the offset of each object in the packfile. Newly
created objects (with newly added history) are still stored as single
objects and periodic repacking is needed to maintain space efficiency.
The process of packing the repository can be very computationally
costly. By allowing objects to exist in the repository in a loose but
quickly generated format, Git allows the costly pack operation to be
deferred until later, when time matters less, e.g., the end of a work
day. Git does periodic repacking automatically but manual repacking is
also possible with the git gc command. For data integrity, both the packfile and its index have an SHA-1
checksum inside and the file name of the packfile also contains an
SHA-1 checksum. To check the integrity of a repository, run the git fsck command.
Another property of Git is that it snapshots directory trees of
files. The earliest systems for tracking versions of source code,
Source Code Control System (SCCS) and
Revision Control System (RCS), worked on individual files and emphasized the space savings to be gained from
interleaved deltas (SCCS) or
delta encoding
(RCS) the (mostly similar) versions. Later revision control systems
maintained this notion of a file having an identity across multiple
revisions of a project. However, Torvalds rejected this concept.
[39] Consequently, Git does not explicitly record file revision relationships at any level below the source code tree.
These implicit revision relationships have some significant consequences:
- It is slightly more costly to examine the change history of one file than the whole project.[40]
To obtain a history of changes affecting a given file, Git must walk
the global history and then determine whether each change modified that
file. This method of examining history does, however, let Git produce
with equal efficiency a single history showing the changes to an
arbitrary set of files. For example, a subdirectory of the source tree
plus an associated global header file is a very common case.
- Renames are handled implicitly rather than explicitly. A common complaint with CVS
is that it uses the name of a file to identify its revision history, so
moving or renaming a file is not possible without either interrupting
its history, or renaming the history and thereby making the history
inaccurate. Most post-CVS revision control systems solve this by giving a
file a unique long-lived name (analogous to an inode number) that survives renaming. Git does not record such an identifier, and this is claimed as an advantage.[41][42] Source code files are sometimes split or merged, or simply renamed,[43]
and recording this as a simple rename would freeze an inaccurate
description of what happened in the (immutable) history. Git addresses
the issue by detecting renames while browsing the history of snapshots
rather than recording it when making the snapshot.[44] (Briefly, given a file in revision N, a file of the same name in revision N−1 is its default ancestor. However, when there is no like-named file in revision N−1, Git searches for a file that existed only in revision N−1 and is very similar to the new file.) However, it does require more CPU-intensive
work every time the history is reviewed, and several options to adjust
the heuristics are available. This mechanism does not always work;
sometimes a file that is renamed with changes in the same commit is read
as a deletion of the old file and the creation of a new file.
Developers can work around this limitation by committing the rename and
the changes separately.
Git implements several merging strategies; a non-default strategy can be selected at merge time:
[45]
- resolve: the traditional three-way merge algorithm.
- recursive: This is the default when pulling or merging one branch, and is a variant of the three-way merge algorithm.
When there are more than one common ancestors that can be used for
three-way merge, it creates a merged tree of the common ancestors and
uses that as the reference tree for the three-way merge. This has been
reported to result in fewer merge conflicts without causing mis-merges
by tests done on prior merge commits taken from Linux 2.6 kernel
development history. Also, this can detect and handle merges involving
renames.
- octopus: This is the default when merging more than two heads.
Data structures
Git's primitives are not inherently a
source-code management system. Torvalds explains,
[47]
In many ways you can just see git as a filesystem – it's content-addressable, and it has a notion of versioning, but I really really designed it coming at the problem from the viewpoint of a filesystem person (hey, kernels is what I do), and I actually have absolutely zero interest in creating a traditional SCM system.
From this initial design approach, Git has developed the full set of features expected of a traditional SCM,
[29] with features mostly being created as needed, then refined and extended over time.
Some data flows and storage levels in the Git revision control system.
Git has two
data structures: a mutable
index (also called
stage or
cache) that caches information about the working directory and the next revision to be committed; and an immutable, append-only
object database.
The index serves as connection point between the object database and the working tree.
The object database contains four types of objects:
- A blob (binary large object) is the content of a file. Blobs have no proper file name, time stamps, or other metadata. (A blob's name internally is a hash of its content.)
- A tree object is the equivalent of a directory. It contains a
list of file names, each with some type bits and a reference to a blob
or tree object that is that file, symbolic link, or directory's
contents. These objects are a snapshot of the source tree. (In whole,
this comprises a Merkle Tree,
meaning that only a single hash for the root tree is sufficient and
actually used in commits to exactly pinpoint to the exact state of whole
tree structures of any number of sub-directories and files.)
- A commit object links tree objects together into a history.
It contains the name of a tree object (of the top-level source
directory), a time stamp, a log message, and the names of zero or more
parent commit objects.
- A tag object is a container that contains a reference to
another object and can hold added meta-data related to another object.
Most commonly, it is used to store a digital signature of a commit object corresponding to a particular release of the data being tracked by Git.
Each object is identified by a SHA-1
hash
of its contents. Git computes the hash, and uses this value for the
object's name. The object is put into a directory matching the first two
characters of its hash. The rest of the hash is used as the file name
for that object.
Git stores each revision of a file as a unique blob. The
relationships between the blobs can be found through examining the tree
and commit objects. Newly added objects are stored in their entirety
using
zlib compression. This can consume a large amount of disk space quickly, so objects can be combined into
packs, which use
delta compression to save space, storing blobs as their changes relative to other blobs.
Git servers typically listen on
TCP port 9418.
[48]
References
Every
object in the Git database which is not referred to may be cleaned up
by using a garbage collection command, or automatically. An object may
be referenced by another object, or an explicit reference. Git knows
different types of references. The commands to create, move, and delete
references vary. "git show-ref" lists all references. Some types are:
- heads: refers to an object locally
- remotes: refers to an object which exists in a remote repository
- stash: refers to an object not yet committed
- meta: e.g. a configuration in a bare repository, user rights; the refs/meta/config namespace was introduced resp gets used by Gerrit[clarification needed][49]
- tags: see above
Implementations
gitg is a graphical front-end using
GTK+
Git is primarily developed on
Linux, although it also supports most major operating systems including
BSD,
Solaris,
macOS, and
Windows.
[50]
The first Microsoft Windows
port
of Git was primarily a Linux emulation framework that hosts the Linux
version. Installing Git under Windows creates a similarly named Program
Files directory containing the MinGW port of the GNU Compiler
Collection, Perl 5, msys2.0 (itself a fork of Cygwin, a Unix-like
emulation environment for Windows) and various other Windows ports or
emulations of Linux utilities and libraries. Currently native Windows
builds of Git are distributed as 32 and 64-bit installers.
[51]
The JGit implementation of Git is a pure
Java software library, designed to be embedded in any Java application. JGit is used in the
Gerrit code review tool and in EGit, a
Git client for the
Eclipse IDE.
[52]
The Dulwich implementation of Git is a pure
Python software component for Python 2.7, 3.4 and 3.5
[53]
The libgit2 implementation of Git is an ANSI C software library with
no other dependencies, which can be built on multiple platforms
including Windows, Linux, macOS, and BSD.
[54] It has bindings for many programming languages, including
Ruby, Python, and
Haskell.
[55][56][57]
JS-Git is a
JavaScript implementation of a subset of Git.
[58]
Web interfaces
Screenshot of Gitweb interface showing a commit
diff.
There are various web interfaces available for Git.
- Cgit:A web frontend for git repositories written in C.
- Gitweb: A git frontend written in perl.
- Gogs: A git frontend with built-in authentication, issue handling, fork and a lot of features, written in go.
- Gitea: A fork of Gogs.
- Gitlist: A git repository viewer using Bootstrap Framework written in Php.
Git server
As
Git is a distributed version control system, it can be used as a server
out of the box. Dedicated Git server software helps, amongst other
features, to add access control, display the contents of a Git
repository via the web, and help managing multiple repositories. Remote
file store and shell access: A Git repository can be cloned to a shared
file system, and accessed by other persons. It can also be accessed via
remote shell just by having the Git software installed and allowing a
user to log in.
[59]
Adoption
The
Eclipse Foundation
reported in its annual community survey that as of May 2014, Git is now
the most widely used source-code management tool, with 42.9% of
professional software developers reporting that they use Git as their
primary source control system
[60] compared with 36.3% in 2013, 32% in 2012; or for Git responses excluding use of
GitHub: 33.3% in 2014, 30.3% in 2013, 27.6% in 2012 and 12.8% in 2011.
[61] Open source directory
Black Duck Open Hub reports a similar uptake among open source projects.
[62]
The Stack Overflow developer survey reported in 2015 that 69.3% of
developers use Git; 36.9% use Subversion; 12.2% use TFS; and 7.9% use
Mercurial.
[63]
The UK IT jobs website itjobswatch.co.uk reports that as of late
September 2016, 29.27% of UK permanent software development job openings
have cited Git,
[64] ahead of 12.17% for Microsoft
Team Foundation Server,
[65] 10.60% for
Subversion,
[66] 1.30% for
Mercurial,
[67] and 0.48% for
Visual SourceSafe.
[68]
Since February 2017,
Microsoft has been in the process of migrating
Microsoft Windows development to Git, migrating from
Perforce.
In order to handle the size of the Windows source code tree, Microsoft
was required to develop customizations to the software, including Git
Virtual File System (GVFS), a system which allows cloned repositories to
use placeholders whose contents are downloaded only once a file is
accessed.
[69]
Security
Git
does not provide access control mechanisms, but was designed for
operation with other tools that specialize in access control.
[70]
On 17 December 2014, an exploit was found affecting the
Windows and
Mac versions of the Git client. An attacker could perform
arbitrary code execution on a target computer with Git installed by creating a malicious Git tree (directory) named
.git
(a directory in Git repositories that stores all the data of the
repository) in a different case (such as .GIT or .Git, needed because
Git doesn't allow the all-lowercase version of
.git to be created manually) with malicious files in the
.git/hooks
subdirectory (a folder with executable files that Git runs) on a
repository that the attacker made or on a repository that the attacker
can modify. If a Windows or Mac user
pulls (downloads) a version
of the repository with the malicious directory, then switches to that
directory, the .git directory will be overwritten (due to the
case-insensitive trait of the Windows and Mac filesystems) and the
malicious executable files in
.git/hooks may be run, which results in the attacker's commands being executed. An attacker could also modify the
.git/config
configuration file, which allows the attacker to create malicious Git
aliases (aliases for Git commands or external commands) or modify extant
aliases to execute malicious commands when run. The vulnerability was
patched in version 2.2.1 of Git, released on 17 December 2014, and
announced on the next day.
[71][72]
Git version 2.6.1, released on 29 September 2015, contained a patch for a security vulnerability (CVE-2015-7545)
[73] which allowed arbitrary code execution.
[74]
The vulnerability was exploitable if an attacker could convince a
victim to clone a specific URL, as the arbitrary commands were embedded
in the URL itself.
[75] An attacker could use the exploit via a
man-in-the-middle attack if the connection was unencrypted,
[75]
as they could redirect the user to a URL of their choice. Recursive
clones were also vulnerable, since they allowed the controller of a
repository to specify arbitrary URLs via the gitmodules file.
[75]
Git uses
SHA-1
hashes internally. Linus Torvalds has responded that the hash was
mostly to guard against accidental corruption, and the security a
cryptographically secure hash gives was just an accidental side effect,
with the main security being signing elsewhere.
[76][77]