Managing Large Files in Git: LFS and Alternatives
Complete guide to managing large files in Git repositories using Git LFS, including setup, migration, CI/CD integration, locking, and alternative approaches like git-annex and DVC.
Managing Large Files in Git: LFS and Alternatives
Overview
Git was designed to track text files efficiently, and it does that brilliantly. But the moment you start committing design assets, compiled binaries, ML model weights, or video files, Git's internal architecture works against you. Git Large File Storage (LFS) replaces large files with lightweight pointer files while storing the actual content on a remote server, keeping your repository fast and manageable. This article covers everything from initial LFS setup to production CI/CD integration, repo migration, and when you should skip LFS entirely in favor of alternatives like git-annex or DVC.
Prerequisites
- Git 2.x or later installed
- Basic familiarity with Git operations (clone, commit, push, pull)
- Command-line access (bash, PowerShell, or similar)
- A GitHub, GitLab, Azure DevOps, or Bitbucket account for LFS hosting
- Node.js 18+ (for the working example)
Why Large Files Break Git
Before reaching for a solution, you need to understand the actual problem. Git is a content-addressable filesystem. Every object — blob, tree, commit — gets SHA-1 hashed and stored. When you run git gc, Git packs these objects into packfiles using delta compression. Delta compression works by storing the difference between similar objects.
Here is where the problem starts. Binary files do not delta compress well. A single-byte change in a 50MB Photoshop file produces an entirely new 50MB blob. Git cannot compute a meaningful delta between two versions of a .psd file the way it can between two versions of a .js file.
The consequences compound across three dimensions:
Repository size explodes. Every version of every large file lives in the .git/objects directory forever. Ten revisions of a 100MB file means roughly 1GB of object storage, even after garbage collection.
# Check your repo size
$ du -sh .git
2.3G .git
# See the largest objects in your packfile
$ git verify-pack -v .git/objects/pack/*.idx | sort -k 3 -n | tail -10
a3f2e1d... blob 104857600 104832512 12345
b7c9a0f... blob 104857600 104831488 104844857
c1d4e8b... blob 52428800 52416384 209689345
# Those are 100MB and 50MB blobs sitting in your pack
Clone time becomes painful. git clone downloads the entire history by default. A repository with 2GB of accumulated binary assets takes minutes on a fast connection and can time out on slower ones.
# A bloated repo clone
$ time git clone https://github.com/myorg/bloated-repo.git
Cloning into 'bloated-repo'...
remote: Enumerating objects: 48230, done.
remote: Counting objects: 100% (48230/48230), done.
remote: Compressing objects: 100% (12845/12845), done.
remote: Total 48230 (delta 31204), reused 47891 (delta 30987)
Receiving objects: 100% (48230/48230), 2.31 GiB | 8.42 MiB/s, done.
Resolving deltas: 100% (31204/31204), done.
Updating files: 100% (1847/1847), done.
real 4m38.214s
user 0m42.108s
sys 0m18.447s
Checkout and branch switching slow down. Even after cloning, switching branches requires Git to materialize files in the working tree. Large binary files make this noticeably slower, especially on spinning disks or network filesystems.
The fundamental issue is that Git treats every file the same way. LFS changes that by introducing a layer of indirection.
Git LFS Setup and Configuration
Git LFS is an open-source extension maintained by GitHub. It intercepts Git's smudge and clean filters to swap large files for small pointer files in your repository while storing the actual content on a dedicated LFS server.
Installation
# macOS
$ brew install git-lfs
# Ubuntu/Debian
$ sudo apt-get install git-lfs
# Windows (via installer or chocolatey)
$ choco install git-lfs
# Verify installation
$ git lfs version
git-lfs/3.5.1 (GitHub; windows amd64; go 1.21.8)
After installing the binary, you need to initialize LFS for your user account. This is a one-time operation per machine:
$ git lfs install
Updated git hooks.
Git LFS initialized.
This command adds smudge and clean filter configurations to your global .gitconfig:
[filter "lfs"]
clean = git-lfs clean -- %f
smudge = git-lfs smudge -- %f
process = git-lfs filter-process
required = true
Tracking Patterns with .gitattributes
LFS uses .gitattributes to determine which files should be managed by LFS. This file gets committed to the repo, so all collaborators automatically track the same patterns.
# Track specific file types
$ git lfs track "*.psd"
$ git lfs track "*.ai"
$ git lfs track "*.sketch"
$ git lfs track "*.zip"
$ git lfs track "*.tar.gz"
$ git lfs track "*.mp4"
$ git lfs track "*.pkl"
$ git lfs track "*.h5"
$ git lfs track "*.onnx"
# Track files in a specific directory regardless of extension
$ git lfs track "assets/designs/**"
$ git lfs track "models/weights/**"
# Track a specific file
$ git lfs track "data/training-dataset.csv"
The resulting .gitattributes file looks like this:
*.psd filter=lfs diff=lfs merge=lfs -text
*.ai filter=lfs diff=lfs merge=lfs -text
*.sketch filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.tar.gz filter=lfs diff=lfs merge=lfs -text
*.mp4 filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
assets/designs/** filter=lfs diff=lfs merge=lfs -text
models/weights/** filter=lfs diff=lfs merge=lfs -text
data/training-dataset.csv filter=lfs diff=lfs merge=lfs -text
When you commit a tracked file, Git stores a pointer file in the repository instead of the actual content:
$ cat .git/lfs/objects/ab/cd/abcd1234...
# This is the actual file content, stored locally
$ git show HEAD:assets/logo.psd
version https://git-lfs.github.com/spec/v1
oid sha256:abcd1234567890abcdef1234567890abcdef1234567890abcdef1234567890ab
size 48234567
That pointer file is only 130-140 bytes regardless of how large the actual file is. The real content lives in .git/lfs/objects/ locally and on the LFS server remotely.
Verifying What LFS Is Tracking
# List tracked patterns
$ git lfs track
Listing tracked patterns
*.psd (.gitattributes)
*.ai (.gitattributes)
*.pkl (.gitattributes)
# List actual LFS objects in the current checkout
$ git lfs ls-files
abcd1234 * assets/logo.psd
ef567890 * models/weights/classifier.pkl
1a2b3c4d * assets/hero-video.mp4
# See LFS status
$ git lfs status
On branch main
Objects to be pushed to origin/main:
assets/logo.psd (LFS: abcd1234)
Objects to be committed:
(no changes)
LFS Storage Backends
GitHub
GitHub provides 1GB of free LFS storage and 1GB of bandwidth per month. Additional data packs cost $5/month for 50GB of storage and 50GB of bandwidth.
# GitHub LFS is automatic — just push to a GitHub remote
$ git remote add origin https://github.com/myorg/my-project.git
$ git push origin main
Uploading LFS objects: 100% (3/3), 142 MB | 12 MB/s, done.
Azure DevOps
Azure DevOps is the most generous for LFS. Free tier includes unlimited LFS storage (subject to project storage limits, typically 250GB for free organizations). This makes it a strong choice for game studios or media-heavy projects.
# Azure DevOps LFS works the same way
$ git remote add origin https://dev.azure.com/myorg/myproject/_git/myrepo
$ git push origin main
GitLab
GitLab.com provides 5GB of LFS storage in the free tier. Self-managed GitLab lets you configure your own storage backend, including S3-compatible object storage.
Self-Hosted LFS Server
You can run your own LFS server if you need complete control over storage. The reference implementation is lfs-test-server, but for production you want something more robust:
# Using the open-source Rudolfs server (Rust, S3-backed)
$ cargo install rudolfs
# Configure with S3 backend
$ rudolfs \
--host 0.0.0.0:8080 \
--s3-bucket my-lfs-bucket \
--s3-region us-east-1 \
--key "$(cat lfs-encryption.key)"
Then configure your repo to use the custom server:
# .lfsconfig (committed to repo)
[lfs]
url = https://lfs.mycompany.com/myorg/myrepo
Migrating Existing Repos to LFS
This is where things get interesting. If you already have large files committed directly to your repository, simply adding LFS tracking going forward does not fix the history. Those old blobs still live in every packfile. You need to rewrite history.
Important: History rewriting is destructive. Every collaborator will need to re-clone after migration. Coordinate with your team before doing this.
Using git-filter-repo (Recommended)
git-filter-repo is the modern replacement for git filter-branch and BFG. It is faster, safer, and more flexible.
# Install
$ pip install git-filter-repo
# First, identify the largest files in history
$ git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| awk '/^blob/ {print $3, $4}' \
| sort -rn \
| head -20
104857600 assets/design-mockup-v3.psd
104857600 assets/design-mockup-v2.psd
104857600 assets/design-mockup-v1.psd
52428800 data/training-data.csv
31457280 build/release-2.1.0.zip
# Migrate specific extensions to LFS throughout history
$ git lfs migrate import --include="*.psd,*.zip,*.csv" --everything
migrate: Sorting commits: ..., done.
migrate: Rewriting commits: 100% (1847/1847), done.
main abcd1234 -> ef567890
develop 1a2b3c4d -> 5e6f7a8b
migrate: Updating refs: ..., done.
migrate: checkout: ..., done.
The --everything flag rewrites all branches and tags. Without it, only the current branch is migrated.
# Verify the migration
$ git lfs ls-files --all | wc -l
47
# Check the new repo size
$ git gc --prune=now
$ du -sh .git
187M .git
# Down from 2.3GB
Using BFG Repo-Cleaner
BFG is an older tool but still works well for simple cases. It is a Java application so it runs anywhere with a JVM.
# Download BFG
$ wget https://repo1.maven.org/maven2/com/madgp/bfg/1.14.0/bfg-1.14.0.jar
# Clone a fresh mirror
$ git clone --mirror https://github.com/myorg/my-project.git
# Remove files larger than 10MB from history
$ java -jar bfg-1.14.0.jar --strip-blobs-bigger-than 10M my-project.git
# Clean up
$ cd my-project.git
$ git reflog expire --expire=now --all
$ git gc --prune=now --aggressive
# Push the rewritten history
$ git push --force
Post-Migration Checklist
After rewriting history, every team member needs to:
# Option 1: Fresh clone (safest)
$ rm -rf my-project
$ git clone https://github.com/myorg/my-project.git
# Option 2: Reset existing clone (if they have no local changes)
$ git fetch origin
$ git reset --hard origin/main
$ git lfs pull
LFS Locking for Binary Files
Binary files cannot be merged. If two people edit the same Photoshop file simultaneously, one person's work gets lost. LFS provides file locking to prevent this.
# Enable locking on your repo
$ git lfs install --force
# Configure lockable file types in .gitattributes
*.psd filter=lfs diff=lfs merge=lfs -text lockable
*.ai filter=lfs diff=lfs merge=lfs -text lockable
*.sketch filter=lfs diff=lfs merge=lfs -text lockable
# Lock a file before editing
$ git lfs lock assets/hero-banner.psd
Locked assets/hero-banner.psd
# See all locks
$ git lfs locks
ID Path Owner Locked At
1234 assets/hero-banner.psd shane 2026-02-08T10:30:00Z
1235 assets/icon-set.ai maria 2026-02-07T14:15:00Z
# Unlock when done
$ git lfs unlock assets/hero-banner.psd
Unlocked assets/hero-banner.psd
# Force unlock someone else's file (admin only)
$ git lfs unlock assets/icon-set.ai --force
The lockable attribute in .gitattributes makes tracked files read-only in the working tree by default. You must explicitly lock a file before editing it, which prevents accidental concurrent edits.
LFS in CI/CD Pipelines
LFS in CI/CD introduces two concerns: bandwidth consumption and build speed. Every CI run that clones the repo downloads LFS objects, which costs bandwidth and time.
GitHub Actions
# .github/workflows/build.yml
name: Build
on: [push, pull_request]
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Checkout with LFS
uses: actions/checkout@v4
with:
lfs: true
# Only fetch LFS objects needed for current commit
# not the entire LFS history
fetch-depth: 1
- name: Cache LFS objects
uses: actions/cache@v4
with:
path: .git/lfs
key: lfs-${{ hashFiles('.gitattributes') }}-${{ github.sha }}
restore-keys: |
lfs-${{ hashFiles('.gitattributes') }}-
- name: Pull LFS files
run: git lfs pull
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: '20'
- name: Install and build
run: |
npm ci
npm run build
GitLab CI
# .gitlab-ci.yml
variables:
GIT_LFS_SKIP_SMUDGE: "1" # Skip automatic LFS download
stages:
- build
build:
stage: build
before_script:
# Only pull LFS files we actually need
- git lfs pull --include="assets/production/**" --exclude=""
script:
- npm ci
- npm run build
cache:
key: lfs-cache
paths:
- .git/lfs/
Selective LFS Fetching
The biggest optimization is not downloading LFS files you do not need. If your CI pipeline only runs tests and does not need design assets, skip them entirely:
# Skip all LFS downloads during clone
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/myorg/project.git
# Then selectively pull only what you need
$ git lfs pull --include="models/production/*.onnx" --exclude="assets/designs/*"
This can reduce CI bandwidth usage by 80% or more depending on your file distribution.
Alternatives to LFS
LFS is not the only game in town, and it is not always the right choice. Here are the major alternatives and when each one makes sense.
git-annex
git-annex predates LFS and is more flexible but also more complex. It supports a wider range of storage backends, including local drives, SSH servers, S3, Glacier, and even Bittorrent.
# Initialize git-annex in a repo
$ git annex init "my laptop"
init my laptop ok
# Add a large file
$ git annex add datasets/large-training-set.tar.gz
add datasets/large-training-set.tar.gz ok
# The file becomes a symlink to the annex
$ ls -la datasets/large-training-set.tar.gz
lrwxrwxrwx 1 shane shane 198 Feb 8 10:00 datasets/large-training-set.tar.gz ->
.git/annex/objects/Xk/9V/SHA256E-s524288000--abcd1234.../SHA256E-s524288000--abcd1234...
# Configure an S3 remote
$ git annex initremote s3-backup type=S3 bucket=my-annex-bucket encryption=none
# Sync content to the remote
$ git annex copy --to s3-backup datasets/large-training-set.tar.gz
copy datasets/large-training-set.tar.gz (to s3-backup...) ok
When to use git-annex: You need fine-grained control over where data lives. git-annex tracks which remotes have which files and can enforce redundancy policies. It is popular in scientific computing and archival workflows.
When to avoid it: Your team uses Windows (git-annex support for Windows has historically been fragile), or you want something that works out of the box with GitHub/GitLab.
DVC (Data Version Control)
DVC is built specifically for machine learning workflows. It tracks data files, models, and pipelines alongside your code in Git.
# Install DVC
$ pip install dvc[s3]
# Initialize DVC in a Git repo
$ dvc init
# Configure S3 as remote storage
$ dvc remote add -d myremote s3://my-dvc-bucket/project
$ dvc remote modify myremote region us-east-1
# Track a large dataset
$ dvc add data/training-images/
This creates a .dvc pointer file:
# data/training-images.dvc
outs:
- md5: abcdef1234567890abcdef1234567890.dir
size: 2147483648
nfiles: 50000
hash: md5
path: training-images
DVC also supports pipeline tracking, which is where it really shines:
# dvc.yaml
stages:
preprocess:
cmd: python src/preprocess.py
deps:
- src/preprocess.py
- data/raw/
outs:
- data/processed/
train:
cmd: python src/train.py
deps:
- src/train.py
- data/processed/
outs:
- models/classifier.pkl
metrics:
- metrics/accuracy.json:
cache: false
# Reproduce the entire pipeline
$ dvc repro
Running stage 'preprocess':
> python src/preprocess.py
Generating processed dataset...done (50000 images)
Running stage 'train':
> python src/train.py
Training classifier...done (accuracy: 0.94)
# Push data and models to remote
$ dvc push
50001 files pushed
When to use DVC: You are building ML pipelines with large datasets and model files. DVC integrates experiment tracking, pipeline reproducibility, and data versioning in one tool.
Cloud Storage with Pointer Files (DIY Approach)
Sometimes the simplest approach is the right one. Store large files in S3, GCS, or Azure Blob Storage and commit a manifest file to Git.
// scripts/sync-assets.js
var AWS = require("aws-sdk");
var fs = require("fs");
var path = require("path");
var crypto = require("crypto");
var s3 = new AWS.S3({ region: "us-east-1" });
var BUCKET = "my-project-assets";
var MANIFEST_PATH = path.join(__dirname, "..", "assets-manifest.json");
function hashFile(filePath) {
return new Promise(function(resolve, reject) {
var hash = crypto.createHash("sha256");
var stream = fs.createReadStream(filePath);
stream.on("data", function(chunk) { hash.update(chunk); });
stream.on("end", function() { resolve(hash.digest("hex")); });
stream.on("error", reject);
});
}
function uploadAsset(filePath, hash) {
var key = "assets/" + hash + "/" + path.basename(filePath);
var body = fs.createReadStream(filePath);
return s3.upload({
Bucket: BUCKET,
Key: key,
Body: body
}).promise().then(function(result) {
console.log("Uploaded: " + filePath + " -> s3://" + BUCKET + "/" + key);
return { path: filePath, hash: hash, s3Key: key, size: fs.statSync(filePath).size };
});
}
function downloadAsset(entry, destDir) {
var dest = path.join(destDir, path.basename(entry.path));
var file = fs.createWriteStream(dest);
return new Promise(function(resolve, reject) {
s3.getObject({ Bucket: BUCKET, Key: entry.s3Key })
.createReadStream()
.pipe(file)
.on("finish", function() {
console.log("Downloaded: " + dest);
resolve();
})
.on("error", reject);
});
}
function syncUp(directory) {
var manifest = {};
var files = fs.readdirSync(directory).filter(function(f) {
var ext = path.extname(f).toLowerCase();
return [".psd", ".ai", ".mp4", ".zip", ".pkl"].indexOf(ext) !== -1;
});
var uploads = files.map(function(file) {
var filePath = path.join(directory, file);
return hashFile(filePath).then(function(hash) {
return uploadAsset(filePath, hash);
});
});
Promise.all(uploads).then(function(results) {
results.forEach(function(entry) {
manifest[entry.path] = entry;
});
fs.writeFileSync(MANIFEST_PATH, JSON.stringify(manifest, null, 2));
console.log("Manifest written: " + MANIFEST_PATH);
}).catch(function(err) {
console.error("Sync failed:", err);
process.exit(1);
});
}
// Usage: node scripts/sync-assets.js upload ./assets
// Usage: node scripts/sync-assets.js download ./assets
var command = process.argv[2];
var dir = process.argv[3] || "./assets";
if (command === "upload") {
syncUp(dir);
} else if (command === "download") {
var manifest = JSON.parse(fs.readFileSync(MANIFEST_PATH, "utf8"));
var downloads = Object.values(manifest).map(function(entry) {
return downloadAsset(entry, dir);
});
Promise.all(downloads).then(function() {
console.log("All assets downloaded.");
});
}
Add the manifest to Git and ignore the actual files:
# .gitignore
assets/*.psd
assets/*.ai
assets/*.mp4
models/*.pkl
models/*.onnx
# But track the manifest
!assets-manifest.json
When to use this approach: You want full control over storage, already have cloud infrastructure, and your team is comfortable with a custom workflow. This avoids LFS bandwidth costs entirely.
Cost Considerations Across Providers
Cost is a real factor when managing large files at scale. Here is how the major providers compare as of early 2026:
| Provider | Free Storage | Free Bandwidth | Paid Storage | Paid Bandwidth |
|---|---|---|---|---|
| GitHub | 1 GB | 1 GB/month | $5/50 GB | $5/50 GB |
| GitLab | 5 GB | 10 GB/month | $0.10/GB | Included |
| Azure DevOps | 250 GB* | Unlimited | Included | Included |
| Bitbucket | 1 GB | 1 GB/month | $10/100 GB | $10/100 GB |
| Self-hosted (S3) | N/A | N/A | ~$0.023/GB | ~$0.09/GB transfer |
*Azure DevOps includes LFS in project storage limits.
For a team with 50GB of binary assets and active CI/CD, monthly costs look roughly like this:
- GitHub: $5-15/month (depending on CI bandwidth)
- GitLab: $5/month
- Azure DevOps: $0 (within free tier)
- Self-hosted S3: $2-5/month
Azure DevOps is the clear winner on cost for LFS-heavy projects. If you are on GitHub and hitting bandwidth limits, the DIY S3 approach can be cheaper than buying data packs.
Monitoring LFS Storage Usage
You need visibility into what LFS is consuming. Without it, you will get surprised by costs or storage limits.
# Check LFS storage on GitHub
$ gh api /repos/{owner}/{repo} --jq '.size'
# Returns size in KB (includes LFS)
# More detailed: GitHub API for LFS usage
$ gh api /orgs/{org}/settings/billing/shared-storage
{
"days_left_in_billing_cycle": 22,
"estimated_paid_storage_for_month": 12.5,
"estimated_storage_for_month": 62.5
}
# Local LFS usage
$ git lfs env
Endpoint=https://github.com/myorg/project.git/info/lfs (auth=basic)
LocalMediaDir=/home/shane/project/.git/lfs/objects
TempDir=/home/shane/project/.git/lfs/tmp
$ du -sh .git/lfs
847M .git/lfs
You can build a simple monitoring script for your Node.js project:
// scripts/lfs-report.js
var childProcess = require("child_process");
function run(cmd) {
return childProcess.execSync(cmd, { encoding: "utf8" }).trim();
}
function getLfsReport() {
var files = run("git lfs ls-files --size").split("\n").filter(Boolean);
var totalSize = 0;
var report = files.map(function(line) {
var parts = line.split(/\s+/);
var oid = parts[0];
var indicator = parts[1];
var filePath = parts.slice(2, -1).join(" ");
var sizeStr = parts[parts.length - 1];
var sizeBytes = parseFloat(sizeStr);
if (sizeStr.indexOf("KB") !== -1) sizeBytes *= 1024;
if (sizeStr.indexOf("MB") !== -1) sizeBytes *= 1024 * 1024;
if (sizeStr.indexOf("GB") !== -1) sizeBytes *= 1024 * 1024 * 1024;
totalSize += sizeBytes;
return {
path: filePath,
size: sizeStr,
oid: oid.substring(0, 10),
downloaded: indicator === "*"
};
});
console.log("=== Git LFS Storage Report ===\n");
console.log("Total LFS objects: " + report.length);
console.log("Total size: " + (totalSize / (1024 * 1024)).toFixed(2) + " MB\n");
console.log("Files:");
report.forEach(function(entry) {
var status = entry.downloaded ? "[local]" : "[remote]";
console.log(" " + status + " " + entry.path + " (" + entry.size + ")");
});
var patterns = run("git lfs track").split("\n").filter(function(line) {
return line.indexOf("*") !== -1 || line.indexOf("/") !== -1;
});
console.log("\nTracked patterns:");
patterns.forEach(function(p) {
console.log(" " + p.trim());
});
}
getLfsReport();
$ node scripts/lfs-report.js
=== Git LFS Storage Report ===
Total LFS objects: 23
Total size: 847.32 MB
Files:
[local] assets/hero-banner.psd (104.2 MB)
[local] assets/brand-guide.ai (52.1 MB)
[local] models/classifier.pkl (312.5 MB)
[remote] assets/old-mockup.psd (98.7 MB)
...
Tracked patterns:
*.psd (.gitattributes)
*.ai (.gitattributes)
*.pkl (.gitattributes)
Complete Working Example
Let's set up Git LFS for a real Node.js project that has design assets, build artifacts, and ML model files. We will configure LFS, set up CI/CD, and migrate existing large files.
Project Structure
my-node-project/
package.json
app.js
src/
routes/
models/
services/
inference.js # Uses ML model for predictions
assets/
designs/
homepage.psd # 85 MB
brand-kit.ai # 45 MB
images/
hero.jpg # 2 MB (keep in Git - small enough)
models/
sentiment.onnx # 200 MB
embeddings.pkl # 150 MB
dist/ # Build output
bundle.js
bundle.js.map
scripts/
lfs-report.js
post-checkout.sh
Step 1: Initialize LFS and Configure Tracking
# Initialize LFS in the project
$ cd my-node-project
$ git lfs install
Updated git hooks.
Git LFS initialized.
# Track large binary formats
$ git lfs track "*.psd"
$ git lfs track "*.ai"
$ git lfs track "*.sketch"
$ git lfs track "*.onnx"
$ git lfs track "*.pkl"
$ git lfs track "*.h5"
$ git lfs track "*.tar.gz"
# Track entire directories for build artifacts
$ git lfs track "dist/**"
# Commit the .gitattributes first
$ git add .gitattributes
$ git commit -m "Configure Git LFS tracking patterns"
Step 2: Add Package Scripts
{
"name": "my-node-project",
"version": "2.1.0",
"scripts": {
"start": "node app.js",
"build": "webpack --mode production",
"lfs:report": "node scripts/lfs-report.js",
"lfs:prune": "git lfs prune --verify-remote",
"postinstall": "git lfs pull --include='models/*.onnx'"
},
"dependencies": {
"express": "^4.18.2",
"onnxruntime-node": "^1.17.0"
}
}
The postinstall script ensures ML model files are downloaded after npm install, which is critical for development setup.
Step 3: Configure Selective Fetching
# .lfsconfig
[lfs]
fetchinclude = models/*, assets/images/*
fetchexclude = assets/designs/*
This configuration means developers get ML models (needed to run the app) but not design files (only needed by designers) by default. Designers can pull their files manually:
$ git lfs pull --include="assets/designs/*"
Step 4: Set Up the Inference Service
// src/services/inference.js
var ort = require("onnxruntime-node");
var path = require("path");
var fs = require("fs");
var MODEL_PATH = path.join(__dirname, "..", "..", "models", "sentiment.onnx");
var session = null;
function loadModel() {
if (!fs.existsSync(MODEL_PATH)) {
console.error("Model file not found at: " + MODEL_PATH);
console.error("Run 'git lfs pull --include=models/*.onnx' to download.");
process.exit(1);
}
var stats = fs.statSync(MODEL_PATH);
if (stats.size < 1000) {
// LFS pointer file is ~130 bytes; actual model is 200MB
console.error("Model file appears to be an LFS pointer, not the actual model.");
console.error("Run 'git lfs pull --include=models/*.onnx' to download.");
process.exit(1);
}
return ort.InferenceSession.create(MODEL_PATH).then(function(s) {
session = s;
console.log("Model loaded: " + MODEL_PATH + " (" + (stats.size / 1024 / 1024).toFixed(1) + " MB)");
return session;
});
}
function predict(inputData) {
if (!session) {
return Promise.reject(new Error("Model not loaded. Call loadModel() first."));
}
var tensor = new ort.Tensor("float32", inputData, [1, inputData.length]);
var feeds = { input: tensor };
return session.run(feeds).then(function(results) {
return results.output.data;
});
}
module.exports = {
loadModel: loadModel,
predict: predict
};
The LFS pointer detection on lines 15-19 is important. If someone clones without LFS or LFS fails silently, you get a 130-byte text file instead of a 200MB model. Without this check, you get cryptic ONNX parsing errors.
Step 5: CI/CD Pipeline
# .github/workflows/build.yml
name: Build and Test
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
with:
lfs: false # Don't download all LFS files
fetch-depth: 1
- name: Cache LFS objects
uses: actions/cache@v4
id: lfs-cache
with:
path: .git/lfs
key: lfs-v1-${{ hashFiles('**/.gitattributes') }}-${{ github.sha }}
restore-keys: |
lfs-v1-${{ hashFiles('**/.gitattributes') }}-
- name: Pull required LFS files
run: |
git lfs install
# Only pull model files needed for tests
git lfs pull --include="models/*.onnx"
# Verify the files are real, not pointers
for f in models/*.onnx; do
size=$(stat --format=%s "$f" 2>/dev/null || stat -f%z "$f")
if [ "$size" -lt 1000 ]; then
echo "ERROR: $f appears to be an LFS pointer ($size bytes)"
exit 1
fi
done
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: '20'
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Run tests
run: npm test
- name: Build
run: npm run build
lint:
runs-on: ubuntu-latest
steps:
- name: Checkout (no LFS)
uses: actions/checkout@v4
with:
lfs: false
fetch-depth: 1
# Lint job doesn't need any LFS files
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: '20'
cache: 'npm'
- name: Install and lint
run: |
npm ci
npm run lint
Notice how the lint job skips LFS entirely. It does not need model files or design assets to check code style. This saves bandwidth and speeds up the pipeline.
Step 6: Migrate Existing Large Files
If the project already has large files committed directly:
# See what is taking up space
$ git lfs migrate info --everything --above=1mb
migrate: Sorting commits: ..., done.
migrate: Examining commits: 100% (1847/1847), done.
*.psd 285.3 MB 3/3 files 100%
*.onnx 200.1 MB 1/1 files 100%
*.pkl 150.4 MB 1/1 files 100%
*.ai 45.2 MB 1/1 files 100%
*.zip 31.0 MB 2/2 files 100%
# Migrate everything matching our tracking patterns
$ git lfs migrate import \
--include="*.psd,*.ai,*.onnx,*.pkl,*.zip,*.tar.gz" \
--everything
migrate: Sorting commits: ..., done.
migrate: Rewriting commits: 100% (1847/1847), done.
main abcd1234 -> 5e6f7a8b
develop 1a2b3c4d -> 9c0d1e2f
migrate: Updating refs: ..., done.
migrate: checkout: ..., done.
# Verify
$ du -sh .git
193M .git
# Down from 1.9 GB
# Force push the rewritten history
$ git push --force-with-lease origin main develop
Common Issues and Troubleshooting
1. "Encountered N file(s) that should have been pointers"
$ git push origin main
LFS upload missing objects:
(missing) assets/logo.psd (abcd1234 -- 48234567)
Encountered 1 file(s) that should have been pointers, but weren't:
assets/logo.psd
This happens when a file was committed before LFS tracking was configured, or when someone commits with GIT_LFS_SKIP_SMUDGE=1 set. Fix it by re-adding the file:
# Remove the file from the index (not disk)
$ git rm --cached assets/logo.psd
# Re-add it — LFS filters will now process it
$ git add assets/logo.psd
$ git commit -m "Convert logo.psd to LFS pointer"
$ git push
2. "batch response: Rate limit exceeded"
$ git lfs pull
batch response: Rate limit exceeded. Please retry in 3600 seconds.
error: failed to fetch some objects from 'https://github.com/myorg/project.git/info/lfs'
GitHub rate-limits LFS API calls. This typically hits CI pipelines making many concurrent requests. Solutions:
# Use LFS caching in CI (see CI/CD section above)
# Or set a custom transfer adapter with retries
$ git config lfs.transfer.maxretries 5
$ git config lfs.transfer.maxretrydelay 30
# Or use a GitHub Personal Access Token for higher limits
$ git config lfs.url "https://[email protected]/myorg/project.git/info/lfs"
3. "smudge filter lfs failed"
$ git checkout develop
Filtering content: 100% (47/47), 812.43 MiB | 2.10 MiB/s, done.
error: external filter 'git-lfs filter-process' failed
fatal: assets/mockup.psd: smudge filter lfs failed
This usually means git-lfs is not installed or not on the PATH. It can also occur when LFS storage is unavailable.
# Verify LFS is installed and accessible
$ which git-lfs
/usr/bin/git-lfs
$ git lfs env
# Check the endpoint URL and authentication
# Skip LFS temporarily to unblock yourself
$ GIT_LFS_SKIP_SMUDGE=1 git checkout develop
# Then pull LFS files separately
$ git lfs pull
4. "Repository over storage quota"
$ git push origin main
remote: error: GH008: Your push was rejected because the repository exceeded
its data quota. Purchase additional data packs to restore access.
You have exceeded your LFS storage limit. Options:
# Check what is using space
$ git lfs ls-files --size
abcd1234 * assets/old-mockup-v1.psd (104 MB)
ef567890 * assets/old-mockup-v2.psd (104 MB)
1a2b3c4d * assets/old-mockup-v3.psd (104 MB)
# 312 MB of old mockups nobody needs
# Prune old LFS objects not referenced by recent commits
$ git lfs prune --verify-remote --recent-refs-days=30
prune: 14 local objects, 6 retained, done.
prune: Deleting objects: 100% (8/8), done.
# If that's not enough, remove old files from history entirely
$ git lfs migrate import --include="*.psd" --everything
# Then push with --force-with-lease
# Or purchase additional storage
# GitHub: Settings -> Billing -> Git LFS data
5. "Unable to find source for object"
$ git lfs fetch --all
fetch: Fetching reference refs/heads/main
error: Unable to find source for object abcdef1234567890 (try running git lfs fetch --all)
This occurs when an LFS object was never pushed to the remote, or was deleted from remote storage. If you have the object locally on another machine:
# On the machine that has the object
$ git lfs push --all origin
# If no machine has it, the file is lost
# Remove the broken reference
$ git rm assets/lost-file.psd
$ git commit -m "Remove orphaned LFS reference"
Best Practices
Set up
.gitattributesbefore committing any large files. Retroactive migration works but is painful. Start tracking patterns from day one.Use
fetchincludeandfetchexcludein.lfsconfig. Not every developer needs every large file. Designers need PSD files. ML engineers need model weights. Backend developers might need neither. Selective fetching keeps clone times fast.Cache LFS objects in CI/CD. Without caching, every CI run downloads every LFS file. Cache the
.git/lfsdirectory with a key based on.gitattributescontent and the commit SHA.Enable file locking for unmergeable formats. PSD, AI, Sketch, and other binary formats cannot be three-way merged. Use the
lockableattribute in.gitattributesto prevent concurrent edits and lost work.Add LFS pointer validation to your build scripts. When LFS is misconfigured or unavailable, you get tiny text pointer files instead of actual binary content. Check file sizes in your application startup or build process to fail fast with a clear error message.
Run
git lfs pruneperiodically. LFS keeps old versions of files in.git/lfs/objects/after they are no longer referenced by your current branch. Pruning reclaims local disk space. Use--verify-remoteto ensure objects exist on the server before deleting locally.Monitor storage and bandwidth usage monthly. LFS costs can creep up silently. Set up alerts on your Git hosting provider when usage exceeds 80% of your plan limits.
Consider Azure DevOps for LFS-heavy projects. If your primary concern is cost and your team is not locked into GitHub or GitLab, Azure DevOps offers the most generous LFS limits at no additional cost.
Keep files under 5 MB in regular Git. Not every binary file needs LFS. Small icons, compressed thumbnails, and configuration files are fine in regular Git. LFS adds complexity, so only use it where the size justifies the overhead.
Document your LFS setup in the project README. New developers need to know to install git-lfs, which files are tracked, and how to pull specific LFS assets. A three-line setup section prevents hours of confusion.
References
- Git LFS Official Documentation
- Git LFS Specification
- GitHub LFS Documentation
- git-filter-repo - History rewriting tool
- BFG Repo-Cleaner - Legacy history cleaner
- DVC Documentation - Data Version Control for ML
- git-annex - Distributed file synchronization
- Rudolfs - Self-hosted LFS server with S3 backend
- Azure DevOps Git LFS - Azure LFS documentation
