Managing Large Files in Git: LFS and Alternatives

Shane

2/8/2026

26 min read

Complete guide to managing large files in Git repositories using Git LFS, including setup, migration, CI/CD integration, locking, and alternative approaches like git-annex and DVC.

devops git version-control git-lfs large-files

Managing Large Files in Git: LFS and Alternatives

Overview

Git was designed to track text files efficiently, and it does that brilliantly. But the moment you start committing design assets, compiled binaries, ML model weights, or video files, Git's internal architecture works against you. Git Large File Storage (LFS) replaces large files with lightweight pointer files while storing the actual content on a remote server, keeping your repository fast and manageable. This article covers everything from initial LFS setup to production CI/CD integration, repo migration, and when you should skip LFS entirely in favor of alternatives like git-annex or DVC.

Prerequisites

Git 2.x or later installed
Basic familiarity with Git operations (clone, commit, push, pull)
Command-line access (bash, PowerShell, or similar)
A GitHub, GitLab, Azure DevOps, or Bitbucket account for LFS hosting
Node.js 18+ (for the working example)

Why Large Files Break Git

Before reaching for a solution, you need to understand the actual problem. Git is a content-addressable filesystem. Every object — blob, tree, commit — gets SHA-1 hashed and stored. When you run git gc, Git packs these objects into packfiles using delta compression. Delta compression works by storing the difference between similar objects.

Here is where the problem starts. Binary files do not delta compress well. A single-byte change in a 50MB Photoshop file produces an entirely new 50MB blob. Git cannot compute a meaningful delta between two versions of a .psd file the way it can between two versions of a .js file.

The consequences compound across three dimensions:

Repository size explodes. Every version of every large file lives in the .git/objects directory forever. Ten revisions of a 100MB file means roughly 1GB of object storage, even after garbage collection.

# Check your repo size
$ du -sh .git
2.3G    .git

# See the largest objects in your packfile
$ git verify-pack -v .git/objects/pack/*.idx | sort -k 3 -n | tail -10
a3f2e1d... blob   104857600 104832512 12345
b7c9a0f... blob   104857600 104831488 104844857
c1d4e8b... blob   52428800  52416384 209689345
# Those are 100MB and 50MB blobs sitting in your pack

Clone time becomes painful. git clone downloads the entire history by default. A repository with 2GB of accumulated binary assets takes minutes on a fast connection and can time out on slower ones.

# A bloated repo clone
$ time git clone https://github.com/myorg/bloated-repo.git
Cloning into 'bloated-repo'...
remote: Enumerating objects: 48230, done.
remote: Counting objects: 100% (48230/48230), done.
remote: Compressing objects: 100% (12845/12845), done.
remote: Total 48230 (delta 31204), reused 47891 (delta 30987)
Receiving objects: 100% (48230/48230), 2.31 GiB | 8.42 MiB/s, done.
Resolving deltas: 100% (31204/31204), done.
Updating files: 100% (1847/1847), done.

real    4m38.214s
user    0m42.108s
sys     0m18.447s

Checkout and branch switching slow down. Even after cloning, switching branches requires Git to materialize files in the working tree. Large binary files make this noticeably slower, especially on spinning disks or network filesystems.

The fundamental issue is that Git treats every file the same way. LFS changes that by introducing a layer of indirection.

Git LFS Setup and Configuration

Git LFS is an open-source extension maintained by GitHub. It intercepts Git's smudge and clean filters to swap large files for small pointer files in your repository while storing the actual content on a dedicated LFS server.

Installation

# macOS
$ brew install git-lfs

# Ubuntu/Debian
$ sudo apt-get install git-lfs

# Windows (via installer or chocolatey)
$ choco install git-lfs

# Verify installation
$ git lfs version
git-lfs/3.5.1 (GitHub; windows amd64; go 1.21.8)

After installing the binary, you need to initialize LFS for your user account. This is a one-time operation per machine:

$ git lfs install
Updated git hooks.
Git LFS initialized.

This command adds smudge and clean filter configurations to your global .gitconfig:

[filter "lfs"]
    clean = git-lfs clean -- %f
    smudge = git-lfs smudge -- %f
    process = git-lfs filter-process
    required = true

Tracking Patterns with .gitattributes

LFS uses .gitattributes to determine which files should be managed by LFS. This file gets committed to the repo, so all collaborators automatically track the same patterns.

# Track specific file types
$ git lfs track "*.psd"
$ git lfs track "*.ai"
$ git lfs track "*.sketch"
$ git lfs track "*.zip"
$ git lfs track "*.tar.gz"
$ git lfs track "*.mp4"
$ git lfs track "*.pkl"
$ git lfs track "*.h5"
$ git lfs track "*.onnx"

# Track files in a specific directory regardless of extension
$ git lfs track "assets/designs/**"
$ git lfs track "models/weights/**"

# Track a specific file
$ git lfs track "data/training-dataset.csv"

The resulting .gitattributes file looks like this:

*.psd filter=lfs diff=lfs merge=lfs -text
*.ai filter=lfs diff=lfs merge=lfs -text
*.sketch filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.tar.gz filter=lfs diff=lfs merge=lfs -text
*.mp4 filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
assets/designs/** filter=lfs diff=lfs merge=lfs -text
models/weights/** filter=lfs diff=lfs merge=lfs -text
data/training-dataset.csv filter=lfs diff=lfs merge=lfs -text

When you commit a tracked file, Git stores a pointer file in the repository instead of the actual content:

$ cat .git/lfs/objects/ab/cd/abcd1234...
# This is the actual file content, stored locally

$ git show HEAD:assets/logo.psd
version https://git-lfs.github.com/spec/v1
oid sha256:abcd1234567890abcdef1234567890abcdef1234567890abcdef1234567890ab
size 48234567

That pointer file is only 130-140 bytes regardless of how large the actual file is. The real content lives in .git/lfs/objects/ locally and on the LFS server remotely.

Verifying What LFS Is Tracking

# List tracked patterns
$ git lfs track
Listing tracked patterns
    *.psd (.gitattributes)
    *.ai (.gitattributes)
    *.pkl (.gitattributes)

# List actual LFS objects in the current checkout
$ git lfs ls-files
abcd1234 * assets/logo.psd
ef567890 * models/weights/classifier.pkl
1a2b3c4d * assets/hero-video.mp4

# See LFS status
$ git lfs status
On branch main
Objects to be pushed to origin/main:

    assets/logo.psd (LFS: abcd1234)

Objects to be committed:

    (no changes)

LFS Storage Backends

GitHub

GitHub provides 1GB of free LFS storage and 1GB of bandwidth per month. Additional data packs cost $5/month for 50GB of storage and 50GB of bandwidth.

# GitHub LFS is automatic — just push to a GitHub remote
$ git remote add origin https://github.com/myorg/my-project.git
$ git push origin main
Uploading LFS objects: 100% (3/3), 142 MB | 12 MB/s, done.

Azure DevOps

Azure DevOps is the most generous for LFS. Free tier includes unlimited LFS storage (subject to project storage limits, typically 250GB for free organizations). This makes it a strong choice for game studios or media-heavy projects.

# Azure DevOps LFS works the same way
$ git remote add origin https://dev.azure.com/myorg/myproject/_git/myrepo
$ git push origin main

GitLab

GitLab.com provides 5GB of LFS storage in the free tier. Self-managed GitLab lets you configure your own storage backend, including S3-compatible object storage.

Self-Hosted LFS Server

You can run your own LFS server if you need complete control over storage. The reference implementation is lfs-test-server, but for production you want something more robust:

# Using the open-source Rudolfs server (Rust, S3-backed)
$ cargo install rudolfs

# Configure with S3 backend
$ rudolfs \
    --host 0.0.0.0:8080 \
    --s3-bucket my-lfs-bucket \
    --s3-region us-east-1 \
    --key "$(cat lfs-encryption.key)"

Then configure your repo to use the custom server:

# .lfsconfig (committed to repo)
[lfs]
    url = https://lfs.mycompany.com/myorg/myrepo

Migrating Existing Repos to LFS

This is where things get interesting. If you already have large files committed directly to your repository, simply adding LFS tracking going forward does not fix the history. Those old blobs still live in every packfile. You need to rewrite history.

Important: History rewriting is destructive. Every collaborator will need to re-clone after migration. Coordinate with your team before doing this.

Using git-filter-repo (Recommended)

git-filter-repo is the modern replacement for git filter-branch and BFG. It is faster, safer, and more flexible.

# Install
$ pip install git-filter-repo

# First, identify the largest files in history
$ git rev-list --objects --all \
    | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
    | awk '/^blob/ {print $3, $4}' \
    | sort -rn \
    | head -20

104857600 assets/design-mockup-v3.psd
104857600 assets/design-mockup-v2.psd
104857600 assets/design-mockup-v1.psd
52428800  data/training-data.csv
31457280  build/release-2.1.0.zip

# Migrate specific extensions to LFS throughout history
$ git lfs migrate import --include="*.psd,*.zip,*.csv" --everything

migrate: Sorting commits: ..., done.
migrate: Rewriting commits: 100% (1847/1847), done.
  main        abcd1234 -> ef567890
  develop     1a2b3c4d -> 5e6f7a8b
migrate: Updating refs: ..., done.
migrate: checkout: ..., done.

The --everything flag rewrites all branches and tags. Without it, only the current branch is migrated.

# Verify the migration
$ git lfs ls-files --all | wc -l
47

# Check the new repo size
$ git gc --prune=now
$ du -sh .git
187M    .git
# Down from 2.3GB

Using BFG Repo-Cleaner

BFG is an older tool but still works well for simple cases. It is a Java application so it runs anywhere with a JVM.

# Download BFG
$ wget https://repo1.maven.org/maven2/com/madgp/bfg/1.14.0/bfg-1.14.0.jar

# Clone a fresh mirror
$ git clone --mirror https://github.com/myorg/my-project.git

# Remove files larger than 10MB from history
$ java -jar bfg-1.14.0.jar --strip-blobs-bigger-than 10M my-project.git

# Clean up
$ cd my-project.git
$ git reflog expire --expire=now --all
$ git gc --prune=now --aggressive

# Push the rewritten history
$ git push --force

Post-Migration Checklist

After rewriting history, every team member needs to:

# Option 1: Fresh clone (safest)
$ rm -rf my-project
$ git clone https://github.com/myorg/my-project.git

# Option 2: Reset existing clone (if they have no local changes)
$ git fetch origin
$ git reset --hard origin/main
$ git lfs pull

LFS Locking for Binary Files

Binary files cannot be merged. If two people edit the same Photoshop file simultaneously, one person's work gets lost. LFS provides file locking to prevent this.

# Enable locking on your repo
$ git lfs install --force

# Configure lockable file types in .gitattributes
*.psd filter=lfs diff=lfs merge=lfs -text lockable
*.ai filter=lfs diff=lfs merge=lfs -text lockable
*.sketch filter=lfs diff=lfs merge=lfs -text lockable

# Lock a file before editing
$ git lfs lock assets/hero-banner.psd
Locked assets/hero-banner.psd

# See all locks
$ git lfs locks
ID    Path                        Owner        Locked At
1234  assets/hero-banner.psd     shane        2026-02-08T10:30:00Z
1235  assets/icon-set.ai         maria        2026-02-07T14:15:00Z

# Unlock when done
$ git lfs unlock assets/hero-banner.psd
Unlocked assets/hero-banner.psd

# Force unlock someone else's file (admin only)
$ git lfs unlock assets/icon-set.ai --force

The lockable attribute in .gitattributes makes tracked files read-only in the working tree by default. You must explicitly lock a file before editing it, which prevents accidental concurrent edits.

LFS in CI/CD Pipelines

LFS in CI/CD introduces two concerns: bandwidth consumption and build speed. Every CI run that clones the repo downloads LFS objects, which costs bandwidth and time.

GitHub Actions

# .github/workflows/build.yml
name: Build
on: [push, pull_request]

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout with LFS
        uses: actions/checkout@v4
        with:
          lfs: true
          # Only fetch LFS objects needed for current commit
          # not the entire LFS history
          fetch-depth: 1

      - name: Cache LFS objects
        uses: actions/cache@v4
        with:
          path: .git/lfs
          key: lfs-${{ hashFiles('.gitattributes') }}-${{ github.sha }}
          restore-keys: |
            lfs-${{ hashFiles('.gitattributes') }}-

      - name: Pull LFS files
        run: git lfs pull

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '20'

      - name: Install and build
        run: |
          npm ci
          npm run build

GitLab CI

# .gitlab-ci.yml
variables:
  GIT_LFS_SKIP_SMUDGE: "1"  # Skip automatic LFS download

stages:
  - build

build:
  stage: build
  before_script:
    # Only pull LFS files we actually need
    - git lfs pull --include="assets/production/**" --exclude=""
  script:
    - npm ci
    - npm run build
  cache:
    key: lfs-cache
    paths:
      - .git/lfs/

Selective LFS Fetching

The biggest optimization is not downloading LFS files you do not need. If your CI pipeline only runs tests and does not need design assets, skip them entirely:

# Skip all LFS downloads during clone
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/myorg/project.git

# Then selectively pull only what you need
$ git lfs pull --include="models/production/*.onnx" --exclude="assets/designs/*"

This can reduce CI bandwidth usage by 80% or more depending on your file distribution.

Alternatives to LFS

LFS is not the only game in town, and it is not always the right choice. Here are the major alternatives and when each one makes sense.

git-annex

git-annex predates LFS and is more flexible but also more complex. It supports a wider range of storage backends, including local drives, SSH servers, S3, Glacier, and even Bittorrent.

# Initialize git-annex in a repo
$ git annex init "my laptop"
init my laptop ok

# Add a large file
$ git annex add datasets/large-training-set.tar.gz
add datasets/large-training-set.tar.gz ok

# The file becomes a symlink to the annex
$ ls -la datasets/large-training-set.tar.gz
lrwxrwxrwx 1 shane shane 198 Feb 8 10:00 datasets/large-training-set.tar.gz ->
  .git/annex/objects/Xk/9V/SHA256E-s524288000--abcd1234.../SHA256E-s524288000--abcd1234...

# Configure an S3 remote
$ git annex initremote s3-backup type=S3 bucket=my-annex-bucket encryption=none

# Sync content to the remote
$ git annex copy --to s3-backup datasets/large-training-set.tar.gz
copy datasets/large-training-set.tar.gz (to s3-backup...) ok

When to use git-annex: You need fine-grained control over where data lives. git-annex tracks which remotes have which files and can enforce redundancy policies. It is popular in scientific computing and archival workflows.

When to avoid it: Your team uses Windows (git-annex support for Windows has historically been fragile), or you want something that works out of the box with GitHub/GitLab.

DVC (Data Version Control)

DVC is built specifically for machine learning workflows. It tracks data files, models, and pipelines alongside your code in Git.

# Install DVC
$ pip install dvc[s3]

# Initialize DVC in a Git repo
$ dvc init

# Configure S3 as remote storage
$ dvc remote add -d myremote s3://my-dvc-bucket/project
$ dvc remote modify myremote region us-east-1

# Track a large dataset
$ dvc add data/training-images/

This creates a .dvc pointer file:

# data/training-images.dvc
outs:
- md5: abcdef1234567890abcdef1234567890.dir
  size: 2147483648
  nfiles: 50000
  hash: md5
  path: training-images

DVC also supports pipeline tracking, which is where it really shines:

# dvc.yaml
stages:
  preprocess:
    cmd: python src/preprocess.py
    deps:
      - src/preprocess.py
      - data/raw/
    outs:
      - data/processed/

  train:
    cmd: python src/train.py
    deps:
      - src/train.py
      - data/processed/
    outs:
      - models/classifier.pkl
    metrics:
      - metrics/accuracy.json:
          cache: false

# Reproduce the entire pipeline
$ dvc repro
Running stage 'preprocess':
> python src/preprocess.py
Generating processed dataset...done (50000 images)

Running stage 'train':
> python src/train.py
Training classifier...done (accuracy: 0.94)

# Push data and models to remote
$ dvc push
50001 files pushed

When to use DVC: You are building ML pipelines with large datasets and model files. DVC integrates experiment tracking, pipeline reproducibility, and data versioning in one tool.

Cloud Storage with Pointer Files (DIY Approach)

Sometimes the simplest approach is the right one. Store large files in S3, GCS, or Azure Blob Storage and commit a manifest file to Git.

// scripts/sync-assets.js
var AWS = require("aws-sdk");
var fs = require("fs");
var path = require("path");
var crypto = require("crypto");

var s3 = new AWS.S3({ region: "us-east-1" });
var BUCKET = "my-project-assets";
var MANIFEST_PATH = path.join(__dirname, "..", "assets-manifest.json");

function hashFile(filePath) {
    return new Promise(function(resolve, reject) {
        var hash = crypto.createHash("sha256");
        var stream = fs.createReadStream(filePath);
        stream.on("data", function(chunk) { hash.update(chunk); });
        stream.on("end", function() { resolve(hash.digest("hex")); });
        stream.on("error", reject);
    });
}

function uploadAsset(filePath, hash) {
    var key = "assets/" + hash + "/" + path.basename(filePath);
    var body = fs.createReadStream(filePath);

    return s3.upload({
        Bucket: BUCKET,
        Key: key,
        Body: body
    }).promise().then(function(result) {
        console.log("Uploaded: " + filePath + " -> s3://" + BUCKET + "/" + key);
        return { path: filePath, hash: hash, s3Key: key, size: fs.statSync(filePath).size };
    });
}

function downloadAsset(entry, destDir) {
    var dest = path.join(destDir, path.basename(entry.path));
    var file = fs.createWriteStream(dest);

    return new Promise(function(resolve, reject) {
        s3.getObject({ Bucket: BUCKET, Key: entry.s3Key })
            .createReadStream()
            .pipe(file)
            .on("finish", function() {
                console.log("Downloaded: " + dest);
                resolve();
            })
            .on("error", reject);
    });
}

function syncUp(directory) {
    var manifest = {};
    var files = fs.readdirSync(directory).filter(function(f) {
        var ext = path.extname(f).toLowerCase();
        return [".psd", ".ai", ".mp4", ".zip", ".pkl"].indexOf(ext) !== -1;
    });

    var uploads = files.map(function(file) {
        var filePath = path.join(directory, file);
        return hashFile(filePath).then(function(hash) {
            return uploadAsset(filePath, hash);
        });
    });

    Promise.all(uploads).then(function(results) {
        results.forEach(function(entry) {
            manifest[entry.path] = entry;
        });
        fs.writeFileSync(MANIFEST_PATH, JSON.stringify(manifest, null, 2));
        console.log("Manifest written: " + MANIFEST_PATH);
    }).catch(function(err) {
        console.error("Sync failed:", err);
        process.exit(1);
    });
}

// Usage: node scripts/sync-assets.js upload ./assets
// Usage: node scripts/sync-assets.js download ./assets
var command = process.argv[2];
var dir = process.argv[3] || "./assets";

if (command === "upload") {
    syncUp(dir);
} else if (command === "download") {
    var manifest = JSON.parse(fs.readFileSync(MANIFEST_PATH, "utf8"));
    var downloads = Object.values(manifest).map(function(entry) {
        return downloadAsset(entry, dir);
    });
    Promise.all(downloads).then(function() {
        console.log("All assets downloaded.");
    });
}

Add the manifest to Git and ignore the actual files:

# .gitignore
assets/*.psd
assets/*.ai
assets/*.mp4
models/*.pkl
models/*.onnx

# But track the manifest
!assets-manifest.json

When to use this approach: You want full control over storage, already have cloud infrastructure, and your team is comfortable with a custom workflow. This avoids LFS bandwidth costs entirely.

Cost Considerations Across Providers

Cost is a real factor when managing large files at scale. Here is how the major providers compare as of early 2026:

Provider	Free Storage	Free Bandwidth	Paid Storage	Paid Bandwidth
GitHub	1 GB	1 GB/month	$5/50 GB	$5/50 GB
GitLab	5 GB	10 GB/month	$0.10/GB	Included
Azure DevOps	250 GB*	Unlimited	Included	Included
Bitbucket	1 GB	1 GB/month	$10/100 GB	$10/100 GB
Self-hosted (S3)	N/A	N/A	~$0.023/GB	~$0.09/GB transfer

*Azure DevOps includes LFS in project storage limits.

For a team with 50GB of binary assets and active CI/CD, monthly costs look roughly like this:

GitHub: $5-15/month (depending on CI bandwidth)
GitLab: $5/month
Azure DevOps: $0 (within free tier)
Self-hosted S3: $2-5/month

Azure DevOps is the clear winner on cost for LFS-heavy projects. If you are on GitHub and hitting bandwidth limits, the DIY S3 approach can be cheaper than buying data packs.

Monitoring LFS Storage Usage

You need visibility into what LFS is consuming. Without it, you will get surprised by costs or storage limits.

# Check LFS storage on GitHub
$ gh api /repos/{owner}/{repo} --jq '.size'
# Returns size in KB (includes LFS)

# More detailed: GitHub API for LFS usage
$ gh api /orgs/{org}/settings/billing/shared-storage
{
  "days_left_in_billing_cycle": 22,
  "estimated_paid_storage_for_month": 12.5,
  "estimated_storage_for_month": 62.5
}

# Local LFS usage
$ git lfs env
Endpoint=https://github.com/myorg/project.git/info/lfs (auth=basic)
LocalMediaDir=/home/shane/project/.git/lfs/objects
TempDir=/home/shane/project/.git/lfs/tmp

$ du -sh .git/lfs
847M    .git/lfs

You can build a simple monitoring script for your Node.js project:

// scripts/lfs-report.js
var childProcess = require("child_process");

function run(cmd) {
    return childProcess.execSync(cmd, { encoding: "utf8" }).trim();
}

function getLfsReport() {
    var files = run("git lfs ls-files --size").split("\n").filter(Boolean);

    var totalSize = 0;
    var report = files.map(function(line) {
        var parts = line.split(/\s+/);
        var oid = parts[0];
        var indicator = parts[1];
        var filePath = parts.slice(2, -1).join(" ");
        var sizeStr = parts[parts.length - 1];

        var sizeBytes = parseFloat(sizeStr);
        if (sizeStr.indexOf("KB") !== -1) sizeBytes *= 1024;
        if (sizeStr.indexOf("MB") !== -1) sizeBytes *= 1024 * 1024;
        if (sizeStr.indexOf("GB") !== -1) sizeBytes *= 1024 * 1024 * 1024;

        totalSize += sizeBytes;

        return {
            path: filePath,
            size: sizeStr,
            oid: oid.substring(0, 10),
            downloaded: indicator === "*"
        };
    });

    console.log("=== Git LFS Storage Report ===\n");
    console.log("Total LFS objects: " + report.length);
    console.log("Total size: " + (totalSize / (1024 * 1024)).toFixed(2) + " MB\n");

    console.log("Files:");
    report.forEach(function(entry) {
        var status = entry.downloaded ? "[local]" : "[remote]";
        console.log("  " + status + " " + entry.path + " (" + entry.size + ")");
    });

    var patterns = run("git lfs track").split("\n").filter(function(line) {
        return line.indexOf("*") !== -1 || line.indexOf("/") !== -1;
    });
    console.log("\nTracked patterns:");
    patterns.forEach(function(p) {
        console.log("  " + p.trim());
    });
}

getLfsReport();

$ node scripts/lfs-report.js
=== Git LFS Storage Report ===

Total LFS objects: 23
Total size: 847.32 MB

Files:
  [local] assets/hero-banner.psd (104.2 MB)
  [local] assets/brand-guide.ai (52.1 MB)
  [local] models/classifier.pkl (312.5 MB)
  [remote] assets/old-mockup.psd (98.7 MB)
  ...

Tracked patterns:
    *.psd (.gitattributes)
    *.ai (.gitattributes)
    *.pkl (.gitattributes)

Complete Working Example

Let's set up Git LFS for a real Node.js project that has design assets, build artifacts, and ML model files. We will configure LFS, set up CI/CD, and migrate existing large files.

Project Structure

my-node-project/
  package.json
  app.js
  src/
    routes/
    models/
    services/
      inference.js      # Uses ML model for predictions
  assets/
    designs/
      homepage.psd      # 85 MB
      brand-kit.ai      # 45 MB
    images/
      hero.jpg          # 2 MB (keep in Git - small enough)
  models/
    sentiment.onnx      # 200 MB
    embeddings.pkl      # 150 MB
  dist/                 # Build output
    bundle.js
    bundle.js.map
  scripts/
    lfs-report.js
    post-checkout.sh

Step 1: Initialize LFS and Configure Tracking

# Initialize LFS in the project
$ cd my-node-project
$ git lfs install
Updated git hooks.
Git LFS initialized.

# Track large binary formats
$ git lfs track "*.psd"
$ git lfs track "*.ai"
$ git lfs track "*.sketch"
$ git lfs track "*.onnx"
$ git lfs track "*.pkl"
$ git lfs track "*.h5"
$ git lfs track "*.tar.gz"

# Track entire directories for build artifacts
$ git lfs track "dist/**"

# Commit the .gitattributes first
$ git add .gitattributes
$ git commit -m "Configure Git LFS tracking patterns"

Step 2: Add Package Scripts

{
  "name": "my-node-project",
  "version": "2.1.0",
  "scripts": {
    "start": "node app.js",
    "build": "webpack --mode production",
    "lfs:report": "node scripts/lfs-report.js",
    "lfs:prune": "git lfs prune --verify-remote",
    "postinstall": "git lfs pull --include='models/*.onnx'"
  },
  "dependencies": {
    "express": "^4.18.2",
    "onnxruntime-node": "^1.17.0"
  }
}

The postinstall script ensures ML model files are downloaded after npm install, which is critical for development setup.

Step 3: Configure Selective Fetching

# .lfsconfig
[lfs]
    fetchinclude = models/*, assets/images/*
    fetchexclude = assets/designs/*

This configuration means developers get ML models (needed to run the app) but not design files (only needed by designers) by default. Designers can pull their files manually:

$ git lfs pull --include="assets/designs/*"

Step 4: Set Up the Inference Service

// src/services/inference.js
var ort = require("onnxruntime-node");
var path = require("path");
var fs = require("fs");

var MODEL_PATH = path.join(__dirname, "..", "..", "models", "sentiment.onnx");
var session = null;

function loadModel() {
    if (!fs.existsSync(MODEL_PATH)) {
        console.error("Model file not found at: " + MODEL_PATH);
        console.error("Run 'git lfs pull --include=models/*.onnx' to download.");
        process.exit(1);
    }

    var stats = fs.statSync(MODEL_PATH);
    if (stats.size < 1000) {
        // LFS pointer file is ~130 bytes; actual model is 200MB
        console.error("Model file appears to be an LFS pointer, not the actual model.");
        console.error("Run 'git lfs pull --include=models/*.onnx' to download.");
        process.exit(1);
    }

    return ort.InferenceSession.create(MODEL_PATH).then(function(s) {
        session = s;
        console.log("Model loaded: " + MODEL_PATH + " (" + (stats.size / 1024 / 1024).toFixed(1) + " MB)");
        return session;
    });
}

function predict(inputData) {
    if (!session) {
        return Promise.reject(new Error("Model not loaded. Call loadModel() first."));
    }

    var tensor = new ort.Tensor("float32", inputData, [1, inputData.length]);
    var feeds = { input: tensor };

    return session.run(feeds).then(function(results) {
        return results.output.data;
    });
}

module.exports = {
    loadModel: loadModel,
    predict: predict
};

The LFS pointer detection on lines 15-19 is important. If someone clones without LFS or LFS fails silently, you get a 130-byte text file instead of a 200MB model. Without this check, you get cryptic ONNX parsing errors.

Step 5: CI/CD Pipeline

# .github/workflows/build.yml
name: Build and Test
on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v4
        with:
          lfs: false  # Don't download all LFS files
          fetch-depth: 1

      - name: Cache LFS objects
        uses: actions/cache@v4
        id: lfs-cache
        with:
          path: .git/lfs
          key: lfs-v1-${{ hashFiles('**/.gitattributes') }}-${{ github.sha }}
          restore-keys: |
            lfs-v1-${{ hashFiles('**/.gitattributes') }}-

      - name: Pull required LFS files
        run: |
          git lfs install
          # Only pull model files needed for tests
          git lfs pull --include="models/*.onnx"
          # Verify the files are real, not pointers
          for f in models/*.onnx; do
            size=$(stat --format=%s "$f" 2>/dev/null || stat -f%z "$f")
            if [ "$size" -lt 1000 ]; then
              echo "ERROR: $f appears to be an LFS pointer ($size bytes)"
              exit 1
            fi
          done

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Run tests
        run: npm test

      - name: Build
        run: npm run build

  lint:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout (no LFS)
        uses: actions/checkout@v4
        with:
          lfs: false
          fetch-depth: 1

      # Lint job doesn't need any LFS files
      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'npm'

      - name: Install and lint
        run: |
          npm ci
          npm run lint

Notice how the lint job skips LFS entirely. It does not need model files or design assets to check code style. This saves bandwidth and speeds up the pipeline.

Step 6: Migrate Existing Large Files

If the project already has large files committed directly:

# See what is taking up space
$ git lfs migrate info --everything --above=1mb
migrate: Sorting commits: ..., done.
migrate: Examining commits: 100% (1847/1847), done.
*.psd   285.3 MB   3/3 files    100%
*.onnx  200.1 MB   1/1 files    100%
*.pkl   150.4 MB   1/1 files    100%
*.ai     45.2 MB   1/1 files    100%
*.zip    31.0 MB   2/2 files    100%

# Migrate everything matching our tracking patterns
$ git lfs migrate import \
    --include="*.psd,*.ai,*.onnx,*.pkl,*.zip,*.tar.gz" \
    --everything

migrate: Sorting commits: ..., done.
migrate: Rewriting commits: 100% (1847/1847), done.
  main     abcd1234 -> 5e6f7a8b
  develop  1a2b3c4d -> 9c0d1e2f
migrate: Updating refs: ..., done.
migrate: checkout: ..., done.

# Verify
$ du -sh .git
193M    .git
# Down from 1.9 GB

# Force push the rewritten history
$ git push --force-with-lease origin main develop

Common Issues and Troubleshooting

1. "Encountered N file(s) that should have been pointers"

$ git push origin main
LFS upload missing objects:
  (missing) assets/logo.psd (abcd1234 -- 48234567)
Encountered 1 file(s) that should have been pointers, but weren't:
    assets/logo.psd

This happens when a file was committed before LFS tracking was configured, or when someone commits with GIT_LFS_SKIP_SMUDGE=1 set. Fix it by re-adding the file:

# Remove the file from the index (not disk)
$ git rm --cached assets/logo.psd

# Re-add it — LFS filters will now process it
$ git add assets/logo.psd

$ git commit -m "Convert logo.psd to LFS pointer"
$ git push

2. "batch response: Rate limit exceeded"

$ git lfs pull
batch response: Rate limit exceeded. Please retry in 3600 seconds.
error: failed to fetch some objects from 'https://github.com/myorg/project.git/info/lfs'

GitHub rate-limits LFS API calls. This typically hits CI pipelines making many concurrent requests. Solutions:

# Use LFS caching in CI (see CI/CD section above)

# Or set a custom transfer adapter with retries
$ git config lfs.transfer.maxretries 5
$ git config lfs.transfer.maxretrydelay 30

# Or use a GitHub Personal Access Token for higher limits
$ git config lfs.url "https://[email protected]/myorg/project.git/info/lfs"

3. "smudge filter lfs failed"

$ git checkout develop
Filtering content: 100% (47/47), 812.43 MiB | 2.10 MiB/s, done.
error: external filter 'git-lfs filter-process' failed
fatal: assets/mockup.psd: smudge filter lfs failed

This usually means git-lfs is not installed or not on the PATH. It can also occur when LFS storage is unavailable.

# Verify LFS is installed and accessible
$ which git-lfs
/usr/bin/git-lfs

$ git lfs env
# Check the endpoint URL and authentication

# Skip LFS temporarily to unblock yourself
$ GIT_LFS_SKIP_SMUDGE=1 git checkout develop

# Then pull LFS files separately
$ git lfs pull

4. "Repository over storage quota"

$ git push origin main
remote: error: GH008: Your push was rejected because the repository exceeded
its data quota. Purchase additional data packs to restore access.

You have exceeded your LFS storage limit. Options:

# Check what is using space
$ git lfs ls-files --size
abcd1234 * assets/old-mockup-v1.psd (104 MB)
ef567890 * assets/old-mockup-v2.psd (104 MB)
1a2b3c4d * assets/old-mockup-v3.psd (104 MB)
# 312 MB of old mockups nobody needs

# Prune old LFS objects not referenced by recent commits
$ git lfs prune --verify-remote --recent-refs-days=30
prune: 14 local objects, 6 retained, done.
prune: Deleting objects: 100% (8/8), done.

# If that's not enough, remove old files from history entirely
$ git lfs migrate import --include="*.psd" --everything
# Then push with --force-with-lease

# Or purchase additional storage
# GitHub: Settings -> Billing -> Git LFS data

5. "Unable to find source for object"

$ git lfs fetch --all
fetch: Fetching reference refs/heads/main
error: Unable to find source for object abcdef1234567890 (try running git lfs fetch --all)

This occurs when an LFS object was never pushed to the remote, or was deleted from remote storage. If you have the object locally on another machine:

# On the machine that has the object
$ git lfs push --all origin

# If no machine has it, the file is lost
# Remove the broken reference
$ git rm assets/lost-file.psd
$ git commit -m "Remove orphaned LFS reference"

Best Practices

Set up .gitattributes before committing any large files. Retroactive migration works but is painful. Start tracking patterns from day one.
Use fetchinclude and fetchexclude in .lfsconfig. Not every developer needs every large file. Designers need PSD files. ML engineers need model weights. Backend developers might need neither. Selective fetching keeps clone times fast.
Cache LFS objects in CI/CD. Without caching, every CI run downloads every LFS file. Cache the .git/lfs directory with a key based on .gitattributes content and the commit SHA.
Enable file locking for unmergeable formats. PSD, AI, Sketch, and other binary formats cannot be three-way merged. Use the lockable attribute in .gitattributes to prevent concurrent edits and lost work.
Add LFS pointer validation to your build scripts. When LFS is misconfigured or unavailable, you get tiny text pointer files instead of actual binary content. Check file sizes in your application startup or build process to fail fast with a clear error message.
Run git lfs prune periodically. LFS keeps old versions of files in .git/lfs/objects/ after they are no longer referenced by your current branch. Pruning reclaims local disk space. Use --verify-remote to ensure objects exist on the server before deleting locally.
Monitor storage and bandwidth usage monthly. LFS costs can creep up silently. Set up alerts on your Git hosting provider when usage exceeds 80% of your plan limits.
Consider Azure DevOps for LFS-heavy projects. If your primary concern is cost and your team is not locked into GitHub or GitLab, Azure DevOps offers the most generous LFS limits at no additional cost.
Keep files under 5 MB in regular Git. Not every binary file needs LFS. Small icons, compressed thumbnails, and configuration files are fine in regular Git. LFS adds complexity, so only use it where the size justifies the overhead.
Document your LFS setup in the project README. New developers need to know to install git-lfs, which files are tracked, and how to pull specific LFS assets. A three-line setup section prevents hours of confusion.

References

Git LFS Official Documentation
Git LFS Specification
GitHub LFS Documentation
git-filter-repo - History rewriting tool
BFG Repo-Cleaner - Legacy history cleaner
DVC Documentation - Data Version Control for ML
git-annex - Distributed file synchronization
Rudolfs - Self-hosted LFS server with S3 backend
Azure DevOps Git LFS - Azure LFS documentation

Managing Large Files in Git: LFS and Alternatives

Overview

Prerequisites

Why Large Files Break Git

Git LFS Setup and Configuration

Installation

Tracking Patterns with .gitattributes

Verifying What LFS Is Tracking

LFS Storage Backends

GitHub

Azure DevOps

GitLab

Self-Hosted LFS Server

Migrating Existing Repos to LFS

Using git-filter-repo (Recommended)

Using BFG Repo-Cleaner

Post-Migration Checklist

LFS Locking for Binary Files

LFS in CI/CD Pipelines

GitHub Actions

GitLab CI

Selective LFS Fetching

Alternatives to LFS

git-annex

DVC (Data Version Control)

Cloud Storage with Pointer Files (DIY Approach)

Cost Considerations Across Providers

Monitoring LFS Storage Usage

Complete Working Example

Project Structure

Step 1: Initialize LFS and Configure Tracking

Step 2: Add Package Scripts

Step 3: Configure Selective Fetching

Step 4: Set Up the Inference Service

Step 5: CI/CD Pipeline

Step 6: Migrate Existing Large Files

Common Issues and Troubleshooting

1. "Encountered N file(s) that should have been pointers"

2. "batch response: Rate limit exceeded"

3. "smudge filter lfs failed"

4. "Repository over storage quota"

5. "Unable to find source for object"

Best Practices

References

Quick Links

Recommended Reading

Retrieval Augmented Generation with Node.js

Need Expert Help?