Skip to main content

Block Caching

How to use block-level caching to optimize random access reads from remote sources.

When to Use Block Caching

Block caching improves performance in these scenarios:

  • Scattered random reads: Accessing individual files spread across the archive
  • Remote sources with latency: HTTP sources where each range request has overhead
  • Partial file reads: Reading portions of large files multiple times

Block caching is NOT recommended for:

  • Sequential directory extraction (use CopyDir without caching)
  • Single-pass full file reads (block overhead exceeds benefit)
  • Local file sources (no network latency to hide)

How It Differs from Content Caching

AspectContent CacheBlock Cache
Cache keySHA256 of file contentSHA256(sourceID + blockSize + blockIndex)
GranularityWhole filesFixed-size blocks (default 64KB)
DeduplicationAcross archivesWithin single source
Best forRepeated access, shared contentRandom access, remote sources

Use content caching when you read the same files repeatedly or share content across archives. Use block caching when you need fast random access to remote data.

Creating a Block Cache

To create a disk-backed block cache:

import (
"github.com/meigma/blob/cache"
"github.com/meigma/blob/cache/disk"
)

blockCache, err := disk.NewBlockCache("/path/to/cache")
if err != nil {
return err
}

Block Cache Options

Configure the block cache with options:

blockCache, err := disk.NewBlockCache("/path/to/cache",
disk.WithBlockMaxBytes(512 << 20), // 512 MB cache limit
disk.WithBlockShardPrefixLen(3), // 3-character sharding
disk.WithBlockDirPerm(0o750), // Directory permissions
)

Wrapping a Source with Block Caching

To add block caching to an HTTP source:

import (
"github.com/meigma/blob"
"github.com/meigma/blob/http"
"github.com/meigma/blob/cache"
"github.com/meigma/blob/cache/disk"
)

// Create HTTP source
source, err := http.NewSource(dataURL)
if err != nil {
return err
}

// Create block cache
blockCache, err := disk.NewBlockCache("/var/cache/blob-blocks")
if err != nil {
return err
}

// Wrap source with caching
cachedSource, err := blockCache.Wrap(source)
if err != nil {
return err
}

// Use cachedSource with blob.New
archive, err := blob.New(indexData, cachedSource)

Wrap Options

Configure per-source wrapping behavior:

cachedSource, err := blockCache.Wrap(source,
cache.WithBlockSize(128 << 10), // 128 KB blocks
cache.WithMaxBlocksPerRead(8), // Bypass for reads spanning > 8 blocks
)

Block Size Selection

The block size affects cache efficiency:

Block SizeBest ForTrade-offs
16 KBSmall files, fine-grained accessMore metadata overhead
64 KB (default)Balanced workloadsGood general choice
256 KBLarge sequential readsWasted space on partial reads

Choose smaller blocks when files are small or access is fine-grained. Choose larger blocks when reads tend to be sequential within files.

Bypass for Large Reads

The MaxBlocksPerRead option bypasses caching when a single ReadAt spans too many blocks. This prevents sequential reads from polluting the block cache:

// Default: 4 blocks. A 256 KB read with 64 KB blocks = 4 blocks = cached
// A 1 MB read with 64 KB blocks = 16 blocks = bypassed

cachedSource, err := blockCache.Wrap(source,
cache.WithMaxBlocksPerRead(4), // Bypass reads spanning > 4 blocks
)

Set to 0 to disable the limit and cache all reads.

SourceID Requirements

Block cache keys depend on a stable source identifier. The HTTP source automatically generates one from URL, ETag, and Last-Modified headers:

// Automatic: url:https://example.com/data|etag:"abc123"
source, err := http.NewSource(dataURL)

// Override if needed
source, err := http.NewSource(dataURL,
http.WithSourceID("my-custom-identifier"),
)

For custom ByteSource implementations, implement the SourceID() method:

type MySource struct {
// ...
}

func (s *MySource) SourceID() string {
return fmt.Sprintf("mysource:%s:%d", s.identifier, s.version)
}

The SourceID must be stable for the same content and change when content changes. Using content hashes or version identifiers is recommended.

Concurrent Access

The block cache uses singleflight to deduplicate concurrent fetches. Multiple goroutines requesting the same block share a single network request:

// These run in parallel but only one network request occurs
go archive.ReadFile("large-file.bin") // Needs block 42
go archive.ReadFile("large-file.bin") // Also needs block 42 - shares fetch

Complete Example

A complete setup with block caching for a remote archive:

func setupBlockCachedArchive(indexData []byte, dataURL, token string) (*blob.Blob, error) {
// Create HTTP source with authentication
source, err := http.NewSource(dataURL,
http.WithHeader("Authorization", "Bearer "+token),
)
if err != nil {
return nil, fmt.Errorf("create source: %w", err)
}

// Create block cache with size limit
cacheDir, _ := os.UserCacheDir()
blockCache, err := disk.NewBlockCache(
filepath.Join(cacheDir, "blob-blocks"),
disk.WithBlockMaxBytes(256 << 20), // 256 MB limit
)
if err != nil {
return nil, fmt.Errorf("create block cache: %w", err)
}

// Wrap source with block caching
cachedSource, err := blockCache.Wrap(source,
cache.WithBlockSize(64 << 10), // 64 KB blocks
cache.WithMaxBlocksPerRead(4), // Bypass large sequential reads
)
if err != nil {
return nil, fmt.Errorf("wrap source: %w", err)
}

return blob.New(indexData, cachedSource)
}

See Also