Skip to main content
Version: v25.1 (latest)

Initial import

Bulk Loader performs fast initial data imports into a new Dgraph cluster. It's significantly faster than Live Loader for large datasets and is the recommended approach for initial data ingestion.

Use Bulk Loader when:

  • Setting up a new Dgraph cluster
  • Importing large datasets (GBs to TBs)
  • Performance is critical for the initial load
warning

Bulk Loader can only be used with a new cluster. For importing data into an existing cluster, use Live Loader.

Prerequisites

Before running Bulk Loader:

  • Start one or more Dgraph Zeros (Alphas will be started later)
  • Prepare data files in RDF (.rdf, .rdf.gz) or JSON (.json, .json.gz) format
  • Prepare a schema file
note

Bulk Loader only accepts RDF N-Quad/Triple data or JSON. See data migration for converting other formats.

Quick Start

dgraph bulk \
--files data.rdf.gz \
--schema schema.txt \
--zero localhost:5080 \
--map_shards 4 \
--reduce_shards 1

Understanding Shards

Before running Bulk Loader, determine your cluster topology:

  • --reduce_shards — Set to the number of Alpha groups in your cluster
  • --map_shards — Set equal to or higher than --reduce_shards for even distribution
Cluster SetupAlpha Groups--reduce_shards
3 Alphas, 3 replicas/group11
6 Alphas, 3 replicas/group22
9 Alphas, 3 replicas/group33

Basic Usage

dgraph bulk \
--files ./data.rdf.gz \
--schema ./schema.txt \
--zero localhost:5080 \
--map_shards 4 \
--reduce_shards 2

Output Structure

Bulk Loader generates p directories in the out folder:

./out
├── 0
│ └── p
│ ├── 000000.vlog
│ ├── 000002.sst
│ └── MANIFEST
└── 1
└── p
└── ...

With --reduce_shards=2, two directories are created (./out/0 and ./out/1).

Deploying Output to Cluster

Copy each shard's p directory to the corresponding Alpha group:

  • Group 1 (Alpha1, Alpha2, Alpha3) → copy ./out/0/p
  • Group 2 (Alpha4, Alpha5, Alpha6) → copy ./out/1/p

Bulk Loader diagram

note

Every Alpha replica in a group must have a copy of the same p directory.

Loading from Cloud Storage

Amazon S3

Set credentials via environment variables or use IAM roles:

Environment VariableDescription
AWS_ACCESS_KEY_IDAWS access key with S3 read permissions
AWS_SECRET_ACCESS_KEYAWS secret key
dgraph bulk \
--files s3:///bucket/data \
--schema s3:///bucket/data/schema.txt \
--zero localhost:5080

IAM Setup

  1. Create an IAM Role with S3 access
  2. Attach using Instance Profile (EC2) or IAM roles for service accounts (EKS)

MinIO

Environment VariableDescription
MINIO_ACCESS_KEYMinIO access key
MINIO_SECRET_KEYMinIO secret key
dgraph bulk \
--files minio://server:port/bucket/data \
--schema minio://server:port/bucket/data/schema.txt \
--zero localhost:5080

Deployment Strategies

Small Datasets (< 10 GB)

Let Dgraph stream snapshots between replicas:

  1. Run Bulk Loader on one server
  2. Start only the first Alpha replica
  3. Wait ~1 minute for snapshot creation:
    Creating snapshot at index: 30. ReadTs: 4.
  4. Start remaining Alpha replicas — snapshots stream automatically:
    Streaming done. Sent 1093470 entries. Waiting for ACK...

Large Datasets (> 10 GB)

Copy p directories directly for faster deployment:

  1. Run Bulk Loader on one server
  2. Copy/rsync p directories to all Alpha servers
  3. Start all Alphas simultaneously
  4. Verify all Alphas create snapshots with matching index values

Multi-tenancy

By default, Bulk Loader preserves namespace information from data files. Without namespace info, data loads into the default namespace.

Force all data into a specific namespace with --force-namespace:

dgraph bulk \
--files data.rdf.gz \
--schema schema.txt \
--zero localhost:5080 \
--force-namespace 123

Encryption

Loading into Encrypted Cluster

Generate encrypted p directories:

dgraph bulk \
--files data.rdf.gz \
--schema schema.txt \
--zero localhost:5080 \
--encryption key-file=./encryption.key

Loading Encrypted Exports

Decrypt encrypted export files during import:

# Encrypted input → Encrypted output
dgraph bulk \
--files encrypted-data.rdf.gz \
--schema encrypted-schema.txt \
--zero localhost:5080 \
--encrypted=true \
--encryption key-file=./encryption.key

# Encrypted input → Unencrypted output (migration)
dgraph bulk \
--files encrypted-data.rdf.gz \
--schema encrypted-schema.txt \
--zero localhost:5080 \
--encrypted=true \
--encrypted_out=false \
--encryption key-file=./encryption.key

Using HashiCorp Vault:

dgraph bulk \
--files encrypted-data.rdf.gz \
--schema encrypted-schema.txt \
--zero localhost:5080 \
--encrypted=true \
--vault addr="http://localhost:8200";enc-field="enc_key";enc-format="raw";path="secret/data/dgraph"

Encryption Flag Combinations

--encrypted--encryption key-fileResult
truenot setError
truesetEncrypted input → Encrypted output
falsenot setUnencrypted input → Unencrypted output
falsesetUnencrypted input → Encrypted output

Performance Tuning

tip

Disable swap space when running Bulk Loader. It's better to reduce memory usage via flags than let swapping slow the process.

Map Phase

Reduce memory usage:

FlagDescription
--num_go_routinesLower = less memory
--mapoutput_mbLower = less memory

Tip: For large datasets, split RDF files into ~256MB chunks to parallelize gzip decoding.

Reduce Phase

Increase if you have RAM to spare:

FlagDescription
--reduce_shardsHigher = more parallelism, more memory
--map_shardsHigher = better distribution, more memory

CLI Options Reference

FlagDescription
--files, -fData file(s) or directory path
--schema, -sSchema file path
--graphql_schema, -gGraphQL schema file (optional)
--zeroDgraph Zero address
--map_shardsNumber of map shards
--reduce_shardsNumber of reduce shards (= Alpha groups)
--outOutput directory (default: out)
--tmpTemp directory (default: tmp)
--new_uidsAssign new UIDs instead of preserving
--store_xidsStore XIDs as xid predicate
--xidmapDirectory for XID→UID mappings
--formatForce format (rdf or json)
--force-namespaceLoad into specific namespace
--encryptionEncryption key file
--encryptedInput files are encrypted
--encrypted_outEncrypt output (default: true if key provided)
--badger compressionCompression: snappy, zstd, or none

See dgraph bulk CLI reference for the complete list.