How to process gigantic files in Golang — Fast and Memory-Efficient

golang process huge files

Why Golang is a good choice for very huge files?

Efficient Memory Usage : Go allows reading files in chunks (e.g., using bufio or io.Reader) so you don’t have to load everything into RAM.
Concurrency Support : Goroutines and channels let you process parts of the file in parallel, speeding up tasks like filtering, transformation, or computation.
Strong Standard Library : The os, io, bufio, and encoding packages provide tools to process files efficiently.
Static Typing & Performance : Faster and safer than scripting languages for processing large volumes of data.

Sample project

As developers, we often face the challenge of processing very large files without consuming too much memory or CPU—because fewer resources means lower costs 🤑. In this post, I’ll show you how to handle huge files efficiently in Go, line by line, using concurrency and memory reuse.

Project Structure

1
2
3
4
5
├── dummy_100000000_rows.csv
├── generate.sh
├── main.go
├── model.go
├── processor.go

Model

The model.go defines reusable data structures (Payload and Chunk) and uses sync.Pool to efficiently manage their memory, minimizing allocations and garbage collection when processing large amounts of data.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
package main

import "sync"

type Payload struct {
 Data []byte
}

type Chunk struct {
 payloads []*Payload
}

var payloadPool = sync.Pool{
 New: func() interface{} {
  return &Payload{}
 },
}

var chunkPool = sync.Pool{
 New: func() interface{} {
  return &Chunk{
   payloads: make([]*Payload, 0),
  }
 },
}

func getPayload() *Payload {
 return payloadPool.Get().(*Payload)
}

func releasePayload(payload *Payload) {
 payload.Data = payload.Data[:0]
 payloadPool.Put(payload)
}

func getChunk() *Chunk {
 return chunkPool.Get().(*Chunk)
}

func releaseChunk(chunk *Chunk) {
 chunk.payloads = chunk.payloads[:0]
 chunkPool.Put(chunk)
}

Processor

The processor.go reads a large file using buffered scanning, groups lines into chunks, and processes them concurrently using a worker pool. It leverages buffered channels and memory pools to efficiently manage data flow and minimize allocations, making it ideal for fast, low-overhead processing of large files.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
package main

import (
 "bufio"
 "io"
 "log"
 "sync"
)

func processFile(file io.Reader, numWorkers int, chunkSize int) {
 scanner := bufio.NewScanner(file)
 payloadChan := make(chan *Chunk, numWorkers)
 var wg sync.WaitGroup

 for i := 0; i < numWorkers; i++ {
  wg.Add(1)
  go func(workerId int) {
   defer wg.Done()
   for chunk := range payloadChan {
    for _, payload := range chunk.payloads {
     // Actual processing logic goes here
     // fmt.Println(string(payload.Data))

     releasePayload(payload)
    }

    releaseChunk(chunk)
   }
  }(i)
 }

 go func() {
  defer close(payloadChan)
  for {
   chunk := getChunk()
   for len(chunk.payloads) < chunkSize && scanner.Scan() {
    payload := getPayload()

    payload.Data = payload.Data[:0]
    payload.Data = append(payload.Data[:0], scanner.Bytes()...)

    chunk.payloads = append(chunk.payloads, payload)
   }

   if len(chunk.payloads) > 0 {
    payloadChan <- chunk
   } else {
    releaseChunk(chunk)
    break
   }
  }

  if err := scanner.Err(); err != nil {
   log.Printf("Scanner error: %v", err)
  }
 }()

 wg.Wait()
}

Main

The main.go is a simple command-line interface that accepts user parameters, opens the specified file, and runs the processing logic with those settings. After processing, it outputs useful information about the process, such as execution time and memory usage, to help monitor performance during large file handling.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
package main

import (
 "flag"
 "log"
 "os"
 "runtime"
 "time"
)

func main() {
 filePath := flag.String("file", "", "Path to CSV file to process (required)")
 chunkSize := flag.Int("chunk", 1000, "Number of lines per chunk (default: 1000)")
 workers := flag.Int("workers", runtime.NumCPU(), "Number of concurrent workers (default: number of CPUs)")
 flag.Parse()

 if *filePath == "" {
  log.Fatal("Missing required --file parameter")
 }

 startTime := time.Now()

 file, err := os.Open(*filePath)
 if err != nil {
  log.Fatalf("Failed to open file: %v", err)
 }
 defer file.Close()

 stat, err := file.Stat()
 if err != nil {
  log.Fatalf("Failed to get file stats: %v", err)
 }

 fileSize := stat.Size()
 log.Printf("Processing file: %s (%.2f MB)", *filePath, float64(fileSize)/1024/1024)
 log.Printf("Number of workers: %d", *workers)
 log.Printf("Chunks size: %d", *chunkSize)

 processFile(file, *workers, *chunkSize)

 printMemStats()
 log.Printf("Processing completed in %v", time.Since(startTime))
}

func printMemStats() {
 var m runtime.MemStats
 runtime.ReadMemStats(&m)

 log.Printf("Memory stats: Alloc=%.2fMB, TotalAlloc=%.2fMB, Sys=%.2fMB, NumGC=%d",
  float64(m.Alloc)/1024/1024,
  float64(m.TotalAlloc)/1024/1024,
  float64(m.Sys)/1024/1024,
  m.NumGC,
 )
}

CLI-supported parameters:

1
2
3
4
5
6
7
+------------+---------------------------------------------------+---------+-----------------------+----------+
| Flag | Description | Type | Default Value | Required |
+------------+---------------------------------------------------+---------+-----------------------+----------+
| --file | Path to the CSV file to process | string | None | Yes |
| --chunk | Number of lines per processing chunk | int | 1000 | No |
| --workers | Number of concurrent worker goroutines to use | int | Number of CPU cores | No |
+------------+---------------------------------------------------+---------+-----------------------+----------+

Example Usage:

1
go run main.go --file=dummy_10000000_rows.csv --chunk=1000 --workers=8

You will get a similar output depending on your file and computer.

Execute Output

Bonus point: Generate sample data

It includes a handy simple Bash script (generate.sh) that quickly creates realistic sample data in CSV format, perfect for testing and demonstration.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
#!/bin/bash

DEFAULT_ROWS=1000000
NUM_ROWS=${1:-$DEFAULT_ROWS}
FILENAME="dummy_${NUM_ROWS}_rows.csv"

echo "Generating $NUM_ROWS rows..."
echo "ID,Name,Surname,Gender,Age,City,Score,Code,Status,Date,Email,Country,Verified,Phone,IP,Browser,OS,Device,Source,Plan,Referral" > "$FILENAME"

awk -v rows="$NUM_ROWS" '
BEGIN {
  srand();

  split("Alice,Bob,Charlie,David,Eve,Frank,Grace,Hannah,Ian,Julia,Sam,Manuel,Francisco,Fred,Carlos", names, ",");
  split("Smith,Johnson,Williams,Brown,Jones,Garcia,Miller,Davis,Martinez,Lopez,Marshall,Andres", surnames, ",");
  split("Male,Female,Other", genders, ",");
  split("New York,London,Berlin,Barcelona,Tokyo,Sydney,Paris,Dubai,Rome,Madrid,Beijing", cities, ",");
  split("USA,UK,Germany,Japan,Australia,France,UAE,Italy,Spain,China", countries, ",");
  split("example.com,test.org,demo.net,sample.io", domains, ",");
  split("Chrome,Firefox,Safari,Edge,Opera", browsers, ",");
  split("Windows,macOS,Linux,Android,iOS", oss, ",");
  split("Desktop,Laptop,Tablet,Mobile", devices, ",");
  split("Google,Facebook,Twitter,LinkedIn,Direct,Email", sources, ",");
  split("Free,Basic,Pro,Enterprise", plans, ",");

  for(i=1;i<=rows;i++) {
    name = names[int(rand()*length(names))+1];
    surname = surnames[int(rand()*length(surnames))+1];
    gender = genders[int(rand()*length(genders))+1];
    age = 18 + int(rand()*63);
    city = cities[int(rand()*length(cities))+1];
    score = sprintf("%.2f", rand()*100);
    code = int(rand()*99999) + 10000;
    status = (rand() > 0.7) ? "Suspended" : ((rand() > 0.5) ? "Active" : "Inactive");
    day = sprintf("%02d", 1 + int(rand()*28));
    date = "2025-" sprintf("%02d", 1 + int(rand()*12)) "-" day;
    email = tolower(name) i "@" domains[int(rand()*length(domains))+1];
    country = countries[int(rand()*length(countries))+1];
    verified = (rand() > 0.5) ? "true" : "false";
    phone = sprintf("+1-%03d-%03d-%04d", int(rand()*900+100), int(rand()*900+100), int(rand()*10000));
    ip = int(rand()*255) "." int(rand()*255) "." int(rand()*255) "." int(rand()*255);
    browser = browsers[int(rand()*length(browsers))+1];
    os = oss[int(rand()*length(oss))+1];
    device = devices[int(rand()*length(devices))+1];
    source = sources[int(rand()*length(sources))+1];
    plan = plans[int(rand()*length(plans))+1];
    referral = sprintf("REF_%04d", int(rand()*10000));

    printf "%d,%s,%s,%s,%d,%s,%s,%d,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s\n",
           i, name, surname, gender, age, city, score, code, status, date, email, country, verified,
           phone, ip, browser, os, device, source, plan, referral;
  }
}' >> "$FILENAME"

echo "CSV file with $NUM_ROWS rows generated: $FILENAME"

FILE_SIZE=$(du -h "$FILENAME" | cut -f1)
echo "File size: $FILE_SIZE"

1
2
#Assign execution permissions
chmod +x generate.sh

1
2
#Example to execute with number of lines. Default value 1000000
./generate.sh 10000000

And when we run it we have a similar output:

Generate Script Output

NOTE : Depending on the number of rows and the power of the computer, this script can take a long time.

Conclusion

Processing huge files in Go doesn’t have to be hard. Using Go’s concurrency and memory reuse features, you can handle big CSVs or text files efficiently without using too much memory. This makes your programs faster and more reliable, especially when dealing with large datasets

Repository

The code for this tutorial can be found in the public: GitHub - albertcolom/go-file-processor

You can read the article on Medium

Why Golang is a good choice for very huge files?#

Sample project#

Project Structure#

Model#

Processor#

Main#

Bonus point: Generate sample data#

Conclusion#

Repository#