Creating a New Source Integration in Go
This guide walks through building a source integration in Go using the CloudQuery SDK. As a running example, we reference the xkcd integration which fetches comic data from the xkcd API.
Before starting, make sure you’re familiar with CloudQuery core concepts and have completed the Getting Started guide.
Prerequisites:
- Go installed (Go Tutorial, A Tour of Go)
- CloudQuery CLI installed
Scaffold a New Integration
The cq-scaffold tool generates a new Go source integration with all the boilerplate. Download it from the releases page, or install via Homebrew on macOS:
brew install cloudquery/tap/scaffoldCreate a new integration (replace <org> and <name> with your GitHub org and integration name):
cq-scaffold source <org> <name>
cd cq-source-<name>
go mod tidyThe scaffold tool only generates Go source integrations. For other languages, see the Python, JavaScript, or Java guides.
Project Structure
Here’s the structure of the xkcd integration, which is representative of a typical Go source integration:
plugins/source/xkcd/
├── main.go # Entry point
├── go.mod # SDK dependency (plugin-sdk/v4)
├── plugin/
│ └── plugin.go # Name, version, kind, team constants
├── client/
│ ├── client.go # Client struct (implements schema.ClientMeta)
│ ├── spec.go # Configuration spec
│ └── testing.go # Mock client constructor used in table unit tests
├── internal/xkcd/
│ └── xkcd.go # HTTP client + Comic struct
└── resources/
├── plugin/
│ ├── plugin.go # Creates plugin via plugin.NewPlugin()
│ └── client.go # Configure function, Sync, Tables
└── services/
├── comic.go # Table definition + resolver
└── comic_test.go # TestsA CloudQuery integration has several distinct components. Here’s what each part does and how they fit together:
main.go: the entry point. It creates the integration and starts serving it over gRPC. You rarely need to modify this file.plugin/: defines constants like the integration name, version, team, and kind (source). These identify your integration on the CloudQuery Hub.client/: the Client is a struct that stores everything your resolvers need: an authenticated API client, configuration values, a logger, etc. Every resolver receives the Client so it can make API calls. The Spec is a struct matching the user’s YAML configuration. It defines what settings your integration accepts (API keys, endpoints, concurrency, etc.).internal/xkcd/(orinternal/<api_name>/): your raw API client code. This is where you make HTTP calls to the third-party API, handle authentication headers, parse responses, and define the response structs. Keeping this separate from the CloudQuery-specific code means you can test and reuse it independently.resources/plugin/: the Configure function lives here. It’s called once when a sync starts: it parses the user’s spec, creates the API client, sets up the scheduler, and returns aplugin.Clientthat the SDK uses to run the sync.resources/services/: one file per table. Each file defines a table (the name, columns, and how they map to your API response struct) and a resolver (the function that actually calls the API and sends results back to CloudQuery). The resolver is the heart of each table: it’s where you make API calls, handle pagination, and stream results to the destination.
How It All Connects
Before reading the code, it helps to understand the flow of what happens when a user runs cloudquery sync:
- The CLI starts your integration as a separate process (or connects to it over gRPC if you’re running it locally)
- Your
main.gocreates the integration and starts the gRPC server - The CLI sends the user’s
specconfiguration to yourConfigurefunction Configureparses the spec, validates it, creates an authenticated API client, and returns aplugin.Client- The CLI asks your integration for its list of tables, then for each table, calls the table’s resolver
- Each resolver fetches data from the API and sends results over a channel. The SDK handles writing them to the destination.
This flow means your main implementation work is in two places: the Configure function (parsing configuration and creating the API client) and the resolvers (fetching data from the API).
Entry Point
The main.go creates and serves the integration. This is boilerplate that you rarely need to modify. It wires together the serve package and your integration:
package main
import (
"context"
"log"
"github.com/cloudquery/plugin-sdk/v4/serve"
plugin "github.com/cloudquery/<org>/cq-source-<name>/resources/plugin"
)
func main() {
p := serve.Plugin(plugin.Plugin())
if err := p.Serve(context.Background()); err != nil {
log.Fatalf("failed to serve plugin: %v", err)
}
}Note that main.go imports from resources/plugin — the package that creates the full integration with Sync, Tables, and Close methods. The top-level plugin/ directory only holds name, version, and kind constants. This is a common point of confusion in the project layout.
Integration Setup
Both resources/plugin/plugin.go and resources/plugin/client.go live in the same Go package (package plugin). The plugin.go file creates the integration; client.go holds the SDK-facing client struct and all the methods the SDK calls at sync time.
resources/plugin/plugin.go wires the constants from plugin/ to the SDK:
// resources/plugin/plugin.go
package plugin
import (
internalPlugin "github.com/cloudquery/<org>/cq-source-<name>/plugin"
"github.com/cloudquery/plugin-sdk/v4/plugin"
)
func Plugin() *plugin.Plugin {
return plugin.NewPlugin(
internalPlugin.Name,
internalPlugin.Version,
Configure,
plugin.WithKind(internalPlugin.Kind),
plugin.WithTeam(internalPlugin.Team),
)
}resources/plugin/client.go defines the SDK-facing client struct and the three methods the SDK calls during a sync. This is distinct from client/client.go (which implements schema.ClientMeta and is used inside resolvers):
// resources/plugin/client.go
package plugin
import (
"context"
"encoding/json"
"fmt"
"github.com/cloudquery/plugin-sdk/v4/message"
"github.com/cloudquery/plugin-sdk/v4/plugin"
"github.com/cloudquery/plugin-sdk/v4/scheduler"
"github.com/cloudquery/plugin-sdk/v4/schema"
"github.com/cloudquery/plugin-sdk/v4/state"
"github.com/cloudquery/plugin-sdk/v4/transformers"
"github.com/rs/zerolog"
"github.com/cloudquery/<org>/cq-source-<name>/client"
"github.com/cloudquery/<org>/cq-source-<name>/resources/services"
)
// Client implements plugin.Client for a source integration.
// It embeds plugin.UnimplementedDestination to satisfy the full
// plugin.Client interface without providing write methods.
type Client struct {
logger zerolog.Logger
config client.Spec
tables schema.Tables
scheduler *scheduler.Scheduler
services *yourapi.Client
plugin.UnimplementedDestination
}
func (c *Client) Sync(ctx context.Context, options plugin.SyncOptions, res chan<- message.SyncMessage) error {
tt, err := c.tables.FilterDfs(options.Tables, options.SkipTables, options.SkipDependentTables)
if err != nil {
return err
}
stateClient, err := state.NewConnectedClient(ctx, options.BackendOptions)
if err != nil {
return err
}
defer stateClient.Close()
schedulerClient := client.New(c.logger, c.config, c.services, stateClient)
if err := c.scheduler.Sync(ctx, schedulerClient, tt, res,
scheduler.WithSyncDeterministicCQID(options.DeterministicCQID)); err != nil {
return fmt.Errorf("failed to sync: %w", err)
}
return stateClient.Flush(ctx)
}
func (c *Client) Tables(_ context.Context, options plugin.TableOptions) (schema.Tables, error) {
return c.tables.FilterDfs(options.Tables, options.SkipTables, options.SkipDependentTables)
}
func (*Client) Close(_ context.Context) error { return nil }getTables() builds the table list once at startup, applies transformers, and injects the standard CloudQuery columns (_cq_id, _cq_source_name, _cq_sync_time):
func getTables() schema.Tables {
tables := []*schema.Table{
services.ComicsTable(),
}
if err := transformers.TransformTables(tables); err != nil {
panic(err)
}
for _, t := range tables {
schema.AddCqIDs(t)
}
return tables
}Configuration & Authentication
The SDK passes the user’s spec block from their YAML configuration as raw JSON bytes to your Configure function. Define a Spec struct and unmarshal it:
// client/spec.go
package client
import "fmt"
type Spec struct {
AccessToken string `json:"access_token"`
Concurrency int `json:"concurrency"`
}
func (s *Spec) SetDefaults() {
if s.Concurrency == 0 {
s.Concurrency = 100
}
}
func (s *Spec) Validate() error {
if s.AccessToken == "" {
return fmt.Errorf("access_token is required")
}
return nil
}Configure is the constructor the SDK calls once per sync. It has two distinct code paths: a fast-path for when the CLI only needs metadata (no live connection), and the normal path that creates a real API client and scheduler:
// resources/plugin/client.go (continued)
func Configure(_ context.Context, logger zerolog.Logger, specBytes []byte, opts plugin.NewClientOptions) (plugin.Client, error) {
if opts.NoConnection {
// Called when the CLI needs table schema without connecting,
// e.g. for documentation generation or --no-migrate.
return &Client{
logger: logger.With().Str("module", "<name>").Logger(),
tables: getTables(),
}, nil
}
var spec client.Spec
if err := json.Unmarshal(specBytes, &spec); err != nil {
return nil, fmt.Errorf("failed to unmarshal spec: %w", err)
}
spec.SetDefaults()
if err := spec.Validate(); err != nil {
return nil, err
}
apiClient, err := yourapi.NewClient(spec.AccessToken)
if err != nil {
return nil, fmt.Errorf("failed to create API client: %w", err)
}
return &Client{
logger: logger.With().Str("module", "<name>").Logger(),
config: spec,
scheduler: scheduler.NewScheduler(
scheduler.WithLogger(logger),
scheduler.WithConcurrency(spec.Concurrency),
),
services: apiClient,
tables: getTables(),
}, nil
}Users configure authentication in their YAML file. The CLI automatically resolves environment variable references:
spec:
access_token: "${YOUR_API_TOKEN}"
concurrency: 50For public APIs that don’t require authentication (like xkcd), omit access_token from the Spec struct and remove its validation from Validate().
Define a Table
A table in CloudQuery represents a collection of related data, typically one API resource type. In Go, you define a table as a function returning a *schema.Table. Each table needs three things: a name (which becomes the database table name), a transformer (which maps your Go struct fields to columns), and a resolver (the function that fetches data from the API).
Rather than listing columns manually, the SDK can auto-map fields from a Go struct using transformers.TransformWithStruct. If an existing Go SDK already provides a struct for the API response, you can use it directly. Otherwise, define your own struct matching the API’s JSON response. Here’s the actual xkcd comics table:
package services
import (
"github.com/cloudquery/plugin-sdk/v4/schema"
"github.com/cloudquery/plugin-sdk/v4/transformers"
)
func ComicsTable() *schema.Table {
return &schema.Table{
Name: "xkcd_comics",
Resolver: fetchComics,
Transform: transformers.TransformWithStruct(
&xkcd.Comic{},
transformers.WithPrimaryKeys("Num"),
),
}
}Notice that we don’t list individual columns. TransformWithStruct inspects the Comic struct and creates a column for each exported field. The WithPrimaryKeys("Num") option marks the Num field as the primary key. The final table name xkcd_comics will appear directly as a database table when synced.
The Comic struct defines the columns (from internal/xkcd/xkcd.go):
type Comic struct {
Month string `json:"month"`
Num int `json:"num"`
Link string `json:"link"`
Year string `json:"year"`
News string `json:"news"`
SafeTitle string `json:"safe_title"`
Transcript string `json:"transcript"`
Alt string `json:"alt"`
Img string `json:"img"`
Title string `json:"title"`
Day string `json:"day"`
}Each struct field becomes a column in the destination table. The SDK maps Go types to appropriate database types (e.g. string → text, int → integer). The json tags determine how the struct is serialized but don’t affect column names. Column names are derived from the Go field names, converted to snake_case.
Write a Table Resolver
The resolver is the heart of your integration: it’s the function that actually calls the third-party API and sends results back to CloudQuery. The resolver signature has four arguments, each serving a specific purpose:
func fetchComics(ctx context.Context, meta schema.ClientMeta, parent *schema.Resource, res chan<- any) errorctx: a standard Go context, used for cancellation. If a user stops a sync, this context is cancelled, so your resolver should respect it in long-running loops.meta: yourClientstruct (cast it withmeta.(*client.Client)). This gives you access to the API client, credentials, and any shared state.parent: for top-level tables, this isnil. For child tables (e.g. fetching commits for a specific repository), this contains the parent row so you can extract the parent’s ID.res: a channel where you send your results. Each item you send becomes a row in the destination table.
Here’s the xkcd resolver. It fetches the latest comic to determine the total count, then iterates through all comics by ID:
func fetchComics(ctx context.Context, meta schema.ClientMeta, parent *schema.Resource, res chan<- any) error {
c := meta.(*client.Client)
latest, err := c.XKCD.GetLatestComic(ctx)
if err != nil {
return err
}
res <- latest
for i := 1; i < latest.Num; i++ {
comic, err := c.XKCD.GetComic(ctx, i)
if err != nil {
return err
}
res <- comic
}
return nil
}A few important things to note about this code:
- We send each comic to the
reschannel as soon as we get it. This is important. The SDK streams items to the destination immediately, so don’t collect everything into a slice first. Streaming keeps memory usage low and gets data to the user’s database faster. - You can send items one at a time or as a slice. The SDK handles both.
- If an API call fails, return the error. The SDK will log it and report it to the user. Any items you’ve already sent to
resbefore the error are still written to the destination, so partial results are preserved. - The struct you send (
Comic) must match the struct used inTransformWithStruct. That’s how the SDK knows which fields map to which columns.
The Client
The Client struct is where you store everything that resolvers need to access: API clients, credentials, configuration, and any other shared state. Every resolver receives it via the meta argument. The Client lives in client/client.go and must implement the schema.ClientMeta interface, which requires an ID() method:
type Client struct {
Logger zerolog.Logger
XKCD *xkcd.Client
Backend state.Client
}
func (c *Client) ID() string {
return "xkcd"
}The ID() method serves two purposes: it identifies the client in log messages, and the SDK uses it internally to track which multiplexed client is running. For a small integration like xkcd, a static string is fine. For multiplexed integrations (e.g. one that syncs multiple AWS accounts), you’d include the account name so each client has a unique ID.
The Client is created inside the Configure function and passed to the SDK, which then provides it to every resolver. This is the main way your integration’s initialization code communicates with its resolvers.
Test Locally
Start the integration as a gRPC server for debugging:
go run main.go serveOr build and run as a local binary:
go build
./cq-source-<name> serveThen sync using the appropriate registry. See Testing Locally for configuration examples and Running Locally for full details.
Advanced: Column Resolvers
Most of the time, TransformWithStruct handles column mapping automatically. But sometimes you need a column that doesn’t come directly from the API response, maybe it’s derived from other fields, or requires an additional API call. In these cases, you can add extra columns with their own resolver functions.
For example, imagine we want to add an is_good boolean column to the xkcd comics table that doesn’t exist in the API response. We add it to the Columns field alongside the auto-generated columns from Transform:
func ComicsTable() *schema.Table {
return &schema.Table{
Name: "xkcd_comics",
Resolver: fetchComics,
Transform: transformers.TransformWithStruct(&xkcd.Comic{}),
Columns: []schema.Column{
{
Name: "is_good",
Type: arrow.FixedWidthTypes.Boolean,
Resolver: resolveComicIsGood,
},
},
}
}
func resolveComicIsGood(ctx context.Context, meta schema.ClientMeta, resource *schema.Resource, c schema.Column) error {
comic := resource.Item.(xkcd.Comic)
return resource.Set(c.Name, strings.Contains(comic.Title, "xkcd"))
}The column resolver receives the current row via resource.Item. You cast it to your struct type, compute the value, and set it with resource.Set(). As big fans of meta-jokes, we define only comics with "XKCD" in the title to be good. These custom columns appear alongside the auto-generated columns from TransformWithStruct.
Advanced: Multiplexing
For our xkcd integration, multiplexing isn’t necessary. There’s only one xkcd API with no accounts or organizations. But many real-world integrations need to fetch data for multiple entities. For example, a GitHub integration that syncs repositories for multiple organizations needs to make separate API calls per org. Without multiplexing, these would run sequentially. With multiplexing, they run in parallel.
A multiplexer is a function that takes the base client and returns a slice of clients, one per entity. The SDK calls your table resolver once for each client in the slice:
func AccountMultiplex(meta schema.ClientMeta) []schema.ClientMeta {
client := meta.(*Client)
l := make([]schema.ClientMeta, 0, len(client.accounts))
for _, acc := range client.accounts {
l = append(l, client.WithAccount(acc))
}
return l
}Then set Multiplex: client.AccountMultiplex on tables that need it. Make sure the client’s ID() method returns a unique value per multiplexed entity:
func (c *Client) ID() string {
return fmt.Sprintf("myplugin:%s", c.Account)
}Inside the resolver, you can then access the current account via client.Account to make the right API calls.
Advanced: Incremental Tables
By default, every sync fetches all data from scratch. For small APIs this is fine, but for APIs with millions of records that rarely change, re-fetching everything is wasteful. Incremental tables solve this by remembering where the last sync left off (using a cursor) and only fetching new data on subsequent syncs.
To make a table incremental, you need to mark it as such and designate one column as the incremental key (the cursor):
func Items() *schema.Table {
return &schema.Table{
Name: "hackernews_items",
Resolver: fetchItems,
IsIncremental: true,
Transform: transformers.TransformWithStruct(&hackernews.Item{}),
Columns: []schema.Column{
{
Name: "id",
Type: arrow.PrimitiveTypes.Int64,
PrimaryKey: true,
IncrementalKey: true,
},
},
}
}In the resolver, use the state backend to persist the cursor:
func fetchItems(ctx context.Context, meta schema.ClientMeta, _ *schema.Resource, res chan<- any) error {
c := meta.(*client.Client)
tableName := Items().Name
// Load cursor from last sync
value, err := c.Backend.GetKey(ctx, tableName)
// ... fetch data starting from cursor ...
// Save cursor after processing
err = c.Backend.SetKey(ctx, tableName, strconv.Itoa(newCursor))
err = c.Backend.Flush(ctx) // Must flush to persist
return nil
}See Managing Incremental Tables for the full guide, and the Hacker News integration for the complete working example.
Troubleshooting
go run main.go serve fails with a compile error
The most common cause is an import path mismatch. Make sure main.go imports from resources/plugin (the package with the Plugin() function), not the top-level plugin/ directory (which only holds constants). See the Entry Point section for the correct import path.
cloudquery sync fails with connection refused
Your integration isn’t running, or the port in your config doesn’t match. Make sure go run main.go serve is running in a separate terminal, and that the path in your YAML (localhost:7777) matches the address shown in the server output.
failed to validate spec error on sync
Your Spec’s Validate() method returned an error. Check that all required fields are present in your YAML config and that environment variable references like ${MY_API_TOKEN} are set in the shell where you’re running cloudquery sync.
Resolver is never called / zero rows synced
The table is probably filtered out. Check that the table name listed in your YAML tables: field matches the name returned by your ComicsTable() function exactly. If you’re using tables: ["*"], make sure the table is included in the list returned by getTables() in resources/plugin/client.go.
Resolver runs but sends no rows to the res channel
Add a log statement inside the resolver to verify it’s executing and that your API call is returning data. Check that you’re actually sending to res — a common mistake is building a slice and forgetting to send its elements.
Sync works locally but re-fetches everything on the second run
You have an incremental table but forgot to call c.Backend.Flush(ctx) after saving the cursor. Without Flush, the cursor is never persisted and each sync starts from scratch. See Advanced: Incremental Tables.
Common Pitfalls
Avoid these common mistakes when building Go integrations:
- Don’t batch results in memory. Send items to the
reschannel as soon as they’re available. Don’t collect all pages into a slice and send them at the end. This wastes memory and delays writes to the destination. - Fetch concurrently when the API allows it. A sequential loop is the simplest starting point, but for large datasets use
golang.org/x/sync/errgroupwith a concurrency limit so you don’t overwhelm the API. See the xkcd integration for a working example. - Always call
Backend.Flush(ctx)for incremental tables. If you skip this, your cursor won’t persist and the next sync will re-fetch everything. - Make
ID()unique per multiplexed client. If two multiplexed clients return the sameID(), the SDK won’t parallelize them correctly. - Return errors from resolvers. Don’t silently swallow API errors. Return them so the SDK can log them and surface them to the user.
- Respect context cancellation. Check
ctx.Done()in long-running loops so the user can cancel a sync cleanly.
Publishing
Visit Publishing an Integration to the Hub for release instructions.
Real-World Examples
- xkcd: starter integration referenced in this tutorial
- Hacker News: incremental tables with state backend
- Kubernetes: large-scale integration with many tables and mock tests
- PostgreSQL Destination: “unmanaged” destination that handles batching itself
- BigQuery Destination: “managed” destination with per-table batching
- All integrations
Next Steps
Once your integration is working locally:
- Publish to the Hub: make your integration available to others
- Add tests: see
comic_test.goin the xkcd integration for a testing pattern - Add incremental tables: use the state backend for large datasets that don’t change much between syncs
- Add multiplexing: parallelize fetching if your integration supports multiple accounts or regions
- Build a destination: see the Go Destination guide to write an integration that receives and stores data
Resources
- CloudQuery Community
- Go SDK Source Code (
plugin-sdk/v4) - How to Write a CloudQuery Source Integration (Video. May reference older SDK patterns; use this guide for current code.)