Get started with an open source customer data platform | Opensource.com

Get started with an open source customer data platform

As an open source alternative to Segment, RudderStack collects and routes event stream (or clickstream) data and automatically builds your customer data lake on your data warehouse.

Person standing in front of a giant computer screen with numbers, data
Image by : 
Opensource.com
x

Subscribe now

Get the highlights in your inbox every week.

RudderStack is an open source, warehouse-first customer data pipeline. It collects and routes event stream (or clickstream) data and automatically builds your customer data lake on your data warehouse.

RudderStack is commonly known as the open source alternative to the customer data platform (CDP), Segment. It provides a more secure, flexible, and cost-effective solution in comparison. You get all the CDP functionality with added security and full ownership of your customer data.

Warehouse-first tools like RudderStack are architected to build functional data lakes in the user's data warehouse. The benefits are improved data control, increased flexibility in tool use, and (frequently) lower costs. Since it's open source, you can see how complicated processes—like building your identity graph—are done without relying on a vendor's black box.

Getting the RudderStack workspace token

Before you get started, you will need the RudderStack workspace token from your RudderStack dashboard. To get it:

  1. Go to the RudderStack dashboard.
  2. Log in using your credentials (or sign up for an account, if you don't already have one).
  3. Once you've logged in, you should see the workspace token on your RudderStack dashboard.

Installing RudderStack

Setting up a RudderStack open source instance is straightforward. You have two installation options:

  1. On your Kubernetes cluster, using RudderStack's Helm charts
  2. On your Docker container, using the docker-compose command

This tutorial explains how to use both options but assumes that you already have Git installed on your system.

Deploying with Kubernetes

You can deploy RudderStack on your Kubernetes cluster using the Helm package manager.

If you plan to use RudderStack in production, we strongly recommend using this method. This is because the Docker images are updated with bug fixes more frequently than the GitHub repository (which follows a monthly release cycle).

Before you can deploy RudderStack on Kubernetes, make sure you have the following prerequisites in place:

Once you've completed all the prerequisites, deploy RudderStack on your default Kubernetes cluster:

  1. Find the Helm chart required to deploy RudderStack in this repo.
  2. Install the Helm chart with a release name of your choice (my-release, in this example) from the root directory of the repo in the previous step:
    $ helm install \
    my-release ./ --set \
    rudderWorkspaceToken="<your workspace token from RudderStack dashboard>"

This deploys RudderStack on your default Kubernetes cluster configured with kubectl using the workspace token you obtained from the RudderStack dashboard.

For more details on the configurable parameters in the RudderStack Helm chart or updating the versions of the images used, consult the documentation.

Deploying with Docker

Docker is the easiest and fastest way to set up your open source RudderStack instance.

First, get the workspace token from the RudderStack dashboard by following the steps above.

Once you have the RudderStack workspace token:

  1. Download the rudder-docker.yml docker-compose file required for the installation.
  2. Replace <your_workspace_token> in this file with your RudderStack workspace token.
  3. Set up RudderStack on your Docker container by running:
    docker-compose -f rudder-docker.yml up

Now RudderStack should be up and running on your Docker instance.

Verifying the installation

You can verify your RudderStack installation by sending test events using the bundled shell script:

  1. Clone the GitHub repository:
    git clone https://github.com/rudderlabs/rudder-server.git
  2. In this tutorial, you will verify RudderStack by sending test events to Google Analytics. Make sure you have a Google Analytics account and keep the tracking ID handy. Also, note that the Google Analytics account needs to have a Web property.
  3. In the RudderStack hosted control plane:
    • Add a source on the RudderStack dashboard by following the Adding a source and destination in RudderStack guide. You can use either of RudderStack's event stream software development kits (SDKs) for sending events from your app. This example sets up the JavaScript SDK as a source on the dashboard. Note: You aren't actually installing the RudderStack JavaScript SDK on your site in this step; you are just creating the source in RudderStack.
    • Configure a Google Analytics destination on the RudderStack dashboard using the instructions in the guide mentioned previously. Use the Google Analytics tracking ID you kept from step 2 of this section:
  4. As mentioned before, RudderStack bundles a shell script that generates test events. Get the Source write key from the RudderStack dashboard:
  5. Next, run:
    ./scripts/generate-event <YOUR_WRITE_KEY> https://hosted.rudderlabs.com/v1/batch
  6. Finally, log into your Google Analytics account and verify that the events were delivered. In your Google Analytics account, navigate to RealTime -> Events. The RealTime view is important because some dashboards can take one to two days to refresh.

Optional: Setting up the open source control plane

RudderStack's core architecture contains two major components: the data plane and the control plane. The data plane, rudder-server, delivers your event data, and the RudderStack hosted control plane manages the configuration of your sources and destinations.

However, if you want to manage the source and destination configurations locally, you can set an open source control plane in your environment using the RudderStack Config Generator. (You must have Node.js installed on your system to use it.)

Here are the steps to set up the control plane:

  1. Install and set up RudderStack on the platform of your choice by following the instructions above.
  2. Run the following commands in this order:
    • cd utils/config-gen
    • npm install
    • npm start

You should now be able to access the open source control plane at http://localhost:3000 by default. If your setup is successful, you will see the user interface.

To export the existing workspace configuration from the RudderStack-hosted control plane and have RudderStack use it, consult the docs.

RudderStack and open source

The core of RudderStack is in the rudder-server repository. It is open source, licensed under AGPL-3.0. A majority of the destination integrations live in the rudder-transformer repository. They are open source as well, licensed under the MIT License. The SDKs and instrumentation repositories, several tool and utility repositories, and even some dbt model repositories for use-cases like customer journey analysis and sessionization for the data residing in your data warehouse are open source, licensed under the MIT License, and available in the GitHub repository.

You can use RudderStack's open source offering, rudder-server, on your platform of choice. There are setup guides for Docker, Kubernetes, native installation, and developer machines.

RudderStack open source offers:

  1. RudderStack event stream
  2. 15+ SDKs and source integrations to ingest event data
  3. 80+ destination and warehouse integrations
  4. Slack community support

RudderStack Cloud

RudderStack also offers a managed option, RudderStack Cloud. It is fast, reliable, and highly scalable with a multi-node architecture and sophisticated error-handling mechanism. You can hit peak event volume without worrying about downtime, loss of events, or latency.

Explore our open source repos on GitHub, subscribe to our blog, and follow us on social media: Twitter, LinkedIn, dev.to, Medium, and YouTube!

Team communication, chat

Learn how to use and get Corteza up and running.
Cheat Sheet cover image

Download our new kubectl cheat sheet to learn helpful commands for the Kubernetes command-line utility.
Analytics: Charts and Graphs

Plausible is gaining attention and users as a viable, effective alternative to Google Analytics.

About the author

Amey Varangaonkar - Amey is a Content Manager at RudderStack. He takes keen interest in Data Science, Content and Product Marketing, Gaming, and Music.