> For the complete documentation index, see [llms.txt](https://docs.sumble.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.sumble.com/our-data/our-data.md).

# Our data

Good go-to-market decisions need data you can trust. Sumble's comes from what companies actually do, not self-reported profiles or stale lists, and every data point connects to the others so you can filter, segment, and aggregate your way to an answer without writing SQL or learning which table to query.

It isn't pulled from a decades-old legacy database, invented by an LLM, or passed through from another vendor. It's built from organization websites, job posts, and people profiles that are always linked and referenced, then modeled so you can query it flexibly.

## Primary sources

Sumble's organization data is rooted in two primary sources: job posts and people profiles. From these we derive organization metadata, tech stack, the teams a company is building, and how all of it trends over time. They're the root of the data tree, and you'll see them throughout the app in slideovers that show the underlying job post or profile.

{% tabs %}
{% tab title="Job Posts" %}

<figure><img src="/files/22HSJgBsb8NIMRPBEtfQ" alt=""><figcaption></figcaption></figure>
{% endtab %}

{% tab title="Person Profile" %}

<figure><img src="/files/aY6TEJV3kMJI9M5jmEse" alt=""><figcaption></figcaption></figure>
{% endtab %}
{% endtabs %}

## Derived data

* Headcount
* Location
  * Sourced from [GeoNames](https://www.geonames.org/), licensed under CC BY 4.0
* Industry
* Parent-Subsidiary relationships
* Attributes
  * B2B/B2C
  * IT Services
  * Digital Native
* Teams and org structure
* Technologies
* Projects
* Job functions: people & job posts
* Job levels: people & job posts

## Data freshness

Sumble's data stays current through automated pipelines that run throughout the day.

| Data type            | Update frequency                                                                                                                                    |
| -------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- |
| Job posts            | Multiple times daily. New postings are ingested from aggregators and careers pages throughout the day.                                              |
| People profiles      | Daily. Professional profiles are refreshed and matched to organizations.                                                                            |
| Organization data    | Daily. Firmographic data, industry classifications, parent/subsidiary relationships, and attributes are recomputed every night.                     |
| Technologies & teams | Daily. Technologies and team structures are re-extracted from job posts using ML models.                                                            |
| Trends               | Daily. Historical trends for technologies, job functions, and headcount are recomputed with each pipeline run, drawing on years of historical data. |
| Intent signals       | Daily. New signals are generated after each data pipeline run.                                                                                      |

## Coverage

Sumble tracks a broad cross-section of the global job market: hundreds of thousands of organizations, millions of job postings and people profiles, and thousands of tracked technologies across categories like programming languages, cloud platforms, databases, and frameworks. Team structures are extracted for covered companies.

Coverage is deepest for technology companies and companies with active hiring, since job posts are a primary data source.

## Data quality

Raw job posts and profiles are unstructured text. Sumble turns them into structured, queryable data through several processing stages:

1. **Entity extraction:** Custom NER (Named Entity Recognition) models identify technologies, teams, projects, and other entities mentioned in job descriptions.
2. **Relation classification:** A second model classifies the relationship between entities, for example whether a technology is actively used or only mentioned.
3. **Entity linking:** Extracted entities are normalized and linked to canonical records, so "k8s", "Kubernetes", and "K8S" all resolve to the same technology.
4. **Deduplication:** Job posts and organizations are deduplicated across sources to prevent double-counting.
5. **Hierarchy construction:** Teams are organized into tree structures, job functions into a standardized hierarchy, and parent/subsidiary relationships are maintained for organizations.

This pipeline runs daily, reprocessing new data and updating existing records.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.sumble.com/our-data/our-data.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
