# Our data

Sumble isn't just another data platform with lots of data but dubious worth because none if it is discoverable; nor is it a database whose value is only revealed if you happen to know all the tables to look in and SQL to write. **Sumble is a repository of interconnected data points that can be used to filter, contextualize, and aggregate insight.**

The source of Sumble's data isn't a decades-old legacy database, hallucinating LLM, or call to some other service. The basis of our data is organization websites, job posts, and people profiles that are always linked and referenced plus lots of data modeling enabling dynamic and flexible querying.

## Primary Sources

Sumble's organization data is rooted in primary sources: job posts and people profiles. From these two data sources we derive organization metadata, tech stack, teams mentioned, and construct trends of these over time. These sources can be seen as the root of our data tree with many web pages showing a slideover with a job post or person profile.

{% tabs %}
{% tab title="Job Posts" %}

<figure><img src="https://591792295-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fd8pAyUZEiTL25GKepCIo%2Fuploads%2Fxu9LqMQejjPSwzGgBiwp%2Fimage.png?alt=media&#x26;token=f39ab3ef-5e34-4700-9ec3-9512ba29d2af" alt=""><figcaption></figcaption></figure>
{% endtab %}

{% tab title="Person Profile" %}

<figure><img src="https://591792295-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fd8pAyUZEiTL25GKepCIo%2Fuploads%2FSt50NcUIKVudrbFUO0jB%2Fimage.png?alt=media&#x26;token=9b977f74-b9c9-4986-9a17-30fa9ed67a5b" alt=""><figcaption></figcaption></figure>
{% endtab %}
{% endtabs %}

## Derived Data

* Headcount
* Location
  * Sourced from [GeoNames](https://www.geonames.org/), licensed under CC BY 4.0
* Industry
* Parent-Subsidiary relationships
* Attributes
  * B2B/B2C
  * IT Services
  * Digital Native
* Teams → Org structure
* Technologies
* Projects
* Job functions: people & job posts
* Job levels: people & job posts

## Data freshness

Sumble's data stays current through automated pipelines that run throughout the day.

| Data type            | Update frequency                                                                                                                                    |
| -------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- |
| Job posts            | Multiple times daily — new postings are ingested from aggregators and careers pages throughout the day                                              |
| People profiles      | Daily — professional profiles are refreshed and matched to organizations daily                                                                      |
| Organization data    | Daily — firmographic data, industry classifications, parent/subsidiary relationships, and attributes are recomputed every night                     |
| Technologies & teams | Daily — technologies and team structures are re-extracted from job posts using ML models each day                                                   |
| Trends               | Daily — historical trends for technologies, job functions, and headcount are recomputed with each pipeline run, drawing on years of historical data |
| Intent signals       | Daily — new signals are generated after each data pipeline run                                                                                      |

## Coverage

Sumble tracks a broad cross-section of the global job market. At a high level: hundreds of thousands of organizations, millions of job postings and people profiles, and thousands of tracked technologies across categories like programming languages, cloud platforms, databases, and frameworks. Team structures are extracted for covered companies.

Coverage is deepest for technology companies and companies with active hiring, since job posts are a primary data source.

## Data quality

Raw job posts and profiles are unstructured text. Sumble turns them into structured, queryable data through several processing stages:

1. **Entity extraction** — Custom NER (Named Entity Recognition) models identify technologies, teams, projects, and other entities mentioned in job descriptions
2. **Relation classification** — A second model classifies the relationship between entities (e.g., whether a technology is actively used vs. just mentioned)
3. **Entity linking** — Extracted entities are normalized and linked to canonical records so that "k8s", "Kubernetes", and "K8S" all resolve to the same technology
4. **Deduplication** — Job posts and organizations are deduplicated across sources to prevent double-counting
5. **Hierarchy construction** — Teams are organized into tree structures, job functions into a standardized hierarchy, and parent/subsidiary relationships are maintained for organizations

This pipeline runs daily, reprocessing new data and updating existing records.
