Our data
Sumble isn't just another data platform with lots of data but dubious worth because none if it is discoverable; nor is it a database whose value is only revealed if you happen to know all the tables to look in and SQL to write. Sumble is a repository of interconnected data points that can be used to filter, contextualize, and aggregate insight.
The source of Sumble's data isn't a decades-old legacy database, hallucinating LLM, or call to some other service. The basis of our data is organization websites, job posts, and people profiles that are always linked and referenced plus lots of data modeling enabling dynamic and flexible querying.
Primary Sources
Sumble's organization data is rooted in primary sources: job posts and people profiles. From these two data sources we derive organization metadata, tech stack, teams mentioned, and construct trends of these over time. These sources can be seen as the root of our data tree with many web pages showing a slideover with a job post or person profile.


Derived Data
Headcount
Location
Sourced from GeoNames, licensed under CC BY 4.0
Industry
Parent-Subsidiary relationships
Attributes
B2B/B2C
IT Services
Digital Native
Teams → Org structure
Technologies
Projects
Job functions: people & job posts
Job levels: people & job posts
Data freshness
Sumble's data stays current through automated pipelines that run throughout the day.
Job posts
Multiple times daily — new postings are ingested from aggregators and careers pages throughout the day
People profiles
Daily — professional profiles are refreshed and matched to organizations daily
Organization data
Daily — firmographic data, industry classifications, parent/subsidiary relationships, and attributes are recomputed every night
Technologies & teams
Daily — technologies and team structures are re-extracted from job posts using ML models each day
Trends
Daily — historical trends for technologies, job functions, and headcount are recomputed with each pipeline run, drawing on years of historical data
Intent signals
Daily — new signals are generated after each data pipeline run
Coverage
Sumble tracks a broad cross-section of the global job market. At a high level: hundreds of thousands of organizations, millions of job postings and people profiles, and thousands of tracked technologies across categories like programming languages, cloud platforms, databases, and frameworks. Team structures are extracted for covered companies.
Coverage is deepest for technology companies and companies with active hiring, since job posts are a primary data source.
Data quality
Raw job posts and profiles are unstructured text. Sumble turns them into structured, queryable data through several processing stages:
Entity extraction — Custom NER (Named Entity Recognition) models identify technologies, teams, projects, and other entities mentioned in job descriptions
Relation classification — A second model classifies the relationship between entities (e.g., whether a technology is actively used vs. just mentioned)
Entity linking — Extracted entities are normalized and linked to canonical records so that "k8s", "Kubernetes", and "K8S" all resolve to the same technology
Deduplication — Job posts and organizations are deduplicated across sources to prevent double-counting
Hierarchy construction — Teams are organized into tree structures, job functions into a standardized hierarchy, and parent/subsidiary relationships are maintained for organizations
This pipeline runs daily, reprocessing new data and updating existing records.
Last updated