Knowledge Graph and Discovery Product

YourStory Startup Graph

Built a graph-based startup intelligence system to map founders, startups, investors, domains, and funding relationships.

Structured discovery Graph-backed product

Context

This was a three-month internship during my undergrad and one of the earliest projects that changed how I thought about systems. I worked directly with the CEO on a startup intelligence product that treated startup coverage as structured discovery data rather than static articles. It was also the period that pushed me fully into coding and product-building as a real discipline.

Problem

Startup media is useful to read, but difficult to navigate systematically when users want to understand who invested where, which founders cluster together, or what adjacent companies exist within a domain.

What I Built

A graph database in Neo4j for startups, founders, investors, domains, and funding rounds
Relationship-driven discovery logic that made investor-style exploration more useful than flat search
Java service logic for ingestion, graph writes, and repeatable structured updates
Cypher query patterns for domain-level investor discovery, founder-investor traversal, and funding-pattern exploration
An Inshorts-style daily product concept for startup news that generated compact updates using funding stage, amount raised, investor names, and related startup signals

Notes

Overview

During a three-month internship at YourStory in undergrad, I worked on a problem that still feels current: how do you turn a stream of startup stories into something users can explore, not just read?

The answer I pursued was a graph-backed product. Rather than treating funding news as isolated text, I modeled startups, founders, investors, rounds, domains, and stories as connected entities. That shift turned the product from search into discovery.

Why a graph made sense

The startup ecosystem is naturally relational. Investors participate in rounds. Founders connect companies over time. Domains create clusters. Stories mention multiple entities at once. Once I framed the product that way, a graph was the cleanest representation of the actual problem.

At a practical level, the system needed to support questions like:

which investors are most active in a specific domain?
what similar founder or company clusters exist around this startup?
what rounds or entities connect to a recent story?

Those are relationship questions, not just text-search questions.

System shape

The pipeline conceptually looked like this:

Story input -> entity normalization -> graph upsert -> traversal/query layer -> editorial or discovery surface

The middle step, normalization, mattered more than anything else. Without canonical entity handling, a graph quietly becomes misleading. Different spellings of the same company or investor create duplicate nodes and distort the network.

The engineering lesson was that the graph itself was not the hard part. Trustworthy entity resolution was.

At the storage layer, the graph was organized around a simple but expressive property-graph model:

Startup
Founder
Investor
Domain
Round

with relationship types like:

FOUNDED_BY
FUNDED_BY
OPERATES_IN
RAISED_IN

That model made it possible to move through the ecosystem the way users naturally think about it. An investor does not want only “articles mentioning fintech.” They want to traverse from a startup to its founders, from a founder to adjacent companies, from a round to participating investors, and from there into domain clusters or follow-on patterns.

On top of that graph, I used Cypher queries for the actual discovery layer. The key value was not just storing connected data, but making multi-hop questions cheap to express:

which investors are repeatedly showing up in a domain
which founders connect otherwise separate startup clusters
which recent rounds create interesting adjacency between companies
which entities should appear together in a compact daily update

That was one of the first times I saw clearly that query design is really product design in another form.

Product surface

The graph enabled a more investor-style product surface. A user could move from one company to its founders, to the investors in a round, to other companies that shared those investors or domain patterns. That kind of traversal feels obvious once it exists, but it is hard to fake with traditional article archives.

Alongside the graph work, I also built toward a short-form startup updates product inspired by compact news formats. The interesting part there was not just templating text. It was using structured fields like funding stage, amount raised, participating investors, and startup/domain tags to make daily updates fast, legible, and consistent.

That became my first real exposure to the connection between data pipelines and product output. Once the entity structure is dependable, the same system can power search, traversal, and editorial surfaces without redoing the logic for every new view.

What this project taught me

This was one of the first projects where I was not just coding a component. I was thinking about the full chain:

how the world should be modeled
how data should be normalized
how queries reflect user intent
how editorial output can be powered by structure rather than manual repetition

That combination of knowledge modeling, infrastructure, and user-facing product thinking has stayed with me ever since.

It also made the direction feel personal. This was the point in undergrad where coding stopped feeling abstract and started feeling like the way I wanted to think and build.

Closing thought

What stayed with me from this project was not just the graph itself. It was the realization that product usefulness often depends on whether the data model matches how people actually think. Once startup coverage was structured as relationships instead of isolated articles, the product became easier to explore, easier to query, and more aligned with real user intent. That was one of the earliest moments where backend modeling, discovery design, and product thinking all clicked together for me.

Role / ownership

Worked directly with the CEO on the product direction rather than treating the project as an isolated engineering task
Owned a meaningful part of the graph-modeling and discovery-system thinking during the internship
Used the internship to move from static content thinking into system design, search/discovery, and product structure

Impact

Showed how structured data, graph modeling, and short-form editorial products can work together to turn unstructured startup coverage into navigable product surfaces.

Stack

Neo4j
Cypher
Java
JavaScript
Entity resolution
Knowledge graphs

Technical design

Ingestion flow that turns startup stories into normalized startup, founder, investor, funding-round, and domain entities
Entity resolution layer that canonicalizes repeated company and investor mentions before write-time graph updates
Neo4j property graph with nodes like Startup, Founder, Investor, Domain, and Round plus relationships such as FOUNDED_BY, FUNDED_BY, and OPERATES_IN
Java services for ingestion, graph upserts, repeatable updates, and query-serving endpoints
Cypher traversal layer optimized for relationship-heavy investor discovery questions rather than keyword search alone
Short-form editorial pipeline that converts structured funding fields into compact startup summaries
Canonicalization and constraint strategy to keep repeated imports from fragmenting the dataset

Engineering decisions

Use a graph model because the domain is fundamentally relationship-heavy
Prefer traversal-first discovery over flat relational filtering because the user intent is usually multi-hop
Invest in canonical IDs and constraints early so the graph stays trustworthy as ingestion grows
Treat domain labels as navigable structure, not just passive article metadata
Use structured funding fields to support both search and short-form editorial output

Tradeoffs

Graphs make multi-hop discovery easier, but they increase the burden on entity resolution and schema hygiene
Automated short-form story generation improves speed, though it only works if the underlying structured data is dependable
Rich relationship modeling helps investors explore, but the ingestion pipeline needs stronger normalization discipline than a simple search index

Outcome / impact

A more navigable startup-discovery surface for investor-style exploration
Working proof that media coverage can become reusable structured product data
Hands-on experience owning product logic, graph design, and infrastructure patterns together

Lessons learned

Graph modeling becomes powerful when the domain is relationship-heavy and the queries match how users actually think
Entity resolution is often the hardest part of making a discovery product trustworthy
This was the project that made coding feel like a way to build products, not just complete tasks