NLP datasets for the Databus — GSoC 2026

Description
Here we propose to use the periodic dumps of Wikimedia to create valuable datasets (e.g. for each article to provide clean text, NEs). These datasets will be publicly available in the Databus.

Goal
The mail objective of this project is to provide a valuable set of NLP related datasets via the Databus. These datasets include:

  • Clean text of full wikipedia articles (e.g. no tables, no equations, no markup)
  • A gold standard por Name Entity retrieval: for each wikipedia article, a list of links to internal and external links
  • Frequencies of tokens in a whole wikipedia dump (per language)

Impact
These datasets will be very useful for the Computer Science community. For instance:

  • Clean datasets are useful for training language models. For English we estimate 4 billion tokens, for Spanish 1 billion tokens. We will pay attention to the cleanness of the texts because is well known that this is a corner stone in LLM training.
  • The NEs in wikipedia articles are a gold standard to train NER models.
  • The frequencies of tokens allow us to create spell checkers, to analyze statistical behaviors (like Zipf’s law), or detect neologisms.

Warm up tasks

  • Use the wikipedia dumps eswiki-20260301 and eswiki-20251020 to:
    • clean texts (use any publicly available library to clean markdown text)
    • tokenize the text and compute token frequencies.
    • compute the neologisms (tokens in 20260301 that are missing in 20251020)
      Notice: all this should be achieved with an ordinary PC (up to 32GB RAM).

Skills Required

  • A good understanding of the Databus
  • Optionally, good knowledge of SPARQL, RDF, and other Semantic Web technologies
  • NLP technologies and libraries
  • Machine Learning
  • Good documentation and communication skills (results will be publicly available: source in github, webapp/services in dockerhub)

Project Size
350 hours

Mentors
Mariano Rico
Nandana Mihindukulasooriya

Hi Mariano,

This project immediately caught my attention because it combines large-scale data processing with high-quality NLP dataset creation — something I’m very interested in, especially from an LLM training perspective.

I’ve already started exploring the warm-up tasks using the Wikipedia dumps (eswiki-20260301 and eswiki-20251020). My current approach is:
• Cleaning text by removing markup, templates, tables, and non-natural language elements (evaluating tools like WikiExtractor and custom parsing where needed for better control over noise)
• Tokenizing using efficient pipelines (SpaCy / custom streaming tokenizer) to ensure it works within memory constraints
• Computing token frequencies using streaming methods (to avoid loading full dumps into memory)
• Detecting neologisms by comparing token sets across dumps, with normalization to reduce noise from casing and punctuation

I’m particularly interested in improving the quality of cleaned text, since even small noise (like leftover templates or malformed markup) can significantly impact downstream LLM training.

I also have a few ideas I’d love your thoughts on:
• Using a pipeline-based architecture (stream → clean → tokenize → analyze) for scalability and reuse across languages
• Storing outputs in a structured format compatible with Databus (possibly RDF-based metadata for discoverability)
• Adding evaluation metrics for “cleanliness” of text (e.g., residual markup detection, token entropy checks)

I’ll continue working on the warm-up tasks and would be happy to share results or benchmarks soon.

Looking forward to your feedback!

Best regards,
Ayush Tripathi

Dear Ayush,

nice to see your interest in this proposal. If you have started the warm up tasks, I would like to see your results and some technical details. Please, send me an email.

Best,

-Mariano

Hi Mariano,

I have completed all three warm-up tasks and wanted to share my results here before sending a detailed email.

Token Frequency Results:

Metric eswiki-20260301 eswiki-20251020
Articles 4,843,334 4,792,369
Unique tokens 5,929,846 5,754,470
Total tokens 1,130,176,559 1,091,450,750

Note: Current neologism comparison uses top 100K tokens. Full vocabulary comparison (freq ≥ 2) is in progress — updated results coming soon.

Pipeline: Fully streaming — peak RAM only ~3GB on a 16GB machine.

All code and results are on GitHub: GitHub - alizahh-7/dbpedia-nlp: Warm-up tasks for DBpedia - NLP Datasets for Databus · GitHub

I have sent a detailed email as well. Looking forward to your feedback!

Best regards,
Syeda Alizah

Hi @MarianoRico,

This project caught my attention immediately as it bridges the gap between raw Wikimedia data and high-quality LLM training sets.

I have already submitted an initial proposal to the GSoC dashboard, which focused on the Databus-python-client. However, after reviewing the specific goals for this project—especially the 32GB RAM constraint and the focus on clean text and token frequencies—I realize that my current proposal needs to be more focused on these extraction and cleaning tasks.

I am planning to revise and update my proposal to focus on:

  • Implementing a streaming pipeline (to handle the RAM limit) for cleaning Wikipedia markdown and equations.
  • Using the Databus to release these datasets (leveraging my existing work on the Python client).
  • Conducting the warm-up tasks involving eswiki neologism detection.

Should I go ahead and update my submitted proposal on the GSoC dashboard to align with these specific requirements? I would appreciate any feedback on whether you’d like to see more focus on the cleaning tools (like mwparserfromhell) or the Databus integration side.

Best regards,
Tahoora Tabassum

Update on Neologism Detection:

I have now completed the full vocabulary comparison (3.1M+ tokens per dump, freq ≥ 2) instead of just top 100K.

Metric Value
2026 full vocab 3,154,788 tokens
2025 full vocab 3,141,544 tokens
Raw neologisms 13,052
After linguistic filtering 782

Key insight discovered: Wikipedia dumps contain significant noise even after cleaning — hex color codes, template artifacts, and usernames appear as false neologisms. Distinguishing true linguistic neologisms from Wikipedia-specific artifacts is a core challenge I’d like to address in the full project.

True neologisms identified include: telerrehabilitación, hispanovenezolanos, bitvavo — reflecting real-world language evolution between Oct 2025 and Mar 2026.

Updated code on GitHub: GitHub - alizahh-7/dbpedia-nlp: Warm-up tasks for DBpedia - NLP Datasets for Databus · GitHub