Projects with this topic
-
Kreuzberg document extraction bindings for Kit
Updated -
"A haute voix !" is an accessibility tool that aims to extract text from a pdf and render it the best way to make it readable out loud by TTS browser tools. All the processing is done locally, documents are processed by your computer.
Updated -
Script Python permettant d’extraire des images depuis un manifeste IIIF, de les traiter avec Tesseract OCR, et de générer des fichiers de sortie dans différents formats. Il est conçu pour les bibliothèques, archives et projets de numérisation nécessitant une reconnaissance optique de caractères imprimés brutes.
Updated -
Advanced enterprise Free Open Source DMS (document management system).
Updated -
A modular Clinical NLP Pipeline built to process and analyze unstructured medical text using both traditional machine learning and transformer-based approaches.
The project combines multiple components including OCR, text preprocessing, feature engineering, classification, named entity recognition, and visualization into a single end-to-end pipeline. It supports extracting clinical insights from raw documents and predicting medical categories using both TF-IDF + SVM and BERT-based models.
The system was designed and implemented as a structured Python project, with each stage separated into independent modules for scalability and maintainability.
Key Highlights
Built an end-to-end NLP pipeline for clinical text processing. Implemented SVM (≈51% accuracy) and BERT (≈77% accuracy) models. Integrated OCR for extracting text from medical documents. Performed Named Entity Recognition (NER) on clinical data. Designed modular architecture (src/) for clean code organization. Exported outputs for visualization and dashboard integration.Updated -
DocuMind es un sistema de organización automática de documentos para Linux desktop, impulsado por IA local (Ollama/Llama3 o HuggingFace). Procesa PDFs, imágenes, vídeos, audio y código: extrae texto/OCR, transcribe, analiza contenido y clasifica/archiva según ISO 15489 (facturas, legal, trabajo, personal, multimedia). Detecta duplicados, registra auditoría en SQLite y prioriza privacidad offline.
Desarrollada en Python 3.10+ con PyMuPDF, Tesseract, Vosk/Whisper, multiprocessing y optimizaciones (xxHash, caching, GPU), demuestra expertise en integración LLM locales/multimodales, procesamiento paralelo, arquitectura modular escalable y evolución hacia GUI PyQt6 con drag-and-drop, búsqueda full-text y empaquetado RPM/Flatpak. (612 caracteres)
Updated -
-
Solving 4chan captcha
Updated -
-
Graphical browser-based Alto4 editor, for the construction of OCR training corpora.
Updated -
-
(Design WIP) Ext. tool adding a transcription (OCR) workflow to the EmuHawk (BizHawk) emulator, allowing retro games to be translated partially- or fully-automatically
Updated -
Sistema event-driven con Kafka que transforma documentos no estructurados en especificaciones de software completas. Extrae texto con OCR, procesa NER con transformers, clasifica oraciones y generar SRS en múltiples formatos.
Updated -
Jochre OCR training corpus for Yiddish in Alto4 format
Updated -
Traitement d'articles en C++ (via RapidOCROnnx) de journaux italiens dans le cadre d'un mémoire de recherche en histoire. Catégorisation à venir.
UpdatedUpdated -
Jochre3 OCR engine with default implementation for Yiddish - completely new version of https://github.com/urieli/jochre
Updated -
Process UrT gameplay to gather distance stats for Game Life Balance: https://game-life-balance.com
Updated -
A libre smart powered comic book reader for Android.
❗ Note: This is a mirror. Check GitHub repository.UpdatedUpdated -
Plataforma de Administración de Documentos (DMP) para preservar el patrimonio musical de "El Sistema", usando:
Papra DMP: Gestión de metadatos. Audiveris OMR: OMR para partituras.Updated -
This project focuses on developing a prototype application for extracting headlines and content from digitized newspaper images stored in the SIDAK (Sistem Informasi Database Koleksi) system of the Monumen Pers Nasional, utilizing computer vision and deep learning techniques.
The prototype aims to overcome the limitations of standard OCR tools by integrating YOLOv8 object detection to precisely identify and separate newspaper headlines and article content before text extraction.
Updated