Code
This is where most of my typing goes to. Updated April 2026
Selected Projects
Data & ML Tools
pdpipe - A composable pipeline library for pandas DataFrames, with reusable stages for column operations, encoding, and data-preparation workflows. [website] [GitHub]
pulearn - Python estimators, metrics, guides, and examples for learning from positive and unlabeled data, including scikit-learn-compatible PU classifiers. [website] [documentation] [GitHub]
skift - scikit-learn-compatible wrappers for Python fastText, including DataFrame-friendly classifiers and stacking-friendly text-model adapters.
awesome-twitter-data - A curated, CC0 awesome-list of Twitter/X datasets and related resources, with license and dataset-size notes where available.
stationarizer - A pandas-friendly time-series utility that applies ADF/KPSS unit-root checks, multiple-testing correction, differencing, and detrending to stationarize numeric series automatically.
Python & AI Workflow Utilities
cachier - Persistent, stale-aware caching decorators for Python functions, with local files, memory, MongoDB, SQL, Redis, and S3 backends plus async support and cache analytics.
foldermix - A CLI that packs a folder into one LLM-friendly context file, with optional PDF, OCR, Office-document, and Markdown-conversion support.
pr-agent-context - A reusable GitHub Actions workflow that publishes managed PR handoff comments for coding agents, combining unresolved review threads, failing checks, log excerpts, and patch coverage.
birch - Hierarchical configuration for Python packages and applications, reading namespaced settings from environment variables and JSON/YAML config files.
s3bp - S3-backed persistence for Python objects, with local disk caching to avoid unnecessary downloads and special attention to pandas DataFrames.
morejson - A drop-in wrapper around Python’s json API that adds encoding support for sets, complex numbers, dates, times, datetimes, timedeltas, and timezones.
Community & Hebrew NLP
NLPH - The Open Natural Language Processing in Hebrew initiative, promoting open tools, resources, datasets, and collaboration for production-ready Hebrew NLP.
DataTalks - A public archive of the Datahack DataTalks meetup series, collecting talks on machine learning, statistics, data engineering, and applied data science.
Recent & Experimental
SynthBanshee - A config-driven Datahack pipeline for generating synthetic Hebrew audio datasets, including dialogue generation, TTS rendering, acoustic augmentation, labeling, and QA.
hocrgen - Dataset operations tooling for HeOCR, covering source ingestion, rights filtering, normalization, review queues, deterministic splits, and benchmark/release assembly for Hebrew OCR data.
leadforge - An opinionated framework for generating narrative-grounded synthetic CRM and go-to-market datasets from simulated commercial worlds.
splendor - A local-first, git-native, schema-driven knowledge compiler for code and research repositories, keeping wiki pages, source manifests, runtime records, and planning objects in version control.