Code - Shay Palachy-Affek

Selected Projects

cachier - Persistent function caching with multiple local and shared backends.
pulearn - Practical positive-unlabeled learning estimators and evaluation tools.
pdpipe - Composable pandas DataFrame transformation pipelines.

Data & ML Tools

pdpipe - A composable pipeline library for pandas DataFrames, with reusable stages for column operations, encoding, and data-preparation workflows. [website] [GitHub]

pulearn - Python estimators, metrics, guides, and examples for learning from positive and unlabeled data, including scikit-learn-compatible PU classifiers. [website] [documentation] [GitHub]

skift - scikit-learn-compatible wrappers for Python fastText, including DataFrame-friendly classifiers and stacking-friendly text-model adapters.

awesome-twitter-data - A curated, CC0 awesome-list of Twitter/X datasets and related resources, with license and dataset-size notes where available.

stationarizer - A pandas-friendly time-series utility that applies ADF/KPSS unit-root checks, multiple-testing correction, differencing, and detrending to stationarize numeric series automatically.

Python & AI Workflow Utilities

cachier - Persistent, stale-aware caching decorators for Python functions, with local files, memory, MongoDB, SQL, Redis, and S3 backends plus async support and cache analytics.

foldermix - A CLI that packs a folder into one LLM-friendly context file, with optional PDF, OCR, Office-document, and Markdown-conversion support.

pr-agent-context - A reusable GitHub Actions workflow that publishes managed PR handoff comments for coding agents, combining unresolved review threads, failing checks, log excerpts, and patch coverage.

birch - Hierarchical configuration for Python packages and applications, reading namespaced settings from environment variables and JSON/YAML config files.

s3bp - S3-backed persistence for Python objects, with local disk caching to avoid unnecessary downloads and special attention to pandas DataFrames.

morejson - A drop-in wrapper around Python’s json API that adds encoding support for sets, complex numbers, dates, times, datetimes, timedeltas, and timezones.

Community & Hebrew NLP

NLPH - The Open Natural Language Processing in Hebrew initiative, promoting open tools, resources, datasets, and collaboration for production-ready Hebrew NLP.

DataTalks - A public archive of the Datahack DataTalks meetup series, collecting talks on machine learning, statistics, data engineering, and applied data science.

Recent & Experimental

SynthBanshee - A config-driven Datahack pipeline for generating synthetic Hebrew audio datasets, including dialogue generation, TTS rendering, acoustic augmentation, labeling, and QA.

hocrgen - Dataset operations tooling for HeOCR, covering source ingestion, rights filtering, normalization, review queues, deterministic splits, and benchmark/release assembly for Hebrew OCR data.

leadforge - An opinionated framework for generating narrative-grounded synthetic CRM and go-to-market datasets from simulated commercial worlds.

splendor - A local-first, git-native, schema-driven knowledge compiler for code and research repositories, keeping wiki pages, source manifests, runtime records, and planning objects in version control.