#tokenize
2 APIs with this tag
N-gram API
Generate n-grams from text, with frequency counts — entirely locally. The ngrams endpoint breaks text into contiguous sequences of n tokens and returns each distinct n-gram with how often it occurs, ranked by frequency: word n-grams (unigrams, bigrams, trigrams and beyond) for phrase and collocation analysis, or character n-grams (shingles) for fuzzy matching, language detection and indexing. The range endpoint produces every size from a minimum to a maximum in a single call (for example 1–3 grams), which is exactly what you need to build feature vectors. Choose word or character mode, whether to lower-case first, and a top-N limit to keep only the most frequent. Word tokenization is Unicode-aware and keeps internal apostrophes and hyphens (don't, well-known) as single tokens. Everything runs locally and deterministically, so it is fast and private. Ideal for text mining and NLP feature extraction, language modelling and autocomplete, search indexing and shingling, plagiarism and similarity detection, and keyword and collocation analysis. Pure local computation — no key, no third-party service, instant. Live, nothing stored. 3 endpoints. This produces n-grams and counts; for extractive summaries and keywords use a summarize API and for grapheme/character counting use a text-segmentation API.
api.oanor.com/ngram-api
Case Detect API
Detect which case convention a string uses, and split identifiers into their constituent words. The detect endpoint classifies any value as camelCase, PascalCase, snake_case, CONSTANT_CASE, kebab-case, COBOL-CASE, Train-Case, dot.case, Title Case, Sentence case, lowercase or UPPERCASE — or mixed when it does not fit — and reports the separator it found and the words it is built from. The split endpoint tokenizes any identifier into words: it breaks camelCase humps, handles acronym boundaries correctly (HTTPServer → HTTP, Server; XMLHttpRequest → XML, Http, Request), and splits on digits and on underscores, dashes, dots and spaces, returning both the original-case tokens and lower-cased words ready to feed into a converter. Ideal for linters and code-mod tools, refactoring, API and schema validators, autocomplete and search, and any pipeline that needs to understand identifier naming. Pure local computation — no key, no third-party service, instant. Live, nothing stored. 3 endpoints. This DETECTS and tokenizes a case convention; to CONVERT a string between case styles use a text-case API.
api.oanor.com/casedetect-api