# Jupyterize - Technical Specification > **For End Users**: See `build/jupyterize/README.md` for usage documentation. ## Document Purpose This specification provides implementation details for developers building the `jupyterize.py` script. It focuses on the essential technical information needed to convert code example files into Jupyter notebooks. **Related Documentation:** - User guide: `build/jupyterize/README.md` - Code example format: `build/tcedocs/README.md` and `build/tcedocs/SPECIFICATION.md` - Existing parser: `build/components/example.py` ## Quickstart for Implementers (TL;DR) - Goal: Convert a marked example file into a clean Jupyter notebook. - Inputs: Source file with markers (EXAMPLE, STEP_START/END, HIDE/REMOVE), file extension for language. - Output: nbformat v4 notebook with cells per step. Steps: 1) Parse file line-by-line into blocks (preamble + steps) using marker rules 2) Detect language from extension and load `build/jupyterize/jupyterize_config.json` 3) If boilerplate is configured for the language, prepend a boilerplate cell 4) For each block: unwrap using `unwrap_patterns` → dedent → rstrip; skip empty cells 5) Assemble notebook (kernelspec/metadata) and write to `.ipynb` Pitfalls to avoid: - Always `.lower()` language keys for config and kernels - Handle both `#EXAMPLE:` and `# EXAMPLE:` formats - Save preamble before the first step and any trailing preamble at end - Apply unwrap patterns in listed order; for Java, remove `@Test` before method wrappers - Dedent after unwrapping when any unwrap patterns exist for the language - **Boilerplate placement is not one-size-fits-all**: Go requires appending to first cell, not separate cell - Check kernel requirements before deciding boilerplate strategy - If kernel needs imports and boilerplate together, use Strategy 2 (append to first cell) - Otherwise, use Strategy 1 (separate boilerplate cell) Add a new language (5 steps): 1) Copy the C# pattern set as a starting point 2) Examine 3–4 real repo files for that language (don’t guess pattern count) 3) Add language-specific patterns (e.g., Java `@Test`, `static main()`) 4) Write one synthetic test and one real-file test per client library variant 5) Iterate on patterns until real files produce clean notebooks --- ## Table of Contents ## Marker Legend (1-minute reference) - EXAMPLE: — Skip this line; defines the example id (must be first line) - BINDER_ID — Skip this line; not included in the notebook - STEP_START / STEP_END — Use as cell boundaries; markers themselves are excluded - HIDE_START / HIDE_END — Include the code inside; markers excluded (unlike web docs, code is visible) - REMOVE_START / REMOVE_END — Exclude the code inside; markers excluded --- 1. [Critical Implementation Notes](#critical-implementation-notes) 2. [Code Quality Patterns](#code-quality-patterns) 3. [System Overview](#system-overview) 4. [Core Mappings](#core-mappings) 5. [Implementation Approach](#implementation-approach) 6. [Marker Processing Rules](#marker-processing-rules) 7. [Language-Specific Features](#language-specific-features) 8. [Notebook Generation](#notebook-generation) 9. [Error Handling](#error-handling) 10. [Testing](#testing) --- ## Critical Implementation Notes > **⚠️ Read This First!** These are the most common pitfalls discovered during implementation. ### 1. Always Use `.lower()` for Dictionary Lookups **Problem**: The `PREFIXES` and `KERNEL_SPECS` dictionaries use **lowercase** keys (`'python'`, `'node.js'`), but `EXTENSION_TO_LANGUAGE` returns mixed-case values (`'Python'`, `'Node.js'`). **Solution**: Always use `.lower()` when accessing these dictionaries: ```python # ❌ WRONG - Will cause KeyError prefix = PREFIXES[language] # KeyError if language = 'Python' # ✅ CORRECT prefix = PREFIXES[language.lower()] ``` This applies to: - `PREFIXES[language.lower()]` in parsing - `KERNEL_SPECS[language.lower()]` in notebook creation ### 2. Check Both Marker Formats (Use Helper Function!) **Problem**: Markers can appear with or without a space after the comment prefix. **Examples**: - `# EXAMPLE: test` (with space) - `#EXAMPLE: test` (without space) **Solution**: Create a helper function to avoid repetition: ```python def _check_marker(line, prefix, marker): """ Check if a line contains a marker (with or without space after prefix). Args: line: Line to check prefix: Comment prefix (e.g., '#', '//') marker: Marker to look for (e.g., 'EXAMPLE:', 'STEP_START') Returns: bool: True if marker is found """ return f'{prefix} {marker}' in line or f'{prefix}{marker}' in line # ✅ CORRECT - Use helper throughout if _check_marker(line, prefix, EXAMPLE): # Handle EXAMPLE marker ``` **Why a helper function?** - You'll check markers ~8 times in the parsing function - DRY principle - don't repeat yourself - Easier to maintain - one place to update if logic changes - More readable - clear intent ### 3. Import from Existing Modules **Problem**: Redefining constants that already exist in the build system. **Solution**: Import from existing modules: ```python # ✅ Import these - don't redefine! from local_examples import EXTENSION_TO_LANGUAGE from components.example import PREFIXES from components.example import HIDE_START, HIDE_END, REMOVE_START, REMOVE_END, STEP_START, STEP_END, EXAMPLE, BINDER_ID ``` ### 4. Handle Empty Directory Name **Problem**: `os.path.dirname()` returns empty string for files in current directory. **Solution**: Check if dirname is non-empty before creating: ```python # ❌ WRONG - os.makedirs('') will fail output_dir = os.path.dirname(output_path) os.makedirs(output_dir, exist_ok=True) # ✅ CORRECT output_dir = os.path.dirname(output_path) if output_dir and not os.path.exists(output_dir): os.makedirs(output_dir, exist_ok=True) ``` ### 5. Save Preamble Before Starting Step **Problem**: When entering a STEP, accumulated preamble code gets lost. **Solution**: Save preamble to cells list before starting a new step: ```python if f'{prefix} {STEP_START}' in line: # ✅ Save preamble first! if preamble_lines: cells.append({'code': ''.join(preamble_lines), 'step_name': None}) preamble_lines = [] in_step = True # ... rest of step handling ``` ### 6. Don't Forget Remaining Preamble **Problem**: Code after the last STEP_END gets lost. **Solution**: Save remaining preamble at end of parsing: ```python # After the main loop if preamble_lines: cells.append({'code': ''.join(preamble_lines), 'step_name': None}) ``` ### 7. Track Duplicate Step Names **Problem**: Users may accidentally reuse step names (copy-paste errors). **Solution**: Track seen step names and warn on duplicates: ```python seen_step_names = set() # When processing STEP_START: if step_name and step_name in seen_step_names: logging.warning(f"Duplicate step name '{step_name}' (previously defined)") elif step_name: seen_step_names.add(step_name) ``` **Why warn instead of error?** - Jupyter notebooks can have duplicate cell metadata - Non-breaking - helps users but doesn't stop processing - Useful for debugging example files ### 8. Handle Language-Specific Boilerplate and Wrappers **Problem**: Different languages have different requirements for Jupyter notebooks: - **C#**: Needs `#r "nuget: PackageName, Version"` directives for dependencies - **Test wrappers**: Source files have class/method wrappers needed for testing but not for notebooks **Solution**: Two-part approach: **Part 1: Boilerplate Injection** - Define language-specific boilerplate in configuration - Insert as first cell (before preamble) - Example: C# needs `#r "nuget: NRedisStack, 1.1.1"` **Part 2: Structural Unwrapping** - Detect and remove language-specific structural wrappers - C#: Remove `public class ClassName { ... }` and `public void Run() { ... }` - Keep only the actual example code inside **Why this matters**: - Without boilerplate: Notebooks won't run (missing dependencies) - Without unwrapping: Notebooks have unnecessary test framework code - These aren't marked with REMOVE blocks because they're needed for tests **See**: [Language-Specific Features](#language-specific-features) section for detailed implementation. ### 9. Unwrapping Patterns: Single‑line vs Multi‑line, and Dedenting (Based on Implementation Experience) During implementation, several non‑obvious details significantly reduced bugs and rework: - Pattern classes and semantics - Single‑line patterns: When `start_pattern == end_pattern`, treat as “remove this line only”. Examples: `public class X {` or `public void Run() {` on one line. - Multi‑line patterns: When `start_pattern != end_pattern`, remove the start line, everything until the end line, and the end line itself. Use this to strip a wrapper’s braces while preserving the inner code with a separate “keep content” strategy. - Use anchored patterns with `^` to avoid over‑matching. Prefer `re.match` (anchored at the start) over `re.search`. - Wrappers split across cells - Real C# files often split wrappers across lines/blocks (e.g., class name on line N, `{` or `}` in later lines). Because parsing splits code into preamble/step cells, wrapper open/close tokens may land in separate cells. - Practical approach: Use separate, simple patterns to remove opener lines (class/method declarations with `{` either on the same line or next line) and a generic pattern to remove solitary closing braces in any cell. - Order of operations inside cell creation 1) Apply unwrapping patterns (in the order listed in configuration) 2) Dedent code (e.g., `textwrap.dedent`) so content previously nested inside wrappers aligns to column 0 3) Strip trailing whitespace (e.g., `rstrip()`) 4) Skip empty cells - Dedent all cells when unwrapping is enabled - Even if a particular cell didn’t change after unwrapping, its content may still be indented due to having originated inside a method/class in the source file. Dedent ALL cells whenever `unwrap_patterns` are configured for the language. - Logging for traceability - Emit `DEBUG` logs per applied pattern (e.g., pattern `type`) to simplify diagnosing regex issues. - Safety tips for patterns - Anchor with `^` and keep them specific; avoid overly greedy constructs. - Keep patterns minimal and composable (e.g., separate `class_opening`, `method_opening`, `closing_braces`). - Validate patterns at startup or wrap application with try/except to warn and continue on malformed regex. ### 10. Closing Brace Removal Must Be Match-Based, Not Pattern-Based (Critical Bug Fix) **Problem**: The initial implementation removed closing braces based on the number of unwrap patterns configured, not the number of patterns that actually matched. This caused a critical bug where closing braces from control structures (for loops, foreach loops, if statements) were incorrectly removed. **Example of the bug**: ```csharp // Original code in a cell for (var i = 0; i < resultsList.Count; i++) { Console.WriteLine(i); } // BUG: Closing brace was removed, resulting in: for (var i = 0; i < resultsList.Count; i++) { Console.WriteLine(i); // Missing } ``` **Root cause**: The unwrapping logic counted braces to remove based on pattern configuration (e.g., "C# has 4 patterns with braces, so remove 4 closing braces from every cell"), rather than counting how many patterns actually matched in each specific cell. **Solution**: Modified `remove_matching_lines()` to return a tuple `(modified_code, match_count)` and updated `unwrap_code()` to only remove closing braces when patterns actually match: ```python # Before (WRONG): for pattern_config in unwrap_patterns: code = remove_matching_lines(code, pattern, end_pattern) if '{' in pattern: braces_removed += 1 # Always increments! # After (CORRECT): for pattern_config in unwrap_patterns: code, match_count = remove_matching_lines(code, pattern, end_pattern) if match_count > 0 and '{' in pattern: braces_removed += match_count # Only increments if pattern matched ``` **Implementation details**: 1. `remove_matching_lines()` now returns `(code, match_count)` instead of just `code` 2. `unwrap_code()` tracks `braces_removed` based on actual matches, not pattern configuration 3. `remove_trailing_braces()` scans from the end and removes only the exact number of trailing closing braces 4. The `closing_braces` pattern was removed from configuration files (C# and Java) since it's now handled programmatically **Time saved by documenting this**: ~2 hours of debugging similar issues in the future. **Follow-up fix**: After implementing match-based brace removal, a second issue was discovered: cells containing **only** orphaned closing braces (from removed class/method wrappers) were still being included in the notebook. These cells appeared when the closing braces were after a REMOVE block, causing them to be parsed as a separate preamble cell. **Solution**: Added a filter in `create_cells()` to skip cells that contain only closing braces and whitespace: ```python # Skip cells that contain only closing braces and whitespace # (orphaned closing braces from removed class/method wrappers) if lang_config.get('unwrap_patterns'): # Remove all whitespace and check if only closing braces remain code_no_whitespace = re.sub(r'\s', '', code) if code_no_whitespace and re.match(r'^}+$', code_no_whitespace): logging.debug(f"Skipping cell {i} (contains only closing braces)") continue ``` This ensures that orphaned closing brace cells are completely removed from the final notebook. ### 11. Pattern Count Differences Between Languages (Java Implementation Insight) **Key Discovery**: When adding Java support after C#, the pattern count increased from 5 to 8 patterns. **Why the difference?** | Language | Patterns | Unique Requirements | |----------|----------|---------------------| | **C#** | 5 | `class_single_line`, `class_opening`, `method_single_line`, `method_opening`, `closing_braces` | | **Java** | 8 | All C# patterns PLUS `test_annotation`, `static_main_single_line`, `static_main_opening` | **Java-specific additions**: 1. **`test_annotation`** - Java uses `@Test` annotations on separate lines before methods (C# uses `[Test]` attributes which are less common in our examples) 2. **`static_main_single_line`** - Java examples often use `public static void main(String[] args)` instead of instance methods 3. **`static_main_opening`** - Multi-line version of static main **Critical insight**: Don't assume pattern counts will be identical across languages, even for similar class-based languages. **Pattern order matters more in Java**: - `test_annotation` MUST come before `method_opening` (otherwise the annotation line might not be removed) - Specific patterns (single-line) before generic patterns (multi-line) - Openers before closers **Implementation tip**: When adding a new language: 1. Start with the C# patterns as a template 2. Examine 3-4 real example files from the repository 3. Look for language-specific constructs (annotations, modifiers, method signatures) 4. Add patterns incrementally and test after each addition 5. Document the pattern order rationale in the configuration **Time saved**: This insight would have saved ~15 minutes of debugging why `@Test` annotations weren't being removed (they were being processed after method patterns, which was too late). --- ## Code Quality Patterns > **💡 Best Practices** These patterns improve code maintainability and readability. ### Pattern 1: Extract Repeated Conditionals into Helper Functions **When you see**: The same conditional pattern repeated multiple times **Example**: Checking for markers appears ~8 times in parsing: ```python if f'{prefix} {EXAMPLE}' in line or f'{prefix}{EXAMPLE}' in line: if f'{prefix} {BINDER_ID}' in line or f'{prefix}{BINDER_ID}' in line: if f'{prefix} {REMOVE_START}' in line or f'{prefix}{REMOVE_START}' in line: # ... 5 more times ``` **Refactor to**: Helper function ```python def _check_marker(line, prefix, marker): return f'{prefix} {marker}' in line or f'{prefix}{marker}' in line # Usage: if _check_marker(line, prefix, EXAMPLE): if _check_marker(line, prefix, BINDER_ID): if _check_marker(line, prefix, REMOVE_START): ``` **Benefits**: - Reduces code by ~15 lines - Single source of truth - Easier to test - More readable ### Pattern 2: Use Sets for Membership Tracking **When you see**: Need to track if something has been seen before **Example**: Tracking duplicate step names **Use**: Set for O(1) lookup ```python seen_step_names = set() if step_name in seen_step_names: # O(1) lookup # Handle duplicate else: seen_step_names.add(step_name) ``` **Don't use**: List (O(n) lookup) ```python # ❌ WRONG - O(n) lookup seen_step_names = [] if step_name in seen_step_names: # Slow for large lists ``` ### Pattern 3: Warn for Non-Critical Issues **When you see**: Issues that are problems but shouldn't stop processing **Examples**: - Duplicate step names - Nested markers - Unpaired markers **Use**: `logging.warning()` instead of raising exceptions ```python if step_name in seen_step_names: logging.warning(f"Duplicate step name '{step_name}'") # Continue processing if in_remove: logging.warning("Nested REMOVE_START detected") # Continue processing ``` **Benefits**: - More user-friendly - Helps debug without breaking workflow - Allows batch processing to continue ### Pattern 4: Validate Early, Process Later **Structure**: 1. Validate all inputs first 2. Then process (assuming valid inputs) **Example**: ```python def jupyterize(input_file, output_file=None, verbose=False): # 1. Validate first language = detect_language(input_file) validate_input(input_file, language) # 2. Process (inputs are valid) parsed_blocks = parse_file(input_file, language) cells = create_cells(parsed_blocks) notebook = create_notebook(cells, language) write_notebook(notebook, output_file) ``` **Benefits**: - Fail fast on invalid inputs - Cleaner error messages - Easier to test validation separately --- ## System Overview ### Purpose Convert code example files (with special comment markers) into Jupyter notebook (`.ipynb`) files. **Process Flow:** ``` Input File → Detect Language → Parse Markers → Generate Cells → Write Notebook ``` ### Key Principles 1. **Simple parsing**: Read file line-by-line, detect markers with regex 2. **Automatic behavior**: Language/kernel from extension, fixed marker handling 3. **Standard output**: Use `nbformat` library for spec-compliant notebooks ### Dependencies ```bash pip install nbformat ``` --- ## Core Mappings > **📖 Source of Truth**: Import these from existing modules - don't redefine! ### File Extension → Language **Import from**: `build/local_examples.py` → `EXTENSION_TO_LANGUAGE` Supported: `.py`, `.js`, `.go`, `.cs`, `.java`, `.php`, `.rs` ### Language → Comment Prefix **Import from**: `build/components/example.py` → `PREFIXES` **⚠️ Critical**: Keys are lowercase (`'python'`, `'node.js'`), so use `language.lower()` when accessing. ### Language → Jupyter Kernel **Define locally** (not in existing modules): ```python KERNEL_SPECS = { 'python': { 'name': 'python3', 'display_name': 'Python 3', 'language': 'python', 'language_info': { 'name': 'python', 'version': '3.x.x', 'mimetype': 'text/x-python', 'file_extension': '.py' } }, 'node.js': { 'name': 'javascript', 'display_name': 'JavaScript (Node.js)', 'language': 'javascript', 'language_info': { 'name': 'javascript', 'version': '20.0.0', 'mimetype': 'application/javascript', 'file_extension': '.js' } }, 'go': { 'name': 'gophernotes', 'display_name': 'Go', 'language': 'go', 'language_info': { 'name': 'go', 'version': '1.x.x', 'mimetype': 'text/x-go', 'file_extension': '.go' } }, 'c#': { 'name': '.net-csharp', 'display_name': '.NET (C#)', 'language': 'C#', 'language_info': { 'name': 'C#', 'version': '12.0', 'mimetype': 'text/x-csharp', 'file_extension': '.cs', 'pygments_lexer': 'csharp' } }, 'java': { 'name': 'java', 'display_name': 'Java', 'language': 'java', 'language_info': { 'name': 'java', 'version': '11.0.0', 'mimetype': 'text/x-java-source', 'file_extension': '.java' } }, 'php': { 'name': 'php', 'display_name': 'PHP', 'language': 'php', 'language_info': { 'name': 'php', 'version': '8.0.0', 'mimetype': 'application/x-php', 'file_extension': '.php' } }, 'rust': { 'name': 'rust', 'display_name': 'Rust', 'language': 'rust', 'language_info': { 'name': 'rust', 'version': '1.x.x', 'mimetype': 'text/x-rust', 'file_extension': '.rs' } } } ``` **⚠️ Critical**: Also use `language.lower()` when accessing this dict. **Note on language_info**: Each language should include complete metadata with `name`, `version`, `mimetype`, and `file_extension` fields. This ensures notebooks are properly recognized by Jupyter and other tools. ### Marker Constants **Import from**: `build/components/example.py` ```python from components.example import ( HIDE_START, HIDE_END, REMOVE_START, REMOVE_END, STEP_START, STEP_END, EXAMPLE, BINDER_ID ) ``` **📖 For marker semantics**, see `build/tcedocs/SPECIFICATION.md` section "Special Comment Reference". --- ## Implementation Approach ### Recommended Strategy **Don't use the Example class** - it modifies files in-place for web documentation. Instead, implement a simple line-by-line parser. ### Module Imports **Critical**: Import existing mappings from the build system: ```python #!/usr/bin/env python3 import argparse import logging import os import sys import nbformat from nbformat.v4 import new_notebook, new_code_cell # Add parent directory to path to import from build/ sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..')) # Import existing mappings - DO NOT redefine these! from local_examples import EXTENSION_TO_LANGUAGE from components.example import PREFIXES # Import marker constants from example.py from components.example import ( HIDE_START, HIDE_END, REMOVE_START, REMOVE_END, STEP_START, STEP_END, EXAMPLE, BINDER_ID ) ``` **Important**: The PREFIXES dict uses lowercase keys (e.g., `'python'`, `'node.js'`), so you must use `language.lower()` when accessing it. ### Basic Structure ```python def main(): # 1. Parse command-line arguments # 2. Detect language from file extension # 3. Validate input file # 4. Parse file and extract cells # 5. Create cells with nbformat # 6. Create notebook with metadata # 7. Write to output file pass ``` ### Language Detection ```python def detect_language(file_path): """Detect language from file extension.""" _, ext = os.path.splitext(file_path) language = EXTENSION_TO_LANGUAGE.get(ext.lower()) if not language: supported = ', '.join(sorted(EXTENSION_TO_LANGUAGE.keys())) raise ValueError( f"Unsupported file extension: {ext}\n" f"Supported extensions: {supported}" ) return language ``` --- ## Marker Processing Rules > **📖 For complete marker documentation**, see `build/tcedocs/SPECIFICATION.md` section "Special Comment Reference" (lines 2089-2107). ### Quick Reference: What to Include/Exclude | Marker | Action | Notebook Behavior | |--------|--------|-------------------| | `EXAMPLE:` line | Skip | Not included | | `BINDER_ID` line | Skip | Not included | | `HIDE_START`/`HIDE_END` markers | Skip markers, **include** code between them | Code visible in notebook | | `REMOVE_START`/`REMOVE_END` markers | Skip markers, **exclude** code between them | Code not in notebook | | `STEP_START`/`STEP_END` markers | Skip markers, use as cell boundaries | Each step = separate cell | | Code outside any step | Include in first cell (preamble) | First cell (no step metadata) | **Key Difference from Web Display**: - Web docs: HIDE blocks are hidden by default (revealed with eye button) - Notebooks: HIDE blocks are fully visible (notebooks don't have hide/reveal UI) ### Parsing Algorithm **Key Implementation Details:** 1. **Use `language.lower()`** when accessing PREFIXES dict (keys are lowercase) 2. **Check both formats**: `f'{prefix} {MARKER}'` and `f'{prefix}{MARKER}'` (with/without space) 3. **Extract step name**: Use `line.split(STEP_START)[1].strip()` to get the step name after the marker 4. **Handle state carefully**: Track `in_remove`, `in_step` flags to know what to include/exclude 5. **Save cells at transitions**: When entering a STEP, save any accumulated preamble first ```python def parse_file(file_path, language): """ Parse file and extract cells. Returns: list of {'code': str, 'step_name': str or None} """ with open(file_path, 'r', encoding='utf-8') as f: lines = f.readlines() # IMPORTANT: Use .lower() because PREFIXES keys are lowercase prefix = PREFIXES[language.lower()] # State tracking in_remove = False in_step = False step_name = None step_lines = [] preamble_lines = [] cells = [] for line_num, line in enumerate(lines, 1): # Skip metadata markers (check both with and without space) if f'{prefix} {EXAMPLE}' in line or f'{prefix}{EXAMPLE}' in line: continue if f'{prefix} {BINDER_ID}' in line or f'{prefix}{BINDER_ID}' in line: continue # Handle REMOVE blocks (exclude content) if f'{prefix} {REMOVE_START}' in line or f'{prefix}{REMOVE_START}' in line: in_remove = True continue if f'{prefix} {REMOVE_END}' in line or f'{prefix}{REMOVE_END}' in line: in_remove = False continue if in_remove: continue # Skip lines inside REMOVE blocks # Skip HIDE markers (but include content between them) if f'{prefix} {HIDE_START}' in line or f'{prefix}{HIDE_START}' in line: continue if f'{prefix} {HIDE_END}' in line or f'{prefix}{HIDE_END}' in line: continue # Handle STEP blocks if f'{prefix} {STEP_START}' in line or f'{prefix}{STEP_START}' in line: # Save accumulated preamble before starting new step if preamble_lines: cells.append({'code': ''.join(preamble_lines), 'step_name': None}) preamble_lines = [] in_step = True # Extract step name from line (text after STEP_START marker) step_name = line.split(STEP_START)[1].strip() if STEP_START in line else None step_lines = [] continue if f'{prefix} {STEP_END}' in line or f'{prefix}{STEP_END}' in line: if step_lines: cells.append({'code': ''.join(step_lines), 'step_name': step_name}) in_step = False step_name = None step_lines = [] continue # Collect code lines if in_step: step_lines.append(line) else: preamble_lines.append(line) # Save any remaining preamble at end of file if preamble_lines: cells.append({'code': ''.join(preamble_lines), 'step_name': None}) return cells ``` **Common Pitfalls to Avoid:** - Forgetting to use `.lower()` when accessing PREFIXES → KeyError - Only checking `f'{prefix} {MARKER}'` format → Missing markers without space - Not saving preamble before starting a step → Lost code - Not handling remaining preamble at end → Lost code --- ## Language-Specific Features > **⚠️ New Requirement**: Notebooks need language-specific setup that source files don't have. ### Overview Different languages have different requirements for Jupyter notebooks that aren't present in the source test files: 1. **Dependency declarations**: C# needs NuGet package directives, Node.js might need npm packages 2. **Structural wrappers**: Test files have class/method wrappers that shouldn't appear in notebooks 3. **Initialization code**: Some languages need setup code that's implicit in test frameworks ### Problem 1: Missing Dependency Declarations **Issue**: C# Jupyter notebooks require NuGet package directives to download dependencies: ```csharp #r "nuget: NRedisStack, 1.1.1" ``` **Current behavior**: Source files don't have these directives (they're in project files) **Desired behavior**: Automatically inject language-specific boilerplate as first cell **Example - C# source file**: ```csharp // EXAMPLE: landing using NRedisStack; using StackExchange.Redis; public class SyncLandingExample { public void Run() { var muxer = ConnectionMultiplexer.Connect("localhost:6379"); // ... } } ``` **Desired notebook output**: ``` Cell 1 (boilerplate): #r "nuget: NRedisStack, 1.1.1" #r "nuget: StackExchange.Redis, 2.6.122" Cell 2 (preamble): using NRedisStack; using StackExchange.Redis; Cell 3 (code): var muxer = ConnectionMultiplexer.Connect("localhost:6379"); // ... ``` ### Problem 2: Unnecessary Structural Wrappers **Issue**: Test files have class/method wrappers needed for test frameworks but not for notebooks. **Affected languages**: C# and Java (both class-based languages with similar syntax) **C# example**: ```csharp public class SyncLandingExample // ← Test framework wrapper { public void Run() // ← Test framework wrapper { // Actual example code here var muxer = ConnectionMultiplexer.Connect("localhost:6379"); } } ``` **Java example**: ```java public class LandingExample { // ← Test framework wrapper @Test public void run() { // ← Test framework wrapper // Actual example code here UnifiedJedis jedis = new UnifiedJedis("redis://localhost:6379"); } } ``` **Current behavior**: These wrappers are copied to the notebook **Desired behavior**: Remove wrappers, keep only the code inside **Why not use REMOVE blocks?** - These wrappers are needed for the test framework to compile/run - Marking them with REMOVE would break the tests - They're structural, not boilerplate **Key similarities between C# and Java**: - Both use `public class ClassName` declarations - Both use method declarations (C#: `public void Run()`, Java: `public void run()`) - Both use curly braces `{` `}` for blocks - Opening brace can be on same line or next line - Test annotations may appear before methods (Java: `@Test`, C#: `[Test]`) **Detailed Java example** (from `local_examples/client-specific/jedis/LandingExample.java`): Before unwrapping: ```java // EXAMPLE: landing // STEP_START import import redis.clients.jedis.UnifiedJedis; // STEP_END public class LandingExample { // ← Remove this @Test // ← Remove this public void run() { // ← Remove this // STEP_START connect UnifiedJedis jedis = new UnifiedJedis("redis://localhost:6379"); // STEP_END // STEP_START set_get_string String res1 = jedis.set("bike:1", "Deimos"); System.out.println(res1); // STEP_END } // ← Remove this } // ← Remove this ``` After unwrapping (desired notebook output): ```java Cell 1 (import step): import redis.clients.jedis.UnifiedJedis; Cell 2 (connect step): UnifiedJedis jedis = new UnifiedJedis("redis://localhost:6379"); Cell 3 (set_get_string step): String res1 = jedis.set("bike:1", "Deimos"); System.out.println(res1); ``` Note: The class declaration, `@Test` annotation, method declaration, and closing braces are all removed, leaving only the actual example code properly dedented. ### Solution Approach #### Option 1: Configuration-Based (Recommended) **Pros**: - No changes to source files - Centralized configuration - Easy to update package versions - Works with existing examples **Cons**: - Requires maintaining configuration file - Less visible to example authors **Implementation**: 1. **Create configuration file** (`jupyterize_config.json`): ```json { "c#": { "boilerplate": [ "#r \"nuget: NRedisStack, 1.1.1\"", "#r \"nuget: StackExchange.Redis, 2.6.122\"" ], "unwrap_patterns": [ { "type": "class", "pattern": "^\\s*public\\s+class\\s+\\w+.*\\{", "end_pattern": "^\\}\\s*$", "keep_content": true }, { "type": "method", "pattern": "^\\s*public\\s+void\\s+Run\\(\\).*\\{", "end_pattern": "^\\s*\\}\\s*$", "keep_content": true } ] }, "node.js": { "boilerplate": [ "// npm install redis" ], "unwrap_patterns": [] } } ``` 2. **Load configuration** in jupyterize.py: ```python def load_language_config(language): """Load language-specific configuration.""" config_file = os.path.join(os.path.dirname(__file__), 'jupyterize_config.json') if os.path.exists(config_file): with open(config_file) as f: config = json.load(f) return config.get(language.lower(), {}) return {} ``` 3. **Inject boilerplate** as first cell: ```python def create_cells(parsed_blocks, language): """Convert parsed blocks to notebook cells.""" cells = [] # Get language config lang_config = load_language_config(language) # Add boilerplate cell if defined if 'boilerplate' in lang_config: boilerplate_code = '\n'.join(lang_config['boilerplate']) cells.append(new_code_cell( source=boilerplate_code, metadata={'cell_type': 'boilerplate', 'language': language} )) # Add regular cells... for block in parsed_blocks: # ... existing logic ``` 4. **Unwrap structural patterns**: ```python def unwrap_code(code, language): """Remove language-specific structural wrappers.""" lang_config = load_language_config(language) unwrap_patterns = lang_config.get('unwrap_patterns', []) for pattern_config in unwrap_patterns: if pattern_config.get('keep_content', True): # Remove wrapper but keep content code = remove_wrapper_keep_content( code, pattern_config['pattern'], pattern_config['end_pattern'] ) return code def remove_wrapper_keep_content(code, start_pattern, end_pattern): """Remove wrapper lines but keep content between them.""" lines = code.split('\n') result = [] in_wrapper = False wrapper_indent = 0 for line in lines: if re.match(start_pattern, line): in_wrapper = True wrapper_indent = len(line) - len(line.lstrip()) continue # Skip wrapper start line elif in_wrapper and re.match(end_pattern, line): in_wrapper = False continue # Skip wrapper end line elif in_wrapper: # Remove wrapper indentation if line.startswith(' ' * (wrapper_indent + 4)): result.append(line[wrapper_indent + 4:]) else: result.append(line) else: result.append(line) return '\n'.join(result) ``` #### Option 2: Marker-Based **Pros**: - Explicit in source files - Self-documenting - No external configuration needed **Cons**: - Requires updating all source files - More markers to maintain - Clutters source files **New markers**: ```csharp // NOTEBOOK_BOILERPLATE_START #r "nuget: NRedisStack, 1.1.1" // NOTEBOOK_BOILERPLATE_END // NOTEBOOK_UNWRAP_START class public class SyncLandingExample { // NOTEBOOK_UNWRAP_END // NOTEBOOK_UNWRAP_START method public void Run() { // NOTEBOOK_UNWRAP_END // Actual code here // NOTEBOOK_UNWRAP_CLOSE method } // NOTEBOOK_UNWRAP_CLOSE class } ``` **Not recommended** because: - Too many new markers - Clutters source files - Harder to maintain - Breaks existing examples ### Configuration Schema and Semantics (Implementation-Proven) - Location: `build/jupyterize/jupyterize_config.json` - Keys: Lowercased language names (`"c#"`, `"python"`, `"node.js"`, `"java"`, ...) - Structure per language: - `boilerplate`: Array of strings (each becomes a line in the first code cell) - `unwrap_patterns`: Array of pattern objects with fields: - `type` (string): Human-readable label used in logs - `pattern` (regex string): Start condition (anchored with `^` recommended) - `end_pattern` (regex string): End condition - `keep_content` (bool): - `true` → remove wrapper start/end lines, keep the inner content (useful for `{ ... }` ranges) - `false` → remove the matching line(s) entirely - If `pattern == end_pattern` → remove only the single matching line - If `pattern != end_pattern` → remove from first match through end match, inclusive - `description` (optional): Intent for maintainers #### At a Glance: Configuration Schema ```json { "": { "boilerplate": ["", ""], "unwrap_patterns": [ { "type": "