# SecondTouchReality **Repository Path**: JINTIAN-JTST/SecondTouchReality ## Basic Information - **Project Name**: SecondTouchReality - **Description**: This is the second edition for OneTouchReality. Namely after the project in Adventure X 2025, and aiming for Cambridge EduX Hackathon 2025. Modular VR hand-gesture toolkit that turns camera + natural language into interactive Unity scenes with pinch grabbing and servo haptic feedback. - **Primary Language**: C# - **License**: Unlicense - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-01-06 - **Last Updated**: 2026-03-14 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # SecondTouchReality Overview ### **SecondTouchReality is an evolution of OneTouchReality** SecondTouchReality is a derivative of a **“VR hand-gesture based teaching-object system”**. Its focus is: **Natural-language → object generation** * **pinch-gesture grabbing in 3D** * **single-channel servo feedback of the pinch state**, with a more modular and extensible architecture. It’s a small but reasonably complete **end-to-end system** that keeps the whole chain while “extracting the skeleton” and making it lightweight: * **Python side**: hand tracking + depth estimation + text classification model * **Unity side**: 3D hand skeleton reconstruction + grabbing interaction + camera control * **Hardware side**: Arduino / servo / glove (interface reserved) Wired together as one pipeline: ``` Camera → Python understands your hand + your sentence → Unity generates an interactive 3D scene → drives hardware feedback ``` The repo is still at prototype stage, but it already shows a full loop of: ```text Perception → Semantics → Interaction → Hardware ``` --- ## 1. System Overview From an end-user point of view, the system does three main things: 1. **Understand your hand** * Python uses MediaPipe Hands to detect 21 hand keypoints. * It computes palm width, palm length, finger curl, side pose, palm/backs-of-hand orientation. * A one-time calibration converts palm width / length into real-world wrist-to-camera distance (meters), with filtering. * It packs the wrist 3D position, 20 bone direction vectors and the `pinch` state into JSON and sends it to Unity over UDP. 2. **Understand your words** * Unity pops up a dialog where you type an English description (e.g. `"a small red apple"`). * A lightweight text model on Python side uses `HashingVectorizer + SGDClassifier` for multi-class classification, and outputs a discrete label (e.g. `"101"`). * A TCP connection returns `"101"` to Unity. 3. **Let you grab things** * Unity uses `HandFromVectors` to reconstruct the 3D hand skeleton from the UDP JSON, drawing it with spheres and lines. * `PinchGrabBall` lets you pinch objects in the scene so they follow your hand. * `HandOrbitCamera` lets you rotate / zoom the camera with pinch + hand movement when you’re not grabbing anything. * `ModelLibrary` / `RuntimeModelLoader` load the corresponding 3D model (prefab or GLB) based on the label. 4. **Directory Structure** ```text SecondTouchReality/ ├── README.md # English / mixed-language readme (overall design) ├── README_CHN.md # Chinese readme (overall design) ├── requirements.txt # Python dependencies ├── text_model.pkl # Trained text classification model ├── main.py # combined_server, one-click to start the whole pipeline ├── .gitignore ├── unecessary/ # Old scripts or unused resources ├── tools/ │ ├── hand_easy.py # Hand distance estimation + calibration logic │ ├── hand_udp.py # Multi-hand + bone vectors + pinch → UDP JSON │ ├── arduino_udp_receive.py # Simple bridge: UDP → Arduino │ ├── text_infer_server.py # Text model inference TCP server │ ├── run_model.py # Load text_model.pkl, provide CLI inference │ └── __pycache__/ # Python cache ├── test/ │ ├── collect_data.py # Collect “description text + label” data │ ├── clean_dataset.py # Clean JSONL dataset │ ├── train_model.py # Train text classification model and output text_model.pkl │ ├── text_object_dataset.jsonl │ ├── cleaned_text_object_dataset.jsonl │ └── object_models_csv.csv # Text labels ↔ model ID mapping table ├── Game/ # Unity demo project (can be opened directly) │ ├── SampleScene.unity # Main scene │ ├── models/ # Several glb models (apple, banana, bowl, etc.) │ └── Scripts/ # Main C# scripts │ ├── HandFromVectors.cs │ ├── HandOrbitCamera.cs │ ├── PinchGrabBall.cs │ ├── ModelLibrary.cs │ ├── RuntimeModelLoader.cs │ └── TextQueryClient_TMP.cs └── ... ``` **Root directory** * `main.py`: Recommended entry script. Opens the camera, starts the UDP → Unity hand data stream, registers the `on_payload` callback to map pinch state to serial commands, and starts the text inference server. * `requirements.txt`: All Python dependencies; usually installed via `pip install -r requirements.txt`. * `text_model.pkl`: Trained by `test/train_model.py`, used to map natural language descriptions to object labels. **`tools/` – runtime tools layer** * `hand_easy.py`: Encapsulates distance calibration and filtering logic; used by `hand_udp.py` and other scripts. * `hand_udp.py`: * Uses MediaPipe, supports multiple hands. * Outputs wrist depth, bone directions, pinch state. * Sends JSON packets to Unity via UDP (default `127.0.0.1:5065`). * Allows external `on_payload` callback. * `text_infer_server.py`: Starts a `socketserver`-based multithreaded TCP server, uses `infer_once` from `run_model.py` to call the text model in memory. * `arduino_udp_receive.py`: Simplified bridge program; reads hand JSON over UDP, cares only about `hands[].pinch`, detects state changes and sends `'0'/'1'` to the Arduino serial port. **`test/` – data & model playground** * `collect_data.py`: Interactive CLI tool to quickly collect training data. * `clean_dataset.py`: Filters out samples containing Chinese, keeps only clean English text + labels, outputs `cleaned_text_object_dataset.jsonl`. * `train_model.py`: Trains a model on the cleaned data and saves it as `text_model.pkl` for the main program. **`Game/` – Unity demo** * `HandFromVectors.cs`: UDP client; parses JSON from Python, reconstructs joint positions and visualizes them with spheres and lines. * `PinchGrabBall.cs`: Turns any 3D object into a “pinchable” object, handling grab/follow/release and smooth motion. * `HandOrbitCamera.cs`: Uses pinch to control camera rotation and zoom. * `ModelLibrary.cs`: Maintains a name → GameObject dictionary and provides `ShowModelByLabel(label)` for directly using text inference results. * `TextQueryClient_TMP.cs`: TextMeshPro-based text input client, talks to the Python text server. * `RuntimeModelLoader.cs`: Loads GLB models by index from `StreamingAssets/models` at runtime to expand the model library. --- ## 2. Tech Stack ### 2.1 Python side * Python 3.x * OpenCV (`cv2`) – camera capture + HUD drawing * MediaPipe Hands – hand keypoint detection * NumPy – vector operations, statistics (median, EMA, etc.) * scikit-learn – text features (`HashingVectorizer`) + linear classifier (`SGDClassifier`) * joblib – serialize model + label encoder (`text_model.pkl`) * socket / socketserver – UDP + TCP communication * pyserial – serial communication with Arduino Main Python files: * `hand_udp.py` – main hand tracking + UDP streaming * `hand_easy.py` – depth estimation demo / debugging * `collect_data.py` / `clean_dataset.py` / `train_model.py` / `run_model.py` – text model data & training toolchain * `text_infer_server.py` – text inference TCP server * `main.py` – combines hand tracking + text server + serial bridge into a single process * `arduino_udp_receive.py` – alternative: standalone UDP → serial bridge ### 2.2 Unity / C# side * Unity 202x * C# scripts: * `HandFromVectors.cs` – UDP receiver + hand skeleton reconstruction + GUI tuning * `PinchGrabBall.cs` – grab logic for objects * `HandOrbitCamera.cs` – orbit/zoom camera around a scene target * `ModelLibrary.cs` – treat children/prefabs as a “model dictionary” * `RuntimeModelLoader.cs` – dynamic `.glb` loading with GLTFast * `TextQueryClient.cs` (class `TextQueryClient_TMP`) – Unity-side TCP client + UI for text * TextMeshPro – input and text display * GLTFast – runtime loading of .glb / .gltf models ### 2.3 Hardware side * Arduino (Uno / Nano, etc.) * One or more simple servos (for demo) * Very simple serial protocol: send one ASCII char per update, e.g. `'0'` / `'1'`. --- ## 3. Typical Run Flow A typical workflow (basically what you’re doing now) is: 1. **Start Python side first** (`main.py`). 2. In the camera window that pops up, press **`c`** to calibrate, `r` to reset, `q` to quit. 3. **Then open the Unity project**, and load the `Skin` scene (or your own demo scene). 4. In Unity, click **Play**: * The 3D skeleton hand follows your real hand. * Pinch to rotate the camera or grab objects. * Enter a sentence in the dialog box and the system will generate the corresponding 3D model in front of you. Details follow. --- ### 3.1 Configure Python Environment 1. Create a virtual environment in the project root (recommended): ```bash python -m venv .venv .venv\Scripts\activate ``` 2. Install dependencies: ```bash pip install -r requirements.txt ``` 3. Make sure the camera is accessible by OpenCV / MediaPipe. --- ### 3.2 Start the combined server `main.py` In the project root: ```bash python main.py ``` You’ll get: * A camera preview window with a HUD (FPS, calibration status, etc.). * In the background: * UDP hand data server (for Unity). * Text TCP server (listening on `127.0.0.1:9009`). * Serial port (if an Arduino is connected). In the camera window: * Press **`c`**: * Open your palm, face the camera, keep still; it samples about 50 frames. * The terminal asks you for the real wrist-to-camera distance (meters), e.g. `0.45`. * It uses the median palm width/length to compute `k_w` / `k_l`, which are later used to estimate Z. * Press **`r`**: reset calibration. * Press **`q`**: exit Python side. After calibration, HUD text changes from “Calib: NOT SET” to something like “Calib: OK”. --- ### 3.3 Open the Unity scene 1. Open the project in Unity. The main demo scene is typically `Skin.unity`. The scene should contain: * An object with `HandFromVectors`: * `listenPort = 5065` (must match Python). * `targetCamera` set (usually the main camera). * The main camera with `HandOrbitCamera` attached. * A node with `ModelLibrary`; its children are model templates (their names usually match label names). * A UI Canvas with `TextQueryClient_TMP` attached, pointing to the TMP input field and buttons. 2. Click **Play**: * You’ll see 3D hand bones (spheres + lines) in the camera view. * When you pinch (thumb + index): * If a `PinchGrabBall` object is nearby, it gets grabbed and follows your hand. * If nothing is grabbed, `HandOrbitCamera` interprets pinch as camera control—moving your hand rotates the view. 3. In the UI, click the button to open the dialog, enter an English description, for example: ```text a green apple ``` and click confirm: * Unity sends this string to `127.0.0.1:9009`. * Python runs the text model and returns something like `102|0.93`. * `TextQueryClient_TMP` parses `label = "102"` and calls `ModelLibrary.ShowModelByLabel("102")`. * `ModelLibrary` / `RuntimeModelLoader` spawn the corresponding model in front of the camera and auto-attach `PinchGrabBall` so you can pinch it. --- ### 3.4 Connect Arduino 1. Flash a simple serial control sketch on Arduino, e.g.: * `Serial.begin(9600);` * `if (Serial.available()) char c = Serial.read();` * If `c == '1'` → servo turns to 45°, if `c == '0'` → servo returns to 0°. 2. In `main.py` or `arduino_udp_receive.py`, change `COM9` to your actual serial port. 3. Run Python: * When hand tracking works, `on_payload` or `arduino_udp_receive` will send `'1'` / `'0'` according to pinch state. * You should see the servo move as you pinch / release. --- ## 4. Main Python Scripts (by function) ### 4.1 Hand tracking: `hand_udp.py` Summary: * Opens the camera and uses MediaPipe Hands to detect multiple hands. * For each frame it computes: * Palm width / palm length (pixels) * Finger curl `curl` (0–1) * Side pose `side` (0–1) * Palm/backs-of-hand orientation `palm_front` * Wrist depth `Z` (meters) * 20 bone direction vectors (unit vectors) * `pinch` (thumb + index pinch or not) * Packs it into JSON and sends via UDP to: * Unity port (default 5065) * Arduino UDP bridge port (default 5066) Key points: * **Calibration logic** * Uses a `CalibState` struct to store sampled palm widths/lengths, `k_w`, `k_l`, etc. * When you press `c`, it samples for a while, then asks for the real distance and computes `k_w = d_real * w_med` and similar. * For depth estimation, it uses two channels `Z ≈ k_w / palm_width` and `Z ≈ k_l / palm_length`, then fuses them based on curl / side. * **Pinch detection** * Typically checks the distance between `thumb_tip` and `index_tip`. If below threshold, it’s pinch. * Writes `"pinch": true/false` directly into the JSON. * **`on_payload` callback** (registered in `main.py`) * Receives the whole JSON per frame, can count how many hands are pinching and do post-processing (e.g. serial output). --- ### 4.2 Depth estimation demo: `hand_easy.py` This is for “pulling the depth estimation piece out” and playing with it separately, without UDP or Unity. Important functions: * `compute_palm_width_and_length(...)` – computes palm width & length in pixels given landmarks; used as depth proxies. * `compute_curl(...)` – uses finger joint angles to determine whether the hand is open or in a fist. * `compute_side(...)` – detects whether the hand faces the camera or is turned sideways. * `fuse_depth(Zw, Zl, curl, side, palm_front, ...)` – fuses the two depth channels into a final `Z_final` with weighting and correction terms. * `draw_hud(...)` – prints all intermediate values on the image for easier tuning and understanding. --- ### 4.3 Text model pipeline * **Raw data**: `text_object_dataset.jsonl` Each line is a JSON record: `{"text": "...", "label": "101"}`, containing both Chinese and English. * `collect_data.py`: interactively add data. * `clean_dataset.py`: filters out samples containing Chinese characters, writes to `cleaned_text_object_dataset.jsonl`. * `train_model.py`: trains / incrementally trains an `SGDClassifier` on the cleaned data, saves to `text_model.pkl`, and prints training metrics. * `run_model.py`: tests the model on the command line, printing top-k labels + probabilities for given sentences. * **`text_infer_server.py`**: * Loads the model once at startup. * Exposes a TCP server: * For each line of text it receives → runs inference → returns `"label|prob\n"`. --- ### 4.4 Combined entry: `main.py` `main.py` glues three directions together: 1. **Hand tracking + UDP**: starts the tracking loop from `tools.hand_udp`. 2. **Serial bridge**: registers `on_payload(payload)`: * Counts whether any hand is pinching in the JSON. * If yes → `ser.write(b"1")`; otherwise → `ser.write(b"0")`. 3. **Text TCP server**: uses `TextInferHandler` + `ThreadedTCPServer` to listen on port 9009 for text from Unity. So you only need to run `python main.py` to support Unity hand tracking + text-driven object generation + hardware feedback all at once. --- ### 4.5 UDP → serial bridge: `arduino_udp_receive.py` If you don’t want to mix too much logic into `main.py`, you can run this script separately: * Listens on UDP port 5066 for the same JSON as Unity. * Parses the current pinch state. * When the state changes, sends `'0'` or `'1'` to the Arduino serial port. * Good for debugging the “hardware bridge” in isolation. --- ## 5. Unity Scripts ### 5.1 `HandFromVectors.cs` – UDP receiver + hand skeleton reconstruction Responsibilities: * Creates a `UdpClient` to listen on the given port (default 5065). * Parses JSON from Python: * `wrist` pixel coordinates + normalized coords + `z_m` (depth) * 20 bone direction vectors (unit vectors) * `pinch` / `is_left` flags * Uses: * Camera intrinsics (via Unity `Camera` projection). * Pre-configured bone lengths. to reconstruct 21 joint positions in Unity world space. * Dynamically creates: * Sphere array `jointObjects` to visualize joints. * `LineRenderer` array `boneLines` to draw bones. * Exposes API: * `TryGetJointPosition(handIndex, jointIndex, out Vector3 pos)` * `bool IsPinching(handIndex)` * `bool AnyHandPinching` It also draws a GUI window in-scene that lets you: * Adjust each bone’s length. * Toggle debug options. * See how many hands are active and their pinch states. --- ### 5.2 `PinchGrabBall.cs` – grabbing objects Attach this script to any GameObject and assign a `handTracker`; the object becomes pinch-grabbable: * When not yet grabbed: * Iterate all hands and check if any pinching hand has its control joint (`controlJointIndex`, index fingertip by default) within `grabDistance` of the object. * If so, treat that as “grabbed”, and record: * Which hand grabbed it: `grabbedHandIndex` * Initial offset from the follow joint to the object: `grabOffset` * While grabbed: * If `usePhysics` is enabled: * Disable gravity on its `Rigidbody`, zero out velocity, and drive it by position interpolation. * If not using physics: * Directly use `Vector3.Lerp` to move `transform.position` toward the target, controlled by `followSmoothing`. * `pinchReleaseGrace` protects against MediaPipe glitches: * Short pinch dropouts that immediately recover will not drop the object. * The object is released only if the pinch has been off for longer than the grace time. The script has a static counter `grabbedCount`; other scripts (e.g. camera control) can query `AnyObjectGrabbed` to know if anything is currently being held. --- ### 5.3 `HandOrbitCamera.cs` – hand-driven camera Attach this to the camera so that pinch gestures control the camera whenever nothing is grabbed: * Choose a joint (default: index fingertip) as the control point. * Record the hand position and yaw/pitch at the moment pinch starts. * While pinch is held: * Map hand movement on screen to yaw / pitch. * Clamp pitch to avoid flipping behind the head. * Adjust radius `radius` based on zoom gestures or depth change to push/pull the camera. Final camera update: ```csharp orbitCamera.transform.position = pivot + dir.normalized * radius; orbitCamera.transform.LookAt(pivot, Vector3.up); ``` --- ### 5.4 `ModelLibrary.cs` – prefab library Design: attach this script to an empty GameObject and put all model prefabs as its children: * In `Awake()`: * Collect all child `GameObject`s into a `Dictionary` keyed by their names. * `SetActive(false)` on all children, treating them as templates. * `ShowModelByLabel(string label)`: * Find the template by label. * If there’s already a displayed instance, hide or deactivate it. * Spawn the new object near `spawnAnchor.position + spawnOffset`. * Ensure it has `PinchGrabBall` attached and configure: * `handTracker` * `grabDistance` * `usePhysics` With this, the label returned by the text model directly determines which prefab appears in the scene. --- ### 5.5 `RuntimeModelLoader.cs` – runtime GLB loading Used to “load models on the fly instead of baking all of them into the scene”. Main interfaces: * `MakeFileName(int index)`: * Default is `101 → "101.glb"`, but you can replace this with more complex mapping (e.g. via a table). * `LoadByIndexAsync(int index)`: * Builds a path (usually under `Application.streamingAssetsPath`). * Uses `GltfImport` to load the GLB. * Instantiates it as a `GameObject`. * If `currentInstance` exists, destroys the old one first. * Parents the new object under the loader and zeroes `localPosition` / `localRotation`. * `LoadByIndex(int index)`: * Convenience wrapper: `_ = LoadByIndexAsync(index);` (ignores `await`). If you have a batch of `.glb` files in StreamingAssets, you can directly map labels to filenames and truly load them on demand in Unity. --- ### 5.6 `TextQueryClient_TMP` – Unity text TCP client Attach this to a UI GameObject; it uses TMP input + buttons to talk to the Python text service. Flow: 1. `OpenDialog()`: * `dialogPanel.SetActive(true)` and focus the input field. 2. Button click → `OnClickSend()`: * Read user text from `descriptionInput.text`. * Start `SendQueryCoroutine(q)`. 3. Inside `SendQueryCoroutine`: * Use `TcpClient` to connect to `serverIp:serverPort` (default `127.0.0.1:9009`). * Send `q + "\n"` in UTF-8. * Block until one line of response is read. * Parse `