求求您把Offer给我吧，我见第一眼就爱上了贵公司，您是我职业生涯中的灯塔，照亮我前行的道路。一天收不到面试通知我就感觉窒息，整天昏天黑地地刷招聘网站，您就是我的明天，没有贵公司我的职业生涯是不会开花的😭😭😭
HR我真的好渴望这份工作啊，我真的太喜欢贵公司的文化了，您是招聘市场上皎洁的明月，您是行业里炽热的太阳，您存在的每一秒都让我倍感向往。我求求您了，把Offer给我吧，我真的每天不看贵公司招聘页面头都难受，每天在床上翻来覆去地扇自己来缓解我对这份工作的渴望，我想入职我想入职我想入职😭😭😭

项目介绍

项目地址

https://github.com/10086mea/JpTypingApp

项目目的

虽然本质是求职项目，拿来学习C++ coroutine相关，不过也是想了好几天想到的点子。

项目目标是对打字输入、语音输入的日语进行补全、修正，协助进行沟通练习，适合已经有了N2水平的语感和词汇量的人继续精进，系统练习日语口语。为了协程这碟醋包的饺子，做成了C++ GUI的形式，不然直接js速搭放上Web在线使用了。

至少这东西对我自己而言能起到口语练习和语法学习的作用？七月考N1，八月见教授，用这东西把自己靠感觉学到的东西好好梳理修正一下。肯定没有标准日本语黄皮书学得快，但是后续接上TTS和立绘之后会比黄皮书好玩（

项目演示

竞品调研

Grammarly

响应速度快到飞起，NLP方面调API速度打不过这种商业模型，正常，只能避其锋芒，做语音对话

整体架构

项目分为三个进程，通过管道通信和socket通信串联。

Gemini API

负责请求LLM，获取预测补全和语法纠错信息

语音识别

使用OpenAI的Whisper V2对语音输入进行识别

C++核心

监测页面用户打字输入间隔与预输入状态，管道通信读取语音输入内容，向python发起补全/语法修正socket通信。

语音输入部分

https://learn.microsoft.com/en-us/answers/questions/4243698/english-speech-recognition-with-unsupported-langua

SB微软，Win10的系统语言与输入法语言不匹配的时候无法启用输入法的语音输入，原文如下，这甚至不是注册表能改的，是直接硬编码在功能里的设定。

I am sorry, there is no way to over-ride this rule, the Windows 10 UI Language must match the Speech Recognition language, there is not even a registry hack that can bypass this rule, it is hard-coded into that functionality.

因此对于语音输入，认为只能选择第三方功能进行替代。为了保证识别效果不输微软，采用了OpenAI whisper作为识别模型。

曲折探索：尝试接入Open AI的Whisper V2

https://github.com/gtppoplp/WhisperDesktop

发现上面的分支只有GUI，我们需要一个纯CLI的版本来方便做进程间通信，因此选择了原始项目的MicrophoneCS.cs 示例

先clone项目

git clone https://github.com/Const-me/Whisper/

从release里下载编译好的Whisper.dll，丢到Whisper\x64\Release下，这样能省很多事。

dotnet build Examples\MicrophoneCS\MicrophoneCS.csproj -c Release -p:Platform=x64

可以看到命令行程序已经编译出来了，然而尝试运行时遇到了空指针报错直接崩溃。

此处的问题其实在编译时就已经提出了警告，我们倒回去修正源码然后重新编译

D:\code\Tencent_Prepare\Whisper\WhisperNet\Internal\NativeLogger.cs(43,4): warning CS8602: 解引用可能出现空引用。 [D:\code\Tencent
_Prepare\Whisper\WhisperNet\WhisperNet.csproj]
    2 个警告
    0 个错误

		static void logSink(IntPtr context, eLogLevel lvl, string message)
		{
			if (lvl == eLogLevel.Error)
				// 【新增这两行防空指针】
				if (state == null)
					createState();

			state.setText(message);
			logMessage?.Invoke(lvl, message);
		}

state.setText(message);引发了这一问题，对源码修改重新编译后再次尝试

运行时从直接崩溃变成了报错输出，报错内容如下

runFullImpl: failed to generate timestamp token - skipping one second

在经过反复测试、查看issue后，得到的结论原项目从23年后就没有更新，不支持V3版本的openai whisper模型。并非我编译的命令行版本有问题，GUI版本也同样什么都识别不出来。

于是被迫将使用的模型降级到Whisper V2

测得某个社区中文fork能够支持V3模型的载入，但效果不佳，因此最终仍然选择模型降级方案

options:
  -h,       --help          [default] show this help message and exit
  -t N,     --threads N     [20     ] number of threads to use during computation
  -ot N,    --offset-t N    [0      ] time offset in milliseconds
  -on N,    --offset-n N    [0      ] segment index offset
  -d  N,    --duration N    [0      ] duration of audio to process in milliseconds
  -mc N,    --max-context N [-1     ] maximum number of text context tokens to store
  -ml N,    --max-len N     [0      ] maximum segment length in characters
  -wt N,    --word-thold N  [0.01   ] word timestamp probability threshold
  -su,      --speed-up      [False  ] speed up audio by x2 (reduced accuracy)
  -tr,      --translate     [False  ] translate from source language to english
  -di,      --diarize       [False  ] stereo audio diarization
  -otxt,    --output-txt    [False  ] output result in a text file
  -ps,      --print-special [False  ] print special tokens
  -nc,      --no-colors     [False  ] do not print colors
  -nt,      --no-timestamps [False  ] do not print timestamps
  -l LANG,  --language LANG [en     ] spoken language
            --prompt PROMPT [       ] initial prompt
  -m FNAME, --model FNAME   [       ] model path
  -f FNAME, --file FNAME    [       ] path of the input audio file

GPU推理没做speedup加速，参数加了-su会抛出异常

GPU model doesn't implement the SpeedupAudio flag
System.NotImplementedException: The method or operation is not implemented.
   at System.Runtime.InteropServices.Marshal.ThrowExceptionForHR(Int32 errorCode)
   at Whisper.Internal.NativeLogger.throwException(Int32 hr)
   at Whisper.Internal.iContext_proxy.Whisper.Internal.iContext.runCapture(sFullParams& params, sCaptureCallbacks& callbacks, iAudioCapture reader)
   at Whisper.Context.runCapture(iAudioCapture capture, Callbacks callbacks, CaptureCallbacks captureCallbacks)
   at MicrophoneCS.CaptureThread.threadMain()
--- End of stack trace from previous location ---
   at MicrophoneCS.CaptureThread.join()
   at MicrophoneCS.Program.Main(String[] args)

脑洞大开：微软的语音识别并非完全不能用

Win+H调用的语音识别由于屎山原因，无法使用与系统语言不同的识别语言。

但Word里面独立的听写功能可以调用，因为跟系统识别是独立的功能，只要联网+下载了日语语音包，就能识别日语。

在调查word听写接口hook的过程中，发现微软面向开发者提供了识别接口：Windows.Media.SpeechRecognition

跑了个简单测试，发现运行起来能够识别日语，且响应速度远超Whisper，于是果断转换路线

C# 测试代码 ◀

using System;
using System.Linq;
using System.Threading.Tasks;
using Windows.Globalization;
using Windows.Media.SpeechRecognition;

namespace WinSpeechTest
{
    class Program
    {
        static async Task Main(string[] args)
        {
            Console.WriteLine("正在检查系统支持的语音识别语言...");
            var supportedLanguages = SpeechRecognizer.SupportedTopicLanguages;

            // 兼容性查找：找 ja-JP 或者直接找 ja
            var targetLang = supportedLanguages.FirstOrDefault(l => l.LanguageTag == "ja-JP")
                          ?? supportedLanguages.FirstOrDefault(l => l.LanguageTag.StartsWith("ja"));

            if (targetLang == null)
            {
                Console.ForegroundColor = ConsoleColor.Red;
                Console.WriteLine("\n[错误] 你的系统没有安装日语(ja-JP)语音包！请去Windows设置中下载。");
                Console.ResetColor();
                return;
            }

            Console.WriteLine($"\n准备使用 {targetLang.DisplayName} 初始化引擎...");

            using var recognizer = new SpeechRecognizer(targetLang);

            // ==========================================
            // 新增 1：监听引擎底层状态变化 (极其重要)
            // ==========================================
            recognizer.StateChanged += (sender, args) =>
            {
                Console.ForegroundColor = ConsoleColor.DarkGray;
                Console.WriteLine($"\n[状态流转]: {args.State}");
                Console.ResetColor();
            };

            // ==========================================
            // 新增 2：监听会话异常中断/结束
            // ==========================================
            recognizer.ContinuousRecognitionSession.Completed += (sender, args) =>
            {
                Console.ForegroundColor = ConsoleColor.Red;
                Console.WriteLine($"\n[会话结束]: 退出状态码 = {args.Status}");
                Console.ResetColor();
            };

            Console.WriteLine("正在编译云端/本地听写语法...");
            var compilationResult = await recognizer.CompileConstraintsAsync();
            if (compilationResult.Status != SpeechRecognitionResultStatus.Success)
            {
                Console.WriteLine($"语法编译失败: {compilationResult.Status}");
                return;
            }

            recognizer.HypothesisGenerated += (sender, args) =>
            {
                Console.Write($"\r[实时听写中]: {args.Hypothesis.Text}                ");
            };

            recognizer.ContinuousRecognitionSession.ResultGenerated += (sender, args) =>
            {
                Console.ForegroundColor = ConsoleColor.Cyan;
                // 新增：打印系统对这句话的置信度 (High, Medium, Low)
                Console.WriteLine($"\n[最终确认]: {args.Result.Text} (置信度: {args.Result.Confidence})");
                Console.ResetColor();
            };

            Console.ForegroundColor = ConsoleColor.Green;
            Console.WriteLine("\n[引擎已启动] 请对着麦克风说话！(按 Ctrl+C 退出)");
            Console.ResetColor();

            try
            {
                await recognizer.ContinuousRecognitionSession.StartAsync();
            }
            catch (Exception ex)
            {
                Console.ForegroundColor = ConsoleColor.Red;
                Console.WriteLine($"\n[启动异常]: 发生底层错误 - {ex.Message}");
                // 捕获最常见的权限拦截错误码
                if (ex.HResult == unchecked((int)0x8004503a) || ex.Message.Contains("Access is denied"))
                {
                    Console.WriteLine("--> 诊断结果: 你的 Windows 麦克风隐私设置拦截了控制台程序的访问！");
                }
                Console.ResetColor();
            }

            // 阻止主线程退出
            await Task.Delay(-1);
        }
    }
}

Live2d交互思路

口型匹配

利用 librosa 库加载音频并提取其波形振幅信号。通过计算音频在不同时间点的能量大小（响度）

参考项目：https://github.com/human3daigc/textoon

四肢

轮换播放闲置动作序列就行了吧，搞那么麻烦干什么

TTS

首先否定GPT-SOVITS，音调读不准还学个屁的日语。

对比Azure-AI-Speech和gemini-2.5-pro-preview-tts 效果后选择Google，搬一下他们的自卖自夸，google的语音确实做到了这个水平。

Gemini 原生音频生成文字转语音 (TTS) 模型与传统 TTS 模型的不同之处在于，它使用的大语言模型不仅知道要说什么，还知道怎么说。

若要解锁此功能，用户可以把自己想象成导演，为虚拟配音演员设置表演场景。为了打造出色的提示，我们建议您考虑以下组成部分：定义角色核心身份和原型的音频配置文件；确定实体环境和情感“氛围”的场景说明；以及提供有关风格、口音和节奏控制的更精确表演指导的导演注释。

通过提供细致的指令，例如精确的地区口音、特定的副语言特征（例如气声）或语速，用户可以利用模型的上下文感知能力生成高度动态、自然且富有表现力的音频表演。为了获得最佳性能，我们建议转写（词/文稿）和导演提示保持一致，以便“谁在说”与“说了什么”和“怎么说的”相匹配。