1 Ml.NET版本更新
当前的Microsoft.ML的软件版本如下:
https://gitee.com/mirrors_feiyun0112/machinelearning-samples.zh-cn 例子使用版本为1.6.0
例子工程更换版本的办法:
1 Directory.Build.props nuget.config
修改samples目录下文件Directory.Build.props的内容
~~ ~~
**
2 打开samples\csharp\All-Samples.sln解决方案
VisualStudio就会加载新的版本的Microsoft.ML库
如以前的工程的引用ml.net库的地方类似如下:
2 例子更新版本到ml.net2.0.1
3 情绪分析例子 [SentimentAnalysis]
SentimentAnalysisConsoleApp.csproj工程的设置修改为:
6 AutoML
- AutoMLQuickStart - C# console application that shows how to get started with the AutoML API.
- AutoMLAdvanced - C# console application that shows the following concepts:
- Modifying column inference results
- Excluding trainers
- Configuring monitoring
- Choosing tuners
- Cancelling experiments
- AutoMLEstimators - C# console application that shows how to:
- Customize search spaces
- Create sweepable estimators
- AutoMLTrialRunner - C# console application that shows how to create your own trial runner for the Text Classification API.
7 Natural Language Processing (NLP)
- TextClassification - C# console app that shows how to use the Text Classification API for inferencing using code generated by Model Builder. The model is trained using Model Builder.
- TextClassification_Sentiment_Razor - ASP.NET Core Razor Pages application for sentiment analysis. Code sample for Analyze sentiment of website comments in a web application using ML.NET Model Builder tutorial. Model is trained using Model Builder.
- SentenceSimilarity - C# console app that shows how to use the Sentence Similarity API. Like the Text Classification API, the Sentence Similarity API uses a NAS-BERT transformer-based deep learning model built with TorchSharp to compare how similar two pieces of text are.
8 例子解析
数据来源,从这个地址下载
9 句子相似度 SentenceSimilarity
【测试时机器没有cuda环境,使用cpu进行训练】
train.csv - 训练集,包含产品、搜索和相关性分数
id | product_uid | product_title | search_term | relevance |
---|---|---|---|---|
2 | 100001 | Simpson Strong-Tie 12-Gauge Angle | angle bracket | 3 |
3 | 100001 | Simpson Strong-Tie 12-Gauge Angle | l bracket | 2.5 |
9 | 100002 | BEHR Premium Textured DeckOver 1-gal. #SC-141 Tugboat Wood and Concrete Coating | deck over | 3 |
home-depot-sentence-similarity.csv数据代码库没有,原始的train.csv 和 home-depot-sentence-similarity.csv关系,可以参考如下下载和生成
https://github.com/dotnet/machinelearning-samples/issues/982 【我按照代码定义的格式写了合并 csv的数据预处理,如下
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.Transforms;
namespace SentenceSimilarity
{
internal class GenData
{
// id product_uid product_title search_term relevance
// 2 100001 Simpson Strong-Tie 12-Gauge Angle angle bracket 3
public class HomeDepot
{
[LoadColumn(0)]
public int id { get; set; }
[LoadColumn(1)]
public int product_uid { get; set; }
[LoadColumn(2)]
public string product_title { get; set; }
[LoadColumn(3)]
public string search_term { get; set; }
[LoadColumn(4)]
public string relevance { get; set; }
}
// https://learn.microsoft.com/en-us/dotnet/api/microsoft.ml.custommappingcatalog.custommapping?view=ml-dotnet
[CustomMappingFactoryAttribute("product_description")]
private class ProdDescCustomAction : CustomMappingFactory<HomeDepot, CustomMappingOutput>
{
// We define the custom mapping between input and output rows that will
// be applied by the transformation.
public static void CustomAction(HomeDepot input, CustomMappingOutput
output) => output.product_description = prodDesc[input.product_uid.ToString()];
public override Action<HomeDepot, CustomMappingOutput> GetMapping()
=> CustomAction;
}
// Defines only the column to be generated by the custom mapping
// transformation in addition to the columns already present.
private class CustomMappingOutput
{
public string product_description { get; set; }
}
static Dictionary<string, string> prodDesc = new Dictionary<string, string>();
static void Main(string[] args)
{
var mlContext = new MLContext(seed: 1);
var DataPath = Path.GetFullPath(@"........\Data\product_descriptions.csv");
{
IDataView dv = mlContext.Data.LoadFromTextFile(DataPath, hasHeader: true, separatorChar: ',', allowQuoting: true,
columns: new[] {
new TextLoader.Column("product_uid",DataKind.String,0),
new TextLoader.Column("product_description",DataKind.String,1)
}
);
foreach (var row in dv.Preview(maxRows: 15_0000).RowView)
{
string uid="", desc="";
foreach (KeyValuePair<string, object> column in row.Values)
{
if (column.Key == "product_uid")
{
uid = column.Value.ToString();
}
else
{
desc= column.Value.ToString();
}
}
prodDesc[uid] = desc;
}
}
DataPath = Path.GetFullPath(@"........\Data\train.csv");
IDataView dataView = mlContext.Data.LoadFromTextFile
var preViewTransformedData = dataView.Preview(maxRows: 5);
foreach (var row in preViewTransformedData.RowView)
{
var ColumnCollection = row.Values;
string lineToPrint = "Row--> ";
foreach (KeyValuePair<string, object> column in ColumnCollection)
{
lineToPrint += $"| {column.Key}:{column.Value}";
}
Console.WriteLine(lineToPrint + "\n");
}
var pipeline = mlContext.Transforms.CustomMapping(new ProdDescCustomAction().GetMapping(), contractName: "product_description");
var transformedData = pipeline.Fit(dataView).Transform(dataView);
//mlContext.ComponentCatalog.RegisterAssembly(typeof(IsUnderThirtyCustomAction).Assembly);
Console.WriteLine("save file");
using FileStream fs = new FileStream(Path.GetFullPath(@"........\Data\home-depot-sentence-similarity.csv"), FileMode.Create);
mlContext.Data.SaveAsText(transformedData, fs, schema: false, separatorChar:',');
}
}
}
具体参考 https://gitee.com/iamops/x-unix-dotnet/blob/main/ml.net2/SentenceSimilarity/GenData.cs
】
数据放好后运行时,会类似如下下载模型文件:
[Source=NasBertTrainer; TrainModel, Kind=Trace] Channel started
[Source=NasBertTrainer; Ensuring model file is present., Kind=Trace] Channel started
[Source=NasBertTrainer; Ensuring model file is present., Kind=Info] Downloading NasBert2000000.tsm from https://aka.ms/mlnet-resources/models/NasBert2000000.tsm to C:\Users\homelap\AppData\Local\Temp\mlnet\NasBert2000000.tsm
[Source=NasBertTrainer; Ensuring model file is present., Kind=Info] NasBert2000000.tsm: Downloaded 3620 bytes out of 17907563
...
TorchSharp目前版本没有正式发布,例子运行问题多多,如上步骤放好数据文件后,直接运行出现
Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
Repeat 2 times:
at TorchSharp.torch+random.THSGenerator_manual_seed(Int64)
at TorchSharp.torch+random.manual_seed(Int64) ...错误
https://github.com/dotnet/machinelearning/issues/6669 按照这个说明设置也不对
ML.NET Version | TorchSharp Package Version |
---|---|
2.0.0 | 0.98.1 |
2.0.1 | 0.98.1 |
3.0.0-preview | 0.98.3 |
Next preview | 0.99.5 |
按如上的设置版本也不对,仍是运行异常,那就进入TorchSharp的源码看看吧。
10 TorchSharp
初步的软件库结构
如上可见,在Microsoft.ML的大框架下【Microsoft.ML.Core.dll Microsoft.ML.dll Microsoft.ML.PCA.dll Microsoft.ML.Transforms.dll Microsoft.ML.Data.dll Microsoft.ML.KMeansClustering.dll Microsoft.ML.StandardTrainers.dll】
针对Torch,按照ML的框架结构,扩展了Microsoft.ML.TorchSharp这层【Microsoft.ML.TorchSharp.dll】,如下是其扩展的概览图
针对CPU/GPU 的场景,分别提供不同的库支持
TorchSharp TorchAudio TorchVision是使用C#语言实现的不同业务类别的库,这个库在pytorch的C语言库的基础上进行成抽象封装,为Microsoft.ML.TorchSharp提供服务;该工程使用C++语言提供了LibTorchSharp库【
】,供TorchSharp/TorchAudio/TorchVision来调用
LibTorchSharp最后使用P/Invoke的模式来调用 pytorch的c语言库
如下就是TorchSharp 工程封装C语言和打包的配置
相关工程的参考地址:
https://github.com/dotnet/TorchSharp
https://github.com/dotnet/TorchSharpExamples
https://www.nuget.org/packages/TorchSharp/
As we build up to a v1.0 release, we will continue to make breaking changes, but only when we consider it necessary for usability. Similarity to the PyTorch experience is a primary design tenet, and we will continue on that path.
如上可见TorchSharp由于未发布1.0,因此接口变化很快,而且官方仓库在nuget发布的的也没有分支和tag,兼容性问题较大
比如找到这个tag 如 https://github.com/dotnet/TorchSharpExamples/releases/tag/v0.95.4 这个的官方例子运行都有问题【
TorchSharpExamples-0.95.4.tar\TorchSharpExamples-0.95.4\src\CSharp\CSharpExamples
random.manual_seed(1); 这个直接不能访问,报异常
但取main分支,0.100.5的版本运行正常
】
回到当前的例子:
TorchSharp.dll!TorchSharp.torch.random.manual_seed(long seed) Line 21623
at TorchSharp\torch.cs(21623)
0.98.1的版本在 https://github.com/dotnet/TorchSharp 这个仓库没有分支和标签,这代码够乱的, https://www.nuget.org/packages/TorchSharp-cpu 不知道这个仓库发布的版本的代码来自哪里?
0.99.3的版本 https://github.com/dotnet/TorchSharp/releases/tag/v0.99.3 和Microsoft.ML.TorchSharp版本直接不兼容,运行缺少部分实现,估计版本迭代节奏差别大
找了个有分支的代码0.98.2,分析下
11 TorchSharp 0.98.2版本调试
https://github.com/dotnet/TorchSharp/tree/0.98.2 整个这个分支跟踪下,按照其DEVGUIDE.md的说明,使用devenv构建
msbuild TorchSharp.sln
msbuild TorchSharp.sln /p:Configuration=Release /p:Platform=x64
libtorch 的版本和 pytorch 是对应的,比如 libtorch 1.6.0 对应于 pytorch 1.6.0。
- DEBUG模式版本
https://download.pytorch.org/libtorch/cpu/
https://download.pytorch.org/libtorch/cu113
https://download.pytorch.org/libtorch/cu113/libtorch-win-shared-with-deps-debug-1.11.0%2Bcu113.zip
https://download.pytorch.org/libtorch/cpu/libtorch-win-shared-with-deps-debug-1.11.0%2Bcpu.zip
下载后会将这些编译结果自动下载下来
libtorch-cpu\libtorch-win-shared-with-deps-debug-1.11.0%2Bcpu.zip 650M
libtorch-cuda-11.3\libtorch-win-shared-with-deps-debug-1.11.0%2Bcu113.zip 2.7G
文件很大
libtorch-cpu\libtorch-win-shared-with-deps-1.11.0%2Bcpu.zip 143M
libtorch-cuda-11.3\libtorch-win-shared-with-deps-1.11.0%2Bcu113.zip 2G
准备好后,直接vs中调试出现问题的函数,正常没问题
将例子代码拿过来,运行,也正常
- Release模式版本
https://download.pytorch.org/libtorch/cu113/libtorch-win-shared-with-deps-1.11.0%2Bcu113.zip
https://download.pytorch.org/libtorch/cpu/libtorch-win-shared-with-deps-1.11.0%2Bcpu.zip
12 例子正常运行
经尝试,只要将如下libtorch的相关库:
例子默认的发布文件不能工作,更换后如下
初步估计是nuget中发布0.98.2版本可能哪里有不一致的地方
【具体的工程参考
https://gitee.com/iamops/x-unix-dotnet/blob/main/ml.net2/SentenceSimilarity/SentenceSimilarity.csproj 】
运行过程
13 小结
Torch的数据训练使用cpu进行速度的确很慢
SentenceSimilarity -官方的这个例子由于混合了c#和pytorch的c版本,由于torch这块的版本不稳定,使用比较麻烦。具体原因上述已分析。