看PyTorch源代码的心路历程|江阴雨辰互联

2023年7月3日发(作者：)

看PyTorch源代码的⼼路历程1. 起因曾经碰到过别⼈的模型prelu在内部的推理引擎算出的结果与其在原始框架PyTorch中不⼀致的情况，虽然理论上⼤家实现的都是⼀个算法，但是从参数上看，因为经过了模型转换，中间做了⼀些调整。为了确定究竟是初始参数传递就出了问题还是在后续传递过程中继续做了更改、亦或者是最终算法实现⽅⾯有着细微差别导致最终输出不同，就想着去看⼀看PyTorch⼀路下来是怎么做的。但是代码跟着跟着就跟丢了，才会发现，PyTorch真的是⼀个很复杂的项⽬，但就像⾆尖⾥⾯说的，环境越是恶劣，回报越是丰厚。为了以后再想跟踪的时候⽅便，因此决定以PReLU为例静态梳理⼀下PyTorch的代码结构。捣⿎的这些天，对如何构建⼀个带有C/C++代码的Python⼜有了新的了解，这也算是意外的收获吧。2. 历程⾸先，我们从PReLU的导⼊路径中知道，他应在径进torchnn之下，进⼊该路径虽然没看到，但是我们在该路径下的__init__.py中知道，其实它就在中。类PReLU最终调⽤了从导⼊的prelu⽅法。顺腾摸⽠，找到prelu，它长下⾯这样：def prelu(input, weight): # type: (Tensor, Tensor) -> Tensor if not _scripting():

if type(input) is not Tensor and has_torch_function((input,)): return handle_torch_function(prelu, (input,), input, weight) return (input, weight)经过⼈脑对代码的⼀番执⾏你会发现，第⼀个if条件满⾜，⽽第⼆个if不满⾜。因此，最终想看算法，得去看()。好吧，接着⼲……⼀番搜寻之后你会发现，Python代码中在torch这个包下⾯你是找不到prelu的定义的。但是绝望之际我们在torch包的__init__.py之中看到看下⾯⼏⾏代码：# pytorchtorch__init__.py# 为了简洁，省去不必要代码，详细代码参见pytorchtorch__init__.pytry: # _initExtension is chosen (arbitrarily) as a sentinel. from torch._C import _initExtension__all__ += [name for name in dir(_C) if name[0] != '_' and not th('Base')]if TYPE_CHECKING: # Some type signatures pulled in from _VariableFunctions here clash with # signatures already imported. For now these clashes are ignored; see # PR #43339 for details. from torch._C._VariableFunctions import * # type: ignorefor name in dir(_C._VariableFunctions): if with('__'): continue globals()[name] = getattr(_C._VariableFunctions, name) __all__.append(name)这是全村最后的希望了。我们知道__all__中的名字其实就是该模块有意暴露出去的API。什么意思呢？也就是说虽然我们明⽂上已经看不到了prelu的定义，但是这⼏⾏代码表明有⼀⼤堆⾝份不明的API被暗搓搓的导⼊了，这其中就很有可能存在我们朝思暮想的prelu。那么我们怎么凭借这么⼀点微弱的线索确定我们的猜测到底对不对呢？这⾥我们就⽤到了Python的⼀个关键知识：C/C++扩展。（戳这⾥《使⽤C语⾔编写Python模块-引⼦》《Python调⽤C++之PYBIND11简介》了解更多）我们知道Python C/C++扩展有着固定的格式，只要我们找到模块初始化⼊⼝，就能顺藤摸⽠找到该模块暴露的给Python解释器所有函数。Python 3中的初始化函数样⼦为PyInit_，其中就是模块的名字。例如在前⾯提到的from torch._C import *中，模块torch下⾯必要有⼀个名字为_C的⼦模块。因此它的初始化函数应该为PyInit__C，我们搜索该名字就能找到模块⼊⼝。当然另外还有⼀种⽅法，就是查看⽂件中关于扩展的描述信息：// _sources = ["torch/csrc/stub.c"]C = Extension("torch._C", libraries=main_libraries, sources=main_sources, language='c', extra_compile_args=main_compile_args + extra_compile_args, include_dirs=[], library_dirs=library_dirs, extra_link_args=extra_link_args + main_link_args + make_relative_rpath_args('lib')) (C)不管是通过搜索还是查看，我们最终都成功定位到了位于pytorchtorchcsrcstub.c下的模块初始化函数PyInit__C(void)，并进⼀步跟踪其调⽤的函数initModule()，便可以知道具体都暴露了哪些API给Python解释器。// INIT_FUNC PyInit__C(void){ return initModule();}// tModule()进⼊initModule()寻找⼀番，你会发现，模块_C中依然没有prelu的Python接⼝。怎么办？莫慌，通过前⾯对torch.__init__.py的分析，我们知道我们还有希望——_C模块下的⼦模块_VariableFunctions，这真的是最后的希望了！没了别的路可以⾛了，只能是硬着头⽪找。经过⼀番惊天地泣⿁神、艰苦卓绝的寻找，我们在initModule()的调⽤链initModule()->THPVariable_initModule(module)->torch::autograd::initTorchFunctions(module)中发现了_VariableFunctions的踪影。Aha，simple！void initTorchFunctions(PyObject* module) { if (PyType_Ready(&THPVariableFunctions) < 0) { throw python_error(); } Py_INCREF(&THPVariableFunctions); // Steals Py_INCREF(&THPVariableFunctions); if (PyModule_AddObject(module, "_VariableFunctionsClass", reinterpret_cast(&THPVariableFunctions)) < 0) { throw python_error(); } // PyType_GenericNew returns a new reference THPVariableFunctionsModule = PyType_GenericNew(&THPVariableFunctions, Py_None, Py_None); // PyModule_AddObject steals a reference if (PyModule_AddObject(module, "_VariableFunctions", THPVariableFunctionsModule) < 0) { throw python_error(); }}但是！！别⾼兴太早！查看模块_VariableFunctions中暴露的接⼝你会发现，根本就没有我们想要的！如下⾯的代码所⽰：static PyMethodDef torch_functions[] = { {"arange", castPyCFunctionWithKeywords(THPVariable_arange), METH_VARARGS | METH_KEYWORDS | METH_STATIC, NULL}, {"as_tensor", castPyCFunctionWithKeywords(THPVariable_as_tensor), METH_VARARGS | METH_KEYWORDS | METH_STATIC, NULL}, {"dsmm", castPyCFunctionWithKeywords(THPVariable_mm), METH_VARARGS | METH_KEYWORDS | METH_STATIC, NULL}, {"from_numpy", THPVariable_from_numpy, METH_STATIC | METH_O, NULL}, {"full", castPyCFunctionWithKeywords(THPVariable_full), METH_VARARGS | METH_KEYWORDS | METH_STATIC, NULL}, {"hsmm", castPyCFunctionWithKeywords(THPVariable_hspmm), METH_VARARGS | METH_KEYWORDS | METH_STATIC, NULL}, {"nonzero", castPyCFunctionWithKeywords(THPVariable_nonzero), METH_VARARGS | METH_KEYWORDS | METH_STATIC, NULL}, {"randint", castPyCFunctionWithKeywords(THPVariable_randint), METH_VARARGS | METH_KEYWORDS | METH_STATIC, NULL}, {"range", castPyCFunctionWithKeywords(THPVariable_range), METH_VARARGS | METH_KEYWORDS | METH_STATIC, NULL}, {"saddmm", castPyCFunctionWithKeywords(THPVariable_sspaddmm), METH_VARARGS | METH_KEYWORDS | METH_STATIC, NULL}, {"sparse_coo_tensor", castPyCFunctionWithKeywords(THPVariable_sparse_coo_tensor), METH_VARARGS | METH_KEYWORDS | METH_STATIC, NULL}, {"_sparse_coo_tensor_unsafe", castPyCFunctionWithKeywords(THPVariable__sparse_coo_tensor_unsafe), METH_VARARGS | METH_KEYWORDS | METH_STATIC, NU {"_validate_sparse_coo_tensor_args", castPyCFunctionWithKeywords(THPVariable__validate_sparse_coo_tensor_args), METH_VARARGS | METH_KEYWORDS | METH {"spmm", castPyCFunctionWithKeywords(THPVariable_mm), METH_VARARGS | METH_KEYWORDS | METH_STATIC, NULL}, {"tensor", castPyCFunctionWithKeywords(THPVariable_tensor), METH_VARARGS | METH_KEYWORDS | METH_STATIC, NULL}, {"get_device", castPyCFunctionWithKeywords(THPVariable_get_device), METH_VARARGS | METH_KEYWORDS | METH_STATIC, NULL}, {"numel", castPyCFunctionWithKeywords(THPVariable_numel), METH_VARARGS | METH_KEYWORDS | METH_STATIC, NULL}, ${py_method_defs} {NULL}};上⾯的代码中我们找不到prelu的任何⾝影。会不会prelu可以绕开C/C++扩展的⽅式直接被Python使⽤呢？所以不会出现在这⾥？答案是不会，⾃古华⼭⼀条路，程序是不会跟你讲潜规则的。那么既然最终代码已经跟丢了，作者⼀定是使⽤了⿊魔法，作为⿇⽠的我⽆计可施，本⽂也该结束了……等等，上⾯的C代码中好像混⼊了奇怪的东西——${py_method_defs}。这种语法好像C/C++语法⾥⾯是没有的，反⽽是Shell这类脚本⾥⾯才会有，难道是新特性？费劲查找了⼀圈，并没有发现C/C++中有这种语法，既然不是正经语法，那么混⼊C/C++中肯定会导致编译失败，但是它确实就在那⾥。那么真相只有⼀个：它就是个占位符，后⾯肯定会有真正的代码替换它！接下来怎么办？搜索！使⽤py_method_defs作为关键字全局搜索，最终我们会发现，确实是有⼀个Python脚本对这个占位符进⾏了替换，⽽替换的结果就是我们⼀直寻找的prelu终于出现在了模块_VariableFunctions之中。好，破案了。但是就像警察破案，即便有单个证据，也要找到其他证据形成完整证据链才能使得证据具有说服⼒。虽然我们通过搜索得知了prelu会出现在模块_VariableFunctions中，但是它究竟怎么来的⽬前还是很模糊：占位符在什么时候被谁调⽤的脚本进⾏了替换？实际上，这⼀切都是有迹可循的。踪迹依旧在中。进⼊的主函数，在调⽤setup函数之前会看到⼀个名为build_deps()的函数调⽤，此函数最终会调⽤指定平台的CMake去按照根⽬录下中的脚本进⾏构建。根⽬录下的最终⼜会调⽤到caffe2⽬录下的（add_subdirectory(caffe2)），⽽caffe2/中就会调⽤到进⾏代码⽣成的Python脚本，如下所⽰：代码⽣成脚本起调过程⽰意图// add_custom_command( OUTPUT ${TORCH_GENERATED_CODE} COMMAND "${PYTHON_EXECUTABLE}" tools/setup_helpers/generate_ --declarations-path "${CMAKE_BINARY_DIR}/aten/src/ATen/" --native-functions-path "aten/src/ATen/native/native_" --nn-path "aten/src" $<$:--disable-autograd> $<$:--selected-op-list-path="${SELECTED_OP_LIST}"> --force_schema_registration进⾏代码⽣成的主要流程如下⾯代码块所⽰，其⼤概流程是main()先解析传递给脚本的参数，之后将参数传递给generate_code()。结合caffe2/中脚本调⽤时传递的参数可知，generate_code()中的是三个gen_*()函数都得到了调⽤，⽽在gen_autograd_python()会调⽤到⼀个名为create_python_bindings()的函数，这个函数就是真正执⾏代码⽣成的地⽅。代码⽣成器调⽤流程⽰意图// tools/setup_helpers/generate_ generate_code(ninja_global=None, declarations_path=None, nn_path=None, native_functions_path=None, install_dir=None, subset=None, disable_autograd=False, force_schema_registration=False, operator_selector=None): if subset == "pybindings" or not subset: gen_autograd_python( declarations_path or DECLARATIONS_PATH, native_functions_path or NATIVE_FUNCTIONS_PATH, autograd_gen_dir, autograd_dir) if operator_selector is None: operator_selector = _nop_selector() if subset == "libtorch" or not subset: gen_autograd( declarations_path or DECLARATIONS_PATH, native_functions_path or NATIVE_FUNCTIONS_PATH, autograd_gen_dir, autograd_dir, disable_autograd=disable_autograd, operator_selector=operator_selector, ) if subset == "python" or not subset: gen_annotated( native_functions_path or NATIVE_FUNCTIONS_PATH, python_install_dir, autograd_dir)def main(): parser = ntParser(description='Autogenerate code') _argument('--declarations-path') _argument('--native-functions-path') _argument('--nn-path') _argument('--ninja-global') _argument('--install_dir') _argument( '--subset', help='Subset of source files to generate. Can be "libtorch" or "pybindings". Generates both when omitted.' ) _argument( '--disable-autograd', default=False, action='store_true', help='It can skip generating autograd related code when the flag is set', ) _argument( '--selected-op-list-path', help='Path to the YAML file that contains the list of operators to include for custom build.', ) _argument( '--operators_yaml_path', help='Path to the model YAML file that contains the list of operators to include for custom build.', ) _argument( '--force_schema_registration', action='store_true', help='force it to generate schema-only registrations for ops that are not' 'listed on --selected-op-list' ) options = _args() generate_code( _global, ations_path, _path, _functions_path, l_dir, , e_autograd, _schema_registration, # ed_op_list operator_selector=get_selector(ed_op_list_path, ors_yaml_path), )if __name__ == "__main__": main()// pytorchtoolsautogradgen_ gen_autograd_python(aten_path, native_functions_path, out, autograd_dir): from .load_derivatives import load_derivatives differentiability_infos = load_derivatives( (autograd_dir, ''), native_functions_path) template_path = (autograd_dir, 'templates') # Generate Functions.h/cpp from .gen_autograd_functions import gen_autograd_functions_python gen_autograd_functions_python( out, differentiability_infos, template_path) # Generate Python bindings from . import gen_python_functions deprecated_path = (autograd_dir, '') gen_python_( out, native_functions_path, deprecated_path, template_path)// pytorchtoolsautogradgen_python_# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ### Main Function## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #def gen(out: str, native_yaml_path: str, deprecated_yaml_path: str, template_path: str) -> None: fm = FileManager(install_dir=out, template_dir=template_path, dry_run=False) methods = load_signatures(native_yaml_path, deprecated_yaml_path, method=True) create_python_bindings( fm, methods, is_py_variable_method, None, 'python_variable_', method=True) functions = load_signatures(native_yaml_path, deprecated_yaml_path, method=False) create_python_bindings( fm, functions, is_py_torch_function, 'torch', 'python_torch_', method=False) create_python_bindings( fm, functions, is_py_nn_function, '', 'python_nn_', method=False) create_python_bindings( fm, functions, is_py_fft_function, '', 'python_fft_', method=False) create_python_bindings( fm, functions, is_py_linalg_function, '', 'python_linalg_', method=False)def create_python_bindings( fm: FileManager, pairs: Sequence[PythonSignatureNativeFunctionPair], pred: Callable[[NativeFunction], bool], module: Optional[str], filename: str, *, method: bool,) -> None: """Generates Python bindings to ATen functions""" py_methods: List[str] = [] py_method_defs: List[str] = [] py_forwards: List[str] = [] grouped: Dict[BaseOperatorName, List[PythonSignatureNativeFunctionPair]] = defaultdict(list) for pair in pairs: if pred(on): grouped[].append(pair) for name in sorted((), key=lambda x: str(x)): overloads = grouped[name] py_(method_impl(name, module, overloads, method=method)) py_method_(method_def(name, module, overloads, method=method)) py_(forward_decls(name, overloads, method=method)) _with_template(filename, filename, lambda: { 'generated_comment': '@' + f'generated from {te_dir}/{filename}', 'py_forwards': py_forwards, 'py_methods': py_methods, 'py_method_defs': py_method_defs, })最终通过查看native_的内容以及深⼊跟踪加载native_的代码发现，native_中的prelu最终会被写到以python_torch_为模板的⽂件中，也就是调⽤ create_python_bindings( fm, functions, is_py_torch_function, 'torch', 'python_torch_', method=False)的时候被⽣成。整个⽣成的过程其实是很繁琐的，⼀层层跟踪后可以发现，最终⽣成的代码可以实现将⼀个名为at::的函数暴露给Python。例如我们的prelu，暴露给Python的API最终会调⽤⼀个名为at::prelu()的函数来做真正的计算。那么这个at::（例如at::prelu()）的定义⼜在哪⾥呢？还是⼀样，故技重施！仍然使⽤Python脚本根据native_⽂件中的内容去以pytorchatensrcATentemplates⽬录下的各种模板去⽣成对应的实际C++源⽂件。最终结果是得到at::，在这个函数中，它调⽤了Dispatcher这个类寻找到⽬标函数的句柄。通常情况下能够使⽤的函数句柄都通过⼀个叫Library的类来管理。Python脚本以为模板，⽣成了注册这些⽬标函数的注册代码，并通过⼀个名为TORCH_LIBRARY的宏调⽤Library类来注册管理。#define TORCH_LIBRARY(ns, m) static void TORCH_LIBRARY_init_ ## ns (torch::Library&); static const torch::detail::TorchLibraryInit TORCH_LIBRARY_static_init_ ## ns ( torch::Library::DEF, &TORCH_LIBRARY_init_ ## ns, #ns, c10::nullopt, __FILE__, __LINE__ ); void TORCH_LIBRARY_init_ ## ns (torch::Library& m)class TorchLibraryInit final {private: using InitFn = void(Library&); Library lib_;public: TorchLibraryInit(Library::Kind kind, InitFn* fn, const char* ns, c10::optional k, const char* file, uint32_t line) : lib_(kind, ns, k, file, line) { fn(lib_); }};PyTorch组成⽰意图3. 总结PyTorch虽然在使⽤上是⾮常的Pythonic，但实际上Python只不过是为了⽅便使⽤裹在C++代码上的⼀层糖⾐。⽤起来虽然好⽤，但是看起来实在是⾮常费劲，特别是如果静态的梳理代码，很多⽤于连接Python C/C++接⼝与实际逻辑代码之间的C++代码都是通过Python脚本⽣成的。⾄此，整个⼤的线索已经摸清了，剩下的就是去查看具体细节的实现。说实话，⼈脑执⾏Python代码之后再去理解C++代码实在是费劲，也费头发。因此我决定的让电脑去⽣成C++代码再接着看更具体的细节，⽐如究竟每⼀个算⼦是怎么注册到Library之中的。4. Bonus

发布者：admin，转转请注明出处：http://www.yc00.com/xiaochengxu/1688380978a129422.html