胡海 长聘教轨助理教授
部门:翻译系
主要经历
个人简介
金沙威尼斯欢乐娱人城助理教授。2021年获美国印第安纳大学计算语言学博士学位(辅修认知科学)。获中国人民大学英语语言文学本科、硕士学位。 研究方向为:计算语言学、自然语言处理、大语言模型、认知科学。 在Computational Linguistics等计算语言学、语言学权威期刊发表论文多篇,在ACL, AAAI, EMNLP, COLING等自然语言处理及人工智能顶会发表论文多篇。 主持教育部人文社科青年项目、上海市浦江人才计划项目。获2024年中国计算语言学年会亮点论文奖。
You can reach me at hu.hai [shift+2] sjtu.edu.cn
个人主页:Personal webpage
实验室 CL Lab
Lab webpage: Computational Linguistics (CL) lab
Our lab is equipped with several deep learning servers for training and evaluating state-of-the-art language models. Currently we have two master students and several undergraduate students working in the lab. They come from diverse backgrounds: linguistics, computer science, information engineering, translation, language studies (English, German, Chinese), etc.
We are actively recruiting students interested in computational linguistics and related areas!
欢迎对计算语言学/自然语言处理/语料库语言学/机器翻译/认知科学感兴趣的本科同学加入本组! 特别欢迎有计算机或心理学、认知科学背景的同学。
欢迎有意读硕士(语言学学硕、翻译专硕)的同学联系我!
教学科研
科研项目及获奖 Funding and awards:
- 主持金沙威尼斯欢乐娱人城文科科研创新培育项目(2023-)
- 获上海市浦江人才计划支持(2022)Shanghai Pujiang Program
- 主持教育部人文社科青年项目 Ministry of Education Funding(2022-)
- 2024年中国计算语言学会议(CCL)亮点论文奖 CCL 2024 Highlight Paper Award
数据集及系统 Datasets and systems
欢迎使用我们开发的大语言模型训练及评测数据以及ChatGPT英语作文检测器。
- ArguGPT detector: (2023) ChatGPT英语作文检测器:预测英语议论文由ChatGPT生成的概率(huggingface链接)
- 【new!】ZhoBLiMP: a Systematic Assessment of Language Models with Linguistic Minimal Pairs in Chinese 包含15种大语言现象、118种小语言现象的汉语最小对立体(minimal pair);以及20个从头预训练的汉语大模型(参数量:14M to 1.4B)
- MELA:(2024) Multilingual Evaluation of Linguistic Acceptability 多语句法可接受度数据集(10种语言:英、中、俄、意、德、西、日、法、阿、冰岛)
- CoLAC:(2023) Corpus of Linguistic Acceptability in Chinese 汉语句法可接受度数据集
- SwordsmanImp:(2024) A benchmark for pragmatic understanding in Chinese based a sitcom《武林外传》言外之意数据集
- Cured SICK: (2023) Re-annotated SICK dataset; 重新标注的SICK数据集
- ChineseNLIProbing: (2021) Multiple probing datasets for Chinese NLI, including Chinese HANS, expanded diagnostics, etc. 多个汉语自然语言推理评测
- OCNLI: (2020) Original Chinese Natural Language Inference; 原生汉语自然语言推理数据集
- CLUE: (2020) Chinese Language Understanding Evaluation (CLUE) benchmark; 中文语言理解测评基准
- FewCLUE: (2021) Few-shot CLUE Benchmark; CLUE少样本学习评测
教授课程 Courses
金沙威尼斯欢乐娱人城 SJTU:大语言模型原理及应用入门 Introduction to Large Language Models、语言智能 Language Intelligence、学术英语写作 Academic Writing、大学英语 College English、英语视听说 English Viewing, Listening and Speaking
印第安纳大学 Indiana University:语言学入门 Introduction to Linguistics、认知科学中的逻辑与数学(助教)Math and Logic in Cognitive Science (TA)
论文发表 Publications
# denotes corresponding author; * denotes equal contribution
Preprints
- Liu, Y., Shen, Y., Zhu, H., Xu, L., Qian, Z., Song, S., Zhang, K., Tang, J., Zhang, P., Yang, B., Wang, R., & Hu, H#. (2024). ZhoBLiMP: a Systematic Assessment of Language Models with Linguistic Minimal Pairs in Chinese. paper. data.
- Liu, Y., Zhang, Z., Zhang, W., Yue, S., Zhao, X., Cheng, X., Zhang, Y., & Hu, H#. (2023). ArguGPT: evaluating, understanding and identifying argumentative essays generated by GPT models. ArXiv, abs/2304.07666. paper data
- Hai Hu*, Ziyin Zhang*, Weifang Huang, Jackie Yan-Ki Lai#, Aini Li, Yina Patterson, Jiahui Huang, Peng Zhang, Chien-Jer Charles Lin, Rui Wang#. (2023). Revisiting Acceptability Judgements: CoLAC - Corpus of Linguistic Acceptability in Chinese. ArXiv, abs/2305.14091. *equal contributions. paper. data.
Benchmarking (Large) Language Models 大模型评测
We evaluates LLMs on various aspects of linguistic understanding, including but not limited to syntax, semantics and pragmatics in Chinese and beyond.
-
Jushi Kai, Tianhang Zhang, Hai Hu, Zhouhan Lin. (2024). SH2: Self-Highlighted Hesitation Helps You Decode More Truthfully. Proceedings of EMNLP (Findings). paper
-
Ziyin Zhang*, Yikang Liu*, Weifang Huang, Junyu Mao, Rui Wang#, Hai Hu#. (2024). MELA: Multilingual Evaluation of Linguistic Acceptability. Proceedings of ACL. paper. data *equal contributions
-
Shisen Yue, Siyuan Song, Xinyuan Cheng, Hai Hu#. (2024). Do Large Language Models Understand Conversational Implicature – A case study with a Chinese sitcom. Proceedings of CCL. paper. data. [Highlight Paper Award 亮点论文奖]
Natural Language Understanding/Natural Language Inference 自然语言理解/自然语言推理
We teach computers to understand human language, in the form of natural language inference.
-
Aikaterini-Lida Kalouli*, Hai Hu*, Alexander F. Webb, Lawrence S. Moss, Valeria de Paiva. (2023). Curing the SICK and other NLI maladies. Computational Linguistics. 49 (1): 199–243. doi: https://doi.org/10.1162/coli_a_00465. *equal contributions. paper. data. (SSCI)
-
Xu, Liang, Xiaojing Lu, Chenyang Yuan, Xuanwei Zhang, Huilin Xu, Hu Yuan, Guoao Wei, Pan Xiang, Xin Tian, Hai Hu. (2021). FewCLUE: A Chinese few-shot learning evaluation benchmark. arXiv preprint arXiv:2107.07498. paper. code.
-
Hu, Hai, He Zhou, Zuoyu Tian, Yiwen Zhang, Yina Ma, Yanting Li, Yixin Nie, Kyle Richardson (2021). Investigating Transfer Learning in Multilingual Pre-trained Language Models through Chinese Natural Language Inference. In: Findings of ACL. paper. code.
-
Xu, Liang, Hai Hu, Xuanwei Zhang, Lu Li, Chenjie Cao, Yudong Li, Yechen Xu, Kai Sun, Dian Yu, Cong Yu, Yin Tian, Qianqian Dong, Weitang Liu, Bo Shi, Yiming Cui, Junyi Li, Jun Zeng, Rongzhao Wang, Weijian Xie, Yanting Li, Yina Patterson, Zuoyu Tian, Yiwen Zhang, He Zhou, Shaoweihua Liu, Zhe Zhao, Qipeng Zhao, Cong Yue, Xinrui Zhang, Zhengliang Yang, Kyle Richardson, and Zhenzhong Lan (2020). CLUE: A Chinese Language Understanding Evaluation Benchmark. In Proceedings ofthe 28th International Conference on Computational Linguistics (COLING). pp. 4762–4772. paper. website. github page
-
Hu, Hai, Kyle Richardson, Liang Xu, Lu Li, Sandra Kuebler, and Larry Moss. (2020). OCNLI: Original Chinese Natural Language Inference. In: Findings of the Association for Computational Linguistics: EMNLP 2020. pp. 3512–3526. paper. code and data. leaderboard.
-
Richardson, Kyle, Hai Hu, Larry Moss, and Ashish Sabharwal. (2020). Probing Natural Language Inference Models through Semantic Fragments. In: Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence. pp. 8713-8721. paper. code and data.
-
Hu, Hai, Qi Chen, Kyle Richardson, Atreyee Mukherjee, Lawrence S Moss, and Sandra Kuebler. (2020). MonaLog: a Lightweight System for Natural Language Inference Based on Monotonicity. In: Proceedings of the Society for Computation in Linguistics 2020. pp. 319-329. paper. poster. code.
-
Hu, Hai, Qi Chen and Larry Moss. (2019). Natural Language Inference with Monotonicity. In Proceedings of the 13th International Conference on Computational Semantics (IWCS 2019), pp. 8–15. Gothenburg, Sweden. paper.
-
Hu, Hai, and Lawrence S. Moss. (2018). Polarity Computations in Flexible Categorial Grammar. In Proceedings of the 7th Joint Conference on Lexical and Computational Semantics: *SEM, pp. 124–129. New Orleans, Louisiana, USA. paper. poster. code.
semantic change 语义变迁
Here I work on detecting semantic change using word embeddings (word2vec, GloVe) in low-resource scenarios, e.g., medieval Spanish.
-
Amaral, Patrícia, Hai Hu and Sandra Kübler (2023). "Tracing semantic change with distributional methods: The contexts of algo". Diachronica. https://doi.org/10.1075/dia.21012.ama paper (SSCI).
-
Hu, Hai, Patrícia Amaral and Sandra Kübler (2022). "Word Embeddings and Semantic Shifts in Historical Spanish: Methodological Considerations". Digital Scholarship in the Humanities. Volume 37, Issue 2, Pages 441–461. https://doi.org/10.1093/llc/fqab050 paper. code (SSCI)
corpus translation studies/treebank construction 语料库翻译研究/翻译汉语树库建设
I am also interested in the morphological, syntactic and stylistic characteristics of translated Chinese (翻译汉语) and Europeanized Chinese (欧化汉语).
To this end, I 1) employ machine learning methods to study translations and 2) build treebanks (=syntactically annotated corpora) to look into the syntactic features of translationese.
-
Hu, Hai and Sandra Kübler. (2021). Investigating Translated Chinese and Its Variants Using Machine Learning. In Natural Language Engineering. Volume 27, Issue 3 , May 2021 , pp. 339 - 372. https://doi.org/10.1017/S1351324920000182 (SCI/SSCI/AHCI) paper. code.
-
Hu, Hai, Yanting Li, Yina Patterson, Zuoyu Tian, Yiwen Zhang, He Zhou, Sandra Kübler, and Chien-Jer Charles Lin (2020). "Building a Literary Treebank for Translation Studies in Chinese". In: Proceedings of 19th International Workshop on Treebanks and Linguistic Theories (TLT). pp. 18-31. paper.
-
Hu, Hai, Wen Li, and Sandra Kübler. (2018). Detecting Syntactic Features of Translated Chinese. In Proceedings of the 2nd Workshop on Stylistic Variation, pp. 20-28. New Orleans, Louisiana, USA. paper. slides. video presentation.
Other papers 其他
I'm a linguist, so I also collaborate with other linguists on very linguistic-y projects where computational modeling is sometimes used.
- Li, A., Tamminga, M., & Hu, H. (2023). Intra- and interspeaker repetitiveness in Chengdu Mandarin locative variation. Language Variation and Change, 1-21. doi:10.1017/S095439452300008X Paper.
-
Lin, Chien-Jer Charles, and Hai Hu. (2023). Linking comprehension and production: Frequency distribution of Chinese relative clauses in the Sinica Treebank. In Chu-Ren Huang, Shukai Hsieh, & Peng Jin (eds.) Chinese Language Resources: Data Collection, Linguistic Analysis, Annotation and Language Processing. Springer. https://doi.org/10.1007/978-3-031-38913-9_23 paper
-
Hu, Hai and Yiwen Zhang. (2017). Path of Vowel Raising in Chengdu Dialect of Mandarin. In Proceedings of the 29th North America Conference on Chinese Linguistics. Rutgers, NJ. pp. 481-498. paper. abstract.
所有发表文章请参看:https://huhailinguist.github.io/publications/
翻译
- 《表象与本质——类比,思考之源和思维之火》刘健、胡海、陈祺 译;[美] 侯世达 / [法] 桑德尔 著;浙江人民出版社;2018年;豆瓣网页
Recent talks:
- 2023/04: 预训练模型进展与展望. SJTU SFL.
- 2022/03: Examining the Replicability of Grammaticality Judgments in Chinese Journal Articles: Dialectal Influences and Sources of Variability. Annual Conference on Human Sentence Processing (UC Santa Cruz; Online)
- 2021/12: Recent progress in natural language inference. AWS AI Lab Shanghai.
- 2021/11: Everytime I hire a linguist, my accuracy goes down: why NLU still needs linguists now? Fudan University NLP lab.
社会兼职
会议组织:
- NAtural LOgic meets MAchine learning (NALOMA) workshop; Workshop at WESSLLI 2020. webpage
审稿:
- ACL, EMNLP, NAACL, CCL等计算语言学会议
- Natural Language Engineering等学术期刊