Introduction
Data is generally divided into structured and unstructured data. Structured data refers to the normative and predictable organization of entities and relationships. Most of the data that needs to be processed belongs to unstructured data.
Information extraction has been used in many fields, such as business intelligence, resume harvesting, media analysis, sentiment detection, patent search, and email scanning. A particularly important area of current research is the extraction of structured data from electronic scientific literature, particularly in the biological and medical fields.
https://www.cnblogs.com/no-tears-girl/p/6435283.html
The biological/medical literature often involves a variety of complex experimental systems and specific experimental factor processing, as well as the results of various composite experiments produced after these treatments.
At present, the core task of the researchers can be regarded as a three-step cycle: 1) designing the results of the experimental results (based on the previous experimental results) 2) generating these experimental results (including Prepare relevant pre-materials/equipment or other resources) 3) Interpret and communicate with other researchers to produce experimental results (face-to-face format, text (scientific papers)/video and other non-direct contact media formats)
For most researchers, it is required to takes a lot of time to read these scientific papers containing various complex reaction systems and experimental factors (often including misleading or fraudulent interferences). Extract some structured and unstructured data. The experimental results produced in this article correspond to the reaction systems (cell lines, animal models or humans), the specific reaction system and the reaction conditions constitute the number and type of experimental treatment factors (genetic function studies such as the introduction of mutations, Overexpression genes, silencing genes, etc.), the originality of these experimental processing factors under various testing techniques (such as image data; cell morphology; original format of high-throughput sequencing data; cell line survival time, proliferation ability; survival time of mice; five-year survival time of tumor patients); experimental results after treatment of excavation (correlation and interaction of various factors; regulatory network; degree of enrichment, etc.).
Natural language processing technology is rapidly evolving, making it possible for us to automatically organize and summarize all kinds of information expressed or implied in the scientific literature. Especially in the biological and medical fields, related needs are becoming more and more urgent.
Based on the blueprint of openbiox determined practical projects, the natural language processing and scientific literature information extraction technology learning/practice group will be formally established, and will eventually be responsible for and complete a specific medical literature information extraction practice project.
简介
生物学/医学文献常常会涉及到各种复杂的实验体系和特定实验因子处理,以及经过这些处理后产生的各种复合的实验结果。
目前来看,从事相关工作的科研工作者的主要任务可以视为一个三步骤的循环:1)设计产生实验结果的方案(基于前人/自己之前的实验结果) 2)产生这些实验结果(包括准备相关前置的材料/设备或者其他资源)3)与其他科研工作者解释和交流产生的实验结果(面对面的形式、文字(科学论文)/视频等其他非直接接触的媒介形式)
对于绝大多数科研工作者来说,都需要花费大量时间阅读这些包含各类复杂反应体系和实验因子处理的科学文献(其中常常还包含各类具有误导性的或者造假性质的干扰),从中抽提取一些结构化和非结构化数据。如该文章产生的实验结果对应于哪些反应体系(细胞系、动物模型或是人体)、特定反应体系和反应条件所构成的实验处理因子的数量和种类(基因功能研究中常见的如引入突变、过表达基因、沉默基因等等)、这些实验处理因子在各类检测技术测试下的原始(如影像数据;细胞形态学;高通量测序数据的原始格式;细胞系的生存时间、增殖能力;小鼠的存活时间;肿瘤病人的五年生存时间)/处理挖掘后的实验结果(各个因子的相关性和相互作用关系;调控网络;富集程度等等)。
自然语言处理技术正在取得快速发展,使得我们自动整理和归纳科学文献所蕴含的明示或暗示的各类信息变得可能。尤其是在生物和医学领域,相关需求正在变得越来越迫切。
基于前期 openbiox 的实践项目设想,现正式建立自然语言处理与科学文献信息提取技术学习/实践小组,最终将负责并完成某个特定医学文献信息抽提实践项目。