Raw data could yield New Science upon Reanalysis

The Computer and Information Science related topics in Day 4 include three sessions with plenty of diverse data types, such as sunspot, upper atmosphere and earth observation data, geomagnetic data, biodiversity data, crystallographic data, chemistry data, radionuclide, as well as census data. Technologies discussed for these data include database system design, data-intensive computing, statistical analysis, data mining and knowledge management, biodiversity informatics, open source software, as well as semantic web technology.

Data Management & Visualization

There is a growing interest in retaining raw data as that could yield new science upon reanalysis. Several conversations have been made in the field of Chemistry and Physics domains. For instance, crystallography is about the structure of materials on molecular scales. An international working group has been commissioned by the International Union of Crystallography to investigate how to archive raw experimental data. Likewise, updated nuclides data are not only used in nuclear physics and associated fields but also in medicine, agriculture and space studies. The Russian National Standard Reference Data System is used to publish updated data based Nuclide Guide and Chart of the Nuclides.

Meanwhile, chemical research requires comprehensive data while the data are often store in different databases in different formats. Chinese Academy of Science attempts to use a Unique Identifier to map different data sources and develops an integrated data access platform. In addition, molecular graphics and spectra are often used for molecular chemistry research. With the attempt to merge the two common Open Source Visualization tools, Jmol and JSpecView, a common extension JCAMP-DX is developed for interoperability.

Data Modelling and Knowledge Management

In the last session of data mining, two speakers respectively talk about using data mining for modelling the periods and amounts of sunspots, as well as applying knowledge management systems in research scenarios. One challenging issue addressed is establishing prediction models on limited data that can maintain high accuracy without new data input. Another topic that is resonant throughout the session is the issue of transforming data into knowledge for the benefit of organizations and research, and to generate innovation and value. The session ends with an enlightening panel discussion where the role of data mining is observed in the context of data management in research, industry, technology transfer, and health care systems.

Earth and Environment (Part II)

Talks in the last Earth and Environment session contain rather variable studies. A statistical analysis from data extracted from computer files of the 1996 and 2001 census  data were used to explore the dynamics of poverty and migration, and the changes in the demographics of household structures in the North West Province of South Africa. The system of Russian-Ukrainian Geomagnetic Data Center provides quality control of the incoming geomagnetic data with three system design components: (1) algorithmic modules; (2) MySQL database designed for data handling, storage and access; (3) baseline value calculator and its interactive Web interface.

Three projects are presented under the sustainable development umbrella. The TaiBIF project (Taiwan Biodiversity Information Facility) uses GBIF metadata format (Global Biodiversity Information Facility) to integrate biodiversity information. One of the major challenges met by this project is the difficulty for researchers to interpret the signification of species spatial distribution. This study conducts and compares three cluster analysis algorithms K-means (partitioning method), DBSCSN (Density-Based Spatial Clustering of Applications with Noise /density based method), and STING (Statistical Information Grid/ grid-based methods) and use geovisualization techniques to analyze and display the large-scale species occurrence data.

In this session we are also introduced by two Kyoto university projects to highlight the importance of observation data to a sustainable environment. Few daily precipitation data for Asia were available, thus the APHRODITE (Highly Resolve Observational Data Integration Towards Evaluation of Water Resources) project aims to provide scientific data products for the determination of Asian monsoon precipitation change, evaluation of water resources, verification of high-resolution model simulations and satellite precipitation estimates, as well as the improvement of precipitation forecasts. However, APHRODITE releases only part of data products for free, not yet for raw data publishing because of the restriction of national data policy. On the other hand, the IUGONET project (Inter-university Upper atmosphere Global Observation NETwork) uses DSpace, a free software, as the metadata DB platform, and releases metada database and analysis software to promote effective use of atmospheric data with other disciplines.

To return to one of the familiar topics that we have come across in the previous days, the semantic web technology has brought attentions to the earth science community. In order to develop an e-Science data infrastructure, a unique access to scientific geo-space related data, and to provide web-based applications and service, the importance of ontology merging is discussed. Several strategies for W3C’s SKOS (Simple Knowledge Organization System) to match SPASE (Space Physics Archive Search and Extract), GCMD (NASA's Global Change Master Directory Science Keyword) and GEMET (GEneral Multilingual Environmental Thesaurus) are suggested:

(1) Use of SKOS (e.g. has exact/close/related match) to find the concordances between concepts (keywords)

(2) Use of SKOS (e.g. has broader/narrower transitive and has broader/narrower match) to the proof of concordances in the ontology hierarchy in the neighbourhood of the found concepts (keywords).

(3) Use of Inference capabilities of ontology reasoners such us OWL inverse property and OWL characteristics

Editor’s note: Alignment between SKOS and new ISO 25964 thesaurus standard (2012-12-13) has been published.




CODATA 2012會議第四天與計算機與資訊科學有關的三個議程涵蓋了多種資料類別,如太陽黑子、天文、大氣、地球關測、地磁、生物多樣性、結晶資料、化學元素、物理放射性核種、以及人口統計資料。針對這些資料,主要以技術層面觀察討論的議題包括資料庫系統設計、資料密集計算、統計分析、資料探勘與知識管理、生物資訊學、開放原始碼軟體、以及語意網技術。






地球與環境  II

在地球與環境第二部分的議程中涵蓋相當多元的研究。其中包括人口社會學的統計分析,該研究透過統計1996年與2001年的南非西北省的人口統計資料,來探討貧窮的變動與城鄉遷移狀況,以及家庭結構的變化.另外,俄羅斯-烏克蘭地磁資料中心系統,則提供了對於來自俄羅斯與烏克蘭地磁資料的品質控管.此系統包含: (1) 演算模組 (2) MySQL 資料庫系統  (3)基線值計算以及互動式的網路操作介面.

若以個案研究關查,議程中有三個在永續發展願景下的計畫研究分享。TaiBIF 「臺灣生物多樣性資訊機構」計畫使用GBIF「全球生物多樣性資訊機構」後設資料格式整合生物種類的資料。此計畫面臨的一個挑戰為研究者對於物種空間分佈解釋的困難度。因此在此篇報告中,演講者試圖採用三種叢集演算法(K分群演算法、DBSCAN以密度為基礎的群聚演算法、STING統計資訊網格演算法)、以及地理視覺化技術來分析和展現大尺度物種資料的分布。


若回到對前面幾日會議主題討論的觀察,語意網技術則在此議程中再次引起地球科學領域的關注。為了達到e化科學的資料基礎架構建設、取用地裡空間相關科學資料、提供網路應用與服務,知識本體整合 (ontology merging) 的重要性因此被討論。針對 W3C的SKOS (簡易知識組織系統)、SPASE (物理空間檔案搜尋與擷取)、GCMD (美國太空總署NASA的全球變遷科學關鍵詞)、以及GEMET (一般性多語環境詞庫)的整合策略建議如下:

(1) 使用SKOS  (如has exact /close/related match )與概念 (關鍵字)之間的一致性。

(2) 使用SKOS  (如has broader/narrower transitive 以及 has broader/narrower match)與知識本體架構層級中可發現的鄰近概念 (關鍵字)之間一致性的檢驗。

(3) 使用知識本體推理器(ontology reasoned)的推理能力如OWL的逆向性質 (inverse property)

編輯小記: Alignment between SKOS and new ISO 25964 thesaurus standard (2012-12-13)已公佈出版。