Raw data could yield New Science upon Reanalysis

The Computer and Information Science related topics in Day 4 include three sessions with plenty of diverse data types, such as sunspot, upper atmosphere and earth observation data, geomagnetic data, biodiversity data, crystallographic data, chemistry data, radionuclide, as well as census data. Technologies discussed for these data include database system design, data-intensive computing, statistical analysis, data mining and knowledge management, biodiversity informatics, open source software, as well as semantic web technology.

Data Management & Visualization

There is a growing interest in retaining raw data as that could yield new science upon reanalysis. Several conversations have been made in the field of Chemistry and Physics domains. For instance, crystallography is about the structure of materials on molecular scales. An international working group has been commissioned by the International Union of Crystallography to investigate how to archive raw experimental data. Likewise, updated nuclides data are not only used in nuclear physics and associated fields but also in medicine, agriculture and space studies. The Russian National Standard Reference Data System is used to publish updated data based Nuclide Guide and Chart of the Nuclides.

Meanwhile, chemical research requires comprehensive data while the data are often store in different databases in different formats. Chinese Academy of Science attempts to use a Unique Identifier to map different data sources and develops an integrated data access platform. In addition, molecular graphics and spectra are often used for molecular chemistry research. With the attempt to merge the two common Open Source Visualization tools, Jmol and JSpecView, a common extension JCAMP-DX is developed for interoperability.

Data Modelling and Knowledge Management

In the last session of data mining, two speakers respectively talk about using data mining for modelling the periods and amounts of sunspots, as well as applying knowledge management systems in research scenarios. One challenging issue addressed is establishing prediction models on limited data that can maintain high accuracy without new data input. Another topic that is resonant throughout the session is the issue of transforming data into knowledge for the benefit of organizations and research, and to generate innovation and value. The session ends with an enlightening panel discussion where the role of data mining is observed in the context of data management in research, industry, technology transfer, and health care systems.

Earth and Environment (Part II)

Talks in the last Earth and Environment session contain rather variable studies. A statistical analysis from data extracted from computer files of the 1996 and 2001 census  data were used to explore the dynamics of poverty and migration, and the changes in the demographics of household structures in the North West Province of South Africa. The system of Russian-Ukrainian Geomagnetic Data Center provides quality control of the incoming geomagnetic data with three system design components: (1) algorithmic modules; (2) MySQL database designed for data handling, storage and access; (3) baseline value calculator and its interactive Web interface.

Three projects are presented under the sustainable development umbrella. The TaiBIF project (Taiwan Biodiversity Information Facility) uses GBIF metadata format (Global Biodiversity Information Facility) to integrate biodiversity information. One of the major challenges met by this project is the difficulty for researchers to interpret the signification of species spatial distribution. This study conducts and compares three cluster analysis algorithms K-means (partitioning method), DBSCSN (Density-Based Spatial Clustering of Applications with Noise /density based method), and STING (Statistical Information Grid/ grid-based methods) and use geovisualization techniques to analyze and display the large-scale species occurrence data.

In this session we are also introduced by two Kyoto university projects to highlight the importance of observation data to a sustainable environment. Few daily precipitation data for Asia were available, thus the APHRODITE (Highly Resolve Observational Data Integration Towards Evaluation of Water Resources) project aims to provide scientific data products for the determination of Asian monsoon precipitation change, evaluation of water resources, verification of high-resolution model simulations and satellite precipitation estimates, as well as the improvement of precipitation forecasts. However, APHRODITE releases only part of data products for free, not yet for raw data publishing because of the restriction of national data policy. On the other hand, the IUGONET project (Inter-university Upper atmosphere Global Observation NETwork) uses DSpace, a free software, as the metadata DB platform, and releases metada database and analysis software to promote effective use of atmospheric data with other disciplines.

To return to one of the familiar topics that we have come across in the previous days, the semantic web technology has brought attentions to the earth science community. In order to develop an e-Science data infrastructure, a unique access to scientific geo-space related data, and to provide web-based applications and service, the importance of ontology merging is discussed. Several strategies for W3C’s SKOS (Simple Knowledge Organization System) to match SPASE (Space Physics Archive Search and Extract), GCMD (NASA's Global Change Master Directory Science Keyword) and GEMET (GEneral Multilingual Environmental Thesaurus) are suggested:

(1) Use of SKOS (e.g. has exact/close/related match) to find the concordances between concepts (keywords)

(2) Use of SKOS (e.g. has broader/narrower transitive and has broader/narrower match) to the proof of concordances in the ontology hierarchy in the neighbourhood of the found concepts (keywords).

(3) Use of Inference capabilities of ontology reasoners such us OWL inverse property and OWL characteristics

Editor’s note: Alignment between SKOS and new ISO 25964 thesaurus standard (2012-12-13) has been published.

----------------------------------------------------------------------------

--------------------------------------------------------------------------

 

CODATA 2012會議第四天與計算機與資訊科學有關的三個議程涵蓋了多種資料類別,如太陽黑子、天文、大氣、地球關測、地磁、生物多樣性、結晶資料、化學元素、物理放射性核種、以及人口統計資料。針對這些資料,主要以技術層面觀察討論的議題包括資料庫系統設計、資料密集計算、統計分析、資料探勘與知識管理、生物資訊學、開放原始碼軟體、以及語意網技術。

資料管理與視覺化

由於重複的分析往往帶來新的科學發現,原始資料的保存方式日益引起注意。在化學與物理領域中,進行了許多相關的對話。例如,結晶學探討的是分子尺度材料的結構,因此國際結晶學會成立了一個國際工作小組,進行探討聯合原始資料檔案庫的可能性。同樣的,最新的核素資料不只對核子物理學及相關領域重要,更影響了製藥,農業與太空研究。因此,俄羅斯國家標準參考資料系統的介紹,則以建立最新的核素指南以及核素表為分享與會者的重點。

化學研究往往需要相當完整的資料,然而這些資料常以不同格式散布在不同的資料庫中,也因此中國科學院試圖以唯一識別碼來連結不同的資料來源,並建立一個整合的資料存取平台。除此之外,由於分子圖與光譜往往被應用在分子化學的研究中,為了整合Jmol與JspecView這兩個常用的開放原始碼分子研究視覺化平台,JCAMP-DX這個新的延伸標準讓同樣的檔案在兩個平台都可以相互操作。

資料模型與知識管理

在資料探勘的最後一場議程中,兩位講者對於資料探勘議題分別探討太陽黑子分佈預測、以及在研究情境下的知識管理系統,進行發表與探討。其中一個極具挑戰的議題圍繞在如何基於有限的資料前提下,建立永久且準確的預測模型;另一的議題則是在整場中不斷提出的關鍵問題:如何將資訊轉為知識,並為公司、研究計畫、組織帶來創新價值。為了達到與觀眾充分的互動,在本場後半主席安排了一場討論會由各領域的人共同討論資料採掘於生醫、業界、技術轉移、研究資料管理上扮演什麼樣的角色。

地球與環境  II

在地球與環境第二部分的議程中涵蓋相當多元的研究。其中包括人口社會學的統計分析,該研究透過統計1996年與2001年的南非西北省的人口統計資料,來探討貧窮的變動與城鄉遷移狀況,以及家庭結構的變化.另外,俄羅斯-烏克蘭地磁資料中心系統,則提供了對於來自俄羅斯與烏克蘭地磁資料的品質控管.此系統包含: (1) 演算模組 (2) MySQL 資料庫系統  (3)基線值計算以及互動式的網路操作介面.

若以個案研究關查,議程中有三個在永續發展願景下的計畫研究分享。TaiBIF 「臺灣生物多樣性資訊機構」計畫使用GBIF「全球生物多樣性資訊機構」後設資料格式整合生物種類的資料。此計畫面臨的一個挑戰為研究者對於物種空間分佈解釋的困難度。因此在此篇報告中,演講者試圖採用三種叢集演算法(K分群演算法、DBSCAN以密度為基礎的群聚演算法、STING統計資訊網格演算法)、以及地理視覺化技術來分析和展現大尺度物種資料的分布。

另外兩個京都大學的計畫則強調永續環境下觀測資料的重要性。由於亞洲地區的每日降雨資料不易取得,APHRODITE(邁向水資源評估的高解析觀測資料整合)計畫,主要目的是提供判定亞洲季風雨量變遷,評量水資源,驗證高解析度的模型模擬結果,衛星雨量估計,以及幫助降雨預測所需要的高解析觀測資料為主。目前此計畫礙於國家資料政策的限制,只釋放部分的資料產品給科學使用。另外一個京都大學的IUGONET(大學間的全球觀測大氣網)計畫,則使用開放軟體DSpace作為後設資料庫,同時並釋放後設資料庫資料、以及分析軟體,藉此希望能促進大氣資料與其他科學領域的整合利用。

若回到對前面幾日會議主題討論的觀察,語意網技術則在此議程中再次引起地球科學領域的關注。為了達到e化科學的資料基礎架構建設、取用地裡空間相關科學資料、提供網路應用與服務,知識本體整合 (ontology merging) 的重要性因此被討論。針對 W3C的SKOS (簡易知識組織系統)、SPASE (物理空間檔案搜尋與擷取)、GCMD (美國太空總署NASA的全球變遷科學關鍵詞)、以及GEMET (一般性多語環境詞庫)的整合策略建議如下:

(1) 使用SKOS  (如has exact /close/related match )與概念 (關鍵字)之間的一致性。

(2) 使用SKOS  (如has broader/narrower transitive 以及 has broader/narrower match)與知識本體架構層級中可發現的鄰近概念 (關鍵字)之間一致性的檢驗。

(3) 使用知識本體推理器(ontology reasoned)的推理能力如OWL的逆向性質 (inverse property)

編輯小記: Alignment between SKOS and new ISO 25964 thesaurus standard (2012-12-13)已公佈出版。