When Big Data meet Open Data


In the broadest sense, the big data and data intensive issues are unfolded almost in every sessions of the whole CODATA 2012 conference. Sessions are started by introducing the concept of “Digital Earth” toward to “Future Earth”. From the perspective of computer and information science, main components in the second day of the conference include cloud computing, standardization and interoperability, as well as crowdsourcing.  These technical considerations have been the essential discussion in several application domains, which include earth science and GIS science, material science, disaster management, and e-Health management. While technical issues are fundamental to the big data and data intensive era, several projects and case studies share their experiences in building an open, collaborative and knowledge-based system.

Technical problems are recognized as the main obstacle to the GIS science research in the data-intensive era. Cloud computing and Service Oriented Architecture (SOA) are two solutions to collaborative geospatial services. The difficulties of image processing have been tackled by cases of using the parallel processing programming and platform, as well as by the project - Globally Leveraged Integrated Data Explorer for Research (GLIDER) which develops a framework for visualizing, analyzing and mining the complex structure of satellite imagery. Together with the GLIDER framework in the task of  workflow classification, data mining is not only regarded as a service, but also provides crowdsourcing tools like Talkoot (Drupal-based features such as forums, chats, ratings, tagging, tag cloud) for online community to build open collaborative portals of Earth Science services.

As we move to the domain of material science, the applications can be broad from the art field (e.g. OPPRA integrate information about art history, artistic techniques, and paint conservation treatment) to the industry use (e.g. the Metals Bank is an integrated infrastructure for the Metals Industry.) To preserving and transferring those data, including the identifiers of materials (e.g. crystal structure, phase diagram and property), standardization of data format is necessary. Instead of inventing a new standard, communities are suggested to collaborate with each other in constructing standards based on existing ones. Meanwhile, in order to make sure that communities can easily share and contribute, the open data environment is recognized as an important issue for accelerating the collection of materials data.

In the context of disaster reduction, the ability to access the information timely such as the location of shelters can save numerous lives. Since recent disaster management systems cannot make good use of information, systems with a framework which combines virtual repository, open information and linked open data are suggested as possible solutions. For instance, a crowdsourcing support system for disaster surveillance (CROSS) combines human sensor and physical sensor, and integrates information from volunteers. The prototype provides planning routes for the volunteers to explore threatened areas. In addition, the Open Information Gateway (OIGY) is designed to delivery content-centric information over unreliable network. In particular, if the network service interrupted, Unaffected Alternate Selection (UAS) uses the backup routing to reroute the affected traffic, recovering the connection.

In general, interoperability  and standardization are central to the e-health management. For interoperability, the Dolphin Project uses the Medical Markup Language for the data exchange. The data will be integrated and available to record owners and medical research institutes. For standardization, the Electronics Health Records (EHR) represents to a standardized or commonly agreed information model. Its primary purpose is to support for life-long and safe integrated health care. A joint VA–DoD iEHR improves the quality and accessibility of health care. Electronic Health Record Management and Preservation (EHR-MP) is accomplished by the iEHR HSP (Healthcare Services Platform) and its Common Information Interoperability Framework (CIIF). Specifically, OpenEHR is currently the richest available datasets and based on archetypes. Ruby is the scripting language of openEHR, which provides capability to implement problems and provide semantic interoperability. The current Clinical Information Modelling Initiative (CIMI) is set to create a reference model based on Detailed Clinical Models (DCM) which can enable EHR repository.

Last but not least, we can see the emerging trend of using standardized data model (e.g. RDF) and ontological modelling of semantic web technologies has started to be brought to some attention of the Big Data and CODATA communities in this conference. We might wonder that: Will semantic web technologies offer the promising opportunity to meet Big Data challenge that the structure of one data usually means another one's unstructured data? While most traditional relational databases are constructed in a relative closed environment, will the open characters of RDF provide more semantic meanings and structured relations to the Big Data?  If so, how far the huge potential of semantic web technologies (with its capabilities of inferring new information) will bring the Big Data to the Open Data? And then, what an effective open data regime should be?


若以最廣義的角度觀察,在CODATA2012會議中大多涵蓋巨量資料與資料密集二大主題。議程的安排首先介紹了「由數位地球邁向未來地球」的概念。本篇會議報告主要以計算機與資訊科學角度出發,整理簡要第二天會議的相關重點,其中主要包括雲端運算、標準化與相互操作性、以及群眾外包。這些技術考量主要分別在地球科學與地理資訊科學、材料科學、災害管理、以及e 化健康管理等議程中討論。技術議題是巨量資料與資料密集的基本要件,但議程中也提供了許多結合開放式、協同合作的知識系統建立的案例與計畫經驗的分享。

在資料密集時代中的地理資訊科學,所面臨的主要問題是技術課題。研究建議以雲端運算以及服務導向架構 (SOA) 為協同式合作的地理空間服務解決方法。另外有關影像處理問題,則建議由平行處理程式與平台、以及GLIDER計畫/ (影響全球的整合資料探索研究計畫)中所提供針對複雜的衛星資料影像的視覺化、分析與資料探勘的整合架構,作為分享的案例。同時藉由運用GLIDER架構作為系統工作流程的分類架構,資料探勘不僅可被視為一項服務,同時亦發展出群眾外包工具如Talkoot (以Drupal 為主包含如 討論室、聊天、評價、標籤與標籤雲等功能),提供社群建立開放式協同合作的地球科學資料服務。

若我們將焦點移到材料科學領域,我們可從議程中得知材料資料的應用可從藝術層面與產業發展層面出發,例如由澳洲昆士蘭大學等開發的OPPRA資料庫,集結了歷史、繪畫技法與保存技術的資訊等;又由韓國官方主持的Metal Bank(金屬銀行)計畫,進行材料資料整合的公共建設,以滿足金屬產業發展需求。然而,要保存與分享這些資料,需要有標準化的資料格式,以儲存晶體結構、相圖、屬性等內容來辨識材料。而這有賴社群與組織的通力合作,以現有的標準為基礎建立新的標準化方法。同時,開放資料環境有助於社群的資訊分享與貢獻、加速材料資料的收集。

在災難發生時,能夠在第一時間獲得相關資訊是存活的關鍵之一。而目前現有的災難管理系統,並不能有效利用網路與群眾為消息來源的資訊。為了解決這個問題,研究提出一個結合虛擬資料庫、開放式資訊系統以及開放資料連結的系統架構,加強災難資訊整合與傳佈。CROSS 即為一例,它匯集群眾外包的資訊,媒合救災需求與救援物資和人力,這個雛形系統提供社群一個可得知災難發生時的危險區域路線規劃;又利用開放式資訊閘道技術 (Open Information Gateway, OIGY) ,使用者則能夠將以內容為中心的資訊在網路不穩定的狀態下傳輸。同時如果網路服務停擺,也能夠利用路由與再路由的技術,找到未受影響的替代路徑 (Unaffected Alternate Selection, UAS) 取代損壞的節點,快速地修復網路連接。

相互操作性與標準化,是e化健康系統管理的二大主要課題。在相互操作性方面,Dolphin計畫中的電子醫療系統使用以醫療標記語言為標準,來交換資料中心與醫療機構間的資料。這些資料會自動整合以供病患本身或醫療研究機構搜尋。在標準化方面,電子醫療紀錄 (EHR) 代表標準化或高度共識的資訊模型。首要目的是支持安全整合的終身醫療衛生保健。例如像是一個整合美國退伍軍人部、以及國防部的跨部門電子健康紀錄系統 (iEHR) 改善了健康照護的品質與可親近性,同時也最佳化了健康照護提供者的工作流程、和改進了會員的服務與退役軍人的滿意度。電子健康紀錄管理與保存 (EMR-MP) 則是透過健康服務平台 (HSP) 與通用資訊相互操作架構 (CIIF) 來達成。而特別值得一提的是,基於原型所設立的 ”開放EHR”,是目前最豐富的可用資料庫。該研究又以 Ruby為 “開放EHR” 的電腦語言程式為例,提供執行問題及語意相互可操作性的功能。臨床資訊模型推動(CIMI)為“開放HER” 體系其一原型,目前希望達到單一形式的資訊模型輸入與分享模型系統。CIMI的目標是利用詳盡臨床模型 (DCM) 創造參照模型,並豐富EHR資料庫。

最後值得一提的,就是在這會議中一個特別的趨勢觀察: 語意網技術如資源描述架構(RDF)與知識本體模型已漸引起巨量資料與CODATA學界社群的注意。這使得我們不禁思考: 在巨量資料中,一個資料的結構往往對另外一個資料而言是不具結構的資料,語意網技術是否能提供面對此技術課題的挑戰? 而當大部分傳統的關聯式資料庫是建構在一個封閉式的環境前提下,RDF的開放式原則是否能提供巨量資料更多的語意、以及有結構的關係架構? 若是如此,語意網技術所具備的產生新資訊的推論能力,又會將巨量資料帶入何種程度的開放性? 而資料的開放制度與規範,因此又將如何有效的定義?