March / April 2014

cloud link stc logos
issue header

linkedin discussMarch/April 2014

Special Issue on Big Data (I)

This issue of Cloud-Link is the first of two about big data, including data-intensive science.  The emergence of "big data" as a field has happened alongside "the Cloud" and in fact big data technologies are often a basis for providing Cloud services. Conversely, Cloud services can put big data tools and methods into the reach of smaller users, by providing affordable on-demand computation and storage. We also note the increasingly popular spelling "big data" instead of "Big Data," perhaps reflecting the maturity of the topic.

In reviewing publications on big data, it was clear that a range of themes was common, including visualization, data quality, privacy and legal issues, impact on society, technology, publishing and discovery of data, standards, and applications and case studies in business intelligence, analysis, Internet of Things, and "smart cities".  With such a choice, we have decided to devote two issues of Cloud-Link to the overall topic.  In the first issue we focus on methods and tools for making data available and analyzing it, as well as applications in medicine and some of the associated concerns.  Unavoidably there are many other useful and insightful articles that have not been included.

We also recommend reading a special issue on "Leveraging Big Data" of IT Professional (published in November/December 2013), which includes five articles that cover related themes: application of big data to business analytics, reduction of healthcare costs, and transforming the government.  You are also encouraged to search through IEEE Xplore and other databases for further reading.

The next issue (May/June 2014) of Cloud-Link will be the second "Big Data" issue. If you would like to recommend any useful articles for this or for future issues, please email them to If you have any suggestions for future topics, please also let us know.

Henry Chan, Victor Leung, Jens Jensen and Tomasz Wiktor Wlodarczyk
Editor and Associate Editors

Articles in this issue

A Scalable Two-Phase Top-Down Specialization Approach for Data Anonymization Using MapReduce on Cloud

By Xuyun Zhang, Yang, L.T., Chang Liu and Jinjun Chen

Published in IEEE Transactions on Parallel and Distributed Systems, February 2014

A large number of cloud services require users to share private data like electronic health records for data analysis or mining, bringing privacy concerns. Anonymizing data sets via generalization to satisfy certain privacy requirements such as k-anonymity is a widely used category of privacy preserving techniques. At present, the scale of data in many cloud applications increases tremendously in accordance with the Big Data trend, thereby making it a challenge for commonly used software tools to capture, manage, and process such large-scale data within a tolerable elapsed time. As a result, it is a challenge for existing anonymization approaches to achieve privacy preservation on privacy-sensitive large-scale data sets due to their insufficiency of scalability. In this paper, we propose a scalable two-phase top-down specialization (TDS) approach to anonymize large-scale data sets using the MapReduce framework on cloud. In both phases of our approach, we deliberately design a group of innovative MapReduce jobs to concretely accomplish the specialization computation in a highly scalable way. Experimental evaluation results demonstrate that with our approach, the scalability and efficiency of TDS can be significantly improved over existing approaches.

Read the full article at IEEE Xplore...   Back to top

Special Issue on Leveraging Big Data

Published in IT Professional, November/December 2013

Leveraging Big Data and Business Analytics [Guest editors' Introduction] 
By Sunil Mithas, Maria R. Lee, Seth Earley, San Murugesan and Reza Djavanshir

Leveraging Big Data Analytics to Reduce Healthcare Costs 
By Uma Srinivasan and Bavani Arunasalam

Business Process Analytics Using a Big Data Approach 
By Alejandro Vera-Baquero, Ricardo Colomo-Palacios and Owen Molloy

XBRL in the Chinese Financial Ecosystem
By Li Jimei, Hu Yuzhou and Du Meijie

Big Data and Transformational Government
By Rhoda C. Joseph and Norman A. Johnson

    Back to top

Embedded Analytics and Statistics for Big Data

By Louridas, P. and Ebert, C.

Published in IEEE Software, November/December 2013

Embedded analytics and statistics for big data have emerged as an important topic across industries. As the volumes of data have increased, software engineers are called to support data analysis and applying some kind of statistics to them. This article provides an overview of tools and libraries for embedded data analytics and statistics, both stand-alone software packages and programming languages with statistical capabilities.

Read the full article at IEEE Xplore...   Back to top

Bias Correction in a Small Sample from Big Data

By Jianguo Lu and Dingding Li

Published in IEEE Transactions on Knowledge and Data Engineering, November 2013

This paper discusses the bias problem when estimating the population size of big data such as online social networks (OSN) using uniform random sampling and simple random walk. Unlike the traditional estimation problem where the sample size is not very small relative to the data size, in big data, a small sample relative to the data size is already very large and costly to obtain. We point out that when small samples are used, there is a bias that is no longer negligible. This paper shows analytically that the relative bias can be approximated by the reciprocal of the number of collisions; thereby, a bias correction estimator is introduced. The result is further supported by both simulation studies and the real Twitter network that contains 41.7 million nodes.

Read the full article at IEEE Xplore...   Back to top

Efficient Skyline Computation on Big Data

By Xixian Han, Jianzhong Li, Donghua Yang and Jinbao Wang

Published in IEEE Transactions on Knowledge and Data Engineering, November 2013

Skyline is an important operation in many applications to return a set of interesting points from a potentially huge data space. Given a table, the operation finds all tuples that are not dominated by any other tuples. It is found that the existing algorithms cannot process skyline on big data efficiently. This paper presents a novel skyline algorithm SSPL on big data. SSPL utilizes sorted positional index lists which require low space overhead to reduce I/O cost significantly. The sorted positional index list Lj is constructed for each attribute Aj and is arranged in ascending order of Aj. SSPL consists of two phases. In phase 1, SSPL computes scan depth of the involved sorted positional index lists. During retrieving the lists in a round-robin fashion, SSPL performs pruning on any candidate positional index to discard the candidate whose corresponding tuple is not skyline result. Phase 1 ends when there is a candidate positional index seen in all of the involved lists. In phase 2, SSPL exploits the obtained candidate positional indexes to get skyline results by a selective and sequential scan on the table. The experimental results on synthetic and real data sets show that SSPL has a significant advantage over the existing skyline algorithms.

Read the full article at IEEE Xplore...   Back to top

Grand Challenge: Applying Regulatory Science and Big Data to Improve Medical Device Innovation

By Erdman, A.G., Keefe, D.F. and Schiestl, R.

Published in IEEE Transactions on Biomedical Engineering, March 2013

Understanding how proposed medical devices will interface with humans is a major challenge that impacts both the design of innovative new devices and approval and regulation of existing devices. Today, designing and manufacturing medical devices requires extensive and expensive product cycles. Bench tests and other preliminary analyses are used to understand the range of anatomical conditions, and animal and clinical trials are used to understand the impact of design decisions upon actual device success. Unfortunately, some scenarios are impossible to replicate on the bench, and competitive pressures often accelerate initiation of animal trials without sufficient understanding of parameter selections. We believe that these limitations can be overcome through advancements in data-driven and simulation-based medical device design and manufacturing, a research topic that draws upon and combines emerging work in the areas of Regulatory Science and Big Data. We propose a cross-disciplinary grand challenge to develop and holistically apply new thinking and techniques in these areas to medical devices in order to improve and accelerate medical device innovation.

Read the full article at IEEE Xplore...   Back to top

cc rc 2014

technav badge

promo join


jobs small