School of Computing, Queen's University
23th Jan 2013
Ideas, Rizwan Mian, 23.1.13
I have brainstormed some ideas that can be possible projects for a graduate level cloud computing course. I categorize them into two classes, name concrete and work-in-progress. As their names imply, concrete ideas have been given some thought and can easily be translated into a project (and more?), where as work-in-progress needs further thinking.
· 1. Workloads: Calculate the data-intensity of the TPC-H  queries. The data intensity may be defined as a ratio of (time spent accessing data by a query)/(execution time of the query) . Alternately, the ratio can be simplified to (amount of disk IO)/(execution time of a query), which then requires measuring data transfer to buffer pool and query execution time. Some initial exploration has been done , but this requires a follow up.
· 2. Performance Prediction Models: I have recently developed models to predict response time and throughput of workloads  -- the data to build these models is publicly available. These models are VM specific, and an opportunity exists to build a generic model that is VM agnostic. I believe existing data could be reused in Weka to train and validate a generic model. Some knowledge of machine learning is required.
· 3. Performance variables (advanced): Ahmad et al. show that query interactions can have a significant impact on database system performance , and hence query execution time. Recent literature [2; 3; 16; 22] model query interactions using an experimental approach that is based on sampling the space of possible interactions and building a model on the sampled data. While this approach gives accurate prediction models for a physical machine, we feel that this approach does not capture the variance caused by other variables in the cloud. Suppose, we have a large list of system variables then it is possible to identify the variables that play a role in the performance of a request type by finding out the information gain provided by each variable. We can use information gain measure similar to Zhang et al.  who apply information gain measure to identify a set of database monitor metrics that indicate potential congestion in a specific scenario.
· 4. Network Prediction Models: Wolski  develops adaptive models for predicting network throughput and latency. These models are generic and collectively known as Network Weather Service, and are based on simple mathematical metrics such mean and median. An opportunity exists to port these models to the Amazon cloud and evaluate their effectiveness to predict network throughput and latency within different regions of the cloud. Initially, the scope of the models may be limited to a network of a single zone.
· 5. Virtualization: Schad et al.  find that the Amazon cloud has very large amount of variance compared to a physical cluster. This large variance has a major issue in building workload performance models , and database applications providing service level agreements . Investigate the causes of variance in Xen virtualization  which is used by Amazon, and rank them according to their role.
· 6. Benchmarking: Find or implement ycsb benchmark  driver for MySQL. Then, evaluate MySQL in the Amazon cloud with the ycsb benchmark, and compare against the benchmark results of PNUTS, BigTable, Hbase, Cassandara. A MySQL Amazon Machine Image (ami) is publically available .
· 7. Benchmarking: Find or implement ycsb driver for Hadoop Distributed File System (HDFS) . Then, evaluate HDFS with the ycsb benchmark, and compare against the benchmark results of PNUTS, BigTable, Hbase and Cassandara present in .
· 8. Amazon’s MapReduce: Execute distributed grep and/or terasort benchmark using Amazon’s Map Reduce , and compare the results with Google’s MapReduce .
· 9. MapReduce: MapReduce (MR) exploits data locality to enable large-scale data processing. In MR, every data block has three copies. The placement of these data blocks on commodity servers such that data mobility is minimized at the execution time is an optimization problem. There are constraints on the placement of data, such as no one machine has more than a single copy of a data block. Use Linear Programming (lp) to find out the optimal placement of data on the servers. The work can be evaluated analytically like , or in a MR Simulation .
· 10. Simulation: Ranganathan et al.  develop a suite of job scheduling and data replication algorithms for grid computing that is evaluated in a simulation. They find that it is important to evaluate the combined effectiveness of replication and scheduling strategies, rather than study them separately. An opportunity exists to perform a similar exercise for cloud computing using a simulation toolkit such as cloudsim .
· 11. Simulation: Find out a simulation toolkit that simulates data-intensive workload execution over an IaaS Amazon cloud. Evaluate how realistic it is by comparing against workloads executed in the Amazon cloud .
· 12. Simulation: Evaluate how realistic MR Simulation  is by simulating some workloads and comparing them to actual workloads execution. There are many examples of MapReduce executions . However, providing a detailed comparison with a few examples  will suffice.
· 13. Elastic Storage: Lim et al.  propose elastic control of storage tier for multi-tier application services, in which adding or removing a storage node requires rebalancing stored data across the nodes. Lim et al. use mathematical models (integral control and multi-variate regression) to build a controller to supervise data rebalancing. Alternate models (such as those based on utility functions or fuzzy logic) can be used to control the data rebalancing. Explore and evaluate one such model.
Work-in-progress (needs developing)
· 1. Cloudify application: Pick up a scientific application (from your domain) e.g. Weka , and expose it on a cloud. See how it benefits from cloud etc. Document the (technical aspects) of this. Document benefits and limitations.
· 2. Information portal: Use some resource (like course Wiki) to collect and share information like useful links for the Amazon cloud, experience with Amazon, building up a glossary etc.
· 3. Benchmarks: Develop new cloud benchmarks for example availability and/or replication, as suggested by Cooper et al. .
· 4. Monitoring: Identify and/or develop monitoring tools for Amazon cloud that provide comprehensive and detailed information about progress of workload execution, resource utilization.
· 5. Amazon cloud architecture: Amazon does not share its cloud architecture publicly. It may be possible to glean the literature to reverse engineer some aspect of the architecture.
· 6. Cost models: Develop cost models for Amazon Storage Services (e.g. Dynamo, RDS, SimpleDB). The models could be parameterized by the workload type, amount of workload, SLA.
· 7. Workloads: Identify and/or new workloads that are data-intensive and/or dynamic.
 Known applications of MapReduce. http://www.dbms2.com/2008/08/26/known-applications-of-mapreduce/.
 Ahmad, M., Aboulnaga, A., and Babu, S., 2009. Query interactions in database workloads. In Proceedings of the Second International Workshop on Testing Database Systems ACM, Providence, Rhode Island, US, 1-6.
 Ahmad, M., Duan, S., Aboulnaga, A., and Babu, S., 2010. Interaction-aware prediction of business intelligence workload completion times. In Data Engineering (ICDE), 2010 IEEE 26th International Conference on, Long Beach, California, US, 413-416.
 Amazon, Elastic MapReduce. http://aws.amazon.com/elasticmapreduce/.
 Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neugebauer, R., Pratt, I., and Warfield, A., 2003. Xen and the art of virtualization. SIGOPS Oper. Syst. Rev. 37, 5, 164-177.
 Calheiros, R., Ranjan, R., Beloglazov, A., DeRose, C., and Buyya, R., 2011. CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms. Softw. Pract. Exper. 41, 1, 23-50.
 Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R., and Sears, R., 2010. Benchmarking cloud serving systems with YCSB. In 1st ACM Symposium on Cloud Computing (SoCC) Association for Computing Machinery, Indianapolis, IN, United states, 143-154. http://dx.doi.org/10.1145/1807128.1807152.
 Dean, J. and Ghemawat, S., 2008. MapReduce: simplified data processing on large clusters. Communications of the ACM 51, 1, 107-113.
 Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I.H., 2009. The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter 11, 1, 10-18.
 Hammoud, S., Maozhen, L., Yang, L., Alham, N.K., and Zelong, L., 2010. MRSim: A discrete event based MapReduce simulator. In 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), 2993-2997.
 Lim, H., Babu, S., and Chase, J., 2010. Automated Control for Elastic Storage. In International Conference on Autonomic Computing (ICAC).
 Mian, R., 2011. Managing Data-Intensive Workloads in a Cloud (Ph.D. Depth Paper). School of Computing, Queen's University. http://research.cs.queensu.ca/TechReports/Reports/2011-580.pdf.
 Mian, R., 2012. dataservice++, U64b-tpc-c-e-h-ebs(boot)-MySQL5.5(64)bp-r5 (ami-7bc16e12). http://thecloudmarket.com/owner/966178113014.
 Mian, R., 2013. buffer pool read/write? http://forums.mysql.com/read.php?24,576829,576829#msg-576829.
 Mian, R., Martin, P., and Vazquez-Poletti, J.L., 2012. Provisioning data analytic workloads in a cloud. Future Generation Computer Systems (FGCS), in press http://dx.doi.org/10.1016/j.future.2012.1001.1008.
 Mian, R., Martin, P., and Vazquez-Poletti, J.L., 2013. Towards Building Performance Models for Data-intensive Workloads in Public Clouds. In 4th ACM/SPEC International Conference on Performance Engineering (ICPE) ACM, Prague, Czech Republic, accepted.
 Mian, R., Martin, P., Zulkernine, F., and Vazquez-Poletti, J.L., 2012. Estimating Costs of Data-intensive Workload Execution in Public Clouds. In 10th International Workshop on Middleware for Grids, Clouds and e-Science (MGC) in conjunction with ACM/IFIP/USENIX 13th International Middleware Conference 2012 ACM, Montreal, Quebec, Canada, 3. http://dl.acm.org/citation.cfm?id=2405139.
 Ranganathan, K. and Foster, I., 2003. Simulation Studies of Computation and Data Scheduling Algorithms for Data Grids. Journal of Grid Computing 1, 1, 53.
 Ruiz-Alvarez, A. and Humphrey, M., 2012. A Model and Decision Procedure for Data Storage in Cloud Computing. In Cluster, Cloud and Grid Computing (CCGrid), 2012 12th IEEE/ACM International Symposium on, Ottawa, Canada, 572-579. http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6217468&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6217468.
 Schad, J., Dittrich, J., and Quiane-Ruiz, J.-A., 2010. Runtime measurements in the cloud: observing, analyzing, and reducing variance. Proc. VLDB Endow. 3, 1-2, 460-471.
 Shvachko, K., Hairong, K., Radia, S., and Chansler, R., 2010. The Hadoop Distributed File System. In Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, 1-10. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5496972&tag=1.
 Tozer, S., Brecht, T., and Aboulnaga, A., 2010. Q-Cop: Avoiding bad query mixes to minimize client timeouts under heavy loads. In IEEE 26th International Conference on Data Engineering (ICDE), Long Beach, CA, USA, 397-408.
 TPC-H, Decision Support Benchmark. http://www.tpc.org/tpch/.
 Wolski, R., 1997. Forecasting Network Performance to Support Dynamic Scheduling Using the Network Weather Service. In Proceedings of the Sixth IEEE International Symposium on High Performance Distributed Computing., Washington, DC, USA, 316-326.
 Zhang, M., Martin, P., Powley, W., Bird, P., and McDonald, K., 2012. Discovering Indicators for Congestion in DBMSs. In Proceedings of the International Workshop on Self-Managing Database Systems (SMDB’12) in Conjunction with the International Conference on Data Engineering (ICDE’12), Washington, DC, USA, in press.