There are multiple CSE6242 sections. This is the course homepage for campus CSE6242A,Q/CX4242A.

CSE6242A,Q/CX4242A, Fall 2018
Data and Visual Analytics

Georgia Tech, College of Computing

4:30 - 5:45pm, Clough 152, Tue & Thu
Prof. Duen Horng (Polo) Chau
This course will introduce you to broad classes of techniques and tools for analyzing and visualizing data at scale. It emphasizes on how to complement computation and visualization to perform effective analysis. We will cover methods from each side, and hybrid ones that combine the best of both worlds. Students will work in small teams to complete a significant project exploring novel approaches for interactive data & visual analytics.

Course Goals

  • Learn visual and computation techniques and tools, for typical data types
    • Learn how to complement each kind of methods
    • Gain a breadth of knowledge
  • Work on real datasets and problems
  • Learn practical know-how (useful for jobs, research) through significant hands-on programming assignments

Acknowledgement

We thank the generous support of Amazon Web Services and Microsoft Azure for free cloud credits, Intel for curriculum development of the memory mapping module (scaling up algorithms with virtual memory), and Tableau for data visualization software.

Announcements and Discussion

The fastest way to get help with homework assignments is to post your questions on Piazza. That way, only our TAs and instructor can help, your peers can too.

If you prefer that your question addresses to only our TAs and the instructor, you can use the private post feature (i.e., check the "Individual Students(s) / Instructors(s)" radio box).

While we welcome everyone to share their experiences in tackling issues and helping each other out, but please do not post your answers, as that may affect the learning experience of your fellow classmates.

For special cases such as failed submissions due to system errors, missing grades, failed file uploads, emergencies that prevent you from submitting, personal issues, you can contact the staff using a private Piazza post.

Course Staff & Office Hours

TAs will hold office hours starting week 2, except on Georgia Tech holidays (e.g., thanksgiving, MLK day, spring break). Each office hour session will be run by at least one TA, and is 1 hour long. See GT’s academic calendar for the full list of holidays (https://registrar.gatech.edu/calendar). We will spread the office hours across weekdays.

Please note that you are always welcome to ask questions on Piazza. Office hours supplement Piazza, and do not replace it.

Picture Polo Chau Tue, 3:30PM-4PM
+ FREE after-class coffee, at Clough Starbucks
Klaus 1324 (Polo's office)
Picture Neetha Ravishankar Mon, 12:30 - 1:30pm All TA office hours are held in the open area outside Polo's office
Picture Jennifer Ma Tue, 11am - 12pm  
Picture Mansi Mathur Tue, 11am - 12pm  
Picture Arathi Arivayutham
Head TA
Wed, 4 - 5pm  
Picture Vineet Vinayak Pasupulety Wed, 4 - 5pm  
Picture Siddharth Gulati Mon, 12:30 - 1:30pm  

Course Schedule Evolving

All times are in eastern time zone.
Wk Dates Topics Tue Thu Homework (HW) Project
1 Aug 21,23 * Course Introduction
* Analytics Building Blocks
* Data Science Buzzwords
* Data Collection
intro building blocks, buzzwords, data collection    
2 28,30 * SQLite
* Data Cleaning
* Class Project Overview
* Code Back-up & Version Control
SQLite, git cleaning, project overview HW1 out
Fri, Aug 31
 
3 Sept 4,6 * Example projects:
(1) Firebird: Predicting Fire Risks in Atlanta, by Shang-Tse Chen
(2) PASSAGE: A Travel Safety Assistant, by Nilaksh Das
* Data Integration
Firebird, PASSAGE, project overview data integration, vis 101    
4   11,13 * Visualization 101
* Data Visualization for Web (D3)
D3 cont'd

HW1 due
Fri, Sept 14, 11:55pm

HW2 out
Fri, Sept 14

Form project teams by
Fri, Sept 14, 11:55pm
5 18-20 * Fixing Common Visualization Issues
* Data Analytics, Concepts and Tasks
* Overview of project proposal and presentation
fix vis publication-fig; analytics tasks    
6 25-27 * Scalable Computing: Hadoop
* Scalable Computing: Pig
* Scalable Computing: Hive
hadoop; pig; hive; spark    
7 Oct 2-4 * Scalable Computing: Spark
* Scalable Computing: HBase
* Classification: concepts, cross-validation, k-NN, decision trees
hbase classification

HW2 due
Fri, Oct 5, 11:55pm

HW3 out
Fri, Oct 5

 
8   9-11 * Visualization for Classification: ROC, AUC, confusion matrix
* Introduction to Clustering: k-means, hierarchical clustering, DBSCAN, vis
Fall recess classification; clasification-vis  

 

9 16-18 * Project proposal presentation Show time! Show time!  

Proposal document due
Mon, Oct 15, 11:55pm

Proposal presentation slides due
Mon, Oct 15, 11:55pm

10 23-25 * Ensemble Method: bagging, random forests
clustering; bagging, random forest graph laws

HW3 due
Fri, Oct 26, 11:55pm

 

 
11 Nov 30-1 * Graph Analytics:
centrality; algorithms-(personalized) PageRank; interactive applications
* Scaling up Algorithms with Virtual Memory
centrality, pagerank mmap HW4 out
Fri, Nov 2
Progress Report due
Fri, Nov 2, 11:55pm
12 6-8 * Text Analytics: concepts, algorithms (LSI=SVD) X text algorithms    
13   13-15
       
14 20-22 Thanks giving X X    
15 27-29 * Time series: algorithms, visualization, & applications
* Project poster presentation
Poster presentation. 4:30pm to 5:45pm-ish. Klaus Atrium. Pizza + drinks served! HW4 due
Mon, Nov 26, 11:55pm

 

16 Dec 4 Lessons learned and closing words X  

Final report due
Tue, Dec 4, 11:55pm

This course can be very tough for many!

WARNING! You are expected to quickly learn many things simultaneously, and for some materials you will need to learn them on your own (e.g., Linux commands, for working with MS Azure/Amazon AWS). This can be very intimidating for many students.

The amounts of time students spend on this class greatly vary, based on their backgrounds, and what they may already know. Some former students told us they spent about 40-60 hours on each homework assignment (we have 4 big assignments, and no exams), and some reported much less. For example, for the homework assignment about D3 visualization programming, students who are completely new to javascript, css, and html likely will spend significantly more time than their peers who have already tried them before. Some former students who do not have a computer science background found the homework assignments challenging, would take significant time and effort, but were rewarding, fun, and "do-able."

Students have at least 2 weeks to complete each homework assignment. Some students waited until the last week, and could not finish. It is critical to plan ahead and prepare for the significant time needed.

Almost all homework assignments involve very large amount of programming tasks (which naturally means likely a lot of debugging will be needed, thus can be time consuming). You should be proficient in at least one high-level programming language (e.g., Python, C++, Java), and is efficient with debugging principles and practices. If not, you should NOT take this course. Instead, you should first take CSE 6040 (for OMS Analytics students) and, if needed, CS 1301 and CS 1371 as well.

Some programming assignments involve high-level languages or scripting (e.g., Python, Java, SQL etc.). Some assignments involve web programming and D3 (e.g., Javascript, CSS, HTML). For example, an assignment on Hadoop and Spark may require you to learn some basic Java and Scala quickly, which should not be too challenging if you already know another high-level language like Python or C++. It is unlikely that you all know tools/skills needed in the programing tasks, so you are expected to learn many of them on the fly.

Basic linear algebra, probability and statistics knowledge is also expected.

Homework

We have 4 big assignments in total (subject to change). Visit this course's Canvas site for the assignment documents. See the schedule table above for deliverable due dates.
  • [10%] HW1: Collecting & visualizing data, SQLite, D3 warmup, OpenRefine, Web Development with Flask and jQuery
  • HW2: D3 Graphs and Visualization
  • [15%] HW3: Hadoop, Spark, Pig and Azure
  • [10%] HW4: Scalable PageRank via Virtual Memory (MMap), Random Forest, Scikit-Learn

Project

See project description. See the schedule table above for deliverable due dates.

Distance Learning Students (Q Section)

A standard 3-day lag applies to all homework and project deliverables.  For project presentation, a group that has DL student member can choose to:
  1. Present in class without 3-day lag; or 
  2. Submit a video presentation with 3-day lag (e.g., screen capture)

Grading Policy

  1. There will be 4 homework assignments. Together, they are worth 50% (10%, 15%, 15%, 10%)  of the course grade.
  2. There will be one course group project worth 50% of the course grade. The project components are:
    1. Proposal (7.5% of course grade)
    2. Proposal presentation (5%)
    3. Progress report (5%)
    4. Final poster presentation (7.5%)
    5. Final report (25%)
  3. You must achieve an overall weighted average of 60% to pass the course.
  4. All deliverables will be graded by our TAs, except the project poster presentation, which will be peer-graded.
  5. When assigning course grades, I will start with the standard grade thresholds (90, 80, etc.). I may lower (and never raise) the thresholds (i.e., to your benefits). For example, I may use 88 instead of 90.

Deliverable Due Dates

All homework and project deliverables will be due at the times shown in the Course Schedule. These times are subject to change so please check back often.

Plagiarism, Collaboration Policy, and Student Honor Code

  1. All course participants (myself, teaching assistants, and learners) are expected to know and abide by the Georgia Tech Academic Honor Code.
  2. Ethical behavior is extremely important in all facets of life.
    • Plagiarism is a serious offense. You are responsible for completing your own work. You are not allowed to copy and paste, or paraphrase, or submit materials created or published by others, as if you created the materials. All materials submitted must be your own.
    • You may discuss high-level ideas with other students at the "whiteboard" level (e.g., how cross validation works, use hashmap instead of array) and review any relevant materials online. However, each student must write up and submit his or her own answers.
    • All incidents of suspected dishonesty, plagiarism, or violations of the Georgia Tech Honor Code will be subject to the institute’s Academic Integrity procedures (e.g., reported to and directly handled by the Office of Student Integrity (OSI)). Consequences can be severe, e.g., academic probation or dismissal, grade penalties, a 0 grade for assignments concerned, and prohibition from withdrawing from the class.

Late Policy

  • Every homework assignment deliverable and every project deliverable comes with a 48-hour "grace period" (except in-class activities like proposal presentation and poster presentation). Such deliverable may be submitted (and resubmitted) up to 48 hours after the official deadline without penalty. You do not need to ask before using this grace period.
  • Any deliverable submitted after the grace period will get zero credit.
  • We will not consider late submission of any missing parts of a deliverable. To make sure you have submitted everything, download your submitted files to double check.
  • No penalties for medical reasons or emergencies. And should they arise, you must contact the Dean of Students office. Doctor's notes, medical documentation, explanation of emergencies, etc. should be submitted to the Dean’s office. After their office receives the information, we will notify me on your behalf.

Timing Policy

  • The course videos follow a logical sequence that includes knowledge-building and experience-building (assignments).
  • Assignments should be completed by their due dates, in order for timely peer assessment. Peer assessments should also be completed by their due dates, to give timely feedback.
  • You will have access to the course content for the scheduled duration of the course.

Netiquette

  • Netiquette refers to etiquette that is used when communicating on the Internet. Review the Core Rules of Netiquette. When you are communicating via email, discussion forums or synchronously (real-time), please use correct spelling, punctuation and grammar consistent with the academic environment and scholarship1.
  • We expect all participants (learners, faculty, teaching assistants, staff) to interact respectfully. Learners who do not adhere to this guideline may be removed from the course.
1. Conner, P. (2006-2014). Ground Rules for Online Discussions, Retrieved 4/21/2014 from http://teaching.colostate.edu/tips/tip.cfm?tipid=128

Dataset Ideas (may need API, or scraping)

Resources

All content and course materials can be accessed online. There is no textbook for this course. All Georgia Tech students have FREE access to https://www.safaribooksonline.com, where you can find a huge number of highly rated and classic books on a wide variety of computer science topics.

Software engineering; become a better programmer and developer

D3 Visualization; Javascript

Big Data

We also recommend the following books and resources.

Python

Data science, machine learning, data mining

Visualization

SQL

Probability

Human Computation

Office of Disability Services

The Office of Disability Services offers accommodations for students with disabilities. Please contact the office should you need help.

Prerequisites

Review Polo's "warnings" before taking this course.

Additional formal prerequisites for CSE 6242

None, but you should have taken courses similar to those listed in the next section, at Georgia Tech or at another school.

If you are an Analytics (OMS or campus) degree student, you should first take CSE 6040 and do very well in it; if necessary, please also first take CS 1301.

Additional formal prerequisites for CX 4242

(Undergraduate Semester level MATH 2605 Minimum Grade of D or
Undergraduate Semester level MATH 2401 Minimum Grade of D or
Undergraduate Semester level MATH 24X1 Minimum Grade of D) or
and
(Undergraduate Semester level MATH 3215 Minimum Grade of D or
Undergraduate Semester level MATH 3225 Minimum Grade of D or
Undergraduate Semester level ECE 3077 Minimum Grade of D or
Undergraduate Semester level ISYE 2027 Minimum Grade of D)
and
(Undergraduate Semester level CS 1371 Minimum Grade of C or
Undergraduate Semester level CS 1372 Minimum Grade of C or
Undergraduate Semester level CX 4010 Minimum Grade of C or
Undergraduate Semester level CX 4240 Minimum Grade of C)

Auditing & Pass/Fail

Due to the class size, I am not offering auditing and pass/fail option.

Previous offerings

Spring 2018 - CSE 6242 / CX 4242 – Polo Chau
Spring 2018 - CSE 6242 OAN (for OMS Analytics students only) – Polo Chau
Fall 2017 - CSE 6242 / CX 4242 – Polo Chau
Spring 2017 - CSE 6242 / CX 4242 – Polo Chau
Fall 2016 - CSE 6242 / CX 4242 – Polo Chau
Spring 2016 - CSE 6242 / CX 4242 – Polo Chau
Fall 2015 - CSE 6242 / CX 4242 – Polo Chau
Spring 2015 - CSE 6242 / CX 4242 – Polo Chau
Fall 2014 - CSE 6242 / CX 4242 – Polo Chau
Spring 2014 - CSE 6242 / CX 4242 – Polo Chau
Spring 2013 - CSE 6242 / CS 4803-DVA – Polo Chau
Spring 2011 - CSE 8803-DVA / CS 4803-DVA - Guy Lebanon
Spring 2010 - CSE 8803-DVA - Guy Lebanon

Acknowledgment & Related Classes

We thank Intel's support in curriculum development for the memory mapping module (scaling up algorithms with virtual memory).

We thank Amazon Educate for providing free cloud credit for Amazon Web Services. We are excited to be am AWS partner university and part of AWS Educate's private beta.

We thank Microsoft Azure's special grant for providing free cloud credit.

We thank Tableau for Teaching program's data visualization software.

Many thanks to my colleagues for sharing their course materials:
  • Prof. John Stasko - Information Visualization - Fall 2012
  • Prof. Jeff Heer - Research Topics in Interactive Data Analysis - Spring 2011
  • Prof. Christos Faloutsos - Multimedia Databases and Data Mining - Fall 2012