There are multiple CSE6242 sections. This is the course homepage for campus CSE6242A,Q,R/CX4242A.
This course will introduce you to broad classes of techniques and tools for analyzing and visualizing data at scale. It emphasizes on how to complement computation and visualization to perform effective analysis. We will cover methods from each side, and hybrid ones that combine the best of both worlds. Students will work in small teams to complete a significant project exploring novel approaches for interactive data & visual analytics.

Course Goals

  • Learn visual and computation techniques and tools, for typical data types
    • Learn how to complement each kind of methods
    • Gain a breadth of knowledge
  • Work on real datasets and problems
  • Learn practical know-how (useful for jobs, research) through significant hands-on programming assignments

Acknowledgement

We thank the generous support of Amazon Web Services and Microsoft Azure for free cloud credits, Intel for curriculum development of the memory mapping module (scaling up algorithms with virtual memory), and Tableau for data visualization software.

Announcements and Discussion

The fastest way to get help with homework assignments is to post your questions on Piazza. That way, not only our TAs and instructor can help, your peers can too.

If you prefer that your question addresses to only our TAs and the instructor, you can use the private post feature (i.e., check the "Individual Students(s) / Instructors(s)" radio box).

While we welcome everyone to share their experiences in tackling issues and helping each other out, but please do not post your answers, as that may affect the learning experience of your fellow classmates.

For special cases such as failed submissions due to system errors, missing grades, failed file uploads, emergencies that prevent you from submitting, personal issues, you can contact the staff using a private Piazza post.

Course Staff & Office Hours

TAs plan to hold office hours starting week 2, except on Georgia Tech holidays (e.g., thanksgiving, MLK day, spring break). Each office hour session will be run by at least one TA, and is 1 hour long. See GT’s academic calendar for the full list of holidays (https://registrar.gatech.edu/calendar). We will spread the office hours across weekdays, and across time of the day. We will announce the office hour times.

We will hold office hours via Slack, where the TA running the office hour will be responsive. We will share information about how to join the appropriate Slack group.

Please note that you are always welcome to ask questions on Piazza. Office hours supplement Piazza, and do not replace it.

Course Schedule

For all dates used in this course, their times are 23:59 Anywhere on Earth (11:59 pm AoE), unless stated otherwise. For example, a due date of "January 8" is the same as "January 8, 23:59pm AoE". Convert the times to your local times using a Time Zone Converter.
On Tuesdays and Thursdays, at 3-4pm eastern, we will hold a combination of Virtual coffee chat and Q&A.
Wk Dates Topics Homework (HW) Project
0 Jan 14-15 * Course Introduction [slides]
Live session: Course Intro + Q&A
 
1 18-22 * Analytics Building Blocks [slides]
* Data Science Buzzwords [slides]
* Data Collection [slides]
* SQLite [slides]
* Data Cleaning [slides]
HW1 out
Fri, Jan 22
 
2 25-29
* Class Project Overview [slides]
** Example project: Firebird - Predicting Fire Risks in Atlanta [2min | 20min]
* Code Back-up & Version Control [slides]
* Data Integration [slides]
* Data Analytics, Concepts and Tasks [slides]
   
3 Feb 1-5 * Visualization 101 [slides]
* Fixing Common Visualization Issues[slides]
4 8-12 * Data Visualization for Web (D3) [slides]
HW1 due
Fri, Feb 12
(Sat, 06:59 ET)

HW2 out
Fri, Feb 12
 
5 15-19 * Scalable Computing: Hadoop [slides]
* Scalable Computing: Pig [slides]
* Scalable Computing: Hive [slides]
  Form project teams by
Fri, Feb 19
6 22-26 * Scalable Computing: Spark [slides]
* Scalable Computing: HBase [slides]
 
7 Mar 1-5 * Classification [slides]
* Visualization for Classification [slides]
HW2 due
Fri, Mar 5

HW3 out
Fri, Mar 5
 
8 8-12 * Introduction to Clustering [slides]
  Proposal Document due
Fri, Mar 12

Proposal Presentation Slides and Video due
Fri, Mar 12
9 15-19 * No class on March 16
* Graph Analytics [slides] [slides]
* Ensemble Method [slides]
* Scaling up Algorithms with Virtual Memory [slides]
 
10 22-26 [Work on Project] HW3 due
Fri, Mar 26

HW4 out
Fri, Mar 26
 
11 Apr 29-2 [Work on Project]   Progress Report due
Fri, Apr 2
(Sat, 07:59 ET)
12 5-9 * Text Analytics [slides]
 
13 12-16 [Work on Project] HW4 due
Fri, Apr 16
 
14 19-23 * Course Review
Poster Presentation Video due
Fri, Apr 23

Final Report due
Fri, Apr 23
15 26-30 Peer assessment   Poster Presentation Video grading starts
Tue, Apr 26

Poster Presentation Video grading due
Fri, Apr 30

This course can be very tough for many!

WARNING! You are expected to quickly learn many things simultaneously, and for some materials you will need to learn them on your own (e.g., Linux commands, for working with MS Azure/Amazon AWS). This can be very intimidating for many students.

The amounts of time students spend on this class greatly vary, based on their backgrounds, and what they may already know. Some former students told us they spent about 40-60 hours on each homework assignment (we have 4 big assignments, and no exams), and some reported much less. For example, for the homework assignment about D3 visualization programming, students who are completely new to javascript, css, and html likely will spend significantly more time than their peers who have already tried them before. Some former students who do not have a computer science background found the homework assignments challenging, would take significant time and effort, but were rewarding, fun, and "do-able."

Students have at least 3 weeks to complete each homework assignment. Some students waited until the last week, and could not finish. It is critical to plan ahead and prepare for the significant time needed.

Almost all homework assignments involve very large amount of programming tasks (which naturally means likely a lot of debugging will be needed, thus can be time consuming). You should be proficient in at least one high-level programming language (e.g., Python, C++, Java), and is efficient with debugging principles and practices. If not, we recommend first taking introductory computing course(s) before taking this course. For exmaple, CSE 6040 for (OMS) Analytics students; CS 1301, CS 1331, CS 1332, CS 1371, etc. for an campus students.

Some programming assignments involve high-level languages or scripting (e.g., Python, Java, SQL etc.). Some assignments involve web programming and D3 (e.g., Javascript, CSS, HTML). For example, an assignment on Hadoop and Spark may require you to learn some basic Java and Scala quickly, which should not be too challenging if you already know another high-level language like Python or C++. It is unlikely that you all know tools/skills needed in the programing tasks, so you are expected to learn many of them on the fly.

Basic linear algebra, probability and statistics knowledge is also expected.

Minimum Computer Requirements

  • 8GB RAM (16GB recommended)
  • 512GB disk (SSD recommended). Some assignments use data files that are more than a few GBs, and some uses virtual machines that can easily take up more than tens of GBs. It is typical for some project teams to use large datasets that are more than a few or tens of GBs.
  • Dual-core Core i5 (8th generation or better recommended)

Accessing Course Materials Outside of US

You may need to use Georgia Tech's VPN. We also recommend checking out some solutions that seem to be working well for OMS students in different countries.

Homework

We have 4 big assignments in total (subject to change). Visit this course's Canvas site for the assignment documents. See the schedule table above for deliverable due dates.
  • [10%] HW1: Collecting & visualizing data, SQLite, D3 warmup, OpenRefine
  • [15%] HW2: D3 Graphs and Visualization
  • [15%] HW3: Hadoop, Spark, Pig and Azure
  • [10%] HW4: Scalable PageRank via Virtual Memory (MMap), Random Forest, Scikit-Learn
We do not release solutions for homework.
Can you release homework early? We understand that some students may prefer that homework assignments be released as soon as possible. Behind the scenes, our course staff work diligently to develop new questions, which means testing new datasets, new instructions, new auto graders, solution code, and more! Unfortunately, this means we likely cannot release assignments well in advance. We will release them as early as possible, hopefully some days before the scheduled release dates on our course schedule. When we release an assignment, we always announce it on Piazza.

Project

See project description. See the schedule table above for deliverable due dates.

Distance Learning Students

A standard 3-day lag applies to all homework and project deliverables. For project presentation, a group that has DL student member (from Q, QSZ, or R sections) can choose to:
  1. [Not applicable in Fall 2020] Present in class without 3-day lag; or
  2. Submit a video presentation with 3-day lag (e.g., screen capture)

Grading Policy

  • There will be 4 homework assignments. Together, they are worth 50% (10%, 15%, 15%, 10%) of the course grade.
  • There will be one course group project worth 50% of the course grade. The project components are:
    1. Proposal (7.5% of course grade)
    2. Proposal presentation (5%) (video recording)
    3. Progress report (5%)
    4. Final poster presentation (7.5%) (video recording)
    5. Final report (25%)
  • You must achieve an overall weighted average of 60% to pass the course.
  • All deliverables will be graded by our TAs, except the project poster presentation, which will be peer-graded.
  • When assigning course grades, I will start with the standard grade thresholds (90, 80, etc.). I may lower (and never raise) the thresholds (i.e., to your benefits). For example, I may use 88 instead of 90.
  • Plagiarism, Collaboration Policy, and Student Honor Code

    • All course participants (myself, teaching assistants, and learners) are expected to know and abide by the Georgia Tech Academic Honor Code.
    • Ethical behavior is extremely important in all facets of life.
    1. Plagiarism is a serious offense. You are responsible for completing your own work. You are not allowed to copy and paste, or paraphrase, or submit materials created or published by others, as if you created the materials. All materials submitted must be your own.
    2. You may discuss high-level ideas with other students at the "whiteboard" level (e.g., how cross validation works, use hashmap instead of array) and review any relevant materials online. However, each student must write up and submit his or her own answers.
    3. You must not put your code on public domain (e.g., public GitHub), because a (future) student could copy your code. That student obviously violates the honor code, and you may also be implicated.
    4. All incidents of suspected dishonesty, plagiarism, or violations of the Georgia Tech Honor Code will be subject to the institute’s Academic Integrity procedures (e.g., reported to and directly handled by the Office of Student Integrity (OSI)). Consequences can be severe, e.g., academic probation or dismissal, grade penalties, a 0 grade for assignments concerned, and prohibition from withdrawing from the class.

    Late Policy and Due Dates

    • All homework and project deliverables are due at the times shown in the Course Schedule. These times are subject to change so please check back often. Convert the times to your local times using a Time Zone Converter.
    • Every homework assignment deliverable and every project deliverable comes with a 48-hour "grace period". You do not need to ask before using this grace period.

      Your deliverable may be submitted (and resubmitted) up to 48 hours after the official deadline without penalty, but Canvas will mark your submission as "late".

      Canvas automatically appends a "version number" to files that you re-submit. You do not need to worry about these version numbers, and there is no need to delete old submissions. We will only grade the most recent submission.
    • Any deliverable submitted after the grace period will get 0 credit. We recommend that you submit your work before the grace period begins.
    • We will not consider late submission of any missing parts of a deliverable. To make sure you have submitted everything, download your submitted files to double check. If your submitting large files, you are responsible for making sure they get uploaded to the system in time. You have 48 hours to verify your submissions!
    • No penalties for medical reasons or emergencies. And should they arise, you must contact the Dean of Students office. Doctor's notes, medical documentation, explanation of emergencies, etc. should be submitted to the Dean’s office. After their office receives the information, they will notify me on your behalf.

    Timing Policy

    • The course videos follow a logical sequence that includes knowledge-building and experience-building (assignments).
    • Assignments should be completed by their due dates, in order for timely peer assessment. Peer assessments should also be completed by their due dates, to give timely feedback.
    • You will have access to the course content for the scheduled duration of the course.

    Attendance Policy

    • This semester, this course runs as remote asynchronous, which means this course is delivered completely online and without a designated meeting time.
    • Login on a regular basis to complete your work, so that you do not have to spend a lot of time reviewing and refreshing yourself regarding the content.
    • We hope to provide you with experiences that are close to in-person learning. We welcome your suggestions on Piazza.

    COVID-19 Policy

    The fall semester 2020 is especially challenging due to the Covid-19 pandemic and a growing awareness of racial inequities.  The following information relates to specific services and guidelines for courses during this semester.  The most up-to-date information on Covid-19 is on the TECH Moving Forward website and in the Academic Restart Frequently Asked Questions.  

    Expectations and Guidelines

    Each of us has a responsibility to ourselves and our fellow Yellow Jackets to be mindful of our shared commitment.

    • We are all required to wear a face covering while inside any campus facilities/buildings, including during in-person classes, and to adhere to social distancing of at least 6 feet. If an individual forgets to bring a face covering to class or into any indoor space, there will be a clearly marked supply of these in each building. If a student fails to follow Georgia Tech’s policies on social distancing and face coverings, they will initially be reminded of the policy and if necessary, asked to leave the class, meeting, or space. If they still fail to follow the policy, they may be referred to the Office of the Dean of Students. Information on the Institute’s policy on face coverings.
    • Students are expected to sit in assigned seats and to come to class only on days that are assigned to them.
    • Papers, projects, tests, homework, and other assignments will only be accepted in electronic form unless the assignment is a physical artifact. 

    Additional information is available in the Student Guidebook.


    Instructor Illness or Exposure to Covid-19

    During the fall 2020 semester, some faculty members may be required to quarantine due to exposure or isolate due to a Covid-19 diagnosis. Some disruption to classes or services is inevitable, but Georgia Tech is making every effort to ensure continuity of operations. As is the case in any semester, faculty may cancel a class if they have an illness or emergency situation and cover any missed material at their own discretion. If an instructor needs to cancel a class, they should notify students as early as possible.

    Faculty who are staying home due to symptoms should monitor their health closely and consult with their school chair to determine if remote instruction or substitute instruction is most appropriate for the course. If they need to cancel a class repeatedly, a backup will be supplied in the form of a temporary substitute instructor or asynchronous work. No course will be canceled after the first class has occurred.

    If you have not tested positive but are ill or have been exposed to someone who is ill, please follow the Covid-19 Exposure Decision Tree for reporting your illness.


    Student Illness or Exposure to Covid-19

    During the semester, you may be required to quarantine or self-isolate to avoid the risk of infection to others.  Quarantine is the separation of those who have been exposed to someone with Covid-19 but who are not ill; isolation is the separation of those who have tested positive for Covid-19 or been diagnosed with Covid-19 by symptoms.

    If you have not tested positive but are ill or have been exposed to someone who is ill, please follow the Covid-19 Exposure Decision Tree for reporting your illness.

    During the quarantine or isolation period you may feel completely well, ill but able to work as usual, or too ill to work until you recover.

    Remote courses and remote class sessions during hybrid courses. Unless you are too ill to work, you should be able to complete your remote work while in quarantine or isolation.

    If you are ill and unable to do course work this will be treated similarly to any student illness. The Dean of Students will have been contacted when you report your positive test or are told that it is necessary to quarantine and will notify your instructor that you may be unable to attend class events or finish your work as the result of a health issue. Your instructor will not be told the reason. We have asked all faculty to be lenient and understanding when setting work deadlines or expecting students to finish work, and so you should be able to catch up with any work that you miss while in quarantine or isolation. Your instructor may make available any video recordings of classes or slides that have been used while you are absent, and may prepare some complementary asynchronous assignments that compensate for your inability to participate in class sessions. Ask your instructor for the details.


    CARE Center, Counseling Center, Stamps Health Services, and the Student Center

    These uncertain times can be difficult, and many students may need help in dealing with stress and mental health. The CARE Center and the Counseling Center, and Stamps Health Services will offer both in-person and virtual appointments. Face-to-face appointments will require wearing a face covering and social distancing, with exceptions for medical examinations. Student Center services and operations are available on the Student Center website. For more information on these and other student services, contact the Vice President and Dean of Students or the Division of Student Life.


    Accommodations for Students at Higher Risk for Severe Illness with Covid-19

    Students may request an accommodation through the Office of Disability Services (ODS) due to 1) presence of a condition as defined by the Americans with Disabilities Act (ADA), or 2) identification as an individual of higher risk for Covid-19, as defined by the Centers for Disease Control (CDC). Registering with ODS is a 3-step process that includes completing an application, uploading documentation related to the accommodation request, and scheduling an appointment for an “intake meeting” (either in person or via phone or video conference) with a disability coordinator.

     If you have been approved by ODS for an accommodation, I will work closely with you to understand your needs and make a good faith effort to investigate whether or not requested accommodations are possible for this course. If the accommodation request results in a fundamental alteration of the stated learning outcome of this course, ODS, academic advisors, and the school offering the course will work with you to find a suitable alternative that as far as possible preserves your progress toward graduation.


    Netiquette

    • Netiquette refers to etiquette that is used when communicating on the Internet. Review the Ground Rules for Online Discussions. When you are communicating via email, discussion forums or synchronously (real-time), please use correct spelling, punctuation and grammar consistent with the academic environment and scholarship.
    • We expect all participants (learners, faculty, teaching assistants, staff) to interact respectfully. Learners who do not adhere to this guideline may be removed from the course.

    Dataset Ideas (may need API, or scraping)

    Resources

    Office of Disability Services

    The Office of Disability Services offers accommodations for students with disabilities. Please contact the office should you need help.

    Support Services

    Academic support, and personal support: Office of the Dean of Students, Counseling Center, Health Serivces, Women's Resource Center, LGBTQIA Resource Center, Veteran's Resource Center, Georgia Tech Police.

    Recommended Reading

    All content and course materials can be accessed online. There is no textbook for this course.

    All Georgia Tech students have FREE access to https://www.oreilly.com, where you can find a huge number of highly rated and classic books (e.g., the "animal" books) from O'Reilly and Pearson covering a wide variety of computer science topics, including some of those listed below. Just log in with your official GT email address, e.g., jdoe3@gatech.edu.

    Software engineering; become a better programmer and developer

    D3 Visualization; Javascript

    Big Data

    Python

    Data science, machine learning, data mining

    Visualization

    SQL

    Probability

    Human Computation

    How to manage multiple versions of Python packages?

    To get started, we recommend the excellent article on Which Python package manager should you use?

    If you've decided to go with pyenv, I recommend Managing Multiple Python Versions With pyenv.

    If you use Mac, we recommend to also check out The right and wrong way to set Python 3 as default on a Mac.

    Students in my reserach group said that Poetry seems to be fast replacing conda envs, and may even replace setuptools for pypi packages in the future.

    Prerequisites

    Review Polo's "warnings" before taking this course.

    Additional formal prerequisites for CSE 6242

    None, but you should have taken courses similar to those listed in the next section, at Georgia Tech or at another school.

    If you are an Analytics (OMS or campus) degree student, you should first take CSE 6040 and do very well in it; if necessary, please also first take CS 1301.

    Additional formal prerequisites for CX 4242

    (Undergraduate Semester level MATH 2605 Minimum Grade of D or
    Undergraduate Semester level MATH 2401 Minimum Grade of D or
    Undergraduate Semester level MATH 24X1 Minimum Grade of D) or
    and
    (Undergraduate Semester level MATH 3215 Minimum Grade of D or
    Undergraduate Semester level MATH 3225 Minimum Grade of D or
    Undergraduate Semester level ECE 3077 Minimum Grade of D or
    Undergraduate Semester level ISYE 2027 Minimum Grade of D)
    and
    (Undergraduate Semester level CS 1371 Minimum Grade of C or
    Undergraduate Semester level CS 1372 Minimum Grade of C or
    Undergraduate Semester level CX 4010 Minimum Grade of C or
    Undergraduate Semester level CX 4240 Minimum Grade of C)

    Course offerings and Registration

    Auditing & Pass/Fail

    Due to the large class size, we are not offering auditing and pass/fail option.

    Previous offerings

    See https://poloclub.github.io/#cse6242 for all past course offerings.

    Acknowledgment & Related Classes

    We thank Intel's support in curriculum development for the memory mapping module (scaling up algorithms with virtual memory).

    We thank Amazon Educate for providing free cloud credit for Amazon Web Services. We are excited to be am AWS partner university and part of AWS Educate's private beta.

    We thank Microsoft Azure's special grant for providing free cloud credit.

    We thank Tableau for Teaching program's data visualization software.

    Many thanks to my colleagues for sharing their course materials:
    • Prof. John Stasko - Information Visualization - Fall 2012
    • Prof. Jeff Heer - Research Topics in Interactive Data Analysis - Spring 2011
    • Prof. Christos Faloutsos - Multimedia Databases and Data Mining - Fall 2012