Usage
  • 13 views
  • 39 downloads

Mining StackOverflow to Filter out Off-topic IRC Discussion

  • Author(s) / Creator(s)
  • Internet Relay Chat (IRC) is a commonly used tool by Open Source developers. Developers use IRC channels to discuss programming related problems, but much of the discussion is irrelevant and off-topic. Essentially if we treat IRC discussions like email messages, and apply spam filtering, we can try to filter out the spam (the off-topic discussions) from the ham (the programming discussions). Yet we need labelled data that unfortunately takes time to curate. To avoid costly cur ration in order to filter out off-topic discussions, we need positive and negative data-sources. On-line discussion forums, such as Stack Overflow, are very effective for solving programming problems. By engaging in open-data, Stack Overflow data becomes a powerful source of labelled text regarding programming. This work shows that we can train classifiers using Stack Overflow posts as positive examples of on-topic programming discussion. You Tube video comments, notorious for their lack of quality, serve as training set of off-topic discussion. By exploiting these datasets, accurate classifiers can be built, tested and evaluated that require very little effort for end-users to deploy and exploit.

  • Date created
    2015
  • Subjects / Keywords
  • Type of Item
    Conference/Workshop Presentation
  • DOI
    https://doi.org/10.7939/r3-z805-2h57
  • License
    Attribution 4.0 International