Have you ever been curious about how LinkedIn populates the content that appears under its Groups You May Like tab? Senior software engineer Alexis Pribula detailed some of the nuts and bolts in a post on the LinkedIn Blog.
In a nutshell, GYML matches information from user profiles on the professional networking site with information from groups, and then implementing various filters to help remove results that may not be relevant. Pribula explains:
GYML matches key member profile features against key group features. An optimal matching relies on relevant historical data, key features, and a well-defined metric. More specifically:
Metric: We designed the metric to optimize for participation in the community, and not necessarily only for group affinity. Indeed, what makes a group valuable, along with its members, is the member contributions to the professional dialog. This design can be achieved by an approach we at LinkedIn call “data jiu jitsu”: First, match a group to a member based on content affinity, then optimize for the desired behavior (in our case, “participation”), which can be done via social learning. Social learning theory tells us that individuals learn by observing other’s behavior and the outcomes of those behaviors. Hence, someone joining a group with high participation from its members is more likely to engage further in the future.
Key features: One of the most interesting aspects of GYML is the group features definitions. Beyond the usual suspects that include group title and group description, the real DNA of a group resides within its members. Hence, using a construct of information theory called mutual information, we generate a “virtual” group profile, which, following the homophily concept, can be matched against each member. Another source of information we use as a feature for matching is the popularity of the group in someone’s network. If many of your connections belong to a group, that group will probably be of interest to you.
Two interesting edge cases arose with this initial approach: potential mismatch with alumni groups (spurring strong reactions from members), and location-specific groups, like “Yahoo! India.” This was resolved by implementing filters that discard groups with an over-representation of a school (location) that does not match the member’s school (location).
Historical Data: To fine-tune the matching process, we leveraged historical data focusing on recent group joins on LinkedIn. To keep the best possible relevance in our matching algorithm, we also applied some filtering. First, we filtered out groups that our members may find controversial. Second, we did not show group recommendations to spammers: Members who try to join groups for the only purpose of spamming the group were subsequently removed from the groups.
To provide constantly fresh recommendations, group recommendations are updated in real-time when members update their profiles, while group features are updated offline on a weekly basis using Hadoop. Note that the latter could be updated more frequently if necessary, but we have found weekly updates to be quite sufficient to ensure freshness of the results.