可能是史上代码最少的协同过滤推荐引擎
by wentrue
at 2010-07-01 21:10:12
original http://www.wentrue.net/blog/?p=970
自世界杯开幕以来,这是首次看不到球赛的两天,看不了球,就写篇博客吧,标题比较有噱头,实际上是用R实现的item-based CF推荐算法。
# 读入数据,原数据是user-subject的收藏二元组
data = read.table('data.dat', sep=',', header=TRUE)
# 标识user与subject的索引
user = unique(data$user_id)
subject = unique(data$subject_id)
uidx = match(data$user_id, user)
iidx = match(data$subject_id, subject)
# 从二元组构造收藏矩阵
M = matrix(0, length(user), length(subject))
i = cbind(uidx, iidx)
M[i] = 1
# 对列向量(subject向量)进行标准化,%*%为矩阵乘法
mod = colSums(M^2)^0.5 # 各列的模
MM = M %*% diag(1/mod) # M乘以由1/mod组成的对角阵,实质是各列除以该列的模
#crossprod实现MM的转置乘以MM,这里用于计算列向量的内积,S为subject的相似度矩阵
S = crossprod(MM)
# user-subject推荐的分值
R = M %*% S
R = apply(R, 1, FUN=sort, decreasing=TRUE, index.return=TRUE)
k = 5
# 取出前5个分值最大的subject
res = lapply(R, FUN=function(r)return(subject[r$ix[1:k]]))
# 输出数据
write.table(paste(user, res, sep=':'), file='result.dat', quote=FALSE, row.name=FALSE, col.name=FALSE)
除去注释,有效代码只有16行。其中大量运用了向量化的函数与处理方式,所以没有任何的显式循环结构,关于向量化更详细的叙述可看这里。
注:该代码实现的只是最基本算法,仅作参考,不承诺在大规模与复杂数据环境下的实用性。
源数据文件data.dat的内容如下所列:
user_id,subject_id1,11,31,71,132,22,52,62,72,92,102,113,13,23,33,43,73,93,105,136,16,36,46,56,86,108,18,28,38,58,68,78,89,1310,1211,211,311,411,611,811,911,1312,1213,313,613,715,415,1215,1316,216,316,416,716,817,217,317,417,517,617,717,817,917,1017,1118,218,319,219,319,519,619,919,1019,1119,1220,120,320,420,720,1321,121,621,821,921,1121,1221,1322,623,223,423,923,1224,124,524,925,225,625,1025,1126,226,326,827,327,627,1227,1328,128,228,328,528,728,928,1028,1128,1228,1329,129,229,329,429,529,629,729,829,929,1030,630,730,930,1331,631,1132,132,533,233,1334,334,734,834,934,1034,1335,335,435,535,635,736,236,336,436,636,736,836,936,1136,1236,1338,541,141,341,441,541,641,741,1142,242,342,742,842,942,1042,1143,243,643,1043,1143,12
您可能也喜欢: | |||
(@guwendong) 推荐系统:Recommender Systems 简介 |
(@guwendong) 推荐系统:主要推荐方法 |
R程序调试–DEBUG R |
推荐系统resys小组线下活动见闻2009-08-22 |
无觅 |