R与Mongo数据库交互演示
薛英杰 / 2021-04-24
MongoDB 是一个基于分布式文件存储的数据库。由 C++ 语言编写。旨在为 WEB 应用提供可扩展的高性能数据存储解决方案。同时,他也是一个介于关系数据库和非关系数据库之间的产品,是非关系数据库当中功能最丰富,最像关系数据库的。MongoDB的安装和具体环境配置可以参考大神的学习笔记
R语言作为一种统计数据分析语言,必然需要输入数据才能进行各种运算,那么当R语言偶遇MongoDB时,该怎么从MongoDB中读入数据,又该如何将数据写入MongoDB呢?
我们都知道R是“包治百病”,与MongoDB交互自然也离不开package。链接MongoDB的R package有三个,分别为:rmongodb、RMongo、Mongolite,他们版本和使用手册如下:
Package:rmongodb(2011)、RMongo(2011)、Mongolite(2014)
User Manual:rmongodb Usage RMongo Usage Mongolite Usage
目前rmongodb和RMongo都已下架,上面的package和User Manual均为历史遗留版本,不过也都能使用。今天重点介绍一下Mongolite包,通过他来实现R语言与MongoDB的交互,可以参考Presentation: UseR 2015 slides。
要实现R语言与MongoDB的交互,首先需要开启MongoDB,在命令行窗口中输入:
D:Mongo\bin\mongod --dbpath D:Mongo\data\db
接下来在R环境中加载相关软件包,建立R与MongoDB链接,具体如下:
##加载软件包
pacman::p_load(data.table,plyr,stringr,dplyr,mongolite,future.apply,future)
##建立R与MongoDB数据库的链接,collection为数据表对应的文件,db为数据库中的表
m<-mongo(collection = "tradedatadaily",db="stockdata")
##断开链接
rm()
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 790514 42.3 1506261 80.5 1182562 63.2
## Vcells 1323857 10.2 8388608 64.0 2178657 16.7
1.查询数据
假设我们需要从MongoDB数据库中查找2021-04-21股票交易数据,统计各收益水平下的上市公司数量,具体代码如下:
##加载软件包
pacman::p_load(data.table,plyr,stringr,dplyr,mongolite,future.apply,future)
##建立R与MongoDB数据库的链接,collection为数据表对应的文件,db为数据库中的表
m<-mongo(collection = "tradedatadaily",db="stockdata")
##查询字段有上市公司代码(symbol)、收盘价(close)、前收盘价(pre_close),查询条件为交易日期(trade_date)等于2021-04-23
tradedata<-m$find(query = '{"trade_date":"20210423"}',
fields = '{"_id":0,"symbol":1,"trade_date":1,"close":1,"pre_close":1}')
head(tradedata)
## symbol trade_date close pre_close
## 1 000007 20210423 3.73 3.81
## 2 000008 20210423 2.28 2.30
## 3 000005 20210423 2.29 2.30
## 4 000004 20210423 18.64 18.33
## 5 000006 20210423 5.37 5.52
## 6 000009 20210423 10.50 10.38
##计算收益并分组统计上市公司数量
fdstb<-tradedata%>%mutate(rets=close/pre_close-1)%>%mutate(groups=cut(rets,c(-0.25,-0.15,-0.1,-0.05,-0.03,-.01,0,0.01,0.03,0.05,0.1,0.15,0.25),labels = c("<-15%","<-10%","<-5%","<-3%","<-1%","<0%","0%<","1%<","3%<","5%<","10%<","15%<"),right=F))%>%group_by(groups)%>%summarise(cout=n())
barplot(fdstb$cout,names.arg = fdstb$groups,col =c(rep("green",6),rep("red",6)),cex.names = 0.7,ylim = c(0,1700),axes = F)
text(c(0.7,1.9,3.1,4.3,5.5,6.7,7.9,9.1,10.3,11.5,12.6,13.9),fdstb$cout+c(8,8,8,8,15,10,10,10,10.3,9,9,9)+32,fdstb$cout,cex = 0.8)
从MongoDB数据库中提取2021年IPO且市值的公司
##加载软件包
pacman::p_load(data.table,plyr,stringr,dplyr,mongolite,future.apply,future,flextable,DT)
## Installing package into 'C:/Users/yingjie/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## also installing the dependencies 'triebeard', 'urltools', 'httpcode', 'crul', 'fontBitstreamVera', 'fontLiberation', 'gfonts', 'fontquiver', 'officer', 'gdtools'
## Warning: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.2:
## cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.2/PACKAGES'
## package 'triebeard' successfully unpacked and MD5 sums checked
## package 'urltools' successfully unpacked and MD5 sums checked
## package 'httpcode' successfully unpacked and MD5 sums checked
## package 'crul' successfully unpacked and MD5 sums checked
## package 'fontBitstreamVera' successfully unpacked and MD5 sums checked
## package 'fontLiberation' successfully unpacked and MD5 sums checked
## package 'gfonts' successfully unpacked and MD5 sums checked
## package 'fontquiver' successfully unpacked and MD5 sums checked
## package 'officer' successfully unpacked and MD5 sums checked
## package 'gdtools' successfully unpacked and MD5 sums checked
## package 'flextable' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\yingjie\AppData\Local\Temp\Rtmp652jh8\downloaded_packages
##
## flextable installed
## Warning: package 'flextable' was built under R version 4.2.3
##建立R与MongoDB数据库的链接,collection为数据表对应的文件,db为数据库中的表
m<-mongo(collection = "stockbase",db="stockdata")
##查询字段有上市公司代码(symbol)、股票名称(name)、上市日期(list_date),查询条件为上市日期(list_date)大于2021-01-01
newfirm<-m$find(query = '{"list_date":{"$gt":"20210101"}}'
)
DT::datatable(newfirm,rownames = FALSE,extensions = 'Buttons', options = list(dom = 'Bfrtip',buttons = c( 'print')))
2 操纵数据
操纵数据主要有增、删、改,查,对应的主要函数为insert、remove、update,find和drop。具体示例如下:
### 加载软件包
pacman::p_load(data.table,plyr,stringr,dplyr,mongolite,future.apply,future,flextable,DT)
###在test表中创建文件iris
test<-mongo("iris","test")
###产生数据
str <- c('{"name" : "jerry"}' , '{"name": "anna", "age" : 23}', '{"name": "joe"}')
###在test表文件iris中插入数据
test$insert(str)
## List of 6
## $ nInserted : int 3
## $ nMatched : int 0
## $ nModified : int 0
## $ nRemoved : int 0
## $ nUpserted : int 0
## $ writeErrors: list()
###查找iris文件中的数据
test$find()
## name age
## 1 jerry NA
## 2 anna 23
## 3 joe NA
###补全缺失年龄
test$update('{"name":"jerry"}','{"$set":{"age":"12"}}')
## List of 3
## $ modifiedCount: int 1
## $ matchedCount : int 1
## $ upsertedCount: int 0
###查看结果
test$find()
## name age
## 1 jerry 12
## 2 anna 23
## 3 joe <NA>
###删除name为joe的记录
test$remove('{"name":"joe"}')
###查看结果
test$find()
## name age
## 1 jerry 12
## 2 anna 23
###删除表test
test$drop()
3 高级操作
如果要从MongoDB中提取中国平安、贵州茅台、五粮液三只股票2021年4月23日的成交量和收盘价,该怎操作呢?可以照猫画虎,一个stackoverflow上的例子可以参考,点这儿。
##加载软件包
pacman::p_load(data.table,plyr,stringr,dplyr,mongolite,future.apply,future,flextable,DT)
##建立R与MongoDB数据库的链接,collection为数据表对应的文件,db为数据库中的表
m<-mongo(collection = "tradedatadaily",db="stockdata")
##查询字段有上市公司代码(symbol)、成交量(vol)、收盘价(close),查询条件为交易日期(trade_date)等于2021-04-23
vol<-m$find(query = '{"trade_date":"20210423","symbol":{"$in":["601318","600519","000858"]}}',fields = '{"_id":0,"symbol":1,"trade_date":1,"close":1,"vol":1}')
vol
## symbol trade_date close vol
## 1 000858 20210423 280.00 207833.26
## 2 600519 20210423 2108.94 33462.77
## 3 601318 20210423 74.26 867845.14
如果要进行模糊查找,又该怎么办呢?查询使用正则表达式进行,可以查看这个例子。比如,要从MongoDB中提取中国平安、贵州茅台、五粮液三只股票上市以来每年5月20日成交量和收盘价,具体如下:
##加载软件包
pacman::p_load(data.table,plyr,stringr,dplyr,mongolite,future.apply,future,flextable,DT)
##建立R与MongoDB数据库的链接,collection为数据表对应的文件,db为数据库中的表
m<-mongo(collection = "tradedatadaily",db="stockdata")
##查询字段有上市公司代码(symbol)、成交量(vol)、收盘价(close),查询条件为交易日期(trade_date)的结尾为0520
vol<-m$find(query = '{"trade_date":{ "$regex": "(0520)$","$options":"i"},"symbol":{"$in":["601318","600519","000858"]}}',fields = '{"_id":0,"symbol":1,"trade_date":1,"close":1,"vol":1}'
)
vol
## symbol trade_date close vol
## 1 000858 19980520 60.89 9217.00
## 2 000858 19990520 33.10 5618.00
## 3 000858 20020520 14.14 7796.85
## 4 000858 20030520 9.46 20661.19
## 5 000858 20040520 6.59 41677.83
## 6 000858 20050520 7.49 32420.17
## 7 000858 20080520 24.56 338851.43
## 8 000858 20090520 17.10 314612.19
## 9 000858 20100520 23.96 166899.67
## 10 000858 20110520 34.40 177364.42
## 11 000858 20130520 23.48 242832.98
## 12 000858 20140520 16.09 167338.13
## 13 000858 20150520 25.69 545839.26
## 14 000858 20160520 29.44 152625.62
## 15 000858 20190520 104.00 398480.30
## 16 000858 20200520 150.85 170412.14
## 17 000858 20210520 303.51 140239.85
## 18 000858 20220520 162.47 452296.75
## 19 600519 20020520 34.51 3353.69
## 20 600519 20030520 24.75 5148.16
## 21 600519 20040520 36.48 2495.30
## 22 600519 20050520 49.15 5845.02
## 23 600519 20080520 173.09 9927.35
## 24 600519 20090520 116.07 14711.27
## 25 600519 20100520 131.48 22740.45
## 26 600519 20110520 200.83 40043.86
## 27 600519 20130520 195.00 30944.06
## 28 600519 20140520 144.98 46885.91
## 29 600519 20150520 253.95 63771.50
## 30 600519 20160520 251.26 25501.57
## 31 600519 20190520 884.96 44407.99
## 32 600519 20200520 1351.00 21183.40
## 33 600519 20210520 2058.88 21857.16
## 34 600519 20220520 1800.01 41949.60
## 35 601318 20080520 57.04 203398.58
## 36 601318 20090520 40.58 172127.01
## 37 601318 20100520 44.93 307979.03
## 38 601318 20110520 49.98 98703.88
## 39 601318 20130520 40.86 324041.66
## 40 601318 20140520 38.78 160323.90
## 41 601318 20150520 85.57 1432831.23
## 42 601318 20160520 31.82 175392.36
## 43 601318 20190520 77.87 745003.90
## 44 601318 20200520 72.95 274834.84
## 45 601318 20210520 70.72 422472.09
## 46 601318 20220520 44.31 549751.15
最后,给出一个Mongdb命令查询网站, 点这儿