Yingjie

R与Mongo数据库交互演示

薛英杰 / 2021-04-24


MongoDB 是一个基于分布式文件存储的数据库。由 C++ 语言编写。旨在为 WEB 应用提供可扩展的高性能数据存储解决方案。同时,他也是一个介于关系数据库和非关系数据库之间的产品,是非关系数据库当中功能最丰富,最像关系数据库的。MongoDB的安装和具体环境配置可以参考大神的学习笔记

R语言作为一种统计数据分析语言,必然需要输入数据才能进行各种运算,那么当R语言偶遇MongoDB时,该怎么从MongoDB中读入数据,又该如何将数据写入MongoDB呢?

我们都知道R是“包治百病”,与MongoDB交互自然也离不开package。链接MongoDB的R package有三个,分别为:rmongodb、RMongo、Mongolite,他们版本和使用手册如下:

Package:rmongodb(2011)RMongo(2011)Mongolite(2014)

User Manual:rmongodb Usage RMongo Usage Mongolite Usage

目前rmongodb和RMongo都已下架,上面的package和User Manual均为历史遗留版本,不过也都能使用。今天重点介绍一下Mongolite包,通过他来实现R语言与MongoDB的交互,可以参考Presentation: UseR 2015 slides

要实现R语言与MongoDB的交互,首先需要开启MongoDB,在命令行窗口中输入:

    D:Mongo\bin\mongod --dbpath D:Mongo\data\db
    

接下来在R环境中加载相关软件包,建立R与MongoDB链接,具体如下:

##加载软件包
pacman::p_load(data.table,plyr,stringr,dplyr,mongolite,future.apply,future)

##建立R与MongoDB数据库的链接,collection为数据表对应的文件,db为数据库中的表
m<-mongo(collection = "tradedatadaily",db="stockdata")

##断开链接
rm()
gc()
##           used (Mb) gc trigger (Mb) max used (Mb)
## Ncells  790514 42.3    1506261 80.5  1182562 63.2
## Vcells 1323857 10.2    8388608 64.0  2178657 16.7

1.查询数据

假设我们需要从MongoDB数据库中查找2021-04-21股票交易数据,统计各收益水平下的上市公司数量,具体代码如下:

##加载软件包
pacman::p_load(data.table,plyr,stringr,dplyr,mongolite,future.apply,future)

##建立R与MongoDB数据库的链接,collection为数据表对应的文件,db为数据库中的表
m<-mongo(collection = "tradedatadaily",db="stockdata")

##查询字段有上市公司代码(symbol)、收盘价(close)、前收盘价(pre_close),查询条件为交易日期(trade_date)等于2021-04-23

tradedata<-m$find(query = '{"trade_date":"20210423"}',
                  fields = '{"_id":0,"symbol":1,"trade_date":1,"close":1,"pre_close":1}')
head(tradedata)
##   symbol trade_date close pre_close
## 1 000007   20210423  3.73      3.81
## 2 000008   20210423  2.28      2.30
## 3 000005   20210423  2.29      2.30
## 4 000004   20210423 18.64     18.33
## 5 000006   20210423  5.37      5.52
## 6 000009   20210423 10.50     10.38
##计算收益并分组统计上市公司数量
fdstb<-tradedata%>%mutate(rets=close/pre_close-1)%>%mutate(groups=cut(rets,c(-0.25,-0.15,-0.1,-0.05,-0.03,-.01,0,0.01,0.03,0.05,0.1,0.15,0.25),labels = c("<-15%","<-10%","<-5%","<-3%","<-1%","<0%","0%<","1%<","3%<","5%<","10%<","15%<"),right=F))%>%group_by(groups)%>%summarise(cout=n())

barplot(fdstb$cout,names.arg = fdstb$groups,col =c(rep("green",6),rep("red",6)),cex.names = 0.7,ylim = c(0,1700),axes = F)
text(c(0.7,1.9,3.1,4.3,5.5,6.7,7.9,9.1,10.3,11.5,12.6,13.9),fdstb$cout+c(8,8,8,8,15,10,10,10,10.3,9,9,9)+32,fdstb$cout,cex = 0.8)

从MongoDB数据库中提取2021年IPO且市值的公司

##加载软件包
pacman::p_load(data.table,plyr,stringr,dplyr,mongolite,future.apply,future,flextable,DT)
## Installing package into 'C:/Users/yingjie/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## also installing the dependencies 'triebeard', 'urltools', 'httpcode', 'crul', 'fontBitstreamVera', 'fontLiberation', 'gfonts', 'fontquiver', 'officer', 'gdtools'
## Warning: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.2:
##   cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.2/PACKAGES'
## package 'triebeard' successfully unpacked and MD5 sums checked
## package 'urltools' successfully unpacked and MD5 sums checked
## package 'httpcode' successfully unpacked and MD5 sums checked
## package 'crul' successfully unpacked and MD5 sums checked
## package 'fontBitstreamVera' successfully unpacked and MD5 sums checked
## package 'fontLiberation' successfully unpacked and MD5 sums checked
## package 'gfonts' successfully unpacked and MD5 sums checked
## package 'fontquiver' successfully unpacked and MD5 sums checked
## package 'officer' successfully unpacked and MD5 sums checked
## package 'gdtools' successfully unpacked and MD5 sums checked
## package 'flextable' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
## 	C:\Users\yingjie\AppData\Local\Temp\Rtmp652jh8\downloaded_packages
## 
## flextable installed
## Warning: package 'flextable' was built under R version 4.2.3
##建立R与MongoDB数据库的链接,collection为数据表对应的文件,db为数据库中的表
m<-mongo(collection = "stockbase",db="stockdata")

##查询字段有上市公司代码(symbol)、股票名称(name)、上市日期(list_date),查询条件为上市日期(list_date)大于2021-01-01

newfirm<-m$find(query = '{"list_date":{"$gt":"20210101"}}'
                  )
DT::datatable(newfirm,rownames = FALSE,extensions = 'Buttons', options = list(dom = 'Bfrtip',buttons = c( 'print')))

2 操纵数据

操纵数据主要有增、删、改,查,对应的主要函数为insert、remove、update,find和drop。具体示例如下:

### 加载软件包
pacman::p_load(data.table,plyr,stringr,dplyr,mongolite,future.apply,future,flextable,DT)

###在test表中创建文件iris
test<-mongo("iris","test")

###产生数据
str <- c('{"name" : "jerry"}' , '{"name": "anna", "age" : 23}', '{"name": "joe"}')

###在test表文件iris中插入数据
test$insert(str)
## List of 6
##  $ nInserted  : int 3
##  $ nMatched   : int 0
##  $ nModified  : int 0
##  $ nRemoved   : int 0
##  $ nUpserted  : int 0
##  $ writeErrors: list()
###查找iris文件中的数据
test$find()
##    name age
## 1 jerry  NA
## 2  anna  23
## 3   joe  NA
###补全缺失年龄
test$update('{"name":"jerry"}','{"$set":{"age":"12"}}')
## List of 3
##  $ modifiedCount: int 1
##  $ matchedCount : int 1
##  $ upsertedCount: int 0
###查看结果
test$find()
##    name  age
## 1 jerry   12
## 2  anna   23
## 3   joe <NA>
###删除name为joe的记录
test$remove('{"name":"joe"}')

###查看结果
test$find()
##    name age
## 1 jerry  12
## 2  anna  23
###删除表test
test$drop()

3 高级操作

如果要从MongoDB中提取中国平安、贵州茅台、五粮液三只股票2021年4月23日的成交量和收盘价,该怎操作呢?可以照猫画虎,一个stackoverflow上的例子可以参考,点这儿

##加载软件包
pacman::p_load(data.table,plyr,stringr,dplyr,mongolite,future.apply,future,flextable,DT)

##建立R与MongoDB数据库的链接,collection为数据表对应的文件,db为数据库中的表
m<-mongo(collection = "tradedatadaily",db="stockdata")

##查询字段有上市公司代码(symbol)、成交量(vol)、收盘价(close),查询条件为交易日期(trade_date)等于2021-04-23

vol<-m$find(query = '{"trade_date":"20210423","symbol":{"$in":["601318","600519","000858"]}}',fields = '{"_id":0,"symbol":1,"trade_date":1,"close":1,"vol":1}')
vol
##   symbol trade_date   close       vol
## 1 000858   20210423  280.00 207833.26
## 2 600519   20210423 2108.94  33462.77
## 3 601318   20210423   74.26 867845.14

如果要进行模糊查找,又该怎么办呢?查询使用正则表达式进行,可以查看这个例子。比如,要从MongoDB中提取中国平安、贵州茅台、五粮液三只股票上市以来每年5月20日成交量和收盘价,具体如下:

##加载软件包
pacman::p_load(data.table,plyr,stringr,dplyr,mongolite,future.apply,future,flextable,DT)

##建立R与MongoDB数据库的链接,collection为数据表对应的文件,db为数据库中的表
m<-mongo(collection = "tradedatadaily",db="stockdata")

##查询字段有上市公司代码(symbol)、成交量(vol)、收盘价(close),查询条件为交易日期(trade_date)的结尾为0520

vol<-m$find(query = '{"trade_date":{ "$regex": "(0520)$","$options":"i"},"symbol":{"$in":["601318","600519","000858"]}}',fields = '{"_id":0,"symbol":1,"trade_date":1,"close":1,"vol":1}'
  )
vol
##    symbol trade_date   close        vol
## 1  000858   19980520   60.89    9217.00
## 2  000858   19990520   33.10    5618.00
## 3  000858   20020520   14.14    7796.85
## 4  000858   20030520    9.46   20661.19
## 5  000858   20040520    6.59   41677.83
## 6  000858   20050520    7.49   32420.17
## 7  000858   20080520   24.56  338851.43
## 8  000858   20090520   17.10  314612.19
## 9  000858   20100520   23.96  166899.67
## 10 000858   20110520   34.40  177364.42
## 11 000858   20130520   23.48  242832.98
## 12 000858   20140520   16.09  167338.13
## 13 000858   20150520   25.69  545839.26
## 14 000858   20160520   29.44  152625.62
## 15 000858   20190520  104.00  398480.30
## 16 000858   20200520  150.85  170412.14
## 17 000858   20210520  303.51  140239.85
## 18 000858   20220520  162.47  452296.75
## 19 600519   20020520   34.51    3353.69
## 20 600519   20030520   24.75    5148.16
## 21 600519   20040520   36.48    2495.30
## 22 600519   20050520   49.15    5845.02
## 23 600519   20080520  173.09    9927.35
## 24 600519   20090520  116.07   14711.27
## 25 600519   20100520  131.48   22740.45
## 26 600519   20110520  200.83   40043.86
## 27 600519   20130520  195.00   30944.06
## 28 600519   20140520  144.98   46885.91
## 29 600519   20150520  253.95   63771.50
## 30 600519   20160520  251.26   25501.57
## 31 600519   20190520  884.96   44407.99
## 32 600519   20200520 1351.00   21183.40
## 33 600519   20210520 2058.88   21857.16
## 34 600519   20220520 1800.01   41949.60
## 35 601318   20080520   57.04  203398.58
## 36 601318   20090520   40.58  172127.01
## 37 601318   20100520   44.93  307979.03
## 38 601318   20110520   49.98   98703.88
## 39 601318   20130520   40.86  324041.66
## 40 601318   20140520   38.78  160323.90
## 41 601318   20150520   85.57 1432831.23
## 42 601318   20160520   31.82  175392.36
## 43 601318   20190520   77.87  745003.90
## 44 601318   20200520   72.95  274834.84
## 45 601318   20210520   70.72  422472.09
## 46 601318   20220520   44.31  549751.15

最后,给出一个Mongdb命令查询网站, 点这儿