使用自定义数据进行Qlib回测

天之道,损有余而补不足;人之道,损不足而奉有余

Qlib是微软出品的开源量化投资平台,包含数据处理、模型构建、交易策略、决策执行、数据分析等一系列功能模块。在对股票市场数据的预测和分析中,我们常常需要对一些模型在历史数据上进行回测。而Qlib提供了这样一个回测框架,能够在自己收集到的数据上进行测试,并输出相应的图形化结果,下面就来试试吧!

数据准备

这里以沪深300指数数据为例子,因为Qlib自带的数据时间只到2020年,所以需要我们自己去下载最新的数据。这里我选择baostock来下载数据,页面中也给出了对应的python代码。在Qlib中指数数据被称为benchmark,与股票市场数据区分开来,具体可以看下面的代码注释:

CSI300_BENCH = "SH000300"
# Benchmark is for calculating the excess return of your strategy.
# Its data format will be like **ONE normal instrument**.
# For example, you can query its data with the code below
# `D.features(["SH000300"], ["$close"], start_time='2010-01-01', end_time='2017-12-31', freq='day')`
# It is different from the argument `market`, which indicates a universe of stocks (e.g. **A SET** of stocks like csi300)
# For example, you can query all data from a stock market with the code below.
# ` D.features(D.instruments(market='csi300'), ["$close"], start_time='2010-01-01', end_time='2017-12-31', freq='day')`

首先可以根据时间测试一下是否有对应的数据,为了方便导入,首先将baostock中的沪深300数据下载下来存储为csv文件,具体代码如下:

from pprint import pprint

import baostock as bs
import qlib
from qlib.constant import REG_CN
import pandas as pd
from qlib.backtest import backtest, executor
from qlib.contrib.evaluate import risk_analysis
from qlib.contrib.strategy import TopkDropoutStrategy
# load lib for benchmark
from qlib.backtest.report import PortfolioMetrics
from qlib.utils.time import Freq
from qlib.utils import flatten_dict
from qlib.contrib.report import analysis_position


# init qlib
qlib.init(provider_uri=r'data/qlib/cn_data', region=REG_CN)

def test_qlib_benchmark(benchmark, start_time, end_time):
    benchmark_config = {
        "benchmark": benchmark,
        "start_time": start_time,
        "end_time": end_time,
    }
    portfolio_metrics = PortfolioMetrics('day', benchmark_config)
    print(portfolio_metrics.bench)

def add_qlib_benchmark(index_code, start_time, end_time):
    '''
    Description: 获取指数(综合指数、规模指数、一级行业指数、二级行业指数、策略指数、成长指数、价值指数、主题指数)K线数据并添加到qlib数据以进行回测
    Parameters:
        index_code:指数代码,综合指数,例如:sh.000001 上证指数,sz.399106 深证综指 等;
            规模指数,例如:sh.000016 上证50,sh.000300 沪深300,sh.000905 中证500,sz.399001 深证成指等;
            一级行业指数,例如:sh.000037 上证医药,sz.399433 国证交运 等;
            二级行业指数,例如:sh.000952 300地产,sz.399951 300银行 等;
            策略指数,例如:sh.000050 50等权,sh.000982 500等权 等;
            成长指数,例如:sz.399376 小盘成长 等;
            价值指数,例如:sh.000029 180价值 等;
            主题指数,例如:sh.000015 红利指数,sh.000063 上证周期 等;
        start_time: 开始时间
        end_time: 结束时间,要比实际时间+1天
    '''
    
    # 登陆系统
    lg = bs.login()
    # 显示登陆返回信息
    print('login respond error_code:'+lg.error_code)
    print('login respond  error_msg:'+lg.error_msg)
    rs = bs.query_history_k_data_plus(index_code,
        "date,code,open,high,low,close,preclose,volume,amount,pctChg",
        start_date=start_time, end_date=end_time, frequency="d")
    print('query_history_k_data_plus respond error_code:'+rs.error_code)
    print('query_history_k_data_plus respond  error_msg:'+rs.error_msg)

    # 打印结果集
    data_list = []
    while (rs.error_code == '0') & rs.next():
        # 获取一条记录,将记录合并在一起
        data_list.append(rs.get_row_data())
    result = pd.DataFrame(data_list, columns=rs.fields)
    # 结果集输出到csv文件
    result.rename(columns={'code':'symbol'},inplace=True)
    result.loc[:, 'symbol'] = "SH000300"
    result.to_csv(r"data/csv/SH000300.csv", index=False)
    print(result)

    # 登出系统
    bs.logout()

# init qlib
    qlib.init(provider_uri=r'data/qlib/cn_data', region=REG_CN)
    try:
        test_qlib_benchmark('SH000300', "2023-03-06", "2023-08-24")
    except ValueError:
        print('Please add SH000300 csv data with "add_custom_benchmark.sh"!')
        add_qlib_benchmark('sh.000300', "2023-03-06", "2023-08-25")
        exit(0)

在回测时,沪深300的成分股市场数据也需要导入到Qlib中来计算超额收益,同样地我将它们存成了csv格式,示例代码如下:

def transform_qlib_format(df):
    sym = np.unique(df['symbol'])
    #np.savetxt(r'data/csi300.txt',sym,fmt='%s')
    for i,isym in tqdm(enumerate(sym)):
        ohlcv_filter = df[df['symbol'] == isym]
        ohlcv_filter.to_csv(r'data/csv/'+isym+'.csv', index=False)

注意要将股票代码的列名命名为symbol,其它列名一般就是包含开盘价(open)、收盘价(close)、最高价(high)、最低价(low)、交易量(volume)。然后就是根据官方提供的工具,将数据导入Qlib目录,具体bash脚本如下:

#!/usr/bin/bash

FILE="./dump_bin.py"
if [ -f "$FILE" ]; then
    echo "$FILE exists."
else
    echo "$FILE does not exist! try download from github using curl"
    curl -O https://raw.githubusercontent.com/microsoft/qlib/refs/heads/main/scripts/dump_bin.py
fi

FILE="./data/qlib/cn_data"
if [ -f "$FILE" ]; then
    echo "$FILE exists."
else
    echo "$FILE does not exist! Creating folder..."
    mkdir -p data/qlib/cn_data
fi

python dump_bin.py dump_all --csv_path data/csv --qlib_dir data/qlib/cn_data/ --include_fields open,close,high,low,volume --symbol_field_name symbol --date_field_name date

然后再次运行一下上面的test_qlib_benchmark函数,应该能够正常读到数据。

模型回测

这里模型回测需要用到自己的模型预测的输出,同样也存储成csv文件。列名分别是:datetime,instrument,score,label。然后就可以配合官方的回测代码进行测试,我修改后的代码如下:

#def backtest_custom_benchmark(BENCH: str = "SH000300", start_date: str = "2023-03-06", end_date: str = "2023-08-24", pred_score: pd.Series=pd.Series([],dtype=int)):
"""
Backtest with custom benchmark
"""
# Benchmark is for calculating the excess return of your strategy.
# Its data format will be like **ONE normal instrument**.
# For example, you can query its data with the code below
# `D.features(["SH000300"], ["$close"], start_time='2010-01-01', end_time='2017-12-31', freq='day')`
# It is different from the argument `market`, which indicates a universe of stocks (e.g. **A SET** of stocks like csi300)
# For example, you can query all data from a stock market with the code below.
# ` D.features(D.instruments(market='csi300'), ["$close"], start_time='2010-01-01', end_time='2017-12-31', freq='day')`

pred_score = pd.read_csv(r'data/output/test.csv')
pred_score["datetime"] = pd.to_datetime(pred_score["datetime"])
pred_label = pred_score.set_index(["instrument", "datetime"])
pred_score = pred_score.set_index(["datetime", "instrument"])["score"]
    
STRATEGY_CONFIG = {
    "topk": 50,
    "n_drop": 5,
    # pred_score, pd.Series
    "signal": pred_score,
}

EXECUTOR_CONFIG = {
    "time_per_step": "day",
    "generate_portfolio_metrics": True,
}

backtest_config = {
    "start_time": "2023-03-06",
    "end_time": "2023-08-24",
    "account": 100000000,
    "benchmark": "SH000300",
    "exchange_kwargs": {
        "freq": "day",
        "limit_threshold": 0.095,
        "deal_price": "close",
        "open_cost": 0.0005,
        "close_cost": 0.0015,
        "min_cost": 5,
    },
}

# strategy object
strategy_obj = TopkDropoutStrategy(**STRATEGY_CONFIG)
# executor object
executor_obj = executor.SimulatorExecutor(**EXECUTOR_CONFIG)
# backtest
portfolio_metric_dict, indicator_dict = backtest(executor=executor_obj, strategy=strategy_obj, **backtest_config)
analysis_freq = "{0}{1}".format(*Freq.parse("day"))
# backtest info
report_normal, positions_normal = portfolio_metric_dict.get(analysis_freq)

# analysis
analysis = dict()
analysis["excess_return_without_cost"] = risk_analysis(
    report_normal["return"] - report_normal["bench"], freq=analysis_freq
)
analysis["excess_return_with_cost"] = risk_analysis(
    report_normal["return"] - report_normal["bench"] - report_normal["cost"], freq=analysis_freq
)

analysis_df = pd.concat(analysis)  # type: pd.DataFrame
# log metrics
analysis_dict = flatten_dict(analysis_df["risk"].unstack().T.to_dict())
# print out results
pprint(f"The following are analysis results of benchmark return({analysis_freq}).")
pprint(risk_analysis(report_normal["bench"], freq=analysis_freq))
pprint(f"The following are analysis results of the excess return without cost({analysis_freq}).")
pprint(analysis["excess_return_without_cost"])
pprint(f"The following are analysis results of the excess return with cost({analysis_freq}).")
pprint(analysis["excess_return_with_cost"])

最后还可以拿工具画出回测结果分析图,如下:

analysis_position.report_graph(report_normal)
analysis_position.score_ic_graph(pred_label)

return_bench

rank_ic

如果您觉得该文章对您有用,欢迎打赏作者,激励创作!
Welcome to tip the author!

微信(WeChat Pay) 支付宝(AliPay)
比特币(Bitcoin) 以太坊(Ethereum)
以太坊(Base) 索拉纳(Solana)