fetch_rcv1#

sklearn.datasets.fetch_rcv1(*, data_home=None, subset='all', download_if_missing=True, random_state=None, shuffle=False, return_X_y=False, n_retries=3, delay=1.0)[source]#

Load the RCV1 multilabel dataset (classification).

Download it if necessary.

Version: RCV1-v2, vectors, full sets, topics multilabels.

Classes

103

Samples total

804414

Dimensionality

47236

Features

real, between 0 and 1

Read more in the User Guide.

Added in version 0.17.

Parameters:
data_homestr or path-like, default=None

Specify another download and cache folder for the datasets. By default all scikit-learn data is stored in β€˜~/scikit_learn_data’ subfolders.

subset{β€˜train’, β€˜test’, β€˜all’}, default=’all’

Select the dataset to load: β€˜train’ for the training set (23149 samples), β€˜test’ for the test set (781265 samples), β€˜all’ for both, with the training samples first if shuffle is False. This follows the official LYRL2004 chronological split.

download_if_missingbool, default=True

If False, raise an OSError if the data is not locally available instead of trying to download the data from the source site.

random_stateint, RandomState instance or None, default=None

Determines random number generation for dataset shuffling. Pass an int for reproducible output across multiple function calls. See Glossary.

shufflebool, default=False

Whether to shuffle dataset.

return_X_ybool, default=False

If True, returns (dataset.data, dataset.target) instead of a Bunch object. See below for more information about the dataset.data and dataset.target object.

Added in version 0.20.

n_retriesint, default=3

Number of retries when HTTP errors are encountered.

Added in version 1.5.

delayfloat, default=1.0

Number of seconds between retries.

Added in version 1.5.

Returns:
datasetBunch

Dictionary-like object. Returned only if return_X_y is False. dataset has the following attributes:

  • datasparse matrix of shape (804414, 47236), dtype=np.float64

    The array has 0.16% of non zero values. Will be of CSR format.

  • targetsparse matrix of shape (804414, 103), dtype=np.uint8

    Each sample has a value of 1 in its categories, and 0 in others. The array has 3.15% of non zero values. Will be of CSR format.

  • sample_idndarray of shape (804414,), dtype=np.uint32,

    Identification number of each sample, as ordered in dataset.data.

  • target_namesndarray of shape (103,), dtype=object

    Names of each target (RCV1 topics), as ordered in dataset.target.

  • DESCRstr

    Description of the RCV1 dataset.

(data, target)tuple

A tuple consisting of dataset.data and dataset.target, as described above. Returned only if return_X_y is True.

Added in version 0.20.

Examples

>>> from sklearn.datasets import fetch_rcv1
>>> rcv1 = fetch_rcv1()
>>> rcv1.data.shape
(804414, 47236)
>>> rcv1.target.shape
(804414, 103)