Working with Data

Reading and Writing Data

This is the reference documentation for the functions and classes inside PyPWA that can be used for parsing and writing data to disk. There exists four different methods to do so:

PyPWA also defines a vector data types and collections for working with Particles, Four Vectors, and Three Vectors, which can be found here.

Reading and Writing Data

Reading and writing from disk to memory. This method will load the entire dataset straight into RAM, or write a dataset straight from RAM onto disk.

PyPWA.read(filename, use_pandas=False, cache=True, clear_cache=False)

Reads the entire file and returns either DaataFrame, ParticlePool, or standard numpy array depending on the data found inside the file.

Parameters
  • filename (Path, str) – File to read.

  • use_pandas (bool) – Determines if a numpy data type or pandas data type is returned.

  • cache (bool, optional) – Enables or disables caching. Defaults to the enabled. Leaving this enabled should do no harm unless there something is broken with caching. Disable this if returning the wrong data for debug purposes. If it continues to return the incorrect data when disabled then caching isn’t the issue.

  • clear_cache (bool, optional) – Forcefully clears the cache for the files that are parsed. Instead of loading the cache, it’ll delete the cache and write a new cache object instead if cache is enabled.

Returns

  • DataFrame – If the file is a kVars file, CSV, or TSV

  • npy.ndarray – If the file is a numpy file, PF file, or single column txt file

  • ParticlePool – If parsing a gamp file

Raises

RuntimeError – If there is no plugin that can load the data found

PyPWA.write(filename, data, cache=True, clear_cache=False)

Reads the entire file and returns either DaataFrame, ParticlePool, or standard numpy array depending on the data found inside the file.

Parameters
  • filename (Path, str) – The filename of the file you wish to write

  • cache (bool, optional) – Enables or disables caching. Defaults to the enabled. Leaving this enabled should do no harm unless there something is broken with caching.

  • clear_cache (bool, optional) – Forcefully clears the cache for the files that are parsed. It’ll delete the cache and write a new cache object instead when cache is enabled.

Raises

RuntimeError – If there is no plugin that can load the data found

Basic Data Sanitization

Allows quick converting of data from Pandas to Numpy, as well as preps data to be passed to non-Python function’s and classes; Such as Fortran modules compiled with f2py, or C/C++ modules bound by Cython.

PyPWA.pandas_to_numpy(df)

Converts Pandas DataTypes to Numpy

Takes a Pandas Series or DataFrame and converts it to Numpy. Pandas does have a built in to_records function, however records are slower than Structured Arrays, while containing much of the same functionality.

Parameters

df (Pandas Series or DataFrame) – The pandas data structure that you wish to be converted to standard Numpy Structured Arrays

Returns

The resulting Numpy array or structured array containing the data from the original DataFrame or Series. If it was a Series with each row named (like an element from a DataFrame) it’ll be a Structured Array with length=1, if it was a standard Series it’ll return a single Numpy Array, and if it was a DataFrame the results will be stored in Structured array matching the types and names from the DataFrame.

Return type

Numpy ArrayLike

PyPWA.to_contiguous(data, names)

Convert DataFrame or Structured Array to List of Contiguous Arrays

This takes a data-set and a list of column names and converts those columns into Contiguous arrays. The reason to use Contiguous arrays over DataFrames or Structured arrays is that the memory is better aligned to improve speed of computation. However, this does double the memory requirements of your data-set since this copies all the events over to the new array. Use only in amplitudes where you need to maximize the speed of your amplitude.

Parameters
  • data (Structured Array, DataFrame, or Dict-like) – This is the data frame or Structured array that you want to extract columns from

  • names (List of Column Names or str) – This is either a list of columns you want from the array, or a single column you want from the array

Returns

If you provide only a single column, it’ll only return a single array with the data from that array. However, if you have supplied multiple columns in a list or tuple, it’ll return a tuple of arrays in the same order as the supplied names.

Return type

ArrayLike or Tuple[ArrayLike]

Data Iterators and Writers

Reading and writing a single event at a time instead of having the entire contents of the dataset memory at once. This is good choice if you are wanting to rapidly transform the data that is on disk.

class PyPWA.DataType(value)

Enumeration for type of data to be read or written using the reader and writer.

Because of how the reader and writer are designed they can not inspect the data before it starts working with the data. This enum is used to specify the type of data you’re working with.

  • BASIC = Standard arrays with no columns

  • STRUCTURED = Columned array (CSV, TSV, DataFrames)

  • TREE_VECTOR = Particle Data (GAMP)

PyPWA.get_writer(filename, dtype)

Returns a writer that can write to the file one event at a time

Parameters
  • filename (str, Path) – The file that you want to write to

  • dtype (DataType) – Specifies the type of that needs to be written. TREE_VECTOR is used for ParticlePools and only works with the ‘.gamp’ extension for now. STRUCTURED_ARRAY is used for both numpy structured arrays and pandas DataFrames. BASIC is used for standard numpy arrays.

Returns

A writer that can read the file, defined in PyPWA.plugins.data

Return type

templates.WriterBase

Raises

RuntimeError – If there is no plugin that can write the data found

See also

write

Writes a ParticlePool, DataFrame, or array to file

Examples

The writer can be used to write a ParticlePool one event at a time

>>> writer = get_writer("example.gamp", DataType.TREE_VECTOR)
>>> for event in particles.iter_events():
>>>     writer.write(event)
>>> writer.close()
PyPWA.get_reader(filename, use_pandas=False)

Returns a reader that can read the file one event at a time

Note

The return value from the reader coule bd a pointer, if you need to keep the event without it being overwrote on the next call, you must call the copy method on the returned data to get a unique copy.

Parameters
  • filename (str, Path) – File to read

  • use_pandas (bool) – Determines if a numpy data type or pandas data type is returned.

Returns

A reader that can read the file, defined in PyPWA.plugins.data

Return type

templates.ReaderBase

Raises

RuntimeError – If there is no plugin that can load the data found

See also

read

Reads an entire file into a DataFrame, ParticlePool, or array

Examples

The reader can be used inside a standard for loop

>>> reader = get_reader("example.gamp")
>>> for event in reader:
>>>     my_kept_event = event.copy()
>>>     regular_event = event
>>> reader.close()

Working with HDF5

Working directly with HDF5 datasets. These datasets offer massive speed advantages over traditional flat files, have a lot of development time put behind them by the HDF group, offer chunk loading, and on the fly data compression.

class PyPWA.ProjectDatabase(file, mode)

Larger than memory data manipulation inside an HDF5 file Allows the user to operate on data larger than the systems RAM. This supports all data types, multiprocessing, and binning of ParticlePool data. Data is stored in the HDF5 using groups, or folders. Each file can have multiple different folders with different data and can be accessed independently from the other folders. Each folder has root data that must be either a ParticlePool or DataFrame, and then other types of data can be added along with the root data either as a managed data type that the table will manage for the user, or unmanaged where the user must ensure there will be no name conflicts or other issues.

Parameters
  • file (str, Path) – The name of the HDF5 file, most commonly with a hd5 extension

  • mode (str) – Either ‘a’ or ‘r’ for append or read-only respectfully. If you try to open the table in write mode using ‘w’ it’ll be changed to ‘a’ppend mode instead to avoid unintentionally overwriting data. Use path to delete the file if you wish to start fresh.

See also

PyPWA.libs.binning, PyPWA.bin_by_range

get_folder(name)

Returns a base folder from the HDF5 file that was previously created.

Parameters

name (str) – The name of the folder to parse from the

Caching

Using pickles to quickly write and read data straight from disk as intermediate caching steps. These are special functions that allow caching values or program states quickly for resuming later. This is a good way to save essential data for a Jupyter Notebook so that if the kernel is rebooted, data isn’t lost.

PyPWA.cache.read(path, intermediate=True, remove_cache=False)

Reads a cache object

This reads caches objects from the disk. With it’s default settings it’ll load the cache file as long as the source file’s hash hasn’t changed. It can also be used to store an intermediate step directly by providing a name and setting intermediate to True.

Parameters
  • path (Path or str) – The path of the source file, or path where you want the intermediate step t0 be stored.

  • intermediate (bool) – If set to true, the cache will be treated as an intermediate step, this means it will assume there is no data file associated with the data, and will not check file hashes.

  • remove_cache (bool) – Setting this to true will remove the cache.

Returns

The first value in the tuple is whether the cache is valid or not and the second value in the returned tuple is whatever data was stored in the cache.

Return type

Tuple[bool, any]

PyPWA.cache.write(path, data, intermediate=False)

Writes a cache file

With its default settings, it’ll write the cache file into the cache location and store the source file’s hash in the cache for future comparison. If intermediate is set to true though, it’ll store the cache in the provided location, and will not store a hash.

Parameters
  • path (Path or str) – The path of the source file, or path where you want the intermediate step t0 be stored.

  • data (Any) – Whatever data you wish to be stored in the cache. Almost anything that can be stored in a variable, can be stored on disk.

  • intermediate (bool) – If set to true, the cache will be treated as an intermediate step, this means it will assume there is no data file associated with the data, and will not check file hashes.

Binning

We provide functions that make binning data in memory an easy process, however for HDF5 a future more in-depth example and documentation will be made available.

PyPWA.bin_with_fixed_widths(dataframe, bin_series, fixed_size, lower_cut=None, upper_cut=None)

Bins a dataframe by fixed using a series in memory

Bins an input array by a fixed number of events in memory. You must put all data you want binned into the DataFrame or Structured Array before use. Each resulting bin can be further binned if you desire.

If the fixed_size does not evenly divide into the length of bin_series, the first and last bin will contain overflows.

Parameters
  • dataframe (DataFrame or Structured Array) – The dataframe or numpy array that you wish to break into bins

  • bin_series (Array-like) – Data that you want to bin by, selectable by user. Must have the same length as dataframe. If a column name is provided, that column will be used from the dataframe.

  • fixed_size (int) – The number of events you want in each bin.

  • lower_cut (float, optional) – The lower cut off for the dataset, if not provided it will be set to the smallest value in the bin_series

  • upper_cut (float, optional) – The upper cut off for the dataset, if not provided will be set to the largest value in the bin_series

Returns

A list of array-likes that have been masked off of the input bin_series.

Return type

List[DataFrame or Structured Array]

Raises

ValueError – If the length of the input array and bin array don’t match

Warning

This function does all binning in memory, if you are working with a large dataset that doesn’t fit in memory, or if you overflow while you are binning, you must use a different binning method

See also

PyPWA.libs.file.project

A numerical dataset that supports binning on disk instead of in-memory. It’s slower and requires more steps to use, but should work even on memory limited systems.

Examples

Binning a DataFrame with values x, y, and z using z to bin

>>> data = {
>>>     "x": npy.random.rand(1000), "y": npy.random.rand(1000),
>>>     "z": (npy.random.rand(1000) * 100) - 50
>>>    }
>>> df = pd.DataFrame(data)
>>> list(df.columns)
["x", "y", "z"]

This will give us a usable DataFrame, now to make a series out of z and use it to make 10 bins.

>>> binning = df["z"]
>>> range_bins = bin_with_fixed_widths(df, binning, 250)
>>> len(range_bins)
4

Each bin should have exactly 250 events in size

>>> lengths = []
>>> for abin in range_bins:
>>>    lengths.append(len(abin))
[250, 250, 250, 250]

That will give you 4 bins with exaactly the same number of events per bin, plus 2 more bins if needed.

PyPWA.bin_by_range(dataframe, bin_series, number_of_bins, lower_cut=None, upper_cut=None, sample_size=None)

Bins a dataframe by range using a series in memory

Bins an input array by range in memory. You must put all data you want binned into the DataFrame or Structured Array before use. Each resulting bin can be further binned if you desire.

Parameters
  • dataframe (DataFrame or Structured Array) – The dataframe or numpy array that you wish to break into bins

  • bin_series (Array-like) – Data that you want to bin by, selectable by user. Must have the same length as dataframe. If a column name is provided, that column will be used from the dataframe.

  • number_of_bins (int) – The resulting number of bins that you would like to have.

  • lower_cut (float, optional) – The lower cut off for the dataset, if not provided it will be set to the smallest value in the bin_series

  • upper_cut (float, optional) – The upper cut off for the dataset, if not provided will be set to the largest value in the bin_series

  • sample_size (int, optional) – If provided each bin will have a randomly selected number of events of length sample_size.

Returns

A list of array-likes that have been masked off of the input bin_series.

Return type

List[DataFrame or Structured Array]

Raises

ValueError – If the length of the input array and bin array don’t match

Warning

This function does all binning in memory, if you are working with a large dataset that doesn’t fit in memory, or if you overflow while you are binning, you must use a different binning method

See also

PyPWA.libs.file.project

A numerical dataset that supports binning on disk instead of in-memory. It’s slower and requires more steps to use, but should work even on memory limited systems.

Notes

The range is selected using a simple method:

\[(max - min) / num_of_bins\]

Examples

Binning a DataFrame with values x, y, and z using z to bin

>>> data = {
>>>     "x": npy.random.rand(1000), "y": npy.random.rand(1000),
>>>     "z": (npy.random.rand(1000) * 100) - 50
>>>    }
>>> df = pd.DataFrame(data)
>>> list(df.columns)
["x", "y", "z"]

This will give us a usable DataFrame, now to make a series out of z and use it to make 10 bins.

>>> binning = df["z"]
>>> range_bins = bin_by_range(df, binning, 10)
>>> len(range_bins)
10

That will give you 10 bins with a very close number of values per bin

Builtin Vectors

PyPWA includes support for both 3 and 4 vector classes, complete with methods to aid operating with vector data. Each vector utilizes Numpy for arrays and numerical operations.

class PyPWA.ParticlePool(particle_list)

Stores a collection of particles together as a list.

By default the particles are represented as their angles and mass, however internally the particles are still stored as the Four Momenta.

display_raw()

Display’s the file

property event_count: int
get_event_mass()
get_particles_by_id(particle_id)
get_particles_by_name(particle_name)
get_s()
get_t()
get_t_prime()
iter_events()
iter_particles()
property particle_count: int
split(count)

Split’s the particles in N groups.

This is required to be a method on any object that needs to be passed to the processing module.

Parameters

count (int) – How many ParticlePools to return

Returns

A list of particle pools that can be passed to different process groups.

Return type

List[ParticlePool]

property stored: List[PyPWA.libs.vectors.particle.Particle]
class PyPWA.Particle(particle_id, charge, e, x=None, y=None, z=None)

Numpy backed Particle object for vector operations inside PyPWA.

By default, Particle is represented through the particles angles and mass. However, internally the particle is stored as four momenta just as it’s stored in the GAMP format.

Parameters
  • particle_id (int) – The Particle ID, used to determine the particle’s name and charge.

  • charge (int) – The particle’s Charge as read from the GAMP file.

  • e (int, npy.ndarray, float, or DataFrame) – Can be an integer to specify size, a structured array or DataFrame with x y z and e values, a single float value, or a Series or single dimensional array, If you provide a float, series, or array, you need to provide a float for the other options as well.

  • x (int, npy.ndarray, float, or DataFrame, optional) –

  • y (int, npy.ndarray, float, or DataFrame, optional) –

  • z (int, npy.ndarray, float, or DataFrame, optional) –

See also

FourVector

For storing a FourVector without particle ID

ParticlePool

For storing a collection of particles

property charge: int

Immutable charge for the particle produced from the ID.

display_raw()

Displays the contents of the Particle as Four Momenta

get_copy()

Returns a deep copy of the Particle.

Returns

Copy of the particle.

Return type

Particle

property id: int

Immutable provided ID at initialization.

property name: str

Immutable name for the particle produced from the ID.

split(count)

Splits the Particle for distributed computing.

Will return N Particles which together will have the same number of elements as the original Particle.

Parameters

count (int) – The amount of Particles to produce from current particle.

Returns

The list of Particles

Return type

List[Particle]

class PyPWA.FourVector(e, x=None, y=None, z=None)

DataFrame backed FourVector object for vector operations inside PyPWA.

Parameters
  • e (int, np.ndarray, float, or DataFrame) – Can be an integer to specify size, a structured array or DataFrame with x y z and e values, a single float value, or a Series or single dimensional array, If you provide a float, series, or array, you need to provide a float for the other options as well.

  • x (int, np.ndarray, float, or Series, optional) –

  • y (int, np.ndarray, float, or Series, optional) –

  • z (int, np.ndarray, float, or Series, optional) –

See also

ThreeVector

For storing a standard X, Y, Z vector

Particle

For storing a particle, adds support for a particle ID

class PyPWA.ThreeVector(x, y=None, z=None)

DataFrame backed ThreeVector object for vector operations inside PyPWA.

Parameters
  • x (int, npy.ndarray, float, or DataFrame) – Can be an integer to specify size, a structured array or DataFrame with x y and z values, a single float value, or a Series or single dimensional array, If you provide a float, series, or array, you need to provide a float for the other options as well.

  • y (int, npy.ndarray, float, or DataFrame, optional) –

  • z (int, npy.ndarray, float, or DataFrame, optional) –

See also

FourVector

For storing a vector with it’s energy.