Working with Data¶
Reading and Writing Data¶
This is the reference documentation for the functions and classes inside PyPWA that can be used for parsing and writing data to disk. There exists four different methods to do so:
PyPWA also defines a vector data types and collections for working with Particles, Four Vectors, and Three Vectors, which can be found here.
Reading and Writing Data¶
Reading and writing from disk to memory. This method will load the entire dataset straight into RAM, or write a dataset straight from RAM onto disk.
- PyPWA.read(filename, use_pandas=False, cache=True, clear_cache=False)¶
Reads the entire file and returns either DaataFrame, ParticlePool, or standard numpy array depending on the data found inside the file.
- Parameters
filename (Path, str) – File to read.
use_pandas (bool) – Determines if a numpy data type or pandas data type is returned.
cache (bool, optional) – Enables or disables caching. Defaults to the enabled. Leaving this enabled should do no harm unless there something is broken with caching. Disable this if returning the wrong data for debug purposes. If it continues to return the incorrect data when disabled then caching isn’t the issue.
clear_cache (bool, optional) – Forcefully clears the cache for the files that are parsed. Instead of loading the cache, it’ll delete the cache and write a new cache object instead if cache is enabled.
- Returns
DataFrame – If the file is a kVars file, CSV, or TSV
npy.ndarray – If the file is a numpy file, PF file, or single column txt file
ParticlePool – If parsing a gamp file
- Raises
RuntimeError – If there is no plugin that can load the data found
- PyPWA.write(filename, data, cache=True, clear_cache=False)¶
Reads the entire file and returns either DaataFrame, ParticlePool, or standard numpy array depending on the data found inside the file.
- Parameters
filename (Path, str) – The filename of the file you wish to write
cache (bool, optional) – Enables or disables caching. Defaults to the enabled. Leaving this enabled should do no harm unless there something is broken with caching.
clear_cache (bool, optional) – Forcefully clears the cache for the files that are parsed. It’ll delete the cache and write a new cache object instead when cache is enabled.
- Raises
RuntimeError – If there is no plugin that can load the data found
Basic Data Sanitization¶
Allows quick converting of data from Pandas to Numpy, as well as preps data to be passed to non-Python function’s and classes; Such as Fortran modules compiled with f2py, or C/C++ modules bound by Cython.
- PyPWA.pandas_to_numpy(df)¶
Converts Pandas DataTypes to Numpy
Takes a Pandas Series or DataFrame and converts it to Numpy. Pandas does have a built in to_records function, however records are slower than Structured Arrays, while containing much of the same functionality.
- Parameters
df (Pandas Series or DataFrame) – The pandas data structure that you wish to be converted to standard Numpy Structured Arrays
- Returns
The resulting Numpy array or structured array containing the data from the original DataFrame or Series. If it was a Series with each row named (like an element from a DataFrame) it’ll be a Structured Array with length=1, if it was a standard Series it’ll return a single Numpy Array, and if it was a DataFrame the results will be stored in Structured array matching the types and names from the DataFrame.
- Return type
Numpy ArrayLike
- PyPWA.to_contiguous(data, names)¶
Convert DataFrame or Structured Array to List of Contiguous Arrays
This takes a data-set and a list of column names and converts those columns into Contiguous arrays. The reason to use Contiguous arrays over DataFrames or Structured arrays is that the memory is better aligned to improve speed of computation. However, this does double the memory requirements of your data-set since this copies all the events over to the new array. Use only in amplitudes where you need to maximize the speed of your amplitude.
- Parameters
data (Structured Array, DataFrame, or Dict-like) – This is the data frame or Structured array that you want to extract columns from
names (List of Column Names or str) – This is either a list of columns you want from the array, or a single column you want from the array
- Returns
If you provide only a single column, it’ll only return a single array with the data from that array. However, if you have supplied multiple columns in a list or tuple, it’ll return a tuple of arrays in the same order as the supplied names.
- Return type
ArrayLike or Tuple[ArrayLike]
Data Iterators and Writers¶
Reading and writing a single event at a time instead of having the entire contents of the dataset memory at once. This is good choice if you are wanting to rapidly transform the data that is on disk.
- class PyPWA.DataType(value)¶
Enumeration for type of data to be read or written using the reader and writer.
Because of how the reader and writer are designed they can not inspect the data before it starts working with the data. This enum is used to specify the type of data you’re working with.
BASIC = Standard arrays with no columns
STRUCTURED = Columned array (CSV, TSV, DataFrames)
TREE_VECTOR = Particle Data (GAMP)
- PyPWA.get_writer(filename, dtype)¶
Returns a writer that can write to the file one event at a time
- Parameters
filename (str, Path) – The file that you want to write to
dtype (DataType) – Specifies the type of that needs to be written. TREE_VECTOR is used for ParticlePools and only works with the ‘.gamp’ extension for now. STRUCTURED_ARRAY is used for both numpy structured arrays and pandas DataFrames. BASIC is used for standard numpy arrays.
- Returns
A writer that can read the file, defined in PyPWA.plugins.data
- Return type
templates.WriterBase
- Raises
RuntimeError – If there is no plugin that can write the data found
See also
write
Writes a ParticlePool, DataFrame, or array to file
Examples
The writer can be used to write a ParticlePool one event at a time
>>> writer = get_writer("example.gamp", DataType.TREE_VECTOR) >>> for event in particles.iter_events(): >>> writer.write(event) >>> writer.close()
- PyPWA.get_reader(filename, use_pandas=False)¶
Returns a reader that can read the file one event at a time
Note
The return value from the reader coule bd a pointer, if you need to keep the event without it being overwrote on the next call, you must call the copy method on the returned data to get a unique copy.
- Parameters
filename (str, Path) – File to read
use_pandas (bool) – Determines if a numpy data type or pandas data type is returned.
- Returns
A reader that can read the file, defined in PyPWA.plugins.data
- Return type
templates.ReaderBase
- Raises
RuntimeError – If there is no plugin that can load the data found
See also
read
Reads an entire file into a DataFrame, ParticlePool, or array
Examples
The reader can be used inside a standard for loop
>>> reader = get_reader("example.gamp") >>> for event in reader: >>> my_kept_event = event.copy() >>> regular_event = event >>> reader.close()
Caching¶
Using pickles to quickly write and read data straight from disk as intermediate caching steps. These are special functions that allow caching values or program states quickly for resuming later. This is a good way to save essential data for a Jupyter Notebook so that if the kernel is rebooted, data isn’t lost.
- PyPWA.cache.read(path, intermediate=True, remove_cache=False)¶
Reads a cache object
This reads caches objects from the disk. With its default settings it’ll read the file as if it were a cache file. If intermediate is set to False, the path will be the source file, and it’ll load the cache file as long as the source file’s hash hasn’t changed. It can also be
- Parameters
path (Path or str) – The path of the source file, or path where you want the intermediate step to be stored.
intermediate (bool) – If set to true, the cache will be treated as an intermediate step, this means it will assume there is no data file associated with the data, and will not check file hashes. By default this is True
remove_cache (bool) – Setting this to true will remove the cache.
- Returns
The first value in the tuple is whether the cache is valid or not and the second value in the returned tuple is whatever data was stored in the cache.
- Return type
Tuple[bool, any]
- PyPWA.cache.write(path, data, intermediate=True)¶
Writes a cache file
With its default settings, it will treat the path as a save location for the cache as an intermediate step. If intermediate is set to false, it’ll write the cache file into a computed cache location and store the source file’s hash in the cache for future comparison.
- Parameters
path (Path or str) – The path of the source file, or path where you want the intermediate step t0 be stored.
data (Any) – Whatever data you wish to be stored in the cache. Almost anything that can be stored in a variable, can be stored on disk.
intermediate (bool) – If set to true, the cache will be treated as an intermediate step, this means it will assume there is no data file associated with the data, and will not check file hashes.
Binning¶
We provide functions that make binning data in memory an easy process, however for HDF5 a future more in-depth example and documentation will be made available.
- PyPWA.bin_with_fixed_widths(dataframe, bin_series, fixed_size, lower_cut=None, upper_cut=None)¶
Bins a dataframe by fixed using a series in memory
Bins an input array by a fixed number of events in memory. You must put all data you want binned into the DataFrame or Structured Array before use. Each resulting bin can be further binned if you desire.
If the fixed_size does not evenly divide into the length of bin_series, the first and last bin will contain overflows.
- Parameters
dataframe (DataFrame or Structured Array) – The dataframe or numpy array that you wish to break into bins
bin_series (Array-like) – Data that you want to bin by, selectable by user. Must have the same length as dataframe. If a column name is provided, that column will be used from the dataframe.
fixed_size (int) – The number of events you want in each bin.
lower_cut (float, optional) – The lower cut off for the dataset, if not provided it will be set to the smallest value in the bin_series
upper_cut (float, optional) – The upper cut off for the dataset, if not provided will be set to the largest value in the bin_series
- Returns
A list of array-likes that have been masked off of the input bin_series.
- Return type
List[DataFrame or Structured Array]
- Raises
ValueError – If the length of the input array and bin array don’t match
Warning
This function does all binning in memory, if you are working with a large dataset that doesn’t fit in memory, or if you overflow while you are binning, you must use a different binning method
See also
PyPWA.libs.file.project
A numerical dataset that supports binning on disk instead of in-memory. It’s slower and requires more steps to use, but should work even on memory limited systems.
Examples
Binning a DataFrame with values x, y, and z using z to bin
>>> data = { >>> "x": npy.random.rand(1000), "y": npy.random.rand(1000), >>> "z": (npy.random.rand(1000) * 100) - 50 >>> } >>> df = pd.DataFrame(data) >>> list(df.columns) ["x", "y", "z"]
This will give us a usable DataFrame, now to make a series out of z and use it to make 10 bins.
>>> binning = df["z"] >>> range_bins = bin_with_fixed_widths(df, binning, 250) >>> len(range_bins) 4
Each bin should have exactly 250 events in size
>>> lengths = [] >>> for abin in range_bins: >>> lengths.append(len(abin)) [250, 250, 250, 250]
That will give you 4 bins with exaactly the same number of events per bin, plus 2 more bins if needed.
- PyPWA.bin_by_range(dataframe, bin_series, number_of_bins, lower_cut=None, upper_cut=None, sample_size=None)¶
Bins a dataframe by range using a series in memory
Bins an input array by range in memory. You must put all data you want binned into the DataFrame or Structured Array before use. Each resulting bin can be further binned if you desire.
- Parameters
dataframe (DataFrame or Structured Array) – The dataframe or numpy array that you wish to break into bins
bin_series (Array-like) – Data that you want to bin by, selectable by user. Must have the same length as dataframe. If a column name is provided, that column will be used from the dataframe.
number_of_bins (int) – The resulting number of bins that you would like to have.
lower_cut (float, optional) – The lower cut off for the dataset, if not provided it will be set to the smallest value in the bin_series
upper_cut (float, optional) – The upper cut off for the dataset, if not provided will be set to the largest value in the bin_series
sample_size (int, optional) – If provided each bin will have a randomly selected number of events of length sample_size.
- Returns
A list of array-likes that have been masked off of the input bin_series.
- Return type
List[DataFrame or Structured Array]
- Raises
ValueError – If the length of the input array and bin array don’t match
Warning
This function does all binning in memory, if you are working with a large dataset that doesn’t fit in memory, or if you overflow while you are binning, you must use a different binning method
See also
PyPWA.libs.file.project
A numerical dataset that supports binning on disk instead of in-memory. It’s slower and requires more steps to use, but should work even on memory limited systems.
Notes
The range is selected using a simple method:
\[(max - min) / num_of_bins\]Examples
Binning a DataFrame with values x, y, and z using z to bin
>>> data = { >>> "x": npy.random.rand(1000), "y": npy.random.rand(1000), >>> "z": (npy.random.rand(1000) * 100) - 50 >>> } >>> df = pd.DataFrame(data) >>> list(df.columns) ["x", "y", "z"]
This will give us a usable DataFrame, now to make a series out of z and use it to make 10 bins.
>>> binning = df["z"] >>> range_bins = bin_by_range(df, binning, 10) >>> len(range_bins) 10
That will give you 10 bins with a very close number of values per bin
- PyPWA.bin_by_list(data, bin_series, bin_list)¶
Bins a dataframe by list of bin limits using a series in memory
Bins an input array by list of bin limits in memory. You must put all data you want binned into the DataFrame or Structured Array before use. Each resulting bin can be further binned if you desire.
- Parameters
data (DataFrame or Structured Array) – The dataframe or numpy array that you wish to break into bins
bin_series (Array-like) – Data that you want to bin by, selectable by user. Must have the same length as dataframe. If a column name is provided, that column will be used from the dataframe.
bin_list (list) – The list of bin limits used to create the bins.
- Returns
A list of array-likes that have been masked off of the input bin_series.
- Return type
List[DataFrame or Structured Array]
- Raises
ValueError – If the length of the input array and bin array don’t match
Warning
This function does all binning in memory, if you are working with a large dataset that doesn’t fit in memory, or if you overflow while you are binning, you must use a different binning method
See also
PyPWA.libs.file.project
A numerical dataset that supports binning on disk instead of in-memory. It’s slower and requires more steps to use, but should work even on memory limited systems.
Examples
Binning a DataFrame with values x, y, and z using z to bin
First create the list which defines all the bin limits >>> bin_limits = [1,3,7,10]
>>> dataset = { >>> "x": npy.random.rand(1000), "y": npy.random.rand(1000), >>> "z": (npy.random.rand(1000) * 100) - 50 >>> } >>> df = pd.DataFrame(dataset) >>> list(df.columns) ["x", "y", "z"]
This will give us a usable DataFrame, now to make a series out of z and use it to make the 3 defined bins bins.
>>> binning = df["z"] >>> range_bins = bin_by_list(df, binning, bin_limits) >>> len(range_bins) 3
That will give you 3 bins with custom bin limits
Builtin Vectors¶
PyPWA includes support for both 3 and 4 vector classes, complete with methods to aid operating with vector data. Each vector utilizes Numpy for arrays and numerical operations.
- class PyPWA.ParticlePool(particle_list)¶
Stores a collection of particles together as a list.
By default the particles are represented as their angles and mass, however internally the particles are still stored as the Four Momenta.
- display_raw()¶
Display’s the file
- property event_count: int¶
- get_event_mass()¶
- get_particles_by_id(particle_id)¶
- get_particles_by_name(particle_name)¶
- get_s()¶
- get_t()¶
- get_t_prime()¶
- iter_events()¶
- iter_particles()¶
- property particle_count: int¶
- split(count)¶
Split’s the particles in N groups.
This is required to be a method on any object that needs to be passed to the processing module.
- Parameters
count (int) – How many ParticlePools to return
- Returns
A list of particle pools that can be passed to different process groups.
- Return type
List[ParticlePool]
- class PyPWA.Particle(particle_id, charge, e, x=None, y=None, z=None)¶
Numpy backed Particle object for vector operations inside PyPWA.
By default, Particle is represented through the particles angles and mass. However, internally the particle is stored as four momenta just as it’s stored in the GAMP format.
- Parameters
particle_id (int) – The Particle ID, used to determine the particle’s name and charge.
charge (int) – The particle’s Charge as read from the GAMP file.
e (int, npy.ndarray, float, or DataFrame) – Can be an integer to specify size, a structured array or DataFrame with x y z and e values, a single float value, or a Series or single dimensional array, If you provide a float, series, or array, you need to provide a float for the other options as well.
x (int, npy.ndarray, float, or DataFrame, optional) –
y (int, npy.ndarray, float, or DataFrame, optional) –
z (int, npy.ndarray, float, or DataFrame, optional) –
See also
FourVector
For storing a FourVector without particle ID
ParticlePool
For storing a collection of particles
- property charge: int¶
Immutable charge for the particle produced from the ID.
- display_raw()¶
Displays the contents of the Particle as Four Momenta
- property id: int¶
Immutable provided ID at initialization.
- property name: str¶
Immutable name for the particle produced from the ID.
- split(count)¶
Splits the Particle for distributed computing.
Will return N Particles which together will have the same number of elements as the original Particle.
- Parameters
count (int) – The amount of Particles to produce from current particle.
- Returns
The list of Particles
- Return type
List[Particle]
- class PyPWA.FourVector(e, x=None, y=None, z=None)¶
DataFrame backed FourVector object for vector operations inside PyPWA.
- Parameters
e (int, np.ndarray, float, or DataFrame) – Can be an integer to specify size, a structured array or DataFrame with x y z and e values, a single float value, or a Series or single dimensional array, If you provide a float, series, or array, you need to provide a float for the other options as well.
x (int, np.ndarray, float, or Series, optional) –
y (int, np.ndarray, float, or Series, optional) –
z (int, np.ndarray, float, or Series, optional) –
See also
ThreeVector
For storing a standard X, Y, Z vector
Particle
For storing a particle, adds support for a particle ID
- class PyPWA.ThreeVector(x, y=None, z=None)¶
DataFrame backed ThreeVector object for vector operations inside PyPWA.
- Parameters
x (int, npy.ndarray, float, or DataFrame) – Can be an integer to specify size, a structured array or DataFrame with x y and z values, a single float value, or a Series or single dimensional array, If you provide a float, series, or array, you need to provide a float for the other options as well.
y (int, npy.ndarray, float, or DataFrame, optional) –
z (int, npy.ndarray, float, or DataFrame, optional) –
See also
FourVector
For storing a vector with it’s energy.